Featured Article

News outlets are accusing Perplexity of plagiarism and unethical web scraping

Ambiguity around copyright laws and AI web crawlers complicate matters

8:00 AM PDT • July 2, 2024

In the age of generative AI, when chatbots can provide detailed answers to questions based on content pulled from the internet, the line between fair use and plagiarism, and between routine web scraping and unethical summarization, is a thin one.

Perplexity AI is a startup that combines a search engine with a large language model that generates answers with detailed responses, rather than just links. Unlike OpenAI’s ChatGPT and Anthropic’s Claude, Perplexity doesn’t train its own foundational AI models, instead using open or commercially available ones to take the information it gathers from the internet and translate that into answers.

But a series of accusations in June suggests the startup’s approach borders on being unethical. Forbes called out Perplexity for allegedly plagiarizing one of its news articles in the startup’s beta Perplexity Pages feature. And Wired has accused Perplexity of illicitly scraping its website, along with other sites.

Perplexity, which as of April was working to raise $250 million at a near-$3 billion valuation, maintains that it has done nothing wrong. The Nvidia- and Jeff Bezos-backed company says that it has honored publishers’ requests to not scrape content and that it is operating within the bounds of fair use copyright laws.

The situation is complicated. At its heart are nuances surrounding two concepts. The first is the Robots Exclusion Protocol, a standard used by websites to indicate that they don’t want their content accessed or used by web crawlers. The second is fair use in copyright law, which sets up the legal framework for allowing the use of copyrighted material without permission or payment in certain circumstances.

Surreptitiously scraping web content

Wired’s June 19 story claims that Perplexity has ignored the Robots Exclusion Protocol to surreptitiously scrape areas of websites that publishers do not want bots to access. Wired reported that it observed a machine tied to Perplexity doing this on its own news site, as well as across other publications under its parent company, Condé Nast.

The report noted that developer Robb Knight conducted a similar experiment and came to the same conclusion.

Both Wired reporters and Knight tested their suspicions by asking Perplexity to summarize a series of URLs and then watching on the server side as an IP address associated with Perplexity visited those sites. Perplexity then “summarized” the text from those URLs — though in the case of one dummy website with limited content that Wired created for this purpose, it returned text from the page verbatim.

This is where the nuances of the Robots Exclusion Protocol come into play.

Web scraping is technically when automated pieces of software known as crawlers scour the web to index and collect information from websites. Search engines like Google do this so that web pages can be included in search results. Other companies and researchers use crawlers to gather data from the internet for market analysis, academic research and, as we’ve come to learn, training machine learning models.

Web scrapers in compliance with this protocol will first look for the “robots.txt” file in a site’s source code to see what is permitted and what is not — today, what is not permitted is usually scraping a publisher’s site to build massive training datasets for AI. Search engines and AI companies, including Perplexity, have stated that they comply with the protocol, but they aren’t legally obligated to do so.

Perplexity’s head of business, Dmitry Shevelenko, told TechCrunch that summarizing a URL isn’t the same thing as crawling. “Crawling is when you’re just going around sucking up information and adding it to your index,” Shevelenko said. He noted that Perplexity’s IP might show up as a visitor to a website that is “otherwise kind of prohibited from robots.txt” only when a user puts a URL into their query, which “doesn’t meet the definition of crawling.”

“We’re just responding to a direct and specific user request to go to that URL,” Shevelenko said.

In other words, if a user manually provides a URL to an AI, Perplexity says its AI isn’t acting as a web crawler but rather a tool to assist the user in retrieving and processing information they requested.

But to Wired and many other publishers, that’s a distinction without a difference because visiting a URL and pulling the information from it to summarize the text sure looks a whole lot like scraping if it’s done thousands of times a day.

(Wired also reported that Amazon Web Services, one of Perplexity’s cloud service providers, is investigating the startup for ignoring robots.txt protocol to scrape web pages that users cited in their prompt. AWS told TechCrunch that Wired’s report is inaccurate and that it told the outlet it was processing their media inquiry like it does any other report alleging abuse of the service.)

Plagiarism or fair use?

screenshot of Perplexity Pages — Forbes accused Perplexity of plagiarizing its scoop about former Google CEO Eric Schmidt developing AI-powered combat drones.

Wired and Forbes have also accused Perplexity of plagiarism. Ironically, Wired says Perplexity plagiarized the very article that called out the startup for surreptitiously scraping its web content.

Wired reporters said the Perplexity chatbot “produced a six-paragraph, 287-word text closely summarizing the conclusions of the story and the evidence used to reach them.” One sentence exactly reproduces a sentence from the original story; Wired says this constitutes plagiarism. The Poynter Institute’s guidelines say it might be plagiarism if the author (or AI) used seven consecutive words from the original source work.

Forbes also accused Perplexity of plagiarism. The news site published an investigative report in early June about how Google CEO Eric Schmidt’s new venture is recruiting heavily and testing AI-powered drones with military applications. The next day, Forbes editor John Paczkowski posted on X saying that Perplexity had republished the scoop as part of its beta feature, Perplexity Pages.

Perplexity Pages, which is only available to certain Perplexity subscribers for now, is a new tool that promises to help users turn research into “visually stunning, comprehensive content,” according to Perplexity. Examples of such content on the site come from the startup’s employees, and include articles like “Beginner’s Guide to Drumming,” or “Steve Jobs: Visionary CEO.”

“It rips off most of our reporting,” Paczkowski wrote. “It cites us, and a few that reblogged us, as sources in the most easily ignored way possible.”

Forbes reported that many of the posts that were curated by the Perplexity team are “strikingly similar to original stories from multiple publications, including Forbes, CNBC and Bloomberg.” Forbes said the posts gathered tens of thousands of views and didn’t mention any of the publications by name in the article text. Rather, Perplexity’s articles included attributions in the form of “small, easy-to-miss logos that link out to them.”

Furthermore, Forbes said the post about Schmidt contains “nearly identical wording” to Forbes’ scoop. The aggregation also included an image created by the Forbes design team that appeared to be slightly modified by Perplexity.

Perplexity CEO Aravind Srinivas responded to Forbes at the time by saying the startup would cite sources more prominently in the future — a solution that’s not foolproof, as citations themselves face technical difficulties. ChatGPT and other models have hallucinated links, and since Perplexity uses OpenAI models, it is likely to be susceptible to such hallucinations. In fact, Wired reported that it observed Perplexity hallucinating entire stories.

Other than noting Perplexity’s “rough edges,” Srinivas and the company have largely doubled down on Perplexity’s right to use such content for summarizations.

This is where the nuances of fair use come into play. Plagiarism, while frowned upon, is not technically illegal.

According to the U.S. Copyright Office, it is legal to use limited portions of a work including quotes for purposes like commentary, criticism, news reporting and scholarly reports. AI companies like Perplexity posit that providing a summary of an article is within the bounds of fair use.

“Nobody has a monopoly on facts,” Shevelenko said. “Once facts are out in the open, they are for everyone to use.”

Shevelenko likened Perplexity’s summaries to how journalists often use information from other news sources to bolster their own reporting. The unfair advantage of AI companies, however, is that they can compile in seconds what it took several journalists hours to create.

Mark McKenna, a professor of law at the UCLA Institute for Technology, Law & Policy, told TechCrunch the situation isn’t an easy one to untangle. In a fair use case, courts would weigh whether the summary uses a lot of the expression of the original article, versus just the ideas. They might also examine whether reading the summary might be a substitute for reading the article.

“There are no bright lines,” McKenna said. “So [Perplexity] saying factually what an article says or what it reports would be using non-copyrightable aspects of the work. That would be just facts and ideas. But the more that the summary includes actual expression and text, the more that starts to look like reproduction, rather than just a summary.”

Unfortunately for publishers, unless Perplexity is using full expressions (and apparently, in some cases, it is), its summaries might not be considered a violation of fair use.

How Perplexity aims to protect itself

AI companies like OpenAI have signed media deals with a range of news publishers to access their current and archival content on which to train their algorithms. In return, OpenAI promises to surface news articles from those publishers in response to user queries in ChatGPT. (But even that has some kinks that need to be worked out, as Nieman Lab reported last week.)

Perplexity has held off from announcing its own slew of media deals, perhaps waiting for the accusations against it to blow over. But the company is “full speed ahead” on a series of advertising revenue-sharing deals with publishers.

The idea is that Perplexity will start including ads alongside query responses, and publishers that have content cited in any answer will get a slice of the corresponding ad revenue. Shevelenko said Perplexity is also working to allow publishers access to its technology so they can build Q&A experiences and power things like related questions natively inside their sites and products.

But is this just a fig leaf for systemic IP theft? Perplexity isn’t the only chatbot that threatens to summarize content so completely that readers fail to see the need to click out to the original source material.

And if AI scrapers like this continue to take publishers’ work and repurpose it for their own businesses, publishers will have a harder time earning ad dollars. That means eventually, there will be less content to scrape. When there’s no more content left to scrape, generative AI systems will then pivot to training on synthetic data, which could lead to a hellish feedback loop of potentially biased and inaccurate content.

More TechCrunch

Ola Electric surges in India’s biggest listing in two years

Manish Singh

5 hours ago

Ola Electric, India’s largest electric two-wheeler maker, saw its shares rise as much as 20% on its public debut on Friday, making it the biggest listing among Indian firms in…

Ola Electric surges in India’s biggest listing in two years

Space

Rocket Lab’s sunny outlook bodes well for future constellation plans

Aria Alamalhodaei

10 hours ago

Rocket Lab surpassed $100 million in quarterly revenue for the first time, a 71% increase from the same quarter of last year. This is just one of several shiny accomplishments…

Rocket Lab’s sunny outlook bodes well for future constellation plans

Fintech

CloudPay, a payroll services provider, lands $120M in new funding

Kyle Wiggers

11 hours ago

In 1996, two companies, Patersons HR and Payroll Solutions, formed a venture called CloudPay to provide payroll and payments services to enterprise clients. CloudPay grew quietly over the next several…

CloudPay, a payroll services provider, lands $120M in new funding

Security

Security bugs in ransomware leak sites helped save six companies from paying hefty ransoms

Zack Whittaker

11 hours ago

The vulnerabilities allowed one security researcher to peek inside the leak sites without having to log in.

Security bugs in ransomware leak sites helped save six companies from paying hefty ransoms

Featured Article

A comprehensive list of 2024 tech layoffs

The tech layoff wave is still going strong in 2024. Following significant workforce reductions in 2022 and 2023, this year has already seen 60,000 job cuts across 254 companies, according to independent layoffs tracker Layoffs.fyi. Companies like Tesla, Amazon, Google, TikTok, Snap and Microsoft have conducted sizable layoffs in the…

Cody Corrall

Alyssa Stringer

11 hours ago

A comprehensive list of 2024 tech layoffs

Rabbit’s r1 refines chats and timers, but its app-using ‘action model’ is still MIA

Devin Coldewey

13 hours ago

A new “beta rabbit” mode adds some conversational AI chops to the Rabbit r1, particularly in more complex or multi-step instructions.

Rabbit’s r1 refines chats and timers, but its app-using ‘action model’ is still MIA

Transportation

Archer to set up air taxi network in LA by 2026 ahead of World Cup

Rebecca Bellan

13 hours ago

Los Angeles is notorious for its back-to-back traffic. Three events that promise to bring in millions of spectators from around the world — the 2026 World Cup, the Super Bowl…

Archer to set up air taxi network in LA by 2026 ahead of World Cup

Featured Article

Amazon is fumbling in India

Amazon’s decision to overlook quick-commerce in India is now looking like a significant misstep.

Manish Singh

13 hours ago

OpenAI finds that GPT-4o does some truly bizarre stuff sometimes

Kyle Wiggers

14 hours ago

OpenAI’s GPT-4o, the generative AI model that powers the recently launched alpha of Advanced Voice Mode in ChatGPT, is the company’s first trained on voice as well as text and…

OpenAI finds that GPT-4o does some truly bizarre stuff sometimes

Enterprise

Box adds crucial piece to its AI platform with Alphamoon acquisition

Ron Miller

14 hours ago

On Thursday, Box filled in a missing piece on its AI platform when it bought automated metadata extracting startup, Alphamoon.

Box adds crucial piece to its AI platform with Alphamoon acquisition

OpenAI adds a Carnegie Mellon professor to its board of directors

Kyle Wiggers

14 hours ago

OpenAI has announced a new appointment to its board of directors: Zico Kolter. Kolter, a professor and director of the machine learning department at Carnegie Mellon, predominantly focuses his research…

OpenAI adds a Carnegie Mellon professor to its board of directors

Government & Policy

Spotify and Epic Games call Apple’s revised DMA compliance plan ‘confusing,’ ‘illegal’ and ‘unacceptable’

Sarah Perez

14 hours ago

Count Spotify and Epic Games among the Apple critics who are not happy with the iPhone maker’s newly revised compliance plan for the European Union’s Digital Markets Act (DMA). Shortly…

Spotify and Epic Games call Apple’s revised DMA compliance plan ‘confusing,’ ‘illegal’ and ‘unacceptable’

Apps

Thursday, the dating app that you can use only on Thursdays, expands to San Francisco

Lauren Forristal

14 hours ago

Thursday seeks to shake up conventional online dating in a crowded market. The app, which recently expanded to San Francisco, fosters intentional dating by restricting user access to Thursdays. At…

Thursday, the dating app that you can use only on Thursdays, expands to San Francisco

Cohere co-founder Nick Frosst thinks everyone needs to be more realistic about what AI can and cannot do

Rebecca Szkutak

15 hours ago

AI companies are gobbling up investor money and securing sky-high valuations early in their life cycle. This dynamic has many calling the AI industry a bubble. Nick Frosst, a co-founder…

Cohere co-founder Nick Frosst thinks everyone needs to be more realistic about what AI can and cannot do

Apps

Instagram is embracing the ‘photo dump’

Aisha Malik

16 hours ago

Instagram is rolling out the ability for users to add up to 20 photos or videos to their feed carousels, as the platform embraces the trend of “photo dumps.” Back…

Transportation

Lyft ‘opens a can of whoop ass’ on surge pricing, Tesla’s Dojo explained and Saudi Arabia pumps $1.5B into Lucid

Kirsten Korosec

16 hours ago

Welcome back to TechCrunch Mobility — your central hub for news and insights on the future of transportation. Sign up here for free — just click TechCrunch Mobility! Anyone paying…

Lyft ‘opens a can of whoop ass’ on surge pricing, Tesla’s Dojo explained and Saudi Arabia pumps $1.5B into Lucid

Venture

Flint Capital raises a $160M through an unusual fund-raising strategy

Margaux MacColl

16 hours ago

Flint Capital just closed its third fund at $160 million. Its has a unique strategy for finding its limited partner investors.

Flint Capital raises a $160M through an unusual fund-raising strategy

Privacy

Elon Musk’s X agrees to pause EU data processing for training Grok

Natasha Lomas

17 hours ago

Earlier this week it emerged that the DPC had instigated court proceedings seeking an injunction against X over the data processing without consent.

Elon Musk’s X agrees to pause EU data processing for training Grok

Robotics

Google DeepMind develops a ‘solidly amateur’ table tennis robot

Brian Heater

17 hours ago

During testing, Google DeepMind’s table tennis bot was able to beat all of the beginner-level players it faced.

Google DeepMind develops a ‘solidly amateur’ table tennis robot

Social

As X sues advertisers over boycott, the app ditches all ads from its top subscription tier

Sarah Perez

17 hours ago

The X account announced that its Premium+ subscription would now be “fully” ad-free, leading some to question how this change would affect creator earnings.

As X sues advertisers over boycott, the app ditches all ads from its top subscription tier

Apps

Apple revises DMA compliance for App Store link-outs, applying fewer restrictions and a new fee structure

Natasha Lomas

17 hours ago

Apple has further revised its compliance plan for the European Union’s Digital Markets Act (DMA) rulebook, which, since March, has forced it to give iOS developers more freedom over how…

Apple revises DMA compliance for App Store link-outs, applying fewer restrictions and a new fee structure

TechCrunch Disrupt 2024

Chime and Dave execs are coming to TechCrunch Disrupt 2024

Mary Ann Azevedo

18 hours ago

The rise of neobanks has been fascinating to witness, as a number of companies in recent years have grown from merely challenging traditional banks to being massive players in and…

Chime and Dave execs are coming to TechCrunch Disrupt 2024

Apps

How to enable Wikipedia’s dark mode

Ivan Mehta

19 hours ago

If you visited the Wikipedia website on mobile this week, you might have seen a pop-up indicating that dark mode is ready for prime time.

Security

Home security giant ADT says it was hacked

Zack Whittaker

19 hours ago

The home security company says attackers accessed databases containing customer home addresses, email addresses, and phone numbers.

Home security giant ADT says it was hacked

Hardware

Looking Glass’ new lineup includes a $300 phone-sized holographic display

Brian Heater

19 hours ago

The Looking Glass Pro has a 6-inch display and a foldable base. It shows spatial images like those created with the Apple Vision Pro and iPhone 15 Pro.

Looking Glass’ new lineup includes a $300 phone-sized holographic display

Media & Entertainment

TikTok partners with Warner Bros. to become a discovery engine for TV and movies

Lauren Forristal

20 hours ago

TikTok’s latest offering is capitalizing on the app’s ability to serve as a discovery engine for other media — something its users already take advantage of by sharing short clips…

TikTok partners with Warner Bros. to become a discovery engine for TV and movies

Climate

Cocoon is transforming steel production runoff into a greener cement alternative

Brian Heater

20 hours ago

Cocoon is a new startup built on the belief that greener steel production and the creation of concrete slag doesn’t have to be an either/or proposition.

Cocoon is transforming steel production runoff into a greener cement alternative

SoundHound acquires Amelia AI for $80M after it raised $189M+

Ingrid Lunden

20 hours ago

SoundHound, an AI company that makes voice interface tech used by car companies, restaurants and tech firms, is doubling down on enterprise services by playing consolidator in a crowded market.…

SoundHound acquires Amelia AI for $80M after it raised $189M+

Apps

Feeling Great’s new therapy app translates its psychiatrist co-founder’s experience into AI

Ivan Mehta

21 hours ago

Seeking mental health support is a complex process, but some founders believe that using AI to formalize techniques like cognitive behavioral therapy (CBT) can help folks who might not have…

Feeling Great’s new therapy app translates its psychiatrist co-founder’s experience into AI

Government & Policy

UK launches formal probe into Amazon’s ties with AI startup Anthropic

Paul Sawers

22 hours ago

The U.K.’s antitrust regulator has confirmed that it’s carrying out a formal antitrust investigation into Amazon’s ties with Anthropic, after Amazon recently completed a $4 billion investment into the AI startup.…

UK launches formal probe into Amazon’s ties with AI startup Anthropic

News outlets are accusing Perplexity of plagiarism and unethical web scraping

Ambiguity around copyright laws and AI web crawlers complicate matters

Surreptitiously scraping web content

Plagiarism or fair use?

How Perplexity aims to protect itself

More TechCrunch

Get the industry’s biggest tech news

TechCrunch Daily News

Startups Weekly

TechCrunch Fintech

TechCrunch Mobility

Tags