Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Spotify’s new feature makes it easier to find popular audiobooks

    This portable JBL Grip Bluetooth speaker is so good at 20% off

    ‘AI’ could dox your anonymous posts

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      What the polls say about how Americans are using AI

      February 27, 2026

      Tensions between the Pentagon and AI giant Anthropic reach a boiling point

      February 21, 2026

      Read the extended transcript: President Donald Trump interviewed by ‘NBC Nightly News’ anchor Tom Llamas

      February 6, 2026

      Stocks and bitcoin sink as investors dump software company shares

      February 4, 2026

      AI, crypto and Trump super PACs stash millions to spend on the midterms

      February 2, 2026
    • Business

      Weighing up the enterprise risks of neocloud providers

      March 3, 2026

      A stolen Gemini API key turned a $180 bill into $82,000 in two days

      March 3, 2026

      These ultra-budget laptops “include” 1.2TB storage, but most of it is OneDrive trial space

      March 1, 2026

      FCC approves the merger of cable giants Cox and Charter

      February 28, 2026

      Finding value with AI and Industry 5.0 transformation

      February 28, 2026
    • Crypto

      Strait of Hormuz Shutdown Shakes Asian Energy Markets

      March 3, 2026

      Wall Street’s Inflation Alarm From Iran — What It Means for Crypto

      March 3, 2026

      Ethereum Price Prediction: What To Expect From ETH In March 2026

      March 3, 2026

      Was Bitcoin Hijacked? How Institutional Interests Shaped Its Narrative Since 2015

      March 3, 2026

      XRP Whales Now Hold 83.7% of All Supply – What’s Next For Price?

      March 3, 2026
    • Technology

      Spotify’s new feature makes it easier to find popular audiobooks

      March 3, 2026

      This portable JBL Grip Bluetooth speaker is so good at 20% off

      March 3, 2026

      ‘AI’ could dox your anonymous posts

      March 3, 2026

      Microsoft says new Teams location feature isn’t for ’employee tracking’

      March 3, 2026

      OpenAI got ‘sloppy’ about the wrong thing

      March 3, 2026
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»Here are the biggest misconceptions about AI content scraping
    Technology

    Here are the biggest misconceptions about AI content scraping

    TechAiVerseBy TechAiVerseJuly 2, 2025No Comments9 Mins Read2 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Here are the biggest misconceptions about AI content scraping
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    Here are the biggest misconceptions about AI content scraping

    AI bots scraping publishers’ sites for real-time information are now scraping publishers’ sites more than the bots used to train large language models. And they’re harder to detect.

    That’s according to the latest report from TollBit, a data marketplace for publishers and AI companies. From Q4 2024 to Q1 2025, bot scrapes used for Retrieval Augmented Generation, or RAG, per site grew 49%. That is nearly 2.5 times the rate of training bot scrapes (which grew by 18%) in the same time period. 

    An increase in bots scraping content from publishers’ sites represents a threat to their businesses. But scraping for AI training and scraping for real-time outputs present different challenges — and some opportunities — for publishers. And not all of them are fully understood. 

    Training scrapes are “one-and-done… to feed a model’s general knowledge,” said Josh Jaffe, AI and media consultant and former president of media at the publisher Ingenio.

    RAG scrapes, on the other hand, are continuous. They have to power responses to users’ questions in AI chatbots and search engines, he said. “It’s the difference between selling your archive once versus being part of an ongoing syndication feed. One is finite. The other has compounding value, assuming publishers can tap into it,” Jaffe said.

    Here is a look at some of the misconceptions:

    Myth: AlI bot scraping is the same

    There are two main types of AI bots — RAG AI bots and training data bots.

    RAG AI bots, or agents, retrieve factual, current information in real-time. They respond to user prompts in AI products like Perplexity and ChatGPT by searching the web. Responses include links or citations to the original sources, such as publishers’ sites. RAG can surface and summarize articles without storing them in training data, which makes the threat to traffic and monetization even more immediate and harder to regulate.

    “Despite the high commercial value of RAG to AI developers, the vast majority of companies take the raw materials required to create summarised simulacrums without any form of remuneration, licensing arrangement, or traffic back to the source publisher website. This is contrary to the terms of service of many publishers, and is neither fair nor sustainable,” reads a report from the Financial Times, submitted to the House of Lords Communications and Digital Select Committee into media literacy last month, which also called the ability for publishers to prevent this process from happening as “minimal.”

    Training data bots, on the other hand, crawl the web for data to feed into LLMs, such as Meta’s Llama or OpenAI’s GPT. Those large datasets are then used to train the models how to “speak,” or generate responses.

    And once they have learned to speak — and LLMs get smarter — training bots are hitting publishers’ sites less frequently. RAG bots, on the other hand, need to keep crawling publishers’ sites to access up-to-date information, which is why they are happening more often.

    AI companies have taken on the responsibility of defining these bots to differentiate them. For example, OpenAI has an agent called “ChatGPT-User” — its RAG AI bot — that scrapes the web for real-time information, while “GPTBot” — its training data bot — scrapes to train OpenAI”s LLM.

    But not all of them do so publicly.

    Myth: RAG scraping is easy to detect

    What makes things even more complicated is that smarter AI agents are emerging that mimic human behavior (and can even solve CAPTCHAs and bypass advanced cyber tools), according to an AI startup company exec, who asked to speak anonymously to share their thoughts freely. This makes them increasingly difficult to detect — and without that visibility, publishers have a hard time knowing how many bots are scraping their sites and how often, and what the impact is to their businesses.

    Also, search engines like Google and Bing don’t separate their RAG bots from the bots they use to categorize content for search results — which means publishers could not “hide” from RAG bots without potentially also “hiding” itself from search and its corresponding referral traffic.

    “This puts publishers in a difficult position as they would risk losing search rankings by restricting all bots — including search bots,” said Arvid Tchivzhel, managing director at Mather Economics’ digital consulting practice.

    For example, Google’s “Google-Extended” bot gathers data to train and improve Google’s AI models, which publishers can block with robots.txt. But Google’s LLM Gemini and its AI search feature AI Overviews do not use Google-Extended for real-time data retrieval, meaning publishers can’t block Google from crawling its sites for RAG without blocking the crawlers it uses for Google’s general search product.

    TollBit’s report detected 436 million AI bot scrapes (both RAG and training scrapes) in Q1 2025, up 46% from Q4 2024. “The harder you block bots, the harder they will work to evade detection,” said Olivia Joslin, co-founder of TollBit.

    Myth: Monetizing training data is the only way publishers can make money

    AI companies like OpenAI have signed large, lump-sum deals with publishers to allow them to ingest their content to train their LLMs. But it’s not the only way publishers can monetize AI bots crawling their sites.

    Publishers could charge RAG AI bots for crawling their sites — either when they scrape for content, or when they’re cited in responses to users’ questions in AI products. Two digital publishing execs told Digiday this will be key to monetizing the increase in bot scraping.

    “The revenue is not there yet as the LLM platforms are still in the early days of building their commercial models, but [I would] expect that to be an area of growth,” said one publishing exec, who traded anonymity for candor.

    TollBit, for example, gives AI scrapers the option to pay a “toll” to access a publisher’s content. A web scraper or AI agent tries to go to a publisher’s webpage, gets redirected to TollBit’s platform, and then is offered a transaction fee to access that page. TollBit has struck deals with over 2,000 publishers, including Penske Media and Time. However, it’s unclear how much money publishers are actually making from TollBit’s marketplace.

    The IAB Tech Lab is also in the early stages of developing an API called LLM Content Ingest, a technical framework that could help control how publishers’ content is accessed and monetized by AI systems. Although it will need buy-in from the AI companies to make it work.

    Publishers are likely to shift to monetizing RAG bot scraping over signing more and more licensing deals with LLMs. Recent deals between AI companies and publishers seem to be moving away from sharing publishers’ content to train LLMs, and instead shifting toward feeding data to AI models in response to queries in AI search engines through a RAG system. (Arguably, many AI companies have already trained their LLMs on huge amounts of data available on the web already.)

    But, it’s not an easy process, according to Tchivzhel.

    “Preventing and monetizing the scraping is very difficult for the average local publisher. Unless you have significant legal resources and scale, you are unlikely to generate meaningful ROI on monetizing the input into RAG models directly,” Tchivzhel said. “There is likely more ROI on monetizing the output from LLMs and RAG models and striking deals with intermediaries who have built attribution models and can prove a specific piece of content was used in an AI answer.”

    Myth: Scrape-to-referral ratio is the same for all AI crawlers 

    Another key finding in TollBit’s report is that AI bots crawling publishers’ sites are scraping way more than they are referring traffic — meaning publishers are losing out on monetizing those audiences.

    On average across TollBit’s partners’ sites, for every 11 scrapes, Bing returns one human visit to sites. This means that Bing’s scrape-to-referral ratio is 11:1. Scrape-to-referral ratios for OpenAI is 179:1, Perplexity’s is 369:1, and Anthropic’s ratio is 8692:1, according to the report.

    Overall across TollBit’s publisher network, AI apps drove 0.04% of total external referral traffic to sites from Q4 2024 to Q1 2025.

    RAG scraping is also happening more often due to the increased adoption of AI tools, according to an AI startup company exec who asked to speak anonymously to share their thoughts freely. Data shows more people are using AI tools for search, for example. As AI companies invest in more search-focused tools, RAG is needed to keep responses up-to-date.

    “We don’t see training bots hammering publishers’ sites thousands of times a day,” the AI company exec said.

    Myth: Robots.txt protects publishers from AI bots 

    If publishers aren’t managing to monetize AI bot scraping, the alternative is to block those bots from accessing the content on their websites. 

    Robots.txt — which tells web crawlers which URLs they can access and is a mechanism to disallow access to publishers’ sites — is the most straightforward way to do this, with just a few lines of code. But it’s also the weakest tactic to block bot traffic.

    Publishers have attempted to block four-times more AI bots between January 2024 and January 2025 using robots.txt. But the percentage of AI bot scrapes that bypassed robots.txt surged from 3.3% in Q4 2024 to 12.9% by the end of Q1 2025. In March 2025, over 26 million scrapes from AI bots bypassed robots.txt for sites on TollBit.

    Recent updates to major AI companies’ terms of service state that their AI bots can act on behalf of user requests – effectively meaning they can ignore robots.txt when being used for RAG, according to the TollBit report.

    Among websites with TollBit Analytics set up before January 2025, AI bot traffic volume nearly doubled in Q1, rising by 87%. 

    The FT report called this the era of “digital dumping.”

    “AI developers flood the market for news and information with outputs that are created using generative AI models in response to natural language user prompts,” it read. “This probabilistic approach to the production of outputs is as far away from the process of producing high quality journalism as it is possible to be.”

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleWTF is behind the explosion of faceless creators? 
    Next Article How Future is using its own AI engine to turn deeper engagement into ad dollars 
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    Spotify’s new feature makes it easier to find popular audiobooks

    March 3, 2026

    This portable JBL Grip Bluetooth speaker is so good at 20% off

    March 3, 2026

    ‘AI’ could dox your anonymous posts

    March 3, 2026
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025702 Views

    Lumo vs. Duck AI: Which AI is Better for Your Privacy?

    July 31, 2025285 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 2025164 Views

    6 Best MagSafe Phone Grips (2025), Tested and Reviewed

    April 6, 2025124 Views
    Don't Miss
    Technology March 3, 2026

    Spotify’s new feature makes it easier to find popular audiobooks

    Spotify’s new feature makes it easier to find popular audiobooks Image: Spotify Summary created by…

    This portable JBL Grip Bluetooth speaker is so good at 20% off

    ‘AI’ could dox your anonymous posts

    Microsoft says new Teams location feature isn’t for ’employee tracking’

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    Spotify’s new feature makes it easier to find popular audiobooks

    March 3, 20262 Views

    This portable JBL Grip Bluetooth speaker is so good at 20% off

    March 3, 20262 Views

    ‘AI’ could dox your anonymous posts

    March 3, 20261 Views
    Most Popular

    7 Best Kids Bikes (2025): Mountain, Balance, Pedal, Coaster

    March 13, 20250 Views

    VTOMAN FlashSpeed 1500: Plenty Of Power For All Your Gear

    March 13, 20250 Views

    Best TV Antenna of 2025

    March 13, 20250 Views
    © 2026 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.