Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    The ‘last-mile’ data problem is stalling enterprise agentic AI — ‘golden pipelines’ aim to fix it

    New agent framework matches human-engineered AI systems — and adds zero inference cost to deploy

    Alibaba’s Qwen 3.5 397B-A17 beats its larger trillion-parameter model — at a fraction of the cost

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      Read the extended transcript: President Donald Trump interviewed by ‘NBC Nightly News’ anchor Tom Llamas

      February 6, 2026

      Stocks and bitcoin sink as investors dump software company shares

      February 4, 2026

      AI, crypto and Trump super PACs stash millions to spend on the midterms

      February 2, 2026

      To avoid accusations of AI cheating, college students are turning to AI

      January 29, 2026

      ChatGPT can embrace authoritarian ideas after just one prompt, researchers say

      January 24, 2026
    • Business

      The HDD brand that brought you the 1.8-inch, 2.5-inch, and 3.5-inch hard drives is now back with a $19 pocket-sized personal cloud for your smartphones

      February 12, 2026

      New VoidLink malware framework targets Linux cloud servers

      January 14, 2026

      Nvidia Rubin’s rack-scale encryption signals a turning point for enterprise AI security

      January 13, 2026

      How KPMG is redefining the future of SAP consulting on a global scale

      January 10, 2026

      Top 10 cloud computing stories of 2025

      December 22, 2025
    • Crypto

      Is Bitcoin Price Entering a New Bear Market? Here’s Why Metrics Say Yes

      February 19, 2026

      Cardano’s Trading Activity Crashes to a 6-Month Low — Can ADA Still Attempt a Reversal?

      February 19, 2026

      Is Extreme Fear a Buy Signal? New Data Questions the Conventional Wisdom

      February 19, 2026

      Coinbase and Ledn Strengthen Crypto Lending Push Despite Market Slump

      February 19, 2026

      Bitcoin Caught Between Hawkish Fed and Dovish Warsh

      February 19, 2026
    • Technology

      The ‘last-mile’ data problem is stalling enterprise agentic AI — ‘golden pipelines’ aim to fix it

      February 19, 2026

      New agent framework matches human-engineered AI systems — and adds zero inference cost to deploy

      February 19, 2026

      Alibaba’s Qwen 3.5 397B-A17 beats its larger trillion-parameter model — at a fraction of the cost

      February 19, 2026

      When accurate AI is still dangerously incomplete

      February 19, 2026

      Meta reportedly plans to release a smartwatch this year

      February 19, 2026
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»OpenAI’s models ‘memorized’ copyrighted content, new study suggests
    Technology

    OpenAI’s models ‘memorized’ copyrighted content, new study suggests

    TechAiVerseBy TechAiVerseApril 5, 2025No Comments3 Mins Read1 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    OpenAI’s models ‘memorized’ copyrighted content, new study suggests
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    OpenAI’s models ‘memorized’ copyrighted content, new study suggests

    A new study appears to lend credence to allegations that OpenAI trained at least some of its AI models on copyrighted content.

    OpenAI is embroiled in suits brought by authors, programmers, and other rights-holders who accuse the company of using their works — books, codebases, and so on — to develop its models without permission. OpenAI has long claimed a fair use defense, but the plaintiffs in these cases argue that there isn’t a carve-out in U.S. copyright law for training data.

    The study, which was co-authored by researchers at the University of Washington, the University of Copenhagen, and Stanford, proposes a new method for identifying training data “memorized” by models behind an API, like OpenAI’s.

    Models are prediction engines. Trained on a lot of data, they learn patterns — that’s how they’re able to generate essays, photos, and more. Most of the outputs aren’t verbatim copies of the training data, but owing to the way models “learn,” some inevitably are. Image models have been found to regurgitate screenshots from movies they were trained on, while language models have been observed effectively plagiarizing news articles.

    The study’s method relies on words that the co-authors call “high-surprisal” — that is, words that stand out as uncommon in the context of a larger body of work. For example, the word “radar” in the sentence “Jack and I sat perfectly still with the radar humming” would be considered high-surprisal because it’s statistically less likely than words such as “engine” or “radio” to appear before “humming.”

    The co-authors probed several OpenAI models, including GPT-4 and GPT-3.5, for signs of memorization by removing high-surprisal words from snippets of fiction books and New York Times pieces and having the models try to “guess” which words had been masked. If the models managed to guess correctly, it’s likely they memorized the snippet during training, concluded the co-authors.

    An example of having a model “guess” a high-surprisal word.Image Credits:OpenAI

    According to the results of the tests, GPT-4 showed signs of having memorized portions of popular fiction books, including books in a dataset containing samples of copyrighted ebooks called BookMIA. The results also suggested that the model memorized portions of New York Times articles, albeit at a comparatively lower rate.

    Abhilasha Ravichander, a doctoral student at the University of Washington and a co-author of the study, told TechCrunch that the findings shed light on the “contentious data” models might have been trained on.

    “In order to have large language models that are trustworthy, we need to have models that we can probe and audit and examine scientifically,” Ravichander said. “Our work aims to provide a tool to probe large language models, but there is a real need for greater data transparency in the whole ecosystem.”

    OpenAI has long advocated for looser restrictions on developing models using copyrighted data. While the company has certain content licensing deals in place and offers opt-out mechanisms that allow copyright owners to flag content they’d prefer the company not use for training purposes, it has lobbied several governments to codify “fair use” rules around AI training approaches.

    Kyle Wiggers is TechCrunch’s AI Editor. His writing has appeared in VentureBeat and Digital Trends, as well as a range of gadget blogs including Android Police, Android Authority, Droid-Life, and XDA-Developers. He lives in Manhattan with his partner, a music therapist.

    View Bio

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleAI has opened a new era in venture capital according to Forerunner founder Kirsten Green
    Next Article How Kalshi helped prediction markets go mainstream
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    The ‘last-mile’ data problem is stalling enterprise agentic AI — ‘golden pipelines’ aim to fix it

    February 19, 2026

    New agent framework matches human-engineered AI systems — and adds zero inference cost to deploy

    February 19, 2026

    Alibaba’s Qwen 3.5 397B-A17 beats its larger trillion-parameter model — at a fraction of the cost

    February 19, 2026
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025684 Views

    Lumo vs. Duck AI: Which AI is Better for Your Privacy?

    July 31, 2025273 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 2025156 Views

    6 Best MagSafe Phone Grips (2025), Tested and Reviewed

    April 6, 2025118 Views
    Don't Miss
    Technology February 19, 2026

    The ‘last-mile’ data problem is stalling enterprise agentic AI — ‘golden pipelines’ aim to fix it

    The ‘last-mile’ data problem is stalling enterprise agentic AI — ‘golden pipelines’ aim to fix…

    New agent framework matches human-engineered AI systems — and adds zero inference cost to deploy

    Alibaba’s Qwen 3.5 397B-A17 beats its larger trillion-parameter model — at a fraction of the cost

    When accurate AI is still dangerously incomplete

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    The ‘last-mile’ data problem is stalling enterprise agentic AI — ‘golden pipelines’ aim to fix it

    February 19, 20260 Views

    New agent framework matches human-engineered AI systems — and adds zero inference cost to deploy

    February 19, 20262 Views

    Alibaba’s Qwen 3.5 397B-A17 beats its larger trillion-parameter model — at a fraction of the cost

    February 19, 20260 Views
    Most Popular

    7 Best Kids Bikes (2025): Mountain, Balance, Pedal, Coaster

    March 13, 20250 Views

    VTOMAN FlashSpeed 1500: Plenty Of Power For All Your Gear

    March 13, 20250 Views

    This new Roomba finally solves the big problem I have with robot vacuums

    March 13, 20250 Views
    © 2026 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.