• FryAI
  • Posts
  • It's Not As Simple As It Seems: Is A Copyright Crackdown The End For The Little Guy?

It's Not As Simple As It Seems: Is A Copyright Crackdown The End For The Little Guy?

Welcome to this week’s Deep-fried Dive with Fry Guy! In these long-form articles, Fry Guy conducts an in-depth analysis of a cutting-edge artificial intelligence (AI) development or developer. Today, Fry Guy explores how enforcing copyrights on training data might cause large tech companies to rule the world. We hope you enjoy!

*Notice: We do not gain any monetary compensation from the people and projects we feature in the Sunday Deep-fried Dives with Fry Guy. We explore these projects and developers solely for the purpose of revealing to you interesting and cutting-edge AI projects, developers, and uses.*


(The mystery link can lead to ANYTHING AI related. Tools, memes, and more…)

Giant tech companies like Microsoft and OpenAI have been freely using data from major news sources like The New York Times to train their massive AI models. This has left many in a rage, as the hard work of these news sources is being leveraged by these big companies with no fair compensation. So we ought to stick it to the big tech companies and make them pay infringement penalties, right? Well, maybe it’s not that simple …

In this article, we hope to present a dichotomy—a double-edged sword, if you will—between copyrights on training data and the death of AI for small developers.


You may have heard of the Magnificent 7. If you haven’t, these are the tech stocks that continually dominate the market. They include the likes of Microsoft, Apple, Nvidia, Alphabet, Amazon, Meta, and Tesla. These companies alone account for around half of the weighting of the Nasdaq, and have been riding high off the surge of generative AI over the past two years.

The generative AI storm began back in November of 2022, when OpenAI first released the famous ChatGPT model. Since then, we have seen a massive explosion of development and adoption. ChatGPT has itself garnered over 180.5 million users, and the Magnificent 7 has released a plethora of AI models, tools, and products. Much of this has been powered by Nvidia, who occupies around 80% of the AI chip market.

With these giant tech companies dominating the AI market, it has caused major concern, particularly over the unresolved issue of copyrights. Given the cloudy legal structure surrounding AI development, large AI models like ChatGPT have had relatively free reign over publicly available internet material, such as news articles and even social media information. This has not come without massive pushback.

In September of 2023, Elon Musk banned crawling and data scraping on X (formerly Twitter). In simple terms, this meant that companies such as OpenAI, Microsoft, and Google were not able to use data from the social media site to train their AI models. This led to lawsuits of various kinds, as policymakers attempted to navigate the rights of access for training these AI models.

Beyond social media platforms, big tech companies like OpenAI and Microsoft have faced massive lawsuits over alleged copyright infringements. This began with a groundbreaking lawsuit filed late last year by The New York Times (NYT), which remains outstanding. The lawsuit, filed in Federal District Court in Manhattan, contends that millions of articles from the NYT were used to train automated chatbots, creating competition with the news outlet itself. The complaint accuses the defendants of “free-riding” on the NYT’s substantial journalism investment, using its content without compensation to develop products that substitute for and divert audiences from the news outlet. The claim seeks unspecified damages, potentially reaching into billions of dollars, related to the “unlawful copying and use of The Times's uniquely valuable works.” Furthermore, it demands the destruction of any chatbot models and training data incorporating copyrighted material from the NYT.

This NYT lawsuit sent shockwaves throughout the AI community and remains one of the most pivotal and contentious issues in the AI space. Since this original lawsuit, eight additional daily newspapers owned by Alden Global Capital have filed lawsuits against OpenAI and Microsoft, accusing the tech companies of illegally using news articles to train their AI models. This joint complaint was filed in federal court in the U.S. Southern District of New York and involves newspapers such as The New York Daily News, The Chicago Tribune, and others. The newspapers allege that OpenAI and Microsoft used millions of copyrighted articles without permission to train and support their AI products. They claim the AI models often present full articles behind paywalls without proper attribution, impacting subscriptions and licensing revenue. The newspapers seek hefty compensation and a jury trial, arguing that the tech companies are using their work to build their businesses without proper compensation. Frank Pine, an executive editor at Alden’s Newspapers, explained, “We’ve spent billions of dollars gathering information and reporting news at our publications, and we can’t allow OpenAI and Microsoft to expand the Big Tech playbook of stealing our work to build their own businesses at our expense.”

OpenAI is worth $86 billion and Microsoft, their major partner in crime, has become the world’s most valuable company, valued at over $3 trillion. Much of their success in AI, according to these lawsuits, has to do with unlawful use of training data for AI models like ChatGPT and Microsoft Copilot.

Some might wonder why this issue is so controversial. Just make the big tech companies pay! Well, it’s not that simple.


On one side, we have people outraged by the unwillingness of big tech companies to pay for quality training data. On the other side, we have people who think that publicly available data should be fair game. In the middle, we have people who are lost in the mess or just don’t have an opinion on the issue. A recent FryAI poll suggested that 44% think tech companies should have to give compensation for the data, 24% think the data should be fair game, and 32% are lost in the middle. This shows the diversity of views on this issue, but also reveals a relatively substantial push for what might be called a copyright crackdown.

Why is there such a large push for stricter copyright regulations and penalties for infringement? Underlying this movement seems to be a sense of justice, or giving people what they deserve. Implementing strict copyright law on materials that are able to be used for the training of AI models would seemingly grant justice to the news sources. This would mean the humans who worked hard to produce the content being used to train the models would be compensated accordingly and given credit for what they contributed. No longer would AI be able to take advantage of their work and replace their expertise.

To the delight of many, substantial copyright deals are beginning to happen, and the money is flying off the rails. OpenAI has reportedly been offering up to $5 million to various news organizations to license copyrighted news articles to train its AI models. This is chump change compared to what some large news organizations require, such as social media outlets. Reddit, for example, reached a deal with Google for $60 million to allow them to use the social media’s news content to train their AI models. This seems like justice for these news outlets, who are not to be taken advantage of. Stick it to the tech companies! Right? But that’s not the whole story …

If copyright laws continue to tighten, acquiring these quality training data sets will become insanely expensive. We are talking about billions of dollars to train an AI model on a quality data set. The issue with that, however, is the only companies able and willing to pay that amount of money are … you guessed it: the tech giants who already control the AI market. This means that what seems like an act of justice for these news outlets turns out to also be a blessing for tech companies like those in the Magnificent 7, as they will now have the only means for acquiring the training data necessary to develop high-quality Application Programming Interfaces (APIs) and foundational models. This move marks the end of novel AI for the startups and for the little guy. Without the APIs and foundational models from large tech companies who can afford the training data, you are out of luck. In this way, making big tech companies pay for training data is giving them an opportunity to control how the underlying AI models work, who can get their hands on them, and what they can be used for. As a result, big tech will get even bigger … so if you find yourself pushing for strict copyrights and large bounties for infringement, be careful what you wish for. Because this might be exactly what these tech giants want.

So at the end of the day, it seems we are caught in a Catch-22: we either get strict on copyright infringement to give justice to the news outlets but give big tech companies all the power in the AI space, or we allow free reign for training data, potentially unjustly using material from hard-working people but ensuring equal access for small developers and AI startups. This question is one that will surely remain a topic of hot debate for some time, but no matter what happens, it seems someone is going to get the short end of the stick.

Did you enjoy today's article?

Login or Subscribe to participate in polls.