Academic Publishers Are Selling Research to AI Companies

Introduction
Open Access Created More Than Access. It Created Data.

Publishers Have Realized the Secondary Value of Research
Academia Is Paying to Build AI Training Infrastructure
Researchers Never Really Consented to This Version of “Open”
“Free to Read” Is Quietly Becoming “Free to Train”

The Licensing Problem Nobody Saw Coming
Publishers Want to Own the AI Layer Too
The Academic Community May Eventually Push Back
The Bigger Question Is Not Legal. It Is Moral.

Introduction

For decades, the academic world treated open access as a moral victory. The logic seemed simple enough: remove paywalls, make research freely available, accelerate discovery, and allow scientific knowledge to circulate without financial barriers. Universities supported it, funders mandated it, researchers complied, and publishers adapted. Billions of dollars were poured into the transition under the promise that scholarship would become more open, more democratic, and more accessible to humanity. In theory, this was one of the great structural reforms in modern scholarly communication. In practice, something stranger happened.

Open access did remove many barriers to reading research, but it also created a new kind of digital asset. Scientific articles were no longer just papers sitting behind subscription walls. They became structured, machine-readable corpora, professionally edited, peer-reviewed, richly tagged, and legally reusable under permissive licenses. To human readers, that meant easier access to knowledge. To artificial intelligence companies, it meant something else entirely: premium training data.

That distinction matters more than many researchers realize. Large language models do not simply need text. They need vast amounts of high-quality text, ideally authoritative, clean, structured, and semantically rich. Academic publishing produces exactly that. Journal articles are among the most valuable textual datasets in existence because they contain expert knowledge, disciplinary terminology, citations, metadata, formal argumentation, and a level of editorial control rarely found on the open web. In the AI era, research papers are not merely scholarship. They are raw infrastructure.

This has created an uncomfortable paradox at the heart of modern scholarly publishing. Universities and funders spend billions to make research openly available, often through article processing charges and expensive publishing agreements. Publishers then retain control over the platforms hosting much of that research and discover that the same scholarly corpus can be monetized again, not by selling subscriptions, but by licensing access to AI developers hungry for training data. In other words, academia pays to create the knowledge, pays to publish the knowledge, and then watches publishers sell the resulting data to artificial intelligence companies as a secondary commercial asset.

That should make researchers pause.

Because open access was supposed to expand scientific communication, not quietly build commercial data pipelines for the AI economy. Yet this is increasingly what appears to be happening. Deals between publishers and technology companies are beginning to reveal a new layer of monetization that was barely imaginable when Creative Commons licenses became central to open access policy. Research that was meant to be “free to read” is now becoming “free to train,” often with little discussion about consent, compensation, data sovereignty, or whether the academic community ever intended its intellectual labor to become fuel for proprietary AI systems.

Open Access Created More Than Access. It Created Data.

The original open access movement was built around a reader-centric idea. Scholarly literature should not be locked behind paywalls, especially when much of that research was publicly funded. The objective was to improve access for scientists, students, policymakers, clinicians, and the broader public. In that earlier framing, the conversation focused on readership. Could someone read the article without paying? Could researchers in poorer institutions gain access to scientific findings? Could knowledge circulate more freely across borders? These were fundamentally human-centered questions.

But digital publishing changed the nature of scholarly content itself. Research articles today are not simply PDF documents read by individuals. They are machine-processable assets embedded in metadata ecosystems, indexed by search engines, harvested by aggregators, linked through citation graphs, and structured in ways that make them ideal for computational ingestion. To a human reader, an article is a piece of scholarship. To a machine, it is a unit of structured knowledge that can be parsed, categorized, recombined, and learned from at scale.

This distinction is crucial because open access policy was largely designed before the rise of generative AI. The architects of early open access mandates were thinking about human knowledge dissemination, text mining for scientific purposes, and legal reuse in scholarly contexts. They were not imagining trillion-parameter language models ingesting millions of research papers to build commercial AI products. The licensing architecture of open access was not designed for that future, yet that future has arrived anyway.

Creative Commons licenses, especially CC BY, became the gold standard of open access because they maximize reuse. That flexibility was celebrated because it enabled sharing, translation, educational redistribution, and computational analysis. But the same permissiveness also creates a commercial opening. If research can legally be reused with attribution, then it can also be mined, extracted, and repackaged into proprietary AI systems, provided the licensing terms allow it. What was once framed as openness for science increasingly looks like openness for commercial machine learning.

That changes the economics of publishing in ways researchers have barely begun to confront. Scholarly content is no longer valuable only because humans read it. It is valuable because machines can train on it. And in the AI economy, that makes journal archives not just repositories of knowledge, but commercially monetizable data infrastructure.

Publishers Have Realized the Secondary Value of Research

For much of the open access debate, publishers justified their business models around familiar services: editorial management, peer review coordination, production, dissemination, indexing, and digital preservation. Whether one agreed with their pricing or not, the logic of the business was straightforward. Publishers sold access to journals, or they charged publication fees to make articles openly available. Revenue was tied directly to scholarly publishing.

Artificial intelligence has introduced a second business model.

Academic publishers increasingly recognize that the corpora they host possess enormous value beyond conventional publishing. Large AI developers require trustworthy, domain-rich textual datasets to improve model quality, particularly in scientific, technical, medical, and professional domains. Scholarly literature offers exactly that. It is authoritative, continuously updated, quality-filtered, and accompanied by metadata that makes machine ingestion easier and more useful.

This is no longer theoretical. Publishers have already begun striking deals in this emerging market. Informa, the parent company of Taylor & Francis, reportedly secured a $10 million AI-related agreement with Microsoft, while Wiley disclosed AI licensing deals worth an estimated $44 million. These are not publishing revenues in the traditional sense. They are data monetization revenues. Scholarly content is being treated not only as literature, but as licensable AI fuel.

What makes this especially controversial is that much of the academic ecosystem did not anticipate this secondary monetization loop. Researchers submit the papers. Peer reviewers donate labor. Universities fund the research. Libraries pay subscription fees or transformative agreement costs. Funders often support open access mandates. APCs are paid to make content openly available. Yet the secondary commercial value generated when that corpus becomes AI training infrastructure can flow elsewhere, often with little visibility to the scholars whose work produced it.

This raises a difficult question that academic publishing cannot avoid forever: when scholarly research becomes a monetizable AI asset, who should benefit from that value? Right now, the answer appears increasingly tilted toward platform owners. And that may become one of the defining publishing controversies of the AI era.

Academia Is Paying to Build AI Training Infrastructure

There is something deeply ironic about the economics of this emerging system. For years, universities and research funders have been told that open access requires financial sacrifice in the name of public good. Institutions paid article processing charges, libraries signed multi-million-dollar publishing agreements, and governments introduced mandates to accelerate public access. The justification was always framed in ethical language: taxpayers funded the research, so taxpayers should be able to read it. That argument resonated because it seemed to align moral purpose with scholarly dissemination.

But in the AI era, the same spending can produce an unintended second outcome. Openly available scholarly literature does not simply sit online waiting for human readers. It becomes part of a highly valuable corpus that can be mined, licensed, and used to train commercial artificial intelligence systems. In effect, academia may be financing the creation of structured datasets that later generate new commercial revenue streams elsewhere. The same article that cost a funder thousands of dollars to publish openly may now possess a second life as machine-readable training data sold into the AI market.

This creates a strange circularity in scholarly economics. Researchers produce the work. Universities pay salaries, laboratories, and infrastructure. Peer reviewers contribute unpaid intellectual labor. Funders underwrite the research itself. Libraries absorb publishing and access costs. APCs are paid to remove barriers to readership. Then publishers, sitting at a critical point in the digital infrastructure chain, discover that the resulting corpus has additional commercial value in AI licensing markets. The academic sector effectively helps create the asset, but does not necessarily control how that asset is monetized afterward.

That is not a trivial issue. Scholarly publishing has always involved asymmetries of value extraction, but AI introduces a new layer because the commercial value of research is no longer tied only to citations, readership, or subscriptions. It is tied to computational usefulness. In a world where data is infrastructure, journal archives become economically significant in ways that extend far beyond traditional publishing. The academic community may be paying to generate a resource whose secondary monetization logic increasingly belongs to others.

Researchers Never Really Consented to This Version of “Open”

One of the most uncomfortable aspects of this debate is that most researchers did not enter open access publishing with AI training in mind. They published under open licenses because of funder mandates, institutional policies, disciplinary norms, or personal support for broader dissemination. Often, “open” was understood in a human sense. It meant students could read papers. It meant clinicians in poorer settings could access medical evidence. It meant researchers without wealthy library budgets could participate more fully in global science.

That is a very different concept from allowing scholarly literature to become feedstock for proprietary machine learning systems.

The legal architecture may permit certain kinds of reuse, but legality and scholarly intent are not the same thing. A scientist who agrees to publish under a permissive license may understand that their article can be shared, translated, cited, or mined for research purposes. That does not necessarily mean they imagined their work being ingested into a commercial AI model that later becomes part of a subscription product, enterprise tool, or billion-dollar technology ecosystem. There is a gap between legal permission and intellectual expectation, and AI has widened that gap dramatically.

This is where the philosophical tension inside open access becomes difficult to ignore. The movement was built around openness as a public good, not openness as unrestricted commercial extraction. Yet the legal simplicity of permissive licensing can flatten those distinctions. “Reusable” sounds noble until researchers begin asking reusable by whom, for what purpose, and under what economic arrangement. Those questions are becoming much harder to dismiss.

The issue is even sharper because scholarly labor is already unusually uncompensated. Researchers write articles without payment. Peer reviewers work without payment. Editorial board members often contribute prestige labor without direct compensation. Universities support much of the infrastructure. To then see this intellectual ecosystem transformed into monetizable AI training material without clear mechanisms for author participation or benefit-sharing raises ethical questions that go well beyond copyright law. It raises questions about consent, academic sovereignty, and whether the scholarly community inadvertently signed onto a model it never fully understood.

“Free to Read” Is Quietly Becoming “Free to Train”

This may become one of the defining tensions in scholarly communication over the next decade. Open access was built on a deceptively simple promise: free to read. But artificial intelligence has exposed how incomplete that phrase really was. Reading is a human act. Training is a computational act. The two are not economically or legally identical, yet they increasingly operate on the same corpus.

A human researcher reading a journal article contributes to scientific discourse. A language model ingesting millions of articles does something very different. It converts patterns, concepts, terminology, and domain-specific structures into machine capability. At scale, this transforms literature into a training substrate for commercial systems that may generate products, services, or profits far removed from the original scholarly context.

This does not mean AI training is inherently illegitimate. Scientific text mining, computational analysis, and machine learning all have legitimate research uses. The problem is that the boundaries between scientific reuse and commercial extraction are becoming increasingly blurred. When academic literature becomes a premium resource for proprietary AI development, the language of openness starts colliding with the economics of platform capitalism.

That collision matters because open access was never simply a technical publishing arrangement. It was a normative movement built on ethical claims about public knowledge. If the end result is that academic institutions spend billions creating open corpora that later become monetizable assets in private AI ecosystems, then the meaning of “open” deserves renewed scrutiny. Open for scholarship is not necessarily the same as open for unrestricted commercial training.

This is why licensing debates are intensifying. Some scholars and policy thinkers are beginning to ask whether the open access movement leaned too heavily toward permissive licensing without anticipating AI-era consequences. Others argue that any attempt to restrict reuse risks weakening the openness that made scientific collaboration possible in the first place. The debate is no longer just about access. It is about what kinds of reuse the academic community should endorse, and whether openness without guardrails creates new forms of extraction rather than genuine scientific freedom.

The Licensing Problem Nobody Saw Coming

Much of this controversy comes down to a deceptively boring topic that now looks much more explosive: licensing.

For years, open access advocates pushed for the widespread adoption of CC BY licenses because they remove barriers to reuse. A CC BY article can be shared, redistributed, translated, adapted, and reused commercially, as long as attribution is provided. In the early open access era, this was considered a strength. It allowed research to circulate more freely, enabled educational reuse, supported data mining, and prevented publishers from locking publicly funded scholarship behind restrictive legal walls.

At the time, this made sense. The biggest threat to knowledge dissemination was restricted access. The policy objective was to maximize circulation, not to anticipate how future technologies might exploit that legal openness.

But AI has changed what “reuse” means.

A commercial company no longer needs to republish an article in the traditional sense to extract value from it. It can ingest millions of articles into a machine learning system, train a proprietary model, and build commercial products whose capabilities are partially shaped by that scholarly corpus. The articles are not reproduced in the conventional publishing sense, but their informational value is absorbed into a system that can generate economic output. This is a very different kind of reuse than the one most early open access policy frameworks were designed to encourage.

That is why licensing has suddenly become a battleground.

Under CC BY, commercial reuse is broadly allowed, which makes it highly attractive for AI developers and publishers entering AI licensing markets. Under more restrictive licenses, such as CC BY-NC, commercial reuse becomes more constrained, at least in principle. That difference may sound technical, but it has major implications for the future of scholarly publishing. A license is no longer just about whether a professor can share a PDF with students. It may determine whether an entire corpus becomes available for large-scale commercial AI monetization.

This is forcing an uncomfortable reassessment inside the academic world. The same legal openness once celebrated as a triumph of scholarly freedom may also be enabling forms of extraction that researchers never anticipated. A license that seemed ideal in 2015 may look very different in 2026, when journal archives are increasingly valuable not just to readers, but to AI companies building commercial products on top of scientific knowledge.

Publishers Want to Own the AI Layer Too

Academic publishers have spent decades controlling access to scholarly content. In the subscription era, they monetized scarcity by charging libraries to read research. In the open access era, many shifted toward monetizing publication by charging authors or institutions to publish research. Artificial intelligence now offers a third opportunity: monetizing computational access to scholarly corpora.

This is a significant shift because it means publishers are no longer simply defending legacy business models. They are positioning themselves inside the AI economy.

Control over journal archives suddenly looks like a strategic asset. Publishers host enormous repositories of peer-reviewed, professionally structured, domain-rich content. They maintain metadata systems, citation networks, taxonomies, and access infrastructures that make these corpora especially useful for machine learning applications. In an era when AI companies need high-quality training material, these archives become commercially valuable in a new way.

That creates a strategic temptation. Instead of merely selling subscriptions or APC-based publishing services, publishers can sell AI access, licensing rights, structured corpora, or computational partnerships. In other words, scholarly content can generate revenue not just once, but multiple times.

For publishers, this is smart business.

For academia, it may be a warning sign.

Because if publishers become gatekeepers not only of scholarly dissemination but also of scholarly AI infrastructure, then the power imbalance deepens. Universities may end up paying for access, paying for publication, and then potentially paying again to access AI systems trained partly on the research they helped produce in the first place. The same intellectual ecosystem can be monetized at multiple stages, each one controlled by a different layer of platform ownership.

That is where the AI question becomes inseparable from the publishing power question. This is no longer just about whether research is open. It is about who controls the economic value generated after that openness exists.

The Academic Community May Eventually Push Back

It would be a mistake to assume this system will continue uncontested.

Academic publishing has historically tolerated significant imbalances because journals were deeply tied to prestige, evaluation, and disciplinary legitimacy. Researchers often accepted dysfunctional economics because they had few alternatives. But AI changes the emotional and ethical stakes. Many scholars can accept publishers earning revenue from publishing services, even if they criticize the costs. They may feel differently when they realize their intellectual labor is also becoming AI training fuel in a separate commercial economy.

That could trigger a backlash.

Some policy thinkers are already discussing whether more restrictive licensing frameworks deserve renewed attention, especially for research intended to remain open for human use but not automatically available for unrestricted commercial AI exploitation. Others are exploring contractual mechanisms, platform-level restrictions, or new legal frameworks around data sovereignty and computational reuse. Universities may begin asking harder questions when they realize they are subsidizing knowledge production while others monetize the AI layer on top of it.

There is also a broader reputational issue for publishers. Scholarly communication depends heavily on trust, legitimacy, and the perception that publishing exists to support knowledge dissemination rather than maximize extraction at every available opportunity. If researchers begin to see publishers as quietly converting scholarly literature into AI licensing assets without meaningful transparency or community benefit-sharing, that trust may erode.

And if trust erodes, the political consequences could be significant.

Libraries, funders, and scholarly communities have already become more aggressive in challenging APC inflation, transformative agreement costs, and commercial concentration. AI monetization may become the next fault line. The academic sector may begin asking whether publicly funded scholarship should be governed as a shared knowledge commons, or whether it should continue serving as a commercial data reservoir for whichever platform controls the infrastructure.

The Bigger Question Is Not Legal. It Is Moral.

Publishers may have legal arguments. AI companies may have licensing arguments. Contracts may permit reuse. Terms of service may define access rights. But legality is not the only issue that matters in scholarly communication.

The deeper issue is moral.

Who should benefit when publicly funded knowledge becomes economically valuable in a new technological ecosystem? Should the answer automatically be the platform owner? Should researchers have a say? Should universities negotiate collective protections? Should scholarly communities distinguish between openness for science and openness for unrestricted commercial extraction?

These questions cut to the philosophical core of the open access movement.

Because open access was never supposed to be a simple transaction. It was supposed to be a reimagining of scholarly communication around public good, knowledge equity, and scientific progress. If that same openness now creates profitable AI pipelines that operate largely outside academic control, then the scholarly world has to decide whether this is an acceptable evolution, or whether it represents a distortion of the movement’s original purpose.

That debate is only beginning. But one thing is becoming increasingly clear: the future of open access may depend less on whether research is free to read, and more on who controls what happens after the reading stops.