Science publishers sell access to research work to technology companies in order to train artificial intelligence (AI) models. Some researchers reacted with dismay to such deals that take place without the authors' consultation. The trend raises questions about the use of published and sometimes copyrighted work to train the growing number of AI chatbots in development.

experts say that a research paper that has not yet been used to train a large voice model will probably be used soon. Researchers explore technical opportunities for authors to determine whether their content is used.

Last month it was announced that the British science publisher Taylor & Francis, based in Milton Park, Great Britain, signed a ten million US dollar deal with Microsoft, which enables the US technology company to access the publisher's data to improve its AI systems. In June, an investor update showed that the US publisher Wiley earned $ 23 million by allowed to train generative AI models on its content.

Everything that is available online-whether in an open access repository or not-has already been "quite" fed into a large voice model, says Lucy Lu Wang, a AI researcher at the University of Washington in Seattle. "And if a paper has already been used as training data in a model, there is no way to remove this paper after training the model," she adds.

massive data records

llms are trained on huge amounts of data that are often skimmed off from the Internet. They determine patterns between the often billions of voice sections in the training data, so -called tokens, which enable them to generate texts with amazing liquid.

Generative AI models rely on taking patterns from these data masses in order to output texts, images or computer code. Scientific work is valuable for LLM developers due to their length and "high Information density", says Stefan Baack, who carries out the analysis of AI training data sets at the Mozilla Foundation in San Francisco, California.

The tendency to buy high -quality data records grows. This year the Financial Times has its material to the Chatgpt-developer Openai offered in a lucrative deal, as well as the online forum Reddit to Google. And since scientific publishers probably consider the alternative as an unauthorized skimmer of their work, "I think that more such deals are imminent," says Wang.

secrets of information

Some AI developers, such as the Large Scale Artificial Intelligence Network, deliberately keep their data records open, but many companies that develop generative AI models have kept a large part of their training data secret, says Baack. "We have no idea what it is," he says. Open source repositories such as Arxiv and the scientific database PubMed are considered "very popular" sources, although Paywalled journal articles are likely to be skimmed up by large technology companies free of charge. "You are always on the hunt for such information," he adds.

It is difficult to prove that an LLM has used a certain paper, says Yves-Alexandre de Montjoye, an computer scientist at Imperial College London. One possibility is to confront the model with an unusual sentence from a text and to check whether the output matches the next words in the original. If this is the case, this is a good sign that the paper is included in the training set. If not, this does not mean that the paper has not been used - not least because developers can program the LLM to filter the answers to ensure that they do not match the training data too closely. "It takes a lot to make it work," he says.

Another procedure for checking whether data is included in a training data set is called a membership inference attack. This is based on the idea that a model is confident about its edition when it sees something that it has seen before. De MontJoyes team has developed a version of it called copyright trap for LLMS.

To put the trap, the team generates plausible but nonsensical sentences and hides in a work, for example as a white text on a white background or in a field that is displayed on a website as a zero width. If an LLM is "surprised" by an unused control set -a measure of its confusion -more than the sentence hidden in the text, "the statistical evidence that the traps have been seen beforehand," he says.

copyright questions

Even if it were possible to prove that an LLM was trained on a specific text, it is not clear what will happen next. Publishers claim that the use of copyrighted texts in training without licensing is considered a injury. But a legal counter -argument says that LLMS does not copy - you extract information content from the training data that is crushed and use your learned knowledge to generate new text.

Possibly a legal proceedings could help clarify this. In an ongoing US consulting law that could be pioneering, The New York Times Microsoft and the developer of Chatgpt, Openai, in San Francisco, California. The newspaper accuses the companies of using their journalistic content without permission to train their models.

Many academics are happy when their work is included in LLMS training data - especially if the models become more precise. "Personally, I don't mind if a chatbot writes in my style," says Baack. But he admits that his profession is not threatened by the expenditure of the LLMS, like that of other professions, such as artists and writers.

Individual scientific authors currently have little influence if the publisher of your paper sells access to your copyrighted works. There are no established means for publicly available articles to assign a credit or know whether a text has been used.

Some researchers, including de Montjoye, are frustrated. "We want LLMS, but we still want something that is fair, and I think we have not yet invented what it looks like," he says.