Chatbots are digesting the Internet, and the Internet wants something in return. 04/19 Update SLTechnology News&Howtos

Chatbots are digesting the Internet, and the Internet wants something in return.

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

On April 30, artificial intelligence companies are using content created by countless people on the Internet without their consent or reward. Nowadays, more and more technology and media companies are asking for fees, hoping to get a piece of the chatbot craze.

If you have ever posted an article on a blog, posted a post on Reddit, or shared anything on the open network, you may have contributed to the birth of the latest generation of artificial intelligence.

Google's Bard, OpenAI's ChatGPT, Microsoft's new version of Bing and similar tools provided by other start-ups all integrate artificial intelligence language models. But without the vast amount of text freely available on the Internet, these clever robot writers would not have been able to come out.

Today, web content has once again become the focus of contention. This has not happened since the early search engine debate. The tech giant is trying to divide this irreplaceable source of information with new value into its own territory.

Unwitting technology and media companies are realising that these data are crucial to nurturing a new generation of language-based artificial intelligence. Reddit is one of OpenAI's valuable training resources, but the former recently announced that it would charge artificial intelligence companies for data access. OpenAI declined to comment.

Recently, Twitter has also begun to charge for data access services, a change that has affected many aspects of the Twitter business, including the use of data by artificial intelligence companies. The News Media Alliance, which represents publishers, announced in a paper this month that when companies use works produced by their members to train artificial intelligence, they should pay license fees.

"what really matters to us is where the information belongs," says Prashanth Chandrasekar, chief executive of Stack Overflow, a programmer question and answer website. The company plans to start charging fees for user-generated content visited by large artificial intelligence companies. "the Stack Overflow community has spent so much effort answering questions over the past 15 years that we really want to make sure our efforts are rewarded."

In the past, there have been many artificial intelligence services, such as OpenAI's Dall-E 2, which can generate images by learning, but have been accused of stealing intellectual property on a large scale. The companies that created these systems are currently embroiled in lawsuits against the allegations. The text dispute generated by artificial intelligence may be even greater, involving not only compensation and credit issues, but also privacy issues.

But Emily M. Bender, a computational linguist at the University of Washington, believes that under current law, artificial intelligence institutions are not responsible for their actions.

The dispute stems from the way artificial intelligence chatbots are developed. The core algorithms of these robots are called "large language model algorithms", which need to absorb and process a large number of existing language text data to imitate the content and way of human speech. This kind of data is different from the services we are used to on the Internet, such as the behavior and personal information used for targeted advertising such as Meta Platforms, the parent company of Facebook.

The data is created by human users using a variety of services, such as hundreds of millions of posts posted by Reddit users. Only on the Internet can you find a large enough artificially generated thesaurus. Without it, all of today's chat-based artificial intelligence and related technologies would not be successful.

In a paper published in 2021, Jesse Dodge, a research scientist at the Allen Institute for artificial Intelligence, a nonprofit, found that Wikipedia and numerous copyrighted news articles from media organizations large and small were in the most commonly used database of web crawlers. Both Google and Facebook use this dataset to train large language models, and OpenAI uses a similar database.

OpenAI no longer reveals its data sources, but according to a paper published in 2020, its large language model uses posts crawled from Reddit to filter and improve data used to train its artificial intelligence.

Reddit spokesman Tim Russ Schmidt (Tim Rathschmidt) said it was not clear how much revenue would be generated by charging companies to access their data, but believed that the data they had could help improve today's state-of-the-art large language models.

According to the report, publishing executives have been investigating: to what extent their content is used to train ChatGPT and other artificial intelligence tools? How do they think they should get compensation? And what laws can they use to defend their rights? However, Danielle Coffey, the group's general counsel, said that so far no agreement had been reached with the owners of any large artificial intelligence chat engines, such as Google, OpenAI, Microsoft, etc., to pay for some of the training data crawled from members of the News Media Alliance.

Twitter didn't respond to a request for comment. Microsoft declined to comment. A Google spokesman said: "for a long time, we have been helping creators and publishers monetize their content and strengthen their relationships with audiences. In accordance with our principles of artificial intelligence, we will continue to innovate in a responsible and ethical manner." The spokesman added that "it is still at an early stage" and that Google is seeking advice on how to build artificial intelligence that is conducive to an open network.

Legal and moral quagmire in some cases, copying data available on open networks (also known as crawling) is legal, although companies are still wrangling over the details of how and when they will be allowed to do so.

Most companies and organizations are willing to put their data online because they want it to be found and indexed by search engines so that people can find it. However, copying this data to train artificial intelligence to replace the need to find the original source is completely different.

Bender, a computational linguist, says the operating principle of technology companies that collect information from the Internet to train artificial intelligence is: "We can accept it, so it's ours." Converting text (including books, magazine articles, essays on personal blogs, patents, scientific papers, and Wikipedia content) into chatbot answers removes the source link to the material. It also makes it harder for users to verify what the robot tells them. This is a big problem for systems that often lie.

These large-scale information grabbing will also steal our personal information. Common Crawl is a non-profit organization that has been grabbing large amounts of content on the open web for more than a decade and providing its database to researchers for free. Common Crawl's database is also used as a starting point for companies that want to train artificial intelligence, including Google, Meta, OpenAI and others.

Sebastian Nagle, a data scientist and engineer at Common Crawl, says that a blog post you wrote a few years ago, though later deleted, may still exist in the training data used by OpenAI, which uses web content from years ago to train its artificial intelligence.

Unlike the search indexes owned by Google and Microsoft, removing personal information from trained artificial intelligence requires retraining the entire model, Bender said. Dodge also said that because the cost of retraining a large language model can be very high, companies are unlikely to do so even if users can prove that personal data is used to train artificial intelligence. Because of the huge computing power required, the training cost of such models is as high as tens of millions of dollars.

But Dodge added that in most cases, it is also difficult to get artificial intelligence trained, including personal data sets, to regurgitate the information. OpenAI said it had adjusted its chat-based system to reject requests for personal information. The governments of the European Union and the United States are considering new laws and regulations to regulate such artificial intelligence.

Accountability and profit sharing some proponents of artificial intelligence argue that artificial intelligence should have access to all the data available to their engineers because this is the way humans learn. Logically, why shouldn't machines do this?

Apart from the fact that artificial intelligence is not the same as humans, Mr Bender says, there is a problem with this view: under current law, artificial intelligence cannot be held responsible for its actions. People who copy other people's work, or who try to repackage misinformation as the truth, may face serious consequences, but the machine and its creator do not bear the same responsibility.

Of course, this may not always be the case. Just as copyright owner Getty sued artificial intelligence companies that use their intellectual property as training data to generate images, if companies and other organizations use their content without authorization, they are likely to eventually take chat-based artificial intelligence manufacturers to court unless they agree to obtain a license.

Can personal essays written by countless people, posts posted on obscure forums and vanished social networks, and all sorts of other things really make today's chatbots just as good at writing? Perhaps the only benefit that the creators of this content can get is that they have made some contribution to the development of chatbots in the use of language.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.