Industry insiders say ChatGPT destroys the foundation of network sharing, and content owners don't want to share any more. 04/17 Update SLTechnology News&Howtos

Industry insiders say ChatGPT destroys the foundation of network sharing, and content owners don't want to share any more.

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

August 31, with the rapid development of artificial intelligence technology, web crawlers once used for search engine indexing are now used to collect training data to develop artificial intelligence models.

Content creators realize that the fruits of their work are being used by big technology companies for free to develop new artificial intelligence tools, and crawler agreements can no longer solve this problem. This may affect the motivation of content owners to share content online, thus fundamentally changing the Internet.

Source Pixabay20 in the late 1990s, a simple code called crawler protocol (robots.txt) appeared, which allows website owners to tell search engine robots crawlers which pages can and cannot be crawled. Nowadays, crawler protocol has become one of the unofficial network rules generally accepted by the industry.

The main purpose of robot crawler is to index information and improve the search results of search engines. Google, Bing and other search engines all have crawlers that generate indexed information about web content and provide it to potential billions of users. This is also the basis of the booming Internet, where creators share a wealth of information because they know that users will visit their sites and browse advertisements, subscribe services or buy goods.

However, generative artificial intelligence and large language models are rapidly changing the task of web crawlers. Instead of supporting content creators, these tools have become their enemies.

Robots feed big technology companies now, web crawlers collect online information and generate large data sets that are used by wealthy technology companies to develop artificial intelligence models for free. For example, CCBot provides data to CommonCrawl, one of the largest AI data sets, while GPTbot provides data to OpenAI, a star AI startup. Google calls the training data for its large language model an "infinite collection", but does not mention that most of the data comes from CommonCrawl's stripped-down version of C4.

The artificial intelligence models developed by these companies use this free information to learn how to answer users' questions, which is a far cry from the established pattern of indexing websites and giving users access to the original content.

Without potential consumers, content creators have no incentive to let web crawlers continue to collect free data. GPTbot has been blocked by Amazon, Airbnb, Quora and thousands of other websites. There are also more and more shielding to the CCBot of CommonCrawl data sets.

The way "rough tools" stop these web crawlers hasn't changed much. Site owners can only deploy crawler protocols and block specific crawlers, but the effect is not ideal.

"it's a bit of a crude tool," said Joost de Valk, a former Wordpress executive, technology investor and founder of Yoast, a digital marketing company. "it has no legal basis and is basically maintained by Google, even though they claim to be maintained with other search engines."

Considering the huge demand of enterprises for high-quality artificial intelligence data, crawler protocols are also easy to be manipulated. For example, companies like OpenAI can bypass the prohibition rules set by people using the crawler protocol simply by changing the name of their web crawler.

In addition, because the crawler protocol is complied with voluntarily, the web crawler can simply ignore the instructions and continue to collect information. Web crawlers of newer search engines such as Brave will not be affected by the rules.

"all the information on the Internet is sucked into a vacuum by models," said Nick Vincent, a computer science professor who studies the relationship between human-generated data and artificial intelligence. "there's a lot going on behind this. In the coming time, we hope to evaluate these models in different ways.

The creator's response, de Valcke, warned that content owners and creators may have been too slow to understand the risk of allowing these web crawlers to access their data for free and use it indiscriminately to develop artificial intelligence models.

"now, doing nothing means,'I agree that my content is in all the artificial intelligence and big language models in the world,'" Mr. de Valcke said. "this is completely wrong. Better crawler protocols need to be created, but it is difficult for search engines and large artificial intelligence teams to do it themselves."

A number of large companies and websites have responded recently, some of which are deploying crawler protocols for the first time.

Originality.ai, a company that monitors AI-generated content, said that as of Aug. 22, 70 of the 1000 most popular websites used crawler protocols to block GPTBot.

Originality.ai also found that 62 of the 1000 most popular sites blocked CommonCrawl's web crawler CCBot. With the growing awareness of artificial intelligence data collection, more and more websites begin to block CommonCrawl this year.

However, the site cannot enforce the crawler protocol. Any crawler can ignore the file and continue to collect data on the page, without the knowledge of the page owner. Even if the deployment of the crawler protocol has a legal basis, its original intention has little to do with the use of network information to develop artificial intelligence models.

"Robots.txt is unlikely to be seen as a law banning the use of website data," said JasonSchultz, director of the New York University Technical Law and Policy Clinic. This is mainly to show that people do not want their websites to be indexed by search engines, not that they do not want their content to be used to train machine learning and artificial intelligence.

"this is a minefield." in fact, this has been going on for many years. As early as 2018, OpenAI released its first GPT model and trained it through the BookCorpus dataset. CommonCrawl started in 2008 and exposed the data set through the Amazon cloud service in 2011.

Although more and more websites are blocking GPTBot, CommonCrawl poses a greater threat to companies that worry that their data is being used to train artificial intelligence models. It can be said that CommonCrawl is to artificial intelligence what Google is to Internet search.

"this is a minefield. We updated our strategy only a few years ago and now we are in a different world," said Catherine Stihler Stiller, chief executive of Creative Commons, a non-profit group.

Knowledge sharing, which began in 2001, is a way for creators and content owners to replace strict copyright with knowledge sharing license agreements to use and share work licenses online. On the basis of a shared license agreement, creators and owners reserve their rights and allow others to access content and create derivative works. Wikipedia, Flickr, StackOverflow and many other well-known websites operate through knowledge sharing license agreements.

In its latest five-year strategy, knowledge sharing says there are problems with the use of open content in training artificial intelligence technology. Knowledge sharing organizations want to make online work sharing fairer.

CommonCrawl, which crawls public information through CCBot on 160 billion web pages, has the largest data repository. Since 2011, it has crawled and saved information from 160 billion web pages, and has continued to increase. Generally speaking, CommonCrawl crawls and saves information about 3 billion web pages each month.

Common Crawl says the business is an "open data" project that aims to "open up anyone's curiosity, analyze the world, and pursue ideas of excellence".

However, the situation is completely different now. A large amount of data collected by Common Crawl is used by large technology companies to develop proprietary models. Even if a large technology company does not profit from artificial intelligence products at the moment, it is possible to do so in the future.

Some large technology companies have stopped disclosing sources of training data. However, many powerful artificial intelligence models are developed using CommonCrawl. It helped Google develop Bard, help Meta train Llama, and help OpenAI create ChatGPT.

Common Crawl also provides data to ThePile, which also has more datasets crawled from other crawlers. ThePile has been widely used in artificial intelligence projects, including Llama and MT-NLG, which is jointly developed by Microsoft and Nvidia.

Since June, one of the most downloaded data from ThePile has been copyrighted comic books, including Archie Comics, Batman, X-Men, Star Wars and Superman. These works were created by DC Comics and Marvel and are still protected by copyright. Recently, it has been reported that a large number of copyrighted books are also stored in ThePile.

Schultz of New York University says the purpose and use of reptiles are completely different. It is difficult to regulate or require them to use data in a specific way.

For its part, while The Pile acknowledges that the data contains copyrighted material, few would agree in the technical articles that created the dataset that "the processing and distribution of data owned by others may also violate copyright law".

In addition, The Pile argues that although relatively unchanged works are stored in the data set, the use of these materials should be transformational in accordance with the principle of fair use. ThePile also acknowledges that when training large language models, complete copyrighted content is needed to produce the best results.

The so-called concept of fair use in web crawlers and artificial intelligence projects has been called into question. Writers, visual artists and even source code developers sued companies such as OpenAI, Microsoft and Meta because their original works were used to train models without permission and they did not benefit from it.

Steven Steven Sinofsky, a former Microsoft executive and partner at venture capital firm Andreessen Horowitz, recently wrote on social media that even if you put something on the Internet, you can't use someone's work for free and without restriction without consent.

There is no solution. "We are trying to solve all these problems right now," says Stiller, chief executive of knowledge sharing. There are many issues to be addressed: compensation, empowerment, trust. In the age of artificial intelligence, we don't have an answer yet.

Mr de Valcke said that because intellectual property sharing licenses promote the circulation of copyright and allow their own works to be used on the Internet, they can be used as a potential licensing model for the development of artificial intelligence models.

Stiller is not sure. When it comes to artificial intelligence, she says, there may be no single solution. Even a more flexible generic protocol may not work. How do you authorize the entire Internet?

"every lawyer I spoke to said that permission doesn't solve the problem," Stiller said. "

She often discusses the issue with the author, artificial intelligence industry executives and other stakeholders. Stiller met with representatives of OpenAI earlier this year and said the company was discussing how to reward creators.

But she added that it is not clear what the public space will look like in the era of artificial intelligence.

Given that web crawlers have collected a lot of data for large technology companies, and that content creators are simply out of control, the Internet could change dramatically.

If publishing information means providing data to competing artificial intelligence models for free, then this activity may stop.

There are already signs that fewer programmers visit the question-and-answer site Stack Overflow to answer questions because their previous efforts have been used to train artificial intelligence models, which can now answer many questions automatically.

Stiller said that the future of all online content creation may soon be the same as today's streaming, content is locked in subscription services, increasingly expensive.

"if we are not careful, it will eventually lead to the closure of public spaces," Stiller said. "there will be more walled gardens and more things that people cannot access. This is not a successful model of knowledge and creativity in the future."

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.