Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

AI data tagging is not "dirty work"

2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

13:58, 14 January 2020

If artificial intelligence is compared to a "rocket", then data is the "fuel" that boosts the rise of the rocket. Machine learning relies on a large amount of labeled data, which enables machines to understand and understand the world. Data annotation is an indispensable part of the development of artificial intelligence and the basic force of AI pyramid construction. In sharp contrast to the prosperity and highlights of AI's "front stage", data tagging is often behind the scenes, often ignored and biased, such as "sweatshops", "AI Foxconn" and "new migrant workers". With the in-depth landing of AI to put forward higher requirements for data, the data labeling industry has gradually transitioned from the reckless growth stage to a more refined growth period.

Data tagging behind "AI Pyramid"

Data is the foundation of machine learning, machine learning is based on data modeling, and rich tags are the premise of successful machine learning modeling. Supervised learning is the most widely used machine learning algorithm at present. This method relies heavily on labeled data. It builds a prediction model by learning a large number of labeled training samples. Deep learning also requires the "feeding" of a large amount of data, and machine learning frameworks such as deep learning need to be trained on large supervised data sets. Su Haibo, the chief algorithm scientist at%, once said that deep learning can exert its power only in scenarios with sufficient labeled data, but there is not enough labeled data in many practical applications.

The landing of AI technology in the whole scene and the arrival of big data era have produced a huge amount of exponential data, and data acquisition has become relatively easy. However, it is not easy to obtain a large number of labeled data, which often requires a lot of human, material and financial costs. In the subdivision areas with high professional barriers such as medical AI, the lack of tagging data has become a "stumbling block" hindering the development of the industry. Tencent YouTu Lab Director Zheng Yefeng once said in an interview on the front line of AI that medical data tagging is "difficult" on the one hand is reflected in the lack of top medical data tagging talents, on the other hand, clinical and scientific research tasks are heavy, and many medical experts do not have the time and energy to do data tagging.

Data annotation is mainly for voice, image, text and so on, mainly through tagging, marking emphasis, tagging, frame objects, annotating and other ways to mark the data set, and then these data sets to the machine training and learning. The main types of data tagging are: Pinyin tagging, prosodic tagging, part of speech tagging, phoneme time point tagging, speech transliteration, classification tagging, dotting tagging, frame tagging, region tagging and so on. Due to the large scale and high cost of the data to be tagged, some Internet giants and some AI companies rarely have their own tagging teams, and most of them are handed over to third-party data service companies or data tagging teams.

Data service is the initial business of Standard Bay Technology. Since its establishment in 2016, Standard Bay Technology has provided voice, image, NLP data collection, labeling and other services for BAT, AI Unicorn and other companies. According to Miao Guanqiong, head of technology data at Standard Bay, Standard Bay has a self-developed collection and annotation platform, including long voice (dialogue, continuous) tagging platform and short voice (ten seconds) tagging platform, AI voice synthesis data tagging platform, data workshop APP, and so on. The choice of labeling platform will be based on images, voice data, data sources, customer needs and other comprehensive decisions. Take speech synthesis data tagging as an example, it will mark its phonetic character, prosody, phoneme time point, part of speech and other tags.

The prosperity of artificial intelligence has spawned and expanded the data tagging industry, and created a large number of jobs. According to data, there are about 200000 full-time data tagging practitioners and 1 million part-time data tagging practitioners in China, and there are about hundreds of companies engaged in data tagging business in China.

Data "migrant workers"?

In the data tagging industry, there is a popular saying, "there is as much intelligence as there is manpower." Data tagging is a vital part of the development of artificial intelligence, but it is often ignored.

Relatively speaking, data tagging is an "entry-level" type of work in the field of artificial intelligence. From the perspective of workflow alone, its technical content is relatively low, and people are the biggest "factor" in this work. Over time, "labor-intensive" has become a label affixed to the data labeling industry. The low threshold has attracted many farmers, students and disabled people to join the data tagging army, and some characteristic "data tagging villages" have appeared in the fourth-and fifth-tier cities in Henan, Hebei, Guizhou, Shanxi and other places in China.

Moving to places with more abundant labour and lower costs is also a trend in the global data tagging industry, not only in China. A number of data tagging villages have sprung up in India, which works for AI in the United States, Europe, Australia and Asia, and Facebook has outsourced some of its social content tagging to an Indian company.

As a result, the above-mentioned workers have become participants in the wave of artificial intelligence, and although they are paid far less than other artificial intelligence practitioners, the work of data markers is easier and more decent than traditional manual work. However, the flip side of the coin is that the workflow is simple and tedious, and the data annotator repeats the work of "frame" day after day. The argument that the data labeling industry is "dirty work" and "data migrant workers" is also scattered.

Miao Guanqiong does not agree with these "voices".

"I don't think it's a 'dirty work' industry, because it's not a job that anyone can do. AI itself is developing rapidly, with the application products on the ground, the demand for data is getting higher and higher, and the quality of data acquisition personnel is also highly required." Considering that it is difficult to control the service quality of the outsourcing team, the projects undertaken by Standard Bay Technology mainly rely on its own data tagging team. it has data teams in Tianjin, Changchun and other cities, and part-time staff temporarily expand according to the size of the project. when selecting part-time staff, more consideration is given to professional level, language, dialect background, or data tagging experience is required, and those with no experience are required to undergo at least 6 months of training.

Miao Guanqiong said that the development of the data tagging industry is becoming more and more specialized. in the early days, Chinese data tagging was mainly used. Now, with the development of multilingualism, dialects, and personalized tagging, the demand for tagging is increasing. It is not something that many people can do casually. Professionals are needed. In addition, the "sweatshop" situation often occurs in the early days of the industry and is aimed at small teams with only one business marked with data, which cannot undertake complex, customized projects. In terms of workload, combined with the needs of customers, taking voice tagging as an example, the effective voice tagging time of a person working in a day is 1 hour.

The proportion of machine marking is increased, but it is impossible to replace manual work.

The wild days are over.

According to the analysis of the White Paper on China's artificial Intelligence basic data Service Industry in 2019, 2010-2016 was the "infancy" of the data service industry, with a surge in demand for early data tagging and low barriers to entry. Since 2017, with the deep landing of AI into various application scenarios, the data labeling industry has entered a period of growth, and the upper application end manufacturers have increasing requirements for data marking quality, such as autopilot, motion images, computer vision and other fields.

The industry pattern is gradually clear, and the Matthew effect is obvious. It is understood that there are about hundreds of domestic companies / teams engaged in data tagging business, of which more than 100 independently do the entire data quality service, dozens of them can provide integrated data acquisition services, and only a dozen can provide high-standard basic data services. At this stage, downstream AI algorithm research and development units mostly divert business to different data service companies, plus the relevant data labeling standards need to be improved, there are no large giant companies in the industry.

This is an unsaturated market, which also means a lot of room for development. According to statistics, the market size of China's artificial intelligence basic data services in 2018 is 2.586 billion yuan, with an annual compound growth rate of 23.5%.

Miao Guanqiong believes that due to the continuous improvement of data security and quality standards and the emergence of relevant data policies, some people who do not meet industry standards and customer needs will be eliminated from the market. She added, "the industry is currently in a rising and rapid development stage, and the industry as a whole is developing in the direction of personalization and specialization, from early simple, general-purpose data to more complex personalized and scene-based data. for many subdivided areas, a large number of real models are needed to iterate the model, rather than simple general data.

The data tagging industry has also begun to enter the stage of man-machine cooperation, the market demand for data tagging is still very large, more professional people and efficient machines are needed, the proportion of machine tagging will continue to increase, AI technology and data complement each other, through AI technology to improve data efficiency, data in turn serves the technology.

In order to reduce labor costs and improve efficiency, many Internet technology companies and third-party data service providers are developing their own tagging tools. In October last year, Google released Fluid Annotation, a human-computer cooperation interface for complete image annotation, which can be used to label the class tags and outlines of each object and background area in the image, which can triple the speed of creating annotation data sets. Data tagging crowdsourcing platforms are also emerging, such as JD.com Zhong Zhi, Baidu Public Test, figure-eight, Amazon Mechanical Turk and so on.

In the future, machine marking and manual assistance will become a foreseeable development trend. This may not be a good thing for the "data tagging village". But Miao Guanqiong believes that machines cannot completely replace manpower. At present, the accuracy of manual labeling is higher than that of the machine, the machine can only run a certain proportion of the correct results, more accurate results still need to be marked manually, and play a more important role. In addition, in the link of quality inspection, the role of human is irreplaceable. Standard Bay data proofreading adopts the manual-based processing method and follows the process of "first instance, second proofreading and third inspection". The machine will sample and accept some of the data and give the preprocessing results. The final result depends on fine manual proofreading.

Https://www.infoq.cn/article/F3eYbuTb2ygMIUdNtatL?utm_source=tt&utm_medium=infoq&utm_campaign=newinfoq&utm_content=0114ai

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report