In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Predicting the unknown has always been the ability of human beings to yearn for. Far from the familiar gossip of the Book of changes and the "push back map" compiled by Taoist monks in the Tang Dynasty, there are astrology that Westerners are familiar with and tarot cards that became popular in the Middle Ages. Recently, for example, under the influence of the Mayan prophecy of "the end of the world in 2012", the national frenzy and commercial carnival that appeared in that year is still fresh in our memory.
Now the era of "it is a pity that he does not inquire about people's livelihood but studies ghosts and spirits" has passed, and we are familiar with deterministic, empirical and even probabilistic predictions of the physical world and social economy. But for example, is there nothing humans can do about highly complex, super-multivariable and massive data predictions described in the "Butterfly effect"?
The answer is not.
Recently, the outbreak of the novel coronavirus epidemic in Wuhan, China has attracted close attention from the World Health Organization and multi-local health agencies around the world. Among them, Wired magazine reported that "BlueDot, a Canadian company, took the lead in predicting and releasing an infectious epidemic in Wuhan through the AI monitoring platform," which received wide attention from the domestic media. This seems to be the result we most want to see in the matter of "predicting the future"-with the help of big data's precipitation foundation and AI's inference, human beings seem to be able to guess the "providence" and reveal the causal law originally hidden in chaos, thus trying to save the world before the natural disaster strikes.
Today, we will start from the prediction of infectious diseases and see how AI is moving towards "ingenious calculation" step by step.
Google GFT yells "here comes the wolf": the Rhapsody of big data with Influenza
Using AI to predict infectious diseases is obviously not the patent of Bluedot. In fact, as early as 2008, Google, today's "strong hand" in AI, made an unsuccessful attempt.
In 2008, Google launched a system to predict influenza trends-Google Flu Trends (Google Influenza Trends, hereinafter referred to as GFT). The GFT War became famous a few weeks before the outbreak of H1N1 in the United States in 2009, when Google engineers published a paper in Nature magazine that successfully predicted the spread of H1N1 across the United States based on the huge amount of search data accumulated by Google. In an analysis of flu trends and regions, Google used billions of search records, processed 450 million different digital models and constructed an influenza prediction index, which had a 97% correlation with official data from the US Centers for Disease Control and Prevention (CDC), but two weeks ahead of CDC. In the face of the epidemic, time is life, and speed is wealth. If GFT can always maintain this ability of "prediction", it can obviously win the first opportunity for the whole society to control the epidemic of infectious diseases in advance.
However, the myth of prophecy did not last long. In 2014, GFT received media attention again, but this time because of its poor performance. In 2014, the researchers published "the Fable of Google flu: the Trap of big data's Analysis" in Science magazine, pointing out that in 2009, GFT failed to predict non-seasonal influenza A-H1N1. In the 108 weeks from August 2011 to August 2013, GFT had 100 weeks higher than the influenza incidence reported by CDC. How much is overestimated? In the 2011-2012 quarter, the incidence predicted by GFT was more than 1.5 times that reported by CDC, while by the 2012-2013 quarter, the incidence of influenza predicted by GFT was more than twice that reported by CDC.
(the chart is from The Parable of Google Flu: Traps in Big Data Analysis | Science,2014)
Although GFT adjusted its algorithm in 2013 and responded that the main culprit for the bias was a change in people's search behavior as a result of heavy media coverage of GFT, GFT's forecast for influenza incidence in the 2013-2014 season was still 1.3 times higher than the CDC report. And the systematic error that the researchers found earlier still exists, that is, the mistake of "here comes the wolf" is still being made.
What factors are left out by GFT that put the prediction system in a dilemma?
According to the analysis of the researchers, there is such a large systematic error in GFT's big data analysis that there may be the following problems in its collection characteristics and evaluation methods:
Big data's arrogance (Big Data Hubris)
The so-called "big data arrogance" is the premise given by Google engineers that big data, obtained through users' search keywords, contains full data collection of influenza diseases, which can completely replace the traditional data collection (sampling statistics). Not its supplement. In other words, GFT believes that the "collected user search information" data is fully related to the "population involved in a flu outbreak". This "arrogant" assumption ignores that the large amount of data does not represent the comprehensiveness and accuracy of the data, so the database samples successfully predicted in 2009 cannot cover the new data features that emerged in the following years. Also because of this "conceit", GFT does not seem to consider the introduction of professional health care data and expert experience, and does not "clean" and "denoise" user search data, leading to the problem that the incidence of the epidemic is overestimated but unable to solve.
Second, search engine evolution
At the same time, the search engine model is not immutable. Google launched "recommended related search terms" after 2011, which is the search related words model that we are familiar with today.
For example, for flu search terms, list,2012 also provides recommendations on diagnostic terms related to influenza treatment years later. The researchers analyzed that these adjustments may have artificially pushed up some searches and led to Google's overestimation of the incidence of the epidemic. For example, when users search for "sore throat", Google will recommend keywords such as "sore throat and fever" and "how to treat sore throat". Users may click out of curiosity and other reasons, resulting in the use of keywords that are not intended by the user, thus affecting the accuracy of GFT's data collection.
In turn, users' search behavior will affect the prediction results of GFT, for example, media reports on influenza epidemic will increase the number of searches for influenza-related words, and then affect the prediction of GFT. As the quantum mechanist Heisenberg pointed out, the "uncertainty principle" in quantum mechanics states that "measurement is interference", so in the noisy world of search engines full of media reports and users' subjective information, there is also the paradox of "prediction is interference". The behavior of search engine users is not entirely spontaneous. Media reports, social media hotspots, search engine recommendations and even big data recommendations all affect the mind of users, resulting in the concentrated outbreak of user-specific search data.
Why are GFT's forecasts always on the high side? According to this theory, we can know that once the epidemic forecast index released by GFT rises, it will immediately trigger media reports, resulting in more relevant information search, thus strengthening the epidemic judgment of GFT. No matter how to adjust the algorithm, it will not change the result of "uncertainty".
III. Relevance rather than causality
The root problem with GFT, the researchers point out, is that Google engineers are not clear about the causal link between search terms and the spread of the flu, but only focus on the statistical correlation between the data. Too much emphasis on "relevance" while ignoring "cause and effect" will lead to data inaccuracy. For example, take "flu" as an example, if the number of searches for the word soars for a period of time, it may be because the launch of a "flu" movie or song does not necessarily mean that the flu is really breaking out.
For a long time, although Google has been expected to disclose GFT's algorithm, Google has not chosen to do so. This has led many researchers to question whether the data can be reproduced or whether there are more commercial considerations. They hope that the search for big data should be combined with traditional statistics (small data) to create a more in-depth and accurate study of human behavior.
Obviously, Google doesn't take this opinion seriously. Finally, GFT was officially offline in 2015. However, it continues to collect search data for relevant users and provides it only to the CDC and some research institutions.
Why BlueDot was the first to successfully predict: the Concerto of AI algorithm and artificial Analysis
As we all know, Google was already laying out artificial intelligence at the time, buying DeepMind in 2014, but still keeping it independent. At the same time, Google did not pay more attention to GFT, so it did not consider adding AI to GFT's algorithm model, but chose to let GFT go to euthanasia.
At about the same time, the BlueDot we see today was born.
BlueDot is an automatic epidemic surveillance system established by infectious disease expert Kamran Khan (Kamran Khan). It tracks more than 100 outbreaks of infectious diseases by analyzing about 100000 articles in 65 languages every day. They are trying to use these targeted data collections to get clues to the outbreak and spread of potential infectious diseases. BlueDot has been using natural language processing (NLP) and machine learning (ML) to train the "automatic disease surveillance platform" so that it can not only identify and eliminate irrelevant "noise" in the data, for example, the system identifies it as an outbreak of anthrax in Mongolia, or just a reunion of anthrax, a heavy metal band formed in 1981. For example, GFT only understands users who search for "flu" as possible influenza patients, and it is obvious that there are too many unrelated users, resulting in an overestimation of epidemic accuracy. This is also the advantage that distinguishes BlueDot from GFT in discriminating key data.
As in the novel coronavirus epidemic forecast, Kamran said BlueDot found the source of the epidemic by searching foreign language news reports, animal and plant disease networks and official announcements. However, the platform algorithm does not use social media content, because the data is too messy and prone to more "noise".
With regard to the prediction of the transmission path of the virus after the outbreak, BlueDot is more likely to use access to global air ticket data to better detect the movements and action time of infected residents. In early January, BlueDot also successfully predicted that novel coronavirus would spread from Wuhan to Beijing, Bangkok, Seoul and Taipei within a few days after the outbreak.
Novel coronavirus's outbreak is not the first success of BlueDot. In 2016, by establishing an AI model of the transmission path of Zika virus in Brazil, BlueDot successfully predicted the occurrence of Zika virus in Florida six months in advance. This means that BlueDot's AI surveillance capabilities can even predict the geographical spread of epidemics.
What are the differences between BlueDot and Google GFT from failure to success?
1. Differences in forecasting techniques
Before, the mainstream prediction and analysis methods adopted a series of techniques of data mining, in which the "regression" methods often used in mathematical statistics, including multiple linear regression, polynomial regression, multi-factor Logistic regression and so on, are essentially a kind of curve fitting, that is, the "conditional mean" prediction of different models. This is the technical principle of the prediction algorithm used by GFT.
Before machine learning, multiple regression analysis provides an effective method to deal with various conditions, which can try to find a result that minimizes the error of prediction data and maximizes the goodness of fit. However, the desire of regression analysis for unbiased prediction of historical data can not guarantee the accuracy of future prediction data, which will lead to the so-called "over-fitting".
According to Shen Yan, a professor at the National Research Institute of Peking University, Google GFT does have the problem of "overfitting" in the article "glory and trap of big data analysis-starting with Google flu trend." That is, in 2009, GFT can observe all the CDC data from 2007 to 2008. The training data and test data used to find the best model is based on the standard of highly fitting CDC data at any cost. So, according to the 2014 Science paper, there will be situations where GFT drops some seemingly odd search terms and uses another 50 million search terms to fit 1152 data points when predicting influenza prevalence between 2007 and 2008. After 2009, the data to be predicted by GFT will face the existence of more unknown variables, including its own prediction will also participate in this data feedback. No matter how GFT is adjusted, it still has to face the problem of over-fitting, so that the overall error of the system can not be avoided.
BlueDot adopts another strategy, that is, the combination of medical and health expertise with artificial intelligence and big data analytical technology to track and predict the global distribution and spread of infectious diseases, and give the best solution.
BlueDot mainly uses natural language processing and machine learning to improve the effectiveness of the monitoring engine. With the improvement of computing power and machine learning in recent years, the method of statistical prediction has been fundamentally changed. It is mainly the application of deep learning (neural network), which adopts the method of "back propagation", which can continuously train, feedback, learn and acquire "knowledge" from the data. through systematic self-learning, the prediction model will be continuously optimized. the accuracy of prediction is also improving with learning. The historical data input before model training becomes particularly critical. Sufficient characteristic data is the basis for the training of the prediction model. The high-quality data after cleaning and the extraction of properly labeled features have become the top priorities for predicting success.
Second, the difference of prediction model
Different from the way that GFT completely handed over the prediction process to the result of big data algorithm, BlueDot did not completely hand over the prediction to the AI monitoring system. BlueDot is handed over to manual analysis after the data is filtered. This is precisely the difference between the "relevance" thinking analyzed by big data of GFT and the "expert experience" prediction model of BlueDot. Big data analyzed by AI selects information from specific websites (medical and health, health and disease news) and platforms (air tickets, etc.). The early warning information given by AI also needs to be re-analyzed by the relevant epidemiologists to confirm whether it is normal, so as to assess whether the epidemic information can be released to the public as soon as possible.
Of course, these current cases do not show that BlueDot has been fully successful in predicting epidemics. First of all, will there be some biases in the AI training model, such as whether the severity of the epidemic will be overstated in order to avoid underreporting, resulting in the problem of "wolf coming" again? Second, is the data evaluated by the monitoring model valid? for example, BlueDot is careful to use social media data to avoid excessive "noise"?
Fortunately, as a professional health service platform, BlueDot will pay more attention to the accuracy of monitoring results than GFT. After all, professional epidemiologists are the final publishers of these forecasts, and the accuracy of their forecasts will directly affect the credibility and commercial value of their platforms. This also means that BlueDot also needs to face some tests on how to balance commercial profit and public responsibility, information openness and so on.
AI's prediction of an epidemic is just a prelude.
"is it artificial intelligence that issued the first Wuhan coronavirus warning?" The headline in the media really surprised many people. In the current era of globalization, the outbreak of epidemic diseases in any place is likely to spread to any corner of the world in a short period of time. The discovery time and the efficiency of early warning is the key to the prevention of epidemic diseases. If AI can become a better epidemic warning mechanism, it would be a way for the World Health Organization (WHO) and the health sector of countries to carry out epidemic prevention mechanisms.
Then it also involves the question of how these institutions and organizations can adopt the results of epidemic forecasts provided by the AI. In the future, the epidemic AI forecasting platform must also provide an assessment of the risk level of epidemic transmission, as well as the economic and political risks that may be caused by the spread of the disease, to help relevant departments make more sound decisions. And all this still takes time. These organizations should also put this AI surveillance system on the agenda in establishing rapid response epidemic prevention mechanisms.
It can be said that the AI successfully predicted the outbreak of the epidemic in advance, which is a bright color for human beings to deal with this global epidemic crisis. It is hoped that the battle of epidemic prevention and control in which artificial intelligence is involved is only the prelude to this protracted war, and there should be more possibilities in the future. For example, the application of AI identification of major infectious disease pathogens; the establishment of infectious disease AI early warning mechanism based on the seasonal epidemic data of major infectious disease areas and infectious diseases; and AI assistance in the optimal allocation of medical materials after the outbreak of infectious diseases. Let's wait and see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.