Daniel wrote a new movement of big data's large-scale Internet data mining and distributed processing in three years. 04/10 Update SLTechnology News&Howtos

Daniel wrote a new movement of big data's large-scale Internet data mining and distributed processing in three years.

2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

As we all know, the mobile Internet, social media, e-commerce and the use of various sensors have produced very large data sets, mining these data can extract useful information.

Focusing on data mining and machine learning under the environment of big data, this paper comprehensively introduces the data processing algorithms that have been established in practice, which is a necessary reading material for school students and related practitioners. The main contents include 10 major contents:

◆ distributed file system and MapReduce tools

◆ similarity search

◆ data flow processing and special processing algorithms for special cases such as easy to lose data

◆ search engine technologies, such as Google's PageRank

Mining ◆ frequent itemsets

Clustering algorithm for ◆ large-scale and High-dimensional datasets

The key problems in the Application of ◆ Web-wide Information Management and recommendation system

◆ Social Network Graph Mining

◆ dimensionality reduction processing, such as SVD decomposition and CUR decomposition

◆ large-scale machine learning.

Basic concepts of data mining

This chapter is the introduction of the whole book, which first expounds the nature of data mining and discusses its different understandings in many related disciplines.

Then it introduces the Bonfrany principle (Bonferroni's principle), which actually warns against the overuse of data mining.

This chapter also outlines some very useful ideas, which may not all belong to the category of data mining, but they are helpful to understand some important concepts in data mining. These ideas include the TF.IDF weight to measure the importance of words, the properties of hash function and index structure, and the identity including the base e of natural logarithm. Finally, the topics to be covered in the following chapters are briefly introduced.

Cdn.xitu.io/2019/12/3/16eca3f71d5b1be9?imageView2/0/w/1280/h/960/format/webp/ignore-error/1 ">

Similar item discovery

A basic data mining problem is to obtain "similar" items from data. We will introduce the application of this problem in Section 3.1 and give a specific example of approximate duplicate checking of Web pages. These approximately duplicate pages may be plagiarized pages, or just mirrored pages where the host and other mirrored pages have different information.

First of all, we express the similarity problem as finding a set problem with relatively large intersection, and then we introduce how to transform the text similarity problem into the above set problem and solve it through the famous "shingling" technology. Then, we introduce a technique called minimum hash (minhashing), which can compress large sets and deduce the similarity of the original sets based on the compressed results. When the similarity requirement is high, you can also use some other techniques, which will be described in Section 3.9.

Another important problem in any type of similarity search is that even though the calculation of the similarity between each item is very simple, it is impossible to detect the similarity for all item pairs because of the large number of item pairs. To solve this problem, a technology called local sensitive hash (LSH) has emerged, which can focus the search on those pairs of items that may be similar.

Finally, we no longer limit the concept of similarity to the intersection of sets, but consider the theory of distance measurement in any space. At the same time, it also inspires the emergence of a general framework for LSH, which can be applied to other definitions of similarity.

Data stream mining

Most of the algorithms introduced in this book are assumed to be mining from a database. That is, if data is really needed, all data is available. In this chapter, we will give another assumption: the data comes in the form of one or more streams, and if the data is not processed or stored in time, the data will be lost forever. In addition, we assume that the data arrives so fast that it is impossible to store all the data in active storage (that is, traditional databases) and interact at the time we choose.

Each algorithm for data flow processing includes, to some extent, the summarization process of the stream. We first consider how to extract useful samples from the stream and how to filter out most of the "unwanted" elements from the stream. Then, we show how to estimate the number of independent elements in the stream, where the storage overhead of the estimation method is much less than the cost of enumerating all the elements seen.

Another way to summarize streams is to look at only one fixed-length "window", which consists of the nearest n elements, where n is a given value, usually large. Then we treat it as a relationship of the database-sample to query the window.

If there are many streams and / or n is large, we may not be able to save the entire window for each stream. Therefore, we need to summarize even these "windows". For a bitstream window, the approximate estimation of the number of 1s is a basic problem.

We will use a method that consumes much less space than storing the entire window. This method can also be extended to approximate all kinds of summation values. .

Frequent itemsets

This chapter mainly focuses on frequent itemset discovery, which is one of the main techniques of data description. This problem is often regarded as "association rule" discovery, although the latter is a more complex data description based on frequent itemset discovery.

First of all, we introduce the "shopping basket" model of data, which is essentially a many-to-many relationship between "item" and "shopping basket" elements. But there are some assumptions about the shape of the data. The problem of frequent itemsets is to find itemsets that appear in many of the same shopping baskets (related to that basket).

The problem of frequent itemset discovery is different from the similarity search discussed in Chapter 3. The former is mainly concerned with the absolute number of shopping baskets containing a particular itemset, while the latter's main goal is to find itemsets with high coincidence between shopping baskets, regardless of whether the absolute number of shopping baskets is very low or not.

The above differences lead to the emergence of a new class of frequent itemset discovery algorithms. We first introduce the A-Priori algorithm. The basic idea of the algorithm is that if a subset of a set is not a frequent itemset, then the set can not be a frequent itemset. Based on this idea, the algorithm can remove most of the unqualified large sets by checking the small sets. Next, we introduce various improvements to the basic A-Priori algorithm, which focus on very large datasets that put a lot of pressure on available memory.

Next, we will consider some faster approximation algorithms, which are not guaranteed to find all frequent itemsets. Some of these algorithms also apply parallelization mechanism, including the parallelization method based on MapReduce framework.

Finally, we will briefly discuss the discovery of frequent itemsets in data streams.

Recommendation system.

There is a class of all-inclusive Web applications that involve users predicting preferences for options, which are called recommendation systems (recommendation system). This chapter will first give some examples of the most important applications of this kind of system.

However, to focus on the problem itself, here are two good examples of recommendation systems:

(1) provide news reports for readers of online newspapers based on the predicted results of user interest.

(2) recommend to customers of online retailers what they may want to buy based on their past shopping and / or product search history.

Recommendation systems use a range of different technologies, which can be divided into two broad categories:

Systems such as content-based systems (Content-basedSystem) mainly examine the nature of recommendations. For example, if a Netlix user watches more than one cowboy movie, the system will recommend to that user a movie in the database that belongs to the "cowboy" category.

Systems such as Collaborative filtering system (Collaborative Filtering System) recommend items by calculating the similarity between users or / and items. Items that are similar to a user's favorite items are recommended to that user. This kind of recommendation system can use the basic principles of similarity search in Chapter 3 and clustering techniques in Chapter 7. However, these techniques themselves are not enough, and some new algorithms have been proved to be very effective in recommendation systems.

Large-scale machine learning

Many algorithms are now classified as "machine learning". Like other algorithms introduced in this book, the purpose of these algorithms is to obtain information from data. All data analysis algorithms generate profiles based on data, and decisions can be made based on these profiles.

In many cases, the frequent itemset analysis method introduced in Chapter 6 generates information such as association rules, which can be used to plan sales strategies or serve other goals.

However, algorithms called "machine learning" can not only generalize the data, but also regard them as model learners or data classifiers, so they can learn some information that can be seen in the data in the future. For example, the clustering algorithm introduced in Chapter 7 can produce a series of clusters, which can not only tell us about the analyzed data (training set), but also divide the future data into a cluster generated by the clustering algorithm. Therefore, machine learning enthusiasts usually use the new word "unsupervised learning" to express clustering, and the term "unsupervised" means that the input data does not tell the clustering algorithm what the final output cluster should be. In supervised machine learning (the topic of this chapter), the data given contains information that classifies at least part of the data correctly. Data that has been classified is called a training set.

This chapter does not intend to provide a comprehensive introduction to all the methods in machine learning, but only focuses on those methods that are suitable for dealing with very large amounts of data, as well as possible parallelization. We will introduce the classic "perceptron" method of learning data classifiers, which can find a hyperplane that separates two types of data. After that, we'll look at some of the more modern technologies that include support vector machines. Similar to perceptrons, these methods find the best classification hyperplane so that as few (if any) training set elements are close to the hyperplane as possible. Finally, the nearest neighbor technology is discussed, that is, the data is classified according to the categories of the nearest neighbors in a certain space.

Because the length is too long, the editor will not introduce too much here. I guess everyone also has some understanding and opinions on data mining and distribution. however, I believe that there are still some conceptual gaps in large-scale numbers. I hope you can carefully read the true meaning of this article! Then, if you need this [big data Internet large-scale data mining and distributed processing] technical document, + + I V X ①⑧⑤⑥①③ zero ⑤③⑨⑤ can be obtained.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.