(article 9) big data's Super Application-data Mining-recommendation system 10/22 Update SLTechnology News&Howtos

(article 9) big data's Super Application-data Mining-recommendation system

2025-10-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Data Mining-recommendation system

Big data can be regarded as the aggregation of a lot of data. Data mining is to discover the value of these data. For example, with the meteorological data of the past 10 years, through data mining, we can almost predict what the weather will be like tomorrow. There's a good chance it's right.

Machine learning is the core of artificial intelligence. It is definitely impossible to excavate big data by manual work, so we have to rely on machines instead of artificial to get an effective model, through which the value of big data can be reflected.

The content of this chapter:

1) data mining and machine learning concepts

2) A machine learning application direction-recommendation system

3) recommendation algorithm-content-based recommendation method

4) recommendation algorithm-- recommendation method based on collaborative filtering

5) implementation of collaborative filtering algorithm based on MapReduce

1. Data Mining and Machine Learning Concepts

Machine learning and data mining technology has begun to play a role in many fields of computer science, such as multimedia, computer graphics, computer network and even operating system, software engineering, etc., especially in the field of computer vision and natural language processing. Machine learning and data mining have become the most popular and popular technologies. So much so that a considerable number of papers at top conferences in these fields are related to machine learning and data mining technology. Generally speaking, the introduction of machine learning and data mining technology is an important trend in many branches of computer science.

For data mining, database provides data management technology, and machine learning provides data analysis technology. Usually we have to deal with big data through the HDFS cloud storage platform for data management, the current Hadoop ecosystem has been mature, a variety of tools and interfaces basically meet the needs of most data management. In the face of such a huge data resource, there needs to be a method to reflect the value of it. Machine learning provides a series of methods to analyze and mine data.

Mahout, a machine learning open source library project in the Hadoop ecosystem, provides a rich implementation of scalable classical machine learning algorithms to help developers create intelligent applications more easily and quickly. Mahout includes many implementations, including clustering, classification, recommendation filtering, and frequent sub-item mining.

two。 An Application Direction of Machine Learning-- recommended Field

Recommendation algorithm is one of the most well-known machine learning models. Recommendation is one of the core components behind many websites and sometimes an important source of revenue.

Generally speaking, the recommendation system attempts to model the relationship between the user and a certain type of item. For example, we use the recommendation system to tell users which movies they might like. If this is done well, it will attract more users to continue to use our service. It's good for both sides. Similarly, if you can tell users exactly which movies are similar to a particular movie, it will be convenient for users to find more interesting information on the site. This can also improve the user experience, participation and the attractiveness of the site content to users. For large websites, a lot of content comes from independent third parties-content providers. For example, the goods on Taobao basically come from various stores, many of the movies on Qiyi come from professional media groups and studios, and the well-produced advertisements on Wechat also come from advertisers from all industries.

Establishing a good recommendation ecosystem is good for users, website platforms and content providers. First of all, users get what they want, and the platform gets more traffic and income. Content providers will also improve the efficiency of selling their goods, so it is a win-win scenario for the three, so a good recommendation system will bring great value.

3. Recommendation algorithm-content-based recommendation method

Content-based recommendation (Content Based) should be regarded as the earliest recommendation method. According to the products that users liked in the past (this article is collectively referred to as item), it recommends products similar to the products they liked in the past. For example, a system that recommends restaurants can recommend a barbecue restaurant for a user based on his previous favorite barbecue restaurant. CB was first mainly used in information retrieval systems, so many methods in information retrieval and information filtering can be used in CB.

The process of CB generally consists of the following three steps:

1) Item Representation: extract some features for each item (that is, the content of item) to represent this item

2) Profile Learning: learn the user's preference characteristics (profile) by using the characteristic data of an item that a user likes (and dislikes) in the past.

3) Recommendation Generation: by comparing the characteristics of the user profile and the candidate item obtained in the previous step, the user recommends a set of item with the greatest correlation.

Give an example of the previous three steps. For personalized reading, an item is an article. According to the first step above, we first extract the attributes that represent them from the content of the article. The common method is to use the words that appear in an article to represent the article, and the corresponding weight of each word is often calculated by tf-idf in information retrieval. For example, for this article, the weight of the words "CB", "recommendation" and "preference" will be larger, while the weight of the word "barbecue" will be lower. Using this method, an abstract article can be represented by a specific vector. The second step is to generate a profile that describes the user's preferences according to what articles the user liked in the past. The easiest way is to take the average vector of all the user's favorite articles as the user's profile. For example, if a user often follows articles related to the recommendation system, the weights corresponding to "CB", "CF" and "recommendation" in his profile will be higher. After obtaining a user's profile, CB can use all the item's relevance to the user's profile to recommend articles to him. A commonly used method to calculate the correlation degree is cosine. Finally, N item candidates that are most relevant to this user (with the largest cosine value) are returned as recommendations to this user.

Next, let's describe the three steps above in detail.

1) Item Representation:

Item in real applications often has some properties that can describe it. These attributes can usually be divided into two categories: structured attributes and unstructured attributes. The so-called structured attribute is that the meaning of this attribute is relatively clear, and its value is limited to a certain range, while the meaning of unstructured attribute is often not clear, and the value is not limited, so it is difficult to use it directly. For example, on dating sites, item is a person, and an item will have structured attributes such as height, education, place of origin, etc., as well as unstructured attributes (such as item's own dating manifesto, blog content, etc.). We can naturally use structured data, but for unstructured data (such as articles), we often have to convert it into structured data before we can use it in the model. The most unstructured data encountered in real situations may be articles (such as personalized reading). Let's take a detailed look at how to structure an unstructured article.

How to represent an article has been studied in information retrieval for many years, the following presentation technology is also the source of information retrieval, its name is vector space model (Vector Space Model, referred to as VSM).

Remember that the collection of all the articles we want to represent is

And the collection of words that appear in all articles (for Chinese articles, all articles have to be segmented first) (also known as dictionaries) is In other words, we have N articles to deal with, and these articles contain n different words. We will eventually use a vector to represent an article, for example, the j article is expressed as, where the weight of the first word in article j, the higher the value, the more important it is; the other vectors in the interpretation are similar. So, in order to represent the j article, the key now is how to calculate the value of each component. For example, we can choose 1 if the word appears in article j, and 0 if it does not appear in article j. We can also choose as the number of times the word appears in the j article (frequency). However, the most commonly used calculation method is the word frequency-inverse document frequency (term frequency-inverse document frequency, referred to as tf-idf), which is commonly used in information retrieval. In the j article, the tf-idf corresponding to the k word in the dictionary is:

It is the number of times the k-th word appears in article j, but the number of articles that include the k-th word in all articles.

Finally, the weight of the k-th word in article j is obtained by the following formula:

The advantage of normalization is that the representation vectors between different articles are normalized to one order of magnitude, which is convenient for the operation of the following steps.

2) Profile Learning

Suppose user u has given his preference to some item, likes one part of the item, and doesn't like the other part of it. So, what to do in this step is to generate a model for user u based on his past preferences. With this model, we can use this model to determine whether user u will like a new item. Therefore, what we want to solve is a typical supervised classification problem, in theory, the classification algorithms in machine learning can be copied here.

Let's briefly introduce the learning algorithm commonly used in CB-KNN:

For a new item, the nearest neighbor method first finds the k item that the user u has judged and is most similar to the new item, and then judges the user's preference for the new item according to the user's preference for the k item. This approach is very similar to item-based kNN in CF, except that the item similarity here is calculated based on the attribute vector of item, while in CF, it is calculated based on the scores of all users on item.

For this method, the key may be how to calculate the pairwise similarity between item through the attribute vector of item. It is suggested in [2] that Euclidean distance is used for similarity calculation for structured data, and cosine can be used for similarity calculation if vector space model (VSM) is used to represent item.

3) Recommendation Generation

Through the previous step, you will get a list of recommendations, and we can directly return the n item in this list that are most relevant to the user's attributes as recommendations to the user.

4. Recommendation algorithm-- recommendation method based on Collaborative filtering

As the saying goes, "birds of a feather flock together." continue to take the example of watching movies. If you like movies such as Batman, Mission impossible, Star Trek, and Source Code, and there is someone who also likes these movies, and he also likes Iron Man, chances are that you also like Iron Man.

Therefore, when a user A needs personalized recommendation, he can first find the user group G with similar interests, and then recommend to A the items that G likes and A has not heard of. This is the user-based system filtering algorithm.

According to the above basic principles, we can split the user-based collaborative filtering recommendation algorithm into two steps:

1) find users with similar interests

The similarity between two users is usually calculated by Jaccard formula or cosine similarity. Let N (u) be the collection of items that user u likes, and N (v) be the collection of items that user v likes, then what is the similarity between u and v?

Jaccard formula:

CoSine similarity:

Suppose there are currently four users: a, B, C, D; there are five items: a, b, c, d, e. The relationship between the user and the item (the user likes the item) is shown in the following figure:

How to calculate the similarity between all users at once? To facilitate calculation, you usually need to create an item-user inversion table first, as shown in the following figure:

Then for each item, the user who likes it, add 1 to the same item between the two. For example, if the users who like item an are An and B, then they add two to one in the matrix. As shown in the following figure:

To calculate the similarity between users, the above matrix only represents the molecular part of the formula. Taking the cosine similarity as an example, the above image is further calculated:

At this point, the calculation of user similarity is done, and we can intuitively find users who are similar to the interests of the target users.

2) recommended items

First of all, we need to find out the K users who are most similar to the target user u from the matrix, and use the set S (u, K) to extract all the items that users like in S, and remove the items that u already likes. For each candidate I, the degree to which user u is interested in it is calculated by the following formula:

Rvi indicates that user v likes I, which in this case is 1. In some recommendation systems that require user rating, user rating is substituted.

For example, suppose we want to recommend items to A, select K = 3 similar users, and similar users are: B, C, D, then the items they like and A do not like are: C, e, then calculate p (A, c) and p (A, e) respectively:

It seems that user A may like c and e equally. In a real recommendation system, just sort by score and take the first few items.

How to learn Hadoop development in 4 months and find a job with an annual salary of 250000?

Free to share a set of 17 years of the latest Hadoop big data tutorials and 100 Hadoop big data must meet questions.

Because links are often harmonious, friends who need them please add Wechat ganshiyun666 to get the latest download link, marked "51CTO"

The tutorials have helped 300 + people successfully transform Hadoop development, with a starting salary of more than 20K, double the previous salary.

Baidu Hadoop core architect recorded it himself.

The content includes three parts: basic introduction, Hadoop ecosystem and real business project. Among them, business cases allow you to come into contact with the real production environment and train your development skills.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.