Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to understand pagerank algorithm

2025-04-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

How to understand the pagerank algorithm, I believe that many inexperienced people are at a loss about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

1. Overview of PageRank algorithm

PageRank, that is, page ranking, also known as page level, Google left ranking or Page ranking.

Link analysis algorithm is proposed by Larry Page and Sergey Brin, founders of Google, when they built the early search system prototype in 1997. since Google achieved unprecedented success in business, the algorithm has become a computing model concerned by other search engines and academia. At present, many important link analysis algorithms are derived from PageRank algorithm. PageRank is a method used by Google to identify the level / importance of a web page, and it is the only criterion used by Google to measure the quality of a website. After combining all other factors, such as the Title logo and the Keywords logo, Google adjusts the results through PageRank to improve the relevance and quality of search results by improving the ranking of pages with more "rank / importance" in search results. The level ranges from 0 to 10, with a full score of 10. The higher the PR value, the more popular (and important) the page is. For example, a website with an PR value of 1 indicates that the site is not very popular, while a PR value of 7 to 10 indicates that the site is very popular (or extremely important). General PR value reaches 4, even if it is a good website. Google sets the PR value of his website to 10, which shows that Google is very popular, and it can also be said that this site is very important.

two。 From the number of chain members to PageRank

Before PageRank was put forward, some researchers have proposed to use the number of links to analyze and calculate links. This method assumes that the more links a page has, the more important it is. Many early search engines also adopted the number of links as a link analysis method, which also has a more obvious effect on the improvement of search engine effect. PageRank not only takes into account the impact of the number of links, but also refers to the web page quality factors, the combination of the two to obtain a better evaluation standard of web page importance.

For an Internet page A, the calculation of the page PageRank is based on the following two basic assumptions:

 quantity assumption: in the Web graph model, if a page node receives more incoming links from other pages, then this page is more important.

 quality assumption: the quality of the link to page An is different, and high-quality pages will transfer more weight to other pages through links. So the higher the quality of the page points to page A, the more important page A.

Using the above two assumptions, the PageRank algorithm just gives each page the same importance score, and updates the PageRank score of each page node through iterative recursive calculation until the score is stable. The result of PageRank calculation is the evaluation of the importance of the web page, which has nothing to do with the query entered by the user, that is, the algorithm has nothing to do with the topic. Suppose there is a search engine, its similarity calculation function does not take into account the content similarity factors, completely using PageRank to sort, then what does the performance of this search engine look like? The search engine returns the same result for any different query request, that is, the page with the highest PageRank value.

3. The principle of PageRank algorithm

PageRank's calculation makes full use of two assumptions: the quantity hypothesis and the quality hypothesis. The steps are as follows:

1) in the initial stage: the web page constructs the Web diagram through the link relationship, and each page sets the same PageRank value. Through several rounds of calculation, the final PageRank value obtained by each page will be obtained. With each round of calculation, the current PageRank value of the page will be constantly updated.

2) the calculation method of updating page PageRank score in one round: in the calculation of page PageRank score in one round, each page distributes its current PageRank value evenly to the outgoing chain contained on this page, so that each link gets the corresponding weight. On the other hand, each page sums all the weights passed into the chain that point to this page, and you can get a new PageRank score. When each page gets the updated PageRank value, a round of PageRank calculations is completed.

3.2 basic ideas:

If page T has a connection to page A, it indicates that the owner of T thinks An is more important and assigns a part of T's importance score to A. The importance score is: PR (T) / L (T)

Where PR (T) is the PageRank value of T and L (T) is the number of outgoing chains of T.

Then the PageRank value of An is the accumulation of a series of page importance scores similar to T.

That is, the number of votes for a page is determined by the importance of all links to its page, and a hyperlink to a page is equivalent to a vote for that page. The PageRank of a page is obtained by recursive algorithms for the importance of all links to its page (link to the page). A page with more links will have a higher level, on the contrary, if a page does not have any links to the page, then it has no level.

3.3 PageRank simple calculation:

Suppose a collection of only four pages: a _ Magi B _ M C and D. If all pages are linked to A, then the PR (PageRank) value of A will be the sum of BMague C and D.

Continue to assume that B also has links to C, and D also has links to three pages including A. You can't vote twice on one page. So B gives half the ticket to each page. By the same logic, only 1/3 of the votes cast by D are counted on A's PageRank.

Example:

The specific calculation process of PageRank is illustrated by the example shown in figure 1.

This formula is defined by. S Brin and L. Page in "The Anatomy of a Large- scale Hypertextual Web Search Engine Computer Networks and ISDN Systems".

So the PageRank of one page is calculated by the PageRank of other pages. Google repeatedly calculates the PageRank for each page. If you give each page a random PageRank value (non-0), then after repeated calculations, the PR value of these pages will tend to be normal and stable. That's why search engines use it.

4. PageRank power method calculation (linear algebra application)

4.1 complete formula:

For this section, you can check out: the Mathematics behind Google

First of all, find a complete formula:

Arvind Arasu in "Junghoo Cho Hector Garcia-Molina, Andreas Paepcke, Sriram Raghavan." "Searching the Web" is more accurately expressed as:

Is the number of pages studied, the number of linked pages, the number of linked pages, and N is the number of all pages.

The PageRank value is an eigenvector in a special matrix. This feature vector is:

If page I has a link to page j, then

= 0.

4.2 using the power method to calculate PageRank

Then our PageRank formula can be converted to solving / N. P is the probability transition matrix, =

) {/ / if the last two results are similar or the same, return R

Return R

} else {

X = R

R = AX

}

}

4.3 solving steps:

1. The calculation process of P probability transfer matrix:

First, we establish a model of the link relationship between web pages, that is, we need an appropriate data structure to represent the connection relationship between pages.

1) first of all, we use the form of a graph to express the relationship between web pages:

Now suppose there are only four collections of web pages: a, B, C, whose abstract structure is shown in figure 1:

Figure 2 web page link matrix: figure 3 web page link probability matrix:

Fig. 4 transpose matrix of P'

Second, the calculation process of A matrix.

1) P probability transfer matrix:

/ N is:

/ N = 0.85 × P + 0.15 *

Initially, the PageRank value of each web page is 1, that is, X = (1,1,1).

Third, the process of cyclic iterative calculation of PageRank

Step one:

Continue to iterate the process.

Until the results of the last two times are similar or the same, that is, R finally converges, R is about equal to X, and the calculation stops. The final R is the PageRank value of each page.

Using the power method to calculate the PageRank value is always convergent, that is, the number of calculation is limited.

Larry Page and Sergey Brin theoretically proved that no matter how the initial value is selected, this algorithm ensures that the estimated value of web page ranking can converge to their real value.

Because the number of web pages on the Internet is huge, the two-dimensional matrix mentioned above theoretically has as many elements as the square of the number of web pages. If we assume that there are a billion web pages, then this matrix has 10 billion elements. The amount of calculation is very large when such a large matrix is multiplied. Larry Page and Sergey Brin use the technique of sparse matrix calculation to greatly simplify the amount of calculation.

5. Advantages and disadvantages of PageRank algorithm

Advantages:

It is a static algorithm independent of query, and the PageRank values of all web pages are obtained by offline calculation, which effectively reduces the amount of computation in online query and greatly reduces the query response time.

Disadvantages:

1) people's queries have topic characteristics, and PageRank ignores the topic relevance, which reduces the relevance and theme of the results.

2) the old page will be rated higher than the new page. Because even a very good new page won't have many upstream links unless it's a subsite of a site.

After reading the above, have you mastered how to understand the pagerank algorithm? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report