In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
PageRank algorithm how to rank web pages, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.
1 principle of PageRank algorithm
The core principle of PageRank algorithm is: in the Internet, if a web page is linked by many other pages, indicating that the page is very important, then its ranking is high.
Larry Page sees the whole Internet as a big picture, each site is like a node, and the link to each page is like an arc. In that case, the Internet can be described by a graph or matrix.
Larry Page was also elected a member of the American Academy of Engineering at the age of 30 because of the algorithm.
Suppose there are four web pages at present, which are Ameme Breco Cpene D, and their link relationships are as follows:
We stipulate that there are two chains:
Out of the chain: a chain drawn from itself.
Chain entry: a chain that introduces itself from the outside.
For example, the C page in the picture has two in-chain and one out-chain.
The idea of PageRank is that the influence of a web page is equal to the sum of all its links.
Expressed in mathematical formulas as follows:
Where (the score represents the page influence):
PR (u) is the score of the page u.
Bu is a link-in collection of web pages.
Web page v is any link of web page u.
PR (v) is the score of the mesh v.
L (v) is the number of links out of the web page v.
The score that page v brings to page u is PR (v) / L (v).
Then PR (u) is equal to the sum of all the chain scores.
In the above formula, we assume that the probability of arriving from a page v to all its outgoing pages is equal.
For example, in the picture above, page A has three outgoing links to B, C, and D. Then when the user accesses A, there is the possibility of jumping to B, C or D, and the jump probability is all 1 hand 3.
2. Calculate the score of the web page
Let's take a look at how to calculate the score of a web page.
We can use a table to show the link relationship between the web pages in the above figure and the probability of each page to other pages:
ABCDA0 A-> A1 C-> A0D-> AB1/3 A-> B0B-> B0C-> B1Accord2D-> BC1/3 A-> C0B-> C0C-> C1Acer 2D-> CD1/3 A-> D1UB2B-> D0C-> D0D-> D.
According to the numbers in this table, you can convert it into a matrix M:
Suppose that the initial influence of the four pages A, B, C and D are all the same, and that is, 1Accord 4, that is:
After the first score transfer, you can get W1, as follows:
In the same way, you can get W2Magic W3 all the way to Wn:
W2 = M * W1
W3 = M * W2
Wn = M * Wn-1
So when does the calculation stop?
Page and Brin have proved that no matter how much the initial value of the page is chosen (our assumption is 1A4), it can finally ensure that the score of the page can converge to a true definite value.
That is, until the Wn no longer changes.
This is the process of calculating the score of a web page, which is easy to understand.
3The two problems of PageRank
What we have described above is the basic principle of PageRank, which is a simplified version. In practical applications, there will be problems of grade leakage (RankLeak) and grade sinking (Rank Sink).
If a web page is not out of the chain, it will absorb the score of other pages without release, which will eventually lead to a score of 0 for other pages. This phenomenon is called grade disclosure. Web page C in the following figure:
On the contrary, if a page is not linked, it will eventually cause the score of the page to be 0, a phenomenon called grade sinking. Web page C in the following figure:
4PageRank Random browsing Model
In order to solve the above problem, Larry Page proposed a random browsing model, that is, users do not always rely on web links to access web pages, but may also access URLs in other ways, such as entering URLs.
Therefore, the concept of damping factor is proposed, which represents the probability of the user surfing the Internet according to the jump link, while 1murd represents the probability of the user visiting the web page in other ways.
So, improve the formula above to:
Where:
D is the damping factor, usually 0.85.
N is the total number of web pages.
5. Calculate the score of the web page with code
How to use code to calculate the PR score of a web page? (I put the above picture here for convenience.)
We can see that the graph is actually a directed graph in the data structure, so we can build the PageRank algorithm by building a directed graph.
NetworkX is a Python toolkit that integrates common graph structures and network analysis algorithms.
We can use NetworkX to build the network structure in the figure above.
First, introduce the module:
Import networkx as nx
Create a directed graph with the DiGraph class:
G = nx.DiGraph ()
The link relationships of the four web pages are expressed in an array:
Edges = [("A", "B"), ("A", "C"), ("A", "D"), ("B", "A"), ("B", "D"), ("C", "A"), ("D", "B"), ("D", "C")]
The elements in the array are added to the graph as edges of the directed graph:
For edge in edges: G.add_edge (edge [0], edge [1])
Use the pagerank method to calculate the PR score:
# alpha is the damping factor PRs = nx.pagerank (G, alpha=1) print PRs
Output the PR value of each web page:
{'Aids: 0.333396911621094,' bands: 0.22222201029459634, 'accounts: 0.22222201029459634,' Downs: 0.22222201029459634}
Finally, we calculated the PR value of each web page.
6. Draw the network diagram
The NetworkX package also provides a way to draw a network diagram:
Import matplotlib.pyplot as plt# draws Network Diagram nx.draw_networkx (G) plt.show ()
As follows:
We can also set the shape of the graph, the size of the node, the length of the edge and other properties.
PageRank algorithm gives us a very important inspiration, weight is a very important indicator in many cases.
For example, in interpersonal communication, personal influence depends not only on the number of your friends, but also on the quality of your friends, which illustrates the importance of the circle.
For example, in the self-media era, the number of fans does not really represent your influence, and the quality of fans is also very important. If you have a lot of big Vs among your fans, it will greatly increase your influence.
After reading the above, have you mastered how the PageRank algorithm ranks web pages? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.