Example Analysis of HanLP keyword extraction algorithm 07/06 Update SLTechnology News&Howtos

Example Analysis of HanLP keyword extraction algorithm

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly shows you the "HanLP keyword extraction algorithm example analysis", the content is easy to understand, clear, hope to help you solve your doubts, the following let the editor lead you to study and learn "HanLP keyword extraction algorithm example analysis" this article.

Analysis of HanLP keyword extraction algorithm 1. Paper In this paper, we introduce the TextRank graphbased ranking model for graphs extracted from naturallanguage texts

TextRank is an unsupervised learning algorithm, which constructs a graph in the text, regards the things of interest in the text (such as word segmentation) as vertices, and then uses the TextRank algorithm to extract some information from the text.

Such keywords may constitute useful entries for building an automatic index for a document collection, can be used to classify a text, or may serve as a concise summary for a given document.

The extracted keywords can be used as text classification or to summarize the central idea of the text.

TextRank uses continuous iterations to extract keywords, and in each iteration, the algorithm scores the vertices in the graph. Until a condition is met (for example, the number of iterations is up to 200, or a parameter set reaches a threshold).

For loosely connected graphs, with the number of edges proportional with the number of vertices,undirected graphs tend to have more gradual convergence curves.

For sparse graphs, there is a linear relationship between the number of edges and the number of vertices. Keyword extraction for such graphs has a smoother convergence curve (or slow convergence).

It may be therefore useful to indicate and incorporate into the model the "strength" of the connection between two vertices $Vambii $and $Vallej $as a weight $w{ ij} $added to the corresponding edge that connects the two vertices.

Sometimes, the relationship between vertices in a graph is not completely equal, for example, there is a close relationship between some vertices, where the weight of edges can be used to measure the importance of the correlation between vertices, and this is the weighted graph model.

two。 Source code to achieve 2.1 keyword extraction process

Given several sentences, extract keywords. The TextRank algorithm is graphbased ranking model, so we need to construct a graph. If we want to construct a graph, we need to determine how to construct the vertices in the graph, so we segment the sentence and take each word as the vertex in the graph.

When selecting a word as the vertex of a graph, you can apply some filtering rules: for example, remove the stopped word in the result of word segmentation and add vertices according to part of speech (for example, only nouns and verbs are used as vertices of the graph).

The vertices added to the graph can be restricted with syntactic filters, which select only lexical units of a certain part of speech. One can for instance consider only nouns and verbs for addition to the graph, and consequently draw potential edges based only on relations that can be established between nouns and verbs.

After determining which words are the vertices of the graph, the other is to determine the relationship between words, that is, which vertices in the graph have edges? For example, set the size of a window, and add an edge to the words that fall in the window.

It is the application that dictates the type of relations that are used to draw connections between any two such vertices

After determining the relationship of the edge, the next step is to determine the weight of the edge. This also depends on the actual situation.

2.2 determine the adjacency points of words according to the size of the window

As mentioned earlier, after the participle of several sentences, we get one word after another, or Term. Suppose the window size is 5. Explain the Java implementation of the TextRank algorithm for keyword extraction, which is mentioned in the article on how to determine which contiguous Term a Term has.

For example, the Term of 'programmer' appears in multiple sentences, so the result of word segmentation is' programmer'in four places:

At index 0: the adjacency points of 'programmer' are:

English, programmer, engagement, Program

Index 9: the adjacency points of 'programmer' are:

Development, maintenance, professional, personnel, division, program, design, personnel

At index 26, the adjacency points of the 'programmer' are:

China, software, practitioners, division, senior, programmers, system analysts, project managers

At index 28, the adjacency points of the 'programmer' are:

Practitioners, dividers, programmers, senior, system analysts, project managers, big four

Combining all the words in these four windows, the adjacency points of 'programmer' are as follows:

Therefore, when the window size is set to 5, the front and back four Term of the Term will be regarded as its adjacency, and when the Term appears multiple times, it will merge the front and back four Term where it appears each time, and finally serve as the adjacency point of the Term.

It can be seen here that if a Term appears many times in a sentence, it means that the Term will be more important. Because it will have more neighboring points, that is, many other Term have voted for it. This is a bit like Term Frequency to measure the importance of Term.

2.3 updating algorithm for score (score)

M.put (key, m.get (key) + d / size * (score.get (element) = = null? 0: score.get (element)); interpretation of the code:

If m.get (key) enters for (String element: value) for the first time, it is the result of getting the first half of the formula. If it has already been iterated in for (String element: value), the for loop is equivalent to summation:\ (\ Sigma_ {Vallej\ in In (Vadmi)}\)

For (String element: value) {int size = words.get (element). Size (); m.put (key, m.get (key) + d / size * (score.get (element) = = null? 0: score.get (element));}

Take, for example, "what he said is true." for example, choose the window size to 5, after word segmentation and removal of stop words:

The undirected graph constructed is as follows: (the weight of each edge is 1)

Take the vertex Li as an example to see how the score of Li is updated. In for (String element: value), there are two vertices to vote on 'size'. The first is the 'exact' vertex, and there are two vertices adjacent to the 'exact' vertex, so: int size= words.get (element). Size ();. Next, let's break down this line of code:

M.put (key, m.get (key) + d / size * (score.get (element) = = null? 0: score.get (element)

M.get (key) is 1merd, because in the outer for loop, m.put (key, 1merd) has been stored in the first half of the formula (1merd).

Score.get (element) = = null? 0: score.get (element) this is the result of the previous iteration. For the initial first iteration, score.get (element) is 0.8807971, which is the initial score for each vertex:

/ / set the initial value according to TF. Words represents an undirected graph for (Map.Entry entry: words.entrySet ()) {score.put (entry.getKey (), sigMoid (entry.getValue (). Size (); / / initialization of each vertex score of the undirected graph}

Score.get (element) is equivalent to\ (WS (Vallej)\) in the formula.

Finally, let's analyze that a size,size is obtained by the code int size = words.get (element). Size (), because the weight of each edge is 1, in Out is actually equivalent to:\ (\ Sigma_ {video\ in Out (Vasij)} w{ jk}\).

In ('reason') = {'indeed', 'say'}

When\ (Out\) is' true',\ (jk) is {'say', 'reason'}, so:\ (\ Sigma_ {Vallek\ in Out (Vallej)} w _ {jk} = 2\). Therefore, update the score of the vertex'Li':\ (1 color 2) * 0.8807971 reason 0.5243387\). The temporary results are then saved through m.put.

Next, for (String element: value) continues, at this time:\ (value\) is the vertex 'say', because the vertex 'say' also has two adjacent edges, so there is:\ (\ Sigma_ {video\ in Out (Vallej)} w{ jk} = 2\). Then update the score of vertex'Li':\ (0.5243387 reason * (1 Universe 2) * 0.8807971 / 0.89867747\). And this is the score of the vertex 'reason' in the first round of iteration.

According to the steps in steps 1 and 2 above, for (String element: value) is equivalent to:\ (\ Sigma_ {Vacuj\ in In (Viteri)}\), because each time the calculated result is returned to HashMap m.

Therefore, in the first round of iteration, the score of vertex'Li'is: 0.89867747

Similar to: max_iter iterations, or reaching a threshold:

If (max_diff = 5) {/ / window size is 5, is written dead. For a term_A, the first four term and the last four term belong to the adjacency point que.poll () of term_A;} for (String qWord: que) {if (w.equals (qWord)) {continue } / / since they are neighbors, then the relationship is mutual, words.get (w) .add (qWord); words.get (qWord) .add (w);} que.offer (w);}

Here is an initial score process assigned to each vertex in the graph:

Map score = new HashMap (); / / Save the final score for each keyword / / set the initial value according to TF. Words represents an undirected graph for (Map.Entry entry: words.entrySet ()) {score.put (entry.getKey (), sigMoid (entry.getValue (). Size (); / / initialize the score of each vertex of the undirected graph}

Next, there are three for loops: the first for loop represents the number of iterations; the second for loop calculates the score for each vertex in the undirected graph; and the third for loop calculates the voting weight given to it by each neighbor of a specific vertex.

For (int I = 0; I < max_iter; + + I) {/ /.... For (Map.Entry entry: words.entrySet ()) {/ /... For (String element: value) {

In this way, the formula in this paper is realized.

\ [WS (jk) = (1murd) + d *\ Sigma_ {Vallej\ in In (Vallej)}\ frac {w{ ji}} {\ Sigma_ {Vallek\ in Out (Vallej)} w{ jk} * WS (Veterj)\]

And the final key words extracted are:

[reason, indeed, say]

The above only uses the sentence "what he says is true" to demonstrate the details of the TextRank algorithm, which may not be reasonable in practical application. Because there will be:

The existing statistics are not enough for TextRank to support the importance of a word, and the algorithm has limitations.

It can be seen that the keyword extraction of TextRank is affected by the result of word segmentation; secondly, it is also affected by the window size.

The above is all the content of this article "example Analysis of HanLP keyword extraction algorithm". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.