How to use Chinese word Segmentation tool ANSJ based on java 04/20 Update SLTechnology News&Howtos

How to use Chinese word Segmentation tool ANSJ based on java

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains the "java-based Chinese word segmentation tool ANSJ how to use", the content of the article is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "how to use the java-based Chinese word segmentation tool ANSJ"!

ANSJ

This is a java implementation of Chinese word segmentation based on n-Gram+CRF+HMM.

The speed of word segmentation is about 2 million words per second (tested in mac air), and the accuracy is more than 96%.

At present, it has been realized. Chinese word segmentation. Chinese name recognition. User-defined dictionary, keyword extraction, automatic summary, keyword tagging and other functions

It can be applied to natural language processing and other aspects, and it is suitable for all kinds of projects that require high effect of word segmentation.

The purpose of word segmentation is to create a highly stable and available Chinese word segmentation tool, which can be used in various scenarios requiring word processing to briefly introduce the main algorithms and characteristics of Ansj Chinese word segmentation.

Data structure

Highly optimized Trie tree

A tool that is widely used in user-defined dictionaries and various scenarios similar to Map, it is well known that Trie has high-speed text scanning ability and low memory occupancy. It is one of the best AC machines, out-of-string voice, which seems to be none in my cognitive range. Compared with other structures, there is a good balance between performance and construction, but in java, building a large number of map, especially hashmap, is a very expensive operation. By putting a large amount of key into a map, it is doomed to automatically unpack and resolve conflicts, and a large number of hash matching does too much small consumption. Although most people think that this consumption belongs to nanosecond level, it is easy to use GB text. This consumption can not be ignored, so the author uses the first letter hash secondary dichotomy to avoid excessive memory consumption, which is precisely because of this mechanism. I can make sure that Ansj loads more user-defined dictionaries, and I was asked for specific numbers. About 5 million words, 1Gde memory. Here the author highly recommends this little guy, you can get this gadget, home artifact, through the nlp-lang package.

Triple array trie tree

Three arrays of trie trees, well, I know you will complain about why I used DAT (Double Array Tree) and changed it to TAT (Tree Array Tree) here. I do not want to be so, but in order to be rigorous some real restore algorithm, it is true to use three arrays to achieve the DAT, mainly in order to judge the words to avoid a unnecessary retrogression, as a strategy of space for time, specific interest can see the creation of DAT in nlp-lang. For the DAT algorithm, I think. If it is not necessary, do not use it, it has a lot of uncertainty in construction and modification, and does not conform to the concept of simple dependability. People who are interested can understand it. I wrote several articles about DAT in my blog when I was a child, although it was very lame. It is said that several people have understood it.

Machine learning

The Hidden Markov language Model shortest path ansj is used with ngram. It is determined by the relationship between two words, which is used for semantic disambiguation.

TF/IDF word bag model is used in keyword extraction. Used to determine the importance of a word. At the same time, the keywords are used to summarize the articles automatically.

CRF context-based tagging similar to CRF is used to implement the function of new word discovery, and new word discovery also serves for keyword extraction.

ToAnalysis precise participle

Accurate participle is recommended by the store manager for Ansj participle.

It is easy to use and stable. Accuracy. And the efficiency of word segmentation. All have achieved a good balance. If you try Ansj for the first time, you can use it if you want to open the box. Then there is nothing wrong with using this participle.

Word Segmentation of DicAnalysis user-defined Dictionary priority Policy

User-defined dictionary priority policy word segmentation, if your user-defined dictionary is good enough, or if your requirements for user-defined dictionaries are relatively high, then you are strongly recommended to use DicAnalysis word segmentation.

It can be said that Dic is superior to ToAnalysis in many aspects.

NlpAnalysis word Segmentation with New word Discovery function

Nlp participle is a kind of participle that can always surprise you.

It can identify unknown words. But it also has its shortcomings. The speed is relatively slow. Poor stability. PS: I'm talking slowly here just by comparing myself to other ways. It should be the speed of 40w words per second.

Personally, I think the application of nlp is 1. 1. Syntax entity name extraction. Unlisted words are sorted out. The main work is to find and analyze the text.

Index-oriented participle in IndexAnalysis

Index-oriented participle. As the name implies, it is a participle suitable for text retrieval such as lucene. The following two main points are considered

Recall rate * recall rate is to cover the result of participle as much as possible. For example, the recall result of "Shanghai Hongqiao Airport South Road" is [Shanghai / ns, Shanghai Hongqiao Airport / nt, Hongqiao Airport / ns, Hongqiao Airport / nz, Airport / n, South Road / nr]

Accuracy * in fact, the strength of Ansj, which is somewhat contradictory to the recall itself, is cleverly avoiding the conflict between the two. For example, our common ambiguous sentence "Tourism and Services"-> for general warranty recalls. The result you will give is that there is no cross-term participle for "travel kimono service" for ansj. It means. The recalled word is only a subdivision of the result after accurate participle. Solve this problem better.

Minimum granularity participle of BaseAnalysis

Basically, it ensures the most basic participle. Words with the smallest granularity. The words involved are about * 100000 *.

The speed of basic word segmentation is very fast. On macAir. Can reach 300w words per second. At the same time, the accuracy is also very high. But his function for new words is very limited.

Functional statistics

String str = "facial Cleanser cooperates with facial Cleanser to deeply clean pores, clean nostrils, facial mask, squeeze hard to get a little wrinkle, cheek pores repair, invisible, strawberry nose, history, no problems, face and neck skin of the same color is healthy, long-term use, safe and healthy, look at your crow's feet, girls between five and ten years old and 28 years younger than their peers."

System.out.println (BaseAnalysis.parse (str))

Clean / ag, face / Q, meter / k, match / v, clean / ag, face / Q, deep / b, clean / a, pore / n, clean / a, nostril / n, mask / n, broken / a, sense / v, push / v, squeeze / v, ability / v, out / v, one / m, spot / v, wrinkle / n, cheek / n, pore / n, repair / v, / u, see / v, missing / v, la / y , strawberry / n, nose / ng, history / n, legacy / vn, problem / n, no way / v, face / n, and / c, neck / n, almost / l, color / n, / u, skin / n, talent / d, is / v, healthy / a, / u, long-term / d, use / v, safety / an, health / a, ratio / p, peer / n, conspicuous / v, small / a, five / m To / v, ten / m, year / Q, 28Aga, year / Q, girl / n, look at / v, you / r, / u, crow's feet / n

System.out.println (ToAnalysis.parse (str))

Clean / ag, face / Q, meter / k, match / v, clean / ag, face / Q, deep / b, clean / a, pore / n, clean / a, nostril / n, mask / n, broken / a, sense / v, push / v, squeeze / v, ability / v, out / v, one / m, spot / v, wrinkle / n, cheek / n, pore / n, repair / v, / u, see / v, missing / v, la / y , strawberry / n, nose / ng, history / n, legacy / vn, problem / n, no way / v, face / n, and / c, neck / n, almost / l, color / n, / u, skin / n, talent / d, is / v, healthy / a, / u, long-term / d, use / v, safety / an, health / a, ratio / p, peer / n, conspicuous / v, small / a, five / m To / v, ten / mq, 28 / mq / u, girl / n, look at / v, your / r / u, crow's feet / n

System.out.println (DicAnalysis.parse (str))

Clean / ag, face / Q, meter / k, match / v, clean / ag, face / Q, deep / b, clean / a, pore / n, clean / a, nostril / n, mask / n, broken / a, sense / v, push / v, squeeze / v, ability / v, out / v, one / m, spot / v, wrinkle / n, cheek / n, pore / n, repair / v, / u, see / v, missing / v, la / y , strawberry / n, nose / ng, history / n, legacy / vn, problem / n, no way / v, face / n, and / c, neck / n, almost / l, color / n, / u, skin / n, talent / d, is / v, healthy / a, / u, long-term / d, use / v, safety / an, health / a, ratio / p, peer / n, conspicuous / v, small / a, five / m To / v, ten / mq, 28 / mq / u, girl / n, look at / v, your / r / u, crow's feet / n

System.out.println (IndexAnalysis.parse (str))

Clean / ag, face / Q, meter / k, match / v, clean / ag, face / Q, deep / b, clean / a, pore / n, clean / a, nostril / n, mask / n, broken / a, sense / v, push / v, squeeze / v, ability / v, out / v, one / m, spot / v, wrinkle / n, cheek / n, pore / n, repair / v, / u, see / v, missing / v, la / y , strawberry / n, nose / ng, history / n, legacy / vn, problem / n, no way / v, face / n, and / c, neck / n, almost / l, color / n, / u, skin / n, talent / d, is / v, healthy / a, / u, long-term / d, use / v, safety / an, health / a, ratio / p, peer / n, conspicuous / v, small / a, five / m To / v, ten / mq, 28 / mq / u, girl / n, look at / v, your / r / u, crow's feet / n

System.out.println (NlpAnalysis.parse (str))

Clean / ag, facial meter / nw, match / v, clean / nw, deep / b, clean / a, pore / n, clean / a, nostril / n, mask / n, crush / nw, push / v, squeeze / v, ability / d, energy / v, a little / nw, wrinkle / n, cheek / n, pore / n, repair / v, / u, see / v, not / d, see / v, la / y, strawberry / n Nose history / nw, legacy / vn, problem / n, no way / v, face / n, and / c, neck / n, almost / l, color / n, / u, skin / n, just / d, is / v, health / a, / u, long-term / d, use / v, safety / an, health / a, / u, ratio / p, peer / n, show / v, small / a, five / m, to / v, ten years old / mq, 28 years old / mq, / u, girl / n, look at / v, you / r, / u, crow's feet / n

Disable word filtering

Stop-word demand is a very common requirement with many advantages, but surprisingly, the disadvantages outweigh the benefits, so don't use this stupid thing in general. I think it's mainly in ancient times, the computing power is limited, and people don't want to calculate too much for fearless strings and give it to Virgo programmers. I generally feel that there is a guy in tmd who has been consuming my cpu all the time. Okay, I'm really not Virgo black. Anyway, this function has to be available, and it has been shifted. The call is as follows.

1. Instantiate a filter

StopRecognition filter = new StopRecognition ()

Filter.insertStopNatures ("uj"); / / filter part of speech

Filter.insertStopNatures ("ul")

Filter.insertStopNatures ("null")

Filter.insertStopWords ("I"); / / filter words

Filter.insertStopRegexes ("small. *?"); / / supports regular expressions

Call filtering

Result modifResult = ToAnalysis.parse (str) .recognition (fitler); / / filter the result of word segmentation

Test example

String str = "Welcome to ansj_seg, (ansj Chinese participle) here if you encounter anything

You can contact me with any questions. I will do my best. Help everyone. Ansj _ seg is faster, more accurate, freer! "

StopRecognition filter = new StopRecognition ()

Filter.insertStopWords ("I"); / / filter words

Filter.insertStopWords ("you")

Filter.insertStopWords ("of")

Filter.insertStopWords (")")

Filter.insertStopWords (())

System.out.println (ToAnalysis.parse (str) .recognition (filter))

Welcome / v, use / vrecoveransjbeck enreparenceconcept / v, Chinese / nz, participle / v, in / p, here / r, if / c, encounter / v, what / r, problem / n, all / d, can / v, contact / vMagee. State, be certain / d, do / v, help / v, everybody / rr, help / v, everybody / rpm. , / w, more / d, quasi / a ~.

Thank you for reading, the above is the content of "how to use ANSJ, a Chinese word segmentation tool based on java". After the study of this article, I believe you have a deeper understanding of how to use ANSJ, a Chinese word segmentation tool based on java, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.