In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly explains "how to understand the comparison and assumption of common word segmentation algorithms". The content of the explanation in this article is simple and clear, and it is easy to learn and understand. let's study and learn "how to understand the comparison and assumption of common word segmentation algorithms"!
Compared with understanding-based word segmentation algorithm and statistics-based word segmentation algorithm, the algorithm based on text matching is more general. The algorithm based on text matching is also called "mechanical word segmentation algorithm". It matches the Chinese character string to be analyzed with an entry in a "fully large" machine dictionary according to a certain strategy. If a string is found in the dictionary, the match is successful and a word can be identified. According to the different scanning direction, the text matching word segmentation method can be divided into forward matching and reverse matching; according to different length priority matching, it can be divided into maximum (longest) matching and minimum (shortest) matching; according to whether it is combined with the part of speech tagging process, it can be divided into simple word segmentation method and the integration method of word segmentation and tagging.
Several commonly used mechanical word segmentation methods are as follows:
1) forward maximum matching method (from left to right)
2) reverse maximum matching method (from right to left)
3) minimum segmentation (to minimize the number of words cut out in each sentence).
There are other word segmentation algorithms that combine the above methods. For example, the forward maximum matching method and the reverse maximum matching method can be combined to form a two-way matching method. Because of the characteristics of Chinese word formation, forward minimum matching and reverse minimum matching are rarely used. This paper focuses on the forward maximum matching method and the reverse maximum matching method.
Because the accuracy of the mechanical word segmentation algorithm depends on the accuracy of the algorithm and the completeness of the thesaurus. In this paper, it is assumed that the thesaurus is large enough to contain the words needed.
Generally speaking, the segmentation accuracy of reverse matching is slightly higher than that of forward matching, and less ambiguity is encountered. The statistical results show that the error rate of forward maximum matching is 1max 169 and that of reverse maximum matching is 1max 245. However, this accuracy is far from meeting the actual needs. The actual word segmentation systems use mechanical word segmentation as a means of initial segmentation, and we need to further improve the accuracy of segmentation by using a variety of other language information.
Let's first look at two sentences in Chinese:
1) Changchun Mayor's Spring Festival speech
2) Changchun Pharmacy in Changchun City
Suppose we have the following words in the thesaurus: "Changchun", "Changchun", "Mayor", "Spring Festival", "speech", "aphrodisiac", "drugstore", "spring drugstore" and so on.
The results obtained by forward maximum matching method are as follows:
Changchun City / Changchun / Festival / speech (divided into 4 words, in which "section" does not match, semantic error)
Changchun City / Changchun / drugstore (divided into 3 words, all match, semantic correct)
The results obtained by reverse maximum matching method are as follows:
Changchun / Mayor / Spring Festival / speech (divided into 4 words, all match, semantic correct)
Changchun / Mayor / Chun Pharmacy (divided into 3 words, all match, semantic error)
From then on, we can see the advantages and disadvantages of forward maximum matching method and reverse maximum matching method: both can correctly explain part of Chinese, while some can not be distinguished.
Can we consider combining these two matching methods and drawing on each's strong points? The answer is yes.
First of all, we use the forward maximum matching method and the reverse maximum matching method to cut the same word respectively, and then compare the results. For example, the segmentation of "Changchun Mayor's Spring Festival speech", because there is a word in the forward maximum matching method can not be matched, so choose to use the reverse maximum matching method as the result.
Secondly, we can introduce the concept of word frequency, and each word will get a word frequency value according to its probability in Chinese. We do two methods of word segmentation for Changchun Pharmacy, but the word frequency of "Spring Pharmacy" obtained by reverse maximum matching method is much lower than that of other words. We can think that the result obtained by this word segmentation method is not universal, and the forward maximum matching method is used to get the result.
Of course, some other methods (such as scanning marking method, part of speech checking method, etc.) can be combined with these two matching methods to achieve a better and more accurate effect of word segmentation.
Thank you for your reading. The above is the content of "how to understand the comparison and assumption of commonly used word Segmentation algorithms". After the study of this article, I believe you have a deeper understanding of how to understand the comparison and assumption of commonly used word segmentation algorithms. the specific use also needs to be verified by practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.