In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly introduces how to use Hanlp, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.
Hanlp is a toolkit composed of a series of models and algorithms that aims to popularize the application of natural language processing in production environments. Hanlp has the characteristics of perfect function, high performance, architecture cleaning, corpus up-to-date and customizable, and provides lexical analysis (Chinese word segmentation, magnetic tagging, named entity recognition), syntactic analysis, text classification and emotion analysis and other functions.
This article carries on the word segmentation, keyword extraction, abstract extraction and maintenance of the sentences entered by the user according to the thesaurus.
Tool class name: DKNLPBase
1. Standard participle
Method signature: List StandardTokenizer.segment (String txt)
Return: list of participles.
Signature parameter description: txt: the statement to be participle.
Example: the following example verifies that the fifth participle in a paragraph is AlphaGo.
Program listing 1
Public void testSegment () throws Exception
{
String text = "goods and services"
List termList = DKNLPBase.segment (text)
AssertEquals ("Commodity", termList.get (0) .word)
AssertEquals ("and", termList.get (1) .word)
AssertEquals ("Services", termList.get (2) .word)
Text = "Ke Jie explains that Lee se-dol VS AlphaGo's second inning ends like this."
TermList = DKNLPBase.segment (text)
AssertEquals ("AlphaGo", termList.get (5) .word); / / able to identify "AlphaGo"
}
2. Keyword extraction
Method signature: List extractKeyword (String txt,int keySum)
Return: keyword list.
Signature parameter description: txt: the statement to extract keywords, and the number of keywords to be extracted by keySum
Example: give a paragraph to extract a key word is "programmer".
Program listing 2
Public void testExtractKeyword () throws Exception
{
String content = "programmers (English Programmer) are professionals engaged in program development and maintenance." +
Programmers are generally divided into programmers and programmers.
"but the line between the two is not very clear, especially in China." +
"Software practitioners are divided into junior programmers, senior programmers and systems" +
"there are four categories of analysts and project managers."
List keyword = DKNLPBase.extractKeyword (content, 1)
AssertEquals (1, keyword.size ())
AssertEquals (programmer, keyword.get (0))
}
3. Phrase extraction
Method signature: List extractPhrase (String txt, int phSum)
Return: phrase
Signature parameter description: txt: statement to extract phrase, number of phSum phrase
Example: give a passage that represents the five phrases of the article, the first of which is an algorithm engineer.
Program listing 3
Public void testExtractPhrase () throws Exception
{
String text = "algorithm engineer\ n" +
"an algorithm (Algorithm) is a series of clear instructions to solve a problem, that is, to be able to obtain the required output for a certain standard input in a limited time." +
"if an algorithm is flawed or is not suitable for a problem, executing the algorithm will not solve the problem. Different algorithms may take different times," +
"Space or efficiency to accomplish the same task. the pros and cons of an algorithm can be measured by space complexity and time complexity. Algorithm engineers are people who use algorithms to deal with things.\ n" +
"\ n" +
"1 Job profile\ n" +
"algorithm engineer is a very high-end position;\ n" +
"Professional requirements: computer, electronics, communications, mathematics and other related majors;\ n" +
"academic requirements: bachelor's degree or above, most of them are master's degree or above;\ n" +
"language requirements: English requirements are proficient, basically able to read foreign professional books and periodicals;\ n" +
"must master computer-related knowledge, proficient in the use of simulation tools such as MATLAB, must be able to speak a programming language.\ n" +
"\ n" +
"2 Research directions\ n" +
"Video algorithm engineer, image processing algorithm engineer, audio algorithm engineer communication baseband algorithm engineer" +
"\ n" +
"3 current situation at home and abroad\ n" +
"at present, there are many engineers engaged in algorithm research in China, but there are very few senior algorithm engineers, and they are a very scarce professional engineer." +
According to the research field, algorithm engineers are mainly divided into audio / video algorithm processing, two-dimensional information algorithm processing in image technology and communication physical layer.
"one-dimensional information algorithm processing in radar signal processing, biomedical signal processing and other fields.\ n" +
"at present, there are relatively advanced video processing algorithms in two-dimensional information algorithm processing, such as computer audio and video and graphics and image technology: machine vision has become the core of this kind of algorithm research;" +
"in addition, there are 2D to 3D algorithm (2D-to-3D conversion), de-interlacing algorithm (de-interlacing), motion estimation motion compensation algorithm" +
"(Motion estimation/Motion Compensation), denoising algorithm (Noise Reduction), scaling algorithm (scaling)," +
"sharpening algorithm (Sharpness), Super Resolution algorithm (Super Resolution), gesture recognition (gesture recognition), face recognition (face recognition).\ n" +
"algorithms commonly used in the field of one-dimensional information such as communication physical layer: RRM and RTT in wireless field, modulation and demodulation in transmission field, channel equalization, signal detection, network optimization, signal decomposition, etc.\ n" +
"in addition, data mining and Internet search algorithms have also become popular directions.\ n" +
"algorithm engineers are gradually moving towards artificial intelligence."
List phraseList = DKNLPBase.extractPhrase (text, 5)
AssertEquals (5, phraseList.size ())
AssertEquals ("algorithm engineer", phraseList.get (0))
}
4. Automatic summary
Method signature: List extractSummary (String txt, int sSum)
Return: summary statement
Signature parameter description: txt: the sentence to extract the summary, and the number of sSum summary sentences
Example: automatically extract three summary sentences.
Program listing 4
Public void testExtractSummary () throws Exception
{
String document = "algorithm can be roughly divided into basic algorithm, data structure algorithm, number theory algorithm, computational geometry algorithm, graph algorithm, dynamic programming and numerical analysis, encryption algorithm, sorting algorithm, retrieval algorithm, randomization algorithm, parallel algorithm, Hermitian deformation model, random forest algorithm.\ n" +
"algorithms can be broadly divided into three categories,\ n" +
"first, finite deterministic algorithms, which are terminated within a limited period of time. They may take a long time to perform a specified task, but will still be terminated within a certain period of time. The results of such algorithms often depend on the input value.\ n" +
"second, finite indeterminate algorithms, which are terminated in a limited time. However, for a given value (or some), the result of the algorithm is not unique or definite.\ n" +
"third, infinite algorithms are those that do not terminate because there are no defined termination conditions, or the defined conditions cannot be satisfied by the input data. In general, infinite algorithms are generated because the termination conditions are not defined."
List sentenceList = DKNLPBase.extractSummary (document, 3)
AssertEquals (3, sentenceList.size ())
}
5. Pinyin conversion
Method signature: List convertToPinyinList (txt)
Return: Pinyin list
Signature parameter description: txt: the sentence to convert Pinyin
Example: give the pinyin of the second word in a paragraph.
Program listing 5
Public void testConvertToPinyinList () throws Exception
{
String text = "the green of the Yalu River is not the same as the green."
List pinyinList = DKNLPBase.convertToPinyinList (text)
AssertEquals (text.length (), pinyinList.size ())
AssertEquals (Pinyin.lu4, pinyinList.get (1))
}
6. Add a thesaurus
Method signature: String addCK (String filePath)
Return: null-complete, other-error message
Signature parameter description: filePath: a new thesaurus file, with each word separated by carriage return newline.
Example: read the new thesaurus file and add the seventh word "Xinmei" in the file content to the thesaurus.
Program listing 6
Public void testAddCK () throws Exception
{
DKNLPBase.addCK ("src/test/resources/custom_dictionary.txt")
String text = "Internet home decoration quality problems frequently Meituan-Dianping into the odds of success"
List termList = DKNLPBase.segment (text)
AssertEquals ("New Beauty", termList.get (6) .word)
}
7. Discovery of new words
Method signature:
NewWordDiscover discover = new NewWordDiscover (max_word_len, min_freq, min_entropy, min_aggregation, filter)
Discover.discovery (text, size)
Return: null-complete, other-error message
Signature parameter description: max_word_len: controls the longest word length in the recognition result. The default value is 4. The higher the value, the greater the amount of computation, and the more phrases will appear in the result.
Min_freq: controls the lowest frequency of words in the result. Those lower than this frequency will be filtered out, reducing the amount of computation. Because the results are sorted by frequency, this parameter doesn't really make much sense. In fact, set it directly to 0 in the interface, which means that all candidate words will come out.
Min_entropy: the value of the lowest information entropy (uncertainty of information) of the words in the control result, which is generally about 0.5. The higher the value, the easier it is to extract shorter words.
Min_aggregation: controls the minimum mutual information value of words in the result (the correlation between words), usually from 50 to 200. The higher the value, the easier it is for longer words to be extracted, and sometimes phrases appear.
Filter: when set to true, the internal thesaurus will be used to filter out "old words".
Text: a document used for new word discovery.
Size: the number of new words.
Example: new word discovery.
Procedure Qing7
Public void testFindNewWord () {
NewWordDiscover discover = new NewWordDiscover (4,0.0f, 0.5f, 100f, true)
/ / read all documents under the folder and merge them into one document for new word discovery.
StringBuilder sbText = new StringBuilder ()
File [] txtFiles = new File ("src/test/resources/ Sogou text Classification Corpus Micro Edition / Health") .listFiles ()
Int I = 0
For (File file: txtFiles)
{
System.out.printf ("[% d /% d] reads% s.\ n", + + I, txtFiles.length, file.getName ())
SbText.append (IOUtil.readTxt (file.getPath ()
If (I = 100) break
}
System.out.printf ("analyzing corpus of length% d.\ n", sbText.length ())
List wordInfoList = discover.discovery (sbText.toString (), 10)
/ / print out the new words found
For (WordInfo wordInfo: wordInfoList) {
System.out.println (wordInfo.text)
}
}
Thank you for reading this article carefully. I hope the article "how to use Hanlp" shared by the editor will be helpful to everyone. At the same time, I also hope that you will support and pay attention to the industry information channel. More related knowledge is waiting for you to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.