In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article introduces the relevant knowledge of "how to use the word participle in Java". Many people will encounter such a dilemma in the operation of actual cases, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
Word word segmentation is a distributed Chinese word segmentation component implemented by Java, which provides a variety of dictionary-based word segmentation algorithms, and uses ngram model to eliminate ambiguity. Can accurately identify English, numbers, date, time and other quantifiers, can identify people, place names, organization names and other unknown words. It can change the behavior of components by customizing configuration files, can customize user thesaurus, automatically detect lexicon changes, support large-scale distributed environment, can flexibly specify a variety of word segmentation algorithms, can use refine functions to flexibly control the results of word segmentation, and can also use part of speech tagging, synonym tagging, antisense tagging, pinyin tagging and other functions. At the same time, it also integrates seamlessly with Lucene, Solr, ElasticSearch and Luke. Note: JDK1.8 is required for word1.3
Maven dependencies:
Specify dependency in pom.xml. Available versions are 1.0,1.1,1.2:
Org.apdplat
Word
1.2
How to use word segmentation:
1. Quick experience
Run the script demo-word.bat in the root directory of the project to quickly experience the effect of word segmentation
Usage: command [text] [input] [output]
Available values for the command command are: demo, text, file
Demo
Text Yang Shangchuan is the author of APDPlat application-level product development platform.
File d:/text.txt d:/word.txt
Exit
2. Segmentation of the text
Remove the stop word: List words = WordSegmenter.seg ("Yang Shangchuan is the author of the APDPlat application-level product development platform")
Keep the stop word: List words = WordSegmenter.segWithStopWords ("Yang Shangchuan is the author of APDPlat application-level product development platform")
System.out.println (words)
Output:
Remove stop words: [Yang Shangchuan, apdplat, application level, product, development platform, author]
Retain stop words: [Yang Shangchuan, yes, apdplat, application level, product, development platform, author]
3. Segmenting the document
String input = "d:/text.txt"
String output = "d:/word.txt"
Remove stop words: WordSegmenter.seg (new File (input), new File (output))
Keep the stop word: WordSegmenter.segWithStopWords (new File (input), new File (output))
4. Custom configuration file
The default configuration file is word.conf under the classpath, which is packaged in word-x.x.jar
The custom configuration file is the word.local.conf under the classpath, which needs to be provided by the user.
If the custom configuration is the same as the default configuration, the custom configuration overrides the default configuration
The configuration file is encoded as UTF-8
5. Custom user thesaurus
Custom user thesaurus is one or more folders or files, and you can use absolute or relative paths
The user thesaurus consists of several dictionary files, which are encoded as UTF-8
The format of the dictionary file is a text file, and one line represents a word.
Paths can be specified by system properties or configuration files, with multiple paths separated by commas
For the dictionary file under the classpath, you need to prefix classpath before the relative path:
There are three ways to specify:
Specify method 1, programmatically specify (high priority):
WordConfTools.set ("dic.path", "classpath:dic.txt,d:/custom_dic")
DictionaryFactory.reload (); / / after changing the dictionary path, reload the dictionary
Specify method 2, Java virtual machine boot parameters (medium priority):
Java-Ddic.path=classpath:dic.txt,d:/custom_dic
Specify method 3, specify the profile (low priority):
Use the file word.local.conf under the classpath to specify configuration information
Dic.path=classpath:dic.txt,d:/custom_dic
If not specified, the dic.txt dictionary file under the classpath is used by default
6. Custom deactivate thesaurus
The usage is similar to the custom user thesaurus, with the following configuration items:
Stopwords.path=classpath:stopwords.txt,d:/custom_stopwords_dic
7. Automatically detect changes in the thesaurus
Can automatically detect changes in custom user thesaurus and custom deactivated thesaurus
Contains files and folders under the classpath, absolute and relative paths under the non-classpath
Such as:
Classpath:dic.txt,classpath:custom_dic_dir
D:/dic_more.txt,d:/DIC_DIR,D:/DIC2_DIR,my_dic_dir,my_dic_file.txt
Classpath:stopwords.txt,classpath:custom_stopwords_dic_dir
D:/stopwords_more.txt,d:/STOPWORDS_DIR,d:/STOPWORDS2_DIR,stopwords_dir,remove.txt
8. Explicit word segmentation algorithm
When you segment text, you can explicitly specify a specific word segmentation algorithm, such as:
WordSegmenter.seg (APDPlat Application level Product Development platform, SegmentationAlgorithm.BidirectionalMaximumMatching)
The optional types of SegmentationAlgorithm are:
Forward maximum matching algorithm: MaximumMatching
Inverse maximum matching algorithm: ReverseMaximumMatching
Forward minimum matching algorithm: MinimumMatching
Inverse minimum matching algorithm: ReverseMinimumMatching
Bi-directional maximum matching algorithm: BidirectionalMaximumMatching
Bi-directional minimum matching algorithm: BidirectionalMinimumMatching
Bi-directional maximum and minimum matching algorithm: BidirectionalMaximumMinimumMatching
Full segmentation algorithm: FullSegmentation
Minimum word segmentation algorithm: MinimalWordCount
Maximum Ngram score algorithm: MaxNgramScore
9. Evaluation of the effect of word segmentation
Run the script evaluation.bat in the project root directory to evaluate the effect of word segmentation
The test text used for the evaluation was 253 3709 lines with a total of 2837 4490 characters.
The evaluation results are located in the target/evaluation directory:
Corpus-text.txt is the manual tagged text of divided words, separated by spaces between words.
Test-text.txt is the test text, which is the result of dividing the corpus-text.txt into multiple lines with punctuation.
Standard-text.txt is the manual tagging text corresponding to the test text, which serves as the standard for correct word segmentation.
Result-text-***.txt,*** is the name of various word segmentation algorithms, which is the result of word word segmentation.
Perfect-result-***.txt,*** is the name of various word segmentation algorithms. This is a text whose segmentation result is exactly the same as that of manual tagging.
Wrong-result-***.txt,*** is the name of various word segmentation algorithms. This is a text in which the result of word segmentation is inconsistent with the manual marking standard.
10. Distributed Chinese word Separator
1. Specify all configuration items * .path to use HTTP resources in the custom configuration file word.conf or word.local.conf, and also specify the configuration item redis.*
2. Configure and start the web server that provides HTTP resources, and deploy the project: https://github.com/ysc/word_web to tomcat
3. Configure and start the redis server
11. Part of speech tagging (1.3 only has this function)
Take the word segmentation result as the input parameter, call the process method of the PartOfSpeechTagging class, and save the part of speech in the partOfSpeech field of the Word class.
As follows:
List words = WordSegmenter.segWithStopWords ("I love China")
System.out.println ("unmarked part of speech:" + words)
/ / part of speech tagging
PartOfSpeechTagging.process (words)
System.out.println ("tagged part of speech:" + words)
Output:
Unmarked part of speech: [I, Love, China]
Tagging part of speech: [I / r, love / v, China / ns]
12 、 refine
Let's look at an example of syncopation:
List words = WordSegmenter.segWithStopWords ("China's working class and the broad masses of working people should unite more closely around the CPC Central Committee")
System.out.println (words)
The results are as follows:
[our country, the working class, and, the broad masses of the working people, want, closer, closer, unite, in and around the Party Central Committee]
If the result of the segmentation we want is:
[our country, workers, classes, and, the broad masses, labor, the masses, want, closer, closer, United, in and around the Party Central Committee]
That is, to subdivide the "working class" into the "working class" and the "working masses" into the "working masses", then what should we do?
We can add the following to the file classpath:word_refine.txt specified by the word.refine.path configuration item:
Working class = working class
Working masses = working masses
Then, we refine the result of word segmentation:
Words = WordRefiner.refine (words)
System.out.println (words)
In this way, we can achieve the effect we want:
[our country, workers, classes, and, the broad masses, labor, the masses, want, closer, closer, United, in and around the Party Central Committee]
Let's look at another example of syncopation:
List words = WordSegmenter.segWithStopWords ("make new achievements on the great journey to achieve the two centenary goals")
System.out.println (words)
The results are as follows:
[in, achieve, two, one hundred years, goal, greatness, journey, top, re-creation, new, achievement]
If the result of the segmentation we want is:
[in, realize, two hundred years, struggle goal, great journey, upper, re-creation, new, achievement]
That is, to merge the "two hundred years" into "two hundred years" and the "great journey" into the "great journey", then what should we do?
We can add the following to the file classpath:word_refine.txt specified by the word.refine.path configuration item:
Two hundred years = two hundred years
Great journey = great journey
Then, we refine the result of word segmentation:
Words = WordRefiner.refine (words)
System.out.println (words)
In this way, we can achieve the effect we want:
[in, realize, two hundred years, struggle goal, great journey, upper, re-creation, new, achievement]
13. Synonymous tagging
List words = WordSegmenter.segWithStopWords ("Chu Li Mo tries every possible means to retrieve memories mercilessly")
System.out.println (words)
The results are as follows:
[Chu Li Mo, do everything possible, for, ruthless, get back, memory]
Make a synonym:
SynonymTagging.process (words)
System.out.println (words)
The results are as follows:
[Chu Li Mo, do everything possible [for a long time intentional, painstaking, trying, painstaking], for, mercilessly, to retrieve, memory [image]]
If indirect synonyms are enabled:
SynonymTagging.process (words, false)
System.out.println (words)
The results are as follows:
[Chu Li Mo, do everything possible [for a long time intentional, painstaking effort, trying every means, painstaking effort], for, ruthless, retrieve, memory [image, image]]
List words = WordSegmenter.segWithStopWords ("Old people with strong hands tend to live longer")
System.out.println (words)
The results are as follows:
[strong hands, big, old, often, more, longevity]
Make a synonym:
SynonymTagging.process (words)
System.out.println (words)
The results are as follows:
[strong hand, big, old man [old man], often [often], more, longer life [longevity, turtle age]]
If indirect synonyms are enabled:
SynonymTagging.process (words, false)
System.out.println (words)
The results are as follows:
[hand strength, big, old man [old man], often [as usual, ordinary, ordinary, regular, ordinary, daily, ordinary, ordinary, often, often], more, longevity [longevity, turtle age]]
Take the word "do everything possible" as an example:
You can get synonyms such as: Word through the getSynonym () method
System.out.println (word.getSynonym ())
The results are as follows:
[deliberate, painstaking, painstaking, painstaking efforts]
Note: if there is no synonym, getSynonym () returns an empty collection: Collections.emptyList ()
The differences between indirect synonyms and direct synonyms are as follows:
Suppose:
An and B are synonyms, An and C are synonyms, B and D are synonyms, C and E are synonyms
Then:
For A, A B C is a direct synonym
For B, A B D is a direct synonym
For C, A C E is a direct synonym
For A B C, A B C D E is an indirect synonym
14. Antisense tagging
List words = WordSegmenter.segWithStopWords ("what movies are worth watching at the beginning of May")
System.out.println (words)
The results are as follows:
[5, at the beginning of the month, there are, which movies are worth watching]
Make an antonym:
AntonymTagging.process (words)
System.out.println (words)
The results are as follows:
[5, at the beginning of the month [at the end of the month], which movies are worth watching]
List words = WordSegmenter.segWithStopWords ("due to inadequate work and poor service, the restaurant should make a sincere apology to the customer rather than perfunctory.")
System.out.println (words)
The results are as follows:
[due to, work, not in place, service, imperfect, cause, customer, in dining, when, happen, unpleasant, thing, restaurant, aspect, should, to, customer, make, sincere, apology, rather than, perfunctory]
Make an antonym:
AntonymTagging.process (words)
System.out.println (words)
The results are as follows:
Due to, work, not in place, service, imperfect, cause, customer, in, meal, time, happen, unpleasant, thing, restaurant, aspect, should, to, customer, make, sincere [deceitful, hypocritical, false, deceitful], apologize, rather than, perfunctory [meticulous, conscientious] Do one's best, strive for perfection, sincerely]]
Take the word "the beginning of the month" as an example:
You can get antonyms such as: Word through the getAntonym () method
System.out.println (word.getAntonym ())
The results are as follows:
[end of month, end of month]
Note: if there are no antonyms, getAntonym () returns an empty collection: Collections.emptyList ()
15. Pinyin marking
List words = WordSegmenter.segWithStopWords (the box office of "Fast and Furious 7" has exceeded 2 billion yuan in just two weeks since its release on April 12)
System.out.println (words)
The results are as follows:
[speed, and, passion, 7, China, mainland, box office, since, April, 12th, release, since, in, short, two weeks, within, breakthrough, 2 billion, RMB]
Perform pinyin annotations:
PinyinTagging.process (words)
System.out.println (words)
The results are as follows:
[speed sd sudu, with y yu, Passion jq jiqing, 7, d de, China zg zhongguo, mainland nd neidi, box office pf piaofang, since z zi, April 12, released sy shangying, since yl yilai, in z zai, short dd duanduan, two weeks lz liangzhou, n nei, break tp tupo, 2 billion, RMB rmb renminbi]
Take the word "speed" as an example:
You can obtain complete pinyin such as sudu through the getFullPinYin () method of Word
You can obtain the acronym pinyin such as sd through the getAcronymPinYin () method of Word
16. Lucene plug-in:
1. Construct a word parser ChineseWordAnalyzer
Analyzer analyzer = new ChineseWordAnalyzer ()
If you need to use a specific word segmentation algorithm, you can specify it through the constructor:
Analyzer analyzer = new ChineseWordAnalyzer (SegmentationAlgorithm.FullSegmentation)
If not specified, the bi-directional maximum matching algorithm is used by default: SegmentationAlgorithm.BidirectionalMaximumMatching
For available word segmentation algorithms, see enumerated class: SegmentationAlgorithm
2. Use word parser to segment text.
TokenStream tokenStream = analyzer.tokenStream ("text", "Yang Shangchuan is the author of APDPlat application-level product development platform")
/ / prepare for consumption
TokenStream.reset ()
/ / start consumption
While (tokenStream.incrementToken ()) {
/ / words
CharTermAttribute charTermAttribute = tokenStream.getAttribute (CharTermAttribute.class)
/ / the starting position of words in the text
OffsetAttribute offsetAttribute = tokenStream.getAttribute (OffsetAttribute.class)
/ / how many words
PositionIncrementAttribute positionIncrementAttribute = tokenStream.getAttribute (PositionIncrementAttribute.class)
/ / part of speech
PartOfSpeechAttribute partOfSpeechAttribute = tokenStream.getAttribute (PartOfSpeechAttribute.class)
/ / acronym Pinyin
AcronymPinyinAttribute acronymPinyinAttribute = tokenStream.getAttribute (AcronymPinyinAttribute.class)
/ / complete Pinyin
FullPinyinAttribute fullPinyinAttribute = tokenStream.getAttribute (FullPinyinAttribute.class)
/ / synonyms
SynonymAttribute synonymAttribute = tokenStream.getAttribute (SynonymAttribute.class)
/ / antonyms
AntonymAttribute antonymAttribute = tokenStream.getAttribute (AntonymAttribute.class)
LOGGER.info (charTermAttribute.toString () + "(" + offsetAttribute.startOffset () + "-" + offsetAttribute.endOffset () + ")" + positionIncrementAttribute.getPositionIncrement ())
LOGGER.info ("PartOfSpeech:" + partOfSpeechAttribute.toString ())
LOGGER.info ("AcronymPinyin:" + acronymPinyinAttribute.toString ())
LOGGER.info ("FullPinyin:" + fullPinyinAttribute.toString ())
LOGGER.info ("Synonym:" + synonymAttribute.toString ())
LOGGER.info ("Antonym:" + antonymAttribute.toString ())
}
/ / consumption is over
TokenStream.close ()
3. Using word analyzer to build Lucene index
Directory directory = new RAMDirectory ()
IndexWriterConfig config = new IndexWriterConfig (analyzer)
IndexWriter indexWriter = new IndexWriter (directory, config)
4. Use word analyzer to query Lucene index.
QueryParser queryParser = new QueryParser ("text", analyzer)
Query query = queryParser.parse ("text: Yang Shangchuan")
TopDocs docs = indexSearcher.search (query, Integer.MAX_VALUE)
17. Solr plug-in:
1. Download word-1.3.jar
Download address: http://search.maven.org/remotecontent?filepath=org/apdplat/word/1.3/word-1.3.jar
2. Create the directory solr-5.1.0/example/solr/lib and copy the word-1.3.jar to the lib directory
3. Configure the schema designated word splitter
Everything in the solr-5.1.0/example/solr/collection1/conf/schema.xml file
And
Replace all with
And remove all filter tags
4. If you need to use a specific word segmentation algorithm:
Available values for segAlgorithm are:
Forward maximum matching algorithm: MaximumMatching
Inverse maximum matching algorithm: ReverseMaximumMatching
Forward minimum matching algorithm: MinimumMatching
Inverse minimum matching algorithm: ReverseMinimumMatching
Bi-directional maximum matching algorithm: BidirectionalMaximumMatching
Bi-directional minimum matching algorithm: BidirectionalMinimumMatching
Bi-directional maximum and minimum matching algorithm: BidirectionalMaximumMinimumMatching
Full segmentation algorithm: FullSegmentation
Minimum word segmentation algorithm: MinimalWordCount
Maximum Ngram score algorithm: MaxNgramScore
If not specified, the bi-directional maximum matching algorithm is used by default: BidirectionalMaximumMatching
5. If you need to specify a specific profile:
For the configurable contents of the word.local.conf file, see the word.conf file in word-1.3.jar.
If not specified, use the default profile, the word.conf file located in word-1.3.jar
18. ElasticSearch plug-in:
1. Open the command line and change to the bin directory of elasticsearch
Cd elasticsearch-1.5.1/bin
2. Run the plugin script to install the word word segmentation plug-in:
. / plugin-u http://apdplat.org/word/archive/v1.2.zip-I word
3. Modify the file elasticsearch-1.5.1/config/elasticsearch.yml and add the following configuration:
Index.analysis.analyzer.default.type: "word"
Index.analysis.tokenizer.default.type: "word"
4. Start the ElasticSearch test effect and visit the Chrome browser:
Http://localhost:9200/_analyze?analyzer=word&text= Yang Shangchuan is the author of APDPlat application-level product development platform.
5. Custom configuration
Modify the configuration file elasticsearch-1.5.1/plugins/word/word.local.conf
6. Specify the word segmentation algorithm
Modify the file elasticsearch-1.5.1/config/elasticsearch.yml and add the following configuration:
Index.analysis.analyzer.default.segAlgorithm: "ReverseMinimumMatching"
Index.analysis.tokenizer.default.segAlgorithm: "ReverseMinimumMatching"
The values that can be specified by segAlgorithm here are:
Forward maximum matching algorithm: MaximumMatching
Inverse maximum matching algorithm: ReverseMaximumMatching
Forward minimum matching algorithm: MinimumMatching
Inverse minimum matching algorithm: ReverseMinimumMatching
Bi-directional maximum matching algorithm: BidirectionalMaximumMatching
Bi-directional minimum matching algorithm: BidirectionalMinimumMatching
Bi-directional maximum and minimum matching algorithm: BidirectionalMaximumMinimumMatching
Full segmentation algorithm: FullSegmentation
Minimum word segmentation algorithm: MinimalWordCount
Maximum Ngram score algorithm: MaxNgramScore
If not specified, the bi-directional maximum matching algorithm is used by default: BidirectionalMaximumMatching
19. Luke plug-in:
1. Download http://luke.googlecode.com/files/lukeall-4.0.0-ALPHA.jar (cannot be accessed domestically)
2. Download and decompress the Java Chinese word segmentation component word-1.0-bin.zip: http://pan.baidu.com/s/1dDziDFz
3. Extract the 4 jar packages in the word-1.0-bin/word-1.0 folder of the extracted Java Chinese word segmentation component to the current folder.
Open lukeall-4.0.0-ALPHA.jar with a compression and decompression tool such as winrar, and add the current folder except the META-INF folder, .jar,
Drag all files other than .bat, .html and word.local.conf files into lukeall-4.0.0-ALPHA.jar
4. Execute the command java-jar lukeall-4.0.0-ALPHA.jar to start luke, which is in Analysis on the Search tab.
You can choose the org.apdplat.word.lucene.ChineseWordAnalyzer word splitter.
5. In the Available analyzers found on the current classpath of the Plugins tab, you can also select
Org.apdplat.word.lucene.ChineseWordAnalyzer word splitter
Note: if you want to integrate other versions of the word splitter yourself, run mvn install to compile the project from the project root, and then run the command
Mvn dependency:copy-dependencies replicates the dependent jar package, and then there will be all in the target/dependency/ directory
Relies on the jar package. Target/dependency/slf4j-api-1.6.4.jar is the logging framework used by the word word splitter.
Target/dependency/logback-classic-0.9.28.jar and
Target/dependency/logback-core-0.9.28.jar is the log implementation recommended by the word word splitter and the configuration file for the log implementation.
The path is located in target/classes/logback.xml,target/word-1.3.jar, which is the main jar package of the word word splitter, if necessary
If you customize the dictionary, you need to modify the word splitter configuration file target/classes/word.conf
Download the integrated Luke plug-in (for lucene4.0.0): lukeall-4.0.0-ALPHA-with-word-1.0.jar
Download the integrated Luke plug-in (for lucene4.10.3): lukeall-4.10.3-with-word-1.2.jar
Word vector:
Count the context-sensitive words of a word from the large-scale corpus, and use the vector composed of these context-related words to express the word.
The similarity of words can be obtained by calculating the similarity of word vectors.
The assumption of similarity is based on the premise that if the context-sensitive words of two words are more similar, then the two words will be more similar.
Experience the effect of the word project's own corpus by running the script demo-word-vector-corpus.bat in the project root directory
If you have your own text content, you can use the script demo-word-vector-file.bat to segment the text, create word vectors, and calculate similarity
This is the end of the content of "how to use word participle in Java". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.