Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use the word participle in Java

2025-03-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "how to use the word participle in Java". Many people will encounter such a dilemma in the operation of actual cases, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Word word segmentation is a distributed Chinese word segmentation component implemented by Java, which provides a variety of dictionary-based word segmentation algorithms, and uses ngram model to eliminate ambiguity. Can accurately identify English, numbers, date, time and other quantifiers, can identify people, place names, organization names and other unknown words. It can change the behavior of components by customizing configuration files, can customize user thesaurus, automatically detect lexicon changes, support large-scale distributed environment, can flexibly specify a variety of word segmentation algorithms, can use refine functions to flexibly control the results of word segmentation, and can also use part of speech tagging, synonym tagging, antisense tagging, pinyin tagging and other functions. At the same time, it also integrates seamlessly with Lucene, Solr, ElasticSearch and Luke. Note: JDK1.8 is required for word1.3

Maven dependencies:

Specify dependency in pom.xml. Available versions are 1.0,1.1,1.2:

Org.apdplat

Word

1.2

How to use word segmentation:

1. Quick experience

Run the script demo-word.bat in the root directory of the project to quickly experience the effect of word segmentation

Usage: command [text] [input] [output]

Available values for the command command are: demo, text, file

Demo

Text Yang Shangchuan is the author of APDPlat application-level product development platform.

File d:/text.txt d:/word.txt

Exit

2. Segmentation of the text

Remove the stop word: List words = WordSegmenter.seg ("Yang Shangchuan is the author of the APDPlat application-level product development platform")

Keep the stop word: List words = WordSegmenter.segWithStopWords ("Yang Shangchuan is the author of APDPlat application-level product development platform")

System.out.println (words)

Output:

Remove stop words: [Yang Shangchuan, apdplat, application level, product, development platform, author]

Retain stop words: [Yang Shangchuan, yes, apdplat, application level, product, development platform, author]

3. Segmenting the document

String input = "d:/text.txt"

String output = "d:/word.txt"

Remove stop words: WordSegmenter.seg (new File (input), new File (output))

Keep the stop word: WordSegmenter.segWithStopWords (new File (input), new File (output))

4. Custom configuration file

The default configuration file is word.conf under the classpath, which is packaged in word-x.x.jar

The custom configuration file is the word.local.conf under the classpath, which needs to be provided by the user.

If the custom configuration is the same as the default configuration, the custom configuration overrides the default configuration

The configuration file is encoded as UTF-8

5. Custom user thesaurus

Custom user thesaurus is one or more folders or files, and you can use absolute or relative paths

The user thesaurus consists of several dictionary files, which are encoded as UTF-8

The format of the dictionary file is a text file, and one line represents a word.

Paths can be specified by system properties or configuration files, with multiple paths separated by commas

For the dictionary file under the classpath, you need to prefix classpath before the relative path:

There are three ways to specify:

Specify method 1, programmatically specify (high priority):

WordConfTools.set ("dic.path", "classpath:dic.txt,d:/custom_dic")

DictionaryFactory.reload (); / / after changing the dictionary path, reload the dictionary

Specify method 2, Java virtual machine boot parameters (medium priority):

Java-Ddic.path=classpath:dic.txt,d:/custom_dic

Specify method 3, specify the profile (low priority):

Use the file word.local.conf under the classpath to specify configuration information

Dic.path=classpath:dic.txt,d:/custom_dic

If not specified, the dic.txt dictionary file under the classpath is used by default

6. Custom deactivate thesaurus

The usage is similar to the custom user thesaurus, with the following configuration items:

Stopwords.path=classpath:stopwords.txt,d:/custom_stopwords_dic

7. Automatically detect changes in the thesaurus

Can automatically detect changes in custom user thesaurus and custom deactivated thesaurus

Contains files and folders under the classpath, absolute and relative paths under the non-classpath

Such as:

Classpath:dic.txt,classpath:custom_dic_dir

D:/dic_more.txt,d:/DIC_DIR,D:/DIC2_DIR,my_dic_dir,my_dic_file.txt

Classpath:stopwords.txt,classpath:custom_stopwords_dic_dir

D:/stopwords_more.txt,d:/STOPWORDS_DIR,d:/STOPWORDS2_DIR,stopwords_dir,remove.txt

8. Explicit word segmentation algorithm

When you segment text, you can explicitly specify a specific word segmentation algorithm, such as:

WordSegmenter.seg (APDPlat Application level Product Development platform, SegmentationAlgorithm.BidirectionalMaximumMatching)

The optional types of SegmentationAlgorithm are:

Forward maximum matching algorithm: MaximumMatching

Inverse maximum matching algorithm: ReverseMaximumMatching

Forward minimum matching algorithm: MinimumMatching

Inverse minimum matching algorithm: ReverseMinimumMatching

Bi-directional maximum matching algorithm: BidirectionalMaximumMatching

Bi-directional minimum matching algorithm: BidirectionalMinimumMatching

Bi-directional maximum and minimum matching algorithm: BidirectionalMaximumMinimumMatching

Full segmentation algorithm: FullSegmentation

Minimum word segmentation algorithm: MinimalWordCount

Maximum Ngram score algorithm: MaxNgramScore

9. Evaluation of the effect of word segmentation

Run the script evaluation.bat in the project root directory to evaluate the effect of word segmentation

The test text used for the evaluation was 253 3709 lines with a total of 2837 4490 characters.

The evaluation results are located in the target/evaluation directory:

Corpus-text.txt is the manual tagged text of divided words, separated by spaces between words.

Test-text.txt is the test text, which is the result of dividing the corpus-text.txt into multiple lines with punctuation.

Standard-text.txt is the manual tagging text corresponding to the test text, which serves as the standard for correct word segmentation.

Result-text-***.txt,*** is the name of various word segmentation algorithms, which is the result of word word segmentation.

Perfect-result-***.txt,*** is the name of various word segmentation algorithms. This is a text whose segmentation result is exactly the same as that of manual tagging.

Wrong-result-***.txt,*** is the name of various word segmentation algorithms. This is a text in which the result of word segmentation is inconsistent with the manual marking standard.

10. Distributed Chinese word Separator

1. Specify all configuration items * .path to use HTTP resources in the custom configuration file word.conf or word.local.conf, and also specify the configuration item redis.*

2. Configure and start the web server that provides HTTP resources, and deploy the project: https://github.com/ysc/word_web to tomcat

3. Configure and start the redis server

11. Part of speech tagging (1.3 only has this function)

Take the word segmentation result as the input parameter, call the process method of the PartOfSpeechTagging class, and save the part of speech in the partOfSpeech field of the Word class.

As follows:

List words = WordSegmenter.segWithStopWords ("I love China")

System.out.println ("unmarked part of speech:" + words)

/ / part of speech tagging

PartOfSpeechTagging.process (words)

System.out.println ("tagged part of speech:" + words)

Output:

Unmarked part of speech: [I, Love, China]

Tagging part of speech: [I / r, love / v, China / ns]

12 、 refine

Let's look at an example of syncopation:

List words = WordSegmenter.segWithStopWords ("China's working class and the broad masses of working people should unite more closely around the CPC Central Committee")

System.out.println (words)

The results are as follows:

[our country, the working class, and, the broad masses of the working people, want, closer, closer, unite, in and around the Party Central Committee]

If the result of the segmentation we want is:

[our country, workers, classes, and, the broad masses, labor, the masses, want, closer, closer, United, in and around the Party Central Committee]

That is, to subdivide the "working class" into the "working class" and the "working masses" into the "working masses", then what should we do?

We can add the following to the file classpath:word_refine.txt specified by the word.refine.path configuration item:

Working class = working class

Working masses = working masses

Then, we refine the result of word segmentation:

Words = WordRefiner.refine (words)

System.out.println (words)

In this way, we can achieve the effect we want:

[our country, workers, classes, and, the broad masses, labor, the masses, want, closer, closer, United, in and around the Party Central Committee]

Let's look at another example of syncopation:

List words = WordSegmenter.segWithStopWords ("make new achievements on the great journey to achieve the two centenary goals")

System.out.println (words)

The results are as follows:

[in, achieve, two, one hundred years, goal, greatness, journey, top, re-creation, new, achievement]

If the result of the segmentation we want is:

[in, realize, two hundred years, struggle goal, great journey, upper, re-creation, new, achievement]

That is, to merge the "two hundred years" into "two hundred years" and the "great journey" into the "great journey", then what should we do?

We can add the following to the file classpath:word_refine.txt specified by the word.refine.path configuration item:

Two hundred years = two hundred years

Great journey = great journey

Then, we refine the result of word segmentation:

Words = WordRefiner.refine (words)

System.out.println (words)

In this way, we can achieve the effect we want:

[in, realize, two hundred years, struggle goal, great journey, upper, re-creation, new, achievement]

13. Synonymous tagging

List words = WordSegmenter.segWithStopWords ("Chu Li Mo tries every possible means to retrieve memories mercilessly")

System.out.println (words)

The results are as follows:

[Chu Li Mo, do everything possible, for, ruthless, get back, memory]

Make a synonym:

SynonymTagging.process (words)

System.out.println (words)

The results are as follows:

[Chu Li Mo, do everything possible [for a long time intentional, painstaking, trying, painstaking], for, mercilessly, to retrieve, memory [image]]

If indirect synonyms are enabled:

SynonymTagging.process (words, false)

System.out.println (words)

The results are as follows:

[Chu Li Mo, do everything possible [for a long time intentional, painstaking effort, trying every means, painstaking effort], for, ruthless, retrieve, memory [image, image]]

List words = WordSegmenter.segWithStopWords ("Old people with strong hands tend to live longer")

System.out.println (words)

The results are as follows:

[strong hands, big, old, often, more, longevity]

Make a synonym:

SynonymTagging.process (words)

System.out.println (words)

The results are as follows:

[strong hand, big, old man [old man], often [often], more, longer life [longevity, turtle age]]

If indirect synonyms are enabled:

SynonymTagging.process (words, false)

System.out.println (words)

The results are as follows:

[hand strength, big, old man [old man], often [as usual, ordinary, ordinary, regular, ordinary, daily, ordinary, ordinary, often, often], more, longevity [longevity, turtle age]]

Take the word "do everything possible" as an example:

You can get synonyms such as: Word through the getSynonym () method

System.out.println (word.getSynonym ())

The results are as follows:

[deliberate, painstaking, painstaking, painstaking efforts]

Note: if there is no synonym, getSynonym () returns an empty collection: Collections.emptyList ()

The differences between indirect synonyms and direct synonyms are as follows:

Suppose:

An and B are synonyms, An and C are synonyms, B and D are synonyms, C and E are synonyms

Then:

For A, A B C is a direct synonym

For B, A B D is a direct synonym

For C, A C E is a direct synonym

For A B C, A B C D E is an indirect synonym

14. Antisense tagging

List words = WordSegmenter.segWithStopWords ("what movies are worth watching at the beginning of May")

System.out.println (words)

The results are as follows:

[5, at the beginning of the month, there are, which movies are worth watching]

Make an antonym:

AntonymTagging.process (words)

System.out.println (words)

The results are as follows:

[5, at the beginning of the month [at the end of the month], which movies are worth watching]

List words = WordSegmenter.segWithStopWords ("due to inadequate work and poor service, the restaurant should make a sincere apology to the customer rather than perfunctory.")

System.out.println (words)

The results are as follows:

[due to, work, not in place, service, imperfect, cause, customer, in dining, when, happen, unpleasant, thing, restaurant, aspect, should, to, customer, make, sincere, apology, rather than, perfunctory]

Make an antonym:

AntonymTagging.process (words)

System.out.println (words)

The results are as follows:

Due to, work, not in place, service, imperfect, cause, customer, in, meal, time, happen, unpleasant, thing, restaurant, aspect, should, to, customer, make, sincere [deceitful, hypocritical, false, deceitful], apologize, rather than, perfunctory [meticulous, conscientious] Do one's best, strive for perfection, sincerely]]

Take the word "the beginning of the month" as an example:

You can get antonyms such as: Word through the getAntonym () method

System.out.println (word.getAntonym ())

The results are as follows:

[end of month, end of month]

Note: if there are no antonyms, getAntonym () returns an empty collection: Collections.emptyList ()

15. Pinyin marking

List words = WordSegmenter.segWithStopWords (the box office of "Fast and Furious 7" has exceeded 2 billion yuan in just two weeks since its release on April 12)

System.out.println (words)

The results are as follows:

[speed, and, passion, 7, China, mainland, box office, since, April, 12th, release, since, in, short, two weeks, within, breakthrough, 2 billion, RMB]

Perform pinyin annotations:

PinyinTagging.process (words)

System.out.println (words)

The results are as follows:

[speed sd sudu, with y yu, Passion jq jiqing, 7, d de, China zg zhongguo, mainland nd neidi, box office pf piaofang, since z zi, April 12, released sy shangying, since yl yilai, in z zai, short dd duanduan, two weeks lz liangzhou, n nei, break tp tupo, 2 billion, RMB rmb renminbi]

Take the word "speed" as an example:

You can obtain complete pinyin such as sudu through the getFullPinYin () method of Word

You can obtain the acronym pinyin such as sd through the getAcronymPinYin () method of Word

16. Lucene plug-in:

1. Construct a word parser ChineseWordAnalyzer

Analyzer analyzer = new ChineseWordAnalyzer ()

If you need to use a specific word segmentation algorithm, you can specify it through the constructor:

Analyzer analyzer = new ChineseWordAnalyzer (SegmentationAlgorithm.FullSegmentation)

If not specified, the bi-directional maximum matching algorithm is used by default: SegmentationAlgorithm.BidirectionalMaximumMatching

For available word segmentation algorithms, see enumerated class: SegmentationAlgorithm

2. Use word parser to segment text.

TokenStream tokenStream = analyzer.tokenStream ("text", "Yang Shangchuan is the author of APDPlat application-level product development platform")

/ / prepare for consumption

TokenStream.reset ()

/ / start consumption

While (tokenStream.incrementToken ()) {

/ / words

CharTermAttribute charTermAttribute = tokenStream.getAttribute (CharTermAttribute.class)

/ / the starting position of words in the text

OffsetAttribute offsetAttribute = tokenStream.getAttribute (OffsetAttribute.class)

/ / how many words

PositionIncrementAttribute positionIncrementAttribute = tokenStream.getAttribute (PositionIncrementAttribute.class)

/ / part of speech

PartOfSpeechAttribute partOfSpeechAttribute = tokenStream.getAttribute (PartOfSpeechAttribute.class)

/ / acronym Pinyin

AcronymPinyinAttribute acronymPinyinAttribute = tokenStream.getAttribute (AcronymPinyinAttribute.class)

/ / complete Pinyin

FullPinyinAttribute fullPinyinAttribute = tokenStream.getAttribute (FullPinyinAttribute.class)

/ / synonyms

SynonymAttribute synonymAttribute = tokenStream.getAttribute (SynonymAttribute.class)

/ / antonyms

AntonymAttribute antonymAttribute = tokenStream.getAttribute (AntonymAttribute.class)

LOGGER.info (charTermAttribute.toString () + "(" + offsetAttribute.startOffset () + "-" + offsetAttribute.endOffset () + ")" + positionIncrementAttribute.getPositionIncrement ())

LOGGER.info ("PartOfSpeech:" + partOfSpeechAttribute.toString ())

LOGGER.info ("AcronymPinyin:" + acronymPinyinAttribute.toString ())

LOGGER.info ("FullPinyin:" + fullPinyinAttribute.toString ())

LOGGER.info ("Synonym:" + synonymAttribute.toString ())

LOGGER.info ("Antonym:" + antonymAttribute.toString ())

}

/ / consumption is over

TokenStream.close ()

3. Using word analyzer to build Lucene index

Directory directory = new RAMDirectory ()

IndexWriterConfig config = new IndexWriterConfig (analyzer)

IndexWriter indexWriter = new IndexWriter (directory, config)

4. Use word analyzer to query Lucene index.

QueryParser queryParser = new QueryParser ("text", analyzer)

Query query = queryParser.parse ("text: Yang Shangchuan")

TopDocs docs = indexSearcher.search (query, Integer.MAX_VALUE)

17. Solr plug-in:

1. Download word-1.3.jar

Download address: http://search.maven.org/remotecontent?filepath=org/apdplat/word/1.3/word-1.3.jar

2. Create the directory solr-5.1.0/example/solr/lib and copy the word-1.3.jar to the lib directory

3. Configure the schema designated word splitter

Everything in the solr-5.1.0/example/solr/collection1/conf/schema.xml file

And

Replace all with

And remove all filter tags

4. If you need to use a specific word segmentation algorithm:

Available values for segAlgorithm are:

Forward maximum matching algorithm: MaximumMatching

Inverse maximum matching algorithm: ReverseMaximumMatching

Forward minimum matching algorithm: MinimumMatching

Inverse minimum matching algorithm: ReverseMinimumMatching

Bi-directional maximum matching algorithm: BidirectionalMaximumMatching

Bi-directional minimum matching algorithm: BidirectionalMinimumMatching

Bi-directional maximum and minimum matching algorithm: BidirectionalMaximumMinimumMatching

Full segmentation algorithm: FullSegmentation

Minimum word segmentation algorithm: MinimalWordCount

Maximum Ngram score algorithm: MaxNgramScore

If not specified, the bi-directional maximum matching algorithm is used by default: BidirectionalMaximumMatching

5. If you need to specify a specific profile:

For the configurable contents of the word.local.conf file, see the word.conf file in word-1.3.jar.

If not specified, use the default profile, the word.conf file located in word-1.3.jar

18. ElasticSearch plug-in:

1. Open the command line and change to the bin directory of elasticsearch

Cd elasticsearch-1.5.1/bin

2. Run the plugin script to install the word word segmentation plug-in:

. / plugin-u http://apdplat.org/word/archive/v1.2.zip-I word

3. Modify the file elasticsearch-1.5.1/config/elasticsearch.yml and add the following configuration:

Index.analysis.analyzer.default.type: "word"

Index.analysis.tokenizer.default.type: "word"

4. Start the ElasticSearch test effect and visit the Chrome browser:

Http://localhost:9200/_analyze?analyzer=word&text= Yang Shangchuan is the author of APDPlat application-level product development platform.

5. Custom configuration

Modify the configuration file elasticsearch-1.5.1/plugins/word/word.local.conf

6. Specify the word segmentation algorithm

Modify the file elasticsearch-1.5.1/config/elasticsearch.yml and add the following configuration:

Index.analysis.analyzer.default.segAlgorithm: "ReverseMinimumMatching"

Index.analysis.tokenizer.default.segAlgorithm: "ReverseMinimumMatching"

The values that can be specified by segAlgorithm here are:

Forward maximum matching algorithm: MaximumMatching

Inverse maximum matching algorithm: ReverseMaximumMatching

Forward minimum matching algorithm: MinimumMatching

Inverse minimum matching algorithm: ReverseMinimumMatching

Bi-directional maximum matching algorithm: BidirectionalMaximumMatching

Bi-directional minimum matching algorithm: BidirectionalMinimumMatching

Bi-directional maximum and minimum matching algorithm: BidirectionalMaximumMinimumMatching

Full segmentation algorithm: FullSegmentation

Minimum word segmentation algorithm: MinimalWordCount

Maximum Ngram score algorithm: MaxNgramScore

If not specified, the bi-directional maximum matching algorithm is used by default: BidirectionalMaximumMatching

19. Luke plug-in:

1. Download http://luke.googlecode.com/files/lukeall-4.0.0-ALPHA.jar (cannot be accessed domestically)

2. Download and decompress the Java Chinese word segmentation component word-1.0-bin.zip: http://pan.baidu.com/s/1dDziDFz

3. Extract the 4 jar packages in the word-1.0-bin/word-1.0 folder of the extracted Java Chinese word segmentation component to the current folder.

Open lukeall-4.0.0-ALPHA.jar with a compression and decompression tool such as winrar, and add the current folder except the META-INF folder, .jar,

Drag all files other than .bat, .html and word.local.conf files into lukeall-4.0.0-ALPHA.jar

4. Execute the command java-jar lukeall-4.0.0-ALPHA.jar to start luke, which is in Analysis on the Search tab.

You can choose the org.apdplat.word.lucene.ChineseWordAnalyzer word splitter.

5. In the Available analyzers found on the current classpath of the Plugins tab, you can also select

Org.apdplat.word.lucene.ChineseWordAnalyzer word splitter

Note: if you want to integrate other versions of the word splitter yourself, run mvn install to compile the project from the project root, and then run the command

Mvn dependency:copy-dependencies replicates the dependent jar package, and then there will be all in the target/dependency/ directory

Relies on the jar package. Target/dependency/slf4j-api-1.6.4.jar is the logging framework used by the word word splitter.

Target/dependency/logback-classic-0.9.28.jar and

Target/dependency/logback-core-0.9.28.jar is the log implementation recommended by the word word splitter and the configuration file for the log implementation.

The path is located in target/classes/logback.xml,target/word-1.3.jar, which is the main jar package of the word word splitter, if necessary

If you customize the dictionary, you need to modify the word splitter configuration file target/classes/word.conf

Download the integrated Luke plug-in (for lucene4.0.0): lukeall-4.0.0-ALPHA-with-word-1.0.jar

Download the integrated Luke plug-in (for lucene4.10.3): lukeall-4.10.3-with-word-1.2.jar

Word vector:

Count the context-sensitive words of a word from the large-scale corpus, and use the vector composed of these context-related words to express the word.

The similarity of words can be obtained by calculating the similarity of word vectors.

The assumption of similarity is based on the premise that if the context-sensitive words of two words are more similar, then the two words will be more similar.

Experience the effect of the word project's own corpus by running the script demo-word-vector-corpus.bat in the project root directory

If you have your own text content, you can use the script demo-word-vector-file.bat to segment the text, create word vectors, and calculate similarity

This is the end of the content of "how to use word participle in Java". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report