Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the function of Sentence Detector?

2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

The main content of this article is to explain "what is the role of Sentence Detector". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next, let the editor take you to learn "what is the role of Sentence Detector?"

# # Sentence Detection##

Apache OpenNLP Sentence Detector can detect whether the punctuation marks in a sentence mark the end of the sentence. In this sense, a sentence is defined as the longest space-trimmed sequence of characters marked by two punctuation marks. The first and last sentences violate this principle. The first character without a space is assumed to be the beginning of a sentence, and the last character without a space is assumed to be the end of the sentence. The following example text should be divided into:

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken ischairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 yearsold and former chairman of Consolidated Gold Fields PLC, was named a director of thisBritish industrial conglomerate.

After checking the boundary of the sentence, each sentence is written into its own line.

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate.

Usually, Sentence Detection is done before the sentence is marked (tokenized), and the pre-training model (pre-trained models) on the site is trained in this way, but you can also execute tokenization first, letting Sentence Detector process the text that has already been tokenized. OpenNLP Sentence Detector cannot recognize sentence boundaries based on sentence content. A prominent example is that the first sentence (title) in an article is mistakenly identified as the first part of the first sentence. Most components in OpenNLP expect the input to be split into many clauses.

# Sentence Detection Tool###

The easiest way to use Sentence Detector is the command line tool. This tool is for demonstration and testing only. Download the English sentence detector model and start Sentence Detector using the following command:

$opennlp SentenceDetector en-sent.bin

Just copy the above sample text to the console. Sentence Detector will read it and output one sentence per line to the console. Typically, the input is read from one file and the output is redirected to another file. This can be done with the following command.

$opennlp SentenceDetector en-sent.bin

< input.txt >

Output.txt

For English sentence models from websites, this input text should not be tokenized.

# Sentence Detection API###

Sentence Detector can be easily integrated into an application through his API. Before you can instantiate Sentence Detector, you must first load the sentence model.

InputStream modelIn = new FileInputStream ("en-sent.bin"); try {SentenceModel model = new SentenceModel (modelIn);} catch (IOException e) {e.printStackTrace ();} finally {if (modelIn! = null) {try {modelIn.close ();} catch (IOException e) {}

After the model is loaded, you can instantiate the SentenceDetectorME.

SentenceDetectorME sentenceDetector = new SentenceDetectorME (model)

Sentence Detector can output an array of String, each element of which is a sentence.

String sentences [] = sentenceDetector.sentDetect ("First sentence. Second sentence.")

This result array consists of two records. The first String is "First sentence." and the second String is "Second sentence." The spaces in the middle are removed before and after the String entered. The API also provides a method that simply returns the span of the sentence in the input String.

Span sentences [] = sentenceDetector.sentPosDetect ("First sentence. Second sentence.")

This result array also includes two records. The first span starts at index 2 and ends at 17. The second span starts at 18 and ends at 34. The public method Span.getCiveredText can create a child String that contains only the characters in this span.

# # Sentence Detector Training##

# Training Tool###

OpenNLP has a command-line tool for training models from different corpora obtained from the model download page. The data must be converted to OpenNLP Sentence Detector training format. He is one sentence per line. A blank line represents the boundary of a document. If the boundary of the file is unknown, it is recommended to have a blank line every dozens of lines. Just like the output in the example above. How to use the tool:

$opennlp SentenceDetectorTrainerUsage: opennlp SentenceDetectorTrainer [.namefinder | .conllx | .pos] [- abbDict path]\ [- params paramsFile] [- iterations num] [- cutoff num]-model modelFile\-lang language-data sampleData [- encoding charsetName] Arguments description:-abbDict path abbreviation dictionary in XML format. -params paramsFile training parameters file. -iterations num number of training iterations, ignored if-params is used. -cutoff num minimal number of times a feature must be seen, ignored if-params is used. -model modelFile output model file. -lang language language which is being processed. -data sampleData data to be used, usually a file name. -encoding charsetName encoding for reading and writing text, if absent the system default is used.

Train an English sentence detector to use the following command:

$opennlp SentenceDetectorTrainer-model en-sent.bin-lang en- data en-sent.train-encoding UTF-8

It should produce the following output:

Indexing events using cutoff of 5 Computing event counts... Done. 4883 events Indexing... Done.Sorting and merging events... Done. Reduced 4883 events to 2945.Done indexing.Incorporating indexed data for training... Done. Number of Event Tokens: 2945 Number of Outcomes: 2 Number of Predicates: 467...done.Computing model parameters...Performing 100 iterations. 1:.. Loglikelihood=-3384.6376826743144 0.38951464263772273 2:.. Loglikelihood=-2191.9266688597672 0.9397911120212984 3:.. Loglikelihood=-1645.8640771555981 0.9643661683391358 4:.. Loglikelihood=-1340.386303774519 0.9739913987302887 5:.. Loglikelihood=-1148.4141548519624 0.9748105672742167. 95:.. Loglikelihood=-288.25556805874436 0.9834118369854598 96:.. Loglikelihood=-287.2283680343481 0.9834118369854598 97:.. Loglikelihood=-286.2174830344526 0.9834118369854598 98:.. Loglikelihood=-285.222486981048 0.9834118369854598 99:.. Loglikelihood=-284.24296917223916 0.9834118369854598100:.. Loglikelihood=-283.2785335773966 0.9834118369854598Wrote sentence detector model.Path: en-sent.bin

# Training API###

Sentence Detector also provides an API to train a new sentence detection model. To train, there are three main steps:

The application must open a sample data stream

Call the SentenceDetectorME.train method

Save SentenceModel to a file or use it directly

The following sample code illustrates these three steps:

Charset charset = Charset.forName ("UTF-8"); ObjectStream lineStream = new PlainTextByLineStream (new FileInputStream ("en-sent.train"), charset); ObjectStream sampleStream = new SentenceSampleStream (lineStream); SentenceModel model;try {model = SentenceDetectorME.train ("en", sampleStream, true, null, TrainingParameters.defaultParams ());} finally {sampleStream.close ();} OutputStream modelOut = null;try {modelOut = new BufferedOutputStream (new FileOutputStream (modelFile)); model.serialize (modelOut) } finally {if (modelOut! = null) modelOut.close ();}

# # Evaluation##

# Evaluation Tool###

This command shows how the evaluator tool works.

Opennlp SentenceDetectorEvaluator-model en-sent.bin-lang en- data en-sent.eval-encoding UTF-8Loading model... DoneEvaluating... DonePrecision: 0.9465737514518002Recall: 0.9095982142857143F-Measure: 0.9277177006260672

En-sent.eval files and training data have the same format.

At this point, I believe that you have a deeper understanding of "what is the role of Sentence Detector", you might as well come to the actual operation! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report