How to use the Tokenizer of OpenNLP 07/06 Update SLTechnology News&Howtos

How to use the Tokenizer of OpenNLP

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to use the Tokenizer of OpenNLP". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to use the Tokenizer of OpenNLP".

OpenNLP Tokenizers splits an input character sequence into tokens. Tokens is usually a word, punctuation, number and so on.

The following result shows the individual tokens in a whitespace separated representation.

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29 .Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group .Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a nonexecutive director of this British industrial conglomerate. A form of asbestos once used to make Kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than 30 years ago, researchers reported.

OpenNLP provides several Tokenizer implementations:

Whitespace Tokenizer-A space Tokenizer, and a sequence without spaces is recognized as tokens

Simple Tokenizer-the Tokenizer of a character class, and the sequence of the same character class is tokens

Learnable Tokenizer-A maximum entropy Tokenizer for token boundary detection based on probabilistic model

Most part of speech tagging (part-of-speech taggings), syntactic parsing (parsers), and so on, work with text tokenized in this way. Make sure your tokenizer produces the desired tokens type, and it's important to use later text processing components.

With OpenNLP (and many other systems), tokenization is a two-stage process: first, identify sentence boundaries, and then identify the tokens of each sentence.

# Tokenizer Tools### Tokenizer API### Tokenizers can be integrated into an application through the API defined by it. The sharing example for WhitespaceTokenizer can be obtained through the static field WhitespaceTokenizer.INSTANCE. A shared instance of SimpleTokenizer can be obtained from SimpleTokenizer.INSTANCE in the same way. Before you can instantiate TokenizerME (learnable Tokenizer), you must first create a Token model. The following code example shows how to load a model.

InputStream modelIn = new FileInputStream ("en-token.bin"); try {TokenizerModel model = new TokenizerModel (modelIn);} catch (IOException e) {e.printStackTrace ();} finally {if (modelIn! = null) {try {modelIn.close ();} catch (IOException e) {}

After the model is loaded, you can instantiate the TokenizerME.

Tokenizer tokenizer = new TokenizerME (model)

Tokenizer provides two Tokenize methods, both of which expect an input String object that contains text that is not Tokenized. It is best to be a sentence if possible, but depending on the training of learnable Tokenizer, this is not necessary. The first returns an array of String, where each String is a token.

String tokens [] = tokenizer.tokenize ("An input sample sentence.")

The output is an array containing these tokens.

"An", "input", "sample", "sentence".

The second method, the TokenizePos method, returns an array of Span, each Span containing the start and end character offsets of the tokens of the input String.

Span tokenSpans [] = tokenizer.tokenizePos ("An input sample sentence.")

This tokenSpans array has five elements. Call Span.getCoveredText to get the text of a span, which gets a Span and the input text. TokenizerME can output the probability of the tokens being detected. The getTokenProbabilities method must be called immediately after the tokenize method is called.

TokenizerME tokenizer =... String tokens [] = tokenizer.tokenize (...); double tokenProbs [] = tokenizer.getTokenProbabilities ()

Each token of the tokenProbs array includes a minimum value, which ranges from 0 to 1. 1 is the maximum possible probability, and 0 is the lowest possible probability.

# # Tokenizer Training## Training Tool### Training API### Tokenizer provides API to train the new tokenization model. Training requires three basic steps:

The application must open a sample data stream

Call the TokenizerME.train method

Save TokenizerModel to a file or use it directly

The following sample code explains these three steps:

Charset charset = Charset.forName ("UTF-8"); ObjectStream lineStream = new PlainTextByLineStream (new FileInputStream ("en-sent.train"), charset); ObjectStream sampleStream = new TokenSampleStream (lineStream); TokenizerModel model;try {model = TokenizerME.train ("en", sampleStream, true, TrainingParameters.defaultParams ());} finally {sampleStream.close ();} OutputStream modelOut = null;try {modelOut = new BufferedOutputStream (new FileOutputStream (modelFile)); model.serialize (modelOut);} finally {if (modelOut! = null) modelOut.close () } Thank you for your reading, the above is the content of "how to use OpenNLP Tokenizer". After the study of this article, I believe you have a deeper understanding of how to use OpenNLP Tokenizer, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.