What is the method of Python text preprocessing 03/27 Update SLTechnology News&Howtos

What is the method of Python text preprocessing

2025-03-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article introduces the relevant knowledge of "what is the method of Python text preprocessing". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Convert letters that appear in text to lowercase

Example 1: convert letters to lowercase

Python implementation code:

Input_str = "The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil." Input_strinput_str = input_str.lower () print (input_str)

Output:

The 5 biggest countries by population in 2017 are china, india, united states, indonesia, and brazil.

Delete a number that appears in the text

If the numbers in the text have nothing to do with text analysis, delete them. In general, regularized expressions can help you achieve this process.

Example 2: delete a number

Python implementation code:

Import re input_str = 'Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls.' Reresult = re.sub (r'\ dudes,', input_str) print (result)

Output:

Box A contains red and white balls, while Box B contains red and blue balls.

Delete punctuation that appears in text

The following sample code shows how to delete punctuation marks from text, such as [! "# $% &'() * +, -. /:;? @ [\] ^ _ `{|} ~].

Example 3: delete punctuation

Python implementation code:

Import string input_str = "This & is [an] example? {of} string. With.? punctuationpunctuationpunctuationpunctuationpunctuationpunctuationpunctuationpunctuationpunctuationpunctuationThe input_str.translate (string.maketrans (","), string.punctuation) print (result)

Output:

This is an example of string with punctuation

Delete spaces that appear in text

You can remove spaces that appear before and after the text through the strip () function.

Example 4: delete spaces

Python implementation code:

Input_str = "\ t a string example\ t" input_strinput_str = input_str.strip () input_str

Output:

'a string example'

Symbolization (Tokenization)

Symbolization is the process of dividing a given text into each marked module, in which words, numbers, punctuation, and other symbols can be regarded as tags. In the following table (Tokenization sheet), some common tools for implementing the symbolization process are listed.

Delete the Terminator that appears in the text

Stop words refers to the most common words in languages such as "a", "a", "on", "is" and "all". These words have no special or important meaning and can usually be deleted from the text. Natural Language Toolkit (NLTK), an open source library dedicated to symbolic and natural language processing statistics, is commonly used to delete these terminators.

Example 7: delete the terminating word

Implementation code:

Input_str = "NLTK is a leading platform for building Python programs to work with human language data." Stop_words = set (stopwords.words ('english')) from nltk.tokenize import word_tokenize tokens = word_tokenize (input_str) result = [i for i in tokens if not i in stop_words] print (result)

Output:

['NLTK',' leading', 'platform',' building', 'Python',' programs', 'work',' human', 'language',' data',']

In addition, scikit-learn provides a tool for dealing with terminators:

From sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

Similarly, spaCy has a similar processing tool:

From spacy.lang.en.stop_words import STOP_WORDS

Remove sparse words and specific words from the text

In some cases, it is necessary to delete some sparse terms or specific words that appear in the text. Considering that any word can be considered as a set of terminating words, this can be achieved through the terminating word deletion tool.

Stem extraction (Stemming)

Stem extraction is a process of simplifying words to trunks, roots or forms (such as books-book,looked-look). At present, the two mainstream algorithms are Porter stemming algorithm (deleting common shapes and inflection point ending in words) and Lancaster stemming algorithm.

Example 8: using NLYK to implement stemming

Implementation code:

From nltk.stem import PorterStemmer from nltk.tokenize import word_tokenize stemmer= PorterStemmer () input_str= "There are several types of stemming algorithms." Input_str=word_tokenize (input_str) for word in input_str: print (stemmer.stem (word))

Output:

There are sever type of stem algorithm.

Word form reduction (Lemmatization)

The purpose of morphological restoration, such as the stem process, is to restore different forms of words to a common basic form. Contrary to the process of stem extraction, morphological restoration is not simply to cut off or deform words, but to obtain the correct word form by using lexical knowledge base.

Currently commonly used word form restoration tools include: NLTK (WordNet Lemmatizer), spaCy,TextBlob,Pattern,gensim,Stanford CoreNLP, memory-based shallow parser (MBSP), Apache OpenNLP,Apache Lucene, General text Engineering Architecture (GATE), Illinois Lemmatizer and DKPro Core.

Example 9: using NLYK to restore word forms

Implementation code:

From nltk.stem import WordNetLemmatizer from nltk.tokenize import word_tokenize lemmatizer=WordNetLemmatizer () input_str= "been had done languages cities mice" input_str=word_tokenize (input_str) for word in input_str: print (lemmatizer.lemmatize (word))

Output:

Be have do language city mouse

Part of speech tagging (POS)

Part-of-speech tagging aims to assign part of speech to each word in a given text (such as nouns, verbs, adjectives and other words) based on the definition and contextual meaning of words. There are currently many tools that include POS markers, including NLTK,spaCy,TextBlob,Pattern,Stanford CoreNLP, memory-based shallow Analyzer (MBSP), Apache OpenNLP,Apache Lucene, General text Engineering Architecture (GATE), FreeLing,Illinois Part of Speech Tagger, and DKPro Core.

Example 10: using TextBlob to implement part of speech tagging

Implementation code:

Input_str= "Parts of speech examples: an article, to write, interesting, easily, and, of" from textblob import TextBlob result = TextBlob (input_str) print (result.tags)

Output:

[('Parts', upped NNS'), ('of', uplink'), ('speech', upright NNN'), ('examples', upright NNS'), ('an', upright DT'), (' article', upright NNN'), ('to', upright to'), ('write', upright VBG'), ('interesting', upright VBG'), ('easily', upright RBB'), ('and', upright CC') ('of', uplin')]

Word segmentation (shallow analysis)

Word chunking is a natural language process to identify the components of a sentence (such as nouns, verbs, adjectives, etc.) and link them to high-level units with discontinuous grammatical meanings (such as noun groups or phrases, verb groups, etc.). Commonly used word segmentation tools include: NLTK,TreeTagger chunker,Apache OpenNLP, text Engineering General Architecture (GATE), FreeLing.

Example 11: using NLYK to implement word segmentation

The first step is to determine the part of speech of each word.

Implementation code:

Input_str= "A black television and a white stove were bought for the new apartment of John." From textblob import TextBlob result = TextBlob (input_str) print (result.tags)

Output:

[('television', upright DT'), ('black', upright JJ'), (' television', upright NN'), ('and', upright DT'), ('white', upright, upright DT'), (' white', upright JJ'), ('stove', upright VBD'), ('bought', upright VBN'), ('for', upright VBN'), ('the', upright DT') ('new', uplink JJ'), (' apartment', uplink NN'), ('of', uplink'), ('John', uplink NNP')]

The second part is to divide words into blocks.

Implementation code:

Reg_exp = "NP: {? *}" rp = nltk.RegexpParser (reg_exp) result = rp.parse (result.tags) print (result)

Output:

(s (NP A/DT black/JJ television/NN) and/CC (NP a/DT white/JJ stove/NN) were/VBD bought/VBN for/IN (NP the/DT new/JJ apartment/NN) of/IN John/NNP)

You can also draw the sentence tree structure diagram through the result.draw () function, as shown in the following figure.

Named entity recognition (Named Entity Recognition)

Named entity recognition (NER) aims to find named entities from the text and classify them into predefined categories (people, places, organizations, times, etc.).

Common named entity recognition tools are shown in the following table, including: NLTK,spaCy, text Engineering Common Architecture (GATE)-ANNIE,Apache OpenNLP,Stanford CoreNLP,DKPro Core, MITIE,Watson NLP,TextRazor,FreeLing, etc.

Example 12: using TextBlob to implement part of speech tagging

Implementation code:

From nltk import word_tokenize, pos_tag, ne_chunk input_str = "Bill works for Apple so he went to Boston for a conference." Print ne_chunk (pos_tag (word_tokenize (input_str)

Output:

(s (PERSON Bill/NNP) works/VBZ for/IN Apple/NNP so/IN he/PRP went/VBD to/TO (GPE Boston/NNP) for/IN a/DT conference/NN. /.

Co-referential resolution Coreference resolution (anaphora resolution anaphora resolution)

Pronouns and other quotation expressions should be associated with the correct individual. Coreference resolution in text refers to referencing the same entity in the real world. For example, in the sentence "Andrew said he would buy a car", the pronoun "he" refers to the same person, that is, "Andrew". Common Coreference resolution tools are shown in the following table, including Stanford CoreNLP,spaCy,Open Calais,Apache OpenNLP and so on.

Collocation extraction (Collocation extraction)

The process of collocation extraction does not happen alone or accidentally, it occurs together with word combinations. Examples of this process include "break the rule break the rules", "idle time free time", "draw conclusions draw a conclusion", "remember keep in mind", "prepare get ready", and so on.

Example 13: using ICE to implement collocation extraction

Implementation code:

Input= ["he and Chazz duel with all keys on the line."] From ICE import CollocationExtractor extractor = CollocationExtractor.with_collocation_pipeline ("T1", bing_key = "Temp", pos_check = False) print (extractor.get_collocations_of_length (input, length = 3))

Output:

["on the line"]

Relationship extraction (Relationship extraction)

The process of relationship extraction refers to obtaining structured text information from unstructured data sources (such as original text). Strictly speaking, it determines the relationship between named entities (such as people, organizations, places) (such as spouses, employment, etc.). For example, from the sentence "married Mark and Emily yesterday", the information we can extract is that Mark is Emily's husband.

This is the end of the content of "what is the method of Python text preprocessing". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.