In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article introduces the relevant knowledge of "what is the method of Python text preprocessing". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
Convert letters that appear in text to lowercase
Example 1: convert letters to lowercase
Python implementation code:
Input_str = "The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil." Input_strinput_str = input_str.lower () print (input_str)
Output:
The 5 biggest countries by population in 2017 are china, india, united states, indonesia, and brazil.
Delete a number that appears in the text
If the numbers in the text have nothing to do with text analysis, delete them. In general, regularized expressions can help you achieve this process.
Example 2: delete a number
Python implementation code:
Import re input_str = 'Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls.' Reresult = re.sub (r'\ dudes,', input_str) print (result)
Output:
Box A contains red and white balls, while Box B contains red and blue balls.
Delete punctuation that appears in text
The following sample code shows how to delete punctuation marks from text, such as [! "# $% &'() * +, -. /:;? @ [\] ^ _ `{|} ~].
Example 3: delete punctuation
Python implementation code:
Import string input_str = "This & is [an] example? {of} string. With.? punctuationpunctuationpunctuationpunctuationpunctuationpunctuationpunctuationpunctuationpunctuationpunctuationThe input_str.translate (string.maketrans (","), string.punctuation) print (result)
Output:
This is an example of string with punctuation
Delete spaces that appear in text
You can remove spaces that appear before and after the text through the strip () function.
Example 4: delete spaces
Python implementation code:
Input_str = "\ t a string example\ t" input_strinput_str = input_str.strip () input_str
Output:
'a string example'
Symbolization (Tokenization)
Symbolization is the process of dividing a given text into each marked module, in which words, numbers, punctuation, and other symbols can be regarded as tags. In the following table (Tokenization sheet), some common tools for implementing the symbolization process are listed.
Delete the Terminator that appears in the text
Stop words refers to the most common words in languages such as "a", "a", "on", "is" and "all". These words have no special or important meaning and can usually be deleted from the text. Natural Language Toolkit (NLTK), an open source library dedicated to symbolic and natural language processing statistics, is commonly used to delete these terminators.
Example 7: delete the terminating word
Implementation code:
Input_str = "NLTK is a leading platform for building Python programs to work with human language data." Stop_words = set (stopwords.words ('english')) from nltk.tokenize import word_tokenize tokens = word_tokenize (input_str) result = [i for i in tokens if not i in stop_words] print (result)
Output:
['NLTK',' leading', 'platform',' building', 'Python',' programs', 'work',' human', 'language',' data',']
In addition, scikit-learn provides a tool for dealing with terminators:
From sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
Similarly, spaCy has a similar processing tool:
From spacy.lang.en.stop_words import STOP_WORDS
Remove sparse words and specific words from the text
In some cases, it is necessary to delete some sparse terms or specific words that appear in the text. Considering that any word can be considered as a set of terminating words, this can be achieved through the terminating word deletion tool.
Stem extraction (Stemming)
Stem extraction is a process of simplifying words to trunks, roots or forms (such as books-book,looked-look). At present, the two mainstream algorithms are Porter stemming algorithm (deleting common shapes and inflection point ending in words) and Lancaster stemming algorithm.
Example 8: using NLYK to implement stemming
Implementation code:
From nltk.stem import PorterStemmer from nltk.tokenize import word_tokenize stemmer= PorterStemmer () input_str= "There are several types of stemming algorithms." Input_str=word_tokenize (input_str) for word in input_str: print (stemmer.stem (word))
Output:
There are sever type of stem algorithm.
Word form reduction (Lemmatization)
The purpose of morphological restoration, such as the stem process, is to restore different forms of words to a common basic form. Contrary to the process of stem extraction, morphological restoration is not simply to cut off or deform words, but to obtain the correct word form by using lexical knowledge base.
Currently commonly used word form restoration tools include: NLTK (WordNet Lemmatizer), spaCy,TextBlob,Pattern,gensim,Stanford CoreNLP, memory-based shallow parser (MBSP), Apache OpenNLP,Apache Lucene, General text Engineering Architecture (GATE), Illinois Lemmatizer and DKPro Core.
Example 9: using NLYK to restore word forms
Implementation code:
From nltk.stem import WordNetLemmatizer from nltk.tokenize import word_tokenize lemmatizer=WordNetLemmatizer () input_str= "been had done languages cities mice" input_str=word_tokenize (input_str) for word in input_str: print (lemmatizer.lemmatize (word))
Output:
Be have do language city mouse
Part of speech tagging (POS)
Part-of-speech tagging aims to assign part of speech to each word in a given text (such as nouns, verbs, adjectives and other words) based on the definition and contextual meaning of words. There are currently many tools that include POS markers, including NLTK,spaCy,TextBlob,Pattern,Stanford CoreNLP, memory-based shallow Analyzer (MBSP), Apache OpenNLP,Apache Lucene, General text Engineering Architecture (GATE), FreeLing,Illinois Part of Speech Tagger, and DKPro Core.
Example 10: using TextBlob to implement part of speech tagging
Implementation code:
Input_str= "Parts of speech examples: an article, to write, interesting, easily, and, of" from textblob import TextBlob result = TextBlob (input_str) print (result.tags)
Output:
[('Parts', upped NNS'), ('of', uplink'), ('speech', upright NNN'), ('examples', upright NNS'), ('an', upright DT'), (' article', upright NNN'), ('to', upright to'), ('write', upright VBG'), ('interesting', upright VBG'), ('easily', upright RBB'), ('and', upright CC') ('of', uplin')]
Word segmentation (shallow analysis)
Word chunking is a natural language process to identify the components of a sentence (such as nouns, verbs, adjectives, etc.) and link them to high-level units with discontinuous grammatical meanings (such as noun groups or phrases, verb groups, etc.). Commonly used word segmentation tools include: NLTK,TreeTagger chunker,Apache OpenNLP, text Engineering General Architecture (GATE), FreeLing.
Example 11: using NLYK to implement word segmentation
The first step is to determine the part of speech of each word.
Implementation code:
Input_str= "A black television and a white stove were bought for the new apartment of John." From textblob import TextBlob result = TextBlob (input_str) print (result.tags)
Output:
[('television', upright DT'), ('black', upright JJ'), (' television', upright NN'), ('and', upright DT'), ('white', upright, upright DT'), (' white', upright JJ'), ('stove', upright VBD'), ('bought', upright VBN'), ('for', upright VBN'), ('the', upright DT') ('new', uplink JJ'), (' apartment', uplink NN'), ('of', uplink'), ('John', uplink NNP')]
The second part is to divide words into blocks.
Implementation code:
Reg_exp = "NP: {? *}" rp = nltk.RegexpParser (reg_exp) result = rp.parse (result.tags) print (result)
Output:
(s (NP A/DT black/JJ television/NN) and/CC (NP a/DT white/JJ stove/NN) were/VBD bought/VBN for/IN (NP the/DT new/JJ apartment/NN) of/IN John/NNP)
You can also draw the sentence tree structure diagram through the result.draw () function, as shown in the following figure.
Named entity recognition (Named Entity Recognition)
Named entity recognition (NER) aims to find named entities from the text and classify them into predefined categories (people, places, organizations, times, etc.).
Common named entity recognition tools are shown in the following table, including: NLTK,spaCy, text Engineering Common Architecture (GATE)-ANNIE,Apache OpenNLP,Stanford CoreNLP,DKPro Core, MITIE,Watson NLP,TextRazor,FreeLing, etc.
Example 12: using TextBlob to implement part of speech tagging
Implementation code:
From nltk import word_tokenize, pos_tag, ne_chunk input_str = "Bill works for Apple so he went to Boston for a conference." Print ne_chunk (pos_tag (word_tokenize (input_str)
Output:
(s (PERSON Bill/NNP) works/VBZ for/IN Apple/NNP so/IN he/PRP went/VBD to/TO (GPE Boston/NNP) for/IN a/DT conference/NN. /.
Co-referential resolution Coreference resolution (anaphora resolution anaphora resolution)
Pronouns and other quotation expressions should be associated with the correct individual. Coreference resolution in text refers to referencing the same entity in the real world. For example, in the sentence "Andrew said he would buy a car", the pronoun "he" refers to the same person, that is, "Andrew". Common Coreference resolution tools are shown in the following table, including Stanford CoreNLP,spaCy,Open Calais,Apache OpenNLP and so on.
Collocation extraction (Collocation extraction)
The process of collocation extraction does not happen alone or accidentally, it occurs together with word combinations. Examples of this process include "break the rule break the rules", "idle time free time", "draw conclusions draw a conclusion", "remember keep in mind", "prepare get ready", and so on.
Example 13: using ICE to implement collocation extraction
Implementation code:
Input= ["he and Chazz duel with all keys on the line."] From ICE import CollocationExtractor extractor = CollocationExtractor.with_collocation_pipeline ("T1", bing_key = "Temp", pos_check = False) print (extractor.get_collocations_of_length (input, length = 3))
Output:
["on the line"]
Relationship extraction (Relationship extraction)
The process of relationship extraction refers to obtaining structured text information from unstructured data sources (such as original text). Strictly speaking, it determines the relationship between named entities (such as people, organizations, places) (such as spouses, employment, etc.). For example, from the sentence "married Mark and Emily yesterday", the information we can extract is that Mark is Emily's husband.
This is the end of the content of "what is the method of Python text preprocessing". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.