In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly talks about "what are the cases of Python text processing". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "what are the cases of Python text processing?"
1 extract PDF content # pip install PyPDF2 install PyPDF2import PyPDF2from PyPDF2import PdfFileReader # Creating a pdf file object.pdf = open ("test.pdf", "rb") # Creating pdf reader object.pdf_reader = PyPDF2.PdfFileReader (pdf) # Checking total number of pages in a pdf file.print ("Total number of Pages:" Pdf_reader.numPages) # Creating a page object.page = pdf_reader.getPage (200) # Extract data from a specific page number.print (page.extractText ()) # Closing the object.pdf.close () 2 extract Word content # pip install python-docx install python-docximport docx def main (): try: doc = docx.Document ('test.docx') # Creating word reader object. Data = "" fullText = [] for para in doc.paragraphs: fullText.append (para.text) data ='\ n'.join (fullText) print (data) except IOError: print ('There was an error opening the filings') Return if _ _ name__ = ='_ main__': main () 3 extract Web web page content # pip install bs4 install bs4from urllib.request import Request, urlopenfrom bs4 import BeautifulSoup req = Request ('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1', headers= {' User-Agent': 'Mozilla/5.0'}) webpage = urlopen (req). Read () # Parsingsoup = BeautifulSoup (webpage 'html.parser') # Formating the parsed html filestrhtm = soup.prettify () # Print first 500 linesprint (strhtm [: 500]) # Extract meta tag valueprint (soup.title.string) print (soup.find (' meta') Attrs= {'property':'og:description'}) # Extract anchor tag valuefor x in soup.find_all (' a'): print (x.string) # Extract Paragraph tag valuefor x in soup.find_all ('p'): print (x.text) 4 reads Json data import requestsimport jsonr = requests.get ("https://support.oneskyapp.com/hc/en-us/article_attachments/202761727/example_2.json")res = r.json ( ) # Extract specific node content.print (res ['quiz'] [' sport']) # Dump data as stringdata = json.dumps (res) print (data) 5 read CSV data import csvwith open ('test.csv' 'r') as csv_file: reader = csv.reader (csv_file) next (reader) # Skip first row for row in reader: print (row) 6 removes the punctuation mark import reimport string data = "Stuning even for the non-gamer: This sound track was beautiful!\ It paints the senery in your mind so well I would recomend\ it even to people who hate vid. Game music! I have played the game Chrono\ Cross but out of all of the games I have ever played it has the best music!\ It backs away from crude keyboarding and takes a fresher step with grate\ guitars and soulful orchestras.\ It would impress anyone who cares to listen! "# Methood 1: Regex# Remove the special charaters from the read string.no_specials_string = re.sub ('[! #?,.:" ]', data) print (no_specials_string) # Methood 2: translate () # Rake translator objecttranslator = str.maketrans (', string.punctuation) data = data.translate (translator) print (data) 7 use NLTK to delete the disabled word from nltk.corpus import stopwords data = ['Stuning even for the non-gamer: This sound track was beautiful!\ It paints the senery in your mind so well I would recomend\ it even to people who hate vid. Game music! I have played the game Chrono\ Cross but out of all of the games I have ever played it has the best music!\ It backs away from crude keyboarding and takes a fresher step with grate\ guitars and soulful orchestras.\ It would impress anyone who cares to listeners'] # Remove stop wordsstopwords = set (stopwords.words ('english')) output = [] for sentence in data: temp_list = [] for word in sentence.split (): if word.lower () not in stopwords: Temp_list.append (word) output.append ('.join (temp_list)) print (output) 8 corrects the spelling from textblob import TextBlobdata = "Natural language is a cantral part of our day to day life using TextBlob And it's so antresting to work on any problem related to langages. "output = TextBlob (data). Correct () print (output) 9 uses the notation of NLTK and TextBlob import nltkfrom textblob import TextBlobdata =" Natural language is a central part of our day to day life, and it's so interesting to work on any problem related to languages. "nltk_output = nltk.word_tokenize (data) textblob_output = TextBlob (data) .wordsprint (nltk_output) print (textblob_output)
Output:
['Natural',' language', 'is',' AA, 'central',' part', 'of',' our', 'day',' to', 'day',' life',', 'and',' it',''s ", 'so',' interesting', 'to',' work', 'on',' any', 'problem',' related', 'to',' languages' '.']
['Natural',' language', 'is',' AA, 'central',' part', 'of',' our', 'day',' to', 'day',' life', 'and',' it', "s", 'so',' interesting', 'to',' work', 'on',' any', 'problem',' related', 'to',' languages']
10 use NLTK to extract the stem list of sentence words or phrases from nltk.stem import PorterStemmer st = PorterStemmer () text = ['Where did he learn to dance like that?',' His eyes were dancing with humor.', 'She shook her head and danced away',' Alex was an excellent dancer.'] Output = [] for sentence in text: output.append ("" .join ([st.stem (I) for i in sentence.split ()]) for item in output: print (item) print ("-" * 50) print (st.stem ('jumping'), st.stem (' jumps'), st.stem ('jumped')
Output:
Where did he learn to danc like that?
Hi eye were danc with humor.
She shook her head and danc away
Alex wa an excel dancer.
Jump jump jump
11 use NLTK to restore sentences or phrases from nltk.stem import WordNetLemmatizerwnl = WordNetLemmatizer () text = ['She gripped the armrest as he passed two cars at a time.',' Her car was in full view.' 'A number of cars carried out of state license plates.'] output = [] for sentence in text: output.append ("" .join ([wnl.lemmatize (I) for i in sentence.split ()]) for item in output: print (item) print ("*" * 10) print (wnl.lemmatize (' jumps', 'n')) print (wnl.lemmatize ('jumping',' v')) print (wnl.lemmatize ('jumped') ) print ("*" * 10) print (wnl.lemmatize ('saddest', 'a')) print (wnl.lemmatize ('happiest', 'a')) print (wnl.lemmatize ('easiest',')
Output:
She gripped the armrest a he passed two car at a time.
Her car wa in full view.
A number of car carried out of state license plates.
**********
Jump
Jump
Jump
**********
Sad
Happy
Easy
12 use NLTK to find the frequency of each word from the text file import nltkfrom nltk.corpus import webtextfrom nltk.probability import FreqDist nltk.download ('webtext') wt_words = webtext.words (' testing.txt') data_analysis = nltk.FreqDist (wt_words) # Let's take the specific words only if their frequency is greater than 3.filter_words = dict ([(m, n) for m) N in data_analysis.items () if len (m) > 3]) for key in sorted (filter_words): print ("% s:% s"% (key, filter_ words [key]) data_analysis = nltk.FreqDist (filter_words) data_analysis.plot (25, cumulative=False)
Output:
[nltk_data] Downloading package webtext to
[nltk_data] C:\ Users\ amit\ AppData\ Roaming\ nltk_data...
[nltk_data] Unzipping corpora\ webtext.zip.
1989: 1
Accessing: 1
Analysis: 1
Anyone: 1
Chapter: 1
Coding: 1
Data: 1
...
13 create the word cloud import nltkfrom nltk.corpus import webtextfrom nltk.probability import FreqDistfrom wordcloud import WordCloudimport matplotlib.pyplot as plt nltk.download ('webtext') wt_words = webtext.words (' testing.txt') # Sample datadata_analysis = nltk.FreqDist (wt_words) filter_words = dict ([(m, n) for m, n in data_analysis.items () if len (m) > 3]) wcloud = WordCloud (). Generate_from_frequencies (filter_words) # Plotting the wordcloudplt.imshow (wcloud) Interpolation= "bilinear") plt.axis ("off") (- 399.5, 399.5, 199.5,-0.5) plt.show () 14NLTK lexical scatter graph import nltkfrom nltk.corpus import webtextfrom nltk.probability import FreqDistfrom wordcloud import WordCloudimport matplotlib.pyplot as plt words = ['data',' science', 'dataset'] nltk.download (' webtext') wt_words = webtext.words ('testing.txt') # Sample data points = [(x Y) for x in range (len (wt_words)) for y in range (len (words)) if wt_ words [x] = = word [y]] if points: X, y = zip (* points) else: X = () plt.plot (x, y, "rx", scalex=.1) plt.yticks (range (len (words)), words, color= "b") plt.ylim (- 1) Len (words) plt.title ("Lexical Dispersion Plot") plt.xlabel ("Word Offset") plt.show () 15 uses countvectorizer to convert text to a number import pandas as pdfrom sklearn.feature_extraction.text import CountVectorizer # Sample data for analysisdata1 = "Java is a language for programming that develops a software for several platforms. A compiled code or bytecode on Java application can run on most of the operating systems including Linux, Mac operating system, and Linux. Most of the syntax of Java is derived from the C++ and C languages. "data2 =" Python supports multiple programming paradigms and comes up with a large standard library, paradigms included are object-oriented, imperative, functional and procedural. "data3 =" Go is typed statically compiled language. It was created by Robert Griesemer, Ken Thompson, and Rob Pike in 2009. This language offers garbage collection, concurrency of CSP-style, memory safety, and structural typing. " Df1 = pd.DataFrame ({'Java': [data1],' Python': [data2], 'Go': [data2]}) # Initializevectorizer = CountVectorizer () doc_vec = vectorizer.fit_transform (df1.iloc [0]) # Create dataFramedf2 = pd.DataFrame (doc_vec.toarray (). Transpose (), index=vectorizer.get_feature_names ()) # Change column headersdf2.columns = df1.columnsprint (df2)
Output:
Go Java Python
And 2 2 2
Application 0 1 0
Are 1 0 1
Bytecode 0 1 0
Can 0 1 0
Code 0 1 0
Comes 1 0 1
Compiled 0 1 0
Derived 0 1 0
Develops 0 1 0
For 0 2 0
From 0 1 0
Functional 1 0 1
Imperative 1 0 1
...
16 use TF-IDF to create the document term matrix import pandas as pdfrom sklearn.feature_extraction.text import TfidfVectorizer# Sample data for analysisdata1 = "Java is a language for programming that develops a software for several platforms. A compiled code or bytecode on Java application can run on most of the operating systems including Linux, Mac operating system, and Linux. Most of the syntax of Java is derived from the C++ and C languages. "data2 =" Python supports multiple programming paradigms and comes up with a large standard library, paradigms included are object-oriented, imperative, functional and procedural. "data3 =" Go is typed statically compiled language. It was created by Robert Griesemer, Ken Thompson, and Rob Pike in 2009. This language offers garbage collection, concurrency of CSP-style, memory safety, and structural typing. "df1 = pd.DataFrame ({'Java': [data1],' Python': [data2], 'Go': [data2]}) # Initializevectorizer = TfidfVectorizer () doc_vec = vectorizer.fit_transform (df1.iloc [0]) # Create dataFramedf2 = pd.DataFrame (doc_vec.toarray (). Transpose (), index=vectorizer.get_feature_names ()) # Change column headersdf2.columns = df1.columnsprint (df2)
Output:
Go Java Python
And 0.323751 0.137553 0.323751
Application 0.000000 0.116449 0.000000
Are 0.208444 0.000000 0.208444
Bytecode 0.000000 0.116449 0.000000
Can 0.000000 0.116449 0.000000
Code 0.000000 0.116449 0.000000
Comes 0.208444 0.000000 0.208444
Compiled 0.000000 0.116449 0.000000
Derived 0.000000 0.116449 0.000000
Develops 0.000000 0.116449 0.000000
For 0.000000 0.232898 0.000000
...
17 generate N-gram for a given sentence
Natural language Toolkit: NLTK
Import nltkfrom nltk.util import ngrams# Function to generate n-grams from sentences.def extract_ngrams (data, num): n_grams = ngrams (nltk.word_tokenize (data), num) return ['.join (grams) for grams in n_grams] data =' A class is a blueprint for the object.'print ("1-gram:", extract_ngrams (data, 1) print ("2-gram:", extract_ngrams (data, 2)) print ("3-gram:" Extract_ngrams (data, 3) print ("4-gram:", extract_ngrams (data, 4))
Text processing tool: TextBlob
From textblob import TextBlob # Function to generate n-grams from sentences.def extract_ngrams (data, num): n_grams = TextBlob (data) .nforth (num) return ['.join (grams) for grams in n_grams] data =' A class is a blueprint for the object.' Print ("1-gram:", extract_ngrams (data, 1)) print ("2-gram:", extract_ngrams (data, 2)) print ("3-gram:", extract_ngrams (data, 3)) print ("4-gram:", extract_ngrams (data, 4))
Output:
1-gram: ['Aids,' class', 'is',' object', 'blueprint',' for', 'the',' object']
2-gram: ['A class', 'class is',' is aegis,'a blueprint', 'blueprint for',' for the', 'the object']
3-gram: ['A class is', 'class is a',' is a blueprint','a blueprint for', 'blueprint for the',' for the object']
4-gram: ['A class is a', 'class is a blueprint',' is a blueprint for','a blueprint for the', 'blueprint for the object']
18 use the sklearn CountVectorize vocabulary specification import pandas as pdfrom sklearn.feature_extraction.text import CountVectorizer # Sample data for analysisdata1 = "Machine language is a low-level programming language. It is easily understood by computers but difficult to read by people. This is why people use higher level programming languages. Programs written in high-level languages are also either compiled and/or interpreted into machine language so that computers can execute them. "data2 =" Assembly language is a representation of machine language. In other words, each assembly language instruction translates to a machine language instruction. Though assembly language statements are readable, the statements are still low-level. A disadvantage of assembly language is that it is not portable, because each platform comes with a particular Assembly Language "df1 = pd.DataFrame ({'Machine': [data1],' Assembly': [data2]}) # Initializevectorizer = CountVectorizer (ngram_range= (2,2)) doc_vec = vectorizer.fit_transform (df1.iloc [0]) # Create dataFramedf2 = pd.DataFrame (doc_vec.toarray (). Transpose ()) Index=vectorizer.get_feature_names () # Change column headersdf2.columns = df1.columnsprint (df2)
Output:
Assembly Machine
Also either 0 1
And or 0 1
Are also 0 1
Are readable 1 0
Are still 1 0
Assembly language 5 0
Because each 1 0
But difficult 0 1
By computers 0 1
By people 0 1
Can execute 0 1
...
19 use TextBlob to extract the noun phrase from textblob import TextBlob#Extract nounblob = TextBlob ("Canada is a country in the northern part of North America.") for nouns in blob.noun_phrases: print (nouns)
Output:
Canada
Northern part
America
How to calculate the word-word co-occurrence matrix import numpy as npimport nltkfrom nltk import bigramsimport itertoolsimport pandas as pd def generate_co_occurrence_matrix (corpus): vocab = set (corpus) vocab = list (vocab) vocab_index = {word: i for i, word in enumerate (vocab)} # Create bigrams from all words in corpus bi_grams = list (bigrams (corpus)) # Frequency distribution of bigrams ((word1, word2) Num_occurrences) bigram_freq = nltk.FreqDist (bi_grams). Most_common (len (bi_grams)) # Initialise co-occurrence matrix # co_occurrence_ Matrix [current] [previous] co_occurrence_matrix = np.zeros ((len (vocab), len (vocab) # Loop through the bigrams taking the current and previous word, # and the number of occurrences of the bigram. For bigram in bigram_freq: current = bigram [0] [1] previous = bigram [0] [0] count = bigram [1] pos_current = vocab_ index [current] pos_previous = vocab_ index [pos _ current] [pos_previous] = count co_occurrence_matrix = np.matrix (co_occurrence_matrix) # return the matrix and the index return co_occurrence_matrix Vocab_index text_data = [['Where',' Python', 'is',' used'], ['What',' is', 'Python'' used', 'in'], [' Why', 'Python',' is', 'best'], [' What', 'companies',' use' 'Python']] # Create one list using many listsdata = list (itertools.chain.from_iterable (text_data)) matrix, vocab_index = generate_co_occurrence_matrix (data) data_matrix = pd.DataFrame (matrix, index=vocab_index, columns=vocab_index) print (data_matrix)
Output:
Best use What Where... In is Python used
Best 0.0 0.0 0.0... 0.0 0.0 0.0 1.0
Use 0.0 0.0 0.0... 0.0 1.0 0.0 0.0
What 1.0 0.0 0.0 0.0... 0.0 0.0 0.0
Where 0.0 0.0 0.0... 0.0 0.0 0.0
Pythonused 0.0 0.0 0.0... 0.0 0.0 0.0 1.0
Why 0.0 0.0 0.0... 0.0 0.0 0.0 1.0
Companies 0.0 1.0 0.0 1.0... 1.0 0.0 0.0 0.0
In 0.0 0.0 0.0... 0.0 0.0 1.0 0.0
Is 0.0 0.0 1.0 0.0... 0.0 0.0 0.0
Python 0.0 0.0 0.0... 0.0 0.0 0.0
Used 0.0 0.0 1.0 0.0... 0.0 0.0 0.0
[11 rows x 11 columns]
21 use TextBlob for emotion analysis from textblob import TextBlobdef sentiment (polarity): if blob.sentiment.polarity
< 0: print("Negative") elif blob.sentiment.polarity >0: print (Positive) else: print ("Neutral") blob = TextBlob ("The movie was excellent!") print (blob.sentiment) sentiment (blob.sentiment.polarity) blob = TextBlob ("The movie was not bad.") print (blob.sentiment) sentiment (blob.sentiment.polarity) blob = TextBlob ("The movie was ridiculous.") print (blob.sentiment) sentiment (blob.sentiment.polarity)
Output:
Sentiment (polarity=1.0, subjectivity=1.0)
Positive
Sentiment (polarity=0.3499999999999999, subjectivity=0.6666666666666666)
Positive
Sentiment (polarity=-0.3333333333333333, subjectivity=1.0)
Negative
22 language translation using Goslate import goslatetext = "Comment vas-tu?" gs = goslate.Goslate () translatedText = gs.translate (text, 'en') print (translatedText) translatedText = gs.translate (text,' zh') print (translatedText) translatedText = gs.translate (text, 'de') print (translatedText) 23 language detection and translation from textblob import TextBlob blob = TextBlob ("Comment vas-tu?") Print (blob.detect_language ()) print (blob.translate (to='es')) print (blob.translate (to='en')) print (blob.translate (to='zh'))
Output:
Fr
Como estas tu?
How are you?
How are you?
24 use TextBlob to get the definition and synonym from textblob import TextBlobfrom textblob import Word text_word = Word ('safe') print (text_word.definitions) synonyms = set () for synset in text_word.synsets: for lemma in synset.lemmas (): synonyms.add (lemma.name ()) print (synonyms)
Output:
['strongbox where valuables can be safely kept',' a ventilated or refrigerated cupboard for securing provisions from pests', 'contraceptive device consisting of a sheath of thin rubber or latex that is worn over the penis during intercourse',' free from danger or the risk of harm','(of an undertaking) secure from risk', 'having reached a base without being put out',' financially sound']
{'secure',' rubber', 'good',' safety', 'safe',' dependable', 'condom',' prophylactic'}
25 use TextBlob to get the list of antonyms from textblob import TextBlobfrom textblob import Wordtext_word = Word ('safe') antonyms = set () for synset in text_word.synsets: for lemma in synset.lemmas (): if lemma.antonyms (): antonyms.add (lemma.antonyms () [0] .name ()) print (antonyms)
Output:
{'dangerous',' out'}
At this point, I believe you have a deeper understanding of "what are the cases of Python text processing?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.