Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the preprocessing method of nlp Chinese data

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "what is the nlp Chinese data preprocessing method". In the daily operation, I believe that many people have doubts about what the nlp Chinese data preprocessing method is. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts about "what is the nlp Chinese data preprocessing method?" Next, please follow the editor to study!

Data loading (default csv format)

Import pandas as pd

Datas = pd.read_csv (". / test.csv", header=0, index_col=0) # DataFrame

N_datas = data.to_numpy () # ndarray converted to numpy is easier to handle (personal preference)

Remove blank lines

Def delete_blank_lines (sentences):

Return [s for s in sentences if s.split ()]

No_line_datas = delete_blank_lines (n_datas)

Remove the number

DIGIT_RE = re.compile (r'\ daddy')

No_digit_datas = DIGIT_RE.sub ('', no_line_datas)

Def delete_digit (sentences):

Return [DIGIT_RE.sub (', s) for s in sentences]

Judge the form of a sentence (simple sentence or complex sentence)

STOPS = ['.' ,

Def is_sample_sentence (sentence):

Count = 0

For word in sentence:

If word in STOPS:

Count + = 1

If count > 1:

Return False

Return True

Remove Chinese and English punctuation

From string import punctuation

Import re

Punc = punctuation + u'

Def delete_punc (sentences):

Return [re.sub (r "[{}] +" .format (punc),', s) for s in a]

Remove English (only Chinese characters)

ENGLISH_RE = re.compile (r'[a-zA-Z] +')

Def delete_e_word (sentences):

Return [ENGLISH_RE.sub (', s) for s in sentences]

Remove garbled codes and special symbols

Use regular expressions to remove related useless symbols and garbled codes

# this operation can remove all symbols, punctuation and English, as punctuation may be needed to further determine whether the sentence is a simple sentence, so this operation can be used last. Price http://www.zzkdfk.com/ for gynecological examination in Zhengzhou

SPECIAL_SYMBOL_RE = re.compile (r'[^\ w\ s\ u4e00 -\ u9fa5] +')

Def delete_special_symbol (sentences):

Return [SPECIAL_SYMBOL_RE.sub (', s) for s in sentences]

Chinese word segmentation

# using jieba

Def seg_sentences (sentences):

Cut_words = map (lambda s: list (jieba.cut (s)), sentences)

Return list (cut_words)

# use pyltp participle

Def seg_sentences (sentences):

Segmentor = Segmentor ()

Segmentor.load ('. / cws.model') # loads word segmentation model parameters

Seg_sents = [list (segmentor.segment (sent)) for sent in sentences]

Segmentor.release ()

Return seg_sents

Remove stop words

# stop word list needs to be downloaded by yourself

Stopwords = []

Def delete_stop_word (sentences):

Return [[word for word in s if word not in stopwords] for s in sentences]

At this point, the study on "what is the preprocessing method of nlp Chinese data" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report