What are the methods of dealing with text type data in pandas 07/13 Update SLTechnology News&Howtos

What are the methods of dealing with text type data in pandas

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "what are the methods of dealing with text type data in pandas". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1. Case conversion and filling s = pd.Series (['lower',' CAPITALS', 'this is a sentence',' SwApCaSe'])

Convert uppercase to lowercase: s.str.lower ()

Lowercase to uppercase: s.str.upper ()

Change to the form of news headlines: s.str.title ()

The first letter is uppercase and the rest is lowercase: s.str.capitalize ()

Convert the original uppercase and lowercase to lowercase and uppercase, respectively, that is, uppercase and lowercase interchange: s.str.swapcase ()

When the text is filled to a fixed length with some character, it is filled from both sides: s.str.center.

Fill the text to a fixed length with a certain character, and you can set the filling direction (default is left, can be set to left,right,both): s.str.pad (width=10, side='right', fillchar='-')

When the text is filled to a fixed length with a certain character, it is filled from the right side of the text, that is, the original string is on the left: s.str.ljust.

When the text is filled to a fixed length with a certain character, it is filled from the left side of the text, that is, the original string is on the right: s.str.rjust.

Fill the text to a fixed length with a certain character in the specified direction (left,right,both): s.str.pad (3)

Add 0 to the specified length before the string:

S = pd.Series (['- 1,'1, '1000, 10, np.nan])

S.str.zfill (3)

two。 String merge and split 2.1 Multi-column string merge

Note: when merging multi-column strings, it is recommended to use the cat function, which is merged by index.

S=pd.DataFrame ({'col1': [' ajar, 'baked, np.nan,' d'], 'col2': [' Agar, 'Bond,' Clearing,'D']}) # 1. Rows with a missing value do not merge s ['col1'] .str.cat ([s [' col2']]) # 2. Replace the missing value with a fixed character (*) and merge s ['col1'] .str.cat ([s [' col2']], na_rep='*') # 3. Replace the missing value with a fixed character (*) and merge s ['col1'] .str.cat ([s [' col2']], na_rep='*',sep=',') # 4 with the delimiter (,). Merge with inconsistent indexes # create seriess = pd.Series (['join='left', na_rep='-',' baked, np.nan,'d']) t = pd.Series (['dashed,' averse, 'eBay,' c'], index= [3,0,4,2]) # merge s.str.cat (t, join='left', na_rep='-') s.str.cat (t, join='right', na_rep='-') s.str.cat (t, join='outer') Na_rep='-') s.str.cat (t, join='inner', na_rep='-') 2.2 text in the form of a list is merged into one column s = pd.Series ([['lion',' elephant', 'zebra'], [1.1,2.2,3.3], [' cat', np.nan, 'dog'], [' cow', 4.5, 'goat'], [' duck'] ['swan',' fish'], 'guppy']]) # underlined to stitch s.str.join (' _')

Before use:

After use:

2.3 A list of strings is merged with itself into a column of s = pd.Series (['a', 'bounded,' c']) # specify the number s.str.repeat (repeats=2) # specify the list s.str.repeat (repeats= [1,2,3])

After using this function, the effect images are as follows:

2.4 one column of string is split into multiple columns 2.4.1 partition function

The partition function, which splits a column string into 3 columns, with 2 as values and 1 as delimiters.

Two parameters are set: sep (delimiter, default is space) and expand (whether to generate dataframe, default is True)

S = pd.Series (['Linda van der Berg',' George Pitt-Rivers']) # is written by default, separated by spaces, and s.str.partition () # is split by the first delimiter Split s.str.rpartition () # with fixed symbol as delimiter s.str.partition ('-', expand=False) # split index idx = pd.Index (['X 123,'Y 999']) idx.str.partition () 2.4.2 split function

The split function is split into multiple values according to the delimiter.

Parameters:

Pat (delimiter, default is space)

N (limit delimited output, that is, find several delimiters, default-1, for all)

Expend (whether to generate dataframe, default is False).

S = pd.Series (["this is a regular sentence", "https://docs.python.org/3/tutorial/index.html",np.nan])#1." By default, s.str.split () # 2 is split by space. Split by space, and limit the output of 2 delimiters s.str.split (Numeric 2) # 3. To split the specified symbol and generate a new dataframes.str.split (pat = "/", expend=True) # 4. Use regular expressions to split and generate a new dataframes = pd.Series (["1x 1m 2"]) s.str.split (r "\ + | =", expand=True) 2.4.3 rsplit function

If you do not set the value of n, the effect of rsplit and split is the same. The difference is that split is restricted from the beginning and rsplit is restricted from the end.

S = pd.Series (["this is a regular sentence", "https://docs.python.org/3/tutorial/index.html",np.nan])# is different from splits.str.rsplit (nasty 2) 3. String Statistics 3.1 count the number of strings in a column s = pd.Series (['dog',', 5, {'foo':' bar'}, [2, 3, 5, 7], ('one',' two', 'three')]) s.str.len () 3.2 Statistics string length s = pd.Series ([' dog',', 5, {'foo':' bar'}, [2] 3, 5, 7], ('one',' two', 'three')] s.str.len ()

The effect picture is as follows:

4. String content lookup (including regular) 4.1 extract

The specified content can be extracted through a regular expression, and a column is generated in parentheses.

S = pd.Series (['a _ 1,'b _ 2,'c _ 3']) # extract according to the parentheses to generate two columns of s.str.extract (r'([ab]) (\ d)') # after adding a question mark, if one of them does not match You can also continue to match s.str.extract (r'([ab])? (\ d)') # you can rename the generated new column s.str.extract (r'(? P [ab]) (? P\ d)') # to generate a column of s.str.extract (r'[ab] (\ d)', expand=True) 4.2 extractall

Unlike extract, this function can extract all eligible elements

S = pd.Series (["a1a2", "b1", "C1"], index= ["A", "B", "C"]) # extract all eligible numbers, resulting in multiple indexes 1 column s.str.extractall (r "[ab] (\ d)") # extract qualified numbers and rename them The result is that multiple indexes 1 column s.str.extractall (r "[ab] (? P\ d)") # extracts qualified a, b and numbers, and multiple indexes multiple columns s.str.extractall (r "(? P [ab]) (? P\ d)") # extracts qualified a, b and numbers. After adding a question mark, a match can continue to match backward. The result is multiple index, multi-column s.str.extractall (r "(? P [ab])? (? P\ d)") 4.3 find

Queries the minimum index of a fixed string in the target string.

If the string to be queried does not appear in the target string, it is displayed as-1

S = pd.Series (['appoint',' price', 'sleep','amount']) s.str.find (' p')

The display results are as follows:

4.4 rfind

Queries the maximum index of a fixed string in the target string.

If the string to be queried does not appear in the target string, it is displayed as-1.

S = pd.Series (['appoint',' price', 'sleep','amount']) s.str.rfind

The query results are as follows:

4.5 findall

Find all patterns or regular expressions that appear in a series / index

S = pd.Series (['appoint',' price', 'sleep','amount']) s.str.findall (r' [ac]')

The display results are as follows:

4.6 get

Extract the series / index of an element from each element in a list, tuple, or string.

S = pd.Series (["String", (1,2,3), ["a", "b", "c"], 123,456, {1: "Hello", "2": "World"}]) s.str.get (1)

The effect is as follows:

4.7 match

Determines whether each string matches the regular expression in the parameter.

S = pd.Series (['appoint',' price', 'sleep','amount']) s.str.match (' ^ [ap]. * t')

The matching effect is as follows:

5. String logic judgment 5.1 contains function

Test whether the pattern or regular expression is included in the string of the series or index.

Parameters:

Pat, string or regular expression

Case, whether case-sensitive. Default is True, that is, case-sensitive.

Flags, whether to pass to the re module. Default is 0.

Na, the method for handling missing values. Default is nan.

Regex, whether to treat the pat parameter as a regular expression. The default is True.

S = pd.Series (['APpoint',' Price', 'cap','approve',123]) s.str.contains (' ap',case=True,na=False,regex=False)

The effect picture is as follows:

5.2 endswith function

Test whether the end of each string element matches the string.

S = pd.Series (['APpoint',' Price', 'cap','approve',123]) s.str.endswith (' e')

The matching results are as follows:

Deal with the value of nm

S = pd.Series (['APpoint',' Price', 'cap','approve',123]) s.str.endswith (' eBay Magnum nauseFalse)

The effect is as follows:

5.3 startswith function

Test whether the beginning of each string element matches the string.

S = pd.Series (['APpoint',' Price', 'cap','approve',123]) s.str.startswith

The match is as follows:

5.4 isalnum function

Check that all characters in each string are alphanumeric.

S1 = pd.Series (['one',' one1', '1percent,']) s1.str.isalnum ()

The effect is as follows:

5.5 isalpha function

Check that all characters in each string are letters.

S1 = pd.Series (['one',' one1', '1percent,']) s1.str.isalpha ()

The effect is as follows:

5.6 isdecimal function

Check that all characters in each string are decimal.

S1 = pd.Series (['one',' one1','1']) s1.str.isdecimal ()

The effect is as follows:

5.7 isdigit function

Check that all characters in each string are numbers.

S1 = pd.Series (['one',' one1','1']) s1.str.isdigit ()

The effect is as follows:

5.8 islower function

Check that all characters in each string are lowercase.

S1 = pd.Series (['one',' one1','1']) s1.str.islower ()

The effect is as follows:

5.9 isnumeric function

Check that all characters in each string are numbers.

S1 = pd.Series (['one',' one1','1]) s1.str.isnumeric ()

The effect is as follows:

5.10 isspace function

Check that all characters in each string are spaces.

S1 = pd.Series (['one','\ t\ r\ n1) s1.str.isspace ()

The effect is as follows:

5.11 istitle function

Check that all characters in each string are capitalized in the form of a title.

S1 = pd.Series (['leopard',' Golden Eagle', 'SNAKE',']) s1.str.istitle ()

The effect is as follows:

5.12 isupper function

Check that all characters in each string are uppercase.

S1 = pd.Series (['leopard',' Golden Eagle', 'SNAKE',']) s1.str.isupper ()

The effect is as follows:

5.13 get_dummies function

Splits each string in the series by sep and returns a dataframe of a virtual / metric variable.

S1 = pd.Series (['leopard',' Golden Eagle', 'SNAKE',']) s1.str.get_dummies ()

The effect is as follows:

This function can also do this kind of matching, pay attention to the form of the input

S1=pd.Series (['a | baked, np.nan,'a | c']) s1.str.get_dummies ()

The effect is as follows:

6. Other 6.1 strip

Delete leading and trailing characters.

S1 = pd.Series (['1. Ant.','2. Bee!\ n Bee,'3. Cat?\ tasking, np.nan]) s1.str.strip ()

The effect is as follows:

6.2 lstrip

Removes the leading character from the series / index.

6.3 rstrip

Removes trailing characters from the series / index.

This is the end of the content of "what are the methods of dealing with text type data in pandas". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.