What are the data cleaning methods of SQL 07/11 Update SLTechnology News&Howtos

What are the data cleaning methods of SQL

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what are the data cleaning methods of SQL". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what are the data cleaning methods of SQL".

The outline is shown in the figure:

01 delete the specified column and rename the column

Scene:

In most cases, not all the features (columns) of the bottom table are useful for analysis, so only some columns need to be extracted, and those columns that are not used can be deleted.

Renaming columns can prevent some columns from being named too verbose (such as Case When statements) and sometimes based on different business metrics requirements.

Delete column Python version:

Df.drop (col_names, axis=1, inplace=True)

Delete column SQL version:

Select col_names from Table_Name alter table tableName drop column columnName

Rename is listed in the Python version:

Df.rename (index= {'row1':'A'}, columns = {' col1':'B'})

Rename is listed in the SQL version:

Select col_names as col_name_B from Table_Name

Because in general there is no permission to delete (you can build temporary tables), another logic of deletion is to select the specified column (Select).

02 duplicate value, missing value processing

Scenario: for example, if 1000 people visit a website today, but a person can visit it several times a day, multiple records of user visits will be recorded in the database. At this time, if you want to find the ID of 1000 people who visited the site today and do user research based on it, you need to remove the duplicate values and give them to the business side to return the visit.

Missing value: when NULL does the operation logic, the returned result is still NULL, which may lead to some BUG whose scripts run correctly but the results are incorrect. In this case, you need to populate the NULL value to the specified value.

Duplicate value handling Python version:

Df.drop_duplicates ()

Duplicate value handling SQL version:

Select distinct col_name from Table_Name select col_name from Table_Name group bycol_name

Missing value handling Python version:

Df.fillna (value = 0) df1.combine_first (df2)

Missing value handling SQL version:

Select ifnull (col_name,0) value from Table_Name select coalesce (col_name,col_name_A,0) as value from Table_Name select case when col_name is null then 0 else col_name end from Table_Name

03 replace string spaces, clean *% @ and other junk characters, string concatenation, separation and other string processing

Scenario: an important way to understand user behavior is to assume the user's psychology, which will use the user's feedback or some research text data, which is usually stored in the database in the form of a string. However, the text feedback from the user is usually very messy, so if you need to extract useful information from these dirty strings, you will need to use the text string processing function.

String processing Python version:

# # 1. Space processing DF [col _ name] = DF [col _ name] .str.lstrip () # 2, *% d and other garbage characters such as DF [col _ name]. Replace ('& #. *',', regex=True, inplace=True) # # 3, string segmentation DF [col _ name] .str.split ('separator') # 4, string concatenation DF [col _ name] .str.cat ()

String processing SQL version:

# # 1. Garbage characters such as select ltrim (col_name) from Table_name # # 2 and *% d handle select regexp_replace (col_name, regular expression) from Table_name # # 3, string segmentation select split (col_name,' separator') from Table_name # 4, string concatenation select concat_ws (col_name,' splitter') from Table_name

04 merge processing

Scenario: sometimes the features you need are stored in different tables, in order to facilitate cleaning, understanding and operation, the data of these tables need to be merged according to certain fields to form a new table, so join and other methods will be used.

Merge processing Python version:

Merge left and right

Pd.merge (left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes= ('_ x','_ y'), copy=True, indicator=False, validate=None) pd.concat ([df1,df2]) merges df1.append (df2, ignore_index=True, sort=False)

Merge processing SQL version:

Merge left and right

Select A.Magnum B.* from Table_a A join Table_b B on A.id = B.id select A.* from Table_a A left join Table_b B on A.id = B.id

Merge up and down

# # Union: merge two result sets, excluding duplicate rows, and sort the default rules at the same time; # # Union All: join two result sets, including duplicate rows, without sorting; select A.* from Table_a A union select B.* from Table_b B # Union is usually much slower than Union All because it compares the records of each query subset. In general, be sure to use Union All if you can meet the requirements with Union All.

05. Grouping sorting of window functions

Scene: if you are now an analyst of a certain treasure, if you want to analyze the sales of different categories in different stores this year, you need to find those categories that sell well and increase the exposure in the second year. At this time, you need to group different categories in different stores and sort them by sales in order to find the better categories in each store.

Demo data as above, a total of three stores, selling different categories of goods, sales corresponding to the above, to find the largest number of goods sold in each store.

Window grouping Python version:

Df ['Rank'] = df.groupby (by= [' Sale_store']) ['Sale_Num'] .transform (lambda x: x.rank (ascending=False))

Window grouping SQL version:

Select * from (Select *, row_number () over (partition by Sale_store order by Sale_Num desc) rk from table_name) b where b.rk = 1

It can be clearly seen that the hottest thing in shop an is vegetables, the most popular in store c is chicken, and shop b is the hottest.

Well, the b store is very good, selling 888 treasure dogs.

Thank you for your reading, the above is the content of "what are the data cleaning methods of SQL". After the study of this article, I believe you have a deeper understanding of what the data cleaning methods of SQL have, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.