In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly explains "how to use Python to make a data preprocessing gadget", the content of the article is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "how to use Python to make a data preprocessing gadget"!
When we usually use Python for data processing and analysis, after import a large number of libraries, we preview the data to see if there are any missing values, duplicate values and other anomalies, and deal with them.
This article will be combined with the GUI tool PySimpleGUI, to explain how to make their own data preprocessing gadget, so that this process can also be automated! The final effect is as follows
This article will be divided into three parts:
Make GUI interface
Data processing explanation
Packaging and testing
The main concerns will involve the following modules:
PySimpleGUI
Pandas
Matplotlib
1. GUI interface making
Train of thought
The old rule, first talk about ideas and then on the code, first of all, the use of PySimpleGUI or those four processes?
Introduce module = > create elements and populate layout== > create form = = > create event loop
From the element point of view, we can know from the figure that the elements we need are the menu bar with instructions, the data preprocessing box that looks concave, the three single option values in the box, the three elements that read the file path (fixed text, input text, browse button), and the three buttons of "View, process, close".
Overall, we need all the elements in the entire form to be distributed in the middle. The menu bar is distributed to the left at the edge of the form. The total distribution of row connection is adopted.
From the event point of view, we need to add the notes that the user needs in the instructions menu. As for the file reading location, we set the Excel format of our two commonly used data storage formats (".xlsx" and ".xls").
After reading, we choose a process in the data preprocessing framework. Then, we can view each error in a pop-up box, and then do the final processing of the data.
The process of processing needs to overwrite the original data file with the processed data. The whole process must be continuous. Here is a tips: it is best to make a backup before each data analysis to prevent the embarrassment of failing in the analysis process but not finding the original data file.
Code
Do you feel ready to move after visiting the train of thought?! Let's implement a wave, first look at the complete code, and then disassemble it in detail.
Import PySimpleGUI as sg import pandas as pd import matplotlib matplotlib.use ("TkAgg") sg.ChangeLookAndFeel ('GreenTan') menu_def = [[' & instructions for use', ['& Note']] layout= [[sg.Menu (menu_def, tearoff=True)], [sg.Frame [[sg.Radio ('duplicate value processing', "RADIO1", size= (15 dup), key= "dup"), sg.Radio ('missing value processing', "RADIO1") Size= (15jingle 1), key= "mis"), sg.Radio ('exception handling', "RADIO1", default=True,key= "war")]], title=' data preprocessing', title_color='green',title_location='n',relief=sg.RELIEF_SUNKEN, tooltip=' choose one of the processing methods')], [sg.Text ('file location', size= (8,1), auto_size_text=False, justification='right'), sg.InputText (enable_events=True) Key= "lujing"), sg.Button ('browsing', key= 'getf')], [sg.Button (' View', key= 'look'), sg.Submit (' processing', key= 'handle'), sg.Cancel (' off')] window = sg.Window ('feature Engineering', layout, default_element_size= (40, 1), grab_anywhere=False) while True: event Values = window.read () if event = = 'getf': text = sg.popup_get_file (' Please click the browse key or enter the absolute path to the file by yourself', title = 'get the item', file_types = (("Excel Files", "* .xlsx"), ("Excel Files", "* .xls"),) sg.popup ('prompt', 'confirm whether to select the file--' Text) window ['lujing'] .update (text) if event = = "look":' if event = = "handle":'if event = = "Cancel" or event = = sg.WIN_CLOSED: break if event = = "Note":''notes to write''
Code interpretation
In fact, when you have a train of thought, you will find that everything seems to be easier. Next, we will explain the role of the relevant parameters.
The first is matplotlib.use ("TkAgg"): the purpose of using the matplotlib module and calling this function is to change the way the image is displayed when we view the exception handling (box diagram display): TkAgg (an interactive background).
The so-called interactive background is that you can do any operation on the image, zoom in and out of the area, value view and other functions.
The reason for calling this function is first because we are using GUI to have that kind of interactive feeling, and secondly, if the amount of data is large, the box chart will be very small, so it is easy to view.
Second, sg.ChangeLookAndFeel ('GreenTan'): change the color of the form.
Then menu_def is the menu bar, using the format [", ["]] to define the main menu bar and submenu bar. Tearoff this function is to add a lovely dotted line interval between each field.
Sg.Frame (): this is the same as the sg.columns () element, mainly for multiple child elements, and we set the relief parameter here to make the entire frame look and feel concave. The tooltip parameter is a small prompt box where you move the frame with your mouse.
The use of the title_location parameter is very interesting, it is the position setting of the title string, there are (nrecoversrecoveryerecoverywpense, etc.), you will soon find that this location is different from other element layout location settings, it is based on geographical location coordinates as sub-parameters.
Sg.Radio: single option box, set the child parameter group_id of all option boxes to the same, so that you can choose one of the three options, here we use "RADIO1" as the group_id.
Sg.Button (): we used four buttons throughout the GUI, one of which is the proprietary button Cancel.
Sg.popup (): a more rudimentary pop-up box showing the key information used by the prompt class.
Sg.popup_get_file (): this is an advanced pop-up box element from a pop-up window with a text input field and a browse button so that the user can select a file. The effect is as follows
Second, data preprocessing
After the GUI part is done, then we will talk about the data processing part, mainly for duplicate values, missing values and outliers.
Data preparation
What we are using here is the A-share market on October 28, 2020. The data section shows:
We can see that there are duplicate lines and missing values.
Duplicate value processing
For two-dimensional list DataFrame, using Pandas module is the most convenient and symbolic module for office simplicity.
Import pandas as pd df = df.read_excel ('file absolute path') imfor = df [df.duplicated ()] imfor = str (imfor)
First call the Pandas module and read the file path. Here the reason why we take the absolute path instead of the relative path is that our packaged GUI does not rely on the file-dependent Python environment, so the relative path reading is unrecognizable.
The function in df [df.duplicated ()], the Pandas, prints rows corresponding to duplicate values in the form of a two-dimensional list. The df variable is changed to a str string because it will be loaded as a string when we use pop-up elements in GUI later.
The final way to deal with duplicate values is as follows:
Df = df.drop_duplicates (inplace = True)
There is only one line of code, but the repeated values in the entire data table can be deleted, indicating the power of the Pandas function.
As for why you use inplace = True, it is because the delete function does not change the structure of the original table, so you need to overwrite the original table with the new table.
Missing value processing
Let's take a look at the code first. in fact, I wrote an article about missing value handling a year ago. Click to view it.
Import pandas as pd df = df.read_excel ('file absolute path') # df.isnull () imfor1 = df.isnull (). Sum () # df.isnull (). Any () imfor1 = str (imfor1)
For tables with missing values, df.isnull () or df.isna () to see the null values. This function is used to determine whether it is a null value, and if it is null, it is assigned to True, otherwise to False.
Here we use df.isnull (). Sum () to count the number of missing values for each column. If the amount of data is large, you can also use df.isnull (). Any () to view rows with only missing values.
Solutions, there are many ways to deal with missing values, such as taking the mean, taking the median, deleting, taking the value below, and so on. We fill it here by taking the upper value.
Df = df.fillna (method='pad')
Exception value handling
The so-called outlier is the occurrence of one or more unsociable numbers in a number field. For example, a hundred-digit number appears in a column of numbers that are all single digits, and this hundred-digit number is an outlier.
There are two kinds of abnormal values detected by Python: box chart observation and standard deviation observation. Here we choose a box diagram to observe.
The box chart is a statistical chart used to show the dispersion of the selected data. by setting the standard, the values that are greater or less than the upper and lower lines of the box diagram are expressed as outliers.
As shown in the figure, the lower quartile means that 25 percent of the data in the sample is less than this number, marked as. The upper quartile means that 25 percent of the sample is greater than this number, marked as. 1.5 times the difference between the upper quartile and the lower quartile plus the upper quartile is the upper edge, and vice versa.
In Pandas, you can call the .boxplot () function to draw a box diagram import pandas as pd df.boxplot ()
Packaging and effect display
After we have written all the code, we can use pyinstaller to package it.
Suppose your program is named yuchuli.py, and you can type it in the cmd window to complete the package.
Pyinstaller-F yuchuli.py
After packaging, exe is in the dist folder of the folder where the Python file is located. Let's start to see the effect.
As you can see, we need three functions of data preprocessing: duplicate values, missing values, and abnormal values can all be handled in a specified way!
Thank you for reading, the above is the content of "how to use Python to make a data preprocessing gadget". After the study of this article, I believe you have a deeper understanding of how to use Python to make a data preprocessing gadget, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.