In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains the "one-line Python command to deal with the early data exploratory method is what", the article explains the content is simple and clear, easy to learn and understand, the following please follow the editor's train of thought slowly in-depth, together to study and learn "one-line Python command to deal with the early data exploratory method is what" bar!
For everyone involved in data science, pre-data cleaning and exploration must be a time-consuming task. It is no exaggeration to say that 80% of our time is spent on pre-data work, including cleaning, processing, EDA (Exploratory Data Analysis, exploratory data analysis), etc. The previous work is not only related to the quality of the data, but also related to the prediction effect of the final model.
Whenever we have a new piece of data on hand, we need to familiarize ourselves with and understand the data in advance through artificial observation, field interpretation and so on. The real EDA process will not begin until the data is cleaned and processed.
The most common operation of this process is to make basic statistics and description of the existing data, including average, variance, maximum and minimum, frequency, quantile, distribution and so on. In fact, it is often relatively fixed and mechanical.
In the R language, the skimr package provides rich data exploratory statistics, which is richer than the basic statistics of describe () in Pandas.
01-skmir
But in the Python community, we can also implement the functions of skmir, even better than skmir. That is to use the pandas-profiling library to help us with the early data exploration work.
Quick use
After pip install pandas-profiling, we can import and use it directly. We only need to use one line of core code ProfileReport (df, * * kwargs) to achieve:
Import pandas as pd import seaborn as sns from pandas_profiling import ProfileReport titanic = sns.load_dataset ("Titanic") ProfileReport (titanic, title = "The EDA of Titanic Dataset")
If we use it in Jupyter Notebook, we will render it in Jupyter Notebook and output it directly to the cell.
02-profile
The pandas-profiling library also extends the DataFrame object method, which means that we can also achieve the same effect by using DataFrame.profile_report () like the calling method.
No matter which method is used, a ProfileReport object is generated in the end; if you want to further fit the Jupyter Notebook, you can directly call to_widgets () and to_notebook_iframe () to generate the hanger or the corresponding components, respectively, which will be more beautiful in the display effect, rather than in the output bar.
03-widgets
If you don't use it directly in Jupyter Notebook, but use another IDE, you can output the report directly through the to_file () method, noting that the last saved file name needs to be added with the extension .html.
In addition, Pandas-profiling is also integrated with multiple frameworks and cloud platforms, making it easy for us to call. For more information, please see https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/integrations.html.
Further customize report information
Although the generated exploratory reports can basically meet our simple needs to understand the data, the output information is also insufficient or redundant. Fortunately, pandas-profiling also provides us with the possibility of our own customization. These customized configurations will eventually be written to the yaml file.
Several sections that we can further adjust are listed in the official document, corresponding to the labels of each part of the Tab column of the report:
Vars: a statistical indicator that is mainly used to adjust the presentation of fields or variables in data in a report.
Missing_diagrams: mainly related to the visual display of missing value fields
Correlations: as the name implies, it adjusts the part about the correlation between fields or variables, including whether to calculate the correlation coefficient, and the relevant threshold, etc.
Interactions: mainly related to the rendering of related diagrams before two fields or variables
Samples: corresponds to the head () and tail () methods in Pandas, that is, how many pieces of data are before and after the preview
There are many parameters that can be specified in these parts. Interested friends can refer directly to the official documentation (https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/advanced_usage.html), so I won't repeat them in this article.
So we can write and adjust manually directly in the code, like this:
Profile_config = {"progress_bar": False, "sort": "ascending", "vars": {"num": {"chi_squared_threshold": 0.95}, "cat": {"n_obs": 10}}, "missing_diagrams": {'heatmap': False,' dendrogram': False }} profile = titanic.profile_report (* * profile_config) profile.to_file ("titanic-EDA-report.html")
Write all the configuration information in a dictionary variable, and then unpack the key-value pair in the form of * * variable so that it can correspond to the corresponding parameters according to the key.
In addition to the configuration writing in the code, if you know a little bit about how to write the yaml configuration file, then we don't have to write it one by one in the code, but we can modify it in the yaml file. You can modify not only the configuration options listed in the official documentation, but also the parameters that are not listed. Because the configuration file is too long, I will only release the parts that are modified based on the official default configuration file config_default.yaml:
# profile_config.yml vars: num: quantiles:-0.250.5-0.75skewness_threshold: 10 low_categorical_threshold: 5 chi_squared_threshold: 0.95cat: length: True unicode: True cardinality_threshold: 50 N_obs: 5 chi_squared_threshold: 0.95coerce_str_to_date: False bool: n_obs: 3 file: active: False image: active: False exif: True hash: True sort: "desceding"
After modifying the yaml file, we just need to specify the path where the configuration file is located with the config_file parameter when generating the report, like this:
Df.profile_report (config_file = "your file path .yml")
Improve the simplicity and readability of our code by separating the configuration file from the core code.
Thank you for your reading, the above is the content of "what is the exploratory method of one-line Python command to get the preliminary data?" after the study of this article, I believe you have a deeper understanding of what is the exploratory method of one-line Python command to deal with the early data, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.