A case of Python data Analysis 07/06 Update SLTechnology News&Howtos

A case of Python data Analysis

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail the cases of Python data analysis for you. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

Preface

For everyone involved in data science, pre-data cleaning and exploration must be a time-consuming task. It is no exaggeration to say that 80% of our time is spent on pre-data work, including cleaning, processing, EDA (Exploratory Data Analysis, exploratory data analysis), etc. The previous work is not only related to the quality of the data, but also related to the prediction effect of the final model.

Whenever we have a new piece of data on hand, we need to familiarize ourselves with and understand the data in advance through artificial observation, field interpretation and so on. The real EDA process will not begin until the data is cleaned and processed.

The most common operation of this process is to make basic statistics and description of the existing data, including average, variance, maximum and minimum, frequency, quantile, distribution and so on. In fact, it is often relatively fixed and mechanical.

In the R language, the skimr package provides rich data exploratory statistics, which is richer than the basic statistics of describe () in Pandas.

01-skmir

But in the Python community, we can also implement the functions of skmir, even better than skmir. That is to use the pandas-profiling library to help us with the early data exploration work.

Quick use

After pip install pandas-profiling, we can import and use it directly. We only need to use one line of core code ProfileReport (df, * * kwargs) to achieve:

Import pandas as pdimport seaborn as snsfrom pandas_profiling import ProfileReporttitanic = sns.load_dataset ("Titanic") ProfileReport (titanic, title = "The EDA of Titanic Dataset")

If we use it in Jupyter Notebook, we will render it in Jupyter Notebook and output it directly to the cell.

02-profile

The pandas-profiling library also extends the DataFrame object method, which means that we can also achieve the same effect by using DataFrame.profile_report () like the calling method.

No matter which method is used, a ProfileReport object is generated in the end; if you want to further fit the Jupyter Notebook, you can directly call to_widgets () and to_notebook_iframe () to generate the hanger or the corresponding components, respectively, which will be more beautiful in the display effect, rather than in the output bar.

03-widgets

If you don't use it directly in Jupyter Notebook, but use another IDE, you can output the report directly through the to_file () method, noting that the last saved file name needs to be added with the extension .html.

In addition, Pandas-profiling is also integrated with multiple frameworks and cloud platforms, making it easy for us to call. For more information, please see https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/integrations.html.

Further customize report information

Although the generated exploratory reports can basically meet our simple needs to understand the data, the output information is also insufficient or redundant. Fortunately, pandas-profiling also provides us with the possibility of our own customization. These customized configurations will eventually be written to the yaml file.

Several sections that we can further adjust are listed in the official document, corresponding to the labels of each part of the Tab column of the report:

Vars: a statistical indicator that is mainly used to adjust the presentation of fields or variables in data in a report.

Missing_diagrams: mainly related to the visual display of missing value fields

Correlations: as the name implies, it adjusts the part about the correlation between fields or variables, including whether to calculate the correlation coefficient, and the relevant threshold, etc.

Interactions: mainly related to the rendering of related diagrams before two fields or variables

Samples: corresponds to the head () and tail () methods in Pandas, that is, how many pieces of data are before and after the preview

There are many parameters that can be specified in these parts. Interested friends can refer directly to the official document https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/advanced_usage.html.

So we can write and adjust manually directly in the code, like this:

Profile_config = {"progress_bar": False, "sort": "ascending", "vars": {"num": {"chi_squared_threshold": 0.95}, "cat": {"n_obs": 10}}, "missing_diagrams": {'heatmap': False,' dendrogram': False }} profile = titanic.profile_report (* * profile_config) profile.to_file ("titanic-EDA-report.html")

Write all the configuration information in a dictionary variable, and then unpack the key-value pair in the form of * * variable so that it can correspond to the corresponding parameters according to the key.

In addition to the configuration writing in the code, if you know a little bit about how to write the yaml configuration file, then we don't have to write it one by one in the code, but we can modify it in the yaml file. You can modify not only the configuration options listed in the official documentation, but also the parameters that are not listed. Because the configuration file is too long, I will only release the parts that are modified based on the official default configuration file config_default.yaml:

# profile_config.ymlvars: num: quantiles:-0.25-0.5-0.75 skewness_threshold: 10 low_categorical_threshold: 5 chi_squared_threshold: 0.95 cat: length: True unicode: True cardinality_threshold: 50 n_obs: 5 chi_squared_threshold 0.95 coerce_str_to_date: False bool: n_obs: 3 file: active: False image: active: False exif: True hash: Truesort: "desceding"

After modifying the yaml file, we just need to specify the path where the configuration file is located with the config_file parameter when generating the report, like this:

Df.profile_report (config_file = "your file path .yml")

Improve the simplicity and readability of our code by separating the configuration file from the core code.

Last

Pandas-profiling library provides us with a convenient and fast way of data exploration, and provides more abundant information than basic statistical information (such as missing value correlation diagram, correlation diagram, etc.), which can save a lot of time for our previous data exploration work.

However, because the dimension of the report generated by pandas-profiling is relatively fixed and templated, you may need to do some extra work for friends who want to make the report richer; at the same time, it should be noted that pandas-profiling is more suitable for use in small and medium-sized data sets. As the amount of data increases, the report rendering becomes much slower and it takes more time to generate the report.

If you still need to EDA the big data set, as the official document says, you'd better reduce the sample size by sampling or sampling without affecting the data distribution. Officials have also indicated that they will use high-performance libraries or frameworks such as modin, spark, and dask as extensible backends in future releases, which may not be a problem when generating EDA reports for large datasets.

This is the end of this article on "the case of Python data analysis". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.