Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use PyCaret to quickly and easily build machine learning projects and prepare the final model for deployment

2025-01-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "how to use PyCaret to quickly and easily build a machine learning project and prepare the final model for deployment". In daily operation, it is believed that many people have doubts about how to use PyCaret to quickly and easily build machine learning project and prepare the final model for deployment. The editor consulted all kinds of materials and sorted out simple and useful operation methods. I hope it will help you answer the question of "how to use PyCaret to quickly and easily build a machine learning project and prepare the final model for deployment"! Next, please follow the editor to study!

This is where PyCaret plays a role. PyCaret is an advanced, low-code Python library that makes it easy to compare, train, evaluate, adjust and deploy machine learning models in just a few lines of code. In essence, PyCaret is basically a large wrapper for many data science libraries such as Scikit-learn,Yellowbrick,SHAP,Optuna and Spacy. Yes, you can use these libraries for the same task, but PyCaret can save a lot of time if you don't want to write a lot of code.

Install PyCaret

PyCaret is a large library with many dependencies. I recommend using Conda to create a virtual environment for PyCaret so that the installation does not affect any of your existing libraries. To create and activate a virtual environment in Conda, run the following command:

Conda create-name pycaret_env python=3.6 conda activate pycaret_env

To install a default smaller version of PyCaret with only the required dependencies, run the following command.

Pip install pycaret

To install the full version of PyCaret, you should run the following command.

Pip install pycaret [full]

Once PyCaret is installed, deactivate the virtual environment and add it to Jupyter using the following command.

Conda deactivate python-m ipykernel install-user-name pycaret_env-display-name "pycaret_env"

Now, after starting Jupyter Notebook in the browser, you should be able to see the option to change the environment to the option you just created.

> Changing the Conda virtual environment in Jupyter.

Import library

You can find the complete code for this article in this GitHub repository. In the following code, I only imported Numpy and Pandas to process the data for this demonstration.

Import numpy as np import pandas as pd

Read data

For this example, I used the California Housing Price dataset available on Kaggle. In the following code, I read this dataset into a data box and display the first ten rows of the data box.

Housing_data = pd.read_csv ('. / data/housing.csv') housing_data.head (10)

> First ten rows of the housing dataset.

The above output gives us some idea of what the data looks like. The data mainly contain digital features and a classification feature, which are used for the proximity of each house to the ocean. The target column we are trying to predict is the "median_house_value" column. The entire dataset contains a total of 20640 observations.

Initialization experiment

Now that we have the data, we can initialize a PyCaret lab that will preprocess the data and enable logging for all models that will be trained on this dataset.

From pycaret.regression import * reg_experiment = setup (housing_data, target = 'median_house_value', session_id=123, log_experiment=True, experiment_name='ca_housing')

As shown in the GIF below, running the above code preprocesses the data and then generates a data box with experimental options.

> Pycaret setup function output.

Comparative benchmark model

We can immediately compare the different baseline models to find the model with the best K-fold cross-validation performance using the compare_models function, as shown in the following code. In the following example, I have excluded XGBoost for demonstration purposes.

Best_model = compare_models (exclude= ['xgboost'], fold=5)

> Results of comparing different models.

This function generates a data box that contains performance statistics for each model and highlights the metrics of the best-performing model, in this case the CatBoost regression.

Establish a model

We can also use PyCaret to train the model in a single line of code. The create_model function only needs a string that corresponds to the type of model you are training. You can find a complete list of acceptable strings for this feature and the corresponding regression model on the PyCaret documentation page.

Catboost = create_model ('catboost')

The create_model function uses the cross-validation metrics of the trained CatBoost model to generate the above data box.

Super parameter adjustment

Now that we have a well-trained model, we can further optimize it through hyperparameter adjustment. With just one line of code, we can adjust the superparameters of the model, as shown below.

Tuned_catboost = tune_model (catboost, n_iter=50, optimize = 'MAE')

> Results of hyperparameter tuning with 10-fold cross-validation.

The most important results (in this case the average indicator) are highlighted in yellow.

Performance of Visualization Model

We can use PyCaret to create many charts to visualize the performance of the model. PyCaret uses another advanced library called Yellowbrick to build these visualization files.

Residual graph

By default, the plot_model function generates a residual graph for the regression model, as shown below.

Plot_model (tuned_catboost)

> Residual plot for the tuned CatBoost model.

Prediction error

By creating a prediction error map, we can also visualize the predicted value relative to the actual target value.

Plot_model (tuned_catboost, plot = 'error')

> Prediction error plot for the tuned CatBoost regressor.

The above diagram is particularly useful because it provides us with a visual representation of the R ²coefficients of the CatBoost model. Ideally (R ²= 1), when the predicted value exactly matches the actual target value, the graph will contain only the points along the dotted line.

Functional importance

We can also visualize the functional importance of the model, as shown below.

Plot_model (tuned_catboost, plot = 'feature')

> Feature importance plot for the CatBoost regressor.

As can be seen from the above chart, the median is the most important feature in predicting house prices. Since this characteristic corresponds to the median income of the housing construction area, this assessment is very reasonable. Houses built in high-income areas may be more expensive than those in low-income areas.

Use all diagrams to evaluate the model

We can also create multiple diagrams to evaluate the model using the validate_model function.

Evaluate_model (tuned_catboost)

> The interface created using the evaluate_model function.

Interpretation model

The interpret_model function is a useful tool for interpreting model prediction. This function uses an interpretable machine learning library called SHAP, which I introduced in the following article.

With just one line of code, we can create a SHAPE bee colony diagram for the model.

Interpret_model (tuned_catboost)

> SHAP plot produced by calling the interpret_model function.

According to the chart above, we can see that the median income field has the greatest impact on predicting the value of the house.

Automatic machine learning

PyCaret also has the ability to run automatic machine learning (AutoML). We can specify the loss function or metric we want to optimize and let the library take over as shown below.

Automlautoml_model = automl (optimize = 'MAE')

In this example, the AutoML model also happens to be a CatBoost regression variable, which we can confirm by printing out the model.

Print (automl_model)

Running the above print statement produces the following output:

Generation prediction

The forecasting model function allows us to generate predictions by using data from experiments or new invisible data.

Pred_holdouts = predict_model (automl_model) pred_holdouts.head ()

The above predict_model function generates predictions for the retention dataset used to validate the model during cross-validation. The code also provides us with a data box containing performance statistics for predictions generated against the AutoML model.

> Predictions generated by the AutoML model.

In the above output, the label column represents the predictions generated by the AutoML model. We can also predict the entire dataset, as shown in the following code.

New_data = housing_data.copy () new_data.drop (['median_house_value'], axis=1, inplace=True) predictions = predict_model (automl_model, data=new_data) predictions.head ()

Save the model

PyCaret also allows us to use the save_model function to save the trained model. This feature saves the transformation pipe of the model to a pickle file.

Save_model (automl_model, model_name='automl-model')

We can also use the load_model function to load the saved AutoML model.

Loaded_model = load_model ('automl-model') print (loaded_model)

Printing out the loaded model produces the following output:

Pipeline (memory=None, steps= [('dtypes', DataTypes_Auto_infer (categorical_features= [], display_types=True, features_todrop= [], id_columns= [], ml_usecase='regression', numerical_features= []) Target='median_house_value', time_features= []), ('imputer', Simple_Imputer (categorical_strategy='not_available', fill_value_categorical=None) Fill_value_numerical=None, numer... ('cluster_all',' passthrough'), ('dummy', Dummify (target='median_house_value')), (' fix_perfect', Remove_100 (target='median_house_value')), ('clean_names', Clean_Colum_Names ()), (' feature_select', 'passthrough'), (' fix_multi', 'passthrough') ('dfs',' passthrough'), ('pca',' passthrough'), ['trained_model',]], verbose=False)

As can be seen from the above output, PyCaret saves not only the trained model at the end of the pipeline, but also the feature engineering and data preprocessing steps at the beginning of the pipeline. Now that we have a machine learning pipeline that can be used for production in a file, we don't have to worry about putting the various parts of the pipeline together.

Model deployment

Now that we have the model pipeline ready for production, we can also use the deploy_model function to deploy the model to a cloud platform such as AWS. If you plan to deploy the model to an S3 bucket, you must run the following command to configure the AWS command line interface before running this feature:

Aws configure

Running the above code will trigger a series of prompts that you need to provide information such as AWS Secret Access Key. Once this process is complete, you can deploy the model using the deploy_model function.

Deploy_model (automl_model, model_name = 'automl-model-aws', platform='aws', authentication = {' bucket': 'pycaret-ca-housing-model'})

In the above code, I deployed the AutoML model to an S3 bucket called pycaret-ca-housing-model in AWS. From here, you can write an AWS Lambda function that extracts the model from S3 and runs in the cloud. PyCaret also allows you to load the model from S3 using the load_model function.

MLflow user interface

Another nice feature of PyCaret is that it can use a machine learning lifecycle tool called MLfLow to record and track your machine learning experiments. Running the following command launches the MLflow user interface in a browser from the local host.

! mlflow ui

> MLFlow dashboard.

In the dashboard above, we can see that MLflow can track the operation of different models of your PyCaret experiment. You can view the performance metrics and the running time of each run in the lab.

Advantages and disadvantages of using PyCaret

If you have read this book, you now have a basic understanding of how to use PyCaret. Although PyCaret is a great tool, it has its own advantages and disadvantages, and you should be aware of this if you plan to use it for data science projects.

Advantages:

Low code base.

Very suitable for simple standard tasks and general machine learning.

Provide support for regression, classification, natural language processing, clustering, anomaly detection and association rule mining.

Makes it easy to create and save complex transformation pipes for models.

Make it easy to visualize the performance of models.

Disadvantages:

So far, because the NLP utility is limited to topic modeling algorithms, PyCaret is not ideal for text classification.

PyCaret is not ideal for deep learning and does not use Keras or PyTorch models.

You cannot perform more complex machine learning tasks, such as using PyCaret (at least in version 2.2.0) for image classification and text generation.

By using PyCaret, you will sacrifice some control over simple and advanced code.

At this point, the study on "how to use PyCaret to quickly and easily build a machine learning project and prepare the final model for deployment" is over. I hope you can solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report