What are the Python automatic libraries that data scientists should know 07/13 Update SLTechnology News&Howtos

What are the Python automatic libraries that data scientists should know

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what Python automatic libraries data scientists should know". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let Xiaobian take you to learn "What Python automatic libraries data scientists should know"!

1.auto-sklearn

Auto-sklearn is an automated machine learning toolkit that seamlessly integrates with the standard sklearn interface that many in the industry are familiar with. By using state-of-the-art methods such as Bayesian optimization, libraries are built to navigate the space of possible models and learn to infer whether a particular configuration performs a given task well.

This library was created by Matthias Feurer et al., and its technical details are described in a paper titled Efficient and Robust Machine Learning. Feurer writes: "We introduce a new robust automated system based on scikit-learn-a structured hypothesis space that generates 110 hyperparameters using 15 classifiers, 14 feature preprocessing methods, and 4 data preprocessing methods. "

Auto-sklearn is probably the best library for getting started with AutoML. In addition to data preparation and model selection for mining datasets, it can also learn models that perform well on similar datasets.

Efficient and Robust Automated Machine Learning(2015)

Auto-sklearn minimizes the required user interaction based on effective implementation. You can install libraries using pip install auto-sklearn.

The two main classes that can be used are Auto Sklearn Classifier and Auto Sklearn Regressor for classification and regression tasks, respectively. Both have the same user-specified parameters, the most important of which are time limits and integration size.

import autosklearn as ask #ask.regression.AutoSklearnRegressor()for regression tasks model =ask.classification.AutoSklearnClassifier(ensemble_size=10, #size of the endensemble (minimum is 1) time_left_for_this_task=120, #the number ofseconds the process runs for per_run_time_limit=30) #maximum secondsallocated per model model.fit(X_train, y_train) #begin fittingthe search model print(model.sprint_statistics()) #printstatistics for the search y_predictions = model.predict(X_test) #get predictionsfrom the model

2.TPOT

TPOT is another Python library for automated modeling pipelines that places more emphasis on data preparation, modeling algorithms, and model hyperparameters. It automates feature selection, preprocessing, and construction through an evolutionary tree-based structure called the Tree-Based Pipeline Optimization Tool (TPOT) that automates the design and optimization of machine learning pipelines. "

Source: Evaluation of Tree-Based Pipeline Optimization Tools in Data Science Automation (2016)

The program or pipeline is presented as a tree view. Genetic programs select and evolve certain programs to maximize the end result of each automated machine learning pipeline.

As Pedro Domingos says,"A stupid algorithm with lots of data is better than a smart algorithm with limited data." "It is true that TPOT can generate complex data preprocessing pipelines.

Source: TPOT documentation

Like many AutoML algorithms, TPOT Pipeline Optimizer can take hours to produce good results, and you can run these long-running programs in Kaggle commits or Google Colab.

import tpot pipeline_optimizer = tpot.TPOTClassifier(generations=5, #number ofiterations to run the training population_size=20, #number ofindividuals to train cv=5) #number of foldsin StratifiedKFold pipeline_optimizer.fit(X_train, y_train) #fit thepipeline optimizer - can take a long time print(pipeline_optimizer.score(X_test, y_test)) #print scoringfor the pipeline pipeline_optimizer.export( tpot_exported_pipeline.py ) #export thepipeline - in Python code!

Perhaps the best feature of TPOT is the ability to export models to Python code files for later use.

3.HyperOpt

HyperOpt, developed by James Bergstra, is a Python library for Bayesian optimization. Designed for large-scale optimization of models with hundreds of parameters, the library is explicitly used to optimize the machine learning pipeline, with the option to scale the optimization process across multiple cores and machines.

"Our approach is to expose an underlying expression graph of how a performance metric (such as classification accuracy on validation examples) is computed from hyperparameters that control not only the application of individual processing steps, but even which processing steps are included. "

HyperOpt, however, is difficult to use directly because it has technical barriers and requires careful specification of optimization processes and parameters. I recommend using HyperOpt-sklearn, which is a HyperOpt wrapper that contains the sklearn library.

Specifically, although HyperOpt supports preprocessing, it focuses primarily on dozens of hyperparameters that go into a particular model. Considering the results of a HyperOpt-sklearn search, a gradient-enhanced classifier is obtained without preprocessing:

{ learner : GradientBoostingClassifier(ccp_alpha=0.0, criterion= friedman_mse , init=None, learning_rate=0.009132299586303643, loss= deviance , max_depth=None, max_features= sqrt , max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=342, n_iter_no_change=None, presort= auto , random_state=2, subsample=0.6844206624548879, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False), preprocs : (), ex_preprocs : ()}

The documentation for building HyperOpt-sklearn models mentions that it is much more complex than auto-sklearn and slightly more complex than TPOT. But if the hyperparameters are important, the extra work is worth it.

4.AutoKeras

Neural networks and deep learning are much more powerful than standard machine learning libraries, and therefore more difficult to automate.

With AutoKeras, neural structure search algorithms find the best structure, such as the number of neurons in a layer, the number of layers, the layers to merge, specific parameters of the layers, such as the size of the filter or the percentage of neurons lost in Dropout, and so on. Once the search is complete, you can use the model as if it were a normal TensorFlow/Keras model.

By using AutoKeras, you can build a model that contains complex elements, such as embedding and spatial reduction, that would otherwise be difficult for those who are still fumbling with deep learning.

When AutoKeras creates a model, many of the preprocessing, such as vectorization or cleansing of text data, is done and optimized.

It only takes two lines of code to start and train the search. AutoKeras has a keras-like interface, so it's easy to remember and use.

AutoKeras supports text, images, and structured data, and provides an interface for beginners and those who want deeper technical knowledge, AutoKeras uses evolutionary neural structure search methods to eliminate difficulties and ambiguities. Although AutoKeras takes a long time to run, there are many user-specified parameters that control the runtime, number of models explored, search space size, and so on.

Which automated library should I use?

If you prefer a clean, simple interface and relatively fast results, use auto-sklearn. It can be integrated naturally with sklearn and used with commonly used models and methods.

TPOT can be used if you are interested in high accuracy and don't mind the time it takes to train. Advanced preprocessing methods can be emphasized by representing pipes in tree structures, and it can additionally output Python code for the best model.

If high accuracy is important and potentially long training times are not a concern, HyperOpt-sklearn is used to emphasize that the effectiveness of hyperparameter optimization of the model depends on the dataset and algorithm.

If your problem involves neural networks, especially in text or image form, use AutoKeras. It does take a long time to train, but there are plenty of measures to control the time and size of the search space.

If you want to automate, don't miss these four libraries.

At this point, I believe that everyone has a deeper understanding of "what Python automatic libraries data scientists should know", so let's actually operate them! Here is the website, more related content can enter the relevant channels for inquiry, pay attention to us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.