How to use Scikit-Learn in python 07/02 Update SLTechnology News&Howtos

How to use Scikit-Learn in python

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how to use Scikit-Learn in python. The introduction in this article is very detailed and has certain reference value. Interested friends must read it!

1. Data sets

When learning algorithms, we all want to have some data sets to practice with. Scikit learn comes with some great datasets like iris dataset, house price dataset, diabetes dataset, etc.

These datasets are easy to access and understand, and you can implement ML models directly on them, making them ideal for beginners.

You can get it as follows:

import sklearnfrom sklearn import datasetsimport pandas as pddataset = datasets.load_iris()df = pd.DataFrame(dataset.data, columns=dataset.feature_names)

Similarly, you can import other datasets in the same way.

2. Data splitting

Sklearn provides the ability to split data sets for training and testing. Splitting the dataset is critical for unbiased evaluation of predictive performance, and the proportion of data in the training and test datasets can be defined.

We can split the dataset as follows:

from sklearn.model_selection import train_test_splitx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=2, random_state=4)

With the help of train_test_split, we split the dataset so that the training set has 80% of the data and the test set has 20%.

3. Linear regression

When the output variable is continuous and linearly related to the dependent variable, a supervised machine learning model is used, which can predict sales in future months by analyzing sales data from previous months.

With sklearn, we can easily implement linear regression models as follows:

from sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_error, r2_scoreregression_model = LinearRegression()regression_model.fit(x_train, y_train)y_predicted = regression_model.predict(x_test)rmse = mean_squared_error(y_test, y_predicted)r2 = r2_score(y_test, y_predicted)

First LinerRegression() creates an object for linear regression, and then we fit the model on the training set. Finally, we predicted the model on the test dataset. "rmse" and "r_score" can be used to check the accuracy of the model.

4. Logistic regression

Logistic regression is also a supervised regression algorithm, just like linear regression. The only difference is that the output variable is categorical. It can be used to predict whether a patient has heart disease.

With sklearn, we can easily implement Logistic regression models as follows:

from sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import confusion_matrixfrom sklearn.metrics import classification_reportlogreg = LogisticRegression()logreg.fit(x_train, y_train)y_predicted = logreg.predict(x_test)confusion_matrix = confusion_matrix(y_test, y_pred)print(confusion_matrix)print(classification_report(y_test, y_pred))

Confusion matrices and classification reports are used to check the accuracy of classification models.

5. Decision tree

Decision trees are a powerful tool for classification and regression problems. It consists of roots and nodes, where the roots represent split decisions and the nodes represent output variable values. Decision trees are useful when the dependent variable does not follow a linear relationship with the independent variable.

Decision tree implementation for classification:

from sklearn.tree import DecisionTreeClassifierfrom sklearn.metrics import confusion_matrixfrom sklearn.tree import export_graphvizfrom sklearn.externals.six import StringIO from IPython.display import Image from pydot import graph_from_dot_datadt = DecisionTreeClassifier()dt.fit(x_train, y_train)dot_data = StringIO()export_graphviz(dt, out_file=dot_data, feature_names=iris.feature_names)(graph, ) = graph_from_dot_data(dot_data.getvalue())y_pred = dt.predict(x_test)

We fit the model using the DecisionTreeClassifier() object and use further code to visualize the decision tree implementation in Python.

6、Bagging

Bagging is a technique for training multiple models of the same type using random samples from a training set. The inputs to different models are independent of each other.

For the former case, multiple decision trees can be used for prediction, not just one decision tree called a random forest.

7、Boosting

Boosting Multiple models are trained in such a way that the inputs to one model depend on the outputs of the previous model. In Boosting, data that is predicted incorrectly is given more priority.

8. Random forest

Random forest is a bagging technique that uses hundreds of decision trees to build models for classification and regression problems. Examples: classifying loan applicants, identifying fraudulent activity and predicting disease.

In Python it is implemented as follows:

from sklearn.ensemble import RandomForestClassifiernum_trees = 100max_features = 3clf = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)clf.fit(x_train,y_train)y_pred=clf.predict(x_test)print("Accuracy:",metrics.accuracy_score(y_test, y_pred))9、XGBoost

XGBoost is a lifting technique that provides a high-performance implementation of gradient lifting decision trees. It can handle missing data on its own, supports regularization and generally gives more accurate results than other models.

In Python it is implemented as follows:

from xgboost import XGBClassifier from sklearn.metrics import mean_squared_errorxgb = XGBClassifier(colsample_bytree = 0.3, learning_rate = 0.1,max_depth = 5, alpha = 10, n_estimators = 10)xgb.fit(x_train,y_train)y_pred=xgb.predict(x_test)rmse = np.sqrt(mean_squared_error(y_test, preds))print("RMSE: %f" % (rmse))10. Support Vector Machine (SVM)

SVM is a supervised machine learning algorithm that classifies by finding the best hyperplane, which is commonly used in many applications such as face detection, mail classification, etc.

In Python it is implemented as:

from sklearn import svmfrom sklearn import metricsclf = svm.SVC(kernel='linear')clf.fit(X_train, y_train)y_pred = clf.predict(X_test)print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

The confusion matrix is a table used to describe the performance of a classification model. The confusion matrix is analyzed with the help of the following four terms:

True Positive (TF)

This means that the model predicts positive, but it is actually positive.

True Negative (TN)

This means that the model predicts negative, and is actually negative.

False positives (FP)

This means that the model predicts positive, but is actually negative.

False Negative (FN)

This means that the model predicts negative, but is actually positive.

Python can do that.

from sklearn.metrics import confusion_matrixconfusion_matrix = confusion_matrix(y_test, y_pred)print(confusion_matrix)12, K-average clustering

K-Means clustering is an unsupervised machine learning algorithm for solving classification problems. An unsupervised algorithm is one that has no labels or output variables in the dataset.

In clustering, data sets are grouped into different groups based on features, called clusters. K-means clustering has many applications, such as market segmentation, document clustering, image segmentation.

It can be implemented in Python as:

from sklearn.cluster import KMeansimport statsmodels.api as smkmeans = KMeans(3)means.fit(x)identified_clusters = kmeans.fit_predict(x)13, DBSCAN cluster

DBSCAN is also an unsupervised clustering algorithm that clusters data points based on similarity. In DBSCAN, clusters are formed only if the number of points in a cluster of a specified radius is minimal.

The advantage of DBSCAN is that it is robust to outliers, i.e. it can handle outliers on its own, unlike k-means clustering. DBSCAN algorithms are used to create heat maps, geospatial analysis, anomaly detection in temperature data.

It can be implemented as:

from sklearn.cluster import DBSCANfrom sklearn import metrics from sklearn.preprocessing import StandardScalerdb = DBSCAN(eps=0.3, min_samples=10).fit(X)core_samples_mask = np.zeros_like(db.labels_, dtype=bool)core_samples_mask[db.core_sample_indexes_] = Truelasels = db.labels_n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)print(labels)14, standardization and normalization

Normalization is a scaling technique where we set the mean of an attribute to 0 and the standard deviation to 1, so that the values are centered around the mean with unit standard deviation. It can be written as X '= (X-μ)/σ

normalization

Normalization is a technique for making values range from 0 to 1, also known as min-max scaling. Normalization can be accomplished by the given formula X= (X− Xmin)/(Xmax− Xmin).

from sklearn.preprocessing import StandardScalerfrom sklearn.preprocessing import MinMaxScaler

Python provides the StandardScaler function for normalization and the MinMaxScaler function for normalization.

15. Feature extraction

Feature extraction is a method of extracting features from data. If we convert the data to digital format, we can only pass the data to the machine learning model. Scikit-Learn provides the ability to convert text and images into numbers.

Bag of Words and TF-IDF are the most common methods for converting words to numbers in natural language processing provided by scikit-learn.

The above is "Python Scikit-Learn how to use" all the content of this article, thank you for reading! Hope to share the content to help everyone, more relevant knowledge, welcome to pay attention to the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.