How to use Python to start machine learning 07/11 Update SLTechnology News&Howtos

How to use Python to start machine learning

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how to use Python to start machine learning", interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to start machine learning with Python.

With the rise of machine learning in the industry, tools that can help users iterate through the whole process quickly become critical. Python, a rising star in the field of machine learning technology, takes you to success and realize the value of life. Therefore, it is very necessary to use Python to realize machine learning.

Introduction to the realization of Machine Learning with Python

Many partners may not be able to figure out why it is Python rather than other languages. In my experience, Python is one of the easiest programming languages to learn. Now you need to iterate through the whole process quickly, and at the same time, data scientists don't need to know much about the language because they can master it quickly.

How easy is it?

For anything in the_list: print (anything)

Is it like English? . The grammar of Python is closely related to English (or human language, not machine language) grammar. There is no hassle caused by stupid curly braces in Python syntax. I have a colleague who works in quality assurance (Quality Assurance). Although she is not a software engineer, she can write production-level Python code in a day. It's true! )

I will introduce several Python-based libraries below. As data analysts and data scientists, we can use their masterpieces to help us accomplish our tasks. These incredible libraries are essential tools for implementing machine learning with Python.

NumPy

This is a very famous data analysis library. NumPy can help you with everything from calculating the median of data distribution to dealing with multidimensional arrays.

Pandas

This is used to process CSV files. Of course, you also need to process some tables, view statistics, etc., and Pandas is a tool that can meet your needs.

Matplotlib

After storing the data in the Pandas data box, you may need to do some visualization to understand more information about the data. After all, a picture is worth a thousand words.

Seaborn

This is another visualization tool, but this tool focuses more on the visualization of statistical results, such as histograms, pie charts, graphs, or correlation tables.

Scikit-Learn

This is the ultimate tool to implement machine learning with Python. The so-called realization of machine learning with Python refers to this-Scikit-Learn. Everything you need, from algorithms to ascension, can be found here.

Tensorflow and Pytorch

I won't say much about these two tools. But if you are interested in deep learning, you can learn more about them. They are worth your time to learn. I will write another tutorial on in-depth learning next time, please look forward to it! )

Python Machine Learning Program

Sample project:

Titanic: machine Learning from disaster (https://www.)

Is the well-known Titanic. It was a disaster in 1912 that affected 2224 passengers and crew members, of whom 1502 were killed. This Kaggle contest (or tutorial) provides real data from disasters. Your task is to interpret the data and predict who will survive and who won't survive the disaster.

How to implement Machine Learning with Python

Before delving into the Titanic's data, we need to install some of the necessary tools.

The first is, of course, Python. The first time to install Python needs to be installed from the official website. You need to install version 3.6 or above so that you can keep up with the latest version of the library.

Python official website: https://www.

You can then install all the libraries using Python's pip. The Python distribution you just downloaded will automatically install pip.

All the other tools you need can be installed with pip. Open the terminal, command line, or PowerShell with the following command:

Pip install numpypip install pandaspip install matplotlibpip install seabornpip install scikit-learnpip install jupyter

Everything seems to be working well. But wait a minute. What is jupyter? Jupyter stands for Julia, Python, and R, so it is actually Jupytr. But the word looked so strange that they turned it into Jupyter. This is a famous notebook in which you can write interactive Python code.

As long as you enter jupyter notebook in the terminal, you can open the browser page

You can write code in a green rectangle, and you can write and evaluate Python code interactively.

Now you have installed all the tools. Let's get started!

Data exploration

Exploring data is the first step. You need to download the data from the Titanic page of Kaggle, and then put the downloaded data in the folder where you started the Jupyter notebook.

Data download address: https://www.ta

Then import the necessary libraries:

Import numpy as np import pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport warningswarnings.filterwarnings ('ignore')% matplotlib inline

Load data:

Train_df=pd.read_csv ("train.csv") train_df.head ()

This is our data. It has the following columns:

PassengerId, the identifier of the passenger

Survived, whether he (she) survived

Pclass, cabin category, maybe 1 for economy class, 2 for business class, 3 for first class

Name, passenger's name.

Sex, gender

Age, age

SibSp, that is, siblings or spouses, indicates the number of siblings and spouses on board the ship

Parch, that is, Parents or Children, indicates the number of parents and children on board.

Ticket, ticket details

Cabin, cabin number, NaN indicates unknown

Embarked, the starting place of boarding, S for Southampton, Q for Queenstown, C for Cherbourg

When exploring data, we often encounter the problem of missing data. Let's take a look.

Def missingdata (data): total = data.isnull (). Sum (). Sort_values (ascending = False) percent = (data.isnull (). Sum () / data.isnull (). Count () * 100). Sort_values (ascending = False) ms=pd.concat ([total, percent], axis=1, keys= ['Total',' Percent']) ms= ms [ms ["Percent"] > 0] f Ax = plt.subplots (figsize= (8 green 6)) plt.xticks (rotation='90') fig=sns.barplot (ms.index, ms ["Percent"], color= "green", alpha=0.8) plt.xlabel ('Features', fontsize=15) plt.ylabel (' Percent of missing values', fontsize=15) plt.title ('Percent missingdata by feature', fontsize=15) return msmissingdata (train_df)

There are some missing values in the data of cabin number, age and boarding place, while there are a lot of missing information of cabin number. We need to deal with them, which is called Data Cleaning.

Data cleaning

We spend 90% of our time on this. We have to do a lot of data cleaning for each machine learning project. When the data is cleaned up, we can easily move on to the next step without worrying about anything.

The most common technique in data cleaning is to fill in missing data. You can fill in the missing data with a mass, average, or median. There are no absolute rules for selecting these data. You can try them one by one and see how they perform. But according to experience, classified data can only use the mode, and continuous data can use the median or average. So we use the mode to fill in the boarding data and the median to fill in the age data.

Train_df ['Embarked'] .fillna (train_df [' Embarked'] .mode () [0], inplace = True) train_df ['Age'] .fillna (train_df [' Age'] .modes (), inplace = True)

The next important operation is to delete the data, especially for a large number of missing data. We deal with the cabin number data as follows:

Drop_column = ['Cabin'] train_df.drop (drop_column, axis=1, inplace = True)

Now check the cleaned data.

Print ('check the nan value in train data') print (train_df.isnull () .sum ())

Feature engineering

Now the data has been cleaned up. Next we're going to do feature engineering.

Feature engineering is basically the technology of discovering features or data based on currently available data. There are several ways to implement this technique. In many cases, this is common sense.

Let's take the boarding place data as an example-this is data filled with Q, S, or C. The Python library cannot handle this because it can only handle numbers. So you need to use the so-called unique thermal vectorization (One Hot Vectorization), which can turn one column into three columns. Fill Embarked_Q, Embarked_S, and Embarked_C with 0 or 1 to indicate whether the person is leaving from this port.

Take SibSp and Parch as examples. There's nothing interesting about these two trains, but you might want to know how many families of a passenger boarded the ship. If you have a large family, it may increase your chances of survival because they can help each other. On the other hand, passengers who board the ship alone may find it difficult to survive.

So you can create a new column, which is the number of members (family size), family size = SibSp + Parch + 1 (passengers themselves).

The last example is bin. Since you find it difficult to distinguish between things with similar values, this creates a range of values (ranges of values) and then combines multiple values together. For example, is there a significant difference between passengers aged 5 and 6? Or is there a significant difference between 45-and 46-year-olds?

This is why the bin column was created. Maybe in terms of age, we can create four columns-young children (14 years old), teenagers (14 years old), adults (20 years old and 40 years old) and older people (over 40 years old).

The code is as follows:

All_data = train_dffor dataset in all_data: dataset ['FamilySize'] = dataset [' SibSp'] + dataset ['Parch'] + 1import re# Define function to extract titles from passenger namesdef get_title (name): title_search = re.search (' ([A-Za-z] +)\., name) # If the title exists, extract and return it. If title_search: return title_search.group (1) return "" # Create a new feature Title, containing the titles of passenger namesfor dataset in all_data: dataset ['Title'] = dataset [' Name'] .apply (get_title) # Group all non-common titles into one single grouping "Rare" for dataset in all_data: dataset ['Title'] = dataset [' Title']. Replace (['Lady',' Countess','Capt', 'Col','Don'] 'Dr',' Major', 'Rev',' Sir', 'Jonkheer',' Dona'], 'Rare') dataset [' Title'] = dataset ['Title']. Replace (' Mlle', 'Miss') dataset [' Title'] = dataset ['Title']. Replace (' Ms') 'Miss') dataset [' Title'] = dataset ['Title']. Replace (' Mme', 'Mrs') for dataset in all_data: dataset [' Age_bin'] = pd.cut (dataset ['Age'], bins= [0dataset [' Age'], labels= ['Children','Teenage','Adult','Elder']) for dataset in all_data: dataset [' Fare_bin'] = pd.cut (dataset ['Fare'], bins= [0Age' 7.91, 14.45], labels [' Low_fare'] ) traindf=train_dffor dataset in traindf: drop_column = ['Age','Fare','Name','Ticket'] dataset.drop (drop_column, axis=1, inplace = True) drop_column = [' PassengerId'] traindf.drop (drop_column, axis=1, inplace = True) traindf= pd.get_dummies (traindf, columns = ["Sex", "Title", "Age_bin", "Embarked", "Fare_bin"] Prefix= ["Sex", "Title", "Age_type", "Em_type", "Fare_type"])

Now, you have created all the features. Then let's look at the correlation between these features:

Sns.heatmap (traindf.corr (), annot=True,cmap='RdYlGn',linewidths=0.2) # data.corr ()-- > correlation matrixfig=plt.gcf () fig.set_size_inches (20jue 12) plt.show ()

A correlation value close to 1 means a high positive correlation and a-1 means a high negative correlation. For example, there is a negative correlation between male and female, as passengers must be identified as one gender (or another). In addition, you can see that no two are highly related except for the content created with feature engineering. It proves that we did the right thing.

What if there is a high correlation between certain factors? We can delete one of them, and the information in the new column does not provide any new information to the system, because the two are exactly the same.

Using Python to realize Machine Learning

Now we have reached the most exciting part of this tutorial, machine learning modeling.

From sklearn.model_selection import train_test_split # for split the datafrom sklearn.metrics import accuracy_score # for accuracy_scorefrom sklearn.model_selection import KFold # for K-fold cross validationfrom sklearn.model_selection import cross_val_score # score evaluationfrom sklearn.model_selection import cross_val_predict # predictionfrom sklearn.metrics import confusion_matrix # for confusion matrixall_features = traindf.drop ("Survived", axis=1) Targeted_feature = traindf ["Survived"] XerotestPhillips Targeted_feature,test_size=0.3,random_state=42) Xerotest.shapereageXroomtest.shaperewritytemperature.shapereedytempertest.Shapereaching.shaperecheding.shapere.shape

There are a variety of algorithms to choose from in the Scikit-Learn library:

Logical regression

Random forest

Support vector machine

K nearest neighbor

Naive Bayes

Decision tree

AdaBoost

LDA

Gradient enhancement

You may feel overwhelmed and want to figure out what is what. Don't worry, just treat it like a "black box"-pick the one that performs best. (I'll write a complete article later on how to choose these algorithms. )

Take my favorite random forest algorithm as an example:

From sklearn.ensemble import RandomForestClassifiermodel = RandomForestClassifier (criterion='gini', n_estimators=700, min_samples_split=10,min_samples_leaf=1, max_features='auto',oob_score=True, random_state=1,n_jobs=-1) model.fit (X_train Y_train) prediction_rm=model.predict (X_test) print ('- The Accuracy of the model--') print ('The accuracy of the Random Forest Classifier is', round (accuracy_score (prediction_rm,y_test) * 100Magne2)) kfold = KFold (n_splits=10, random_state=22) # kryp10, split the data into 10 equal partsresult_rm=cross_val_score (model) All_features,Targeted_feature,cv=10,scoring='accuracy') print ('The cross validated score for Random Forest Classifier is:',round (result_rm.mean () * 100 summer 2)) y_pred = cross_val_predict (model,all_features,Targeted_feature,cv=10) sns.heatmap (confusion_matrix (Targeted_feature,y_pred), annot=True,fmt='3.0f',cmap= "summer") plt.title (' Confusion_matrix', yearly 1.05, size=15)

WOW! The accuracy is as high as 83%. As far as the first attempt is concerned, the result is good.

The cross-validation score means K-fold verification method. If you divide the data into 10 variables, calculate the mean of all the scores and take them as the final score.

Fine tuning

Now you have completed the steps of implementing machine learning with Python. But adding one more step will give you better results-fine-tuning. Fine tuning means finding the best parameters for machine learning algorithms. Take the random forest code above as an example:

Model = RandomForestClassifier (criterion='gini', n_estimators=700, min_samples_split=10,min_samples_leaf=1, max_features='auto',oob_score=True, random_state=1,n_jobs=-1)

You need to set a lot of parameters. By the way, all of the above are default values. You can change the parameters as needed. But of course, it takes a lot of time.

Don't worry-there is a tool called Grid search (Grid Search) that automatically finds the best parameters. Sounds good, doesn't it?

# RandomForestClassifier Parameters tunning model = RandomForestClassifier () n_estim=range # # Search grid for optimal parametersparam_grid = {"n_estimators": n_estim} model_rf = GridSearchCV (model,param_grid = param_grid, cv=5, scoring= "accuracy", verbose = 1) model_rf.fit (train_X,train_Y) # Best scoreprint (model_rf.best_score_) # best estimatormodel_rf.best_estimator_ so far I believe you have a deeper understanding of "how to use Python to start machine learning". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.