How to use Python and Jupyter Notebook to build a prediction model 05/07 Update SLTechnology News&Howtos

How to use Python and Jupyter Notebook to build a prediction model

2025-05-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "how to use Python and Jupyter Notebook to build a prediction model". In the daily operation, I believe that many people have doubts about how to use Python and Jupyter Notebook to build a prediction model. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "how to use Python and Jupyter Notebook to build a prediction model". Next, please follow the editor to study!

The data I used in this experiment is the hotel reservation demand dataset from Kaggle.

In this article, I'll show you only the modeling phase, using only the Logistic regression model, but you can access the complete documentation, including data cleaning, preprocessing, and exploratory data analysis on Github.

Import library

Import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import confusion_matrixfrom sklearn.metrics import classification_reportimport warningswarnings.filterwarnings ("ignore")

Load dataset

Df = pd.read_csv ('hotel_bookings.csv') df = df.iloc [0pur2999] df.head ()

Here is what the dataset looks like.

It has 32 columns, and its full version is:

['hotel',' is_canceled', 'lead_time',' arrival_date_year', 'arrival_date_month',' arrival_date_week_number', 'arrival_date_day_of_month',' stays_in_weekend_nights', 'stays_in_week_nights',' adults', 'children',' babies', 'meal',' country', 'market_segment' 'distribution_channel', 'is_repeated_guest',' previous_cancellations', 'previous_bookings_not_canceled',' reserved_room_type', 'assigned_room_type',' booking_changes', 'deposit_type',' agent', 'company',' days_in_waiting_list', 'customer_type',' adr', 'required_car_parking_spaces' 'total_of_special_requests',' reservation_status', 'reservation_status_date']

Based on the information I run on Notebook, the nan values in the dataset can be found in the "country", "agent" and "company" columns

Based on the "lead_time" feature, I replaced the NaN value in "country" with PRT (Portugal) because PRT is the most common

I tried to replace the nan values on the "agent" feature based on lead_time, arrival_date_month, and arrival_date_week_number, but most of them were "240s" as the most common proxies.

After reading the descriptions and explanations of the datasets that can be found on the Internet, the author describes the "agent" feature as "booking travel agency ID". Therefore, those who have "agent" in the dataset are the only ones who order through travel agencies, while those who do not have "agent" or Nan are those who do not order through travel agencies. Therefore, I think it is better to populate the nan values with zeros than with common proxies, which makes the dataset different from the original dataset.

Last but not least, I chose to give up the entire "company" feature, because the NaN in this feature accounts for about 96% of the data. If I decide to modify the data, it may have a huge impact on the data and may affect the entire data

Split dataset

Df_new = df.copy () [['required_car_parking_spaces','lead_time','booking_changes','adr','adults',' is_canceled']] df_new.head () x = df_new.drop (['is_canceled'], axis=1) y = df_new [' is_canceled']

I tried to split the dataset according to the first five features most significantly related to the is_Cancelled: required_car_parking_spaces', 'lead_time',' booking_changes', 'adr',' adults,' and 'is_canceled.'

X_train, x_test, y_train, y_test = train_test_split (x, y, test_size=0.20, shuffle=False)

Training and testing are divided into 80% and 20%.

Fitting model

Model_LogReg_Asli is the original model that uses Logistic regression before using hyperparameter tuning. Here are the model predictions.

Model performance

As mentioned above, the accuracy of the Logistic regression model is about 69.3%.

Model parameters

Logistic regression Analysis of Randomized Search CV

Model_LR_RS is a model that uses Logistic regression and hyperparameter adjustment (random).

As shown in the figure above, the result of the Logistic regression model with Randomized Search CV is exactly the same as that without random search, which is 69.3%.

Logistic regression based on Grid search CV

Model_LR2_GS is a model that uses Logistic regression and hyperparameter adjustment (grid search).

The above figure shows that the Logistic regression model using grid search CV has the same accuracy of 69.3%.

Model evaluation

Confusion matrix

TN is a true counterexample, FN is a false counterexample, FP is a false positive example, TP is a real example, 0 is not cancelled, 1 is cancelled. The following is a classified report of the model.

At this point, the study on "how to use Python and Jupyter Notebook to build a predictive model" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.