In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article introduces the relevant knowledge of "how to write a cardiovascular disease prediction model with Python". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
01 data understanding
The data are taken from the cardiovascular disease data set shared by the kaggle platform, with a total of 13 fields and 299 patient diagnostic records. The specific fields are summarized as follows:
02 data reading and preliminary processing
First import the required package.
# data processing import numpy as np import pandas as pd # Visualization import matplotlib.pyplot as pltimport seaborn as snsimport plotly as py import plotly.graph_objs as goimport plotly.express as pximport plotly.figure_factory as ff# Model Establishment from sklearn.linear_model import LogisticRegressionfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifierimport lightgbm# pre-processing from sklearn.preprocessing import StandardScaler# Model Evaluation from sklearn.model_selection import train_test_split, GridSearchCVfrom sklearn.metrics import plot_confusion_matrix, confusion_matrix, f1_score
Load and preview the dataset:
# read data df = pd.read_csv ('. / data/heart_failure.csv') df.head ()
03 exploratory analysis
1. Descriptive analysis
Df.describe () T
The results of the above descriptive analysis are briefly summarized as follows:
Death: the average death rate is 32%
Age distribution: average age 60 years old, minimum 40 years old, maximum 95 years old
Diabetes: 41.8% have diabetes
Whether there is hypertension: 35.1% have high blood pressure
Whether or not to smoke: 32.1% have smoking.
two。 Target variable
# generate data death_num = df ['DEATH_EVENT']. Value_counts () death_num = death_num.reset_index () # pie fig = px.pie (death_num, names='index', values='DEATH_EVENT') fig.update_layout (distribution of title_text=' target variable DEATH_EVENT') py.offline.plot (fig, distribution of filename='./html/ target variable DEATH_EVENT. Html')
A total of 299 people, of whom 96 did not survive during the follow-up period, accounting for 32.1% of the total
3. anemia
As can be seen from the picture, patients with anemia have a higher probability of death, which is 35.66%.
Bar1 = draw_categorical_graph (df ['anaemia'], df [' DEATH_EVENT'], title=' erythrocytopenia and hemoglobin reduction and survival') bar1.render (. / html/ erythrocyte hemoglobin decrease and survival .html')
4. Age
It can be seen from the histogram that there is a great difference in age distribution among patients with cardiovascular disease, and the trend is that the older the age, the lower the proportion of survival and the higher the proportion of death.
# generate data surv = df [df ['DEATH_EVENT'] = 0] [' age'] not_surv = df [df ['DEATH_EVENT'] = = 1] [' age'] hist_data = [surv, not_surv] group_labels = ['Survived',' Not Survived'] # histogram fig = ff.create_distplot (hist_data, group_labels, bin_size=0.5) fig.update_layout (title_text=' Age and Survival status relationship') py.offline.plot (fig The relationship between filename='./html/ age and survival status. Html')
5. Age / sex
From the grouping statistics and graphics, we can see that there is no significant difference in survival status between different genders. Among the deaths, the average age of men is relatively high.
6. Age / smoking
The data show that, overall, there is no significant correlation between smoking and survival. But when we look at people who smoke, people under the age of 50 are more likely to survive.
7. Creatine phosphate kinase (CPK)
As can be seen from the histogram, people with higher levels of CPK enzyme in the blood are more likely to die.
8. ejection fraction
Ejection fraction represents the pumping function of the heart, and the probability of survival is lower at too high and too low levels.
9. platelets
Platelets in the blood (100,300) × 10 ^ 9 / L, higher or lower levels represent abnormal, and the probability of survival is low.
10. Serum creatinine level
Serum creatinine is the most commonly used index to detect renal function. A higher index represents renal insufficiency, renal failure, and has a higher probability of death.
11. Serum sodium level
The graph shows that higher or lower serum sodium is often associated with risk.
twelve。 Relativity analysis
From the correlation diagram of numerical attributes, we can see that there is no significant collinear relationship between variables.
Num_df = df [['age',' creatinine_phosphokinase', 'ejection_fraction',' platelets', 'serum_creatinine',' serum_sodium']] plt.figure (figsize= (12,12) sns.heatmap (num_df.corr (), vmin=-1, cmap='coolwarm', linewidths=0.1, annot=True) plt.title ('Pearson correlation coefficient between numeric variables', fontdict= {' fontsize': 15}) plt.show ()
04 feature screening
We use statistical methods for feature screening, when the target variable DEATH_EVENT is a classification variable, when the independent variable is a classification variable, chi-square identification is used, and the independent variable is a numerical variable, using analysis of variance.
# divide X and yX = df.drop ('DEATH_EVENT', axis=1) y = df [' DEATH_EVENT'] from feature_selection import Feature_selectfs = Feature_select (num_method='anova', cate_method='kf') X_selected = fs.fit_transform (X, y) X_selected.head () 2020 17:19:49 INFO attr select after select attr: ['serum_creatinine',' serum_sodium', 'ejection_fraction',' age', 'time']
05 data modeling
First of all, the training set and the test set are divided.
# divide training set and test set Features = X_selected.columnsX = df [Features] y = df ["DEATH_EVENT"] X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.2, stratify=y Random_state=2020) # Standardization scaler = StandardScaler () scaler_Xtrain = scaler.fit_transform (X_train) scaler_Xtest = scaler.fit_transform (X_test) lr = LogisticRegression () lr.fit (scaler_Xtrain, y_train) test_pred = lr.predict (scaler_Xtest) # F1-scoreprint ("F1_score of LogisticRegression is:", round (f1_score (y_true=y_test, y_pred=test_pred) 2))
We use the decision tree for modeling, set the feature selection standard to gini, and the depth of the tree is 5. Output confusion matrix diagram: in this case, class 1 is the object of our concern.
# DecisionTreeClassifierclf = DecisionTreeClassifier (criterion='gini', max_depth=5, random_state=1) clf.fit (X_train, y_train) test_pred = clf.predict (X_test) # F1-scoreprint ("F1_score of DecisionTreeClassifier is:", round (f1_score (y_true=y_test, y_pred=test_pred), 2)) # drawing plt.figure (figsize= (10,7) plot_confusion_matrix (clf, X_test, y_test, cmap='Blues') plt.title ("DecisionTreeClassifier-Confusion Matrix") Fontsize=15) plt.xticks (range (2), ["Heart Not Failed", "Heart Fail"], fontsize=12) plt.yticks (range (2), ["Heart Not Failed", "Heart Fail"], fontsize=12) plt.show () F1_score of DecisionTreeClassifier is: 0.61
Grid search is used for parameter tuning, and the optimization standard is F1.
Parameters = {'splitter': (' best','random'), 'criterion': ("gini", "entropy"), "max_depth": [* range (1,20)],} clf = DecisionTreeClassifier (random_state=1) GS = GridSearchCV (clf, param_grid=parameters, cv=10, scoring='f1', n_jobs=-1) GS.fit (X_train Y_train) print (GS.best_params_) print (GS.best_score_) {'criterion':' entropy', 'max_depth': 3,' splitter': 'best'} 0.7638956305132776
Re-evaluate the effectiveness of the test set using the optimal model:
Test_pred = GS.best_estimator_.predict (X_test) # F1-scoreprint ("F1_score of DecisionTreeClassifier is:", round (f1_score (y_true=y_test, y_pred=test_pred), 2) # drawing plt.figure (figsize= (10,7)) plot_confusion_matrix (GS, X_test, y_test, cmap='Blues') plt.title ("DecisionTreeClassifier-Confusion Matrix", fontsize=15) plt.xticks (range (2), ["Heart Not Failed", "Heart Fail"] Fontsize=12) plt.yticks (range (2), ["Heart Not Failed", "Heart Fail"], fontsize=12) plt.show ()
Use random forests
# RandomForestClassifierrfc = RandomForestClassifier (n_estimators=1000, random_state=1) parameters = {'max_depth': np.arange (2,20,1)} GS = GridSearchCV (rfc, param_grid=parameters, cv=10, scoring='f1', n_jobs=-1) GS.fit (X_train, y_train) print (GS.best_params_) print (GS.best_score_) test_pred = GS.best_estimator_.predict (X_test) # F1-scoreprint ("F1_score of RandomForestClassifier is:" Round (f1_score (y_true=y_test, y_pred=test_pred), 2) {'max_depth': 3} 0.791157747481277F1_score of RandomForestClassifier is: .53
Use Boosting
Gbl = GradientBoostingClassifier (n_estimators=1000, random_state=1) parameters = {'max_depth': np.arange (2,20,1)} GS = GridSearchCV (gbl, param_grid=parameters, cv=10, scoring='f1', n_jobs=-1) GS.fit (X_train, y_train) print (GS.best_params_) print (GS.best_score_) # Test set test_pred = GS.best_estimator_.predict (X_test) # F1-scoreprint ("F1_score of GradientBoostingClassifier is:" Round (f1_score (y_true=y_test, y_pred=test_pred), 2) {'max_depth': 3} 0.7288420428900305F1_score of GradientBoostingClassifier is: 0.65
Use LGBMClassifier
Lgb_clf = lightgbm.LGBMClassifier (boosting_type='gbdt', random_state=1) parameters = {'max_depth': np.arange (2,20,1)} GS = GridSearchCV (lgb_clf, param_grid=parameters, cv=10, scoring='f1', n_jobs=-1) GS.fit (X_train Y_train) print (GS.best_params_) print (GS.best_score_) # Test set test_pred = GS.best_estimator_.predict (X_test) # F1-scoreprint ("F1_score of LGBMClassifier is:", round (f1_score (y_true=y_test, y_pred=test_pred)) 2) {'max_depth': 2} 0.780378102289867F1_score of LGBMClassifier is: 0.74 is the comparison of the performance of each model on the test set: LogisticRegression:0.63DecisionTree Classifier:0.73Random Forest Classifier: 0.53GradientBoosting Classifier: 0.65LGBM Classifier: 0.74 "how to write a cardiovascular disease prediction model with Python" is introduced here. Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.