In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article will explain in detail how to use Python to analyze the heart disease data set. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.
We are all afraid of getting sick, but we have been numb to the cold and fever, which is a disease from small to big, because it will be better in a week, but as we grow up, all kinds of inflammation, high three high, heart disease and coronary heart disease are born.
Heart disease, as a kind of disease that makes people feel horrible when they have an attack, takes away many lives every year. And those who are still sick have to give up too much in the rest of their lives to prevent a heart attack.
When we don't get sick, we always feel that it is far away from us. This is my understanding of heart disease. I don't know the cause of it, and I don't know what causes it. And how to maintain a normal life after illness, and so on, do not know.
See a heart attack data on kaggle today (the dataset download address and source code are at the end of the article), so take this opportunity to analyze it in depth.
Data set reading and simple description
First, import library and set hyperparameters to facilitate follow-up analysis.
Import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns
These two tables can be obtained by reading and describing the dataset:
You can see that there are 303 rows and 14 columns of data, and the headings of each column are age, sex, cp,... 、 target . They are like the test sheet every time they go to the hospital, and many non-professionals don't know them. So the meaning after translation using the official explanation is as follows:
Age: the age of the friend
Sex: the sex of the friend (1 = male, 0 = female)
Cp: type of chest pain experienced (value 1: typical angina, value 2: atypical angina, value 3: non-angina, value 4: asymptomatic)
Trestbps: the friend's resting blood pressure (mm Hg on admission)
Chol: the friend's cholesterol measurement (in mg/dl)
Fbs: fasting blood glucose (> 120 mg/dl,1= true; 0 = false)
Restecg: resting electrocardiographic measurement (0 = normal, 1 = abnormal ST-T wave, 2 = possible or confirmed left ventricular hypertrophy according to Estes criteria)
Thalach: the maximum heart rate reached by this friend
Exang: angina pectoris caused by exercise (1 = yes; 0 = none)
Oldpeak: ST inhibition, caused by exercise relative to rest ("ST" is related to the position on the ECG diagram. This piece is more professional, you can click this to see an interpretation)
Slope: slope of the highest motion ST segment (value 1: uphill, value 2: flat, value 3: downhill)
Ca: the number of major blood vessels in fluorescent coloration (0-4)
Thal: a blood disease called thalassemia (3 = normal; 6 = fixation defect; 7 = reversible defect)
Target: heart disease (0 = No, 1 = Yes)
So these messages are all physical indicators of sick or healthy people, and they don't have anything to do with whether he smokes, whether he stays up late, whether he is hereditary, or whether he has a regular schedule, so there is no point to guide our lives now. for example, things that tell us to quit smoking and drinking.
Conveniently send me a link to Zhihu. In addition, the above is only translated by me through the interpretation and translation given by the original data set. If there is any mistake, please correct it.
The first thing to get a set of data is to take a look at the general appearance of the data.
Male to female ratio
First, let's take a look at the disease ratio, the male-to-female ratio, which are routine.
CountNoDisease = len (data [data.target = = 0]) countHaveDisease = len (data [data.target = = 1]) countfemale = len (data [data.sex = = 0]) countmale = len (data [data.sex = = 1]) print (f'No patients: {countNoDisease}', end=',') print ("No Heart Disease rate: {: .2f}%" .format (countNoDisease / (len (data.target) * 100)) print (f 'patients: {countHaveDisease}', end=' ) print ("Heart disease rate: {: .2f}%" .format ((countHaveDisease / (len (data.target)) * 100)) print (number of women: {countfemale}', end=',') print ("proportion of women: {: .2f}%" .format ((countfemale / (len (data.sex)) * 100)) print (number of men: {countmale}', end=') ') print ("male ratio: {: .2f}%" .format ((countmale / (len (data.sex)) * 100)
The answer from the above code is as follows: at first glance, there are more men than women, but only if the data is a sample of 300 people and does not represent all mankind.
No patients: 138, no heart disease rate: 45.54%
Number of patients: 165, heart disease rate: 54.46%
Number of women: 96, proportion of women: 31.68%
Number of males: 207, male ratio: 68.32%
In addition to looking at this face with a pie chart, you can also look at it at the same time.
Fig, ax= plt.subplots (1 target 3) # 2 sub-regions fig.set_size_inches (wicked 15 minutes hobby 5) # set canvas size sns.countplot (x = "sex", data=data,ax=ax [0]) plt.xlabel ("gender (0 = female, 1 = male)") sns.countplot (x = "target", data=data,ax=ax [1]) plt.xlabel ("whether or not sick (0 = not sick, 1 = sick)") sns.swarmplot Ax=ax [2]) plt.xlabel ("gender (0 = female, 1 = male)") plt.show ()
From this triple diagram, we can see that more than 1 male and female 0, sick target1 is more than 0, in the age distribution violin chart, we can see that the proportion of female patients is more than that of male patients.
Please disassemble the column in detail, see the code and illustration below:
Pd.crosstab (data.sex,data.target) .plot (kind= "bar", figsize= (15jue 6), color= ['# 30A9DEFC05']) plt.title ('disease picture in all genders') plt.xlabel ('gender (0 = female, 1 = male)') plt.xticks (rotation=0) plt.legend (["not sick", "suffering from heart disease"]) plt.ylabel ('number') plt.show ()
You can see that the number of female patients in this data set is more than three times the number of healthy people. Leave a question, is it easier for women to have heart disease? Baidu for a while, found that this question asked a lot of people, but there is no specific scientific answer. The same is true of google. It may be necessary to look for the literature to find this answer, but it is not the purpose of this article, so we do not look for this true proportion.
In this data set, there are twice as many men as women, 207and 96 respectively, and the number of sick patients is slightly more than that of non-sick patients, with 165138 cases. Because the age may be continuous, the age, sex and disease relationship are made in the third picture. From the color observation alone, it can be found that the prevalence rate of women is higher than that of men in this data set. Through the fourth chart and statistics, it can be calculated that the prevalence rate is 44.9% in males and 75% in females.
It should be noted that the prevalence rate obtained in this paper is only from this data set.
Relationship between age and disease
Take a look at the following code: does the disease rate change with age?
(now when I am writing this article, it occurred to me that even if there is a change, it may not be meaningful, but the sample is still limited. If the coverage of this sample space is increased by 1000 times, it will show something-- that is, the relationship between age and heart disease.)
Pd.crosstab (data.age,data.target) .plot (kind= "bar", figsize= (252.8)) plt.title ('disease distribution with age') plt.xlabel ('age') plt.ylabel ('ratio') plt.savefig ('heartDiseaseAndAges.png') plt.show ()
The output image is as follows: in this picture, the number of patients aged 37-54 years old is more than the number of non-patients, whether there is this rule after the age continues to rise, and the number of patients increases again after the age of 70 +, which can only be shown as data, not as a conclusion.
There are many dimensions in the dataset that can be combined and analyzed, so let's start the combinatorial exploration and analysis.
The relationship among age, heart rate and disease
In this data set, the word heart rate is' thalach', so it depends on the relationship between age, heart rate and whether you are sick or not.
# scatter plot plt.scatter (x=data.age [data.target==1], y=data.thalach [(data.target==1)], c = "red") plt.scatter (x=data.age [data.target==0], y=data.thalach [(data.target==0)], caged patients 41D3BD') plt.legend (["sick", "not sick"]) plt.xlabel ("age") plt.ylabel ("maximum heart rate") plt.show () # draw a violin picture sns.violinplot Data=data) plt.show ()
Seeing the heartbeat of 200 at the age of 30 scared me. If a heart attack is not a disease, the speed of 200 is too worshiping.
What you can see is that the incidence of heart rate disease is about 140-200bpm. This figure is generally higher than that of people who are not sick, and it can also be seen from the violin chart that the distribution of this value is higher and more concentrated than that of healthy people.
The relationship between age and the distribution of blood pressure (trestbps) We all know that blood pressure is a routine test during physical examination, so I think there is any relationship between blood pressure and age? Does heart disease have anything to do with age?
Let's make a picture and have a look. And try to distinguish it with different colors.
Plt.scatter (x=data.age [data.target==1], y=data.trestbps [data.target==1], c = "# FFA773") plt.scatter (x=data.age [data.target==0], y=data.trestbps [data.target==0], c = "# 8DE0FF") plt.legend (["sick", 'not sick']) plt.xlabel ("age") plt.ylabel ("blood pressure") plt.show ()
Does the blood pressure seem to float more as you get older? What can be seen from this result is that both patients with resting blood pressure and those without resting blood pressure are evenly distributed in terms of blood pressure, and there is no significant stratified change with age. So it's not a good way to tell if you have a heart attack directly from your resting blood pressure.
So what else does blood pressure have to do with it?
Like heart rate? Okay, let's take a look.
Relationship between blood pressure (trestbps) and heart rate (thalach)
Blood pressure and heart rate both come from the kinetic energy of the heart, which is equivalent to engine power and engine speed. I guess these two have something to do with each other. Let's take a look.
Plt.scatter (x=data.thalach [data.target==1], y=data.trestbps [data.target==1], c = "# FFA773") plt.scatter (x=data.thalach [data.target==0], y=data.trestbps [data.target==0], c = "# 8DE0FF") plt.legend (["sick", 'not sick']) plt.xlabel ("heart rate") plt.ylabel ("blood pressure") plt.show ()
The reality is that in this sample set, except for the existing results of a high new rate of disease, there is no correlation between blood pressure and heart rate.
The relationship between the type of chest pain and heart disease and blood pressure
There are four types of chest pain in the table, which are 0123 respectively. Do they have anything to do with heart disease? let's take a look.
In addition, what I want to say is that the translation above is 1 typical, 2 atypical, 3 non-angina pectoris and 4 asymptomatic.
But the data set is 0123, I read a lot of people's works in kaggle, there is no reasonable explanation for this, so I only visualize the data, not analyze it.
Sns.swarmplot (size=6) plt.xlabel ('sickness') plt.show ()
Fig,ax=plt.subplots (1 xlabel 2 data.cp.value_counts figuration = (14 5)) sns.countplot (Xerox recording cpcpads) ax [1] .set _ title ("chest pain type") data.cp.value_counts (). Plot.pie [1], shadow=True, cmap='Blues') ax [1] .set _ title ("chest pain type")
The conclusion is: what can be seen from the picture above is that people with type 0 pain account for the majority of non-disease groups, while in the group of patients, 123 kinds of chest pain account for the majority.
Relationship between angina pectoris caused by exercise and disease and heart rate
Does it have anything to do with the type of chest pain and angina pectoris caused by exercise? Does it have anything to do with heart rate? Draw a picture and have a look.
PS: angina pectoris caused by exercise (exang: 1 = yes; 0 = no)
Sns.swarmplot (Xerox exangmus) plt.ylabel ('maximum heart rate') plt.show () plt.xlabel ('have you ever had angina pectoris')
The image you got is very interesting!
Although the maximum heart rate was measured at admission, the maximum heart rate concentration was higher in people without exercise-induced angina pectoris, between 160 and 180, and they all had heart disease.
My guess is: they have heart disease, exercise is uncomfortable, so they do not exercise, so there is no such problem as "chest pain during exercise".
Many of the people who had chest pain during exercise (1 on the right) had chest pain, which had a higher heart rate, concentrated between 120 and 150, and many of them did not have a heart attack, just a higher heart rate.
The relationship between the number of large vessels (ca), blood pressure (trestbps) and disease
Plt.figure (figsize= (155.5)) sns.swarmplot (yawned blood pressure trestpads) plt.ylabel ('resting blood pressure') plt.show ()
Plt.figure (figsize= (155.5)) sns.catplot (x = "ca", y = "age", hue= "target", kind= "swarm", data=data, palette='RdBu_r') plt.xlabel ('number of large vessels') plt.ylabel ('age')
The number of blood vessels refers to silver coloration. The specific medical meaning has not been found, so it is not analyzed. It's just that there is a great correlation between zero and the disease.
Age (age) and cholesterol (chol)
When I was in middle and high school, my mother told me that I should not have more than two egg yolks a day, otherwise it would cause high cholesterol. At that time, I was in good health and never believed these words. I didn't even guarantee to stay in college every day, but I remembered this sentence, so when I saw the word cholesterol, I would think of this family education.
Cholesterol side reflects blood lipids, so below generate a scatter chart of the relationship among cholesterol, age and disease. In order to distinguish, I changed the color this time.
Plt.scatter (x=data.age [data.target==1], y=data.chol [data.target==1], c = "orange") plt.scatter (x=data.age [data.target==0], y=data.chol [data.target==0], c = "green") plt.legend (["sick", 'not sick']) plt.xlabel ("age") plt.ylabel ("cholesterol") plt.show () # Box figure sns.boxplot
In this sample set, there is no obvious stratification in the distribution of cholesterol content between patients and non-patients, and the box chart shows that the reasonable upper and lower limits are the same, except that the patients with the three lines of 25%, 50% and 75% are slightly lower.
The conclusion is that cholesterol does not directly reflect the presence or absence of heart disease.
Relativity analysis
After a lot of analysis, what is related to the disease, and what is the relationship between the data? Take a look at the picture. The greener the color is, the more relevant it is, and the more red it is, the more negative it is.
Plt.figure (figsize= (15jue 10)) ax= sns.heatmap (data.corr (), cmap=plt.cm.RdYlBu_r, annot=True, fmt='.2f') a meme b = ax.get_ylim () ax.set_ylim (axiom 0.5)
Whether the image is good-looking or not depends on the last line, and whether the disease is positively related to cp, thalach, slope, and negatively related to exang, oldpeak, ca, thal and so on.
After analyzing some of the contents of the heart disease data set, there are actually many combinations to analyze the 14 columns. In addition, this paper does not use the model, but makes a brief analysis by means of data visualization.
On how to use Python to analyze the heart disease data set is shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.