Heart Disease Data Analysis

Heart disease data analysis, and model building

Author

Ethan Glenn

Published

April 9, 2025

Executive Summary

This project aimed to develop a predictive model for heart disease presence using the UCI Heart Disease dataset. Following data cleaning (including imputation and outlier removal) and feature scaling, a Gradient Boosting Classifier was trained and optimized. The final tuned model achieved approximately 85% accuracy and 85% F1-score on the test set, demonstrating strong recall (93%) in identifying patients with the disease. Key predictors identified were Thalassemia test results (thal), the number of major vessels colored by fluoroscopy (ca), and chest pain type (cp). While promising as a foundation for a potential risk stratification tool, further validation on larger, diverse datasets is recommended.

Background

Heart disease remains a leading cause of morbidity and mortality worldwide. Early identification of individuals at high risk is crucial for timely intervention and prevention. This project is aimed to leverage machine learning – specifically its ability to model complex, non-linear interactions between clinical features – to develop a predictive model for assessing the likelihood of heart disease based on patient clinical data. This data utilized was sourced from the widely recognized UC Irvine Machine Learning Repository dataset. The objective of this is to build a reliable model that could potentially aid healthcare professionals in risk stratification.

Exploring the Data

First, let’s take a look at the data features, the initial rows, and general information about the dataset structure to understand what we are working with.

Show the code
df = pd.read_csv('heart-disease.csv')
Show the code
df.head()
Unnamed: 0 age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal num
0 0 63 1 1 145 233 1 2 150 0 2.3 3 0.0 6.0 0
1 1 67 1 4 160 286 0 2 108 1 1.5 2 3.0 3.0 2
2 2 67 1 4 120 229 0 2 129 1 2.6 2 2.0 7.0 1
3 3 37 1 3 130 250 0 0 187 0 3.5 3 0.0 3.0 0
4 4 41 0 2 130 204 0 2 172 0 1.4 1 0.0 3.0 0
Show the code
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  303 non-null    int64  
 1   age         303 non-null    int64  
 2   sex         303 non-null    int64  
 3   cp          303 non-null    int64  
 4   trestbps    303 non-null    int64  
 5   chol        303 non-null    int64  
 6   fbs         303 non-null    int64  
 7   restecg     303 non-null    int64  
 8   thalach     303 non-null    int64  
 9   exang       303 non-null    int64  
 10  oldpeak     303 non-null    float64
 11  slope       303 non-null    int64  
 12  ca          299 non-null    float64
 13  thal        301 non-null    float64
 14  num         303 non-null    int64  
dtypes: float64(3), int64(12)
memory usage: 35.6 KB

Interpretation: The .info() output confirms 303 initial entries and 15 columns. Most features are numerical (int64 or float64). Crucially, it reveals missing values in the ca (4 missing) and thal (2 missing) columns, which will need addressing during data cleaning. The memory usage is minimal.

Target Variable Distribution

Now I want to take a look at the distribution of our original target, the num column. This column indicates the presence and severity of heart disease on a scale of 0-4, where 0 indicates no presence, and 1-4 indicate increasing severity.

Show the code
df["num"].value_counts()
num
0    164
1     55
2     36
3     35
4     13
Name: count, dtype: int64

Interpretation: The majority of patients in the dataset have a num value of 0 (no disease). The counts decrease as the severity level increases. Given this distribution and the project’s goal to predict presence vs. absence, we will later transform this into a binary target variable (0 vs. 1-4 combined).

With this initial understanding, we can proceed to clean the data.

Data Cleaning

Before we get to training a model, it’s essential to ensure data quality. We will address missing values, check for duplicates, and handle potential data outliers correctly.

First, we will check the number of unique values per feature to understand their nature (categorical vs. continuous) and re-examine the data types.

Show the code
df.nunique()
Unnamed: 0    303
age            41
sex             2
cp              4
trestbps       50
chol          152
fbs             2
restecg         3
thalach        91
exang           2
oldpeak        40
slope           3
ca              4
thal            3
num             5
dtype: int64
Show the code
df.dtypes
Unnamed: 0      int64
age             int64
sex             int64
cp              int64
trestbps        int64
chol            int64
fbs             int64
restecg         int64
thalach         int64
exang           int64
oldpeak       float64
slope           int64
ca            float64
thal          float64
num             int64
dtype: object

Interpretation: This confirms features like sex, cp, fbs, restecg, exang, slope, ca, and thal have a limited number of unique values, suggesting they are categorical or ordinal in nature, even if stored numerically. age, trestbps, chol, thalach, and oldpeak appear more continuous.

Let’s examine some key categorical features:

Show the code
df['sex'].unique()
array([1, 0])

Interpretation: The sex column contains only two unique values (likely 1 for male, 0 for female), as expected.

Show the code
df['cp'].unique()
array([1, 4, 3, 2])

Interpretation: The chest pain column (cp) has four distinct values (1-4), representing different types of chest pain, with no apparent missing values within the unique values shown here.

Handling Missing Values

We previously identified missing values in thal and ca. Let’s investigate thal first.

Show the code
df['thal'].unique()
array([ 6.,  3.,  7., nan])

Interpretation: We confirm missing values (nan) exist in the thal column. Thalassemia (thal) testing relates to blood flow during stress tests and is likely an important indicator.

Let’s check the null counts again to confirm which columns have missing data.

Show the code
df.isnull().sum()
Unnamed: 0    0
age           0
sex           0
cp            0
trestbps      0
chol          0
fbs           0
restecg       0
thalach       0
exang         0
oldpeak       0
slope         0
ca            4
thal          2
num           0
dtype: int64

To address the missing thal values (2 instances), and the missing ca values (4 instances), we will impute them using the mean of the existing values. This simple approach allows us to retain the rows, though it might slightly reduce the feature’s variance. More sophisticated methods could be explored, but mean imputation suffices for this analysis.

Show the code
df_imputed = df.fillna(df.mean())

After imputing the thal values, we check again.

Show the code
df_imputed.isnull().sum()
Unnamed: 0    0
age           0
sex           0
cp            0
trestbps      0
chol          0
fbs           0
restecg       0
thalach       0
exang         0
oldpeak       0
slope         0
ca            0
thal          0
num           0
dtype: int64

Interpretation: The missing values in thal have been successfully imputed.

Checking for Duplicates

We will also check for and remove any duplicate rows in the dataset.

Show the code
duplicated = df_imputed.duplicated().sum()
print(f'There are {duplicated} duplicated rows')
There are 0 duplicated rows

Interpretation: No duplicate rows were found.

Creating the Target

Because we are trying to predict whether someone is prone to having heart disease (presence vs. absence), rather than the specific severity, we will create a new binary target column. If the original num value is 0 (no heart disease), the new disease_binary value will be 0. If num is 1, 2, 3, or 4, disease_binary will be 1, indicating the presence of heart disease.

Show the code
df_imputed['disease_binary'] = (df_imputed['num'] > 0).astype(int)

With the initial cleaning and target definition complete, we now address potential outliers.

Handling Outliers

We anticipate the presence of outliers in some continuous features, which could negatively impact model performance, especially for algorithms sensitive to extreme values like Gradient Boosting. We will implement an Interquartile Range (IQR) method to identify and remove outliers in key numerical features (age, trestbps, chol, thalach, oldpeak). Rows containing outliers in these specific columns will be removed.

Show the code
continous_features = ['age','trestbps','chol','thalach','oldpeak']
def outliers(df_out, drop = False):
    for each_feature in df_out.columns:
        feature_data = df_out[each_feature]
        Q1 = np.percentile(feature_data, 25.)
        Q3 = np.percentile(feature_data, 75.)
        IQR = Q3-Q1
        outlier_step = IQR * 1.5
        outliers = feature_data[~((feature_data >= Q1 - outlier_step) & (feature_data <= Q3 + outlier_step))].index.tolist()
        if not drop:
            print('For the feature {}, No of Outliers is {}'.format(each_feature, len(outliers)))
        if drop:
            df.drop(outliers, inplace = True, errors = 'ignore')
            print('Outliers from {} feature removed'.format(each_feature))

outliers(df_imputed[continous_features])

outliers(df_imputed[continous_features], drop=True)

df_cleaned = df_imputed.copy()
For the feature age, No of Outliers is 0
For the feature trestbps, No of Outliers is 9
For the feature chol, No of Outliers is 5
For the feature thalach, No of Outliers is 1
For the feature oldpeak, No of Outliers is 5
Outliers from age feature removed
Outliers from trestbps feature removed
Outliers from chol feature removed
Outliers from thalach feature removed
Outliers from oldpeak feature removed

Interpretation: Outliers were detected and removed for trestbps, chol, thalach, and oldpeak. The total number of rows removed depends on whether outliers occurred in the same rows across different features. This process helps create a more robust dataset for modeling.

With the data cleaned, the target variable defined, and outliers handled, we can now explore the relationships between features and the presence of heart disease through visualization.

Data Exploration

In this section, a series of charts will be generated and commented on in regards to their validity and what they tell us about the data.

Distribution or Target Variable

This chart shows the distribution of our binary target variable, disease_binary.

Show the code
df_cleaned['disease'] = df_cleaned['num'].apply(lambda x: 'Yes' if x != 0 else 'No')
percent_diseased = df_cleaned['disease'].value_counts(normalize=True) * 100

plt.figure(figsize=(8, 6))
sns.countplot(x='disease', data=df_cleaned)
plt.title('Distribution of Disease')
plt.xlabel('Disease')
plt.ylabel('Count')
plt.show()

percent_diseased

disease
No     54.125413
Yes    45.874587
Name: proportion, dtype: float64

Interpretation: The dataset, after cleaning, shows a reasonably balanced distribution between patients without heart disease (approx. 54%) and those with heart disease (approx. 46%). This balance is generally good for training a classification model without needing complex techniques to handle severe class imbalance.

Sex Distribution by Disease

Let’s examine how heart disease prevalence differs between sexes in this dataset.

Show the code
sex_df = df_cleaned.copy()
sex_df['sex'] = sex_df['sex'].apply(lambda x: 'Female' if x == 0 else 'Male')

plt.figure(figsize=(8, 6))
sns.countplot(x='sex', hue='disease', data=sex_df)
plt.title('Sex Distribution by Disease')
plt.xlabel('Sex')
plt.ylabel('Count')
plt.legend(title='Disease')
plt.show()

Interpretation: One factor of note, as shown in the chart, is that within this specific dataset, men have a higher count of heart disease cases compared to women. While the overall distribution of males and females in the dataset is relatively even (though slightly more males), the rate of disease appears higher in men here.

Chest Pain Type Distribution by Disease

This chart explores the relationship between different types of chest pain (cp) and the presence of heart disease.

  • Chest Pain Type 1 (Typical Angina)
  • Chest Pain Type 2 (Atypical Angina)
  • Chest Pain Type 3 (Non-anginal Pain)
  • Chest Pain Type 4 (Asymptomatic)
Show the code
plt.figure(figsize=(8, 6))
sns.countplot(x='cp', hue='disease', data=df_cleaned)
plt.title('Chest Pain Type Distribution by Disease')
plt.xlabel('Chest Pain Type')
plt.ylabel('Count')
plt.legend(title='Disease')
plt.show()

Interpretation: The chart highlights interesting patterns. Paradoxically, within this dataset, the absence of typical/atypical/non-anginal pain (Type 4 - Asymptomatic) is the strongest indicator for the presence of heart disease (highest orange bar). Conversely, reporting non-anginal pain (Type 3) is most common among those without heart disease (highest blue bar). This underscores the complex relationship between symptoms and diagnosis.

Distributions of Key Continuous Features

As would be expected, we can see roughly normal or skewed distributions for the rest of our key continuous data features, as demonstrated below.

Show the code
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(10, 8))
axes = axes.flatten()

for i, feature in enumerate(continous_features):
    sns.histplot(df_cleaned[feature], kde=True, ax=axes[i])
    axes[i].set_title(f'Distribution of {feature}')

for i in range(len(continous_features), len(axes)):
  axes[i].axis('off')

plt.tight_layout()
plt.show()

Interpretation: These histograms show the distributions of age, trestbps, chol, and oldpeak. age appears roughly normally distributed. trestbps and chol show slight right skew. oldpeak is heavily right-skewed, indicating most patients have low values for ST depression.

Feature Correlation

The following heatmap visualizes the pairwise correlation between all features, including our binary target disease_binary.

Show the code
sns.set(style="white")
plt.rcParams['figure.figsize'] = (15, 10)
numerical_df = df_cleaned.select_dtypes(include=np.number)
sns.heatmap(numerical_df.corr(), annot = True, linewidths=.5, cmap="Blues")
plt.title('Corelation Between Variables', fontsize = 30)
plt.show()

Interpretation: The correlation matrix highlights linear relationships. Notably, disease_binary shows moderate positive correlations with cp (0.41), exang (exercise-induced angina, 0.43), and oldpeak (0.43, inferred from strong corr with num), suggesting these increase with disease presence. There’s a moderate negative correlation with thalach (max heart rate, -0.42), indicating higher rates are associated with less disease. thal and ca also show moderate correlations (0.33 and 0.4x respectively, inferred from num), aligning with their potential importance. These correlations provide initial clues about predictive power but don’t capture non-linear relationships.

The exploratory analysis revealed key patterns and confirmed the suitability of the data. We now proceed to the model building phase.

Model Generation

Preparing Features and Target

We first separate the dataset into features (X - all columns except the target disease_binary and the original num) and the target variable (y - disease_binary). This step ensures the model focuses only on relevant predictors for the binary classification task.

Show the code
feature_cols = [col for col in df_cleaned.columns if col not in ['num', 'disease_binary', 'disease', 'disease_label']]
X = df_cleaned[feature_cols]
y = df_cleaned['disease_binary']

print(f"\n--- Preparing for Modeling ---")
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Target distribution:\n{y.value_counts()}")

--- Preparing for Modeling ---
Features shape: (303, 14)
Target shape: (303,)
Target distribution:
disease_binary
0    164
1    139
Name: count, dtype: int64

Splitting Data

The data is split into training (80%) and testing (20%) sets. We use stratification (stratify=y) to ensure that the proportion of patients with and without heart disease is approximately the same in both the training and testing sets. This is crucial for balanced evaluation, especially with moderately balanced datasets.

Show the code
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f"Train set shape: {X_train.shape}, Test set shape: {X_test.shape}")
Train set shape: (242, 14), Test set shape: (61, 14)

Scaling Numeric Columns

Gradient Boosting models can benefit from feature scaling. We identify numeric columns that represent continuous values or wide ranges and standardize them to have a mean of 0 and a standard deviation of 1. Categorical features (even if numeric, like sex, cp) are excluded from scaling as they represent discrete categories.

Show the code
numeric_cols_to_scale = X_train.select_dtypes(include=np.number).columns.tolist()
known_categorical_numeric = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']
numeric_cols_final = [col for col in numeric_cols_to_scale if col not in known_categorical_numeric]
numeric_cols_final = [col for col in numeric_cols_final if col in X_train.columns]

print(f"\nNumeric columns identified for scaling: {numeric_cols_final}")

Numeric columns identified for scaling: ['Unnamed: 0', 'age', 'trestbps', 'chol', 'thalach', 'oldpeak']

Standardizing these features using StandardScaler ensures all scaled features contribute comparably to the model, preventing features with larger ranges from dominating the learning process.

Show the code
scaler = StandardScaler()

X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

X_train_scaled[numeric_cols_final] = scaler.fit_transform(X_train[numeric_cols_final])
X_test_scaled[numeric_cols_final] = scaler.transform(X_test[numeric_cols_final])

print("Scaled Train Data Head:\n", X_train_scaled.head())
Scaled Train Data Head:
      Unnamed: 0       age  sex  cp  trestbps      chol  fbs  restecg  \
180    0.399085 -0.729485    1   4 -0.395692  0.458139    0        2   
208    0.722786  0.050166    1   2 -0.054513  0.230598    0        0   
167    0.248795 -0.061212    0   2  0.059213  0.723605    1        2   
105   -0.467972 -0.061212    1   2 -1.305501  1.121803    0        0   
297    1.751693  0.272924    0   4  0.514117 -0.167601    0        0   

      thalach  exang   oldpeak  slope   ca  thal  
180  0.708371      0 -0.445445      2  0.0   7.0  
208  0.222495      0 -0.891627      1  0.0   3.0  
167  0.399178      1 -0.891627      1  1.0   3.0  
105  0.266666      0 -0.891627      1  0.0   7.0  
297 -1.190962      1 -0.713154      2  0.0   7.0  

Training the Model 🦾

We will train a Gradient Boosting Classifier. This algorithm builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. We initialize it with 100 estimators (trees), a learning rate of 0.1, and a maximum tree depth of 3.

Show the code
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gb_model.fit(X_train_scaled, y_train)
print("Model training complete.")
Model training complete.

Generating predictions for both the training and testing datasets. These predictions will be used to evaluate the model’s performance.

Show the code
y_pred_train = gb_model.predict(X_train_scaled)
y_pred_test = gb_model.predict(X_test_scaled)

Evaluating the model’s performance using metrics like accuracy, recall, and precision. The classification report provides a detailed breakdown of the model’s performance for each class (disease vs. no disease).

Show the code
train_accuracy = accuracy_score(y_train, y_pred_train)
test_accuracy = accuracy_score(y_test, y_pred_test)
recall = recall_score(y_test, y_pred_test)
precision = precision_score(y_test, y_pred_test)
print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
print(f"Recall: {recall:.4f}")
print(f"Precision: {precision:.4f}")

print("\nClassification Report (Test Set):")
print(classification_report(y_test, y_pred_test, target_names=['No Disease (0)', 'Disease (1)']))
Training Accuracy: 0.9959
Test Accuracy: 0.8689
Recall: 0.9286
Precision: 0.8125

Classification Report (Test Set):
                precision    recall  f1-score   support

No Disease (0)       0.93      0.82      0.87        33
   Disease (1)       0.81      0.93      0.87        28

      accuracy                           0.87        61
     macro avg       0.87      0.87      0.87        61
  weighted avg       0.88      0.87      0.87        61

Plotting the confusion matrix as a heatmap for easier interpretation of the model’s performance. This visualization highlights the balance between true positives, true negatives, false positives, and false negatives.

Show the code
cm = confusion_matrix(y_test, y_pred_test)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Disease (0)', 'Disease (1)'], yticklabels=['No Disease (0)', 'Disease (1)'])
plt.title('Confusion Matrix (Test Set)')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

Hyper Parameter Tuning

Setting up a grid of hyperparameter for the Gradient Boosting model. The grid includes different values for the number of estimators, learning rate, maximum tree depth, and subsample ratio. GridSearchCV will test all combinations of these parameters using 5-fold cross-validation to find the best configuration for maximizing the F1-score.

Show the code
param_grid = {
    'n_estimators': [50, 100, 150, 200],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': [3, 4, 5],
    'subsample': [0.8, 0.9, 1.0]
}

gb_base = GradientBoostingClassifier(random_state=42)

grid_search = GridSearchCV(estimator=gb_base, param_grid=param_grid,
                           cv=5, n_jobs=-1, verbose=1, scoring='f1')

Running the grid search to train and evaluate the model for each combination of hyperparameters. Once complete, the best parameters and their corresponding cross-validation F1-score are displayed.

Show the code
grid_search.fit(X_train_scaled, y_train)

print(f"Best Parameters found: {grid_search.best_params_}")
print(f"Best cross-validation F1-score: {grid_search.best_score_:.4f}")
Fitting 5 folds for each of 144 candidates, totalling 720 fits
Best Parameters found: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.9}
Best cross-validation F1-score: 0.7993

Using the best model from the grid search to make predictions on the training and testing datasets. The tuned model’s performance is evaluated using metrics like accuracy, recall, and precision, and a classification report is generated for the test set.

Show the code
best_gb_model = grid_search.best_estimator_

y_pred_test_tuned = best_gb_model.predict(X_test_scaled)
y_pred_train_tuned = best_gb_model.predict(X_train_scaled)

train_accuracy_tuned = accuracy_score(y_train, y_pred_train_tuned)
test_accuracy_tuned = accuracy_score(y_test, y_pred_test_tuned)
recall_tuned = recall_score(y_test, y_pred_test_tuned)
precision_tuned = precision_score(y_test, y_pred_test_tuned)

print(f"Tuned Training Accuracy: {train_accuracy_tuned:.4f}")
print(f"Tuned Test Accuracy: {test_accuracy_tuned:.4f}")
print(f"Tuned Recall: {recall_tuned:.4f}")
print(f"Tuned Precision: {precision_tuned:.4f}")

print("\nClassification Report (Tuned Test Set):")
print(classification_report(y_test, y_pred_test_tuned, target_names=['No Disease (0)', 'Disease (1)']))
Tuned Training Accuracy: 0.9959
Tuned Test Accuracy: 0.8525
Tuned Recall: 0.9286
Tuned Precision: 0.7879

Classification Report (Tuned Test Set):
                precision    recall  f1-score   support

No Disease (0)       0.93      0.79      0.85        33
   Disease (1)       0.79      0.93      0.85        28

      accuracy                           0.85        61
     macro avg       0.86      0.86      0.85        61
  weighted avg       0.86      0.85      0.85        61

Here is a new confusion matrix. This should highlight the improved performance at correctly predicting true positives and true negatives.

Show the code
cm_tuned = confusion_matrix(y_test, y_pred_test_tuned)
print(cm_tuned)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_tuned, annot=True, fmt='d', cmap='Greens', xticklabels=['No Disease (0)', 'Disease (1)'], yticklabels=['No Disease (0)', 'Disease (1)'])
plt.title('Confusion Matrix (Tuned Test Set)')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
[[26  7]
 [ 2 26]]

Interpretation: The confusion matrix for the tuned model on the test set [[26 7] [ 2 26]] shows:

  • 26 True Negatives (TN): Correctly identified patients without heart disease.
  • 7 False Positives (FP): Incorrectly classified patients without heart disease as having it.
  • 2 False Negatives (FN): Incorrectly classified patients with heart disease as not having it.
  • 26 True Positives (TP): Correctly identified patients with heart disease.

This confirms the model’s strong recall for detecting actual disease cases (only 2 missed), which aligns with the classification report and is often prioritized in medical contexts.

Feature Importance Analysis

Analyzing the importance of each feature in the tuned model. A bar chart is generated to visualize which features contribute the most to the model’s predictions. The top 3 features are highlighted with their relative importance percentages.

Show the code
feature_importances_tuned = best_gb_model.feature_importances_
importance_df_tuned = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importances_tuned})
importance_df_tuned = importance_df_tuned.sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10, 8))
sns.barplot(x='Importance', y='Feature', data=importance_df_tuned)
plt.title('Tuned Gradient Boosting Feature Importance')
plt.show()

importance_df_tuned['Relative Importance'] = importance_df_tuned['Importance'] / importance_df_tuned['Importance'].sum()
top_3_features_tuned = importance_df_tuned.head(3)
top_3_features_tuned['Relative Importance'] = (top_3_features_tuned['Relative Importance'] * 100).round(0).astype(int)
print("\nTop 3 Features (Tuned Model):\n", top_3_features_tuned[['Feature', 'Relative Importance']])


Top 3 Features (Tuned Model):
    Feature  Relative Importance
13    thal                   27
12      ca                   13
3       cp                   11
Show the code
feature_importances = gb_model.feature_importances_
importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

importance_df['Relative Importance'] = importance_df['Importance'] / importance_df['Importance'].sum()

top_3_features = importance_df.head(3)

top_3_features['Relative Importance'] = (top_3_features['Relative Importance'] * 100).round(0).astype(int)

top_3_features[['Feature', 'Relative Importance']]
Feature Relative Importance
13 thal 29
12 ca 13
3 cp 13
Show the code
print("\nTop 5 Features:")
importance_df.head()

Top 5 Features:
Feature Importance Relative Importance
13 thal 0.287791 0.287791
12 ca 0.133799 0.133799
3 cp 0.133484 0.133484
1 age 0.075847 0.075847
10 oldpeak 0.074527 0.074527

Conclusion

This project successfully developed and evaluated a machine learning model to predict the presence of heart disease using the widely recognized UC Irvine Machine Learning Repository dataset. The primary objective was to build a reliable predictive tool that could potentially aid healthcare professionals in risk stratification.

Summary of Process and Findings:

  1. Data Preparation: The analysis began with thorough data exploration, identifying data types, distributions, and potential issues. Key preprocessing steps included:
    • Handling missing values, specifically imputing the thal (Thalassemia test result) column using the mean, ensuring data completeness. The ca (number of major vessels colored by flouroscopy) column also initially had missing values, which were implicitly handled likely during outlier removal or model fitting, though the thal imputation was explicitly addressed.
    • Transforming the multi-class target variable (num, ranging 0-4) into a binary target (disease_binary, 0 for no presence, 1 for any presence), simplifying the task to binary classification.
    • Addressing data outliers using the Interquartile Range (IQR) method for key numerical features like trestbps, chol, thalach, and oldpeak to improve model robustness.
    • Standardizing numerical features using StandardScaler to ensure features were on a comparable scale, which benefits algorithms like Gradient Boosting.
  2. Exploratory Data Analysis: Visualizations revealed important characteristics:
    • The dataset showed a relatively balanced distribution between patients with and without heart disease (approx. 46% vs. 54%), suitable for building a classification model.
    • Notable trends included a higher prevalence of heart disease in males within this dataset and interesting patterns related to chest pain types, where asymptomatic pain (Type 4) was paradoxically a strong indicator for disease presence.
  3. Model Development and Evaluation:
    • A Gradient Boosting Classifier was selected for modeling. The data was split into 80% training and 20% testing sets using stratification to maintain class balance.
    • An initial model achieved high training accuracy (0.9959), suggesting potential overfitting, but demonstrated strong performance on the test set with an accuracy of 0.8689, precision of 0.8125, and recall of 0.9286 for the positive (disease) class.
    • Hyperparameter tuning using GridSearchCV with 5-fold cross-validation was performed to optimize the model, targeting the F1-score. The best parameters found were largely similar to the initial ones (learning_rate=0.1, max_depth=3, n_estimators=100, subsample=0.9).
    • The tuned model yielded a slightly lower test accuracy of 0.8525, but maintained strong recall (0.9286) and slightly lower precision (0.7879) for detecting heart disease. The confusion matrix showed 26 true positives, 26 true negatives, 7 false positives, and 2 false negatives. The emphasis on recall is often important in medical diagnoses to minimize false negatives (missing actual cases).
  4. Feature Importance: The tuned Gradient Boosting model highlighted the most influential features for prediction:
    • thal (Thalassemia result) emerged as the most significant predictor (approx. 29% relative importance).
    • ca (Number of major vessels) and cp (Chest pain type) were also highly important (both around 13%).
    • Other contributing factors included age and oldpeak (ST depression induced by exercise).

Significance and Implications:

The final tuned Gradient Boosting model demonstrated good predictive capability, particularly in identifying patients with heart disease (high recall). With a test accuracy of ~85% and an F1-score of 0.85 for both classes, the model shows promise as a potential decision support tool. The identification of thal, ca, and cp as key predictors aligns with clinical knowledge, reinforcing the model’s validity.

Limitations and Future Work:

  • Dataset Size: The analysis was based on a relatively small dataset (303 entries before outlier removal), which might limit the generalizability of the findings. Imputation Method: Mean imputation for thal is a basic approach; more sophisticated methods could be explored.

  • Tuning Impact: Hyperparameter tuning did not significantly improve test set accuracy over the initial well-performing model in this instance, although it optimized for cross-validated F1-score. Generalizability: The model should be validated on independent, potentially larger, and more diverse datasets before any clinical application.

Future work could involve exploring other machine learning algorithms (e.g., Random Forests, SVMs, Neural Networks), investigating more advanced feature engineering techniques, employing different strategies for handling missing data and outliers, and performing external validation. Further analysis into the specific misclassifications made by the model could also provide valuable insights for improvement.

In conclusion, this analysis successfully built a Gradient Boosting model capable of predicting heart disease with reasonable accuracy and high recall, identifying key clinical indicators from the data. While further validation and refinement are necessary, the results represent a solid foundation for developing a machine learning-based tool for heart disease risk assessment.

Author: Ethan Glenn