Inteplast Group, Ltd.

boat

Introduction - Data Science Cardiovascular Disease Classification: Project


• Created a disease classifier that identifies data science cardiovascular disease (Accuracy: 81%) to help doctors notice the disease

• Optimized Logistic Regression, K Nearest-Neighbors, Naive Bayes, XGBoost, LightGBM, and Random Forest Regressors using GridSearchCV to reach the best model.

• Used data set of Cardiovascular Disease in Kaggle

• Built a model explanation using SHAP.

Data Collecting


In the dataset, we got the following features:

• General information: age, weight, height, and gender
• Physical index: systolic blood pressure, diastolic blood pressure, cholesterol, and glucose
• Living habits: daily smoking, alcohol intake, and physical activities
• Target: Whether they are a cardiovascular disease patient

Data Cleaning


I needed to clean it up so that it was usable for our model. I made the following changes and created the following variables:

• Scaled the feature to standardization.
• Splitted the data into train and validation sets with validation size of 20%

EDA


• I looked at the normality of the data, the correlation with the various variables. Below are a few highlights from the figures. Correlation with other features:

Model performance

Random Forest: Accuracy on train sets = 81%

Naive Bayes: Accuracy on train sets = 82%

XGBoost: Accuracy on train sets = 82%

LightGBM: Accuracy on train sets = 83%