SHAP Values and Other Indicators of Feature Predictive Power in Binary Classification

📌 Project Overview

This repository contains the implementation, analysis, and experiments from the thesis SHAP Values and Other Indicators of Feature Predictive Power in Binary Classification. The primary objective of this project is to conduct a comparative analysis of feature predictive power assessment methods, evaluating the reliability of SHAP (SHapley Additive explanations) values against Permutation Feature Importance (PFI) and standard statistical measures like Pearson Correlation and Normalized Mutual Information (NMI).

The study explores these methods in the context of Logistic Regression and XGBoost models, specifically highlighting how they behave under extreme multicollinearity.

🔬 Methodology

Interpretability Methods Analyzed

Datasets

  1. Breast Cancer Wisconsin (Diagnostic) Dataset: A strictly numerical dataset of 569 samples and 30 features derived from digitized images of cell nuclei. It is characterized by severe multicollinearity (e.g., radius, perimeter, and area are mathematically related).

  2. Adult Census Income Dataset: Contains 32,561 records with a mix of continuous and categorical features, predicting whether an individual earns more than $50,000 annually.

Models

📊 Key Findings and Experiments

1. The Impact of Multicollinearity

In the Breast Cancer Wisconsin dataset, strong near-perfect linear relationships were observed, such as between mean radius and mean perimeter ($r=0.9979$).

Correlation Plots

Pairwise relationships between selected features from the Breast Cancer Wisconsin dataset. Below-diagonal panels show scatter plots colored by diagnosis (red=malignant, blue=benign). We can observe almost perfect collinearity between mean radius and mean perimeter, and exponential-like relationships among mean area vs. mean radius and mean area vs. mean perimeter. Diagonal panels display kernel density estimates divided by class. For each feature pair, Pearson Correlation Coefficients (Corr) and Normalized Mutual Information (NMI) are reported.

2. Discrepancies Between SHAP and PFI

Experimental results reveal significant inconsistencies between global rankings derived from PFI and SHAP.

image_517da1

Permutation Feature Importance (PFI) for Breast Cancer Dataset. The x-axis represents the drop in ROC AUC score when feature values are permuted.

adult_pfi

Permutation Feature Importance (PFI) for Adult Census Income Dataset. The x-axis represents the drop in ROC AUC score when feature values are permuted.

image_516b1e

SHAP summary plots for Breast Cancer Dataset. Points represent individual instances; color indicates feature value (red=high, blue=low); x-axis displays the SHAP values. High values of worst texture (red) result in negative SHAP values, increasing the predicted risk of malignancy.

image_51677d1

SHAP summary plots for Adult Census Income Dataset. Each point represents a single observation.

3. Iterative Feature Addition

Iterative selection experiments (Forward and Backward Selection) demonstrated that SHAP does not necessarily prioritize features with the highest standalone predictive power in redundant datasets. When features were added sequentially based on rankings, PFI and Correlation sometimes recovered model performance faster than SHAP in the initial stages.

breast_best

Iterative Feature Addition Curves (Forward Ranking) for Breast Cancer Wisconsin Dataset (Logistic Regression).

adult_best

Iterative Feature Addition Curves (Forward Ranking) for Adult Census Income Dataset (XGBoost).

4. Mitigating Multicollinearity via Hierarchical Clustering

To stabilize importance rankings and address the substitution effect, a dimensionality reduction strategy using Hierarchical Clustering (Ward’s minimum variance method) was applied to the feature space.

Metric Full Model (30 Features) Reduced Model (11 Features, $t=0.5$)
Precision 0.98 0.98
Recall 0.95 0.95
F1-Score 0.96 0.96
Accuracy 0.97 0.97
ROC AUC 0.9974 0.9967

Performance Comparison: Full Feature Set vs. Reduced Feature Set at $t=0.5$ (class: Malignant)

dendogram

Hierarchical Clustering Dendrogram (Ward Linkage). The y-axis represents the Ward linkage distance (increase in within-cluster variance). The black dashed line at threshold $t=0.5$ cuts the tree into 11 distinct clusters by grouping highly redundant features.

pfi_after

Permutation Feature Importance (PFI) for the Reduced Model. We can observe more pronounced importance scores compared to the original model. All retained predictors now have positive PFI scores.

shap_after

SHAP Summary Plot for the Reduced Model (11 features). Compared to the full model, the SHAP values here exhibit substantially higher magnitude (maximum $\approx 2.23$ vs $0.96$) and a wider spread, facilitating improved interpretability.

💡 Conclusions