Imbalanced-learn: Essential Toolkit for Handling Imbalanced Datasets
Overview
Imbalanced-learn is a Python package offering a number of re-sampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with scikit-learn and is part of the scikit-learn-contrib projects.
Why Imbalanced-learn Matters
In real-world datasets, it’s common to have imbalanced classes where one class significantly outnumbers others. This creates challenges:
- Credit Card Fraud: 99.9% legitimate transactions vs 0.1% fraudulent
- Medical Diagnosis: Rare diseases with few positive cases
- Manufacturing Defects: Most products pass quality control
- Customer Churn: Typically only 2-5% of customers churn
Standard machine learning algorithms often fail on imbalanced datasets, predicting only the majority class.
Key Features
1. Over-sampling Methods
- SMOTE (Synthetic Minority Over-sampling Technique)
- ADASYN (Adaptive Synthetic Sampling)
- BorderlineSMOTE and variants
- Random over-sampling
2. Under-sampling Methods
- Tomek Links
- Edited Nearest Neighbors
- Condensed Nearest Neighbors
- Random under-sampling
3. Combination Methods
- SMOTEENN (SMOTE + Edited Nearest Neighbors)
- SMOTETomek (SMOTE + Tomek Links)
4. Ensemble Methods
- BalancedRandomForestClassifier
- BalancedBaggingClassifier
- RUSBoostClassifier
- EasyEnsembleClassifier
Quick Start Example
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# Assuming X and y are your features and target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Apply SMOTE to balance the training data
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)
# Train your model on balanced data
model = LogisticRegression()
model.fit(X_train_balanced, y_train_balanced)
# Evaluate on original imbalanced test set
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
Installation
# Using pip
pip install imbalanced-learn
# Using conda
conda install -c conda-forge imbalanced-learn
# From source
git clone https://github.com/scikit-learn-contrib/imbalanced-learn.git
cd imbalanced-learn
pip install .
Best Practices
1. Never Balance the Test Set
Only balance training data. Test on the original distribution to get realistic performance metrics.
2. Cross-Validation Strategy
Use StratifiedKFold
to maintain class distribution across folds:
from sklearn.model_selection import StratifiedKFold
from imblearn.pipeline import Pipeline
pipeline = Pipeline([
('sampler', SMOTE()),
('classifier', LogisticRegression())
])
3. Choose the Right Technique
- Small dataset: Prefer over-sampling (SMOTE)
- Large dataset: Consider under-sampling
- Noisy data: Use cleaning methods (Tomek Links)
- Ensemble preference: Try BalancedRandomForest
4. Monitor Multiple Metrics
Don’t rely only on accuracy. Use:
- Precision, Recall, F1-score
- ROC-AUC and PR-AUC
- Confusion Matrix
- Class-specific metrics
Integration with ML Pipelines
Imbalanced-learn integrates seamlessly with scikit-learn pipelines:
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
pipeline = ImbPipeline([
('scaler', StandardScaler()),
('sampler', SMOTE(sampling_strategy='auto')),
('classifier', RandomForestClassifier())
])
pipeline.fit(X_train, y_train)
Comparison with Alternatives
Tool | Pros | Cons | Best For |
---|---|---|---|
imbalanced-learn | Scikit-learn compatible, comprehensive methods | Python-only | General ML projects |
Class weights | Simple, no data generation | Limited effectiveness | Quick baseline |
Custom sampling | Full control | Time-consuming | Specific requirements |
ROSE (R) | Good for R users | Not Python | R ecosystem |
Advanced Techniques
Custom Sampling Strategy
# Define custom ratio for each class
sampling_strategy = {0: 1000, 1: 1000, 2: 500}
smote = SMOTE(sampling_strategy=sampling_strategy)
Combining with Feature Selection
from sklearn.feature_selection import SelectKBest
from imblearn.pipeline import Pipeline
pipeline = Pipeline([
('feature_selection', SelectKBest(k=10)),
('sampler', SMOTE()),
('classifier', LogisticRegression())
])
Common Pitfalls to Avoid
- Data Leakage: Apply resampling only on training data
- Over-reliance on Accuracy: Use appropriate metrics for imbalanced data
- Ignoring Domain Knowledge: Sometimes imbalance reflects reality
- Not Trying Cost-Sensitive Learning: Consider adjusting class weights first
Resources
- Official Documentation: https://imbalanced-learn.org/
- GitHub Repository: https://github.com/scikit-learn-contrib/imbalanced-learn
- API Reference: https://imbalanced-learn.org/stable/references/index.html
- Examples Gallery: https://imbalanced-learn.org/stable/auto_examples/index.html
Conclusion
Imbalanced-learn is an essential tool for any data scientist dealing with real-world classification problems. Its scikit-learn compatibility and comprehensive set of techniques make it the go-to solution for handling imbalanced datasets in Python.
Whether you’re detecting fraud, diagnosing diseases, or predicting customer churn, imbalanced-learn provides the tools to build more robust and fair machine learning models.