đź”— Project Repository
Loan Eligibility using Gradient Boosting
Built a machine learning model to predict loan approval likelihood based on applicant credit, income, and employment data
In this dataset, you must explore and cleanse a dataset consisting of over 1,00,000 loan records to determine the best way to predict whether a loan applicant should be granted a loan or not. You must then build a machine learning model that returns the unique customer ID and a loan status label that indicates whether the loan should be given to that individual or not.
- Banks and credit institutions need quick, data-driven tools to evaluate loan applications.
- Manual screening is inconsistent and time-consuming. The bank wants to reduce ineligible approvals.
- They are most concerned about false positives — approving people who actually are not eligible.
- That means: They should prioritize high Precision, because High Precision = few false approvals. Low Precision means many ineligible people slipped through as “eligible”.
- Languages: Python
- Libraries: scikit-learn, XGBoost, imbalanced-learn (SMOTE), fancyimpute, matplotlib, seaborn
- Environment: Spyder IDE
- Model: Gradient Boosting Classifier (best performance)
- Deployment-ready assets: GBM_Model_version.pkl, Output_LoanResult.csv
Dataset
Loan application data containing features such as income, credit score, debt ratio, and job history.
Key Preprocessing Steps
- Removed duplicate entries based onÂ
Loan ID - Handled outliers using IQR capping and percentile thresholds
- Standardized inconsistent categorical values (e.g., mergedÂ
"HaveMortgage"Â andÂ"Home Mortgage") - Imputed missing values using:
- SoftImpute for numerical features –Â
KNN imputation for categorical and mixed types - Factorized and one-hot encoded categorical features
- Scaled numerical features to ensure consistent model performance
Evaluation criteria: To achieve a passing grade, the accuracy of the model has to be at least 70%.
- Tried multiple models: Logistic Regression, Random Forest, XGBoost, and Gradient Boosting
- Used SMOTE to balance class distribution and handle imbalance between approved/rejected loans
- Evaluated each model with Mathew’s Correlation Coefficient, ROC-AUC, F1-score, Precision, and Recall
- Saved final model using joblib.dump()
- Loaded model in a separate script for real-time predictions on test data
- Added logic to output “Loan Approved” or “Loan Rejected” status with probabilities
- Exported final predictions to ‘Output_LoanResult.csv’
- Achieved 75% ROC-AUC and 58% precision
- Top predictive features: Term, Current loan amount, credit score, annual income and Home Ownership.
- These features provide valuable signals for lenders:
- Term: Longer loan durations may increase risk, guiding lenders to adjust interest rates or approval criteria.
- Current Loan Amount: Higher amounts suggest greater financial burden, helping set loan caps or require additional documentation.
- Credit Score: A direct measure of creditworthiness, enabling personalized offers and risk segmentation.
- Annual Income: Indicates repayment ability, used to calculate debt-to-income ratios and tailor loan limits.
- Home Ownership: Suggests financial stability and potential collateral, influencing approval likelihood and loan terms.
- Together, these insights empower lenders to make faster, fairer, and more informed loan decisions while minimizing default risk.