Skills: Python, Scikit-learn, EDA, LightGBM

A LightGBM Model to identify borrowers who are most likely to repay a loan,thus resulting in a profitable bank. png

Business Problem

The two most critical questions in the lending industry are: 1) How risky is the borrower? 2) Given the borrower’s risk, should we lend him/her? The answer to the first question determines the interest rate the borrower would have. Interest rate measures among other things (such as time value of money) the riskness of the borrower, i.e. the riskier the borrower, the higher the interest rate.

Investors (lenders) provide loans to borrowers in exchange for the promise of repayment with interest. That means the lender only makes profit (interest) if the borrower pays off the loan. However, if he/she doesn’t repay the loan, then the lender loses money.

Our Goal is to understand the loan business of the bank.It wants to identify who would be the right target for the loans. The data contains both customers who have been given loans and those who haven’t yet received loans. The aim is to identify of the customers who haven’t been granted a loan which are more likely to repay the loan and provide loans to them.

Data Set

Borrower_table - informtion about the borrowers such as age, salary, checking/savings amount and previous credit history.
Loan_table – loan details such as repayment history, loan purpose.

Roadmap

Dataset missing value, outlier analysis
Data Transformation
Data Exploration for Insights
Data Preparation for Model Building
Model Build to maximize profit
Model Selection
Recommendations based on target audience

import numpy as np
import pandas as pd


from sklearn.metrics import auc, roc_curve,roc_auc_score, classification_report,accuracy_score
from sklearn.metrics import roc_curve
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')
%matplotlib inline

import lightgbm as lgb

Data Loading

Loan Table:

Dataset contains 101100 unique loans and has 5 features including loan granted and loan repaid.
Loan Repaid has 3 values: 0,1,NaN
Definition of Loan Repaid:
Loan Granted and Repaid then repaid is 1
Loan Granted and Not Repaid then repaid is 0
Loan Not Granted then Repaid is NaN

loan = pd.read_csv('loan_table.csv',parse_dates =[ 'date'])
loan.head()

	loan_id	loan_purpose	date	loan_granted	loan_repaid
0	19454	investment	2012-03-15	0	NaN
1	496811	investment	2012-01-17	0	NaN
2	929493	other	2012-02-09	0	NaN
3	580653	other	2012-06-27	1	1.0
4	172419	business	2012-05-21	1	0.0

loan.shape

(101100, 5)

len(loan.loan_id.unique())

loan.isnull().sum()

loan_id             0
loan_purpose        0
date                0
loan_granted        0
loan_repaid     53446
dtype: int64

loan[(loan.loan_granted == 1) & (loan.loan_repaid.isnull())]

	loan_id	loan_purpose	date	loan_granted	loan_repaid

Borrower Table:
Dataset has loan id which is foreign key to loans table.
It has 101100 obs with 12 features related to the borrower.
A few columns have missing values. Currently Repaying other loans will be Nan if the borrower has never applied for a loan before.

borrower = pd.read_csv('borrower_table.csv')
borrower.head()

	loan_id	is_first_loan	fully_repaid_previous_loans	currently_repaying_other_loans	total_credit_card_limit	avg_percentage_credit_card_limit_used_last_year	saving_amount	checking_amount	is_employed	yearly_salary	age	dependent_number
0	289774	1	NaN	NaN	8000	0.49	3285	1073	0	0	47	3
1	482590	0	1.0	0.0	4500	1.03	636	5299	1	13500	33	1
2	135565	1	NaN	NaN	6900	0.82	2085	3422	1	24500	38	8
3	207797	0	1.0	0.0	1200	0.82	358	3388	0	0	24	1
4	828078	0	0.0	0.0	6900	0.80	2138	4282	1	18100	36	1

borrower.shape

(101100, 12)

len(borrower.loan_id.unique())

borrower.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101100 entries, 0 to 101099
Data columns (total 12 columns):
 #   Column                                           Non-Null Count   Dtype  
---  ------                                           --------------   -----  
 0   loan_id                                          101100 non-null  int64  
 1   is_first_loan                                    101100 non-null  int64  
 2   fully_repaid_previous_loans                      46153 non-null   float64
 3   currently_repaying_other_loans                   46153 non-null   float64
 4   total_credit_card_limit                          101100 non-null  int64  
 5   avg_percentage_credit_card_limit_used_last_year  94128 non-null   float64
 6   saving_amount                                    101100 non-null  int64  
 7   checking_amount                                  101100 non-null  int64  
 8   is_employed                                      101100 non-null  int64  
 9   yearly_salary                                    101100 non-null  int64  
 10  age                                              101100 non-null  int64  
 11  dependent_number                                 101100 non-null  int64  
dtypes: float64(3), int64(9)
memory usage: 9.3 MB

borrower.isnull().sum()

loan_id                                                0
is_first_loan                                          0
fully_repaid_previous_loans                        54947
currently_repaying_other_loans                     54947
total_credit_card_limit                                0
avg_percentage_credit_card_limit_used_last_year     6972
saving_amount                                          0
checking_amount                                        0
is_employed                                            0
yearly_salary                                          0
age                                                    0
dependent_number                                       0
dtype: int64

borrower.describe()

	loan_id	is_first_loan	fully_repaid_previous_loans	currently_repaying_other_loans	total_credit_card_limit	avg_percentage_credit_card_limit_used_last_year	saving_amount	checking_amount	is_employed	yearly_salary	age	dependent_number
count	101100.000000	101100.000000	46153.000000	46153.000000	101100.000000	94128.000000	101100.000000	101100.000000	101100.000000	101100.000000	101100.000000	101100.000000
mean	499666.826726	0.543492	0.899291	0.364332	4112.743818	0.724140	1799.617616	3177.150821	0.658675	21020.727992	41.491632	3.864748
std	288662.006929	0.498107	0.300946	0.481247	2129.121462	0.186483	1400.545141	2044.448155	0.474157	18937.581415	12.825570	2.635491
min	30.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	18.000000	0.000000
25%	250333.750000	0.000000	1.000000	0.000000	2700.000000	0.600000	834.000000	1706.000000	0.000000	0.000000	32.000000	2.000000
50%	499885.000000	1.000000	1.000000	0.000000	4100.000000	0.730000	1339.000000	2673.000000	1.000000	21500.000000	41.000000	3.000000
75%	749706.250000	1.000000	1.000000	1.000000	5500.000000	0.860000	2409.000000	4241.000000	1.000000	35300.000000	50.000000	6.000000
max	999987.000000	1.000000	1.000000	1.000000	13500.000000	1.090000	10641.000000	13906.000000	1.000000	97200.000000	79.000000	8.000000

Data Merging

Combine Loan and Borrower information to get a full dataset which can be used for analysis.

full_data = pd.merge(left = borrower, right = loan, on = 'loan_id',how = 'left')
full_data.head()

	loan_id	is_first_loan	fully_repaid_previous_loans	currently_repaying_other_loans	total_credit_card_limit	avg_percentage_credit_card_limit_used_last_year	saving_amount	checking_amount	is_employed	yearly_salary	age	dependent_number	loan_purpose	date	loan_granted	loan_repaid
0	289774	1	NaN	NaN	8000	0.49	3285	1073	0	0	47	3	business	2012-01-31	0	NaN
1	482590	0	1.0	0.0	4500	1.03	636	5299	1	13500	33	1	investment	2012-11-02	0	NaN
2	135565	1	NaN	NaN	6900	0.82	2085	3422	1	24500	38	8	other	2012-07-16	1	1.0
3	207797	0	1.0	0.0	1200	0.82	358	3388	0	0	24	1	investment	2012-06-05	0	NaN
4	828078	0	0.0	0.0	6900	0.80	2138	4282	1	18100	36	1	emergency_funds	2012-11-28	0	NaN

full_data.shape

(101100, 16)

Total 47,654 loans were granted which is 47% of the total loans and 53% were not granted.

grant = len(full_data[full_data['loan_granted'] == 1])
no_grant = len(full_data[full_data['loan_granted'] == 0])
print('Total Loans Granted: {}'.format(grant))
print('Total Loans Not Granted: {}'.format(no_grant))

Total Loans Granted: 47654
Total Loans Not Granted: 53446

Filter the dataset to include only the loans which were granted. This leaves us with 47654 observations.

data = full_data[full_data['loan_granted'] == 1]
len(data)

data.head()

	loan_id	is_first_loan	fully_repaid_previous_loans	currently_repaying_other_loans	total_credit_card_limit	avg_percentage_credit_card_limit_used_last_year	saving_amount	checking_amount	is_employed	yearly_salary	age	dependent_number	loan_purpose	date	loan_granted	loan_repaid
2	135565	1	NaN	NaN	6900	0.82	2085	3422	1	24500	38	8	other	2012-07-16	1	1.0
5	423171	1	NaN	NaN	6100	0.53	6163	5298	1	29500	24	1	other	2012-11-07	1	1.0
7	200139	1	NaN	NaN	4000	0.57	602	2757	1	31700	36	8	business	2012-09-19	1	0.0
8	991294	0	1.0	0.0	7000	0.52	2575	2917	1	58900	33	3	emergency_funds	2012-12-04	1	1.0
9	875332	0	1.0	0.0	4300	0.83	722	892	1	5400	32	7	business	2012-01-20	1	1.0

Data Quality Analysis

We see that a few columns such as currently repaying previous loans has large number of nulls.
After closer examination we see currently repaying previous loans and fully repaid previous loans is null for only first loan which makes sense since the borrower is applying for a loan for the first time.

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47654 entries, 2 to 101099
Data columns (total 16 columns):
 #   Column                                           Non-Null Count  Dtype         
---  ------                                           --------------  -----         
 0   loan_id                                          47654 non-null  int64         
 1   is_first_loan                                    47654 non-null  int64         
 2   fully_repaid_previous_loans                      21865 non-null  float64       
 3   currently_repaying_other_loans                   21865 non-null  float64       
 4   total_credit_card_limit                          47654 non-null  int64         
 5   avg_percentage_credit_card_limit_used_last_year  46751 non-null  float64       
 6   saving_amount                                    47654 non-null  int64         
 7   checking_amount                                  47654 non-null  int64         
 8   is_employed                                      47654 non-null  int64         
 9   yearly_salary                                    47654 non-null  int64         
 10  age                                              47654 non-null  int64         
 11  dependent_number                                 47654 non-null  int64         
 12  loan_purpose                                     47654 non-null  object        
 13  date                                             47654 non-null  datetime64[ns]
 14  loan_granted                                     47654 non-null  int64         
 15  loan_repaid                                      47654 non-null  float64       
dtypes: datetime64[ns](1), float64(4), int64(10), object(1)
memory usage: 6.2+ MB

data.isnull().sum()

loan_id                                                0
is_first_loan                                          0
fully_repaid_previous_loans                        25789
currently_repaying_other_loans                     25789
total_credit_card_limit                                0
avg_percentage_credit_card_limit_used_last_year      903
saving_amount                                          0
checking_amount                                        0
is_employed                                            0
yearly_salary                                          0
age                                                    0
dependent_number                                       0
loan_purpose                                           0
date                                                   0
loan_granted                                           0
loan_repaid                                            0
dtype: int64

#Data quality check
data[(data.is_first_loan == 1 ) & (data.fully_repaid_previous_loans == 1.0)]

	loan_id	is_first_loan	fully_repaid_previous_loans	currently_repaying_other_loans	total_credit_card_limit	avg_percentage_credit_card_limit_used_last_year	saving_amount	checking_amount	is_employed	yearly_salary	age	dependent_number	loan_purpose	date	loan_granted	loan_repaid

data[(data.is_first_loan == 1 ) & (data.currently_repaying_other_loans == 1.0)]

	loan_id	is_first_loan	fully_repaid_previous_loans	currently_repaying_other_loans	total_credit_card_limit	avg_percentage_credit_card_limit_used_last_year	saving_amount	checking_amount	is_employed	yearly_salary	age	dependent_number	loan_purpose	date	loan_granted	loan_repaid

A simple initial data investigation can tell us how many loans were granted and how many were actually repaid

Total 47654 Loan were granted
Out of which 64% of the loans were repaid and 36% were unpaid.

tot_loans = len(data)
tot_loans_rep = len(data[data['loan_repaid'] == 1.0])
tot_loan_norep = len(data[data['loan_repaid'] == 0.0])
print('Total Loans Granted: {}'.format(tot_loans))
print('Total Loans Repaid: {}'.format(tot_loans_rep))
print('Total Loans Not Repaid: {}'.format(tot_loan_norep))

Total Loans Granted: 47654
Total Loans Repaid: 30706
Total Loans Not Repaid: 16948

Data Exploration(EDA)

We will begin by extracting some date features to better understand our data.

# parse date information and extract month, week, and dayofweek information
data['month'] = data['date'].apply(lambda x: x.month)
data['week'] = data['date'].apply(lambda x: x.week)
data['dayofweek'] = data['date'].apply(lambda x: x.dayofweek)

<ipython-input-247-a568e6d2ef8a>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['month'] = data['date'].apply(lambda x: x.month)
<ipython-input-247-a568e6d2ef8a>:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['week'] = data['date'].apply(lambda x: x.week)
<ipython-input-247-a568e6d2ef8a>:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['dayofweek'] = data['date'].apply(lambda x: x.dayofweek)

Loan Repayment Rate by Loan Purpose

Results:

A lot of the borrowers apply for a loan when purchasing a house, or for business and investment purposes.
Business and Investment have highest re-payment rate whereas Emergency funds have the lowest rate.

#Visualization of loan purpose
sns.set(style="whitegrid", color_codes=True)
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(16, 6))
sns.countplot(x='loan_purpose', data=data, ax=ax[0])
ax[0].set_xlabel('Loan Purpose', fontsize=12)
ax[0].set_ylabel('Count', fontsize=12)
ax[0].set_title('Count Plot of Loan Purpose', fontsize=16)

sns.barplot(x='loan_purpose', y='loan_repaid', data=data, ax=ax[1])
ax[1].set_xlabel('Loan Purpose', fontsize=12)
ax[1].set_ylabel('Loan Repaid Ratio', fontsize=12)
ax[1].set_title('Loan Repaid Ratio by Loan Purpose', fontsize=16)
plt.tight_layout()
plt.show()

png

Loan Repayment Rate by Number of dependents

Results:

My inital assumption was higher the number of dependents, lower the repayment rate. More dependents might lead to more expenses and lesser savings to repay the loan. This assumption was validated by the data.
Loan Repayment rate is highest when number of dependents is 0 or 1 and it is inversely proportional to number of dependents.
Just over 2.5% loans are granted to people with zero dependents but more than 17.5% are granted to people with 8 dependents.

#Dependent number

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(16, 6))
sns.barplot(x="dependent_number", y="dependent_number", data=data, estimator=lambda x: len(x) / len(data) * 100,ax = ax)
ax.set_xlabel('dependent_number', fontsize=12)
ax.set_ylabel('Count Percentage', fontsize=12)
ax.set_title('Count Plot of dependents and Repayment Rate', fontsize=16)
ax2 = ax.twinx()
sns.lineplot(x='dependent_number', y='loan_repaid', data=data, ax=ax2)
ax2.grid(False)
plt.tight_layout()
plt.show()

png

Loan Repayment Rate by Age

Results:

A lot of the borrowers applying for the loan are between 30-55 years of age.
Loan Repayment rate fluctuates with age.

#Age
grouped_age = data[['age', 'loan_repaid']].groupby('age')
grouped_age
mean_age = grouped_age.mean().reset_index()
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 10))
hist_kws={'histtype': 'bar', 'edgecolor':'black', 'alpha': 0.2}
sns.distplot(data[data['loan_repaid'] == 0]['age'], 
             label='loan_repaid = 0', ax=ax[0], hist_kws=hist_kws)
sns.distplot(data[data['loan_repaid'] == 1]['age'], 
             label='loan_repaid = 1', ax=ax[0], hist_kws=hist_kws)
ax[0].set_title('Histogram of Age', fontsize=16)
ax[0].legend()
ax[1].plot(mean_age['age'], mean_age['loan_repaid'], '.-')
ax[1].set_title('Mean Repaid Rate by Age', fontsize=16)
ax[1].set_xlabel('Age')
ax[1].set_ylabel('Mean Repaid Rate')
ax[1].grid(True)
plt.show()

png

Loan Repayment Rate by Employment status

Results:

A lot of the borrowers who were granted the loan are employed and are able to repay the loan.
Repayment Rate among the ones who are unemployed is just 18%.

len(data[(data.is_employed == 0) & (data.loan_repaid == 1)])/len(data[data.is_employed == 0])

0.17942750756341633

#is_employed
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 10))
sns.countplot(x='is_employed', data=data, ax=ax[0])
ax[0].set_xlabel('is_employed', fontsize=12)
ax[0].set_ylabel('Count', fontsize=12)
ax[0].set_title('Count Plot of is_employed', fontsize=16)

sns.barplot(x='is_employed', y='loan_repaid', data=data, ax=ax[1])
ax[1].set_xlabel('is_employed', fontsize=12)
ax[1].set_ylabel('Loan Repaid Ratio', fontsize=12)
ax[1].set_title('Loan Repaid Ratio vs. is_employed', fontsize=16)
plt.tight_layout()
plt.show()

png

Loan Repayment Rate among first time loans¶

Results:

A lot of the borrowers who were granted a loan are first time borrowers.
More than 50% loans are granted to first-time borrowers who have a repayment rate of around 65%.
Repayment Rate is more than 60% for both first and non-first timers.

#is_first_loan


fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10, 6))
sns.barplot(x="is_first_loan", y="is_first_loan", data=data, estimator=lambda x: len(x) / len(data) * 100,ax = ax)
ax.set_xlabel('is_first_loan', fontsize=12)
ax.set_ylabel('Count Percentage', fontsize=12)
ax.set_title('Count Plot and Repaid rate by first time borrowers', fontsize=16)
ax2 = ax.twinx()
sns.lineplot(x='is_first_loan', y='loan_repaid', data=data, color = "black",lw = 2,markers=True,ax=ax2)
ax2.grid(False)
plt.tight_layout()
plt.show()

png

Loan Repayment Rate by Previous Loan History

Results:

Not a lot of loans were granted to people who defaulted on their previous loans.
Surprisingly, Repayment rate is above 55% for those who did not pay their previous loans.

#fully_repaid_previous_loans
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 10))
sns.countplot(x='fully_repaid_previous_loans', data=data, ax=ax[0])
ax[0].set_xlabel('fully_repaid_previous_loans', fontsize=12)
ax[0].set_ylabel('Count', fontsize=12)
ax[0].set_title('Count Plot of fully_repaid_previous_loans', fontsize=16)

sns.barplot(x='fully_repaid_previous_loans', y='loan_repaid', data=data, ax=ax[1])
ax[1].set_xlabel('fully_repaid_previous_loans', fontsize=12)
ax[1].set_ylabel('Loan Repaid Ratio', fontsize=12)
ax[1].set_title('Loan Repaid Ratio vs. fully_repaid_previous_loans', fontsize=16)
plt.tight_layout()
plt.show()

png

Loan Repayment Rate by other loans

Results:

Around 14% loans were granted to borrowers who are currently paying other loans.
Repayment Rate is less than 30% for those who are paying other loans.

#currently_repaying_other_loans
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 10))
sns.barplot(x='currently_repaying_other_loans', y ='currently_repaying_other_loans',estimator=lambda x: len(x) / len(data) * 100, data=data, ax=ax[0])
ax[0].set_xlabel('currently_repaying_other_loans', fontsize=12)
ax[0].set_ylabel('Count Percent', fontsize=12)
ax[0].set_title('Count Plot of currently_repaying_other_loans', fontsize=16)

sns.barplot(x='currently_repaying_other_loans', y='loan_repaid', data=data, ax=ax[1])
ax[1].set_xlabel('currently_repaying_other_loans', fontsize=12)
ax[1].set_ylabel('Loan Repaid Ratio', fontsize=12)
ax[1].set_title('Loan Repaid Ratio vs. currently_repaying_other_loans', fontsize=16)
plt.tight_layout()
plt.show()

png

Loan Repayment Rate by Credit Card Limit

Results:

Borrowers who didn’t repay loan have credit card limit between $2000 to $6000, whereas the borrowers who did repay their loans had higher credit card limit ranging from $2000 to $10,000. These borrowers must be the ones with good credit scores.
Repayment rate increases as credit card limit increases. People approved for high credit limit must be the ones with good credit history so lesser chance of delinquency.

# Visualization of 'total_credit_card_limit'
grouped = data[['total_credit_card_limit', 'loan_repaid']].groupby('total_credit_card_limit')
grouped
mean = grouped.mean().reset_index()
mean
hist_kws={'histtype': 'bar', 'edgecolor':'black', 'alpha': 0.2}
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(16, 6))
sns.distplot(data[data['loan_repaid'] == 0]['total_credit_card_limit'], 
             label='loan_repaid = 0', ax=ax[0], hist_kws=hist_kws)
sns.distplot(data[data['loan_repaid'] == 1]['total_credit_card_limit'], 
             label='loan_repaid = 1', ax=ax[0], hist_kws=hist_kws)
ax[0].set_title('Histogram of total_credit_card_limit', fontsize=16)
ax[0].legend()
ax[1].plot(mean['total_credit_card_limit'], mean['loan_repaid'], '.-')
ax[1].set_title('Mean Repaid Rate vs. total_credit_card_limit', fontsize=16)
ax[1].set_xlabel('total_credit_card_limit')
ax[1].set_ylabel('Mean Repaid Rate')
ax[1].grid(True)
plt.show()

png

Loan Repayment Rate by Saving and Checking Amount

Checking and Savings amount are important factors when deciding a borrower’s capacity to repay a loan.
Borrowers who have lesser than ‘2000’ in savings and 4000 in checking are the ones who are more likely to default.
Borrowers who have higher than 2000 in savings and anywhere from 2000 to 10,000 dollars in checking are the ones who have high repayment rate.

# Visualization of 'saving_amount'
grouped = data[['saving_amount', 'loan_repaid']].groupby('saving_amount')
grouped
mean = grouped.mean().reset_index()
mean
hist_kws={'histtype': 'bar', 'edgecolor':'black', 'alpha': 0.2}
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(16, 6))
sns.distplot(data[data['loan_repaid'] == 0]['saving_amount'], 
             label='loan_repaid = 0', ax=ax[0], hist_kws=hist_kws)
sns.distplot(data[data['loan_repaid'] == 1]['saving_amount'], 
             label='loan_repaid = 1', ax=ax[0], hist_kws=hist_kws)
ax[0].set_title('Histogram of saving_amount', fontsize=16)
ax[0].legend()
ax[1].plot(mean['saving_amount'], mean['loan_repaid'], '.-')
ax[1].set_title('Mean Repaid Rate vs. saving_amount', fontsize=16)
ax[1].set_xlabel('saving_amount')
ax[1].set_ylabel('Mean Repaid Rate')
ax[1].grid(True)
plt.show()

png

# Visualization of 'checking_amount'
grouped = data[['checking_amount', 'loan_repaid']].groupby('checking_amount')
grouped
mean = grouped.mean().reset_index()
mean
hist_kws={'histtype': 'bar', 'edgecolor':'black', 'alpha': 0.2}
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(16, 6))
sns.distplot(data[data['loan_repaid'] == 0]['checking_amount'], 
             label='loan_repaid = 0', ax=ax[0], hist_kws=hist_kws)
sns.distplot(data[data['loan_repaid'] == 1]['checking_amount'], 
             label='loan_repaid = 1', ax=ax[0], hist_kws=hist_kws)
ax[0].set_title('Histogram of checking_amount', fontsize=16)
ax[0].legend()
sns.lineplot(x= mean['checking_amount'], y= mean['loan_repaid'],ax = ax[1])
ax[1].set_title('Mean Repaid Rate vs. checking_amount', fontsize=16)
ax[1].set_xlabel('checking_amount')
ax[1].set_ylabel('Mean Repaid Rate')
ax[1].grid(True)
plt.show()

png

Loan Repayment Rate by Yearly Salary

Results:

Borrowers who are unable to repay a loan earn less than 40,000 dollars a year
Borrowers who earn more than 40,000 dollars a year have a repayment rate of more than 80%.

# Visualization of 'yearly_salry'
grouped = data[['yearly_salary', 'loan_repaid']].groupby('yearly_salary')
grouped
mean = grouped.mean().reset_index()
mean
hist_kws={'histtype': 'bar', 'edgecolor':'black', 'alpha': 0.2}
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(16, 6))
sns.distplot(data[data['loan_repaid'] == 0]['yearly_salary'], 
             label='loan_repaid = 0', ax=ax[0], hist_kws=hist_kws)
sns.distplot(data[data['loan_repaid'] == 1]['yearly_salary'], 
             label='loan_repaid = 1', ax=ax[0], hist_kws=hist_kws)
ax[0].set_title('Histogram of yearly_salary', fontsize=16)
ax[0].legend()
ax[1].plot(mean['yearly_salary'], mean['loan_repaid'], '.-')
ax[1].set_title('Mean Repaid Rate vs. yearly_salary', fontsize=16)
ax[1].set_xlabel('yearly_salary')
ax[1].set_ylabel('Mean Repaid Rate')
ax[1].grid(True)
plt.show()

png

Loan Repayment by Last Year’s Credit Card Usage

Results:

Borrowers unable to repay a loan have higher credit card usage than the ones who repay the loans.
As the credit card usage increases, ability to repay a loan decreases.

not_null = data[~data['avg_percentage_credit_card_limit_used_last_year'].isnull()]

#avg_percentage_credit_card_limit_used_last_year

grouped = not_null[['avg_percentage_credit_card_limit_used_last_year', 'loan_repaid']].groupby('avg_percentage_credit_card_limit_used_last_year')
grouped
mean = grouped.mean().reset_index()

hist_kws={'histtype': 'bar', 'edgecolor':'black', 'alpha': 0.2}
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(16, 6))
sns.distplot(not_null[not_null['loan_repaid'] == 0]['avg_percentage_credit_card_limit_used_last_year'], 
             label='loan_repaid = 0', ax=ax[0], hist_kws=hist_kws)
sns.distplot(data[data['loan_repaid'] == 1]['avg_percentage_credit_card_limit_used_last_year'], 
             label='loan_repaid = 1', ax=ax[0], hist_kws=hist_kws)
ax[0].set_title('Credit Card Usage Last Year', fontsize=16)
ax[0].legend()
ax[1].plot(mean['avg_percentage_credit_card_limit_used_last_year'], mean['loan_repaid'], '.-')
ax[1].set_title('Mean Repaid Rate by Credit Card Use', fontsize=16)
ax[1].set_xlabel('avg_percentage_credit_card_limit_used_last_year')
ax[1].set_ylabel('Mean Repaid Rate')
ax[1].grid(True)
plt.show()

png

Repayment By Month

Results:

A lot of loans are granted at the beginning of the year, in May and in October.
Repayment rate remains almost the same across all the months.

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 10))
sns.countplot(x='month', data=data, ax=ax[0])
ax[0].set_xlabel('month', fontsize=12)
ax[0].set_ylabel('Count', fontsize=12)
ax[0].set_title('Count Of loans by month', fontsize=16)

sns.barplot(x='month', y='loan_repaid', data=data, ax=ax[1])
ax[1].set_xlabel('month', fontsize=12)
ax[1].set_ylabel('Loan Repaid Ratio', fontsize=12)
ax[1].set_title('Loan Repaid Ratio by Month', fontsize=16)
plt.tight_layout()
plt.show()

png

We can even look at the relationship of various variables together such as loan purpose by is_employed and loan_repaid. A lot of the borrowers who are unemployed apply for emergency loans.

sns.set_style('ticks')
sns.set(font_scale = 2)
sns.catplot("is_employed", col='loan_purpose', data=data, hue='loan_repaid', kind="count", col_wrap=5)

<seaborn.axisgrid.FacetGrid at 0x12980c4f0>

png

Feature Engineering

Avg_percentage_credit_card_limit

Avg_percentage_credit_card_limit_used_last_year is left-skewed which indicates that mean is lesser than the median so we will use median to impute the values

sns.set(font_scale = 1)
sns.distplot(data['avg_percentage_credit_card_limit_used_last_year'])

<matplotlib.axes._subplots.AxesSubplot at 0x12cddb6d0>

png

median_cc = data['avg_percentage_credit_card_limit_used_last_year'].median()
data = data.fillna({'avg_percentage_credit_card_limit_used_last_year':median_cc})
data.avg_percentage_credit_card_limit_used_last_year.isnull().sum()

#Rename avg_percentage_credit_card_limit_used_last_year to credit use
data.rename(columns={'avg_percentage_credit_card_limit_used_last_year':'credit_limit_used'},inplace=True)

is_first_loan, currently_repaying_loans, fully_repaid_previous_loans

We will create a new variable based on three columns: is_first_loan, currently_repaying_loans and fully_repaid_previous_loans :
Class 1: Not Repaid previous loan and currently paying other loans
Class 2: Paid previous loans and currently paying other loans
Class 3: First Loan
Class 4: Not paid previous loan and not paying other loans
Class 5: Repaid previous loan and not paying other loans

def new_variable(df):
    if((df['is_first_loan'] == 0) & (df['fully_repaid_previous_loans'] == 0) & (df['currently_repaying_other_loans'] == 1)):
        val = 1
    elif((df['is_first_loan'] == 0) & (df['fully_repaid_previous_loans'] == 1) & (df['currently_repaying_other_loans'] == 1)):
        val = 2
    elif((df['is_first_loan'] == 0) & (df['fully_repaid_previous_loans'] == 1) & (df['currently_repaying_other_loans'] == 0)):
        val = 5
    elif((df['is_first_loan'] == 0) & (df['fully_repaid_previous_loans'] == 0) & (df['currently_repaying_other_loans'] == 0)):
        val = 4
    else:
        val = 3
    return(pd.Series(val))
    
        
        
        

res = data.apply(new_variable,axis = 1)

data['prev_loan_status'] = res

data.loc[data.is_first_loan == 0,['is_first_loan','fully_repaid_previous_loans','currently_repaying_other_loans','prev_loan_status']]

	is_first_loan	fully_repaid_previous_loans	currently_repaying_other_loans	prev_loan_status
8	0	1.0	0.0	5
9	0	1.0	0.0	5
13	0	1.0	0.0	5
17	0	1.0	0.0	5
40	0	1.0	0.0	5
...	...	...	...	...
101081	0	1.0	0.0	5
101083	0	1.0	1.0	2
101088	0	0.0	0.0	4
101090	0	1.0	0.0	5
101098	0	1.0	0.0	5

21865 rows × 4 columns

group_loan = data[['prev_loan_status','loan_repaid']].groupby('prev_loan_status').agg({'count','mean'}).reset_index()
group_loan.columns = group_loan.columns.droplevel(0)

group_loan.columns = ['prev_loan_status','total_count','repaid_rate']

group_loan['cnt_pct'] = 100* group_loan['total_count']/group_loan['total_count'].sum()
group_loan

	prev_loan_status	total_count	repaid_rate	cnt_pct
0	1	682	0.218475	1.431150
1	2	5828	0.277454	12.229823
2	3	25789	0.650006	54.117178
3	4	1450	0.751724	3.042767
4	5	13905	0.797339	29.179083

Class 5(previous loan fully paid and not paying other loans) has the highest repayment rate of 79%.
54% of the loans granted are First time loans and have repayment rate of 65%.

yearly salary and employment correlated

Unemployed borrowers will not have a salary to report so we will remove one of them

print(data[data['is_employed'] == 0]['yearly_salary'].unique())

[0]

One hot encoding for loan purpose

K unique values, only need K-1 features, so remove one redundant feature

data_cp = pd.get_dummies(data)

Remove redundant columns

X_new = data_cp.drop(columns = ['loan_id','date','loan_granted','is_first_loan','fully_repaid_previous_loans','currently_repaying_other_loans','is_employed','loan_purpose_other'],axis = 1)
X_new.head()

	total_credit_card_limit	credit_limit_used	saving_amount	checking_amount	yearly_salary	age	dependent_number	loan_repaid	month	week	dayofweek	prev_loan_status	loan_purpose_business	loan_purpose_emergency_funds
2	6900	0.82	2085	3422	24500	38	8	1.0	7	29	0	3	0	0
5	6100	0.53	6163	5298	29500	24	1	1.0	11	45	2	3	0	0
7	4000	0.57	602	2757	31700	36	8	0.0	9	38	2	3	1	0
8	7000	0.52	2575	2917	58900	33	3	1.0	12	49	1	5	0	1
9	4300	0.83	722	892	5400	32	7	1.0	1	3	4	5	1	0

We have 16 features for model building

X_new.shape

(47654, 16)

LightGBM Model

LightGBM is a distributed and efficient gradient boosting framework that uses tree-based learning. It’s histogram-based and places continuous values into discrete bins, which leads to faster training and more efficient memory usage. In this piece, we’ll explore LightGBM in depth.
LightGBM Advantages

Faster training speed and higher efficiency
Lower memory usage
Better accuracy
Support of parallel and GPU learning
Capable of handling large-scale data
Is not affected by outliers and correlated variables
Can handle categorical data

Train-Test Split

#Train test split
Y = X_new['loan_repaid']
Y.head()

  1.0
  1.0
  0.0
  1.0
  1.0
Name: loan_repaid, dtype: float64

X = X_new.drop(columns= ['loan_repaid'],axis = 1)

X.shape

(47654, 15)

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25,random_state=42)

len(X_train),len(X_test),len(y_train),len(y_test)

(35740, 11914, 35740, 11914)

print(y_train.mean(),y_test.mean())

0.6450475657526581 0.6422695987913379

Prepare GBM Dataset and Tune Parameters

#create LightGBM dataset
d_train = lgb.Dataset(data=X_train, label=y_train,free_raw_data=False)
d_test = lgb.Dataset(data=X_test,label=y_test)

# Cross validation
#Using Default Values
params = {'learning_rate': 0.01, 
          'boosting_type': 'gbdt', 
          'objective': 'binary', 
          'metric': ['binary_logloss', 'auc'], 
          'sub_feature':0.8, 
          'num_leaves': 31, 
          'min_data': 20, 
          'max_depth': 15, 
          'is_unbalance': True}

history = lgb.cv(params, train_set=d_train, num_boost_round=1000, nfold=5, 
                 early_stopping_rounds=10, seed=42, verbose_eval=False)

print('Best rounds:\t', len(history['binary_logloss-mean']))

Best rounds:	 647

Prediction and Evaluation

# re-train the model and make predictions
clf = lgb.train(params, train_set=d_train, num_boost_round=647)

pred = clf.predict(X_test)

roc_auc = roc_auc_score(y_test,pred)

pred_int = (pred >=0.5).astype(int)
accuracy = accuracy_score(pred_int,y_test)

print('ROC score:{}, Accuracy at 0.5 threshold: {}'.format(roc_auc,accuracy))

ROC score:0.9774140994352406, Accuracy at 0.5 threshold: 0.9204297465167031

Feature Importance

Savings Amount, Yearly Salary, Total Credit Card Usage are among the top features that can help determine a borrower’s ability to repay a loan.

# feature importance
features = clf.feature_name()
importance = clf.feature_importance()

fig, ax = plt.subplots(figsize=(10, 8))
lgb.plot_importance(clf, ax=ax, height=0.5)
plt.show()

png

Bank Profit Calculation

We will assume if the loan is repaid then bank earns a $1 and if the loan is not repaid then a loss of -1.

Bank profit by old model

#Total Profit from original Bank model

loan_paid = y_test.astype(int).values

bank_profit = np.sum(loan_paid * 2 - 1)
print('Bank profit:\t', bank_profit)

Bank profit:	 3390

Bank profit by New Model at different thresholds

We will calculate the profit at different thresholds and choose the one where we get maximum profit

# Calculate profit according to different thresholds
def calc_threshold(loan_paid,probability,threshold):
    loan_granted = (probability > threshold).astype(int)
    profit = 0
    for i in range(len(loan_paid)):
        if loan_granted[i] == 1:
            if loan_paid[i] == 0:
                profit -= 1
            else:
                profit += 1
                
    return profit
        
            
    

# calculate the profit according to given threshold
thresholds = list(np.linspace(0, 1, 100))
profits = []

for threshold in thresholds:
    profits.append(calc_threshold(loan_paid, pred, threshold))

thresh_prof = pd.DataFrame({'threshold':thresholds,'profit':profits})
thresh_prof.loc[thresh_prof.profit.argmax(),:]

threshold       0.393939
profit       6756.000000
Name: 39, dtype: float64

With the new model, bank can make a profit of 6756 dollars which is double of the bank’s earlier profit of 3390.

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(thresholds, profits, label='New Model')
ax.plot(thresholds, [bank_profit] * len(thresholds), label='Bank Model')
ax.set_xlabel('Threshold', fontsize=12)
ax.set_ylabel('Profit', fontsize=12)
ax.set_title('Profit Comparison')
ax.legend(fontsize=12)
plt.tight_layout()
plt.show()

png

fpr,tpr,threshold = roc_curve(loan_paid, pred)
print("Area under the Receiver Operating Characteristic Curves (ROC AUC) is %f " 
      %(roc_auc_score(loan_paid, pred)))

Area under the Receiver Operating Characteristic Curves (ROC AUC) is 0.977414

plt.figure()
plt.plot([0,1],[0,1], linestyle = '--', label = 'random guessing')
plt.plot(fpr, tpr, label = "ROC AUC" )
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc='lower right')

<matplotlib.legend.Legend at 0x137690a30>

png

Classification Report at 0.5 threshold

False Positives: If the model predicts that a loan will be repaid and the borrower defaults, then bank will lose money. We would like to keep our false positives low.

False Negatives: If the model predicts that a loan will not be repaid but the borrower does repay, the bank might lose a potential customer by not granting a loan. In this case loss is going to be 0.

More than False negatives, we should focus on high precision and keeping the false positives low.

print(classification_report(y_true = y_test, y_pred=pred_int))

              precision    recall  f1-score   support

         0.0       0.85      0.95      0.89      4262
         1.0       0.97      0.91      0.94      7652

    accuracy                           0.92     11914
   macro avg       0.91      0.93      0.92     11914
weighted avg       0.93      0.92      0.92     11914

cm = confusion_matrix(y_test, pred_int)

# view with a heatmap
sns.heatmap(cm, annot=True, annot_kws={"size":30}, cmap='Blues', square=True, fmt='.0f')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title('Confusion Matrix')

Text(0.5, 1.0, 'Confusion Matrix')

png

Precision and Recall at 0.4 threshold

We can see that the true positive predictive power i.e model predicts that a borrower will reapy and borrower did repay increased at the new threshold.

pred_int_new = (pred >= 0.39).astype(int)

accuracy_score(y_test,pred_int_new)

0.9247104247104247

print(classification_report(y_true = y_test, y_pred=pred_int_new))

              precision    recall  f1-score   support

         0.0       0.87      0.92      0.90      4262
         1.0       0.96      0.93      0.94      7652

    accuracy                           0.92     11914
   macro avg       0.91      0.92      0.92     11914
weighted avg       0.93      0.92      0.93     11914

cm = confusion_matrix(y_test, pred_int_new)

# view with a heatmap
sns.heatmap(cm, annot=True, annot_kws={"size":30}, cmap='Blues', square=True, fmt='.0f')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title('Confusion Matrix at new threshold')

Text(0.5, 1.0, 'Confusion Matrix at new threshold')

png

Strategy for Denied Application

My model is able to predict 7083 loans will be repaid and they did. The true positive predictive accuracy is 91.1 %. Thus, if we grant a loan to a previously denied application then the bank can make a lot of profit.

pos_ind = (pred_int_new == 1)

tpr_new = np.sum(y_test[pos_ind] == 1) # number of loan repaid
print("number of cases that model predicts repay and repaid: {}".format(tpr_new))

fpr_new = np.sum(y_test[pos_ind] == 0) # number of loan defaulted
print("number of cases that model predicts repay but defaulted: {}".format(fpr_new))

tot = (np.sum(y_test[pos_ind] == 1) - np.sum(y_test[pos_ind] == 0))/sum(pos_ind)
print("score to gain when granting a loan: {}".format(tot))

number of cases that model predicts repay and repaid: 7083
number of cases that model predicts repay but defaulted: 328
score to gain when granting a loan: 0.9114829307785723

Impact Analysis of “is_employed”

Results:

There are 1047 unemployed borrowers and 10,867 employed borrowers in the dataset
If an applicant is unemployed, then our model predicts 14% of the loans will be repaid and grants them the loan. If an applicant is employed, then model predicts 67% of the loans will be repaid and grants them the loan.
Among the ones who are unemployed and model predicts to grant the loan, 94.4% repay the loans.
Among the ones who are employed and model predicts to grant the loan, 95.5% do repay the loans.

unemployed = len(X_test[X_test['yearly_salary'] == 0])
not_emp = len(X_test) - unemployed
print('Number of unemployed borrowers in the test set : {}\nNumber of employed borrowers : {}'.format(unemployed,not_emp))

Number of unemployed borrowers in the test set : 1047
Number of employed borrowers : 10867

loan_grant_unemp = pred_int_new[X_test['yearly_salary'] == 0].mean()
loan_grant_emp = pred_int_new[X_test['yearly_salary'] != 0].mean()
print('Loan Grant Rate for unemployed by the model : {}\nLoan Grant Rate for employed by the model:{}'.format(loan_grant_unemp,loan_grant_emp))

Loan Grant Rate for unemployed by the model : 0.13753581661891118
Loan Grant Rate for employed by the model:0.6687218183491304

#Repay rate for unemployed
unemployed_granted_ind = np.logical_and(X_test['yearly_salary']==0,pred_int_new==1)
unemp_repay = y_test[unemployed_granted_ind].mean()

0.9444444444444444

#Repay rate for employed
employed_granted_ind = np.logical_and(X_test['yearly_salary']!=0,pred_int_new==1)
emp_repay = y_test[employed_granted_ind].mean()
emp_repay

0.9559653226916196

Savings Amount

pred_int_new[X_test['saving_amount'] < 4000].mean()

0.5704473910140228

pred_int_new[X_test['saving_amount'] >= 4000].mean()

1.0

Recommendations:

According to the feature importance plot, the most important feature is ‘savings amount’. This makes perfect sense because if a person has money in the savings account, it is more likely that he/she will be able to repay the loan. Given the data, our model predicts 100% repayment if savings amount exceeds 4000 dollars
Borrower’s employment status is very important. If a person is employed it is highly likely that he/she will be able to repay the loan. Target customers who are employed and have a yearly salary of atleast 60,000. This will reduce the probability of default.
Number of dependents does affect the repayment rate to some extent. Bank should modify it’s strategy to reduce giving loans to people with 8 dependents and give more loans to people with less dependents
Borrowers who have high credit card limit could mean they have good financial standing to be approved for such limits. These users have higher chance of repaying the loans. Hence, they could be granted a loan.
The gradient boosted model developed should be implemented and loans should be given to the ones predicted to have high repayment probability. This model has 92% positive predictive power and provides double the profit the bank is currently making.