Machine Learning on Aggregated Fannie's Data

First, let's import all the dependencies.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn
import re
import datetime
import sklearn as sk
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV, SGDClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.cross_validation import train_test_split, KFold, StratifiedShuffleSplit
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve, precision_recall_curve
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
pd.options.display.max_columns=200
matplotlib.rcParams['figure.figsize'] = (15.0, 12.0)
In [2]:
summaryfile_type = {'Unnamed':int,
                    'Unnamed2':int,
                    'LOAN_ID':object,
                    'ORIG_CHN':object,
                    'Seller.Name':object,
                    'ORIG_RT':float,
                    'ORIG_AMT':float,
                    'ORIG_TRM':float,
                    'ORIG_DTE':object,
                    'FRST_DTE':object,
                    'OLTV':float,
                    'OCLTV':float,
                    'NUM_BO':float,
                    'DTI':float,
                    'CSCORE_B':float,
                    'FTHB_FLG':object,
                    'PURPOSE':object,
                    'PROP_TYP':object,
                    'NUM_UNIT':int,
                    'OCC_STAT':object,
                    'STATE':object,
                    'ZIP_3':object,
                    'MI_PCT':float,
                    'Product.Type':object,
                    'CSCORE_C':float,
                    'Monthly.Rpt.Prd':object,
                    'Servicer.Name':object,
                    'LAST_RT':float,
                    'LAST_UPB':float,
                    'Loan.Age':float,
                    'Months.To.Legal.Mat':float,
                    'Adj.Month.To.Mat':float,
                    'Maturity.Date':object,
                    'MSA':int,
                    'Delq.Status':object,
                    'MOD_FLAG':object,
                    'Zero.Bal.Code':float,
                    'ZB_DTE':object,
                    'LPI_DTE':object,
                    'FCC_DTE':object,
                    'DISP_DT':object
}
In [3]:
fannie1 = pd.read_csv('./processed/total/total_2012.csv', dtype=summaryfile_type, nrows=9000000)

Pre-process the data

First, we replace the column name having '.' to underline just because we are using Python. Then, we drop rows that have NA in either OLTV, OCLTV, DTI or CSCORE_B columns. Also, all loan with "current" status should be considered as tempororarily OK, as their final status is unknown. Because of that, for our machine learning, we just use a subset of the data containing those loan with status "fully paid" and "default".

In [4]:
fannie1.drop(fannie1.columns[:1], axis=1, inplace=True)
fannie1.rename(columns=lambda x: re.sub('[.]','_',x), inplace=True)
fannie1.head()
Out[4]:
LOAN_ID ORIG_CHN Seller_Name ORIG_RT ORIG_AMT ORIG_TRM ORIG_DTE FRST_DTE OLTV OCLTV NUM_BO DTI CSCORE_B FTHB_FLG PURPOSE PROP_TYP NUM_UNIT OCC_STAT STATE ZIP_3 MI_PCT Product_Type CSCORE_C Monthly_Rpt_Prd Servicer_Name LAST_RT LAST_UPB Loan_Age Months_To_Legal_Mat Adj_Month_To_Mat Maturity_Date MSA Delq_Status MOD_FLAG Zero_Bal_Code ZB_DTE LPI_DTE FCC_DTE DISP_DT
0 100001726155 B STEARNS LENDING, LLC 3.500 278000.0 360.0 12/2012 02/2013 77.0 77.0 1.0 36.0 802.0 N R SF 1 P CA 949 NaN FRM NaN 09/01/2015 OTHER 3.500 258940.15 32.0 328.0 320.0 01/2043 42220 0 N 0.0 NaN NaN NaN NaN
1 100003164216 R OTHER 3.625 478000.0 360.0 11/2012 01/2013 68.0 68.0 2.0 25.0 763.0 N R PU 1 P CA 913 NaN FRM 742.0 09/01/2015 OTHER 3.625 452501.57 33.0 327.0 327.0 12/2042 31080 0 N 0.0 NaN NaN NaN NaN
2 100008218145 C FLAGSTAR CAPITAL MARKETS CORPORATION 3.625 403000.0 360.0 09/2012 11/2012 54.0 61.0 2.0 33.0 749.0 N R SF 1 P WA 981 NaN FRM 747.0 08/01/2014 MATRIX FINANCIAL SERVICES CORPORATION 3.625 389568.37 22.0 338.0 0.0 10/2042 42660 X N 1.0 08/2014 NaN NaN NaN
3 100009252313 C WELLS FARGO BANK, N.A. 3.625 626000.0 360.0 09/2012 11/2012 78.0 78.0 1.0 24.0 800.0 N R SF 1 P VA 220 NaN FRM NaN 09/01/2015 WELLS FARGO BANK, N.A. 3.625 590002.26 35.0 325.0 325.0 10/2042 47900 0 N 0.0 NaN NaN NaN NaN
4 100009333533 C JPMORGAN CHASE BANK, NATIONAL ASSOCIATION 3.625 130000.0 360.0 10/2012 12/2012 80.0 80.0 2.0 41.0 758.0 Y P SF 1 P IL 601 NaN FRM 749.0 09/01/2015 JP MORGAN CHASE BANK, NA 3.625 122262.89 34.0 326.0 322.0 11/2042 16980 0 N 0.0 NaN NaN NaN NaN
In [41]:
fannie1_filtered = fannie1.dropna(subset=('OLTV', 'OCLTV', 'DTI', 'CSCORE_B'))
In [42]:
fannie1_known = fannie1_filtered[fannie1_filtered['Zero_Bal_Code'] > 0]

Normalize the feature by states

As we already see in the exploratory data analysis, loan performance should consider within each geographycal region. For a proof-of-concept perspective, we use state to group our dataset. Those continuous variables (e.g. loan amount, loan-to-value ratio and debt-to-income ratio) are normalized by using their mean and standard deviation by each states.

In the future, a more refined geo-location grouping will be used, as the dataset does contain the first three digits of the zip code.

Feature extractions

All those classes with Extract... names are Scikit Learn Transformer to extract features from each record. Those features are further combined via Scikit-Learn's feature-union. Keep in mind that in order to successfully conduct the feature-union, you have to make sure the return value of your transformer is "column-like". At the end of each transformer functions of each class, the reshape line is there for that.

In [43]:
state_mean = fannie1_known.groupby('STATE')[('ORIG_AMT', 'OCLTV', 'DTI')].mean()
state_std = fannie1_known.groupby('STATE')[('ORIG_AMT', 'OCLTV', 'DTI')].std()
In [8]:
class ExtractOrigYear(sk.base.BaseEstimator, sk.base.TransformerMixin):
    def __init__(self):
        self.int = np.vectorize(int)
        pass
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, x):
        return self.int(x['ORIG_DTE'].apply(lambda x: x.split('/')[-1])).reshape(-1,1)

class extractfeatures(sk.base.BaseEstimator, sk.base.TransformerMixin):
    def __init__(self, column='OCLTV'):
        '''
        We use the colname as the selection rule to judge the 
        '''
        self.column = column
        
    
    def fit(self, x, y):
        return self
    
    def transform(self, x):
        return x.loc[:,self.column].values.reshape(-1,1)
    
class ExtractLoanStatus(sk.base.BaseEstimator, sk.base.TransformerMixin):
    def __init__(self):
        '''
        Initialize the class with bisection of the loan status: Default or Healthy
        '''
        pass
    
    def fit(self, x):
        return self
    
    def transform(self, x):
        '''
        Transform the loan status to a tertiary status: Healthy (0), Default (1) 
        '''
        status = x['Zero_Bal_Code'].apply(lambda x: 0 if x <=1 else 1)
        return status
    
class ExtractCreditScore(sk.base.BaseEstimator, sk.base.TransformerMixin):
    def __init__(self, is_take_minimum = True):
        self.take_minimum = is_take_minimum
        self.where = np.where
        pass
    
    def fit(self, x, y):
        return self
    
    def transform(self, x):
        result = self.where((x['CSCORE_B'] - x['CSCORE_C'] > 0), x['CSCORE_C'], x['CSCORE_B'])
        return result.reshape(-1,1)
    
class ExtractCategory(sk.base.BaseEstimator, sk.base.TransformerMixin):
    def __init__(self, colname):
        self.colname = colname
        self.transformer = LabelEncoder()
        pass
    
    def fit(self, x, y):
        self.transformer.fit(x[self.colname])
        return self
    
    def transform(self, x):
        return self.transformer.transform(x[self.colname]).reshape(-1,1)
    
class ExtractNormalized(sk.base.BaseEstimator, sk.base.TransformerMixin):
    def __init__(self, groupby, target, total_mean=state_mean, total_std=state_std):
        self.groupby = groupby
        self.target = target
        self.total_mean = total_mean
        self.total_std = total_std
        pass
    
    def fit(self, x, y):
        return self
    
    def transform(self, x):
        temp1 = x.groupby(self.groupby)[[self.groupby, self.target]].apply(lambda x: (x[self.target])).values
        temp2 = x.groupby(self.groupby)[[self.groupby, self.target]].apply(lambda x: self.total_mean.ix[x[self.groupby].values,self.target]).values
        temp3 = x.groupby(self.groupby)[[self.groupby, self.target]].apply(lambda x: self.total_std.ix[x[self.groupby].values,self.target]).values
        return ((temp1-temp2)/temp3).reshape(-1,1)

Logistic Regression

We want to start with the very simple linear model to see how good we can achieve. We should notice that because we are dealing with a very biased classification issue. Therefore, we apply the StratifiedShuffleSplit to ensure both training and testing sets contain same ratio of paid/default cases. Also, the option class_weight is turned on and set to balanced to tell the model to balance the weight between paid/default cases.

There are other approaches to deal with the unbalanced classification method, like undersampling, oversampling and SMOTE. We will use these methods in the next version of SmartUnderwriter.

In [44]:
features = FeatureUnion([
        ('Loan_Amount', ExtractNormalized('STATE','ORIG_AMT')),
        ('credit score', ExtractCreditScore()),
        ('Loan_to_Value', ExtractNormalized('STATE','OCLTV')),
        ('Debt_to_income', ExtractNormalized('STATE','DTI')),
        ('Loan_purpose', ExtractCategory('PURPOSE')),
        ('Property_Type', ExtractCategory('PROP_TYP')),
        ('Occupancy_Status', ExtractCategory('OCC_STAT'))
        
    ])
In [45]:
sss = StratifiedShuffleSplit(ExtractLoanStatus().fit_transform(fannie1_known), 1, test_size=0.15)
for train_index, test_index in sss:
    fannie_train = fannie1_known.iloc[train_index,]
    fannie_test = fannie1_known.iloc[test_index, ]
    status_train = ExtractLoanStatus().fit_transform(fannie1_known).iloc[train_index,]
    status_test = ExtractLoanStatus().fit_transform(fannie1_known).iloc[test_index,]

Grid Search for Best Hyperparameter of Logistic Regression

One of the most powerful feature of Scikit-Learn library is its grid search function. Here, we use cross-validation logistic model to search the best regulation parameter to give highest AUC score.

In [46]:
model2 = Pipeline([
        ('features', features),
        ('Logistic', LogisticRegressionCV(Cs = 10, scoring = 'roc_auc', class_weight='balanced'))
    ])

model2.fit(fannie_train, status_train)
Out[46]:
Pipeline(steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('Loan_Amount', ExtractNormalized(groupby='STATE', target='ORIG_AMT',
         total_mean=            ORIG_AMT      OCLTV        DTI
STATE
AK     241308.434959  74.639228  31.997967
AL     194002.399177  76.045013  30.418485
AR     ...=None,
           refit=True, scoring='roc_auc', solver='lbfgs', tol=0.0001,
           verbose=0))])
In [47]:
print(model2.named_steps['Logistic'].C_)
print(model2.named_steps['Logistic'].coef_)
[ 0.00077426]
[[ 0.00161235 -0.01686483 -0.00233408  0.00108836 -0.20423069 -0.11163132
  -0.47136523]]

Comments about Logistic Model

Feature coefficients for the best logistic model we have:

Loan Amount Credit Score Loan-to-Value ratio Debt-to-Income ratio Loan Purpose Property Type Occupancy Status
0.00161235 -0.01686483 -0.00233408 0.00108836 -0.20423069 -0.11163132 -0.47136523

Remember label 1 means default and label 0 means paid. In general, the model gives a reasonable (qualitatively) explanation about how to have a less risky loan: lower loan amount, higher credit score etc.

In [48]:
status_pred2 = model2.predict(fannie_test)
print(classification_report(status_test, status_pred2))
print(pd.DataFrame(confusion_matrix(status_test, status_pred2), index=['Actual Healthy', 'Actual Default'], 
                   columns = ['Pred. Healthy', 'Pred. Default']))
print('Area under the curve is',roc_auc_score(status_test, status_pred2))
             precision    recall  f1-score   support

          0       0.98      0.73      0.83    810739
          1       0.11      0.69      0.20     41217

avg / total       0.94      0.72      0.80    851956

                Pred. Healthy  Pred. Default
Actual Healthy         587862         222877
Actual Default          12657          28560
Area under the curve is 0.70900599495
/usr/local/lib/python3.4/dist-packages/numpy/core/fromnumeric.py:2652: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
  VisibleDeprecationWarning)

Comments about model metrics using the test set

Not surprisingly, this model gives a 11% precision on the default prediction, which means quite a bit of good loan has been categorized as potential default. As a contrast, the recall rate (which means predicted positive/all true positive, or power if you are statistician) is at 69%. You can adjust the precision/recall by adjusting the setpoint for the binary classification. This turns into the classic 'precision/recall' trade-off. I have a nice plot below to give you a better idea.

The AUC score of this model 0.71, which is not bad as the first shot with a simple linear model.

In [49]:
prec, rec, thres1 = precision_recall_curve(status_test, status_pred2)
fpr, tpr, thres2 = roc_curve(status_test, model2.decision_function(fannie_test))
In [50]:
plt.plot(prec,rec)
plt.xlabel('Precision')
plt.ylabel('Recall')
Out[50]:
<matplotlib.text.Text at 0x7f48a19024a8>
In [51]:
plt.plot(fpr,tpr, label='Logistic Regression Model')
plt.plot([0,1],[0,1], 'r--', label='random model')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc = 'upper left')
Out[51]:
<matplotlib.legend.Legend at 0x7f488cd36358>

Stochastic Gradient Descent Classifier

The next model to try is Support Vector Machine (SVM). However, SVM is notoriously slow especially when dealing with dataset size like here. Luckily, stochastic gradient descent based algorithm can speed up quite a bit. So, the second model to try is to use SGD classfier with a "SVM"-like hinge loss function.

SGD classifier is very sensitive to the scale of the data. So we put in a StandardScaler transformer to normalize the whole feature matrix.

In [52]:
features = FeatureUnion([
        ('Loan_Amount', ExtractNormalized('STATE','ORIG_AMT')),
        ('credit score', ExtractCreditScore()),
        ('Loan_to_Value', ExtractNormalized('STATE','OCLTV')),
        ('Debt_to_income', ExtractNormalized('STATE','DTI')),
        ('Loan_purpose', ExtractCategory('PURPOSE')),
        ('Property_Type', ExtractCategory('PROP_TYP')),
        ('Occupancy_Status', ExtractCategory('OCC_STAT'))
        
    ])


model4 = Pipeline([
        ('features', features),
        ('scale', StandardScaler()),
        ('SGD', SGDClassifier(loss = 'hinge', class_weight='balanced'))
    ])
model4.fit(fannie_train, status_train)
status_pred4 = model4.predict(fannie_test)
roc_auc_score(status_test, model4.decision_function(fannie_test))
Out[52]:
0.771532777604367
In [53]:
pred4 = model4.predict(fannie_test)
print(classification_report(status_test,pred4))
print(pd.DataFrame(confusion_matrix(status_test, pred4), index=['Actual Healthy', 'Actual Default'], 
                   columns = ['Pred. Healthy', 'Pred. Default']))
             precision    recall  f1-score   support

          0       0.98      0.72      0.83    810739
          1       0.11      0.69      0.19     41217

avg / total       0.94      0.72      0.80    851956

                Pred. Healthy  Pred. Default
Actual Healthy         585966         224773
Actual Default          12586          28631
/usr/local/lib/python3.4/dist-packages/numpy/core/fromnumeric.py:2652: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
  VisibleDeprecationWarning)
In [54]:
model4.named_steps['SGD'].coef_
Out[54]:
array([[ 0.04530123, -0.86460842,  0.01849037, -0.04807039, -0.08237096,
        -0.10498967, -0.14175012]])

Comments on SGDClassifier

SGD classifier did give a better AUC score (0.77). But the classification report shows that this model is similar to what we have by using logistic regression.

Coefficients in the model:

Loan Amount Credit Score Loan-to-Value ratio Debt-to-Income ratio Loan Purpose Property Type Occupancy Status
0.04530123 -0.86460842 0.01849037 -0.04807039 -0.08237096 -0.10498967 -0.14175012

It would be good to grid search the SGDClassifier. However, due to the memory limitation, grid search was not conducted. In the future when I move all to the Spark module, we would redo the SGDClassifier.

This model is chosen to be the prediction model I use for the final prediction model.

Gradient Boosting

Gradient boosting technology is shown to be an effective method to find a low bias/low variance model. Here, we test out the gradient boosting method. Initial result shows it has similar performance as logistic regression or SGDclassifier, using AUC score. Unfortunately, no grid search can be done using my 16GB Digital Ocean box.

In [56]:
features = FeatureUnion([
        ('Loan_Amount', ExtractNormalized('STATE','ORIG_AMT')),
        ('credit score', ExtractCreditScore()),
        ('Loan_to_Value', ExtractNormalized('STATE','OCLTV')),
        ('Debt_to_income', ExtractNormalized('STATE','DTI')),
        ('Loan_purpose', ExtractCategory('PURPOSE')),
        ('Property_Type', ExtractCategory('PROP_TYP')),
        ('Occupancy_Status', ExtractCategory('OCC_STAT'))
        
    ])

model5 = Pipeline([
        ('features', features),
        ('GDB', GradientBoostingClassifier(n_estimators=200, learning_rate=0.5))
    ])
model5.fit(fannie_train, status_train)
status_pred5 = model5.predict(fannie_test)
In [57]:
roc_auc_score(status_test, model5.decision_function(fannie_test))
Out[57]:
0.7795108119781432