Will a Customer Subscribe to a Term Deposit?

10 Nov 2019

Reading time ~19 minutes

On November 9th to 10th 2019, my team won the First Place for Best Model Award at Data Hackathon (a less than 24 data science competition) held by OCRUG and Merage Analytics Club at The Paul Merage School of Business. We were a team of 3 people from UCI Master’s in Business Analytics Program. Below is the report of our analysis and prediction for the competition.

Background:

The data is about direct marketing campaigns (phone calls) of a Portuguese banking institution from May 2008 to November 2010. The clients were contacted more than once in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

Goal:

Predict whether the client will subscribe (yes/no) to a term deposit. Key Process Owners can use the output of this prediction to improve the stragegy for the next market campaign.

Data Dictionary:

Independent Variable
Bank client data

Age (numeric)
Job : type of job
Marital : marital status
Education
Default: has credit in default? (“yes”,“no”)
Balance: average yearly balance, in euros
Housing: has housing loan? (“yes”,“no”)
Loan: has personal loan? “yes”,“no”)

Related with the last contact of the current campaign

Contact: contact communication type
Day: last contact day of the month
Month: last contact month of year
Duration: last contact duration, in seconds (numeric)

Other Variables

Campaign: number of contacts performed during this campaign and for this client
Pdays: number of days that passed by after the client was last contacted from a previous campaign (-1 means - client was not previously contacted)
Previous: number of contacts performed before this campaign and for this client
Poutcome: outcome of the previous marketing campaign

Dependent Variable (desired target)

Deposit - has the client subscribed a term deposit? (“yes”,“no”)

url = 'https://raw.githubusercontent.com/ocrug/hackathon-2019-11/master/data/bank-full.csv'
bank = pd.read_csv(url,sep = ';', encoding='latin1')

Data Cleaning

The first step we took to clean the data is to check for any duplication, null, or NA values.
Then, we altered any variables with “yes” and “no” values to binary 1 and 0 to make it easier for analysis and model builing.

Besides the above, the overall dataset is already clean enough. Therefore, we moved forward with the EDA after the above steps.

duplicated data : 0
null data : 0
NA data : 0

bank.replace(('yes', 'no'), (1, 0), inplace=True)

Exploratory Data Analysis (EDA)

In the EDA part, we wanted to visualize and analyze the relationship between the independent variables to each other as well as to the depedent variable (deposit).

The chart below showed that most of the customer in the dataset were blue collars. However, customers who are retired, students, unemployed, or work as management has higher percentage of subscribing to the term deposit compared to other job titles.

output_9_0 output_9_1

The chart below showed that customers who successfully subscribed from the previous marketing campaign will be most likely to subscribe again.

output_11_0 output_11_1

The chart below showed that customers who were single were most likely to subscribe as compared to customers who are divorced or married.

output_13_0 output_13_1

The chart below showed that most data were collected in May. However, there were higher percentage of customer subscribring in the month of Feb, Mar, Apr, Oct, Sep, Dec.

output_15_0 output_15_1

Correlation Matrix and Chi Square Test

Once we analyzed the breakdown of the dependent variable on each of these independent variables, we wanted to make sure that this effects are not due to random chance. Therefore, we used chi-squared test to see their significant relationship. We also wanted to see the correlation between the continuous independent variables to the dependent variable using the correlation matrix.

The result from the correlation matrix showed that there might be collinearity between pdays and previous variables since both variables technically measured the same thing. Therefore, we will remove one of them from the model.
duration showed very high correlation to the deposit variable. This made sense since we knew that it is impossible to get the duration data before a call is performed. Also, after the end of the call deposit result is obviously known. Therefore, this input would be discarded in order to have a realistic predictive model.
In terms of the Chi-Squared Test, the result was favorable in that all variables showed significant relationship to the dependent variables. This means that all variables are important in predicting the deposit.

output_17_0

pd.set_option('display.float_format', lambda x: '%.8f' % x)
def chisq_of_df_cols(df, c1, c2):
    groupsizes = df.groupby([c1, c2]).size()
    ctsum = groupsizes.unstack(c1)
    # fillna(0) is necessary to remove any NAs which will cause exceptions
    return(scs.chi2_contingency(ctsum.fillna(0)))

categorical = bank[set(bank.columns)-set(corr.columns)]
variable = []
Chi_Squared_Val = []
P_Value = []

for col in categorical.columns:
    variable.append(col)
    Chi_Squared_Val.append(chisq_of_df_cols(categorical, col,bank.deposit)[0])
    P_Value.append(chisq_of_df_cols(categorical, col,bank.deposit)[1])

chi

bank.drop(columns=['duration'],inplace=True)

Outlier Detection

There were outliers in the previous and p-days columns. Therefore, we capped the data to top 90 - 93%.

previous top 90% value = 2.0
pdays top 90% value = 185.0
pdays top 93% value = 6.0

bank[[i for i in list(set(bank.columns) - set(variable)) if i != 'deposit']].describe()

	pdays	campaign	balance	housing	previous	day	default	age	loan
count	45211.00000000	45211.00000000	45211.00000000	45211.00000000	45211.00000000	45211.00000000	45211.00000000	45211.00000000	45211.00000000
mean	40.19782796	2.76384066	1362.27205769	0.55583818	0.58032337	15.80641879	0.01802659	40.93621021	0.16022649
std	100.12874599	3.09802088	3044.76582917	0.49687781	2.30344104	8.32247615	0.13304894	10.61876204	0.36682004
min	-1.00000000	1.00000000	-8019.00000000	0.00000000	0.00000000	1.00000000	0.00000000	18.00000000	0.00000000
25%	-1.00000000	1.00000000	72.00000000	0.00000000	0.00000000	8.00000000	0.00000000	33.00000000	0.00000000
50%	-1.00000000	2.00000000	448.00000000	1.00000000	0.00000000	16.00000000	0.00000000	39.00000000	0.00000000
75%	-1.00000000	3.00000000	1428.00000000	1.00000000	0.00000000	21.00000000	0.00000000	48.00000000	0.00000000
max	871.00000000	63.00000000	102127.00000000	1.00000000	275.00000000	31.00000000	1.00000000	95.00000000	1.00000000

Now, let’s revisit the boxplot of each of these continous independent variables after we capped the outliers. output_26_1

Prepare the Data for Model

For the first model, we will included the important variables and then conducted the backward elimitation methodology to enhance the performance of the model.

#creating one-hot variables
bank_cleaned = bank
for i in variable:
    bank_cleaned = pd.get_dummies(bank_cleaned, columns = [i])
bank_cleaned.head()

	age	balance	housing	loan	day	campaign	pdays	...	month_may	poutcome_unknown
0	58	2143	1	0	5	1.00000000	-1.00000000	...	1	1
1	44	29	1	0	5	1.00000000	-1.00000000	...	1	1
2	33	2	1	1	5	1.00000000	-1.00000000	...	1	1
3	47	1506	1	0	5	1.00000000	-1.00000000	...	1	1
4	33	1	0	0	5	1.00000000	-1.00000000	...	1	1

5 rows × 48 columns

bank_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 48 columns):
age                    45211 non-null int64
default                45211 non-null int64
balance                45211 non-null int64
housing                45211 non-null int64
loan                   45211 non-null int64
day                    45211 non-null int64
campaign               45211 non-null float64
pdays                  45211 non-null float64
previous               45211 non-null float64
deposit                45211 non-null int64
contact_cellular       45211 non-null uint8
contact_telephone      45211 non-null uint8
contact_unknown        45211 non-null uint8
education_primary      45211 non-null uint8
education_secondary    45211 non-null uint8
education_tertiary     45211 non-null uint8
education_unknown      45211 non-null uint8
job_admin.             45211 non-null uint8
job_blue-collar        45211 non-null uint8
job_entrepreneur       45211 non-null uint8
job_housemaid          45211 non-null uint8
job_management         45211 non-null uint8
job_retired            45211 non-null uint8
job_self-employed      45211 non-null uint8
job_services           45211 non-null uint8
job_student            45211 non-null uint8
job_technician         45211 non-null uint8
job_unemployed         45211 non-null uint8
job_unknown            45211 non-null uint8
marital_divorced       45211 non-null uint8
marital_married        45211 non-null uint8
marital_single         45211 non-null uint8
month_apr              45211 non-null uint8
month_aug              45211 non-null uint8
month_dec              45211 non-null uint8
month_feb              45211 non-null uint8
month_jan              45211 non-null uint8
month_jul              45211 non-null uint8
month_jun              45211 non-null uint8
month_mar              45211 non-null uint8
month_may              45211 non-null uint8
month_nov              45211 non-null uint8
month_oct              45211 non-null uint8
month_sep              45211 non-null uint8
poutcome_failure       45211 non-null uint8
poutcome_other         45211 non-null uint8
poutcome_success       45211 non-null uint8
poutcome_unknown       45211 non-null uint8
dtypes: float64(3), int64(7), uint8(38)
memory usage: 5.1 MB

Split Train and Test Data

To evaluate the model, we used the holdout method and splitted the data into 80% for model training and 20% for model testing/evaluation. Then, we measured the baseline percenteage between each classes in the overall dataset and compare it with the train and test dataset to see if the randomized split was representative enough. The result was favorable, the test and train dataset both had very similar percentage breakdown for the dependent variable.

print(bank_cleaned.deposit.value_counts())
print(bank_cleaned.deposit.value_counts()/(len(bank.deposit)+1))

0    39922
1     5289
Name: deposit, dtype: int64
0   0.88299566
1   0.11698222
Name: deposit, dtype: float64

Now splitting the train and test data

split = np.random.rand(len(bank_cleaned)) <= .80
train = bank_cleaned[split]
test = bank_cleaned[~split]

Train Data Breakdown

0   0.88228142
1   0.11769095
Name: deposit, dtype: float64

Test Data Breakdown

test.deposit.value_counts()/(len(test.deposit)+1)

0   0.88576177
1   0.11412742
Name: deposit, dtype: float64

Model 1

Our first model had a really high accuracy of ~89%. However, the recall was ~18.8%. This indicated that the model was predicting a lot of “no subscription” and it is not representative enough because it was highly skewed to only one class of the predicted variable. It will not work well if we evaluate the model on data with a lot of “yes”.
This happened since we had an imbalance dataset. We would like to create a more representative model that can handle more variability. Therefore, we built the next model with a modified or different sampling method on the training dataset.
Additionally, this model was not giving us a good result because in this context, we would like to maximize our recall to predict more customers who will be willing to subscribe to the term deposit.

##for dummy vars, we only need N-1
train = train.drop(columns=['job_unknown','marital_divorced','poutcome_other','contact_cellular','education_unknown'])
test = test.drop(columns=['job_unknown','marital_divorced','poutcome_other','contact_cellular','education_unknown'])

X_train = train[train.columns[train.columns != 'deposit']]
y_train = train[train.columns[train.columns == 'deposit']]


X_test = test[test.columns[test.columns != 'deposit']]
y_test = test[test.columns[test.columns == 'deposit']]
#X_train

##No longer using sklearn due to no summary statistics
#logit = LogisticRegression()
#logit.fit(X_train, y_train)
#y_pred = logit.predict(X_test)
logit_model = sm.Logit(y_train, X_train).fit()
print(logit_model.summary())

Optimization terminated successfully.
         Current function value: 0.300826
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                deposit   No. Observations:                36187
Model:                          Logit   Df Residuals:                    36145
Method:                           MLE   Df Model:                           41
Date:                Sat, 30 Nov 2019   Pseudo R-squ.:                  0.1697
Time:                        23:21:43   Log-Likelihood:                -10886.
converged:                       True   LL-Null:                       -13111.
Covariance Type:            nonrobust   LLR p-value:                     0.000
=======================================================================================
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
age                     0.0014      0.002      0.631      0.528      -0.003       0.006
default                -0.1694      0.164     -1.036      0.300      -0.490       0.151
balance               1.99e-05   4.96e-06      4.012      0.000    1.02e-05    2.96e-05
housing                -0.5572      0.043    -12.940      0.000      -0.642      -0.473
loan                   -0.3957      0.059     -6.664      0.000      -0.512      -0.279
day                     0.0027      0.002      1.084      0.278      -0.002       0.008
campaign               -0.1179      0.013     -8.931      0.000      -0.144      -0.092
pdays                  -0.0021      0.001     -2.734      0.006      -0.004      -0.001
previous                0.2561      0.075      3.425      0.001       0.110       0.403
contact_telephone      -0.2439      0.073     -3.345      0.001      -0.387      -0.101
contact_unknown        -1.3145      0.070    -18.756      0.000      -1.452      -1.177
education_primary      -0.2344      0.103     -2.284      0.022      -0.435      -0.033
education_secondary    -0.0713      0.090     -0.790      0.430      -0.248       0.106
education_tertiary      0.0843      0.095      0.888      0.374      -0.102       0.270
job_admin.              0.2819      0.236      1.194      0.233      -0.181       0.745
job_blue-collar         0.1488      0.235      0.633      0.527      -0.312       0.610
job_entrepreneur        0.0889      0.255      0.349      0.727      -0.411       0.588
job_housemaid          -0.1039      0.259     -0.401      0.689      -0.612       0.404
job_management          0.2081      0.234      0.888      0.374      -0.251       0.667
job_retired             0.6343      0.240      2.643      0.008       0.164       1.105
job_self-employed       0.2557      0.248      1.031      0.303      -0.230       0.742
job_services            0.2333      0.239      0.976      0.329      -0.235       0.702
job_student             0.6144      0.248      2.474      0.013       0.128       1.101
job_technician          0.1737      0.234      0.741      0.459      -0.286       0.633
job_unemployed          0.4274      0.248      1.725      0.085      -0.058       0.913
marital_married        -0.1800      0.058     -3.079      0.002      -0.295      -0.065
marital_single          0.1339      0.067      2.008      0.045       0.003       0.265
month_apr              -1.0249      0.330     -3.104      0.002      -1.672      -0.378
month_aug              -1.8817      0.329     -5.716      0.000      -2.527      -1.236
month_dec              -0.3764      0.366     -1.027      0.304      -1.094       0.342
month_feb              -1.4508      0.329     -4.411      0.000      -2.095      -0.806
month_jan              -2.1788      0.343     -6.351      0.000      -2.851      -1.506
month_jul              -1.7086      0.330     -5.174      0.000      -2.356      -1.061
month_jun              -0.8912      0.328     -2.716      0.007      -1.534      -0.248
month_mar               0.1450      0.343      0.423      0.673      -0.527       0.818
month_may              -1.4663      0.328     -4.474      0.000      -2.109      -0.824
month_nov              -1.8762      0.331     -5.670      0.000      -2.525      -1.228
month_oct              -0.4541      0.338     -1.345      0.179      -1.116       0.208
month_sep              -0.4072      0.339     -1.200      0.230      -1.072       0.258
poutcome_failure       -0.3341      0.089     -3.734      0.000      -0.509      -0.159
poutcome_success        1.9193      0.098     19.516      0.000       1.727       2.112
poutcome_unknown       -0.1566      0.188     -0.831      0.406      -0.526       0.213
=======================================================================================

y_pred = logit_model.predict(X_test)
y_predbin= pd.DataFrame(y_pred).values
y_predbin= sklearn.preprocessing.binarize(y_predbin, threshold=0.5, copy=True)

cnf_matrix = metrics.confusion_matrix(y_test, y_predbin)

heatMap(pd.DataFrame(cnf_matrix))
print("Accuracy:",metrics.accuracy_score(y_test, y_predbin))
print("Precision:",metrics.precision_score(y_test, y_predbin))
print("Recall:",metrics.recall_score(y_test, y_predbin))

output_42_0

Accuracy: 0.8922872340425532
Precision: 0.5923566878980892
Recall: 0.18058252427184465

cnf_matrix

array([[7866,  128],
       [ 844,  186]])

print('predicted class breakdown',pd.DataFrame(y_predbin)[0].value_counts()/(len(pd.DataFrame(y_predbin)[0])+1),sep=
     '\n')
plt.bar(['0','1'], pd.DataFrame(y_predbin)[0].value_counts(), color = "blue")
plt.title('Predicted Classes')

predicted class breakdown
0.00000000   0.96509695
1.00000000   0.03479224
Name: 0, dtype: float64

Text(0.5, 1.0, 'Predicted Classes')

output_44_2

y_pred_proba = y_pred
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="auc="+str(round(auc,2)))
plt.ylabel("TPR")
plt.xlabel("FPR")
plt.title("ROC Curve")
plt.legend(loc=4)
plt.show()

output_45_0

Model 2 - Using SMOTE and Dropping Insignificant Variables

The method that we used to treat the imbalanced dataset was the SMOTE (Synthetic Minority Oversampling Technique).
“What it does is, it creates synthetic (not duplicate) samples of the minority class. Hence making the minority class equal to the majority class. SMOTE does this by selecting similar records and altering that record one column at a time by a random amount within the difference to the neighbouring records.” (Rahim S., 2018)

Next, we also dropped some variables below:

Date was excluded in the model. Generally speacking, clients decide to subscribe or not to subscribe a bank’s term deposit are not reference to date.
pdays variable was also excluded since we knew it provided the same information as previous.
Dropping default variable as well because it’s insignificant

The accuracy on this model dropped significantly to ~71%. However, the recall had improved so much to ~62.8% vs ~18% on the first model. We believed that this model was more representative since it had a better distribution of prediction between the 2 classes (yes/no).

train1 = train.drop(columns=['pdays','default','day','month_apr', 'month_aug', 'month_dec', 'month_feb', 'month_jan',
       'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov',
       'month_oct', 'month_sep'])

test1 = test.drop(columns=['pdays','default','day','month_apr', 'month_aug', 'month_dec', 'month_feb', 'month_jan',
       'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov',
       'month_oct', 'month_sep'])

X_test1 = test1.loc[:,test1.columns != 'deposit'].values
y_test1 = test1.loc[:, test1.columns == 'deposit'].values

train1.head()

	age	balance	housing	loan	campaign	contact_unknown	...	job_technician	marital_married	marital_single	poutcome_unknown
0	58	2143	1	0	1.00000000	1	...	0	1	0	1
1	44	29	1	0	1.00000000	1	...	1	0	1	1
2	33	2	1	1	1.00000000	1	...	0	1	0	1
3	47	1506	1	0	1.00000000	1	...	0	1	0	1
4	33	1	0	0	1.00000000	1	...	0	0	1	1

5 rows × 28 columns

columns=train1.loc[:,train1.columns != 'deposit'].columns

X= train1.loc[:,train1.columns != 'deposit'].values
y = train1.loc[:, train1.columns == 'deposit'].values


smt = SMOTE()

os_data_X,os_data_y=smt.fit_sample(X, y)
os_data_X = pd.DataFrame(data=os_data_X,columns=columns )
os_data_y= pd.DataFrame(data=os_data_y,columns=['y'])
smoted_train=pd.concat([os_data_X,os_data_y],axis=1)

# we can Check the numbers of our data
print("Length of oversampled data is ",len(os_data_X))
print("Number of no subscription in oversampled data",len(os_data_y[os_data_y['y']==0]))
print("Number of subscription",len(os_data_y[os_data_y['y']==1]))
print("Proportion of no deposit data in oversampled data is ",len(os_data_y[os_data_y['y']==0])/len(os_data_X))
print("Proportion of deposit data in oversampled data is ",len(os_data_y[os_data_y['y']==1])/len(os_data_X))

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/utils/validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)


Length of oversampled data is  63856
Number of no subscription in oversampled data 31928
Number of subscription 31928
Proportion of no deposit data in oversampled data is  0.5
Proportion of deposit data in oversampled data is  0.5

os_data_y['y'].value_counts()

1    31928
0    31928
Name: y, dtype: int64

logit_model1 = sm.Logit(os_data_y, os_data_X).fit()
print(logit_model1.summary())

Optimization terminated successfully.
         Current function value: 0.568418
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                63856
Model:                          Logit   Df Residuals:                    63829
Method:                           MLE   Df Model:                           26
Date:                Sat, 30 Nov 2019   Pseudo R-squ.:                  0.1799
Time:                        23:26:45   Log-Likelihood:                -36297.
converged:                       True   LL-Null:                       -44262.
Covariance Type:            nonrobust   LLR p-value:                     0.000
=======================================================================================
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
age                     0.0010      0.001      0.920      0.357      -0.001       0.003
balance              2.739e-05   3.08e-06      8.882      0.000    2.13e-05    3.34e-05
housing                -0.6192      0.021    -30.090      0.000      -0.660      -0.579
loan                   -0.5089      0.029    -17.669      0.000      -0.565      -0.452
campaign               -0.1742      0.006    -26.912      0.000      -0.187      -0.162
previous                0.3661      0.041      8.857      0.000       0.285       0.447
contact_telephone      -0.2008      0.040     -5.076      0.000      -0.278      -0.123
contact_unknown        -1.0184      0.026    -39.113      0.000      -1.069      -0.967
education_primary      -0.1984      0.055     -3.584      0.000      -0.307      -0.090
education_secondary    -0.0314      0.049     -0.635      0.525      -0.128       0.066
education_tertiary      0.2294      0.052      4.387      0.000       0.127       0.332
job_admin.              0.6609      0.092      7.214      0.000       0.481       0.840
job_blue-collar         0.4505      0.090      4.997      0.000       0.274       0.627
job_entrepreneur        0.3466      0.104      3.338      0.001       0.143       0.550
job_housemaid           0.0829      0.108      0.767      0.443      -0.129       0.295
job_management          0.4444      0.091      4.868      0.000       0.265       0.623
job_retired             0.9909      0.103      9.650      0.000       0.790       1.192
job_self-employed       0.4771      0.101      4.703      0.000       0.278       0.676
job_services            0.4797      0.093      5.157      0.000       0.297       0.662
job_student             0.9870      0.102      9.636      0.000       0.786       1.188
job_technician          0.3958      0.091      4.370      0.000       0.218       0.573
job_unemployed          0.6659      0.102      6.531      0.000       0.466       0.866
marital_married        -0.1836      0.031     -6.002      0.000      -0.244      -0.124
marital_single          0.2135      0.034      6.204      0.000       0.146       0.281
poutcome_failure       -0.4828      0.049     -9.817      0.000      -0.579      -0.386
poutcome_success        3.1788      0.083     38.215      0.000       3.016       3.342
poutcome_unknown        0.1025      0.076      1.342      0.179      -0.047       0.252
=======================================================================================

output_54_0

Accuracy: 0.7116578014184397
Precision: 0.22575017445917656
Recall: 0.6281553398058253

pd.DataFrame(y_predbin1)[0].value_counts()/(len(pd.DataFrame(y_predbin1)[0])+1)

0.00000000   0.68232687
1.00000000   0.31756233
Name: 0, dtype: float64


Text(0.5, 1.0, 'Predicted Classes')

output_56_1 output_57_0

Model 3

Based on model 2 result on variable importance, we dropped variables that had p-value > 0.05
The result indicated that accuracy improved slightly with less than 1%. Therefore we chose Model 3.

train1 = train.drop(columns=['pdays','default','day','month_apr', 'month_aug', 'month_dec', 'month_feb', 'month_jan',
       'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov',
       'month_oct', 'month_sep','poutcome_unknown','job_housemaid'])
test1 = test.drop(columns=['pdays','default','day','month_apr', 'month_aug', 'month_dec', 'month_feb', 'month_jan',
       'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov',
       'month_oct', 'month_sep','poutcome_unknown','job_housemaid'])

Optimization terminated successfully.
         Current function value: 0.567444
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                63856
Model:                          Logit   Df Residuals:                    63831
Method:                           MLE   Df Model:                           24
Date:                Sat, 30 Nov 2019   Pseudo R-squ.:                  0.1814
Time:                        23:35:48   Log-Likelihood:                -36235.
converged:                       True   LL-Null:                       -44262.
Covariance Type:            nonrobust   LLR p-value:                     0.000
=======================================================================================
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
age                     0.0008      0.001      0.893      0.372      -0.001       0.003
balance              2.761e-05   3.07e-06      9.007      0.000    2.16e-05    3.36e-05
housing                -0.6224      0.021    -30.343      0.000      -0.663      -0.582
loan                   -0.4924      0.029    -17.092      0.000      -0.549      -0.436
campaign               -0.1678      0.006    -26.260      0.000      -0.180      -0.155
previous                0.3312      0.023     14.528      0.000       0.286       0.376
contact_telephone      -0.1974      0.040     -4.976      0.000      -0.275      -0.120
contact_unknown        -1.0073      0.026    -38.772      0.000      -1.058      -0.956
education_primary      -0.1047      0.050     -2.095      0.036      -0.203      -0.007
education_secondary     0.0258      0.044      0.584      0.559      -0.061       0.112
education_tertiary      0.2607      0.047      5.524      0.000       0.168       0.353
job_admin.              0.6412      0.054     11.795      0.000       0.535       0.748
job_blue-collar         0.4388      0.051      8.538      0.000       0.338       0.540
job_entrepreneur        0.3063      0.073      4.200      0.000       0.163       0.449
job_management          0.4542      0.054      8.429      0.000       0.349       0.560
job_retired             0.9639      0.067     14.404      0.000       0.833       1.095
job_self-employed       0.4232      0.070      6.060      0.000       0.286       0.560
job_services            0.4812      0.057      8.499      0.000       0.370       0.592
job_student             0.9895      0.074     13.356      0.000       0.844       1.135
job_technician          0.3635      0.053      6.911      0.000       0.260       0.467
job_unemployed          0.6449      0.070      9.177      0.000       0.507       0.783
marital_married        -0.1517      0.030     -5.117      0.000      -0.210      -0.094
marital_single          0.2907      0.032      8.995      0.000       0.227       0.354
poutcome_failure       -0.4990      0.046    -10.792      0.000      -0.590      -0.408
poutcome_success        3.1733      0.081     39.050      0.000       3.014       3.333
=======================================================================================

output_63_0

Accuracy: 0.7138741134751773
Precision: 0.22733661278988054
Recall: 0.6281553398058253

output_64_1

output_65_0

Model 4

Since we knew there were 2 types of customers in the dataset, one who had been contacted before and one who had not been contacted before, we wanted to create separate models for these customers since we believed that personalization is key.
A more similar customers can give you more accurate prediction as well.
In this case, the variable that we used to separate the model was pdays since any value above -1 indicated customers who had been previously contacted for the bank marketing campaign.
The result was favorable. using the same exact method as model 3, we were able to achieve accuracy of ~77%.

train3 = train[train.pdays != -1]
test3 = test[test.pdays != -1]

train1 = train3.drop(columns=['pdays','default','day','month_apr', 'month_aug', 'month_dec', 'month_feb', 'month_jan',
       'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov',
       'month_oct', 'month_sep','poutcome_unknown','job_housemaid'])
test1 = test3.drop(columns=['pdays','default','day','month_apr', 'month_aug', 'month_dec', 'month_feb', 'month_jan',
       'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov',
       'month_oct', 'month_sep','poutcome_unknown','job_housemaid'])

X_test1 = test1.loc[:,test1.columns != 'deposit'].values
y_test1 = test1.loc[:, test1.columns == 'deposit'].values

Length of oversampled data is  10122
Number of no subscription in oversampled data 5061
Number of subscription 5061
Proportion of no deposit data in oversampled data is  0.5
Proportion of deposit data in oversampled data is  0.5
1    5061
0    5061
Name: y, dtype: int64



Optimization terminated successfully.
         Current function value: 0.502125
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                10122
Model:                          Logit   Df Residuals:                    10097
Method:                           MLE   Df Model:                           24
Date:                Sat, 30 Nov 2019   Pseudo R-squ.:                  0.2756
Time:                        23:46:37   Log-Likelihood:                -5082.5
converged:                       True   LL-Null:                       -7016.0
Covariance Type:            nonrobust   LLR p-value:                     0.000
=======================================================================================
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
age                    -0.0001      0.003     -0.046      0.963      -0.005       0.005
balance              1.595e-05   8.09e-06      1.972      0.049       1e-07    3.18e-05
housing                -0.9681      0.056    -17.432      0.000      -1.077      -0.859
loan                   -0.5818      0.087     -6.723      0.000      -0.751      -0.412
campaign               -0.1535      0.021     -7.246      0.000      -0.195      -0.112
previous                0.3172      0.054      5.844      0.000       0.211       0.424
contact_telephone      -0.2155      0.102     -2.104      0.035      -0.416      -0.015
contact_unknown        -0.5576      0.269     -2.076      0.038      -1.084      -0.031
education_primary      -0.6053      0.143     -4.242      0.000      -0.885      -0.326
education_secondary    -0.3852      0.121     -3.184      0.001      -0.622      -0.148
education_tertiary     -0.1405      0.126     -1.112      0.266      -0.388       0.107
job_admin.              0.1873      0.153      1.227      0.220      -0.112       0.486
job_blue-collar        -0.4044      0.150     -2.700      0.007      -0.698      -0.111
job_entrepreneur       -0.5751      0.228     -2.519      0.012      -1.023      -0.128
job_management          0.0905      0.152      0.596      0.551      -0.207       0.388
job_retired             0.2913      0.185      1.573      0.116      -0.072       0.654
job_self-employed      -0.1098      0.193     -0.570      0.569      -0.488       0.268
job_services           -0.0418      0.162     -0.257      0.797      -0.360       0.277
job_student             0.3518      0.181      1.940      0.052      -0.004       0.707
job_technician         -0.1949      0.151     -1.287      0.198      -0.492       0.102
job_unemployed          0.5237      0.199      2.628      0.009       0.133       0.914
marital_married         0.1811      0.084      2.164      0.030       0.017       0.345
marital_single          0.2831      0.091      3.117      0.002       0.105       0.461
poutcome_failure       -0.3130      0.062     -5.015      0.000      -0.435      -0.191
poutcome_success        2.3982      0.082     29.264      0.000       2.238       2.559
=======================================================================================

output_71_0

Accuracy: 0.7750301568154403
Precision: 0.49353448275862066
Recall: 0.6239782016348774

output_72_0

Conclusion and Key Takeaways

Accuracy is not the only measure to assess your model performance. Instead, based on the context, we believe that maximizing our recall or sensitivity is a more important metrics. Having a context knowledge is very important in assessing a more realistic and accurate model. Addtionally, context understanding helps us in determining what method to go with in handling imbalance data. Lastly, Having separate model for different segment of customers really help the business understands their customers and also provide a more accurate prediction.

Analysis and Report Written by : Jennifer Siwu
Reference for SMOTE : https://medium.com/@saeedAR/smote-and-near-miss-in-python-machine-learning-in-imbalanced-datasets-b7976d9a7a79