toad Tutorial¶
Toad is a Python toolkit for professional model developers - a part of its functionality is specific for scorecard development. The toad package is countiously being upgraded and added for new features. We will introduce the key functionality in this tutorial, including:
- EDA-related functions
- how to use toad to fine tune feature binning and conduct feature selection
- WOE transformation
- stepwise feature selection
- model evaluation and validation
- scorecard transformation
- other functions
This tutorial will demonstrate how to use toad to model data of high dimension with efficiency.¶
*Install and upgrade:* 1. pip:!pip install toad 2. conda:conda install toad –channel conda-forge 3. update:!pip install -U toad; conda install -U toad –channel conda-forge
*Feel free to open new issues on our `github <https://github.com/amphibian-dev/toad>`__*
[4]:
import pandas as pd
import numpy as np
import toad
[1]:
'''
Please upgrade to the latest version
'''
[1]:
'\nPlease upgrade to the latest version\n'
### 0. Load data¶
The demo data has 165 dimensions, including one ID column, a target variable, and a month column. The feature columns contain both categorical and numerical features, with several having missing data.
*This demo will showcase how toad can efficiently and effectively help model development for such dirty / nasty dataset. *
[7]:
data = pd.read_csv('train.csv')
print('Shape:',data.shape)
data.head(10)
Shape: (108940, 167)
[7]:
APP_ID_C | target | var_d1 | var_d2 | var_d3 | var_d4 | var_d5 | var_d6 | var_d7 | var_d8 | ... | var_l_118 | var_l_119 | var_l_120 | var_l_121 | var_l_122 | var_l_123 | var_l_124 | var_l_125 | var_l_126 | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | app_1 | 0 | Hit-6+ Vintage | 816.0 | RESIDENT INDIAN | Post-Graduate | M | RESIDENT INDIAN | SELF-EMPLOYED | Y | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 2019-03 |
1 | app_2 | 0 | NaN | 841.0 | RESIDENT INDIAN | Post-Graduate | F | RESIDENT INDIAN | SALARIED | N | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 2019-03 |
2 | app_3 | 0 | Hit-6+ Vintage | 791.0 | RESIDENT INDIAN | Post-Graduate | M | RESIDENT INDIAN | PROPRIETOR | Y | ... | 0.0 | 0.088235 | 0.0 | 0.100000 | 0.0 | 0.011494 | 0.5 | 0.000000 | 0.0 | 2019-03 |
3 | app_4 | 0 | Hit-6+ Vintage | 821.0 | RESIDENT INDIAN | Graduate | M | RESIDENT INDIAN | SELF-EMPLOYED | N | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 2019-03 |
4 | app_5 | 0 | Hit-6+ Vintage | 807.0 | RESIDENT INDIAN | Graduate | M | RESIDENT INDIAN | SALARIED | Y | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.540541 | 0.0 | 0.285714 | 0.0 | 2019-03 |
5 | app_6 | 0 | Hit-6+ Vintage | 788.0 | RESIDENT INDIAN | Others | M | RESIDENT INDIAN | SALARIED | N | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 2019-03 |
6 | app_7 | 0 | Hit-6+ Vintage | 779.0 | RESIDENT INDIAN | Graduate | M | RESIDENT INDIAN | ATTORNEY AT LAW | Y | ... | 0.0 | 0.722222 | 0.0 | 0.777778 | 0.0 | 0.380952 | 0.0 | 0.571429 | 0.0 | 2019-03 |
7 | app_8 | 0 | Hit-6+ Vintage | 801.0 | RESIDENT INDIAN | Post-Graduate | M | RESIDENT INDIAN | SAL(RETIRAL AGE 60) | N | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 2019-03 |
8 | app_9 | 0 | Hit-6+ Vintage | 815.0 | RESIDENT INDIAN | Graduate | F | RESIDENT INDIAN | NaN | Y | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 2019-03 |
9 | app_10 | 0 | NaN | 804.0 | RESIDENT INDIAN | Graduate | M | RESIDENT INDIAN | PROPRIETOR | N | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 2019-03 |
10 rows × 167 columns
The dataset contains monthly data from Mar, 2019 - Jul, 2019. We will use Mar and Apr data as training sample and May, Jun, Jul data as out-of-time (OOT) sample.¶
[4]:
print('month:',data.month.unique())
month: ['2019-03' '2019-04' '2019-05' '2019-06' '2019-07']
[8]:
train = data.loc[data.month.isin(['2019-03','2019-04'])==True,:]
OOT = data.loc[data.month.isin(['2019-03','2019-04'])==False,:]
print('train size:',train.shape,'\nOOT size:',OOT.shape)
train size: (43576, 167)
OOT size: (65364, 167)
### I. EDA functions¶
1. toad.detect(dataframe):¶
To EDA data statics and other information. The columns output the statistical summary of each column. The ones that should be paid attention with are: no. of missing, no. of unqiue values, mean for numerical features, mode for categorical features, etc. As per the cell below, the takeaway should include:
- postive samples account for 2.2%: the mean of traget col is 0.0219479
- several features have different amount of missing values: notice the missing col.
- there are both categoical and numerical features. The unique values of several categorical features are high - from 10 to even 84: notice the unqiue col for features of type==object.
[6]:
toad.detect(train)[:10]
[6]:
type | size | missing | unique | mean_or_top1 | std_or_top2 | min_or_top3 | 1%_or_top4 | 10%_or_top5 | 50%_or_bottom5 | 75%_or_bottom4 | 90%_or_bottom3 | 99%_or_bottom2 | max_or_bottom1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
APP_ID_C | object | 43576 | 0.00% | 43576 | app_36227:0.00% | app_29819:0.00% | app_35476:0.00% | app_10104:0.00% | app_35794:0.00% | app_25789:0.00% | app_36858:0.00% | app_12750:0.00% | app_24:0.00% | app_13004:0.00% |
target | int64 | 43576 | 0.00% | 2 | 0.0213191 | 0.144447 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
var_d1 | object | 43576 | 37.57% | 2 | Hit-6+ Vintage:60.32% | Hit-lt 6 Vinta:2.10% | None | None | None | None | None | None | Hit-6+ Vintage:60.32% | Hit-lt 6 Vinta:2.10% |
var_d2 | float64 | 43576 | 5.44% | 389 | 570.492 | 355.565 | -1 | -1 | -1 | 778 | 810 | 832 | 864 | 900 |
var_d3 | object | 43576 | 5.31% | 6 | RESIDENT INDIAN:94.00% | NON-RESIDENT INDIAN:0.64% | PARTNERSHIP FIRM:0.02% | PRIVATE LTD COMPANIES:0.02% | PUBLIC LTD COMPANIES:0.00% | NON-RESIDENT INDIAN:0.64% | PARTNERSHIP FIRM:0.02% | PRIVATE LTD COMPANIES:0.02% | PUBLIC LTD COMPANIES:0.00% | OVERSEAS CITIZEN OF INDIA:0.00% |
var_d4 | object | 43576 | 1.08% | 5 | Graduate:55.30% | Post-Graduate:21.57% | Others:10.71% | Under Graduate:10.67% | Professional:0.67% | Graduate:55.30% | Post-Graduate:21.57% | Others:10.71% | Under Graduate:10.67% | Professional:0.67% |
var_d5 | object | 43576 | 1.08% | 3 | M:79.70% | F:14.33% | O:4.89% | None | None | None | None | M:79.70% | F:14.33% | O:4.89% |
var_d6 | object | 43576 | 1.08% | 13 | RESIDENT INDIAN:93.34% | PRIVATE LTD COMPANIES:2.57% | PARTNERSHIP FIRM:1.45% | PUBLIC LTD COMPANIES:0.73% | NON-RESIDENT INDIAN:0.64% | CO-OPERATIVE SOCIETIES:0.01% | LIMITED LIABILITY PARTNERSHIP:0.00% | ASSOCIATION:0.00% | TRUST-NGO:0.00% | OVERSEAS CITIZEN OF INDIA:0.00% |
var_d7 | object | 43576 | 1.60% | 84 | SALARIED:31.43% | PROPRIETOR:31.31% | SELF-EMPLOYED:10.74% | OTHERS:6.40% | FIRST TIME USERS:2.72% | NURSE:0.00% | PHARMACIST:0.00% | RETAIL BUS OPERATOR:0.00% | PRIVATE TAILOR:0.00% | TAXI DRIVER:0.00% |
var_d8 | object | 43576 | 1.08% | 2 | Y:59.90% | N:39.03% | None | None | None | None | None | None | Y:59.90% | N:39.03% |
2. toad.quality(dataframe, target=’target’, iv_only=False):¶
Output IV (information value), gini, entropy and no. of unique values for each feature. The features are sorted by IV in a descending order. “target” is the target variable, and ‘iv_only’ specifies whether to calculate IV only.
Note: it is recommended to set “iv_only=True” for large dataset or high-dimensional data.
[7]:
toad.quality(data,'target',iv_only=True)[:15]
[7]:
iv | gini | entropy | unique | |
---|---|---|---|---|
var_b19 | 0.353043 | NaN | NaN | 88.0 |
var_b18 | 0.317603 | NaN | NaN | 46.0 |
var_d2 | 0.313443 | NaN | NaN | 411.0 |
var_d7 | 0.309985 | NaN | NaN | 95.0 |
var_b10 | 0.301111 | NaN | NaN | 15726.0 |
var_b17 | 0.240104 | NaN | NaN | 235.0 |
var_b16 | 0.231403 | NaN | NaN | 104.0 |
var_b24 | 0.226939 | NaN | NaN | 30928.0 |
var_b20 | 0.198655 | NaN | NaN | 34.0 |
var_b11 | 0.187306 | NaN | NaN | 239.0 |
var_l_19 | 0.160020 | NaN | NaN | 32240.0 |
var_b9 | 0.157585 | NaN | NaN | 197.0 |
var_l_68 | 0.150068 | NaN | NaN | 757.0 |
var_l_123 | 0.146634 | NaN | NaN | 10602.0 |
var_l_125 | 0.146274 | NaN | NaN | 6338.0 |
### II. how to use toad to fine tune feature binning and conduct feature selection¶
3. toad.selection.select(dataframe, target=’target’, empty=0.9, iv=0.02, corr=0.7, return_drop=False, exclude=None):¶
Conduct preliminary feature selection according to missing percentage, IV and correlation (with other features), the variables are:
- empyt=0.9: the features with missing percentage larger than 90% are filtered;
- iv=0.02: the features with IV smaller than 0.02 are eliminated;
- corr=0.7: if two or more features have Pearson correlation larger than 0.7, the ones with lower IV are eliminated;
- return_drop=False: if set True, the function returns a list of deleted columns;
- exclude=None: input the list of features to be excluded from the algorithm, typically ID column and month column.
As shown in the cell below, none feautures are deleted due to high missing values, most values are deleted by the IV threshold, several are deleted for the correlation. In the end, 32 features are chosen from initially 165.
[9]:
train_selected, dropped = toad.selection.select(train,target = 'target', empty = 0.5, iv = 0.05, corr = 0.7, return_drop=True, exclude=['APP_ID_C','month'])
print(dropped)
print(train_selected.shape)
{'empty': array([], dtype=float64), 'iv': array(['var_d1', 'var_d4', 'var_d8', 'var_d9', 'var_b5', 'var_b6',
'var_b7', 'var_l_1', 'var_l_2', 'var_l_3', 'var_l_4', 'var_l_5',
'var_l_6', 'var_l_7', 'var_l_8', 'var_l_10', 'var_l_12',
'var_l_14', 'var_l_15', 'var_l_16', 'var_l_17', 'var_l_18',
'var_l_21', 'var_l_23', 'var_l_24', 'var_l_25', 'var_l_26',
'var_l_27', 'var_l_28', 'var_l_29', 'var_l_30', 'var_l_31',
'var_l_32', 'var_l_33', 'var_l_34', 'var_l_35', 'var_l_37',
'var_l_38', 'var_l_39', 'var_l_40', 'var_l_41', 'var_l_42',
'var_l_43', 'var_l_44', 'var_l_45', 'var_l_47', 'var_l_49',
'var_l_51', 'var_l_53', 'var_l_55', 'var_l_56', 'var_l_57',
'var_l_59', 'var_l_61', 'var_l_62', 'var_l_63', 'var_l_65',
'var_l_67', 'var_l_70', 'var_l_72', 'var_l_75', 'var_l_76',
'var_l_77', 'var_l_78', 'var_l_79', 'var_l_80', 'var_l_81',
'var_l_82', 'var_l_83', 'var_l_84', 'var_l_85', 'var_l_86',
'var_l_87', 'var_l_88', 'var_l_90', 'var_l_92', 'var_l_93',
'var_l_94', 'var_l_95', 'var_l_96', 'var_l_97', 'var_l_98',
'var_l_100', 'var_l_102', 'var_l_104', 'var_l_106', 'var_l_108',
'var_l_109', 'var_l_110', 'var_l_112', 'var_l_114', 'var_l_116',
'var_l_117', 'var_l_118', 'var_l_120', 'var_l_122', 'var_l_124',
'var_l_126'], dtype=object), 'corr': array(['var_b27', 'var_b28', 'var_b4', 'var_l_105', 'var_b25', 'var_b2',
'var_l_113', 'var_l_46', 'var_b26', 'var_b8', 'var_b1',
'var_l_103', 'var_l_99', 'var_l_74', 'var_b14', 'var_l_13',
'var_l_22', 'var_b22', 'var_l_101', 'var_l_111', 'var_b12',
'var_l_69', 'var_b11', 'var_l_115', 'var_l_11', 'var_l_36',
'var_l_50', 'var_l_54', 'var_l_121', 'var_b15', 'var_l_73',
'var_l_66', 'var_l_125', 'var_b16', 'var_b24'], dtype=object)}
(43576, 34)
4. Fine binning¶
Toad’s binning function support both categorical and numerical features. The class “toad.transform.Combiner()” is used to train, the procedure is below:
- *** initalise: ***c = toad.transform.Combiner()
- *train binning*: c.fit(dataframe, y = ‘target’, method = ‘chi’, min_samples = None, n_bins = None, empty_separate = False)
- y: target variable;
- method: the method to apply binning. Suport ‘chi’ (Chi-squared), ‘dt’, (decisin tree), ‘kmeans’ (K-means), ‘quantile’ (by the same percentile), and ‘step’ (by the same step);
- min_samples: can be a number or a porportion. Minimum number / porportion of samples required in each bucket;
- n_bins: mininum number of buckets. If the number is too large, the algorithm will return the maxinum number of buckets it can get;
- empty_separate: whether to seperate the missing values in a bucket. If False, missing values will be put along with the bucket of most close bad rate.
- *binning results*:c.export()
- *adjust bins*: c.update(dict)
- *apply bins and convert to discrete values*: c.transform(dataframe, labels=False):
- labels: whether to convert data to explanatory labels. Returns 0, 1, 2 … when False. Categorical features will be sorted in a descending order of porportion. Returns (-inf, 0], (0,10], (10, inf) when True.
Note: 1. remember to exclude the unwanted columns, especially ID column and timestamp column. 2. Columns with large number of unique values may take much time to train.
[11]:
# initialise
c = toad.transform.Combiner()
to_drop=['APP_ID_C','month']
# Train binning with the selected features from previous; use reliable Chi-squared binning, and control that each bucket has at least 5% sample.
c.fit(train_selected, y = 'target', method = 'chi', min_samples = 0.05, exclude = to_drop) #empty_separate = False
# For the demonstration purpose, only showcase 3 bin results.
print('var_d2:',c.export()['var_d2'])
print('var_d5:',c.export()['var_d5'])
print('var_d6:',c.export()['var_d6'])
var_d2: [747.0, 782.0, 820.0]
var_d5: [['O', 'nan', 'F'], ['M']]
var_d6: [['PUBLIC LTD COMPANIES', 'NON-RESIDENT INDIAN', 'PRIVATE LTD COMPANIES', 'PARTNERSHIP FIRM', 'nan'], ['RESIDENT INDIAN', 'TRUST', 'TRUST-CLUBS/ASSN/SOC/SEC-25 CO.', 'HINDU UNDIVIDED FAMILY', 'CO-OPERATIVE SOCIETIES', 'LIMITED LIABILITY PARTNERSHIP', 'ASSOCIATION', 'OVERSEAS CITIZEN OF INDIA', 'TRUST-NGO']]
5. Fine tune bins¶
The “toad.plot” provides functions for visualisation to help make adjustment.
- *In-sample visualisation *: toad.plot.bin_plot(dataframe, x = None, target = ‘target)
The bars are the proportion of each binned class, and the line is the corresponding postive sample proportion (e.g. bad rate).
- x: the feature column of interest
- target: target variable
[15]:
from toad.plot import bin_plot
# Check the bin results of 'var_d2' of in-sample
col = 'var_d2'
# It's recommended to set 'labels = True' for better visualisation.
bin_plot(c.transform(train_selected[[col,'target']], labels=True), x=col, target='target')
No handles with labels found to put in legend.
No handles with labels found to put in legend.
[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x13b033128>
- *OOT visualisation:* toad.plot.badrate_plot(dataframe, target = ‘target’, x = None, by = None)
Show the positive rates of each class across different time.
- target: target variable
- x: time column, must be in string
- by: feature column of interest
Note: the time column must be preprocessed and converted to string - timestamp is not supported
[14]:
from toad.plot import badrate_plot
col = 'var_d2'
# Check the stability of 'var_d2''s bins across time
#badrate_plot(c.transform(train[[col,'target','month']], labels=True), target='target', x='month', by=col)
#badrate_plot(c.transform(OOT[[col,'target','month']], labels=True), target='target', x='month', by=col)
badrate_plot(c.transform(data[[col,'target','month']], labels=True), target='target', x='month', by=col)
'''
A feature is preferrable if the gaps between classes get wider as time goes by - it means the binned classes have larger difference. No line crossing means the bin results are stable.
'''
[14]:
'\nA feature is preferrable if the gaps between classes get wider as time goes by - it means the binned classes have larger difference. No line crossing means the bin results are stable.\n'
[12]:
# Check the bin results of var_d5 of in-sample
col = 'var_d5'
# It's recommended to set 'labels = True' for categorical features.
bin_plot(c.transform(train_selected[[col,'target']], labels=True), x=col, target='target')
No handles with labels found to put in legend.
No handles with labels found to put in legend.
[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2d1b5f60>
- *adjust bins:*c.update(dict)
the passed new bins will be updated - other feature bins are kept intact.
[16]:
# The IV is small, assume we want to seperate 'F' out to lift IV.
# Set new bins
rule = {'var_d5':[['O', 'nan'],['F'], ['M']]}
# Pass new bins
c.update(rule)
# Re-check both in-sample and OOT stability.
bin_plot(c.transform(train_selected[['var_d5','target']], labels=True), x='var_d5', target='target')
badrate_plot(c.transform(OOT[['var_d5','target','month']], labels=True), target='target', x='month', by='var_d5')
No handles with labels found to put in legend.
No handles with labels found to put in legend.
[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x1282486d8>
### III. WOE transformation¶
WOE transformation is applied after binning is tuned and finalised. The procedure is following:
*Use the finalised Combiner to apply the binning:* c.transform(dataframe, labels=False)
It only transform the binned features.
*initialise woe transer:* transer = toad.transform.WOETransformer()
*fit_transform:* transer.fit_transform(dataframe, target, exclude = None)
Fit and apply WOE transformation, for in-sample data
- target:target values in Series or DataFrame;
- exclude: columns not be WOE transformed Note: 1. “fit_transform” fits and transform all the data, even the ones not binned. Remember to exclude the unwanted columns. 2. Alwasy exlclude target column.
*Apply WOE transformation, typically to test / OOT data:*transer.transform(dataframe)
[14]:
# Initialise
transer = toad.transform.WOETransformer()
# transer.fit_transform() & combiner.transform(). Remember to exclude target
train_woe = transer.fit_transform(c.transform(train_selected), train_selected['target'], exclude=to_drop+['target'])
OOT_woe = transer.transform(c.transform(OOT))
print(train_woe.head(3))
APP_ID_C target var_d2 var_d3 var_d5 var_d6 var_d7 \
0 app_1 0 -0.178286 0.046126 0.090613 0.047145 0.365305
1 app_2 0 -1.410248 0.046126 -0.271655 0.047145 -0.734699
2 app_3 0 -0.178286 0.046126 0.090613 0.047145 0.365305
var_d11 var_b3 var_b9 ... var_l_60 var_l_64 var_l_68 var_l_71 \
0 -0.152228 -0.141182 -0.237656 ... 0.132170 0.080656 0.091919 0.150975
1 -0.152228 0.199186 0.199186 ... 0.132170 0.080656 0.091919 0.150975
2 -0.152228 -0.141182 0.388957 ... -0.926987 -0.235316 -0.883896 -0.385976
var_l_89 var_l_91 var_l_107 var_l_119 var_l_123 month
0 0.091901 0.086402 -0.034434 0.027322 0.087378 2019-03
1 0.091901 0.086402 -0.034434 0.027322 0.087378 2019-03
2 0.091901 -0.620829 -0.034434 -0.806599 -0.731941 2019-03
[3 rows x 34 columns]
- *toad.selection.stepwise(dataframe, target=’target’, estimator=’ols’, direction=’both’, criterion=’aic’, max_iter=None, return_drop=False, exclude=None): *
Stepwise regression feature selection, supports forward, backward, and both-direction (recommended):
- estimator: the regression model to fit, support 'ols', 'lr', 'lasso', 'ridge'
- direction: stepwise direction, support 'forward', 'backward', 'both' (recommended)
- criterion: selection criteria, support 'aic', 'bic', 'ks', 'auc'
- max_iter: maximum number of iterations
- return_drop: whether to return a list of dropped column names
- exclude: list of column to be from alogorithm, such as ID column and time column.
*tip: generally, direction = ‘both’ produces the best results. Setting estimator = ‘ols’ and criterion = ‘aic’ makes the stepwise fast and the results are sound for logistic regression.*
[15]:
# Apply stepwise regression on the WOE-transformed data
final_data = toad.selection.stepwise(train_woe,target = 'target', estimator='ols', direction = 'both', criterion = 'aic', exclude = to_drop)
# Place the selected features to test / OOT sample
final_OOT = OOT_woe[final_data.columns]
print(final_data.shape) # Out of 31 features, stepwise regression selected 10 of them.
(43576, 13)
[16]:
# The final list of features for modelling
col = list(final_data.drop(to_drop+['target'],axis=1).columns)
- *toad.metrics.PSI(df_train, df_test):*
Ouput the PSI for each feature - used to check the OOT stability of WOE-transformed features.
[17]:
toad.metrics.PSI(final_data[col], final_OOT[col])
[17]:
var_d2 0.000254
var_d5 0.000012
var_d7 0.000079
var_d11 0.000191
var_b10 0.000209
var_b18 0.000026
var_b19 0.000049
var_b23 0.000037
var_l_20 0.000115
var_l_68 0.000213
dtype: float64
- Common evaluation metrics: toad. metrics. KS, F1, AUC
[18]:
# Build a logit
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(final_data[col], final_data['target'])
# Obtain predicted probability for training and OOT
pred_train = lr.predict_proba(final_data[col])[:,1]
pred_OOT_may =lr.predict_proba(final_OOT.loc[final_OOT.month == '2019-05',col])[:,1]
pred_OOT_june =lr.predict_proba(final_OOT.loc[final_OOT.month == '2019-06',col])[:,1]
pred_OOT_july =lr.predict_proba(final_OOT.loc[final_OOT.month == '2019-07',col])[:,1]
/Users/zhouxiyu/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
[19]:
from toad.metrics import KS, AUC
print('train KS',KS(pred_train, final_data['target']))
print('train AUC',AUC(pred_train, final_data['target']))
print('OOT results')
print('5月 KS',KS(pred_OOT_may, final_OOT.loc[final_OOT.month == '2019-05','target']))
print('6月 KS',KS(pred_OOT_june, final_OOT.loc[final_OOT.month == '2019-06','target']))
print('7月 KS',KS(pred_OOT_july, final_OOT.loc[final_OOT.month == '2019-07','target']))
train KS 0.3707986228750539
train AUC 0.75060723924743
OOT结果
5月 KS 0.3686687175756087
6月 KS 0.3495273403486497
7月 KS 0.3796914199845523
*PSI also be used to gauge the stability of predicted proabilities*
[20]:
print(toad.metrics.PSI(pred_train,pred_OOT_may))
print(toad.metrics.PSI(pred_train,pred_OOT_june))
print(toad.metrics.PSI(pred_train,pred_OOT_june))
0.12760761722158315
0.1268648506657109
0.1268648506657109
- *toad.metrics.KS_bucket(predicted_proba, y_true, bucket=10, method = ‘quantile’):*
output the evaluative information of binned predicted probability, including the probability range, no. of samples, bad rate, KS of each probability bin.
- bucket:no. of bins
- method:method of binning. Recommend to use 'quantile' or 'step'
(1) the larger the difference of bad\_rate between each group, the better the results; (2) can be used to check the monotonicity of groups of scores; (3) can be used to find the optimal cutoff point; (4) can be used to compare predictability of models
[21]:
# Group the predicted scores in bins with same number of samples in each (i.e. "quantile" binning)
toad.metrics.KS_bucket(pred_train, final_data['target'], bucket=10, method = 'quantile')
[21]:
min | max | bads | goods | total | bad_rate | good_rate | odds | bad_prop | good_prop | total_prop | cum_bads | cum_goods | cum_total | cum_bads_prop | cum_goods_prop | cum_total_prop | ks | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.000275 | 0.003380 | 9 | 4332 | 4341 | 0.002073 | 0.997927 | 0.002078 | 0.009688 | 0.101578 | 0.099619 | 9 | 4332 | 4341 | 0.009688 | 0.101578 | 0.099619 | -0.091890 |
1 | 0.003398 | 0.005207 | 12 | 3585 | 3597 | 0.003336 | 0.996664 | 0.003347 | 0.012917 | 0.084062 | 0.082545 | 21 | 7917 | 7938 | 0.022605 | 0.185640 | 0.182164 | -0.163035 |
2 | 0.005207 | 0.008116 | 37 | 5071 | 5108 | 0.007244 | 0.992756 | 0.007296 | 0.039828 | 0.118906 | 0.117220 | 58 | 12988 | 13046 | 0.062433 | 0.304547 | 0.299385 | -0.242114 |
3 | 0.008125 | 0.010862 | 26 | 3854 | 3880 | 0.006701 | 0.993299 | 0.006746 | 0.027987 | 0.090370 | 0.089040 | 84 | 16842 | 16926 | 0.090420 | 0.394916 | 0.388425 | -0.304497 |
4 | 0.010868 | 0.014651 | 59 | 4759 | 4818 | 0.012246 | 0.987754 | 0.012398 | 0.063509 | 0.111590 | 0.110565 | 143 | 21601 | 21744 | 0.153929 | 0.506507 | 0.498990 | -0.352578 |
5 | 0.014661 | 0.019846 | 76 | 3901 | 3977 | 0.019110 | 0.980890 | 0.019482 | 0.081808 | 0.091472 | 0.091266 | 219 | 25502 | 25721 | 0.235737 | 0.597979 | 0.590256 | -0.362241 |
6 | 0.019858 | 0.025968 | 116 | 4665 | 4781 | 0.024263 | 0.975737 | 0.024866 | 0.124865 | 0.109386 | 0.109716 | 335 | 30167 | 30502 | 0.360603 | 0.707365 | 0.699972 | -0.346762 |
7 | 0.025986 | 0.032467 | 108 | 4188 | 4296 | 0.025140 | 0.974860 | 0.025788 | 0.116254 | 0.098202 | 0.098586 | 443 | 34355 | 34798 | 0.476857 | 0.805567 | 0.798559 | -0.328710 |
8 | 0.032484 | 0.044998 | 173 | 4187 | 4360 | 0.039679 | 0.960321 | 0.041318 | 0.186222 | 0.098178 | 0.100055 | 616 | 38542 | 39158 | 0.663079 | 0.903745 | 0.898614 | -0.240666 |
9 | 0.045115 | 0.370055 | 313 | 4105 | 4418 | 0.070847 | 0.929153 | 0.076248 | 0.336921 | 0.096255 | 0.101386 | 929 | 42647 | 43576 | 1.000000 | 1.000000 | 1.000000 | 0.000000 |
- toad.ScoreCard( combiner = {}, transer = None, pdo = 60, rate = 2, base_odds = 20, base_score = 750, card = None, C=0.1,kwargs):
Convert logit into a standard scorecard. Support direct input of parameters of a LogisticRegression class.
- combiner: input the pre-fitted toad.Combiner class
- transer: input the per-fitted toad.WOETransformer class
- pdo、rate、base_odds、base_score:
e.g. pdo=60, rate=2, base_odds=20,base_score=750
it means when odds is 1/60, the base socre is 750, and t
- card: 支持传入专家评分卡 pre-defined scorecard
- **kwargs: support to input parameters of a logistic regression class (i.e. sklearn.linear_model.LogisticRegression)
[22]:
card = toad.ScoreCard(
combiner = c,
transer = transer,
#class_weight = 'balanced',
#C=0.1,
#base_score = 600,
#base_odds = 35 ,
#pdo = 60,
#rate = 2
)
card.fit(final_data[col], final_data['target'])
/Users/zhouxiyu/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
[22]:
ScoreCard(base_odds=35, base_score=750, card=None,
combiner=<toad.transform.Combiner object at 0x1a2434fdd8>, pdo=60,
rate=2,
transer=<toad.transform.WOETransformer object at 0x1a235a5358>)
[23]:
# Output standard scorecard
card.export()
[23]:
{'var_d2': {'[-inf ~ 747.0)': 65.54,
'[747.0 ~ 782.0)': 45.72,
'[782.0 ~ 820.0)': 88.88,
'[820.0 ~ inf)': 168.3},
'var_d5': {'O,nan': 185.9, 'F': 103.26, 'M': 68.76},
'var_d7': {'LARGE FLEET OPERATOR,COMPANY,STRATEGIC TRANSPRTER,SALARIED,HOUSEWIFE': 120.82,
'DOCTOR-SELF EMPLOYED,nan,SAL(RETIRAL AGE 60),SERVICES,SAL(RETIRAL AGE 58),OTHERS,DOCTOR-SALARIED,AGENT,CONSULTANT,DIRECTOR,MEDIUM FLEETOPERATOR,TRADER,RETAIL TRANSPORTER,MANUFACTURING,FIRST TIME USERS,STUDENT,PENSIONER': 81.32,
'PROPRIETOR,TRADING,STRATEGIC CAPTIVE,SELF-EMPLOYED,SERV-PRIVATE SECTOR,SMALL RD TRANS.OPR,BUSINESSMAN,CARETAKER,RETAIL,AGRICULTURIST,RETIRED PERSONNEL,MANAGER,CONTRACTOR,ACCOUNTANT,BANKS SERVICE,GOVERNMENT SERVICE,ADVISOR,STRATEGIC S1,SCHOOLS,TEACHER,GENARAL RETAILER,RESTAURANT KEEPER,OFFICER,POLICEMAN,SERV-PUBLIC SECTOR,BARRISTER,Salaried,SALESMAN,RETAIL CAPTIVE,Defence (NCO),STRATEGIC S2,OTHERS NOT DEFINED,JEWELLER,SECRETARY,SUP STRAT TRANSPORT,LECTURER,ATTORNEY AT LAW,TAILOR,TECHNICIAN,CLERK,PLANTER,DRIVER,PRIEST,PROGRAMMER,EXECUTIVE ASSISTANT,PROOF READER,STOCKBROKER(S)-COMMD,TYPIST,ADMINSTRATOR,INDUSTRY,PHARMACIST,Trading,TAXI DRIVER,STRATEGIC BUS OP,CHAIRMAN,CARPENTER,DISPENSER,HELPER,STRATEGIC S3,RETAIL BUS OPERATOR,GARAGIST,PRIVATE TAILOR,NURSE': 55.79},
'var_d11': {'N': 88.69, 'U': 23.72},
'var_b10': {'[-inf ~ -8888.0)': 67.76,
'[-8888.0 ~ 0.548229531)': 97.51,
'[0.548229531 ~ inf)': 36.22},
'var_b18': {'[-inf ~ 2)': 83.72, '[2 ~ inf)': 39.23},
'var_b19': {'[-inf ~ -9999)': 70.78, '[-9999 ~ 4)': 97.51, '[4 ~ inf)': 42.2},
'var_b23': {'[-inf ~ -8888)': 64.51, '[-8888 ~ inf)': 102.69},
'var_l_20': {'[-inf ~ 0.000404297)': 78.55,
'[0.000404297 ~ 0.003092244)': 103.85,
'[0.003092244 ~ inf)': 36.21},
'var_l_68': {'[-inf ~ 0.000255689)': 70.63,
'[0.000255689 ~ 0.002045513)': 24.56,
'[0.002045513 ~ 0.007414983000000002)': 66.63,
'[0.007414983000000002 ~ 0.019943748)': 99.55,
'[0.019943748 ~ inf)': 142.36}}
### VII. Other functions¶
*toad.transform.GBDTTransformer *
GBDT encoding - pre-processing for gbdt + lr technique.
[28]:
gbdt_transer = toad.transform.GBDTTransformer()
gbdt_transer.fit(final_data[col+['target']], 'target', n_estimators = 10, max_depth = 2)
/Users/zhouxiyu/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py:415: FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values.
If you want the future behaviour and silence this warning, you can specify "categories='auto'".
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
warnings.warn(msg, FutureWarning)
[28]:
<toad.transform.GBDTTransformer at 0x1a2daf60f0>
[29]:
gbdt_vars = gbdt_transer.transform(final_data[col])
[31]:
gbdt_vars.shape
[31]:
(43576, 40)
[ ]: