{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# toad Tutorial \n", "\n", "Toad is a Python toolkit for professional model developers - a part of its functionality is specific for scorecard development. The toad package is countiously being upgraded and added for new features. We will introduce the key functionality in this tutorial, including:\n", "\n", "1. EDA-related functions \n", "2. how to use toad to fine tune feature binning and conduct feature selection\n", "3. WOE transformation \n", "4. stepwise feature selection \n", "5. model evaluation and validation\n", "6. scorecard transformation \n", "7. other functions\n", "\n", "-----------------\n", "-----------------\n", "\n", "### This tutorial will demonstrate how to use toad to model data of high dimension with efficiency. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Install and upgrade:***\n", "1. pip:!pip install toad\n", "2. conda:conda install toad --channel conda-forge\n", "3. update:!pip install -U toad; conda install -U toad --channel conda-forge\n", "\n", "***Feel free to open new issues on our [github](https://github.com/amphibian-dev/toad)***\n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import toad" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\\nPlease upgrade to the latest version\\n'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'''\n", "Please upgrade to the latest version\n", "'''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----------------------\n", "### ### 0. Load data\n", "\n", "The demo data has 165 dimensions, including one ID column, a target variable, and a month column. The feature columns contain both categorical and numerical features, with several having missing data. \n", "\n", "***This demo will showcase how toad can efficiently and effectively help model development for such dirty / nasty dataset. ***\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shape: (108940, 167)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
APP_ID_Ctargetvar_d1var_d2var_d3var_d4var_d5var_d6var_d7var_d8...var_l_118var_l_119var_l_120var_l_121var_l_122var_l_123var_l_124var_l_125var_l_126month
0app_10Hit-6+ Vintage816.0RESIDENT INDIANPost-GraduateMRESIDENT INDIANSELF-EMPLOYEDY...0.00.0000000.00.0000000.00.0000000.00.0000000.02019-03
1app_20NaN841.0RESIDENT INDIANPost-GraduateFRESIDENT INDIANSALARIEDN...0.00.0000000.00.0000000.00.0000000.00.0000000.02019-03
2app_30Hit-6+ Vintage791.0RESIDENT INDIANPost-GraduateMRESIDENT INDIANPROPRIETORY...0.00.0882350.00.1000000.00.0114940.50.0000000.02019-03
3app_40Hit-6+ Vintage821.0RESIDENT INDIANGraduateMRESIDENT INDIANSELF-EMPLOYEDN...0.00.0000000.00.0000000.00.0000000.00.0000000.02019-03
4app_50Hit-6+ Vintage807.0RESIDENT INDIANGraduateMRESIDENT INDIANSALARIEDY...0.00.0000000.00.0000000.00.5405410.00.2857140.02019-03
5app_60Hit-6+ Vintage788.0RESIDENT INDIANOthersMRESIDENT INDIANSALARIEDN...0.00.0000000.00.0000000.00.0000000.00.0000000.02019-03
6app_70Hit-6+ Vintage779.0RESIDENT INDIANGraduateMRESIDENT INDIANATTORNEY AT LAWY...0.00.7222220.00.7777780.00.3809520.00.5714290.02019-03
7app_80Hit-6+ Vintage801.0RESIDENT INDIANPost-GraduateMRESIDENT INDIANSAL(RETIRAL AGE 60)N...0.00.0000000.00.0000000.00.0000000.00.0000000.02019-03
8app_90Hit-6+ Vintage815.0RESIDENT INDIANGraduateFRESIDENT INDIANNaNY...0.00.0000000.00.0000000.00.0000000.00.0000000.02019-03
9app_100NaN804.0RESIDENT INDIANGraduateMRESIDENT INDIANPROPRIETORN...0.00.0000000.00.0000000.00.0000000.00.0000000.02019-03
\n", "

10 rows × 167 columns

\n", "
" ], "text/plain": [ " APP_ID_C target var_d1 var_d2 var_d3 var_d4 \\\n", "0 app_1 0 Hit-6+ Vintage 816.0 RESIDENT INDIAN Post-Graduate \n", "1 app_2 0 NaN 841.0 RESIDENT INDIAN Post-Graduate \n", "2 app_3 0 Hit-6+ Vintage 791.0 RESIDENT INDIAN Post-Graduate \n", "3 app_4 0 Hit-6+ Vintage 821.0 RESIDENT INDIAN Graduate \n", "4 app_5 0 Hit-6+ Vintage 807.0 RESIDENT INDIAN Graduate \n", "5 app_6 0 Hit-6+ Vintage 788.0 RESIDENT INDIAN Others \n", "6 app_7 0 Hit-6+ Vintage 779.0 RESIDENT INDIAN Graduate \n", "7 app_8 0 Hit-6+ Vintage 801.0 RESIDENT INDIAN Post-Graduate \n", "8 app_9 0 Hit-6+ Vintage 815.0 RESIDENT INDIAN Graduate \n", "9 app_10 0 NaN 804.0 RESIDENT INDIAN Graduate \n", "\n", " var_d5 var_d6 var_d7 var_d8 ... var_l_118 \\\n", "0 M RESIDENT INDIAN SELF-EMPLOYED Y ... 0.0 \n", "1 F RESIDENT INDIAN SALARIED N ... 0.0 \n", "2 M RESIDENT INDIAN PROPRIETOR Y ... 0.0 \n", "3 M RESIDENT INDIAN SELF-EMPLOYED N ... 0.0 \n", "4 M RESIDENT INDIAN SALARIED Y ... 0.0 \n", "5 M RESIDENT INDIAN SALARIED N ... 0.0 \n", "6 M RESIDENT INDIAN ATTORNEY AT LAW Y ... 0.0 \n", "7 M RESIDENT INDIAN SAL(RETIRAL AGE 60) N ... 0.0 \n", "8 F RESIDENT INDIAN NaN Y ... 0.0 \n", "9 M RESIDENT INDIAN PROPRIETOR N ... 0.0 \n", "\n", " var_l_119 var_l_120 var_l_121 var_l_122 var_l_123 var_l_124 var_l_125 \\\n", "0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 \n", "1 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 \n", "2 0.088235 0.0 0.100000 0.0 0.011494 0.5 0.000000 \n", "3 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 \n", "4 0.000000 0.0 0.000000 0.0 0.540541 0.0 0.285714 \n", "5 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 \n", "6 0.722222 0.0 0.777778 0.0 0.380952 0.0 0.571429 \n", "7 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 \n", "8 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 \n", "9 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 \n", "\n", " var_l_126 month \n", "0 0.0 2019-03 \n", "1 0.0 2019-03 \n", "2 0.0 2019-03 \n", "3 0.0 2019-03 \n", "4 0.0 2019-03 \n", "5 0.0 2019-03 \n", "6 0.0 2019-03 \n", "7 0.0 2019-03 \n", "8 0.0 2019-03 \n", "9 0.0 2019-03 \n", "\n", "[10 rows x 167 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.read_csv('train.csv')\n", "print('Shape:',data.shape)\n", "data.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The dataset contains monthly data from Mar, 2019 - Jul, 2019. We will use Mar and Apr data as training sample and May, Jun, Jul data as out-of-time (OOT) sample." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "month: ['2019-03' '2019-04' '2019-05' '2019-06' '2019-07']\n" ] } ], "source": [ "print('month:',data.month.unique())" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train size: (43576, 167) \n", "OOT size: (65364, 167)\n" ] } ], "source": [ "train = data.loc[data.month.isin(['2019-03','2019-04'])==True,:]\n", "OOT = data.loc[data.month.isin(['2019-03','2019-04'])==False,:]\n", "\n", "print('train size:',train.shape,'\\nOOT size:',OOT.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----------------------\n", "### ### I. EDA functions\n", "\n", "#### 1. toad.detect(dataframe): \n", "\n", "To EDA data statics and other information. The columns output the statistical summary of each column. The ones that should be paid attention with are: no. of missing, no. of unqiue values, mean for numerical features, mode for categorical features, etc. As per the cell below, the takeaway should include:\n", "\n", "a. postive samples account for 2.2%: the mean of traget col is 0.0219479\n", "\n", "b. several features have different amount of missing values: notice the missing col.\n", "\n", "c. there are both categoical and numerical features. The unique values of several categorical features are high - from 10 to even 84: notice the unqiue col for features of type==object." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typesizemissinguniquemean_or_top1std_or_top2min_or_top31%_or_top410%_or_top550%_or_bottom575%_or_bottom490%_or_bottom399%_or_bottom2max_or_bottom1
APP_ID_Cobject435760.00%43576app_36227:0.00%app_29819:0.00%app_35476:0.00%app_10104:0.00%app_35794:0.00%app_25789:0.00%app_36858:0.00%app_12750:0.00%app_24:0.00%app_13004:0.00%
targetint64435760.00%20.02131910.14444700000011
var_d1object4357637.57%2Hit-6+ Vintage:60.32%Hit-lt 6 Vinta:2.10%NoneNoneNoneNoneNoneNoneHit-6+ Vintage:60.32%Hit-lt 6 Vinta:2.10%
var_d2float64435765.44%389570.492355.565-1-1-1778810832864900
var_d3object435765.31%6RESIDENT INDIAN:94.00%NON-RESIDENT INDIAN:0.64%PARTNERSHIP FIRM:0.02%PRIVATE LTD COMPANIES:0.02%PUBLIC LTD COMPANIES:0.00%NON-RESIDENT INDIAN:0.64%PARTNERSHIP FIRM:0.02%PRIVATE LTD COMPANIES:0.02%PUBLIC LTD COMPANIES:0.00%OVERSEAS CITIZEN OF INDIA:0.00%
var_d4object435761.08%5Graduate:55.30%Post-Graduate:21.57%Others:10.71%Under Graduate:10.67%Professional:0.67%Graduate:55.30%Post-Graduate:21.57%Others:10.71%Under Graduate:10.67%Professional:0.67%
var_d5object435761.08%3M:79.70%F:14.33%O:4.89%NoneNoneNoneNoneM:79.70%F:14.33%O:4.89%
var_d6object435761.08%13RESIDENT INDIAN:93.34%PRIVATE LTD COMPANIES:2.57%PARTNERSHIP FIRM:1.45%PUBLIC LTD COMPANIES:0.73%NON-RESIDENT INDIAN:0.64%CO-OPERATIVE SOCIETIES:0.01%LIMITED LIABILITY PARTNERSHIP:0.00%ASSOCIATION:0.00%TRUST-NGO:0.00%OVERSEAS CITIZEN OF INDIA:0.00%
var_d7object435761.60%84SALARIED:31.43%PROPRIETOR:31.31%SELF-EMPLOYED:10.74%OTHERS:6.40%FIRST TIME USERS:2.72%NURSE:0.00%PHARMACIST:0.00%RETAIL BUS OPERATOR:0.00%PRIVATE TAILOR:0.00%TAXI DRIVER:0.00%
var_d8object435761.08%2Y:59.90%N:39.03%NoneNoneNoneNoneNoneNoneY:59.90%N:39.03%
\n", "
" ], "text/plain": [ " type size missing unique mean_or_top1 \\\n", "APP_ID_C object 43576 0.00% 43576 app_36227:0.00% \n", "target int64 43576 0.00% 2 0.0213191 \n", "var_d1 object 43576 37.57% 2 Hit-6+ Vintage:60.32% \n", "var_d2 float64 43576 5.44% 389 570.492 \n", "var_d3 object 43576 5.31% 6 RESIDENT INDIAN:94.00% \n", "var_d4 object 43576 1.08% 5 Graduate:55.30% \n", "var_d5 object 43576 1.08% 3 M:79.70% \n", "var_d6 object 43576 1.08% 13 RESIDENT INDIAN:93.34% \n", "var_d7 object 43576 1.60% 84 SALARIED:31.43% \n", "var_d8 object 43576 1.08% 2 Y:59.90% \n", "\n", " std_or_top2 min_or_top3 \\\n", "APP_ID_C app_29819:0.00% app_35476:0.00% \n", "target 0.144447 0 \n", "var_d1 Hit-lt 6 Vinta:2.10% None \n", "var_d2 355.565 -1 \n", "var_d3 NON-RESIDENT INDIAN:0.64% PARTNERSHIP FIRM:0.02% \n", "var_d4 Post-Graduate:21.57% Others:10.71% \n", "var_d5 F:14.33% O:4.89% \n", "var_d6 PRIVATE LTD COMPANIES:2.57% PARTNERSHIP FIRM:1.45% \n", "var_d7 PROPRIETOR:31.31% SELF-EMPLOYED:10.74% \n", "var_d8 N:39.03% None \n", "\n", " 1%_or_top4 10%_or_top5 \\\n", "APP_ID_C app_10104:0.00% app_35794:0.00% \n", "target 0 0 \n", "var_d1 None None \n", "var_d2 -1 -1 \n", "var_d3 PRIVATE LTD COMPANIES:0.02% PUBLIC LTD COMPANIES:0.00% \n", "var_d4 Under Graduate:10.67% Professional:0.67% \n", "var_d5 None None \n", "var_d6 PUBLIC LTD COMPANIES:0.73% NON-RESIDENT INDIAN:0.64% \n", "var_d7 OTHERS:6.40% FIRST TIME USERS:2.72% \n", "var_d8 None None \n", "\n", " 50%_or_bottom5 75%_or_bottom4 \\\n", "APP_ID_C app_25789:0.00% app_36858:0.00% \n", "target 0 0 \n", "var_d1 None None \n", "var_d2 778 810 \n", "var_d3 NON-RESIDENT INDIAN:0.64% PARTNERSHIP FIRM:0.02% \n", "var_d4 Graduate:55.30% Post-Graduate:21.57% \n", "var_d5 None None \n", "var_d6 CO-OPERATIVE SOCIETIES:0.01% LIMITED LIABILITY PARTNERSHIP:0.00% \n", "var_d7 NURSE:0.00% PHARMACIST:0.00% \n", "var_d8 None None \n", "\n", " 90%_or_bottom3 99%_or_bottom2 \\\n", "APP_ID_C app_12750:0.00% app_24:0.00% \n", "target 0 1 \n", "var_d1 None Hit-6+ Vintage:60.32% \n", "var_d2 832 864 \n", "var_d3 PRIVATE LTD COMPANIES:0.02% PUBLIC LTD COMPANIES:0.00% \n", "var_d4 Others:10.71% Under Graduate:10.67% \n", "var_d5 M:79.70% F:14.33% \n", "var_d6 ASSOCIATION:0.00% TRUST-NGO:0.00% \n", "var_d7 RETAIL BUS OPERATOR:0.00% PRIVATE TAILOR:0.00% \n", "var_d8 None Y:59.90% \n", "\n", " max_or_bottom1 \n", "APP_ID_C app_13004:0.00% \n", "target 1 \n", "var_d1 Hit-lt 6 Vinta:2.10% \n", "var_d2 900 \n", "var_d3 OVERSEAS CITIZEN OF INDIA:0.00% \n", "var_d4 Professional:0.67% \n", "var_d5 O:4.89% \n", "var_d6 OVERSEAS CITIZEN OF INDIA:0.00% \n", "var_d7 TAXI DRIVER:0.00% \n", "var_d8 N:39.03% " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "toad.detect(train)[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2. toad.quality(dataframe, target='target', iv_only=False):\n", "\n", "Output IV (information value), gini, entropy and no. of unique values for each feature. The features are sorted by IV in a descending order. \"target\" is the target variable, and 'iv_only' specifies whether to calculate IV only. \n", "\n", "\n", "Note: it is recommended to set \"iv_only=True\" for large dataset or high-dimensional data. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ivginientropyunique
var_b190.353043NaNNaN88.0
var_b180.317603NaNNaN46.0
var_d20.313443NaNNaN411.0
var_d70.309985NaNNaN95.0
var_b100.301111NaNNaN15726.0
var_b170.240104NaNNaN235.0
var_b160.231403NaNNaN104.0
var_b240.226939NaNNaN30928.0
var_b200.198655NaNNaN34.0
var_b110.187306NaNNaN239.0
var_l_190.160020NaNNaN32240.0
var_b90.157585NaNNaN197.0
var_l_680.150068NaNNaN757.0
var_l_1230.146634NaNNaN10602.0
var_l_1250.146274NaNNaN6338.0
\n", "
" ], "text/plain": [ " iv gini entropy unique\n", "var_b19 0.353043 NaN NaN 88.0\n", "var_b18 0.317603 NaN NaN 46.0\n", "var_d2 0.313443 NaN NaN 411.0\n", "var_d7 0.309985 NaN NaN 95.0\n", "var_b10 0.301111 NaN NaN 15726.0\n", "var_b17 0.240104 NaN NaN 235.0\n", "var_b16 0.231403 NaN NaN 104.0\n", "var_b24 0.226939 NaN NaN 30928.0\n", "var_b20 0.198655 NaN NaN 34.0\n", "var_b11 0.187306 NaN NaN 239.0\n", "var_l_19 0.160020 NaN NaN 32240.0\n", "var_b9 0.157585 NaN NaN 197.0\n", "var_l_68 0.150068 NaN NaN 757.0\n", "var_l_123 0.146634 NaN NaN 10602.0\n", "var_l_125 0.146274 NaN NaN 6338.0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "toad.quality(data,'target',iv_only=True)[:15]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----------------------\n", "### ### II. how to use toad to fine tune feature binning and conduct feature selection\n", "\n", "#### 3. toad.selection.select(dataframe, target='target', empty=0.9, iv=0.02, corr=0.7, return_drop=False, exclude=None):\n", "\n", "Conduct preliminary feature selection according to missing percentage, IV and correlation (with other features), the variables are:\n", "\n", "(1) empyt=0.9: the features with missing percentage larger than 90% are filtered;\n", "\n", "(2) iv=0.02: the features with IV smaller than 0.02 are eliminated;\n", "\n", "(3) corr=0.7: if two or more features have Pearson correlation larger than 0.7, the ones with lower IV are eliminated;\n", "\n", "(4) return_drop=False: if set True, the function returns a list of deleted columns;\n", "\n", "(5) exclude=None: input the list of features to be excluded from the algorithm, typically ID column and month column. \n", "\n", "As shown in the cell below, none feautures are deleted due to high missing values, most values are deleted by the IV threshold, several are deleted for the correlation. In the end, 32 features are chosen from initially 165. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'empty': array([], dtype=float64), 'iv': array(['var_d1', 'var_d4', 'var_d8', 'var_d9', 'var_b5', 'var_b6',\n", " 'var_b7', 'var_l_1', 'var_l_2', 'var_l_3', 'var_l_4', 'var_l_5',\n", " 'var_l_6', 'var_l_7', 'var_l_8', 'var_l_10', 'var_l_12',\n", " 'var_l_14', 'var_l_15', 'var_l_16', 'var_l_17', 'var_l_18',\n", " 'var_l_21', 'var_l_23', 'var_l_24', 'var_l_25', 'var_l_26',\n", " 'var_l_27', 'var_l_28', 'var_l_29', 'var_l_30', 'var_l_31',\n", " 'var_l_32', 'var_l_33', 'var_l_34', 'var_l_35', 'var_l_37',\n", " 'var_l_38', 'var_l_39', 'var_l_40', 'var_l_41', 'var_l_42',\n", " 'var_l_43', 'var_l_44', 'var_l_45', 'var_l_47', 'var_l_49',\n", " 'var_l_51', 'var_l_53', 'var_l_55', 'var_l_56', 'var_l_57',\n", " 'var_l_59', 'var_l_61', 'var_l_62', 'var_l_63', 'var_l_65',\n", " 'var_l_67', 'var_l_70', 'var_l_72', 'var_l_75', 'var_l_76',\n", " 'var_l_77', 'var_l_78', 'var_l_79', 'var_l_80', 'var_l_81',\n", " 'var_l_82', 'var_l_83', 'var_l_84', 'var_l_85', 'var_l_86',\n", " 'var_l_87', 'var_l_88', 'var_l_90', 'var_l_92', 'var_l_93',\n", " 'var_l_94', 'var_l_95', 'var_l_96', 'var_l_97', 'var_l_98',\n", " 'var_l_100', 'var_l_102', 'var_l_104', 'var_l_106', 'var_l_108',\n", " 'var_l_109', 'var_l_110', 'var_l_112', 'var_l_114', 'var_l_116',\n", " 'var_l_117', 'var_l_118', 'var_l_120', 'var_l_122', 'var_l_124',\n", " 'var_l_126'], dtype=object), 'corr': array(['var_b27', 'var_b28', 'var_b4', 'var_l_105', 'var_b25', 'var_b2',\n", " 'var_l_113', 'var_l_46', 'var_b26', 'var_b8', 'var_b1',\n", " 'var_l_103', 'var_l_99', 'var_l_74', 'var_b14', 'var_l_13',\n", " 'var_l_22', 'var_b22', 'var_l_101', 'var_l_111', 'var_b12',\n", " 'var_l_69', 'var_b11', 'var_l_115', 'var_l_11', 'var_l_36',\n", " 'var_l_50', 'var_l_54', 'var_l_121', 'var_b15', 'var_l_73',\n", " 'var_l_66', 'var_l_125', 'var_b16', 'var_b24'], dtype=object)}\n", "(43576, 34)\n" ] } ], "source": [ "train_selected, dropped = toad.selection.select(train,target = 'target', empty = 0.5, iv = 0.05, corr = 0.7, return_drop=True, exclude=['APP_ID_C','month'])\n", "print(dropped)\n", "print(train_selected.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4. Fine binning\n", "\n", "Toad's binning function support both categorical and numerical features. The class \"toad.transform.Combiner()\" is used to train, the procedure is below:\n", "\n", "(1) *** initalise: ***c = toad.transform.Combiner()\n", "\n", "(2) ***train binning***: c.fit(dataframe, y = 'target', method = 'chi', min_samples = None, n_bins = None, empty_separate = False) \n", "\n", " - y: target variable;\n", "\n", " - method: the method to apply binning. Suport 'chi' (Chi-squared), 'dt', (decisin tree), 'kmeans' (K-means), 'quantile' (by the same percentile), and 'step' (by the same step);\n", "\n", " - min_samples: can be a number or a porportion. Minimum number / porportion of samples required in each bucket;\n", " \n", " - n_bins: mininum number of buckets. If the number is too large, the algorithm will return the maxinum number of buckets it can get;\n", "\n", " - empty_separate: whether to seperate the missing values in a bucket. If False, missing values will be put along with the bucket of most close bad rate. \n", "\n", "\n", "(3) ***binning results***:c.export()\n", "\n", "(4) ***adjust bins***: c.update(dict)\n", "\n", "(5) ***apply bins and convert to discrete values***: c.transform(dataframe, labels=False):\n", "\n", " - labels: whether to convert data to explanatory labels. Returns 0, 1, 2 ... when False. Categorical features will be sorted in a descending order of porportion. Returns (-inf, 0], (0,10], (10, inf) when True.\n", " \n", "Note: 1. remember to exclude the unwanted columns, especially ID column and timestamp column. 2. Columns with large number of unique values may take much time to train. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "var_d2: [747.0, 782.0, 820.0]\n", "var_d5: [['O', 'nan', 'F'], ['M']]\n", "var_d6: [['PUBLIC LTD COMPANIES', 'NON-RESIDENT INDIAN', 'PRIVATE LTD COMPANIES', 'PARTNERSHIP FIRM', 'nan'], ['RESIDENT INDIAN', 'TRUST', 'TRUST-CLUBS/ASSN/SOC/SEC-25 CO.', 'HINDU UNDIVIDED FAMILY', 'CO-OPERATIVE SOCIETIES', 'LIMITED LIABILITY PARTNERSHIP', 'ASSOCIATION', 'OVERSEAS CITIZEN OF INDIA', 'TRUST-NGO']]\n" ] } ], "source": [ "# initialise\n", "c = toad.transform.Combiner()\n", "\n", "to_drop=['APP_ID_C','month']\n", "# Train binning with the selected features from previous; use reliable Chi-squared binning, and control that each bucket has at least 5% sample.\n", "c.fit(train_selected, y = 'target', method = 'chi', min_samples = 0.05, exclude = to_drop) #empty_separate = False\n", "\n", "# For the demonstration purpose, only showcase 3 bin results.\n", "print('var_d2:',c.export()['var_d2'])\n", "print('var_d5:',c.export()['var_d5'])\n", "print('var_d6:',c.export()['var_d6'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5. Fine tune bins\n", "\n", "The \"toad.plot\" provides functions for visualisation to help make adjustment.\n", "\n", "(1) ***In-sample visualisation ***: toad.plot.bin_plot(dataframe, x = None, target = 'target)\n", "\n", "The bars are the proportion of each binned class, and the line is the corresponding postive sample proportion (e.g. bad rate).\n", "\n", " - x: the feature column of interest\n", " \n", " - target: target variable" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "No handles with labels found to put in legend.\n", "No handles with labels found to put in legend.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from toad.plot import bin_plot\n", "\n", "# Check the bin results of 'var_d2' of in-sample \n", "col = 'var_d2'\n", "\n", "# It's recommended to set 'labels = True' for better visualisation.\n", "bin_plot(c.transform(train_selected[[col,'target']], labels=True), x=col, target='target')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(2) ***OOT visualisation:*** toad.plot.badrate_plot(dataframe, target = 'target', x = None, by = None)\n", "\n", "Show the positive rates of each class across different time.\n", "\n", " - target: target variable\n", " \n", " - x: time column, must be in string\n", " \n", " - by: feature column of interest\n", "\n", "Note: the time column must be preprocessed and converted to string - timestamp is not supported " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\\nA feature is preferrable if the gaps between classes get wider as time goes by - it means the binned classes have larger difference. No line crossing means the bin results are stable.\\n'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from toad.plot import badrate_plot\n", "\n", "col = 'var_d2'\n", "\n", "# Check the stability of 'var_d2''s bins across time\n", "#badrate_plot(c.transform(train[[col,'target','month']], labels=True), target='target', x='month', by=col)\n", "#badrate_plot(c.transform(OOT[[col,'target','month']], labels=True), target='target', x='month', by=col)\n", "\n", "badrate_plot(c.transform(data[[col,'target','month']], labels=True), target='target', x='month', by=col)\n", "'''\n", "A feature is preferrable if the gaps between classes get wider as time goes by - it means the binned classes have larger difference. No line crossing means the bin results are stable.\n", "'''" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "No handles with labels found to put in legend.\n", "No handles with labels found to put in legend.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAwIAAAF2CAYAAADdkC9GAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjAsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+17YcXAAAgAElEQVR4nOzdeZTcVZ3//2dX70v2dDaSkA0uZCUhGCAsQRYTQBRHmd/wFXWUTdEhjMOIOhwUI+iwqowYv4o4+lVGOIOiY2QZR0UGMElnXy5JZwFJQrbOnl6qq35/VCd0miydkEolXc/HOTl2fe69n3rXOZLUqz93KUin00iSJEnKL4lcFyBJkiTp2DMISJIkSXnIICBJkiTlIYOAJEmSlIcMApIkSVIeKsp1AUegE/BRYCnQlONaJEmS1DEVA6cBPwW257iWrDgRg8BHge/mughJkiTljUdzXUA2nIhBYCnAtm27aW5O5boWSZIkdUCFhQk6dy6Hlu+eHdGJGASaAJqbUyST+w8CX//6V9i9exfTpv0rt9xyA+Xl5dx//7ff0W/z5k1cffXl3HPP/UyceP4h33jVqpXcf/+9LF26mF69enPTTbdw4YXvPWD/jRs3cN9991BTM5vOnTtz3XWf4IMf/PB++37lK19m48YNPPLI99/5gZuauP7667jxxlv2qXPBgnl8+tOf2qdveXk5zz//4jvuMX36v/H887/jqad+fcjPKUmSpL067FT0Dr9YePLkK5g16y9s27btHW3//d/P06lTZyZMOOeQ92loaODzn/8cAweezA9/+FPe//6r+cpXvsyyZfGAY774xX+ioKCA6dN/xPXX38y3v/0QL730zi/pL774B1544dkDvu9dd32J2trl72hbtWolQ4cO41e/+t3eP7/4xa/e0W/p0iX87Gf/fsjPKEmSpPzR4YPARRddQiJRyIsv/uEdbS+88CyXXPI+iooO/WDkf/7nBZqamvjHf/wCJ588iL/7u49y7rnn8dRT/7Hf/vPmzeG115bypS/dxZAhQ5ky5UquvvrD/OIXP9un37ZtW3nggW8yatSYd9wjxqXccMPHWLPmzf2+x8qVtQwaNIQePXru/dOtW/d9+jQ1NXHPPV/Z7/0lSZKUvzp8EKiqqmLixPP5n/95YZ/ra9euYdGiBUyefMXe1+edN57f/nb/U2cWLJjHiBGj9gkNZ5xxJvPnz91v//nz5zFkyFA6d+6y99rYseNYsGA+qdTbU5oefvh+3vveSxgxYtQ77jF79kwmTryA6dMf2+97rFy5gpNPHrT/D97iRz/6v5x00gAuuujig/aTJElSfunwQQDgfe+7/B3Tg1544VkGDRrCaaedDkCvXr351a9+x8UXX7rfe2zYsIHq6up9rvXo0ZMNG9bvt//Gjeupru71jv6NjQ1s3boVgD//+Y8sWrSAG2+8Zb/3uPba67jpplsoLS3bb/vKlStYvnwZH//433H11Zfz1a/+C5s2bdzbHuNSnnnmaf7pn+7Y73hJkiTlr7wIAmeffS5VVVX7TA964YVnmTz58r2vCwsL6dGj5wG/dDc01FNcXLLPtZKSYpqa9r9+pL6+nuLi4n2u7Rnf1NTItm3buP/+b/CFL/wLZWX7f8+D2bFjBxs3biCZTHLHHf/CnXfezdq1a/j85/+BZDK5d0rQLbfcSo8ePQ/7/pIkSerYTsRdgw5bUVERF198Gb///QtcccVVrFy5gpUrV3DZZVPafY/S0lKamhr3udbY2HTA4FBaWkpd3eZ9ru0ZX1ZWxre+dR/nnXcB48aNP8xPk1FVVcWzz/6BsrJyCgsLAbjnnvv44AenMGfObObOraFnz15MmXLlEd1fkiRJHVteBAHITA/6zGeuZ9u2rTz//O8YO3Y8vXr1bvf46upe+0y7Adi0aeM7pgu93b83CxcueEf/srIyqqo68eyzMygtLeXZZ38LZBb1plIpLr30fH7ykyfp06fPIWuqrKza53X37j3o3LkLGzas57nnZrBp00YuvTSz3WgymSSZTHLppedz//3fZsyYse3+7JIkSep48iYIDB8+kn79TuKll17kj3/8PR/96CcOa/zIkaP53vceIZlM7l0wPHfubEaOHL3f/qNGjeaxx6azbds2OnfuDMCcOTWMGDGKRCLBE088vU//n/70cWprl3PXXdPo2fPQU3kWLlzAbbfdwk9+8ou9oWHdunVs3bqFk08ezHe+M51kMrm3/7PP/pbf/OZXfOc70w8YXiRJkpQ/8iYIAFx22RSefPLnrF//FpMm7buLTnNzM1u21FFVVbXf6T6TJl3M97//Xb75zWlce+3HePXV/+Xll19i+vTH9/bZtGkj5eUVVFRUMHr0GQwePJSvfvVf+Mxn/oHXXlvK008/xde//k0A+vcfsM/9q6o6UVpa+o7rB3LqqYGePXty771387nP3UZjYwMPP3w/Y8eeyYgRI9/Rv2vXrhQWFrb7/pIkSWq/EML1wK3AduDaGOOqVm09gCeAvsATMcZpIYQi4H7gfDLrdm+IMc4KIQTgFWDP+IdijFk5ECovFgvv8b73Xc6yZa9xwQUXUV5evk/b+vVv8YEPTOa///v5/Y4tLy/nvvu+xRtvvM6nPvVRfv3rX3L33fdyyimn7u3zgQ9M5uc//wkAiUSCe+65n3Q6zQ03fJwf/nA6t912O+ecc95R+SwlJSXcf/+3qago57OfvZF//MfPMmDAQKZN++ZRub8kSZLaJ4RQDdwBTAC+BjzYpsudwC+B0cAVIYTRwMnArBjjmcC/AN9q6dsF+GWMcWzLn6ydCluQTqezde9sOQ94sa5uJ8lk6pCdJUmSpMNVVJSgW7dKyPzG/s8H6xtC+D/AVTHGvw0hJIC1QJ8YY7qlfVlL+5IQwueBkhjjva3GdwEWxhgHhBAuBS6PMd6WnU/2tryaGiRJkiQdjhkzZvSZOnXqoDaXt8QYt7R63ReIADHGVAihDugObGpp7wXUtvz8JjCxzf0mADUtP3cBLgkhzAFWAzfHGNcdjc/SVl5NDZIkSZIOx3333fcksLLNn6n76dr6e3UnoO20m4L9tbWsFfgKmfUCAL8DPgacA7wBTHtXH+AgDAKSJEnSAdx+++0fAQa3+fNwm25rgAAQQugEdAPqWrW/BQxr+Tm09N/jB8CMGOOLADHGHTHGOTHGeuAx4LSj+oFacWqQJEmSdABTpkxZN2XKlFWH6PYccHcIoQKYBMwArgkh9IsxPgT8BrgohLAEuBD4JEAI4Rtk1gt8bc+NQgjXtPSvB64CZh3dT/Q2g4AkSZL0LsQYN4YQ7gFepWX7UOBDwKCWLtPIbB96M/DzGOOCEMIU4AvA7Jb1AAD/ADQBzwO9gYXAJ7JVt7sGSZIkSW0czq5BJyrXCEiSJEl5yCAgSZIk5SHXCEiSpKxKpVI88MA3WL58GcXFxdxxx5307z9gb/szzzzNr371nxQWFvLxj3+KiRPPZ926ddx77900NycB+Od//hIDBw7K0SeQOiafCEiSpKx68cU/0NjYyPTpP+Lmmz/HI488tLdt06aNPPXUEzz66A958MFHmD79ERobG/nBDx7lb/7mGh555Ptcd93f873v/VsOP4HUMflEQJIkZdX8+XOZMOEcAEaOHMXSpUv2ti1ZsohRo8ZQUlJCSUkJJ500gNraZXz2s7dRVVUFQHNzMyUlJTmpXerIfCIgSZKyaufOnVRWVu19nUgkSCaT+22rqKhgx44ddO3alaKiIl5/fRX/9m8P88lP3nDM65Y6Op8ISJKkrKqsrGTXrl17X6fTaYqKivbbtmvXLjp16gRATc0sHnjgG9x5592uD5CywCcCkiQpq0aNGsMrr7wEwMKFCxgyZNjettNPH8H8+XNoaGhgx44drF69ksGDh1JTM4tvfet+HnjgO5x22vBclS51aB4oJkmSsmrPrkG1tctJp9N86Ut38fLLf6Z//wGcd96FPPPM0zzzzNOkUik+9rG/Z9Kki/n4x/+OpqZGunfvAcDAgSfzz//85Rx/EuWTfDhQzCAgSZIktZEPQcA1ApIkKasOdY7Az372E1544VkSiQTXXff3XHjhRTQ01HP33XdSV1dHRUUFX/7yV+nWrVsOP4XU8bhGQJIkZdXBzhHYvn07Tz31BNOn/4gHH3yEb3/7AQCefvophgwZxne/+wMmT76CH//4h7kqX+qwDAKSJCmrDnaOQHl5OX369GX37t3U1+8mkUi0jJnHhAnnAnD22ROZNesvx75wqYNzapAkScqqA50jsGcL0V69enPddR+huTnFddd9Yu+YPQeKVVRUsHPnjmNet9TRGQQkSVJWHewcgVdeeYlNmzbyi188A8DnP/85Ro0a0zJmJ5A5W2BPKJB09Dg1SJIkZdWBzhFIp9N0SzYxubSEXY88SMEbq6mqqmLHjh2MGjWGl1/OjHnllZcYM2ZszuqXOiq3D5UkSVnV9hyBL3/uNt584XcM2LaVsu3bSAGxqYln6hsZNGoMn/nMP9DQ0MC0aXexadNGiouLueuuafTo0TPXH0V5JB+2DzUISJKkrEvt3kXj/Lk0zJlF04rlkE5TdPJgSseNp3TMOBKVTv3R8SUfgoBrBCRJUlakk000Ll1MQ80sGpcshGSSRM9qKi6dQunY8RT2rM51iVJey2oQCCFcD9wKbAeujTGuatXWB3gS6A/8Cfj7GKO/4pck6QSWTqVIrlpBw5xZNMybQ3r3Lgoqqyg7eyKl486iqP9ACgoKcl2mJLIYBEII1cAdwGjgQuBB4EOtutwKPAvcA/wUuBh4Plv1SJKk7Em+tY6Gmpk0zJlFqm4zFJdQOnI0pePGU3zKaRQUFua6REltZPOJwGXA7BjjrhDCs8DjIYSCGOOeRQnbgDdijKkQwmygMYu1SJKkoyy1bSsNc2ZTP2cWzW++AQUFFJ9yGhXvu5LSkaMpKC3NdYmSDiKbQaAvEAFavuzXAd2BTS3tjwK/DiF0AgLwcNsbhBC6Al1bX5s+fXqfSZMmZbFsSZJ0IKn6ehoXzqOhZhZNy2Nm0W//gVRe9SFKx5xJonPnXJcoqZ2yvVi49TkFnYDWWxRNAZ4DSoHxwABgVZvxU4G7Wl+YPn06kyZN2rOKW5IkZVk6mWTnokVsffllttfUkG5spLhnT3pceSVdzjmH0n79cl2ipCOQzSCwBpgA0PJb/25AXav224EJMcamEMIW4GYyawpaexh4vPWFm266aTzwpNuHStKJp0u3CkqKnCt+Ikin09SvWMHW//1ftv3lLzRv305hZSVdJk6ky7nnUj5smIt+dcw0JpvZWrfr0B2Polbbh3ZY2QwCzwF3hxAqgEnADOCaEEK/GONDQGfgTOAV4BSgvO0NYoxbgC1tLvfPYs2SpCwqKSrkoT8tyXUZOoiKbXX0rV1Ev9rFVG6ro7mwkPUDhrFmwgg2njSEdGEhrE3C2qW5LlV55LYLTs91CR1S1oJAjHFjCOEe4FVatg8ls2vQoJYunwC+3xIUlgMfy1YtkiTpwIp376LvyiX0q11E1w1rSQOb+w5kxeizeWtQIFniol+pI8rqGoEY42PAY60uPdiq7c9kthaVJEnHWCLZRK/Xl9GvdjE9/7qCRDrNtm7VLD1rEmuHnE5DpYt+pY7Ok4UlScoXqRQ91q6mX+1ieq96jaJkI7srO7Fq5HtYM3Q4O7r3ynWFko4hg4AkSR1ZOk2nzevpt3wRfVcsoWz3DpqKS1g75DTWDh3B5j4DwEW/Ul4yCEiS1AGVb99K3xWL6Ve7iKotm0glEmzoP5QlQ4ezYcAwUkV+BZDynX8LSJLUQRQ37Kb3yki/2kV0f+uvAGzu3Z9F576PdYMDTaXv2KBPUh4zCEiSdAJLJJNU/7WWfssXUf3XWhKpFDu6dOe1ceezduhwdnfqmusSJR2nDAKSJJ1o0mm6rXuDfrWL6LMqUtzYQEN5Ja+fPo41Q0ewrUdv5/1LOiSDgCRJJ4iqug0ti34XU75zO8miYt46+VTWDBvB5r4nk04kcl2ipBOIQUCSpONY6c7tLYt+F9N583pSBQVsPGkwr42fxPqBw2guLsl1iZJOUAYBSZKOM4WNDfRZFelbu5gea1dTAGzp2ZfFZ1/CusGn0VhemesSJXUABgFJko4DBc3N9HxzJf1qF9Hr9eUUNifZ2akrtWdMZM3Q4ezq0j3XJUrqYAwCkiTlSjpN1/Vv0q92MX1WLqWkYTeNZeX89dTRrBk6nK3V/Vz0KylrDAKSJB1jlVs30bc2M++/YvsWmguLWD/wFNYMG8HGkwaRThTmukRJecAgIEnSMVCyeyd9Vyyhb+1ium5cSxrY1O9klp9xLm+dfCrNJaW5LlFSnjEISJKUJYVNjfR6fRn9li+ix5pVJNJptvbozdL3XMTaIafTUNEp1yVKymMGAUmSjqKCVIoea1ZlFv2uXkZRsondlZ1ZOWoCa4cOZ0e36lyXKEmAQUCSpHcvnabzpnUth30tobR+F00lpawdOpw1Q0dQ17u/i34lHXcMApIkHaHybVvot2IRfWsXU7V1M6lEIesHDGXN0BFsGDCEdKH/zEo6fvk3lCRJh6G4fjd9Vi6lX+0iuq1/E4DNfQawcOR7WDcokCwty3GFktQ+BgFJkg4hkWyi1+vL6btiMdVvrCCRTrG9aw/imReyduhw6qs657pESTpsBgFJkvYnlaL7ujfoV7uIPqsiRU2N1FdUsXrEeNYMHc727r2c9y/phGYQkCSplarN6+lXu4h+tYsp27WDZHEJ6wadypqhI9jcZyAkErkuUZKOCoOAJCnvle3YRt8Vi+lXu4hOdRtJFSTY2H8IS98znPUDh5EqKs51iZJ01BkEJEl5qaihnj6rIn1rF9N93esUAHW9+rHonEtZN/g0msoqcl2iJGWVQUCSlDcKmpNU/3UF/WoXUf1GLYXNzezs3J3l485jzZDh7O7cLdclStIxYxCQJHVs6TRd33ozs+h35VJKGutpKKvgjXAGa4aOYFvPPi76lZSXDAKSpA6pcstG+i1fTL8ViyjfsY1kUTHrTz6FNUNHsKnfINIu+pV0FIUQrgduBbYD18YYV7Vq6wE8AfQFnogxTgshFAH3A+cDCeCGGOOsEEICeAS4AFgEfDzGWJ+Nmg0CkqQOo3TXDvqsWEK/2kV02fQW6YICNvYbxGvjLmD9yafQXFyS6xIldUAhhGrgDmA0cCHwIPChVl3uBH4JPAq8FEJ4BtgJzIoxTg0hXAF8C5gIXAVUxxhHhhC+CdwIfDsbdRsEJEkntMKmBnqvWka/2kX0WLuagnSarT37sGTCxawdfBqNFVW5LlFSx3cZMDvGuCuE8CzweAihIMaYbmm/ArgqxpgKITwFXBFjvBeobWn/MzCwVd/ft/z8C+AeDAKSJGUUpJrp8eYq+tUuovfqZRQ2J9lV1YXa0WezdugIdnbtkesSJXUQM2bM6DN16tRBbS5viTFuafW6LxABWr7s1wHdgU0t7b14+0v/m2R+89/aBKCm7b1a+vZ7t5/hQAwCkqQTQzpNlw1r6Ve7iL4rl1BSv5vG0jLePGUUa4YOZ0uvk1z0K+mou++++57cz+WvAl9pc631wqNOQLpNe8H+2lrWCnwF+MJ+7rW/+xw1BgFJ0nGtYlsdfVtO+q3cVkdzYSHrBwxjzbARbDxpCOnCwlyXKKkDu/322z8yderUWW0ub2nzeg2Z3+oTQugEdAPqWrW/BQwjs/g3tPTf4wfAjBjji63uFYAX9tP3qDIISJKOO8W7d9F3ZWbRb9cNa0kDm/uezIrRZ/PWoECypDTXJUrKE1OmTFk3ZcqUVYfo9hxwdwihApgEzACuCSH0izE+BPwGuCiEsITMYuJPAoQQvgGUxBi/1upevwE+BvwbcBHwX0fx4+zDICBJOi4kkk30en0Z/WoX0/OvK0ik02zrVs3SsyaxdsjpNFR2znWJkrRfMcaNIYR7gFdp2T6UzK5Bg1q6TCOzfejNwM9jjAtCCFPITAeaHUKY09LvH4BfA5eFEBaSeYLwpWzVXZBOZ23aUbacB7xYV7eTZDKV61okSYehuroTD/1pydsXUil6rF1Nv9rF9F71GkXJRnZXdmLtkOGsGTqcHd175a5YSceN2y44nQ0bth/T9ywqStCtWyVk9vn/8zF982PEJwKSpGMrnabT5vX0W76IviuWULZ7B03FJawdchprh45gc58BLvqVpGPAICBJOiaaN29i4yt/4Lz//gNVWzaRSiTY0H8oS4YOZ8OAYaSK/CdJko4l/9aVJGVNatcuGufPob5mJsmVmS20G3v3Z9G572Pd4EBTaXmOK5Sk/GUQkCQdVemmJhqXLKJhzkwalyyG5iSFvXpTMflK+l58Ib9bsjHXJUqSMAhIko6CdCpFcmUt9TUzaZw/l3T9bgo6dabs3PMpG3cWhSf1p6CggJLqTmAQkKTjgkFAknTEkuvW0FAzk4Y5s0ltqYOSEkpHnUHp2PEUDzuVAg/7kqTjlkFAknRYmrduoWHObBpqZtK89k1IJCg+9TQqL7+KkhGjKPCwL0k6IRgEJEmHlKrfTeOCeTTUzKSpdhmk0xQNHETlBz5M6RnjSFR1ynWJkqTDZBCQJO1XOpmk8bUlNNTMpHHRQkg2kejRk/JLJlM2djyF1R72JUknMoOAJGmvdDpNcvXKzLz/eXNI79pJQWUVZRPOoXTseIoGDqLAw74kqUMwCEiSSK5/i4Y5s2iomUVq80YoKqZk5CjKxp1F8amnu+hXkjogg4Ak5anU9m00zK2hYc5Mkm+8DgUFFA8LVFw6mZKRo0mUediXJHVkBgFJyiPpxgYaFs7PLPpdFiGVovCk/lReeTUlZ5xJYZcuuS5RknSMGAQkqYNLNzfTtCxmpv4snAeNjSS6dqN80sWUjj2Loj59c12iJCkHDAKS1AGl02mSf30js+h37mzSO7ZTUF5O2dizKB03nqJBQyhIJHJdpiQphwwCktSBNG/a2LLodybNG9ZDYRElw0dSOnY8JacPp6CoONclSpKOEwYBSTrBpXbupGFeDQ01M0muXglA0ZBhVF14MSWjziBRUZHjCiVJxyODgCSdgNJNjTQuXkhDzSwaly7KLPrt3ZeKKe+ndOx4Crt1z3WJkqTjnEFAkk4Q6VSKphXLMl/+588l3VBPonMXys+/iNJx4ynse5KHfUmS2s0gIEnHueSaN/cu+k1t3UJBaRklo8ZQOu4sioee4qJfSdIRMQhI0nGoeUvd24t+162FRIKS04ZT+v6rKRk+koLiklyXKEk6wRkEJOk4kdq9i8b5c2mYM4umFcshnabo5MFUXv0RSseMI1FZlesSJUkdiEFAknIonWyicenizLz/JQshmaSwuhcVl15O6dgzKexZnesSJUkdlEFAko6xdCpFcvXKzLz/eXNI795FQWUVZWdPpHTcWRT1H+iiX0lS1hkEJOkYSb61LvPlf84sUnWbobiE0pGjM4t+TwkUFBbmukRJUh4xCEhSFqW2baVh7mzqa2bR/OYbUFBA8SmnUfG+KykdOZqC0tJclyhJylMGAUk6ylL19TQunEdDzSyalsfMot/+A6m86m8yi347d851iZIkGQQk6WhINzfT9NrSzNSfRfOhqYlEt+6Uv/cySseNp6hXn1yXKEnSPrIaBEII1wO3AtuBa2OMq1q1lQDfBs4FNgAfizG+mc16JOloSqfTJN9Y3XLYVw3pnTsoqKigbPyEzKLfkwe76FeSdNzKWhAIIVQDdwCjgQuBB4EPteryGaApxjg6hDAMWJ+tWiTpaGreuIGGmpnUz5lFauMGKCqiZPgoSsedRUk4nYIiH7ZKko5/2fzX6jJgdoxxVwjhWeDxEEJBjDHd0v5R4MMAMcblWaxDkt611I7tNMyroaFmFsnXV2UW/Q49hYr3XkbJyDEkystzXaIkSYclm0GgLxABYoypEEId0B3Y1NLeH7gihPBJoAa4OcbY3PoGIYSuQNfW16ZPn95n0qRJWSxbkjLSjY00Ll5Afc1MmuISSKUo7NuPiis+QOkZZ1LYtVuuS5Qk6Yhl+/l1otXPnYB0q9eVwDpgPPAr4APAf7YZPxW4q/WF6dOnM2nSJLp1qzz61UrKe+lUip2LF7Pt5ZfZPns2qfp6irp3p8fkyXQ+5xzKBgzIdYmSlJeqqzvluoQOJ5tBYA0wASCE0AnoBtS1al8LvBpjTIcQ/gcYtp97PAw83vrCTTfdNB54sq5uJ8lkKht1S8oz6XSa5jV/pb5mJo1zZ5Pato2CsjJKRo/NHPY1eCgFiQTbge0btue63BOa/5BLOlIbjvHfv0VFiQ7/i+dsBoHngLtDCBXAJGAGcE0IoV+M8SEyTwH+JoTwCHAR8L22N4gxbgG2tLncP4s1S8ojzZs30TBnFg1zZtH81jooLKTktBGUjhtPyekjKSguznWJkiRlTdaCQIxxYwjhHuBVWrYPJbNr0KCWLtOAfyeze9B/xxh/m61aJGmP1K5dNM6fQ33NTJIrawEoGjyEyg/9LaVjxpKo6Ni//ZEkaY+srhGIMT4GPNbq0oOt2raSWRcgSVmVbmqicckiGubMpHHJYmhOUtirNxWTr6R07HgKu/fIdYmSJB1zbnYtqUNKp1IkV9Zm5v3Pn0u6fjcFnTpTdu75lI07i8KT+nvYlyQprxkEJHUoyXVrMif9zplNaksdlJRQOuoMSseOp3jYqRQUFua6REmSjgsGAUknvOatW2iYM5uGmpk0r30TEgmKTz2NysuvomTEKApKSnNdoiRJxx2DgKQTUqp+N40L5tFQM5Om2mWQTlM0cBCVH/wwpWPGkahym0pJkg7GICDphJFOJml8bQkNNTNpXLQQkk0kevSk/JLJlI0dT2F1r1yXKEnSCcMgIOm4lk6nSa5eSUPNLBrm1ZDetZOCyirKJpxD6djxFA0c5KJfSZKOgEFA0nEpuf6tzGFfNbNIbd4IxcWUjBhF2bizKD71dBf9SpL0LhkEJB03Utu30TCvhoaamSTfeB0KCigeFqi4dDIlI8eQKCvLdYmSJHUYBgFJOZVubKBh4fzMot9lEVIpCk/qT+WVV1NyxpkUdumS6xIlSeqQDAKSjrl0czNNy2Jm6s/CeS6ZyuAAAB+4SURBVNDYSKJbd8onXULpuPEU9e6b6xIlSerwDAKSjol0Ok3yr29kDvuaO5v0ju0UlJdTNvaszJf/QUMoSCRyXaYkSXnDICApq5o3bWxZ9DuT5g3robCIkuEjKR07npLTh1NQVJzrEiVJyksGAUlHXWrnzrcX/a5eCUDRkGFUXXgxJaPOIFFRkeMKJUmSQUDSUZFuaqRx8UIaambRuHRRZtFv775UXH4VpWecSWG37rkuUZIktWIQkHTE0qkUTSuWZb78z59LuqGeROculJ9/EaXjxlPY9yQP+5Ik6ThlEJB02JJr3ty76De1dQsFpWWUjD6D0rHjKR56iot+JUk6ARgEJLVL85a6txf9rlsLiQQlpw2n9P1XUzJ8JAXFJbkuUZKknAkhXA/cCmwHro0xrmrV1gN4AugLPBFjnNZy/RPAncAzMcbbWq69D/h3YE3L8NtjjC9ko2aDgKQDSu3eReOCeZnDvlYsh3SaopMHU3n1NZSOGUuisirXJUqSlHMhhGrgDmA0cCHwIPChVl3uBH4JPAq8FEJ4JsY4H/gT8BOg9emZXYDvxhi/mu26DQKS9pFONtG4dHFm3v+ShZBMUljdi4pLL6d07JkU9qzOdYmSJB1vLgNmxxh3hRCeBR4PIRTEGNMt7VcAV8UYUyGEp1pez48xrgghrATOaHWvLsCWY1G0QUAS6VSK5OqVmXn/8+aQ3r2LgqpOlJ19Xuawr/4DXfQrScpLM2bM6DN16tRBbS5viTG2/rLeF4gALV/264DuwKaW9l5AbcvPbwITD/KWXYC/DSF8CpgLfCbGuOPdfYr9MwhIeSz51rrMl/85s0jVbYbiEkpHjqZ03FkUnxIoKCzMdYmSJOXUfffd9+R+Ln8V+Eqba613yugEpNu0FxykrbUfA78hExgeB6YC09pX7eExCEh5JrVtKw1zZ1NfM4vmN9+AggKKTz2NislXUjpiNAWlpbkuUZKk48btt9/+kalTp85qc7nt1J01wASAEEInoBtQ16r9LWAYsAgIvL0Q+B1ijBuADS33+inwN++m/oMxCEh5IFVfT+PCeTTUzKJpecws+u0/kMqr/obSMeNIdO6c6xIlSTouTZkyZd2UKVNWHaLbc8DdIYQKYBIwA7gmhNAvxvgQmd/wXxRCWEJmMfEnD3SjEML/Af6DzBOGy4G2IeSoMQhIHVS6uZmm15Zmpv4smg9NTSS696D84vdROnY8Rb1657pESZI6hBjjxhDCPcCrtGwfSmbXoEEtXaaR2T70ZuDnMcYFACGEOWTWEpSHECaRearQCfgzmXUFfwS+m626C9Lpg01ROi6dB7xYV7eTZDKV61qUB1KpFA888A2WL19GcXExd9xxJ/37D9jb/swzT/OrX/0nhYWFfPzjn2LixPP3tv3iFz9j06ZNfPrTnzsmtabTaZJvrG457KuG9M4dFFRUUDpmHKXjzqLo5MEu+lVOVVd34qE/Lcl1GZJOMLddcDobNmw/pu9ZVJSgW7dKgPPJfDHvcHwiIB3Ciy/+gcbGRqZP/xELFy7gkUce4hvfeBCATZs28tRTT/CDH/yExsZGPvOZT3HWWRNIp1N885tfZ/HihVx44XuzXmPzxg001Mykfs4sUhs3QFERJcNHUTruLErC6RQU+Z+6JEnal98OpEOYP38uEyacA8DIkaNYuvTt32YuWbKIUaPGUFJSQklJCSedNIDa2mWcdNIAJk++gvHj38Pq1auyUldqx3Ya5tXQUDOL5OurMot+h55CxXsvo2TkGBLl5Vl5X0mSdPwJIXRts6XpIRkEpEPYuXMnla1O0E0kEiSTSYqKit7RVlFRwY4dO+jcuTPvec/Z/Pa3vz6qtaQbG2lcvID6mpk0xSWQSlHY9yQqrvggpWeMo7Brt6P6fpIk6fgWQgjA00CXEMJ7gBeAq2OMSw811iAgHUJlZSW7du3a+zqdTlPUMtWmbduuXbvo1KnTUX3/dCpF0/LXaKiZSePCeaQbGkh06Ur5he+ldOxZFPXtd1TfT5IknVC+A9wK/GuM8c0QwneA7wMXHGqgQUA6hFGjxvDSSy9y8cWXsnDhAoYMGba37fTTR/D973+XhoYGmpqaWL16JYMHD33X75lOp2le81fqa2bSOHc2qW3bKCgrp6Rl0W/x4KEUJBKHvpEkSeroesQYn888GIAY43dDCDe2Z6BBQDqECy64iJkzX+Xmmz9JOp3mS1+6iyee+Cn9+w/gvPMu5MMf/v+45ZYbSKVS3HjjZyh9FwdyNW/eRMOcWTTMmUXzW+ugsJCS00ZkFv2ePoKC4uKj+MkkSVIHkA4hlNFyWnEIoQ9Q2J6Bbh8q5Vhq1y4a58+hvmYmyZW1ABQNHkLp2LMoHTOWREVljiuUjh63D5V0JNw+9MBCCJ8CPgYMBX4C/B3wzRjjo4ca6xMBKQfSTU00LllEw5yZNC5ZDM1JCnv1pmLylZSOHU9h9x65LlGSJJ0AYow/DCEsA64AioEbYozPt2esQUA6RtKpFMmVtZl5//Pnkq7fTUGnzpSdez5l486i8KT+HvYlSZIOSwjhazHGO4E/tbr2rRjjrYca2+4gEELoQWb1cTPwxxjj1iMpVso3yXVraKjJzPtPbamjoLSUkpFjMot+h53qol9JknTYQghfBboBfxtC6NKqqRh4H5mdhA6qXUEghHA18BiwAEgAPwwhXBNj/J/DrlrKA81bt9AwZzYNNTNpXvsmJBIUn3oalZdfRcmIURSUHPmCYkmSJOBV4CwgBWxqdT0J/J/23KC9TwS+DlwQY1wAEEIYB/wAGNfuUqUOLlW/m8YF82iomUlT7TJIpykaOIjKD36Y0jHjSFQd3fMFJElS/oox/hb4bQhhRozxL0dyj/YGgV17QkDLG9eEEE647YakbEgnm9jx1BM0zJsDySYSPaspv2QyZWPHU1jdK9flSZKkjq0uhPAtoAooILN16LAY48RDDWxvEJgRQvgC8AiZNQIfAxaGELoBBTHGzUdWt3TiSzclad60kbIJ51A67iyKBpzsol9JknSs/AyYBZwL/Bx4PzC7PQPbGwTuIJMu7m1z/Toyhxe069ACqSNKlJfT9Zbbcl2GJEnKT51ijJ8OITwMzAC+DfyxPQPbFQRijB5nKkmSJB1/9iwUXg6MjDHObO8U/vbuGpQA/gmYQmZLoueAe2KMySMoVpIkSdLRsbzlacCPyezsWUXm+/ohtXcD83uB9wLfAh4kMwfp/iMoVJIkSdLR82ngxRjjHOD/kvnOfmN7BrZ3jcBkYHyMsQkghPBfwLwjKFSSJEnS0fPrGOPFADHGR4FH2zuwvU8EEntCQMubNABNB+kvSZIkKfu6hhAqj2Rge58IzA0hPERm+9A08Flg/pG8oSRJkqSjZiewOoQwH9ix52KM8apDDWxvELiFzFZE/0vmoIJngc8dfp2SJEmSjqIfHunA9gaBL8YYP3GkbyJJkiTp6Isx/vhIx7Y3CFwJfPFI30SSJEnS0RNCSJGZsr8/6RjjIb/ntzcIrAghPAf8mX3nHj3YzvGSJEmSjp5qMlP2vwasBqYDzcAngJPbc4P2BoHNLf97ZssbbDmcKiVJkiQdPTHGTQAhhPExxk+3avp2CGFWe+7R3u1DvwmMBK4ArgKGAl9pf6mSJEmSsqAyhBD2vAghjAJK2zOwvU8EfkTmpLIfkXkEcROZFcqXHl6dkiRJko6ifwFeadk+NAEMB65tz8D2BoGKGOP3W73+TgjhhsOrUZIkSdLRFGP8zxDCn4HzWi79Kca4sT1j2zs1aGkI4dw9L0III4GVh1emJEmSpKMphFACTAQqgSrg/SGEr7dnbHufCJwM/DGEMA9IAmOBdS2PIIgxjj7sqiVJkiS9W/8BDAH6AnOACcAf2jOwvUHgC0dUliRJkqRsOgM4BXgUeJDMjJ9H2zOwXUEgxvjHIy5NkiRJUrasjTEmQwivASNjjE+GELq0Z2B71whIkiRJOv7sCCFcC8wDrmnZPrSqPQMNApIkSdKJ6xZgDPAykAb+CNzXnoEGAUmSJOnEdiFQB1wNzAX+qz2DDAKSJEnSiWs6mYN+y4EK4GngB+0Z2N5dg45ICOF64FZgO3BtjHHVfvo8CIyNMV6UzVokSZKkDqhbjPH/tnr9nRDCp9ozMGtPBEII1cAdZPYy/RqZ7Yza9jmNt09BkyRJknR4locQJux5EUIYDdS2Z2A2pwZdBsyOMe4CngUmhhAK2vR5APhGFmuQJEmSOpwQwoKWw33HA38OIcwOIfwFmAUMbc89sjk1qC8QAWKMqRBCHdAd2AQQQrgSeAOYfaAbhBC6Al1bX5s+fXqfSZMmZalkSZIk6YTw2Xd7g6yuEWDfJw6dyGxpRAihBPgycCUH3+d0KnBX6wvTp09n0qRJdOtWeZRLlSRJ0vGqurpTrks4rhyNA3+zGQTWkFkfQAihE9CNzLZGkFkX0AP4DVAKDA0hPBRjvK3NPR4GHm994aabbhoPPFlXt5NkMpW96iVJR53/kEs6Uhs2bD+m71dUlOjwv3jOZhB4Drg7hFABTAJmkDntrF+M8SHgVIAQwsnA4/sJAcQYtwBb2lzun8WaJUmSpLyQtcXCMcaNwD3Aq8AXgc8DJwGDs/WekiRJktonq2sEYoyPAY+1uvSOLURjjKsBzxCQJEmSjiFPFpYkSZLykEFAkiRJykMGAUmSJCkPGQQkSZKkPGQQkCRJkvKQQUCSJEnKQwYBSZIkKQ8ZBCRJkqQ8lNUDxSRJkqR8EEK4HrgV2A5cG2Nc1aqtB/AE0Bd4IsY4reX6J4A7gWdijLe1XCsDfgKcDvwB+IcYYyobNftEQJIkSXoXQgjVwB3ABOBrwINtutwJ/BIYDVwRQhjdcv1PZL70t3YzsCrGOBKoBq7MVt0GAUmSJOnduQyYHWPcBTwLTAwhFLRqvwL4fctv9p9qeU2McQWwss29rgB+3/LzL/b0zQanBkmSJEkHMGPGjD5Tp04d1Obylhjjllav+wIRIMaYCiHUAd2BTS3tvYDalp/fBCYe5C333qulb78jr/7gfCIgSZIkHcB99933JJnf2rf+M3U/XVt/r+4EpNu0Fxyk7UD3ak/fI2YQkCRJkg7g9ttv/wgwuM2fh9t0WwMEgBBCJ6AbUNeq/S1gWMvPoaX/gey9Vzv6vitODZIkSZIOYMqUKeumTJmy6hDdngPuDiFUAJOAGcA1IYR+McaHgN8AF4UQlgAXAp88yL1+A1wE/FfL/z7+rj7AQRgEJEmSpHchxrgxhHAP8Cot24cCHwIGtXSZRmb70JuBn8cYFwCEEOaQWUtQHkKYRGbXoenAv4cQFpLZPvQ32arbICBJkiS9SzHGx4DHWl16sFXbZjI7C7UdM/YAt/vbo1vd/rlGQJIkScpDBgFJkiQpDxkEJEmSpDxkEJAkSZLykEFAkiRJykMGAUmSJCkPGQQkSZKkPGQQkCRJkvKQQUCSJEnKQwYBSZIkKQ8ZBCRJkqQ8ZBCQJEmS8pBBQJIkScpDBgFJkiQpDxkEJEmSpDxkEJAkSZLykEFAkiRJykMGAUmSJCkPGQQkSZKkPGQQkCRJkvKQQUCSJEnKQwYBSZIkKQ8ZBCRJkqQ8ZBCQJEmS8pBBQJIkScpDBgFJkiQpDxkEJEmSpDxkEJAkSZLykEFAkiRJykMGAUmSJCkPGQQkSZKkPGQQkCRJkvKQQUCSJEnKQ0W5LkA63qVSKR544BssX76M4uJi7rjjTvr3H7C3/T/+4//xwgvPAXDOORP55CdvpKGhnrvvvpO6ujoqKir48pe/Srdu3XL1ESRJkt7BJwLSIbz44h9obGxk+vQfcfPNn+ORRx7a2/bmm3/lued+x/e+9xjTp/+ImTNfYfnyZTz99FMMGTKM7373B0yefAU//vEPc/gJJEmS3skgIB3C/PlzmTDhHABGjhzF0qVL9rb17t2HBx74DoWFhSQSCZLJJCUlJcyfP48JE84F4OyzJzJr1l9yUrskSdKBGASkQ9i5cyeVlVV7X+/5wg9QVFRE165dSafTPPLIw5xySmDgwJPZuXMnVVWZMRUVFezcuSMntUuSJB2IawSkQ6isrGTXrl17X6fTaYqK3v5Pp6GhgXvvvZuKigo+//k7Wo3ZCcCuXbv2hgJJkqTjhU8EpEMYNWoMr7zyEgALFy5gyJBhe9vS6TRf/OLnGTbsFP75n79MYWHh3jEvv5wZ88orLzFmzNhjX7gkSdJB+ERAOoQLLriImTNf5eabP0k6neZLX7qLJ574Kf37D6C5OcXcuTU0Njbyyiv/C8DNN3+Wq6/+MNOm3cWnP/0piouLueuuaTn+FJIkSfsqSKfTua7hcJ0HvFhXt5NkMpXrWiRJh6G6uhMP/WnJoTtKUiu3XXA6GzZsP6bvWVSUoFu3SoDzgT8f0zc/RpwaJEmSJOUhg4AkSZKUhwwCkiRJUh7K6mLhEML1wK3AduDaGOOqVm1XAV8AegA/iTF+PZu1SJIkSXpb1p4IhBCqgTuACcDXgAfbdDkNuAQ4A7gxhHBqtmqRJEmStK9sTg26DJgdY9wFPAtMDCEU7GmMMf5rjHF3jLEeqAEGZLEWSZIkSa1kc2pQXyACxBhTIYQ6oDuwqXWnEEIRMBpY0PYGIYSuQNfW16ZPn95n0qRJWSr54Lp0q6CkqDAn7y3pxNWYbGZr3a5Dd5Qk6RjK9oFirZ84dAL2d2jBzcDvY4zr99M2Fbir9YXp06czadKkPfu6HnPufy3pcN12welUV3fKdRmSdELz79GjL5tBYA2Z9QGEEDoB3YC61h1CCJcBHwcuOsA9HgYeb33hpptuGg88mYsDxfw/oKQjdawPwjle+feopCOVwwPFOqxsBoHngLtDCBXAJGAGcE0IoV+M8aEQwnjgu8B7Y4w79neDGOMWYEuby/2zWLMkSZKUF7K2WDjGuBG4B3gV+CLweeAkYHBLl9+SCSL/GUKYE0K4L1u1SJIkSdpXVtcIxBgfAx5rdenBVm29svnekiRJkg4s24uFJUmSpA7vEAfp9gCeILOr5hMxxmkt168A7gWSwA0xxtkhhAC8AuwZ/1CM8d+zUbNBQJIkSXoXWh2kOxq4kMwsmA+16nIn8EvgUeClEMIzwBLgETKb61QD/w6cCXQBfhlj/Pts153NA8UkSZKkfHDQg3SBK8hsl58Cnmp5/R5gfYxxfYxxEVASQuhLJgi03SwnK3wiIEmSJB3AjBkz+kydOnVQm8tbWna33ONQB+n2Ampbfn4TmNh6TKvr/cgEgUtCCHOA1cDNMcZ1R+8Tvc0nApIkSdIB3HfffU8CK9v8mbqfroc6SLdgP237G/M74GPAOcAbwLR3Uf5BGQQkSZKkA7j99ts/Qmb7+9Z/Hm7TbQ0Q4IAH6b4FDGv5ObT03zum9fUY444Y45wYYz2Z3TdPO6ofqBWnBkmSJEkHMGXKlHVTpkxZdYhuBz1IF/gNcFEIYQmZxcSfJLNYuEcIoTeZqUOrY4zrQgjXtPSvB64CZmXhYwEGAUmSJOldiTFuDCHsOUh3O3AtmV2DBrV0mUZm+9CbgZ/HGBcAhBBuAZ4ns33o9S19m1qu9QYWAp/IVt0GAUmSJOldOsRBupvJ7CzUdswMMk8PWl97Gng6S2XuwzUCkiRJUh4yCEiSJEl5yCAgSZIk5SGDgCRJkpSHDAKSJElSHjIISJIkSXnIICBJkiTlIYOAJEmSlIcMApIkSVIeMghIkiRJecggIEmSJOUhg4AkSZKUhwwCkiRJUh4yCEiSJEl5yCAgSZIk5SGDgCRJkpSHDAKSJElSHjIISJIkSXnIICBJkiTlIYOAJEmSlIcMApIkSVIeMghIkiRJecggIEmSJOUhg4AkSZKUhwwCkiRJUh4yCEiSJEl5yCAgSZIk5SGDgCRJkpSHDAKSJElSHjIISJIkSXnIICBJkiTlIYOAJEmSlIcMApIkSVIeMghIkiRJecggIEmSJOUhg4AkSZKUhwwCkiRJUh4yCEiSJEl5yCAgSZIk5SGDgCRJkpSHDAKSJElSHjIISJIkSXnIICBJkiTlIYOAJEmSlIcMApIkSVIeMghIkiRJecggIEmSJOUhg4AkSZKUhwwCkiRJUh4yCEiSJEl5yCAgSZIk5SGDgCRJkpSHDAKSJElSHirK5s1DCNcDtwLbgWtjjKtatfUAngD6Ak/EGKdlsxZJkiQpW47ke28I4QrgXiAJ3BBjnB1CSACPABcAi4CPxxjrs1Fz1p4IhBCqgTuACcDXgAfbdLkT+CUwGrgihDA6W7VIkiRJ2XIk33tDCMVkvvBfAlwHfL+l71VAdYxxJLAKuDFbdWdzatBlwOwY467/v727jbGjKgM4/l8pBIFStrxoC2hrTR8DH7BC1bIKBPENk4aCSIQEisGIFZWKFUxIoKQgxvBmE6mhtuUDkBgIYjXYSpEYQIhSSKutDwYDGGot1GKp/SC064eZi8N1t122e1925/9LNrlzzpkzzyS7s/Pcc84MsAroi4ieSv3ngIczczdwb7ktSZIkjTbDue/9MLAlM7dk5p+AAyJiUqNtud9PaeE9ciunBk0CEiAzd0fENmAisLWsPwp4rvz8EtDX3EFEHAYcVi1btmzZsX19fey3X2eWN0we/86OHFfS6DZunEuyGryOShqOdl9HG/eaa9asOWbevHlTmqpfzcxXK9vDue99c59K+eSm8kZZS7R0jQBvHXEYD/Q31ffsoQ7gcuCaasHq1avp6+vj0EM784/kvBlTOnJcSaNbb+/BnQ6ha3gdlTQcnbqOLl269J4BihcC1zaVDee+d7B93jFA2YhrZSKwiWKeFBExHugFtlXq/wG8n2IRRJTtm90KrKgWbN++/ZCtW7d++vDDD18LvD7yYUuSJEnsv3nz5g9NmDBhFbCjqe7Vpu3h3PduKj83NJc/xOD3yCOip7+/NUlGRBwBPEGxKOITwFyKeU6TM/OWiLgZ+CvwI+BJ4EuZub4lwUiSJEktMpz7XmAj8Cwwi2Lq0IrMPDEiZgMXZubnI+Im4PnMXNyKuFs22SozXwFuoDjZ7wJXAEcDU8smiyhWRa8DfmYSIEmSpNFoOPe9mfkG8DXg18CdwJfLtiuBzRHxR+AY/vc0oRHXshEBSZIkSd3Lx1hIkiRJNWQiIEmSJNWQiYAkSZJUQyYCkiRJUg2ZCEiSJEk11Oo3C0tdKSIuAb4JvAacn5nPN9WfB3wH2J/i5R/zMnNbcz8tiGs5cCrwr7Lo/sy8rtXHlaR2iYgrgcuBGzPztqa65UBvZp5VKUvg7sxc2N5IpbHPREC1ExFHAldRvPTjVOBm4OxK/ckUScAZmbktIi4G7gE+06YQ52fmA206liS128+B9+2h/riIODgz/x0RxwEHtikuqXZMBFRHnwKeysydEbEKWBERPZnZeKnG14FrGyMAmbk8Ir4REdMz89lGJxFxEXAicCwwE/h+Zi6OiAnAbcDxwAHAFzNzQ/lN1wbgTGA6cG5mPt6eU5ak7pCZGyNi0x6aPAp8FrgXmAM83JbApBpyjYDqaBKQAJm5G9gGTKzUfwB4pmmfZ8ryZh8BLgJOp3iTIMAO4NbMnAncAny70n5q2XYhcNk+nYUkjU2rKRIAgD6KN7FKagETAdVV9Xd/PFB9xXY/0NPUvqepTcPjmbm9HCk4BCAzdwE7I+I64DzgvZX2q8qRh7XAuweJ7ZaIeLr8OWrIZyRJY8PfgYkRMQ14Adjd4XikMctEQHW0CQiAiBgP9FKMCjRsBGY07TMD+PNe+t1V9jkTWA78Criagf/OXuf/k42G+Zk5o/zZspdjStJYtBq4Cbiv04FIY5mJgOpoNTAjIg4CTgMeBL4QEfPL+sXANRHRC28+YeilzPzLEPs/BfhdOf+/OaGQJFVExLfKJ7VV3UcxLeiR9kck1YeJgGonM18BbgCepJjXfwVwNMX8fTLzCeB7wJqIWAecAVwAEBETImJlREwcqO/S/cDJEfE4b117MKCImBMRi/bhlCRpVIiISRHxNHApsCAiHgKmUFyD35SZLwIzM/ON9kcp1UdPf/9A054lDSYilgDXZ+bfRqi/A4G7MvOckehPkiRpKBwRkN6GiDgFeHGkkoDS1cCNI9ifJEnSXjkiIEmSJNWQIwKSJElSDZkISJIkSTVkIiBJkiTVkImAJEmSVEPjOh2AJKnzIuII4OXM7Cm37wNOAHaUTX6TmfMH21+SNPqYCEiSBjILOCkzN3U6EElSa5gISFIXi4i7gacy86Zy+6vA6cAm4KPAeKAHuCQzH4uIFRRvtJ4G/CIzr9xD32cD1wM7gd9XyqeW/d4REe8B/gBckZn/HPkzlCR1imsEJKm73QHMrWzPBdYBk4FZmXkccCdwVaXNQZl5/F6SgHcBy4BzMvNE4IVK9VHAQ8ClwAcppgct2+czkSR1FUcEJKm7PQIcGBEnUXxzfySwCJgOfCUipgGnAa9V9nl0CP1+DFifmRvK7R8DNwBk5pPAnEbDiLgW2BwRB2Tmf/blZCRJ3cMRAUnqYpnZD/wEuBC4uPx8JvDLsskDwBKK6UENOxia6j5vND5ExMcjYnZTu93ArrcVvCSpq5kISFL3WwHMBs4FlgOfBFZm5u0U8/fPAvZ7m33+Fjg+Ik4ot+dW6g4BFkfExHJ7AXBvZpoISNIYYiIgSV0uMzcDa4F15VN8lgCnRcT6svw5YGpEDPmanpkvA+cDd0XEWmBqpe5B4IfAYxGRFAuPLxup85EkdYee/v7+TscgSZIkqc1cLCxJY1RELAAuGKT6B5l5VzvjkSR1F0cEJEmSpBpyjYAkSZJUQyYCkiRJUg2ZCEiSJEk1ZCIgSZIk1ZCJgCRJklRD/wX7hMhNH86IugAAAABJRU5ErkJggg==", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Check the bin results of var_d5 of in-sample\n", "col = 'var_d5'\n", "\n", "# It's recommended to set 'labels = True' for categorical features.\n", "bin_plot(c.transform(train_selected[[col,'target']], labels=True), x=col, target='target')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(3) ***adjust bins:***c.update(dict)\n", "\n", "the passed new bins will be updated - other feature bins are kept intact. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "No handles with labels found to put in legend.\n", "No handles with labels found to put in legend.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# The IV is small, assume we want to seperate 'F' out to lift IV. \n", "\n", "# Set new bins \n", "rule = {'var_d5':[['O', 'nan'],['F'], ['M']]}\n", "\n", "# Pass new bins\n", "c.update(rule)\n", "\n", "# Re-check both in-sample and OOT stability. \n", "bin_plot(c.transform(train_selected[['var_d5','target']], labels=True), x='var_d5', target='target')\n", "badrate_plot(c.transform(OOT[['var_d5','target','month']], labels=True), target='target', x='month', by='var_d5')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ### III. WOE transformation\n", "\n", "WOE transformation is applied after binning is tuned and finalised. The procedure is following:\n", "\n", "(1) ***Use the finalised Combiner to apply the binning:*** c.transform(dataframe, labels=False) \n", "\n", " It only transform the binned features. \n", "\n", "(2) ***initialise woe transer:*** transer = toad.transform.WOETransformer()\n", "\n", "(3) ***fit_transform:*** transer.fit_transform(dataframe, target, exclude = None)\n", "\n", "\n", " Fit and apply WOE transformation, for in-sample data\n", " \n", " - target:target values in Series or DataFrame;\n", " \n", " - exclude: columns not be WOE transformed\n", " Note: 1. \"fit_transform\" fits and transform all the data, even the ones not binned. Remember to exclude the unwanted columns. 2. Alwasy exlclude target column.\n", " \n", "(4) ***Apply WOE transformation, typically to test / OOT data:***transer.transform(dataframe)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " APP_ID_C target var_d2 var_d3 var_d5 var_d6 var_d7 \\\n", "0 app_1 0 -0.178286 0.046126 0.090613 0.047145 0.365305 \n", "1 app_2 0 -1.410248 0.046126 -0.271655 0.047145 -0.734699 \n", "2 app_3 0 -0.178286 0.046126 0.090613 0.047145 0.365305 \n", "\n", " var_d11 var_b3 var_b9 ... var_l_60 var_l_64 var_l_68 var_l_71 \\\n", "0 -0.152228 -0.141182 -0.237656 ... 0.132170 0.080656 0.091919 0.150975 \n", "1 -0.152228 0.199186 0.199186 ... 0.132170 0.080656 0.091919 0.150975 \n", "2 -0.152228 -0.141182 0.388957 ... -0.926987 -0.235316 -0.883896 -0.385976 \n", "\n", " var_l_89 var_l_91 var_l_107 var_l_119 var_l_123 month \n", "0 0.091901 0.086402 -0.034434 0.027322 0.087378 2019-03 \n", "1 0.091901 0.086402 -0.034434 0.027322 0.087378 2019-03 \n", "2 0.091901 -0.620829 -0.034434 -0.806599 -0.731941 2019-03 \n", "\n", "[3 rows x 34 columns]\n" ] } ], "source": [ "# Initialise\n", "transer = toad.transform.WOETransformer()\n", "\n", "# transer.fit_transform() & combiner.transform(). Remember to exclude target\n", "train_woe = transer.fit_transform(c.transform(train_selected), train_selected['target'], exclude=to_drop+['target'])\n", "OOT_woe = transer.transform(c.transform(OOT))\n", "\n", "print(train_woe.head(3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ### IV.Stepwise regression feature selection\n", "---------------\n", "\n", "6. ***toad.selection.stepwise(dataframe, target='target', estimator='ols', direction='both', criterion='aic', max_iter=None, return_drop=False, exclude=None): ***\n", "\n", "Stepwise regression feature selection, supports forward, backward, and both-direction (recommended):\n", "\n", " - estimator: the regression model to fit, support 'ols', 'lr', 'lasso', 'ridge' \n", " \n", " - direction: stepwise direction, support 'forward', 'backward', 'both' (recommended)\n", " \n", " - criterion: selection criteria, support 'aic', 'bic', 'ks', 'auc'\n", " \n", " - max_iter: maximum number of iterations\n", " \n", " - return_drop: whether to return a list of dropped column names\n", " \n", " - exclude: list of column to be from alogorithm, such as ID column and time column.\n", " \n", "***tip: generally, direction = 'both' produces the best results. Setting estimator = 'ols' and criterion = 'aic' makes the stepwise fast and the results are sound for logistic regression.***\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(43576, 13)\n" ] } ], "source": [ "# Apply stepwise regression on the WOE-transformed data\n", "final_data = toad.selection.stepwise(train_woe,target = 'target', estimator='ols', direction = 'both', criterion = 'aic', exclude = to_drop)\n", "\n", "# Place the selected features to test / OOT sample \n", "final_OOT = OOT_woe[final_data.columns]\n", "\n", "print(final_data.shape) # Out of 31 features, stepwise regression selected 10 of them." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# The final list of features for modelling\n", "col = list(final_data.drop(to_drop+['target'],axis=1).columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7. ***toad.metrics.PSI(df_train, df_test):***\n", "\n", "Ouput the PSI for each feature - used to check the OOT stability of WOE-transformed features." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "var_d2 0.000254\n", "var_d5 0.000012\n", "var_d7 0.000079\n", "var_d11 0.000191\n", "var_b10 0.000209\n", "var_b18 0.000026\n", "var_b19 0.000049\n", "var_b23 0.000037\n", "var_l_20 0.000115\n", "var_l_68 0.000213\n", "dtype: float64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "toad.metrics.PSI(final_data[col], final_OOT[col])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ### V. Model evaluation and validation\n", "---------------\n", "\n", "7. **Common evaluation metrics**: toad. metrics. KS, F1, AUC" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/zhouxiyu/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n" ] } ], "source": [ "# Build a logit\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "lr = LogisticRegression()\n", "lr.fit(final_data[col], final_data['target'])\n", "\n", "# Obtain predicted probability for training and OOT \n", "pred_train = lr.predict_proba(final_data[col])[:,1]\n", "\n", "pred_OOT_may =lr.predict_proba(final_OOT.loc[final_OOT.month == '2019-05',col])[:,1]\n", "pred_OOT_june =lr.predict_proba(final_OOT.loc[final_OOT.month == '2019-06',col])[:,1]\n", "pred_OOT_july =lr.predict_proba(final_OOT.loc[final_OOT.month == '2019-07',col])[:,1]" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train KS 0.3707986228750539\n", "train AUC 0.75060723924743\n", "OOT结果\n", "5月 KS 0.3686687175756087\n", "6月 KS 0.3495273403486497\n", "7月 KS 0.3796914199845523\n" ] } ], "source": [ "from toad.metrics import KS, AUC\n", "\n", "print('train KS',KS(pred_train, final_data['target']))\n", "print('train AUC',AUC(pred_train, final_data['target']))\n", "print('OOT results')\n", "print('5月 KS',KS(pred_OOT_may, final_OOT.loc[final_OOT.month == '2019-05','target']))\n", "print('6月 KS',KS(pred_OOT_june, final_OOT.loc[final_OOT.month == '2019-06','target']))\n", "print('7月 KS',KS(pred_OOT_july, final_OOT.loc[final_OOT.month == '2019-07','target']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***PSI also be used to gauge the stability of predicted proabilities***" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.12760761722158315\n", "0.1268648506657109\n", "0.1268648506657109\n" ] } ], "source": [ "print(toad.metrics.PSI(pred_train,pred_OOT_may))\n", "print(toad.metrics.PSI(pred_train,pred_OOT_june))\n", "print(toad.metrics.PSI(pred_train,pred_OOT_june))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "8. ***toad.metrics.KS_bucket(predicted_proba, y_true, bucket=10, method = 'quantile'):***\n", "\n", "output the evaluative information of binned predicted probability, including the probability range, no. of samples, bad rate, KS of each probability bin. \n", "\n", " - bucket:no. of bins\n", " \n", " - method:method of binning. Recommend to use 'quantile' or 'step' \n", " \n", " (1) the larger the difference of bad_rate between each group, the better the results; (2) can be used to check the monotonicity of groups of scores; (3) can be used to find the optimal cutoff point; (4) can be used to compare predictability of models \n", "\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
minmaxbadsgoodstotalbad_rategood_rateoddsbad_propgood_proptotal_propcum_badscum_goodscum_totalcum_bads_propcum_goods_propcum_total_propks
00.0002750.0033809433243410.0020730.9979270.0020780.0096880.1015780.0996199433243410.0096880.1015780.099619-0.091890
10.0033980.00520712358535970.0033360.9966640.0033470.0129170.0840620.08254521791779380.0226050.1856400.182164-0.163035
20.0052070.00811637507151080.0072440.9927560.0072960.0398280.1189060.1172205812988130460.0624330.3045470.299385-0.242114
30.0081250.01086226385438800.0067010.9932990.0067460.0279870.0903700.0890408416842169260.0904200.3949160.388425-0.304497
40.0108680.01465159475948180.0122460.9877540.0123980.0635090.1115900.11056514321601217440.1539290.5065070.498990-0.352578
50.0146610.01984676390139770.0191100.9808900.0194820.0818080.0914720.09126621925502257210.2357370.5979790.590256-0.362241
60.0198580.025968116466547810.0242630.9757370.0248660.1248650.1093860.10971633530167305020.3606030.7073650.699972-0.346762
70.0259860.032467108418842960.0251400.9748600.0257880.1162540.0982020.09858644334355347980.4768570.8055670.798559-0.328710
80.0324840.044998173418743600.0396790.9603210.0413180.1862220.0981780.10005561638542391580.6630790.9037450.898614-0.240666
90.0451150.370055313410544180.0708470.9291530.0762480.3369210.0962550.10138692942647435761.0000001.0000001.0000000.000000
\n", "
" ], "text/plain": [ " min max bads goods total bad_rate good_rate odds \\\n", "0 0.000275 0.003380 9 4332 4341 0.002073 0.997927 0.002078 \n", "1 0.003398 0.005207 12 3585 3597 0.003336 0.996664 0.003347 \n", "2 0.005207 0.008116 37 5071 5108 0.007244 0.992756 0.007296 \n", "3 0.008125 0.010862 26 3854 3880 0.006701 0.993299 0.006746 \n", "4 0.010868 0.014651 59 4759 4818 0.012246 0.987754 0.012398 \n", "5 0.014661 0.019846 76 3901 3977 0.019110 0.980890 0.019482 \n", "6 0.019858 0.025968 116 4665 4781 0.024263 0.975737 0.024866 \n", "7 0.025986 0.032467 108 4188 4296 0.025140 0.974860 0.025788 \n", "8 0.032484 0.044998 173 4187 4360 0.039679 0.960321 0.041318 \n", "9 0.045115 0.370055 313 4105 4418 0.070847 0.929153 0.076248 \n", "\n", " bad_prop good_prop total_prop cum_bads cum_goods cum_total \\\n", "0 0.009688 0.101578 0.099619 9 4332 4341 \n", "1 0.012917 0.084062 0.082545 21 7917 7938 \n", "2 0.039828 0.118906 0.117220 58 12988 13046 \n", "3 0.027987 0.090370 0.089040 84 16842 16926 \n", "4 0.063509 0.111590 0.110565 143 21601 21744 \n", "5 0.081808 0.091472 0.091266 219 25502 25721 \n", "6 0.124865 0.109386 0.109716 335 30167 30502 \n", "7 0.116254 0.098202 0.098586 443 34355 34798 \n", "8 0.186222 0.098178 0.100055 616 38542 39158 \n", "9 0.336921 0.096255 0.101386 929 42647 43576 \n", "\n", " cum_bads_prop cum_goods_prop cum_total_prop ks \n", "0 0.009688 0.101578 0.099619 -0.091890 \n", "1 0.022605 0.185640 0.182164 -0.163035 \n", "2 0.062433 0.304547 0.299385 -0.242114 \n", "3 0.090420 0.394916 0.388425 -0.304497 \n", "4 0.153929 0.506507 0.498990 -0.352578 \n", "5 0.235737 0.597979 0.590256 -0.362241 \n", "6 0.360603 0.707365 0.699972 -0.346762 \n", "7 0.476857 0.805567 0.798559 -0.328710 \n", "8 0.663079 0.903745 0.898614 -0.240666 \n", "9 1.000000 1.000000 1.000000 0.000000 " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Group the predicted scores in bins with same number of samples in each (i.e. \"quantile\" binning)\n", "toad.metrics.KS_bucket(pred_train, final_data['target'], bucket=10, method = 'quantile')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ### VI. Standard scorecard transformation\n", "---------------\n", "\n", "6. **toad.ScoreCard( combiner = {}, transer = None, pdo = 60, rate = 2, base_odds = 20, base_score = 750, card = None, C=0.1,**kwargs): \n", "\n", "Convert logit into a standard scorecard. Support direct input of parameters of a LogisticRegression class.\n", "\n", " - combiner: input the pre-fitted toad.Combiner class\n", " \n", " - transer: input the per-fitted toad.WOETransformer class\n", " \n", " - pdo、rate、base_odds、base_score: \n", " e.g. pdo=60, rate=2, base_odds=20,base_score=750\n", " it means when odds is 1/60, the base socre is 750, and t\n", " \n", " - card: 支持传入专家评分卡 pre-defined scorecard\n", " \n", " - **kwargs: support to input parameters of a logistic regression class (i.e. sklearn.linear_model.LogisticRegression)\n", " " ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/zhouxiyu/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n" ] }, { "data": { "text/plain": [ "ScoreCard(base_odds=35, base_score=750, card=None,\n", " combiner=, pdo=60,\n", " rate=2,\n", " transer=)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "card = toad.ScoreCard(\n", " combiner = c,\n", " transer = transer,\n", " #class_weight = 'balanced',\n", " #C=0.1,\n", " #base_score = 600,\n", " #base_odds = 35 ,\n", " #pdo = 60,\n", " #rate = 2\n", ")\n", "\n", "card.fit(final_data[col], final_data['target'])" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'var_d2': {'[-inf ~ 747.0)': 65.54,\n", " '[747.0 ~ 782.0)': 45.72,\n", " '[782.0 ~ 820.0)': 88.88,\n", " '[820.0 ~ inf)': 168.3},\n", " 'var_d5': {'O,nan': 185.9, 'F': 103.26, 'M': 68.76},\n", " 'var_d7': {'LARGE FLEET OPERATOR,COMPANY,STRATEGIC TRANSPRTER,SALARIED,HOUSEWIFE': 120.82,\n", " 'DOCTOR-SELF EMPLOYED,nan,SAL(RETIRAL AGE 60),SERVICES,SAL(RETIRAL AGE 58),OTHERS,DOCTOR-SALARIED,AGENT,CONSULTANT,DIRECTOR,MEDIUM FLEETOPERATOR,TRADER,RETAIL TRANSPORTER,MANUFACTURING,FIRST TIME USERS,STUDENT,PENSIONER': 81.32,\n", " 'PROPRIETOR,TRADING,STRATEGIC CAPTIVE,SELF-EMPLOYED,SERV-PRIVATE SECTOR,SMALL RD TRANS.OPR,BUSINESSMAN,CARETAKER,RETAIL,AGRICULTURIST,RETIRED PERSONNEL,MANAGER,CONTRACTOR,ACCOUNTANT,BANKS SERVICE,GOVERNMENT SERVICE,ADVISOR,STRATEGIC S1,SCHOOLS,TEACHER,GENARAL RETAILER,RESTAURANT KEEPER,OFFICER,POLICEMAN,SERV-PUBLIC SECTOR,BARRISTER,Salaried,SALESMAN,RETAIL CAPTIVE,Defence (NCO),STRATEGIC S2,OTHERS NOT DEFINED,JEWELLER,SECRETARY,SUP STRAT TRANSPORT,LECTURER,ATTORNEY AT LAW,TAILOR,TECHNICIAN,CLERK,PLANTER,DRIVER,PRIEST,PROGRAMMER,EXECUTIVE ASSISTANT,PROOF READER,STOCKBROKER(S)-COMMD,TYPIST,ADMINSTRATOR,INDUSTRY,PHARMACIST,Trading,TAXI DRIVER,STRATEGIC BUS OP,CHAIRMAN,CARPENTER,DISPENSER,HELPER,STRATEGIC S3,RETAIL BUS OPERATOR,GARAGIST,PRIVATE TAILOR,NURSE': 55.79},\n", " 'var_d11': {'N': 88.69, 'U': 23.72},\n", " 'var_b10': {'[-inf ~ -8888.0)': 67.76,\n", " '[-8888.0 ~ 0.548229531)': 97.51,\n", " '[0.548229531 ~ inf)': 36.22},\n", " 'var_b18': {'[-inf ~ 2)': 83.72, '[2 ~ inf)': 39.23},\n", " 'var_b19': {'[-inf ~ -9999)': 70.78, '[-9999 ~ 4)': 97.51, '[4 ~ inf)': 42.2},\n", " 'var_b23': {'[-inf ~ -8888)': 64.51, '[-8888 ~ inf)': 102.69},\n", " 'var_l_20': {'[-inf ~ 0.000404297)': 78.55,\n", " '[0.000404297 ~ 0.003092244)': 103.85,\n", " '[0.003092244 ~ inf)': 36.21},\n", " 'var_l_68': {'[-inf ~ 0.000255689)': 70.63,\n", " '[0.000255689 ~ 0.002045513)': 24.56,\n", " '[0.002045513 ~ 0.007414983000000002)': 66.63,\n", " '[0.007414983000000002 ~ 0.019943748)': 99.55,\n", " '[0.019943748 ~ inf)': 142.36}}" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Output standard scorecard \n", "card.export()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ### VII. Other functions \n", "\n", "--------------------\n", "\n", "***toad.transform.GBDTTransformer ***\n", "\n", " GBDT encoding - pre-processing for gbdt + lr technique. " ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/zhouxiyu/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py:415: FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values.\n", "If you want the future behaviour and silence this warning, you can specify \"categories='auto'\".\n", "In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.\n", " warnings.warn(msg, FutureWarning)\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gbdt_transer = toad.transform.GBDTTransformer()\n", "gbdt_transer.fit(final_data[col+['target']], 'target', n_estimators = 10, max_depth = 2)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "gbdt_vars = gbdt_transer.transform(final_data[col])" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(43576, 40)" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gbdt_vars.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }