Welcome to toad’s documentation!

Installation

via pip

pip install toad

via anaconda

conda install toad --channel conda-forge

via source code

python setup.py install

Tutorial

A basic tutorial is provided.

中文指引

Contents

toad package

Submodules

toad.detector module

Command line tools for detecting csv data

Team: ESC

Examples

python detector.py -i xxx.csv -o report.csv

toad.detector.getTopValues(series, top=5, reverse=False)[source]

Get top/bottom n values

Parameters:
  • series (Series) – data series
  • top (number) – number of top/bottom n values
  • reverse (bool) – it will return bottom n values if True is given
Returns:

Series of top/bottom n values and percentage. [‘value:percent’, None]

Return type:

Series

toad.detector.getDescribe(series, percentiles=[0.25, 0.5, 0.75])[source]

Get describe of series

Parameters:
  • series (Series) – data series
  • percentiles – the percentiles to include in the output
Returns:

the describe of data include mean, std, min, max and percentiles

Return type:

Series

toad.detector.countBlank(series, blanks=[None])[source]

Count number and percentage of blank values in series

Parameters:
  • series (Series) – data series
  • blanks (list) – list of blank values
Returns:

number of blanks str: the percentage of blank values

Return type:

number

toad.detector.isNumeric(series)[source]

Check if the series’s type is numeric

Parameters:series (Series) – data series
Returns:bool
toad.detector.detect(dataframe)[source]

Detect data

Parameters:dataframe (DataFrame) – data that will be detected
Returns:report of detecting
Return type:DataFrame

toad.merge module

toad.merge.ChiMerge()

Chi-Merge

Parameters:
  • feature (array-like) – feature to be merged
  • target (array-like) – a array of target classes
  • n_bins (int) – n bins will be merged into
  • min_samples (number) – min sample in each group, if float, it will be the percentage of samples
  • min_threshold (number) – min threshold of chi-square
Returns:

array of split points

Return type:

array

toad.merge.DTMerge()

Merge by Decision Tree

Parameters:
  • feature (array-like) –
  • target (array-like) – target will be used to fit decision tree
  • nan (number) – value will be used to fill nan
  • n_bins (int) – n groups that will be merged into
  • min_samples (int) – min number of samples in each leaf nodes
Returns:

array of split points

Return type:

array

toad.merge.KMeansMerge()

Merge by KMeans

Parameters:
  • feature (array-like) –
  • target (array-like) – target will be used to fit kmeans model
  • nan (number) – value will be used to fill nan
  • n_bins (int) – n groups that will be merged into
  • random_state (int) – random state will be used for kmeans model
Returns:

split points of feature

Return type:

array

toad.merge.QuantileMerge()

Merge by quantile

Parameters:
  • feature (array-like) –
  • nan (number) – value will be used to fill nan
  • n_bins (int) – n groups that will be merged into
  • q (array-like) – list of percentage split points
Returns:

split points of feature

Return type:

array

toad.merge.StepMerge()

Merge by step

Parameters:
  • feature (array-like) –
  • nan (number) – value will be used to fill nan
  • n_bins (int) – n groups that will be merged into
  • clip_v (number | tuple) – min/max value of clipping
  • clip_std (number | tuple) – min/max std of clipping
  • clip_q (number | tuple) – min/max quantile of clipping
Returns:

split points of feature

Return type:

array

toad.merge.merge

merge feature into groups

Parameters:
  • feature (array-like) –
  • target (array-like) –
  • method (str) – ‘dt’, ‘chi’, ‘quantile’, ‘step’, ‘kmeans’ - the strategy to be used to merge feature
  • return_splits (bool) – if needs to return splits
  • n_bins (int) – n groups that will be merged into
Returns:

a array of merged label with the same size of feature array: list of split points

Return type:

array

toad.metrics module

toad.metrics.KS(score, target)[source]

calculate ks value

Parameters:
  • score (array-like) – list of score or probability that the model predict
  • target (array-like) – list of real target
Returns:

the max KS value

Return type:

float

toad.metrics.KS_bucket(score, target, bucket=10, method='quantile', return_splits=False, **kwargs)[source]

calculate ks value by bucket

Parameters:
  • score (array-like) – list of score or probability that the model predict
  • target (array-like) – list of real target
  • bucket (int) – n groups that will bin into
  • method (str) – method to bin score. quantile (default), step
  • return_splits (bool) – if need to return splits of bucket
Returns:

DataFrame

toad.metrics.KS_by_col(df, by='feature', score='score', target='target')[source]
toad.metrics.SSE(y_pred, y)[source]

sum of squares due to error

toad.metrics.MSE(y_pred, y)[source]

mean of squares due to error

toad.metrics.AIC(y_pred, y, k, llf=None)[source]

Akaike Information Criterion

Parameters:
  • y_pred (array-like) –
  • y (array-like) –
  • k (int) – number of featuers
  • llf (float) – result of log-likelihood function
toad.metrics.BIC(y_pred, y, k, llf=None)[source]

Bayesian Information Criterion

Parameters:
  • y_pred (array-like) –
  • y (array-like) –
  • k (int) – number of featuers
  • llf (float) – result of log-likelihood function
toad.metrics.F1(score, target, split='best', return_split=False)[source]

calculate f1 value

Parameters:
  • score (array-like) –
  • target (array-like) –
Returns:

best f1 score float: best spliter

Return type:

float

toad.metrics.AUC(score, target, return_curve=False)[source]

AUC Score

Parameters:
  • score (array-like) – list of score or probability that the model predict
  • target (array-like) – list of real target
  • return_curve (bool) – if need return curve data for ROC plot
Returns:

auc score

Return type:

float

toad.metrics.PSI(test, base, combiner=None, return_frame=False)[source]

calculate PSI

Parameters:
  • test (array-like) – data to test PSI
  • base (array-like) – base data for calculate PSI
  • combiner (Combiner|list|dict) – combiner to combine data
  • return_frame (bool) – if need to return frame of proportion
Returns:

float|Series

toad.metrics.matrix(y_pred, y, splits=None)[source]

confusion matrix of target

Parameters:
  • y_pred (array-like) –
  • y (array-like) –
  • splits (float|list) – split points of y_pred
Returns:

confusion matrix witch true labels in rows and predicted labels in columns

Return type:

DataFrame

toad.plot module

toad.plot.badrate_plot(frame, x=None, target='target', by=None, freq=None, format=None, return_counts=False, return_proportion=False, return_frame=False)[source]

plot for badrate

Parameters:
  • frame (DataFrame) –
  • x (str) – column in frame that will be used as x axis
  • target (str) – target column in frame
  • by (str) – column in frame that will be calculated badrate by it
  • freq (str) – offset aliases string by pandas http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
  • format (str) – format string for time
  • return_counts (bool) – if need return counts plot
  • return_frame (bool) – if need return frame
Returns:

badrate plot Axes: counts plot Axes: proportion plot Dataframe: grouping detail data

Return type:

Axes

toad.plot.corr_plot(frame, figure_size=(20, 15))[source]

plot for correlation

Parameters:frame (DataFrame) – frame to draw plot
Returns:Axes
toad.plot.proportion_plot(x=None, keys=None)[source]

plot for comparing proportion in different dataset

Parameters:
  • x (Series|list) – series or list of series data for plot
  • keys (str|list) – keys for each data
Returns:

Axes

toad.plot.roc_plot(score, target, compare=None)[source]

plot for roc

Parameters:
  • score (array-like) – predicted score
  • target (array-like) – true target
  • compare (array-like) – another score for comparing with score
Returns:

Axes

toad.plot.bin_plot(frame, x=None, target='target', iv=True, annotate_format='.2f')[source]

plot for bins

Parameters:
  • frame (DataFrame) –
  • x (str) – column in frame that will be used as x axis
  • target (str) – target column in frame
  • iv (bool) – if need to show iv in plot
  • annotate_format (str) – format str for axis annotation of chart
Returns:

bins’ proportion and badrate plot

Return type:

Axes

toad.scorecard module

class toad.scorecard.ScoreCard(pdo=60, rate=2, base_odds=35, base_score=750, card=None, combiner={}, transer=None, **kwargs)[source]

Bases: sklearn.base.BaseEstimator, toad.utils.mixin.RulesMixin, toad.utils.mixin.BinsMixin

coef_

coef of LR model

intercept_
n_features_
features_
combiner
fit(X, y)[source]
Parameters:
  • X (2D DataFrame) –
  • Y (array-like) –
predict(X, **kwargs)[source]

predict score :param X: X to predict :type X: 2D array-like :param return_sub: if need to return sub score, default False :type return_sub: bool

Returns:predicted score DataFrame: sub score for each feature
Return type:array-like
predict_proba(X)[source]

predict probability

Parameters:X (2D array-like) – X to predict
Returns:probability of all classes
Return type:2d array
proba_to_score(prob)[source]

covert probability to score

odds = (1 - prob) / prob score = factor * log(odds) * offset

score_to_proba(score)[source]

covert score to probability

Returns:the probability of 1
Return type:array-like|float
bin_to_score(bins, return_sub=False)[source]

predict score from bins

woe_to_score(woe, weight=None)[source]

calculate score by woe

after_load(rules)[source]

after load card

after_export(card, to_frame=False, to_json=None, to_csv=None, **kwargs)[source]

generate a scorecard object

Parameters:
  • to_frame (bool) – return DataFrame of card
  • to_json (str|IOBase) – io to write json file
  • to_csv (filepath|IOBase) – file to write csv
Returns:

dict

testing_frame(**kwargs)[source]

get testing frame with score

Returns:testing frame with score
Return type:DataFrame

toad.selection module

class toad.selection.StatsModel(estimator='ols', criterion='aic', intercept=False)[source]

Bases: object

get_estimator(name)[source]
stats(X, y)[source]
get_criterion(pre, y, k)[source]
t_value(pre, y, X, coef)[source]
p_value(t, n)[source]
loglikelihood(pre, y, k)[source]
toad.selection.stepwise(frame, target='target', estimator='ols', direction='both', criterion='aic', p_enter=0.01, p_remove=0.01, p_value_enter=0.2, intercept=False, max_iter=None, return_drop=False, exclude=None)[source]

stepwise to select features

Parameters:
  • frame (DataFrame) – dataframe that will be use to select
  • target (str) – target name in frame
  • estimator (str) – model to use for stats
  • direction (str) – direction of stepwise, support ‘forward’, ‘backward’ and ‘both’, suggest ‘both’
  • criterion (str) – criterion to statistic model, support ‘aic’, ‘bic’
  • p_enter (float) – threshold that will be used in ‘forward’ and ‘both’ to keep features
  • p_remove (float) – threshold that will be used in ‘backward’ to remove features
  • intercept (bool) – if have intercept
  • p_value_enter (float) – threshold that will be used in ‘both’ to remove features
  • max_iter (int) – maximum number of iterate
  • return_drop (bool) – if need to return features’ name who has been dropped
  • exclude (array-like) – list of feature names that will not be dropped
Returns:

selected dataframe array: list of feature names that has been dropped

Return type:

DataFrame

toad.selection.drop_empty(frame, threshold=0.9, nan=None, return_drop=False, exclude=None)[source]

drop columns by empty

Parameters:
  • frame (DataFrame) – dataframe that will be used
  • threshold (number) – drop the features whose empty num is greater than threshold. if threshold is float, it will be use as percentage
  • nan (any) – values will be look like empty
  • return_drop (bool) – if need to return features’ name who has been dropped
  • exclude (array-like) – list of feature names that will not be dropped
Returns:

selected dataframe array: list of feature names that has been dropped

Return type:

DataFrame

toad.selection.drop_var(frame, threshold=0, return_drop=False, exclude=None)[source]

drop columns by variance

Parameters:
  • frame (DataFrame) – dataframe that will be used
  • threshold (float) – drop features whose variance is less than threshold
  • return_drop (bool) – if need to return features’ name who has been dropped
  • exclude (array-like) – list of feature names that will not be dropped
Returns:

selected dataframe array: list of feature names that has been dropped

Return type:

DataFrame

toad.selection.drop_corr(frame, target=None, threshold=0.7, by='IV', return_drop=False, exclude=None)[source]

drop columns by correlation

Parameters:
  • frame (DataFrame) – dataframe that will be used
  • target (str) – target name in dataframe
  • threshold (float) – drop features that has the smallest weight in each groups whose correlation is greater than threshold
  • by (array-like) – weight of features that will be used to drop the features
  • return_drop (bool) – if need to return features’ name who has been dropped
  • exclude (array-like) – list of feature names that will not be dropped
Returns:

selected dataframe array: list of feature names that has been dropped

Return type:

DataFrame

toad.selection.drop_iv(frame, target='target', threshold=0.02, return_drop=False, return_iv=False, exclude=None)[source]

drop columns by IV

Parameters:
  • frame (DataFrame) – dataframe that will be used
  • target (str) – target name in dataframe
  • threshold (float) – drop the features whose IV is less than threshold
  • return_drop (bool) – if need to return features’ name who has been dropped
  • return_iv (bool) – if need to return features’ IV
  • exclude (array-like) – list of feature names that will not be dropped
Returns:

selected dataframe array: list of feature names that has been dropped Series: list of features’ IV

Return type:

DataFrame

toad.selection.drop_vif(frame, threshold=3, return_drop=False, exclude=None)[source]

variance inflation factor

Parameters:
  • frame (DataFrame) –
  • threshold (float) – drop features until all vif is less than threshold
  • return_drop (bool) – if need to return features’ name who has been dropped
  • exclude (array-like) – list of feature names that will not be dropped
Returns:

selected dataframe array: list of feature names that has been dropped

Return type:

DataFrame

toad.selection.select(frame, target='target', empty=0.9, iv=0.02, corr=0.7, return_drop=False, exclude=None)[source]

select features by rate of empty, iv and correlation

Parameters:
  • frame (DataFrame) –
  • target (str) – target’s name in dataframe
  • empty (number) – drop the features which empty num is greater than threshold. if threshold is float, it will be use as percentage
  • iv (float) – drop the features whose IV is less than threshold
  • corr (float) – drop features that has the smallest IV in each groups which correlation is greater than threshold
  • return_drop (bool) – if need to return features’ name who has been dropped
  • exclude (array-like) – list of feature name that will not be dropped
Returns:

selected dataframe dict: list of dropped feature names in each step

Return type:

DataFrame

toad.stats module

toad.stats.gini(target)[source]

get gini index of a feature

Parameters:target (array-like) – list of target that will be calculate gini
Returns:gini value
Return type:number
toad.stats.gini_cond

get conditional gini index of a feature

Parameters:
  • feature (array-like) –
  • target (array-like) –
Returns:

conditional gini value. If feature is continuous, it will return the best gini value when the feature bins into two groups

Return type:

number

toad.stats.entropy(target)[source]

get infomation entropy of a feature

Parameters:target (array-like) –
Returns:information entropy
Return type:number
toad.stats.entropy_cond

get conditional entropy of a feature

Parameters:
  • feature (array-like) –
  • target (array-like) –
Returns:

conditional information entropy. If feature is continuous, it will return the best entropy when the feature bins into two groups

Return type:

number

toad.stats.probability(target, mask=None)[source]

get probability of target by mask

toad.stats.WOE(y_prob, n_prob)[source]

get WOE of a group

Parameters:
  • y_prob – the probability of grouped y in total y
  • n_prob – the probability of grouped n in total n
Returns:

woe value

Return type:

number

toad.stats.IV

get the IV of a feature

Parameters:
  • feature (array-like) –
  • target (array-like) –
  • return_sub (bool) – if need return IV of each groups
  • n_bins (int) – n groups that the feature will bin into
  • method (str) – the strategy to be used to merge feature, default is ‘dt’
  • () (**kwargs) – other options for merge function
toad.stats.badrate(target)[source]

calculate badrate

Parameters:target (array-like) – target array which 1 is bad
Returns:float
toad.stats.VIF(frame)[source]

calculate vif

Parameters:frame (ndarray|DataFrame) –
Returns:Series
class toad.stats.indicator(*args, is_class=False, **kwargs)[source]

Bases: toad.utils.decorator.Decorator

indicator decorator

name = 'indicator'
need_merge = False
dtype = None
wrapper(*args, **kwargs)[source]
toad.stats.column_quality(feature, target, name='feature', indicators=[], need_merge=False, **kwargs)[source]

calculate quality of a feature

Parameters:
  • feature (array-like) –
  • target (array-like) –
  • name (str) – feature’s name that will be setted in the returned Series
  • indicators (list) – list of indicator functions
  • need_merge (bool) – if need merge feature
Returns:

a list of quality with the feature’s name

Return type:

Series

toad.stats.quality(dataframe, target='target', cpu_cores=0, iv_only=False, indicators=['iv', 'gini', 'entropy', 'unique'], **kwargs)[source]

get quality of features in data

Parameters:
  • dataframe (DataFrame) – dataframe that will be calculate quality
  • target (str) – the target’s name in dataframe
  • iv_only (bool) – deprecated. if only calculate IV
  • cpu_cores (int) – the maximun number of CPU cores will be used, 0 means all CPUs will be used, -1 means all CPUs but one will be used.
Returns:

quality of features with the features’ name as row name

Return type:

DataFrame

toad.transform module

class toad.transform.Transformer[source]

Bases: sklearn.base.TransformerMixin, toad.utils.mixin.RulesMixin

Base class for transformers

fit()

fit method, see details in fit_ method

transform(X, *args, **kwargs)[source]

transform method, see details in transform_ method

default_rule()[source]
export(**kwargs)[source]

export rules to dict or a json file

Parameters:to_json (str|IOBase) – json file to save rules
Returns:dictionary of rules
Return type:dict
fit_transform(X, y=None, **fit_params)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.
  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
  • **fit_params (dict) – Additional fit parameters.
Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

load(rules, update=False, **kwargs)[source]

load rules from dict or json file

Parameters:
  • rules (dict) – dictionary of rules
  • from_json (str|IOBase) – json file of rules
  • update (bool) – if need to use updating instead of replacing rules
rules
update(*args, **kwargs)[source]

update rules

Parameters:
  • rules (dict) – dictionary of rules
  • from_json (str|IOBase) – json file of rules
class toad.transform.WOETransformer[source]

Bases: toad.transform.Transformer

WOE transformer

fit_(X, y)[source]

fit WOE transformer

Parameters:
  • X (DataFrame|array-like) –
  • y (str|array-like) –
  • select_dtypes (str|numpy.dtypes) – ‘object’, ‘number’ etc. only selected dtypes will be transform
transform_(rule, X, default='min')[source]

transform function for single feature

Parameters:
  • X (array-like) –
  • default (str) – ‘min’(default), ‘max’ - the strategy to be used for unknown group
Returns:

array-like

default_rule()[source]
export(**kwargs)[source]

export rules to dict or a json file

Parameters:to_json (str|IOBase) – json file to save rules
Returns:dictionary of rules
Return type:dict
fit()

fit method, see details in fit_ method

fit_transform(X, y=None, **fit_params)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.
  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
  • **fit_params (dict) – Additional fit parameters.
Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

load(rules, update=False, **kwargs)[source]

load rules from dict or json file

Parameters:
  • rules (dict) – dictionary of rules
  • from_json (str|IOBase) – json file of rules
  • update (bool) – if need to use updating instead of replacing rules
rules
transform(X, *args, **kwargs)[source]

transform method, see details in transform_ method

update(*args, **kwargs)[source]

update rules

Parameters:
  • rules (dict) – dictionary of rules
  • from_json (str|IOBase) – json file of rules
class toad.transform.Combiner[source]

Bases: toad.transform.Transformer, toad.utils.mixin.BinsMixin

Combiner for merge data

fit_(X, y=None, method='chi', empty_separate=False, **kwargs)[source]

fit combiner

Parameters:
  • X (DataFrame|array-like) – features to be combined
  • y (str|array-like) – target data or name of target in X
  • method (str) – the strategy to be used to merge X, same as .merge, default is chi
  • n_bins (int) – counts of bins will be combined
  • empty_separate (bool) – if need to combine empty values into a separate group
transform_(rule, X, labels=False, ellipsis=16, **kwargs)[source]

transform X by combiner

Parameters:
  • X (DataFrame|array-like) – features to be transformed
  • labels (bool) – if need to use labels for resulting bins, False by default
  • ellipsis (int) – max length threshold that labels will not be ellipsis, None for skipping ellipsis
Returns:

array-like

set_rules(map, reset=False)[source]

set rules for combiner

Parameters:
  • map (dict|array-like) – map of splits
  • reset (bool) – if need to reset combiner
Returns:

self

ELSE_GROUP = 'else'
EMPTY_BIN = -1
NUMBER_EXP = re.compile('\\[(-inf|-?\\d+(.\\d+)?)\\s*[~-]\\s*(inf|-?\\d+(.\\d+)?)\\)')
default_rule()[source]
export(**kwargs)[source]

export rules to dict or a json file

Parameters:to_json (str|IOBase) – json file to save rules
Returns:dictionary of rules
Return type:dict
fit()

fit method, see details in fit_ method

fit_transform(X, y=None, **fit_params)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.
  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
  • **fit_params (dict) – Additional fit parameters.
Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

classmethod format_bins(bins, index=False, ellipsis=None)[source]

format bins to label

Parameters:
  • bins (ndarray) – bins to format
  • index (bool) – if need index prefix
  • ellipsis (int) – max length threshold that labels will not be ellipsis, None for skipping ellipsis
Returns:

array of labels

Return type:

ndarray

load(rules, update=False, **kwargs)[source]

load rules from dict or json file

Parameters:
  • rules (dict) – dictionary of rules
  • from_json (str|IOBase) – json file of rules
  • update (bool) – if need to use updating instead of replacing rules
classmethod parse_bins(bins)[source]

parse labeled bins to array

rules
transform(X, *args, **kwargs)[source]

transform method, see details in transform_ method

update(*args, **kwargs)[source]

update rules

Parameters:
  • rules (dict) – dictionary of rules
  • from_json (str|IOBase) – json file of rules
class toad.transform.GBDTTransformer[source]

Bases: toad.transform.Transformer

GBDT transformer

fit_(X, y, **kwargs)[source]

fit GBDT transformer

Parameters:
  • X (DataFrame|array-like) –
  • y (str|array-like) –
  • select_dtypes (str|numpy.dtypes) – ‘object’, ‘number’ etc. only selected dtypes will be transform,
transform_(rules, X)[source]

transform woe

Parameters:X (DataFrame|array-like) –
Returns:array-like
default_rule()[source]
export(**kwargs)[source]

export rules to dict or a json file

Parameters:to_json (str|IOBase) – json file to save rules
Returns:dictionary of rules
Return type:dict
fit()

fit method, see details in fit_ method

fit_transform(X, y=None, **fit_params)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.
  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
  • **fit_params (dict) – Additional fit parameters.
Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

load(rules, update=False, **kwargs)[source]

load rules from dict or json file

Parameters:
  • rules (dict) – dictionary of rules
  • from_json (str|IOBase) – json file of rules
  • update (bool) – if need to use updating instead of replacing rules
rules
transform(X, *args, **kwargs)[source]

transform method, see details in transform_ method

update(*args, **kwargs)[source]

update rules

Parameters:
  • rules (dict) – dictionary of rules
  • from_json (str|IOBase) – json file of rules

toad.preprocessing module

toad.preprocessing.process module
class toad.preprocessing.process.Processing(data)[source]

Bases: object

Example:

>>> (Processing(data)
...     .groupby('id')
...     .partitionby(TimePartition(
...         'base_time',
...         'filter_time',
...         ['30d', '60d', '180d', '365d', 'all']
...     ))
...     .apply({'A': ['max', 'min', 'mean']})
...     .apply({'B': ['max', 'min', 'mean']})
...     .apply({'C': 'nunique'})
...     .apply({'D': {
...         'f': len,
...         'name': 'normal_count',
...         'mask':  Mask('D').isin(['normal']),
...     }})
...     .apply({'id': 'count'})
...     .exec()
... )
groupby(name)[source]

group data by name

Parameters:name (str) – column name in data
apply(f)[source]

apply functions to data

Parameters:f (dict|function) – a config dict that keys are the column names and values are the functions, it will take the column series as the functions argument. if f is a function, it will take the whole dataframe as the argument.
append_func(col, func)[source]
partitionby(p)[source]

partition data to multiple pieces, processing will process to all the pieces

Parameters:p (Partition) –
exec()[source]
process(data)[source]
class toad.preprocessing.process.Mask(column=None)[source]

Bases: object

a placeholder to select dataframe

push(op, value)[source]
replay(data)[source]
isin(other)[source]
isna()[source]
class toad.preprocessing.process.F(f, name=None, mask=None)[source]

Bases: object

function class for processing

name
is_buildin
need_filter
filter(data)[source]
toad.preprocessing.partition module
class toad.preprocessing.partition.Partition[source]

Bases: object

partition(data)[source]

partition data

Parameters:data (DataFrame) – dataframe
Returns:mask of partition data iterator -> str: suffix string of current partition
Return type:iterator -> ndarray[bool]
class toad.preprocessing.partition.TimePartition(base, filter, times)[source]

Bases: toad.preprocessing.partition.Partition

partition data by time delta

Parameters:
  • base (str) – column name of base time
  • filter (str) – column name of target time to be compared
  • times (list) – list of time delta`

Example:

>>> TimePartition('apply_time', 'query_time', ['30d', '90d', 'all'])
partition(data)[source]

partition data

Parameters:data (DataFrame) – dataframe
Returns:mask of partition data iterator -> str: suffix string of current partition
Return type:iterator -> ndarray[bool]
class toad.preprocessing.partition.ValuePartition(column)[source]

Bases: toad.preprocessing.partition.Partition

partition data by column values

Parameters:column (str) – column name which will be used as partition

Example:

>>> ValuePartition('status')
partition(data)[source]

partition data

Parameters:data (DataFrame) – dataframe
Returns:mask of partition data iterator -> str: suffix string of current partition
Return type:iterator -> ndarray[bool]

toad.utils module

toad.utils.func module
class toad.utils.func.Parallel[source]

Bases: object

apply(func, args=(), kwargs={})[source]
join()[source]
toad.utils.func.np_count(arr, value, default=None)[source]
toad.utils.func.has_nan(arr)[source]
toad.utils.func.np_unique(arr, **kwargs)[source]
toad.utils.func.to_ndarray(s, dtype=None)[source]
toad.utils.func.fillna(feature, by=-1)[source]
toad.utils.func.bin_by_splits(feature, splits)[source]

Bin feature by split points

toad.utils.func.feature_splits(feature, target)[source]

find posibility spilt points

toad.utils.func.iter_df(dataframe, feature, target, splits)[source]

iterate dataframe by split points

Returns:iterator (df, splitter)
toad.utils.func.inter_feature(feature, splits)[source]
toad.utils.func.is_continuous(series)[source]
toad.utils.func.split_target(frame, target)[source]
toad.utils.func.unpack_tuple(x)[source]
toad.utils.func.generate_str(size=6, chars='ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')[source]
toad.utils.func.save_json(contents, file, indent=4)[source]

save json file

Parameters:
  • contents (dict) – contents to save
  • file (str|IOBase) – file to save
toad.utils.func.read_json(file)[source]

read json file

toad.utils.func.clip(series, value=None, std=None, quantile=None)[source]

clip series

Parameters:
  • series (array-like) – series need to be clipped
  • value (number | tuple) – min/max value of clipping
  • std (number | tuple) – min/max std of clipping
  • quantile (number | tuple) – min/max quantile of clipping
toad.utils.func.diff_time(base, target, format=None, time='day')[source]
toad.utils.func.diff_time_frame(base, frame, format=None)[source]
toad.utils.func.flatten_columns(columns, sep='_')[source]

flatten multiple columns to 1-dim columns joined with ‘_’

toad.utils.func.bin_to_number(reg=None)[source]
Returns:func(string) -> number
Return type:function
toad.utils.func.generate_target(size, rate=0.5, weight=None, reverse=False)[source]

generate target for reject inference

Parameters:
  • size (int) – size of target
  • rate (float) – rate of ‘1’ in target
  • weight (array-like) – weight of ‘1’ to generate target
  • reverse (bool) – if need reverse weight
Returns:

array

toad.utils.func.get_dummies(dataframe, exclude=None, binary_drop=False, **kwargs)[source]

get dummies

toad.utils.decorator module
class toad.utils.decorator.Decorator(*args, is_class=False, **kwargs)[source]

Bases: object

base decorater class

is_class = False
setup(*args, **kwargs)[source]
call(*args, **kwargs)[source]
wrapper(*args, **kwargs)[source]
class toad.utils.decorator.frame_exclude(*args, is_class=False, **kwargs)[source]

Bases: toad.utils.decorator.Decorator

decorator for exclude columns

wrapper(X, *args, exclude=None, **kwargs)[source]
class toad.utils.decorator.select_dtypes(*args, is_class=False, **kwargs)[source]

Bases: toad.utils.decorator.Decorator

decorator for select frame by dtypes

wrapper(X, *args, select_dtypes=None, **kwargs)[source]
class toad.utils.decorator.save_to_json(*args, is_class=False, **kwargs)[source]

Bases: toad.utils.decorator.Decorator

support save result to json file

wrapper(*args, to_json=None, **kwargs)[source]
class toad.utils.decorator.load_from_json(*args, is_class=False, **kwargs)[source]

Bases: toad.utils.decorator.Decorator

support load data from json file

require_first = False
wrapper(*args, from_json=None, **kwargs)[source]
class toad.utils.decorator.support_dataframe(*args, is_class=False, **kwargs)[source]

Bases: toad.utils.decorator.Decorator

decorator for supporting dataframe

require_target = True
target = 'target'
wrapper(frame, *args, **kwargs)[source]
class toad.utils.decorator.proxy_docstring(*args, is_class=False, **kwargs)[source]

Bases: toad.utils.decorator.Decorator

method_name = None
toad.utils.mixin module
class toad.utils.mixin.RulesMixin[source]

Bases: object

default_rule()[source]
rules
load(rules, update=False, **kwargs)[source]

load rules from dict or json file

Parameters:
  • rules (dict) – dictionary of rules
  • from_json (str|IOBase) – json file of rules
  • update (bool) – if need to use updating instead of replacing rules
export(**kwargs)[source]

export rules to dict or a json file

Parameters:to_json (str|IOBase) – json file to save rules
Returns:dictionary of rules
Return type:dict
update(*args, **kwargs)[source]

update rules

Parameters:
  • rules (dict) – dictionary of rules
  • from_json (str|IOBase) – json file of rules
class toad.utils.mixin.BinsMixin[source]

Bases: object

EMPTY_BIN = -1
ELSE_GROUP = 'else'
NUMBER_EXP = re.compile('\\[(-inf|-?\\d+(.\\d+)?)\\s*[~-]\\s*(inf|-?\\d+(.\\d+)?)\\)')
classmethod parse_bins(bins)[source]

parse labeled bins to array

classmethod format_bins(bins, index=False, ellipsis=None)[source]

format bins to label

Parameters:
  • bins (ndarray) – bins to format
  • index (bool) – if need index prefix
  • ellipsis (int) – max length threshold that labels will not be ellipsis, None for skipping ellipsis
Returns:

array of labels

Return type:

ndarray

Module contents

Indices and tables