toad.selection module

class toad.selection.StatsModel(estimator='ols', criterion='aic', intercept=False)[source]

Bases: object

get_estimator(name)[source]
stats(X, y)[source]
get_criterion(pre, y, k)[source]
t_value(pre, y, X, coef)[source]
p_value(t, n)[source]
loglikelihood(pre, y, k)[source]
toad.selection.stepwise(frame, target='target', estimator='ols', direction='both', criterion='aic', p_enter=0.01, p_remove=0.01, p_value_enter=0.2, intercept=False, max_iter=None, return_drop=False, exclude=None)[source]

stepwise to select features

Parameters:
  • frame (DataFrame) – dataframe that will be use to select
  • target (str) – target name in frame
  • estimator (str) – model to use for stats
  • direction (str) – direction of stepwise, support ‘forward’, ‘backward’ and ‘both’, suggest ‘both’
  • criterion (str) – criterion to statistic model, support ‘aic’, ‘bic’
  • p_enter (float) – threshold that will be used in ‘forward’ and ‘both’ to keep features
  • p_remove (float) – threshold that will be used in ‘backward’ to remove features
  • intercept (bool) – if have intercept
  • p_value_enter (float) – threshold that will be used in ‘both’ to remove features
  • max_iter (int) – maximum number of iterate
  • return_drop (bool) – if need to return features’ name who has been dropped
  • exclude (array-like) – list of feature names that will not be dropped
Returns:

selected dataframe array: list of feature names that has been dropped

Return type:

DataFrame

toad.selection.drop_empty(frame, threshold=0.9, nan=None, return_drop=False, exclude=None)[source]

drop columns by empty

Parameters:
  • frame (DataFrame) – dataframe that will be used
  • threshold (number) – drop the features whose empty num is greater than threshold. if threshold is float, it will be use as percentage
  • nan (any) – values will be look like empty
  • return_drop (bool) – if need to return features’ name who has been dropped
  • exclude (array-like) – list of feature names that will not be dropped
Returns:

selected dataframe array: list of feature names that has been dropped

Return type:

DataFrame

toad.selection.drop_var(frame, threshold=0, return_drop=False, exclude=None)[source]

drop columns by variance

Parameters:
  • frame (DataFrame) – dataframe that will be used
  • threshold (float) – drop features whose variance is less than threshold
  • return_drop (bool) – if need to return features’ name who has been dropped
  • exclude (array-like) – list of feature names that will not be dropped
Returns:

selected dataframe array: list of feature names that has been dropped

Return type:

DataFrame

toad.selection.drop_corr(frame, target=None, threshold=0.7, by='IV', return_drop=False, exclude=None)[source]

drop columns by correlation

Parameters:
  • frame (DataFrame) – dataframe that will be used
  • target (str) – target name in dataframe
  • threshold (float) – drop features that has the smallest weight in each groups whose correlation is greater than threshold
  • by (array-like) – weight of features that will be used to drop the features
  • return_drop (bool) – if need to return features’ name who has been dropped
  • exclude (array-like) – list of feature names that will not be dropped
Returns:

selected dataframe array: list of feature names that has been dropped

Return type:

DataFrame

toad.selection.drop_iv(frame, target='target', threshold=0.02, return_drop=False, return_iv=False, exclude=None)[source]

drop columns by IV

Parameters:
  • frame (DataFrame) – dataframe that will be used
  • target (str) – target name in dataframe
  • threshold (float) – drop the features whose IV is less than threshold
  • return_drop (bool) – if need to return features’ name who has been dropped
  • return_iv (bool) – if need to return features’ IV
  • exclude (array-like) – list of feature names that will not be dropped
Returns:

selected dataframe array: list of feature names that has been dropped Series: list of features’ IV

Return type:

DataFrame

toad.selection.drop_vif(frame, threshold=3, return_drop=False, exclude=None)[source]

variance inflation factor

Parameters:
  • frame (DataFrame) –
  • threshold (float) – drop features until all vif is less than threshold
  • return_drop (bool) – if need to return features’ name who has been dropped
  • exclude (array-like) – list of feature names that will not be dropped
Returns:

selected dataframe array: list of feature names that has been dropped

Return type:

DataFrame

toad.selection.select(frame, target='target', empty=0.9, iv=0.02, corr=0.7, return_drop=False, exclude=None)[source]

select features by rate of empty, iv and correlation

Parameters:
  • frame (DataFrame) –
  • target (str) – target’s name in dataframe
  • empty (number) – drop the features which empty num is greater than threshold. if threshold is float, it will be use as percentage
  • iv (float) – drop the features whose IV is less than threshold
  • corr (float) – drop features that has the smallest IV in each groups which correlation is greater than threshold
  • return_drop (bool) – if need to return features’ name who has been dropped
  • exclude (array-like) – list of feature name that will not be dropped
Returns:

selected dataframe dict: list of dropped feature names in each step

Return type:

DataFrame