streamline.dataprep.data_process module

class streamline.dataprep.data_process.DataProcess(dataset, experiment_path, ignore_features=None, categorical_features=None, quantitative_features=None, exclude_eda_output=None, categorical_cutoff=10, sig_cutoff=0.05, featureeng_missingness=0.5, cleaning_missingness=0.5, correlation_removal_threshold=1.0, partition_method='Stratified', n_splits=10, random_state=None, show_plots=False)[source]

Bases: Job

Exploratory Data Analysis Class for the EDA/Phase 1 step of STREAMLINE

Initialization function for Exploratory Data Analysis Class. Parameters are defined below.

Parameters:
  • dataset – a streamline.utils.dataset.Dataset object or a path to dataset text file

  • experiment_path – path to experiment the logging directory folder

  • ignore_features – list of string of column names of features to ignore or path to .csv file with feature labels to be ignored in analysis (default=None)

  • categorical_features – list of string of column names of features to ignore or path to .csv file with feature labels specified to be treated as categorical where possible (default=None)

  • categorical_cutoff – number of unique values for a variable is considered to be quantitative vs categorical (default=10)

  • exclude_eda_output – list of names of analysis to do while doing EDA (must be in set X)

  • categorical_cutoff – categorical cut off to consider a feature categorical by analysis, default=10

  • sig_cutoff – significance cutoff for continuous variables, default=0.05

  • featureeng_missingness – the proportion of missing values within a feature (above which) a new binary categorical feature is generated that indicates if the value for an instance was missing or not

  • cleaning_missingness – the proportion of missing values, within a feature or instance, (at which) the given feature or instance will be automatically cleaned (i.e. removed) from the processed ‘target dataset’

  • correlation_removal_threshold – the (pearson) feature correlation at which one out of a pair of features is randomly removed from the processed ‘target dataset’

  • random_state – random state to set seeds for reproducibility of algorithms

categorical_feature_encoding()[source]

Categorical feature encoding using sklearn onehot encoder not used/implemented

categorical_feature_encoding_pandas()[source]

Categorical feature encoding using pandas get_dummies function

counts_summary(total_missing=None, plot=False, save=True, replicate=False)[source]

Reports various dataset counts: i.e. number of instances, total features, categorical features, quantitative features, and class counts. Also saves a simple bar graph of class counts if user specified.

Parameters:
  • save

  • total_missing – total missing values (optional, runs again if not given)

  • plot – flag to output bar graph in the experiment log folder

  • replicate

Returns:

data_manipulation()[source]

Wrapper function for all data cleaning and feature engineering data manipulation

drop_highly_correlated_features()[source]
drop_ignored_rowcols(ignored_features=None)[source]

Basic data cleaning: Drops any instances with a missing outcome value as well as any features (ignore_features) specified by user

drop_invariant()[source]

Basic data cleaning: Drops any invariant features found by pandas

feature_engineering()[source]

Feature Engineering - Missingness as a feature (missingness feature engineering phase)

Using the used run parameter we define the minimum missingness of a variable at which streamline will automatically engineer a new feature (i.e. 0 not missing vs. 1 missing).

This parameter would have value of 0-1 and default of 0.5 meaning any feature with a missingness of >50% will have a corresponding missingness feature added.

This new feature would have the inserted label of “Miss_”+originalFeatureName. The list of feature names for which a missingness feature was constructed is saved in self.engineered_features. In the ‘apply’ phase, we use this feature list to build similar new missingness features added to the replication dataset.

feature_removal()[source]
graph_selector(feature_name)[source]

Assuming a categorical class outcome, a barplot is generated given a categorical feature, and a boxplot is generated given a quantitative feature.

Parameters:

feature_name – feature name of the column the function is doing operation on

identify_feature_types(x_data=None)[source]

Automatically identify categorical vs. quantitative features/variables Takes a dataframe (of independent variables) with column labels and returns a list of column names identified as being categorical based on user defined cutoff (categorical_cutoff).

instance_removal()[source]

dropping instances with feature/columns missingness greater that cleaning missingness percentage

join()[source]
label_encoder()[source]

Numerical Data Encoder: for any features in the data (other than the instanceID, but including the class column) if the feature (which should also be considered to be categorical - so check that feature is in the list of features being treated as categorical, and if not add it to that list) has any non-numerical values, numerically encode these values based on alphabetical order of the feature values. As we do this we create a new output .csv file (called Numerical_Encoding_Map.csv), where each row provides the feature that was numerically encoded, and the subsequent columns provide a mapping of the original values to new numerical values.

make_log_folders()[source]

Makes folders for logging exploratory data analysis

run(top_features=20)[source]

Wrapper function to run_explore and KFoldPartitioner

Parameters:

top_features – no of top features to consider (default=20)

run_process(top_features=20)[source]

Run Exploratory Data Process accordingly on the EDA Object

Parameters:

top_features – no of top features to consider (default=20)

save_runtime()[source]

Export runtime for this phase of the pipeline on current target dataset

second_eda(top_features=20)[source]
start(top_features=20)[source]
test_selector(feature_name)[source]

Selects and applies appropriate univariate association test for a given feature. Returns resulting p-value

Parameters:

feature_name – name of feature column operation is running on

univariate_analysis(top_features=20)[source]

Calculates univariate association significance between each individual feature and class outcome. Assumes categorical outcome using Chi-square test for categorical features and Mann-Whitney Test for quantitative features.

Parameters:

top_features – no of top features to show/consider

univariate_plots(sorted_p_list=None, top_features=20)[source]

Checks whether p-value of each feature is less than or equal to significance cutoff. If so, calls graph_selector to generate an appropriate plot.

Parameters:
  • sorted_p_list – sorted list of p-values

  • top_features – no of top features to consider (default=20)