streamline.dataprep.data_process module
- class streamline.dataprep.data_process.DataProcess(dataset, experiment_path, ignore_features=None, categorical_features=None, quantitative_features=None, exclude_eda_output=None, categorical_cutoff=10, sig_cutoff=0.05, featureeng_missingness=0.5, cleaning_missingness=0.5, correlation_removal_threshold=1.0, partition_method='Stratified', n_splits=10, random_state=None, show_plots=False)[source]
Bases:
Job
Exploratory Data Analysis Class for the EDA/Phase 1 step of STREAMLINE
Initialization function for Exploratory Data Analysis Class. Parameters are defined below.
- Parameters:
dataset – a streamline.utils.dataset.Dataset object or a path to dataset text file
experiment_path – path to experiment the logging directory folder
ignore_features – list of string of column names of features to ignore or path to .csv file with feature labels to be ignored in analysis (default=None)
categorical_features – list of string of column names of features to ignore or path to .csv file with feature labels specified to be treated as categorical where possible (default=None)
categorical_cutoff – number of unique values for a variable is considered to be quantitative vs categorical (default=10)
exclude_eda_output – list of names of analysis to do while doing EDA (must be in set X)
categorical_cutoff – categorical cut off to consider a feature categorical by analysis, default=10
sig_cutoff – significance cutoff for continuous variables, default=0.05
featureeng_missingness – the proportion of missing values within a feature (above which) a new binary categorical feature is generated that indicates if the value for an instance was missing or not
cleaning_missingness – the proportion of missing values, within a feature or instance, (at which) the given feature or instance will be automatically cleaned (i.e. removed) from the processed ‘target dataset’
correlation_removal_threshold – the (pearson) feature correlation at which one out of a pair of features is randomly removed from the processed ‘target dataset’
random_state – random state to set seeds for reproducibility of algorithms
- categorical_feature_encoding()[source]
Categorical feature encoding using sklearn onehot encoder not used/implemented
- categorical_feature_encoding_pandas()[source]
Categorical feature encoding using pandas get_dummies function
- counts_summary(total_missing=None, plot=False, save=True, replicate=False)[source]
Reports various dataset counts: i.e. number of instances, total features, categorical features, quantitative features, and class counts. Also saves a simple bar graph of class counts if user specified.
- Parameters:
save –
total_missing – total missing values (optional, runs again if not given)
plot – flag to output bar graph in the experiment log folder
replicate –
Returns:
- data_manipulation()[source]
Wrapper function for all data cleaning and feature engineering data manipulation
- drop_ignored_rowcols(ignored_features=None)[source]
Basic data cleaning: Drops any instances with a missing outcome value as well as any features (ignore_features) specified by user
- feature_engineering()[source]
Feature Engineering - Missingness as a feature (missingness feature engineering phase)
Using the used run parameter we define the minimum missingness of a variable at which streamline will automatically engineer a new feature (i.e. 0 not missing vs. 1 missing).
This parameter would have value of 0-1 and default of 0.5 meaning any feature with a missingness of >50% will have a corresponding missingness feature added.
This new feature would have the inserted label of “Miss_”+originalFeatureName. The list of feature names for which a missingness feature was constructed is saved in self.engineered_features. In the ‘apply’ phase, we use this feature list to build similar new missingness features added to the replication dataset.
- graph_selector(feature_name)[source]
Assuming a categorical class outcome, a barplot is generated given a categorical feature, and a boxplot is generated given a quantitative feature.
- Parameters:
feature_name – feature name of the column the function is doing operation on
- identify_feature_types(x_data=None)[source]
Automatically identify categorical vs. quantitative features/variables Takes a dataframe (of independent variables) with column labels and returns a list of column names identified as being categorical based on user defined cutoff (categorical_cutoff).
- instance_removal()[source]
dropping instances with feature/columns missingness greater that cleaning missingness percentage
- label_encoder()[source]
Numerical Data Encoder: for any features in the data (other than the instanceID, but including the class column) if the feature (which should also be considered to be categorical - so check that feature is in the list of features being treated as categorical, and if not add it to that list) has any non-numerical values, numerically encode these values based on alphabetical order of the feature values. As we do this we create a new output .csv file (called Numerical_Encoding_Map.csv), where each row provides the feature that was numerically encoded, and the subsequent columns provide a mapping of the original values to new numerical values.
- run(top_features=20)[source]
Wrapper function to run_explore and KFoldPartitioner
- Parameters:
top_features – no of top features to consider (default=20)
- run_process(top_features=20)[source]
Run Exploratory Data Process accordingly on the EDA Object
- Parameters:
top_features – no of top features to consider (default=20)
- test_selector(feature_name)[source]
Selects and applies appropriate univariate association test for a given feature. Returns resulting p-value
- Parameters:
feature_name – name of feature column operation is running on
- univariate_analysis(top_features=20)[source]
Calculates univariate association significance between each individual feature and class outcome. Assumes categorical outcome using Chi-square test for categorical features and Mann-Whitney Test for quantitative features.
- Parameters:
top_features – no of top features to show/consider
- univariate_plots(sorted_p_list=None, top_features=20)[source]
Checks whether p-value of each feature is less than or equal to significance cutoff. If so, calls graph_selector to generate an appropriate plot.
- Parameters:
sorted_p_list – sorted list of p-values
top_features – no of top features to consider (default=20)