streamline.runners.dataprocess_runner module

class streamline.runners.dataprocess_runner.DataProcessRunner(data_path, output_path, experiment_name, exclude_eda_output=None, class_label='Class', instance_label=None, match_label=None, n_splits=10, partition_method='Stratified', ignore_features=None, categorical_features=None, quantitative_features=None, top_features=20, categorical_cutoff=10, sig_cutoff=0.05, featureeng_missingness=0.5, cleaning_missingness=0.5, correlation_removal_threshold=1.0, random_state=None, run_cluster=False, queue='defq', reserved_memory=4, show_plots=False)[source]

Bases: object

Description: Phase 1 of STREAMLINE - This ‘Main’ script manages Phase 1 run parameters, updates the metadata file (with user specified run parameters across pipeline run) and submits job to run locally (to run serially) or on a linux computing cluster (parallelized). This script runs ExploratoryAnalysisJob.py which conducts initial exploratory analysis of data and cross validation (CV) partitioning. Note that this entire pipeline may also be run within Jupyter Notebook (see STREAMLINE-Notebook.ipynb). All ‘Main’ scripts in this pipeline have the potential to be extended by users to submit jobs to other parallel computing frameworks (e.g. cloud computing). .. warning:

- Before running, be sure to check that all run parameters have relevant/desired values including those with            default values available.
- 'Target' datasets for analysis should be in comma-separated format (.txt or .csv)
- Missing data values should be empty or indicated with an 'NA'.
- Dataset(s) includes a header giving column labels.
- Data columns include features, class label, and optionally instance (i.e. row) labels, or match labels            (if matched cross validation will be used)
- Binary class values are encoded as 0 (e.g. negative), and 1 (positive) with respect to true positive,             true negative, false positive, false negative metrics. PRC plots focus on classification of 'positives'.
- All feature values (both categorical and quantitative) are numerically encoded. Scikit-learn does not accept             text-based values. However, both instance_label and match_label values may be either numeric or text.
- One or more target datasets for analysis should be included in the same data_path folder. The path to this             folder is a critical pipeline run parameter. No spaces are allowed in filenames (this will lead to
    'invalid literal' by export_exploratory_analysis.)             If multiple datasets are being analyzed they must have the             same class_label, and (if present) the same instance_label and match_label.

Initializer for a runner class for Exploratory Data Analysis Jobs

Parameters:
  • data_path – path to directory containing datasets

  • output_path – path to output directory

  • experiment_name – name of experiment output folder (no spaces)

  • exclude_eda_output – list of eda outputs to exclude

  • ['describe_csv' (possible options) –

  • 'univariate_plots'

  • 'correlation_plots']

  • class_label – outcome label of all datasets

  • instance_label – instance label of all datasets (if present)

  • match_label – only applies when M selected for partition-method; indicates column with matched instance ids

  • n_splits – no of splits in cross-validation (default=10)

  • partition_method – method of partitioning in cross-validation must be in [“Random”, “Stratified”, “Group”] (default=”Stratified”)

  • ignore_features – list of string of column names of features to ignore or path to .csv file with feature labels to be ignored in analysis (default=None)

  • categorical_features – list of string of column names of features to ignore or path to .csv file with feature labels specified to be treated as categorical where possible (default=None)

  • categorical_cutoff – number of unique values for a variable is considered to be quantitative vs categorical (default=10)

  • sig_cutoff – significance cutoff used throughout pipeline (default=0.05)

  • featureeng_missingness – the proportion of missing values within a feature (above which) a new binary categorical feature is generated that indicates if the value for an instance was missing or not

  • cleaning_missingness – the proportion of missing values, within a feature or instance, (at which) the given feature or instance will be automatically cleaned (i.e. removed) from the processed ‘target dataset’

  • correlation_removal_threshold – the (pearson) feature correlation at which one out of a pair of features is randomly removed from the processed ‘target dataset’

  • random_state – sets a specific random seed for reproducible results (default=None)

  • run_cluster – name of cluster run setting or False (default=False)

  • queue – name of queue to be used in cluster run (default=”defq”)

  • reserved_memory – reserved memory for cluster run in GB (in default=4)

  • show_plots – flag to output plots for notebooks (default=False)

get_cluster_params(dataset_path)[source]
make_dir_tree()[source]

Checks existence of data folder path. Checks that experiment output folder does not already exist as well as validity of experiment_name parameter. Then generates initial output folder hierarchy.

run(run_parallel=False)[source]
save_metadata()[source]
submit_lsf_cluster_job(dataset_path)[source]
submit_slurm_cluster_job(dataset_path)[source]