Run Parameters

Here we review the run parameters available across the 9 phases of STREAMLINE. We begin with a quick guide/summary of all run parameters according to run mode along with their default values (when applicable). Then we provide further descriptions, formatting, valid values, and guidance (as needed) for each run parameter. Lastly, we provide overall guidance on setting STEAMLINE run parameters.


Quick Guide

The quick guide below distinguishes essential from non-essential run parameters within streamline, and further breaks down non-essential run paramters by pipeline phase. The name of each parameter is given for the command-line, configuration file, and notebooks (same for both Colab and Jupyter Notebooks), as well as the internal STREAMLINE default value (which ocassionally differ from the default values used in the notebooks for the demonstration datasets).

  • Run parameters without default values are incidated with ‘no default’.

  • Run parameters that are not used in one of the run modes are indicated with ‘NA’.

  • All run parameters include quick links to their respective details in Parameter Details, including their description, format, values, and other tips.

Essential Parameters (Phases 1-9)

Command-line Parameter

Config File Parameter

Notebook Parameter

Default

--data-path

dataset_path

data_path

no default

--out-path

output_path

output_path

no default

--exp-name

experiment_name

experiment_name

no default

--class-label

class_label

class_label

‘Class’

--inst-label

instance_label

instance_label

None

--match-label

match_label

match_label

None

--fi

ignore_features_path

ignore_features

None

--cf

categorical_feature_path

categorical_feature_headers

None

--qf

quantitative_feature_path

quantitiative_feature_headers

None

--rep-path

rep_data_path

rep_data_path

no default

--dataset

dataset_for_rep

dataset_for_rep

no default

--config or -c

NA

NA

no default

--do-till-report or -dtr

do_till_report

NA

False

--do-eda

do_eda

NA

False

--do-dataprep

do_dataprep

NA

False

--do-feat-imp

do_feat_imp

NA

False

--do-feat-sel

do_feat_sel

NA

False

--do-model

do_model

NA

False

--do-stats

do_stats

NA

False

--do-compare-dataset

do_compare_dataset

NA

False

--do-report

do_report

NA

False

--do-replicate

do_replicate

NA

False

--do-rep-report

do_rep_report

NA

False

--do-cleanup

do_cleanup

NA

False

NA

NA

applyToReplication

True

NA

NA

demo_run

True

NA

NA

use_data_prompt (Colab)

True

General Parameters (Phase 1)

Command-line Parameter

Config File Parameter

Notebook Parameter

Default

--cv

cv_partitions

n_splits

10

--part

partition_method

partition_method

‘Stratified’

--cat-cutoff

categorical_cutoff

categorical_cutoff

10

--sig

sig_cutoff

sig_cutoff

0.05

--rand-state

random_state

random_state

42

Data Processing Parameters (Phase 1)

Command-line Parameter

Config File Parameter

Notebook Parameter

Default

--exclude-eda-output

exclude_eda_output

exclude_eda_output

None

--top-uni-feature

top_uni_features

top_uni_features

20

--feat_miss

featureeng_missingness

featureeng_missingness

0.5

--clean_miss

cleaning_missingness

cleaning_missingness

0.5

--corr_thresh

correlation_removal_threshold

correlation_removal_threshold

1.0

Imputation & Scaling Parameters (Phase 2)

Command-line Parameter

Config File Parameter

Notebook Parameter

Default

--impute

impute_data

impute_data

True

--multi-impute

multi_impute

multi_impute

True

--scale

scale_data

scale_data

True

--over-cv

overwrite_cv

overwrite_cv

True

Feature Importance Estimation Parameters (Phase 3)

Command-line Parameter

Config File Parameter

Notebook Parameter

Default

--do-mi

do_mutual_info

do_mutual_info

True

--do-ms

do_multisurf

do_multisurf

True

--use-turf

use_turf

use_TURF

False

--turf-pct

turf_pct

TURF_pct

0.5

--inst-sub

instance_subset

instance_subset

2000

--n-jobs

n_jobs

cores

1

Feature Selection Parameters (Phase 4)

Command-line Parameter

Config File Parameter

Notebook Parameter

Default

--filter-feat

filter_poor_features

filter_poor_features

True

--max-feat

max_features_to_keep

max_features_to_keep

2000

--export-scores

export_scores

export_scores

True

--top-fi-features

top_fi_features

top_fi_features

40

--over-cv-feat

overwrite_cv_feat

overwrite_cv_feat

True

Modeling Parameters (Phase 5)

Command-line Parameter

Config File Parameter

Notebook Parameter

Default

--algorithms

algorithms

algorithms

None

--exclude

exclude

exclude

‘eLCS,XCS’

--subsample

training_subsample

training_subsample

0

--use-uniformFI

use_uniform_fi

use_uniform_FI

True

--metric

primary_metric

primary_metric

‘balanced_accuracy’

--metric-direction

metric_direction

metric_direction

‘maximize’

--n-trials

n_trials

n_trials

200

--timeout

timeout

timeout

900

--export-hyper-sweep

export_hyper_sweep_plots

export_hyper_sweep_plots

False

--do-LCS-sweep

do_lcs_sweep

do_lcs_sweep

False

--nu

lcs_nu

lcs_nu

1

--iter

lcs_iterations

lcs_iterations

200000

--N

lcs_n

lcs_N

2000

--lcs-timeout

lcs_timeout

lcs_timeout

1200

--model-resubmit

model_resubmit

NA

False

Post-Analysis Parameters (Phase 6)

Command-line Parameter

Config File Parameter

Notebook Parameter

Default

--exclude-plots

exclude_plots

exclude_plots

None

--metric-weight

metric_weight

metric_weight

‘balanced_accuracy’

--top-model-fi-features

top_model_fi_features

top_model_fi_features

40

Compare Data Parameters (Phase 7)

There are currently no run parameters to adjust for this phase.

Replication Parameters (Phase 8)

Command-line Parameter

Config File Parameter

Notebook Parameter

Default

--exclude-rep-plots

exclude_rep_plots

exclude_rep_plots

None

Summary Report Parameters (Phase 9)

There are currently no run parameters to adjust for this phase.

Cleanup Parameters

Command-line Parameter

Config File Parameter

Notebook Parameter

Default

--del-time

del_time

del_time

True

--del-old-cv

del_old_cv

del_old_cv

True

Multiprocessing Parameters

Command-line Parameter

Config File Parameter

Notebook Parameter

Default

--run-parallel

run_parallel

NA

False

--run-cluster

run_cluster

NA

“SLURM”

--res-mem

reserved_memory

NA

4

--queue

queue

NA

“defq”

Logging Parameters

Command-line Parameter

Config File Parameter

Notebook Parameter

Default

--verbose

verbose

NA

False

--logging-level

logging_level

NA

‘INFO’


Parameter Details

This section will go into greater depth for each run parameter, primarily using the configuration file parameter name to identify each.

  • Parameters identified as (str) format should be entered with single quotation marks within notebooks, or when using a configuration file, but without them when using command line arguments (CLA).


Essential Parameters (Phase 1-9)

dataset_path

  • Description: path to the folder containing one or more ‘target datasets’ to be analyzed that meet dataset formatting requirements

  • Format: (str), e.g. '/content/STREAMLINE/data/DemoData'

  • Values: must be a valid folder-path

  • Tips: STREAMLINE automatically detects the number of ‘target datasets’ in this folder and will run a complete analysis on each, comparing dataset performance in phase 7

output_path

  • Description: path to an output folder where STREAMLINE will save the experiment folder (containing all output files)

  • Format: (str), e.g. '/content/DemoOutput'

  • Values: must be a valid folder-path, however the lowest level of the folder (e.g. DemoOutput) does not already have to exist, and will be automatically created if it does not

  • Tips: When running multiple STREAMLINE experiments, it’s convenient to leave this parameter the same and just update experiment_name

experiment_name

  • Description: a unique name for the current STREAMLINE experiment output folder that will be created within output_path

  • Format: (str), e.g. 'demo_experiment'

  • Values: any string value name (avoid spaces)

  • Tips: a short, unique, and descriptive name is encouraged

class_label

  • Description: the name of the class/outcome column found in the dataset header

  • Format: (str), e.g. 'Class'

  • Values: the case-sensitive name used in the dataset to identify the outcome labels column

instance_label

  • Description: the name of the instance ID column that may (or may not) be included in the dataset

  • Format: (str), e.g. 'InstanceID'

  • Values: None, or the case-sensitive name used in the dataset to identify the instance ID column (if present)

  • Tips: having an instance ID column in the data allows users to later identify model predictions for specific instances in the dataset, as well as reverse-engineer instance subgroups in the dataset downstream using the ExSTraCS modeling algorithm’s capability to detect and characterize heterogeneous associations. This may not be necessesary for most users.

match_label

  • Description: the name of the match/group ID column that can be included in a dataset to keep instances with the same match label together within the same CV partition

  • Format: (str), e.g. 'MatchID'

  • Values: None, or the case-sensitive name used in the dataset to identify the match/group ID column (if present)

  • Tips: having a match/group ID column in the data allows users to apply machine learning modeling to datasets where instances with different outcomes have been matched based on other covariates that the user wants to account for (e.g. age, sex, race, etc)

ignore_features_path

  • Description: a list of feature names for STREAMLINE to immediately drop from the target datasets

  • Format:

    1. for notebook or config file modes: provide a (list) of (str) feature names that can be found in any of the ‘target datasets’, e.g. ['IgnoredFeature1','IgnoredFeature2']

    2. for command line arguments: provide a (str) path to a .csv file including a row of feature names that can be found in any of the ‘target datasets’, e.g. '/content/STREAMLINE/data/MadeUp/ignoreFeat.csv'

  • Values: None, or (for either format) should include case-sensitive feature names found in at least one of the ‘target datasets’

  • Tips: useful for easily dropping features found in the datasets that users may wish to exclude if those features might lead to data leakage, or for other data quality reasons

categorical_feature_path

  • Description: a list of feature names for STREAMLINE to explicitly treat as categorical feature types

  • Format:

    1. for notebook or config file modes: provide a (list) of (str) feature names that can be found in any of the ‘target datasets’, e.g. ['Feature1','Feature7']

    2. for command line arguments: provide a (str) path to a .csv file including a row of feature names that can be found in any of the ‘target datasets’, e.g. '/content/STREAMLINE/data/DemoFeatureTypes/hcc_cat_feat.csv'

  • Values: None, or (for either format) should include case-sensitive feature names found in at least one of the ‘target datasets’

  • Tips:

    • When specifying categorical_feature_path feature names and leaving quantiative_feature_path = None all other features will be automatically treated as quanatiative

    • When specifying quantiative_feature_path feature names and leaving categorical_feature_path = None all other features will be automatically treated as categorical

    • When specifying feature names for both categorical_feature_path and quantiative_feature_path, any features in the data not specified by one of theses lists will have it’s feature type determined automatically using categorical_cutoff

    • Note: any text-valued features in a dataset will automatically be numerically encoded and treated as categorical features (overriding any other user specifications)

quantitative_feature_path

  • Description: a list of feature names for STREAMLINE to explicitly treat as quantitative feature types

rep_data_path

  • Description: path to the folder containing one or more ‘replication datasets’ to be evaluated using previously trained models for a specific ‘target dataset’ (see data formatting requirements)

  • Format: (str), e.g. '/content/STREAMLINE/data/DemoRepData'

  • Values: must be a valid folder-path

  • Tips: STREAMLINE automatically detects the number of ‘replication datasets’ in this folder and will run a complete evaluation on each.

dataset_for_rep

  • Description: path to the individual ‘target dataset’ file used to train the models which you want to evaluate with the above ‘replication datasets’ (see data formatting requirements)

  • Format: (str), e.g. '/content/STREAMLINE/data/DemoData/hcc_data_custom.csv'

  • Values: must be a valid file-path

  • Tips: STREAMLINE’s replication phase is set up to evaluate all models trained from a single ‘target datasets’ at once using one or more replication datasets, specific to that ‘target dataset’. The replication phase can be run multiple times, each for a new ‘target dataset’, and it’s own respective ‘replication dataset(s)’.

config

  • Description: path to the configuration file used to run STREAMLINE from the command line using a configuration file locally or on a cluster

  • Format: (str), e.g. run_configs/local.cfg

  • Values: must be a valid file-path to a properly formatted configuration file

do_till_report

  • Description: boolean flag telling STREAMLINE to automatically run all phases excluding phase 8 (i.e. replication), and part of phase 9 (i.e. PDF report for replication)

  • Format: [Command Line Argument] just use flag (i.e. --do-till-report), [Configuration File] (bool)

  • Values: True or False

do_eda

  • Description: boolean flag telling STREAMLINE to run phase 1 (i.e. EDA and Processing)

  • Format: [Command Line Argument] just use flag (i.e. --do-eda), [Configuration File] (bool)

  • Values: True or False

do_dataprep

  • Description: boolean flag telling STREAMLINE to run phase 2 (i.e. Imputation and Scaling)

  • Format: [Command Line Argument] just use flag (i.e. --do-dataprep), [Configuration File] (bool)

  • Values: True or False

do_feat_imp

  • Description: boolean flag telling STREAMLINE to run phase 3 (i.e. Feature Importance Estimation)

  • Format: [Command Line Argument] just use flag (i.e. --do-feat-imp), [Configuration File] (bool)

  • Values: True or False

do_feat_sel

  • Description: boolean flag telling STREAMLINE to run phase 4 (i.e. Feature Selection)

  • Format: [Command Line Argument] just use flag (i.e. --do-feat-sel), [Configuration File] (bool)

  • Values: True or False

do_model

  • Description: boolean flag telling STREAMLINE to run phase 5 (i.e. Modeling)

  • Format: [Command Line Argument] just use flag (i.e. --do-model), [Configuration File] (bool)

  • Values: True or False

do_stats

  • Description: boolean flag telling STREAMLINE to run phase 6 (i.e. Post-Analysis)

  • Format: [Command Line Argument] just use flag (i.e. --do-stats), [Configuration File] (bool)

  • Values: True or False

do_compare_dataset

  • Description: boolean flag telling STREAMLINE to run phase 7 (i.e. Compare Datasets)

  • Format: [Command Line Argument] just use flag (i.e. --do-compare-dataset), [Configuration File] (bool)

  • Values: True or False

do_report

  • Description: boolean flag telling STREAMLINE to run phase 9 (i.e. Summary Report) specific to phases 1-7

  • Format: [Command Line Argument] just use flag (i.e. --do-report), [Configuration File] (bool)

  • Values: True or False

do_replicate

  • Description: boolean flag telling STREAMLINE to run phase 8 (i.e. Replication) specific to phases 1-7

  • Format: [Command Line Argument] just use flag (i.e. --do-replicate), [Configuration File] (bool)

  • Values: True or False

do_rep_report

  • Description: boolean flag telling STREAMLINE to run phase 9 (i.e. Summary Report) specific to phase 8

  • Format: [Command Line Argument] just use flag (i.e. --do-rep-report), [Configuration File] (bool)

  • Values: True or False

do_cleanup

  • Description: boolean flag telling STREAMLINE to run output file cleanup (optional)

  • Format: [Command Line Argument] just use flag (i.e. --do-cleanup), [Configuration File] (bool)

  • Values: True or False

applyToReplication

  • Description: a notebook-specific parameter indicating whether to include running phase 8 (i.e. Replication)

  • Format: (bool)

  • Values: True or False

demo_run

  • Description: a notebook-specific parameter indicating whether to automatically run the notebook on the demonstration datasets

  • Format: (bool)

  • Values: True or False

use_data_prompt

  • Description: a notebook-specific parameter that activates a notebook prompt to gather essential run parameter information directly from the user rather than have them manually update code cells

  • Format: (bool)

  • Values: True or False


General Parameters (Phase 1)

cv_partitions

  • Description: k, the number of k-fold cross validation training/testing data partitions to create and apply throughout pipeline

  • Format: (int)

  • Values: an integer between 3 and 10 is recommended

  • Tips: smaller values will yield shorter STREAMLINE run times, but training datasets will have a smaller number of instances

partition_method

  • Description: the cross validation strategy used

  • Format: (str)

  • Values: 'Stratified', 'Random', or 'Group'

  • Tips: 'Stratified' is generally recommended in order to keep class balance as similar as possible within respective partitions, however 'Group' can be selected when match_label has been specified to keep instances with the same match/group ID together within a respective partition

categorical_cutoff

  • Description: the number of unique values observed for a given feature in a ‘target dataset’ after which a variable is automatcially considered to be quantitative

  • Format: (int)

  • Values: an integer between 3 and 10 is generally recommended, but should be set in a dataset-specific manner

  • Tips: this parameter will only be used if the user hasn’t specifically indicated which features to treat as categorical or quantitative using categorical_feature_path and/or quantitative_feature_path, respectively. However depending on the specific dataset, users can sometimes conveniently set this parameter to correctly assign variable types, e.g. if all categorical features in the dataset have fewer than 5 unique values, but quantitative ones all have more than 10 unique values, setting categorical_cutoff = 7 will make correct feature type assignments automatically.

sig_cutoff

  • Description: the statistical significance cutoff used throughout the pipeline used in deciding whether to run pair-wise non-parametric statistical comparisons following group comparisons, and for identifying significant results in output files with a ‘*’

  • Format: (float)

  • Values: a value <= 0.05 is recommended

  • Tips: Note: STREAMLINE does not currently automatically account for multiple testing - users should take this into consideration themselves

random_state

  • Description: sets a specific random seed for the STREAMLINE run (important for pipeline reproducibility)

  • Format: (int)

  • Values: any positive integer value is fine

  • Tips: make sure to use the same value for random_state in a separate run along with the same datasets and run parameters to obtain reproducible pipeline results


Data Processing Parameters (Phase 1)

exclude_eda_output

  • Description: allows users to exclude some of the outputs automatically generated by STREAMLINE during phase 1

  • Format:

    1. for notebook or config file modes: provide a (list) of valid options (str) , e.g. ['describe','univariate_plots','correlation_plots']

    2. for command line arguments: provide as a list of comma separated values with no spaces, e.g. describe,univariate_plots,correlation_plots

  • Values: None, or ['describe', 'univariate_plots', or 'correlation_plots'] - provided in format above

    • describe - don’t run or output the set of standard pandas functions (i.e. Describe(), Dtypes(), and nunique()) as .csv files

    • univariate_plots - don’t output individual univariate analysis plots illustrating features vs. outcome (by default STREAMLINE outputs these plots for any feature with a significant univariate association based on sig_cutoff)

    • correlation_plots - don’t output feature correlation heatmaps for the ‘initial’ or ‘processed’ data EDA

top_uni_features

  • Description: number of most significant features to report in the notebook and PDF summary

  • Format: (int)

  • Values: an integer between 10 and 40 is recommended

featureeng_missingness

  • Description: the proportion of missing values within a feature (above which) a new binary categorical feature is generated that indicates if the value for an instance was missing or not

  • Format: (float)

  • Values: (0.0 - 1.0)

  • Tips: this parameter controls automated feature engineering of a new ‘missingness’ feature, generated for another pre-existing feature in the ‘target dataset’. It’s useful for identifying the potentially predictive value of any feature who’s missingness is not completely at random (NCAR)

cleaning_missingness

  • Description: the proportion of missing values, within a feature or instance, (at which) the given feature or instance will be automatically cleaned (i.e. removed) from the processed ‘target dataset’

  • Format: (float)

  • Values: (0.0 - 1.0)

  • Tips: this parameter controls automated data cleaning based on feature or instance ‘missingness’. STREAMLINE will first remove features with high missingness, then subsequently remove any instances with missingness over this proportion.

correlation_removal_threshold

  • Description: the (pearson) feature correlation at which one out of a pair of features is randomly removed from the processed ‘target dataset’

  • Format: (float)

  • Values: (0.0 - 1.0)

  • Tips: this parameter controls automated data cleaning based on feature correlation. The safest setting (to avoid missing predictive information) is the default of 1.0 (i.e. perfect correlation between two features). Note: STREAMLINE interprets this parameter as both a positive and negative correlation threshold.


Imputation & Scaling Parameters (Phase 2)

impute_data

  • Description: indicates whether or not to apply missing data imputation to features in the data or not

  • Format: (bool)

  • Values: True or False

  • Tips: leaving to the default value of True is recommended but not always neccessary depending on whether missing data is present in the original datasets or what algorithms a user wishes to run (e.g. ExSTraCS can handle missing values in data)

multi_impute

  • Description: indicates whether or not to apply multiple imputation using scikit-learn’s IterativeImputer for imputing missing values in quantiative features. Mode imputation is always applied for categorical features.

  • Format: (bool)

  • Values: True or False

  • Tips: for larger datasets, multiple imputation can run very slowly, and take up alot of disk space in the pickled imputation files that are automatically stored for downstream imputation of replication data or further external application of the models. When False, median imputation is instead used for quantiative features.

scale_data

  • Description: indicates whether or not to apply standard scaling to features in the data or not

  • Format: (bool)

  • Values: True or False

  • Tips: leaving to the default value of True is recommended but not always neccessary depending on what algorithms a user wishes to run (see Imputation and Scaling)

overwrite_cv

  • Description: indicates whether or not to overwrite the phase 1 version of CV (training and testing) datasets with newly imputed and scaled CV datasets

  • Format: (bool)

  • Values: True or False

  • Tips: True will reduce the number of output files generated (and storage space) keeping only the final processed, imputed, scaled, and feature selected CV datasets, however False allows users to view intermediary CV datasets following phase one data processing and CV partitioning


Feature Importance Estimation Parameters (Phase 3)

do_mutual_info

  • Description: indicates whether or not to run mutual information as a feature importance estimation algorithm (prior to modeling)

  • Format: (bool)

  • Values: True or False

  • Tips: mutual information is good at detecting univariate association between a given feature and outcome. While we recommend running both feature importance algorithms, users should specify True for at least one algorithm.

do_multisurf

  • Description: indicates whether or not to run MultiSURF as a feature importance estimation algorithm (prior to modeling)

  • Format: (bool)

  • Values: True or False

  • Tips: MultiSURF is good at detecting both features involved in an interaction and univariate association with outcome. While we recommend running both feature importance algorithms, users should specify True for at least one algorithm.

use_turf

  • Description: indicates whether or not to run TuRF, a wrapper algorithm that operates around MultiSURF, improving it’s ability to detect feature interactions in data with larger numbers of features

  • Format: (bool)

  • Values: True or False

  • Tips: using TuRF is strongly recommended in datasets with >10,000 features, but can improve feature importance rankings in datasets with fewer features as well

turf_pct

  • Description: this parameter currently serves two functions: (1) it determines the propotion of instances removed from consideration during a TuRF iteration, and (2) it dictates the number of TuRF iteractions (where the nubmer of iterations is 1/turf_pct)

  • Format: (float)

  • Values: (0.01- 0.5)

  • Tips: setting turf_pct to 0.5 will run MultiSURF twice, removing the lowest scoring half of features in the first iteration (and giving them a very low feature importance score), then running MultiSURF again on the remaining features to rescore them. A setting of 0.2 would remove 20% of features each iteration, over 5 iterations. Thus lower values for this parameter will increase run time.

instance_subset

  • Description: the number of randomly chosen instances in the training data used to use for running MultiSURF

  • Format: (int)

  • Values: any integer above 500 is recommended, but the default of 2000 seems to be a reasonable trade-off in many cases between run time and performance

  • Tips: the MultiSURF algorithm scales quadratically with the number of features in the data, but linearly with the number of features. Thus a dataset with a large number of training instances can make MultiSURF run very slowly. However, MultiSURF does not necessarily need to see all training instances to reasonably estimate feature imporance. If this parameter is set larger than the number of instances in a given training dataset, it will simply use all available training instances.

n_jobs

  • Description: the number of CPU cores dedicated to running MultiSURF

  • Format: (int)

  • Values: -1, or a positive integer <= the number of cores available on your machine

  • Tips: -1 will run MultiSURF on all available cores when run locally


Feature Selection Parameters (Phase 4)

filter_poor_features

  • Description: indicates whether or not to apply feature selection to the dataset

  • Format: (bool)

  • Values: True or False

  • Tips: when set to False all features will be preserved in the datasets for phase 5 modeling

max_features_to_keep

  • Description: indicates the maximum number of top scorign features to retain in the datasets prior to phase 5 modeling (based on the scores of the feature importance estimation algorithms, i.e. Mutual Information and MultiSURF)

  • Format: (int or None)

  • Values: any positive integer > 1 is acceptable

  • Tips: we have set the default of this parameter to 2000 primarily to limit the computational burden of modeling. Users should use their own judgment in setting this parameter for the dataset/task in hand. When set to None and filter_poor_features = True, STREAMLINE will automatically remove any feature that scored <= 0 for each feature importance estimation algorithm run. When set to an integer such as 2000 and filter_poor_features = True, STREAMLINE will first remove any feature that scored <= 0 for each feature importance estimation algorithm run, then alternate between the sets of feature importance rankings keeping the top scoring (non-redundant) features from each algorithm.

export_scores

  • Description: indicates whether or not to export barplots for the feature importance estimation algorithms (Mutual Information and MultiSURF) summarizing average feature importance scores over CV training partitions

  • Format: (bool)

  • Values: True or False

top_fi_features

  • Description: number of top scoring features (mean over CV runs) to illustrate in the above feature importance estimation bar plots generated when [`export_scores’](#export-scores) = `True`

  • Format: (int)

  • Values: an integer between 10 and 40 is recommended

overwrite_cv_feat

  • Description: indicates whether or not to overwrite the phase 2 version of CV (training and testing) datasets with newly feature selected CV datasets

  • Format: (bool)

  • Values: True or False

  • Tips: True will reduce the number of output files generated (and storage space) keeping only the final processed, imputed, scaled, and feature selected CV datasets, however False allows users to view intermediary CV datasets following phase two imputation and scaling


Modeling Parameters (Phase 5)

algorithms

  • Description: used to specify which machine learning modeling algorithms will be applied

  • Format: (list of ‘str’ values, or None)

    1. for notebook or config file modes: provide a (list) of (str) algorithm identifiers, e.g. ['NB','LR','EN','DT','RF','XGB','SVM','ANN','KNN','GP','ExSTraCS]

    2. for command line arguments: provide as a list of comma separated values with no spaces, e.g. NB,LR,EN,DT,RF,XGB,SVM,ANN,KNN,GP,ExSTraCS

  • Values: None, or any subset of the following [‘NB’,’LR’,’EN’,’DT’,’RF’,’GB’,’XGB’,’LBG’,’CGB’,’SVM’,’ANN’,’KNN’,’GP’,’eLCS’,’XCS’,’ExSTraCS], where:

    • Naive Bayes (NB)

    • Logistic Regression (LR)

    • Elastic Net (EN)

    • Decision Tree (DT)

    • Random Forest (RF)

    • Gradient Boosting (GB)

    • Extreame Gradient Boosting (XGB)

    • Light Gradient Boosting (LGB)

    • Category Gradient Boosting (CGB)

    • Support Vector Machines (SVM)

    • Artificial Neural Networks (ANN)

    • K-Nearest Neighbors (KNN)

    • Genetic Programming, i.e. symbolic classification (GP)

    • Educational Learning Classifier System (eLCS)

    • ‘X’ Classifier System (XCS)

    • Extended Supervised Tracking Classifier System (ExSTraCS)

  • Tips: setting this parameter to None will run all algorithms in STREAMLINE with the exception of any algorithms specified within exclude. To run a fairly comprehensive subset of algorithms (without running them all), we recommend ['NB','LR','EN','DT','RF','XGB','SVM','ANN','KNN','GP','ExSTraCS]. Specifying algorithms using this parameter is most convenient when you want to run a small subset of algorithms, e.g. ['NB','LR','DT']

exclude

  • Description: used to specify which machine learning modeling algorithms to exclude from analysis

  • Format: (list of ‘str’ values, or None)

    1. for notebook or config file modes: provide a (list) of (str) algorithm identifiers, e.g. ['eLCS','XCS']

    2. for command line arguments: provide as a list of comma separated values with no spaces, e.g. eLCS,XCS

  • Values: same as for algorithms above

  • Tips: setting this parameter to None just tells STREAMLINE not to exclude any additional algorithms not already specified within algorithms. Currently, by default STREAMLINE excludes eLCS and XCS from an analysis. Specifying algorithms using this parameter is most convenient when you want to exclude a small subset of algorithms, e.g. ['SVM','eLCS','XCS'].

training_subsample

  • Description: the number of randomly chosen instances in the training data used to use for training certain longer running algorithms (i.e. XGB,SVM,KN,ANN,LR,eLCS,XCS,ExStraCS)

  • Format: (0, or another int)

  • Values: the default of 0 will use all training data. Otherwise, any positive integer is acceptable.

  • Tips: In general, we recommend leaving this parameter to 0, however some algorithms may take a very long time to run. If you’re worried about this recommend setting this parameter to 2000 as a reasonable trade-off in many cases between run time and performance.

use_uniform_fi

  • Description: indicates whether or not to override any available (modeling-algorithm-specific) model-feature-importance estimation methods, instead using scikit-learn’s permutation importance estimator uniformly for all algorithms

  • Format: (bool)

  • Values: True or False

  • Tips: when True, model feature importance will be estimated in the same way for all models/algorithms. However, when False the following algorithms have their own unique strategies of estimating model feature importance, that will be used instead: (i.e. LR,DT,RF,XGB,LGB,GB,eLCS,XCS,ExSTraCS). Any algorithms without an internal strategy for estimating model feature importance will rely on permuation importance by default.

primary_metric

  • Description: the evaluation metric used to optimize hyperparameters

  • Format: (str)

  • Values: We recommend 'balanced_accuracy', 'roc_auc', or 'f1' (based on the users needs/priorities), however it can be any available metric identifier from (https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter)

metric_direction

  • Description: indicates whether the primary_metric should be maximized or minimized during hyperparameter optimization

  • Format: (str)

  • Values: maximize or minimize

  • Tips: For almost all metrics (including 'balanced_accuracy', 'roc_auc', or 'f1'), this should be maximize

n_trials

  • Description: an Optuna parameter controlling the number of hyperparameter optimization trials to be conducted

  • Format: (int)

  • Values: any positive integer > 1, (200 by default)

  • Tips: When this parameter is set to a larger value, hyperparameter optimization will take longer to complete, but a broader range of hyperparameter configurations will be considered which can improve algorithm modeling performance

timeout

  • Description: an Optuna parameter controlling the total number of seconds until a given hyperparameter sweep stops running new trials

  • Format: (int, or None)

  • Values: any positive integer > 1, (900 by default, i.e. 15 minutes), or None

  • Tips: To ensure STREAMLINE reproducibility, this parameter must be set to None, however this will force all algorithms to fully complete the number of trials specified by n_trials. When set to an integer, Optuna will submit new trials (as previous ones complete), up until this time limit, and then only use the hyperparameter sweep trials it has completed to pick the best hyperparameter settings for the given algorithm. Any trial already started after this time limit is reached, will continue to run until completion. This means that one algorithm can spend more total time on hyperparameter trials than another, when this parameter is given a time limit.

export_hyper_sweep_plots

  • Description: indicates whether or not to generate an Optuna-plot visualizing the hyperparameter sweep of an algorithm on a given dataset

  • Format: (bool)

  • Values: True or False

do_lcs_sweep

  • Description: indicates whether or not to apply an Optuna hyperparameter sweep to one of the rule-based ML algorithms, i.e. (eLCS, XCS, ExSTraCS)

  • Format: (bool)

  • Values: True or False

  • Tips: Learning classifier system (LCS), i.e. rule-based ML modeling algorithms can be computationally expensive, but have fairly reliable default run parameter settings. This parameter allow users to avoid a hyperparameter sweep, and train each LCS algorithm only once on manually specified run parameters. To save run time, in general we recommend leaving this parameter to False and specifying the LCS run parameters described below. Watch this video to learn LCS basics.

lcs_nu

  • Description: specifies the nu parameter used by LCS algorithms (i.e. eLCS,XCS,ExSTraCS)

  • Format: (int)

  • Values: (1 - 10)

  • Tips: higher values place more pressure for these algorithms to generate perfectly accurate rules, which easily leads to overfitting in noisy problems. Unless you know that your models should be able to achieve 100% testing accuracy on the target data, we recommend leaving this parameter to the default of 1. Watch this video to learn LCS basics.

lcs_iterations

  • Description: specifies the number of learning iterations an LCS algorithm will run (i.e. eLCS,XCS,ExSTraCS)

  • Format: (int)

  • Values: a positive integer at least two times larger than the number of training instances in the target data

  • Tips: each iteration, an LCS algorithm focuses on one instance in the training dataset, thus this parameter should always be larger (ideally much larger) than the number of training instances in the data. For most users we recommend the default value of 200000 as a starting point, however, as a key run parameter, more learning iterations is typically expected to improve LCS algorithm performance. Watch this video to learn LCS basics.

lcs_N

  • Description: specifies the maximum rule-population size for an LCS algorithm (i.e. eLCS,XCS,ExSTraCS)

  • Format: (int)

  • Values: a positive integer > 50

  • Tips: LCS algorithms learn a population (i.e set) of rules that collectively constitute the learned model. When this parameter is larger, LCS will take longer to run. However, LCS algorithms require a larger rule-population to solve more complex problems or analyze larger datasets. For most users we recommend the default value of 2000 as a starting point, however, as a key run parameter, a larger rule-population is typically expected to improve LCS algorithm performance. Watch this video to learn LCS basics.

lcs_timeout

  • Description: similar to timeout, this Optuna parameter controlling the total number of seconds until an LCS algorithm hyperparameter sweep stops running new trials. LCS uses a separate run parameter for this since it can take alot longer to run an LCS hyperparameter sweep.

  • Format: (int, or None)

  • Values: any positive integer > 1, (1200 by default, i.e. 20 minutes), or None

  • Tips: To ensure STREAMLINE reproducibility, this parameter must be set to None if do_lcs_sweep = True, however this will force LCS algorithms to fully complete the number of trials specified by n_trials. When set to an integer, Optuna will submit new trials (as previous ones complete), up until this time limit, and then only use the hyperparameter sweep trials it has completed to pick the best hyperparameter settings for the given LCS algorithm. Any trial already started after this time limit is reached, will continue to run until completion. This means that one LCS algorithm can spend more total time on hyperparameter trials than another, when this parameter is given a time limit.

model_resubmit

  • Description: boolean flag telling STREAMLINE that this is a secondary run attempt of phase 5 (i.e. modeling)

  • Format: [Command Line Argument] just use flag (i.e. --model-resubmit), [Configuration File] (bool)

  • Values: True or False

  • Tips: set this parameter to True either because (1) one of the previous model training jobs timed-out, or failed and the user wants to re-submit them or (2) the user had previously run phase 5 on a subset of available algorithms, but now they’d like to run additional algorithms


Post-Analysis Parameters (Phase 6)

exclude_plots

  • Description: allows users to exclude some of the outputs automatically generated by STREAMLINE during phase 6 (post-analysis)

  • Format:

    1. for notebook or config file modes: provide a (list) of valid options (str), e.g. ['plot_ROC','plot_PRC']

    2. for command line arguments: provide as a list of comma separated values with no spaces, e.g. plot_ROC,plot_PRC

  • Values: None, or ['plot_ROC', 'plot_PRC', 'plot_FI_box', or 'plot_metric_boxplots'] - provided in format above

    • plot_ROC - don’t output ROC plots individually for each algorithm including all CV results and averages

    • plot_PRC - don’t output PRC plots individually for each algorithm including all CV results and averages

    • plot_FI_box - don’t output model feature importance boxplots for each algorithm

    • plot_metric_boxplots - don’t output evaluation metric boxplots for each metric comparing algorithm performance

metric_weight

  • Description: the evaluation metric used to weigh model feature importance estimates in the composite feature importance plots

  • Format: (str)

  • Values: balanced_accuracy or roc_auc

  • Tips: we recommend setting the this parameter the same as primary_metric if possible

top_model_fi_features

  • Description: the number of top scoring features (based on model feature importance estimates) to illustrate in feature importance figures (i.e. feature importance boxplots, and composite feature importance plots)

  • Format: (int)

  • Values: an integer between 10 and 40 is recommended

  • Tips:


Replication Parameters (Phase 8)

exclude_rep_plots

  • Description: allows users to exclude some of the outputs automatically generated by STREAMLINE during phase 8 (replication)

  • Format:

    1. for notebook or config file modes: provide a (list) of valid options (str), e.g. ['plot_ROC', 'plot_PRC']

    2. for command line arguments: provide as a list of comma separated values with no spaces, e.g. plot_ROC,plot_PRC

  • Values: None, or ['feature_correlations','plot_ROC', 'plot_PRC', or 'plot_metric_boxplots'] - provided in format above

    • feature_correlations - don’t output feature correlation heatmaps for the replication datasets during replication EDA

    • plot_ROC - don’t output ROC plots individually for each algorithm including all CV results and averages

    • plot_PRC - don’t output PRC plots individually for each algorithm including all CV results and averages

    • plot_metric_boxplots - don’t output evaluation metric boxplots for each metric comparing algorithm performance


Cleanup Parameters

del_time

  • Description: boolean flag telling STREAMLINE to delete individual runtime files from the output experiment folder

  • Format: [Command Line Argument] just use flag (i.e. --del-time), [Configuration File] (bool)

  • Values: True or False

del_old_cv

  • Description: boolean flag telling STREAMLINE to delete intermediary cross validation datasets (i.e. training and testing datasets prior to completed data processing, imputation, scaling, and feature selection) form the output experiment folder

  • Format: [Command Line Argument] just use flag (i.e. --del-old-cv), [Configuration File] (bool)

  • Values: True or False

  • Tips: this parameter is only relevant if overwrite_cv was set to False


Multiprocessing Parameters

run_parallel

  • Description: indicates whether or not to run STREAMLINE in parallel (locally) with CPU core multiprocessing

  • Format: (bool)

  • Values: True or False

  • Tips: this parameter is only relevant when [`run_cluster](#run-cluster) = `False`

run_cluster

  • Description: indicates whether or not to run STREAMLINE on an dask-compatible computing cluster (HPC)

  • Format: (bool or str)

  • Values: False, or a string identifying the cluster type from options below:

    • LSF - LSFCluster

    • SLURM - SLURMCluster

    • HTCondor - HTCondorCluster

    • Moab - MoabCluster

    • OAR - OARCluster

    • PBS - PBSCluster

    • SGE - SGECluster

    • UGE - SGECluster variant used at our institution

    • Local - LocalCluster

    • SLURMOld - Legacy job submission for SLURMCluster

    • LSFOld - Legacy job submission for LSFCluster

  • Tips: The default of "SLURM" is specific to our institutions HPC hardware/software, and may not be relevant to many users

reserved_memory

  • Description: the memory (in Gigabytes) reserved for STREAMLINE jobs

  • Format: (int)

  • Values: an integer generally > 1 or < the maximum memory available for an HPC job on your system (consult your cluster documentation or administrator)

queue

  • Description: indiates the queue within your HPC where your STREAMLINE jobs will be scheduled to run

  • Format: (str)

  • Values: any viable str name for a queue you have access to at your institution

  • Tips: The default of "defq" is specific to our institutions HPC hardware/software, and may not be relevant to many users


Logging Parameters

verbose

  • Description: boolean flag telling STREAMLINE to send all print output and warnings to the command line output

  • Format: [Command Line Argument] just use flag (i.e. --verbose), [Configuration File] (bool)

  • Values: True or False

logging_level

  • Description: boolean flag telling STREAMLINE what loggin level to use in the command line output

  • Format: [Command Line Argument] just use flag (i.e. --logging-level), [Configuration File] (bool)

  • Values: True or False


Guidelines for Setting Parameters

Ensuring Output Reproducibility

STREAMLINE is completely reproducible when the timeout parameter is set to None, and. This also assumes that STREAMLINE is being run on the same datasets, with the same run parameters (including random_state).

When timeout is not set to None, STREAMLINE output can sometimes vary slightly (particularly when parallelized) since Optuna (for hyperparameter optimization) may not complete the same number of optimization trials within the user specified time limit on different computing resources.

However, having a timeout value specified helps ensure STREAMLINE run completion within a reasonable time frame.

Reducing Runtime and Memory Use

Conducting a more effective ML analysis typically demands a much larger amount of computing power and runtime. However, we provide general guidelines here for limiting overall runtime of a STREAMLINE experiment.

  1. Run/include a fewer number of datasets in dataset_path at once.

  2. Run using fewer ML algorithms at once:

    • Naive Bayes, Logistic Regression, and Decision Trees are typically fastest.

    • Genetic Programming, eLCS, XCS, and ExSTraCS often take the longest (however other algorithms such as SVM, KNN, and ANN can take even longer when the number of instances is very large).

  3. Run using a smaller number of cv_partitions.

  4. Run without generating additional plots (see exclude_eda_output, export_hyper_sweep_plots,exclude_plots, exclude_rep_plots).

  5. In large datasets with missing values, set multi_impute to False. This will apply simple mean imputation to numerical features instead (saving computational time, memory and output file space).

  6. Set use_TURF as False. However we strongly recommend setting this to True in feature spaces > 10,000 in order to avoid missing feature interactions during feature selection.

  7. Set TURF_pct no lower than 0.5. Setting at 0.5 is by far the fastest, but it will operate more effectively in very large feature spaces when set lower.

  8. Set instance_subset at or below 2000 (speeds up multiSURF feature importance evaluation at potential expense of performance).

  9. Set max_features_to_keep at or below 2000 and filter_poor_features = True (this limits the maximum number of features that can be passed on to ML modeling).

  10. Set training_subsample at or below 2000 (this limits the number of sample used to train particularly expensive ML modeling algorithms). However avoid setting this too low, or ML algorithms may not have enough training instances to effectively learn.

  11. Set n_trials and/or timeout to lower values (this limits the time spent on hyperparameter optimization).

  12. If using eLCS, XCS, or ExSTraCS, set do_lcs_sweep to False, lcs_iterations at or below 200000, and lcs_n at or below 2000.

Improving Modeling Performance

  • Generally speaking, the more computational time you are willing to spend on ML, the better the results. Doing the opposite of the above tips for reducing runtime, will likely improve performance.

  • In certain situations, setting filter_poor_features to False, and relying on the ML algorithms alone to identify relevant features can possibly yield better performance. However, this may only be computationally practical when the total number of features in an original dataset is smaller (e.g. under 2000).

  • Note that eLCS, XCS, and ExSTraCS are newer algorithm implementations developed by our research group. As such, their algorithm performance may not yet be optimized in contrast to the other well established and widely utilized options. These learning classifier system (LCS) algorithms are unique however, in their ability to model very complex associations in data, while offering a largely interpretable model made up of simple, human readable IF:THEN rules. They have also been demonstrated to be able to tackle both complex feature interactions as well as heterogeneous patterns of association (i.e. different features are predictive in different subsets of the training data).

  • In problems with no noise (i.e. datasets where it is possible to achieve 100% testing accuracy), LCS algorithms (i.e. eLCS, XCS, and ExSTraCS) perform better when lcs_nu is set larger than 1 (i.e. 5 or 10 recommended). This applies significantly more pressure for individual rules to achieve perfect accuracy. In noisy problems this may lead to significant overfitting.

Other Guidelines

  • SVM and ANN modeling should only be applied when data scaling is applied by the pipeline.

  • Logistic Regression’ baseline model feature importance estimation is determined by the exponential of the feature’s coefficient. This should only be used if data scaling is applied by the pipeline. Otherwise use_uniform_fi should be True.

  • While the STREAMLINE includes impute_data as an option that can be turned off in phase 2, most algorithm implementations (all those standard in scikit-learn) cannot handle missing data values with the exception of eLCS, XCS, and ExSTraCS. In general, STREAMLINE is expected to fail with an errors if run on data with missing values, while impute_data is set to False.