Running STREAMLINE

This section details how to run STREAMLINE in any of its run modes. These include:

Google Colab Notebook: (run remotely on free google cloud resources)
- Both an ‘easy’ and ‘manual’ run mode is available for users run their own data
Jupyter Notebook: (run locally on your PC)
Command Line Interface: (locally or on a ‘dask-compatable’ CPU Computing Cluster)

While the notebooks only allow STREAMLINE to be run serially, it can be ‘embarrassingly’ parallelized when run from the command line in one of two ways:

Local command line: basic CPU core parallelization
CPU Computing Cluster: job submission parallelization

When run from the command line, STREAMLINE can be run in one of two ways:

Using a Configuration File: run all, or any number of phases using a single command that points to a ‘configuration file’ with all necessary run parameters
Using Command-Line Arguments: run all, or any number of phases using command line arguments

For more details and guidelines on selecting a run mode, see ‘Picking a Run Mode’.

All users may benefit from reviewing the ‘Guidelines for Setting Run Parameters’ section for tips on (1) ensuring reproducibility (2) reducing runtime, and (3) improving modeling performance. Details on the variety of outputs generated by STREAMLINE can be found in ‘Navigating STREAMLINE Output’.

Once you’ve completed the installation instructions for the run mode desired, follow the mode-specific directions below for running STREAMLINE.

Google Colab Notebook

This run mode is best for (1) easily trying out STREAMLINE on demonstration data, (2) running analyses on small datasets, or (3) educational purposes. Check out this tutorial to lean the basics of a Google Colab Notebook.

Below we first detail how to run the Colab Notebook on the included demonstration datasets, then how to adapt this notebook to run on your own dataset(s) as well as change STREAMLINE run parameters if desired.

Running the Demo (Colab)

The STREAMLINE Google Colab Notebook is set up to run a limited analysis applying all 9 phases of the pipeline. This includes 3-fold cross validation, and applying only three of the faster ML modeling algorithms to 2 example ‘target datasets’, and a ‘replication dataset’ relevant to only one of the target datasets. These datasets are detailed in Demonstration Data. This demo should take 6-7 minutes to run on Google Cloud, with results viewable in the notebook. The notebook will also automatically download the PDF summary reports, and the zipped ‘experiment folder’ with all output files, with the user’s permission.

To run this demo, do the following:

Set up a Google account (if you don’t already have one). Click here for help.
Open the STREAMLINE Google Colab Notebook by clicking the link below: https://colab.research.google.com/drive/14AEfQ5hUPihm9JB2g730Fu3LiQ15Hhj2?usp=sharing
[Optional] Open the Runtime menu and select Disconnect and delete runtime. This clears the memory of the previous notebook run. This is only necessary when the underlying base code is modified, but it may be useful to troubleshoot if modifications to the notebook do not seem to have an effect.
Open the Runtime menu and select Run all. This will run all code cells of the notebook, i.e. all phases of STREAMLINE.

At this point the notebook will do the following automatically:

Reserve a limited amount of free memory (RAM) and disk space on Google Cloud.
Load the most recent STREAMLINE repository into memory from Github. The STREAMLINE release version is automatically indicated in the summary PDF reports.
Install other necessary python packages in the Google Colab Environment.
Run the entirety of STREAMLINE on the demonstration datasets.
Download the PDF ‘testing evaluation’ and the ‘replication evaluation’ summary reports automatically. Google will ask for user permission the first time
Download the zipped ‘experiment folder’ with all output files to your local computer.

See ‘Notebook Output’ for more on examining output within the notebook and ‘Output Files’

Running Your Own Datasets (Colab)

Before running STREAMLINE on new data, make sure it adheres to ‘Input Data Requirements’. To update the STREAMLINE Colab Notebook to run on one or more user specified ‘target datasets’, users can chose between an ‘easy’ and ‘manual mode.

As above, begin by opening the STREAMLINE Google Colab Notebook by clicking the link below: https://colab.research.google.com/drive/14AEfQ5hUPihm9JB2g730Fu3LiQ15Hhj2?usp=sharing

Before running, update the run parameters within the ‘STREAMLINE RUN PARAMETERS’ section of the notebook as indicated below.

Note that, for brevity, some parameter names used in the notebook (used below) are slightly different from those used in the command line. Details on STREAMLINE run parameters are given here.

Easy Mode

This mode is most convenient if you want to run the notebook on other data, but want to be prompted to enter/select essential parameter information instead of adjusting parameters within run parameter code cells. This mode will prompt the user for essential ‘experiment’ and dataset-specific run parameter values. This mode is also convenient as it allows you to select datasets directly from your local computer rather than creating new folders within the temporary Colab Notebook workspace. All other non-essential run parmameters need to be updated within respective code cells.

In the first code cell, set (demo_run = False) and ([`use_data_prompt’](parameters.md#use-data-prompt) = `True`)
- This tells the notebook that you don’t want to run the demo datasets, and you want to be ‘prompted’ to enter/select essential run parameters rather than edit the respective code cells
[Optional] Update non-essential run parameters (within respective code cells) to the users specifications
- *Most commonly, this would include n_splits, categorical_cutoff, and algorithms
- We also strongly recommend specifying categorical_feature_headers and/or quantitiative_feature_headers as lists of feature names in the dataset(s) headers that should be treated as either categorical or quanatiative
- If only one of these lists is specified, all features not specified in that list will be treated as the other feature type by default
Open the Runtime menu and select Run all.
Reply to the prompts requesting the following essential parameter values:
- experiment_name - a unique name for the ouput folder for the current STREAMLINE ‘experiment’
- data_path - use file navigation window to select the folder containing one or more ‘target datasets’ to be analyzed
  - These datasets must adhere to the formatting detailed in ‘Input Data Requirements’
- class_label - specify the header name for the outcome column in the dataset(s), e.g. ‘Class’
- instance_label - specify the header name for the unique instance IDs in the dataset(s) or specify None if not relevant
- match_label - specify the header name for the match/group column in the dataset(s) or specify None if not relevant
- applyToReplication - indicate True or False as to whether ‘replication data’ is available for the replication phase
- rep_data_path - use file navigation window to select the folder containing one or more ‘replication datasets’ to be analyzed
  - All datasets in this folder should be replicates for a single ‘target dataset’, and similarly adhere to formatting requirments
- dataset_for_rep - specify the filename (with extension) of the original ‘target dataset’ to indicate which models the replication data will be applied to

After providing valid entries for these prompts, all phases of STREAMLINE will run in sequence within the notebook.
STREAMLINE outputfiles are automatically saved to the output ‘experiment folder’ named UserOutput within the temporary notebook workspace, as well as optionally downloaded to the users computer after completion

Manual Mode

In the first code cell, set (demo_run = False) and ([`use_data_prompt’](parameters.md#use-data-prompt) = `False`)
- This tells the notebook that you don’t want to run the demo datasets, but you want update all run parameters (essential and non-essential) within respective code cells
Click on the ‘Files’ tab on the left side of the notebook (pictured as a blank folder), and right-click on ‘content’ folder (i.e. the temporary google colab workspace) and create a ‘New Folder’ to contain your target dataset(s), called UserData (or some other name if you also update the data_path parameter)
Save your formatted target dataset(s) within this folder
- Note: We recommend making sure datasets run within Google Colab do not contain any sensitive or protected health information (PHI)
[Optional] Repeat steps 2-3 for any replication dataset(s) you wish to apply to the models trained for a specific ‘target dataset’
- If you have no replication data, make sure to update the applyToReplication parameter to False
Update (essential and non-essential) run parameter code cells to the dataset and users specifications
- Note: You can leave output_path as /content/UserOutput and all output will be saved to this automatically created folder
- If you run more than one STREAMLINE ‘experiments’ in a single session, make sure to update experiment_name each time to avoid overwriting a prior experiment
Open the Runtime menu and select Run all

Note: Common errors preventing the notebook from running to completion include issues with file/path names, dataset formatting, or other incorrect changes to other run parameter settings
The notebook includes comments after each run parameter, indicating the format and value options for each

Jupyter Notebook

This run mode is best for (1) confirming successful STREAMLINE installation for local computer use, (2) running STREAMLINE in a notebook on your own computer’s resources (generally faster than Colab Notebook), (3) running analyses on small to moderately sized datasets, (4) viewing output directly within a notebook, or (5) educational purposes. Click here to learn the basics of Jupyter Notebook.

Running STREAMLINE in Jupyter Notebook is largely the same as for running it in Google Colab. Below we specify how to run the Jupyter Notebook on the included demonstration datasets, then how to adapt it to run on your own dataset(s).

Running the Demo (Jupyter)

The STREAMLINE Jupyter Notebook is also set up to run a limited analysis applying all 9 phases of the pipeline. This includes 3-fold cross validation, and applying only three of the faster ML modeling algorithms to 2 example ‘target datasets’, and a ‘replication dataset’ relevant to only one of the target datasets. These datasets are detailed in Demonstration Data. This demo should take about 2-5 minutes to run (depending on your computer hardware), with results viewable in the notebook. The notebook will automatically save the ‘experiment folder’ (named DemoOutput) with all output files (including PDF reports).

From your command line, open Jupyter Notebook by typing jupyter notebook.
Within the Jupyter local file browser that opens, navigate into the previously saved/installed STREAMLINE directory where you find the file named STREAMLINE-Notebook.ipypnb.
Click to open STREAMLINE-Notebook.ipypnb as a Jupyter Notebook in a new page open in your web browser.
Open the Kernel menu and select Restart & Run All. This will run all code cells of the notebook, i.e. all phases of STREAMLINE.

At this point the notebook will do the following automatically:

Run the entirety of STREAMLINE on the demonstration datasets.
Save all output files (including PDF reports) as an ‘experiment folder’ named DemoOutput within the STREAMLINE directory.

See ‘Notebook Output’ for more on examining output within the notebook and ‘Output Files’

Running Your Own Datasets (Jupyter)

Begin by opening STREAMLINE-Notebook.ipypnb as a Jupyter Notebook (steps 1-3 above). Before running, update the run parameters within the ‘STREAMLINE RUN PARAMETERS’ section of the notebook as indicated below.

Note that, for brevity, some parameter names used in the notebook (used below) are slightly different from those used in the command line. Details on STREAMLINE run parameters are given here.

In the first code cell, set ([`demo_run’](parameters.md#demo-run) = `False`)
- This tells the notebook that you don’t want to run the demo datasets, and you want to be ‘prompted’ to enter/select essential run parameters rather than edit the respective code cells
Update essential run parameters (within respective code cells) to the user/dataset’s specifications
- experiment_name - a unique name for the ouput folder for the current STREAMLINE ‘experiment’
- data_path - path to the folder containing one or more ‘target datasets’ to be analyzed
  - These datasets must adhere to the formatting detailed in ‘Input Data Requirements’
- output_path - path to the folder (that will be automatically created if it doesn’t yet exist) in which the ‘experiment folder’ including all STREAMLINE output will be saved
  - Note: You can leave output_path as ./UserOutput if you’ve named this folder UserOutput
- class_label - the header name for the outcome column in the dataset(s), e.g. ‘Class’
- instance_label - the header name for the unique instance IDs in the dataset(s) or specify None if not relevant
- match_label - the header name for the match/group column in the dataset(s) or specify None if not relevant
- ignore_features - list of text-valued feature names in target datasets that you want STREAMLINE to drop from the analysis, or specify None if not relevant
- categorical_feature_headers - list of text-valued feature names in the dataset(s) headers that should be treated as categorical or specify None if quantitiative_feature_headers were specified and you want all other features to be treated as categorical, or you want feature types to be automatically decided using categorical_cutoff
- quantitiative_feature_headers - list of text-valued feature names in the dataset(s) headers that should be treated as quantitative or specify None if categorical_feature_headers were specified and you want all other features to be treated as quantitative, or you want feature types to be automatically decided using categorical_cutoff
- applyToReplication - indicate True or False as to whether ‘replication data’ is available for the replication phase
- rep_data_path - path to folder containing one or more ‘replication datasets’ to be analyzed
  - All datasets in this folder should be replicates for a single ‘target dataset’, and similarly adhere to formatting requirments
- dataset_for_rep - path to the file (with extension) of the original ‘target dataset’ to indicate which models the replication data will be applied to
[Optional] Update non-essential run parameters (within respective code cells) to the users specifications
- *Most commonly, this would include n_splits, categorical_cutoff, and algorithms
Open the Kernel menu and select Restart & Run All. This will run all code cells of the notebook, i.e. all phases of STREAMLINE.

Note: It can take multiple hours or longer to run this notebook on larger datasets and/or using all machine learning modeling algorithms. We recommend using a computing cluster for such tasks if possible.

Command Line Interface

This run mode is best for (1) most efficiently running STREAMLINE with parallelization options, (2) users comfortable with command lines, or (3) running moderate to large datasets and/or more exhaustive run parameter configurations.

Running STREAMLINE from the command line can be done locally (with or without CPU core parallelization), or on a dask-compatable CPU computing cluster. Any of these scenarios can also be run from a single command (i.e. all phases at once) using a ‘configuration file’, or separately one phase at a time. Below we indicate how to run all of these possible command line run configurations using the demonstration datasets as an example. As for Google Colab and Jupyter Notebook run modes, to run STREAMLINE on datasets other than the demonstration datasets, essential run parameters should be specified/updated accordingly. STREAMLINE command line run parameters specified in a configuration file, or as command line arguments have slightly different names as detailed within the run parameters section.

Locally

This section explains running STREAMLINE locally using the command line interface.

Using a Configuration File (Locally)

All phases of STREAMLINE can be run (in sequence) with a single simple command by editing and calling an associated configuration file (run_configs/local.cfg) as indicated below.

Note: This approach also allows users to run any subset of sequential STREAMLINE phases (e.g. Phase 1 alone for EDA, or Phases 1-4 for EDA, data processing, and feature selection) using the different ‘phases to run’ flags within the configuration file.

Open your command line interface and navigate to the installed STREAMLINE directory.
To run the demonstration datasets, skip to step 5. To view the pre-specified configuration file, click here.
Assuming you want to run your own dataset(s), further navigate into the run_configs folder and open local.cfg in a text editor to update the essential and non-essential run parameters accordingly (see run parameters and terminal text editors for help).
- Under ‘phases to run’ the do_till_report parameter (when set to True) will automatically run all phases up until do_replicate by default. do_replicate, do_rep_report and do_cleanup must all be specified individually
- To run a subset of phases (e.g. phases 1-4), set do_till_report = False, and do_eda, do_dataprep, do_feat_imp, and do_feat_sel each to True and the other ‘do’ phases to False
- Make sure to keep [`run_cluster’](parameters.md#run-cluster) = `False`, which tells STREAMLINE to be run locally rather than on a CPU computing cluster
- Optionally set [`run_parallel’](parameters.md#run-parallel) = `False`, which will turn off local multi-core CPU parallelization
Navigate back to the STREAMLINE base directory.
Run the following command within the STREAMLINE base directory:

python run.py -c run_configs/local.cfg

Note: You can save your own .cfg files to call with this command. We recommend copying renaming, and editing local.cfg and then calling this new configuration file as an argument to run.py

Using Command-Line Arguments (Locally)

STREAMLINE phases can also be called individually from the command line without a configuration file (instead specifying run parameters as arguments). This can be helpful, in particular, if you want to run a big analysis, and would like to look at the output of phases along the way without committing to running the whole pipeline upfront. Similar to any other run mode, make sure to specify arguments for all ‘essential’ run parameters for a given dataset.

Note: Command line run parameters have slightly different identifiers than for the configuration file (see run parameters)
Note: Any unspecified non-essential run parameters will be assigned their default values for a given STREAMLINE run
- Make sure to specify --run-cluster = False, which tells STREAMLINE to be run locally rather than on a CPU computing cluster
- Optionally specify --run-parallel = False, which will turn off local multi-core CPU parallelization
Note: When specifying --fi, --cf, or --qf using this run approach, it is necessary to pass a file-path to a .csv file including a list of feature names for that parameter, rather than directly listing these feature names. We use this approach in the examples below using .csv files found in STREAMLINE/data/DemoFeatureTypes.

Open your command line interface and navigate to the installed STREAMLINE directory.
The subsections below provide different example scenarios running run.py on the demonstration datasets. These scenarios run STREAMLINE similarly to our other demo run mode examples above, but we set --run-cluster = False (necessary) and optionally --run-parallel = True for each example.

All Phases at Once (Replication Data Included)

python run.py --do-till-report --do-rep-report --do-clean --data-path ./data/DemoData --out-path DemoOutput --exp-name demo_experiment --class-label Class --inst-label InstanceID --cf ./data/DemoFeatureTypes/hcc_cat_feat.csv --qf ./data/DemoFeatureTypes/hcc_quant_feat.csv --cv 3 --algorithms=NB,LR,DT --do-replicate --rep-path ./data/DemoRepData --dataset ./data/DemoData/hcc_data_custom.csv --run-cluster False --run-parallel True

All Main Phases at Once (No Replication Data)

python run.py --do-till-report --do-clean --data-path ./data/DemoData --out-path DemoOutput --exp-name demo_experiment --class-label Class --inst-label InstanceID --cf ./data/DemoFeatureTypes/hcc_cat_feat.csv --qf ./data/DemoFeatureTypes/hcc_quant_feat.csv --cv 3 --algorithms=NB,LR,DT --run-cluster False --run-parallel True

One Phase at a Time

The following commands can be run one after the other (in sequence), waiting for the previous command to complete.

Phase 1 - Data Exploration & Processing:

python run.py --do-eda --data-path ./data/DemoData --out-path DemoOutput --exp-name demo_experiment --class-label Class --inst-label InstanceID --cf ./data/DemoFeatureTypes/hcc_cat_feat.csv --qf ./data/DemoFeatureTypes/hcc_quant_feat.csv --cv 3 --run-cluster False --run-parallel True

Phase 2 - Imputation and Scaling:

python run.py --do-dataprep --out-path DemoOutput --exp-name demo_experiment --run-cluster False --run-parallel True

Phase 3 - Feature Importance Estimation

python run.py --do-feat-imp --out-path DemoOutput --exp-name demo_experiment --run-cluster False --run-parallel True

Phase 4 - Feature Selection

python run.py --do-feat-sel --out-path DemoOutput --exp-name demo_experiment --run-cluster False --run-parallel True

Phase 5 - Machine Learning (ML) Modeling

python run.py --do-model --out-path DemoOutput --exp-name demo_experiment --algorithms NB,LR,DT --run-cluster False --run-parallel True

Phase 6 - Post-Analysis

python run.py --do-stats --out-path DemoOutput --exp-name demo_experiment --run-cluster False --run-parallel True

Phase 7 - Compare Datasets

If there is only one ‘target dataset’ in the given analysis, skip this command.

python run.py --do-compare-dataset --out-path DemoOutput --exp-name demo_experiment --run-cluster False --run-parallel True

Phase 8 - Replication

If there are no replication datasets, skip this command. If you have multiple ‘target datasets’ each with one or more associated replication datasets, run this command once for each original target dataset (updating --rep-path and --dataset for each).

python run.py --do-replicate --out-path DemoOutput --exp-name demo_experiment --rep-path ./data/DemoRepData --dataset ./data/DemoData/hcc_data_custom.csv --run-cluster False --run-parallel True

Phase 9 - Summary Report(s)

Run the following command to generate the main PDF report (summarizing testing data evaluations of the models).

python run.py --do-report --out-path DemoOutput --exp-name demo_experiment --run-cluster False --run-parallel True

If the models of a STREAMLINE experiment were applied to replication data in phase 8 you can generate a report for the replication of a single target dataset using the following command. If you have multiple ‘target datasets’ each with one or more associated replication datasets, run this command once for each original target dataset (updating --rep-path and --dataset for each).

python run.py --do-rep-report --out-path DemoOutput --exp-name demo_experiment --rep-path ./data/DemoRepData --dataset ./data/DemoData/hcc_data_custom.csv --run-cluster False --run-parallel True

Optional Clean-up

python run.py --do-clean --out-path DemoOutput --exp-name demo_experiment --del-time --del-old-cv --run-cluster False --run-parallel True

CPU Computing Cluster

This section explains running STREAMLINE remotely on a dask-compatible CPU computing cluster (i.e. HPC).

Helpful Tools

First let’s discuss the role of a couple helpful tools mentioned in the installation section for running STREAMLINE on a computing cluster.

nano

GNU nano is a text editor for Unix-like computing systems or operating environments using a command line interface. This would be incredibly handy for editing the configuration file through a ssh terminal when running STREAMLINE with a configuration file.

A detailed guide can be found here

A gist of the application is that you can edit a configuration file such as e.g. run_configs/cedars.cfg using the following steps:

Go to the root STREAMLINE folder.
Type nano run_configs/cedars.cfg in the terminal to open the file in nano.
Edit the configuration file as needed.
Press Ctrl + X to close the file and Y to save the changes.

tmux

tmux is a terminal multiplexer/emulator. It lets you switch easily between several programs in one terminal, detach them (they keep running in the background) and reattach them to a different terminal. This is particularly important when you want to run all phases of STREAMLINE automatically from a single command. To achieve this, STREAMLINE runs a script on the head node (i.e. job submission node) that monitors phase completion and submits new jobs for the next phase. A terminal emulator allows you to start a full pipeline run and close the window without killing the job.

A detailed guide on using it can be found here

A gist of the application is that you can open a new terminal that will stay open even if you disconnect and close your terminal using the following steps:

Go to the root streamline folder.
Type and run tmux new -s mysession
Open the required config file using nano (e.g. run_configs/cedars.cfg)
Make the necessary in the changes in the config file.
Press Ctrl + X to close the file and Y to save the changes.
Run required commands.
Press Ctrl + b and then the d key to close the terminal.

Cluster-Specific Run Parameters

You should be aware of 3 cluster-specific run parameters:

run-cluster: flag for type of cluster, discussed in detail below (when not False, this over-rides any value specified for run-parallel)
queue: the partition queue used for job submissions
reserved_memory: memory (in GB) reserved per job

Run Cluster Parameter

The run-cluster parameter is the most important parameter here. It should be set to False when running locally, but to use a cluster, specify as a string value for the cluster-type. Currently, clusters supported by dask-jobqueue are supported with the following settings options for run-cluster:

LSF: LSFCluster
SLURM: SLURMCluster
HTCondor: HTCondorCluster
Moab: MoabCluster
OAR: OARCluster
PBS: PBSCluster
SGE: SGECluster
UGE: SGECluster variant used at our institution
Local: LocalCluster

Additionally, the earlier/legacy method of STREAMLINE manual job submission is supported for SLURM and LSF using the string values below for run-cluster. This will generate and submit jobs using shell files as was used in STREAMLINE release 0.2.5 and earlier. This legacy option ensures that minimal memory/computation is used on the head node (i.e. job-submit node).

SLURMOld: Legacy job submission for SLURMCluster
LSFOld Legacy job submission for LSFCluster

Queue and Memory Parameters

Check with your cluster administrator on how to set these these cluster-specific parameters. We have set defaults for these parameters for use on our own institution’s HPC (i.e. queue = defq, and reserved_memory = 4).

Using a Configuration File (Cluster)

This is largely the same as running STREAMLINE from a configuration file locally, with the addition of three cluster-specific parameters ( run-cluster,queue, and reserved_memory).

Open your command line interface within your HPC and navigate to the installed STREAMLINE directory.
Edit any run parameters within a configuration file according to your needs (making sure to update run-cluster,queue, and reserved_memory within the multiprocessing section).

We have included example configuration files set up to run the demonstration datasets on three different clusters we utilize (i.e. cedars.cfg cedars_old.cfg and upenn.cfg), using SLURM, UGE, and LSF, respectively. We will focus here on SLURM as an example with the respective configuration file (cedars.cfg) found here

Run the following command within the STREAMLINE base directory:

python run.py -c run_configs/cedars.cfg

Note: The configuration filename and location can be anything as long as it’s a valid configuration file set up to run on your dask-compatable cluster.
Note: When using this run strategy, it is strongly recommended to use a terminal emulator and check with your cluster administrator to make sure that your system allows light weight, but longer duration code to be run from the head node that monitors job completion and can submit new jobs. If not, we recommend running STREAMLINE one phase at a time in legacy mode (i.e. run-cluster = SLURMOld or LSFOld) which only uses the head node to submit jobs.

Using Command-Line Arguments (Cluster)

This is largely the same as running STREAMLINE using command-line arguments locally with the addition of three cluster-specific parameters ( --run-cluster,queue, and --res-mem), and users can ignore --run-parallel.

Open your command line interface within your HPC and navigate to the installed STREAMLINE directory.
The subsections below provide different example scenarios running run.py on the demonstration datasets, however users can adjust these arguments for their own data. Also, here --run-parallel is automatically overridden by --run-cluster = SLURM (or whatever cluster-name is specified or than False).

Note: Any unspecified non-essential run parameters will be assigned their default values for a given STREAMLINE run
Note: When specifying --fi, --cf, or --qf using this run approach, it is necessary to pass a file-path to a .csv file including a list of feature names for that parameter, rather than directly listing these feature names. We use this approach in the examples below using .csv files found in STREAMLINE/data/DemoFeatureTypes.

All Phases at Once (Replication Data Included)

Notice: this approach will run a lightweight script on the headnode monitoring job completion and submitting jobs for subsequent phases until completion of all phases.

python run.py --do-till-report --do-rep-report --do-clean --data-path ./data/DemoData --out-path DemoOutput --exp-name demo_experiment --class-label Class --inst-label InstanceID --cf ./data/DemoFeatureTypes/hcc_cat_feat.csv --qf ./data/DemoFeatureTypes/hcc_quant_feat.csv --cv 3 --algorithms=NB,LR,DT --do-replicate --rep-path ./data/DemoRepData --dataset ./data/DemoData/hcc_data_custom.csv --run-cluster SLURM --res-mem 4 --queue defq

All Main Phases at Once (No Replication Data)

Notice: this approach will run a lightweight script on the headnode monitoring job completion and submitting jobs for subsequent phases until completion of all phases.

python run.py --do-till-report --do-clean --data-path ./data/DemoData --out-path DemoOutput --exp-name demo_experiment --class-label Class --inst-label InstanceID --cf ./data/DemoFeatureTypes/hcc_cat_feat.csv --qf ./data/DemoFeatureTypes/hcc_quant_feat.csv --cv 3 --algorithms=NB,LR,DT --run-cluster SLURM --res-mem 4 --queue defq

One Phase at a Time

The following commands can be run one after the other (in sequence), waiting for all jobs of the previous phase to complete successfully. For minimal head node overhead, we recommend running these jobs using --run-cluster = SLURMOld or LSFOld`` rather than SLURM` (if available to you).

Phase 1 - Data Exploration & Processing:

python run.py --do-eda --data-path ./data/DemoData --out-path DemoOutput --exp-name demo_experiment --class-label Class --inst-label InstanceID --cf ./data/DemoFeatureTypes/hcc_cat_feat.csv --qf ./data/DemoFeatureTypes/hcc_quant_feat.csv --cv 3 --run-cluster SLURM --res-mem 4 --queue defq

Phase 2 - Imputation and Scaling:

python run.py --do-dataprep --out-path DemoOutput --exp-name demo_experiment --run-cluster SLURM --res-mem 4 --queue defq

Phase 3 - Feature Importance Estimation

python run.py --do-feat-imp --out-path DemoOutput --exp-name demo_experiment --run-cluster SLURM --res-mem 4 --queue defq

Phase 4 - Feature Selection

python run.py --do-feat-sel --out-path DemoOutput --exp-name demo_experiment --run-cluster SLURM --res-mem 4 --queue defq

Phase 5 - Machine Learning (ML) Modeling

python run.py --do-model --out-path DemoOutput --exp-name demo_experiment --algorithms NB,LR,DT --run-cluster SLURM --res-mem 4 --queue defq

Phase 6 - Post-Analysis

python run.py --do-stats --out-path DemoOutput --exp-name demo_experiment --run-cluster SLURM --res-mem 4 --queue defq

Phase 7 - Compare Datasets

If there is only one ‘target dataset’ in the given analysis, skip this command.

python run.py --do-compare-dataset --out-path DemoOutput --exp-name demo_experiment --run-cluster SLURM --res-mem 4 --queue defq

Phase 8 - Replication

If there are no replication datasets, skip this command. If you have multiple ‘target datasets’ each with one or more associated replication datasets, run this command once for each original target dataset (updating --rep-path and --dataset for each).

python run.py --do-replicate --out-path DemoOutput --exp-name demo_experiment --rep-path ./data/DemoRepData --dataset ./data/DemoData/hcc_data_custom.csv --run-cluster SLURM --res-mem 4 --queue defq

Phase 9 - Summary Report(s)

Run the following command to generate the main PDF report (summarizing testing data evaluations of the models).

python run.py --do-report --out-path DemoOutput --exp-name demo_experiment --run-cluster SLURM --res-mem 4 --queue defq

If the models of a STREAMLINE experiment were applied to replication data in phase 8 you can generate a report for the replication of a single target dataset using the following command. If you have multiple ‘target datasets’ each with one or more associated replication datasets, run this command once for each original target dataset (updating --rep-path and --dataset for each).

python run.py --do-rep-report --out-path DemoOutput --exp-name demo_experiment --rep-path ./data/DemoRepData --dataset ./data/DemoData/hcc_data_custom.csv --run-cluster SLURM --res-mem 4 --queue defq

Optional Clean-up

python run.py --do-clean --out-path DemoOutput --exp-name demo_experiment --del-time --del-old-cv --run-cluster SLURM --res-mem 4 --queue defq

Checking STREAMLINE Job Completion

Whether running STREAMLINE from the configuration file or using command-line arguments, users may wish to check on the job completion status for jobs within a given phase. For example; (1) if running STREAMLINE one phase at a time, users will want to ensure that all jobs of the current phase have completed before inintiating the next, or (2) if the modeling phase is taking a long time to complete, you may wish to know what algorithms are still training. This can be accomplished using the included checker.py script.

First, make sure you are in the installed STREAMLINE directory. Then you can check the parameter options of checker.py with:

python checker.py --help

As an example, let’s say the user wants to check the status of modeling jobs on their cluster during phase 5. They could run the following command which assumes we are running the demonstration data, as above.

python checker.py --out-path DemoOutput --exp-name demo_experiment --phase 5 --count-only True

This would return the number of STREAMLINE jobs that have not yet completed (or failed to run) within phase 5.

Alternatively, the command below would output the names of the jobs that have not completed, which (in the case of phase 5) would inform the user which algorithms were still running.

python checker.py --out-path DemoOutput --exp-name demo_experiment --phase 5 

Picking a Run Mode

Why run STREAMLINE on Google Colab?

Running STREAMLINE on Google Colab is best for:

Running the STREAMLINE demonstration on the included demo data
Users with little to no coding experience
Users that want the quickest/easiest approach to running STREAMLINE
Users that do not have access to a very powerful computer or compute cluster.
Applying STREAMLINE to smaller-scale analyses (in particular when only using free/limited Google Cloud resources):
- Smaller datasets (e.g. < 500 instances and features)
- A small number of total datasets (e.g. 1 or 2)
- Only using the simplest/quickest modeling algorithms (e.g. Naive Bayes, Decision Trees, Logistic Regression)
- Only using 1 or 2 modeling algorithms
Google Colab Notebook - on free Google Cloud resources [Anyone can run]:
- Advantages
  - No coding or PC environment experience needed
  - Automatically installs and uses the most recent version of STREAMLINE
  - Computing can performed directly on Google Cloud from anywhere
  - One-click run of whole pipeline (all phases)
  - Offers in-notebook viewing of results and ability to save notebook as documentation of analysis
  - Allows easy customizability of nearly all aspects of the pipeline with minimal coding/environment experience
- Disadvantages:
  - Can only run pipeline serially
  - Slowest of the run options
  - Limited by google cloud computing allowances (may only work for smaller datasets)
- Notes: Requires a Google account (free)
Jupyter Notebook - locally [Basic experience]:
- Advantages:
  - Does not rely on free computing limitations of Google Cloud (but rather your own computer’s limitations)
  - One-click run of whole pipeline (all phases)
  - Offers in-notebook viewing of results and ability to save notebook as documentation of analysis
  - Allows easy customizability of all aspects of the pipeline with minimal coding/environment experience
- Disadvantages:
  - Can only run pipeline serially
  - Slower runtime than from command-line
  - Beginners have to set up their computing environment
- Notes: Requires Anaconda3, Python3, and several other minor Python package installations
Command Line (Local) [Command-line Users]:
- Advantages:
  - Typically runs faster than within Jupyter Notebook
  - A more versatile option for those with command-line experience
  - One-command run of whole pipeline available when using a configuration file to run
  - Can optionally run the pipeline one phase at a time
- Disadvantages:
  - Can only run pipeline serially or with limited local cpu core parallelization
  - Command-line experience recommended
- Notes: Requires Anaconda3, Python3, and several other minor Python package installations
Command Line (HPC Cluster) [Computing Cluster Users]:
- Advantages:
  - By far the fastest, most efficient way to run STREAMLINE
  - Offers ability to run STREAMLINE over 7 types of HPC systems
  - One-command run of whole pipeline available when using a configuration file to run
  - Can optionally run the pipeline one phase at a time
- Disadvantages:
  - Experience with command-line and dask-compatible clusters recommended
  - Access to a computing cluster required
- Notes: Requires Anaconda3, Python3, and several other minor Python package installations. Cluster runs of STREAMLINE were set up using dask-jobqueue and thus should support 7 types of clusters as described in the dask documentation. Currently we have only directly tested STREAMLINE on SLURM and LSF clusters. Further codebase adaptation may be needed for clusters types not on the above link.