Running STREAMLINE
This section details how to run STREAMLINE in any of its run modes. These include:
Google Colab Notebook: (run remotely on free google cloud resources)
Both an ‘easy’ and ‘manual’ run mode is available for users run their own data
Jupyter Notebook: (run locally on your PC)
Command Line Interface: (locally or on a ‘dask-compatable’ CPU Computing Cluster)
While the notebooks only allow STREAMLINE to be run serially, it can be ‘embarrassingly’ parallelized when run from the command line in one of two ways:
Local command line: basic CPU core parallelization
CPU Computing Cluster: job submission parallelization
When run from the command line, STREAMLINE can be run in one of two ways:
Using a Configuration File: run all, or any number of phases using a single command that points to a ‘configuration file’ with all necessary run parameters
Using Command-Line Arguments: run all, or any number of phases using command line arguments
For more details and guidelines on selecting a run mode, see ‘Picking a Run Mode’.
All users may benefit from reviewing the ‘Guidelines for Setting Run Parameters’ section for tips on (1) ensuring reproducibility (2) reducing runtime, and (3) improving modeling performance. Details on the variety of outputs generated by STREAMLINE can be found in ‘Navigating STREAMLINE Output’.
Once you’ve completed the installation instructions for the run mode desired, follow the mode-specific directions below for running STREAMLINE.
Google Colab Notebook
This run mode is best for (1) easily trying out STREAMLINE on demonstration data, (2) running analyses on small datasets, or (3) educational purposes. Check out this tutorial to lean the basics of a Google Colab Notebook.
Below we first detail how to run the Colab Notebook on the included demonstration datasets, then how to adapt this notebook to run on your own dataset(s) as well as change STREAMLINE run parameters if desired.
Running the Demo (Colab)
The STREAMLINE Google Colab Notebook is set up to run a limited analysis applying all 9 phases of the pipeline. This includes 3-fold cross validation, and applying only three of the faster ML modeling algorithms to 2 example ‘target datasets’, and a ‘replication dataset’ relevant to only one of the target datasets. These datasets are detailed in Demonstration Data. This demo should take 6-7 minutes to run on Google Cloud, with results viewable in the notebook. The notebook will also automatically download the PDF summary reports, and the zipped ‘experiment folder’ with all output files, with the user’s permission.
To run this demo, do the following:
Set up a Google account (if you don’t already have one). Click here for help.
Open the STREAMLINE Google Colab Notebook by clicking the link below: https://colab.research.google.com/drive/14AEfQ5hUPihm9JB2g730Fu3LiQ15Hhj2?usp=sharing
[Optional] Open the
Runtime
menu and selectDisconnect and delete runtime
. This clears the memory of the previous notebook run. This is only necessary when the underlying base code is modified, but it may be useful to troubleshoot if modifications to the notebook do not seem to have an effect.Open the
Runtime
menu and selectRun all
. This will run all code cells of the notebook, i.e. all phases of STREAMLINE.
At this point the notebook will do the following automatically:
Reserve a limited amount of free memory (RAM) and disk space on Google Cloud.
Load the most recent STREAMLINE repository into memory from Github. The STREAMLINE release version is automatically indicated in the summary PDF reports.
Install other necessary python packages in the Google Colab Environment.
Run the entirety of STREAMLINE on the demonstration datasets.
Download the PDF ‘testing evaluation’ and the ‘replication evaluation’ summary reports automatically. Google will ask for user permission the first time
Download the zipped ‘experiment folder’ with all output files to your local computer.
See ‘Notebook Output’ for more on examining output within the notebook and ‘Output Files’
Running Your Own Datasets (Colab)
Before running STREAMLINE on new data, make sure it adheres to ‘Input Data Requirements’. To update the STREAMLINE Colab Notebook to run on one or more user specified ‘target datasets’, users can chose between an ‘easy’ and ‘manual mode.
As above, begin by opening the STREAMLINE Google Colab Notebook by clicking the link below: https://colab.research.google.com/drive/14AEfQ5hUPihm9JB2g730Fu3LiQ15Hhj2?usp=sharing
Before running, update the run parameters within the ‘STREAMLINE RUN PARAMETERS’ section of the notebook as indicated below.
Note that, for brevity, some parameter names used in the notebook (used below) are slightly different from those used in the command line. Details on STREAMLINE run parameters are given here.
Easy Mode
This mode is most convenient if you want to run the notebook on other data, but want to be prompted to enter/select essential parameter information instead of adjusting parameters within run parameter code cells. This mode will prompt the user for essential ‘experiment’ and dataset-specific run parameter values. This mode is also convenient as it allows you to select datasets directly from your local computer rather than creating new folders within the temporary Colab Notebook workspace. All other non-essential run parmameters need to be updated within respective code cells.
In the first code cell, set (
demo_run
=False
) and ([`use_data_prompt’](parameters.md#use-data-prompt) = `True`)This tells the notebook that you don’t want to run the demo datasets, and you want to be ‘prompted’ to enter/select essential run parameters rather than edit the respective code cells
[Optional] Update non-essential run parameters (within respective code cells) to the users specifications
*Most commonly, this would include
n_splits
,categorical_cutoff
, andalgorithms
We also strongly recommend specifying
categorical_feature_headers
and/orquantitiative_feature_headers
as lists of feature names in the dataset(s) headers that should be treated as either categorical or quanatiativeIf only one of these lists is specified, all features not specified in that list will be treated as the other feature type by default
Open the
Runtime
menu and selectRun all
.Reply to the prompts requesting the following essential parameter values:
experiment_name
- a unique name for the ouput folder for the current STREAMLINE ‘experiment’data_path
- use file navigation window to select the folder containing one or more ‘target datasets’ to be analyzedThese datasets must adhere to the formatting detailed in ‘Input Data Requirements’
class_label
- specify the header name for the outcome column in the dataset(s), e.g. ‘Class’instance_label
- specify the header name for the unique instance IDs in the dataset(s) or specifyNone
if not relevantmatch_label
- specify the header name for the match/group column in the dataset(s) or specifyNone
if not relevantapplyToReplication
- indicateTrue
orFalse
as to whether ‘replication data’ is available for the replication phaserep_data_path
- use file navigation window to select the folder containing one or more ‘replication datasets’ to be analyzedAll datasets in this folder should be replicates for a single ‘target dataset’, and similarly adhere to formatting requirments
dataset_for_rep
- specify the filename (with extension) of the original ‘target dataset’ to indicate which models the replication data will be applied to
After providing valid entries for these prompts, all phases of STREAMLINE will run in sequence within the notebook.
STREAMLINE outputfiles are automatically saved to the output ‘experiment folder’ named
UserOutput
within the temporary notebook workspace, as well as optionally downloaded to the users computer after completion
Manual Mode
In the first code cell, set (
demo_run
=False
) and ([`use_data_prompt’](parameters.md#use-data-prompt) = `False`)This tells the notebook that you don’t want to run the demo datasets, but you want update all run parameters (essential and non-essential) within respective code cells
Click on the ‘Files’ tab on the left side of the notebook (pictured as a blank folder), and right-click on ‘content’ folder (i.e. the temporary google colab workspace) and create a ‘New Folder’ to contain your target dataset(s), called
UserData
(or some other name if you also update thedata_path
parameter)Save your formatted target dataset(s) within this folder
Note: We recommend making sure datasets run within Google Colab do not contain any sensitive or protected health information (PHI)
[Optional] Repeat steps 2-3 for any replication dataset(s) you wish to apply to the models trained for a specific ‘target dataset’
If you have no replication data, make sure to update the
applyToReplication
parameter toFalse
Update (essential and non-essential) run parameter code cells to the dataset and users specifications
Note: You can leave
output_path
as/content/UserOutput
and all output will be saved to this automatically created folderIf you run more than one STREAMLINE ‘experiments’ in a single session, make sure to update
experiment_name
each time to avoid overwriting a prior experiment
Open the
Runtime
menu and selectRun all
Note: Common errors preventing the notebook from running to completion include issues with file/path names, dataset formatting, or other incorrect changes to other run parameter settings
The notebook includes comments after each run parameter, indicating the format and value options for each
Jupyter Notebook
This run mode is best for (1) confirming successful STREAMLINE installation for local computer use, (2) running STREAMLINE in a notebook on your own computer’s resources (generally faster than Colab Notebook), (3) running analyses on small to moderately sized datasets, (4) viewing output directly within a notebook, or (5) educational purposes. Click here to learn the basics of Jupyter Notebook.
Running STREAMLINE in Jupyter Notebook is largely the same as for running it in Google Colab. Below we specify how to run the Jupyter Notebook on the included demonstration datasets, then how to adapt it to run on your own dataset(s).
Running the Demo (Jupyter)
The STREAMLINE Jupyter Notebook is also set up to run a limited analysis applying all 9 phases of the pipeline. This includes 3-fold cross validation, and applying only three of the faster ML modeling algorithms to 2 example ‘target datasets’, and a ‘replication dataset’ relevant to only one of the target datasets. These datasets are detailed in Demonstration Data. This demo should take about 2-5 minutes to run (depending on your computer hardware), with results viewable in the notebook. The notebook will automatically save the ‘experiment folder’ (named DemoOutput
) with all output files (including PDF reports).
From your command line, open Jupyter Notebook by typing
jupyter notebook
.Within the Jupyter local file browser that opens, navigate into the previously saved/installed
STREAMLINE
directory where you find the file namedSTREAMLINE-Notebook.ipypnb
.Click to open
STREAMLINE-Notebook.ipypnb
as a Jupyter Notebook in a new page open in your web browser.Open the
Kernel
menu and selectRestart & Run All
. This will run all code cells of the notebook, i.e. all phases of STREAMLINE.
At this point the notebook will do the following automatically:
Run the entirety of STREAMLINE on the demonstration datasets.
Save all output files (including PDF reports) as an ‘experiment folder’ named
DemoOutput
within theSTREAMLINE
directory.
See ‘Notebook Output’ for more on examining output within the notebook and ‘Output Files’
Running Your Own Datasets (Jupyter)
Begin by opening STREAMLINE-Notebook.ipypnb
as a Jupyter Notebook (steps 1-3 above). Before running, update the run parameters within the ‘STREAMLINE RUN PARAMETERS’ section of the notebook as indicated below.
Note that, for brevity, some parameter names used in the notebook (used below) are slightly different from those used in the command line. Details on STREAMLINE run parameters are given here.
In the first code cell, set ([`demo_run’](parameters.md#demo-run) = `False`)
This tells the notebook that you don’t want to run the demo datasets, and you want to be ‘prompted’ to enter/select essential run parameters rather than edit the respective code cells
Update essential run parameters (within respective code cells) to the user/dataset’s specifications
experiment_name
- a unique name for the ouput folder for the current STREAMLINE ‘experiment’data_path
- path to the folder containing one or more ‘target datasets’ to be analyzedThese datasets must adhere to the formatting detailed in ‘Input Data Requirements’
output_path
- path to the folder (that will be automatically created if it doesn’t yet exist) in which the ‘experiment folder’ including all STREAMLINE output will be savedNote: You can leave
output_path
as./UserOutput
if you’ve named this folderUserOutput
class_label
- the header name for the outcome column in the dataset(s), e.g. ‘Class’instance_label
- the header name for the unique instance IDs in the dataset(s) or specifyNone
if not relevantmatch_label
- the header name for the match/group column in the dataset(s) or specifyNone
if not relevantignore_features
- list of text-valued feature names in target datasets that you want STREAMLINE to drop from the analysis, or specifyNone
if not relevantcategorical_feature_headers
- list of text-valued feature names in the dataset(s) headers that should be treated as categorical or specifyNone
ifquantitiative_feature_headers
were specified and you want all other features to be treated as categorical, or you want feature types to be automatically decided usingcategorical_cutoff
quantitiative_feature_headers
- list of text-valued feature names in the dataset(s) headers that should be treated as quantitative or specifyNone
ifcategorical_feature_headers
were specified and you want all other features to be treated as quantitative, or you want feature types to be automatically decided usingcategorical_cutoff
applyToReplication
- indicateTrue
orFalse
as to whether ‘replication data’ is available for the replication phaserep_data_path
- path to folder containing one or more ‘replication datasets’ to be analyzedAll datasets in this folder should be replicates for a single ‘target dataset’, and similarly adhere to formatting requirments
dataset_for_rep
- path to the file (with extension) of the original ‘target dataset’ to indicate which models the replication data will be applied to
[Optional] Update non-essential run parameters (within respective code cells) to the users specifications
*Most commonly, this would include
n_splits
,categorical_cutoff
, andalgorithms
Open the
Kernel
menu and selectRestart & Run All
. This will run all code cells of the notebook, i.e. all phases of STREAMLINE.
Note: It can take multiple hours or longer to run this notebook on larger datasets and/or using all machine learning modeling algorithms. We recommend using a computing cluster for such tasks if possible.
Command Line Interface
This run mode is best for (1) most efficiently running STREAMLINE with parallelization options, (2) users comfortable with command lines, or (3) running moderate to large datasets and/or more exhaustive run parameter configurations.
Running STREAMLINE from the command line can be done locally (with or without CPU core parallelization), or on a dask-compatable CPU computing cluster. Any of these scenarios can also be run from a single command (i.e. all phases at once) using a ‘configuration file’, or separately one phase at a time. Below we indicate how to run all of these possible command line run configurations using the demonstration datasets as an example. As for Google Colab and Jupyter Notebook run modes, to run STREAMLINE on datasets other than the demonstration datasets, essential run parameters should be specified/updated accordingly. STREAMLINE command line run parameters specified in a configuration file, or as command line arguments have slightly different names as detailed within the run parameters section.
Locally
This section explains running STREAMLINE locally using the command line interface.
Using a Configuration File (Locally)
All phases of STREAMLINE can be run (in sequence) with a single simple command by editing and calling an associated configuration file (run_configs/local.cfg
) as indicated below.
Note: This approach also allows users to run any subset of sequential STREAMLINE phases (e.g. Phase 1 alone for EDA, or Phases 1-4 for EDA, data processing, and feature selection) using the different ‘phases to run’ flags within the configuration file.
Open your command line interface and navigate to the installed
STREAMLINE
directory.To run the demonstration datasets, skip to step 5. To view the pre-specified configuration file, click here.
Assuming you want to run your own dataset(s), further navigate into the
run_configs
folder and openlocal.cfg
in a text editor to update the essential and non-essential run parameters accordingly (see run parameters and terminal text editors for help).Under ‘phases to run’ the
do_till_report
parameter (when set toTrue
) will automatically run all phases up untildo_replicate
by default.do_replicate
,do_rep_report
anddo_cleanup
must all be specified individuallyTo run a subset of phases (e.g. phases 1-4), set
do_till_report
=False
, anddo_eda
,do_dataprep
,do_feat_imp
, anddo_feat_sel
each toTrue
and the other ‘do’ phases toFalse
Make sure to keep [`run_cluster’](parameters.md#run-cluster) = `False`, which tells STREAMLINE to be run locally rather than on a CPU computing cluster
Optionally set [`run_parallel’](parameters.md#run-parallel) = `False`, which will turn off local multi-core CPU parallelization
Navigate back to the
STREAMLINE
base directory.Run the following command within the
STREAMLINE
base directory:
python run.py -c run_configs/local.cfg
Note: You can save your own
.cfg
files to call with this command. We recommend copying renaming, and editinglocal.cfg
and then calling this new configuration file as an argument torun.py
Using Command-Line Arguments (Locally)
STREAMLINE phases can also be called individually from the command line without a configuration file (instead specifying run parameters as arguments). This can be helpful, in particular, if you want to run a big analysis, and would like to look at the output of phases along the way without committing to running the whole pipeline upfront. Similar to any other run mode, make sure to specify arguments for all ‘essential’ run parameters for a given dataset.
Note: Command line run parameters have slightly different identifiers than for the configuration file (see run parameters)
Note: Any unspecified non-essential run parameters will be assigned their default values for a given STREAMLINE run
Make sure to specify
--run-cluster
=False
, which tells STREAMLINE to be run locally rather than on a CPU computing clusterOptionally specify
--run-parallel
=False
, which will turn off local multi-core CPU parallelization
Note: When specifying
--fi
,--cf
, or--qf
using this run approach, it is necessary to pass a file-path to a.csv
file including a list of feature names for that parameter, rather than directly listing these feature names. We use this approach in the examples below using.csv
files found inSTREAMLINE/data/DemoFeatureTypes
.
Open your command line interface and navigate to the installed
STREAMLINE
directory.The subsections below provide different example scenarios running
run.py
on the demonstration datasets. These scenarios run STREAMLINE similarly to our other demo run mode examples above, but we set--run-cluster
=False
(necessary) and optionally--run-parallel
=True
for each example.
All Phases at Once (Replication Data Included)
python run.py --do-till-report --do-rep-report --do-clean --data-path ./data/DemoData --out-path DemoOutput --exp-name demo_experiment --class-label Class --inst-label InstanceID --cf ./data/DemoFeatureTypes/hcc_cat_feat.csv --qf ./data/DemoFeatureTypes/hcc_quant_feat.csv --cv 3 --algorithms=NB,LR,DT --do-replicate --rep-path ./data/DemoRepData --dataset ./data/DemoData/hcc_data_custom.csv --run-cluster False --run-parallel True
All Main Phases at Once (No Replication Data)
python run.py --do-till-report --do-clean --data-path ./data/DemoData --out-path DemoOutput --exp-name demo_experiment --class-label Class --inst-label InstanceID --cf ./data/DemoFeatureTypes/hcc_cat_feat.csv --qf ./data/DemoFeatureTypes/hcc_quant_feat.csv --cv 3 --algorithms=NB,LR,DT --run-cluster False --run-parallel True
One Phase at a Time
The following commands can be run one after the other (in sequence), waiting for the previous command to complete.
Phase 1 - Data Exploration & Processing:
python run.py --do-eda --data-path ./data/DemoData --out-path DemoOutput --exp-name demo_experiment --class-label Class --inst-label InstanceID --cf ./data/DemoFeatureTypes/hcc_cat_feat.csv --qf ./data/DemoFeatureTypes/hcc_quant_feat.csv --cv 3 --run-cluster False --run-parallel True
Phase 2 - Imputation and Scaling:
python run.py --do-dataprep --out-path DemoOutput --exp-name demo_experiment --run-cluster False --run-parallel True
Phase 3 - Feature Importance Estimation
python run.py --do-feat-imp --out-path DemoOutput --exp-name demo_experiment --run-cluster False --run-parallel True
Phase 4 - Feature Selection
python run.py --do-feat-sel --out-path DemoOutput --exp-name demo_experiment --run-cluster False --run-parallel True
Phase 5 - Machine Learning (ML) Modeling
python run.py --do-model --out-path DemoOutput --exp-name demo_experiment --algorithms NB,LR,DT --run-cluster False --run-parallel True
Phase 6 - Post-Analysis
python run.py --do-stats --out-path DemoOutput --exp-name demo_experiment --run-cluster False --run-parallel True
Phase 7 - Compare Datasets
If there is only one ‘target dataset’ in the given analysis, skip this command.
python run.py --do-compare-dataset --out-path DemoOutput --exp-name demo_experiment --run-cluster False --run-parallel True
Phase 8 - Replication
If there are no replication datasets, skip this command. If you have multiple ‘target datasets’ each with one or more associated replication datasets, run this command once for each original target dataset (updating --rep-path
and --dataset
for each).
python run.py --do-replicate --out-path DemoOutput --exp-name demo_experiment --rep-path ./data/DemoRepData --dataset ./data/DemoData/hcc_data_custom.csv --run-cluster False --run-parallel True
Phase 9 - Summary Report(s)
Run the following command to generate the main PDF report (summarizing testing data evaluations of the models).
python run.py --do-report --out-path DemoOutput --exp-name demo_experiment --run-cluster False --run-parallel True
If the models of a STREAMLINE experiment were applied to replication data in phase 8 you can generate a report for the replication of a single target dataset using the following command. If you have multiple ‘target datasets’ each with one or more associated replication datasets, run this command once for each original target dataset (updating --rep-path
and --dataset
for each).
python run.py --do-rep-report --out-path DemoOutput --exp-name demo_experiment --rep-path ./data/DemoRepData --dataset ./data/DemoData/hcc_data_custom.csv --run-cluster False --run-parallel True
Optional Clean-up
python run.py --do-clean --out-path DemoOutput --exp-name demo_experiment --del-time --del-old-cv --run-cluster False --run-parallel True
CPU Computing Cluster
This section explains running STREAMLINE remotely on a dask-compatible CPU computing cluster (i.e. HPC).
Helpful Tools
First let’s discuss the role of a couple helpful tools mentioned in the installation section for running STREAMLINE on a computing cluster.
nano
GNU nano is a text editor for Unix-like computing systems or operating environments using a command line interface. This would be incredibly handy for editing the configuration file through a ssh terminal when running STREAMLINE with a configuration file.
A detailed guide can be found here
A gist of the application is that you can edit a configuration file such as e.g. run_configs/cedars.cfg
using the following steps:
Go to the root
STREAMLINE
folder.Type
nano run_configs/cedars.cfg
in the terminal to open the file in nano.Edit the configuration file as needed.
Press
Ctrl + X
to close the file andY
to save the changes.
tmux
tmux is a terminal multiplexer/emulator. It lets you switch easily between several programs in one terminal, detach them (they keep running in the background) and reattach them to a different terminal. This is particularly important when you want to run all phases of STREAMLINE automatically from a single command. To achieve this, STREAMLINE runs a script on the head node (i.e. job submission node) that monitors phase completion and submits new jobs for the next phase. A terminal emulator allows you to start a full pipeline run and close the window without killing the job.
A detailed guide on using it can be found here
A gist of the application is that you can open a new terminal that will stay open even if you disconnect and close your terminal using the following steps:
Go to the root streamline folder.
Type and run
tmux new -s mysession
Open the required config file using nano (e.g.
run_configs/cedars.cfg
)Make the necessary in the changes in the config file.
Press
Ctrl + X
to close the file andY
to save the changes.Run required commands.
Press
Ctrl + b
and then thed
key to close the terminal.
Cluster-Specific Run Parameters
You should be aware of 3 cluster-specific run parameters:
run-cluster
: flag for type of cluster, discussed in detail below (when notFalse
, this over-rides any value specified forrun-parallel
)queue
: the partition queue used for job submissionsreserved_memory
: memory (in GB) reserved per job
Run Cluster Parameter
The run-cluster
parameter is the most important parameter here. It should be set to False
when running locally, but to use a cluster, specify as a string value for the cluster-type. Currently, clusters supported by dask-jobqueue are supported with the following settings options for run-cluster
:
LSF
: LSFClusterSLURM
: SLURMClusterHTCondor
: HTCondorClusterMoab
: MoabClusterOAR
: OARClusterPBS
: PBSClusterSGE
: SGEClusterUGE
: SGECluster variant used at our institutionLocal
: LocalCluster
Additionally, the earlier/legacy method of STREAMLINE manual job submission is supported for SLURM
and LSF
using the string values below for run-cluster
. This will generate and submit jobs using shell files as was used in STREAMLINE release 0.2.5 and earlier. This legacy option ensures that minimal memory/computation is used on the head node (i.e. job-submit node).
SLURMOld
: Legacy job submission for SLURMClusterLSFOld
Legacy job submission for LSFCluster
Queue and Memory Parameters
Check with your cluster administrator on how to set these these cluster-specific parameters. We have set defaults for these parameters for use on our own institution’s HPC (i.e. queue
= defq
, and reserved_memory
= 4
).
Using a Configuration File (Cluster)
This is largely the same as running STREAMLINE from a configuration file locally, with the addition of three cluster-specific parameters ( run-cluster
,queue
, and reserved_memory
).
Open your command line interface within your HPC and navigate to the installed
STREAMLINE
directory.Edit any run parameters within a configuration file according to your needs (making sure to update
run-cluster
,queue
, andreserved_memory
within the multiprocessing section).
We have included example configuration files set up to run the demonstration datasets on three different clusters we utilize (i.e.
cedars.cfg
cedars_old.cfg
andupenn.cfg
), usingSLURM
,UGE
, andLSF
, respectively. We will focus here onSLURM
as an example with the respective configuration file (cedars.cfg
) found here
Run the following command within the
STREAMLINE
base directory:
python run.py -c run_configs/cedars.cfg
Note: The configuration filename and location can be anything as long as it’s a valid configuration file set up to run on your dask-compatable cluster.
Note: When using this run strategy, it is strongly recommended to use a terminal emulator and check with your cluster administrator to make sure that your system allows light weight, but longer duration code to be run from the head node that monitors job completion and can submit new jobs. If not, we recommend running STREAMLINE one phase at a time in legacy mode (i.e.
run-cluster
=SLURMOld
orLSFOld
) which only uses the head node to submit jobs.
Using Command-Line Arguments (Cluster)
This is largely the same as running STREAMLINE using command-line arguments locally with the addition of three cluster-specific parameters ( --run-cluster
,queue
, and --res-mem
), and users can ignore --run-parallel
.
Open your command line interface within your HPC and navigate to the installed
STREAMLINE
directory.The subsections below provide different example scenarios running
run.py
on the demonstration datasets, however users can adjust these arguments for their own data. Also, here--run-parallel
is automatically overridden by--run-cluster
=SLURM
(or whatever cluster-name is specified or thanFalse
).
Note: Any unspecified non-essential run parameters will be assigned their default values for a given STREAMLINE run
Note: When specifying
--fi
,--cf
, or--qf
using this run approach, it is necessary to pass a file-path to a.csv
file including a list of feature names for that parameter, rather than directly listing these feature names. We use this approach in the examples below using.csv
files found inSTREAMLINE/data/DemoFeatureTypes
.
All Phases at Once (Replication Data Included)
Notice: this approach will run a lightweight script on the headnode monitoring job completion and submitting jobs for subsequent phases until completion of all phases.
python run.py --do-till-report --do-rep-report --do-clean --data-path ./data/DemoData --out-path DemoOutput --exp-name demo_experiment --class-label Class --inst-label InstanceID --cf ./data/DemoFeatureTypes/hcc_cat_feat.csv --qf ./data/DemoFeatureTypes/hcc_quant_feat.csv --cv 3 --algorithms=NB,LR,DT --do-replicate --rep-path ./data/DemoRepData --dataset ./data/DemoData/hcc_data_custom.csv --run-cluster SLURM --res-mem 4 --queue defq
All Main Phases at Once (No Replication Data)
Notice: this approach will run a lightweight script on the headnode monitoring job completion and submitting jobs for subsequent phases until completion of all phases.
python run.py --do-till-report --do-clean --data-path ./data/DemoData --out-path DemoOutput --exp-name demo_experiment --class-label Class --inst-label InstanceID --cf ./data/DemoFeatureTypes/hcc_cat_feat.csv --qf ./data/DemoFeatureTypes/hcc_quant_feat.csv --cv 3 --algorithms=NB,LR,DT --run-cluster SLURM --res-mem 4 --queue defq
One Phase at a Time
The following commands can be run one after the other (in sequence), waiting for all jobs of the previous phase to complete successfully. For minimal head node overhead, we recommend running these jobs using --run-cluster
= SLURMOld
or LSFOld`` rather than
SLURM` (if available to you).
Phase 1 - Data Exploration & Processing:
python run.py --do-eda --data-path ./data/DemoData --out-path DemoOutput --exp-name demo_experiment --class-label Class --inst-label InstanceID --cf ./data/DemoFeatureTypes/hcc_cat_feat.csv --qf ./data/DemoFeatureTypes/hcc_quant_feat.csv --cv 3 --run-cluster SLURM --res-mem 4 --queue defq
Phase 2 - Imputation and Scaling:
python run.py --do-dataprep --out-path DemoOutput --exp-name demo_experiment --run-cluster SLURM --res-mem 4 --queue defq
Phase 3 - Feature Importance Estimation
python run.py --do-feat-imp --out-path DemoOutput --exp-name demo_experiment --run-cluster SLURM --res-mem 4 --queue defq
Phase 4 - Feature Selection
python run.py --do-feat-sel --out-path DemoOutput --exp-name demo_experiment --run-cluster SLURM --res-mem 4 --queue defq
Phase 5 - Machine Learning (ML) Modeling
python run.py --do-model --out-path DemoOutput --exp-name demo_experiment --algorithms NB,LR,DT --run-cluster SLURM --res-mem 4 --queue defq
Phase 6 - Post-Analysis
python run.py --do-stats --out-path DemoOutput --exp-name demo_experiment --run-cluster SLURM --res-mem 4 --queue defq
Phase 7 - Compare Datasets
If there is only one ‘target dataset’ in the given analysis, skip this command.
python run.py --do-compare-dataset --out-path DemoOutput --exp-name demo_experiment --run-cluster SLURM --res-mem 4 --queue defq
Phase 8 - Replication
If there are no replication datasets, skip this command. If you have multiple ‘target datasets’ each with one or more associated replication datasets, run this command once for each original target dataset (updating --rep-path
and --dataset
for each).
python run.py --do-replicate --out-path DemoOutput --exp-name demo_experiment --rep-path ./data/DemoRepData --dataset ./data/DemoData/hcc_data_custom.csv --run-cluster SLURM --res-mem 4 --queue defq
Phase 9 - Summary Report(s)
Run the following command to generate the main PDF report (summarizing testing data evaluations of the models).
python run.py --do-report --out-path DemoOutput --exp-name demo_experiment --run-cluster SLURM --res-mem 4 --queue defq
If the models of a STREAMLINE experiment were applied to replication data in phase 8 you can generate a report for the replication of a single target dataset using the following command. If you have multiple ‘target datasets’ each with one or more associated replication datasets, run this command once for each original target dataset (updating --rep-path
and --dataset
for each).
python run.py --do-rep-report --out-path DemoOutput --exp-name demo_experiment --rep-path ./data/DemoRepData --dataset ./data/DemoData/hcc_data_custom.csv --run-cluster SLURM --res-mem 4 --queue defq
Optional Clean-up
python run.py --do-clean --out-path DemoOutput --exp-name demo_experiment --del-time --del-old-cv --run-cluster SLURM --res-mem 4 --queue defq
Checking STREAMLINE Job Completion
Whether running STREAMLINE from the configuration file or using command-line arguments, users may wish to check on the job completion status for jobs within a given phase. For example; (1) if running STREAMLINE one phase at a time, users will want to ensure that all jobs of the current phase have completed before inintiating the next, or (2) if the modeling phase is taking a long time to complete, you may wish to know what algorithms are still training. This can be accomplished using the included checker.py
script.
First, make sure you are in the installed STREAMLINE
directory. Then you can check the parameter options of checker.py
with:
python checker.py --help
As an example, let’s say the user wants to check the status of modeling jobs on their cluster during phase 5. They could run the following command which assumes we are running the demonstration data, as above.
python checker.py --out-path DemoOutput --exp-name demo_experiment --phase 5 --count-only True
This would return the number of STREAMLINE jobs that have not yet completed (or failed to run) within phase 5.
Alternatively, the command below would output the names of the jobs that have not completed, which (in the case of phase 5) would inform the user which algorithms were still running.
python checker.py --out-path DemoOutput --exp-name demo_experiment --phase 5
Picking a Run Mode
Why run STREAMLINE on Google Colab?
Running STREAMLINE on Google Colab is best for:
Running the STREAMLINE demonstration on the included demo data
Users with little to no coding experience
Users that want the quickest/easiest approach to running STREAMLINE
Users that do not have access to a very powerful computer or compute cluster.
Applying STREAMLINE to smaller-scale analyses (in particular when only using free/limited Google Cloud resources):
Smaller datasets (e.g. < 500 instances and features)
A small number of total datasets (e.g. 1 or 2)
Only using the simplest/quickest modeling algorithms (e.g. Naive Bayes, Decision Trees, Logistic Regression)
Only using 1 or 2 modeling algorithms
Google Colab Notebook - on free Google Cloud resources [Anyone can run]:
Advantages
No coding or PC environment experience needed
Automatically installs and uses the most recent version of STREAMLINE
Computing can performed directly on Google Cloud from anywhere
One-click run of whole pipeline (all phases)
Offers in-notebook viewing of results and ability to save notebook as documentation of analysis
Allows easy customizability of nearly all aspects of the pipeline with minimal coding/environment experience
Disadvantages:
Can only run pipeline serially
Slowest of the run options
Limited by google cloud computing allowances (may only work for smaller datasets)
Notes: Requires a Google account (free)
Jupyter Notebook - locally [Basic experience]:
Advantages:
Does not rely on free computing limitations of Google Cloud (but rather your own computer’s limitations)
One-click run of whole pipeline (all phases)
Offers in-notebook viewing of results and ability to save notebook as documentation of analysis
Allows easy customizability of all aspects of the pipeline with minimal coding/environment experience
Disadvantages:
Can only run pipeline serially
Slower runtime than from command-line
Beginners have to set up their computing environment
Notes: Requires Anaconda3, Python3, and several other minor Python package installations
Command Line (Local) [Command-line Users]:
Advantages:
Typically runs faster than within Jupyter Notebook
A more versatile option for those with command-line experience
One-command run of whole pipeline available when using a configuration file to run
Can optionally run the pipeline one phase at a time
Disadvantages:
Can only run pipeline serially or with limited local cpu core parallelization
Command-line experience recommended
Notes: Requires Anaconda3, Python3, and several other minor Python package installations
Command Line (HPC Cluster) [Computing Cluster Users]:
Advantages:
By far the fastest, most efficient way to run STREAMLINE
Offers ability to run STREAMLINE over 7 types of HPC systems
One-command run of whole pipeline available when using a configuration file to run
Can optionally run the pipeline one phase at a time
Disadvantages:
Experience with command-line and dask-compatible clusters recommended
Access to a computing cluster required
Notes: Requires Anaconda3, Python3, and several other minor Python package installations. Cluster runs of STREAMLINE were set up using
dask-jobqueue
and thus should support 7 types of clusters as described in the dask documentation. Currently we have only directly tested STREAMLINE on SLURM and LSF clusters. Further codebase adaptation may be needed for clusters types not on the above link.