streamline.utils.dataset module

class streamline.utils.dataset.Dataset(dataset_path, class_label, match_label=None, instance_label=None)[source]

Bases: object

Creates dataset with path of tabular file

Parameters:
  • dataset_path – path of tabular file (as csv, tsv, or txt)

  • class_label – column label for the outcome to be predicted in the dataset

  • match_label – column to identify unique groups of instances in the dataset that have been ‘matched’ as part of preparing the dataset with cases and controls that have been matched for some co-variates Match label is really only used in the cross validation partitioning It keeps any set of instances with the same match label value in the same partition.

  • instance_label – Instance label is mostly used by the rule based learner in modeling, we use it to trace back heterogeneous subgroups to the instances in the original dataset

clean_data(ignore_features)[source]

Basic data cleaning: Drops any instances with a missing outcome value as well as any features (ignore_features) specified by user

counts_summary(experiment_path, total_missing=None, plot=True, show_plots=False, initial='')[source]

Reports various dataset counts: i.e. number of instances, total features, categorical features, quantitative features, and class counts. Also saves a simple bar graph of class counts if user specified.

Parameters:
  • experiment_path

  • total_missing – total missing values (optional, runs again if not given)

  • plot – flag to output bar graph in the experiment log folder

  • show_plots – flag to show plots

  • initial – flag for initial eda

Returns:

describe_data(experiment_path, initial='')[source]

Conduct and export basic dataset descriptions including basic column statistics, column variable types (i.e. int64 vs. float64), and unique value counts for each column

eda(experiment_path, plot=False, initial='')[source]
feature_correlation(experiment_path, x_data=None, plot=True, show_plots=False, initial='')[source]

Calculates feature correlations via pearson correlation and exports a respective heatmap visualization. Due to computational expense this may not be recommended for datasets with a large number of instances and/or features unless needed. The generated heatmap will be difficult to read with a large number of features in the target dataset.

Parameters:
  • experiment_path

  • x_data – data with only feature columns

  • plot

  • show_plots

  • initial

feature_only_data()[source]

Create features-only version of dataset for some operations Returns: dataframe x_data with only features

get_headers()[source]

Return feature names of the datasets

Returns: list of feature names

get_outcome()[source]

Function to get outcome value form data Returns: outcome column

initial_eda(experiment_path, plot=False, initial='initial/')[source]
load_data()[source]

Function to load data in dataset

missing_count_plot(experiment_path, plot=False, initial='')[source]

Plots a histogram of missingness across all data columns.

missingness_counts(experiment_path, initial='', save=True)[source]

Count and export missing values for all data columns.

non_feature_data()[source]

Create non features version of dataset for some operations Returns: dataframe y_data with only non features

set_original_headers(experiment_path, phase='exploratory')[source]

Exports dataset header labels for use as a reference later in the pipeline.

Returns: list of headers labels

set_processed_headers(experiment_path, phase='exploratory')[source]

Exports dataset header labels for use as a reference later in the pipeline.

Returns: list of headers labels