streamline.utils.dataset module
- class streamline.utils.dataset.Dataset(dataset_path, class_label, match_label=None, instance_label=None)[source]
Bases:
object
Creates dataset with path of tabular file
- Parameters:
dataset_path – path of tabular file (as csv, tsv, or txt)
class_label – column label for the outcome to be predicted in the dataset
match_label – column to identify unique groups of instances in the dataset that have been ‘matched’ as part of preparing the dataset with cases and controls that have been matched for some co-variates Match label is really only used in the cross validation partitioning It keeps any set of instances with the same match label value in the same partition.
instance_label – Instance label is mostly used by the rule based learner in modeling, we use it to trace back heterogeneous subgroups to the instances in the original dataset
- clean_data(ignore_features)[source]
Basic data cleaning: Drops any instances with a missing outcome value as well as any features (ignore_features) specified by user
- counts_summary(experiment_path, total_missing=None, plot=True, show_plots=False, initial='')[source]
Reports various dataset counts: i.e. number of instances, total features, categorical features, quantitative features, and class counts. Also saves a simple bar graph of class counts if user specified.
- Parameters:
experiment_path –
total_missing – total missing values (optional, runs again if not given)
plot – flag to output bar graph in the experiment log folder
show_plots – flag to show plots
initial – flag for initial eda
Returns:
- describe_data(experiment_path, initial='')[source]
Conduct and export basic dataset descriptions including basic column statistics, column variable types (i.e. int64 vs. float64), and unique value counts for each column
- feature_correlation(experiment_path, x_data=None, plot=True, show_plots=False, initial='')[source]
Calculates feature correlations via pearson correlation and exports a respective heatmap visualization. Due to computational expense this may not be recommended for datasets with a large number of instances and/or features unless needed. The generated heatmap will be difficult to read with a large number of features in the target dataset.
- Parameters:
experiment_path –
x_data – data with only feature columns
plot –
show_plots –
initial –
- feature_only_data()[source]
Create features-only version of dataset for some operations Returns: dataframe x_data with only features
- missing_count_plot(experiment_path, plot=False, initial='')[source]
Plots a histogram of missingness across all data columns.
- missingness_counts(experiment_path, initial='', save=True)[source]
Count and export missing values for all data columns.
- non_feature_data()[source]
Create non features version of dataset for some operations Returns: dataframe y_data with only non features