streamline.featurefns.selection module

class streamline.featurefns.selection.FeatureSelection(full_path, n_splits, algorithms, class_label, instance_label, export_scores=True, top_features=20, max_features_to_keep=2000, filter_poor_features=True, overwrite_cv=False, show_plots=False)[source]

Bases: Job

Feature Selection Job for CV Data Splits

Parameters:
  • export_scores – flag to export top feature scores (default=True)

  • top_features – number of top features to consider (default=20)

  • max_features_to_keep – maximum number of features to keep (default=2000)

  • filter_poor_features – flag to filter poor features (default=True)

  • overwrite_cv – overwrite last cross validation dataset (default=False)

  • show_plots – flag to show plots (default=False)

gen_filtered_datasets(cv_selected_list, path_to_csv, dataset_name, overwrite_cv)[source]

Takes the lists of final features to be kept and creates new filtered cv training and testing datasets including only those features.

Parameters:
  • cv_selected_list – list of list for name of features selected for each cv

  • path_to_csv – path to cv splits from the last phase

  • dataset_name – name of dataset

  • overwrite_cv – rename or overwrite old cv splits

report_ave_fs(algorithm, algorithmlabel, selected_feature_lists, meta_feature_ranks)[source]

Loads feature importance results from phase 3, stores sorted feature importance scores for all cvs, creates a list of all feature names that have a feature importance score greater than 0 (i.e. some evidence that it may be informative), and creates a barplot of average feature importance scores.

Parameters:
  • algorithm – name of algorithm reporting for

  • algorithmlabel – label of algorithm reporting for (used for saving logs)

  • selected_feature_lists – list of selected features for processing (dictionary for data storage)

  • meta_feature_ranks – dictionary for data storage

Returns:

report_informative_features(informative_feature_counts, uninformative_feature_counts)[source]

Saves counts of informative vs uninformative features (i.e. those with feature importance scores <= 0) in an csv file. :param informative_feature_counts: count of informative features to save :param uninformative_feature_counts: count of uninformative features to save

run()[source]

Run all elements of the feature selection: reports average feature importance scores across CV sets and applies collective feature selection to generate new feature selected datasets

save_runtime(full_path)[source]

Save phase runtime :param full_path: full path of current experiment

select_features(selected_feature_lists, max_features_to_keep, meta_feature_ranks)[source]

Function to select features

Identifies feature to keep for each cv. If more than one feature importance algorithm was applied, collective feature selection is applied so that the union of informative features is preserved. Overall, only informative features (i.e. those with a score > 0 are preserved). If there are more informative features than the max_features_to_keep, then only those top scoring features are preserved. To reduce the feature list to some max limit, we alternate between algorithm ranked feature lists grabbing the top features from each until the max limit is reached.

Parameters:
  • selected_feature_lists – dictionary fpr data storage

  • max_features_to_keep – number of maximum features to keep

  • meta_feature_ranks – dictionary for data storage

Returns:

cv_selected_list, informative_feature_counts, uninformative_feature_counts list of final selected features for each cv