Code Documentation

skrare.rare module

class skrare.rare.RARE(label_name='Class', duration_name='grf_yrs', given_starting_point=False, amino_acid_start_point=None, amino_acid_bins_start_point=None, iterations=1000, rare_variant_maf_cutoff=1, set_number_of_bins=1, min_features_per_group=1, max_number_of_groups_with_feature=1, informative_cutoff=0.2, crossover_probability=0.5, mutation_probability=0.05, elitism_parameter=0.2, scoring_method='Relief', score_based_on_sample=True, score_with_common_variables=False, instance_sample_size=500, random_seed=None, bin_size_variability_constraint=None, max_features_per_bin=None, multiprocessing=False)[source]

Bases: BaseEstimator, TransformerMixin

A Scikit-Learn compatible framework for the RARE Algorithm.

Parameters:
  • given_starting_point – whether or not expert knowledge is being inputted (True or False)

  • amino_acid_start_point – if RARE is starting with expert knowledge, input the list of features here; otherwise None

  • amino_acid_bins_start_point – if RARE is starting with expert knowledge, input the list of bins of features here; otherwise None

  • iterations – the number of evolutionary cycles RARE will run

  • label_name – label for the class/endpoint column in the dataset (e.g., ‘Class’)

  • rare_variant_maf_cutoff – the minor allele frequency cutoff separating common features from rare variant features

  • set_number_of_bins – the population size of candidate bins

  • min_features_per_group – the minimum number of features in a bin

  • max_number_of_groups_with_feature – the maximum number of bins containing a feature

  • scoring_method – ‘Univariate’, ‘Relief’, or ‘Relief only on bin and common features’

  • score_based_on_sample – if Relief scoring is used, whether or not bin evaluation is done based on a sample of instances rather than the whole dataset

  • score_with_common_variables – if Relief scoring is used, whether or not common features should be used as context for evaluating rare variant bins

  • instance_sample_size – if bin evaluation is done based on a sample of instances, input the sample size here

  • crossover_probability – the probability of each feature in an offspring bin to crossover to the paired offspring bin (recommendation: 0.5 to 0.8)

  • mutation_probability – the probability of each feature in a bin to be deleted (a proportionate probability is automatically applied on each feature outside the bin to be added (recommendation: 0.05 to 0.5 depending on situation and number of iterations run)

  • elitism_parameter – the proportion of elite bins in the current generation to be preserved for the next evolutionary cycle (recommendation: 0.2 to 0.8 depending on conservativeness of approach and number of iterations run)

  • random_seed – the seed value needed to generate a random number

  • bin_size_variability_constraint – sets the max bin size of children to be n times the size of their sibling (recommendation: 2, with larger or smaller values the population would trend heavily towards small or large bins without exploring the search space)

  • max_features_per_bin – sets a max value for the number of features per bin

  • multiprocessing – flag for using multiprocessing implementation of RARE

fit(original_feature_matrix, y=None)[source]

Scikit-learn compatible fit function for supervised training of FIBERS

Parameters:
  • original_feature_matrix – array-like {n_samples, n_features} Training instances. ALL INSTANCE ATTRIBUTES MUST BE NUMERIC or NAN

  • y – array-like {n_samples} Training labels. ALL INSTANCE PHENOTYPES MUST BE NUMERIC NOT NAN OR OTHER TYPE

:return self

reboot_population()[source]

Function to reboot population, not implemented

transform(original_feature_matrix, y=None)[source]

Scikit-learn compatible transform function for supervised training of FIBERS

Parameters:
  • X – original feature matrix. pd.DataFrame

  • y – array-like {n_samples} Training labels. ALL INSTANCE PHENOTYPES MUST BE NUMERIC NOT NAN OR OTHER TYPE

:return self, bin_feature_matrix, common_features_and_bins_matrix, amino_acid_bins, amino_acid_bin_scores, rare_feature_maf_dict, common_feature_maf_dict, rare_feature_df, common_feature_df, maf_0_features