Code Documentation
skrare.rare module
- class skrare.rare.RARE(label_name='Class', duration_name='grf_yrs', given_starting_point=False, amino_acid_start_point=None, amino_acid_bins_start_point=None, iterations=1000, rare_variant_maf_cutoff=1, set_number_of_bins=1, min_features_per_group=1, max_number_of_groups_with_feature=1, informative_cutoff=0.2, crossover_probability=0.5, mutation_probability=0.05, elitism_parameter=0.2, scoring_method='Relief', score_based_on_sample=True, score_with_common_variables=False, instance_sample_size=500, random_seed=None, bin_size_variability_constraint=None, max_features_per_bin=None, multiprocessing=False)[source]
Bases:
BaseEstimator
,TransformerMixin
A Scikit-Learn compatible framework for the RARE Algorithm.
- Parameters:
given_starting_point – whether or not expert knowledge is being inputted (True or False)
amino_acid_start_point – if RARE is starting with expert knowledge, input the list of features here; otherwise None
amino_acid_bins_start_point – if RARE is starting with expert knowledge, input the list of bins of features here; otherwise None
iterations – the number of evolutionary cycles RARE will run
label_name – label for the class/endpoint column in the dataset (e.g., ‘Class’)
rare_variant_maf_cutoff – the minor allele frequency cutoff separating common features from rare variant features
set_number_of_bins – the population size of candidate bins
min_features_per_group – the minimum number of features in a bin
max_number_of_groups_with_feature – the maximum number of bins containing a feature
scoring_method – ‘Univariate’, ‘Relief’, or ‘Relief only on bin and common features’
score_based_on_sample – if Relief scoring is used, whether or not bin evaluation is done based on a sample of instances rather than the whole dataset
score_with_common_variables – if Relief scoring is used, whether or not common features should be used as context for evaluating rare variant bins
instance_sample_size – if bin evaluation is done based on a sample of instances, input the sample size here
crossover_probability – the probability of each feature in an offspring bin to crossover to the paired offspring bin (recommendation: 0.5 to 0.8)
mutation_probability – the probability of each feature in a bin to be deleted (a proportionate probability is automatically applied on each feature outside the bin to be added (recommendation: 0.05 to 0.5 depending on situation and number of iterations run)
elitism_parameter – the proportion of elite bins in the current generation to be preserved for the next evolutionary cycle (recommendation: 0.2 to 0.8 depending on conservativeness of approach and number of iterations run)
random_seed – the seed value needed to generate a random number
bin_size_variability_constraint – sets the max bin size of children to be n times the size of their sibling (recommendation: 2, with larger or smaller values the population would trend heavily towards small or large bins without exploring the search space)
max_features_per_bin – sets a max value for the number of features per bin
multiprocessing – flag for using multiprocessing implementation of RARE
- fit(original_feature_matrix, y=None)[source]
Scikit-learn compatible fit function for supervised training of FIBERS
- Parameters:
original_feature_matrix – array-like {n_samples, n_features} Training instances. ALL INSTANCE ATTRIBUTES MUST BE NUMERIC or NAN
y – array-like {n_samples} Training labels. ALL INSTANCE PHENOTYPES MUST BE NUMERIC NOT NAN OR OTHER TYPE
:return self
- transform(original_feature_matrix, y=None)[source]
Scikit-learn compatible transform function for supervised training of FIBERS
- Parameters:
X – original feature matrix. pd.DataFrame
y – array-like {n_samples} Training labels. ALL INSTANCE PHENOTYPES MUST BE NUMERIC NOT NAN OR OTHER TYPE
:return self, bin_feature_matrix, common_features_and_bins_matrix, amino_acid_bins, amino_acid_bin_scores, rare_feature_maf_dict, common_feature_maf_dict, rare_feature_df, common_feature_df, maf_0_features