Feature Selection
filter_by_laplacian(X, layer, k=5, t=None, topN=100, correlation_threshold=0.9)
Filters features in a dataset based on Laplacian scores and removes highly correlated features, retaining only the top N features with the lowest scores and optionally considering correlation.
This function computes Laplacian scores for each feature in the dataset to measure its importance in capturing the data's intrinsic geometry. It then selects the top N features with the lowest scores. Additionally, it can remove features that are highly correlated beyond a specified threshold to reduce redundancy. The selection process can extend to include slightly more features before correlation filtering to ensure the best candidates are retained.
Parameters
X : pandas.DataFrame The input dataset with samples in rows and features in columns. layer : str Identifier for the dataset layer being processed (for logging purposes). k : int, optional The number of nearest neighbors to consider for computing the Laplacian score (default is 5). t : float, optional The heat kernel parameter for Laplacian score computation. If None, the default behavior applies without a heat kernel (default is None). topN : int, optional The number of top features to keep based on the lowest Laplacian scores (default is 100). correlation_threshold : float, optional The Pearson correlation coefficient threshold for identifying and removing highly correlated features. Features with a correlation above this threshold are considered redundant (default is 0.9).
Returns
X_selected : pandas.DataFrame The filtered data matrix with only the selected features retained. feature_log : pandas.DataFrame A DataFrame logging each feature's Laplacian score, correlation information (if applicable), and whether it was selected. It includes columns for 'feature', 'laplacian_score', 'correlated_with' (feature causing redundancy), 'correlation_score', and 'selected' (a boolean indicating if the feature is retained in the final selection).
Notes
- The function prioritizes features with lower Laplacian scores as they are considered more important for capturing the data's structure.
- Features are initially selected based on their Laplacian scores. If
correlation_threshold
is set below 1, the function further filters these features by removing those that are highly correlated with any already selected feature. - The process may select additional features beyond
topN
before correlation filtering to ensure that the best candidates are considered. The final number of features, however, is pruned totopN
.
Source code in flexynesis/feature_selection.py
153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 |
|
laplacian_score(X, k=5, t=None)
Calculate Laplacian Score for each feature in the dataset.
Parameters: X: numpy array (n_samples, n_features) - Input data k: int - Number of nearest neighbors to consider t: float - Heat kernel parameter (optional)
Returns: scores: (n_features,) - Laplacian Scores for each feature
Source code in flexynesis/feature_selection.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
|
remove_redundant_features(X, laplacian_scores, threshold, topN=None)
Removes features from the dataset based on correlation and Laplacian scores, optionally keeping only the top N features based on their scores, and manages redundant features.
This function evaluates features in a dataset for redundancy by measuring the correlation between them. If the absolute correlation between two features exceeds a specified threshold, the feature with the higher Laplacian score (a lower score is better) is considered redundant and is marked for removal. If the number of selected features is less than a specified top N, additional features are included from the redundant set based on their Laplacian scores until the top N count is reached.
Parameters
X : pandas DataFrame The input dataset with samples in rows and features in columns. laplacian_scores : numpy array An array of Laplacian scores corresponding to each feature in the dataset. Lower scores indicate features that better capture the intrinsic geometry of the data. threshold : float The correlation threshold above which a feature is considered redundant. Features with absolute correlation greater than this threshold with any selected feature are marked as redundant. topN : int, optional The desired number of features to select. If specified, the function will ensure that exactly topN features are selected, including from the redundant set if necessary. If None, no limit is applied to the number of selected features.
Returns
selected_features : list A list of indices of the features selected to be retained in the dataset. redundant_features_df : pandas DataFrame A DataFrame listing the features considered redundant, the indices of the features they are most correlated with, and the correlation scores.
Notes
- The function prioritizes the removal of features with higher Laplacian scores when deciding between two correlated features, under the assumption that lower scores indicate features more important for preserving data structure.
- Features initially marked as redundant but included during the top-up process to meet the topN requirement are removed from the redundant_features_df before it is returned.
Source code in flexynesis/feature_selection.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
|