DS-3 Data reduction tasks using Scikit-learn
Datasets include more features in the model that makes it more complex and overfit the data. Some features can be noise and potentially damage the model. By removing those features, the model generalizes better.
Iris database is used to perform this task in sklearn. datasets library
Importing all required libraries,
Loading the Iris dataset, Now, let's see the information about the dataset. Dataset shape,
To test the effectiveness of different feature selection methods, we add some noise features to the data set.
Before applying the feature selection method, we need to split the data first. Bcz we only select features based on the information from the training set, not on the whole data set.
Variance Threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features. Our dataset has no zero variance feature so our data isn’t affected here.
Univariate Feature Selection
- Univariate feature selection works by selecting the best features based on univariate statistical tests.
- To see whether there is a statistically significant relationship between them, Compare each feature to the target variable.
- When we analyze the relationship between one feature and the target variable we ignore the other features. That is why it is called ‘univariate’.
- Each feature has its own test score.
- Finally, all the test scores are compared, and the features with top scores will be selected.
- These objects take as input a scoring function that returns univariate scores and p-values (or only scores for SelectKBest and SelectPercentile):
- For regression: f_regression, mutual_info_regression
For classification: chi2, f_classif, mutual_info_classif
- f_classif (ANOVA)
Recursive Feature Elimination
Recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. The estimator is trained on the initial features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute.
Then, the least important features are pruned from the current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.
Principal Component Analysis (PCA)
If your learning algorithm is too slow because the input dimension is too high, then using PCA to speed it up can be a reasonable choice. This is probably the most common application of PCA. Another common application of PCA is for data visualization.
We will use PCA to reduce that 4-dimensional data into 2 or 3 dimensions so that you can plot and hopefully understand the data better.
PCA Projection to 2D
The original data has 4 columns (sepal length, sepal width, petal length, and petal width). this converts the original data which is 4 dimensional into 2 dimensions.
Concatenating DataFrame along axis = 1. finalDf is the final DataFrame before plotting the data.
visualize the dataframe
PCA Projection to 3D
This converts the original data which is 4 dimensional into 3 dimensions.
visualize the dataframe
Thank You !!