Preparing Your Data – Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

Preparing Your Data

In the previous chapter, we dealt with clean data, where all the values were available to us, all the columns had numeric values, and when faced with too many features, we had a regularization technique on our side. In real life, it will often be the case that the data is not as clean as you would like it to be. Sometimes, even clean data can still be preprocessed in ways to make things easier for our machine learning algorithm. In this chapter, we will learn about the following data preprocessing techniques:

  • Imputing missing values
  • Encoding non-numerical columns
  • Changing the data distribution
  • Reducing the number of features via selection
  • Projecting data into new dimensions

Imputing missing values

"It is a capital mistake to theorize before one has data."
– Sherlock Holmes

To simulate a real-life scenario where the data has missing values, we will create a dataset with people's weights as a function of their height. Then, we will randomly remove 75% of the values in the height column and set them to NaN:

df = pd.DataFrame(
{
'gender': np.random.binomial(1, .6, 100),
'height': np.random.normal(0, 10, 100),
'noise': np.random.normal(0, 2, 100),
}
)

df['height'] = df['height'] + df['gender'].apply(
lambda g: 150 if g else 180
)
df['height (with 75% NaN)'] = df['height'].apply(
lambda x: x if np.random.binomial(1, .25, 1)[0] else np.nan
)
df['weight'] = df['height'] + df['noise'] - 110

We used a random number generator with an underlying binomial/Bernoullidistribution here to decide whether each sample will be removed. The distribution's n value is set to 1—that is, it is a Bernoulli distribution—and its p value is set to 0.25—that is, each sample has a 25% chance of staying. Whenever the returned value of the generator is 0, the sample is set to NaN. As you can see, due to the nature of the random generator, the final percentage of NaN values may be slightly more or less than 75%.

Here are the first four rows of the DataFrame that we have just created. Only the height column, with the missing values, and the weights are shown here:

We can also check what percentage of each column has missing values by using the following code:

df.isnull().mean()

When I ran the previous line, 77% of the values were missing. Note that you may get a different ratio of missing values than the ones I've got here, thanks to the random number generator used.

None of the regressors we have seen so far will accept this data with all the NaN values in it. Therefore, we need to convert those missing values into something. Deciding on which values to fill in place of the missing values is the job of the data imputation process.

There are different kinds of imputation techniques. We are going to try them here and observe their effect on our weight estimations. Keep in mind that we happen to know the original height data without any missing values, and we know that using a ridge regressor on the original data gives us an MSE value of 3.4. Let's keep this piece of information as a reference for now.

Setting missing values to 0

One simple approach would be to set all the missing values to 0. The following code will make our data usable once more:

df['height (75% zero imputed)'] = df['height (with 75% NaN)'].fillna(0)

Fitting a ridge regressor on the newly imputed column will give us an MSE value of 365:

from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

reg = Ridge()
x, y = df[['height (75% zero imputed)']], df['weight']
reg.fit(x, y)
mean_squared_error(y, reg.predict(x))

Although we were able to use the regressor, its error is huge compared to our reference scenario. To understand the effect of zero imputation, let's plot the imputed data and use the regressor's coefficients to see what kind of line it created after training. Let's also plot the original data for comparison. I am sure the code for generating the following graph is straightforward to you by now, so I'll skip it:

We already know by now that a linear model is only capable of fitting a continuous straight line onto the data (or a hyperplane, in the case of higher dimensions). We also know that 0 is not a reasonable height for anyone. Nevertheless, with zero imputation, we introduced a bunch of values where the heights are 0 and the weights range between 10 and 90 or so. This obviously confused our regressor, as we can see in the right-hand side graph.

A non-linear regressor, such as a decision tree, will be able to deal with this problem much better than its linear counterpart. Actually, for tree-based models, I'd suggest you try replacing the missing values in x with values that don't exist in your data. For example, you may experiment with setting the height to -1, in this case.

Setting missing values to the mean

Another name for statistical mean is expected value. That's because the mean serves as a biased estimation of the data. Having that said, replacing missing values with the column's mean values sounds like a plausible idea.

In this chapter, I am fitting a regressor on the entire dataset. I am not concerned about splitting the data into training and test sets here since I am mainly bothered with how the regressor behaves with imputation. Nevertheless, in real life, you will just want to learn the mean value of the training set and use it to impute the missing values for both the training and test sets.

scikit-learn's SimpleImputer feature makes it possible to find out the mean value from the training set and use it to impute both the training and test sets. It does so by using our favorite fit() and transform() methods. But let's stick to a one-step fit_transform() function here since we only have one set:

from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
df['height (75% mean imputed)'] = imp.fit_transform(
df[['height (with 75% NaN)']]
)[:, 0]

We have a single column to impute here, which is why I used [:, 0] to access its values after imputation.

A ridge regressor will give us an MSE value of 302. To understand where this improvement came from, let's plot the model's decision and compare it to the previous one with zero imputation:

Clearly, the model's decisions make more sense now. You can see how the dotted line coincides with the actual non-imputed data points.

In addition to using mean as a strategy, the algorithm can also find the median of the training data. The median is usually a better option if your data has outliers. In the case of non-numerical features, you should instead use the most_frequent option as your strategy.

Using informed estimations for missing values

Using a single value for all missing values may not be ideal. For example, we know here that our data includes male and female samples and each sub-sample has a different average height. The IterativeImputer() method is an algorithm that can use neighboring features to estimate the missing values in a certain feature. Here, we use the gender information to infer values to use when imputing the missing heights:

# We need to enable the module first since it is an experimental one 
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp = IterativeImputer(missing_values=np.nan)
df['height (75% iterative imputed)'] = imp.fit_transform(
df[['height (with 75% NaN)', 'gender']]
)[:, 0]

We now have two values to be used for imputation:

The MSE value is 96 this time. This strategy is the clear winner here.

We only had one feature with missing values here. In the case of multiple features, the IterativeImputer() method loops over all the features. It uses all the features but one to predict the missing values of the remaining one via regression. Once it is done looping over all the features, it may repeat the entire process more than once until the values converge. There are parameters to decide which regression algorithm to use, what order to use when looping over the features, and what the maximum number of iterations allowed is. Clearly, this strategy may be computationally expensive with bigger datasets and a higher number of incomplete features. Furthermore, theIterativeImputer() implementation is still experimental, and its API might change in the future.

A column with too many missing values carries too little information for our estimate to use. We can try our best to impute those missing values; but nevertheless, dropping the entire column and not using it at all is sometimes the best option, especially if the majority of the values are missing.

Encoding non-numerical columns

"Every decoding is another encoding."
David Lodge

Non-numeric data is another issue that algorithm implementations cannot deal with. In addition to the core scikit-learn implementation, scikit-learn-contrib has a list of satellite projects. These projects provide additional tools to our data arsenal, and here is how they describe themselves:

"scikit-learn-contrib is a GitHub organization for gathering high-quality, scikit-learn - compatibleprojects. It also provides a template for establishing new scikit-learn compatible projects."

We are going to use one of these projects here—category_encoders. This allows us to encode non-numerical data into different forms. First, we will install the library using the pip installer, as follows:

          pip install category_encoders
        

Before jumping into the different encoding strategies, let's first create a fictional dataset to play with:

df = pd.DataFrame({
'Size': np.random.choice(['XS', 'S', 'M', 'L', 'XL', 'XXL'], 10),
'Brand': np.random.choice(['Nike', 'Puma', 'Adidas', 'Le Coq', 'Reebok'], 10),
})

We will then split it into two equal halves:

from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.5)
Keep in mind that the core scikit-learn library implements two of the encoders we are going to see here—preprocessing.OneHotEncoder and preprocessing.OrdinalEncoder. Nevertheless, I prefer the category_encoders implementation for its richness and versatility.

Now, on to our first, and most popular, encoding strategy—one-hot encoding.

One-hot encoding

One-hot encoding, also known as dummy encoding, is the most common method for dealing with categorical features. If you have a column containing the red, green, and blue values, it sounds logical to convert them into three columns—is_red, is_green, and is_blue—and fill these columns with ones and zeroes, accordingly.

Here is the code for decoding our datasets using OneHotEncoder:

from category_encoders.one_hot import OneHotEncoder
encoder = OneHotEncoder(use_cat_names=True, handle_unknown='return_nan')
x_train = encoder.fit_transform(df_train)
x_test = encoder.transform(df_test)

I set use_cat_names=True to use the encoded values when assigning column names. The handle_unknown parameter tells the encoder how to deal with values in the test set that don't exist in the training set. For example, we have no clothing of the XS or Ssizesin our training set. We also don't have any Adidas clothing in there. That's why these records in the test set are converted toNaN:

You still have to impute those NaN values. Otherwise, we can just set those values to 0 by setting handle_unknown to value.

One-hot encoding is recommended for linear models and K-Nearest Neighbor (KNN) algorithms. Nevertheless, due to the fact that one column may be expanded into too many columns and some of them may be inter-dependent, regularization or feature selection are recommended here. We will look further at feature selection later in this chapter, and the KNN algorithm will be discussed later in this book.

Ordinal encoding

Depending on your use case, you may need to encode your categorical values in a way that reflects their order. If I am going to use this data to predict the level of demand for the items, then I know that it isn't the case that the larger the item's size, the higher the demand for it. So, one-hot encoding may still be apt for the sizes here. However, if we are to predict the amount of material needed to create each item of clothing, then we need to encode the sizes in a way that implies that XL needs more material than L. In this case, we are concerned with the order of those values and so we use OrdinalEncoder, as follows:

from category_encoders.ordinal import OrdinalEncoder

oencoder = OrdinalEncoder(
mapping= [
{
'col': 'Size',
'mapping': {'XS': 1, 'S': 2, 'M': 3, 'L': 4, 'XL': 5}
}
]
)

df_train.loc[
:, 'Size [Ordinal Encoded]'
] = oencoder.fit_transform(
df_train['Size']
)['Size'].values
df_test.loc[
:, 'Size [Ordinal Encoded]'
] = oencoder.transform(
df_test['Size']
)['Size'].values

Note that we have to specify the mapping by hand. We want XS to be encoded as 1, S as 2, and so on. As a result, we get the following DataFrame:

This time, the encoded data fits into just one column, and the values missing from the training set are encoded as -1.

This encoding method is recommended for non-linear models, such as decision trees. As for linear models, they may interpret XL (encoded as 5) to be five times the size of XS (encoded as 1). That's why one-hot encoding is still preferred for linear models. Furthermore, coming up with meaningful mappings and setting it by hand can be time-consuming.

Target encoding

One obvious way to encode categorical features, in a supervised learning scenario, is to base the encoding on the target values. Say we want to estimate the price of an item of clothing. We can replace the brand names with the average price for all items of the same brand in our training dataset. Nevertheless, there is one obvious problem here. Say one brand happens to appear only once or twice in our training set. There is no guarantee that these few appearances are good representations of the brand's price. In another world, using the target values just like that may result in overfitting, and the resulting model may not generalize well when dealing with new data. That's why the category_encoders library has multiple variations of target encoding; they all have the same underlying objective, but each of them has a different method for dealing with the aforementioned overfitting issue. Here are some examples of these implementations:

  • Leave-one-out cross-validation
  • The target encoder
  • The catboost encoder
  • The M-estimator

Leave-one-out is probably the most well-known implementation of the ones listed. In the training data, it replaces the categorical value in raw data with the mean of the corresponding target values of all the rows with the same categorical value except for this particular raw data. For the test data, it just uses the mean of the corresponding targets of each category value learned from the training data. Furthermore, the encoder also has a parameter called sigma, which allows you to add noise to the learned mean to prevent even moreoverfitting.

Homogenizing the columns' scale

Different numerical columns may have different scales. One column's age is in the tens, while its salary is typically in the thousands. As we saw earlier, putting different columns into a similar scale helps in some cases. Here are some of the cases where scaling is recommended:

  • It allows gradient-descent solvers to converge quicker.
  • It is needed for algorithms such as KNN and Principle Component Analysis (PCA)
  • When training an estimator, it puts the features on a comparable scale, which helps when juxtaposing their learned coefficients.

In the next sections, we are going to examine the most commonly used scalers.

The standard scaler

This converts the features into normal distribution by setting their mean to 0 and their standard deviation to 1. This is done using the following operation, where a column's mean value is subtracted from each value in it, and then the result is divided by the column'sstandard deviation value:

The scaler's implementation can be used as follows:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

Once fitted, you can also find out the mean and variance for each column in the training data via the mean_ and var_ attributes. In the presence of outliers, the standard scaler does not guarantee balanced feature scales.

The MinMax scaler

This squeezes the features into a certain range, typically between 0 and 1. If you need to use a different range, you can set it using the feature_range parameter. This scaler works as follows:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0,1))
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

Once fitted, you can also find out the minimum and maximum values for each column in the training data with the data_min_and data_max_attributes. Since all samples are limited to a predefined range, outliers may force inliers to be squeezed into a small subset of this range.

RobustScaler

This is similar to the standard scaler, but uses the data quantiles instead to be more robust to the outliers' effect on the mean and standard deviation. It's advised that you use this if your data has outliers, and it can be used as follows:

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

Other scalers also exist; however, I have covered the most commonly used scalers here. Throughout this book, we will be using the aforementioned scalers. All scalers have an inverse_transform() method, so you can restore a feature's original scales if needed. Furthermore, if you cannot load all training data into memory at once, or if the data comes in batches, you can then call the scaler's partial_fit() method with each batch instead of calling the fit() method for the entire dataset once.

Selecting the most useful features

"More data, such as paying attention to the eye colors of the people around when crossing the street, can make you miss the big truck."
– Nassim Nicholas Taleb

We have seen, in previous chapters, that too many features can degrade the performance of our models. What is known as the curse of dimensionality may negatively impact an algorithm's accuracy, especially if there aren't enough training samples. Furthermore, it can also lead to more training time and higher computational requirements. Luckily, we have also learned how to regularize our linear models or limit the growth of our decision trees to combat the effect of feature abundance. Nevertheless, we may sometimes end up using models where regularization is not an option. Additionally, we may still need to get rid of some pointless features to reduce the algorithm's training time and computational needs. In these situations, feature selection is wise to use as a first step.

Depending on whether we are dealing with labeled or unlabeled data, we can choose different methods for feature selection. Furthermore, some methods are more computationally expensive than others, and some lead to more accurate results. In the following sections, we are going to see how those different methods can be used and, to demonstrate that, we will load scikit-learn's wine dataset:

from sklearn import datasets

wine = datasets.load_wine()
df = pd.DataFrame(
wine.data,
columns=wine.feature_names
)
df['target'] = pd.Series(
wine.target
)

We then split the data as we usually do:

from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.4)

x_train = df_train[wine.feature_names]
x_test = df_test[wine.feature_names]

y_train = df_train['target']
y_test = df_test['target']

The wine dataset has 13 features and is used for classification tasks. In the following sections, we are going to discover which features are less important than the others.

VarianceThreshold

If you recall, when we used the PolynomialFeatures transformer, it added a column where all the values were set to 1. Additionally, categorical encoders, such as one-hot encoding, can result in columns where almost all of the values are 0. It's also common,in real-life scenarios, to have columns where all the data in it is identical or almost identical. Variance is the most obvious way to measure the amount of variation in a dataset, so VarianceThresholdallows us to set a minimum threshold for an accepted variance in each feature. In the following code, we will set the variance threshold to 0. It then goes through the training set to learn which features deserve to stay:

from sklearn.feature_selection import VarianceThreshold
vt = VarianceThreshold(threshold=0)
vt.fit(x_train)

Like all of our other modules, this one also provides the usual fit(), transform(), and fit_transform() methods. However, I prefer not to use them here since we already gave our columns names, and the transform() functions don't honor the names we have given. That's why I prefer to use another method called get_support(). This method returns a list of Booleans, where any False values correspond to columns that ought to be removed based on the threshold we set. Here is how I remove unnecessary features using the pandas library's iloc function:

x_train = x_train.iloc[:, vt.get_support()]
x_test = x_test.iloc[:, vt.get_support()]

We can also print the feature names and sort them according to their variance, as follows:

pd.DataFrame(
{
'Feature': wine.feature_names,
'Variance': vt.variances_,
}
).sort_values(
'Variance', ascending=True
)

This gives us the following table:

We can see that none of our features have zero variance; therefore, none of them are removed. You may decide to use a higher threshold—for example, setting the threshold to 0.05 will get rid of nonflavanoid_phenols. However, let me list the key advantages and disadvantages of this module to help you decide when and how to use it:

  • Unlike the other feature selection methods we are going to see in a bit, this one does not use data labels when selecting features. This is useful when dealing with unlabeled data, as in unsupervised learning scenarios.
  • The fact that it is label-agnostic also means that a low variance feature might still correlate well with our labels and removing it is a mistake.
  • The variance, just like the mean, is scale-dependent. A list of numbers from 1 to 10 has a variance of 8.25, while the list of 10, 20, 30,...100 has a variance of 825.0. We can clearly see this in the variance of proline. This makes the numbers in our table incomparable and makes it hard to pick a correct threshold. One idea may be to scale your data before calculating its variance. However, keep in mind that you cannot use StandardScaler since it deliberately unifies the variance of all features. So, I would find MinMaxScaler more meaningful here.

In summary, I find the variance threshold handy in removing zero-variance features. As for the remaining features, I'd let the next feature selection algorithms deal with them, especially when dealing with labeled data.

Filters

Now that our data comes with labels, it makes sense to use the correlation between each feature and the labels to decide which features are more useful for our model. This category of feature-selection algorithms deals with each individual feature and measures its usefulness in relation to the label; this algorithm is called filters. In other words, the algorithm takes each column in x and uses some measure to evaluate how useful it is in predicting y. Useful columns stay, while the rest are removed. The way that usefulness is measured is what differentiates one filter selector from the other. For the sake of clarity, I am going to focus on two selectors here since each one has its roots in a different scientific field, and understanding them both serves as a good foundation for future concepts. The two concepts are ANOVA (F-values) and mutual information.

f-regression and f-classif

As its name suggests, f_regressionis used for feature selection in regression tasks. f_classif is its classification cousin. f_regression has its roots in the field of statistics. Its scikit-learn implementation uses the Pearson correlation coefficient to calculate the correlation between each column in x and y. The results are then converted into F-values and P-values, but let's keep that conversion aside since the correlation coefficient is the key here. We start by subtracting the mean values for each column from all the values in the same column, which is similar to what we did in StandardScaler, but without dividing the values by their standard deviation. Then, we calculate the correlation coefficient using the following formula:

Since the mean is subtracted, the values for the x and y values are positive when an instance is above its column's mean value, and negative when it is below. So, this equation is maximized so that every time x is above average, y is also above average, and whenever x is below average, y follows suit. The maximum value for this equation is 1. We canthen say that x and y are perfectly correlated. The equation is -1 when x and y stubbornly go opposite ways, in other words negatively correlated. A zero result means that x and y are uncorrelated (that is, independent or orthogonal).

Usually, statisticians write this equation differently. The fact that the mean is subtracted from x and y is usually written down as a part of the equation. Then, the numerator is clearly the covariance and the denominator is the product of the two variances. Nevertheless, I deliberately chose not to follow the statistical convention here so that our natural language processing friends feel at home once they realize that this is the exact same equation as for cosine similarity. There, x and y are seen as vectors, the numerator is their dot product, and the denominator is the product of their magnitudes. Consequently, the two vectors are perfectly correlated (go in the same direction) when the angle between them is 0 (cosine 0 = 1). Conversely, they are independent when they are perpendicular to each other, hence the term orthogonal. One takeaway from this visual interpretation is that this metric only considers the linear relationship between x and y.

For the case of classification, a one-way ANOVA test is performed. This compares the variance between the different class labels to the variance within each class. Just like its regression cousin, it measures the linear dependence between the features and the class labels.

Enough theory for now; let's use f_classif to pick the most useful features in our dataset:

from sklearn.feature_selection import f_classif
f, p = f_classif(x_train, y_train)

Let's keep the resulting f and p values to one side for now. After explaining the mutual information approach for feature selection, we will use these values to contrast the two approaches.

Mutual information

This approach has its roots in a different scientific field called informationtheory. This field was introduced by Claude Shannon to solve issues relating to signal processing and data compression. When we send a message made up of zeros and ones, we may know the exact content of this message, but can we actually quantify the amount of information this very message carries? Shannon solved this problem by borrowing the concept of entropy from thermodynamics. Further down the line comes the concept of mutualinformation. It quantifies the amount of information obtained about one variable when observing another variable. The formula for mutual information is as follows:

Before dissecting this equation, keep the following in mind:

  • P(x) is the probability of x taking a certain value, as is P(y) for y.
  • P(x, y) is known as joint probability, which is the probability of both x and y taking a specific pair of values.
  • P(x, y) only equals the product of P(x) * P(y)if x and y are independent. Otherwise, its value is more or less than their product, depending on whether x and y are positively or negatively correlated.

The double summation and the first part of the equation, P(x, y), are our way of calculating a weighted average for all possible values of x and y. The logarithmic part is what we care about, and it is known as point-wise mutual information. If x and y are independent, the fraction is equal to 1 and its logarithm is 0. In other words, we get 0 when the two variables are uncorrelated. Otherwise, the sign of the outcome points to whether x and y are positively or negatively correlated.

Here is how we get the mutual information coefficient for each feature:

from sklearn.feature_selection import mutual_info_classif
mi = mutual_info_classif(x_train, y_train)

Unlike Pearson's correlation coefficient, mutual information captures any kind of correlation, whether it is linear or not.

Comparing and using the different filters

Let's now compare our mutual information scores to the F-values. To do so, we will put them both into one DataFrame and use the pandas styling feature to plot bar charts within the DataFrame, as follows:

pd.DataFrame(
{
'Feature': wine.feature_names,
'F': f,
'MI': mi,
}
).sort_values(
'MI', ascending=False
).style.bar(
subset=['F', 'MI'], color='grey'
)

This gives us the following DataFrame:

As you can see, they mostly agree on the order of feature importance, yet they still disagree sometimes. I used each of the two methods to select the top four features, then compared the accuracy of a logistic regression classifier to that of a decision tree classifier with each feature selection method. Here are the results of the training set:

As you can tell, each of the two selection methods worked better for one of the two classifiers here. It seems that f_classif served the linear model better due to its linear nature, while the non-linear model favored an algorithm that captures non-linear correlations. I have not found any literature confirming the generality of this speculation, however.

It is hard not to see the underlying theme that links the two measures. The numerator calculates some intra-variable information—the covariance, dot product, or join probability. The denominator calculates the product of inter-variable information—the variance, norms, or probability. This very theme will continue to appear in different topics in the future. One day, we might use cosine similarity to compare two documents; another day, we might use mutual information to evaluate a clustering algorithm.

Evaluating multiple features at a time

The feature selection methods shownin the Filters section of this chapter are also regarded as univariate feature selection methods since they check each feature separately before deciding whether to keep it. This can result in any of the two following issues:

  • If two features are highly correlated, we only want to keep one of them. However, due to the nature of the univariate feature selection, they will both still be selected.
  • If two features are not very useful on their own, yet their combination is useful. They will still be removed due to the way the univariate feature selection methods work.

To deal with these issues, we may decide to use one of the following solutions:

  • Using estimators for feature selection: Typically, regressors and classifiers assign values to the features used after training, signifying their importance. So, we can use an estimator's coefficients (or feature importance) to add or remove features from our initial feature set. scikit-learn's Recursive Feature Elimination (RFE) algorithm starts with an initial set of features. Then, it iteratively removes features with each iteration using the trained model's coefficients. The SelectFromModel algorithm is a meta-transformer that can make use of a regularized model to remove features with zero or near-zero coefficients.
  • Using estimators with built-in feature selection: In other words, this means using a regularized estimator such as lasso, where feature selection is part of the estimator's objectives.

In summary, methods such as using variance thresholds and filters are quick to perform but have their drawbacks when it comes to feature correlation and interaction. More computationally expensive methods, such as wrappers, deal with these issues but are prone to overfitting.

If you ask me about my recommendations for feature selection, personally, my go-to method would be regularization after removing the zero-variance features, unless I am dealing with a huge amount of features where training on the entire set is unfeasible. In these cases, I'd use a univariate feature selection method while being careful about removing features that might end up being useful. I'd still use a regularized model afterward to deal with any multicollinearity.

In the end, the proof of the pudding is in the eating, and empirical results via trial and error may trump my recommendations. Furthermore, besides improving the final model's accuracy, feature selection can still be used to understand the data at hand. The feature importance scores can still be used to inform business decisions. For example, if our label states whether a user is going to churn, we can come up with a hypothesis that the top-scoring features affect the churn rate the most. Then, we can run experiments by changing the relevant parts of our product to see whether we can decrease the churn rate.

Summary

Pursuing a data-related career requires a tendency to deal with imperfections. Dealing with missing values is one step that we cannot progress without. So, we started this chapter by learning about different data imputation methods. Additionally, suitable data for one task may not be perfect for another. That's why we learned about feature encoding and how to change categorical and ordinal data to fit into our machine learning needs. Helping algorithms to perform better can require rescaling the numerical features. Therefore, we learned about three scaling methods. Finally, data abundance can be a curse on our models, so feature selection is one prescribed way to deal with the curse of dimensionality, along with regularization.

One main theme that ran through this entire chapter is the trade-off between simple and quick methods versus more informed and computationally expensive methods that may result in overfitting. Knowing which methods to use requires an understanding of their underlying theories, in addition to a willingness to experiment and use iterations. So, I decided to go a bit deeper into the theoretical background where needed, not only so that it helps you pick your methods wisely, but also so that it allows you to come up with your own methods in the future.

Now that we have the main data preprocessing tools on our side, we are ready to move on to our next algorithm—KNN.