Feature Engineering is the process of transforming the raw features into more informative features that can be used in modeling or EDA tasks.
Feature Functions
As number of features grows, we can capture arbitrarily complex relationships.
Suppose we wish to develop a model to predict a vehicle's fuel efficiency ("mpg") as a function of its horsepower("hp"). Glancing at the plot below, we see that the relationship between "mpg" and "hp" is non-linear.
An SLR fit doesn't capture the relationship between the two variables.
Standard multiple linear regression model. In its current form:
Do a better job of capturing the non-linear relationship between the two variables:
But it is still linear in θ - the prediction y is a linear combination of the model parameters.
This means that we can use the same linear algebra methods as before to derive the optimal model parameters when fitting the model.
Although the model contains non-linear x terms, it is linear with respect to the model parameter θi.
Because our OLS derivation relied on assuming a linear model of θi, the method is still valid to fits this new model.
If we refit the model with "hp" squared as its own feature, we see that the model follows the data much more closely.
By squaring the "hp" feature, we were able to create a new feature that significantly improved the quality of our model.
A feature funtion is some function applied to the original variables in the data to generate one or more new features.
One Hot Encoding
One hot encoding is a feature engineering technique that generates numeric feature from categorical data, allowing us to use our usual methods to fit a regression model on the data.
import numpy as np
np.random.seed(1337)
tips_df = sns.load_dataset("tips").sample(100)
tips_df[["day"]].head(5)
It doesn't seem possible to fit a regression model to this data - we can't directly perform any mathematical operations on the entry "Thur".
To resolve this, we create a new table with a feature for each unique value in the original "day" column.
We then iterate through the "day" column. For each entry in "day" we fill the corressponding feature in the new table with 1. All other features are set to 0.
Example:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Perform the one hot encoding
oh_enc = OneHotEncoder()
oh_enc.fit(tips_df[['day']])
ohe_data = oh_enc.transform(tips_df[['day']]).toarray()
# Combine with original features
data_w_ohe = (tips_df
.join(
pd.DataFrame(ohe_data, columns=oh_enc.get_feature_names(), index=tips_df.index))
)
data_w_ohe
Higher-order Polynomial Example
From order 0(the constant model) to order 5 (polynomial features through horsepower to the fifth power)
We can observe a small improvement in MSE. The MSE will continue to marginally decrease as we add more and more terms to our model.
Variance and Traning Error
To take full advantage of capturing non-linear relationships, we might be inclined to desgin increasingly complex features.
- Model with order 1: mpg = θ0 + θ1(hp)
- Model with order 2: mpg = θ0 + θ1(hp) + θ2(hp)^2
- Model with order 4: mpg = θ0 + θ1(hp) + θ2(hp)^2 + ... + θ4(hp)^4
We find that the MSE decrease with increasingly complex models.
The training error is the model's error when generating predictions from the same data that was used for training purposes. We conclude that the training error goes down as the complexity of the model increase.
The sensitivity of the model to the data used to train it is called the model variance. As we saw above, model variance tends to increase with model complexity.
Overfitting
The phenonmenon above is called overfitting. The model effectively just memorized the training data it encountered when was fitted, leaving it unable to handle new situations.
'Computer Science π > Machine LearningπΌ' μΉ΄ν κ³ λ¦¬μ λ€λ₯Έ κΈ
Pandas part 1 Review (0) | 2023.05.26 |
---|---|
Life Cycle and Design Review (0) | 2023.05.26 |
Canonicalization (0) | 2023.05.24 |
Regex (0) | 2023.05.24 |
Python String Method (0) | 2023.05.24 |