Feature Engineering

Viktoria Karamyshau
2 min readJul 7, 2021

#RefreshingStatistics Day 6

Feature engineering is a crucial step in predictive modelling (also called predictive analytics).

Predictive analytics is a modelling process that uses statistics to predict future outcomes.

Predictive modeling cycle:

  1. Data cleaning
  2. Feature engineering
  3. Model building
  4. Model deployment
  5. Model updating
  6. Repeat steps 3–4

Feature engineering is the process of applying domain knowledge for performing the transformation of given features (variables, attributes) from the raw data. The objective of feature engineering is reduction of the modelling error for a given target that is being predicted.

Feature engineering process:

  1. Testing baseline model on existing feature set
  2. Creating new features
  3. Testing the impact of extracted features to the task
  4. Improving features if needed
  5. Repeat cycle

Types of new features being created:

  • numerical
  • categorical
  • bucketized
  • crossed
  • embedding
  • hashed

Feature engineering approaches:

  • manual
  • automatic
  • combined

Feature engineering types:

  • based on interaction between 2 or more features (ex. sum, product etc.)
  • representing the same feature in a different way (ex. grouping similar categories, ranking, transforming categorical features using label or one-hot encoding etc.)
  • using indicator variables

Feature set explosion

When feature engineering done inappropriately with adding redundant features, it leads to feature explosion. Feature explosion might be caused by:

  • applying feature templates instead of coding new features,
  • feature combinations that are not representable by linear system.

Solution: regularization, kernel method (PCA, CCA, spectral clustering, others), dimensionality reduction by applying feature selection.

Feature crosses

Feature cross is a multiplication (crossing) of two or more features.

If features are of Boolean type, like those that are generated by one-hot-encoding or binning, the resulting crosses can be extremely sparse.

Feature crossing is a very efficient strategy for fitting highly complex spaces to linear learners, because linear models scale well to massive datasets.

Feature crossing discretizes the input space and memorizes the training dataset. Memorization is opposite of generalization though!

--

--