Home Data Preprocessing Introduction
Post
Cancel

Data Preprocessing Introduction

What is data preprocessing?

Data preprocessing is an important task that must be conducted before a dataset can be used for model training. Raw data is often noisy and unreliable, and may be missing values. Using such data for modeling can produce misleading results.
To solve the misleading, five main tasks are performed in data preprocessing:

  • Data cleaning: Fill in missing values, detect, and remove noisy data and outliers.
  • Data transformation: Normalize data to reduce dimensions and noise.
  • Data reduction: Sample data records or attributes for easier data handling.
  • Data discretization: Convert continuous attributes to categorical attributes for ease of use with certain machine learning methods.
  • Text cleaning: Remove embedded characters that may cause data misalignment, for example, embedded tabs in a tab-separated data file, or embedded new lines that may break records.

What is the order of data preprocessing?

  1. Taking care of missing data
  2. Encoding categorical data
  3. Splitting the dataset into the training set and test set
  4. Feature scaling

There are more details in the Data preprocessing post.

This post is licensed under CC BY 4.0 by the author.