Skip to content

Data Cleaning and Pre-processing

Principles

Name / PrincipleWhat It Is (Summary)Further Reading
Tidy Data Principles (Hadley Wickham)Defines how datasets should be structured so they are easier to clean, analyze, and visualize.https://vita.had.co.nz/papers/tidy-data.pdf
GIGO (Garbage In, Garbage Out)Poor quality input data leads inevitably to poor analytical or modeling outcomes.https://en.wikipedia.org/wiki/Garbage_in,_garbage_out
DRY Data (Don’t Repeat Yourself)Avoids duplicated values across datasets to reduce inconsistency and maintenance issues.https://en.wikipedia.org/wiki/Don%27t_repeat_yourself
Single Source of Truth (SSOT)Ensures all teams use the same authoritative version of data.https://en.wikipedia.org/wiki/Single_source_of_truth
Data Quality Dimensions FrameworkStandard framework defining data quality in terms of accuracy, completeness, consistency, validity, timeliness, etc.https://tdwi.org/articles/2013/02/19/tdwi-checklist-evaluating-data-quality.aspx
Normalization Rules (1NF → BCNF)Rules that eliminate redundancy and improve integrity when structuring relational data.https://www.studytonight.com/dbms/database-normalization.php
Data Profiling MethodologySystematic examination of data for patterns, anomalies, and structure before cleaning.https://en.wikipedia.org/wiki/Data_profiling
Imputation Frameworks (MCAR / MAR / MNAR)Defines the statistical nature of missing data so the right filling strategy can be chosen.https://datascience.stackexchange.com/questions/14667/what-are-mcar-mar-and-mnar
Feature Scaling MethodsStandardization and normalization methods used to bring numeric variables to comparable ranges.https://scikit-learn.org/stable/modules/preprocessing.html
Feature Engineering Life CycleProcess of transforming raw data into meaningful features for ML models.https://developers.google.com/machine-learning/data-prep/construct

Powered by VitePress