Data Cleaning and Pre-processing

Principles

Name / Principle	What It Is (Summary)	Further Reading
Tidy Data Principles (Hadley Wickham)	Defines how datasets should be structured so they are easier to clean, analyze, and visualize.	https://vita.had.co.nz/papers/tidy-data.pdf
GIGO (Garbage In, Garbage Out)	Poor quality input data leads inevitably to poor analytical or modeling outcomes.	https://en.wikipedia.org/wiki/Garbage_in,_garbage_out
DRY Data (Don’t Repeat Yourself)	Avoids duplicated values across datasets to reduce inconsistency and maintenance issues.	https://en.wikipedia.org/wiki/Don%27t_repeat_yourself
Single Source of Truth (SSOT)	Ensures all teams use the same authoritative version of data.	https://en.wikipedia.org/wiki/Single_source_of_truth
Data Quality Dimensions Framework	Standard framework defining data quality in terms of accuracy, completeness, consistency, validity, timeliness, etc.	https://tdwi.org/articles/2013/02/19/tdwi-checklist-evaluating-data-quality.aspx
Normalization Rules (1NF → BCNF)	Rules that eliminate redundancy and improve integrity when structuring relational data.	https://www.studytonight.com/dbms/database-normalization.php
Data Profiling Methodology	Systematic examination of data for patterns, anomalies, and structure before cleaning.	https://en.wikipedia.org/wiki/Data_profiling
Imputation Frameworks (MCAR / MAR / MNAR)	Defines the statistical nature of missing data so the right filling strategy can be chosen.	https://datascience.stackexchange.com/questions/14667/what-are-mcar-mar-and-mnar
Feature Scaling Methods	Standardization and normalization methods used to bring numeric variables to comparable ranges.	https://scikit-learn.org/stable/modules/preprocessing.html
Feature Engineering Life Cycle	Process of transforming raw data into meaningful features for ML models.	https://developers.google.com/machine-learning/data-prep/construct