Data Cleaning and Pre-processing
Principles
| Name / Principle | What It Is (Summary) | Further Reading |
|---|---|---|
| Tidy Data Principles (Hadley Wickham) | Defines how datasets should be structured so they are easier to clean, analyze, and visualize. | https://vita.had.co.nz/papers/tidy-data.pdf |
| GIGO (Garbage In, Garbage Out) | Poor quality input data leads inevitably to poor analytical or modeling outcomes. | https://en.wikipedia.org/wiki/Garbage_in,_garbage_out |
| DRY Data (Don’t Repeat Yourself) | Avoids duplicated values across datasets to reduce inconsistency and maintenance issues. | https://en.wikipedia.org/wiki/Don%27t_repeat_yourself |
| Single Source of Truth (SSOT) | Ensures all teams use the same authoritative version of data. | https://en.wikipedia.org/wiki/Single_source_of_truth |
| Data Quality Dimensions Framework | Standard framework defining data quality in terms of accuracy, completeness, consistency, validity, timeliness, etc. | https://tdwi.org/articles/2013/02/19/tdwi-checklist-evaluating-data-quality.aspx |
| Normalization Rules (1NF → BCNF) | Rules that eliminate redundancy and improve integrity when structuring relational data. | https://www.studytonight.com/dbms/database-normalization.php |
| Data Profiling Methodology | Systematic examination of data for patterns, anomalies, and structure before cleaning. | https://en.wikipedia.org/wiki/Data_profiling |
| Imputation Frameworks (MCAR / MAR / MNAR) | Defines the statistical nature of missing data so the right filling strategy can be chosen. | https://datascience.stackexchange.com/questions/14667/what-are-mcar-mar-and-mnar |
| Feature Scaling Methods | Standardization and normalization methods used to bring numeric variables to comparable ranges. | https://scikit-learn.org/stable/modules/preprocessing.html |
| Feature Engineering Life Cycle | Process of transforming raw data into meaningful features for ML models. | https://developers.google.com/machine-learning/data-prep/construct |
