Skip to content

Problem Definition

The first step in data science is to have a clear and concise problem definition to scope the task at hand.

The following table is a suggested framework to be used for the same:

SectionExample Entry
ContextRetail chain struggles with product stockouts causing lost sales.
Business ObjectiveReduce lost sales by improving inventory planning.
Data Science ObjectiveForecast daily product demand per store for next 14 days.
Key QuestionsWhat is expected sales volume per product-store-day?
Success MetricMAPE ≤ 15% for top 100 products.
ConstraintsMust run nightly and integrate with existing PostgreSQL system.
Data NeededSales history, calendar events, weather, promotions.
DeliverablesForecast table + visualization dashboard in BI tool.

This can be augmented by using a questions-to-datamap mapping in the following way:

Key QuestionData NeededData SourceNotes (Quality, Processing, Granularity)
What will demand be per store next week?historical sales by store/datesales databaseaggregate to daily level
Does weather affect demand?temperature, rainfallweather APIjoin on geo + date
Do promotions change buying patterns?promo flags, discount %, marketing schedulemarketing calendarneed categorical encoding

When preparing the problem definition, it is necessary to identify the analysis performance metrics to start with. The following provides an example of the same:

Analysis TypeExample MetricsInterpretation Focus
ForecastingMAE, RMSE, MAPEForecast error relative to scale
RegressionR², RMSE, MAEVariance explained & error magnitude
ClassificationF1 score, ROC AUC, AccuracyBalance error vs. precision/recall trade-offs
ClusteringSilhouette Score, Davies-BouldinHow well data points cluster
EDA InsightsSummary stats, distributionsWhether patterns are meaningful and reproducible

The conclusion stability also needs to be tested as an extension. To choose the method for it, we have the following:

Stress TestHow
Train/Test or Cross-validation consistencyk-fold or rolling window validation
Sensitivity analysisVary assumptions or thresholds
Stability across subgroupsPerformance by region/customer/segment/time

A proper validation can be done using the following checkbox:

text
[ ] The analysis answers the original objective.
[ ] Data used is appropriate and clean, with no leakage.
[ ] Performance was assessed using correct evaluation metrics.
[ ] A baseline comparison was included.
[ ] Model/analysis is robust across tests and subsets.
[ ] Results are interpretable to stakeholders.
[ ] Actionable recommendations are clearly stated.
[ ] Ethical and contextual impacts were considered.

The thereotical aspects, such as the analysis performance metrics, can be found in the statistics section.

Powered by VitePress