Skip to content

Data Science

Data science is the science of analysing and extracting information from large sets of data, which typically combines elements of statistics, maths, computing, and other subjects.

It is an interdisciplinary field focused on extracting knowledge and insights from structured and unstructured data using scientific methods, processes, algorithms, and systems. It integrates elements from statistics, computer science, mathematics, and domain expertise to support decision-making, predictions, and automation.

This topic, due to the broad nature of it, comprises of topics containing elements of Information and Communication TEchnology and Mathematics and Statistics. The topics covered in the section here are:

The following topics will be discussed under statistics:

Data Science follows a common process in industries, and as such, it is recommended to use the same, for preparing workloads of this domain:

  1. Problem Definition

    • Identify the objective: What question needs to be answered?
    • Determine constraints, success criteria, and required outputs.
  2. Data Acquisition (Data Extraction)

    • Collect data from sources such as:

      • Databases (SQL/NoSQL)
      • APIs or web services
      • Logs, sensors, IoT streams
      • Web scraping and open datasets
  3. Data Cleaning

    • Fix or remove incorrect, missing, or inconsistent data:

      • Handle missing values
      • Correct data types
      • Remove duplicates or anomalies
      • Normalize formats
  4. Data Pre-Processing / Transformation

    • Prepare data for analysis or modeling:

      • Feature engineering
      • Scaling / normalization
      • Encoding categorical variables
      • Time-series formatting
      • Text vectorization
  5. Data Loading / Storage

    • Store processed data in usable structures:

      • Data warehouse / lake
      • Analysis-ready tables
      • Feature stores for ML pipelines
  6. Exploratory Data Analysis (EDA)

    • Understand the data before modeling:

      • Statistical summaries
      • Data visualization
      • Detect trends, relationships, outliers
  7. Modeling / Analysis / Machine Learning

    • Choose and apply methods based on the problem:

      • Regression / classification / clustering
      • Predictive models
      • Statistical inference
      • Simulation, forecasting
  8. Model Evaluation & Testing

    • Verify reliability and performance:

      • Train-test splits
      • Cross-validation
      • Performance metrics (accuracy, F1, MSE, etc.)
      • A/B testing if applied to real systems
  9. Interpretation & Insights

    • Translate results into meaningful conclusions.
    • Communicate findings to stakeholders clearly and visually.
  10. Deployment & Monitoring

  • Put the model or solution into production:

    • APIs, dashboards, automated workflows
    • Continuous monitoring for drift, decay, and reliability
  1. Iteration
  • Data Science is cyclical, not linear.
  • New data or business changes may require re-training or re-design.

Each of the sections will be expanded upon in their own subsequent sections.

Datasets to practice data science methods

*Apache PARQUET is a columnar DB store to be used with the Apache Hadoop ecosystem, Pandas can be used to read this dataset

APIs to practice on

Finance & Economics

APIData ProvidedNotes
Yahoo Finance API (via yfinance library)Stocks, ETFs, currenciesFree, no API key required if using yfinance
Alpha VantageReal-time + historical stocks, crypto, forexFree tier with rate limits (5 calls/min)
FRED (Federal Reserve)Macroeconomic indicators (GDP, CPI, rates)Fully free, requires simple key
Finnhub.ioStocks, sentiment analysis, cryptoGenerous free tier

Weather / Environment

APIData ProvidedNotes
OpenWeatherMapWeather forecasts & historyFree tier w/ API key; limited calls/day
NOAA Climate Data APIHistorical climate and weatherFully free, but some datasets require request forms
AirNow APIAir quality index (AQI) dataFree registration

Maps / Geospatial / Places

APIData ProvidedNotes
OpenStreetMap (OSM)Roads, buildings, geodataFree, use through Overpass API
GeoNames APIGeographic names & location metadataFree with registration
USGS Earthquake APIGlobal earthquake data in real timeNo key required

Open Government / Demographics

APIData ProvidedNotes
U.S. Census APIDemographic & socioeconomic dataFree, requires registration
UN Data APIPopulation, development statsMostly open datasets downloadable via JSON
World Bank Open Data APIGlobal development indicatorsCompletely free, no key required

Social Media / Text / NLP

APIData ProvidedNotes
Reddit APIPosts & commentsRequires a free API key
Wikipedia APIArticles, summaries, pageviewsNo key required
NewsAPINews headlines & metadataFree tier limited, no full text
HuggingFace Datasets APINLP datasets programmatic accessFully free

Health / Science / Research

APIData ProvidedNotes
PubMed Entrez APIScientific paper metadataNo key needed (but recommended)
OpenFDADrug & adverse event dataCompletely free
ClinicalTrials.gov APIMedical trial dataOpen and unrestricted

E-Commerce / Products

APIData ProvidedNotes
OpenFoodFacts APIIngredients, nutrition labelsGreat for classification/ML
Fake Store APIProduct + cart data for mock ecommerceGood for beginner ML demos

Fun & Miscellaneous

APIData ProvidedNotes
PokéAPIPokémon statsFully free & fun to use
Star Wars API (SWAPI)Characters, planets, starshipsNo key needed
Open Trivia DatabaseTrivia Q&ANo key required

Programming & Software Tools

  • Languages: Python, R (sometimes SQL, Scala, Julia)
  • Libraries: pandas, NumPy, scikit-learn, TensorFlow, PyTorch, matplotlib/seaborn
  • Version control: Git
  • Development tools: Jupyter Notebooks, VS Code

Powered by VitePress