Electricity Consumption Predictor

May 2025

Python
scikit-learn
Time-Series
Machine Learning
A modular, design-pattern-driven machine learning framework to ingest smart-meter readings and predict aggregate daily household power consumption.
Published

June 15, 2025

GitHub

Project Overview

The Electricity Consumption Predictor is an end-to-end time-series regression pipeline designed to predict residential energy usage. Utilizing software engineering best practices, it Aggregates and clean 2M+ high-frequency telemetry readings, imputes missing records, filters out anomalies via statistical interquartile ranges, and fits non-linear regressors (including Random Fourier Features and Multi-Layer Perceptrons) to assist utility companies in grid load management.

Problem

Household electricity consumption is highly non-linear, presenting sudden peaks due to human behavior and appliance usage. Traditional load forecasting models (like ARIMA) assume constant variance and linear relationships, failing to capture sudden spikes or long-term seasonal variance.

I wanted to design a modular, object-oriented forecasting pipeline that compares simple parametric curves against high-dimensional non-linear mapping methods (RFF kernel approximations, MLPs) under strict data leakage guards.

Features

  • Strategy Design Pattern Pipeline: Encapsulates data ingestion, missing value imputation, outlier filtering, and feature transformations in interchangeable strategy classes.
  • Robust Telemetry Aggregation: Downsamples over 2 million raw minute-level measurements into daily aggregates to capture macro consumption patterns.
  • IQR-Based Outlier Removal: Truncates anomalous power spikes to stabilize regression models.
  • Multi-Model Regression Sandbox: Benchmarks Linear Regression, Polynomial Regression, Gaussian Radial Basis Functions (RBF), Random Fourier Features (RFF), and Multi-Layer Perceptrons (MLP).
  • Random Fourier Feature Projections: Maps inputs to a 12,000-dimensional space via scikit-learn’s RBFSampler to approximate Gaussian RBF kernels in linear time.
  • Standard Scaling Data Guard: Fits scaling parameters exclusively on the training partition to prevent test set data leakage.

Tech Stack

  • ETL & Processing:
    • Python
    • Pandas
    • NumPy
  • Modeling & Benchmarking:
    • scikit-learn (RBFSampler, MLPRegressor, LinearRegression)
    • Matplotlib
    • Seaborn
    • Zipfile

Architecture

graph TD
    Zip["raw zip archive"] --> Extractor["ZipDataExtractor"]
    Extractor --> Text["household_power_consumption.txt"]
    Text --> Ingest["Ingestor Strategy Pattern (CSV/Text)"]
    Ingest --> Process["initial_process.py (Daily Aggregation)"]
    Process --> Missing["missing_value_handling.py (Mean Impute)"]
    Missing --> Outliers["outliers_detection.py (IQR Filtering)"]
    Outliers --> Features["feature_engineering.py (Standard Scaler)"]
    Features --> Sandbox["Regression Sandbox<br>(Linear / Poly / RFF / MLP)"]
    Sandbox --> Predict["Predicts Daily Active Power"]

My Contributions

  • Programmed the Strategy Design Pattern for data ingestion, missing value imputation, outlier detection, and feature transformation.
  • Built the daily aggregation downsampling processor reducing the dataset size.
  • Configured data splitting wrappers isolating training and testing partitions.
  • Benchmarked Linear, Polynomial, Gaussian RBF, RFF, and MLP regressors.
  • Conducted hyperparameter grid sweeps to resolve bias-variance tradeoffs across polynomial degrees and Fourier projection components.
  • Audited the codebase to identify a scaling return bug and local absolute paths.

What I Learned

  • Structuring modular machine learning pipelines using the Strategy Design Pattern.
  • Approximating infinite-dimensional radial basis function kernels using Random Fourier Features.
  • Managing data splits and scaling configurations to prevent leakage.
  • Analyzing bias-variance tradeoffs via parametric sweeps.

Results

  • Downsampled 2M+ records into a stable 1,399-row daily dataset.
  • Random Fourier Features (RFF) regression achieved a training MSE of 98,803.99 kW².
  • Deep MLP Regressor achieved the best generalization with a test MSE of 140,600.85 kW², outperforming baseline Linear Regression (298,697.81 kW²) by over 50%.

Future Work

  • Fix the Min-Max scaling method return bug in the feature engineering module.
  • Transition pipeline to distributed engines (PySpark or Dask) to scale to millions of active meters.
  • Add model serialization utilities using joblib.
  • Wrap model inference in a FastAPI web container.