Back to Portfolio

Smart-Home Audio Keyword-Spotting

Sep 2025 - Dec 2025
PythonPyTorchTensorFlowRaspberry PiEmbedded SystemsMachine LearningAudio ProcessingMFCCCNNLSTMGPIOI²Clibrosa
View on GitHub

Overview

An embedded keyword-spotting system that controls peripherals based on voice commands. This project implements a multi-class audio classification system on a Raspberry Pi Zero 2 W microcontroller, creating a prototype smart-home device capable of recognizing voice commands and executing them with peripheral components.

The system recognizes 9 classes: "Red", "Green", "Blue", "White", "Off" (for RGB LED control), "Time" and "Temperature" (for LCD display), plus "Noise" and "Unknown Command" for robustness. The color keywords control an RGB LED, while "Time" and "Temperature" display current information on an LCD screen using an RTC chip and temperature sensor.

The project involved a complete machine learning pipeline: recording audio data from multiple speakers, chopping utterances, applying data augmentation (low-pass filters, high-pass filters, band-pass filters, pitch-shifting, noise addition, dynamic compression), extracting Mel-Frequency Cepstral Coefficients (MFCCs) for feature extraction, training CNN/LSTM models in PyTorch/TensorFlow, and compressing the model for deployment on the Raspberry Pi.

The inference system captures 5-second audio buffers, trims to optimal length (~1.8s), extracts MFCCs, and feeds them to the model for real-time classification. The system achieved high accuracy (99.79% validation, ~98.5% test) with the CNN model.

Project Context

This project was developed as part of a Machine Learning course (EE 475) at Northwestern University. The goal was to create an embedded keyword-spotting system that could recognize voice commands and control hardware peripherals in real-time.

Key Features

  • Real-time audio keyword recognition on Raspberry Pi Zero 2 W
  • 9-class classification: RGB colors (Red, Green, Blue, White, Off), LCD commands (Time, Temperature), and noise/unknown
  • Hardware integration: RGB LED control, LCD display, RTC for time, temperature sensor
  • Complete ML pipeline: data collection, augmentation, feature extraction (MFCCs), model training, and deployment
  • Data augmentation techniques: LP/HP/BP filters, pitch-shifting, noise addition, dynamic compression
  • High accuracy: 99.79% validation accuracy, ~98.5% test accuracy with CNN model
  • Real-time inference with 5s buffer capture, trimmed to ~1.8s for optimal model input

Challenges

  • Optimizing model size for embedded deployment on Raspberry Pi Zero 2 W
  • Handling real-time audio processing with latency constraints
  • Managing data collection and augmentation across multiple speakers
  • Integrating hardware peripherals (LED, LCD, sensors) with the ML inference pipeline
  • Balancing model accuracy with computational efficiency for edge deployment

Results

  • Successfully deployed CNN model achieving 99.79% validation accuracy
  • Real-time keyword recognition with sub-second latency
  • Robust system handling noise and unknown commands
  • Complete end-to-end pipeline from data collection to hardware control