Neural Speech Decoding:Brain-to-Text System

Overview

Developed a high-performance neural speech decoder to translate intracranial neural recordings (area 6v) into phoneme sequences for a participant with ALS. The project focused on optimizing a deep learning pipeline under strict real-time (causal) constraints. By systematically engineering the architecture from a GRU baseline to a specialized LSTM with a non-linear post-processing stack, the system achieved a Phoneme Error Rate (PER) of 19.59%, significantly outperforming the 23.6% baseline.

Project Architecture

System Baseline Architecture: The left panel illustrates the Training pipeline, where neural activity data undergoes feature preprocessing and augmentation before being fed into a Gated Recurrent Unit (GRU) model. The model is trained to predict phonemes using Connectionist Temporal Classification (CTC) loss. The right panel depicts the Inference pipeline, where new neural activity is preprocessed and encoded by the trained GRU model. The output is then decoded into text using a lexicon-constrained beam search algorithm that integrates an external language model.

Performance Metrics

Metric	Our Best Model (LSTM)	Baseline (GRU)
Phoneme Error Rate (PER)	19.59%	23.60%
Architecture	LSTM + Linderman Stack	Vanilla GRU
Loss Function	Focal CTC Loss	Standard CTC Loss
Optimizer Stability	AdamW + Scheduler	Adam (Fixed LR)
Training Efficiency	~0.88s / batch (T4)	~0.76s / batch (T4)

Performance comparison on the Brain-to-Text Benchmark '24 test set. Our model reduces error by ~17% relative to the baseline.

Technical Details

Architectural Optimization (GRU → LSTM):
- Transitioned the core recurrent backbone from GRU to LSTM to better capture long-range temporal dependencies in continuous speech.
- Integrated a “Linderman” Post-Stack (Linear → LayerNorm → Dropout → GELU) to increase representational capacity and stabilize gradient flow.
Loss Function Engineering:
- Implemented Focal CTC Loss (L_focal = (1 - p_t)^γ · L_CTC) to address class imbalance.
- This effectively down-weighted easy phonemes (like silence or common vowels) and focused learning on hard, ambiguous phonemes.
High-Performance Computing & Distillation:
- Conducted comparative training on NVIDIA Tesla T4 (Cloud) and RTX 5070 Ti (Local).
- Demonstrated that a High-Capacity GRU (1024 units) can match LSTM accuracy (19.65%) when trained for 6x more iterations, leveraging the raw throughput of modern GDDR7 hardware.

Key Features

Real-Time Constraint: Strictly enforced uni-directional processing (Causal Masking) to ensure the system is viable for live BCI applications.
Robustness: Enhanced generalization using Time Masking augmentation and increased White Noise injection.
Stabilized Training: Solved gradient explosion issues using Gradient Clipping (Norm=5.0) and a Sequential Learning Rate Scheduler (Warmup + Cosine Annealing).

Technical Stack

Frameworks: PyTorch, NumPy, SciPy
Hardware: NVIDIA Tesla T4 (GCP), RTX 5070 Ti
Algorithms: Connectionist Temporal Classification (CTC), Recurrent Neural Networks (RNN/LSTM), AdamW Optimization
Data: Intracranial Neural Recordings (Brain-to-Text Benchmark ‘24)

Course

UCLA ECE C243A – Neural Signal Processing (Fall 2025) – Jonathan Kao