Neural Speech Decoding:Brain-to-Text System
Real-time phoneme prediction from intracranial neural signals achieving 19.59% PER using optimized LSTM architectures.
Overview
Developed a high-performance neural speech decoder to translate intracranial neural recordings (area 6v) into phoneme sequences for a participant with ALS. The project focused on optimizing a deep learning pipeline under strict real-time (causal) constraints. By systematically engineering the architecture from a GRU baseline to a specialized LSTM with a non-linear post-processing stack, the system achieved a Phoneme Error Rate (PER) of 19.59%, significantly outperforming the 23.6% baseline.
Links
Project Files
Source Code & ReportProject Architecture
System Baseline Architecture: The left panel illustrates the Training pipeline, where neural activity data undergoes feature preprocessing and augmentation before being fed into a Gated Recurrent Unit (GRU) model. The model is trained to predict phonemes using Connectionist Temporal Classification (CTC) loss. The right panel depicts the Inference pipeline, where new neural activity is preprocessed and encoded by the trained GRU model. The output is then decoded into text using a lexicon-constrained beam search algorithm that integrates an external language model.
Performance Metrics
| Metric | Our Best Model (LSTM) | Baseline (GRU) |
|---|---|---|
| Phoneme Error Rate (PER) | 19.59% | 23.60% |
| Architecture | LSTM + Linderman Stack | Vanilla GRU |
| Loss Function | Focal CTC Loss | Standard CTC Loss |
| Optimizer Stability | AdamW + Scheduler | Adam (Fixed LR) |
| Training Efficiency | ~0.88s / batch (T4) | ~0.76s / batch (T4) |
Performance comparison on the Brain-to-Text Benchmark '24 test set. Our model reduces error by ~17% relative to the baseline.
Technical Details
- Architectural Optimization (GRU → LSTM):
- Transitioned the core recurrent backbone from GRU to LSTM to better capture long-range temporal dependencies in continuous speech.
- Integrated a “Linderman” Post-Stack (
Linear→LayerNorm→Dropout→GELU) to increase representational capacity and stabilize gradient flow.
- Loss Function Engineering:
- Implemented Focal CTC Loss (Lfocal = (1 - pt)γ · LCTC) to address class imbalance.
- This effectively down-weighted easy phonemes (like silence or common vowels) and focused learning on hard, ambiguous phonemes.
- High-Performance Computing & Distillation:
- Conducted comparative training on NVIDIA Tesla T4 (Cloud) and RTX 5070 Ti (Local).
- Demonstrated that a High-Capacity GRU (1024 units) can match LSTM accuracy (19.65%) when trained for 6x more iterations, leveraging the raw throughput of modern GDDR7 hardware.
Key Features
- Real-Time Constraint: Strictly enforced uni-directional processing (Causal Masking) to ensure the system is viable for live BCI applications.
- Robustness: Enhanced generalization using Time Masking augmentation and increased White Noise injection.
- Stabilized Training: Solved gradient explosion issues using Gradient Clipping (Norm=5.0) and a Sequential Learning Rate Scheduler (Warmup + Cosine Annealing).
Technical Stack
- Frameworks: PyTorch, NumPy, SciPy
- Hardware: NVIDIA Tesla T4 (GCP), RTX 5070 Ti
- Algorithms: Connectionist Temporal Classification (CTC), Recurrent Neural Networks (RNN/LSTM), AdamW Optimization
- Data: Intracranial Neural Recordings (Brain-to-Text Benchmark ‘24)
Course
UCLA ECE C243A – Neural Signal Processing (Fall 2025) – Jonathan Kao