CUDA Optimized CNN

Overview

Designed and implemented a highly optimized CUDA-based convolutional neural network, focusing on the forward pass of convolutional layers for a modified LeNet-5 architecture. The project aimed to achieve maximum performance for deep learning tasks such as image classification and object detection.

Implementation Architecture

Architecture diagram showing the CUDA-based CNN implementation with memory management and computation optimization.

Technical Details

Implemented GPU-based forward convolution using CUDA C++ with a structured Prolog-Kernel-Epilog approach
Ensured proper memory management, convolution computation, and output transfer
Matched CPU implementation correctness while optimizing performance using Nsight profiling tools
Applied advanced GPU programming techniques including:
- CUDA streams for concurrent execution
- GEMM kernels for matrix operations
- Kernel fusion for reduced overhead
- Optimized memory access patterns
- Efficient data parallelism strategies

Key Achievements

Achieved target inference time of ≤80ms for 10,000 images from the Fashion MNIST dataset
Optimized memory access patterns for improved throughput
Implemented efficient data parallelism strategies
Reduced computational overhead through algorithmic optimizations
Successfully validated against CPU implementation for accuracy

Technical Stack

CUDA C++
Parallel Programming
GPU Computing
Performance Optimization
Deep Learning
Computer Vision
Convolutional Neural Networks
Nsight Profiling Tools

Project Advisor

Prof. Volodymyr Kindratenko, UIUC