Skip to content

orech/tiny_asr

Repository files navigation

ASR Toolkit

A minimalistic repository for educational purposes that implements a complete Automatic Speech Recognition (ASR) pipeline using Connectionist Temporal Classification (CTC). This toolkit allows you to train CTC models on single GPU/CPU or in distributed fashion, with support for decoding and evaluation.

Features

  • CTC Model: Bidirectional LSTM-based CTC model for speech recognition
  • Training: Single GPU/CPU and distributed multi-GPU training
  • Decoding: Greedy CTC decoding with optional language model integration
  • Evaluation: Word Error Rate (WER) computation
  • Data Processing: LibriSpeech dataset support with mel-spectrogram features
  • Augmentation: SpecAugment for improved model robustness
  • Monitoring: Weights & Biases (wandb) integration for experiment tracking
  • Mixed Precision: Automatic Mixed Precision (AMP) support for faster training

Installation

  1. Clone the repository:
git clone <repository-url>
cd asr_toolkit
  1. Create a virtual environment:
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Install additional audio libraries (for FLAC support):
# On macOS
brew install flac

# On Ubuntu
sudo apt-get install flac

# Or install soundfile
pip install soundfile

Quick Start

Training

Single GPU/CPU Training

python train_ctc.py \
  --data_root /path/to/LibriSpeech \
  --train_subset train-clean-100 \
  --valid_subset dev-clean \
  --epochs 10 \
  --batch_size 32 \
  --lr 1e-3 \
  --device cuda \
  --wandb_project ctc-minilab \
  --wandb_run my-experiment

Distributed Multi-GPU Training

torchrun --nproc_per_node=4 train_distributed.py \
  --data_root /path/to/LibriSpeech \
  --train_subset train-clean-100 \
  --valid_subset dev-clean \
  --epochs 10 \
  --batch_size 8 \
  --lr 1e-3 \
  --backend nccl \
  --wandb_project ctc-minilab \
  --wandb_run distributed-experiment

Decoding and Evaluation

Greedy Decoding

python decode_ctc.py \
  --checkpoint_path ./exp_ctc_bilstm/best.pt \
  --data_root /path/to/LibriSpeech \
  --subset test-clean \
  --output_file results.txt

Decoding with Language Model

python decode_ctc.py \
  --checkpoint_path ./exp_ctc_bilstm/best.pt \
  --data_root /path/to/LibriSpeech \
  --subset test-clean \
  --lm_path /path/to/language_model.arpa \
  --output_file results_with_lm.txt

Dataset

This toolkit uses the LibriSpeech dataset, which is automatically downloaded and processed:

  • train-clean-100: 100 hours of clean speech for training
  • dev-clean: Development set for validation
  • test-clean: Test set for evaluation

The dataset is processed to extract 80-dimensional mel-spectrogram features with log-magnitude scaling.

Model Architecture

The CTC model consists of:

  • Feature Extractor: Mel-spectrogram with log-magnitude scaling
  • Encoder: Bidirectional LSTM with 256 hidden units
  • Projection: Linear layer mapping to vocabulary size
  • CTC Loss: Connectionist Temporal Classification loss

Training Features

Data Augmentation

  • SpecAugment: Frequency and time masking for improved robustness
  • CMVN: Cepstral Mean and Variance Normalization

Optimization

  • AdamW Optimizer: With gradient clipping
  • Learning Rate Scaling: Automatic scaling for distributed training
  • Mixed Precision: AMP support for faster training on modern GPUs

Monitoring

  • Weights & Biases: Experiment tracking and visualization
  • Gradient Monitoring: Gradient norm tracking
  • Loss Tracking: Training and validation loss monitoring

Evaluation

The toolkit provides comprehensive evaluation metrics:

  • Word Error Rate (WER): Primary metric for ASR evaluation
  • Character Error Rate (CER): Character-level accuracy
  • Decoding Speed: Inference time measurement

File Structure

asr_toolkit/
├── train_ctc.py          # Single GPU/CPU training
├── train_distributed.py  # Distributed training
├── decode_ctc.py         # Decoding and evaluation
├── model_ctc.py          # CTC model definition
├── data.py               # Dataset and data loading
├── features.py           # Feature extraction and augmentation
├── tokenizer.py          # Character tokenizer
├── utils.py              # Utility functions
├── requirements.txt      # Python dependencies
└── README.md            # This file

Configuration

Training Parameters

  • --data_root: Path to LibriSpeech dataset
  • --train_subset: Training subset (e.g., train-clean-100)
  • --valid_subset: Validation subset (e.g., dev-clean)
  • --epochs: Number of training epochs
  • --batch_size: Batch size
  • --lr: Learning rate
  • --device: Device (cuda/cpu)
  • --num_workers: Number of data loading workers

Model Parameters

  • --hidden: LSTM hidden size (default: 256)
  • --n_mels: Number of mel filters (default: 80)
  • --n_fft: FFT size (default: 400)
  • --hop_length: Hop length (default: 160)

Troubleshooting

Common Issues

  1. FLAC Backend Error: Install FLAC support or use num_workers=0
  2. MPS CTC Loss Error: Set PYTORCH_ENABLE_MPS_FALLBACK=1 for Apple Silicon
  3. Segmentation Fault: Reduce num_workers or use num_workers=0
  4. CUDA Out of Memory: Reduce batch_size or use gradient accumulation

Performance Tips

  • Use pin_memory=True for faster GPU data transfer
  • Enable torch.backends.cudnn.benchmark = True for consistent input sizes
  • Use mixed precision training for faster training on modern GPUs
  • Adjust num_workers based on your system (start with 0, then increase)

Educational Purpose

This repository is designed for educational purposes to understand:

  • CTC loss and its applications in ASR
  • Distributed training with PyTorch
  • Speech feature extraction and preprocessing
  • Model evaluation and decoding strategies
  • End-to-end ASR pipeline implementation

License

This project is for educational purposes. Please check the original LibriSpeech dataset license for commercial use.

Contributing

This is an educational repository. Feel free to fork and modify for your learning purposes.

Acknowledgments

  • LibriSpeech dataset
  • PyTorch team for the excellent framework
  • Weights & Biases for experiment tracking

About

Train a small CTC ASR model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages