Skip to content

Releases: pplmx/llm

v0.0.5

08 Jan 12:42

Choose a tag to compare

Added

  • SFT (Supervised Fine-tuning):

    • SFTDataset for instruction tuning with input masking
    • SFTDataModule for data loading
    • SFTTask registered as --task sft in CLI
    • Tests for all SFT components
  • DPO (Direct Preference Optimization):

    • DPODataset handling chosen/rejected pairs
    • DPODataModule for preference data loading
    • DPOTask with reference model management and DPO loss
    • Registered as --task dpo in CLI
    • Tests for all DPO components
  • Continuous Batching Engine (Serving):

    • src/llm/serving/engine.py with ContinuousBatchingEngine class
    • Iteration-level scheduling via Scheduler and SlotAllocator
    • Pre-allocated KV cache pool for efficient memory management
    • Supports mixed prefill/decode batching with automatic padding
    • Clean API: requires model and tokenizer instances upfront
    • src/llm/serving/scheduler.py with FCFS scheduling logic
  • LoRA (Low-Rank Adaptation):

    • src/llm/core/lora.py with LoRALinear class for parameter-efficient fine-tuning
    • apply_lora(), merge_lora(), get_lora_parameters() helper functions
    • Device/dtype handling for CUDA compatibility
    • 17 tests covering training and weight merging
  • QLoRA (Quantized LoRA):

    • src/llm/core/qlora.py with QLoRALinear class
    • NF4 4-bit quantization for base weights (~4x memory reduction)
    • LoRA adapters remain in fp16/bf16 for training stability
    • apply_qlora() and get_qlora_parameters() helpers
  • RoPE (Rotary Position Embedding):

    • src/llm/core/rope.py with RotaryPositionEmbedding class
    • Linear, dynamic, and NTK-aware scaling methods for extended context
    • apply_rotary_pos_emb(), get_rope_scaling_factor() utilities
    • 15 tests
  • ALiBi (Attention with Linear Biases):

    • src/llm/core/alibi.py with ALiBiPositionBias class
    • get_alibi_slopes(), build_alibi_bias() functions
    • Cached bias computation for efficiency
    • 13 tests
  • Sliding Window Attention:

    • window_size parameter in scaled_dot_product_attention
    • Propagated through MultiHeadAttention, TransformerBlock, DecoderModel
    • Reduces memory for long sequences by limiting attention scope
    • 10 tests
  • KV Cache Optimization:

    • src/llm/core/kv_cache.py with KVCache class for pre-allocated cache buffers
    • In-place updates during autoregressive generation (avoids O(n²) memory operations)
    • Integrated into MHA, TransformerBlock, DecoderModel
    • Factory method KVCache.from_model_config() for easy instantiation
    • Backward compatible: legacy past_key_value tuple format still works
  • E2E Testing Infrastructure:

    • tests/e2e/ directory with comprehensive pipeline tests
    • test_training.py, test_sft.py, test_dpo.py
    • test_gradient_accumulation.py, test_resume_training.py
    • Advanced inference and callback tests
  • Documentation:

    • notebooks/quick_start.ipynb interactive tutorial
    • Covers model building, training, inference, and advanced features

Changed

  • SDPA Refactoring:

    • Consolidated scaled_dot_product_attention wrapper into src/llm/core/attn/sdpa.py
    • Refactored MultiHeadAttention and MultiLatentAttention to use common sdpa wrapper
    • Archived custom implementation to _learning/03_lab/experiments/custom_sdpa.py
  • Test Suite Refactoring:

    • Organized test files into subdirectories (tests/training/, tests/inference/, etc.)
    • Converted to functional testing style (real components over mocks)
    • Added shared fixtures in tests/conftest.py
    • Test count: 385 → 432
  • TrainingEngine:

    • Support for dictionary batches in training/validation loops
    • Gradient accumulation implementation
  • DPO Reference Model:

    • Use model reconstruction instead of deepcopy for ref_model creation
  • Documentation:

    • Added docs/README.md as documentation entry point
    • Added MkDocs Material configuration (mkdocs.yml) for documentation site
    • Added GitHub Actions workflow for automatic GitHub Pages deployment
    • Added guide-finetuning.md (LoRA/QLoRA) and guide-inference.md (KVCache/GQA/Continuous Batching)
    • Enhanced architecture.md with detailed component diagrams and data flow analysis
    • Updated ROADMAP Phase 10.2 (Continuous Batching complete)

v0.0.4

07 Jan 07:33

Choose a tag to compare

Added

  • Gradient Checkpointing:

    • Memory-efficient training via gradient_checkpointing parameter in DecoderModel
    • enable_gradient_checkpointing() / disable_gradient_checkpointing() methods
    • Automatic incompatibility check with use_cache=True
  • E2E Pipeline Automation:

    • scripts/e2e_pipeline.py for automated Train → Evaluate → Inference workflow
    • src/llm/utils/e2e.py with reusable E2E core functions (E2EConfig, E2EResult, run_e2e_pipeline)
    • Rich progress UI and configurable CLI options
  • OpenAI-Compatible Chat API (/v1/chat/completions):

    • Compatible with official OpenAI Python SDK
    • Streaming and non-streaming chat completions
    • Bearer token authentication support
    • Multi-turn conversation handling
    • 8 new test cases for compatibility layer
  • Batch Inference:

    • batch_generate function in inference.py with left-padding and batched forward pass
    • BatchGenerationRequest / BatchGenerationResponse schemas
    • /batch_generate API endpoint
    • 3 tests for batch inference (basic, single, empty)
  • Request Queue and Concurrency Control:

    • max_concurrent_requests and request_timeout in ServingConfig
    • asyncio.Semaphore for concurrency limiting
    • asyncio.timeout for request timeout handling (504 response)
  • CLI Entry Points:

    • llm-train command for training models
    • llm-serve command for starting inference server
  • Testing Infrastructure:

    • Pytest markers using decorators: quick, slow, heavy, e2e
    • MoE integration tests (6 tests for expert routing, gradient flow)
    • E2E pipeline tests (full workflow, streaming consistency)
    • Gradient checkpointing tests (8 tests)
    • Total test count: 296 → 337
  • Examples Directory:

    • inference_demo.py for basic text generation
    • openai_client_demo.py for OpenAI SDK usage
  • Documentation:

    • scripts/README.md documenting all available scripts
    • HFTokenizer example in usage.md
    • Updated root README.md with links to Examples and Scripts

Changed

  • Makefile Reorganization:

    • make test now runs all tests by default
    • make test-fast for daily development (excludes heavy/e2e)
    • make test-quick for rapid iteration (~6s)
    • make test-cov for CI with coverage and allure reports
    • Removed redundant test-all and test-integration
  • CLI Standardization:

    • CLI parameters changed from snake_case to kebab-case (--file-path, --batch-size)
    • Replace typer with typer-slim[standard] for reduced dependencies
  • Code Quality Improvements:

    • Translate Chinese docstrings to English in serving module
    • Remove ~75 lines of redundant comments
    • Simplify section comments while preserving algorithm clarity
  • Documentation Refactoring:

    • Eliminated redundancy between README, usage.md, and development.md
    • Clear document responsibility separation
    • Updated all docs to use new CLI commands
    • Enhanced package metadata (keywords, classifiers)
  • Module Exports:

    • Enhanced llm/__init__.py with public API exports (DecoderModel, generate, etc.)
    • Enhanced llm.serving module exports (LLMEngine, ServingConfig, OpenAI schemas)

Fixed

  • Removed obsolete TODO comment in engine.py
  • Removed duplicate num_kv_heads field in ModelConfig
  • Fixed MD051/link-fragments in tutorial-cpu-llm.md and faq.md
  • Fixed train.py task registration for lm task

v0.0.3

23 Dec 06:32

Choose a tag to compare

Added

  • Inference Serving:

    • Production-ready REST API with FastAPI
    • Streaming support via Server-Sent Events (SSE)
    • Advanced sampling strategies (nucleus sampling/top-p, repetition penalty)
    • Prometheus metrics endpoint for monitoring
    • API key authentication (X-API-Key header)
    • Structured logging with python-json-logger
    • Real PyTorch model weights loading from checkpoint files
    • Pickled tokenizer object loading support
  • Component Registry:

    • Automatic component registration system (ComponentRegistry)
    • Core components (MHA, MLP, MoE) auto-registered via side-effect imports
    • Prevents "component not found" errors in simplified scripts
  • Data Abstraction:

    • Formalized BaseTokenizer protocol
    • BaseDataModule abstraction for flexible data handling
    • Environment variable configuration support (e.g., LLM_TRAINING__EPOCHS)
  • Testing & CLI:

    • --num-samples flag in train.py for rapid regression testing
    • Scheduler edge case tests (test_scheduler_edge_cases.py)
    • Validation logging tests (test_engine_logging.py)
    • Component registry tests (test_init.py)
    • Model loading verification tests
    • Auto-device detection in training scripts (prioritizes CUDA)
  • Documentation:

    • Comprehensive usage guide (docs/usage.md)
    • Architecture documentation (docs/architecture.md)
    • Engineering documentation (ADRs, PR templates, FAQ)
    • VS Code configuration and extensions

Changed

  • Architecture Modernization:

    • Migrated to Pydantic v2 (BaseSettings, BaseModel) for configuration
    • Fully typed and validated configuration system
    • CLI migration from argparse to typer for better UX
  • Naming Standardization:

    • Unified ffn_hidden_sizeintermediate_size across codebase
    • Standardized input parameter xhidden_states in forward methods
    • Applied to MLP, LayerNorm, RMSNorm, DecoderModel, TransformerBlock
    • Updated all 309 tests to reflect API changes
  • Code Quality:

    • Standardized punctuation in documentation (full-width → half-width)
    • Improved type hints and documentation comments
    • Refactored TransformerBlock.forward for clarity

Fixed

  • Core Bugs:

    • CosineAnnealingLR T_max calculation when epochs == warmup_epochs (ZeroDivisionError)
    • TrainingEngine validation logging crash when gradient_norms is empty (IndexError)
    • PAD token generation issue in inference (logits masking)
    • SyntheticDataModule prefetch_factor handling with num_workers=0
    • TransformerBlock shared norm instance bug (independent norm1/norm2)
    • Scheduler/optimizer step order warnings in tests
    • PositionalEncoding support for start_pos in incremental generation
    • MLP SwiGLU operation order for numerical consistency
    • Prompt truncation respecting max_seq_len with new tokens
    • Auto AMP dtype resolution for CPU-only environments
  • Registry & Imports:

    • Package auto-registration via import llm
    • Component not found errors in simplified execution

v0.0.2

23 Dec 06:32

Choose a tag to compare

0.0.3 - 2025-12-23

🚀 Features

  • (inference) Add simple autoregressive generation loop - (4585a2e)
  • (scripts) Auto-detect device in train_simple_decoder.py - (ff2ad6f)
  • (scripts) Add best_train.py for efficient distributed training - (b345f82)
  • (scripts) Add optimized_train_02.py for efficient distributed training - (e585cf2)
  • (scripts) Add optimized DDP training script - (02787e6)
  • Support real weights loading and serving enhancements - (1da2b8a)
  • Implement production-ready inference serving with streaming and observability - (2c21937)
  • Expand LLM functional test suite and fix core regressions - (60ce0cd)
  • Finalize architecture modernization and code quality cleanup - (7954f61)
  • Enhance inference performance and project quality tools - (03d640f)
  • Implement and integrate Mixture of Experts (MoE) - (deaed1b)
  • Enhance training framework with modularity and extensibility - (592e1d9)
  • Introduce modular training framework - (c944960)
  • Add a training sample - (a5417f2)
  • Remove something redundant - (31393e9)
  • Remove sys.path manipulations. - (a4e4c76)
  • Implement PositionalEncoding. - (f34f019)
  • Implement core components for a custom LLM framework - (2a4881f)
  • Introduce moe - (a26b0c0)

🐛 Bug Fixes

  • (attn) Correct QKV splitting in MultiHeadAttention - (9b5d634)
  • (core) Ensure distinct norm instances in TransformerBlock - (5e7f4cc)
  • (core/mlp) Move provided norm instance to target device/dtype to avoid device mismatch - (6a52399)
  • Test_engine_auto_amp_dtype case failed on cuda env - (a6406ca)
  • Failed to run on cpu-only env - (4faee5c)
  • Resolve multiple training stability and correctness issues - (334d76d)
  • ARG001 Unused function argument: dummy_input - (ff461df)

🚜 Refactor

  • (arch) Modernize architecture with registry, data abstraction, and robust config - (6f74012)
  • (core) Unify deep learning naming conventions - (d2c9237)
  • (moe) Optimize it - (3a91204)
  • (moe) Optimize it - (7fee39e)
  • (scripts) Migrate CLI scripts to typer - (f86ad16)
  • (scripts) Implement train_02.py with rich logging and optimized config management - (47cf2f8)
  • (scripts) Improve code readability and structure in optimized_train.py - (69eb300)
  • (scripts) Optimize and modularize PyTorch DDP training script - (e005d6a)
  • Consolidate environment variable handling and simplify code - (331bbdd)

📚 Documentation

  • (changelog) Enhance v0.0.2 release notes with detailed information - (cdd384e)
  • (changelog) Prepare v0.0.3 release notes - (b3bc1ab)
  • (llm) Add attn - (222095c)
  • Add usage - (039bf76)
  • Optimize documentation comments and standardize punctuation - (c382e34)
  • Add engineering docs and tooling config - (f3818d8)
  • Comprehensive documentation overhaul and roadmap expansion - (5bd6dac)
  • Update the roadmap - (e9d6176)
  • Add roadmap for learning - (ab81c7d)
  • Add ROADMAP - (c0e57d4)
  • Update GEMINI.md - (cd30474)
  • Standardize training documentation filenames to kebab-case - (de4a99c)
  • Update project documentation - (ebf95e6)
  • Streamline documentation entry point - (5b197fc)
  • Refactor and standardize documentation structure - (b2a46ee)
  • Refine the structure - (e4097d1)
  • Create comprehensive training framework documentation - (3ae4ace)
  • Add GEMINI-example.md template - (8eaf0f4)
  • Update GEMINI.md with commit workflow - (bdbde1a)
  • Add some docs about transformer - (16774c9)
  • Add moe - (f82b734)

🎨 Styling

🧪 Testing

  • (core) Make device comparisons robust by comparing device.type - (b5229ed)
  • Increase the coverage - (85be1e7)

⚙️ Miscellaneous Tasks

Read more