Releases: pplmx/llm
v0.0.5
Added
-
SFT (Supervised Fine-tuning):
SFTDatasetfor instruction tuning with input maskingSFTDataModulefor data loadingSFTTaskregistered as--task sftin CLI- Tests for all SFT components
-
DPO (Direct Preference Optimization):
DPODatasethandling chosen/rejected pairsDPODataModulefor preference data loadingDPOTaskwith reference model management and DPO loss- Registered as
--task dpoin CLI - Tests for all DPO components
-
Continuous Batching Engine (Serving):
src/llm/serving/engine.pywithContinuousBatchingEngineclass- Iteration-level scheduling via
SchedulerandSlotAllocator - Pre-allocated KV cache pool for efficient memory management
- Supports mixed prefill/decode batching with automatic padding
- Clean API: requires
modelandtokenizerinstances upfront src/llm/serving/scheduler.pywith FCFS scheduling logic
-
LoRA (Low-Rank Adaptation):
src/llm/core/lora.pywithLoRALinearclass for parameter-efficient fine-tuningapply_lora(),merge_lora(),get_lora_parameters()helper functions- Device/dtype handling for CUDA compatibility
- 17 tests covering training and weight merging
-
QLoRA (Quantized LoRA):
src/llm/core/qlora.pywithQLoRALinearclass- NF4 4-bit quantization for base weights (~4x memory reduction)
- LoRA adapters remain in fp16/bf16 for training stability
apply_qlora()andget_qlora_parameters()helpers
-
RoPE (Rotary Position Embedding):
src/llm/core/rope.pywithRotaryPositionEmbeddingclass- Linear, dynamic, and NTK-aware scaling methods for extended context
apply_rotary_pos_emb(),get_rope_scaling_factor()utilities- 15 tests
-
ALiBi (Attention with Linear Biases):
src/llm/core/alibi.pywithALiBiPositionBiasclassget_alibi_slopes(),build_alibi_bias()functions- Cached bias computation for efficiency
- 13 tests
-
Sliding Window Attention:
window_sizeparameter inscaled_dot_product_attention- Propagated through
MultiHeadAttention,TransformerBlock,DecoderModel - Reduces memory for long sequences by limiting attention scope
- 10 tests
-
KV Cache Optimization:
src/llm/core/kv_cache.pywithKVCacheclass for pre-allocated cache buffers- In-place updates during autoregressive generation (avoids O(n²) memory operations)
- Integrated into
MHA,TransformerBlock,DecoderModel - Factory method
KVCache.from_model_config()for easy instantiation - Backward compatible: legacy
past_key_valuetuple format still works
-
E2E Testing Infrastructure:
tests/e2e/directory with comprehensive pipeline teststest_training.py,test_sft.py,test_dpo.pytest_gradient_accumulation.py,test_resume_training.py- Advanced inference and callback tests
-
Documentation:
notebooks/quick_start.ipynbinteractive tutorial- Covers model building, training, inference, and advanced features
Changed
-
SDPA Refactoring:
- Consolidated
scaled_dot_product_attentionwrapper intosrc/llm/core/attn/sdpa.py - Refactored
MultiHeadAttentionandMultiLatentAttentionto use commonsdpawrapper - Archived custom implementation to
_learning/03_lab/experiments/custom_sdpa.py
- Consolidated
-
Test Suite Refactoring:
- Organized test files into subdirectories (
tests/training/,tests/inference/, etc.) - Converted to functional testing style (real components over mocks)
- Added shared fixtures in
tests/conftest.py - Test count: 385 → 432
- Organized test files into subdirectories (
-
TrainingEngine:
- Support for dictionary batches in training/validation loops
- Gradient accumulation implementation
-
DPO Reference Model:
- Use model reconstruction instead of
deepcopyfor ref_model creation
- Use model reconstruction instead of
-
Documentation:
- Added
docs/README.mdas documentation entry point - Added MkDocs Material configuration (
mkdocs.yml) for documentation site - Added GitHub Actions workflow for automatic GitHub Pages deployment
- Added
guide-finetuning.md(LoRA/QLoRA) andguide-inference.md(KVCache/GQA/Continuous Batching) - Enhanced
architecture.mdwith detailed component diagrams and data flow analysis - Updated ROADMAP Phase 10.2 (Continuous Batching complete)
- Added
v0.0.4
Added
-
Gradient Checkpointing:
- Memory-efficient training via
gradient_checkpointingparameter inDecoderModel enable_gradient_checkpointing()/disable_gradient_checkpointing()methods- Automatic incompatibility check with
use_cache=True
- Memory-efficient training via
-
E2E Pipeline Automation:
scripts/e2e_pipeline.pyfor automated Train → Evaluate → Inference workflowsrc/llm/utils/e2e.pywith reusable E2E core functions (E2EConfig,E2EResult,run_e2e_pipeline)- Rich progress UI and configurable CLI options
-
OpenAI-Compatible Chat API (
/v1/chat/completions):- Compatible with official OpenAI Python SDK
- Streaming and non-streaming chat completions
- Bearer token authentication support
- Multi-turn conversation handling
- 8 new test cases for compatibility layer
-
Batch Inference:
batch_generatefunction ininference.pywith left-padding and batched forward passBatchGenerationRequest/BatchGenerationResponseschemas/batch_generateAPI endpoint- 3 tests for batch inference (basic, single, empty)
-
Request Queue and Concurrency Control:
max_concurrent_requestsandrequest_timeoutinServingConfigasyncio.Semaphorefor concurrency limitingasyncio.timeoutfor request timeout handling (504 response)
-
CLI Entry Points:
llm-traincommand for training modelsllm-servecommand for starting inference server
-
Testing Infrastructure:
- Pytest markers using decorators:
quick,slow,heavy,e2e - MoE integration tests (6 tests for expert routing, gradient flow)
- E2E pipeline tests (full workflow, streaming consistency)
- Gradient checkpointing tests (8 tests)
- Total test count: 296 → 337
- Pytest markers using decorators:
-
Examples Directory:
inference_demo.pyfor basic text generationopenai_client_demo.pyfor OpenAI SDK usage
-
Documentation:
scripts/README.mddocumenting all available scripts- HFTokenizer example in
usage.md - Updated root
README.mdwith links to Examples and Scripts
Changed
-
Makefile Reorganization:
make testnow runs all tests by defaultmake test-fastfor daily development (excludes heavy/e2e)make test-quickfor rapid iteration (~6s)make test-covfor CI with coverage and allure reports- Removed redundant
test-allandtest-integration
-
CLI Standardization:
- CLI parameters changed from snake_case to kebab-case (
--file-path,--batch-size) - Replace
typerwithtyper-slim[standard]for reduced dependencies
- CLI parameters changed from snake_case to kebab-case (
-
Code Quality Improvements:
- Translate Chinese docstrings to English in serving module
- Remove ~75 lines of redundant comments
- Simplify section comments while preserving algorithm clarity
-
Documentation Refactoring:
- Eliminated redundancy between README, usage.md, and development.md
- Clear document responsibility separation
- Updated all docs to use new CLI commands
- Enhanced package metadata (keywords, classifiers)
-
Module Exports:
- Enhanced
llm/__init__.pywith public API exports (DecoderModel,generate, etc.) - Enhanced
llm.servingmodule exports (LLMEngine,ServingConfig, OpenAI schemas)
- Enhanced
Fixed
- Removed obsolete TODO comment in
engine.py - Removed duplicate
num_kv_headsfield inModelConfig - Fixed MD051/link-fragments in
tutorial-cpu-llm.mdandfaq.md - Fixed
train.pytask registration forlmtask
v0.0.3
Added
-
Inference Serving:
- Production-ready REST API with FastAPI
- Streaming support via Server-Sent Events (SSE)
- Advanced sampling strategies (nucleus sampling/top-p, repetition penalty)
- Prometheus metrics endpoint for monitoring
- API key authentication (
X-API-Keyheader) - Structured logging with
python-json-logger - Real PyTorch model weights loading from checkpoint files
- Pickled tokenizer object loading support
-
Component Registry:
- Automatic component registration system (
ComponentRegistry) - Core components (MHA, MLP, MoE) auto-registered via side-effect imports
- Prevents "component not found" errors in simplified scripts
- Automatic component registration system (
-
Data Abstraction:
- Formalized
BaseTokenizerprotocol BaseDataModuleabstraction for flexible data handling- Environment variable configuration support (e.g.,
LLM_TRAINING__EPOCHS)
- Formalized
-
Testing & CLI:
--num-samplesflag intrain.pyfor rapid regression testing- Scheduler edge case tests (
test_scheduler_edge_cases.py) - Validation logging tests (
test_engine_logging.py) - Component registry tests (
test_init.py) - Model loading verification tests
- Auto-device detection in training scripts (prioritizes CUDA)
-
Documentation:
- Comprehensive usage guide (
docs/usage.md) - Architecture documentation (
docs/architecture.md) - Engineering documentation (ADRs, PR templates, FAQ)
- VS Code configuration and extensions
- Comprehensive usage guide (
Changed
-
Architecture Modernization:
- Migrated to Pydantic v2 (
BaseSettings,BaseModel) for configuration - Fully typed and validated configuration system
- CLI migration from
argparsetotyperfor better UX
- Migrated to Pydantic v2 (
-
Naming Standardization:
- Unified
ffn_hidden_size→intermediate_sizeacross codebase - Standardized input parameter
x→hidden_statesin forward methods - Applied to
MLP,LayerNorm,RMSNorm,DecoderModel,TransformerBlock - Updated all 309 tests to reflect API changes
- Unified
-
Code Quality:
- Standardized punctuation in documentation (full-width → half-width)
- Improved type hints and documentation comments
- Refactored
TransformerBlock.forwardfor clarity
Fixed
-
Core Bugs:
CosineAnnealingLRT_maxcalculation whenepochs == warmup_epochs(ZeroDivisionError)TrainingEnginevalidation logging crash whengradient_normsis empty (IndexError)- PAD token generation issue in inference (logits masking)
SyntheticDataModuleprefetch_factorhandling withnum_workers=0TransformerBlockshared norm instance bug (independentnorm1/norm2)- Scheduler/optimizer step order warnings in tests
- PositionalEncoding support for
start_posin incremental generation - MLP SwiGLU operation order for numerical consistency
- Prompt truncation respecting
max_seq_lenwith new tokens - Auto AMP dtype resolution for CPU-only environments
-
Registry & Imports:
- Package auto-registration via
import llm - Component not found errors in simplified execution
- Package auto-registration via
v0.0.2
0.0.3 - 2025-12-23
🚀 Features
- (inference) Add simple autoregressive generation loop - (4585a2e)
- (scripts) Auto-detect device in train_simple_decoder.py - (ff2ad6f)
- (scripts) Add best_train.py for efficient distributed training - (b345f82)
- (scripts) Add optimized_train_02.py for efficient distributed training - (e585cf2)
- (scripts) Add optimized DDP training script - (02787e6)
- Support real weights loading and serving enhancements - (1da2b8a)
- Implement production-ready inference serving with streaming and observability - (2c21937)
- Expand LLM functional test suite and fix core regressions - (60ce0cd)
- Finalize architecture modernization and code quality cleanup - (7954f61)
- Enhance inference performance and project quality tools - (03d640f)
- Implement and integrate Mixture of Experts (MoE) - (deaed1b)
- Enhance training framework with modularity and extensibility - (592e1d9)
- Introduce modular training framework - (c944960)
- Add a training sample - (a5417f2)
- Remove something redundant - (31393e9)
- Remove
sys.pathmanipulations. - (a4e4c76) - Implement
PositionalEncoding. - (f34f019) - Implement core components for a custom LLM framework - (2a4881f)
- Introduce moe - (a26b0c0)
🐛 Bug Fixes
- (attn) Correct QKV splitting in MultiHeadAttention - (9b5d634)
- (core) Ensure distinct norm instances in TransformerBlock - (5e7f4cc)
- (core/mlp) Move provided norm instance to target device/dtype to avoid device mismatch - (6a52399)
- Test_engine_auto_amp_dtype case failed on cuda env - (a6406ca)
- Failed to run on cpu-only env - (4faee5c)
- Resolve multiple training stability and correctness issues - (334d76d)
- ARG001 Unused function argument:
dummy_input- (ff461df)
🚜 Refactor
- (arch) Modernize architecture with registry, data abstraction, and robust config - (6f74012)
- (core) Unify deep learning naming conventions - (d2c9237)
- (moe) Optimize it - (3a91204)
- (moe) Optimize it - (7fee39e)
- (scripts) Migrate CLI scripts to typer - (f86ad16)
- (scripts) Implement train_02.py with rich logging and optimized config management - (47cf2f8)
- (scripts) Improve code readability and structure in optimized_train.py - (69eb300)
- (scripts) Optimize and modularize PyTorch DDP training script - (e005d6a)
- Consolidate environment variable handling and simplify code - (331bbdd)
📚 Documentation
- (changelog) Enhance v0.0.2 release notes with detailed information - (cdd384e)
- (changelog) Prepare v0.0.3 release notes - (b3bc1ab)
- (llm) Add attn - (222095c)
- Add usage - (039bf76)
- Optimize documentation comments and standardize punctuation - (c382e34)
- Add engineering docs and tooling config - (f3818d8)
- Comprehensive documentation overhaul and roadmap expansion - (5bd6dac)
- Update the roadmap - (e9d6176)
- Add roadmap for learning - (ab81c7d)
- Add ROADMAP - (c0e57d4)
- Update GEMINI.md - (cd30474)
- Standardize training documentation filenames to kebab-case - (de4a99c)
- Update project documentation - (ebf95e6)
- Streamline documentation entry point - (5b197fc)
- Refactor and standardize documentation structure - (b2a46ee)
- Refine the structure - (e4097d1)
- Create comprehensive training framework documentation - (3ae4ace)
- Add GEMINI-example.md template - (8eaf0f4)
- Update GEMINI.md with commit workflow - (bdbde1a)
- Add some docs about transformer - (16774c9)
- Add moe - (f82b734)
🎨 Styling
- Ruff - (60ab171)
- Ruff - (20fb4e3)
- Format code for readability and consistency - (7831aa7)
- Add .markdownlint.yaml - (c98f6e1)
- Format - (bb7bb41)
🧪 Testing
- (core) Make device comparisons robust by comparing device.type - (b5229ed)
- Increase the coverage - (85be1e7)
⚙️ Miscellaneous Tasks
- (release) Use CHANGELOG.md for GitHub releases instead of git-cliff - (a0621e7)
- Some minor changes - (09e3a6f)
- Update the prek plugins - (6d97ec3)
- Upgrade actions/checkout to v5 - (d34c441)
- Update GEMINI.md - (7e36f67)
- Some minor changes - (4c7e0e7)
- Some minor changes - (55c12c8)
- Some minor changes - (c330cec)
- Add mlp vs moe - (e2478f0)
- Remove something redundant - ([b962d05](https://github.com/pplmx/ll...