Fix FP4 init range for nvfp4_group_gemm #96

gau-nernst · 2026-01-24T04:01:15Z

Key change: make the values of FP4 symmetric -> the result has mean of 0

Debug script using Modal (run with modal run debug.py)

import modal

from pathlib import Path

image = (
    modal.Image.from_registry(
        "nvidia/cuda:13.0.2-cudnn-devel-ubuntu24.04", add_python="3.12"
    )
    .entrypoint([])  # remove verbose logging by base image on entry
    .uv_pip_install("torch==2.9.1", index_url="https://download.pytorch.org/whl/cu130")
    .uv_pip_install("numpy")
    .add_local_python_source("reference", "task", "utils")
)
app = modal.App("debug", image=image)


@app.function(gpu="B200")
def run(task_config: dict):
    import torch
    from reference import generate_input, ref_kernel

    for cfg in task_config["benchmarks"]:
        data = generate_input(**cfg)
        out_list = ref_kernel(data)
        out = torch.cat([x.view(-1) for x in out_list], dim=0)

        print(cfg)
        print(
            f"  mean={out.mean().item():.2f}, std={out.std().item():.2f}, inf_any={out.isinf().any().item()}"
        )


@app.local_entrypoint()
def main():
    import yaml

    task_yaml = Path(__file__).parent / "task.yml"
    task_config = yaml.safe_load(open(task_yaml))
    run.remote(task_config)

Before

{'m': [80, 176, 128, 72, 64, 248, 96, 160], 'n': [4096, 4096, 4096, 4096, 4096, 4096, 4096, 4096], 'k': [7168, 7168, 7168, 7168, 7168, 7168, 7168, 7168], 'g': 8, 'seed': 1111}
  mean=inf, std=nan, inf_any=True
{'m': [40, 76, 168, 72, 164, 148, 196, 160], 'n': [7168, 7168, 7168, 7168, 7168, 7168, 7168, 7168], 'k': [2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048], 'g': 8, 'seed': 1111}
  mean=16944.00, std=1964.00, inf_any=False
{'m': [192, 320], 'n': [3072, 3072], 'k': [4096, 4096], 'g': 2, 'seed': 1111}
  mean=34048.00, std=2820.00, inf_any=False
{'m': [128, 384], 'n': [4096, 4096], 'k': [1536, 1536], 'g': 2, 'seed': 1111}
  mean=12816.00, std=1731.00, inf_any=False

After

{'m': [80, 176, 128, 72, 64, 248, 96, 160], 'n': [4096, 4096, 4096, 4096, 4096, 4096, 4096, 4096], 'k': [7168, 7168, 7168, 7168, 7168, 7168, 7168, 7168], 'g': 8, 'seed': 1111}
  mean=0.65, std=184.12, inf_any=False
{'m': [40, 76, 168, 72, 164, 148, 196, 160], 'n': [7168, 7168, 7168, 7168, 7168, 7168, 7168, 7168], 'k': [2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048], 'g': 8, 'seed': 1111}
  mean=0.26, std=98.31, inf_any=False
{'m': [192, 320], 'n': [3072, 3072], 'k': [4096, 4096], 'g': 2, 'seed': 1111}
  mean=0.43, std=139.12, inf_any=False
{'m': [128, 384], 'n': [4096, 4096], 'k': [1536, 1536], 'g': 2, 'seed': 1111}
  mean=0.19, std=85.62, inf_any=False

msaroufim · 2026-01-24T17:41:55Z

@vickiw973 for review

vickiw973

thanks for the fix.

change fp4 init range

c5353b8

vickiw973 approved these changes Jan 25, 2026

View reviewed changes

msaroufim self-requested a review January 26, 2026 00:47

msaroufim approved these changes Jan 26, 2026

View reviewed changes

msaroufim merged commit 07f0321 into gpu-mode:main Jan 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FP4 init range for nvfp4_group_gemm #96

Fix FP4 init range for nvfp4_group_gemm #96

gau-nernst commented Jan 24, 2026

Uh oh!

msaroufim commented Jan 24, 2026

Uh oh!

vickiw973 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix FP4 init range for nvfp4_group_gemm #96

Fix FP4 init range for nvfp4_group_gemm #96

Conversation

gau-nernst commented Jan 24, 2026

Uh oh!

msaroufim commented Jan 24, 2026

Uh oh!

vickiw973 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants