parameter-golf

My submission to OpenAI's Parameter Golf competition on a single RTX 3080 (12 GB). The challenge assumes 8xH100 — I had a desktop GPU and about two weeks.

View Submission

Overview

OpenAI's Parameter Golf challenges participants to train the best language model that fits in a 16MB artifact. The competition hardware is 8xH100 — I submitted on a single RTX 3080 with 12 GB VRAM.

Final score: val_bpb: 1.5568 — not competitive with the leaderboard, but the process was worth documenting. Full ablation results and training logs are in the PR.

Architecture

10-layer GQA transformer with progressive sequence length scheduling (512 → 1024 → 2048)
Multi-scale RoPE with different frequency bases per KV group
Byte-level token embeddings as a side channel
Mixed-bit quantization (int5/int6 + zstd) for the 16MB artifact cap

What I Learned

I hit my best score on day five, then spent nine more days trying to beat it. I ablated 12 leaderboard-proven techniques across 3 phases. Almost all of them made things worse.

The reason: on a single GPU you get ~3,600 steps in 10 minutes. The challenge hardware gets ~7,100. Techniques like XSA, EMA, and Partial RoPE add per-step overhead that costs hundreds of training steps. The loss curve is still dropping when time runs out — so anything that slows you down is net negative, even if it improves per-step efficiency.

The most useful takeaway: techniques proven at scale don't automatically transfer down. The optimal architecture depends on your step budget, not just your parameter budget.

parameter-golf

Overview

Architecture

What I Learned

Tech Stack

Links