My submission to OpenAI's Parameter Golf competition on a single RTX 3080 (12 GB). The challenge assumes 8xH100 — I had a desktop GPU and about two weeks.
OpenAI's Parameter Golf challenges participants to train the best language model that fits in a 16MB artifact. The competition hardware is 8xH100 — I submitted on a single RTX 3080 with 12 GB VRAM.
Final score: val_bpb: 1.5568 — not competitive with the leaderboard, but the process was worth documenting. Full ablation results and training logs are in the PR.
I hit my best score on day five, then spent nine more days trying to beat it. I ablated 12 leaderboard-proven techniques across 3 phases. Almost all of them made things worse.
The reason: on a single GPU you get ~3,600 steps in 10 minutes. The challenge hardware gets ~7,100. Techniques like XSA, EMA, and Partial RoPE add per-step overhead that costs hundreds of training steps. The loss curve is still dropping when time runs out — so anything that slows you down is net negative, even if it improves per-step efficiency.
The most useful takeaway: techniques proven at scale don't automatically transfer down. The optimal architecture depends on your step budget, not just your parameter budget.