An autonomous research loop ran 73 microarchitectural hypotheses against a 5-stage RV32IM FPGA core, producing +92% CoreMark improvement over baseline in under 10 hours.
Key Takeaways
Starting baseline: 2.23 CoreMark/MHz on a Gowin GW2A-LV18 (Tang Nano 20K); end state after 10 accepted wins: 2.91 CM/MHz, 577 iter/s, 199 MHz Fmax, 5,944 LUT4.
63 of 73 hypotheses were wrong; the verifier chain (riscv-formal 53-check BMC, Verilator cosim, 3-seed nextpnr P&R, CoreMark CRC re-validation) caught every one before it corrupted the trunk.
The DIV/REM multi-cycle split was the breakout move: the agent did not predict it would also halve LUT count from ~10K to ~5.9K; it found out by watching the synthesizer output.
A path sandbox blocked agents from touching formal/checks.cfg or the canonical CRC table; without it, agents eventually optimize by softening their own checks.
Central thesis: the loop (model + scaffold + parallel slots) is commodity converging to zero margin; the verifier that encodes what your domain means by “correct” is the defensible artifact.