A 2019 FLOW Lab post walks through optimizing a Julia vortex particle N-body kernel from 58x slower than C++ to near parity via concrete types, avoiding allocations, and manual loop unrolling.
Key Takeaways
Untyped Julia structs (ParticleAmbiguous) produce ::Any in the AST and run 58x slower than -O3 C++; parametric concrete types alone yield a 3x speedup.
After fixing types, the bottleneck shifts to allocations: list comprehensions and cross()/norm() calls generate millions of heap objects per benchmark run.
Using @code_warntype is the primary diagnostic tool; red ::Any annotations signal JIT-hostile code paths.
The benchmark kernel is a real aerodynamics O(N²) particle-to-particle (P2P) interaction on 216 particles, making results representative of HPC inner loops.
C++ baseline compiled with -O3 on an i7-7820HQ runs the P2P kernel in ~4 ms minimum; the optimization journey targets that ceiling.
Hacker News Comment Review
No substantive HN discussion yet; one commenter flagged the 2019 publication date, and another linked a Julia Discourse thread for follow-up context.