Stratum: System-Hardware Co-Design with 3D-Stackable DRAM for Efficient MoE

· ai systems books · Source ↗

TLDR

  • IEEE/ACM MICRO paper proposes Stratum, combining Mono3D DRAM, near-memory processing, and GPU to serve MoE LLMs at 8.29x higher throughput and 7.66x better energy efficiency than GPU baselines.

Key Takeaways

  • Mono3D DRAM uses hybrid bonding (1 μm pitch, ~5x finer than HBM TSVs) to deliver higher internal bandwidth than HBM to a logic die, enabling stronger NMP without embedding logic in DRAM cells.
  • Stratum introduces in-memory tiering: layers with lower access latency hold hot experts, slower layers hold cold experts, guided by a lightweight topic classifier predicting request topics.
  • The silicon interposer connects the Mono3D DRAM+logic stack to the GPU, keeping dense compute on GPU while offloading expert weight fetches to NMP.
  • Cross-stack evaluation (device, circuit, algorithm, system) validates 8.29x decoding throughput and 7.66x energy efficiency gains over state-of-the-art GPU-HBM baselines across multiple MoE benchmarks.
  • Target workloads include models like DeepSeek-V3 (671B) and Kimi-K2 (1T), where MoE expert weight volume saturates HBM bandwidth during decoding.

Hacker News Comment Review

  • One commenter notes Mono3D DRAM-based NMP likely generalizes well beyond LLM workloads to memory-bound traditional computing, which the paper does not explore.

Original | Discuss on HN