CUDA-oxide: Nvidia's official Rust to CUDA compiler

· coding hardware · Source ↗

TLDR

  • Nvidia’s experimental cuda-oxide compiles standard Rust directly to PTX for SIMT GPU kernels, with no DSLs or foreign bindings.

Key Takeaways

  • #[cuda_module] and #[kernel] macros embed device artifacts into the host binary and generate typed launch methods per kernel.
  • DisjointSlice<T> enforces aliasing safety at the type level, preventing multiple threads from writing to the same index without unsafe code.
  • Supports async GPU execution via lazy DeviceOperation graphs scheduled across stream pools, requiring familiarity with tokio.
  • v0.1.0 is early alpha: API breakage expected. Build and run via cargo oxide run.
  • Uses a custom rustc codegen backend targeting MLIR-free PTX output, explicitly avoiding LLVM/TableGen build complexity.

Hacker News Comment Review

  • Commenters are cautiously optimistic about replacing cudarc-based workflows, but note build-time comparisons to nvcc are unresolved and depend heavily on incremental compilation setup.
  • The safety model is acknowledged as partial: Rust’s borrow checker cannot enforce GPU thread-level aliasing, so DisjointSlice and ThreadIndex do the heavy lifting where the type system can reach.
  • Skepticism surfaced that the codebase may be largely AI-generated, which raises concerns about long-term maintainability and correctness of the custom IR and codegen backend.

Notable Comments

  • @nextaccountic: quotes the safety docs directly – the borrow checker was “not designed for 2048 threads per SM all pointing at the same output buffer.”
  • @the__alchemist: asks whether cuda-oxide enables shared host/device structs, which existing Rust/CUDA workflows still lack.

Original | Discuss on HN