Launching pyptx — a Python DSL for writing NVIDIA PTX kernels directly.
The world's first open source Python DSL for NVIDIA PTX.
Repo: https://github.com/patrick-toulme/pyptx
Docs: https://pyptx.dev/
Connect on LinkedIn: https://www.linkedin.com/in/patrick-toulme-150b041a5/
Follow on X: https://x.com/PatrickToulme
Launching pyptx — a Python DSL for writing NVIDIA PTX kernels directly.
Today I’m open-sourcing a project I’ve been building on personal time: pyptx, a Python DSL where the function body is the PTX instruction stream. One PTX instruction = one Python call. No optimizer, no autotuner, no tile IR between you and the hardware.
Why? Because the newest GPU features — Hopper’s wgmma, TMA multicast, mbarrier-based pipelines, Blackwell’s tcgen05.mma + TMEM + cooperative 2-SM MMA often only exist at the PTX level. For developers chasing peak performance, that has historically meant writing inline PTX inside CUDA C++.
pyptx brings that whole path into Python. Callable from JAX (via typed XLA FFI) and PyTorch (eager, torch.compile, and a C++ extension fast path).
A few numbers from real silicon:
• H100 bf16 GEMM: 815 TFLOPS, competitive with cuBLAS at matrix sizes ≥ 6K
• B200 bf16 GEMM: 1240 TFLOPS on the 1SM kernel
• RMSNorm: 2.6 TB/s (88% of HBM3 peak, 3.9× PyTorch eager)
• SwiGLU: 2.8 TB/s (94% HBM3)
The other half of the project is a transpiler in the opposite direction:
python -m pyptx.codegen kernel.ptx --sugar
takes PTX from anywhere — nvcc, Triton, CUTLASS output, DeepGEMM kernels — and emits editable pyptx Python. The parser/emitter round-trips byte-identical on 218+ real-world kernels. So you can read someone else’s kernel as Python, modify it, and ship the result.
Built end-to-end: parser, IR, emitter, transpiler, JAX integration, PyTorch integration, full Hopper + Blackwell ISA coverage, multi-arch wheels published to PyPI. ~17K lines of Python total.
Ships with maintained GEMM, grouped GEMM, RMSNorm, LayerNorm, and SwiGLU kernels for both Hopper and Blackwell, plus the PTX → Python transpiler.
pip install pyptx[torch] # for PyTorch
pip install pyptx[jax] # for JAX
pip install pyptx[all] # both
Repo: https://github.com/patrick-toulme/pyptx
Docs: https://pyptx.dev
If you write GPU kernels — especially if you’ve ever wished Triton would let you express a specific wgmma pattern, or wanted to read a CUTLASS PTX dump as editable Python — try it. PRs welcome, especially Blackwell tuning and new ISA wrappers.





Cool work! What would you say is the primary limitation of pyptx now?