Launching pyptx — a Python DSL for writing NVIDIA PTX kernels directly.

The world's first open source Python DSL for NVIDIA PTX.

Patrick C. Toulme

Apr 25, 2026

Repo: https://github.com/patrick-toulme/pyptx

Docs: https://pyptx.dev/

Connect on LinkedIn: https://www.linkedin.com/in/patrick-toulme-150b041a5/

Follow on X: https://x.com/PatrickToulme

Launching pyptx — a Python DSL for writing NVIDIA PTX kernels directly.

Patrick C Toulme@PatrickToulme

Launching pyptx — a Python DSL for writing NVIDIA PTX kernels. One PTX instruction = one Python call. Write pure PTX in Python. Direct Hopper + Blackwell support: wgmma, TMA, tcgen05, mbarriers. JAX + PyTorch integration. Includes GEMM, grouped GEMM, RMSNorm, SwiGLU, and a

6:54 PM · Apr 25, 2026 · 1.63K Views

10 Replies · 8 Reposts · 48 Likes

Today I’m open-sourcing a project I’ve been building on personal time: pyptx, a Python DSL where the function body is the PTX instruction stream. One PTX instruction = one Python call. No optimizer, no autotuner, no tile IR between you and the hardware.

Why? Because the newest GPU features — Hopper’s wgmma, TMA multicast, mbarrier-based pipelines, Blackwell’s tcgen05.mma + TMEM + cooperative 2-SM MMA often only exist at the PTX level. For developers chasing peak performance, that has historically meant writing inline PTX inside CUDA C++.

pyptx brings that whole path into Python. Callable from JAX (via typed XLA FFI) and PyTorch (eager, torch.compile, and a C++ extension fast path).

A few numbers from real silicon:

• H100 bf16 GEMM: 815 TFLOPS, competitive with cuBLAS at matrix sizes ≥ 6K

• B200 bf16 GEMM: 1240 TFLOPS on the 1SM kernel

• RMSNorm: 2.6 TB/s (88% of HBM3 peak, 3.9× PyTorch eager)

• SwiGLU: 2.8 TB/s (94% HBM3)

The other half of the project is a transpiler in the opposite direction:

python -m pyptx.codegen kernel.ptx --sugar

takes PTX from anywhere — nvcc, Triton, CUTLASS output, DeepGEMM kernels — and emits editable pyptx Python. The parser/emitter round-trips byte-identical on 218+ real-world kernels. So you can read someone else’s kernel as Python, modify it, and ship the result.

Built end-to-end: parser, IR, emitter, transpiler, JAX integration, PyTorch integration, full Hopper + Blackwell ISA coverage, multi-arch wheels published to PyPI. ~17K lines of Python total.

Ships with maintained GEMM, grouped GEMM, RMSNorm, LayerNorm, and SwiGLU kernels for both Hopper and Blackwell, plus the PTX → Python transpiler.

pip install pyptx[torch] # for PyTorch

pip install pyptx[jax] # for JAX

pip install pyptx[all] # both

Repo: https://github.com/patrick-toulme/pyptx

Docs: https://pyptx.dev

If you write GPU kernels — especially if you’ve ever wished Triton would let you express a specific wgmma pattern, or wanted to read a CUTLASS PTX dump as editable Python — try it. PRs welcome, especially Blackwell tuning and new ISA wrappers.

Just a Byte - AI Compilers, Silicon, and Systems

Discussion about this post

Ready for more?