Mountain View·— PT·Available 2026

Hi,
I am Bhuvan.

Graduate researcher at Carnegie Mellon working at the intersection of diffusion models, autoregressive video, and the systems that make modern inference fast — from action-driven compute schedules and DPM-Solver++ on world models, to RadixAttention from first principles, to NKI kernels for MoE on Trainium.

↓ Scroll to selected work

MOVE CURSOR · REVEAL SILICON

✦CUDA✦Triton✦FlashAttention✦Diffusion✦KV-Cache✦INT4✦Trainium NKI✦MoE✦Speculative Decoding✦Modal✦vLLM✦SGLang✦PyTorch✦CUDA✦Triton✦FlashAttention✦Diffusion✦KV-Cache✦INT4✦Trainium NKI✦MoE✦Speculative Decoding✦Modal✦vLLM✦SGLang✦PyTorch

(01) About

A researcher at the boundary of model and metal.

I build the layer where models meet metal. My research lives at the boundary of generative models and inference systems — squeezing latency, memory, and compounding error out of frontier models so they can run at interactive frame rates.

Lately that has meant 3.54× training-free speedups on autoregressive Minecraft world models, recursive language models that beat their own baseline by 4.4 points while spending 64% fewer tokens, and a closed-loop synthesis pipeline that mines, generates, and gates long-tail driving scenes for vision-language drivers.

I work end-to-end: hand-tuned CUDA softmax kernels, KV-cache compression, INT4/FP8 quantization, distributed training, and the empirical discipline of measuring 72 runs across 16 optimization families before claiming a result.

3.54×

WorldServe · 950-frame speedup

64%

ERLM token reduction · LongBench v2

Top 15

AWS NKI Challenge · Trainium3

Configs measured · 16 families

(02) Selected work

Where ideas meet silicon.

19 projects

01 / Inference

2026

Inference · 2026

WorldServe

A 3.54× training-free speedup recipe for autoregressive world model inference.

Open-Oasis 500M is an autoregressive Minecraft world model that runs ten DDIM steps per frame regardless of player activity, capping interactive generation at 2 fps on H100. WorldServe stacks two orthogonal step-count cuts — DPM-Solver++ 2M (5 base steps) and an action-magnitude bucket schedule that picks 2/3/5 forwards per frame from the 25-dim Minecraft action vector — for a 3.54× speedup at preserved self-coherence (Δvs_prev = −0.02 dB) on 950 real frames.

→72 measured configurations across 16 optimization families
→Action-magnitude difficulty schedule — first free per-frame signal for adaptive diffusion compute
→Empirical proof: step-count reduction is autoregressive-safe, per-step substitution compounds 5–22 dB at length
→TaylorSeer port with per-frame state reset — 2.52× standalone, 0/152K validation failures

PyTorch 2.4CUDA 12.4DPM-Solver++ 2MTaylorSeerModal H100 SXMfp16 autocast

3.54×

−0.02 dB self-coherence · 950 frames

Hi,I am Bhuvan.

A researcher at the boundary of model and metal.

Where ideas meet silicon.

WorldServe

ERLM — Enhanced Recursive Language Models

Closed-Loop Long-Tail Synthesis

NKI-MoE

CUDA Transformer Acceleration

Distributed GPT-2 Training

Inception of Odyssey

Diabetic Retinopathy Explainability

Triton 3D Kernels

AgentStack

Indian-Language LLM Research

Photon Brain

Alexa-at-Home

CBIR + Depth Estimation

BERT from Scratch

SmartFarm DL + IoT

Pneumonia Detection

Kanbas LMS

Premier Visual Manipulator

A research trajectory.

M.S. Researcher — Diffusion & Inference Systems

AWS Annapurna — Trainium Kernel Challenge

WorldServe — World Model Inference

Enhanced Recursive Language Models

Tools of the trade.

Inference

Systems

Distributed

Research

MLOps

Languages

The full record.

Let's build the next inference stack.

Hi,
I am Bhuvan.

Let's build the
next inference stack.