Mountain View · PT · Available 2026

Hi,
I am Bhuvan.

Graduate researcher at Carnegie Mellon working at the intersection of diffusion models, autoregressive video, and the systems that make modern inference fast — from action-driven compute schedules and DPM-Solver++ on world models, to RadixAttention from first principles, to NKI kernels for MoE on Trainium.

↓ Scroll to selected work

MOVE CURSOR · REVEAL SILICON

CUDATritonFlashAttentionDiffusionKV-CacheINT4Trainium NKIMoESpeculative DecodingModalvLLMSGLangPyTorchCUDATritonFlashAttentionDiffusionKV-CacheINT4Trainium NKIMoESpeculative DecodingModalvLLMSGLangPyTorch
(01) About

A researcher at the boundary of model and metal.

I build the layer where models meet metal. My research lives at the boundary of generative models and inference systems — squeezing latency, memory, and compounding error out of frontier models so they can run at interactive frame rates.

Lately that has meant 3.54× training-free speedups on autoregressive Minecraft world models, recursive language models that beat their own baseline by 4.4 points while spending 64% fewer tokens, and a closed-loop synthesis pipeline that mines, generates, and gates long-tail driving scenes for vision-language drivers.

I work end-to-end: hand-tuned CUDA softmax kernels, KV-cache compression, INT4/FP8 quantization, distributed training, and the empirical discipline of measuring 72 runs across 16 optimization families before claiming a result.

3.54×
WorldServe · 950-frame speedup
64%
ERLM token reduction · LongBench v2
Top 15
AWS NKI Challenge · Trainium3
72
Configs measured · 16 families
(02) Selected work

Where ideas meet silicon.

01 / Inference
2026
Inference · 2026

WorldServe

A 3.54× training-free speedup recipe for autoregressive world model inference.

Open-Oasis 500M is an autoregressive Minecraft world model that runs ten DDIM steps per frame regardless of player activity, capping interactive generation at 2 fps on H100. WorldServe stacks two orthogonal step-count cuts — DPM-Solver++ 2M (5 base steps) and an action-magnitude bucket schedule that picks 2/3/5 forwards per frame from the 25-dim Minecraft action vector — for a 3.54× speedup at preserved self-coherence (Δvs_prev = −0.02 dB) on 950 real frames.

  • 72 measured configurations across 16 optimization families
  • Action-magnitude difficulty schedule — first free per-frame signal for adaptive diffusion compute
  • Empirical proof: step-count reduction is autoregressive-safe, per-step substitution compounds 5–22 dB at length
  • TaylorSeer port with per-frame state reset — 2.52× standalone, 0/152K validation failures
PyTorch 2.4CUDA 12.4DPM-Solver++ 2MTaylorSeerModal H100 SXMfp16 autocast
3.54×
−0.02 dB self-coherence · 950 frames
02 / Research
2026
Research · 2026

ERLM — Enhanced Recursive Language Models

Five composable systems optimizations that beat the RLM baseline while spending 64% fewer tokens.

Recursive Language Models give an LLM a Python REPL to iteratively query long documents — but introduce five systems inefficiencies: linear retrieval, no convergence criterion, sequential sub-queries, KV recomputation, and over-provisioned weights. ERLM fixes all five through composable optimizations layered on the base RLM loop, evaluated on LongBench v2 with Qwen3-8B.

  • TF-IDF dynamic retrieval + Jaccard-based adaptive budget control
  • Async parallel sub-calls + RadixAttention KV prefix caching
  • FP8/INT8 quantization fitting Qwen3-8B with KV headroom
  • MiniTorch extension: KV cache + RadixAttention + Flash Attention from scratch (8/8 tests, 72.7% prefix hit)
  • OpenAI-compatible serving endpoint as drop-in for Ollama/vLLM
Qwen3-8BvLLMRadixAttentionFlash-AttnPyTorchMiniTorch
44.4% · −63.7%
accuracy · token reduction · 63.5% cache hit
scene frequency · log scalelong tail · < 0.1%Mine → Generate → Verify → TrainDINOv3 · V-JEPA-2.1 · Cosmos-Transfer-2.5-2B
03 / Research
2026
Research · 2026

Closed-Loop Long-Tail Synthesis

A mining → generation → verification pipeline for driving vision-language models.

Real long-tail driving clips constitute well under 0.1% of any open AV corpus, and that 0.1% is exactly what ships VLM drivers to regulators. Proposes a four-stage closed loop: a 5-axis high-signal scorer (DINOv3 + V-JEPA-2.1 + 5-model BEV ensemble + Westhofen criticality + Cosmos-Reason2 rarity prior), typed SceneLayout extraction, Cosmos-Transfer-2.5-2B + multi-view + LiDARGen synthesis, and a four-gate verifier that re-uses the scoring stack to reject generations that lost their seed property.

  • 5-axis calibrated mining over WOD-E2E + nuScenes + CODA + nuPlan + comma2k19
  • 5-axis perturbation taxonomy: weather, agent substitution, multiplication, trajectory, layout
  • Multi-view (8-camera) + matching synthetic LiDAR + UniAD auto-labels
  • Registered ablation across 5 BEVFusion training recipes — quantifies targeting vs filtering vs generation
  • ~40K verified high-signal clips at full WOD-E2E scale
DINOv3V-JEPA-2.1UniAD / VAD / BEVFormerCosmos-Transfer-2.5-2BBEVFusion
~40K
verified clips · 4-gate closed loop
E0E5E10E15E20E25E30E35E40E45
04 / Systems
2026
Systems · 2026

NKI-MoE

Custom Trainium kernels for a 30B Mixture-of-Experts model.

Hand-written Neuron Kernel Interface (NKI) kernels for Qwen3-30B-A3B running on AWS Trainium2 and Trainium3. Re-engineered MoE routing, expert computation, and sparse-pattern attention to compete on the AWS Annapurna kernel challenge — scoring on Accuracy × Reduced_Latency × Throughput × Normalized_NKI_FLOPs.

  • Top-15 finish — advanced to Trainium3 round 2
  • Single-file NKI kernel covering routing + expert matmul
  • Direct work on the Trainium silicon programming model
  • AWS Neuron SDK 2.28 / NKI 1 + 2
AWS Neuron SDK 2.28NKI 1/2Trainium2/3Qwen3-30B-A3BPyTorch
Top 15
AWS Annapurna kernel challenge
softmax · LayerNorm · float4
05 / Systems
2026
Systems · 2026

CUDA Transformer Acceleration

Hand-tuned softmax + LayerNorm kernels for transformer attention.

Custom CUDA kernels replacing PyTorch ops in the attention block. Two softmax variants — warp-level reduction for short sequences, block-level with CUB BlockLoad/Store for long ones — plus a fused LayerNorm with float4 vectorized loads.

  • Causal, padding, and future-mask handling with -inf shifting
  • Numerically stable max-shift normalization
  • Single-pass fused LayerNorm: variance + mean + normalize
  • CUB cooperative groups + shared-memory reductions
CUDA C++CUBcooperative_groupsPyTorch C++ ext
~6.5×
kernel speedup over PyTorch baseline
GPU0GPU1GPU2GPU3GPU4GPU5GPU6GPU7GPU8
06 / Systems
2026
Systems · 2026

Distributed GPT-2 Training

DDP + pipeline parallelism from first principles.

Implementation of data-parallel and pipeline-parallel training for GPT-2 — including dataset partitioning, gradient AllReduce across ranks, layer-wise model splits, microbatch scheduling, and worker thread queues. Then layered SGLang RadixAttention + DeepSpeed ZeRO + LoRA on Llama-2 7B for 2× V100 fine-tuning.

  • Custom _clock_cycles microbatch scheduler with worker threads
  • 1.5×+ throughput scaling on 2 GPUs
  • Direct torch.distributed primitives, no Lightning
  • Llama-2 7B LoRA on 2× V100 16GB via DeepSpeed ZeRO
PyTorch DistributedNCCLDeepSpeed ZeROSGLangFlashInferLlama-2 7B + LoRA
2× GPU
pipeline + DDP scaling
KafkaTrainSnapshotA/B RouteServe
07 / MLOps
2026
MLOps · 2026

Inception of Odyssey

Production movie recommender for 1M users with full MLOps.

End-to-end recommendation system: hybrid User-User CF + TF-IDF content model, intelligent routing, blue/green deploys, A/B routing via stable MD5 hashing, Kafka ingestion, automated 3-day retraining, and a Prometheus + Grafana telemetry stack. CF achieves 2.22 RMSE; cold-start TF-IDF achieves 5.80 RMSE at 123 req/s.

  • Stable hash-bucket A/B router with statistical comparison
  • Versioned model snapshots with git-commit provenance
  • Five Prometheus alert rules — drift, latency, availability, accuracy, new-user fraction
  • 70%+ availability target with <50h downtime / 72h window · 200+ tests · 74% coverage
Flaskscikit-learnKafkaDockerPrometheusGrafana
1M users
20K movies · 200+ tests · 74% cov
08 / Research
2026
Research · 2026

Diabetic Retinopathy Explainability

Policy-compliant explainability for medical screening AI.

ResNet-50 classifier across 5 severity levels of diabetic retinopathy, packaged with dual-audience explainability — Grad-CAM heatmaps and confidence reports for nurses, fairness audits and limitations docs for procurement officers under an 8-point responsible-AI policy.

  • Cohen’s Kappa 0.913 against expert grading
  • Grad-CAM spatial attention overlays per severity class
  • Per-demographic fairness audits (age, gender)
  • 8-point responsible-AI policy compliance mapping
TensorFlowGrad-CAMscikit-learnAPTOS fundus dataset
88.5%
classification accuracy · κ 0.913
softmax · LayerNorm · float4
09 / Systems
2026
Systems · 2026

Triton 3D Kernels

Custom Flash Attention v2 — 100.8 TFLOPS on H100, 1.35× speedup on TripoSR.

Hand-tuned Triton kernel for Flash Attention v2 with deferred normalization, split backward kernels (no atomics), and `triton.autotune` over 6 configs. Drops directly into TripoSR's 16 self-attention layers for an end-to-end 1.14× speedup at output cosine-similarity 1.000000. Live Modal demo deployable from a single command.

  • 100.8 TFLOPS on H100 80GB HBM3 (B=2, H=16, S=2048, D=64)
  • Deferred normalization — accumulate unnormalized O, divide by ℓ once at end
  • Split backward (`_bwd_dkdv` + `_bwd_dq`) — no atomics, 1.4× faster at long seq
  • Live demo on Modal: bnallamo--triton-3d-demo-launch-gradio.modal.run
TritonPyTorchCUDA 12TripoSRModal H100Gradio
100.8
TFLOPS · 1.35× attention speedup
GPU0GPU1GPU2GPU3GPU4GPU5GPU6GPU7GPU8
10 / MLOps
2026
MLOps · 2026

AgentStack

Stack Overflow for AI agents — agent-first bug resolution platform.

When a developer agent hits a bug, it queries AgentStack instead of burning compute on trial-and-error. If another agent has already solved that exact error, the platform returns a structured executable patch. If not, the agent solves it and contributes back. Full-stack: FastAPI backend + Next.js dashboard + TypeScript & Python SDKs + MCP server, published as `agentstackio` on npm.

  • FastAPI backend (Render) + Next.js frontend (Vercel) — live deployment
  • TypeScript SDK + Python SDK published as `agentstackio` on npm
  • MCP server for Claude Code / Cursor integration
  • 60-second install via `npm install -g agentstackio` + JSON config
FastAPINext.jsTypeScriptPythonSupabaseMCPRenderVercel
Live
agentstackio · npm · Render · Vercel
11 / Research
2026
Research · 2026

Indian-Language LLM Research

Research proposal + landscape review of sovereign Indian-language foundation models.

Comprehensive research artifact on the Indian-language LLM ecosystem unveiled at the India AI Impact Summit 2026 (New Delhi) — Sarvam-105B (105B MoE), Sarvam-30B, BharatGen Param-2, Krutrim-2, Gnani Voice Indic. Includes formal LaTeX research proposal, benchmark analysis (IndicGenBench, MMLU, ARC-C, Flores), dataset survey, and correspondence with CMU LTI faculty.

  • 6 flagship Indic foundation models surveyed in depth
  • Benchmark comparison — IndicGenBench, MMLU, ARC-C, Flores En→Indic
  • Formal LaTeX research proposal with PDF
  • Tokenizer + Indic data pipeline analysis
LaTeXMarkdownResearch
6 LLMs
sovereign Indic foundations · benchmarks
12 / Systems
2026
Systems · 2026

Photon Brain

iMessage-native second brain — memory · CRM · OCR · voice · scheduler from chat.

TypeScript SDK + agent that turns iMessage on Mac into a programmable second brain. Reads + writes Apple's chat.db SQLite directly, watches messages in real time, batch sends, runs scheduled and recurring messages, smart reminders, attachments, plugin system, webhook support. Built on Bun / Node.

  • Reads + writes Apple chat.db SQLite directly · macOS native
  • MessageChain API + MessageScheduler + Smart Reminders
  • Real-time message watching with reactive triggers
  • Plugin + webhook system for custom automations
TypeScriptBunSQLiteAppleScriptmacOS
12 features
iMessage SDK · 17-section docs
GPU0GPU1GPU2GPU3GPU4GPU5GPU6GPU7GPU8
13 / Research
2026
Research · 2026

Alexa-at-Home

Multi-user smart-home reasoning system — NLU + patent-aware insights.

Python backend organized into six reasoning modules: CASAS dataset integration, multi-user disambiguation, NLU, insights generation, patent-aware retrieval, and a unified reasoning layer. Frontend dashboard for visualization. Smart-home assistant that understands who is speaking and what activity is in progress.

  • 6 specialized modules: casas / insights / multiuser / nlu / patent / reasoning
  • CASAS smart-home dataset integration
  • Multi-user disambiguation in shared spaces
  • Patent-aware retrieval-augmented reasoning
PythonNLUCASASSmart Home
6 modules
reasoning · multi-user
softmax · LayerNorm · float4
14 / Research
2025
Research · 2025

CBIR + Depth Estimation

C++/OpenCV image retrieval engine with monocular depth + face detection.

Computer-vision pipeline that combines histogram, texture/color, custom feature, and deep embedding matchers with Depth-Anything-V2 ONNX inference for monocular depth. Includes face detection (Haar cascades), video processing (`vidDisplay` + `da2-video`), and a baseline matcher for benchmarking. 30+ source files across 115 MB of code and assets.

  • Depth-Anything-V2 ONNX integration for monocular depth
  • Multiple matchers — histogram / texture / custom / embedding
  • Video pipeline (`vidDisplay` + `da2-video`)
  • Face detection (Haar cascades) + face tracking
C++OpenCVONNX RuntimeDepth-Anything-V2
115 MB
CV pipeline · 30+ source files
15 / Research
2026
Research · 2026

BERT from Scratch

BERT pre-training and fine-tuning implemented from first principles.

Implementing BERT (Bidirectional Encoder Representations from Transformers) from scratch as a CMU course project. Covers tokenization, masked-language-modeling pre-training, fine-tuning, and downstream evaluation. Includes formal midterm and final reports documenting baseline architecture, training experiments, and ablations.

  • BERT pre-training implemented from scratch
  • Baseline + multiple ablation experiments
  • Formal midterm + final reports
  • Foundation-model fundamentals exercise
PyTorchTransformersPythonJupyter
From scratch
pre-train + ablations + reports
16 / Research
2024
Research · 2024

SmartFarm DL + IoT

Deep learning + IoT decision-support system for agriculture.

Combines crop-vision deep learning with IoT sensor telemetry to surface irrigation, fertilizer, and disease-risk decisions for farmers. Includes a video walkthrough of the working system showing the end-to-end pipeline.

  • Crop classification CNN trained on field imagery
  • IoT sensor integration (soil, humidity, temperature)
  • Decision-support dashboard for farmers
  • Video demo: youtu.be/SXvQFBdIwcg
TensorFlowPythonIoTJupyter
Demo
youtu.be/SXvQFBdIwcg
17 / Research
2023
Research · 2023

Pneumonia Detection

Full-stack X-ray pneumonia classifier — Flask API + React frontend.

CNN-based pneumonia classifier on chest X-ray dataset, served as a full-stack web app. Backend Flask API serves the trained model; React frontend uploads images and displays predictions with confidence. End-to-end deployment-ready.

  • CNN classifier on chest X-ray dataset
  • Flask API serving the trained model
  • React frontend for image upload + result display
  • End-to-end deployment-ready stack
TensorFlowFlaskReactPythonJupyter
Full-stack
CNN · Flask API · React UI
KafkaTrainSnapshotA/B RouteServe
18 / MLOps
2024
MLOps · 2024

Kanbas LMS

Canvas LMS clone — React + TypeScript + Redux frontend, Node + Express backend.

Full-stack Learning Management System inspired by Canvas. React + Redux Toolkit + TypeScript on the front; Express/Node + MongoDB on the back. Course modules, assignments, quizzes, gradebook, role-based access (admin / faculty / student).

  • React + TypeScript + Redux Toolkit frontend
  • Express/Node + MongoDB backend
  • Course modules, assignments, quizzes, gradebook
  • Role-based access (admin / faculty / student)
ReactTypeScriptReduxNode.jsExpressMongoDB
Full-stack
CRUD · auth · roles
E0E5E10E15E20E25E30E35E40E45
19 / Systems
2024
Systems · 2024

Premier Visual Manipulator

Java image-manipulation tool with Swing GUI and a custom scripting language.

Multi-mode image processor implementing classical image operations (filters, color transforms, blur/sharpen, histogram equalization) with a Swing-based GUI for interactive editing and a custom scripting language for batch operations. Includes a JUnit test suite with reference output images.

  • Swing GUI with live preview
  • Custom scripting language for batch processing
  • Filters, color transforms, histogram equalization
  • JUnit test suite with reference outputs
JavaSwingJUnit
GUI + DSL
image manipulation · script-driven
Trajectory

A research trajectory.

  1. 2025 — Present

    M.S. Researcher — Diffusion & Inference Systems

    Carnegie Mellon University

    Building inference frameworks for diffusion-based world models and recursive language models. Course path through CMU 11-868 LLM Systems (15-442/642), ML Systems, and ML in Production. Working on autoregressive video acceleration, KV-cache surgery, and long-tail synthesis for driving VLMs.

  2. AWS Annapurna — Trainium Kernel Challenge

    AWS Open Competition

    Top-15 finish writing custom NKI kernels for a 30B MoE model on Trainium2. Advanced to Trainium3 round 2 with the top teams.

  3. Spring 2026

    WorldServe — World Model Inference

    CMU 15-442 / 15-642 Final

    3.54× training-free speedup on Open-Oasis 500M. 72 measured configurations across 16 optimization families. First per-frame difficulty signal exploited from the action input pipe of an autoregressive world model.

  4. Spring 2026

    Enhanced Recursive Language Models

    CMU 11-868 LLM Systems

    Five composable systems optimizations on RLM. 44.4% accuracy on LongBench v2 (vs 40% baseline) with 63.7% token reduction and 63.5% RadixAttention prefix cache hit rate.

(03) Stack

Tools of the trade.

Inference

01
  • DPM-Solver++
  • TaylorSeer caching
  • KV-cache compression
  • INT4 / FP8 quant
  • Speculative decoding
  • Sparse attention

Systems

02
  • CUDA / CUB
  • Triton
  • AWS NKI
  • Flash-Attn
  • NCCL
  • torchao

Distributed

03
  • DDP
  • Pipeline parallel
  • DeepSpeed ZeRO
  • vLLM
  • SGLang / RadixAttention
  • Modal

Research

04
  • Diffusion models
  • World models
  • MoE routing
  • Long-tail mining
  • Cosmos-Transfer
  • Grad-CAM

MLOps

05
  • Kafka
  • Prometheus
  • Grafana
  • Docker
  • A/B testing
  • Blue/green deploys

Languages

06
  • Python
  • C++ / CUDA C
  • PyTorch
  • TensorFlow
  • JAX
  • TypeScript
(04) Resume

The full record.

Last updated · Apr 2026

resume.pdf
1 page · 80 KB · v.Apr-2026

Your browser can't display the embedded PDF.

Open the resume in a new tab ↗
(05) Get in touch

Let's build the
next inference stack.

Open to research collaborations, internships, and conversations about diffusion inference, kernel work, or any system that needs to be made fast. Best way to reach me is email — I read everything.

bnallamo@andrew.cmu.edu