Shubham Agarwal | projects

Chunk-Caching for RAG in LLMs

Primary project (2024-25) + Undergraduate Internship (2024)

Created a key-value reuse technique for LLMs, enabling token recomputation savings of 80–90% and significant reductions in prefill times. Included vLLM kernel-level implementations and GPU/CPU cache management systems. Contributed to a paper submission at SIGMOD 2025.

Interference-Aware LLM Inference Scheduling

Alind Khare (GaTech), PhD Internship (2024)

Designed an interference-aware LLM inference scheduling system with a two-level (global+local) scheduler. Optimized both TTFT and TPOT latencies using a Mixed-Integer Linear Programming (MILP) solution. Integrated predictive placement based on expected decode length.

Kernel Profiling and Optimization for A100 GPUs

Shweta Pandey (IISc), PhD Internship (2024)

Built a detailed performance profile of running models on NVIDIA A100 GPUs using advanced NVIDIA tools. Analyzed kernel compute, memory, and communication characteristics to identify bottlenecks. Extended profiling work to also include custom kernel scheduling decisions, achieving 1.5–2X gains over naive batching techniques with Multi-Process Service (MPS) optimizations.

Scaling Approximate Caching for Text-to-Image pipelines

Primary project (2024)

Developed a concept decomposition technique for recombining noise caches, enhancing generation fidelity and efficiency (ECCV 2024), and designed a dynamic scaling framework reducing GPU costs by 30% (NSDI 2024, ASPLOS 2025 under review).

Scaling Visualization Recommender Models on Large Data

Undergraduate Internship (2023)

Designed a plug-in framework for Visualization Recommender Systems, achieving up to 10× speedup in latency. Implemented a scalable Deep Q-Learning algorithm to optimize input statistics selection. Contributed to a publication at PAKDD 2024.

Training Encoder-Decoder for Noise Retrieval

Undergraduate Internship (2023)

Trained a novel embedding architecture (bi-encoder/cross-encoder) for efficient retrieval of noise caches. Focused on enhancing retrieval efficiency and accuracy in approximate caching systems.

Approximate Caching for Diffusion Models

Primary project (2023)

Developed a caching system to skip redundant iterations by retrieving and refining noise states, achieving up to 2x faster image generation with a 20% reduction in inference costs, deployed in Adobe Firefly (NSDI 2024).

Cloud Systems Reliability

Primary project (2022-23)

Developed a proactive outage detection system reducing detection time by 88% (FSE 2023) and advanced root cause analysis techniques with causal discovery and knowledge graphs, improving resolution efficiency (WWW 2023, ASE 2023).

Projects