Siddhant Ray

I am a third year PhD student in Computer Science at the University of Chicago, advised by Junchen Jiang and Nick Feamster. I am interested efficient serving systems for Large Language Models while using LLM application feedback and in machine learning methods for performance improvements in computer networks and systems.

Currently I work on joint optimizations in Retrieval-Augmented-Generation (RAG) systems on quality and delay with query level configuration selection and resource scheduling . I also work on using Transformer models for per-packet latency prediction to improve queue selection and reduce tail-latency for latency sensitive applications.

In the past, I have worked on advances in Software Defined Networking, programmable networks and cloud computing. Additionally I have spent some time working on developing NLP techniques to analyse political corpora.

I'm fortunate to be additionally supported by the Liew Family Graduate Fellowship. Prior to starting my PhD, I earned my MSc in Electrical Engineering and Information Technology at ETH Zurich and my B.Tech in Electronics and Communication Engineering at VIT Vellore.

News

Oct, 2025	METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation accepted at ACM SOSP’25.
Sep, 2025	Serving as a Reviewer for AAAI’26 and ICLR’26 .
Jun, 2025	Starting my research internship at Microsoft, Redmond jointly working with Microsoft Research and Outlook.
May, 2025	Selected and awarded travel grant for the inaugural PhD Research School held by the LDOS expedition at UT Ausin.
May, 2025	Serving as on the Artifact Evaluation Committee for USENIX ATC’25 , OSDI’25 and CoNEXT’25 .

Selected publications

SOSP
METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation

Siddhant Ray, Rui Pan, Zhuohan Gu, Kuntai Du, Shaoting Feng, Ganesh Ananthanarayanan, Ravi Netravali, and Junchen Jiang

In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles 2025

Abs arXiv Bib Slides Talk

RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge causes higher response delay. Prior work focuses either on reducing the response delay (e.g., better scheduling of RAG queries) or on maximizing quality (e.g., tuning the RAG workflow), but they fall short in systematically balancing the tradeoff between the delay and quality of RAG responses. To balance both quality and response delay, this paper presents METIS, the first RAG system that jointly schedules queries and adapts the key RAG configurations of each query, such as the number of retrieved text chunks and synthesis methods. Using four popular RAG-QA datasets, we show that compared to the state-of-the-art RAG optimization schemes, METIS reduces the generation latency by 1.64 – 2.54\texttimes without sacrificing generation quality.
@inproceedings{10.1145/3731569.3764855, author = {Ray, Siddhant and Pan, Rui and Gu, Zhuohan and Du, Kuntai and Feng, Shaoting and Ananthanarayanan, Ganesh and Netravali, Ravi and Jiang, Junchen}, title = {METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation}, year = {2025}, isbn = {9798400718700}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3731569.3764855}, doi = {10.1145/3731569.3764855}, booktitle = {Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles}, pages = {606–622}, numpages = {17}, keywords = {RAG systems, LLM inference, scheduling}, location = {Lotte Hotel World, Seoul, Republic of Korea}, series = {SOSP '25}, talk = {https://drive.google.com/file/d/1bMZf-38ubFO6hZk6ivZcsaL9kC72GEpa/view?usp=sharing}, }
EuroSys
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang

In Proceedings of the Twentieth European Conference on Computer Systems 2025

Abs arXiv Bib 🏅 Best Paper

Large language models (LLMs) often incorporate multiple text chunks in their inputs to provide the necessary contexts. To speed up the prefill of the long LLM inputs, one can pre-compute the KV cache of a text and re-use the KV cache when the context is reused as the prefix of another LLM input. However, the reused text chunks are not always the input prefix, which makes precomputed KV caches not directly usable since they ignore the text’s cross-attention with the preceding texts. Thus, the benefits of reusing KV caches remain largely unrealized.This paper tackles just one challenge: when an LLM input contains multiple text chunks, how to quickly combine their precomputed KV caches in order to achieve the same generation quality as the expensive full prefill (i.e., without reusing KV cache)? This challenge naturally arises in retrieval-augmented generation (RAG) where the input is supplemented with multiple retrieved texts as the context. We present CacheBlend, a scheme that reuses the precomputed KV caches, regardless prefix or not, and selectively recomputes the KV values of a small subset of tokens to partially update each reused KV cache. In the meantime, the small extra delay for recomputing some tokens can be pipelined with the retrieval of KV caches within the same job, allowing CacheBlend to store KV caches in slower devices with more storage capacity while retrieving them without increasing the inference delay. By comparing CacheBlend with the state-of-the-art KV cache reusing schemes on three open-source LLMs of various sizes and four popular benchmark datasets of different tasks, we show that CacheBlend reduces time-to-first-token (TTFT) by 2.2-3.3\texttimes and increases the inference throughput by 2.8-5\texttimes from full KV recompute without compromising generation quality. The code is available at https://github.com/LMCache/LMCache.
@inproceedings{10.1145/3689031.3696098, author = {Yao, Jiayi and Li, Hanchen and Liu, Yuhan and Ray, Siddhant and Cheng, Yihua and Zhang, Qizheng and Du, Kuntai and Lu, Shan and Jiang, Junchen}, title = {{C}ache{B}lend: {F}ast {L}arge {L}anguage {M}odel {S}erving for {RAG} with {C}ached {K}nowledge {F}usion}, year = {2025}, isbn = {9798400711961}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3689031.3696098}, doi = {10.1145/3689031.3696098}, booktitle = {Proceedings of the Twentieth European Conference on Computer Systems}, pages = {94–109}, numpages = {16}, keywords = {KV Cache, Large Language Models, Retrieval-Augmented-Generation}, location = {Rotterdam, Netherlands}, series = {EuroSys '25}, award_name = {Best Paper}, }
SIGCOMM
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang

In Proceedings of the ACM SIGCOMM 2024 Conference 2024

Abs arXiv Bib

s large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging, as nothing can be generated until the whole context is processed by the LLM. While the context-processing delay can be reduced by reusing the KV cache of a context across different inputs, fetching the KV cache, which contains large tensors, over the network can cause high extra network delays. CacheGen is a fast context-loading module for LLM systems. First, CacheGen uses a custom tensor encoder, leveraging KV cache’s distributional properties to encode a KV cache into more compact bitstream representations with negligible decoding overhead, to save bandwidth usage. Second, CacheGen adapts the compression level of different parts of a KV cache to cope with changes in available bandwidth, in order to maintain low context-loading delay and high generation quality. % When available bandwidth drops, CacheGen may raise the compression level for a part of the context or recompute its KV cache on the fly. We test CacheGen on popular LLMs and datasets. Compared to the recent systems that reuse the KV cache, CacheGen reduces the KV cache size by 3.5-4.3x and the total delay in fetching and processing contexts by 3.2-3.7x with negligible impact on the LLM response quality. Our code is at: this https URL.
@inproceedings{10.1145/3651890.3672274, author = {Liu, Yuhan and Li, Hanchen and Cheng, Yihua and Ray, Siddhant and Huang, Yuyang and Zhang, Qizheng and Du, Kuntai and Yao, Jiayi and Lu, Shan and Ananthanarayanan, Ganesh and Maire, Michael and Hoffmann, Henry and Holtzman, Ari and Jiang, Junchen}, title = {Cache{G}en: KV {C}ache {C}ompression and {S}treaming for {F}ast {L}arge {L}anguage {M}odel {S}erving}, year = {2024}, isbn = {9798400706141}, publisher = {ACM}, url = {https://doi.org/10.1145/3651890.3672274}, doi = {10.1145/3651890.3672274}, booktitle = {Proceedings of the ACM SIGCOMM 2024 Conference}, keywords = {large language models, KV cache, compression}, location = {Sydney, NSW, Australia}, }
HOTNETS
A New Hope for Network Model Generalization

Alexander Dietmüller, Siddhant Ray, Romain Jacob, and Laurent Vanbever

In Proceedings of the 21st ACM Workshop on Hot Topics in Networks 2022

Abs arXiv Bib Code

Generalizing machine learning (ML) models for network traffic dynamics tends to be considered a lost cause. Hence, for every new task, we often resolve to design new models and train them on model-specific datasets collected, whenever possible, in an environment mimicking the model’s deployment. This approach essentially gives up on generalization. Yet, an ML architecture called_Transformer_ has enabled previously unimaginable generalization in other domains. Nowadays, one can download a model pre-trained on massive datasets and only fine-tune it for a specific task and context with comparatively little time and data. These fine-tuned models are now state-of-the-art for many benchmarks. We believe this progress could translate to networking and propose a Network Traffic Transformer (NTT), a transformer adapted to learn network dynamics from packet traces. Our initial results are promising: NTT seems able to generalize to new prediction tasks and contexts. This study suggests there is still hope for generalization, though it calls for a lot of future research.
@inproceedings{dietmuller2022new, author = {Dietm\"{u}ller, Alexander and Ray, Siddhant and Jacob, Romain and Vanbever, Laurent}, title = {A New Hope for Network Model Generalization}, year = {2022}, isbn = {9781450398992}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3563766.3564104}, doi = {10.1145/3563766.3564104}, booktitle = {Proceedings of the 21st ACM Workshop on Hot Topics in Networks}, pages = {152–159}, numpages = {8}, keywords = {packet-level modeling, transformer}, location = {Austin, Texas}, series = {HotNets '22}, }