preprints

arXiv
SwiftQueue: Optimizing Low-Latency Applications with Swift Packet Queuing

Siddhant Ray, Xi Jiang, Jack Luo, Nick Feamster, and Junchen Jiang

2024

Abs arXiv Bib

Low Latency, Low Loss, and Scalable Throughput (L4S), as an emerging router-queue management technique, has seen steady deployment in the industry. An L4S-enabled router assigns each packet to the queue based on the packet header marking. Currently, L4S employs per-flow queue selection, i.e. all packets of a flow are marked the same way and thus use the same queues, even though each packet is marked separately. However, this may hurt tail latency and latency-sensitive applications because transient congestion and queue buildups may only affect a fraction of packets in a flow. We present SwiftQueue, a new L4S queue-selection strategy in which a sender uses a novel per-packet latency predictor to pinpoint which packets likely have latency spikes or drops. The insight is that many packet-level latency variations result from complex interactions among recent packets at shared router queues. Yet, these intricate packet-level latency patterns are hard to learn efficiently by traditional models. Instead, SwiftQueue uses a custom Transformer, which is well-studied for its expressiveness on sequential patterns, to predict the next packet’s latency based on the latencies of recently received ACKs. Based on the predicted latency of each outgoing packet, SwiftQueue’s sender dynamically marks the L4S packet header to assign packets to potentially different queues, even within the same flow. Using real network traces, we show that SwiftQueue is 45-65% more accurate in predicting latency and its variations than state-of-art methods. Based on its latency prediction, SwiftQueue reduces the tail latency for L4S-enabled flows by 36-45%, compared with the existing L4S queue-selection method.
@misc{ray2024swiftqueueoptimizinglowlatencyapplications, title = {{S}wift{Q}ueue: {O}ptimizing {L}ow-{L}atency {A}pplications with {S}wift {P}acket {Q}ueuing}, author = {Ray, Siddhant and Jiang, Xi and Luo, Jack and Feamster, Nick and Jiang, Junchen}, year = {2024}, eprint = {2410.06112}, archiveprefix = {arXiv}, primaryclass = {cs.NI}, note = {In Submission}, }

peer reviewed

2025

EuroSys
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang

In Proceedings of the Twentieth European Conference on Computer Systems 2025

Abs arXiv Bib 🏅 Best Paper

Large language models (LLMs) often incorporate multiple text chunks in their inputs to provide the necessary contexts. To speed up the prefill of the long LLM inputs, one can pre-compute the KV cache of a text and re-use the KV cache when the context is reused as the prefix of another LLM input. However, the reused text chunks are not always the input prefix, which makes precomputed KV caches not directly usable since they ignore the text’s cross-attention with the preceding texts. Thus, the benefits of reusing KV caches remain largely unrealized.This paper tackles just one challenge: when an LLM input contains multiple text chunks, how to quickly combine their precomputed KV caches in order to achieve the same generation quality as the expensive full prefill (i.e., without reusing KV cache)? This challenge naturally arises in retrieval-augmented generation (RAG) where the input is supplemented with multiple retrieved texts as the context. We present CacheBlend, a scheme that reuses the precomputed KV caches, regardless prefix or not, and selectively recomputes the KV values of a small subset of tokens to partially update each reused KV cache. In the meantime, the small extra delay for recomputing some tokens can be pipelined with the retrieval of KV caches within the same job, allowing CacheBlend to store KV caches in slower devices with more storage capacity while retrieving them without increasing the inference delay. By comparing CacheBlend with the state-of-the-art KV cache reusing schemes on three open-source LLMs of various sizes and four popular benchmark datasets of different tasks, we show that CacheBlend reduces time-to-first-token (TTFT) by 2.2-3.3\texttimes and increases the inference throughput by 2.8-5\texttimes from full KV recompute without compromising generation quality. The code is available at https://github.com/LMCache/LMCache.
@inproceedings{10.1145/3689031.3696098, author = {Yao, Jiayi and Li, Hanchen and Liu, Yuhan and Ray, Siddhant and Cheng, Yihua and Zhang, Qizheng and Du, Kuntai and Lu, Shan and Jiang, Junchen}, title = {{C}ache{B}lend: {F}ast {L}arge {L}anguage {M}odel {S}erving for {RAG} with {C}ached {K}nowledge {F}usion}, year = {2025}, isbn = {9798400711961}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3689031.3696098}, doi = {10.1145/3689031.3696098}, booktitle = {Proceedings of the Twentieth European Conference on Computer Systems}, pages = {94–109}, numpages = {16}, keywords = {KV Cache, Large Language Models, Retrieval-Augmented-Generation}, location = {Rotterdam, Netherlands}, series = {EuroSys '25}, award_name = {Best Paper}, }
SOSP (to appear)
METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation

Siddhant Ray, Rui Pan, Zhuohan Gu, Kuntai Du, Shaoting Feng, Ganesh Ananthanarayanan, Ravi Netravali, and Junchen Jiang

2025

Abs arXiv Bib

RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge often improves generation quality at the expense of response delay. Prior work either reduces the response delay (through better scheduling of RAG queries) or strives to maximize quality (which involves tuning the RAG workflow), but they fall short in optimizing the tradeoff between the delay and quality of RAG responses. This paper presents METIS, the first RAG system that jointly schedules queries and adapts the key RAG configurations of each query, such as the number of retrieved text chunks and synthesis methods, in order to balance quality optimization and response delay reduction. Using 4 popular RAG-QA datasets, we show that compared with the state-of-the-art RAG optimization schemes, METIS reduces the generation latency by 1.64−2.54× without sacrificing generation quality.
@misc{ray2024ragservefastqualityawarerag, title = {{{METIS}: {F}ast {Q}uality-{A}ware {RAG} {S}ystems with {C}onfiguration {A}daptation}}, author = {Ray, Siddhant and Pan, Rui and Gu, Zhuohan and Du, Kuntai and Feng, Shaoting and Ananthanarayanan, Ganesh and Netravali, Ravi and Jiang, Junchen}, year = {2025}, eprint = {2412.10543}, archiveprefix = {arXiv}, primaryclass = {cs.LG}, }

2024

SIGCOMM
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang

In Proceedings of the ACM SIGCOMM 2024 Conference 2024

Abs arXiv Bib

s large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging, as nothing can be generated until the whole context is processed by the LLM. While the context-processing delay can be reduced by reusing the KV cache of a context across different inputs, fetching the KV cache, which contains large tensors, over the network can cause high extra network delays. CacheGen is a fast context-loading module for LLM systems. First, CacheGen uses a custom tensor encoder, leveraging KV cache’s distributional properties to encode a KV cache into more compact bitstream representations with negligible decoding overhead, to save bandwidth usage. Second, CacheGen adapts the compression level of different parts of a KV cache to cope with changes in available bandwidth, in order to maintain low context-loading delay and high generation quality. % When available bandwidth drops, CacheGen may raise the compression level for a part of the context or recompute its KV cache on the fly. We test CacheGen on popular LLMs and datasets. Compared to the recent systems that reuse the KV cache, CacheGen reduces the KV cache size by 3.5-4.3x and the total delay in fetching and processing contexts by 3.2-3.7x with negligible impact on the LLM response quality. Our code is at: this https URL.
@inproceedings{10.1145/3651890.3672274, author = {Liu, Yuhan and Li, Hanchen and Cheng, Yihua and Ray, Siddhant and Huang, Yuyang and Zhang, Qizheng and Du, Kuntai and Yao, Jiayi and Lu, Shan and Ananthanarayanan, Ganesh and Maire, Michael and Hoffmann, Henry and Holtzman, Ari and Jiang, Junchen}, title = {Cache{G}en: KV {C}ache {C}ompression and {S}treaming for {F}ast {L}arge {L}anguage {M}odel {S}erving}, year = {2024}, isbn = {9798400706141}, publisher = {ACM}, url = {https://doi.org/10.1145/3651890.3672274}, doi = {10.1145/3651890.3672274}, booktitle = {Proceedings of the ACM SIGCOMM 2024 Conference}, keywords = {large language models, KV cache, compression}, location = {Sydney, NSW, Australia}, }
NAIC
Eloquent: A More Robust Transmission Scheme for LLM Token Streaming

Hanchen Li, Yuhan Liu, Yihua Cheng, Siddhant Ray, Kuntai Du, and Junchen Jiang

In Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing 2024

Abs arXiv Bib

To render each generated token in real-time for users, the Large Language Model (LLM) server generates tokens one by one and streams each token (or group of a few tokens) through the network to the user right after generation, which we refer to as LLM token streaming. However, under unstable network conditions, the LLM token streaming experience could suffer greatly from stalls since one packet loss could block the rendering of later tokens even if the packets containing them arrive on time. With a measurement study, we show that current applications suffer from increased stalls under unstable networks. For this emerging token streaming problem in LLM Chatbots that differs from previous multimedia and text applications, we propose a novel transmission scheme, called Eloquent, which puts newly generated tokens as well as currently unacknowledged tokens in the next outgoing packet. This ensures that each packet contains some new tokens and, in the meantime, is independently rendered when received, avoiding the aforementioned stalls caused by missing packets. Through simulation under various networks, we show Eloquent reduces stall ratio (proportion of token rendering wait time) by 71.0% compared to the retransmission method commonly used by real chatbot applications and by 31.6% compared to the baseline packet duplication scheme. By tailoring Eloquent to fit the token-by-token generation of LLM, we enable the Chatbots to respond like an eloquent speaker for users to better enjoy pervasive AI.
@inproceedings{10.1145/3672198.3673797, author = {Li, Hanchen and Liu, Yuhan and Cheng, Yihua and Ray, Siddhant and Du, Kuntai and Jiang, Junchen}, title = {Eloquent: {A} {M}ore {R}obust {T}ransmission {S}cheme for {LLM} {T}oken {S}treaming}, year = {2024}, isbn = {9798400707131}, publisher = {ACM}, url = {https://doi.org/10.1145/3672198.3673797}, doi = {10.1145/3672198.3673797}, booktitle = {Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing}, keywords = {Large Language Models, Real-Time Communication, Token Streaming}, location = {Sydney, NSW, Australia}, }

2022

HOTNETS
A New Hope for Network Model Generalization

Alexander Dietmüller, Siddhant Ray, Romain Jacob, and Laurent Vanbever

In Proceedings of the 21st ACM Workshop on Hot Topics in Networks 2022

Abs arXiv Bib Code

Generalizing machine learning (ML) models for network traffic dynamics tends to be considered a lost cause. Hence, for every new task, we often resolve to design new models and train them on model-specific datasets collected, whenever possible, in an environment mimicking the model’s deployment. This approach essentially gives up on generalization. Yet, an ML architecture called_Transformer_ has enabled previously unimaginable generalization in other domains. Nowadays, one can download a model pre-trained on massive datasets and only fine-tune it for a specific task and context with comparatively little time and data. These fine-tuned models are now state-of-the-art for many benchmarks. We believe this progress could translate to networking and propose a Network Traffic Transformer (NTT), a transformer adapted to learn network dynamics from packet traces. Our initial results are promising: NTT seems able to generalize to new prediction tasks and contexts. This study suggests there is still hope for generalization, though it calls for a lot of future research.
@inproceedings{dietmuller2022new, author = {Dietm\"{u}ller, Alexander and Ray, Siddhant and Jacob, Romain and Vanbever, Laurent}, title = {A New Hope for Network Model Generalization}, year = {2022}, isbn = {9781450398992}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3563766.3564104}, doi = {10.1145/3563766.3564104}, booktitle = {Proceedings of the 21st ACM Workshop on Hot Topics in Networks}, pages = {152–159}, numpages = {8}, keywords = {packet-level modeling, transformer}, location = {Austin, Texas}, series = {HotNets '22}, }

2020

IJMNDI
Machine learning based cell association for mMTC 5G communication networks

Siddhant Ray, and Budhaditya Bhattacharyya

International Journal of Mobile Network Design and Innovation 2020

Abs Bib Code

With the advent of 5G communication networks, the number of devices on the core 5G network significantly increases. A 5G network is a cloud native, massively connected IoT platform with a huge number of devices hosted on the network as compared to prior generation networks. Previously known Machine Type Communication (MTC), it is now known as massive Machine Type Communication (mMTC) and plays a pivotal role in the new network scenario with a larger pool of devices. As ultra-low latency is the key metric in developing 5G communication, a proper cell association scheme is now required to meet the load and traffic needs of the new network, as compared to the earlier cell association schemes which were based only on the Reference Signal Received Power (RSRP). The eNodeB with the highest RSRP may not always be optimal for cell association to provide the lowest latency. This paper proposes an unsupervised machine learning algorithm, namely Hidden Markov Model (HMM) learning on the network’s telemetry data, which is used to learn network parameters and select the best eNodeB for cell association, with the objective of ultimate ultralow latency. The proposed model uses an HMM learning followed by decoding for selecting the optimal cell for association.
@article{ray2020machine, title = {Machine learning based cell association for mMTC 5G communication networks}, author = {Ray, Siddhant and Bhattacharyya, Budhaditya}, journal = {International Journal of Mobile Network Design and Innovation}, volume = {10}, number = {1}, pages = {10--16}, year = {2020}, publisher = {Inderscience Publishers (IEL)}, }

posters

NSDI
Transformer-based Predictions for Sudden Network Changes (Poster)

Siddhant Ray, Xi Jiang, Zhuohan Gu, Junchen Jiang, and Nick Feamster

In 21st USENIX Symposium on Networked Systems Design and Implementation 2024

Abs Bib PDF

Accurate predictions on sudden changes in network states are crucial for the integrity of real-time applications. Traditional heuristic models fall short, especially in tail cases, struggling to capture long-term network dependencies. Modelling network traces as time series sequences, we explore the use of a Transformer model architecture, known for its success in time series prediction, to model network trace dependencies, focusing on sudden change predictions. Our preliminary result on using a Transformer model for predicting one-way delay (OWD) shows observable improvement over the heuristic baseline in prediction loss. This suggests a promising direction for enhancing network predictability and optimizing resource utilization.
@inproceedings{siddhantnsdi, author = {Ray, Siddhant and Jiang, Xi and Gu, Zhuohan and Jiang, Junchen and Feamster, Nick}, title = {Transformer-based Predictions for Sudden Network Changes (Poster)}, year = {2024}, isbn = {9781450398992}, publisher = {USENIX Association}, booktitle = {21st USENIX Symposium on Networked Systems Design and Implementation}, keywords = {packet-level modeling, transformer}, location = {Santa Clara, CA}, series = {NSDI '24}, }