Publications
Publications by categories in reversed chronological order.
preprints
- SwiftQueue: Optimizing Low-Latency Applications with Swift Packet QueuingSiddhant Ray, Xi Jiang, Jack Luo, Nick Feamster, and Junchen Jiang2024
Low Latency, Low Loss, and Scalable Throughput (L4S), as an emerging router-queue management technique, has seen steady deployment in the industry. An L4S-enabled router assigns each packet to the queue based on the packet header marking. Currently, L4S employs per-flow queue selection, i.e. all packets of a flow are marked the same way and thus use the same queues, even though each packet is marked separately. However, this may hurt tail latency and latency-sensitive applications because transient congestion and queue buildups may only affect a fraction of packets in a flow. We present SwiftQueue, a new L4S queue-selection strategy in which a sender uses a novel per-packet latency predictor to pinpoint which packets likely have latency spikes or drops. The insight is that many packet-level latency variations result from complex interactions among recent packets at shared router queues. Yet, these intricate packet-level latency patterns are hard to learn efficiently by traditional models. Instead, SwiftQueue uses a custom Transformer, which is well-studied for its expressiveness on sequential patterns, to predict the next packet’s latency based on the latencies of recently received ACKs. Based on the predicted latency of each outgoing packet, SwiftQueue’s sender dynamically marks the L4S packet header to assign packets to potentially different queues, even within the same flow. Using real network traces, we show that SwiftQueue is 45-65% more accurate in predicting latency and its variations than state-of-art methods. Based on its latency prediction, SwiftQueue reduces the tail latency for L4S-enabled flows by 36-45%, compared with the existing L4S queue-selection method.
- CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge FusionJiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang2024
Large language models (LLMs) often incorporate multiple text chunks in their inputs to provide the necessary contexts. To speed up the prefill of the long LLM inputs, one can pre-compute the KV cache of a text and re-use the KV cache when the context is reused as the prefix of another LLM input. However, the reused text chunks are not always the input prefix, and when they are not, their precomputed KV caches cannot be directly used since they ignore the text’s cross-attention with the preceding text in the LLM input. Thus, the benefits of reusing KV caches remain largely unrealized. This paper tackles just one question: when an LLM input contains multiple text chunks, how to quickly combine their precomputed KV caches in order to achieve the same generation quality as the expensive full prefill (i.e., without reusing KV cache)? We present CacheBlend, a scheme that reuses the pre-computed KV caches, regardless prefix or not, and selectively recomputes the KV values of a small subset of tokens to partially update each reused KV cache. In the meantime,the small extra delay for recomputing some tokens can be pipelined with the retrieval of KV caches within the same job,allowing CacheBlend to store KV caches in slower devices with more storage capacity while retrieving them without increasing the inference delay. By comparing CacheBlend with the state-of-the-art KV cache reusing schemes on three open-source LLMs of various sizes and four popular benchmark datasets of different tasks, we show that CacheBlend reduces time-to-first-token (TTFT) by 2.2-3.3X and increases the inference throughput by 2.8-5X, compared with full KV recompute, without compromising generation quality or incurring more storage cost.
- RAGServe: Fast Quality-Aware RAG Systems with Configuration AdaptationSiddhant Ray, Rui Pan, Zhuohan Gu, Kuntai Du, Ganesh Ananthanarayanan, Ravi Netravali, and Junchen Jiang2024
RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge often improves generation quality at the expense of response delay. Prior work either reduces the response delay (through better scheduling of RAG queries) or strives to maximize quality (which involves tuning the RAG workflow), but they fall short in optimizing the tradeoff between the delay and quality of RAG responses. This paper presents RAGServe, the first RAG system that jointly schedules queries and adapts the key RAG configurations of each query, such as the number of retrieved text chunks and synthesis methods, in order to balance quality optimization and response delay reduction. Using 4 popular RAG-QA datasets, we show that compared with the state-of-the-art RAG optimization schemes, RAGServe reduces the generation latency by 1.64-2.54X without sacrificing generation quality.
peer reviewed
2024
- CacheGen: KV Cache Compression and Streaming for Fast Large Language Model ServingYuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen JiangIn Proceedings of the ACM SIGCOMM 2024 Conference 2024
s large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging, as nothing can be generated until the whole context is processed by the LLM. While the context-processing delay can be reduced by reusing the KV cache of a context across different inputs, fetching the KV cache, which contains large tensors, over the network can cause high extra network delays. CacheGen is a fast context-loading module for LLM systems. First, CacheGen uses a custom tensor encoder, leveraging KV cache’s distributional properties to encode a KV cache into more compact bitstream representations with negligible decoding overhead, to save bandwidth usage. Second, CacheGen adapts the compression level of different parts of a KV cache to cope with changes in available bandwidth, in order to maintain low context-loading delay and high generation quality. % When available bandwidth drops, CacheGen may raise the compression level for a part of the context or recompute its KV cache on the fly. We test CacheGen on popular LLMs and datasets. Compared to the recent systems that reuse the KV cache, CacheGen reduces the KV cache size by 3.5-4.3x and the total delay in fetching and processing contexts by 3.2-3.7x with negligible impact on the LLM response quality. Our code is at: this https URL.
- Eloquent: A More Robust Transmission Scheme for LLM Token StreamingHanchen Li, Yuhan Liu, Yihua Cheng, Siddhant Ray, Kuntai Du, and Junchen JiangIn Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing 2024
To render each generated token in real-time for users, the Large Language Model (LLM) server generates tokens one by one and streams each token (or group of a few tokens) through the network to the user right after generation, which we refer to as LLM token streaming. However, under unstable network conditions, the LLM token streaming experience could suffer greatly from stalls since one packet loss could block the rendering of later tokens even if the packets containing them arrive on time. With a measurement study, we show that current applications suffer from increased stalls under unstable networks. For this emerging token streaming problem in LLM Chatbots that differs from previous multimedia and text applications, we propose a novel transmission scheme, called Eloquent, which puts newly generated tokens as well as currently unacknowledged tokens in the next outgoing packet. This ensures that each packet contains some new tokens and, in the meantime, is independently rendered when received, avoiding the aforementioned stalls caused by missing packets. Through simulation under various networks, we show Eloquent reduces stall ratio (proportion of token rendering wait time) by 71.0% compared to the retransmission method commonly used by real chatbot applications and by 31.6% compared to the baseline packet duplication scheme. By tailoring Eloquent to fit the token-by-token generation of LLM, we enable the Chatbots to respond like an eloquent speaker for users to better enjoy pervasive AI.
2022
- A New Hope for Network Model GeneralizationAlexander Dietmüller, Siddhant Ray, Romain Jacob, and Laurent VanbeverIn Proceedings of the 21st ACM Workshop on Hot Topics in Networks 2022
Generalizing machine learning (ML) models for network traffic dynamics tends to be considered a lost cause. Hence, for every new task, we often resolve to design new models and train them on model-specific datasets collected, whenever possible, in an environment mimicking the model’s deployment. This approach essentially gives up on generalization. Yet, an ML architecture called_Transformer_ has enabled previously unimaginable generalization in other domains. Nowadays, one can download a model pre-trained on massive datasets and only fine-tune it for a specific task and context with comparatively little time and data. These fine-tuned models are now state-of-the-art for many benchmarks. We believe this progress could translate to networking and propose a Network Traffic Transformer (NTT), a transformer adapted to learn network dynamics from packet traces. Our initial results are promising: NTT seems able to generalize to new prediction tasks and contexts. This study suggests there is still hope for generalization, though it calls for a lot of future research.
2020
- Machine learning based cell association for mMTC 5G communication networksSiddhant Ray, and Budhaditya BhattacharyyaInternational Journal of Mobile Network Design and Innovation 2020
With the advent of 5G communication networks, the number of devices on the core 5G network significantly increases. A 5G network is a cloud native, massively connected IoT platform with a huge number of devices hosted on the network as compared to prior generation networks. Previously known Machine Type Communication (MTC), it is now known as massive Machine Type Communication (mMTC) and plays a pivotal role in the new network scenario with a larger pool of devices. As ultra-low latency is the key metric in developing 5G communication, a proper cell association scheme is now required to meet the load and traffic needs of the new network, as compared to the earlier cell association schemes which were based only on the Reference Signal Received Power (RSRP). The eNodeB with the highest RSRP may not always be optimal for cell association to provide the lowest latency. This paper proposes an unsupervised machine learning algorithm, namely Hidden Markov Model (HMM) learning on the network’s telemetry data, which is used to learn network parameters and select the best eNodeB for cell association, with the objective of ultimate ultralow latency. The proposed model uses an HMM learning followed by decoding for selecting the optimal cell for association.
posters
- Transformer-based Predictions for Sudden Network Changes (Poster)Siddhant Ray, Xi Jiang, Zhuohan Gu, Junchen Jiang, and Nick FeamsterIn 21st USENIX Symposium on Networked Systems Design and Implementation 2024
Accurate predictions on sudden changes in network states are crucial for the integrity of real-time applications. Traditional heuristic models fall short, especially in tail cases, struggling to capture long-term network dependencies. Modelling network traces as time series sequences, we explore the use of a Transformer model architecture, known for its success in time series prediction, to model network trace dependencies, focusing on sudden change predictions. Our preliminary result on using a Transformer model for predicting one-way delay (OWD) shows observable improvement over the heuristic baseline in prediction loss. This suggests a promising direction for enhancing network predictability and optimizing resource utilization.