Paper Reading - PARROT: Efficient Serving of LLM-based Applications with Semantic Variable

Paper https://www.usenix.org/conference/osdi24/presentation/lin-chaofan

Parrot is a system designed to optimize the end-to-end performance of LLM-based applications (e.g. AI agents, and co-pilots)

The key innovation is the semantic variable, a unified abstraction allowing LLM services to understand the structure and dependencies of LLM requests.

Current LLM-Based Applications:

To accomplish a task, LLM-based applications typically require multiple rounds of conversation, which is implemented through various API calls to LLM, demonstrating complex workflow patterns.

Workflow of popular LLM-based applications:

We can see in pictures (a) and (b), that a meeting summary application divides a lengthy document into multiple shorter sections to satisfy the length constraint of the LLM conversation and finally combines it into the final summary through the Map-Reduce or chaining summary patterns.

Chat-based applications, e.g., Bing Copilot, call LLM APIs multiple times to generate answers based on user queries.

In multiple agent coding, each agent represents a different role played by different LLM calls, collaborating to achieve a task

What is the problem here?

The existing API design of these applications is request-centric.
Public LLM services only observe tons of individual requests, without knowing any application-level information. These services blindly optimize the performance of individual requests, leading to suboptimal end-to-end performance of LLM applications.
Excessive Overhead of Consecutive Requests: Multiple dependent requests incur network latency and queuing delays because the client must wait for the response of one request before issuing the next.
Misaligned Scheduling Objectives: Public LLM services optimize for individual request latency, which may not align with the end-to-end performance goals of applications.
Redundant Computations: Long system prompts (e.g., task definitions, safety rules) are repeated across requests, wasting storage, computation, and memory bandwidth.
Inefficient Cluster Scheduling for Mixed Workloads: LLM services handle diverse workloads (e.g., chatbots vs. document processing). Mixing low-latency and batch workloads on the same GPU leads to resource contention.

What do we need to do then?

Finding the correlation of multiple LLM requests by exploiting application-level information.

PARROT DESIGN:

Parrot is implemented in Python with approximately 14,000 lines of code.

Parrot provides a natural way of programming LLM applications with Semantic Variable annotations, compatible with existing LLM orchestration frameworks like LangChain.

The front end handles Semantic Variables and Semantic Functions, translating them into Parrot’s API requests using FastAPI.

Centralized Parrot Manager is responsible for request management at a cluster level, by deriving the application-level knowledge and optimizing the end-to-end performance of the application.

The manager will schedule the LLM requests to LLM Engine, which is formed by a GPU server (or a group of servers) in the cluster that can serve LLM requests independently.

The LLM engine is optimized with kernels from:

vLLM (for efficient token generation)
xFormers (for transformer-based models)
Custom GPU kernels developed by the authors.

Parrot treating an LLM Request as a Semantic Function:

import Parrot as P
from Parrot.PerformanceCriteria import LATENCY

@P.SemanticFunction
def WritePythonCode(task: P.SemanticVariable):
	""" You are an expert software engineer.
	Write python code of {{input:task}}.
	Code: {{output:code}}
	"""

@P.SemanticFunction
def WriteTestCode(
	task: P.SemanticVariable,
	code: P.SemanticVariable):
	""" You are an experienced QA engineer.
	You write test code for {{input:task}}.
	Code: {{input:code}}.
	Your test code: {{output:test}}
	"""

def WriteSnakeGame():
	task = P.SemanticVariable("a snake game")
	code = WritePythonCode(task)
	test = WriteTestCode(task, code)
	return code.get(perf=LATENCY), test.get(perf=LATENCY)

It contains two SemanticFunctions, one for the software engineer to write code and one for the QA engineer to write test code.
Three Semantic Variables: task, code, and test, for task description, the code to be developed by the software engineer, and the test code to be developed by the QA engineer, respectively.
Although existing LLM orchestration frameworks also allow placeholders in a prompt, however, the placeholders are rendered with real data before the submission, hence public LLM services cannot detect such a structure.
Instead, Parrot relies on Semantic Variables to preserve the prompt structure for further inter-request analysis on the public LLM services side.
In addition to the semantic functions, LLM application developers can further define orchestration functions that connect multiple semantic functions.
The code variable connects the two LLM requests originating from WritePythonCode and WriteTestCode, showing their sequential dependency.

DAG Basesd Inter-Request Analysis in PARROT:

Parrot builds a DAG (Directed Acyclic Graph) where:

Nodes represent LLM requests or Semantic Variables (task, code, test).
Edges show dependencies (e.g., one request's output is another request’s input).

Key Primitives (Functions)

GetProducer(Variable) – Finds which request generates a Semantic Variable.

GetConsumers(Variable) – Identifies all requests dependent on a Semantic Variable.

GetPerfObj(Request) – Retrieves the performance objective (latency vs. throughput).

E.g.

If a multi-agent system has:

WritePythonCode(task) → code
WriteTestCode(task, code) → test

Parrot builds a DAG like this:

task → WritePythonCode → code → WriteTestCode → test

The code variable connects two requests, so Parrot optimizes their scheduling together.
Instead of waiting for each step sequentially, Parrot detects dependencies and processes them efficiently.

Prompt Structure-Based Inter-Request Analysis in PARROT:

Parrot analyzes prompt structures to identify reusable components.
It stores a hash of repeated sections (using PrefixHash()) and reuses cached results.

Optimizations with Semantic Variable:

Serving Dependent Requests:
Parrot constructs a Directed Acyclic Graph (DAG) to represent task dependencies.
Performance Objective Deduction:
Analyze the DAG structure and Semantic Variables to infer priorities (e.g., deadlines for real-time tasks).
Allocates GPU resources and batch sizes accordingly.

Sharing Prompt Prefix:
Uses PrefixHash to detect identical prefixes (e.g., repetitive system prompts)
Caches intermediate attention key-value (KV) matrices for shared prefixes.
Dynamic Batching with DAG Awareness:
Groups independent tasks into batches while respecting dependencies.
Semantic Variables reveal which tasks can be parallelized (e.g., branches in a workflow).
Parrot batches independent tasks to maximize GPU utilization.

PARROT's Scheduling:

Two scheduling principles:

group LLM requests with similar performance requirements to circumvent the conflict,
maximize opportunities for sharing across requests.

With the extracted DAG, the system arranges the LLM requests according to their topological order.

Scheduling Algorithm:

Data: Q: the request queue

Q.sort() ; /* Topological order */

for r ∈ Q do
	SharedReqsInQueue, CtxInEngine =
	FindSharedPrefix(r);
	if r.TaskGroup ̸= ∅ then
		r∗ = FindEngine(r.TaskGroup);
	
	else if SharedReqsInQueue ̸= ∅ then
		r∗ = FindEngine(SharedReqsInQueue);
	
	else if CtxInEngine ̸= ∅ then
		r∗ = FindEngine(r, filter=CtxInEngine);
	
	if r∗ = ∅ then
		r∗ = FindEngine(r);
	
	Q.remove(r∗);

PARROT Implementation:

Front-end API:

Submit Request:

{
  "prompt": "Write a Python script for sorting a list.",
  "placeholders": [
    {
      "name": "task",
      "in_out": true,
      "semantic_var_id": "sv1",
      "transforms": "none"
    }
  ],
  "session_id": "1234"
}

Instead of directly sending a static prompt, Parrot submits a request with placeholders (Semantic Variables). This allows the LLM service to retain structural information, enabling optimizations like batching, caching, and dependency tracking.

Get Request:

{
  "prompt": "Write a Python script for sorting a list.",
  "placeholders": [
    {
      "name": "task",
      "in_out": true,
      "semantic_var_id": "sv1",
      "transforms": "none"
    }
  ],
  "session_id": "1234"
}

Instead of waiting for each request to complete sequentially, Parrot fetches results asynchronously. This allows LLM requests to be scheduled more efficiently without blocking execution.

Parrot’s Kernel Optimization Approach

To speed up LLM inference, Parrot implements a novel GPU kernel using OpenAI Triton and CUDA. The improvements include:

Integration of PagedAttention and FlashAttention:

PagedAttention: Stores key-value (KV) cache in separate memory segments and maintains a page table per request.

FlashAttention: Maximizes data reuse within shared memory to minimize unnecessary memory transfers.

Efficient KV Cache Loading:

Unlike traditional PagedAttention, which reloads KV cache tiles multiple times, Parrot’s kernel loads them only once into shared memory.

Reduces memory transactions between L2 Cache and Shared Memory, making attention computation faster.

Optimized Attention Score Computation:

The kernel first computes attention scores for shared prefix tokens and writes them back to high-bandwidth memory (HBM).
Then, it processes new tokens separately and merges their attention scores with the stored prefix results.

Universal Engine Abstraction in Parrot

Parrot introduces a universal abstraction for LLM engines to support:
Stateful generation (e.g., handling ongoing contexts dynamically).

Without Stateful Generation (Traditional API Calls)

User: "Tell me about the solar system."

LLM processes entire request.

User: "What about Mars?"

The entire chat history must be resent → increases token count & computation cost.

With Stateful Generation in Parrot

User: "Tell me about the solar system."

Parrot stores KV cache for this conversation.

User: "What about Mars?"

Parrot only processes the new question, avoiding redundant computation.

KV cache sharing across different requests to improve efficiency.

Without KV Cache Sharing (Traditional API Calls)

User A: "What is the capital of France?" → LLM processes full request.

User B: "What is the capital of France?" → LLM processes the same request again.

With KV Cache Sharing in Parrot

Parrot detects duplicate prompts and retrieves cached results, avoiding unnecessary computation.

Core API Methods defined by PARROT in LLM Engine

1. Fill(token_ids, context_id, parent_context_id)

Loads the initial prompt tokens into the KV cache.
Computes attention values for these tokens.

2. Generate(sampling_configs, context_id, parent_context_id)

Performs token generation step-by-step.
Stops when it reaches a length limit, EOS token, or user-defined termination condition.
Uses continuous batching for efficient scheduling.

3. FreeContext(context_id)

Frees KV cache in GPU memory for completed or expired requests.

Evaluations of the PARROT approach:

Document Summarization (Map-Reduce & Chain-Style Processing):
Parrot reduced end-to-end latency by up to 2.37× compared to baseline LLM services
Serving multiple GPT applications:
Parrot achieved up to 12× higher request rates by detecting and caching repeated system prompts. Parrot stored and reused common text instead of reprocessing the same instructions for every query.

Multi-agent Applications - using MetaGPT within Parrot:
Parrot achieved up to 12× higher request rates by detecting and caching repeated system prompts. Efficient dependency tracking and optimized scheduling.

Handling Mixed Workloads (Chat + Batch Processing)
Parrot reduced latency for chat applications by 5.5× while improving throughput for batch jobs by 3.7×. Smart GPU scheduling prevented chatbots from slowing down due to large document-processing jobs.