Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Paper Reading - PARROT: Efficient Serving of LLM-based Applications with Semantic Variable
Feb 18, 2025
153 views
Paper https://www.usenix.org/conference/osdi24/presentation/lin-chaofan
Parrot is a system designed to optimize the end-to-end performance of LLM-based applications (e.g. AI agents, and co-pilots)
The key innovation is the semantic variable, a unified abstraction allowing LLM services to understand the structure and dependencies of LLM requests.
Current LLM-Based Applications:
To accomplish a task, LLM-based applications typically require multiple rounds of conversation, which is implemented through various API calls to LLM, demonstrating complex workflow patterns.
Workflow of popular LLM-based applications:
We can see in pictures (a) and (b), that a meeting summary application divides a lengthy document into multiple shorter sections to satisfy the length constraint of the LLM conversation and finally combines it into the final summary through the Map-Reduce or chaining summary patterns.
Chat-based applications, e.g., Bing Copilot, call LLM APIs multiple times to generate answers based on user queries.
In multiple agent coding, each agent represents a different role played by different LLM calls, collaborating to achieve a task
What is the problem here?
What do we need to do then?
PARROT DESIGN:
Parrot is implemented in Python with approximately 14,000 lines of code.
Parrot provides a natural way of programming LLM applications with Semantic Variable annotations, compatible with existing LLM orchestration frameworks like LangChain.
The front end handles Semantic Variables and Semantic Functions, translating them into Parrot’s API requests using FastAPI.
Centralized Parrot Manager is responsible for request management at a cluster level, by deriving the application-level knowledge and optimizing the end-to-end performance of the application.
The manager will schedule the LLM requests to LLM Engine, which is formed by a GPU server (or a group of servers) in the cluster that can serve LLM requests independently.
The LLM engine is optimized with kernels from:
Parrot treating an LLM Request as a Semantic Function:
DAG Basesd Inter-Request Analysis in PARROT:
Parrot builds a DAG (Directed Acyclic Graph) where:
Key Primitives (Functions)
E.g.
If a multi-agent system has:
Parrot builds a DAG like this:
task → WritePythonCode → code → WriteTestCode → test
Prompt Structure-Based Inter-Request Analysis in PARROT:
Optimizations with Semantic Variable:
PARROT's Scheduling:
Two scheduling principles:
With the extracted DAG, the system arranges the LLM requests according to their topological order.
Scheduling Algorithm:
PARROT Implementation:
Front-end API:
Submit Request:
Instead of directly sending a static prompt, Parrot submits a request with placeholders (Semantic Variables). This allows the LLM service to retain structural information, enabling optimizations like batching, caching, and dependency tracking.
Get Request:
Instead of waiting for each request to complete sequentially, Parrot fetches results asynchronously. This allows LLM requests to be scheduled more efficiently without blocking execution.
Parrot’s Kernel Optimization Approach
To speed up LLM inference, Parrot implements a novel GPU kernel using OpenAI Triton and CUDA. The improvements include:
Integration of PagedAttention and FlashAttention:
Efficient KV Cache Loading:
Optimized Attention Score Computation:
Universal Engine Abstraction in Parrot
Core API Methods defined by PARROT in LLM Engine
1. Fill(token_ids, context_id, parent_context_id)
2. Generate(sampling_configs, context_id, parent_context_id)
3. FreeContext(context_id)
Evaluations of the PARROT approach: