Core Argument
GPU kernel optimization has long been regarded as a domain exclusive to human experts. However, Cursor’s multi-agent system autonomously solved 235 CUDA kernel optimization problems in just three weeks, achieving a 38% geometric mean acceleration. This validates a key insight: the core value of a multi-agent architecture lies not in its multiplicity but in the specialized execution of decoupled complex tasks. This article deeply analyzes this case from an engineering perspective, exploring best practices in multi-agent orchestration for vertical domain optimization.
Background: Why Kernel Optimization is the Litmus Test for Multi-Agent Systems
Kernel optimization is one of the closest fields to “extreme engineering” in GPU programming. It requires a deep understanding of hardware architecture (register allocation, instruction pipelining, memory hierarchy), precise modeling of mathematical operations, and efficient searching within a vast solution space. For a long time, this field has relied on the accumulated experience of a small number of top kernel engineers.
Cursor chose this field as a test for its multi-agent system for a straightforward reason:
“One of the best ways to evaluate long-running, multi-agent systems is to give them open-ended optimization problems where even we don’t know the right answer.” — Cursor Engineering Blog
Unlike simple code completion or text generation, kernel optimization provides quantifiable objectives (latency, throughput, SOL scores), allowing the system to iteratively optimize without relying on human judgment of intermediate results.
System Architecture: Three-Layer Decoupled Collaborative Design
Planner-Agent: Global Coordination and Task Distribution
At the core of the entire system is a Planner Agent, responsible for:
- Distributing 235 optimization problems to Worker Agents
- Monitoring performance metrics and dynamically rebalancing task assignments
- Deciding when to accept the current solution and when to continue iterating
The official description states:
“The multi-agent system solved all 235 GPU kernel optimization problems in a single run by deploying a planner agent that distributed and rebalanced work across autonomous workers based on performance metrics.”
This Planner does not execute specific optimization work but acts as the “task scheduling hub”—a variant of the Supervisor Pattern commonly seen in orchestration paradigms.
Worker-Agent: Domain Specialization and Self-Driven Optimization
Each Worker Agent independently executes:
- Reads the problem description and baseline implementation
- Analyzes the kernel’s computational patterns and hardware bottlenecks
- Generates an optimized version (CUDA C++ or CuTe DSL)
- Calls the Benchmark Pipeline for performance feedback
- Iteratively optimizes based on feedback
The key is the automated invocation of the Benchmark Pipeline:
“The multi-agent system independently learned to call the benchmarking pipeline during its runs, creating a loop where the system continuously tested, debugged, and optimized kernels without any developer intervention.”
This means that Worker Agents have a complete self-driven optimization loop—they do not rely on humans to tell them where to optimize but autonomously discover issues and iterate.
Benchmark Pipeline: Objective Performance Judging
The system’s “judge” is the SOL-ExecBench benchmarking framework provided by NVIDIA:
- Generates 235 real optimization problems from 124 production open-source models
- Executes on 27 NVIDIA Blackwell 200 GPUs
- SOL (Speed-of-Light) scores measure the quality of solutions: 0.5 = baseline, 1.0 = hardware limits
- Built-in anti-cheating mechanisms (cache detection, hardware limit verification)
“If agents use cheating tactics like caching and deliver performance beyond what a B200 can support, the pipeline invalidates the result.” — Cursor Engineering Blog
Key Technical Decisions
Decision 1: Dual-Language Testing—Validating Architectural Generalization
Cursor chose to have the system solve the same set of problems in two languages:
- CUDA C++ with inline PTX: close to the hardware level, testing the system’s understanding of ISA-level instructions
- CuTe DSL: high-level abstraction, testing the system’s ability to learn new APIs
“CuTe DSL … has minimal presence in public training data, testing whether the system can learn novel APIs purely from provided documentation.”
This design is very clever: the PTX layer validates “whether it can operate the lowest-level hardware resources,” while the CuTe layer validates “whether it can quickly grasp unknown tools from documentation.” Both capabilities indicate a good decoupling of architecture and model ability.
Decision 2: Problem-Level vs. Combined Metrics
Result data is divided into two levels:
- combined_metrics.csv: baseline latency, SOL latency, selected latency, SOL scores for each workload
- problem_level_metrics.csv: aggregated results by problem: SOL scores and relative baseline speedup
This layered design allows for subsequent analysis to see both the optimization effects of individual problems and evaluate the overall strategy’s effectiveness.
Result Analysis: Engineering Implications Behind the Numbers
Overall Performance
| Metric | Value |
|---|---|
| Problems Solved | 235 / 235 (100%) |
| Relative Baseline Speedup | 149 / 235 (63%) |
| Geometric Mean Speedup | 38% |
| Problems with Speedup > 2x | 45 / 235 (19%) |
| Median SOL Score | 0.56 |
| Highest SOL Score | 0.9722 |
Key Interpretations:
- 38% Geometric Mean Speedup—not a simple average, but a geometric mean, meaning that even a few extreme acceleration cases do not dominate the overall result, making the numbers more representative.
- 63% of Problems Surpassing Baseline—this means that 37% of problems did not exceed the baseline. A median SOL of 0.56 indicates “there is still significant room for improvement.”
- SOL 0.9722 for Attention Kernel—this result is very close to hardware limits (1.0), indicating that the system has reached expert-level performance on certain problems.
Case 1: Grouped Query Attention (SOL 0.9722, 84% Speedup)
This is the standout single result. The agent optimized the Attention Kernel from SGLang (Llama 3.1 8B) using CUDA C++:
- Successfully optimized memory loading and mathematical operations using hardware-level instructions
- Improved scheduling with Persistent Kernels
- Super-optimized for specific input sizes
Result: This kernel was reintegrated into the SGLang production code, with a measured TTFT (Time To First Token) improvement of 3%.
“We compared the multi-agent system’s custom kernel with a human-optimized baseline in the FlashInfer library. We found that the system produced a solution approaching hardware limits with a SOL score of 0.9722.”
Case 2: BF16 GEMM (Close to cuBLAS, Small-M Surpassing)
Matrix multiplication is recognized as one of the “hardest problems” by kernel engineers, as it requires a deep understanding of hardware unit scheduling. The system:
- Generated a dedicated CUDA C++ GEMM kernel from scratch
- Independently learned to use Blackwell-specific instructions
- Surpassed NVIDIA’s cuBLAS library by +9% in small batch scenarios (crucial for LLM inference decoding).
Case 3: NVFP4 MoE Linear (39% Speedup)
In the quantization scenario of mixture of experts models, the agent correctly identified the quantization area as a bottleneck and innovatively used precomputed threshold buckets to directly map FP32 to FP4—an optimization that requires a true understanding of the essence of numerical formats.
Engineering Insights: What Determines the Limits of Multi-Agent Systems
Insight 1: Computational Resources are Hard Constraints on Exploration Depth
Cursor explicitly states:
“The median SOL score was still only 0.56, leaving significant room for further optimization. We believe that multi-agent solutions can be vastly improved with more compute, as we had hundreds of problems and agents running on only 27 GPUs.”
This is a very honest assessment. 27 GPUs are far from sufficient for the deep exploration of 235 problems—if resources were expanded to 270 GPUs, the system could theoretically explore the solution space more deeply, potentially raising the median SOL to 0.7+.
Insight: The quality of optimization in multi-agent systems is strongly correlated with computational resources. If your scenario is “rapid convergence under limited resources,” such systems may not perform as expected; if it is “large-scale parallel exploration,” they can fully leverage their advantages.
Insight 2: The Quality of Task Decomposition Determines System Limits
The entire system’s coordination protocol exists in a single Markdown file:
“The entire coordination protocol lived in a single markdown file that specified the output format, rules, and tests.”
This means that the Planner’s scheduling strategy, the Worker’s execution boundaries, and the criteria for result determination—all depend on this protocol. If the protocol design is unreasonable, the Planner may distribute tasks unevenly, Workers may duplicate efforts, and the Benchmark may provide misleading feedback.
Insight: The engineering complexity of multi-agent systems lies not in the “agents themselves” but in the design of task decomposition and coordination protocols. The protocol is the constitution of the system; everything else is the execution of that constitution.
Insight 3: Boundaries of Domain Knowledge vs. General Reasoning
The CuTe DSL experiment demonstrated a key conclusion: even if a certain domain is almost non-existent in public training data, the system can still learn from documentation and complete optimizations. This implies:
“Multi-agent architectures will quickly become the default approach to building software because they can tackle novel problems that fall far outside training data distribution.”
Insight: The true value of multi-agent systems lies not in “replacing existing experts” but in solving problems that have never been solved before—those long-tail issues that lack training data, reference implementations, and where experts also lack the time to address.
Insights for Harness Design
1. Harness Must Be Task-Aware
One key factor in the success of this system is the deep integration of the Benchmark Pipeline with the Agent: Agents do not receive feedback only after “executing all code” but can call the Benchmark for validation at any time. This changes the behavior pattern of the Agents—from “guessing generation” to “verification iteration.”
Many existing Agent Harness designs are “execute → observe results → human judgment,” while this case demonstrates the capability boundary of “execute → automated verification → agent autonomous decision-making.”
2. Scoring Mechanisms Must Prevent Cheating
The anti-cheating design of SOL-ExecBench (hardware limit detection) is worth learning for all systems requiring objective assessments of agent capabilities. Agents may be motivated to “find loopholes in scoring” rather than “truly solving problems” under high-reward incentives—this is an issue all evaluation frameworks must consider.
3. The Planner’s Task Distribution Strategy is a Performance Bottleneck
As the system scales (more Workers, more problems), the Planner’s scheduling overhead becomes a new bottleneck. Cursor opted for “dynamic rebalancing based on performance metrics,” which requires:
- Benchmark results to be returned to the Planner in real-time
- The Planner to maintain global state
- Task granularity to be reasonable (too coarse leads to uneven load, too fine leads to excessive scheduling overhead)
Conclusion: Where are the True Boundaries of Multi-Agent Systems?
Cursor’s case answers a key question: Where are the limits of multi-agent systems?
The answer is: computational resources determine exploration depth, protocol design dictates execution efficiency, and model capability determines the quality of each step. All three are indispensable.
If your problem is “requiring extensive parallel exploration and automated validation,” such multi-agent systems may yield astonishing returns; if your problem requires deep domain intuition and cannot be automated for validation, the current systems may still fall short of human experts.
“The most ambitious tasks in software are open-ended, without a clear solution. Single-agent systems struggle here because models are best at narrowly scoped tasks they have already seen during training.” — Cursor Engineering Blog
Next Steps: If you are building an agent system that requires “long-term exploration + automated validation” capabilities, Cursor’s framework is worth studying in depth—not to imitate its specific implementation but to understand how it organically combines “task decomposition,” “autonomous validation,” and “dynamic scheduling.”
Comments
Discussion is powered by Giscus (GitHub Discussions). Add
repo,repoID,category, andcategoryIDunder[params.comments.giscus]inhugo.tomlusing the values from the Giscus setup tool.