From task ideas to benchmark-ready evidence.
Task Lab is the working space behind new ComponentBench tasks. It helps turn a component-level interaction idea into a structured task packet, a runnable page, a verifier, human reference traces, and agent-run evidence. The page below shows the intended flow and keeps the current draft/generation controls in one place.
How a task becomes usable evidence
The benchmark is useful only when each task is specific enough to run, verify, compare against a human baseline, and debug after an agent fails. Task Lab keeps those pieces connected instead of treating task generation as a standalone demo.
Start from a component skill
A draft begins with the UI behavior we want to test: selecting from a dense dropdown, resolving a modal, filtering a table, recovering from feedback, or confirming a state change.
Turn it into a structured task
The idea is normalized into an instruction, component metadata, expected end state, negative cases, and hints about the page context. This keeps tasks measurable instead of just plausible.
Build a verifiable page
The generated page is tied to an explicit success checker. The goal is not to make a puzzle; it is to create a small, realistic interaction that can be judged consistently.
Ground it with traces and runs
Human reference traces provide the efficiency baseline. Agent runs add observations, actions, screenshots, timing, and failure evidence for debugging and release decisions.
What the pipeline should leave behind
The important product is not just the generated page. It is the bundle of artifacts that lets a researcher trust the task and lets an agent engineer understand why a run succeeded or failed.
Instruction, component family, library, task template, scene factors, expected end state, and negative cases.
Replayable trace with human step count and time, used as the efficiency baseline for agent runs.
Observation/action history, screenshots or frames, pass/fail status, timing, and useful failure notes.
Aggregated component-level signals that help decide whether a task is ready for the held-out benchmark pool.
A possible path toward component-level training gyms.
This is not ready as a live gym product yet. The immediate purpose of Task Lab is benchmark creation, validation, and analysis. Longer term, the same task/component/verifier pipeline could generate a separate training pool for latency-aware computer-use agents, while the public ComponentBench evaluation set remains held out.
The research motivation is straightforward: long-horizon gyms teach planning and recovery, while short component-level gyms can train agents to execute common UI primitives quickly, reliably, and with less reasoning cost.
The same task specs, verifiers, traces, and failure packets could seed separate component-level training environments.
The public benchmark should remain held out. Training gyms would be generated as a separate pool, not from the evaluation set.
There is no production rollout scheduler, reward service, policy training loop, or published gym split wired into this site today.
If built out, these gyms would train fast UI grounding and low-latency component execution alongside longer workflow gyms.
The natural training path would be hybrid rather than pure RL from scratch.
Warm-start from human traces and strong-agent successful runs
Prefer shorter verified successes over wasteful or failed trajectories
Use verifier rewards with step, time, token, and invalid-action penalties
Draft A New Task
This form is the current working entry point for the pipeline above. It can save a draft, pass ontology and verifier hints, and call the configured backend when generation services are available.
Describe the task in plain language. Be specific about what success looks like.