Task Lab

From task ideas to benchmark-ready evidence.

Task Lab is the working space behind new ComponentBench tasks. It helps turn a component-level interaction idea into a structured task packet, a runnable page, a verifier, human reference traces, and agent-run evidence. The page below shows the intended flow and keeps the current draft/generation controls in one place.

Input

task idea

Core check

verifier

Output

evidence packet

How a task becomes usable evidence

The benchmark is useful only when each task is specific enough to run, verify, compare against a human baseline, and debug after an agent fails. Task Lab keeps those pieces connected instead of treating task generation as a standalone demo.

Start from a component skill

A draft begins with the UI behavior we want to test: selecting from a dense dropdown, resolving a modal, filtering a table, recovering from feedback, or confirming a state change.

Canonical component, task template, scene factors

Turn it into a structured task

The idea is normalized into an instruction, component metadata, expected end state, negative cases, and hints about the page context. This keeps tasks measurable instead of just plausible.

Instruction, verifier intent, failure boundaries

Build a verifiable page

The generated page is tied to an explicit success checker. The goal is not to make a puzzle; it is to create a small, realistic interaction that can be judged consistently.

Runnable task, deterministic success checker

Ground it with traces and runs

Human reference traces provide the efficiency baseline. Agent runs add observations, actions, screenshots, timing, and failure evidence for debugging and release decisions.

Human baseline, agent logs, audit packet

What the pipeline should leave behind

The important product is not just the generated page. It is the bundle of artifacts that lets a researcher trust the task and lets an agent engineer understand why a run succeeded or failed.

Task packet

Instruction, component family, library, task template, scene factors, expected end state, and negative cases.

Human reference

Replayable trace with human step count and time, used as the efficiency baseline for agent runs.

Run packet

Observation/action history, screenshots or frames, pass/fail status, timing, and useful failure notes.

Release evidence

Aggregated component-level signals that help decide whether a task is ready for the held-out benchmark pool.

Future direction

A possible path toward component-level training gyms.

This is not ready as a live gym product yet. The immediate purpose of Task Lab is benchmark creation, validation, and analysis. Longer term, the same task/component/verifier pipeline could generate a separate training pool for latency-aware computer-use agents, while the public ComponentBench evaluation set remains held out.

The research motivation is straightforward: long-horizon gyms teach planning and recovery, while short component-level gyms can train agents to execute common UI primitives quickly, reliably, and with less reasoning cost.

Scope

separate pool

Focus

latency aware

Status

not live yet

What carries over

The same task specs, verifiers, traces, and failure packets could seed separate component-level training environments.

What stays protected

The public benchmark should remain held out. Training gyms would be generated as a separate pool, not from the evaluation set.

What is not live yet

There is no production rollout scheduler, reward service, policy training loop, or published gym split wired into this site today.

Why it matters

If built out, these gyms would train fast UI grounding and low-latency component execution alongside longer workflow gyms.

If this becomes a gym layer

The natural training path would be hybrid rather than pure RL from scratch.

not wired yet

Warm-start from human traces and strong-agent successful runs

Prefer shorter verified successes over wasteful or failed trajectories

Use verifier rewards with step, time, token, and invalid-action penalties

Draft A New Task

This form is the current working entry point for the pipeline above. It can save a draft, pass ontology and verifier hints, and call the configured backend when generation services are available.

What do you want to build?

Describe the task in plain language. Be specific about what success looks like.