Training Infrastructure That Generates Itself

Describe any training environment. 930 builds it, grades it, and exports the data.

930 Forge generates gyms, tasks, and rubrics from a prompt. Every session is event-sourced — replay any moment, fork from any state, export as SFT data. The training universe expands with every run.

9 gyms 81 tasks Event-sourced sessions One API

Trusted by teams at

Amazon X OpenAI Anthropic

930 Forge

Generate any gym, task, or rubric from a prompt.

Describe what you need in plain language. Forge generates the environment (state machine, UI, event handlers), the tasks, and the grading rubrics. Code compiles at runtime, validates against the task behaviour, and registers in the catalog. Immediately runnable.

Describe the environment

Tell Forge what the gym should do. The orchestrator decomposes your request into gym, tasks, rubrics, and world generators — then builds them in parallel.

Auto-generated rubrics

Forge writes rubric functions that return per-criterion scores and detailed textual feedback. Composable, weighted, and inspectable — not a black box.

Compile, validate, run

Generated Elixir code compiles to BEAM bytecode, passes AST safety checks, smoke tests required callbacks, and registers in the catalog. Ready in seconds.

Event-sourced platform

Every action is an event. Replay any moment. Fork from any state.

Sessions are event-sourced from the ground up. Every click, every keystroke is stored as a trace entry with a full state snapshot. This gives you capabilities that stateless systems can't.

Time travel

Replay events to reconstruct the exact state at any point in time. See what the model saw, step by step.

Session forking

Model failed at step 15? Fork from there. Test a different strategy from the same checkpoint without re-running known-good steps.

Full observability

Every trace entry carries the event name, parameters, timestamp, and post-action state snapshot. Nothing is opaque.

Deterministic seeds

Same seed, same scenario. The only variable is your model. Reproducible environments that don't drift between runs.

Criterion-level grading

Not a number. A diagnosis.

Every task is graded by composable rubric functions. Each criterion returns a scalar score and detailed textual feedback. You see exactly which skills your model has learned, which it hasn't, and why.

Scalar score + feedback

Each criterion returns a 0–1 score and a human-readable explanation. "Cell (row_1, amount) = 450, expected 500" — not just "73%".

Weighted composition

Tasks compose rubric entries with weights. Main objectives can matter more than safety checks. Multi-gym tasks grade across every environment in one pass.

RL reward signals

The same rubric results feed directly into RL training as reward signals. Per-criterion scores enable fine-grained curriculum learning and auxiliary rewards.

Training data

One run. Three outputs. Your failures become your curriculum.

Every session produces eval grades, RL reward signals, and SFT training data. No separate pipelines, no reformatting. Export any session as a ZIP with screenshots and action manifests, or as trace JSON for replay.

SFT data export

Export sessions as ZIP files with step-by-step screenshots, bounding boxes, action types, and selectors. Ready for supervised fine-tuning pipelines.

Trace JSON

Full session export with initial and final states, complete event trace with snapshots, grading results, and scenario metadata. One file, every detail.

Kick-start before RL

Use SFT data to bootstrap your model before RL training begins. Run solver sessions, export the good ones, and fine-tune on demonstrated behaviour.

See it in action

Watch models learn in real tasks.

Each task runs in a controlled gym with criterion-level grading. Multi-gym tasks span CRM, spreadsheets, and more.

Cross-gym workflow

Model learns to onboard a client across three apps in one episode.

CRM updates, spreadsheet bookkeeping, and operational follow-through — graded per criterion across every gym.

Example prompt

Onboard Quigley-Block as a new client with a $126000 contract. Update their CRM status and deal value, add the contract as revenue in Excel, and create three onboarding …

13 steps Open task

Spreadsheet reasoning

Model learns to reconcile sheets without corrupting the source.

Inspect multiple sheets, infer what's missing, write only the derived output. Graded on accuracy and restraint.

Example prompt

Reconcile invoices against payments: 1. Switch to the 'Invoices' sheet and review all invoice IDs 2. Switch to the 'Payments' sheet and note which invoice IDs have payme…

15 steps Open task

Spatial manipulation

Model learns precise sequencing in a world full of distractors.

Small enough to grade exactly, complex enough to require planning. Each criterion isolates a different spatial skill.

Example prompt

Build a tower at the center of the grid by stacking the colored cubes in this order (from bottom to top): green → blue → yellow → red. Do not move the purple sphere.

12 steps Open task

The vision

A training universe that expands with every user.

Because 930 can generate any task or gym from a prompt, it can also analyze your training rollouts, identify your model's blind spots, and generate new tasks specifically in those weak areas. Do that for every team on the platform, and the training universe grows — especially where models need it most.

01

Evaluate

Run tasks with criterion-level grading. See exactly where your model fails and why.

02

Analyze blind spots

930 analyzes rollouts across sessions. Detailed reports surface your model's strengths, weaknesses, and missing capabilities.

03

Generate targeted tasks

Forge creates new gyms and tasks specifically in the model's blind spots. Training curriculum that adapts to what's actually broken.

04

Train & repeat

Export SFT data. Train. Re-evaluate on the same tasks. Measure improvement. The loop compounds. The universe expands.

Stop maintaining pipelines. Start compounding training.

Generate a gym from a prompt. Run a graded task in under a minute. Export SFT data. Fork from any failure. Every round makes the next one better — and every team on the platform makes it better for everyone.