Reproducible eval sandbox for testing Computer Use / browser agents on triage tickets in Zendesk in gaming context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on extract leads from Apollo.io in gaming context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on schedule posts in Buffer in gaming context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on pull metrics from Mixpanel in gaming context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on fill job applications on company portals in gaming context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on download reports from Stripe dashboard in gaming context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on book flights on Google Flights in gaming context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on manage ads in Meta Ads Manager in gaming context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on update records in Salesforce in gaming context. Fixture sites, gold trajectories, and regression gates.