Reproducible eval sandbox for testing Computer Use / browser agents on research companies on Crunchbase in education context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on monitor competitor pricing in education context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on scrape product listings on Amazon in education context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on download reports from Stripe dashboard in education context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on pull metrics from Mixpanel in education context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on extract leads from Apollo.io in education context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on pull analytics from GA4 dashboard in education context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on apply to jobs on LinkedIn in gaming context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on scrape product listings on Amazon in gaming context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on fill expense reports in Concur in gaming context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on monitor competitor pricing in gaming context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on pull analytics from GA4 dashboard in gaming context. Fixture sites, gold trajectories, and regression gates.