§ 01

Overview

Schematic of the DiscoverPhysics benchmarking pipeline
Fig. 1 · The benchmark loop. For each world, the agent proposes initial conditions for an N-body simulator, observes the resulting noisy trajectories, and after a fixed budget of experimentation rounds submits both a natural-language explanation and a Python implementation of the inferred law of motion. Both outputs are scored against the world's hidden ground-truth physics — the explanation by an LLM judge, the implementation by simulating it forward and comparing trajectories.

Frontier LLMs perform strongly across physics evaluations, but it is hard to disentangle genuine reasoning from recall of established science. DiscoverPhysics asks an LLM agent to discover the laws of motion of a simulated world whose physics deliberately deviates from our own.

We construct eleven public, and eleven private worlds governed by, among others, screened and fractional-power gravity, multi-species couplings, hidden dark-matter-like particles, and time-varying interactions. Each world is generated on demand by an N-body simulator. The agent proposes several rounds of experiments, observes raw trajectory data, and submits both a natural-language explanation and a Python implementation of the inferred law.

Across eleven frontier models, the strongest agents pass only about half of the worlds and consistently fail on those where latent structure must be uncovered. Predictive accuracy and conceptual understanding decouple: the model with the lowest trajectory MSE is not the one with the highest explanation score.

22 Physical worlds
11 Models evaluated
~50% Best pass@5
§ 02

Leaderboard

Click any column header to sort.
Loading results…

A world is considered a pass if per-trajectory normalized MSE is below 0.1 and the explanation score is ≥ 0.9. Pass@k is the expected percentage of worlds passed when k seeds are sampled (without replacement) from a 5-seed pool, averaged over 1,000 Monte Carlo draws. Norm. MSE is the geometric mean of per-trajectory normalized MSE; lower is better.

§ 03

Per-world breakdown

Mean explanation score (0 – 1) per model and world, averaged across 5 seeds. Hover any cell for the standard error and the geometric-mean trajectory error. Models are ordered by Pass@5 (top) to lowest.

Loading heatmap…
§ 04

Results at a glance

Fig. 2a · Pareto frontier Explanation score against normalized MSE. The strongest models reach the upper-left corner; some achieve low MSE without high explanation scores, indicating fitting without conceptual understanding. Hover any marker for the full model identity and stats.
Fig. 2b · Pass@k Expected fraction of worlds passed from k independent attempts. Click any model in the legend to toggle its line.
§ 05

The eleven public worlds

Each world is defined by a hidden force law. Agents see only the simulator output; world names and equations below are revealed in the paper, never to the agent.

Loading worlds…

Submit a model

We're opening DiscoverPhysics to community submissions. Initially, due to validation concerns please contact matt.sampson@princeton.edu for guidelines on being added to the public leaderboard. For access to the full public and private repository request acces here: https://huggingface.co/mattWiemann/DiscoverPhysics

If you'd like to discuss a non-standard evaluation setting (different round budgets, new world, custom prompt), open an issue first.

Submission guide View schema