Using persistent object instances as anchors to organize long-horizon 3D reconstruction.
We present LIST3R, an instance-aware framework for long-sequence 3D reconstruction inspired by the way humans organize spatial memory around stable and recognizable objects. LIST3R organizes long-sequence reconstruction around instance anchors, using them to reconnect fragmented subsequences and consolidate local observations into a coherent global 3D scene. Given a long video, our approach partitions it into overlapping subsequences and builds a local instance library for each partial reconstruction, maintaining trackable anchors with semantic and geometric evidence. These anchors are matched across subsequences to recover revisited regions and provide object-aware constraints for fragment alignment, producing a global reconstruction. During this process, the evolving geometric evidence updates the local instance libraries and progressively organizes them into a unified global 3D instance library. Experiments on long-sequence benchmarks show that our method produces more accurate trajectories and higher-quality 3D reconstructions, highlighting the effectiveness of persistent instance anchors for organizing long-horizon 3D reconstruction.
Our core idea is to use recognizable object instances as persistent anchors throughout the reconstruction process. Our method follows a three-stage pipeline. First, we build a local instance library for each subsequence, organizing partial reconstructions around trackable object-centric cues. Second, these local instance libraries are used to establish cross-subsequence associations, including long-range revisit discovery and instance-aware subsequence merging, which are further refined by confidence-weighted optimization for global consistency. Finally, local instance observations are consolidated into a unified 3D instance library.
| Method | TUM | ETH3D | BONN | ||||||
|---|---|---|---|---|---|---|---|---|---|
| ATE↓ | RTE↓ | RRE↓ | ATE↓ | RTE↓ | RRE↓ | ATE↓ | RTE↓ | RRE↓ | |
| CUT3R | 0.866 | 0.963 | 40.19 | 2.895 | 2.537 | 43.04 | 0.319 | 0.561 | 58.13 |
| TTT3R | 0.317 | 0.385 | 9.92 | 1.317 | 0.939 | 10.33 | 0.149 | 0.759 | 47.51 |
| VGGT-Long | 0.325 | 0.489 | 25.21 | 1.292 | 1.701 | 32.92 | 0.123 | 0.787 | 47.43 |
| π-Long | 0.208 | 0.279 | 7.81 | 0.562 | 0.455 | 13.65 | 0.094 | 0.770 | 48.01 |
| Scal3R | 0.267 | 0.329 | 5.72 | 0.807 | 0.590 | 7.00 | 0.117 | 0.779 | 49.09 |
| LIST3R (Ours) | 0.150 | 0.211 | 6.97 | 0.516 | 0.444 | 9.32 | 0.085 | 0.779 | 45.89 |
Camera pose estimation on long sequences. ATE / RTE in meters, RRE in degrees — all lower is better. Green = best.
| Method | ETH3D | NRGBD | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Chamfer↓ | Acc↓ | Comp↓ | NC↑ | F@5↑ | Chamfer↓ | Acc↓ | Comp↓ | NC↑ | F@5↑ | |
| CUT3R | 140.0 | 62.8 | 217.3 | 0.536 | 4.0 | 73.2 | 50.1 | 96.4 | 0.575 | 9.0 |
| TTT3R | 102.6 | 36.6 | 168.6 | 0.610 | 7.3 | 41.3 | 26.2 | 56.4 | 0.647 | 22.2 |
| VGGT-Long | 50.6 | 56.7 | 44.4 | 0.618 | 19.8 | 6.1 | 5.3 | 6.9 | 0.857 | 68.9 |
| π-Long | 41.5 | 37.8 | 45.3 | 0.686 | 32.7 | 5.0 | 4.4 | 5.5 | 0.876 | 68.9 |
| Scal3R | 33.8 | 36.5 | 31.1 | 0.658 | 26.2 | 7.7 | 4.2 | 11.1 | 0.829 | 71.2 |
| LIST3R (Ours) | 27.4 | 31.1 | 23.6 | 0.709 | 36.5 | 4.7 | 4.1 | 5.3 | 0.875 | 73.4 |
Point cloud reconstruction quality. Chamfer / Acc / Comp in cm. Green = best.