LIST3R · Long-sequence Instance-aware 3D Reconstruction

Overview

We present LIST3R, an instance-aware framework for long-sequence 3D reconstruction inspired by the way humans organize spatial memory around stable and recognizable objects. LIST3R organizes long-sequence reconstruction around instance anchors, using them to reconnect fragmented subsequences and consolidate local observations into a coherent global 3D scene. Given a long video, our approach partitions it into overlapping subsequences and builds a local instance library for each partial reconstruction, maintaining trackable anchors with semantic and geometric evidence. These anchors are matched across subsequences to recover revisited regions and provide object-aware constraints for fragment alignment, producing a global reconstruction. During this process, the evolving geometric evidence updates the local instance libraries and progressively organizes them into a unified global 3D instance library. Experiments on long-sequence benchmarks show that our method produces more accurate trajectories and higher-quality 3D reconstructions, highlighting the effectiveness of persistent instance anchors for organizing long-horizon 3D reconstruction.

LIST3R teaser — LIST3R leverages instance guidance to recover more effective **revisits** and smoother **cross-subsequence alignment**, producing more accurate and stable camera trajectories than the baseline.

Method

Our core idea is to use recognizable object instances as persistent anchors throughout the reconstruction process. Our method follows a three-stage pipeline. First, we build a local instance library for each subsequence, organizing partial reconstructions around trackable object-centric cues. Second, these local instance libraries are used to establish cross-subsequence associations, including long-range revisit discovery and instance-aware subsequence merging, which are further refined by confidence-weighted optimization for global consistency. Finally, local instance observations are consolidated into a unified 3D instance library.

Quantitative Analysis

Method	TUM			ETH3D			BONN
Method	ATE↓	RTE↓	RRE↓	ATE↓	RTE↓	RRE↓	ATE↓	RTE↓	RRE↓
CUT3R	0.866	0.963	40.19	2.895	2.537	43.04	0.319	0.561	58.13
TTT3R	0.317	0.385	9.92	1.317	0.939	10.33	0.149	0.759	47.51
VGGT-Long	0.325	0.489	25.21	1.292	1.701	32.92	0.123	0.787	47.43
π-Long	0.208	0.279	7.81	0.562	0.455	13.65	0.094	0.770	48.01
Scal3R	0.267	0.329	5.72	0.807	0.590	7.00	0.117	0.779	49.09
LIST3R (Ours)	0.150	0.211	6.97	0.516	0.444	9.32	0.085	0.779	45.89

Camera pose estimation on long sequences. ATE / RTE in meters, RRE in degrees — all lower is better. Green = best.

**Estimated long-sequence camera trajectories.**

Method	ETH3D					NRGBD
Method	Chamfer↓	Acc↓	Comp↓	NC↑	F@5↑	Chamfer↓	Acc↓	Comp↓	NC↑	F@5↑
CUT3R	140.0	62.8	217.3	0.536	4.0	73.2	50.1	96.4	0.575	9.0
TTT3R	102.6	36.6	168.6	0.610	7.3	41.3	26.2	56.4	0.647	22.2
VGGT-Long	50.6	56.7	44.4	0.618	19.8	6.1	5.3	6.9	0.857	68.9
π-Long	41.5	37.8	45.3	0.686	32.7	5.0	4.4	5.5	0.876	68.9
Scal3R	33.8	36.5	31.1	0.658	26.2	7.7	4.2	11.1	0.829	71.2
LIST3R (Ours)	27.4	31.1	23.6	0.709	36.5	4.7	4.1	5.3	0.875	73.4

Point cloud reconstruction quality. Chamfer / Acc / Comp in cm. Green = best.

Qualitative long-sequence reconstruction — **Qualitative results for long-sequence 3D reconstruction.**

LIST3R: Long-sequence Instance-aware 3D Reconstruction

Overview

Method

Visualizations

Quantitative Analysis