Streaming 3D perception is well suited to robotics and augmented reality, where long visual streams must be processed efficiently and consistently. Recent recurrent models offer a promising solution by maintaining fixed-size states and enabling linear-time inference, but they often suffer from drift accumulation and temporal forgetting over long sequences due to the limited capacity of compressed latent memories. We propose Mem3R, a streaming 3D reconstruction model with a hybrid memory design that decouples camera tracking from geometric mapping to improve temporal consistency over long sequences. For camera tracking, Mem3R employs an implicit fast-weight memory implemented as a lightweight Multi-Layer Perceptron updated via Test-Time Training. For geometric mapping, Mem3R maintains an explicit token-based fixed-size state. Compared with CUT3R, this design not only significantly improves long-sequence performance but also reduces the model size from 793M to 644M parameters. Mem3R supports existing improved plug-and-play state update strategies developed for CUT3R. Specifically, integrating it with TTT3R decreases Absolute Trajectory Error by up to 39% over the base implementation on 500 to 1000 frame sequences. The resulting improvements also extend to other downstream tasks, including video depth estimation and 3D reconstruction, while preserving constant GPU memory usage and comparable inference throughput.
We present Mem3R, an RNN-based model built on the CUT3R paradigm that achieves stronger long-sequence streaming 3D perception. Built on a dual-memory design, Mem3R combines (i) an implicit memory, W, for camera pose estimation and (ii) an explicit memory of persistent tokens for global geometric context. It is further compatible with plug-and-play state-update strategies developed for CUT3R, such as TTT3R, yielding additional gains in reconstruction quality and camera pose estimation. Replacing CUT3R's heavy pose-related state tokens and decoder with a lightweight implicit MLP-based memory reduces the parameter count by about 19%, from 793M to 644M.
End-to-end streaming 3D reconstruction without state reset and post optimization.
Overview of Mem3R. Top: For each frame It in the streaming image sequence, a ViT encoder extracts image features Ft. The fast-weight module W performs camera tracking with decoder under Test-Time Training (TTT), while the fixed-size state S and decoder preserve and update geometric information, producing an intermediate state S't. S't is then fused with the previous state St-1 through a channel-wise update module to obtain the final updated state St. By replacing pose-related state token with W, Mem3R maintains a constant-size recurrent state while reducing long-term temporal forgetting in streaming 3D reconstruction. Bottom: Illustration of the fast-weight and decoder module with TTT.
We quantitatively evaluate camera pose estimation on the ScanNet dataset and the TUM Dynamics dataset. Mem3R achieves significant improvement compared with CUT3R. Moreover, when combined with training-free, plug-and-play state update strategies TTT3R or TTSA3R, Mem3R further yields better results compared to applying these strategies to CUT3R.
Quantitative evaluation of camera pose estimation from the ScanNet dataset (left) and the TUM Dynamics dataset (right).
| Method | Runtime (fps) ↑ | Memory (MiB) ↓ | Params ↓ |
|---|---|---|---|
| CUT3R | 26 | 7930 | 793M |
| Ours | 26 | 7340 | 644M |
| TTT3R | 25 | 8364 | 793M |
| Ours + TTT3R | 25 | 7774 | 644M |
| TTSA3R | 25 | 8786 | 793M |
| Ours + TTSA3R | 25 | 8208 | 644M |
Efficiency comparison of runtime (fps) and GPU memory usage (MiB). Green indicates that our model matches or outperforms its corresponding base model.
Changkun Liu is supported by Android XR, Google. We would like to thank Guangyao Zhai for the valuable discussion during the initial stages of this project, and Haian Jin for the assistance with the training setup. We appreciate the great code base and examples provided by CUT3R, TTT3R, TTSA3R and LaCT.