DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video
Reliable 4D object detection, which refers to 3D object detection in streaming video, is crucial for perceiving and understanding the real world. Existing open-set 4D object detection methods typically make predictions on a frame-by-frame basis without modeling temporal consistency, or rely on complex multi-stage pipelines that are prone to error propagation across cascaded stages. Progress in this area has been hindered by the lack of large-scale datasets that capture continuous reliable 3D bounding box (b-box) annotations.
To overcome these challenges, we first introduce DA4D, a large-scale 4D detection dataset containing over 280k sequences with high-quality b-box annotations collected under diverse conditions. Building on DA4D, we propose DetAny4D, an open-set end-to-end framework that predicts 3D b-boxes directly from sequential inputs. DetAny4D fuses multi-modal features from pre-trained foundational models and designs a geometry-aware spatiotemporal decoder to effectively capture both spatial and temporal dynamics. Furthermore, it adopts a multi-task learning architecture coupled with a dedicated training strategy to maintain global consistency across sequences of varying lengths.
Extensive experiments show that DetAny4D achieves competitive detection accuracy and significantly improves temporal stability, effectively addressing long-standing issues of jitter and inconsistency in 4D object detection.
The data processing pipeline for 4D detection task. We record posed RGB frames sequentially and separate the records into fixed-length sequences. Objects in global coordinates are projected into ego view and filtered with policies to delete occluded and out-of-view objects. Objects b-boxes are then recalculated according to the visibility and accumulated considering the point cloud within the sequence. Finally, the coordinates of a sequence is adapted referring to the first frame.
Pipeline of the proposed DetAny4D model. RGB sequence with prompts are encoded with the feature extractor, generating tokens, image embeddings, and depth and camera-related embeddings. A Geometry Context Transformer then inject 3D space embeddings in a transformer control manner, and together with embeddings decoded by the Spatiotemporal Transformer to generate prediction results. Multi-task heads are employed for effective training.
@inproceedings{hou2025detany4d,
title={DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video},
author={Hou, Jiawei and Zhang, Shenghao and Wang, Can and Gu, Zheng and Ling, Yonggen and Zeng, Taiping and Xue, Xiangyang and Zhang, Jingbo},
booktitle = {arXiv},
year = {2025},
}