Is Pre-training Applicable to the Decoder for Dense Prediction?

Depth Annotation to Edge

Abstract

We study monocular metric depth estimation (MMDE) without camera intrinsics at training or inference. When focal length and scene depth vary together, depth changes are difficult to perceive from image, yet the edge-frequency statistics exhibit systematic, scale-correlated shifts. Building on this observation, we introduce a spectral quantile estimator (SQE) that analyzes the Fourier spectrum of a predicted edge map and outputs a single score used as a proxy for metric scale. Consequently, we propose MD2E, a method that models depth-to-edge cues by deriving edge targets from depth annotations, calibrating metric scale using the spectral score, and using edge predictions to regularize depth boundaries while producing metric depth. Across diverse cameras and datasets, MD2E achieves state-of-the-art monocular metric depth in both zero-shot and fine-tuning settings without camera metadata.

Method

Overview of MD2E. An image is processed by the MDE model to produce an edge map and an initial depth map. Dense depth labels are transformed into edge annotations to supervise the edge branch. A spectral-quantile estimator computes an edge score t_pred, which calibrates the initial depth to metric scale. The final depth is trained with both ground-truth depth and edge prediction supervision, yielding sharp and accurate metric depth.

Main Results

Depth Prediction

MD2E: Modeling Depth-to-Edge Cues for Monocular Metric Depth Estimation

(CVPR 2026)

Depth Annotation to Edge

Abstract

Method

Main Results

Depth Prediction

Error Map