We study monocular metric depth estimation (MMDE) without camera intrinsics at training or inference. When focal length and scene depth vary together, depth changes are difficult to perceive from image, yet the edge-frequency statistics exhibit systematic, scale-correlated shifts. Building on this observation, we introduce a spectral quantile estimator (SQE) that analyzes the Fourier spectrum of a predicted edge map and outputs a single score used as a proxy for metric scale. Consequently, we propose MD2E, a method that models depth-to-edge cues by deriving edge targets from depth annotations, calibrating metric scale using the spectral score, and using edge predictions to regularize depth boundaries while producing metric depth. Across diverse cameras and datasets, MD2E achieves state-of-the-art monocular metric depth in both zero-shot and fine-tuning settings without camera metadata.
Overview of MD2E. An image is processed by the MDE model to produce an edge map and an initial depth map. Dense depth labels are transformed into edge annotations to supervise the edge branch. A spectral-quantile estimator computes an edge score tpred, which calibrates the initial depth to metric scale. The final depth is trained with both ground-truth depth and edge prediction supervision, yielding sharp and accurate metric depth.