Monocular depth estimation (MDE) is crucial for various computer vision applications, but existing methods often struggle to balance inference speed and accuracy when processing large-region visual information. This paper introduces LR²Depth, a novel MDE method that addresses this challenge by utilizing large-kernel convolution on low-resolution feature maps for efficient large-region feature aggregation. Our approach leverages the fact that each pixel on low-resolution feature maps corresponds to a larger region of the original image, allowing for fast and accurate depth predictions at a lower inference cost. Extensive experiments on NYU-Depth-V2, KITTI, and SUN RGB-D datasets demonstrate that LR$^2$Depth not only achieves state-of-the-art performance but also operates approximately twice as fast as previous MDE methods. Notably, at the time of submission, LR²Depth secured the top-1 position on the KITTI depth prediction online benchmark in 2024.
Overview of LR²Depth. Left: The workflow and fundamental unit of LR²Depth. LR²Depth employs large convolution kernels exclusively at low resolution stages, aggregating information over large region with less budgets. Right: The impact of identically sized convolutional kernels on feature maps of different resolutions. It is evident that in low-resolution feature map, the kernel can easily traverse object to extract distance information.