LR²Depth: Large-Region Aggregation at Low Resolution for Efficient Monocular Depth Estimation

(IROS 2025)

Chao Ning1,2, Weihao Xuan1,2, Wanshui Gan 1,2, Naoto Yokoya 1,2,*
1The University of Tokyo    2RIKEN   

Demo

MDE using LR²Depth on low power mobile device

o_image

Abstract

Monocular depth estimation (MDE) is crucial for various computer vision applications, but existing methods often struggle to balance inference speed and accuracy when processing large-region visual information. This paper introduces LR²Depth, a novel MDE method that addresses this challenge by utilizing large-kernel convolution on low-resolution feature maps for efficient large-region feature aggregation. Our approach leverages the fact that each pixel on low-resolution feature maps corresponds to a larger region of the original image, allowing for fast and accurate depth predictions at a lower inference cost. Extensive experiments on NYU-Depth-V2, KITTI, and SUN RGB-D datasets demonstrate that LR$^2$Depth not only achieves state-of-the-art performance but also operates approximately twice as fast as previous MDE methods. Notably, at the time of submission, LR²Depth secured the top-1 position on the KITTI depth prediction online benchmark in 2024.

o_image

Method

Overview of LR²Depth. Left: The workflow and fundamental unit of LR²Depth. LR²Depth employs large convolution kernels exclusively at low resolution stages, aggregating information over large region with less budgets. Right: The impact of identically sized convolutional kernels on feature maps of different resolutions. It is evident that in low-resolution feature map, the kernel can easily traverse object to extract distance information.

overview_image

Main Results

Quantitative comparison

nyu-results
kitti-results
sun-results

Qualitative comparison

nyu-v-results
kitti-v-results

3D Reconstruction

3d-v-results