Depth estimation is an important task in autonomous driving, and usually needs special types of sensors or multiple cameras. In this paper, we propose a novel approach to monocular depth estimation based on two other cheaper annotation tasks: semantic segmentation and prediction of a single vanishing point without the need for ground truth depth data. In a Manhattan-world assumption with a single vanishing point, only one vanishing point exists and represents the end of the scene extension on the z-axis. Depending on semantic segmentation prediction, we set hand-crafted rules to determine the depth of each pixel in the scene depending on its label and its spatial position with regard to the vanishing point. We train two convolutional neural networks (CNNs): a semantic segmentation CNN and a vanishing point prediction CNN. We then fuse the results obtained from the two networks using the hand-crafted rules, which are defined based on single-view geometry rules by taking into consideration the label of the pixel and the nature of the object obtained by the segmentation model. Extensive experiments were done using the KITTI and Cityscapes benchmark datasets. The proposed model achieves impressive performance in semantic segmentation (mean intersection over union of 82.20%) and vanishing point estimation (mean absolute error of 1.87). Monocular depth estimation achieved a relative absolute error of 0.070 with the KITTI dataset and 0.289 with the Cityscapes dataset, outperforming many state-of-the-art methods in depth estimation and semantic segmentation at 10 frames per second.
Research Member
Research Date
Research Year
2024
Research Journal
IEEE Transactions on Intelligent Vehicles
Research Abstract