Abstract:Currently, foundation model had attracted a lot attention for monocular depth estimation in endoscopic surgery. However, degradation in clinical scenarios is often complex for endoscopic image, which leads to compromised robustness. Therefore, we propose a self-supervised generative feature driven EndoMDNet that aim to mitigate degradation for robust monocular depth estimation in endoscopic surgery. Specifically, a content recognition mechanism is designed to guide diffusion model to generate detail information that is utilized as the supplementation of degraded depth feature. Moreover, diffusion model tend to generate artifacts that may also be inconsistent with the target distribution. To address this problem, we propose wavelet-based refined adaptive fusion block to filter noise of generative feature and adaptively fuse it with degraded depth feature . Finally, extensive experiment on SCARED, SERV-CT and Hamlyn datasets demonstrate the robustness of our proposed method.