Multi-modal Fusion and Wavelet Feature Extraction for Event-intensity Stereo Matching

Qinghao Zhao; Jiangtao Xu

doi:10.1007/s11633-025-1567-z

Qinghao Zhao, Jiangtao Xu. Multi-modal Fusion and Wavelet Feature Extraction for Event-intensity Stereo MatchingJ. Machine Intelligence Research, 2026, 23(3): 662-676. DOI: 10.1007/s11633-025-1567-z

Citation:

Qinghao Zhao, Jiangtao Xu. Multi-modal Fusion and Wavelet Feature Extraction for Event-intensity Stereo MatchingJ. Machine Intelligence Research, 2026, 23(3): 662-676. DOI: 10.1007/s11633-025-1567-z

Citation:

Qinghao Zhao, Jiangtao Xu. Multi-modal Fusion and Wavelet Feature Extraction for Event-intensity Stereo MatchingJ. Machine Intelligence Research, 2026, 23(3): 662-676. DOI: 10.1007/s11633-025-1567-z

Multi-modal Fusion and Wavelet Feature Extraction for Event-intensity Stereo Matching

Abstract

Abstract

Stereo matching networks based on intensity modality or events modality are susceptible to complex environments, whereas multi-modal stereo matching networks fail to fully utilize geometric contextual information and suffer from information loss during feature extraction. To address these challenges, we propose a novel end-to-end event-intensity stereo matching network based on multi-modal fusion and wavelet feature extraction. First, the multi-modal geometric contextual fusion module (MGCFM) is proposed to integrate contextual information from event streams with geometric information from intensity images, thereby generating robust fusion representations. Subsequently, the wavelet feature extraction module (WFEM) is designed to extract precise wavelet features via Haar wavelet, transforming fusion representations into the frequency domain for pixelwise stereo matching. Finally, the wavelet features are fed into a stereo matching network to generate depth maps. The proposed method outperforms most state-of-the-art (SOTA) approaches in accuracy and speed, achieving a mean disparity error of 0.34 pixels on the multi-vehicle stereo camera dataset (MVSEC) dataset and requiring 0.04 seconds on the stereo event camera dataset (DSEC) with a 1PE metric of 6.728%.

FullText(HTML)

References (50)

Cited By

免责声明：本文中文版本由iFLYTEK翻译自动生成，仅供参考。对于该英文译文的合理性、准确性及完整性，我们不予负责，亦不对由此产生的相关后果承担任何商业及法律责任。

Multi-modal Fusion and Wavelet Feature Extraction for Event-intensity Stereo Matching

Abstract

Catalog

Export File

Citation

Format

Content