CM-CLIP: Few-shot Cross-modal Generalization for Face Anti-spoofing via Prompt Learning
-
-
Abstract
Multi-modal face anti-spoofing (FAS) is receiving increasing attention due to the complementarity between different modal sample information. However, its performance is limited by the sparse availability of multi-modal data (i.e., Depth, NIR) compared to visible light data (i.e., RGB). In this paper, we intend to utilize a large amount of RGB data to improve the performance of FAS on Depth or NIR data with only few-shot target modal samples, which is essentially a cross-modal generalization FAS (CMG-FAS) task. The difficulty of analyzing the CMG-FAS lies in the classifier is seriously interfered with by modal-gap due to large distribution discrepancies among different modalities. To address this problem, we introduce a novel cross-modal contrastive language-image pre-training framework termed CM-CLIP, that aims to leverage the textual feature to dynamically adjust the classifier′s weights for exploring generalizable visual features. Specifically, in the text branch of CM-CLIP, a fine-grained prompt learning (FGPL) strategy is introduced to describe a sample from three levels: content, modality, and category. Furthermore, our prompts are designed to be modally generalized by two contributions: 1) Reducing the correlation between content and modality prompts. 2) Combining the learnable context vectors with an input-conditional token generated for each image. Additionally, we design an adaptive differential convolution adapter (ADC-Adapter) into the image encoder of CM-CLIP, which can effectively reduce the number of learnable parameters and bridge the task-gap between general image recognition and FAS task quickly. Extensive experiments show that the CM-CLIP framework is effective and outperforms the state-of-the-art methods on several multi-modal benchmarks.
-
-