CM-CLIP: Few-shot Cross-modal Generalization for Face Anti-spoofing via Prompt Learning

Hui Ma; Ajian Liu; Xun Lin; Hugo Jair Escalante; Isabelle Guyon; Jun Wan; Zhen Lei; Yanyan Liang

doi:10.1007/s11633-025-1575-z

Hui Ma, Ajian Liu, Xun Lin, Hugo Jair Escalante, Isabelle Guyon, Jun Wan, Zhen Lei, Yanyan Liang. CM-CLIP: Few-shot Cross-modal Generalization for Face Anti-spoofing via Prompt LearningJ. Machine Intelligence Research, 2026, 23(2): 366-382. DOI: 10.1007/s11633-025-1575-z

Citation:

CM-CLIP: Few-shot Cross-modal Generalization for Face Anti-spoofing via Prompt Learning

Graphical Abstract

Abstract

Abstract

Multi-modal face anti-spoofing (FAS) is receiving increasing attention due to the complementarity between different modal sample information. However, its performance is limited by the sparse availability of multi-modal data (i.e., Depth, NIR) compared to visible light data (i.e., RGB). In this paper, we intend to utilize a large amount of RGB data to improve the performance of FAS on Depth or NIR data with only few-shot target modal samples, which is essentially a cross-modal generalization FAS (CMG-FAS) task. The difficulty of analyzing the CMG-FAS lies in the classifier is seriously interfered with by modal-gap due to large distribution discrepancies among different modalities. To address this problem, we introduce a novel cross-modal contrastive language-image pre-training framework termed CM-CLIP, that aims to leverage the textual feature to dynamically adjust the classifier′s weights for exploring generalizable visual features. Specifically, in the text branch of CM-CLIP, a fine-grained prompt learning (FGPL) strategy is introduced to describe a sample from three levels: content, modality, and category. Furthermore, our prompts are designed to be modally generalized by two contributions: 1) Reducing the correlation between content and modality prompts. 2) Combining the learnable context vectors with an input-conditional token generated for each image. Additionally, we design an adaptive differential convolution adapter (ADC-Adapter) into the image encoder of CM-CLIP, which can effectively reduce the number of learnable parameters and bridge the task-gap between general image recognition and FAS task quickly. Extensive experiments show that the CM-CLIP framework is effective and outperforms the state-of-the-art methods on several multi-modal benchmarks.

FullText(HTML)

References (53)

Supplements (0)

Cited By

CM-CLIP: Few-shot Cross-modal Generalization for Face Anti-spoofing via Prompt Learning

Abstract

Catalog

Export File

Citation

Format

Content