Citation: | Zhengyan Zhang, Guangxuan Xiao, Yongwei Li, Tian Lv, Fanchao Qi, Zhiyuan Liu, Yasheng Wang, Xin Jiang, Maosong Sun. Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-level Backdoor Attacks. Machine Intelligence Research, vol. 20, no. 2, pp.180-193, 2023. https://doi.org/10.1007/s11633-022-1377-5 |
[1] |
J. Devlin, M. W. Chang, K. Lee, K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA, pp. 4171−4186, 2019. DOI: 10.18653/v1/N19-1423.
|
[2] |
K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA, 2015.
|
[3] |
S. F. Li, S. Q. Ma, M. H. Xue, B. Z. H. Zhao. Deep learning backdoors. [Online], Available: https://arxiv.org/abs/2007.08273, 2020.
|
[4] |
Q. X. Xiao, K. Li, D. Zhang, W. L. Xu. Security risks in deep learning implementations. In Proceedings of IEEE Security and Privacy Workshops, San Francisco, USA, pp. 123−128, 2018. DOI: 10.1109/SPW.2018.00027.
|
[5] |
O. Kovaleva, A. Romanov, A. Rogers, A. Rumshisky. Revealing the dark secrets of BERT. In Proceedings of Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, pp. 4365−4374, 2019. DOI: 10.18653/v1/D19-1445.
|
[6] |
A. Chan, Y. Tay, Y. S. Ong, A. Zhang. Poison attacks against text datasets with conditional adversarially regularized autoencoder. In Proceedings of Findings of the Association for Computational Linguistics, pp. 4175−4189, 2020. DOI: 10.18653/v1/2020.findings-emnlp.373.
|
[7] |
Y. J. Ji, X. Y. Zhang, S. L. Ji, X. P. Luo, T. Wang. Model-reuse attacks on deep learning systems. In Proceedings of ACM SIGSAC Conference on Computer and Communications Security, Toronto, Canada, pp. 349−363, 2018. DOI: 10.1145/3243734.3243757.
|
[8] |
K. Kurita, P. Michel, G. Neubig. Weight poisoning attacks on pretrained models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2793−2806, 2020. DOI: 10.18653/v1/2020.acl-main.249.
|
[9] |
K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 770−778, 2016. DOI: 10.1109/CVPR.2016.90.
|
[10] |
Y. H. Liu, M. Ott, N. Goyal, J. F. Du, M. Joshi, D. Q. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach. [Online], Available: https://arxiv.org/abs/1907.11692, 2019.
|
[11] |
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, 2020.
|
[12] |
Z. Z. Lan, M. D. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.
|
[13] |
G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 2261−2269, 2017. DOI: 10.1109/CVPR.2017.243.
|
[14] |
I. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. H. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, M. Lucic, A. Dosovitskiy. MLP-mixer: An all-MLP architecture for vision. In Proceedings of the 35th Conference on Neural Information Processing Systems, NeurIPS, 2021.
|
[15] |
H. X. Liu, Z. H. Dai, D. R. So, Q. V. Le. Pay attention to MLPs. In Proceedings of the 35th Conference on Neural Information Processing Systems, 2021.
|
[16] |
I. J. Goodfellow, J. Shlens, C. Szegedy. Explaining and harnessing adversarial examples. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA, 2015.
|
[17] |
D. Jin, Z. J. Jin, J. T. Zhou, P. Szolovits. Is BERT really robust? A strong baseline for natural language attack on text classification and entailment. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 5, pp. 8018-8025, 2020.
|
[18] |
Y. Zang, F. C. Qi, C. H. Yang, Z. Y. Liu, M. Zhang, Q. Liu, M. S. Sun. Word-level textual adversarial attacking as combinatorial optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6066−6080, 2020. DOI: 10.18653/v1/2020.acl-main.540.
|
[19] |
H. Xu, Y. Ma, H. C. Liu, D. Deb, H. Liu, J. L. Tang, A. K. Jain. Adversarial attacks and defenses in images, graphs and text: A review. International Journal of Automation and Computing, vol. 17, no. 2, pp. 151–178, 2020. DOI: 10.1007/s11633-019-1211-x.
|
[20] |
M. Ren, Y. L. Wang, Z. F. He. Towards interpretable defense against adversarial attacks via causal inference. Machine Intelligence Research, vol. 19, no. 3, pp. 209–226, 2022. DOI: 10.1007/s11633-022-1330-7.
|
[21] |
T. Y. Gu, B. Dolan-Gavitt, S. Garg. BadNets: Identifying vulnerabilities in the machine learning model supply chain. [Online], Available: https://arxiv.org/abs/1708.06733, 2017.
|
[22] |
Y. Ji, Z. X. Liu, X. Hu, P. Q. Wang, Y. H. Zhang. Programmable neural network trojan for pre-trained feature extractor. [Online], Available: https://arxiv.org/abs/1901.07766, 2019.
|
[23] |
R. Schuster, T. Schuster, Y. Meri, V. Shmatikov. Humpty dumpty: Controlling word meanings via corpus poisoning. In Proceedings of IEEE Symposium on Security and Privacy, San Francisco, USA, pp. 1295−1313, 2020. DOI: 10.1109/SP40000.2020.00115.
|
[24] |
N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson, A. Oprea, C. Raffel. Extracting training data from large language models. In Proceedings of the 30th USENIX Security Symposium, 2021.
|
[25] |
R. Pang, H. Shen, X. Y. Zhang, S. L. Ji, Y. Vorobeychik, X. P. Luo, A. Liu, T. Wang. A tale of evil twins: Adversarial inputs versus poisoned models. In Proceedings of ACM SIGSAC Conference on Computer and Communications Security, pp. 85−99, 2020. DOI: 10.1145/3372297.3417253.
|
[26] |
G. X. Liu, I. Khalil, A. Khreishah, N. Phan. A synergetic attack against neural network classifiers combining backdoor and adversarial examples. In Proceedings of IEEE International Conference on Big Data, Orlando, USA, pp. 834−846, 2021. DOI: 10.1109/BigData52589.2021.9671964.
|
[27] |
Y. Q. Liu, S. Q. Ma, Y. Aafer, W. C. Lee, J. Zhai, W. H. Wang, X. Y. Zhang. Trojaning attack on neural networks. In Proceedings of 25th Annual Network and Distributed System Security Symposium, San Diego, USA, 2018.
|
[28] |
J. Z. Dai, C. S. Chen, Y. F. Li. A backdoor attack against LSTM-based text classification systems. IEEE Access, vol. 7, pp. 138872–138878, 2019. DOI: 10.1109/ACCESS.2019.2941376.
|
[29] |
X. Y. Chen, A. Salem, D. F. Chen, M. Backes, S. Q. Ma, Q. N. Shen, Z. H. Wu, Y. Zhang. BadNL: Backdoor attacks against NLP models with semantic-preserving improvements. [Online], Available: https://arxiv.org/abs/2006.01043, 2020.
|
[30] |
L. C. Sun. Natural backdoor attack on text data. [Online], Available: https://arxiv.org/abs/2006.16176, 2020
|
[31] |
X. Y. Zhang, Z. Zhang, S. L. Ji, T. Wang. Trojaning language models for fun and profit. In Proceedings of IEEE European Symposium on Security and Privacy, Vienna, Austria, pp. 179−197, 2021. DOI: 10.1109/EuroSP51992.2021.00022.
|
[32] |
F. C. Qi, M. K. Li, Y. Y. Chen, Z. Y. Zhang, Z. Y. Liu, Y. S. Wang, M. S. Sun. Hidden killer: Invisible textual backdoor attacks with syntactic trigger. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 443−453, 2021. DOI: 10.18653/v1/2021.acl-long.37.
|
[33] |
F. C. Qi, Y. Yao, S. Xu, Z. Y. Liu, M. S. Sun. Turn the combination lock: Learnable textual backdoor attacks via word substitution. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 4873−4883, 2021. DOI: 10.18653/v1/2021.acl-long.377.
|
[34] |
W. K. Yang, Y. K. Lin, P. Li, J. Zhou, X. Sun. Rethinking stealthiness of backdoor attack against NLP models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 5543−5557, 2021. DOI: 10.18653/v1/2021.acl-long.431.
|
[35] |
L. Y. Li, D. M. Song, X. N. Li, J. H. Zeng, R. T. Ma, X. P. Qiu. Backdoor attacks on pre-trained models by layerwise weight poisoning. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, pp. 3023−3032, 2021. DOI: 10.18653/v1/2021.emnlp-main.241.
|
[36] |
Y. S. Yao, H. Y. Li, H. T. Zheng, B. Y. Zhao. Latent backdoor attacks on deep neural networks. In Proceedings of ACM SIGSAC Conference on Computer and Communications Security, London, UK, pp. 2041−2055, 2019. DOI: 10.1145/3319535.3354209.
|
[37] |
J. Y. Jia, Y. P. Liu, N. Z. Gong. BadEncoder: Backdoor attacks to pre-trained encoders in self-supervised learning. In Proceedings of IEEE Symposium on Security and Privacy, San Francisco, USA, pp. 2043−2059, 2022. DOI: 10.1109/SP46214.2022.9833644.
|
[38] |
E. Bagdasaryan, V. Shmatikov. Blind backdoors in deep learning models. In Proceedings of the 30th USENIX Security Symposium, pp. 1505−1521, 2021.
|
[39] |
S. Rezaei, X. Liu. A target-agnostic attack on deep models: Exploiting security vulnerabilities of transfer learning. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.
|
[40] |
L. J. Shen, S. L. Ji, X. H. Zhang, J. F. Li, J. Chen, J. Shi, C. F. Fang, J. W. Yin, T. Wang. Backdoor pre-trained models can transfer to all. In Proceedings of ACM SIGSAC Conference on Computer and Communications Security, pp. 3141−3158, 2021. DOI: 10.1145/3460120.3485370.
|
[41] |
Y. M. Li, Y. Jiang, Z. F. Li, S. T. Xia. Backdoor learning: A survey. [Online], Available: https://arxiv.org/abs/2007.08745, 2020.
|
[42] |
K. Liu, B. Dolan-Gavitt, S. Garg. Fine-pruning: Defending against backdooring attacks on deep neural networks. In Proceedings of the 21st International Symposium on Research in Attacks, Intrusions, and Defenses, Springer, Heraklion, Greece, pp. 273−294, 2018. DOI: 10.1007/978-3-030-00470-5_13.
|
[43] |
Y. G. Li, X. Lyu, N. Koren, L. Lyu, B. Li, X. J. Ma. Neural attention distillation: Erasing backdoor triggers from deep neural networks. In Proceedings of the 9th International Conference on Learning Representations, 2021.
|
[44] |
B. L. Wang, Y. S. Yao, S. Shan, H. Y. Li, B. Viswanath, H. T. Zheng, B. Y. Zhao. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In Proceedings of IEEE Symposium on Security and Privacy, San Francisco, USA, pp. 707−723, 2019. DOI: 10.1109/SP.2019.00031.
|
[45] |
Y. S. Gao, C. E. Xu, D. R. Wang, S. P. Chen, D. C. Ranasinghe, S. Nepal. STRIP: A defence against trojan attacks on deep neural networks. In Proceedings of the 35th Annual Computer Security Applications Conference, ACM, San Juan, USA, pp. 113−125, 2019. DOI: 10.1145/3359789.3359790.
|
[46] |
J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Miami, USA, pp. 248−255, 2009. DOI: 10.1109/CVPR.2009.5206848.
|
[47] |
R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, C. Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Seattle, USA, pp. 1631−1642, 2013.
|
[48] |
M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar. Predicting the type and target of offensive posts in social media. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA, pp. 1415−1420, 2019. DOI: 10.18653/v1/N19-1144.
|
[49] |
V. Metsis, I. Androutsopoulos, G. Paliouras. Spam filtering with Naïve Bayes – which Naive Bayes? In Proceedings of the 3th Conference on Email and Anti-spam, Mountain View, USA, 2006.
|
[50] |
J. Stallkamp, M. Schlipsing, J. Salmen, C. Igel. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks, vol. 32, pp. 323–332, 2012. DOI: 10.1016/j.neunet.2012.02.016.
|
[51] |
J. Wieting, K. Gimpel. ParaNMT-50M: Pushing the limits of paraphrastic sentence embeddings with millions of machine translations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, pp. 451−462, 2018. DOI: 10.18653/v1/P18-1042.
|
[52] |
X. Y. Chen, C. Liu, B. Li, K. Lu, D. Song. Targeted backdoor attacks on deep learning systems using data poisoning. [Online], Available: https://arxiv.org/abs/1712.05526, 2017.
|
[53] |
Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In Proceedings of NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Granada, Spain, 2011.
|
[54] |
A. Coates, A. Y. Ng, H. Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, USA, pp. 215−223, 2011.
|
[55] |
S. Ioffe, C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, pp. 448−456, 2015.
|
[56] |
X. J. Xu, Q. Wang, H. C. Li, N. Borisov, C. A. Gunter, B. Li. Detecting AI trojans using meta neural analysis. In Proceedings of IEEE Symposium on Security and Privacy, San Francisco, USA, pp. 103−120, 2021. DOI: 10.1109/SP40001.2021.00034.
|
[57] |
Y. K. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, S. Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of IEEE International Conference on Computer Vision, Santiago, Chile, pp. 19−27, 2015. DOI: 10.1109/ICCV.2015.11.
|
[58] |
P. Chrabaszcz, I. Loshchilov, F. Hutter. A downsampled variant of ImageNet as an alternative to the CIFAR datasets. [Online], Available: https://arxiv.org/abs/1707.08819, 2017.
|