In the past decades, artificial intelligence (AI) has achieved unprecedented success, where statistical models become the central entity in AI. However, the centralized training and inference paradigm for building and using these models is facing more and more privacy and legal challenges. To bridge the gap between data privacy and the need for data fusion, an emerging AI paradigm federated learning (FL) has emerged as an approach for solving data silos and data privacy problems. Based on secure distributed AI, federated learning emphasizes data security throughout the lifecycle, which includes the following steps: data preprocessing, training, evaluation, and deployments. FL keeps data security by using methods, such as secure multi-party computation (MPC), differential privacy, and hardware solutions, to build and use distributed multiple-party machine-learning systems and statistical models over different data sources. Besides data privacy concerns, this paper argues that the concept of “model” matters, when developing and deploying federated models, they are easy to expose to various kinds of risks including plagiarism, illegal copy, and misuse. To address these issues, this paper introduces FedIPR, a novel ownership verification scheme, by embedding watermarks into FL models to verify the ownership of FL models and protect model intellectual property rights (IPR or IP-right for short). While security is at the core of FL, there are still many articles referred to distributed machine learning with no security guarantee as “federated learning”, which are not satisfied with the FL definition supposed to be. To this end, this paper reiterates the concept of federated learning and proposes secure federated learning (SFL), where the ultimate goal is to build trustworthy and safe AI with strong privacy-preserving and IP-right-preserving. This paper provides a comprehensive overview of existing works, including threats, attacks, and defenses in each phase of SFL from the lifecycle perspective. The review has been published in the first issue of Machine Intelligence Research in 2023. The full text is open accessed.
In recent years, artificial intelligence (AI) has made great progress in many commercial applications, including computer vision, natural language processing, recommender systems, etc. However, behind the super-fast development, the drawbacks of traditional AI approaches are also revealed, which are that they rely heavily on the availability of large-scale and high-quality data but do not provide a mechanism for security obtaining and using it. For example, the development of computer vision benefited from large-scale public datasets like ImageNet, which is essentially based on a centralized data model. From e-commerce to online video, based on historical data, recommender systems can analyze user preferences precisely and recommend the most relevant items to users. In biology, by training on publicly available data consisting of 170 000 protein structures from the protein data bank (PDB), AlphaFold, developed by DeepMind, achieved high accuracy predictions of protein structure. These examples are all centralized data-driven computation systems, and they require that data scattered across multiple devices be first uploaded to a central database before being used for training the statistical models.
Centralized data fusion for AI modeling is facing more and more legal and ethical challenges. In practice, data is often spread across multiple end devices and held by different individual users or organizations, data in different locations is heterogeneous in form and distribution. Fusing the data into a central database inevitably increases privacy leakage risks. With the increasing awareness of privacy concerns, governments are strengthening data privacy laws to prevent privacy leakage, such as general data protection regulation (GDPR) in the EU, California consumer privacy act (CCPA) in the USA, data security law (DSL) in China. On the other hand, due to uniqueness and rarity, the value of data is also a challenge that cannot be neglected, its value will disappear gradually whenever data can be shared and copied, simply because no organization is willing to share data without benefit.In order to eliminate the drawbacks caused by data fusion, Google proposed a new training paradigm, called federated learning (FL), to address data challenges. The original FL requires model parameters, not raw training data sets, to be exchanged between multiple devices during the whole training process, which can greatly mitigate data privacy risks. However, existing works have shown that vanilla FL without protection on exchanged model parameters may not always provide strong security guarantees. Zhu et al. demonstrated that the original training data can be recovered from gradients. Phong et al. showed that even a small portion of original gradients can expose information about local data. Besides, beyond the training stage, vanilla FL is also vulnerable to various kinds of attacks during the entire FL lifecycle, which includes the following steps: data preprocessing, training, evaluation, and deployments. For example, data can be poisoned in the preprocessing stage. Membership inference attacks can occur in the model deployment phase. It is thus important to emphasize that security guarantees are an essential part of FL system design.
Threat phases of federated learning lifecycle
Moreover, as statistical models are the central entities in AI, multiple assets that include the training data, hardware, and human expertise, are involved when developing and deploying FL models in practice. This makes “model management” a critical issue. To prevent models from being misused or plagiarized without authorization, this paper reinforces awareness of model intellectual property and introduces IP-right-preserving mechanism for models in federate learning. This paper realizes federated model IPR protection by embedding watermarks into deep neural network (DNN) model parameters, achieving good results in practice.
In summary, the true spirit of federated learning lies in its ability to provide strong privacy-preserving and model IP-right preserving, to distinguish it from vanilla FL, we call it secure federated learning (SFL). In contrast to many existing FL works that only provide very weak or no security guarantees, this paper emphasizes that the principle of SFL should receive more attention in both industry and academia. This article gives a comprehensive overview of key aspects of SFL, including existing works on both security guarantees throughout the entire lifecycle and model IP-right protection. In the rest of the article, we will use federated learning (FL) and secure federated learning (SFL) interchangeably and refer to federated learning systems that employ certain security mechanisms unless stated otherwise.
The architecture of secure federated learning
Compared to the previous works, the main contributions of our paper are as follows:
1)It reiterates the core concept of secure federated learning and emphasizes that security guarantees should cover the entire FL lifecycle, which includes the following steps: data preprocessing, model training, evaluation, and deployment.
2)It provides a general SFL architecture, which covers both HFL and VFL scenarios, and summarizes existing works on threats, attacks, and defense in each phase throughout the entire lifecycle.
3)It views the model intellectual property right as an important part when building a secure FL system, and provides detailed implementations on how to protect federated models’ IPR.
Download full text：
Federated Learning with Privacy-preserving and Model IP-right-protection
Qiang Yang, Anbu Huang, Lixin Fan, Chee Seng Chan, Jian Han Lim, Kam Woh Ng, Ding Sheng Ong & Bowen Li