Limitations and Solutions of Federated Learning Privacy Protection in the AI Era

Limitations and Solutions of Federated Learning Privacy Protection in the AI Era

Liu Zegang

(Associate Professor, Southwest University of Political Science and Law)

[Abstract]

Legislation on artificial intelligence (AI) often tends to favor specific technologies. Federated learning is a mainstream machine learning technology whose greatest advantage lies in its architecture design that fully considers privacy needs. The application of federated learning in fields such as finance and data sharing is already widespread and has a significant impact on the rights of individuals. Currently, federated learning aimed at privacy protection continuously exposes various privacy issues, revealing legal defects in personal data privacy protection paths: sparse regulations lead to a lack of clear privacy requirements for federated learning, making it difficult to leverage its “privacy by design” advantages; distributed architecture complicates the implementation of privacy protection responsibilities in federated learning; an excessive emphasis on confidentiality and security weakens and transforms the personal nature of privacy protection; lack of standardized technical trade-offs results in a lack of transparency and certainty in privacy protection. These issues highlight the significant gap between AI privacy protection and personal data protection in terms of protected objects, protection processes, protection responsibilities, and protection frameworks. To adapt to the special requirements of AI privacy protection, future upgrades and improvements to privacy protection regulations in AI can focus on integrating normative bases, adjusting regulatory priorities, exploring accountability mechanisms, and constructing communication mechanisms.

[Keywords]

Artificial Intelligence Legislation; Federated Learning; Privacy by Design; Differential Privacy; Privacy Computing

There are many privacy risks in the training, deployment, and use of artificial intelligence. However, there is currently no comprehensive solution for AI privacy protection. The term “artificial intelligence” encompasses many divergent technological solutions. In the tech community, scientists, including Yann LeCun and Geoffrey Hinton, still have significant disagreements about the theoretical development direction and engineering practice plans for AI technology in the future. Before 2010, deep learning based on artificial neural network architecture was widely regarded as a dead-end technology path. However, the rapid development of AI technology and industry over the past decade has overturned previous popular beliefs. The EU’s AI legislative texts mainly reflect the current technological understanding and practical directions. For example, the EU’s Artificial Intelligence Act specifically emphasizes its relationship with the General Data Protection Regulation (GDPR) and primarily regulates machine learning, which is a data-driven technology. The tortuous process of EU AI legislation and its final version’s special emphasis on foundational models also indicate that current AI legislation is not entirely neutral, but rather favors specific technologies. Federated learning is a foundational technology for machine learning that enables collaborative modeling using data from multiple institutions or entities while ensuring privacy protection. This privacy-preserving machine learning framework can operate various mainstream algorithms, including neural networks, and is compatible with large model technologies, making it a current mainstream machine learning technology. As a form of “privacy by design,” federated learning’s greatest advantage lies in its architecture that fully considers privacy needs, ensuring that “data does not leave the local environment” in all tasks. The characteristic of “data remains static while the model moves, data is available but not visible” allows federated learning to effectively utilize participant data for collaborative model training while also protecting user privacy and data security. However, this architecture designed based on personal data (information) protection laws cannot permanently eliminate privacy risks. The tech community has pointed out that federated learning’s privacy design still has shortcomings. In February 2024, Stanford University’s Human-Centered Artificial Intelligence Institute pointed out in a white paper titled “Rethinking Privacy in the AI Era” that personal data protection rights cannot effectively eliminate the privacy risks caused by the massive data collection of AI; existing and proposed privacy legislation is insufficient to address AI’s privacy issues. Currently, federated learning is widely applied in fields such as finance and data sharing, with participants primarily being institutions and enterprises, making it difficult for the general public to perceive and understand. However, legal research should aim to analyze the legal deficiencies of federated learning privacy protection beyond ordinary and popular cognitive levels, revealing the inherent technical flaws and helping to uncover the specificity of AI privacy protection, providing inspiration for the development direction of AI privacy regulations.

This article adopts an institutional correlation and realistic development stance on the use of terms such as “privacy protection” and “personal information protection.” Currently, widely used concepts in the legal field such as “personal information protection,” “privacy rights,” and “privacy protection” do not have any that is more fundamental to the extent that it can serve as a basis for other concepts or has the inevitability of replacing other concepts. On the contrary, privacy rights and personal information protection have developed in a highly correlated manner within an extremely complex real-world context. From a realistic development perspective, the industry more often uses the concept of “privacy protection” (privacy preserving), which is highly inclusive and flexible. The proponents of federated learning use the term “privacy protection” in their papers to respond to the legal requirements for personal data protection represented by the EU’s GDPR. However, requirements for privacy rights protection posed by laws such as the Personal Information Protection Law of the People’s Republic of China (hereinafter referred to as the “Personal Information Protection Law”) will not be automatically realized in emerging technologies and products, but require relevant entities, including the legal community, to scrutinize, reflect on, and promote them more purposefully. This article examines the actual effects and shortcomings of existing federated learning privacy protection from a legal normative perspective, analyzes its institutional causes, and proposes relevant legal countermeasures. This article adopts a position that combines technology and regulation, using the term “privacy” to summarize various rights related to ensuring data is protected from accidental or intentional disclosure to safeguard personal dignity.

1. Legal Limitations of Federated Learning Privacy Protection

Federated learning is originally a result of the effective implementation of personal data protection laws. When Google’s technical team first proposed the concept of federated learning in 2016, they claimed it was a distributed deep learning framework that fully considered privacy protection, suitable for training AI on modern mobile devices. The training data for AI on mobile devices is often privacy-sensitive or voluminous. This makes the traditional method of logging into data centers for training no longer suitable. As an alternative, federated learning distributes training data across mobile devices and learns a shared model by aggregating local computational updates, thus solving the local update optimization problem for Android smartphone users’ keyboard input methods. This device-based collaborative modeling is only one form of federated learning. The forms of federated learning have gradually expanded from cross-device collaborative training to cross-institution collaborative modeling. Broadly speaking, federated learning emphasizes that the original data of each participant remains local, without exchange or transmission, but uses real-time aggregated updates to achieve model learning. The vision of federated learning is to fully utilize data from more participants for AI project development and deployment while meeting privacy protection needs. Currently, the legal community’s interest in pre-trained large language models (LLMs) like ChatGPT far exceeds that in federated learning. In fact, federated learning can effectively facilitate the steady development of LLMs. Research indicates that the stock of high-quality language data may be exhausted by 2026, while the stock of low-quality language data and image data will gradually deplete over the next 20 years. If data usage efficiency does not significantly improve or new data sources are not found, the development trend of machine learning may slow, and the scale of LLMs will also be limited. Federated learning provides a compliant framework for fully utilizing data from various terminals and institutions, helping to break data bottlenecks while ensuring privacy protection in LLM construction. Additionally, due to communication and computational limitations on the client side, general AI can also utilize distributed machine learning like federated learning for real-time client-side training and model updates.

In the initial years after the proposal of federated learning, there was great confidence in its privacy protection effectiveness. Since the EU’s privacy protection regulations are crucial for multinational companies like Google, the formulation and implementation of EU laws such as the GDPR directly and significantly influenced the design of federated learning. Some scholars claimed: “Federated learning protects user data privacy through parameter exchange under encryption mechanisms; data and models themselves are not transmitted and cannot be reverse-engineered to infer each other’s data, so there is no possibility of leakage at the data level, nor does it violate stricter data protection laws such as the GDPR.” However, research as early as 2019 proved that it was possible to infer participant training data through model input-output and intermediate gradients. In 2020, studies demonstrated that gradient inversion attacks could reconstruct training data from participants. Clearly, the optimistic assessments made when federated learning was introduced have proven unfounded. In reality, attackers can perform data poisoning, model attacks, inference attacks, server vulnerabilities, and other forms of attacks based on the structural characteristics of federated learning systems, some of which can lead to severe privacy risks. Common privacy attacks against federated learning include model reconstruction attacks, malicious server attacks, GAN-based inference attacks, and membership inference attacks.

Currently, research on the privacy risks of federated learning is still in its early stages. As this article will soon point out, the technology field often confuses the concepts of “privacy,” “data,” and “security.” This article organizes four aspects of federated learning privacy risks that are closely related to law from a normative and technical combined perspective, without delving deeply into “privacy” technical issues that are less related to law. This organization may be incomplete, but it basically covers the most pressing privacy regulatory issues. Overall, the current limitations of federated learning privacy protection at the legal normative level are mainly caused by the privacy protection path centered on personal data protection, primarily reflected in the following four aspects.

(1) Sparse Privacy Protection Norms Lead to a Lack of Privacy Requirements

Privacy requirements are system requirements related to privacy. The direct sources of privacy requirements mainly include laws, regulations, standards, best practices, and stakeholder expectations. The general process of privacy by design is to first identify privacy requirements; next, conduct a privacy risk assessment and choose appropriate privacy control methods; and finally, develop and integrate procedures. Privacy requirements are the starting point and key to the success or failure of privacy design. Unclear or incomplete privacy requirements can significantly undermine the effectiveness of privacy design. Currently, the sources of privacy requirements for federated learning are singular, primarily driven by developers’ proactive exploration and improvement of system privacy performance. This fully reflects the proactive and preventive characteristics of “privacy by design”: designers should proactively estimate potential weaknesses and possible privacy threats of the system and then choose appropriate technical and management measures to prevent relevant risks. Beyond the privacy requirements from developers and researchers, other privacy requirements in the field of federated learning are very scarce. Specifically, this is reflected in several aspects.

First, there are no specific provisions or special requirements in the law for the privacy protection of machine learning architectures like federated learning. Currently, the legal norms for the AI and big data fields operate through a combination of hard and soft laws, traditional legislation, and various standards, guidelines, and best practices. These different types of norms reference and are closely linked to each other, collectively forming the broad legal norm environment for AI. As global AI legislation is still in the exploratory stage, the most operable norms for privacy protection in existing laws are personal data protection laws. Taking the GDPR as an example, the characteristic of federated learning, where “data is available but not visible,” allows it to easily comply with the principles of accuracy, storage limitation, integrity, and confidentiality among its six principles, and it is highly compliant with the purpose limitation principle and data minimization principle, with compliance levels similar to other machine learning systems regarding the principle of lawful, fair, and transparent processing. The EU’s AI legislation does not set specific regulations for architectures like federated learning. Following the scenario-based regulatory path of that law, most scenarios applicable to federated learning fall under low-risk situations, and even in high-risk AI applications, federated learning is easier to comply with various privacy requirements of that law than other AI architectures. Under the existing regulatory framework, federated learning appears as a highly self-aware privacy protection “top student,” with the mandatory privacy requirements from laws and regulations being scarce.

Second, due to the rapid development stage of federated learning, there is still much room for improvement in the technical framework and details, and relevant consensus is yet to be formed. Even if researchers or institutions propose some evaluation standards, they are mostly provisional and exploratory. Overall, there are currently few best practices and standards for privacy protection in federated learning.

Finally, stakeholder privacy expectations are also scarce. Within the existing personal data protection law framework, data subjects have vague privacy expectations regarding federated learning. From a regulatory perspective, the norms of personality rights and privacy rights in the constitution, civil law, and other sectoral laws are too abstract, lacking operability in specific scenarios. In terms of actual effects, privacy protection has effectively been replaced by data protection. The subjects of rights corresponding to privacy rights have been downgraded to passive data subjects without agency. From a factual perspective, due to the complexity of the architecture and technology of federated learning, most ordinary data subjects cannot understand the scenarios and principles where their relevant rights may be harmed. The training and inference processes of federated learning generally protect original data better than other architectures, hence the concerns of data subjects regarding rights protection are weaker. The scarcity of privacy expectation requirements from rights subjects leads to privacy protection goals in commercial deployments of federated learning being easily sacrificed in trade-offs with data compliance, model efficiency, or network security.

(2) Ambiguous Legal Responsibilities for Privacy Protection

The determination and pursuit of legal responsibilities for federated learning privacy protection are primarily complicated by its loose “federal” relationships and unique “joint” learning processes. This is reflected in several aspects.

First, the “looseness” of federated learning in terms of subject relationships leads to difficulties in accountability. In a loose federal structure, the relationships among data holders are relatively equal. Once accountability issues arise, it is challenging to simply determine specific responsible subjects from a regulatory perspective. Taking horizontal federated learning as an example, in a client-server network structure, if privacy breaches occur in federated learning applications like Google’s Gboard, it is relatively easy to identify the major responsibility of the large company deploying the application. However, in cross-institution federated learning, the server may either be set up by the entity leading the construction of the federated learning system or be a third party trusted by all client entities. Moreover, the server controller may not bear more responsibility for privacy protection. The relationships among participants in federated learning with a peer-to-peer network structure are more akin to a loose “confederation”: participants can communicate directly without relying on a third party, enhancing security but making it more challenging to determine responsibility when issues arise.

Second, the characteristics of joint learning modeling in federated learning make it difficult to determine the legal nature of its participants. Participants may not necessarily be the “data controllers” under the EU’s personal data protection law or the “personal information processors” under China’s Personal Information Protection Law. The GDPR defines a data controller as an organization or individual that determines the purposes and means of processing personal data, playing a central decision-making role in personal data processing activities and being responsible for that decision. Clearly, the concept of data controller is closely related to responsibility. EU data protection authorities also recognize: “Controller is a functional concept aimed at allocating responsibility based on factual influence.” Controllers must determine which data to process for which intended purposes. Participants in federated learning often do not meet the criteria of a controller. This is because the primary purpose of personal data processing in federated learning is to train models, not to obtain more information about individuals. More importantly, under the federated learning architecture, personal data is available but not visible, and data availability does not depend on fluidity; rather, it operates without movement. The invisibility itself aligns with privacy expectations under the personal data protection framework. If federated learning participants cannot be classified as data controllers or personal information processors, it becomes impossible to allocate and pursue responsibility from the perspective of personal data law.

(3) The Personal Rights of Privacy Protection are Transformed and Weakened

The legal concept of privacy includes two key points: personal nature (autonomy, identity, and dignity) and concealment (freedom from intrusion and observation). When federated learning was proposed, it considered the requirements of personal data protection laws, but its design and engineering practices primarily reference existing cybersecurity and information security norms. The privacy protection of federated learning has actually undergone two transformations: first, it was transformed into personal data protection at the legislative level, and then at a more specific technical standard level, it was transformed into information security protection. Consequently, federated learning privacy protection overly emphasizes concealment (confidentiality) while neglecting the personal dimension.

In design and engineering implementation, privacy protection is a branch of the information system security protection framework. For example, international standards set by the International Organization for Standardization (ISO) in the fields of data communication and network security are primarily security standards, and privacy standards are gradually introduced based on security standards, the most important of which is the ISO/IEC 29100 series of privacy framework standards. These privacy standards are constructed based on security standards and security management frameworks. Most privacy standard requirements are ultimately realized through the implementation of security standards. Among the many standards introduced by the AI security standardization research group SC42, none are specifically dedicated to privacy protection. This also indirectly indicates that the level of security in the current AI standard-setting field is prioritized far above privacy.

The National Institute of Standards and Technology (NIST) in the United States provides detailed and comprehensive privacy control measures in its SP 800-53 document titled “Security and Privacy Controls for Information Systems and Organizations,” which has a clear organization and structure. The controls are divided into 20 families, most of which are common control items for privacy and security. Moreover, the fourth edition of SP 800-53 added two specific privacy control items: individual participation and privacy authorization. However, the document’s authors believe that “without a basic foundation of information security, organizations cannot have effective privacy.” Thus, in the fifth edition of 2020, many control items were significantly adjusted, including replacing the individual participation and privacy authorization privacy control items with others and breaking down and refining some control measures from the fourth edition, overall enhancing the integration of control items and strengthening the integration of privacy and security controls. However, the demand for privacy protection has further been absorbed by security protection. Other relevant standards include the IEEE standard committee’s federated learning standard (IEEE P3652.1) launched in 2021, as well as the privacy computing federated learning product safety requirements and testing methods (YD/T 4691-2024) and performance requirements and testing methods (YD/T 4692-2024) led by the China Academy of Information and Communications Technology, which focus on product research and development, evaluation, testing, and acceptance rather than rights protection and legal supervision. Some so-called international standards were originally dominated by domestic enterprises (such as WeBank) and promoted for approval. Since these standards lack enforceability and are not closely related to the theme of this article, they will not be discussed further here.

More importantly, merely emphasizing security protection cannot eliminate high-harm privacy risks. Most research on federated learning security and privacy protection is based on the honest-but-curious model assumption. Some scholars have systematically studied the semi-honest security of horizontal federated learning based on this assumption. Semi-honest attackers will attempt to infer or extract other participants’ private data based on the intermediate results generated during the execution of the federated learning cryptographic security protocol while adhering to it. Due to constraints from data laws and regulations, along with the fact that malicious behavior can degrade model quality and harm the attackers’ own interests, participants in federated learning model training usually conform to the semi-honest but curious assumption and do not attempt extreme malicious attacks. The introduction of secure computing technologies like trusted execution environments can also somewhat limit the impact of such attackers, making it difficult for them to infer other participants’ private information from parameters returned by the server. However, fundamentally speaking, federated learning finds it hard to eliminate the challenges posed by the “Byzantine Generals Problem.” Once a collapse in cooperation occurs, leading to the failure of the semi-honest model assumption, especially if the central server becomes a malicious attacker, it is entirely possible to infer participant-related data through the model. More seriously, various parties may tacitly collude to obtain data from other participants through technical means. This seemingly extreme assumption is, in reality, highly realistic in terms of interests: if the participants in federated learning are not in competition, they may fully collude to circumvent legal restrictions for data exchange, illegally obtaining data while acquiring usable AI models. Due to the complexity of the federated learning architecture, external regulation of such malicious collusion behaviors is very challenging. In cross-institution federated learning, even if personal data is involved, individuals find it difficult to participate and intervene, and data subjects may unknowingly suffer privacy violations.

(4) Lack of Normative Constraints on Technical Trade-offs Leads to Uncertainty in Privacy Protection

Federated learning exhibits significant complexity at the technical level, which is primarily reflected in the following aspects. First, the types and algorithms of federated learning are complex. Federated learning can generally be divided into horizontal, vertical, and transfer learning categories. Horizontal federated learning and vertical federated learning are classified based on the different attributes of client data. Federated transfer learning is a combination of federated learning and transfer learning. Data characteristics and classification labels between clients can vary significantly, requiring alignment during training. As the federated learning framework has developed, more traditional machine algorithms can be implemented on federated learning, resulting in a high complexity in terms of framework and algorithm types. Second, federated learning is a form of distributed learning. Distributed machine learning is inherently more complex than centralized learning. Moreover, federated learning faces even more severe challenges than general distributed machine learning: multi-source heterogeneous data, unstable devices, and high communication costs. In traditional machine learning, training data is independent and identically distributed (IID). In federated learning, data is distributed across multiple devices or servers, with data on each device or server potentially coming from different user groups or environments, leading to inconsistent data distributions. This contradicts the IID assumption, which requires all data samples to come from the same distribution. In federated learning, data on different devices may be correlated. For example, data on a user’s phone and computer may be related, but these data are collected in different environments, resulting in different feature distributions. According to various regions or user behaviors, correlations will inevitably arise between these data, violating the requirement for independence. Therefore, IID data does not satisfy the requirement for identical distribution. In federated learning, to protect user privacy, techniques such as differential privacy are typically required. The application of these techniques on non-IID data is more complex. The high complexity necessitates that developers make trade-offs and choices among multiple objectives and technical parameters. However, currently, these trade-offs lack substantial legal constraints, leading to privacy-protecting technical indicators being easily sacrificed in these trade-offs, thereby increasing uncertainty in federated learning privacy protection at the normative level. This uncertainty arising from trade-offs is particularly pronounced in differential privacy.

The principle of differential privacy was strictly mathematically proven by Cynthia Dwork in 2006. Essentially, differential privacy is a perturbation technique, where noise is added to the original data, making it difficult to distinguish statistical information calculated from the perturbed data from that calculated from the original information. Perturbation techniques are simple and efficient but are vulnerable to probabilistic attacks, leading to a dilemma: excessive noise addition severely impacts learning accuracy and efficiency; too little noise fails to achieve privacy protection. Typical differential privacy is also known as ε-differential privacy, where ε is referred to as the privacy budget. When the privacy budget ε is sufficiently small, the level of privacy protection is high, but data availability is low, resulting in poor machine learning performance. Increasing the privacy budget ε leads to the opposite situation. The privacy budget serves as a trade-off parameter between efficiency and privacy protection. Different deployment methods of differential privacy can result in significantly different utility, performance, and privacy loss situations, necessitating trade-offs based on various needs and goals. Furthermore, the aspects that need to be balanced in federated learning extend beyond differential privacy. For example, privacy protection standards for horizontal federated learning can be implemented through secret sharing, key agreement, authenticated encryption, or homomorphic encryption, but these require substantial computational and communication overhead, leading to the practice of often implementing weakened semi-honest privacy security requirements for horizontal federated learning through differential privacy. These lack of normative trade-offs increase the uncertainty of federated learning privacy protection.

2. The Specificity of AI Privacy Protection

AI has a significant degree of uncertainty regarding its impact on various rights, including privacy rights. Personal data protection laws, represented by the GDPR, are constructed to address the characteristics of technology and industry related to computers, the internet, and big data. Personal data protection laws effectively achieve privacy protection under big data conditions. This easily leads to a misconception that personal data protection laws can fully handle AI’s privacy protection. However, based on the problems in federated learning privacy protection, personal data protection laws cannot adequately adapt to the challenges of AI privacy protection. In summary, the legal defects of federated learning privacy protection reveal the gap between AI privacy protection and personal data protection, highlighting the characteristics of AI privacy protection.

(1) Gap in Protected Objects

From the nature and development trends of AI, personal data protection laws cannot capture the most crucial factors in AI privacy protection. The goal of AI is to approach the overall intelligence level of humans rather than individual intelligent performance; thus, it focuses on the statistical characteristics of the training dataset as a whole. The essence of machine learning is to automatically learn patterns from limited data and use those patterns to make predictions on unknown data. From both essence and goal, AI inherently relies more on public data than personal data and has no real interest in acquiring personal data. With the promotion of AI applications, future AI development may increasingly utilize data generated through data augmentation techniques and data produced by humans using AI systems. In the traditional sense, personal data has limited value for AI.

Federated learning conducts model training using jointly sourced data, effectively breaking down data silos and learning the overall characteristics of human behavior more comprehensively. Although the weights and gradients transmitted during federated learning model training still reflect data characteristics, various types of training data, including personal data, do not leave the local databases of data controllers. In a broader sense, the training data used in mainstream machine learning technologies of current AI is merely the raw material for “alchemy”; the final product is the model rather than new personal data. There is no strict mapping between the model and personal data. Even if an attacker obtains training data through technical means, what is leaked is often the so-called “privacy data” at the participant level, i.e., data that participants have the right to store and process but do not have the right or willingness to share. These data have already been pre-processed before being used for machine learning training to allow the machine to learn valuable features. The pre-processed dataset significantly differs from traditional personal data. For an attacker to restore identifiable data about specific individuals, they often require additional data and technical conditions. Therefore, from the perspective of compliance with personal data protection laws, regulating privacy protection in machine learning lacks an appropriate “normative interface.”

The various large models representing AI development trends are essentially machine learning, and their normative connection to personal data is tenuous. For example, current mainstream LLMs are trained using vast public datasets. While LLM developers typically disclose the composition of project datasets and data collection standards, they often remain vague about data sources, dataset sizes, and data labeling counts. Although OpenAI has faced scrutiny over whether it illegally used personal data during training, the vast majority of this data was obtained through public channels, diverging significantly from the general concept of personal data. As a powerful open-source LLM project parallel to OpenAI’s closed-source projects, Meta’s LLaMA also only utilized publicly available data for training. Of course, the application inference phase of LLMs in specific domains involves data protection and privacy issues. If users upload data involving privacy to the model during use, it may lead to privacy leaks. However, this is not the responsibility of developers and companies but rather a result of user negligence. From the essential characteristics of data usage in federated learning to the low dependency of LLMs like ChatGPT on personal data, data regulations represented by personal data protection laws are not entirely suitable for the industrial logic of AI, and the space for regulating AI privacy protection through personal data laws is also gradually tightening.

(2) Gap in Protection Focus

In the process of federated learning expanding from early cross-device forms to cross-institution forms, the meaning of “privacy protection” has undergone a clear transformation: system security has become the overwhelming focus of protection, while the personal interests of natural persons have been neglected. From a legal perspective, only individuals with personhood can discuss privacy, but in the realm of federated learning, the rights holders for privacy protection have gradually been replaced by the various institutional entities participating in the training. When Google initially proposed the federated learning scheme, it targeted mobile terminal input method users, who are natural persons, and the rights holders for privacy protection were also natural persons. However, cross-institution federated learning primarily protects the various security rights of participating institutions, which are often generically referred to as privacy protection. This has caused a significant shift in the focus of privacy protection, transforming “privacy protection” into confidentiality and security protection.

There are clear differences between security protection and privacy protection in the field of AI. First, the protection goals differ. The ultimate goal of privacy protection is personal rights, creating barriers and managing boundaries to protect individuals from unnecessary interference, thus promoting the autonomous development and dignity of individuals. The goal of security protection is to ensure the confidentiality, integrity, and availability of systems. Second, the actors to be prevented differ. Security protection primarily guards against unauthorized external actors, while privacy protection must guard against both internal and external actors. Third, the types of attacks differ. AI privacy protection is more closely related to confidentiality, with attack modes targeting confidentiality including reconstruction attacks, model inversion attacks, and membership inference attacks. AI security protection, on the other hand, is more related to integrity and availability, targeting poisoning attacks, adversarial attacks, and query attacks. Fourth, the defensive methods involved in protection differ. Defensive methods for privacy protection include secure multi-party computation, homomorphic encryption, and differential privacy, while defensive methods for security protection include defense distillation, adversarial training, and regularization.

In the absence of specialized norms, AI privacy protection still primarily relies on information security and cybersecurity norms. Specialized privacy norms and standards in AI will also form path dependencies on existing security norms and standards. Although privacy is closely related to network information security, there are essential differences. Privacy protection has been compressed through personal data protection and security protection, gradually distancing itself from its focus on personal rights, overly emphasizing “concealment” or confidentiality goals. The goal of personal rights in privacy protection has been weakened and shifted.

(3) Gap in Protection Responsibility

Personal data protection laws represented by the GDPR have created a series of subject concepts centered on “data controllers,” with the distinction standards for these subjects being the sharing of responsibilities for personal data processing. Personal data protection laws adopt a relatively simple framework for describing data flow processes, including data collection, storage, processing, transmission, portability, and deletion. However, this relatively simple method of responsibility allocation is no longer suitable for the complex scenarios of AI privacy protection. Although current machine learning is data-driven, data is not the primary contradiction in AI privacy protection. Taking federated learning as an example, since data does not flow or migrate in the legal sense, the responsibilities of participants cannot be delineated and determined using data processing flow. In this situation, continuing to use the responsibility allocation norms of personal data protection laws may lead to such laws becoming a compliance barrier for AI projects to evade privacy protection obligations. Therefore, it is challenging to clarify the responsibilities for privacy protection in AI projects from a technical and normative perspective based on data. Correspondingly, confidentiality or security responsibilities for data cannot fully cover the responsibilities for AI privacy protection. This creates a gap between personal data protection responsibilities and AI privacy protection responsibilities. The concepts of responsible subjects such as data controllers and processors in big data scenarios are no longer very suitable for the responsibility allocation in AI scenarios.

Federated learning has rich forms and complex technologies, and it plays an important role in fields such as finance and healthcare, but most of the time it operates silently as a foundational architecture, unlike large models like ChatGPT and Sora that easily attract attention and anxiety. The public’s awareness of the use of federated learning and its impact on various important rights is also insufficient. However, in terms of the breadth and depth of actual impact, federated learning is no less significant for ordinary people than foundational models, necessitating risk prevention and responsibility regulation. Yet, the loose network structure and joint training model of federated learning make it exceptionally challenging to allocate responsibility among participants from both technical and normative perspectives. To achieve privacy protection in models like federated learning, it is necessary to establish relatively specific responsibility norms in conjunction with technical characteristics.

(4) Gap in Protection Framework

Machine learning is fundamentally the process of computers transforming data into knowledge and intelligence, with the transformation method being “computation.” From a practical perspective, AI is based on computing and pattern recognition implemented by computers. Pattern recognition can also be reduced to a broad computational problem. Of course, without data, machine learning cannot develop, but overly emphasizing the importance of data makes it difficult to truly understand the essence of AI privacy protection or accurately describe the process of AI privacy protection. However, if one thinks from the perspective of data rather than computation, it becomes impossible to understand the “trade-off” characteristics of current AI privacy protection: as long as one does not care about computational costs and availability, privacy can be perfectly protected. If computational factors are not considered, existing cryptographic techniques, such as homomorphic encryption, not only have high accuracy but can also eliminate the trade-off between data utility and data privacy. However, the computational costs of homomorphic encryption are exceptionally high. Research indicates that under typical application scenarios and parameter settings, the most advanced fully homomorphic encryption schemes are several thousand times less efficient than multi-party secure computation. If not for considerations of reducing computational costs and improving economic efficiency, one could comprehensively protect privacy using cryptographic techniques. It is precisely the trade-offs between computational and communication costs that lead to increased privacy risks. However, these trade-offs are realistic and necessary; otherwise, some AI projects may remain at the mathematical principle level and find it difficult to be implemented in engineering.

Since computation is the key to transforming data into information, knowledge, and intelligence, AI privacy protection can only be comprehensively grasped from a computational perspective. From a computational viewpoint, the training and inference (application) processes of AI exhibit very different data characteristics and significant differences in their implications for privacy protection. For instance, in federated learning, the training process is more complex, and privacy risks are more pronounced. In client-side architectures, the aggregation function is often provided by trusted computing platforms. Participants in horizontal federated learning all possess complete models, thus allowing local inference. During the inference process, appropriate technologies can be employed for privacy protection. In vertical federated learning, since participants do not have complete models, the inference process still requires computation and coordination through trusted third parties, making privacy protection more challenging.

Therefore, only by centering on computation and starting from the overall system can one fully describe and comprehensively standardize AI privacy protection issues. This is also the fundamental reason for the rise of the concept of “privacy computing.” Privacy computing is not a single technology but a collective term for technologies that conduct data computation and analysis while protecting privacy. Secure multi-party computation, federated learning, and trusted execution environments are currently the main privacy computing technical solutions. Each of these three solutions has its advantages and can complement each other in certain application practices. From an AI perspective, the federated learning framework can integrate the strengths of secure multi-party computation and trusted execution environments. The concept of privacy computing fully reflects the trend of privacy protection’s development at the technical level: describing and protecting privacy more realistically from a computational perspective.

3. Directions for Improving AI Privacy Protection Norms

AI privacy protection and personal data protection are complementary rather than mutually exclusive. This is also the rationale behind the EU’s Artificial Intelligence Act grounding its data norms in personal data rights. The explanatory section of the Artificial Intelligence Act states: “If this regulation contains specific rules for protecting individuals in the processing of personal data, involving restrictions on the use of artificial intelligence systems for remote biometric identification for law enforcement purposes, risk assessments of individuals using artificial intelligence systems for law enforcement purposes, and biometric categorization of individuals using artificial intelligence systems for law enforcement purposes, then for these specific rules, this regulation should be based on Article 16 of the Treaty on the Functioning of the European Union.” The issuance of the Artificial Intelligence Act is merely the beginning of AI legal norms, not the end. This is partly because legislation not only regulates and supports the development of AI in the EU but also implicitly aims to resist the monopolization of the AI industry by the US in Europe. Additionally, the uncertainty of AI makes any legal regulation plan inherently incomplete. From a practical effects perspective, the US, which leads in the AI field, actually adopts a very lenient regulatory policy and legal framework. The ongoing AI legislation in the EU has yet to sufficiently address some significant differences between AI privacy protection and personal data protection. China’s current AI legislation, which has already been initiated, can draw on the experiences of Europe and the US but does not need to follow them blindly. Overall, if China’s privacy protection norms make moderate adjustments, it can make leading contributions to the field of AI privacy protection.

(1) Integrating Normative Bases

From an institutional environment perspective, the current global technology legislation focus has shifted from data law and competition law to AI legislation. AI has become the basic institutional environment for privacy rights protection. Data legislation represented by personal data protection laws emerged in the early development of the internet, represented by the EU’s 1995 Data Protection Directive, which also includes a series of norms related to data security and usage. With the rapid development of the internet industry and technology, data laws have gradually upgraded but have always been in a reactive state. To reverse this situation, the EU introduced the more comprehensive and suitable GDPR for the big data era. However, it has proven that data protection laws cannot effectively adjust the relevant objects. Therefore, competition laws represented by the EU’s Digital Markets Act and Digital Services Act have emerged to fill the gaps in data laws. However, as AI becomes the leading technology driving industrial development, the framework of data legislation combined with competition legislation can no longer effectively regulate relevant industries and activities. Specialized AI legislation represented by the EU has emerged accordingly. In this context, purely relying on data law, especially personal data protection law, as the main normative basis for protecting privacy rights is no longer aligned with reality. With the rapid development of technologies such as federated learning and blockchain, current data protection methods appear increasingly outdated and stale. Although the enactment of China’s Data Security Law and Personal Information Protection Law is of great significance, if it aims to effectively protect privacy in the AI field, the existing Personal Information Protection Law must be adjusted.

In the short term, the regulatory integration role of “privacy protection” should be fully utilized. Since data protection cannot replace the need for privacy, “privacy protection” has become an effective concept for integrating various privacy norms and technologies. This has been first practiced in the pragmatic United States. On March 31, 2023, the White House Office of Science and Technology Policy (OSTP) released the National Strategy to Advance Privacy-Preserving Data Sharing and Analytics, which proposed a privacy-preserving data sharing and analytics strategy (PPDSA). This strategy established four guiding pillars, representing the foundation of its privacy and data approach: carefully crafted PPDSA technologies that protect citizen rights; promoting innovation while ensuring equality; establishing technologies with accountability mechanisms; and minimizing risks for vulnerable groups. More importantly, PPDSA uses the concept of “privacy protection” to integrate various social, legal, and technological means. In the context of stagnant privacy legislation, privacy protection, as an integrative concept, possesses both technical operability and normative necessity.

In the medium term, legal distinctions between data, information, and privacy should be carefully made in response to the realities of the AI industry. In the era of big data, privacy protection norms often center on data as the core element of privacy protection. However, federated learning suggests that a narrow focus on data may lead to the privacy protection goals of AI systems being overlooked. This is also evident in the field of LLMs. In AI scenarios, data, information, and privacy can be transformed through “computation,” dynamically correlating with each other. In the era of big data, the blurred boundaries between data, information, and privacy have become increasingly necessary to clarify in the AI era. China’s existing legislation also employs the terms “data,” “information,” and “privacy” interchangeably. China’s AI legislation can innovatively distinguish between data, information, and privacy in AI scenarios according to technological realities. In the long term, the role of privacy rights should be emphasized to enhance the effectiveness of privacy protection and return to the original intention of protecting personal rights. After the rise of personal data protection laws, the importance of privacy rights has significantly diminished, showing a clear trend of “taking a back seat.” In the fields of big data and AI industries, privacy rights have been fragmented into several rights related to personal data protection. In the context of rapid technological and industrial development, the normative connotation and realization paths of privacy rights appear obscure. However, fundamentally, privacy rights are personality rights related to the dignity and development of individuals. In the AI era, it is essential to promote human dignity. With the gradual stabilization of industrial technologies and continuous exploration at the normative level, future efforts should leverage the concept of privacy rights to strengthen AI privacy protection through more integrated legal norms.

(2) Exploring Accountability Mechanisms

The accountability issues of AI privacy protection are highly complex. Especially for AI projects like federated learning involving numerous participants and loose architectures, accountability is challenging both technically and legally. From a technical standpoint, the key to privacy protection accountability auditing lies in explanation. The explanation of AI is aimed at addressing theoretical flaws, application defects, and regulatory requirements. The goal of explanation is not only to coordinate relationships between humans and machines but also to strengthen communication and trust among AI participants. The accountability audit of inherently interpretable models is relatively simple, but complex models like federated learning are not inherently interpretable; these complex models usually obtain interpretability through post-hoc analysis methods. Common post-hoc analysis methods for complex models include partial dependence plots, accumulated local effects plots, LIME, SHAP, and others.

In addition to interpretability, the accountability auditing issues of AI privacy protection can also be realized through blockchain technology. Blockchain is not a single technology but a combination of P2P dynamic networking, cryptographic-based shared ledgers, consensus mechanisms, and smart contracts. The greatest significance of blockchain lies in building a trusted network that enables peer-to-peer value transfer without centralized endorsement. Since any actions and data changes on the blockchain are faithfully recorded, this undoubtedly provides strong support for various regulatory auditing tasks. The integration of blockchain with federated learning can not only further optimize the architecture and processes of federated learning but also enhance the incentive and regulatory audit mechanisms of federated learning. Of course, the integration of blockchain and AI is still in the exploratory stage, and the deployment costs are relatively high, but the advantages of blockchain AI in privacy protection accountability mechanisms are undeniable.

From a normative perspective, both the EU and the US currently adopt a risk-based approach. However, risks themselves are difficult to quantify. The focus of the AI accountability mechanism should be to develop a flexible system of responsible subjects, accountability principles, and responsibility frameworks. In this regard, the EU’s experience can be drawn upon.

1. Responsible Subjects

Currently, the Organization for Economic Cooperation and Development (OECD), the EU, and the US are gradually adopting a converging system of responsible subject concepts adapted to AI, particularly the responsible subject descriptions in the EU’s Artificial Intelligence Act, which are more detailed and complete, warranting reference. The Artificial Intelligence Act adopts a framework of subjects and responsibilities that is completely different from that of personal data protection laws. It no longer centers on data but defines a complete set of subject concepts based on the characteristics of AI in products, services, and markets, detailing their responsibilities. The subject concepts in the AI Act use the term “actor”; the core participants are “providers” and “deployers.” A “provider” is defined as any natural person or legal entity, public authority, agency, or other body that develops an AI system or general-purpose AI model, or has developed an AI system or general-purpose AI model and places it on the market or provides it under their name or trademark, whether for payment or free of charge. A “deployer” is any natural person or legal entity, public authority, agency, or other body that uses an AI system under their authorization. Besides providers and deployers, the AI Act also includes a broader concept of “operators,” which encompasses providers, deployers, authorized representatives, importers, and distributors. Clearly, compared to the data-centric perspective of the GDPR, the AI Act provides a more detailed description of the responsibilities of various subjects in the AI field from a broader market perspective.

2. Accountability Principles

The obligations and responsibilities of participants in the AI Act are not vague but are described very specifically in conjunction with the development, deployment, and use processes of AI systems, proposing three types of accountability principles: risk-corresponding responsibility principle, fair burden-sharing responsibility principle, and goal-proportional responsibility principle.

The risk-corresponding responsibility principle states that different risks associated with AI systems correspond to different responsibilities. The AI Act classifies AI systems based on a “risk-based approach” and proposes different requirements and obligations. AI systems with “unacceptable” risks are prohibited in the EU. “High-risk” AI systems must meet a series of requirements and obligations to enter the EU market. “Limited risk” AI systems are mainly subject to very light transparency obligations. “Low-risk or minimal risk” AI systems can be developed and used in the EU without complying with any additional legal obligations. The scope of high-risk AI systems has gradually expanded during the drafting of the AI Act. Article 6 of Chapter 3 of the AI Act describes the criteria for defining “high-risk” AI systems. Articles 16 to 27 of the AI Act stipulate the obligations that participants of high-risk AI systems must fulfill, which also constitute important grounds for distinguishing and pursuing responsibility. To ensure that the responsibilities of high-risk AI systems are not overlooked, the explanatory section of the AI Act indicates that it is appropriate for specific natural persons or legal entities defined as providers to bear the responsibility for placing high-risk AI systems on the market or into service, regardless of whether those individuals or entities designed or developed the system.

The fair burden-sharing responsibility principle states that responsibilities should be shared fairly along the AI value chain. The explanatory section of the AI Act provides very cautious, pragmatic, and detailed positions on various issues related to general-purpose AI models that may significantly affect safety and fundamental rights. To ensure that responsibilities are shared fairly along the AI value chain, general-purpose AI models must meet the proportionate and more specific requirements and obligations stipulated by the AI Act, including assessing and mitigating potential risks and dangers through appropriate design, testing, and analysis; implementing data governance measures; complying with technical design requirements; and meeting environmental standards.

The goal-proportional responsibility principle states that specific requirements and responsibilities must be imposed on certain AI systems to achieve the goals of the AI Act. Although the AI Act stipulates very specific requirements and obligations for general-purpose AI models, this does not mean that all AI systems using general-purpose AI models are necessarily high-risk; rather, they are subjected to more specific requirements to achieve the goals of the AI Act. This implicitly indicates that the principles of responsibility allocation for AI systems are not only based on the value chain but also include the principle of proportionality in bearing corresponding responsibilities to achieve the goals of the AI Act.

China’s Measures for the Administration of Generative AI Services, issued in 2023, also employs the concept of “generative AI service providers.” However, this measure describes other related subjects too simply, having only “generative AI service users” without specific subjects like “deployers,” “authorized representatives,” “importers,” and “distributors,” and lacks general terms like “participants” and “distributors.” The Basic Requirements for the Security of Generative AI Services, released by the National Cybersecurity Standardization Technical Committee in February 2024, also only describes the subject as “service providers.” China’s AI legislation should adopt a more comprehensive concept of responsible subjects and explore appropriate accountability mechanisms. With the improvement of accountability theories and norms, the challenges of AI privacy protection responsibilities will gradually be resolved.

3. Responsibility Framework

Attempting to cover the privacy protection responsibilities and obligations of AI with a simple process framework is akin to wishful thinking. However, without a description of the privacy protection process, it is impossible to reasonably allocate responsibilities for highly complex AI systems. A rational approach is to adopt a dual-layer responsibility framework, which prescribes general principles for the responsibilities of AI systems while also providing specialized, specific regulations for special types of AI technology solutions. This legislative approach, while seemingly a stopgap, is in line with the complexity and developmental status of AI. The EU’s AI legislation has already adopted such a dual-layer responsibility framework. The AI Act distinguishes between AI systems (AI systems) and AI models (AI models). Although models are essential components of systems, they do not constitute systems themselves and require various components to form a system. The AI Act provides principled regulations for the responsibilities of AI systems based on the general processes of development, training, deployment, and application of AI. At the same time, the explanatory section of the AI Act combines the characteristics of processes for development, deployment, and usage to propose very specific requirements and obligations for participants of general-purpose AI models. The AI Act also defines the “systemic risks” of general-purpose AI models, which are the risks unique to high-impact capabilities of general-purpose AI models, significantly affecting the internal market and presenting actual or reasonably foreseeable negative impacts on public health, safety, fundamental rights, or society as a whole, capable of spreading on a large scale throughout the value chain. Articles 53 and 54 of the AI Act stipulate the obligations of providers of general-purpose AI models, while Article 55 stipulates the obligations of providers of general-purpose AI models with systemic risks.

This dual-layer framework enhances the flexibility and adaptability of AI legislation. However, the dual-layer framework of the EU’s AI Act is not exhaustive. In reality, the reason for specifically regulating general-purpose AI models is that they possess significant influence and can lead to greater risks. In the future, it is entirely possible for other AI technology solutions with significant influence to emerge, and existing technologies may also possess significant influence due to expanded deployment and application scales. It is entirely feasible to use influence as a criterion to determine which AI technology solutions require specialized regulation through stricter processes and to stipulate more targeted obligations and responsibilities, thereby forming a more complete dual-layer responsibility framework.

(4) Adjusting Regulatory Priorities

The demands for privacy protection in AI are difficult to be fully realized within a security framework. AI legislation needs to distinguish between security and privacy protection needs, allowing privacy protection to return to the essence of personal rights. Achieving this requires adhering to a systems perspective and engineering concepts to comprehensively standardize the elements of privacy protection with respect to “computation.”

AI privacy protection must emphasize a systems perspective. It is especially important to stress that AI privacy protection cannot solely focus on data but should comprehensively consider various elements throughout the entire development and application cycle of AI systems, directing attention to the realization of privacy protection needs and appropriately standardizing the most important links. It is crucial to fully recognize the guiding role of law in various technical standards, guidelines, and best practices, guiding the formulation of feasible AI privacy protection standards.

AI privacy protection should also emphasize engineering concepts. The privacy design principles proposed in the 1990s faced a very different technological and industrial environment than exists today. The processes for developing AI products and services differ significantly from those in the big data industry. Although privacy design emphasizes standardization throughout the entire lifecycle, its focus is still on design, limiting its penetrative capacity for the development of more complex AI systems, making it difficult to ensure the full realization of privacy needs. AI technologies like deep learning have many “black box” aspects, and design needs must undergo engineering adjustments to be gradually realized. In this context, privacy engineering can play a more significant role. Privacy engineering focuses on implementing a series of technologies to reduce privacy risks and ensure that organizations can make purposeful decisions regarding resource allocation and the effective implementation of information system controls. In principle, privacy design should precede privacy engineering. However, as some scholars have pointed out, privacy engineering typically encompasses all activities related to privacy throughout the entire system development lifecycle.

Guided by systems and engineering perspectives, AI legislation should describe technology and product processes centering on “computation” rather than “data,” and based on this, conduct targeted standardization of training and inference processes involving privacy in AI. During the training phase, in addition to focusing on the legal basis for data processing, it is also necessary to guide the privacy protection responsibilities of AI providers, encouraging them to pay more attention to the personal privacy protection risks beyond data compliance. In the inference application phase, it is essential to further open up channels for individuals to participate in AI projects, providing them with more tools and means to safeguard their personal privacy rights.

4. Conclusion

Former European Data Protection Supervisor Giovanni Buttarelli pointed out that the privacy paradox is not a contradiction between people’s hidden and exposed needs but rather that we have not found the right methods to address the possibilities and vulnerabilities brought about by rapid digitalization. Compared to big data, the intense transformation brought about by AI technology has a more profound impact on human society. However, the current response to AI risks exhibits an excessive focus on “distant worries” while lacking attention to “immediate concerns.” Various crisis theories, such as AI potentially replacing or controlling humans, abound. At the same time, there is a severe lack of attention to the privacy risks posed by the ongoing impact of AI. People have unreasonably assumed that the technologies and norms of the big data era can smoothly resolve the privacy issues of AI. Research indicates that when humans make decisions in states of uncertainty without any information on probability distributions, the brain regions activated are the prefrontal cortex and the amygdala. The former stimulates planning and inhibition of instincts, while the latter evokes fear emotions. The current cognitive biases of the public towards AI largely stem from emotional responses to extreme uncertainty. For ordinary individuals, this may be understandable. However, scholars should not follow the tide, nor should they amplify and misuse negative emotions. Legal research in the VUCA (Volatility, Uncertainty, Complexity, Ambiguity) era should overcome emotional barriers, surmount cognitive obstacles, and explore the stability and foundation of norms while understanding industry and technology dynamics.

Source: Chinese and Foreign Law Studies, 2025, Issue 1

Research Institute of Rule of Law Government

This public account is operated and managed by the Research Institute of Rule of Law Government, China University of Political Science and Law. Please contact us if you have any questions.

Email: [email protected]

We welcome contributions!

Follow us, follow the rule of law

Leave a Comment Cancel reply