Abstract
The global issue of AI-related copyright infringement is currently the topic of much heated debate and litigation, leading to landmark decisions[1].
While it has been extensively explored through many publications[2], but it seems that the implications of the underlying characteristics of AI platforms from which the copyright-related issues are arising on the protection of personal data have been overlooked.
This paper intends to address these implications, by providing an overview of the technological reasons why AI is intrinsically a factor of personal data breach and why this situation will become increasingly acute and will not be easily be solved by Law in spite of considerable efforts which are being deployed by governments in order to tackle AI-related data privacy issues both in the European Union and abroad[3].
AI platforms increasingly act as repositories of personal data
Just like for copyright related issues, the source of the problem is the use of sensitive data as training material for the AI platforms.
The training process of a modern-day AI encompasses the provision by the developer of the AI of astronomical amounts of data from a variety of sources[4], and there is no possibility for AI developers to do otherwise, this because of the nature of the underlying technology.
The “self-improving through self-learning” aspect of modern-day AI platforms, which differentiates them from previous generations of the technology, is enabled by this use of massive amounts of data in order to feed the embedded artificial neural networks.
The quality and the quantity of the labeled training data has a direct impact upon the ability of the neural network to update its parameters in order to minimize the underlying defined loss function, thereby allowing the AI to generalize in order to support previously unseen data.
The underlying datasets of AI based-solutions are increasingly embedding large quantities of non-sanitized or non-anonymous personal data.
This can be because the use of personal data is key for the AI to adapt its response to the personality of the subject, which happens across a wide-range of applications, from medical diagnosis (which cannot be accurate without the knowledge of the full range of medical data of the patient) to customer-facing applications, such as chatbots (which increasingly aim at providing a personalized experience, and therefore having access to a wide-range of customer data).
Even if identification of the subject is not necessary when it comes to training data (which is becoming less and less frequent due to the propension of companies using AI to perform the training based upon actual customer data), this identification has to be performed for actual personal data when the system is in production.
This can also be because the AI platforms are increasingly built-in order to dynamically construct their datasets from the internet[5], gathering information from various sources without discriminating whether or not they are containing personal data. These sources might encompass ill-protected personal data repositories or results from a data breach.
AI platforms are not designed in order to protect data, whether it is personal or not
A modern-day generative AI essentially relies upon a deep generative model which is constructed by applying unsupervised or self-supervised machine-learning to a dataset and using a deep neural network, involving generative adversarial networks, autoregressive models and variational autoencoders.
This allows to efficiently “predict” the desired output data Y from the observation of a collection of inputs X, thus providing the effective result expected by the user of the AI platform. The autoencoder learns two functions, the encoding function which transforms the input data, and a decoding function which recreates the input data from the encoded data, and has a generative feature which allows to randomly create new data which is similar to the training data.
This essentially means that the generative AI is not “creating” (i.e generating brand new data through its “imagination”) per se but instead “reusing” pre-existing data by way of transformation.
The degree of the transformation (i.e the variance between the original training data and the “newly created” output data) is a direct function of the complexity of the model and the constraints inherited from the degree of precision of the user’s prompt.
The more a prompt is precisely formulated, and forces the model to focus on the exact representation of the original data, the greater are the chances that the model output will be a non-transformed, or only slightly transformed, instance of the original data. For an example, using the prompt “An original picture of Andy Warhol’s Campbell Soup cans” with any AI image generator will most probably reproduce the original with little to no modification, and in anyway without enough significant modifications to avoid copyright infringement lawsuit, which the author of this paper could experiment directly using a popular AI image generator.[6]
Figure 1: output of Deep AI using the aforementioned prompt
The numerous cases in which the transformation performed by the AI is not sufficient to modify enough the input data to claim a degree of variance allowing to assimilate the output data to a new creation instead of a mere reuse or mashup of the source copyrighted materials explains the capacity of owners of copyrighted materials used as training data evidence the reuse of the materials in claims for copyright infringement[7][8].
This cannot but lead to the leakage of personal data used as training material, through the two commonly unwanted side-effects of the features of the underlying technology of generative AI platforms which are their propension to reveal training data through their “predictive” capabilities and their propension to experimenting “hallucinations”, side-effects which are further aggravated by (i) the current market trend to provide software-as-a-service implementations allowing users to implement their own training data in order to customize “out-of-the-box” AI platforms, exponentially increasing the cases in which personal data will be imported into the platform’s dataset (for an example in order to train chatbots) and (ii) the extension of the collection of training data to uncategorized data from internet sources (including, in case of automated collection, the ability to integrate into the dataset personal data made available over the internet through a previous data breach).
Predictive features are a source of widespread diffusion of personal data breach:
The larger is the quantity of the data in a training dataset and the more precise is its labelling, the more accurate and efficient will the AI be “out-of-the-box” and the more self-adaptive it will be. To the contrary, an unproperly trained AI, whether due to lack of training data in a sufficient amount or due to poor labeling of training data, will be entirely useless.
However, beyond this initial, “controlled”, training dataset, deep-generative models are able to take advantage of uncategorized data sources, this being their key advantage compared to previous generations models.
Therefore, developers of such AI platforms have a strong incentive to open the training dataset to the largest quantity of data possible, not exercising control over its contents (which would defeat the purpose), and to give to their AI platform the ability to crawl the internet for source content, this being performed based only upon rule-based limitations in the self-learning model.
Here again, there is no incentive for the AI developer to limit the web crawling capabilities of the platform, as it would defeat the overall purpose.
As a result, previously leaked personal data readily available from the internet will easily be integrated by the AI as part of its source dataset, and will as easily be reproduced as output data through an insufficient transformation, in a way which is unavoidable, similarly to the previously discussed copyright-related issues[9].
Worse, because one of the core features of a deep neural network combined with a variable encoder is to allow to identify, with great accuracy, which collection of input data corresponds to an expected output, this gives to users of such AI platforms using adequately formatted prompts, the ability to correlate with absolute certainty various personal data of a single subject which would otherwise be difficult to link to the said subject as being spread across multiple internet sources, thus facilitating the access to comprehensive data for an ill-intentioned user and thereby increasing the potential impact of pre-existing data breaches[10].
Hallucinations are a source of personal data breach:
User prompts which are “incorrectly” formulated, meaning that the AI algorithm is unable to make sense of the prompt, or misunderstands the prompt, but fails to reject the prompt as unproperly formulated and therefore processes the prompt, just as issues during the transformation process and lack of training data on a subject or improperly labeled training data will lead a generative AI to “hallucinate”.
Also, overfitting, which is the lack of ability of a model to make accurate predictions from other data than from the training data, as well as high complexity models are sources of “hallucinations”[11].
Hallucinations leads the AI to provide an output which is not based upon training data, is improperly transformed or does not follow any identifiable pattern. The output which is then provided is a regurgitation by the AI of portions of raw data from the underlying dataset, without proper application of the rules used normally by the AI to control the output.
Should the dataset contain personal data (which is a certainty when it comes to any generative AI with a customer-facing application such as customer-support chatbots for an example), this personal data might be more or less extensively revealed without transformation to the users as a response to their prompts.
The occurrence of a sizeable data breach due to hallucinations is just a matter of time. If Open Ai’s GPT-4 is hallucinating “only” around 3 percent of the time; Google’s Palm Chat had a shocking 27 percent rate[12].
Needless to say, GPT -4 is the go-to AI engine for most users, and with 180 million registered users and 1.6 billion visitors of the website in January 2024[13], even a “low” three per cent hallucination rate still yields at least a few million daily hallucinations, hence a few million daily opportunities to expose raw data.
The last major hallucination of Chat GPT was on February 20th, 2024, affecting users worldwide and it took almost a day to Open AI to get things back under control[14].
Legal response is relatively simple, but will create hurdles for innovation
The only possibilities to avoid personal data breach would be to enforce upon AI vendors mandatory restrictions of the content of datasets or the obligation to scrub the input data and / or the output data of any personal data.
Restricting the contents of datasets in order to make sure that the datasets contains only duly sanitized or anonymized data is not per se an issue, provided always that the source data can be controlled, which excludes the use of any web crawling features to enrich the data sets, or at least implies implementing content analysis allowing to identify the presence of personal amongst the contents of the internet sources targeted by the AI platform, thus creating technological constraints equivalent to the ones required in order to scrub the input / output data.
Scrubbing the input data would obviously be a problem for any application which is relying upon personal data to produce meaningful output, but could be a solution for other AI-powered applications.
Scrubbing the output data would be a solution which would avoid data leaks, but would run the risk of making the output less meaningful.
In both cases, the challenge will be to discriminate what is personal data from what is not: it is pretty obvious that the ability for an AI platform to discriminate between say a fictious character, a person whose personal data is public and a person whose personal data is subject to protection will require adding a supplementary level of complexity to already complex systems.
This will defeat the purpose of unsupervised training and will generate constraints as to the level of sophistication which is to be expected from self-supervised training, and will create entry barriers on the market of deep generative model due to the complexity of guaranteeing compliant self-supervised models.
The challenge for the legislator will be to balance between personal data protection and innovation, knowing that none will be left unharmed, none will be entirely safeguarded and that a technological and legal conundrum will ensue.
[1] Thaler v. Perlmutter, No. 22-CV-384-1564-BAH. District of Columbia, judgment August 18, 2023: AI output is not copyrightable.
[2] See for an example Lee, Jyh-An; Hilty, Reto; Liu, Kung-Chung, eds. (2021). Artificial Intelligence and Intellectual Property. Oxford University Press.
[3] For an example through the upcoming Artificial Intelligence Act in the EU or the Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence dated October 30th, 2023 in the United States.
[4] GPT-3 has 175 billion parameters, a rule of thumb for sizing the training dataset being a minimum of 10 sources per parameter.
[5] Which is what is advocated by major researchers in the Natural Language Processing field, from Carnegie Mellon and Berkeley https://internet-explorer-ssl.github.io/
[6] Deep AI, available here: https://deepai.org/machine-learning-model/text2img
[7] Such as in New York Times lawsuit against Microsoft and Open AI https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html
[8] See also “The Generative AI Copyright Fight Is Just Getting Started”, article of Gregory Barber in Wired. December 9th, 2023
[9] On an efficiently trained system open to internet content, the prompt “What is the date of birth and gender of [person X], living in [place Y] shall provide the proper data as an answer.
[10] The example prompt represented in Note 5 hereinabove could be easily supplemented by requesting email address, street address and phone number and asking if there are any available sources featuring password for the said email, as well as financial information such as credit card number or bank account details.
[11] https://www.ibm.com/topics/ai-hallucinations
[12] In Wired, January 5th 2024 “In Defense of AI Hallucinations” by Steven Levy, see also https://github.com/vectara/hallucination-leaderboard for current open source ranking of major AI platforms by hallucination rate.
0 Comments