Generative artificial intelligence (AI) models, that is to say, AI models capable of generating text, images, code, audio, video, and other content as part of their output in response to inputs or prompts, such as OpenAI's ChatGPT and Dall-E, Meta's Llama, and Google's Imagen (accessed via Gemini), require significant volumes of high-quality data in order to train the model and enable it to assimilate the information and refine its output, through an iterative process. Generative AI models do not 'memorize' or recount their training data, per se, but instead learn to predict the appropriate output based on probabilities having regard to patterns in training data.
According to OpenAI, ChatGPT was developed using 'three primary sources of information:' publicly available information on the internet, information licensed from third parties, and information provided by users or human trainers. Meta's Llama 2 was similarly 'pretrained on publicly available online data sources' and trained on '2 trillion tokens,' which are the units of data into which training data is split whereby each word, punctuation mark, or pixel, for example, would constitute a separate token. Both developers state that they either did not intentionally target for, or sought to remove from, training data sources with high volumes of personal data. The process of gathering or extracting, through the use of an automated tool or bot, data from websites, known as web scraping, of publicly available data including personal data has legal implications for website operators, developers of AI models, their deployers, and data subjects. Nicola Cain, of Handley Gill Limited, discusses these legal implications for all individuals involved in web scraping data.