Apple Details Its AI Foundation Models and Applebot Web Scraping

From Apple’s Machine Learning Research1 blog:

Our foundation models are trained on Apple’s AXLearn framework, an open-source project we released in 2023. It builds on top of JAX and XLA, and allows us to train the models with high efficiency and scalability on various training hardware and cloud platforms, including TPUs and both cloud and on-premise GPUs. We used a combination of data parallelism, tensor parallelism, sequence parallelism, and Fully Sharded Data Parallel (FSDP) to scale training along multiple dimensions such as data, model, and sequence length.

We train our foundation models on licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web-crawler, AppleBot. Web publishers have the option to opt out of the use of their web content for Apple Intelligence training with a data usage control.

We never use our users’ private personal data or user interactions when training our foundation models, and we apply filters to remove personally identifiable information like social security and credit card numbers that are publicly available on the Internet. We also filter profanity and other low-quality content to prevent its inclusion in the training corpus. In addition to filtering, we perform data extraction, deduplication, and the application of a model-based classifier to identify high quality documents.

It’s a very technical read, but it shows how Apple approached building AI features in their products and how their on-device and server models compare to others in the industry (on servers, Apple claims their model is essentially neck and neck with GPT-4-Turbo, OpenAI’s older model).

This blog post, however, pretty much parallels my reaction to the WWDC keynote. Everything was fun and cool until they showed generative image creation that spits out slop “resembling” (strong word) other people; and in this post, everything was cool until they mentioned how – surprise! – Applebot had already indexed web content to train their model without publishers’ consent, who can only opt out now. (This was also confirmed by Apple executives elsewhere.)

As a creator and website owner, I guess that these things will never sit right with me. Why should we accept that certain data sets require a licensing fee but anything that is found “on the open web” can be mindlessly scraped, parsed, and regurgitated by an AI? Web publishers (and especially indie web publishers these days, who cannot afford lawsuits or hiring law firms to strike expensive deals) deserve better.

It’s disappointing to see Apple muddy an otherwise compelling set of features (some of which I really want to try) with practices that are no better than the rest of the industry.


  1. How long until this become the ‘Apple Intelligence Research’ website? ↩︎