Selected from liuliu.me
Author: liuliu
Translated by Machine Heart
Machine Heart Editorial Team
Stable Diffusion may soon become popular on mobile devices.
Is it difficult to run Stable Diffusion on an iPhone? The author of this article provides the answer: it’s not difficult, and the iPhone still has 50% of its performance available.
As we all know, Apple releases a new iPhone every year that claims to be faster and better in every aspect, thanks to the rapid development of new visual models and image sensors. For photography, if we go back ten years, could you take high-quality pictures with an iPhone? The answer is no, because technological development is gradual, and ten years is enough time to improve mobile photography technology.
Due to this pattern of technological development (gradual), there are some programs that, even when running on the best computing devices, are nearly unusable for a period of time. However, these new programs with newly enabled scenarios attract some users’ attention, and people are willing to explore them.
The author of this article is one of those attracted. Over the past three weeks, the author has developed an application that can generate images using Stable Diffusion and edit them in your preferred way. This application generates images on the latest iPhone 14 Pro in just one minute, using about 2GiB of application memory, and requires downloading about 2GiB of initial data to get started.
App Store link: https://apps.apple.com/us/app/draw-things-ai-generation/id6444050820
This result has sparked discussions among many netizens. Some have started to worry about battery consumption, jokingly saying, “This is cool, but it looks like a great way to drain your phone’s battery.”
“I’ve never felt the heat from my iPhone so happily as I do now.”
“In this cold winter, I can use my phone as a hand warmer.”
However, while everyone is joking about the phone’s heating issue, they are also giving high praise to this work.
“This is incredible. Generating a complete image on my iPhone SE3 takes about 45 seconds—almost as fast as my M1 Pro MacBook with the original version!”
Memory and Hardware Optimization
How was this achieved? Next, let’s take a look at the author’s implementation process:
To run Stable Diffusion on an iPhone while keeping 50% of the performance, one major challenge is running the program on a device with 6GiB RAM. 6GiB sounds like a lot, but if you use more than 2.8GiB on a 6GiB device or more than 2GiB on a 4GiB device, iOS will kill your application.
So how much memory does the Stable Diffusion model need for inference?
This has to start with the model’s structure. Typically, the Stable Diffusion model consists of four parts: 1. A text encoder that generates text feature vectors to guide image generation; 2. An optional image encoder that encodes images into latent space (for image-to-image generation); 3. A denoising model that slowly denoises the latent representation of the image from noise; 4. An image decoder that decodes images from the latent representation.
Modules 1, 2, and 4 run once during inference, requiring a maximum of about 1GiB. The denoising model, however, occupies about 3.2GiB (full float), and it needs to be executed multiple times, so the author wanted to keep this module in RAM longer.
The initial Stable Diffusion model required nearly 10GiB to perform a single image inference. Between a single input (2x4x64x64) and output (2x4x64x64), there are many output layers. Not all layer outputs can be reused immediately; some must retain certain parameters for later use (residual networks).
For a time, researchers optimized around PyTorch Stable Diffusion, retaining the temporary space used by NVIDIA CUDNN and CUBLAS libraries, all aimed at reducing memory usage, allowing the Stable Diffusion model to run on cards with as little as 4GiB.
However, this still exceeded the author’s expectations. Therefore, the author began focusing on Apple hardware optimization.
Initially, the author considered using 3.2GiB or 1.6GiB half float. To avoid triggering Apple’s OOM (Out of Memory), the author had about 500MiB of space available.
The first question is, how large is each intermediate output?
It turns out that most of them are relatively small, each below 6MiB (2x320x64x64). The framework used by the author (s4nnc) can reasonably package them into less than 50MiB for reuse.
Notably, the denoiser has a self-attention mechanism that uses its own latent representation of the image as input. During the self-attention computation, there is a batch matrix of size 16x4096x4096, applying softmax to that matrix is about 500MiB in FP16 and can be done “inplace,” meaning it can safely overwrite its input without damage. Fortunately, both Apple and NVIDIA low-level libraries provide inplace softmax implementations, which are not available in higher-level libraries like PyTorch.
So can we really complete the task with around 550MiB + 1.6GiB of memory?
On Apple hardware, a common choice for implementing neural network backends is using the MPSGraph framework. Therefore, the author first attempted to implement all neural network operations using MPSGraph. At FP16 precision, the peak memory usage is about 6GiB, which is clearly much more than the expected memory usage. What’s going on?
The author analyzed the reasons in detail. First, he did not use MPSGraph in the common TensorFlow way. MPSGraph requires encoding the entire computation graph, then using input/output tensors, which handles internal allocation and allows the user to submit the entire graph for execution.
However, the way the author used MPSGraph was very similar to how PyTorch operates—as an operation execution engine. To execute inference tasks, many compiled MPSGraphExecutables execute on the Metal command queue, each of which may hold some intermediate allocated memory. If submitted all at once, all these commands hold allocated memory until execution is complete.
A simple way to solve this problem is to adjust the submission speed; there is no need to submit all commands at once. In fact, each Metal queue has a limit of 64 concurrent submissions. The author tried submitting 8 operations at a time, and the peak memory usage dropped to 4GiB.
However, this is still 2GiB more than what the iPhone can handle.
To use CUDA for computing self-attention, a common trick in the original Stable Diffusion code implementation is to use permutation instead of transposition. This trick is effective because CUBLAS can directly handle strided tensors, avoiding the need for dedicated memory for tensor transposition.
However, MPSGraph does not support strided tensors; a permuted tensor will still be transposed internally, requiring intermediate allocated memory. By explicitly transposing, allocation will be handled by higher-level layers, avoiding MPSGraph’s internal inefficiencies. Using this trick, memory usage will be close to 3GiB.
It turns out that starting from iOS 16.0, MPSGraph can no longer make optimal allocation decisions for softmax. Even if the input and output tensors point to the same data, MPSGraph will allocate an additional output tensor and then copy the result to the pointed location.
The author found that using the Metal Performance Shaders alternative completely met the requirements, reducing memory usage to 2.5GiB without any performance degradation.
On the other hand, MPSGraph’s GEMM kernel requires internal transposition. Explicit transposition is also ineffective here because these transpositions are not “inplace” operations for higher-level layers. For a specific tensor size of 500MiB, this extra allocation is unavoidable. By switching to Metal Performance Shaders, the project author reclaimed another 500MiB, with a performance loss of about 1%, ultimately reducing memory usage to the ideal 2GiB.
Reference Links:
https://news.ycombinator.com/item?id=33539192
https://liuliu.me/eyes/stretch-iphone-to-its-limit-a-2gib-model-that-can-draw-everything-in-your-pocket/
Graduating Soon? Xiaohongshu Is Calling for AI Talent!
-
Competitive salary
-
Hardcore benefits like priority settlement
-
Exclusive personal growth plans
-
Rich opportunities for practical technical scenarios
-
Deep communication channels with top global enterprises and university laboratories
-
……
© THE END
For reprints, please contact this public account for authorization
For submissions or inquiries: [email protected]