November 11, 2024

Unlocking the Power of Edge Devices: EdgeMaxxing's First Breakthrough

‍TL;DR: Subnet 39 has achieved a significant milestone by optimizing Stable Diffusion XL to run twice as fast on consumer GPUs like the NVIDIA RTX series while maintaining similar output quality. This marks the first step and proof of concept that EdgeMaxxing’s competition model can work.

‍

The Untapped Power of Consumer Hardware

‍

Global demand for GPUs have risen sharply over the last few years, primarily driven by AI model training and inference. As models improve and unlock new use cases, companies and developers are building applications that require more compute capacity to serve at scale.

While the race for cutting-edge datacenter GPUs like the A100s and H100s intensifies, there’s an untapped alternative: millions of powerful consumer GPUs that often sit idle.

Let's break down the available compute power: NVIDIA shipped an estimated 30M consumer GPUs in 2023 (mostly RTX 4080s and 4090s), Apple has sold 43M M-series Macs since 2022, and most consumer devices often sit unused 60-80% of the time

To contextualize this:

When considering certain AI workloads, it’s possible that ~10 Mac M2s & M3s or ~10 NVIDIA RTX 4080s can achieve performance comparable to a single A100 GPU in specific use cases. Based on recent shipment estimates, this translates to roughly:

- 21.5M Macs ≈ 2.15M A100 equivalents

- 30M RTX GPUs ≈ 3M A100s equivalents

Just these two sources add up to about 5.15M A100 equivalents - that's over 2.5x the actual A100 GPUs that NVIDIA shipped in 2023. And we haven't even looked at the broader PC market.

These chips are less powerful than the H100 and A100 datacenter chips, but still very capable of handling tasks like image generation and inference for mid-sized LLMs, they also present a significant economic advantage: consumer GPUs often deliver higher performance per dollar compared to their datacenter counterparts.

‍

‍

Looking at performance per dollar, consumer GPUs like the RTX 4090 actually outperform datacenter cards by 2-3x, making this not just a technical opportunity but an economic one too.

‍

The Genesis of SN39

‍

This underutilized consumer compute presents an opportunity to democratize AI computation. We believe making consumer devices more useful through optimized models is one dimension towards accessibility.

By harnessing this underutilized network of computing power, especially the compute already owned by the average person, we can reduce reliance on centralized datacenters, lower the costs of inference, and make AI more accessible.

In July 2024, we launched a subnet dedicated to optimizing AI models for edge devices, starting with Stable Diffusion XL and targeting the NVIDIA GeForce RTX 4090 GPU. Our first optimization contest focused on generation speed.

‍

‍

Breakthrough: First Successful Competition

‍

Our first competition successfully optimized Stable Diffusion XL to run twice as fast on consumer-grade GPUs while maintaining over 94.7% of the original image quality.

Baseline Model (Access here) | Optimized Model (Access here)

‍

Examples

‍

How It Works

‍

Our top miner (model optimizer) achieved this solution, by leveraging three key techniques:

1. Custom Scheduler: Developed a custom scheduler that increases the quality per-step, reducing the need for a higher step count.

2. Lower Step Count: By utilizing the custom scheduler, the step count was reduced with minimal quality loss, leading to linear performance improvements.

3. Compilation and Caching: Leveraged tools like OneDiffX for efficient model compilation and caching, further optimizing performance.

‍

Is this model useful?

‍

Yes! Soon the optimized model will be running in production in Dream by WOMBO, a mobile AI art app serving over 1M+ active users every month. Several million images are generated daily utilizing this Stable Diffusion model.

The results:

Previously, NVIDIA L40 GPUs achieved image generation times of approximately 2.3 seconds per generation. By deploying this optimized model, generation times will be reduced by half, enabling users to experience image generation twice as fast. Additionally, this improvement doubles the throughput per GPU, resulting in a 50% cost reduction for inference with this model.

‍

That's cool! What's next for EdgeMaxxing?

‍

This is step one towards a future where advanced AI models run efficiently on consumer devices. This n=1 demonstration proves our contests are viable for driving meaningful model improvements.

We're planning to do the same for hundreds of popular models. Here's what's coming:

1. Expanding hardware support beyond RTX 4090s

2. Creating contests for optimizing VRAM and power consumption

3. Building an open-source optimization platform

4. Enabling any company to submit models for optimization

‍

How do I EdgeMaxx?

‍

The best place to start is:

• Check out our Hugging Face Dashboard and Demo

• Join the Bittensor Discord Community

Follow our progress:

• Twitter: @WOMBO

• GitHub: SN39 Repository

‍

Peace & Love to all ❤️