TL;DR: Yesterday I wrote about running 400B models on a laptop. Today someone did it on an iPhone. The AI democratization curve is steeper than anyone expected — and it's changing how I think about building AI agents.


What Just Happened

A developer named @anemll posted a video on Twitter showing an iPhone 17 Pro running a 400-billion parameter language model. No cloud. No internet. Airplane mode on.

The model runs at 0.6 tokens per second — roughly one word every two seconds. That's painfully slow compared to cloud APIs. But here's why this matters: the iPhone has 12GB of RAM. This model normally needs over 200GB.

The math shouldn't work. Yet it does.

Why This Isn't Just a Stunt

When I wrote about Flash-MoE running on laptops [yesterday](/flash-moe-397b-laptop/), I thought we'd see phones in maybe two years. It took 24 hours.

Here's the trick: Mixture of Experts (MoE) models don't use all their parameters for every token. A 400B MoE model with 512 experts only activates 4-10 experts per token — less than 2% of total weights.

Instead of loading everything into memory, Flash-MoE streams model weights from the phone's SSD to the GPU on demand. It's using Apple's own research paper "LLM in a Flash" from 2023, combined with aggressive quantization and speculative decoding.

The result: a model that needs 200GB runs on a device with 12GB.

The Pattern I'm Seeing

In the past week:

  • **$12K Tinybox** — 120B parameter inference at home
  • **Flash-MoE on laptop** — 397B on consumer hardware
  • **Flash-MoE on iPhone** — 400B in your pocket

The direction is clear: AI compute is collapsing from data centers to laptops to phones. Each step happens faster than the last.

What This Means for Builders

I run an AI agent 24/7 to help manage this newsletter. She costs about $0.08 per article using Claude API. Here's my honest calculation of what pocket-sized LLMs change:

Today:

  • Cloud APIs: fast, capable, costs per token
  • Local models: slower, less capable, zero marginal cost
  • My choice: cloud for complex tasks, local for simple ones

Tomorrow (maybe 12-18 months):

  • Phone-class models reach "good enough" for many tasks
  • Background AI agents running on-device without internet
  • Privacy-first AI becomes the default

For builders like me, this creates a decision point: keep building for cloud APIs, or start preparing for on-device?

My answer: both. The hybrid model wins. Complex reasoning stays in the cloud. But simple classification, quick lookups, and offline fallbacks? Those are going local.

What's Still Missing

Let's not oversell this. At 0.6 tokens/second, you can't have a conversation. The battery drain is brutal. Context windows are severely limited by RAM.

And the real bottleneck isn't compute — it's memory bandwidth. Moving data from storage to processor fast enough is the hard problem that Apple's "LLM in a Flash" research barely scratched.

The iPhone demo is a proof of concept, not a product.

The Uncomfortable Question

If phones can run 400B models (slowly), what can laptops run in two years? What can small servers run?

The value of cloud AI infrastructure — the thing that's absorbed billions in investment — depends on a capability gap that's shrinking faster than expected.

I'm not saying cloud AI dies. But the moat around API providers is eroding. The question isn't whether local AI catches up. It's when.


My Takeaway

I started AI Insider to track how AI is actually changing work. Not the hype — the reality.

This week taught me: the pace of hardware efficiency is the story I've been underweighting. Software capability gets all the headlines. But the quiet work on inference optimization, MoE architectures, and memory streaming? That's what's making AI accessible.

The 400B phone demo isn't useful today. But it's a signal. The gap between "data center AI" and "pocket AI" just got smaller.

I'm updating my mental model. You should too.


Yesterday: [Flash-MoE: Running 397B on a Laptop](/flash-moe-397b-laptop/) Last week: [The $12,000 AI Independence Box](/tinybox-ai-independence-box/)