Back to Stories

The handoff is the hard part



May 21, 2026 - 2 min read

The case for putting language models on the phone has stopped being speculative. A fully charged iPhone running a 7B model at 10 tokens per second drains in under two hours, while a 350M 8-bit model runs all day at the same rate. This figure comes from Liu et al. (2024) in MobileLLM, presented at ICML, who designed sub-billion parameter models that approach LLaMA-2 7B accuracy on API calling tasks. They observed that for models under one billion parameters, depth matters more than width, a result that cuts against the dominant scaling laws.

The empirical picture is now mapped. A survey by Lu et al. (2024) of 70 open-source SLMs between 100M and 5B parameters found that the scaling curve between model size and reasoning quality has steepened: newer small models match the performance of previous-generation large ones. This is the technical premise behind the architectural argument made by Belcak et al. (2025) at NVIDIA, who position SLMs not as constrained alternatives but as the natural engine for agentic systems. Agents decompose tasks into narrow, repetitive sub-calls, precisely the workload where SLMs run 10 to 30 times cheaper. They propose a heterogeneous architecture with SLMs handling the routine flow and larger models invoked only when generality or hard reasoning is genuinely required.

This is what Apple has shipped. The Apple Intelligence Foundation Language Models Tech Report (2025) describes a 3B on-device model with 2-bit quantisation-aware training and KV-cache sharing that cuts cache memory by 37.5%, paired with a server-side Mixture-of-Experts model on Private Cloud Compute. The on-device layer handles the routine; the cloud handles what it cannot. The routing logic itself is studied by Ding et al. (2024) at ICLR, whose Hybrid LLM router learns to send up to 40% fewer queries to the large model with no measurable drop in response quality.

The bottleneck is where the architecture meets the device. A measurement study by Yan & Ding (2025) implemented a smartphone sensor-reasoning application across mobile, edge, and cloud deployments. They found that only models under 4B parameters run successfully on powerful mobile devices, that quantisation often produces meaningful quality loss, and that on-device output takes over 30 seconds against under 10 in the cloud. The first layer is technically settled. The second layer, where the small model decides whether it can answer or must defer, is where the real engineering still sits.


Scan the QR code to view this story on your mobile device.


On-device LLMsSmall language modelsAgentic AIEdge-cloud routingMobile inference