Edge AI: Running Models on Phones, Drones, and IoT Devices

Why Edge Matters

Most AI discussion assumes cloud inference: send data to a server, get results back. This works for many applications but fails for others: when latency is critical (real-time robot control), when connectivity is unreliable (drones in remote areas), when privacy is required (medical devices that can't send patient data to the cloud), or when cost makes continuous cloud API calls impractical (10 million IoT sensors).

Edge AI runs models on the device — phone, drone, microcontroller, embedded system. Getting useful AI to run efficiently on constrained hardware is a significant engineering challenge with specialized tools and techniques.

The Hardware Landscape

  • Mobile SoCs: Apple A18, Snapdragon 8 Gen 4, Samsung Exynos — all include dedicated Neural Processing Units (NPUs) that provide 10-40 TOPS (tera-operations per second) of neural network compute. Modern smartphones can run 7B LLMs (quantized) at 5-15 tokens/second.
  • Edge GPUs: NVIDIA Jetson Orin (2-275 TOPS depending on module); designed for robotics, drones, and smart cameras. Full PyTorch and TensorRT support.
  • Microcontrollers (MCU): STM32, Arduino Nano 33, Raspberry Pi Pico — milliwatts of power, kilobytes of RAM. TinyML models run classification and anomaly detection; not LLMs.
  • FPGAs: Xilinx/AMD and Intel FPGAs enable custom neural network inference at extremely low latency and power. Used in specialized industrial and medical applications.

Quantization for Edge

Standard ML models use 32-bit or 16-bit floating point weights. Edge hardware often works best with 8-bit integer (INT8) or 4-bit quantization. INT8 quantization reduces memory 4× (vs float32) and enables specialized integer MAC units that are 4-8× more efficient than float units. For LLMs on mobile, GGUF format (llama.cpp) with 4-bit or 5-bit quantization is standard.

Model Architecture for Edge

Edge deployment benefits from architectures designed for efficiency: MobileNet, EfficientNet, MobileViT for vision; MobileLLM, Phi-3-mini for language. Knowledge distillation — training a smaller "student" model to mimic a larger "teacher" — produces compact models that punch above their parameter count. Neural architecture search (NAS) finds Pareto-optimal architectures for specific hardware targets.

Practical Deployment

Key tools: ONNX (model export format), TensorFlow Lite (mobile and embedded), Core ML (Apple devices), ONNX Runtime (cross-platform), llama.cpp (LLM inference on CPUs and Apple Silicon). For Android, Google's AI Edge (formerly LiteRT) provides hardware-accelerated inference on Qualcomm, MediaTek, and Google Tensor NPUs. Dr. Lin's research group at Meridian AI focuses on NAS methods for these edge hardware targets.