What is Edge AI?
Edge AI runs AI models directly on local devices such as phones, cameras, or sensors instead of sending data to remote cloud servers.
It performs inference (and sometimes training) on the device itself by using lightweight, optimized models that fit within the hardware’s memory and power limits.
Key techniques include model quantization, pruning, and the use of specialized chips like NPUs or TPUs that accelerate neural-network operations locally.
This approach reduces the need to transmit raw data, enabling faster responses and continued operation without an internet connection.
Example
A smartphone camera app that instantly applies filters or detects objects using on-device models, without uploading photos to the cloud.
Why it matters
Edge AI cuts latency for real-time tasks, lowers cloud costs, and improves privacy by keeping sensitive data on the device.
Frequently asked questions
Cloud AI sends data to remote servers for processing; Edge AI runs the model locally on the device for speed and privacy.
Related terms
Federated learning is a machine learning technique that trains models across many decentralized devices or servers, each holding its own local data, without ever moving the raw data to a central location.
An API (Application Programming Interface) is a standardized set of rules that lets software applications request services or data from each other. In AI infrastructure, it typically means exposing machine learning models as callable endpoints for inference or training.
CUDA is NVIDIA's platform and programming model that lets developers run general-purpose computations on NVIDIA GPUs instead of just CPUs.
Knowledge distillation is a technique that transfers knowledge from a large, complex 'teacher' model to a smaller 'student' model so the student can achieve similar performance with far less compute and memory.
In AI/ML infrastructure, an endpoint is a deployed URL or network address that exposes a trained model so applications can send data and receive predictions via API calls.
FLOPs stands for floating-point operations and counts the total number of arithmetic calculations (additions, multiplications) a neural network performs during a forward or backward pass.