Shipping AI features is less about model choice and more about where you run inference. Here is a practical decision guide.
When edge inference wins
- Hard latency budgets: AR, voice assistants, or UI co-pilots that need <100 ms round-trip.
- Intermittent connectivity: on-device experiences where offline resilience matters.
- Data residency/privacy: sensitive inputs that must avoid leaving the device.
When cloud inference wins
- Large models and accelerators: GPU/TPU capacity, or proprietary models you cannot ship to clients.
- Bursting workloads: scale elastically during traffic spikes, pay for what you use.
- Centralized governance: rotate keys, audit usage, and update models without shipping new app binaries.
Hybrid patterns that work well
- Cascade: quick on-device filter, escalate to cloud for complex cases.
- Speculative UX: render a fast local guess while cloud refines the answer; reconcile when the slower result arrives.
- Dynamic routing: choose edge or cloud based on input size, device class, or user tier.
Engineering tips
- Cache tokenizers and model weights (edge) to avoid cold starts.
- Batch requests in the cloud to reduce per-call overhead; cap batch size to avoid tail latency.
- Collect latency/error histograms per route; auto-fallback to the alternative path on SLO violations.
- Treat models as deployable artifacts with semantic versions; log model_id on every response.
Next steps
- Prototype a cascade flow: small on-device model first, cloud fallback on low-confidence outputs.
- Add feature flags to switch routing without redeploying clients.
- Instrument with OpenTelemetry spans to compare edge vs cloud latency end-to-end.

