Workers AI and AI Gateway
LLM inference on Cloudflare's GPU edge, AI Gateway for routing/caching across providers, and the patterns that pair them with Workers.
This module is being expanded.
Coming in the next revision:
- Workers AI — Cloudflare-owned GPU PoPs (a separate fleet from the regular Worker PoPs) hosting open-weights models. Llama, Mistral, embedding models, image generation, speech-to-text.
- The pricing model. Per-neuron / per-token, with bandwidth-flat egress. Cheap by hyperscaler standards.
- AI Gateway — a Worker that sits in front of OpenAI, Anthropic, Google, Workers AI itself. Provides caching, rate limiting, retry, logging, fallback routing.
- Cost-routing patterns — route easy queries to Llama 3.3 8B on Workers AI; route complex ones to Claude or GPT-5 via AI Gateway.
- Vectorize integration — RAG built end-to-end on Workers: docs in R2, embeddings in Vectorize, LLM via Workers AI or AI Gateway.
- Observability — every LLM call logged with latency, cost, cache-hit status.
For the broader AI inferencing landscape see the AI inferencing post.
Next: Module 11 — Pages.