Workers AI and AI Gateway

LLM inference on Cloudflare's GPU edge, AI Gateway for routing/caching across providers, and the patterns that pair them with Workers.

This module is being expanded.

Coming in the next revision:

Workers AI — Cloudflare-owned GPU PoPs (a separate fleet from the regular Worker PoPs) hosting open-weights models. Llama, Mistral, embedding models, image generation, speech-to-text.
The pricing model. Per-neuron / per-token, with bandwidth-flat egress. Cheap by hyperscaler standards.
AI Gateway — a Worker that sits in front of OpenAI, Anthropic, Google, Workers AI itself. Provides caching, rate limiting, retry, logging, fallback routing.
Cost-routing patterns — route easy queries to Llama 3.3 8B on Workers AI; route complex ones to Claude or GPT-5 via AI Gateway.
Vectorize integration — RAG built end-to-end on Workers: docs in R2, embeddings in Vectorize, LLM via Workers AI or AI Gateway.
Observability — every LLM call logged with latency, cost, cache-hit status.

For the broader AI inferencing landscape see the AI inferencing post.

Next: Module 11 — Pages.