LLM Proxy for Raspberry Pi - Edge AI Deployment Guide

🥧 Recommended Hardware

✨ Best for LLM inference

💰 Budget-friendly option

⚡ Hardware acceleration

🦙

Run Ollama on Raspberry Pi with ARM-optimized models. Supports quantized versions of popular models for efficient inference.

🔮

Deploy LiteLLM proxy on Pi for managing local and cloud models. Connects to Ollama and forwards cloud API requests.

🧠

OpenAI-compatible API running fully on Raspberry Pi. Supports multiple quantized model formats optimized for ARM.

Install Raspberry Pi OS 64-bit (Lite recommended). Enable SSH and configure adequate cooling for sustained inference workloads.

sudo apt update && sudo apt upgrade -y

sudo apt install -y python3-pip git

Download and install Ollama ARM64 version. This provides the core inference engine optimized for Raspberry Pi architecture.

curl -fsSL https://ollama.com/install.sh | sh

ollama pull phi3:mini

Set up LiteLLM to provide a unified API interface. Configure it to use Ollama as the backend provider.

pip install litellm[proxy]

litellm --model ollama/phi3:mini --port 4000

Apply ARM-specific optimizations and configure memory limits. Use quantized models for best performance on limited hardware.

# Use quantized models

ollama pull llama3:8b-q4_0

# Monitor resources

htop

4-bit and 8-bit quantized models run 4-8x faster with minimal quality loss. Ideal for Pi's limited resources.

Configure 4GB+ swap to handle larger models. Use fast storage (NVMe or USB 3.0 SSD) for swap file.

Add active cooling (fan + heatsink) to prevent thermal throttling during extended inference sessions.

Process multiple requests together to maximize throughput. Avoid sequential single-token requests.

Model	Size	Tokens/sec	RAM Usage	Best For
Phi-3 Mini (3.8B)	Q4	8-12	2.5GB	General chat
Llama 3 8B	Q4	4-6	5GB	Complex tasks
Gemma 2B	F16	15-20	4GB	Fast responses
TinyLlama 1.1B	Q4	25-30	1GB	Edge cases
CodeLlama 7B	Q4	5-8	4.5GB	Coding