🍓

LLM Proxy for Raspberry Pi

Deploy lightweight AI infrastructure on Raspberry Pi for edge computing, privacy-focused local inference, and cost-effective prototyping environments.

4GB Min RAM Required
7B Model Size Max
~2W Power Usage
$35 Hardware Cost

🥧 Recommended Hardware

Raspberry Pi 5 (8GB)

  • CPU Cortex-A76 Quad-core
  • RAM 8GB LPDDR4X
  • Storage NVMe SSD recommended
  • Performance 2-3x faster than Pi 4
✨ Best for LLM inference

Raspberry Pi 4 (8GB)

  • CPU Cortex-A72 Quad-core
  • RAM 8GB LPDDR4
  • Storage USB 3.0 SSD
  • Performance Good for small models
💰 Budget-friendly option

With Coral Edge TPU

  • Accelerator Google Edge TPU
  • TOPS 4 TOPS
  • Power 2W additional
  • Use Case Quantized models only
⚡ Hardware acceleration

🚀 Compatible Solutions

🦙

Ollama on Pi

Run Ollama on Raspberry Pi with ARM-optimized models. Supports quantized versions of popular models for efficient inference.

  • ARM64 native support
  • 4-bit quantization
  • Low memory footprint
  • Easy model management
🔮

LiteLLM (Lightweight)

Deploy LiteLLM proxy on Pi for managing local and cloud models. Connects to Ollama and forwards cloud API requests.

  • Unified API interface
  • Multi-provider support
  • Rate limiting
  • Minimal dependencies
🧠

LocalAI

OpenAI-compatible API running fully on Raspberry Pi. Supports multiple quantized model formats optimized for ARM.

  • GGUF model support
  • Streaming responses
  • No external API calls
  • Privacy-first design

📋 Setup Guide

1

Prepare Your Pi

Install Raspberry Pi OS 64-bit (Lite recommended). Enable SSH and configure adequate cooling for sustained inference workloads.

sudo apt update && sudo apt upgrade -y
sudo apt install -y python3-pip git
2

Install Ollama

Download and install Ollama ARM64 version. This provides the core inference engine optimized for Raspberry Pi architecture.

curl -fsSL https://ollama.com/install.sh | sh
ollama pull phi3:mini
3

Configure LiteLLM Proxy

Set up LiteLLM to provide a unified API interface. Configure it to use Ollama as the backend provider.

pip install litellm[proxy]
litellm --model ollama/phi3:mini --port 4000
4

Optimize Performance

Apply ARM-specific optimizations and configure memory limits. Use quantized models for best performance on limited hardware.

# Use quantized models
ollama pull llama3:8b-q4_0
# Monitor resources
htop

💡 Optimization Tips

Use Quantized Models

4-bit and 8-bit quantized models run 4-8x faster with minimal quality loss. Ideal for Pi's limited resources.

Increase Swap Space

Configure 4GB+ swap to handle larger models. Use fast storage (NVMe or USB 3.0 SSD) for swap file.

Cooling is Critical

Add active cooling (fan + heatsink) to prevent thermal throttling during extended inference sessions.

Batch Requests

Process multiple requests together to maximize throughput. Avoid sequential single-token requests.

Model Performance on Pi 5

Model Size Tokens/sec RAM Usage Best For
Phi-3 Mini (3.8B) Q4 8-12 2.5GB General chat
Llama 3 8B Q4 4-6 5GB Complex tasks
Gemma 2B F16 15-20 4GB Fast responses
TinyLlama 1.1B Q4 25-30 1GB Edge cases
CodeLlama 7B Q4 5-8 4.5GB Coding