Introduction
The prospect of running a local version of generative AI in your own environment is quite appealing in many contexts, particularly when privacy is a concern. While cloud-based solutions like ChatGPT dominate headlines, open-source alternatives now enable us to run sophisticated language models directly on our own hardware.
Important
This article walks through setting up GPT4All—an open-source generative AI—using Docker for easy deployment and isolation.
Background
The development of local AI solutions has accelerated in response to privacy concerns around cloud-based models. As noted in a Wired article:
A number of experts who spoke to Wired suspect that Google is using Bard to send a message that the EU's laws around privacy and online safety aren't to its liking. But more than this, it could be a sign that generative AI technology as it exists now is fundamentally incompatible with existing and developing privacy and online safety laws in the EU.
The uncertainty around Bard's rollout in the region comes as the bloc's lawmakers are negotiating new draft rules to govern artificial intelligence via the fledgling AI Act.
Local AI models offer several advantages:
- Complete data privacy (no data leaves your machine)
- Operation in air-gapped environments
- No subscription costs
- Customisation potential for specific datasets
- Independence from third-party service availability
Prerequisites
Before beginning, ensure your system has:
- Docker and Docker Compose installed
- CPU with AVX or AVX2 instruction support (check with commands below)
- Minimum 8GB RAM (16GB recommended)
- At least 5GB free disk space for models
Check for AVX support with:
lscpu | grep -o 'avx[^ ]*'
Or alternatively:
grep -o 'avx[^ ]*' /proc/cpuinfo
Setting up your environment
Creating your workspace
First, create a directory for your GPT4All Docker environment:
mkdir -p ~/docker/gpt4all && cd ~/docker/gpt4all
Downloading a model
Let's start with a moderately-sized model. Create a models directory and download your first model:
mkdir -p models
curl -L https://huggingface.co/nomic-ai/gpt4all-j/resolve/main/gpt4all-lora-quantized-ggml.bin \
-o models/gpt4all-lora-quantized-ggml.bin
This basic model is approximately 4GB but provides a good balance of performance and resource usage for testing.
Docker configuration
Create a docker-compose.yaml file with the following configuration:
version: '3.6'
services:
gpt4all-ui:
image: ghcr.io/mckaywrigley/chatbot-ui:main
platform: linux/amd64
ports:
- "3000:3000"
environment:
- OPENAI_API_KEY=dummy-key-not-needed-for-local-models
depends_on:
- gpt4all-api
gpt4all-api:
build: .
volumes:
- ./models:/models
ports:
- "8000:8000"
environment:
- MODEL=/models/gpt4all-lora-quantized-ggml.bin
Next, create a minimal Dockerfile in the same directory:
FROM python:3.9-slim
WORKDIR /app
RUN apt-get update && apt-get install -y \
build-essential \
gcc \
g++ \
&& rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
EXPOSE 8000
CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"]
Create a requirements.txt file with the dependencies:
fastapi==0.95.1
uvicorn==0.22.0
llama-cpp-python==0.1.48
pydantic==1.10.7
Finally, create an api.py file for the backend service:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Optional
import os
import json
from llama_cpp import Llama
app = FastAPI()
class CompletionRequest(BaseModel):
prompt: str
max_tokens: int = 256
temperature: float = 0.7
top_p: float = 0.95
stop: Optional[List[str]] = None
# Load the model
model_path = os.environ.get("MODEL", "/models/gpt4all-lora-quantized-ggml.bin")
if not os.path.exists(model_path):
raise RuntimeError(f"Model file not found: {model_path}")
llm = Llama(model_path=model_path, n_ctx=2048)
@app.post("/api/completions")
async def create_completion(request: CompletionRequest):
try:
response = llm(
request.prompt,
max_tokens=request.max_tokens,
temperature=request.temperature,
top_p=request.top_p,
stop=request.stop or []
)
return {
"id": "cmpl-local",
"object": "text_completion",
"created": 0,
"model": os.path.basename(model_path),
"choices": [
{
"text": response["choices"][0]["text"],
"index": 0,
"finish_reason": "stop" if response["choices"][0]["finish_reason"] == "stop" else None
}
]
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/api/models")
async def list_models():
return {
"object": "list",
"data": [
{
"id": os.path.basename(model_path),
"object": "model",
"created": 0,
"owned_by": "local"
}
]
}
Running the container
With all files in place, start the services:
docker-compose up -d
Note
Accessing the UI
Once the containers are running, access the UI at:
http://localhost:3000
The first time you open the interface, you might need to:
- Select the model from the dropdown
- Configure basic parameters like temperature and tokens
- Begin interacting with your local AI model
Troubleshooting common issues
Error: no cgroup mount found in mountinfo
If you encounter this error, the following command may fix it:
sudo mount -t cgroup -o none,name=systemd cgroup /sys/fs/cgroup/systemd
Error fetching models
If you see an "Error fetching models" message:
- Ensure both containers are running (
docker-compose ps) - Check container logs for more detailed errors:
docker-compose logs gpt4all-api - Verify the model file exists in the models directory and has the correct permissions
- Check if your CPU supports the required instruction set
Slow responses or container crashes
Local AI models are resource-intensive. If you're experiencing slow responses:
- Try a smaller model
- Increase the memory allocation for Docker
- Reduce
max_tokensto generate shorter responses - Close other memory-intensive applications
Taking things further
Exploring other models
Several models are compatible with this setup:
- GPT4All-J (original model we used)
- Orca Mini
- Wizard Vicuna
- Replit Code
Models can be downloaded from:
Using Python bindings
For programmatic access to models, install the Python bindings:
pip install llama-cpp-python
This enables:
- Low-level access to C API via
ctypesinterface - High-level Python API for text completion
- OpenAI-like API compatibility
- LangChain integration
Flask integration
For integrating GPT4All into Flask applications:
from flask import Flask, request, jsonify
from llama_cpp import Llama
app = Flask(__name__)
llm = Llama(model_path="models/gpt4all-lora-quantized-ggml.bin", n_ctx=2048)
@app.route("/api/complete", methods=["POST"])
def complete_text():
data = request.json
prompt = data.get("prompt", "")
max_tokens = data.get("max_tokens", 256)
response = llm(prompt, max_tokens=max_tokens)
return jsonify({"completion": response["choices"][0]["text"]})
if __name__ == "__main__":
app.run(debug=True)
Observations and limitations
In my testing, I've observed that:
- Response quality varies significantly between models
- Generation is noticeably slower than cloud services (5-20 seconds per response)
- Models can require 4-16GB of RAM when loaded
- Some technical knowledge and contextual prompts improve responses
- Models excel at code completion but may struggle with current events
Conclusion
Running GPT4All locally with Docker provides a privacy-focused alternative to cloud-based AI services. While performance doesn't match commercial offerings, the technology is rapidly evolving, and the privacy benefits make it a compelling option for many use cases.
As open-source models continue to improve, local AI solutions will become increasingly viable for a wider range of applications. This approach is particularly valuable for developers working with sensitive data or in environments with strict privacy requirements.