Skip to main content

Running an open sourced generative AI on a local device with GPT4All and Docker

Learn how to run a local version of GPT-like AI models using GPT4All and Docker for improved privacy and full local control.

Introduction

The prospect of running a local version of generative AI in your own environment is quite appealing in many contexts, particularly when privacy is a concern. While cloud-based solutions like ChatGPT dominate headlines, open-source alternatives now enable us to run sophisticated language models directly on our own hardware.

Important

Running AI models locally offers significant privacy advantages but comes with hardware requirements and performance trade-offs compared to cloud-based solutions.

This article walks through setting up GPT4All—an open-source generative AI—using Docker for easy deployment and isolation.

Background

The development of local AI solutions has accelerated in response to privacy concerns around cloud-based models. As noted in a Wired article:

A number of experts who spoke to Wired suspect that Google is using Bard to send a message that the EU's laws around privacy and online safety aren't to its liking. But more than this, it could be a sign that generative AI technology as it exists now is fundamentally incompatible with existing and developing privacy and online safety laws in the EU.

The uncertainty around Bard's rollout in the region comes as the bloc's lawmakers are negotiating new draft rules to govern artificial intelligence via the fledgling AI Act.

Local AI models offer several advantages:

  • Complete data privacy (no data leaves your machine)
  • Operation in air-gapped environments
  • No subscription costs
  • Customisation potential for specific datasets
  • Independence from third-party service availability

Prerequisites

Before beginning, ensure your system has:

  • Docker and Docker Compose installed
  • CPU with AVX or AVX2 instruction support (check with commands below)
  • Minimum 8GB RAM (16GB recommended)
  • At least 5GB free disk space for models

Check for AVX support with:

lscpu | grep -o 'avx[^ ]*'

Or alternatively:

grep -o 'avx[^ ]*' /proc/cpuinfo

Setting up your environment

Creating your workspace

First, create a directory for your GPT4All Docker environment:

mkdir -p ~/docker/gpt4all && cd ~/docker/gpt4all

Downloading a model

Let's start with a moderately-sized model. Create a models directory and download your first model:

mkdir -p models
curl -L https://huggingface.co/nomic-ai/gpt4all-j/resolve/main/gpt4all-lora-quantized-ggml.bin \
  -o models/gpt4all-lora-quantized-ggml.bin

This basic model is approximately 4GB but provides a good balance of performance and resource usage for testing.

Docker configuration

Create a docker-compose.yaml file with the following configuration:

version: '3.6'

services:
  gpt4all-ui:
    image: ghcr.io/mckaywrigley/chatbot-ui:main
    platform: linux/amd64
    ports:
      - "3000:3000"
    environment:
      - OPENAI_API_KEY=dummy-key-not-needed-for-local-models
    depends_on:
      - gpt4all-api

  gpt4all-api:
    build: .
    volumes:
      - ./models:/models
    ports:
      - "8000:8000"
    environment:
      - MODEL=/models/gpt4all-lora-quantized-ggml.bin

Next, create a minimal Dockerfile in the same directory:

FROM python:3.9-slim

WORKDIR /app

RUN apt-get update && apt-get install -y \
    build-essential \
    gcc \
    g++ \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

EXPOSE 8000

CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"]

Create a requirements.txt file with the dependencies:

fastapi==0.95.1
uvicorn==0.22.0
llama-cpp-python==0.1.48
pydantic==1.10.7

Finally, create an api.py file for the backend service:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Optional
import os
import json
from llama_cpp import Llama

app = FastAPI()

class CompletionRequest(BaseModel):
    prompt: str
    max_tokens: int = 256
    temperature: float = 0.7
    top_p: float = 0.95
    stop: Optional[List[str]] = None

# Load the model
model_path = os.environ.get("MODEL", "/models/gpt4all-lora-quantized-ggml.bin")
if not os.path.exists(model_path):
    raise RuntimeError(f"Model file not found: {model_path}")

llm = Llama(model_path=model_path, n_ctx=2048)

@app.post("/api/completions")
async def create_completion(request: CompletionRequest):
    try:
        response = llm(
            request.prompt,
            max_tokens=request.max_tokens,
            temperature=request.temperature,
            top_p=request.top_p,
            stop=request.stop or []
        )

        return {
            "id": "cmpl-local",
            "object": "text_completion",
            "created": 0,
            "model": os.path.basename(model_path),
            "choices": [
                {
                    "text": response["choices"][0]["text"],
                    "index": 0,
                    "finish_reason": "stop" if response["choices"][0]["finish_reason"] == "stop" else None
                }
            ]
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/api/models")
async def list_models():
    return {
        "object": "list",
        "data": [
            {
                "id": os.path.basename(model_path),
                "object": "model",
                "created": 0,
                "owned_by": "local"
            }
        ]
    }

Running the container

With all files in place, start the services:

docker-compose up -d

Note

The first time you run this, Docker will build the container, which may take several minutes depending on your system's performance.

Accessing the UI

Once the containers are running, access the UI at:

http://localhost:3000

The first time you open the interface, you might need to:

  1. Select the model from the dropdown
  2. Configure basic parameters like temperature and tokens
  3. Begin interacting with your local AI model

Troubleshooting common issues

Error: no cgroup mount found in mountinfo

If you encounter this error, the following command may fix it:

sudo mount -t cgroup -o none,name=systemd cgroup /sys/fs/cgroup/systemd

Error fetching models

If you see an "Error fetching models" message:

  • Ensure both containers are running (docker-compose ps)
  • Check container logs for more detailed errors: docker-compose logs gpt4all-api
  • Verify the model file exists in the models directory and has the correct permissions
  • Check if your CPU supports the required instruction set

Slow responses or container crashes

Local AI models are resource-intensive. If you're experiencing slow responses:

  • Try a smaller model
  • Increase the memory allocation for Docker
  • Reduce max_tokens to generate shorter responses
  • Close other memory-intensive applications

Taking things further

Exploring other models

Several models are compatible with this setup:

  • GPT4All-J (original model we used)
  • Orca Mini
  • Wizard Vicuna
  • Replit Code

Models can be downloaded from:

Using Python bindings

For programmatic access to models, install the Python bindings:

pip install llama-cpp-python

This enables:

  • Low-level access to C API via ctypes interface
  • High-level Python API for text completion
  • OpenAI-like API compatibility
  • LangChain integration

Flask integration

For integrating GPT4All into Flask applications:

from flask import Flask, request, jsonify
from llama_cpp import Llama

app = Flask(__name__)
llm = Llama(model_path="models/gpt4all-lora-quantized-ggml.bin", n_ctx=2048)

@app.route("/api/complete", methods=["POST"])
def complete_text():
    data = request.json
    prompt = data.get("prompt", "")
    max_tokens = data.get("max_tokens", 256)

    response = llm(prompt, max_tokens=max_tokens)
    return jsonify({"completion": response["choices"][0]["text"]})

if __name__ == "__main__":
    app.run(debug=True)

Observations and limitations

In my testing, I've observed that:

  • Response quality varies significantly between models
  • Generation is noticeably slower than cloud services (5-20 seconds per response)
  • Models can require 4-16GB of RAM when loaded
  • Some technical knowledge and contextual prompts improve responses
  • Models excel at code completion but may struggle with current events

Conclusion

Running GPT4All locally with Docker provides a privacy-focused alternative to cloud-based AI services. While performance doesn't match commercial offerings, the technology is rapidly evolving, and the privacy benefits make it a compelling option for many use cases.

As open-source models continue to improve, local AI solutions will become increasingly viable for a wider range of applications. This approach is particularly valuable for developers working with sensitive data or in environments with strict privacy requirements.

Further reading