Cloud Server for Ollama in Europe: Self-Host AI EU Guide
Ollama is the fastest way to get a local LLM running - a single command installs the runtime, pulls a model, and exposes an OpenAI-compatible API. For European teams, running Ollama on an EU cloud server means all AI inference stays within EU jurisdiction, satisfying GDPR requirements while giving developers the simplicity of a managed service.
This guide covers how to deploy Ollama on a DCXV EU cloud server, which models to use for different workloads, and what performance to expect.
Why Run Ollama on an EU Cloud Server
Running Ollama locally on developer laptops works for testing, but production AI features need a server: consistent availability, GPU acceleration, shared access for multiple services, and stable API endpoints your applications can call reliably.
EU cloud hosting specifically matters because Ollama serves as the inference endpoint for your applications. Every prompt your users send flows through this server. Under GDPR, if those prompts contain personal data - and in most real-world applications they do - that inference must happen on infrastructure under EU jurisdiction. A DCXV EU cloud server running Ollama gives you a compliant, private AI endpoint that never routes data to US infrastructure.
Choosing the Right Model for Your Use Case
Ollama supports hundreds of models. For production EU deployments:
- llama3.1:8b - best all-around for chat, summarization, Q&A. Runs on CPU or GPU. 4-5 GB VRAM at Q4.
- llama3.1:70b - near-GPT-4 quality. Requires 40+ GB VRAM. Use on A100/H100 servers.
- mistral:7b - fast, efficient, excellent for structured output and function calling.
- nomic-embed-text - embedding model for RAG pipelines. CPU-friendly, 274 MB.
- codellama:13b - code generation and review. Good on a single 16 GB GPU.
- phi3:mini - Microsoft’s 3.8B model. Very fast on CPU, useful for classification.
Minimum Specs for Ollama
- CPU-only (small models, 7B Q4) - 8 vCPU, 16 GB RAM, 100 GB NVMe SSD
- CPU production (parallel requests, 7B Q4) - 16 vCPU, 32 GB RAM, 200 GB NVMe SSD
- GPU entry (7B-13B at FP16) - 4 vCPU, 16 GB RAM, 16-24 GB VRAM, 200 GB NVMe
- GPU production (34B+ models) - 8 vCPU, 64 GB RAM, 40-80 GB VRAM, 500 GB NVMe
Recommended DCXV Configuration
DCXV cloud servers run on Tier III EU infrastructure with private networking. For Ollama in production:
- CPU server, 16 vCPU / 32 GB RAM - serves 7B models at 18-28 tokens/s, suitable for internal tools and batch jobs
- GPU server, 16-24 GB VRAM - serves 7B-13B models at 80-120 tokens/s, suitable for user-facing features
- GPU server, 80 GB VRAM - serves 70B models at 25-40 tokens/s, GPT-4 class for production APIs
Contact sales@dcxv.com to configure Ollama-ready GPU or CPU instances in EU data centers.
Quick Setup Commands
# Install Ollama on Ubuntu 22.04
curl -fsSL https://ollama.com/install.sh | sh
# Verify installation
ollama --version
# Start the Ollama service (starts automatically after install)
sudo systemctl status ollama # Pull models you need
ollama pull llama3.1:8b # 4.7 GB - general purpose
ollama pull mistral:7b # 4.1 GB - fast, structured output
ollama pull nomic-embed-text # 274 MB - embeddings for RAG
ollama pull codellama:13b # 7.4 GB - code tasks
# List downloaded models
ollama list
# Run a quick test
ollama run llama3.1:8b "In one sentence, what is GDPR?" # Configure Ollama to serve on your private network
# Edit /etc/systemd/system/ollama.service
# Add under [Service]:
# Environment="OLLAMA_HOST=0.0.0.0:11434"
# Environment="OLLAMA_NUM_PARALLEL=4" # concurrent requests
# Environment="OLLAMA_MAX_LOADED_MODELS=2" # models to keep in memory
sudo systemctl daemon-reload
sudo systemctl restart ollama
# Verify API is accessible from app server
curl http://10.0.0.5:11434/api/tags # Use the OpenAI-compatible API from your application
# List available models
curl http://10.0.0.5:11434/v1/models
# Chat completion (drop-in for OpenAI SDK)
curl http://10.0.0.5:11434/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "llama3.1:8b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize GDPR Article 5 in 3 bullet points."}
]
}'
# Embeddings for RAG
curl http://10.0.0.5:11434/v1/embeddings
-H "Content-Type: application/json"
-d '{"model": "nomic-embed-text", "input": "EU data residency requirements"}' # Optional: protect Ollama with a reverse proxy (nginx)
sudo apt install -y nginx
cat > /etc/nginx/sites-available/ollama << 'EOF'
server {
listen 443 ssl;
server_name ai.yourdomain.eu;
location / {
proxy_pass http://127.0.0.1:11434;
proxy_set_header Host $host;
# Add auth header check here for production
}
}
EOF
sudo ln -s /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx Expected Performance Benchmarks
CPU server (16 vCPU DCXV), llama3.1:8b Q4_K_M:
- Single request generation - 18-28 tokens/s
- Concurrent requests (OLLAMA_NUM_PARALLEL=4) - 6-10 tokens/s per request
- Embedding throughput (nomic-embed-text) - 250-400 vectors/s
GPU server (16 GB VRAM), llama3.1:8b FP16:
- Single request generation - 80-120 tokens/s
- Concurrent requests (4 parallel) - 50-80 tokens/s per request
- Time to first token - 100-250ms
GPU server (24 GB VRAM), mistral:7b FP16:
- Single request generation - 100-150 tokens/s
- Structured JSON output latency - 200-400ms typical
Bottom Line
Ollama on a DCXV EU cloud server gives your team a private, GDPR-compliant AI endpoint that is as simple to manage as any other service. Install takes under five minutes, models pull with a single command, and the OpenAI-compatible API means any application using the OpenAI SDK works without code changes. Contact DCXV to spin up a CPU or GPU server in an EU data center.





