DeepSeek V4: 1.6T MoE Model with 1M Context on EU Server

DeepSeek V4 is the most capable open-source language model family available as of April 2026. The series ships two variants - DeepSeek-V4-Pro (1.6T parameters, 49B activated) and DeepSeek-V4-Flash (284B parameters, 13B activated) - both supporting a one million token context window. For European businesses, self-hosting DeepSeek V4 on an EU cloud server means accessing frontier AI capabilities under full GDPR data residency.

This guide covers what DeepSeek V4 brings architecturally, how to run it locally on DCXV GPU infrastructure, and what performance to expect across reasoning modes.

What Is New in DeepSeek V4

DeepSeek V4 introduces three architectural upgrades over V3.2:

Hybrid Attention (CSA + HCA) - Combines Compressed Sparse Attention and Heavily Compressed Attention to dramatically reduce long-context cost. At 1M tokens, V4-Pro requires only 27% of the inference FLOPs and 10% of the KV cache compared with V3.2.
Manifold-Constrained Hyper-Connections (mHC) - Strengthens residual connections to improve signal propagation stability across layers without sacrificing model expressivity.
Muon Optimizer - Replaces AdamW for faster convergence and improved training stability at scale.

Both models are pre-trained on over 32T tokens and post-trained with a two-stage pipeline: domain-specific expert cultivation (SFT + RL with GRPO), followed by on-policy distillation to consolidate expertise into a single model.

Model Variants and Precision

Model	Total Params	Activated	Context	Precision
DeepSeek-V4-Flash-Base	284B	13B	1M	FP8 Mixed
DeepSeek-V4-Flash	284B	13B	1M	FP4 + FP8 Mixed
DeepSeek-V4-Pro-Base	1.6T	49B	1M	FP8 Mixed
DeepSeek-V4-Pro	1.6T	49B	1M	FP4 + FP8 Mixed

FP4 + FP8 Mixed means MoE expert parameters use FP4 precision while most other parameters use FP8, significantly reducing VRAM requirements without major quality loss.

Three Reasoning Modes

Both Pro and Flash support three inference modes you select at runtime:

Non-think - Fast, intuitive responses for routine tasks. Output starts with </think> summary immediately.
Think High - Conscious logical analysis. Slower but more accurate. Uses full <think> block before summary.
Think Max - Maximum reasoning effort, best for math proofs, hard coding problems, and complex agentic tasks. Requires a special system prompt and at least 384K context window.

Benchmark Highlights

DeepSeek-V4-Pro-Max achieves the top Codeforces rating of 3206 among all models tested, ahead of GPT-5.4 (3168) and Gemini-3.1-Pro (3052). On LiveCodeBench it scores 93.5% Pass@1. On SWE-Verified it resolves 80.6% of real-world GitHub issues.

For knowledge tasks, V4-Pro-Max reaches 90.1% on GPQA Diamond and 87.5% on MMLU-Pro. The Flash variant with Think Max mode scores comparably to Pro on most reasoning tasks, at a fraction of the VRAM cost.

Hardware Requirements for Self-Hosting

V4-Flash (284B, FP4+FP8) - Approximately 150-180 GB total weight storage. Requires multiple A100 80 GB or H100 GPUs in tensor parallelism. Minimum 2x A100 80 GB for Flash.
V4-Pro (1.6T, FP4+FP8) - Requires 8x A100 80 GB or equivalent multi-node GPU setup. More practical for organizations running dedicated inference clusters.

For most European businesses, V4-Flash is the practical choice - comparable reasoning to Pro in Think Max mode, much lower hardware cost, and still 1M token context.

Quick Setup Commands

# Download DeepSeek-V4-Flash from HuggingFace
pip install huggingface_hub transformers

python -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='deepseek-ai/DeepSeek-V4-Flash',
    local_dir='/models/deepseek-v4-flash',
    ignore_patterns=['*.md']
)
"

# Encode messages using the provided encoding script
# Clone the model repo to get the encoding utilities
git clone https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash /models/deepseek-v4-flash

# Use the provided encoding helper
python3 << 'EOF'
import sys
sys.path.insert(0, '/models/deepseek-v4-flash/encoding')
from encoding_dsv4 import encode_messages, parse_message_from_completion_text

messages = [
    {"role": "user", "content": "Summarize GDPR Article 5 in 3 bullet points."}
]

# Non-think mode - fast response
prompt = encode_messages(messages, thinking_mode="non-thinking")
print("Non-think prompt length:", len(prompt))

# Think High mode - analytical response
prompt_think = encode_messages(messages, thinking_mode="thinking")
print("Think prompt length:", len(prompt_think))
EOF

# Serve with vLLM (recommended for production)
pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model /models/deepseek-v4-flash \
  --host 10.0.0.5 \
  --port 8000 \
  --tensor-parallel-size 2 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.90 \
  --served-model-name deepseek-v4-flash

# Test the API
curl http://10.0.0.5:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-flash",
    "messages": [{"role": "user", "content": "What is GDPR?"}],
    "temperature": 1.0,
    "top_p": 1.0
  }'

# For Think Max mode - set context window to at least 384K and use special system prompt
curl http://10.0.0.5:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-flash",
    "messages": [
      {"role": "system", "content": "Think step by step. Use maximum reasoning effort."},
      {"role": "user", "content": "Prove that sqrt(2) is irrational."}
    ],
    "temperature": 1.0,
    "top_p": 1.0,
    "max_tokens": 32768
  }'

Recommended Inference Parameters

The DeepSeek team recommends using temperature = 1.0 and top_p = 1.0 for local deployment. For Think Max mode, set the context window to at least 384K tokens to allow full chain-of-thought reasoning.

Running DeepSeek V4 on DCXV EU Infrastructure

DCXV GPU servers in EU Tier III data centers are the practical path to self-hosting DeepSeek V4 under GDPR data residency. Recommended configurations:

2x A100 80 GB - Runs V4-Flash comfortably in FP8. Handles Think High and Think Max modes. Suitable for internal enterprise tools and EU API services.
8x A100 80 GB - Required for V4-Pro. Handles the full 1.6T parameter model in FP4+FP8. For organizations needing frontier model quality with full data control.

Contact sales@dcxv.com to discuss multi-GPU configurations for DeepSeek V4 deployment.

Bottom Line

DeepSeek V4 is the strongest open-source model release of 2026, with V4-Pro-Max achieving the top Codeforces rating and near-frontier performance on reasoning and agentic benchmarks. The hybrid attention architecture makes 1M token context practical - not just technically possible. For European organizations that cannot send prompts to US-hosted APIs, self-hosting V4-Flash on DCXV EU GPU infrastructure delivers GPT-4-class reasoning under full GDPR compliance.

cloud ai vps

Run Claude Code, Codex and Grok CLI on Your Own Cloud Server

Turn a Debian or Ubuntu cloud server into a sandbox for AI coding agents like Claude Code, Codex and Grok CLI. Vibe code from anywhere, even your phone.

June 21, 2026

ai llm open-source glm Cloud

GLM-5.2 - The New Leading Open Weights LLM

Z.ai's GLM-5.2 is the new leading open weights model on the Artificial Analysis Intelligence Index, scoring 51 under an MIT license with a 1M token context.

June 18, 2026

ai deepseek llm