GLM-5.1 — AI Glossary | Superteams.ai

GLM-5.1 is Z.AI’s next-generation open-weight foundation model, released on March 27, 2026 under the MIT License. It is the successor to GLM-5 (“From Vibe Coding to Agentic Engineering”) and sets a new state-of-the-art result on SWE-Bench Pro — outperforming GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro on real-world software engineering tasks. Its defining characteristic is sustained long-horizon agentic execution: GLM-5.1 can work autonomously on a single complex task for up to 8 hours, continuously evaluating its own intermediate results and revising its approach hundreds of times before delivering final output.

Architecture

GLM-5.1 is a Sparse Mixture-of-Experts model with 754 billion total parameters and 40 billion active per token. The MoE layer uses 256 routed experts with top-8 routing plus 1 always-active shared expert, giving the model a stable base representation alongside dynamic specialisation.

Multi-head Latent Attention (MLA) GLM-5.1 replaces standard multi-head attention with MLA, which compresses key-value representations into a lower-dimensional latent space before projecting back out for attention computation. This reduces KV cache memory by roughly 33% versus standard MHA. The MLA-256 variant used here increases head dimension from 192 to 256 while reducing total attention heads by a third — cutting decoding compute without sacrificing training efficiency or context fidelity.

DeepSeek Sparse Attention (DSA) For long-context processing, GLM-5.1 combines MLA with DeepSeek Sparse Attention, which attends selectively across the 200K-token context window rather than computing full quadratic attention across all positions. This combination keeps long-context inference compute tractable at production scale.

Multi-Token Prediction (MTP) head A secondary prediction head is trained to predict multiple future tokens per forward pass. During inference, this enables speculative decoding — generating candidate continuations in parallel and verifying them in bulk — significantly improving tokens-per-second throughput without affecting output quality.

Training: Slime Asynchronous RL

GLM-5.1’s post-training uses Slime, Z.AI’s asynchronous reinforcement learning infrastructure. Standard RL for LLMs is sequential: generate a rollout, evaluate it, update weights, repeat — with the cluster idling while evaluation completes. Slime breaks the dependency:

Training trajectories are generated independently across the cluster in parallel
APRIL (Active Partial Rollouts) allows partial evaluation of incomplete trajectories, feeding results back into weight updates without waiting for full rollouts to finish
This keeps GPU utilisation high throughout training and dramatically increases the volume of RL signal the model receives per wall-clock hour

The RL training distribution for GLM-5.1 was heavily weighted toward coding tasks. Crucially, long-horizon agentic data was incorporated during mid-training rather than only at the fine-tuning stage — which is the key reason the model generalises to multi-step agent workflows rather than excelling only at single-shot code generation.

Thinking mode is enabled by default in GLM-5.1, a change from GLM-5 where it was opt-in.

Agentic Execution

GLM-5.1 was explicitly designed for tools like Claude Code and OpenClaw. Its long-horizon planning improves on GLM-5 across the full agentic loop:

Planning — decompose a complex task into an ordered sequence of sub-tasks
Stepwise execution — invoke tools, write code, run tests, read outputs
Process adjustment — evaluate interim results and revise the plan mid-execution (potentially hundreds of times on a single task)
Result delivery — produce production-ready output after full self-verification

The model sustains this loop for up to 8 hours — a threshold that covers most real-world engineering tasks from specification to working, tested code.

Benchmarks & Pricing

Benchmark	GLM-5.1	Notes
SWE-Bench Pro	58.4%	#1 open-weight, above GPT-5.4 and Claude Opus 4.6
Context window	200K tokens	128K max output

API pricing: approximately $3 per million output tokens — roughly 94.6% of Claude Opus 4.6’s performance at a fraction of the cost.

Features supported: function calling, structured output, context caching, MCP integration, and native thinking mode.

Significance

GLM-5.1 is part of a broader pattern in 2026 where open-weight models reach or exceed proprietary frontier performance on specific capability domains — in this case, agentic software engineering. Its combination of MLA efficiency, sparse attention, asynchronous RL training, and native agentic data during mid-training represents a coherent architectural bet that the bottleneck for long-horizon agents is training distribution, not raw parameter count.

Ready to build?

Leverage AI technologies to build your product stack

Superteams can help you build, deploy and launch AI application stacks using open source technologies — from architecture through to production.

Talk to Superteams