SL
Skeptik Log
skeptik-log

Two Chinese AI Giants Clash on the Coding Frontier

By Skeptik Log

Two open-weight models from China just ate Western labs’ lunch on two fronts: sustained autonomous execution and cost-efficient frontier coding. And both are MIT-licensed.

📋 Source: Z.AI, DeepSeek, MarkTechPost, Morph, CodersEra

TL;DR

GLM-5.1 runs for 8 hours straight on a single task and hits SOTA on SWE-Bench Pro. DeepSeek V4 Pro matches Claude Opus 4.6 on coding at 1/21 the cost with a 1M-token context window. Both MIT-licensed, both Chinese, both making the OpenAI-Anthropic-Google trio look overpriced.

Where we’re headed

April 2026 will be remembered as the month the “frontier” shifted east. GLM-5.1 and DeepSeek V4 Pro landed within weeks of each other, and they’re not playing catch-up anymore. In specific domains, they’re setting the pace.

We’ll break down what each model actually does, how they compare, and which one you should reach for depending on your use case. Spoiler: they’re good at different things.

GLM-5.1: The 8-Hour Workhorse

Z.AI released GLM-5.1 on April 7, 2026. It’s a 754-billion-parameter Mixture-of-Experts model built on a Dynamic Sparse Attention (DSA) architecture. The headline: 58.4 on SWE-Bench Pro, edging out GPT-5.4 (57.7) and Claude Opus 4.6 (57.3).

But the real story isn’t a single benchmark. GLM-5.1 was engineered for long-horizon autonomous execution. The model can work continuously on a single task for up to 8 hours, running experiment-analyze-optimize loops across hundreds of rounds and thousands of tool calls without human intervention.

In concrete demos it:

  • Built a complete Linux desktop environment from scratch
  • Optimized a CUDA kernel from 2.6× to 35.7× speedup through 178 autonomous iterations
  • Improved a vector database implementation to 1.5× its initial performance consistently
The 8-hour sustained execution capability addresses a known failure mode in LLM agents: the "plateau problem," where models exhaust their repertoire early and stop making progress regardless of additional compute time. GLM-5.1's asynchronous reinforcement learning training, which decouples generation from training, appears to be the key architectural enabler.

Key benchmarks:

Benchmark GLM-5.1 Score
SWE-Bench Pro 58.4 (SOTA)
AIME 2026 95.3
GPQA Diamond 86.2
Terminal-Bench 2.0 63.5 (66.5 w/ Claude Code)
CyberGym 68.7 (up from 48.3 in GLM-5)
BrowseComp 68.0
MCP-Atlas 71.8

The model packs a 200K context window with 128K max output tokens, thinking mode with multiple reasoning depths, function calling, structured output, and native MCP integration. Available under MIT on HuggingFace, supported by SGLang, vLLM, Transformers, and KTransformers for local deployment. API pricing: ~$1.4/M input, $4.4/M output.

DeepSeek V4 Pro: The Efficiency Monster

DeepSeek V4 arrived on April 24, 2026 with two variants:

  • V4-Pro: 1.6T total parameters, 49B active per token
  • V4-Flash: 284B total, 13B active

Both share a 1M-token context window with 384K max output, both MIT-licensed.

The architecture introduces three innovations over V3.2: Compressed Sparse Attention (CSA) interleaved with Heavily Compressed Attention (HCA), Manifold-Constrained Hyper-Connections (mHC) for stable deep-layer training, and the Muon optimizer replacing AdamW. The result: at 1M-token context, V4-Pro uses only 27% of the inference FLOPs and 10% of the KV cache compared to V3.2, despite being 2.4× larger overall.

The coding numbers speak for themselves. V4-Pro-Max scores 93.5 on LiveCodeBench, the highest of any model. It hits 80.6% on SWE-Bench Verified (0.2 points behind Claude Opus 4.6) and a Codeforces rating of 3206. On reasoning: GPQA Diamond at 90.1, MMLU-Pro at 87.5.

The dual Think/Non-Think mode is more than a convenience feature. In production agent workflows, the ability to switch between a 2-second rapid response and a 10-second deliberated answer at the API level, without prompt engineering, dramatically simplifies orchestration. Schema violation rates for function calling reportedly dropped from 4.2% to under 0.8% compared to V3.2, making it viable for unsupervised autonomous agents.

Key benchmarks (V4-Pro Max):

Benchmark DeepSeek V4-Pro Max
SWE-Bench Verified 80.6%
LiveCodeBench 93.5 (SOTA)
GPQA Diamond 90.1
MMLU-Pro 87.5
Codeforces Rating 3206 (SOTA)
HMMT 2026 95.2
BrowseComp 83.4

Pricing is aggressive: V4-Pro at $1.74/M input and $3.48/M output, V4-Flash at just $0.14/$0.28. Roughly 1/21 the cost of Claude Opus 4.6 for near-identical SWE-Bench performance.

Head to Head

Dimension GLM-5.1 DeepSeek V4 Pro
Total Parameters 754B 1.6T
Active per Token MoE (not disclosed) 49B
Context Window 200K 1M
Max Output 128K 384K
License MIT MIT
SWE-Bench (Pro/Verified) 58.4 (Pro, SOTA) 80.6% (Verified)
LiveCodeBench N/A 93.5 (SOTA)
GPQA Diamond 86.2 90.1
Terminal-Bench 2.0 63.5 N/A
Long-Horizon Execution 8 hours Standard
API Input Price ~$1.4/M $1.74/M
Key Innovation Async RL, sustained execution CSA+HCA, Muon optimizer
A critical note on the SWE-Bench comparison: GLM-5.1 reports on SWE-Bench *Pro* while DeepSeek V4 reports on SWE-Bench *Verified*. These are different benchmark variants with different difficulty levels, so the numbers are not directly comparable. SWE-Bench Verified typically yields higher scores than SWE-Bench Pro.

When to Reach for Which

Pick GLM-5.1 when:

  • You need an agent that works autonomously for hours on complex engineering tasks
  • Your workflow involves iterative optimization, debugging loops, or sustained multi-step execution
  • You want the strongest open-weight model for real-world software engineering (SWE-Bench Pro)
  • MCP integration and tool-use orchestration are central to your pipeline

Pick DeepSeek V4 Pro when:

  • You need frontier-level coding at a fraction of Western model pricing
  • Your tasks involve very long contexts (1M tokens) like full codebase ingestion
  • Competitive programming or algorithmic problem-solving is the primary use case
  • You need dual Think/Non-Think mode for mixed-latency workflows

Pick DeepSeek V4 Flash when:

  • Cost is the primary constraint and V4-Pro-level reasoning isn’t strictly necessary
  • You’re running high-volume production workloads (classification, summarization, simple generation)
  • Latency matters more than peak accuracy

For the technically curious

From here on it gets technical. If you care about the what more than the how, skip to the takeaway.

Real-world deployment scenarios

Startup CTO - A 20-person fintech startup sets GLM-5.1 loose on a 180K-line Rails monolith every Friday night. By Monday morning, the model has completed 8-hour autonomous refactoring sessions: extracting service objects, migrating deprecated APIs, updating the test suite across hundreds of files. What used to take a senior engineer two weeks ships over a weekend, with PRs that pass CI on the first try 73% of the time.

Data Platform Lead - A logistics company feeds DeepSeek V4 Pro their entire 400K-token microservices codebase, API specs, and Terraform configs in a single prompt. The model identifies 23 circular dependencies, maps a complete migration path to a cleaner service mesh, and generates a phased rollout plan. With a 200K-context model, this would require splitting the codebase across dozens of prompts and manually stitching the results.

DevOps Engineer - During a 3 AM production outage, GLM-5.1 autonomously traces a cascading Kubernetes failure from a misconfigured HPA through three services, identifies the root cause, writes the fix, and prepares the rollback command, all without hand-holding.

Quant Researcher - A Hong Kong trading firm uses V4 Pro’s Think mode to verify a new pairs-trading algorithm, catching a subtle mean-reversion error that would have caused losses in low-volatility regimes. In Non-Think mode, the same model handles real-time signal generation at sub-second latency during market hours.

Solo Developer - An indie dev routes routine code completion, commit messages, and test scaffolding through V4-Flash at $0.14/M input. The monthly bill barely crosses $8, while V4-Pro kicks in only for the occasional complex bug. The dual-mode setup replaces a $200/month coding assistant subscription at a fraction of the cost.

Security Analyst - A bank’s cybersecurity team deploys GLM-5.1 on internal code repositories, leveraging its CyberGym-honed vulnerability detection (68.7, up from 48.3 in GLM-5). The model flags a SQL injection pattern that static analyzers missed because it was split across two ORM layers, and generates the patched code. The team estimates 40% more true positives than their previous SAST toolchain.

Competitive Programmer - A university research team uses V4 Pro to tackle Codeforces Div.1 problems. With its 3206 Elo rating and 93.5 LiveCodeBench, the model consistently produces near-optimal solutions for graph theory and dynamic programming challenges, serving as both a training partner and correctness oracle.

Platform Architect - An enterprise team builds a two-tier review pipeline: incoming bugs hit V4-Pro in Non-Think mode for rapid triage and duplicate detection (under 2 seconds per ticket), while complex or security-relevant issues escalate to Think mode for deep root-cause analysis. One model handles both speed and depth, no separate fast/slow configurations needed.

The takeaway

Key points:

  • GLM-5.1’s 8-hour autonomous execution and SWE-Bench Pro record prove LLMs can ship production code without hand-holding
  • DeepSeek V4 Pro matches Claude Opus 4.6 on coding at 1/21 the cost, with a 1M-token context window
  • Open-weight Chinese models are no longer catching up. In specific domains, they’re setting the pace

The competitive landscape is no longer “OpenAI vs. Anthropic vs. Google.” It’s a five-way race, and two contenders just proved they can win on home turf.

Sources

skeptik-log By Skeptik Log