Kimi K2.6 Beats GPT-5.4 at Code Review: antirez's Real-World Test on Richard Hipp's Patch
⏱️ Key moments
- 00:00 - Introduction: Chinese models as exit strategy
- 02:30 - Richard Hipp's PR on linenoise
- 05:00 - The escape sequence bug in the prompt
- 08:15 - Kimi K2.6 analyzes the patch and finds the bug
- 13:40 - Claude Opus 4.7: fast and the ZWJ case
- 17:20 - GPT-5.4 with Codex: the golden standard that fails
- 21:00 - Strategic lesson: affinity and optionality
Salvatore Sanfilippo, better known as antirez, published a video testing three top-tier LLMs, Kimi K2.6, Claude Opus 4.7, and GPT-5.4, on a real code review task: analyzing a pull request from Dr. Richard Hipp, creator of SQLite, on the linenoise library. The result? The Chinese open-source model beat the American “golden standard” in one specific but instructive case.
Context: A PR from a Programming Legend
Antirez opens with a political and pragmatic premise: he watches Chinese models with growing interest not just for democratization, but as an exit strategy. Between transatlantic tensions and the risk that access to American models could become expensive or restricted, having viable alternatives is a practical necessity.
The pull request comes from Richard Hipp himself, who describes linenoise as “the default command line editor library for SQLite 3, because linenoise fits so well with the low dependency, self-contained, it just works philosophy of SQLite.” A deep cultural kinship between Hipp and antirez: both believe in essential programming.
The Bug: Escape Sequences in the Prompt
The problem is simple but subtle. When using linenoise with a colored prompt (e.g., ANSI escape sequences to change color based on a SQLite transaction state), the library doesn’t correctly calculate the prompt’s visible length because it doesn’t strip escape sequences. This breaks cursor positioning.
Hipp’s patch introduces a linenoiseEscapeSeqLen() function that, given a pointer to a character, returns the length in bytes of the escape sequence if the character starts one, or zero otherwise. Small, self-contained, elegant patch.
Kimi K2.6: The Newcomer That Surprises
Antirez used Kimi Code (Moonshot AI’s coding agent) with the newly released Kimi K2.6. The model received the prompt: “Please evaluate this pull request against this codebase”, with the linenoise directory and the PR link.
Kimi first produced a correct bug analysis, but couldn’t access the patch.txt file due to tool limitations. After antirez pointed it to the file at /tmp/patch.txt, Kimi:
- Correctly identified that CSI sequence detection is sufficient for the SQLite colored prompt use case
- Found a real bug: accessing
s[1]without first checking the escape sequence length, a potential out-of-bounds read - Proposed the correct fix: add a length check before accessing
s[1]
Kimi K2.6 was released on April 20, 2026 by Moonshot AI. It’s a 1T total parameter MoE model with 32B active per token, up to 262K context, and a native MoonViT vision encoder. In benchmarks, it competes directly with GPT-5.4 and Claude Opus 4.6: HLE-Full with tools 54.0 (vs 52.1 GPT-5.4, 53.0 Opus 4.6), BrowseComp 83.2, SWE-Bench Verified 80.2, LiveCodeBench v6 89.6. Open weights on Hugging Face under a modified MIT license, with Cloudflare pricing at $0.95/M input tokens and $4/M output, roughly 15x cheaper than Claude Opus 4.6.
One interesting detail: Kimi’s thinking process is fully visible, something no longer possible with paid models from OpenAI and Anthropic. Transparency that has practical value for code review.
Claude Opus 4.7: Fast and Thorough
With Claude Opus 4.7 (“high effort” mode), the picture is different. Opus is faster and produces a more complete analysis:
- Confirms: “Escape detection needs small fixes before merge”
- Identifies the out-of-bounds read like Kimi
- Notes that only CSI sequences are handled, not OSC
- Raises the ZWJ (Zero Width Joiner) issue: an edge case where an emoji followed by a joiner code and a color code could theoretically trigger a bug. A subtle observation no other model made.
Antirez notes this ZWJ case is theoretical: “nobody emits an emoji, then a joiner code, then a color code in a prompt.” But the fact that Opus caught it demonstrates a superior level of analytical depth.
GPT-5.4 with Codex: The Golden Standard Stumbles
And here comes the surprise. GPT-5.4, which antirez calls “the golden standard of programming,” completely failed to find the bug on the first pass. After receiving the same prompt:
- It responded: “Findings: no blocking findings”
- Classified the PR as “issue” rather than “pull request”
- Did not identify the out-of-bounds read autonomously
Only after antirez explicitly asked “Does the patch contain any bugs?” did GPT-5.4 find the s[1] access without the length check. It then proposed the same fix as Kimi, but the fact it didn’t find it on its own is significant.
Antirez puts it plainly: “This is an example where GPT-5.4, usually the strongest, performed worse than all, and Kimi performed better in this specific context than GPT.”
Broader Lesson: Exit Strategy and Affinity
Beyond the single test, antirez draws a strategic lesson:
“If we European programmers no longer had access to certain models, or for economic reasons the most important vendors started charging exorbitant prices for these models, we need to have alternatives.”
His position is pragmatic: he’ll use Kimi K2.6 more and more to develop affinity, just as he uses GPT-5.4 with Codex daily and Opus with Claude Code. Not for replacement, but for optionality.
On small local models (Gemma 4, Qwen 3.5/3.6), the verdict is blunt: “they kind of work, but this is a different level of capability.”