MiniMax Releases M2.7: First AI Model That Can Iteratively Improve Itself
MiniMax released M2.7, the first AI model designed to deeply participate in its own evolution. Through 100+ autonomous self-iteration cycles, it achieved a 30% performance gain on internal benchmarks; competed autonomously on Kaggle to win 9 gold medals; scored 56.22% on SWE-Pro matching GPT-5.3 Codex; and maintains 97% skill adherence with 40+ complex simultaneous tools — marking a qualitative leap in AI agent capabilities.
MiniMax M2.7: The First AI That Iterates Itself — A Qualitative Leap in Agent Capabilities
In March 2026, Chinese AI powerhouse MiniMax officially released **MiniMax M2.7**, its latest flagship model. This is not merely another incremental performance upgrade — it represents a landmark moment in the history of artificial intelligence. M2.7 is the first large language model explicitly designed to **deeply participate in its own model evolution**, signaling that AI systems are graduating from passive tools to actively self-improving entities.
The Self-Iteration Breakthrough: A Paradigm Shift
Traditional AI model improvement has always relied on a human-intensive cycle: engineers collect data, design experiments, tune parameters, and retrain. This dependency on human labor has long been the bottleneck constraining the speed of AI capability gains. MiniMax M2.7 breaks this paradigm.
According to MiniMax's official technical blog, M2.7 played a pivotal role in its own development. The engineering team had the model **autonomously update its own memory and construct dozens of complex Agent skills** within its harness to support reinforcement learning experiments. More critically, M2.7 was then tasked with **proactively refining its own learning processes and tool chains** based on experimental outcomes — forming a genuine model self-evolution loop.
The core logic of this loop: M2.7 builds and monitors its own reinforcement learning harness, identifies failure trajectories, plans improvements, modifies code, runs evaluations, compares results, and decides whether to keep or revert changes. In one internal test, M2.7 executed over **100 complete rounds** of this "analyze → improve → validate" cycle entirely autonomously, ultimately achieving a **30% performance gain** on internal evaluation sets.
Autonomous Kaggle Competitions: From 50% to 74% Medal Rate
To stress-test M2.7's autonomous evolution capabilities, MiniMax designed a demanding challenge: entering M2.7 into **MLE-Bench Lite**, a set of 22 machine learning competitions open-sourced by OpenAI, each runnable on a single A30 GPU yet spanning virtually all stages of the ML workflow.
M2.7's approach was elegant in its autonomy. Without any human intervention, the model deployed a self-designed three-module agent architecture — **short-term memory, self-feedback, and self-optimization**. After each iteration round, the model generates a memory markdown file, performs self-criticism on its current results, and synthesizes potential optimization directions for the next round.
Over 24 hours of autonomous iteration, M2.7 accumulated **9 gold medals, 5 silver medals, and 1 bronze medal**, with its medal win rate climbing from approximately 50% to **nearly 74%**. The average medal rate across three separate 24-hour runs reached **66.6%**, ranking third among all models tested — behind only Opus-4.6 (75.7%) and GPT-5.4 (71.2%), and tied with Gemini-3.1 (66.6%).
Crucially, the entire process required zero human intervention. M2.7 autonomously analyzed competition rules, formulated strategies, adjusted model hyperparameters (temperature, frequency penalty, presence penalty, etc.), and continuously optimized its agent architecture across dozens of iterations.
Software Engineering: Approaching the Industry Frontier
M2.7's performance on software engineering benchmarks is equally striking.
On **SWE-Pro** — a multi-language, contamination-resistant benchmark featuring 1,865 tasks across 41 repositories — M2.7 scored **56.22%**, neck-and-neck with GPT-5.3-Codex (56.8%). On benchmarks even closer to real-world engineering scenarios, M2.7 demonstrates even more pronounced advantages: **SWE Multilingual** (76.5) and **Multi SWE Bench** (52.7).
On **VIBE-Pro**, the repo-level code generation benchmark for end-to-end full project delivery, M2.7 scored **55.6%**, nearly on par with Claude Opus 4.6 — meaning Web, Android, iOS, and simulation projects can all be handed directly to M2.7 for independent completion. On **Terminal Bench 2** (57.0%) and **NL2Repo** (39.8%), both demanding deep system-level comprehension, M2.7 also performs solidly, confirming that it doesn't just write code — it genuinely understands software systems.
A particularly compelling real-world demonstration is **live production environment debugging**. When faced with production alerts, M2.7 can: correlate monitoring metrics with deployment timelines for causal reasoning, perform statistical analysis on trace sampling and propose precise hypotheses, proactively connect to databases to verify root causes, locate missing index migration files in code repositories, and even initiate emergency non-blocking index creation before filing a merge request. MiniMax reports that in multiple real production incidents, using M2.7 reduced recovery time to **under three minutes**.
Agent Teams: Native Multi-Agent Collaboration
One of M2.7's most significant advances in the agent ecosystem is its native support for **Agent Teams** — true multi-agent collaboration. Unlike pseudo-multi-agent systems cobbled together via prompt engineering, M2.7 has internalized multi-agent collaboration as a native model capability:
- **Stable role anchoring**: Maintains consistent role identity within complex state machines
- **Adversarial reasoning**: Proactively challenges teammates' logical and ethical blind spots
- **Protocol adherence**: Sustains stable instruction compliance across extended multi-turn interactions
On the **Toolathon** tool-use benchmark, M2.7 achieved **46.3% accuracy**, ranking in the global top tier. On MiniMax's internally developed **MM Claw** evaluation set — covering real-world scenarios including personal learning planning, office document processing, scheduled research and investment advice, and code development — M2.7 achieved **62.7% accuracy**, approaching Claude Sonnet 4.6's performance.
Perhaps most impressive: even when operating with over **40 complex skills simultaneously** (each skill description exceeding 2,000 tokens), M2.7 maintains a **97% skill adherence rate** — an extraordinary level of stability for high-density tool-use scenarios.
Professional Work: A Leap in Document Intelligence
In professional office software domains, M2.7 has made substantial gains. On **GDPval-AA** — which measures domain expertise and task delivery capability — M2.7 achieved an **ELO score of 1495** among 45 models, the highest among open-source models, surpassing GPT-5.3, and ranking just behind Opus 4.6, Sonnet 4.6, and GPT-5.4.
M2.7's document processing capabilities for Word, Excel, and PPT have been systematically optimized. The model can both generate files directly from templates and follow interactive user instructions to perform multiple rounds of high-fidelity editing on existing files, delivering polished, directly-usable outputs.
In financial analysis scenarios, M2.7 can autonomously: read company annual reports and earnings call transcripts, cross-reference multiple research reports, independently design assumptions and construct revenue forecast models, and finally generate a PPT and Word research report from templates — "understanding, judging, and producing like a junior analyst, with self-correction through multiple rounds of interaction." MiniMax notes that the output quality is already sufficient to serve as a first draft in actual professional workflows.
Head-to-Head: M2.7 vs. GPT-5.x and Claude
Here's how M2.7 stacks up against the industry's top models across key benchmarks:
| Benchmark | M2.7 | GPT-5.3 Codex | Claude Opus 4.6 | Gemini-3.1 |
|---|---|---|---|---|
| SWE-Pro | 56.22% | 56.8% | ~57% | — |
| VIBE-Pro | 55.6% | — | ~56% | — |
| Terminal Bench 2 | 57.0% | 77.3% | — | — |
| GDPval-AA ELO | 1495 | — | >1495 | — |
| MLE-Bench Lite (Avg. Medal Rate) | 66.6% | — | 75.7% (Opus) | 66.6% |
| MM Claw | 62.7% | — | ~65% (Sonnet) | — |
| Toolathon | 46.3% | — | — | — |
The overall picture: M2.7 has entered the industry's top tier across software engineering, multi-turn agent tasks, and professional document processing. It matches GPT-5.3 Codex on coding benchmarks and trails Claude Opus 4.6 only marginally on most tasks. However, M2.7's unique self-iteration capabilities give it a dimension of value that raw benchmark scores don't fully capture.
The Agent Ecosystem: OpenClaw and Beyond
M2.7's release coincides with what MiniMax describes as "the recent surge in popularity of OpenClaw" — representative of a thriving agent ecosystem. MiniMax notes with satisfaction that its M2-series models have contributed to this community's growth.
The **MM Claw** evaluation set was specifically built around common OpenClaw use cases, covering a wide spectrum of real-world needs. M2.7's 62.7% accuracy on this benchmark signals that the model is purpose-built for the kinds of complex, multi-step agentic tasks that define modern personal AI assistants.
MiniMax also unveiled **OpenRoom**, a preliminary demo of an interaction system that liberates AI interaction from plain text streams, placing it within an interactive Web GUI space. In OpenRoom, character settings aren't cold prompt chunks — conversation drives the experience, generating real-time visual feedback and scene interactions, with characters proactively engaging with their environment. Notably, most of the OpenRoom code was written by M2.7 itself.
The Road to Recursive Self-Improvement
MiniMax's technical blog explicitly articulates its vision for the future: "AI self-evolution will gradually transition towards full autonomy, coordinating data construction, model training, inference architecture, evaluation, and other stages without human involvement."
M2.7 is described as "early echoes" of this vision. While it cannot yet fully autonomously train its own successor, it can already handle large portions of the iteration work that previously required human engineers — at the Agent harness layer. Within MiniMax's own organization, M2.7 handles **30%-50%** of the RL team's daily workflow, autonomously collecting feedback, building evaluation sets, and optimizing its own skills and memory mechanisms.
The RL team's daily workflow, as described by MiniMax, illustrates the scope of this change: a researcher proposes an experimental idea; M2.7 conducts literature review, tracks experiment specs, pipelines data, launches experiments, monitors progress, automatically reads logs, debugs, analyzes metrics, fixes code, submits merge requests, and runs smoke tests — with human researchers only engaging for critical decisions and strategic discussions.
What It Means for the AI Industry
M2.7's release carries implications at multiple levels:
Technical: First production-grade model to demonstrate meaningful autonomous self-iteration at scale, validating a viable path toward AI self-improvement
Capability: Achieves top-tier performance across software engineering, multi-agent collaboration, and professional office tasks — confirming that Chinese AI labs are competing head-to-head with OpenAI and Anthropic
Ecosystem: Native Agent Teams support, high-density skill handling, and multi-turn high-fidelity editing provide a substantially more capable foundation for agent application development
Strategic: Signals a corporate transformation — MiniMax's own R&D processes are now deeply dependent on M2.7, accelerating the company's evolution into a genuinely AI-native organization
MiniMax M2.7 is now fully available on the MiniMax Agent platform and MiniMax API Platform. The model represents a genuine inflection point — the moment when AI stopped being purely a product of human engineering and began contributing meaningfully to its own development. The recursive loop has begun.