潜龙 QianLong · 中文 AI 内容与工具平台

In the high-stakes race of artificial intelligence, "clean data" has become the ultimate marketing flex. When Microsoft unveiled its new MAI series of large language models, the company made a point to highlight that the models were built end-to-end using "enterprise grade, clean and commercially licensed data."

For a brief moment, it appeared the tech giant had achieved the impossible: training a cutting-edge AI without resorting to the controversial practice of mass-scraping the public internet. But a closer look at the fine print reveals that the industry's definition of "clean" might be very different from what the public imagines.

The two models in question are undeniable technical marvels. MAI-Thinking-1 is a reasoning powerhouse boasting 1 trillion total parameters, though it operates efficiently by only activating 35 billion parameters at a time using a Mixture of Experts (MoE) architecture. Its sibling, MAI-Code-1-Flash, is a 137-billion-parameter model designed specifically to power GitHub Copilot with lower latency and cost.

The impressive efficiency of these models, however, is overshadowed by the reality of their origin. Deep within the MAI-Thinking-1 technical paper, the curtain is pulled back on the "appropriately licensed" training corpus. Rather than a pristine, walled-garden dataset, the foundation of the model is a massive crawl of the open web.

Microsoft’s proprietary web crawler initially ingested a staggering 1.2 trillion pages. The "cleaning" process involved applying blocklists to remove adult content and piracy-related domains, and deploying proprietary detection models to scrub out websites heavily populated by AI-generated text. After this aggressive filtering, the corpus was whittled down to 794 billion pages. An additional 24.2 billion pages were sourced from the widely used, open-source Common Crawl dataset.

This revelation highlights a persistent tension in generative AI. Tech companies are increasingly eager to distance themselves from the legal and ethical gray areas of web scraping. Yet, the engineering reality dictates that building a state-of-the-art model still requires vacuuming up vast swaths of the open internet.

Ultimately, Microsoft’s approach demonstrates sophisticated data curation rather than a fundamental shift in data sourcing. Until the legal frameworks surrounding AI training are definitively settled, the internet remains the indispensable, albeit highly contested, fuel for the AI revolution.

Key Points

Microsoft launched MAI-Thinking-1 (1T parameters) and MAI-Code-1-Flash (137B parameters).
Both models use a Mixture of Experts (MoE) architecture, activating only a fraction of their total parameters during inference.
Despite claims of using 'commercially licensed' and 'clean' data, the models were trained on a massive web crawl.
The 1.2 trillion crawled pages were heavily filtered to remove piracy, adult content, and AI-generated text, leaving 794 billion pages.

Why It Matters

The gap between corporate marketing and engineering reality reveals that even the most advanced AI models still fundamentally rely on scraping the public internet, keeping the debate over copyright and data ethics wide open.

Sources:

Microsoft's new MAI models — Simon Willison's Weblog

潛

本文完

潜龙编辑部 · 2026/6/7

The 'Clean Data' Illusion Behind Microsoft's New MAI Models

Key Points

Why It Matters

更多专栏

Meta Is Now Hallucinating Its Own Clickbait

The Great AI Divide: Why Developers Are Arguing Over "Ugly" Code

The Endless Relaunch: Why Apple's 'New Siri' is Still Loading