At present, we’re proud to introduce Maia 200, a breakthrough inference accelerator engineered to dramatically enhance the economics of AI token technology. Maia 200 is an AI inference powerhouse: an accelerator constructed on TSMC’s 3nm course of with native FP8/FP4 tensor cores, a redesigned reminiscence system with 216GB HBM3e at 7 TB/s and 272MB of on-chip SRAM, plus information motion engines that preserve large fashions fed, quick and extremely utilized. This makes Maia 200 probably the most performant, first-party silicon from any hyperscaler, with 3 times the FP4 efficiency of the third technology Amazon Trainium, and FP8 efficiency above Google’s seventh technology TPU. Maia 200 can also be probably the most environment friendly inference system Microsoft has ever deployed, with 30% higher efficiency per greenback than the newest technology {hardware} in our fleet in the present day.
Maia 200 is a part of our heterogenous AI infrastructure and can serve a number of fashions, together with the newest GPT-5.2 fashions from OpenAI, bringing efficiency per greenback benefit to Microsoft Foundry and Microsoft 365 Copilot. The Microsoft Superintelligence group will use Maia 200 for artificial information technology and reinforcement studying to enhance next-generation in-house fashions. For artificial information pipeline use circumstances, Maia 200’s distinctive design helps speed up the speed at which high-quality, domain-specific information might be generated and filtered, feeding downstream coaching with more energizing, extra focused indicators.
Maia 200 is deployed in our US Central datacenter area close to Des Moines, Iowa, with the US West 3 datacenter area close to Phoenix, Arizona, coming subsequent and future areas to comply with. Maia 200 integrates seamlessly with Azure, and we’re previewing the Maia SDK with a whole set of instruments to construct and optimize fashions for Maia 200. It features a full set of capabilities, together with PyTorch integration, a Triton compiler and optimized kernel library, and entry to Maia’s low-level programming language. This provides builders fine-grained management when wanted whereas enabling simple mannequin porting throughout heterogeneous {hardware} accelerators.
Engineered for AI inference
Fabricated on TSMC’s cutting-edge 3-nanometer course of, every Maia 200 chip accommodates over 140 billion transistors and is tailor-made for large-scale AI workloads whereas additionally delivering environment friendly efficiency per greenback. On each fronts, Maia 200 is constructed to excel. It’s designed for the newest fashions utilizing low-precision compute, with every Maia 200 chip delivering over 10 petaFLOPS in 4-bit precision (FP4) and over 5 petaFLOPS of 8-bit (FP8) efficiency, all inside a 750W SoC TDP envelope. In sensible phrases, Maia 200 can effortlessly run in the present day’s largest fashions, with loads of headroom for even greater fashions sooner or later.
Crucially, FLOPS aren’t the one ingredient for sooner AI. Feeding information is equally vital. Maia 200 assaults this bottleneck with a redesigned reminiscence subsystem. The Maia 200 reminiscence subsystem is centered on narrow-precision datatypes, a specialised DMA engine, on-die SRAM and a specialised NoC material for prime‑bandwidth information motion, growing token throughput.
Optimized AI methods
On the methods stage, Maia 200 introduces a novel, two-tier scale-up community design constructed on commonplace Ethernet. A customized transport layer and tightly built-in NIC unlocks efficiency, robust reliability and vital value benefits with out counting on proprietary materials.
Every accelerator exposes:
- 2.8 TB/s of bidirectional, devoted scaleup bandwidth
- Predictable, high-performance collective operations throughout clusters of as much as 6,144 accelerators
This structure delivers scalable efficiency for dense inference clusters whereas decreasing energy utilization and total TCO throughout Azure’s world fleet.
Inside every tray, 4 Maia accelerators are totally linked with direct, non‑switched hyperlinks, protecting excessive‑bandwidth communication native for optimum inference effectivity. The identical communication protocols are used for intra-rack and inter-rack networking utilizing the Maia AI transport protocol, enabling seamless scaling throughout nodes, racks and clusters of accelerators with minimal community hops. This unified material simplifies programming, improves workload flexibility and reduces stranded capability whereas sustaining constant efficiency and price effectivity at cloud scale.
A cloud-native improvement strategy
A core precept of Microsoft’s silicon improvement applications is to validate as a lot of the end-to-end system as potential forward of ultimate silicon availability.
A complicated pre-silicon atmosphere guided the Maia 200 structure from its earliest phases, modeling the computation and communication patterns of LLMs with excessive constancy. This early co-development atmosphere enabled us to optimize silicon, networking and system software program as a unified entire, lengthy earlier than first silicon.
We additionally designed Maia 200 for quick, seamless availability within the datacenter from the start, constructing out early validation of a few of the most complicated system parts, together with the backend community and our second-generation, closed loop, liquid cooling Warmth Exchanger Unit. Native integration with the Azure management airplane delivers safety, telemetry, diagnostics and administration capabilities at each the chip and rack ranges, maximizing reliability and uptime for production-critical AI workloads.
Because of these investments, AI fashions had been operating on Maia 200 silicon inside days of first packaged half arrival. Time from first silicon to first datacenter rack deployment was lowered to lower than half that of comparable AI infrastructure applications. And this end-to-end strategy, from chip to software program to datacenter, interprets straight into larger utilization, sooner time to manufacturing and sustained enhancements in efficiency per greenback and per watt at cloud scale.
Join the Maia SDK preview
The period of large-scale AI is simply starting, and infrastructure will outline what’s potential. Our Maia AI accelerator program is designed to be multi-generational. As we deploy Maia 200 throughout our world infrastructure, we’re already designing for future generations and count on every technology will frequently set new benchmarks for what’s potential and ship ever higher efficiency and effectivity for crucial AI workloads.
At present, we’re inviting builders, AI startups and lecturers to start exploring early mannequin and workload optimization with the brand new Maia 200 software program improvement equipment (SDK). The SDK features a Triton Compiler, assist for PyTorch, low-level programming in NPL and a Maia simulator and price calculator to optimize for efficiencies earlier within the code lifecycle. Join the preview right here.
Get extra photographs, video and sources on our Maia 200 website and learn extra particulars.
Scott Guthrie is accountable for hyperscale cloud computing options and companies together with Azure, Microsoft’s cloud computing platform, generative AI options, information platforms and knowledge and cybersecurity. These platforms and companies assist organizations worldwide remedy pressing challenges and drive long-term transformation.



