Microsoft is contributing new requirements throughout energy, cooling, sustainability, safety, networking, and fleet resiliency to advance innovation.
Within the transition from constructing computing infrastructure for cloud scale to constructing cloud and AI infrastructure for frontier scale, the world of computing has skilled tectonic shifts in innovation. All through this journey, Microsoft has shared its learnings and finest practices, optimizing our cloud infrastructure stack in cross-industry boards such because the Open Compute Venture (OCP) International Basis.
At present, we see that the following section of cloud infrastructure innovation is poised to be essentially the most consequential interval of transformation but. In simply the final yr, Microsoft has added greater than 2 gigawatts of recent capability and launched the world’s strongest AI datacenter, which delivers 10x the efficiency of the world’s quickest supercomputer right now. But, that is just the start.
Delivering AI infrastructure on the highest efficiency and lowest price requires a methods strategy, with optimizations throughout the stack to drive high quality, pace, and resiliency at a degree that may present a constant expertise to our prospects. Within the quest to provide resilient, sustainable, safe, and broadly scalable expertise to deal with the breadth of AI workloads, we’re embarking on an bold new journey: one not simply of redefining infrastructure innovation at each layer of execution from silicon to methods, however one among tightly built-in {industry} alignment on requirements that provide a mannequin for world interoperability and standardization.
At this yr’s OCP International Summit, Microsoft is contributing new requirements throughout energy, cooling, sustainability, safety, networking, and fleet resiliency to additional advance innovation within the {industry}.
Redefining energy distribution for the AI period
As AI workloads scale globally, hyperscale datacenters are experiencing unprecedented energy density and distribution challenges.
Final yr, on the OCP International Summit, we partnered with Meta and Google within the growth of Mt. Diablo, a disaggregated energy structure. This yr, we’re constructing on this innovation with the following step of our full-stack transformation of datacenter energy methods: solid-state transformers. Strong-state transformers simplify the ability chain with new conversion applied sciences and safety schemes that may accommodate future rack voltage necessities.
Coaching massive fashions throughout 1000’s of GPUs additionally introduces variable and intense energy draw patterns that may pressure the grid. The utility, and conventional energy supply methods. These fluctuations not solely threat {hardware} reliability and operational effectivity but in addition create challenges throughout capability planning and sustainability targets.
Along with key {industry} companions, Microsoft is main an influence stabilization initiative to handle this problem. In a just lately printed paper with OpenAI and NVIDIA—Energy Stabilization for AI Coaching Datacenters—we tackle how full-stack improvements spanning rack-level {hardware}, firmware orchestration, predictive telemetry, and facility integration can clean energy spikes, cut back energy overshoot by 40%, and mitigate operational threat and prices to allow predictable, and scalable energy supply for AI coaching clusters.
This yr, on the OCP International Summit, Microsoft is becoming a member of forces with {industry} companions to launch a devoted energy stabilization workgroup. Our objective is to foster open collaboration throughout hyperscalers and {hardware} companions, sharing our learnings from full-stack innovation and welcoming the group to co-develop new methodologies that tackle the distinctive energy challenges of AI coaching datacenters. By constructing on the insights from our just lately printed white paper, we intention to speed up industry-wide adoption of resilient, scalable energy supply options for the following era of AI infrastructure. Learn extra about our energy stabilization efforts.
Cooling improvements for resiliency
As the ability profile for AI infrastructure adjustments, we’re additionally persevering with to rearchitect our cooling infrastructure to help evolving wants round power consumption, area optimization, and total datacenter sustainability. Numerous cooling options should be carried out to help the size of our growth—as we search to construct new AI-scale datacenters, we’re additionally using Warmth Exchanger Unit (HXU)-based liquid cooling to quickly deploy new AI capability inside our current air-cooled datacenter footprint.
Microsoft’s subsequent era HXU is an upcoming OCP contribution that allows liquid cooling for high-performance AI methods in air-cooled datacenters, supporting world scalability and speedy deployment. The modular HXU design delivers 2X the efficiency of present fashions and maintains >99.9% cooling service availability for AI workloads. No datacenter modifications are required, permitting seamless integration and growth. Study extra in regards to the subsequent era HXU right here.
In the meantime, we’re persevering with to innovate throughout a number of layers of the stack to handle adjustments in energy and warmth dissipation—using facility water cooling at datacenter-scale, circulating liquid in closed-loops from server to chiller; and exploring on-chip cooling improvements like microfluidics to effectively take away warmth instantly from the silicon.
Unified networking options for rising infrastructure calls for
Scaling lots of of 1000’s of GPUs to function as a single, coherent system comes with vital challenges to create rack-scale interconnects that may ship low-latency, excessive bandwidth materials which can be each environment friendly and interoperable. As AI workloads develop exponentially and infrastructure calls for intensify, we’re exploring networking optimizations that may help these wants. To that finish, now we have developed options leveraging scale-up, scale-out, and Large Space Community (WAN) options to allow large-scale distributed coaching.
We accomplice intently with requirements our bodies, like UEC (Extremely Ethernet Consortium) and UALink, centered on innovation in networking applied sciences for this essential ingredient of AI methods. We’re additionally driving ahead adoption of Ethernet for scale-up networking throughout the ecosystem and are excited to see the Ethernet for Scale-up Networking (ESUN) workstream launch beneath the OCP Networking Venture. We look ahead to selling adoption of cutting-edge networking options and enabling multi-vendor Ecosystem primarily based on open requirements.
Safety, sustainability, and high quality: Basic pillars for resilient AI operations
Protection in depth: Belief at each layer
Our complete strategy to scaling AI methods responsibly contains embedding belief and safety into each layer of our platform. This yr, we’re introducing new safety contributions that construct on our current physique of labor in {hardware} safety and introduce new protocols which can be uniquely match to help new scientific breakthroughs which have been accelerated with the introduction of AI:
- Constructing on previous years’ contributions and Microsoft’s collaboration with AMD, Google, and NVIDIA, now we have additional enhanced Caliptra, our open-source silicon root of belief The introduction of Caliptra 2.1 extends the {hardware} root-of-trust to a full safety subsystem. Study extra about Caliptra 2.1 right here.
- We’ve additionally added Adams Bridge 2.0 to Caliptra to increase help for quantum-resilient cryptographic algorithms to the root-of-trust.
- Lastly, we’re contributing OCP Layered Open-source Cryptographic Key Administration (L.O.C.Ok)—a key administration block for storage gadgets that secures media encryption keys in {hardware}. L.O.C.Ok was developed by means of collaboration between Google, Kioxia, Microsoft, Samsung, and Solidigm.
Advancing datacenter-scale sustainability
Sustainability continues to be a significant space of alternative for {industry} collaboration and standardization by means of communities such because the Open Compute Venture. Working collaboratively as an ecosystem of hyperscalers and {hardware} companions is one catalyst to handle the necessity for sustainable datacenter infrastructure that may successfully scale as compute calls for proceed to evolve. This yr, we’re happy to proceed our collaborations as a part of OCP’s Sustainability workgroup throughout areas equivalent to carbon reporting, accounting, and circularity:
- Introduced at this yr’s International Summit, we’re partnering with AWS, Google, and Meta to fund the Product Class Rule initiative beneath the OCP Sustainability workgroup, with the objective of standardizing carbon measurement methodology for gadgets and datacenter gear.
- Along with Google, Meta, OCP, Schneider Electrical, and the iMasons Local weather Accord, we’re establishing the Embodied Carbon Disclosure Base Specification to ascertain a typical framework for reporting the carbon influence of datacenter gear.
- Microsoft is advancing the adoption of waste warmth reuse (WHR). In partnership with the NetZero Innovation Hub, NREL, and EU and US collaborators, Microsoft has printed warmth reuse reference designs and is growing an financial modeling device which give knowledge heart operators and waste warmth off takers/shoppers the associated fee it takes to develop the waste warmth reuse infrastructure primarily based on the circumstances like the dimensions and capability of the WHR system, season, location, WHR mandates and subsidies in place. These region-specific options assist operators convert extra warmth into usable power—assembly regulatory necessities and unlocking new capability, particularly in areas like Europe the place warmth reuse is turning into obligatory.
- We’ve developed an open methodology for Life Cycle Evaluation (LCA) at scale throughout large-scale IT {hardware} fleets to drive in the direction of a “gold normal” in sustainable cloud infrastructure.
Rethinking node administration: Fleet operational resiliency for the frontier period
As AI infrastructure scales at an unprecedented tempo, Microsoft is investing in standardizing how various compute nodes are deployed, up to date, monitored, and serviced throughout hyperscale datacenters. In collaboration with AMD, Arm, Google, Intel, Meta, and NVIDIA, we’re driving a collection of Open Compute Venture (OCP) contributions centered on streamlining fleet operations, unifying firmware administration, manageability interfaces and enhancing diagnostics, debug, and RAS (Reliability, Availability, and Serviceability) capabilities. This standardized strategy to lifecycle administration lays the muse for constant, scalable node operations throughout this era of speedy growth. Learn extra about our strategy to resilient fleet operations.
Paving the way in which for frontier-scale AI computing
As we enter a brand new period of frontier-scale AI growth, Microsoft takes delight in main the development of requirements that can drive the way forward for globally deployable AI supercomputing. Our dedication is mirrored in our lively position in shaping the ecosystem that allows scalable, safe, and dependable AI infrastructure throughout the globe. We invite attendees of this yr’s OCP International Summit to attach with Microsoft at sales space #B53 to find our newest cloud {hardware} demonstrations. These demonstrations showcase our ongoing collaborations with companions all through the OCP group, highlighting improvements that help the evolution of AI and cloud applied sciences.
Join with Microsoft on the OCP International Summit 2025 and past