top of page
Screenshot 2024-08-10 at 11.22.21 PM.png

Perspective on Optical Transceivers and AI Networking


In recent years, Arista has been actively advocating for the application of Low Power Optics (LPO) in AI networking. At this year’s ECOC, Arista presented its ongoing support for LPO as a key enabler in addressing the challenges of AI infrastructure. Below is an overview of Arista’s perspective on the role of optical transceivers, including high-speed AOC (Active Optical Cable) and DAC (Direct Attach Copper) products, in AI networking environments where data rates range from 10G to 800G and beyond.


Low Power Optics (LPO) in AI networking
Low Power Optics (LPO) in AI networking


The Growing Need for High-Speed Interconnects in AI Networking


As AI model complexity continues to advance, the computational demand on AI accelerators such as GPUs is growing at an unprecedented pace. While the speed of AI compute chip development is limited, the volume of training data and the size of AI models are expanding rapidly. To keep up with this demand, it is essential to consider high-capacity GPU cluster networking. This is where advanced transceivers, AOCs, and DAC solutions with data rates from 10G to 800G come into play, providing the high-speed, high-bandwidth interconnects needed to support intensive AI workloads.


The computational demand on AI accelerators
The computational demand on AI accelerators

Currently, NVIDIA's 72-GPU NVLink-based backplane switch uses DAC (Direct Attach Copper) cables for connectivity between racks. The reason DAC was chosen for this application is that it only requires interconnectivity between two cabinets. Copper cables are advantageous here due to their low power consumption and high reliability, but they have a critical limitation: short transmission distances. At 112 Gbps, DAC can only reliably transmit signals up to about 2 meters.


As data rates increase to 224 Gbps and 448 Gbps, the transmission distance for DAC will become even shorter, making optical solutions a necessity. Unlike copper, optical cables don’t suffer from issues like the skin effect, which limits copper’s ability to handle high-frequency signals. This makes optical technologies like transceivers and OSFP AOCs much more suitable for high-speed, long-distance data transmission in advanced AI networks.


The Shift from Copper to Optical Transceivers and OSFP AOCs


Looking ahead to 448 Gbps interconnects, Arista suggests that optical transceivers and OSFP AOCs using PAM4 modulation could continue to be viable, whereas copper would likely need to shift to PAM6 modulation to manage the signal integrity. PAM6 modulation operates at a lower frequency than PAM4, which reduces signal loss in copper cables but also limits performance efficiency. As AI data rates and interconnect distances increase between racks, copper’s limitations become more apparent, making the shift to optical transceivers and high-performance OSFP AOCs inevitable.


Shift from Copper to Optical Transceivers and OSFP AOCs
Shift from Copper to Optical Transceivers and OSFP AOCs

In a scenario where optical replaces copper in NVLink-based backplane interconnects, Arista estimates that approximately 648 1.6T optical transceivers would be required to connect a 72-GPU backplane. If traditional DSP-based 1.6T transceivers are used, the overall power consumption would increase by about 16.2%. However, using LPO-based 1.6T optical transceivers could limit the power increase to around 5.4%, a much more manageable impact.


Switches use copper interconnects to support 72XPU chip interconnects
Switches use copper interconnects to support 72XPU chip interconnects

Comparing LPO and CPO in AI Networking


The industry has seen significant debate between LPO (Low Power Optics) and CPO (Co-Packaged Optics) for AI networking solutions. Arista has consistently argued in favor of LPO, citing specific advantages and disadvantages of each technology.


LPO (Low Power Optics): LPO transceivers offer relatively low power consumption, with energy efficiencies ranging between 6.5 to 8.3 pJ/bit according to various manufacturers. However, LPO has limitations in transmission distance because the electrical trace length between the ASIC and the module can cause signal degradation. Another challenge with LPO is the increased power density on the front panel, where all optical modules are concentrated. Arista believes that this issue can be managed effectively through liquid cooling technologies, which can mitigate the thermal density on the front panel.


CPO (Co-Packaged Optics): CPO is potentially even more energy-efficient, with power consumption figures ranging from 1.5 to 5 pJ/bit. However, CPO has a less mature supply chain and presents maintenance challenges, as co-packaged optics are integrated directly with the switch ASIC, making replacements and upgrades more complicated. While CPO has a theoretical efficiency advantage, with estimates of 5 pJ/bit compared to LPO’s 6.5 pJ/bit, the practical benefits are marginal given the trade-offs in flexibility and maturity. NVIDIA’s CPO evaluation team projects that CPO power consumption could eventually reach as low as 1-1.5 pJ/bit, but Arista remains cautious about these claims due to the existing challenges with CPO’s ecosystem and reliability.


Comparing LPO and CPO in AI Networking
Comparing LPO and CPO in AI Networking

Applications for LPO Transceivers and OSFP AOCs in AI Networking


Arista envisions LPO as particularly well-suited for backplane interconnects that span up to 10 meters, such as in configurations linking 128 or 512 xPU units in data-intensive computing setups. For these distances, optical transceivers and OSFP AOCs are a natural fit. While DAC solutions can work for shorter distances, they struggle to handle the higher speeds and extended reach that are increasingly required. The hot-swappable nature of LPO-based transceivers also supports more straightforward maintenance, which is advantageous compared to the integrated nature of CPO.


The increased power density on the front panel with LPO is a known drawback, but Arista believes that advanced cooling techniques, such as liquid cooling, can alleviate this issue. Based on a comparative assessment of copper, DPO (DSP-based hot-pluggable optics), LPO, and CPO, Arista concludes that **LPO offers the best balance of performance, power efficiency, and practicality** for AI networking. Copper cannot meet the high-speed or long-distance demands of backplane interconnects, DPO has excessive power requirements, and CPO’s lack of supply chain maturity makes it less attractive for large-scale deployment.


Liquid cooling to reduce power consumption
Liquid cooling to reduce power consumption

Conclusion


In summary, Arista’s position is that LPO-based optical transceivers and OSFP AOC solutions are the optimal choice for AI networking. With data rates scaling from 10G up to 800G and beyond, the limitations of copper and DSP-based optics highlight the necessity of transitioning to low-power, high-speed optical solutions. While CPO may offer some theoretical energy efficiency gains, LPO’s practical benefits and flexibility make it a superior choice for AI networking environments.


As data rates and interconnect distances continue to grow, the industry is moving towards high-performance optical solutions to support future AI workloads. Arista’s commitment to advancing LPO technology reaffirms its belief that LPO and OSFP AOC will play a crucial role in the evolution of AI networking, enabling higher bandwidth, lower power consumption, and greater scalability for next-generation data center architectures.

4 views0 comments

Recent Posts

See All

Comments


bottom of page