Switch to Spine-and-Leaf Architecture
High-speed data center networking functions are the basis for everything else: intersystem links, storage and reliable connectivity to customers. That means not just high speed, but also low-latency and low-loss networks. To deliver the performance needed for AI, IT managers should be thinking about changes both in architecture and in hardware.
IT managers with traditional three-tier core/distribution/edge networks in their data centers should be planning to replace all that gear — even without AI in the picture — with spine-and-leaf architecture. Changing to spine-and-leaf ensures that every system in a computing pod is no more than two hops from every other system.
Selecting 40-gigabit-per-second or 100Gbps links between leaf switches and the network spine helps reduce the impact of oversubscription when servers are commonly connected at 10Gbps to the network leaf switches. To really be on the cutting edge of performance, IT managers can aim for a 100Gbps fabric end-to-end, although some find that 10Gbps server connections occupy a price-performance sweet spot.
When a network has to support high-speed NVMe over Fabric storage, IT managers have another option for notching up speeds to match the demands being made by ML models: remote direct memory access (RDMA) combined with lossless Ethernet.
NVMe over Fabric can run over standard Ethernet, utilizing Transmission Control Protocol to encapsulate traffic. But NVMe over Fabric storage delivers even lower latency when server network interface controllers, or NICs, are replaced with RDMA NICs, or RNICs. By offloading everything from the CPU and bypassing the OS kernel, network stack and disk drivers, performance is supercharged over traditional architectures. The lossless Ethernet side of the equation is provided by modern high-performance network switches that can compensate for oversubscription, prioritize RDMA traffic and manage congestion end to end within the data center.
With high-speed networking in place, and high-speed storage systems ready to roll, IT managers are poised for the last part of the AI equation: computing power.
Are GPUs Necessary for AI?
Start researching AI and ML, and you may discover that your old servers are not powerful enough and you need to immediately invest in graphics processing units to handle the load. In truth, moving to GPUs will give the best results in many cases, but not all the time. And for IT managers who have extensive experience with traditional servers and large server farms already deployed, adding GPUs can be an expensive choice.
The key point here is parallelism: the requirement to run multiple streams at the same time, combined with memory use. GPUs are great at parallel operations, and mainstream ML tools are especially efficient and high-performing when they can run on these GPUs. But all this performance comes at a cost, and GPU upgrades don’t do anything when your developers and operations teams don’t dim the lights when they run the processor-intensive parts of their ML models.
That’s the big difference between GPUs and storage and network upgrades, which deliver better performance for everything running in the data center, all the time.
IT managers should plan their investments carefully when it comes to GPUs, and make sure that workloads are heavy enough to justify investing in this new technology. It’s also worthwhile to look at the major cloud computing providers, including Amazon, Google and Microsoft, as they already have the GPU hardware installed and ready to go, and are happy to rent it to you through their cloud computing services.