Data Center

How to Get a Business’s Network Ready to Handle AI Applications

Artificial intelligence requires a lot of storage and computing power, plus an elegant architecture.

Joel is an internationally recognized expert in the areas of security, networking and messaging. Follow him on X (formerly Twitter) @joelsnyder.

Artificial intelligence is moving from small research projects and blue-sky ideas to a solid and valuable presence in enterprise IT portfolios. AI teams are focusing on improving customer chatbots, building environmental management systems, speeding up image processing and making supply-chain management tools smarter.

IT managers may think of new AI-based applications as “just another app,” but that could be a dangerous assumption. For most developers, AI means using machine learning and neural networks, and these technologies can have a huge impact on both on-premises and cloud-based resources. Let’s look at how AI can affect the main components in data centers: storage, network and compute.

Why Your Organization Should Consider Adding More Storage

AI, machine learning and neural networks eat storage like crazy. Consider some of the big open-source data sets being used for ML: YouTube-8M, which has 350,000 hours of video; Google’s Open Images, with 9 million images; and ImageNet, with 14 million images. ML tools will stress both the capacity and performance of storage systems.

Data centers based on storage area networks with spinning disks have mostly given way to flash-based solid-state drive arrays, yet that may not be enough performance for demanding ML applications. IT managers looking for serious performance may wish to investigate the new Non-Volatile Memory Express–based storage arrays. Fortunately, NVMe is now becoming mainstream enough that most popular storage vendors are on top of it, including NetApp, HPE, Dell and IBM.

NVMe can be attached directly to systems and delivers performance by connecting directly to the PCIe bus. This allows every CPU core to talk directly to the storage system and take advantage of NUMA memory, eliminating the bottleneck of a controller and the single queue that comes with a traditional storage array. But attaching NVMe directly to a single server depends on the speed of that server, which may simply shift the bottleneck.

IT managers would be wise to also investigate NVMe over Fabric SANs. These extend the speed of NVMe storage arrays across network fabrics, most commonly Ethernet and Fibre Channel. NVMe over Fabric delivers best when paired with a high-speed backbone, which brings us to the next part of our data center equation: the network.

Click the banner below to receive exclusive cloud content when you register as an Insider.

Switch to Spine-and-Leaf Architecture

High-speed data center networking functions are the basis for everything else: intersystem links, storage and reliable connectivity to customers. That means not just high speed, but also low-latency and low-loss networks. To deliver the performance needed for AI, IT managers should be thinking about changes both in architecture and in hardware.

IT managers with traditional three-tier core/distribution/edge networks in their data centers should be planning to replace all that gear — even without AI in the picture — with spine-and-leaf architecture. Changing to spine-and-leaf ensures that every system in a computing pod is no more than two hops from every other system.

Selecting 40-gigabit-per-second or 100Gbps links between leaf switches and the network spine helps reduce the impact of oversubscription when servers are commonly connected at 10Gbps to the network leaf switches. To really be on the cutting edge of performance, IT managers can aim for a 100Gbps fabric end-to-end, although some find that 10Gbps server connections occupy a price-performance sweet spot.

When a network has to support high-speed NVMe over Fabric storage, IT managers have another option for notching up speeds to match the demands being made by ML models: remote direct memory access (RDMA) combined with lossless Ethernet.

NVMe over Fabric can run over standard Ethernet, utilizing Transmission Control Protocol to encapsulate traffic. But NVMe over Fabric storage delivers even lower latency when server network interface controllers, or NICs, are replaced with RDMA NICs, or RNICs. By offloading everything from the CPU and bypassing the OS kernel, network stack and disk drivers, performance is supercharged over traditional architectures. The lossless Ethernet side of the equation is provided by modern high-performance network switches that can compensate for oversubscription, prioritize RDMA traffic and manage congestion end to end within the data center.

With high-speed networking in place, and high-speed storage systems ready to roll, IT managers are poised for the last part of the AI equation: computing power.

Are GPUs Necessary for AI?

Start researching AI and ML, and you may discover that your old servers are not powerful enough and you need to immediately invest in graphics processing units to handle the load. In truth, moving to GPUs will give the best results in many cases, but not all the time. And for IT managers who have extensive experience with traditional servers and large server farms already deployed, adding GPUs can be an expensive choice.

The key point here is parallelism: the requirement to run multiple streams at the same time, combined with memory use. GPUs are great at parallel operations, and mainstream ML tools are especially efficient and high-performing when they can run on these GPUs. But all this performance comes at a cost, and GPU upgrades don’t do anything when your developers and operations teams don’t dim the lights when they run the processor-intensive parts of their ML models.

That’s the big difference between GPUs and storage and network upgrades, which deliver better performance for everything running in the data center, all the time.

IT managers should plan their investments carefully when it comes to GPUs, and make sure that workloads are heavy enough to justify investing in this new technology. It’s also worthwhile to look at the major cloud computing providers, including Amazon, Google and Microsoft, as they already have the GPU hardware installed and ready to go, and are happy to rent it to you through their cloud computing services.

James Steinberg/Theispot