GPU Cluster Architecture: Bare-Metal vs. Virtualized for AI
When your training run spans hundreds of billions of parameters and dozens of GPUs, the infrastructure layer underneath is not an abstraction you can afford to ignore. The choice between bare metal GPU servers and virtualized GPU instances determines MFU (model flops utilization), inter-node latency, and ultimately the cost per training step. This post examines both architectures at the hardware and software layers so you can make an informed decision before committing capital or long-term contracts.
What Is Bare-Metal GPU Infrastructure?
A bare-metal GPU server is a physical machine provisioned directly to a single tenant with no hypervisor, no virtual machine layer, and no shared kernel in between the workload and the hardware. The operating system boots on the physical host, drivers communicate directly with the GPU silicon, and the network fabric connects node-to-node at full line rate.
In a GPU cluster context, bare-metal nodes are typically connected via high-bandwidth, low-latency fabrics such as NVIDIA InfiniBand NDR (400 Gb/s) or HDR (200 Gb/s), with NVLink or NVSwitch providing intra-node GPU-to-GPU bandwidth that reaches 900 GB/s on H100 SXM configurations. Nothing in this datapath is shared with another tenant's workload, and no hypervisor interrupt handler competes for CPU cycles.
How Virtualized GPU Clusters Work
Virtualized GPU infrastructure places a hypervisor — most commonly KVM, VMware ESXi, or a cloud provider's proprietary equivalent — between the physical GPU and the guest operating system. GPUs are surfaced to tenants either through full GPU passthrough (vGPU in pass-through mode), NVIDIA vGPU software partitioning, or MIG (Multi-Instance GPU) slicing on Ampere and Hopper-generation hardware.
Each approach imposes different overhead profiles, but all share a common constraint: the hypervisor must arbitrate hardware access, coordinate interrupt routing, and manage memory mappings across tenants. For inference serving at moderate batch sizes or for development and experimentation, this overhead is often acceptable. For large-scale distributed training, it becomes a significant and compounding bottleneck.
The Performance Gap: Where It Comes From
Collective Communication Latency
Distributed training frameworks — PyTorch DDP, DeepSpeed, Megatron-LM — rely on collective operations: AllReduce, AllGather, ReduceScatter. These operations synchronize gradient tensors across every GPU in the cluster after each backward pass. The time spent in collectives directly subtracts from compute utilization.
On bare-metal InfiniBand clusters, NCCL (NVIDIA Collective Communications Library) can bypass the CPU entirely using GPUDirect RDMA, writing gradient tensors from GPU memory to the network interface and into a remote GPU's memory without a single host-CPU copy. Latencies for small messages drop below 2 microseconds. In a virtualized environment, GPUDirect RDMA support is either absent, emulated, or requires specific hypervisor configurations that most cloud providers do not expose to tenants. The result is CPU-mediated transfers with latencies 5x to 15x higher for the same message sizes.
PCIe and NVLink Topology Visibility
Multi-GPU training performance depends heavily on whether the framework can enumerate the actual hardware topology — which GPUs share an NVSwitch fabric, which sit behind the same PCIe root complex, which are on different NUMA nodes. NCCL topology detection uses this information to select optimal communication trees and rings.
On bare metal, nvidia-smi topo -m reflects the real physical interconnect. On virtualized instances, the topology presented to the guest is a virtualized approximation. In many public cloud configurations, NVLink is not exposed to VM guests at all, forcing all intra-node GPU communication over PCIe — reducing intra-node bandwidth from 900 GB/s to roughly 64 GB/s on a standard PCIe 4.0 x16 slot.
Jitter and Tail Latency in Synchronous Training
Synchronous data-parallel and tensor-parallel training requires all nodes to complete a collective before the optimizer step can proceed. A single slow node holds back the entire cluster. In a virtualized environment, hypervisor scheduling introduces jitter that manifests as unpredictable stalls in the 50–500 microsecond range. These stalls are individually small but accumulate across thousands of iterations and tens of nodes into training runs that are measurably longer and less predictable.
Bare-metal nodes with real-time kernel tuning, CPU isolation (cpuset), and NUMA-aware process pinning eliminate this class of jitter at the source.
When Virtualization Is the Right Choice
Virtualized GPU infrastructure is not categorically inferior — it is appropriate for specific workloads:
- Development and experimentation: A single-engineer workload that needs a GPU for a few hours benefits from the elasticity and rapid provisioning of virtualized cloud instances.
- Inference serving with moderate SLAs: If your P99 latency target is in the tens of milliseconds and throughput requirements are met by a single GPU or a small multi-GPU node, virtualization overhead is negligible.
- Burst capacity: Short-duration, non-training workloads like batch scoring or embedding generation can run cost-effectively on spot or preemptible virtualized instances.
- MIG-partitioned workloads: NVIDIA MIG on H100 creates isolated GPU instances with deterministic memory and compute guarantees, which suits multi-tenant inference platforms where isolation matters more than raw throughput.
The calculus changes the moment you are running multi-node distributed training, fine-tuning models with more than 7 billion parameters, or building a persistent inference platform where hardware costs are amortized over 12 to 36 months.
Bare-Metal GPU Clusters for Enterprise AI: Infrastructure-Layer Specifics
InfiniBand Fabric Design
A production bare-metal GPU cluster for LLM training is typically a fat-tree or dragonfly topology of InfiniBand NDR switches. Each compute node connects to a leaf switch at 400 Gb/s per port. Leaf switches uplink to spine switches with no oversubscription — a 1:1 blocking ratio ensures that all-to-all collective operations never saturate uplinks. This topology is incompatible with the multi-tenant network overlays that public cloud providers use to share physical infrastructure across customers.
Storage: GPUDirect Storage and NVMe-over-Fabrics
Training at scale requires feeding GPUs fast enough that storage never becomes the bottleneck. Bare-metal clusters support GPUDirect Storage (GDS), which allows NVIDIA GPUs to read directly from NVMe SSDs or NVMe-over-Fabrics (NVMe-oF) arrays without staging data through host DRAM. A hypervisor layer breaks GDS support in most deployments, forcing data through the CPU memory bus and compressing achievable storage bandwidth by 30 to 50 percent.
Power and Thermal Design
An eight-GPU H100 SXM node draws between 10 and 12 kW under full training load. Data center infrastructure must be designed for this density from the floor up: power delivery at 208V or 415V three-phase, rear-door heat exchangers or direct liquid cooling, and PDU circuits rated per rack rather than averaged across a multi-tenant floor. OneSource Cloud provisions dedicated power circuits and cooling infrastructure per customer cluster, not shared capacity pools that degrade under concurrent load.
Cost Predictability: Reserved Bare Metal vs. On-Demand Cloud GPUs
The per-hour list price of a cloud H100 instance appears lower than a reserved bare-metal contract at first glance. The comparison becomes unfavorable for cloud once you account for: egress fees on checkpoints and datasets, storage costs for persistent model weights, the performance degradation that extends wall-clock training time (and therefore total billed hours), and the absence of committed-use discounts that match what a dedicated infrastructure provider offers on 1- or 3-year terms.
Enterprise buyers running more than 500 GPU-hours per month consistently find that dedicated bare-metal contracts — with predictable monthly invoicing and no per-byte egress charges — deliver 30 to 60 percent lower total cost than equivalent on-demand cloud GPU spend, after accounting for the performance-adjusted training throughput.
Key Takeaways
- Bare metal GPU servers eliminate the hypervisor layer that degrades collective communication latency, intra-node NVLink bandwidth, and storage throughput for distributed AI training.
- GPUDirect RDMA and GPUDirect Storage — the two most impactful throughput technologies for large-scale training — require bare-metal or near-bare-metal access to function at rated performance.
- Virtualized GPU instances remain appropriate for development, small-scale inference, and burst workloads where per-hour elasticity outweighs raw performance.
- Multi-node InfiniBand fabric design, power density provisioning, and NVMe-oF storage architecture are bare-metal-only capabilities that hyperscaler shared infrastructure cannot replicate for single tenants.
- Total cost of ownership over 12 to 36 months consistently favors dedicated bare-metal contracts for enterprises running sustained AI training workloads above 500 GPU-hours per month.
Frequently Asked Questions
Can I get NVLink connectivity on a virtualized GPU instance?
On most public cloud platforms, NVLink is not exposed to VM guests. Intra-node GPU-to-GPU communication falls back to PCIe, reducing bandwidth from 900 GB/s to approximately 64 GB/s. Some providers offer NVLink passthrough in specific instance families, but this is not universally available and typically requires dedicated host reservations that approach bare-metal pricing anyway.
Is bare-metal GPU hosting more secure than cloud VMs?
From a hardware isolation standpoint, yes. There is no shared kernel, no hypervisor attack surface, and no risk of side-channel attacks that exploit shared last-level cache or shared DRAM controllers between tenants. For AI workloads involving proprietary model weights or sensitive training data, the absence of a shared hardware layer is a meaningful security boundary.
How long does it take to provision a bare-metal GPU cluster?
Provisioning timelines vary by provider and cluster size. OneSource Cloud offers pre-staged H100 and H200 node configurations that can be delivered to a production-ready state within 5 to 10 business days for standard cluster sizes. Custom InfiniBand fabric builds for 64-node and larger configurations require longer lead times tied to hardware allocation.
What operating systems are supported on bare-metal GPU nodes?
Bare-metal nodes support any OS the customer chooses to install — Ubuntu, Rocky Linux, RHEL, and custom images are all common. This is a meaningful advantage over managed cloud services, which constrain OS choices to vendor-approved images and restrict kernel versions to those compatible with the hypervisor stack.
Talk to OneSource Cloud About Your GPU Cluster Requirements
If your team is evaluating infrastructure for an upcoming training run or building a persistent AI platform, the architecture decisions made now will compound across every future workload. Contact OneSource Cloud to discuss your GPU cluster requirements with an infrastructure engineer who works at the hardware layer, not the console layer. You can also schedule a 30-minute technical call to walk through your specific model size, parallelism strategy, and throughput targets so we can size a bare-metal configuration that matches your actual workload — not a generic instance family.
