Don't rent the cloud, own instead
In this detailed overview, Harald Schäfer, CTO of comma.ai, advocates for companies to build and operate their own data centers rather than relying on cloud providers. He argues that self-hosting fosters better engineering incentives, provides greater control over infrastructure, and is significantly more cost-effective for consistent workloads like ML training—estimating a savings of $20M compared to cloud equivalents. The post details the technical architecture of comma's $5M facility, which features 600 GPUs across 75 in-house 'TinyBox Pro' machines, a custom outside-air cooling system, and 4PB of SSD storage. On the software side, comma utilizes tools like Slurm for workload management, PyTorch for training, and custom open-source solutions like 'minikeyvalue' for distributed storage and 'miniray' for task orchestration. This level of vertical integration allows for efficient model training and rapid code iteration within a streamlined, monorepo-based environment.