Feed

Distributed Systems

Distributed systems discussions covering consensus algorithms, CAP theorem, fault tolerance, and architectural patterns from developer communities.

Articles from the last 30 days

Don't rent the cloud, own instead
01Tuesday, February 3, 2026

Don't rent the cloud, own instead

In this detailed overview, Harald Schäfer, CTO of comma.ai, advocates for companies to build and operate their own data centers rather than relying on cloud providers. He argues that self-hosting fosters better engineering incentives, provides greater control over infrastructure, and is significantly more cost-effective for consistent workloads like ML training—estimating a savings of $20M compared to cloud equivalents. The post details the technical architecture of comma's $5M facility, which features 600 GPUs across 75 in-house 'TinyBox Pro' machines, a custom outside-air cooling system, and 4PB of SSD storage. On the software side, comma utilizes tools like Slurm for workload management, PyTorch for training, and custom open-source solutions like 'minikeyvalue' for distributed storage and 'miniray' for task orchestration. This level of vertical integration allows for efficient model training and rapid code iteration within a streamlined, monorepo-based environment.

The Day the Telnet Died
02Tuesday, February 10, 2026

The Day the Telnet Died

In January 2026, GreyNoise analysts observed a sudden and dramatic 65% drop in global Telnet traffic within a single hour, which eventually settled at an 83% reduction from the baseline. This structural shift preceded the public disclosure of CVE-2026-24061, a critical authentication bypass vulnerability in GNU Inetutils telnetd that allows unauthenticated root access via a simple argument injection. The data suggests that major Tier 1 transit providers likely implemented port 23 filtering on backbone infrastructure in anticipation of the vulnerability's disclosure. This proactive infrastructure-level response significantly impacted residential and enterprise ISPs while leaving major cloud providers with direct peering largely unaffected. The incident highlights a potential shift in how global network operators coordinate to mitigate high-impact security risks at the routing level before they can be exploited at scale.

Making WebAssembly a first-class language on the Web
03Thursday, February 26, 2026

Making WebAssembly a first-class language on the Web

WebAssembly progress has expanded its core capabilities, yet it remains a 'second-class' web citizen due to its reliance on complex JavaScript glue code for loading and API access. The WebAssembly Component Model aims to bridge this gap, allowing direct Web API interaction and better cross-language interoperability without JavaScript overhead.

Be wary of Bluesky
04Friday, February 20, 2026

Be wary of Bluesky

The article warns that while Bluesky uses the open ATProto, its infrastructure remains highly centralized. Most users rely on Bluesky-run servers for data storage and identity. This creates a 'centralization flywheel' where new apps increase dependency on one company, making users vulnerable to future monetization or acquisition despite the protocol's theoretical portability.

Sources:Hacker News302 pts
Running Your Own AS: BGP on FreeBSD with FRR, GRE Tunnels, and Policy Routing
05Sunday, February 8, 2026

Running Your Own AS: BGP on FreeBSD with FRR, GRE Tunnels, and Policy Routing

This technical guide explains how individuals can operate their own Autonomous System (AS) and announce IPv6 prefixes on the public internet using FreeBSD and FRR. The author details the process of obtaining an AS number and IPv6 prefix via a sponsoring LIR, configuring BGP peering with multiple upstreams, and using GRE/GIF tunnels for prefix distribution. A significant portion of the article focuses on advanced networking techniques, specifically dual-FIB policy routing and PF firewall rules to manage multiple IPv6 address spaces on a single server. The setup ensures that traffic from a personal BGP prefix and provider-assigned addresses can coexist without routing loops or spoofing issues. Key takeaways include the importance of MSS clamping for tunnels, bogon filtering for BGP security, and the use of the reply-to directive in PF to handle asymmetric routing.

Use Protocols, Not Services
06Sunday, February 15, 2026

Use Protocols, Not Services

This article advocates for a shift from centralized services to decentralized protocols to ensure privacy and circumvent censorship. Services like Discord are vulnerable to government compliance and surveillance, whereas protocols like IRC, XMPP, and Matrix lack single entities to compel, making them resilient against regulatory pressure and account bans.

Sources:Hacker News239 pts
Async/Await on the GPU
07Tuesday, February 17, 2026

Async/Await on the GPU

VectorWare has successfully implemented Rust's async/await and Future trait on GPUs, enabling structured concurrency without custom DSLs. This achievement allows developers to write complex, high-performance GPU applications using familiar Rust abstractions, leveraging the Embassy executor for task scheduling. The milestone promotes code reuse and safer, more maintainable GPU-native software.

Sources:Hacker News204 pts
Reports of Telnet's death have been greatly exaggerated
08Wednesday, February 11, 2026

Reports of Telnet's death have been greatly exaggerated

Reports suggesting core ISPs are blocking Telnet traffic following recent security vulnerabilities are likely inaccurate. Analysis from Terrace, using RIPE Atlas and internal sensor data, shows no evidence of widespread port 23 filtering. The observed traffic drop in other reports may result from measurement artifacts, session counting methods, or threat actors intentionally avoiding specific monitoring infrastructure.

Show HN: AI agents play SimCity through a REST API
09Monday, February 9, 2026

Show HN: AI agents play SimCity through a REST API

The simulation system is currently running, managing a population of over 8.4 million across 463 cities. Key updates include high residential demand, decreased crime rates, and power plants coming online. Despite localized traffic issues and a downtown monster sighting, major city portals remain connected with forty-five mayors actively registered within the network.

Sources:Hacker News169 pts
Like Game-of-Life, but on Growing Graphs, with WASM and WebGL
10Sunday, February 8, 2026

Like Game-of-Life, but on Growing Graphs, with WASM and WebGL

This project explores an experimental simulation of emergent complexity inspired by Paul Cousin’s research on Graph-Rewriting Automata. By utilizing specific local topological rules, the system demonstrates how complex global patterns can arise from simple structural transformations. Unlike traditional cellular automata which operate on static grids, this approach employs dynamic graphs where nodes and edges evolve based on rewriting logic. This simulation serves as a bridge between computer science and mathematical theory, offering insights into how self-organizing systems function within computational frameworks. The work highlights the potential for non-linear growth and structural evolution in distributed systems and algorithmic design.

Sources:Hacker News169 pts
What functional programmers get wrong about systems
11Monday, February 9, 2026

What functional programmers get wrong about systems

The essay explores the gap between functional programming's focus on program correctness and the realities of production systems, which are inherently distributed. While tools like static types and algebraic data types provide ironclad guarantees for a single binary, they fail to account for the 'set of deployments'—where multiple versions of code, schemas, and data co-exist. The author argues that correctness is a systemic property involving interaction between differing versions across network boundaries, message queues (like Kafka), and databases. Addressing this requires looking beyond the type checker toward schema registries, 'parse, don't validate' at version boundaries, and infrastructure that tracks runtime versioning. Ultimately, it emphasizes that technical excellence in logic cannot solve 'semantic drift' or the temporal archeology of long-lived data, necessitating a shift in focus from snapshots to the evolution of the entire ensemble.

Sources:Hacker News155 pts
Cloudflare outage on February 20, 2026
12Saturday, February 21, 2026

Cloudflare outage on February 20, 2026

Cloudflare experienced a 6-hour service outage on February 20, 2026, due to an Addressing API bug that unintentionally withdrew 1,100 BGP prefixes from its Bring Your Own IP (BYOIP) service. The incident, caused by an automated cleanup sub-task, impacted services like CDN, Magic Transit, and Spectrum. Cloudflare is implementing standardized API schemas and health-mediated rollouts to prevent recurrence.

Sources:Hacker News150 pts
A distributed queue in a single JSON file on object storage
13Saturday, February 21, 2026

A distributed queue in a single JSON file on object storage

Turbopuffer replaced its internal indexing job queue with a single-file system on object storage. Utilizing a stateless broker, group commit batching, and Compare-and-Set (CAS) for atomic operations, the design achieved 10x lower tail latency and improved scalability. This architecture ensures high availability through a simple, predictable system built on durable object storage primitives.

Sources:Hacker News139 pts
gRPC: From service definition to wire format
14Monday, February 9, 2026

gRPC: From service definition to wire format

A technical exploration of gRPC, covering the contract-first approach using Protocol Buffers and its four streaming models. It details transport via HTTP/2, including URL construction, metadata, length-prefixed framing, and binary wire formats. The article also explains error handling with trailers, compression mechanisms, and adaptations like gRPC-Web for browsers.

Sources:Hacker News136 pts
A Botnet Accidentally Destroyed I2P
15Saturday, February 21, 2026

A Botnet Accidentally Destroyed I2P

In February 2026, the I2P network suffered a massive Sybil attack from the Kimwolf botnet, which deployed 700,000 hostile nodes. Originally mistaken for state-sponsored interference, the disruption was an accidental consequence of botnet command-and-control operations. I2P responded by releasing version 2.11.0, featuring post-quantum encryption and advanced Sybil mitigations.

Sources:Hacker News134 pts
Exploring a Modern SMTPE 2110 Broadcast Truck
16Saturday, February 7, 2026

Exploring a Modern SMTPE 2110 Broadcast Truck

A behind-the-scenes look at an NHL sports broadcast exploring SMPTE 2110 technology. Key technical elements include Evertz 5700MSC-IP master clocks, PTP timing synchronization, and hybrid fiber-copper SMPTE cabling. The experience highlights the move from analog to digital IP-based media distribution while emphasizing the precision and professionalism required by the production crew.

Sources:Hacker News122 pts
Bridging Elixir and Python with Oban
17Thursday, February 19, 2026

Bridging Elixir and Python with Oban

The article demonstrates how Oban facilitates seamless interoperability between Elixir and Python by sharing a common PostgreSQL database. Using a 'Badge Forge' micro-app example, it shows how developers can leverage Python libraries like WeasyPrint for PDF generation while maintaining core logic in Elixir, allowing both ecosystems to exchange durable background jobs transparently.

Sources:Hacker News121 pts
Postgres Postmaster does not scale
18Wednesday, February 4, 2026

Postgres Postmaster does not scale

Recall.ai, a platform for recording and processing millions of virtual meetings, encountered a rare scaling bottleneck within PostgreSQL. Due to the highly synchronized nature of meetings starting on the hour, their EC2 infrastructure faced extreme traffic bursts, leading to connection delays of up to 15 seconds. Investigation revealed that the PostgreSQL postmaster process operates on a single-threaded main loop responsible for spawning and reaping backends. During high connection churn or heavy background worker activity, this loop saturates a single CPU core, delaying new connections. The team mitigated this by enabling Linux huge pages to reduce 'fork' overhead, introducing jitter to stagger EC2 connection times, and limited parallel query bursts. The findings highlight that the primary bottleneck in PostgreSQL scaling is often the single-threaded nature of the postmaster rather than general resource availability.

Sources:Hacker News110 pts
Goblins: Distributed, Transactional Programming with Racket and Guile
19Saturday, January 31, 2026

Goblins: Distributed, Transactional Programming with Racket and Guile

Spritely Goblins is a sophisticated distributed object programming environment designed to simplify building secure, networked applications. By utilizing a capability-based security model, it ensures that objects remain encapsulated and protected while participating in distributed transactions. The platform provides automatic local transactions for synchronous operations and an efficient asynchronous interface for objects regardless of their network location. A key feature of Goblins is its language-agnostic approach, currently supporting Guile and Racket, which allows for cross-language object interaction. By abstracting the complexities of protocol architecture and networking details, Goblins enables developers to focus on core logic. Additionally, it integrates powerful debugging tools and a process persistence model that facilitates seamless upgrades without compromising security fundamentals.

Sources:Hacker News110 pts
Hypergrowth isn’t always easy
20Friday, January 30, 2026

Hypergrowth isn’t always easy

Tailscale recently addressed recent uptime concerns by providing a transparent look at their system architecture and incident response philosophy. They clarified that their 'coordination server' has evolved into a sharded 'coordination service' designed as a high-speed message bus. While this architecture allows the data plane to remain functional—meaning existing connections persist even if the control plane is down—it creates disruptions for administrative actions like logging in or updating ACLs. To combat recent instability, Tailscale is implementing several improvements: caching network maps on nodes to survive restarts during outages, enhancing the coordination service with hot spares and auto-rebalancing, and investing in multi-tailnet sharing to reduce geographic latency. Their commitment to visibility ensures all incidents are reported publicly to maintain trust as they scale their global infrastructure.

Sources:Hacker News110 pts