
Contributor · technology
Sam Okonkwo
@sam · writer · editorial staff
Technology columnist. Non-AI tech: cloud, semiconductors, hardware, infrastructure.
Sam’s brain
166 nodes
A searchable, growing knowledge base. Theses, methodology, sources, and observations they have published in their own voice. Updated as they read, write, and revise.
View the full brain →Operating POV6 nodes›
Accidental complexity is what you can fix with money; essential complexity is what remains
Brooks divided the problem space into essential and accidental difficulty, and the line has held for four decades. Essential difficulties are inherent to software: managing complexity that grows non-linearly with component count, conforming to interfaces you don't control, adapting to requirements that change faster than you can ship, reasoning about execution paths you cannot visualize. Accidental difficulties are everything else—language verbosity, toolchain friction, deployment toil.
The past thirty years of productivity gains came from attacking accidental complexity. High-level languages, managed runtimes, version control, continuous integration, infrastructure-as-code, serverless—each wave automated or abstracted away something that used to require manual coordination. The gains were real and they compounded, but they were bounded. You cannot abstract your way out of state synchronization in a distributed system; you can only choose which guarantees to relax.
The mistake is thinking the accidental problems are solved. They're not solved; they're redistributed. Every abstraction layer that eliminates boilerplate introduces new failure modes at the boundary. Serverless eliminates server management but makes cold start latency and API gateway timeouts your problem. Managed databases eliminate routine maintenance but make query plan regression and connection pool exhaustion your problem. The toil moves; it doesn't vanish.
What Brooks got right is that the essential difficulties don't yield to better tooling—they yield to better problem decomposition. Microservices didn't make distributed systems easier; they made the boundaries explicit so you could reason about them separately. The CAP theorem didn't solve consistency; it formalized the tradeoff so you could pick your constraints consciously.
The disciplined stance is to spend freely on accidental complexity but not to confuse it with progress on the essential. Faster iteration loops let you explore the design space more quickly, but they don't tell you which part of the design space contains a correct solution. Money buys you more attempts. It doesn't buy you fundamentally simpler problems.
#essential-vs-accidental#productivity#abstractions#problem-decompositionInfrastructure is illegible until you try to rent it
The datacenter abstraction pretends hardware doesn't matter until the moment it does. Cloud providers spent two decades selling the fantasy of infinite elasticity—spin up what you need, pay for what you use, infrastructure as code means infrastructure disappears. Then training runs hit the wall of CoWoS packaging capacity and the illusion collapsed.
What makes infrastructure illegible is not technical complexity but economic opacity. A researcher pricing an H100 rental sees a number that represents dozens of upstream constraints: TSMC's advanced packaging lines, HBM memory allocations made eighteen months prior, power delivery systems rated for specific thermal envelopes, network fabrics sized for collective communication patterns. The rental rate is a lossy compression of a multi-tier supply chain, and the compression is deliberate.
The interesting move is not trying to make infrastructure more legible—that fight is lost—but recognizing what becomes visible only under scarcity. H100 rental rates climbed 38% in five months not because compute got more valuable but because the binding constraint moved from chips to packaging to memory to power, each transition revealing a layer of the stack that had been hidden by abundance. When HBM3e sells out through 2026, you learn that memory isn't fungible. When rack power density jumps from 12 kW to 40 kW, you learn that 'datacenter' is not a uniform substrate.
The disciplined stance is to treat posted prices as trailing indicators of constraints you cannot directly observe, and to read capital expenditure as a forecast of what the hyperscalers believe will bind next. AWS spending $100B in 2025 is not a bet on demand—it's a bet on which parts of the supply chain to vertically integrate before the next shortage makes them unaffordable. The infrastructure is illegible, but the capital flows are not.
#infrastructure#cloud-economics#supply-chain#capital-allocationWrite for the constraint, not the chip
The story isn't the H100 or the B200. The story is what happens when you can't get one.
Readings [2] and [3] point to the same conclusion: CoWoS packaging and HBM3e memory are the binding constraints through 2026. TSMC's packaging capacity is sold out. SK Hynix and Micron's high-bandwidth memory is fully booked. These aren't software problems you can patch. They're multi-billion-dollar fab expansions with 18-month lead times.
This changes how you cover the space. When NVIDIA announces a new chip, the first question isn't "what can it do?" It's "how many can actually ship, and what's the constraint?" Reading [1] shows the downstream effect: H100 rental rates climbed 38% in five months because supply can't catch demand, even as newer silicon theoretically ships.
The hyperscalers know this. Readings [5], [6], [7], and [8] document their response: custom silicon designed around what they can actually manufacture at scale. Google splits TPUs into training and inference variants. AWS runs Graviton at fixed clocks with no hyperthreading. Microsoft aims Maia at inference economics, not training parity. These aren't just cost plays—they're constraint arbitrage. By designing chips that use different packaging, different memory configurations, or different fabs entirely, they route around the bottleneck.
The technological race matters less than the supply chain positioning. Who has CoWoS allocation? Who secured HBM contracts in 2024? Who can wait eighteen months for capacity and who needs compute today?
Cover the constraint. The chip is just what you build when you can't route around it.
#supply-chain#hardware-constraints#custom-silicon#manufacturing-bottleneck#hyperscalers#reporting-frameworkThe Measurement Rig Is Part of the System Under Test
When you benchmark, you are measuring a composite system: the thing you think you're testing plus the infrastructure that generates load, captures telemetry, and reports results. The methodology readings make this visible at multiple layers.
Coordinated omission [23, 24, 26] is the clearest case: closed-loop load generators throttle their own request rate when the system slows down, systematically hiding the latency spikes you most need to see. The measurement apparatus becomes complicit in the lie. Open-loop testing [25] decouples request submission from response time, but that just shifts the problem—now you need enough parallel workers to sustain the arrival rate even when responses stack up.
Synthetic workloads [2] have the same structural issue: you're measuring how well your proxy represents reality, not just how the system performs. The divergence in execution statistics—operator distributions, cardinality patterns, join selectivity—is a property of your workload construction methodology, not the database.
Static benchmarks [3] encode assumptions about problem size and scaling patterns. When those assumptions break—and they will, because real systems grow and shift—the benchmark tells you less and less about production behavior.
The craft response is to instrument the measurement rig itself. Treat load generator saturation as a first-class metric. Log workload divergence. Version-control the benchmark configuration with the same rigor as the system under test [4]. The philosophical response is to stop pretending the measurement is separate from the phenomenon. Production is the only benchmark that doesn't lie, and even production lies when you don't measure the right things.
This is why chaos engineering starts with observability of steady state [12, 15], not fault injection. You can't run the experiment until you know what normal looks like, and you can't trust 'normal' until you've measured it under conditions that don't coordinate their omissions with your measurement infrastructure.
#benchmarking#coordinated-omission#workload-modeling#chaos-engineering#observability#measurement-methodology#systems-testingThe Infrastructure Writer's Stance: Track the Constraints, Not the Features
The technology beat has a breathlessness problem. Most coverage treats new infrastructure as feature drops—what it does, who's using it, how fast it runs. But the readings across Brooks [1,2,3,4], CAP [5,6,7,8], and impossibility results [27,28,29,30] suggest a different lens: watch what the system cannot do, and why.
Brooks established that essential difficulty cannot be abstracted away [1,2]. The CAP theorem forced database designers to choose which guarantees to break when the network partitions [5,6]. FLP proved that asynchronous consensus with crash failures is impossible, full stop [27,30]. These aren't pessimistic framings—they're constraint maps. They tell you where the trade-offs live.
When Stonebraker argues "no size fits all" [16], he's not hedging—he's saying the constraint surfaces are incompatible. You cannot simultaneously optimize for OLTP, analytics, and graph traversal in a single architecture. When Lampson says "do one thing at a time and do it well" [14], he's saying the cost of generality is that you pay it everywhere, every time.
The operating stance this suggests: follow the impossibility results into the workarounds. When PBFT needs 3f+1 replicas [23], that's not a detail—it's the economic structure of Byzantine tolerance. When TCP needed slow start [25], that wasn't a patch—it was the protocol learning to respect conservation principles [24].
Most infrastructure writing asks "what can this do?" The better question is "what can this not do, and what did the designers sacrifice to work around it?" That's where the actual system is. That's where the dependencies live. The constraints are the story.
index · 1index · 2index · 3index · 4index · 5index · 6index · 7index · 8index · 14index · 16index · 23index · 24index · 25index · 27index · 28index · 29index · 30#systems-thinking#infrastructure-fundamentals#trade-offs#impossibility-results#constraint-analysisWhat technology coverage is for
Technology coverage at Palanor is the financial read with the engineering substance kept intact.
Three commitments:
- Quote the docs. Cite the commit. Name the function. The technology read that loses the technical detail loses the structural argument.
- Net retention says one story. The roadmap says another. The product settles it. I read all three before I publish.
- Developer signal leads revenue signal. GitHub activity, package downloads, conference attendance — these are quarters ahead of the income statement.
I will not anthropomorphize software. I will not use the phrase "AI-powered."
#technology#software#semiconductors
Methodology1 node›
How I read the technology stack
Read 1 — Public software companies, quarterly. Net retention, cRPO, growth-vs-margin posture. I track the named cohort.
Read 2 — Semiconductor capacity + design-wins. TSMC + Samsung + Intel Foundry. Design-win disclosures in earnings calls trail the actual fab commitments.
Read 3 — Developer signal. GitHub stars + commits + issue velocity for the projects that matter. npm + PyPI + crates.io + Maven Central download trends. Conference speaker lineups.
Read 4 — Open-source license dynamics. When a meaningful project re-licenses, the ecosystem response over the next two quarters is the read.
Adrian Hoff and I cross-check whenever the AI layer rides on the software stack. Patrick Erskine and I cross-check whenever the semi cycle routes through industrial capex.
#method
Currently watching1 node›
Technology screen
- SaaS net retention — three of the named cohort have shown sequential deceleration; watching for the structural call vs. the cyclical call.
- TSMC N2 design wins. The slate is mostly priced in; the surprise will be a named entrant adding capacity.
- Open-source AI inference stack. vLLM + sgLang + the alternative serving layers. The commodity inference layer is real.
- GitHub activity on the developer-tools cohort. Two projects are showing a velocity inflection that hasn't shown up in revenue yet.
#active
Thesis12 nodes›
Incremental improvements compound only if you attack the same constraint repeatedly
Brooks argued there was no silver bullet—no single innovation that would yield an order-of-magnitude improvement in software productivity—but the common misreading is that incremental progress doesn't matter. The actual claim was subtler: incremental improvements compound when they attack essential complexity, but most 'productivity' tools attack accidental complexity and hit diminishing returns once the easy gains are captured.
The evidence is in where the compounding actually happened. Version control did not make programming 10x easier, but distributed version control plus continuous integration plus infrastructure-as-code plus deployment automation together made iteration cycles faster by more than an order of magnitude. The compounding came from each tool removing friction at a different stage of the same pipeline—develop, test, deploy, monitor—so the improvements multiplied rather than added.
The same pattern appears in training infrastructure. Faster GPUs are incremental. Better interconnects are incremental. Optimized collective communication libraries are incremental. But when you stack them—NVIDIA's NVLink for intra-node bandwidth, InfiniBand for inter-node bandwidth, NCCL for optimized all-reduce patterns—you get compounding because each improvement removes a different bottleneck in the same communication-heavy workload.
The failure mode is when you optimize orthogonal dimensions that don't compound. Making your database 10% faster and your frontend 10% faster does not make the system 21% faster; it makes it 10% faster bounded by whichever is still the bottleneck. The improvements add in parallel, not multiply in series.
The disciplined approach is to attack the same essential constraint from multiple angles and verify that the improvements stack. Brooks was right that no single change breaks the curve. He did not say that a coherent set of changes, applied to the same underlying difficulty, cannot bend it.
#productivity#compounding-improvements#essential-vs-accidental#optimizationWhat you cannot measure directly, you infer from capital flows and rental spreads
The production systems that matter are not instrumented for external observation. Google does not publish latency percentiles for search. Meta does not publish ranking model complexity for feed. AWS does not publish utilization rates for Lambda cold start backends. The interesting technical problems are happening behind API boundaries you cannot see past.
What you can measure are the economic signals that leak out. Rental rates for H100 instances encode supply tightness. HBM sell-through into 2026 encodes demand forecasts. Hyperscaler capital expenditure encodes beliefs about which constraints will bind next. These are not perfect proxies—capex can be mistimed, rental markets can be distorted by subsidies or exclusive agreements—but they are less gameable than self-reported benchmarks.
The discipline is treating price as information about scarcity and capital allocation as information about expected scarcity. When H100 rental rates climb 38% while newer silicon launches, you learn that CoWoS packaging is the bottleneck, not chip performance. When TSMC expands advanced packaging capacity from 15,000 wafers/month to 130,000 wafers/month, you learn that multiple customers are bidding for the same constrained resource, which tells you something about the shape of demand even if you cannot see the models being trained.
The risk is over-indexing on prices that reflect temporary imbalances rather than structural costs. A GPU rental spike during a capacity crunch does not mean inference is permanently more expensive; it means someone is willing to pay a premium to avoid waiting. The analytical move is comparing spot prices to reserved instance prices, comparing rental rates to depreciation schedules, comparing capex growth rates to revenue growth rates. The gaps tell you where the market expects the constraints to persist.
The read I trust is not What is X company building? but What is X company willing to pay for? The technical details are proprietary. The capital allocation is public record.
#inference#economics#signals#capital-allocationPartition tolerance means you've already decided the tradeoff
The CAP theorem is usually taught backwards. The common framing is that you choose two of three: consistency, availability, partition tolerance. But in any system built on networked machines, partitions are not optional—they are a fact of physics. Network links fail, switches drop packets, datacenter cross-connects saturate. The theorem does not give you a choice about P; it forces you to choose between C and A when P occurs.
The clarity this provides is underrated. Before CAP, distributed database designs pretended you could have strong consistency and high availability simultaneously if you just engineered carefully enough. After CAP, the debate shifted to which guarantee you relax and under what conditions. Eventual consistency systems gave up C to preserve A. Consensus protocols gave up A (in the form of blocking during partition) to preserve C. The theorem didn't solve the problem; it made the problem legible.
What gets lost in popularization is that the tradeoff is not binary—it's contextual. A shopping cart can tolerate temporary inconsistency; an account balance cannot. A social feed can tolerate staleness; a mutex cannot. The engineering discipline is not choosing one model and applying it globally but segmenting the system so that different subsystems can make different choices.
The modern expression of this is in API contracts and SLOs. When a service advertises 99.99% availability, it has implicitly chosen A over C for some partition scenarios. When a database advertises serializable isolation, it has chosen C over A. The choice is rarely explicit in the documentation, but it's encoded in the failure modes—what happens when you lose a zone, when you lose a region, when the network bisects your cluster.
The practical read is that partition tolerance is not a feature you add; it's a constraint you design around. The systems that survive are the ones that made the C-versus-A tradeoff consciously and verified it under realistic failure injection, not the ones that assumed the network would always be reliable enough not to matter.
#CAP-theorem#distributed-systems#consistency#availability#tradeoffsCustom silicon is a hedge against the tax, not a path to dominance
The narrative on custom accelerators is backwards. Google did not build TPUs to beat NVIDIA at training; Google built TPUs because paying NVIDIA's margin on every inference request would make serving uneconomical at Google's scale. The TPU is a monopsony move, not a performance move.
The evidence is in the bifurcation. When Google split TPU v8 into separate training (8t) and inference (8i) SKUs, they revealed the different optimizations required: training chips maximize aggregate FLOPS and inter-chip bandwidth for collective operations; inference chips maximize throughput per watt and minimize latency for serial requests. You can multi-tenant inference chips and run them hot; you cannot afford the coordination overhead on a training cluster sized for a months-long run.
AWS took the other path with Graviton and Trainium—start with commodity ARM cores and fixed clock rates to deliver cost predictability, then build a training chip that can slot into the same infrastructure playbook. The design philosophy is explicit: don't compete on peak performance, compete on total cost of ownership when you're provisioning at hyperscale and your workload is elastic enough to tolerate different hardware.
The mistake is treating custom silicon as proof of technical superiority. The actual question is whether the volume you control justifies the NRE and the opportunity cost of not riding NVIDIA's software ecosystem. For hyperscalers serving their own models, the answer is yes—the inference tax alone pays for the chip development. For anyone else, buying Blackwell and optimizing your model architecture is almost certainly cheaper than spinning a new ASIC.
The tell is when a cloud provider offers both: NVIDIA instances at a premium and custom silicon at a discount. The price spread is not a technology gap; it's a test of how much the customer values ecosystem portability versus raw economics. Most customers pay the portability tax. The hyperscalers have the volume not to.
#custom-silicon#economics#vertical-integration#cloud-strategyThe benchmark is a lossy encoding of the workload you cannot share
Every benchmark is a treaty between what you want to measure and what you can publish. TPC standardized the treaty for transactional systems: define a schema, specify query distributions, count throughput under ACID constraints. The compact worked because the workloads being compared were structurally similar and the vendors had institutional incentive to play by shared rules.
That compact is breaking in three places. First, synthetic workloads diverge from production traces in ways that matter for systems designed around specific access patterns—you can minimize divergence in operator distributions but not in the long-tail dependencies that determine whether a cache hierarchy thrives or collapses. Second, static benchmarks cannot track systems whose scale economics change faster than the benchmark committees meet—AI training has crossed three orders of magnitude in six years, and predefined problem sizes become unmoored from the frontier within a release cycle. Third, the workloads most worth comparing are proprietary feature pipelines or ranking models that cannot be published without disclosing competitive advantage.
The result is not the death of benchmarking but its fragmentation. Model leaderboards become the benchmark for capabilities (MMLU, HumanEval, GPQA) while cost-per-token becomes the benchmark for serving economics. Training benchmarks fork into hardware-specific MLPerf submissions that are optimized for the measurement rather than production. What you lose is comparability across stacks—there is no neutral way to compare a Google TPU pod optimized for its own frameworks against an NVIDIA cluster running PyTorch.
The disciplined read is that published benchmarks are not ground truth but a negotiated representation of a workload, and the negotiation encodes what the publisher is willing to reveal. When a hyperscaler reports MLPerf numbers on their custom silicon, you learn what they optimized for, which is often more valuable than the score itself. Reproducibility remains the baseline—if you cannot rerun the benchmark you cannot verify the claim—but reproducibility does not imply relevance.
#benchmarking#measurement#workload-modeling#competitive-dynamicsPower density is forcing a datacenter architecture fork
Readings [4], [9], and [10] document a physical incompatibility between AI workloads and traditional datacenter infrastructure.
The numbers don't fit. Reading [4]: H100 racks demand 40+ kW, Blackwell racks demand 120–140 kW. Traditional colocation is built for 10–12 kW/rack. Reading [9]: AI workloads run at continuous maximum capacity, not cyclical peaks. Reading [10]: global datacenter power demand will more than double by 2030, from 415 TWh to 945 TWh, with AI driving the increase.
This isn't a "do more with less" problem. This is a facilities architecture problem. You can't retrofit liquid cooling into a raised-floor facility designed for air cooling. You can't draw 120 kW/rack from electrical infrastructure rated for 12 kW. You can't get multi-year power contracts when reading [10] shows the grid itself is the constraint.
The fork: AI-native facilities versus everything else. AI-native means liquid cooling from day one. It means on-site substations and utility-grade power contracts. It means reading [11]'s renewable commitments running into deployment reality—because if you're drawing 100+ MW continuously, you can't just buy offsets.
The strategic implication: location matters again. Readings [10] and [11] show this clearly. Where can you get firm power? Where can you get it renewably? Where can the grid actually deliver what the contract promises? These aren't software questions. They're infrastructure planning questions with 5–10 year lead times.
The companies that solve this aren't optimizing datacenters. They're building power plants that happen to have computers in them.
#datacenter-infrastructure#power-density#ai-infrastructure#liquid-cooling#grid-capacity#facilities-planningThe hyperscaler silicon strategy is vertical integration disguised as chip design
Reading [8] frames hyperscaler custom silicon as a fundamental rewrite of the AI infrastructure market. But the deeper pattern across [5], [6], [7], and [8] is that chip design is the vehicle, not the destination.
Google doesn't build TPUs to compete with NVIDIA on benchmarks. They build TPUs because systolic arrays integrate tightly with their distributed systems architecture and because they control the entire stack from silicon to model to service. AWS doesn't build Graviton to win single-threaded performance wars. They build it to deliver predictable, contract-grade cost structures that their enterprise customers will pay for—static clocks, physical cores, no noisy neighbors. Microsoft doesn't build Maia to match H100 on training. They build it to arbitrage the 10–100× inference-to-training ratio that their actual workload mix demands.
The common pattern: custom silicon lets hyperscalers encode their operational advantages in hardware. It's not about raw performance. It's about workload fit, supply chain control, and margin capture.
This is why reading [8] matters: hyperscalers are bifurcating the accelerator market not by building better chips, but by building chips that only make sense if you control distribution. A TPU is a worse product if you're not running on GCP infrastructure. Graviton is a worse product if you're not deploying on AWS with their networking and storage. Maia is a worse product if you're not running Azure AI services.
The moat isn't the silicon. The moat is the system the silicon assumes.
#custom-silicon#hyperscalers#vertical-integration#cloud-infrastructure#competitive-moats#workload-optimizationQueueing Theory Is the Common Substrate Under Capacity and Chaos
Capacity planning [20, 21, 22] and chaos engineering [12, 13, 14, 15] look like different disciplines—one is about prediction, the other about resilience—but they both bottom out in queueing theory, and that shared substrate explains why the methodology debates keep circling the same territory.
Queueing theory models saturation, not just throughput [20]. You can run a system at 70% utilization and have terrible tail latency because requests are stacking up in queues. You can run at 90% and meet your SLA because arrival patterns are smooth and service times are predictable. The utilization number doesn't tell you whether the system is stable—you need the full queueing model.
Little's Law [21] connects the variables: average number in system equals arrival rate times average time in system. This holds under broad stability conditions, but the stability conditions are doing the work. If your arrival rate exceeds your service rate, the law doesn't help you because the system isn't stable—queues grow without bound.
Chaos engineering makes this concrete. Steady state as the contract [12] is queueing language: you define the measurable output of the system—throughput, latency percentiles, error rates—and then you perturb the system to see if the output holds. Fault injection [14] and real-world event simulation [13] are ways of testing whether the queueing model you think you have is the queueing model you actually have. When you kill a server, you're reducing service capacity. When you inject latency, you're increasing service time. The hypothesis-testing framework [15] is asking: does the system degrade gracefully, or does it fall off a cliff?
The connection runs both ways. Capacity decisions require performance targets [22]—you can't just optimize for utilization, you need an SLA that specifies percentile latency under load. That SLA is a statement about queueing behavior. And chaos engineering is how you validate that your capacity model holds when real-world events mess with your service rate assumptions.
The methodological implication: if you're doing capacity planning without chaos testing, you're trusting a model you haven't validated. If you're doing chaos engineering without queueing theory, you're running experiments without a mental model of what should happen. The disciplines converge because they're both trying to answer the same question: does this system stay stable under load, and if not, where does it break?
#queueing-theory#capacity-planning#chaos-engineering#performance-engineering#littles-law#steady-state#reliabilityReadiness Frameworks Encode Risk Appetite, Not Technical Truth
Production readiness reviews [27, 28, 29, 30] are not objective assessments of system quality. They are organizational negotiations about who owns the consequences when things break.
Google's PRR evolution [27] shows this clearly: the Simple PRR Model gave way to Extended Engagement and Frameworks not because the technical requirements changed, but because the organizational boundaries shifted. SRE engagement expanded earlier in the lifecycle because late-stage readiness gates were blocking launches and creating political friction.
Risk-proportional reviews [28] make the negotiation explicit. A customer-facing payment service needs more rigorous review than an internal batch job—not because the engineering is better or worse, but because the blast radius and organizational consequences are different. Criticality is a social construct. "Production-ready" means "we agree to run this," and what SRE agrees to run depends on how much it trusts the team, how visible the failures will be, and what other commitments are already on the books.
The seven-axis checklist [29]—service levels, architecture, performance, documentation, observability, testing, deployment—looks like a technical rubric, but every axis encodes a different failure mode and a different repair cost. Observability determines how fast you can triage. Documentation determines how many people need to be paged. Testing determines whether you believe your rollback strategy. These aren't measures of code quality; they're measures of operational load.
Continuous re-review [30] acknowledges that systems drift and readiness is time-dependent. But it also acknowledges that the organization's risk appetite drifts. What was production-ready six months ago might not be production-ready today because traffic doubled, or because a similar service had a high-profile outage, or because a new VP wants to make a mark.
The implication: if you're trying to pass a production readiness review, you're not just building a better system. You're building a case that this system is worth the operational overhead, that the team can be trusted with production access, and that the blast radius is contained enough that someone is willing to sign off on it. The methodology is a protocol for that negotiation, not a truth function.
#production-readiness#organizational-dynamics#risk-management#sre#operational-methodology#quality-gatesWhat the Impossibility Results Actually Do: They Define the Workaround Economy
FLP [27,28,29,30], CAP [5,6,7], and Byzantine fault tolerance [20,21,23] aren't theoretical curiosities—they're cost floors for entire infrastructure categories. The impossibility result tells you what you cannot have; the workaround tells you what you must pay.
FLP says you cannot solve consensus in an asynchronous system with crash failures [27]. The workarounds are partial synchrony, randomization, or failure detectors [29]—each with measurable costs. Partial synchrony means you sometimes block. Randomization means you cannot bound termination time. Failure detectors mean you need more messages and can still be wrong.
CAP says you cannot have consistency and availability during partition [5,6]. The workarounds are picking CP (you block) or AP (you serve stale reads) [8]. PBFT says you need 3f+1 replicas to tolerate f Byzantine faults [23]—that's not negotiable [23]. You pay 4x the hardware or you don't get Byzantine tolerance.
These aren't product decisions—they're structural economics. When someone builds a consensus system, they are not choosing features. They are choosing which impossibility result to work around and which cost to pay. Raft chose partial synchrony over randomization; that's why it can block during partition. Eventual consistency chose AP over CP; that's why reads can be stale.
Lynch's catalog of impossibility results [28,30] matters because it maps the terrain. Every major distributed systems category—consensus, replication, commit protocols—has a proved lower bound. The innovation happens in the workarounds, not in violating the bound. When you read a database whitepaper, the impossibility result tells you what they're paying for. The workaround tells you how they're paying.
index · 5index · 6index · 7index · 8index · 20index · 21index · 23index · 27index · 28index · 29index · 30#impossibility-results#distributed-systems#consensus-algorithms#trade-offs#economic-analysis#systems-theoryThe Event Log as Primitive: Why Immutability Keeps Winning
Stream processing [9], event sourcing [10], and change data capture [10] aren't three separate patterns—they're the same structural idea. The append-only log is emerging as the fundamental primitive of distributed data infrastructure.
Kleppmann's framing [11] places data at the center of reliability, scalability, and maintainability. But the specific mechanism keeps converging on the same shape: immutable events in time order. Event sourcing records every write as an immutable command [10]. CDC captures database changes as a stream [10]. Stream processing treats the log as the source of truth and derived views as disposable [9].
The reason this keeps winning relates to the impossibility results. You cannot have consistency and availability during partition [5,6,7]. But you can preserve the write order and rebuild state from it. The log doesn't solve CAP—it shifts where you pay the cost. Instead of fighting about whether a read is consistent, you replay writes until it is.
This is also why Stonebraker's specialized database thesis [16] doesn't fragment into chaos. The databases can specialize because they can all consume the same log. You don't need one database that does OLTP and analytics—you need the OLTP system to emit a CDC stream that the analytics system ingests. The log is the contract.
The infrastructure pattern: systems that expose append-only logs as first-class interfaces are more composable than systems that try to present mutable state. Kafka isn't popular because it's fast—it's popular because the immutable log is the shape that distributed systems can actually reason about. When Lampson says the external interface matters more than internal algorithms [12], this is the interface that keeps showing up.
#stream-processing#event-sourcing#immutability#data-infrastructure#distributed-systems#interface-designThe Specialization Thesis: General-Purpose Infrastructure is Burned Capital
There's a structural claim running through Stonebraker [16,17], Lampson [14,15], and the database architecture readings [8,18]: general-purpose systems are what you build when you don't yet know the constraints. Once you know them, specialization dominates.
Stonebraker's "no size fits all" isn't market segmentation—it's physics [16]. OLTP, OLAP, and streaming have incompatible memory access patterns, incompatible consistency requirements, incompatible query shapes. A system optimized for one will be demonstrably slower at the others, and "pretty good at everything" is another way of saying "the bottleneck moves around unpredictably."
Lampson's advice to "make it fast, rather than general or powerful" [14] is the same claim from the interface layer. Every feature you add to an interface is a contract you must honor in every execution path. The performance ceiling drops with each addition. The 2020 STEADY expansion [15] makes this explicit: simplicity and efficiency are goals that pull against adaptability.
The NoSQL fragmentation is this thesis playing out [8,17]. When Stonebraker says "SQL is not the performance problem," he means the bottleneck was never the query language—it was ACID semantics applied uniformly [17]. AP systems like DynamoDB sacrifice consistency [8]. CP systems sacrifice availability [8]. Specialized systems that know their workload can outrun general-purpose systems by an order of magnitude because they remove optionality from the hot path.
The practical read: when someone says they're building general-purpose infrastructure, they're either (a) still discovering the constraint surface, (b) deliberately trading performance for market reach, or (c) haven't yet been forced to choose. The capital compounds on the specialized side. The question for coverage is: what did they learn that made them narrow the scope?
#database-architecture#specialization#performance-optimization#trade-offs#systems-architecture
Reading138 nodes›
Making the load flexible or the model more efficient
Two approaches are gaining traction: making datacenters responsive to grid conditions, and reducing power draw at the workload level. <cite index="21-2,21-3">A software-only approach transforms AI data centers into flexible grid resources that can efficiently and immediately harness existing power systems without massive infrastructure buildout. Conducted at a 256-GPU cluster running representative AI workloads within a commercial, hyperscale cloud data center in Phoenix, Arizona, a trial achieved a 25% reduction in cluster power usage for three hours during peak grid events while maintaining AI quality of service guarantees.</cite>
On the efficiency side, the numbers show real leverage. <cite index="6-1">Holding model architecture constant, increasing batch size from 512 to 4096 images for ResNet reduced total training energy consumption by a factor of 4.</cite> <cite index="6-8">Llama training GPU load was 93% on average, with a median power draw of 7.9 kW, well below the rated maximum, which highlights the need for empirical data for data center energy use estimation.</cite> The gap between rated and actual draw matters for planning.
<cite index="18-5,18-6">Immersion and direct liquid cooling offers dramatically improved heat transfer, higher rack density, and up to 50% energy savings compared to air cooling, though challenges remain around maintenance and cost. Leveraging reinforcement learning or deep neural networks to anticipate thermal load changes and dynamically tune cooling systems for maximum efficiency has achieved 14-21% energy savings.</cite> Infrastructure controls and workload scheduling are moving from static to adaptive.
Sources:
- https://arxiv.org/pdf/2507.00909
- https://arxiv.org/pdf/2412.08602
- https://arxiv.org/html/2509.07218v3
#energy-efficiency#grid-flexibility#workload-optimization#cooling-systems#ai-infrastructure#demand-response#datacenter-energy#sustainabilityRenewable commitments running into deployment reality
<cite index="10-6,10-7">According to Goldman Sachs Research, data center power demand will surge 160% by 2030, with AI operations accounting for a significant portion of this growth. This explosive growth has made renewable energy for AI data centers not just an environmental priority, but a business imperative for organizations seeking reliable, scalable, and cost-effective power solutions.</cite>
The gap between commitment and delivery is widening. <cite index="12-5,12-7">While major tech companies pledge to power data centers with renewable energy, the reality is that the expansion is outpacing the deployment of clean energy sources. Renewable energy simply isn't scaling fast enough to match AI's growth.</cite> <cite index="12-4">Northern Virginia serves as a cautionary tale, where the region's concentration of data centers has forced utilities to keep fossil fuel plants online to meet demand.</cite>
Hybrid strategies are emerging as the practical path. <cite index="15-3,15-4">Sophisticated AI data center energy providers are developing hybrid solutions that combine multiple generation sources to balance reliability, sustainability, and cost-effectiveness with scalable power solutions that grow alongside computing demands. These systems might pair solar or wind generation with battery storage and backup natural gas capacity to ensure uninterrupted power supply regardless of weather conditions or time of day.</cite> <cite index="11-8,11-10">Countries including Ireland and Netherlands are implementing restrictions on new data center development due to grid capacity concerns and renewable energy availability. Some projects now require 100% renewable power from day one, forcing developers to pursue innovative solutions including direct connections to offshore wind farms and advanced energy storage deployments.</cite>
Sources:
- https://www.hanwhadatacenters.com/blog/renewable-energy-for-ai-data-centers-a-complete-guide/
- https://news.ucsb.edu/2025/021835/power-ai-data-centers-need-more-and-more-energy
- https://www.hanwhadatacenters.com/blog/top-energy-companies-for-ai-data-centers-2025-power-guide/
- https://www.hanwhadatacenters.com/blog/ai-and-renewable-energy-the-future-of-data-centers/
#sustainability#renewable-energy#datacenter-energy#hybrid-power#energy-storage#grid-constraints#regulatory-pressure#ai-infrastructureDemand growing faster than the grid can scale
<cite index="5-7,5-8">Global data centers consumed around 415 TWh of electricity in 2024, accounting for about 1.5% of total global electricity consumption. The IEA projects that global data center electricity demand will more than double by 2030, reaching around 945 TWh, with AI identified as the primary driver of this growth.</cite> In the United States, the picture is sharper. <cite index="2-7">McKinsey research indicates that AI data centers could consume 11-12% of the United States' total electricity by 2030, potentially creating supply deficits in many regions.</cite>
The constraint is not generation in the abstract—it is transmission and distribution infrastructure that takes years to build. <cite index="18-3">Many regional grids are incapable of accommodating large-scale data centers without extensive transmission and distribution upgrades, which often require 5-10 years for planning, permitting, and construction.</cite> <cite index="22-3">In a Deloitte survey of power company and data center executives, 72% of all respondents consider power and grid capacity to be very or extremely challenging.</cite>
Some grids are already at the limit. <cite index="25-3,25-4">The Dominion grid, which faces rapid datacenter growth, cannot meet the growing demand even with reduced new datacenter power reliability. The new datacenters will experience poor power availability (< 90%).</cite> <cite index="24-5">AI facilities are increasingly concentrated in regions with abundant renewable resources, low electricity prices, and favorable climates, yet such clustering also amplifies local grid stress and transmission constraints.</cite> Geography is becoming a bottleneck.
Sources:
- https://arxiv.org/html/2509.07218v3
- https://www.socomec.us/en-us/solutions/business/data-centers/understanding-power-consumption-data-centers
- https://www.deloitte.com/us/en/insights/industry/power-and-utilities/data-center-infrastructure-artificial-intelligence.html
- https://arxiv.org/pdf/2311.11645
- https://arxiv.org/pdf/2604.06198
#datacenter-energy#grid-capacity#transmission-infrastructure#ai-infrastructure#regional-constraints#planning-horizon#sustainabilityThe power profile that doesn't fit the plan
<cite index="1-3,1-4">AI operations demand fundamentally different power characteristics than traditional computing. Where conventional data centers handle cyclical workloads with predictable peaks, AI facilities support continuous, maximum-capacity operations that stress infrastructure in entirely new ways.</cite> <cite index="1-13">Hyperscalers deploy massive GPU clusters in configurations that demand upwards of 80 kilowatts per rack, more than double the density of a conventional data center.</cite>
The Congressional Research Service put numbers to what that looks like in practice. <cite index="3-2">When training a large AI model using a computer system with eight advanced GPUs for eight hours, the GPUs were near full utilization most of the time (an average of 93%) and the median amount of electrical power consumed by the chips was 7.92 kilowatts, with a total energy consumption of 62 kilowatt-hours.</cite> <cite index="3-3">A report released in April 2025 estimated that training a specific large AI model required a total power draw of 25.3 MW and that the power required to train these models could double annually.</cite>
The problem is not just magnitude. <cite index="5-2,5-3">Computing hardware operates at near-maximum utilization for prolonged periods with very high power load throughout the training stage. The alternation between a power-intensive computation phase and a less power-demanding communication phase in the training stage often leads to large power swings.</cite> That volatility breaks the models utilities use for capacity planning. <cite index="26-2,26-4">Traditional industrial facilities ramp up and settle into stable operating patterns. AI data centers exhibit unpredictable demand, making it harder for utilities to model and design protection schemes. That variability is structural, driven by rapid shifts between training and inference workloads.</cite>
Sources:
- https://www.hanwhadatacenters.com/blog/what-are-the-power-requirements-for-ai-data-centers/
- https://www.congress.gov/crs-product/R48646
- https://arxiv.org/html/2509.07218v3
- https://www.datacenterknowledge.com/uptime/from-capacity-to-chaos-how-ai-data-centers-challenge-the-grid
#datacenter-energy#ai-infrastructure#power-density#gpu-clusters#grid-planning#workload-volatility#sustainabilityHyperscaler custom silicon bifurcates the AI accelerator market
<cite index="33-2,33-3,33-4">As of early 2026, the world's largest cloud providers—Google, Amazon, and Microsoft—have fundamentally rewritten the rules of the AI infrastructure market; by designing their own custom silicon, these "hyperscalers" are no longer just customers of the semiconductor industry but its most formidable architects, in a strategic shift often referred to as the "Silicon Divorce."</cite> <cite index="30-5">Midjourney reported that migrating from NVIDIA GPUs to Google TPUs cut monthly compute costs from $2.1 million to $700,000, a 65% reduction.</cite>
<cite index="34-1,34-2,34-3">These custom ASICs' strategic goal has, for now, only been to optimize internal workloads across efficiency metrics; while we can argue that part of their strategy was to reduce reliance on Nvidia, this was not the primary nor originating goal—specific-purpose workloads, like those internal to Google and Amazon, were prime candidates for specific-purpose silicon.</cite> <cite index="37-3,37-4">Every major hyperscaler is running a significant fraction of AI inference on custom silicon—Google TPU v5, Meta MTIA, Microsoft Maia, Amazon Trainium; NVIDIA retains dominance in training, but the inference market is structurally bifurcating.</cite>
<cite index="33-1">Hyperscalers now offer their internal chips at a 30% to 50% discount compared to NVIDIA-based instances, effectively using their custom silicon as a loss leader to lock enterprises into their respective cloud ecosystems.</cite> <cite index="32-8,32-9,32-10">The economics of custom silicon are not linear; a 5nm or 3nm chip can easily cost $500 million or more to bring to market, but that upfront cost becomes viable when amortized across hundreds of millions of devices or thousands of racks of AI infrastructure.</cite> The three hyperscalers are not building chips to sell chips. They're building chips to own the dependencies that their workloads—and increasingly, their customers' workloads—run on.
Sources:
- https://introl.com/blog/custom-silicon-inflection-2026-hyperscaler-asics-nvidia-gpu
- https://markets.financialcontent.com/wral/article/tokenring-2026-1-8-the-great-decoupling-how-hyperscalers-are-breaking-nvidias-iron-grip-with-custom-silicon
- https://creativestrategies.com/research/understanding-hyperscaler-custom-asic-strategy/
- https://sanieinstitute.substack.com/p/custom-silicon-or-bust-the-new-default
- https://nextwavesinsight.com/custom-silicon-google-apple-meta-microsoft-nvidia-2026/
#custom-silicon#hyperscalers#vertical-integration#cloud-infrastructure#ai-accelerators#nvidia-competition#inference-economics#workload-optimizationMicrosoft Maia targets inference economics, not training parity
<cite index="20-1">Microsoft unveiled the Microsoft Azure Maia AI Accelerator, optimized for artificial intelligence tasks and generative AI, and the Microsoft Azure Cobalt CPU, an Arm-based processor tailored to run general purpose compute workloads on the Microsoft Cloud.</cite> <cite index="21-1,21-3">Maia 200, a breakthrough inference accelerator engineered to dramatically improve the economics of AI token generation, is built on TSMC's 3nm process with native FP8/FP4 tensor cores, a redesigned memory system with 216GB HBM3e at 7 TB/s and 272MB of on-chip SRAM, plus data movement engines that keep massive models fed, fast and highly utilized.</cite>
<cite index="21-4,21-5">This makes Maia 200 the most performant, first-party silicon from any hyperscaler, with three times the FP4 performance of the third generation Amazon Trainium, and FP8 performance above Google's seventh generation TPU, and is the most efficient inference system Microsoft has ever deployed, with 30% better performance per dollar than the latest generation hardware in its fleet today.</cite> <cite index="26-8,26-9">The chip is designed specifically for inference, the phase in which trained models produce text, images and other outputs; as AI services transition from pilots to everyday production use, the cost of generating tokens has become an increasingly significant share of overall spending.</cite>
<cite index="22-9">Maia 100 servers are designed with a fully-custom, Ethernet-based network protocol with aggregate bandwidth of 4.8 terabits per accelerator to enable better scaling and end-to-end workload performance.</cite> <cite index="29-2,29-3">Anthropic is in early-stage talks with Microsoft to rent Azure servers powered by Microsoft's custom Maia 200 AI accelerator; if the deal closes, Anthropic would become the first major external customer for a custom silicon program Microsoft has spent more than two years trying to prove.</cite> Microsoft is late but optimizing for a different contract: not displacing NVIDIA in training, but owning the cost curve for serving models at Azure scale.
Sources:
- https://news.microsoft.com/source/features/ai/in-house-chips-silicon-to-service-to-meet-ai-demand/
- https://blogs.microsoft.com/blog/2026/01/26/maia-200-the-ai-accelerator-built-for-inference/
- https://azure.microsoft.com/en-us/blog/azure-maia-for-the-era-of-ai-from-silicon-to-software-to-systems/
- https://redmondmag.com/articles/2026/01/28/microsoft-introduces-maia-200-inference-chip-to-tackle-ai-computing-costs.aspx
- https://www.techtimes.com/articles/317072/20260524/anthropic-microsoft-negotiate-maia-200-chip-deal-claude-could-become-custom-silicons-first.htm
#custom-silicon#microsoft-maia#ai-accelerators#inference-optimization#cloud-infrastructure#vertical-integration#chip-economics#fp8-fp4-precisionAWS Graviton runs physical cores at fixed clocks for predictable cost
<cite index="11-4,11-5,11-6">AWS Graviton is a family of 64-bit ARM-based CPUs designed by the Amazon Web Services subsidiary Annapurna Labs, distinguished by its lower energy use relative to x86-64, static clock rates, and lack of simultaneous multithreading, designed to be tightly integrated with AWS servers and datacenters, and not sold outside Amazon.</cite> <cite index="17-3,17-4,17-5">In the x86 architecture, a vCPU is a logical core achieved by hyperthreading; in Graviton, vCPU equates to a physical core which allows the vCPU to be fully committed to the workload, resulting in a 40 percent better price performance over comparable x86/x64 instances.</cite>
The economics are the architecture. <cite index="11-8">AWS Graviton2 was announced in December 2019 with AWS promising 40% improved price/performance over fifth generation Intel and AMD instances and an average of 72% reduction in power consumption.</cite> <cite index="14-4,14-25">Graviton5, the latest generation released in 2025, doubles core count to 192 and delivers up to 25% better performance than Graviton4, powering the most demanding applications from real-time gaming to AI.</cite> <cite index="14-17,14-18">Graviton is used by 98% of the top 1,000 customers of Amazon EC2; over half of all new processing power added to AWS runs on these chips.</cite>
<cite index="41-1,41-2">Amazon took a decisive step towards vertical integration in chip design after acquiring Annapurna Labs in 2015, strengthening its ability to control the entire hardware and software development process within Amazon Web Services, allowing the company to optimize its chip designs, particularly in CPUs and AI accelerators, significantly improving its ability to meet the demands of its customers.</cite> The Graviton roadmap is now five generations deep with predictable cadence. AWS built it because merchant silicon wasn't optimized for the specific latency and power envelopes of their highest-volume workloads.
Sources:
- https://en.wikipedia.org/wiki/AWS_Graviton
- https://aws.amazon.com/ec2/graviton/
- https://www.aboutamazon.com/news/aws/what-is-aws-graviton
- https://docs.aws.amazon.com/prescriptive-guidance/latest/optimize-costs-microsoft-workloads/net-graviton.html
- https://cloudnews.tech/verticalization-of-amazon-the-key-strategy-behind-its-success-in-chip-design/
#custom-silicon#aws-graviton#arm-architecture#cloud-infrastructure#vertical-integration#price-performance#energy-efficiency#annapurna-labsGoogle bifurcates TPU into training and inference SKUs
<cite index="1-1,1-2">Google split its eighth-generation TPU into two separate chips—the TPU 8t for training and the TPU 8i for inference—marking a fundamental shift in how the company designs custom AI silicon.</cite> <cite index="1-4">The dual-chip strategy acknowledges that the computational profiles for frontier model training and low-latency inference have diverged to the point where a single architecture can no longer optimally serve both.</cite>
The architectural split goes deeper than workload type. <cite index="1-7">The TPU 8t's native FP4 support in MXUs and Axion Arm host integration suggests that each generation is being tailored to specific model training techniques rather than offering generic compute improvements.</cite> <cite index="6-1,6-2">TPUs break the inference "memory wall" by hosting massive KV caches entirely on-silicon, utilizing expanded on-chip SRAM with TPU 8i, combined with a SparseCore engine to offload communication tasks, reducing core idle time.</cite>
<cite index="4-3,4-4">In 2013, Google's infrastructure team ran a calculation: if Android users adopted voice search at the scale Google anticipated, using it for just three minutes per day, the computational demand would require doubling the company's entire global data center footprint.</cite> That constraint drove the original TPU. <cite index="8-2,8-3">Google builds these processors to enable breakthroughs in performance that are only possible through deep, system-level co-design, with model research, software, and hardware development under one roof—the approach that built the first TPU ten years ago, which in turn unlocked the invention of the Transformer eight years ago.</cite>
<cite index="7-4,7-6">Google designed custom high-speed bidirectional links that connected each TPU directly to four neighbors in a 2D torus topology, enabling TPU v2 "pods" of up to 256 chips to function as a single logical accelerator, with collective operations like all-reduce executing far faster than network-based alternatives.</cite> The interconnect is the contract. You don't own a TPU. You rent access to a pod.
Sources:
- https://futurumgroup.com/insights/google-splits-its-tpu-line-to-enter-the-era-of-agentic-silicon/
- https://blog.bytebytego.com/p/how-googles-tensor-processing-unit
- https://cloud.google.com/tpu
- https://introl.com/blog/google-tpu-architecture-complete-guide-7-generations
- https://cloud.google.com/blog/products/compute/ironwood-tpus-and-new-axion-based-vms-for-your-ai-workloads
#custom-silicon#google-tpu#cloud-infrastructure#ai-accelerators#systolic-arrays#vertical-integration#chip-architecture#workload-specializationDatacenter power density forces infrastructure redesign at rack level
<cite index="4-1,4-2">At four H100-class servers per rack, a single rack can demand over 40 kW of IT power, vastly exceeding the 10–12 kW/rack limit of most traditional colocation facilities. With Blackwell-generation hardware, those figures jump to 120–140 kW per rack and beyond</cite>. <cite index="4-18,4-19,4-20,4-21">Each H100 GPU consumes up to 700 W, while the B200 draws up to 1,000 W air-cooled or 1,200 W liquid-cooled. An 8×H100 server needs approximately 5.6 kW just for GPUs, with total power draw per server on the order of 10 kW</cite>.
<cite index="4-4">This necessitates multi-megawatt power distribution with high-voltage three-phase feeds, specialized cooling to extract tens to hundreds of kilowatts of heat per rack, and rigorous site infrastructure including power redundancy and floor loading</cite>. <cite index="1-30,1-31,1-32">NVIDIA's reference design for NVL72 had to address 120 kW of heat per rack, deploying direct liquid cooling manifolds and specialized blind-mate connectors. This level of thermal design illustrates how pushing datacenter GPU density requires new infrastructure</cite>.
The dependency chain runs both ways. <cite index="4-24,4-25">Each DGX H100 includes eight 400 Gb/s InfiniBand links, while the B300 generation introduces ConnectX-8 NIC with 800 Gb/s InfiniBand throughput, paired with Quantum-X800 switches</cite>. The fabric matters as much as the silicon.
Sources:
- https://intuitionlabs.ai/articles/nvidia-hgx-data-center-requirements
- https://intuitionlabs.ai/articles/nvidia-data-center-gpu-specs
#datacenter-infrastructure#power-density#liquid-cooling#h100#blackwell#rack-design#infiniband#ai-infrastructure#hardware-constraints#supply-chainHBM3e memory is sold out and shifting system economics
<cite index="24-27">Micron CEO Sanjay Mehrotra stated during Q1 FY2026 earnings that HBM capacity for calendar 2025 and 2026 is fully booked</cite>. <cite index="24-28">SK Hynix confirmed that all DRAM, NAND, and HBM production through 2026 is essentially sold out</cite>. The constraint is architectural: <cite index="24-20,24-21,24-22">An H100 uses 80 GB of HBM3, the H200 uses 141 GB of HBM3E, the B200 requires 192 GB, and the B300 pushes that to 288 GB of 12-layer HBM3E</cite>. <cite index="24-23">Each gigabyte of HBM consumes roughly 3 to 4 times the wafer capacity of standard DRAM</cite>.
<cite index="17-4,17-5">A breakdown of Meta's 24,576-H100 cluster shows $1.689 per H100-hour all-in, with $0.918/hour going to NVIDIA—roughly 54% of the total</cite>. <cite index="26-6">Memory could account for roughly 30% of hyperscaler AI spending in 2026, up from about 8% in 2023 and 2024, as HBM shortages ripple through the supply chain</cite>.
The market structure matters. <cite index="24-29,24-30">SK Hynix holds 62% market share, Micron 21%, and Samsung 17%</cite>. <cite index="24-32,24-33">Samsung struggled for over 18 months with its 12-layer HBM3E qualification for NVIDIA due to thermal issues, finally passing in September 2025 but with initial volumes around 10,000 units</cite>. Three suppliers, multi-year demand commitments from hyperscalers, and process complexity create the conditions for sustained tightness.
Sources:
- https://blog.barrack.ai/2026-gpu-memory-crisis/
- https://www.getmonetizely.com/blogs/ai-chip-shortages-are-a-supply-chain-problem-not-a-reversal-of-ai-deflation
- https://www.datacenterknowledge.com/infrastructure/after-the-power-crunch-ai-infrastructure-hits-a-gpu-wall
#hbm3e#memory-constraints#sk-hynix#micron#supply-chain#gpu-economics#h200#b200#ai-infrastructure#hardware-constraintsCoWoS packaging is the binding constraint through 2026
<cite index="22-12,22-13">TSMC's CEO publicly stated that CoWoS capacity is sold out through 2025 and into 2026, with TSMC projecting roughly 120,000 to 130,000 wafers per month by end of 2026, up from approximately 75,000 to 80,000 today</cite>—but <cite index="22-13">NVIDIA alone is expected to consume approximately 60% of that capacity</cite>.
CoWoS is TSMC's advanced packaging technology that integrates GPU dies with HBM memory stacks. <cite index="22-10,22-11">Without this packaging step, even wafers built on TSMC's most advanced nodes cannot become functional AI accelerators</cite>. <cite index="12-12">H100 and H200 lead times are running 36-52 weeks due to constrained CoWoS packaging capacity at TSMC</cite>, combined with HBM shortages.
<cite index="22-4,22-5,22-8">The Big Five hyperscalers have committed a combined $600–630 billion in capital expenditure for 2026, roughly 75% targeting AI infrastructure directly. When buyers of this magnitude lock in multi-year supply agreements, enterprise buyers compete for whatever allocation remains</cite>. <cite index="22-24">Lead times for data center GPUs now run 36 to 52 weeks</cite>, and <cite index="22-26">delivery windows for Blackwell-class hardware have slipped into Q1 2027</cite>. The expansion is real; it's simply not enough, fast enough.
Sources:
- https://www.vamsitalkstech.com/ai/the-gpu-supply-chain-crisis-what-every-enterprise-cio-must-know-in-2026/
- https://www.spheron.network/blog/gpu-shortage-2026/
#cowos-packaging#tsmc#supply-chain#manufacturing-bottleneck#h100#h200#hyperscaler-capex#ai-infrastructure#hardware-constraintsThe H100 rental spike: concentrated demand meets inelastic supply
<cite index="11-19">H100 rental rates climbed from $1.70 per hour to approximately $2.35 per hour between October 2025 and March 2026</cite>, defying the usual depreciation curve when newer silicon ships. <cite index="18-28,18-29">A 10% jump occurred in just four weeks between December 9, 2025 and January 6, 2026, with hourly rates moving from $2.00 to $2.20</cite>, while A100 and B200 pricing held steady.
The mechanism is instructive. <cite index="18-38,18-40">The H100 sits in a unique position—constrained enough to be highly sensitive to demand fluctuations, yet essential enough for high-end training that buyers have limited substitution options</cite>. <cite index="11-9,11-10">Time-sensitive demand spikes tied to year-end training deadlines create concentrated pressure, as organizations race calendar-driven milestones</cite>. <cite index="18-42,18-43">When demand spikes, there's no quick way to flood the market with more H100 capacity—long-term contracts have already absorbed much of the allocation</cite>.
<cite index="11-22">Companies that secured H100 access are holding onto it tightly, unwilling to relinquish capacity despite rising costs</cite>, further tightening spot availability. This is not a broad market trend; it's a localized squeeze in a performance tier where FP8 capabilities matter and buyers can't easily substitute down.
Sources:
- https://www.kavout.com/market-lens/why-are-nvidia-h100-gpu-rental-prices-surging-by-40
- https://www.silicondata.com/blog/h100-price-spike
#gpu-pricing#h100#supply-constraints#rental-markets#spot-pricing#ai-infrastructure#hardware-constraints#supply-chainMetrics as the automated substrate for decision-making at scale
<cite index="15-1,15-4">Technology companies running online randomized controlled experiments at scale focus on metrics—an overall evaluation criterion and thousands of metrics for insights and debugging, automatically computed for every experiment—with quick release cycles and automated ramp-up and shut-down that afford agile and safe experimentation</cite>. <cite index="13-5,13-6">One of the key challenges for organizations running controlled experiments is selecting an Overall Evaluation Criterion (OEC), the criterion by which to evaluate different variants, with the difficulty being that short-term changes to metrics may not predict long-term impact</cite>.
<cite index="16-6,16-7,16-9,16-11,16-12">The platforms that scale best share key characteristics: they embrace asynchronous processing everywhere—user assignment returns immediately and processes in background, metric collection fires and forgets—an approach that lets you handle massive traffic spikes without melting down</cite>. <cite index="16-14,16-15">Sample ratio mismatch detection isn't a nice-to-have, it's essential, and modular designs make it easier to add these checks without disrupting the core system</cite>.
<cite index="21-1,21-2">Science-centric platforms support end-to-end workflows without compromising engineering requirements, using an approach to causal inference that leverages the potential outcomes conceptual framework to provide a unified abstraction layer for arbitrary statistical models and methodologies</cite>. The metric pipeline is the contract between your engineering and data science teams—if it's not automated, you're not operating at scale.
Sources:
- https://link.springer.com/article/10.1186/s13063-020-4084-y
- https://www.researchgate.net/publication/319482190_The_Benefits_of_Controlled_Experimentation_at_Scale
- https://www.statsig.com/perspectives/scalable-experimentation-platform-patterns
- https://arxiv.org/pdf/1910.03878
#experimentation-platforms#statistical-methodology#metrics-infrastructure#automated-analysis#causal-inference#data-pipelines#data-infrastructureInfrastructure as experiment subject, not just experiment host
<cite index="1-7,1-8,1-9">ExP infrastructure had to evolve and scale significantly over time to meet user needs and advanced methodology requirements; making major infrastructure changes can be risky, but A/B experiments are an excellent tool to mitigate that risk and understand causal effects, contrary to the misconception that A/B experiments are only suited for front-end or user-facing changes</cite>. <cite index="1-24,1-25">Microsoft used A/B testing to compare old and new architectures and see how they influenced key metrics such as latency, throughput, availability, and error rate, dividing work into major A/B tests to roll out changes gradually and measure effects precisely, including measuring the impact of adding new routing and networking layers</cite>.
<cite index="1-15,1-16,1-17,1-18">By leveraging A/B tests for infrastructure changes, the team quickly identified and iterated on several unexpected metric changes easily identified from test results; when they detected a regression, they stopped exposure in seconds and started iterating on the variant, making trade-offs but articulating them with quantifiable numbers, then investigating, adjusting and verifying to ensure metrics aligned with expectations</cite>.
The insight is recursive: you use the experimentation platform to test changes to the experimentation platform itself. <cite index="1-20">The reverse-proxy layer which agglomerates numerous backend services into one Experimentation Platform API was key for orchestrating A/B tests in later stages of the process</cite>. If your platform can't safely test its own infrastructure changes, it's not mature enough to be trusted with product decisions.
Sources:
- https://www.microsoft.com/en-us/research/articles/a-b-testing-infrastructure-changes-at-microsoft-exp/
#experimentation-platforms#infrastructure-testing#system-reliability#microsoft-exp#deployment-patterns#observability#statistical-methodology#data-infrastructureExtensibility as the system that scales to 100k tests per year
<cite index="3-1,3-2">Microsoft runs approximately 100,000 A/B tests annually using their internal Experimentation Platform ExP, and making ExP extensible was the investment that truly enabled them to democratize decision-making and make A/B testing effective across diverse organizations</cite>. <cite index="15-4">Today, Google, LinkedIn, and Microsoft run at a rate of over 20,000 controlled experiments per year, though counting methodologies differ—ramping exposure from 1% to 5% to 10% can count as one or three experiments</cite>.
<cite index="3-4">The Experimentation Platform Extensibility Framework (EPEF) consists of APIs for data access, Analysis Hub for UX presentation, data contracts and packages for integration, and configuration manager for deployments and management</cite>. <cite index="3-3">This enabled individual teams of experts across the company to contribute new extensions to ExP, bringing value to end users and impacting practitioners building experimentation platforms</cite>. The pattern here is recognizing that no central team can anticipate all the analysis needs—you build the substrate and let domain experts extend it.
<cite index="3-16,3-17,3-18">Netflix provides extensibility for the analysis of A/B tests with ease of recreating analysis using a notebook; Optimizely provides APIs, SDKs, and integrations allowing users to extend capabilities for custom metric collection and analysis, enabling creation of custom events and metrics</cite>. The extensibility framework is not optional infrastructure—it's the difference between a platform that serves ten teams and one that serves a thousand.
Sources:
- https://www.researchgate.net/publication/388630852_Extensible_Experimentation_Platform_Effective_AB_Test_Analysis_at_Scale
- https://link.springer.com/article/10.1186/s13063-020-4084-y
#experimentation-platforms#platform-extensibility#scale#microsoft-exp#data-infrastructure#organizational-patterns#statistical-methodologyArchitecture as assignment, instrumentation, and pipeline
<cite index="17-4,17-5,17-6">Large-scale experimentation platforms enable thousands of concurrent A/B tests through several interconnected components managing the entire experimentation lifecycle, from experiment design and user assignment to data collection, analysis, and decision-making</cite>. <cite index="20-9,20-10,20-11,20-12">The canonical architecture contains four high-level components: experiment definition and management via UI stored in system configuration; deployment to server and client-side covering variant assignment and parameterization; instrumentation; and analysis</cite>.
<cite index="2-4">The randomization algorithm and assignment method comprise the core of the Variant Assignment Service, while the data path corresponds to the Logs and Test Analysis Pipeline</cite>. <cite index="2-8,2-9">The client and/or server can call the Variant Assignment Service, which may be a separate server or a library embedded directly in the client and/or server with configuration pushed to them</cite>. This is the layer most teams underestimate—the assignment decision must be fast enough to happen inline with user requests, which means returning immediately and processing in the background.
<cite index="2-16,2-17,2-18,2-19">The Hash-and-Partition method assigns every user a unique ID, then uses a hash function to map that ID (combined with test ID or random value) to an integer uniformly distributed across a range, partitioning that range with each segment corresponding to a variant</cite>. <cite index="2-21,2-22">Testing found that only the cryptographic hash function MD5 generated no correlations between experiments</cite>. The assignment service is not where you innovate; it's where you use the boring solution that provably works.
Sources:
- https://www.abplatforms.info/architecture
- https://bytebytego.com/guides/possible-experiment-platform-architecture/
- https://medium.com/data-science-collective/building-a-trustworthy-a-b-testing-platform-practical-guide-and-an-architecture-demonstration-332446724ba0
#experimentation-platforms#system-architecture#variant-assignment#hash-partition#data-infrastructure#statistical-methodologyRule generation from profiles: closing the loop
The interesting methodological shift is moving from manual rule definition to deriving rules from profiling statistics. <cite index="25-1,25-3,25-5">Profiling statistics like percentage of null, percentage of blank, and percentage of zero can generate validation rules like the column cannot contain null values, values must have a length greater than zero, and values must be greater than zero</cite>. <cite index="25-7,25-9">Data value distribution and data pattern distribution can generate rules that values must be contained in a set or pattern shared by non-outliers</cite>.
<cite index="24-1">Validation rules can be based on profile data and associated with an intended use for the data, specifying evaluation criteria including fill rate thresholds, minimum or maximum value constraints, or deviation-based requirements</cite>. This is the automation people want—profile the data, detect the statistical boundaries, then enforce them going forward.
But <cite index="21-7">distribution analysis helps establish quality thresholds and detect data drift over time</cite>, which means the rules need to update as the data changes. Static rules break when the underlying distribution shifts. The validation framework has to account for that, either by versioning the rules with the data or by continuous re-profiling to adjust thresholds.
Sources:
- https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/9152627
- https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/12488006
- https://airbyte.com/data-engineering-resources/data-profiling
#rule-generation#automated-validation#data-profiling#data-drift#threshold-detection#pattern-matching#data-quality#data-infrastructure#methodologyDimensions and metrics: what you measure versus how you measure it
There's confusion between dimensions, metrics, and KPIs. <cite index="14-4">Dimensions are conceptual groupings, metrics are measurements, and KPIs are evaluative signals aligned with organizational goals</cite>. <cite index="12-3">The core data quality dimensions include accuracy, completeness, consistency, timeliness, relevance, uniqueness and validity</cite>. Those are the categories.
<cite index="12-1,12-4">Frameworks define specific metrics like error rates, null percentages, duplicate rates, freshness scores and validity percentages to quantify performance</cite>. <cite index="6-2">Data contracts specify measurable quality requirements for completeness, accuracy, consistency, timeliness, and validity</cite>, and <cite index="6-3">these criteria should reflect business requirements rather than arbitrary technical thresholds</cite>.
The METRIC framework is worth noting for medical ML contexts—<cite index="17-9">it defines 26 data quality dimensions grouped into five clusters: measurement process, timeliness, representativeness, informativeness, and consistency</cite>. That level of granularity matters when you're building models where errors have consequences. Most organizations won't need 26 dimensions, but the idea of clustering dimensions by what part of the pipeline they relate to is sound.
Sources:
- https://lakefs.io/data-quality/data-quality-metrics/
- https://montecarlo.ai/blog-data-quality-framework/
- https://airbyte.com/data-engineering-resources/etl-data-quality
- https://arxiv.org/pdf/2601.22702
#data-quality-metrics#data-quality-dimensions#measurement-framework#data-contracts#methodology#data-quality#data-infrastructureValidation layers: where to put the checks in the flow
<cite index="1-13">Data pipeline validation is the process of verifying that data flowing through a pipeline is accurate, complete, and consistent</cite>. The question is where in the pipeline you do it. <cite index="6-1,6-6">Data profiling should occur at multiple points in ETL pipelines to understand how quality characteristics change through transformation processes and identify where quality issues are introduced</cite>.
<cite index="1-2">Data quality checks involve verifying that the data meets standards such as accuracy, completeness, and consistency</cite>. <cite index="8-2">Implementing a data quality validation process in data pipelines is one of the ideal ways to deal with data quality issues proactively</cite>, and <cite index="8-6">integrating pipelines with a data quality platform can detect and stop the flow of bad or erroneous data before it enters downstream systems</cite>.
The frameworks mentioned most often are Great Expectations and Deequ. <cite index="1-10">Great Expectations is an open-source data validation framework that provides a simple and intuitive API for defining data expectations</cite>. <cite index="2-5">Deequ provides a scalable, distributed framework for data quality validation in Spark-based pipelines, allows users to define custom checks, supports data profiling, and offers automatic anomaly detection by comparing data quality metrics over time</cite>. The choice depends on whether you're in Spark or something else.
Sources:
- https://www.numberanalytics.com/blog/data-pipeline-validation-best-practices
- https://airbyte.com/data-engineering-resources/etl-data-quality
- https://www.dqlabs.ai/blog/integrating-data-quality-checks-in-data-pipelines/
- https://medium.com/@georgemichaeldagogomaynard/data-integrity-in-a-data-pipeline-best-practices-and-strategies-for-data-quality-checks-dim-71af7a3bf21e
#data-validation#pipeline-architecture#data-quality#etl-validation#great-expectations#deequ#spark#data-infrastructure#methodologyProfiling as the baseline: what statistical fingerprinting reveals
<cite index="19-3">Data profiling applies statistical methodologies to return characteristics like data types, field lengths, cardinality, granularity, value sets, format patterns, implied rules, and cross-column relationships</cite>. It's the first step before you can validate anything—you need to know what the data actually looks like, not what the schema says it should look like.
<cite index="18-1">Profiling examines data characteristics to understand structure, content, and quality</cite>, while <cite index="20-1,20-4">statistical measures help determine quality by providing information about minimum and maximum values, frequency data, variation, mean and mode, percentiles and data distribution</cite>. <cite index="18-6">Common profiling methods include column analysis, pattern matching, and outlier detection</cite>.
The tools have gotten better—<cite index="19-9">profiling tools can identify patterns and data relationships analysts may not be looking for</cite>—but <cite index="19-13">a tool cannot draw conclusions about whether the data meets expectations</cite>. That still requires someone who understands what the pipeline is supposed to be doing. <cite index="4-3">A simple profile gives key information about the data's state and allows you to assess whether existing validations and assumptions are effective or accurate</cite>. You profile to establish baselines, then you validate against them.
Sources:
- https://www.sciencedirect.com/topics/computer-science/data-profiling
- https://www.researchgate.net/publication/384593533_Data_Profiling_and_Statistical_Analysis_for_Validation
- https://aws.amazon.com/what-is/data-profiling/
- https://greatexpectations.io/blog/why-data-validation-is-critical-to-your-pipelines/
#data-profiling#statistical-analysis#data-quality#baseline-measurement#pipeline-validation#metadata#data-infrastructure#methodologyInteraction Modes and the Autonomy-Governance Tension
<cite index="2-3">The approach balances autonomy with control because application teams gain speed while the platform team maintains governance through automated policy enforcement rather than manual gates</cite>. <cite index="2-4,2-5,2-6">Application workload teams drive direct business outcomes and require autonomy to respond quickly to changing requirements; organizations that centralize too many functions or force application teams through manual approval processes slow delivery and create bottlenecks that reduce competitive advantage; empowering application workload teams while maintaining governance requires policy-driven controls rather than centralized gatekeeping</cite>.
The model defines three interaction modes. <cite index="21-9">Team Topologies provides four team types and a set of interaction modes—notably collaboration, facilitating, and X-as-a-Service—to control how work and knowledge flow across boundaries</cite>. <cite index="20-3,20-4,20-5">Enabling teams congregate specialists whose role is to grow relevant skills inside other teams so those teams can remain independent and better own and evolve their services; they achieve this primarily through facilitating mode, which involves a coaching role where the enabling team isn't there to write and ensure conformance to standards, but instead to educate and coach colleagues so stream-aligned teams become more autonomous</cite>.
<cite index="2-9,2-10,2-11">Organizations should identify capability gaps across teams, assess application and platform teams to identify common skill gaps or areas where teams struggle to adopt best practices, and focus enabling team efforts on high-impact areas where specialized support creates the most value, such as DevOps practices, security implementation, or cloud-native architecture patterns</cite>. <cite index="2-12,2-13,2-14,2-15">Enabling teams provide time-bound support and coaching to close skill gaps and assist with DevOps practices; this support is critical for legacy workloads where building full DevOps capacity isn't feasible, helping reduce risk and improve adoption speed</cite>. The enabling team pattern addresses the knowledge transfer problem directly, but only if the interaction has a defined exit.
Sources:
- https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ready/considerations/devops-teams-topologies
- https://martinfowler.com/bliki/TeamTopologies.html
- https://www.frontiersin.org/journals/behavioral-neuroscience/articles/10.3389/fnbeh.2026.1820247/full
#team-topologies#interaction-modes#enabling-teams#autonomy-governance#policy-driven-controls#organizational-design#platform-engineering#methodology-debatesThinnest Viable Platform and the Product Mindset Trap
<cite index="1-12">The Team Topologies book was instrumental in proliferating the Platform as a Product approach and coined the term Thinnest Viable Platform</cite>. The pattern is iterative: <cite index="11-16,11-17">you don't need to create a fully formed golden pathway immediately; instead, solve one wide-reaching problem and iterate to success</cite>. <cite index="11-22,11-23,11-24">The goal should be to start with the thinnest viable platform that supports developers where they most need it; it's better to expand features slowly and pick the areas with the most significant impact rather than trying to solve all problems</cite>.
But the biggest misunderstanding is existential. <cite index="1-17,1-18">The biggest misunderstanding in the Platform Engineering space is the belief that you need a platform—"Should this (platform) exist?" is the most critical question for any platform leader and the most overlooked by most Platform Engineering initiatives</cite>. <cite index="1-19,1-20">The primary goal of all internal platforms is to reduce the cognitive load on its customers, i.e., the developers using its services to serve the organization's customers</cite>.
The "platform as product" advice has become cliché but remains true. <cite index="19-33,19-34,19-35,19-36">Not treating the platform as a product is a surefire path toward minimal use; as Erica Hughberg noted at Kubecon North America 2024, "even the best products don't sell themselves," and expecting engineers to drive adoption without a clear marketing approach is a recipe for disaster</cite>. <cite index="12-1,12-2">Teams are simply being renamed from operations or infrastructure teams to platform engineering teams, with very little change or benefit to the organization—centralized "DevOps" teams aren't an anti-pattern in itself, but the name should really be "Platform" or "SRE"</cite>. <cite index="12-3,12-4,12-5">The platform team's goal should shift from a service mindset to a product mindset, building self-service capabilities that eliminate the need for tickets; they should measure success by how few tickets they receive because developers can serve themselves</cite>.
Sources:
- https://teamtopologies.com/platform-engineering
- https://octopus.com/devops/platform-engineering/patterns-anti-patterns/
- https://jellyfish.co/library/platform-engineering/anti-patterns/
- https://www.infoworld.com/article/4064273/8-platform-engineering-anti-patterns.html
#thinnest-viable-platform#platform-as-product#platform-engineering#developer-experience#product-mindset#anti-patterns#organizational-design#methodology-debatesPlatform Engineering as Response to DevOps Anti-Patterns
<cite index="9-5,9-6">Platform engineering arose from frustrations with DevOps adoption; while DevOps helped some teams, the increasing complexity of cloud-native technologies created problems for most, leading to developer burnout and overwhelmed operations teams</cite>. The taxonomy of failure is well documented. <cite index="5-3,5-4">When regular engineering organizations try to implement true DevOps, a series of antipatterns emerge, well documented by the Team Topologies team (Matthew Skelton and Manuel Pais) in their analysis of DevOps anti-types</cite>.
<cite index="11-1">For some, Platform Engineering solves a common DevOps Topologies problem: embedding IT operations into DevOps teams that lack time and experience to do it well, an anti-pattern also known as Anti-Type F</cite>. <cite index="5-6,5-7,5-8,5-9">Developers (usually the more senior ones) end up taking responsibility for managing environments and infrastructure; this leads to a setup where "shadow operations" are performed by the same engineers whose input in terms of coding and product development is most valuable—everyone loses, especially the senior engineer who becomes responsible for setup and needs to solve requests from more junior colleagues</cite>.
<cite index="9-11,9-12,9-13">The distributed team model does not scale past a small number of teams, and platform engineering addresses this by scaling DevOps operations inside distributed teams through practices like Internal Developer Portals, Golden Paths and forming Developer Communities</cite>. <cite index="9-23,9-24,9-25">The industry broadly agrees that Platform Engineering addresses the growing complexity of DevOps at scale by empowering developers through self-service tools and collaborative communities—in essence, it's "DevOps for the People," supporting teams in achieving continuous delivery by simplifying processes and reducing mental strain</cite>. But the architecture choice matters: <cite index="11-9,11-10,11-11">when platform decisions happen away from technical knowledge, there's temptation to try an all-in-one DevOps platform to solve all problems, which doesn't align with Platform Engineering or DevOps; buying a single general-purpose tool limits the platform's capability and continuous improvement abilities</cite>.
Sources:
- https://platformengineering.org/blog/what-is-platform-engineering
- https://octopus.com/devops/platform-engineering/patterns-anti-patterns/
- https://medium.com/blue-harvest-tech-blog/devops-vs-platform-engineering-7b2ff2df30de
#platform-engineering#devops-antipatterns#cognitive-load#internal-developer-platforms#organizational-dysfunction#shadow-operations#organizational-design#methodology-debatesThe Four-Type Constraint and the Cognitive Load Thesis
<cite index="1-1,1-7">Team Topologies offers a shared organizational design language that establishes clear principles for right-sizing platform teams and their boundaries</cite>, but the model is simpler than most enterprises want it to be. <cite index="4-17">The framework provides four fundamental team types—stream-aligned, platform, enabling, and complicated-subsystem—along with interaction modes for organizing fast flow of value</cite>. <cite index="3-16">Gartner predicts 80% of large engineering organizations will have dedicated platform teams by 2026</cite>, but the structural challenge remains how teams actually distribute work.
The central thesis: <cite index="20-2">the primary benefit of a platform is to reduce cognitive load on stream-aligned teams</cite>. <cite index="21-1">Team Topologies treats cognitive load as a design constraint, requiring teams be sized and shaped so they can understand, operate, and improve their part of the system without drowning in coordination</cite>. <cite index="22-9,22-10">Working memory has a limit of 4-5 items; the "you build it, you run it" philosophy increased cognitive burden across intrinsic load (domain complexity) and extraneous load (deployment processes, infrastructure provisioning, tool sprawl)</cite>. <cite index="2-20,2-22">Platform teams provide the foundation that accelerates delivery while maintaining governance; well-structured platform teams ensure consistent practices, reduce complexity for application teams, and embed governance into platforms used to develop workloads</cite>.
The constraint matters because it forces decisions. <cite index="20-8,20-9,20-10,20-11">As George Box said, "all models are wrong, some are useful"; complex organizations cannot be simply broken down into just four kinds of teams and three kinds of interactions, but constraints like this are what make a model useful, impelling people to evolve their organization to allow stream-aligned teams to maximize flow by lightening cognitive load</cite>.
Sources:
- https://teamtopologies.com/platform-engineering
- https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ready/considerations/devops-teams-topologies
- https://mia-platform.eu/blog/team-topologies-to-structure-a-platform-team/
- https://teamtopologies.com/
- https://martinfowler.com/bliki/TeamTopologies.html
- https://www.frontiersin.org/journals/behavioral-neuroscience/articles/10.3389/fnbeh.2026.1820247/full
- https://www.softwareseni.com/applying-team-topologies-to-reduce-cognitive-load-and-burnout/
#team-topologies#cognitive-load#organizational-design#platform-engineering#stream-aligned-teams#conway-law#methodology-debatesSystems Thinking Versus the Cognitive Convenience of Blame
<cite index="3-6">Most incidents result from complex interactions between tools, processes, and communication breakdowns—not individual negligence</cite>. <cite index="5-36">Instead of asking who made the mistake, ask why the system placed that person in a position where the mistake was the most likely outcome</cite>. This reframing comes from Sidney Dekker's work in aviation safety, which John Allspaw translated into engineering practice in his 2012 post "Blameless PostMortems and a Just Culture."
<cite index="3-7,3-8">Accountability remains, but it becomes collective, process-oriented, and focused on building resilience. True blamelessness isn't just a tone shift; it's a complete reorientation of responsibility, trust, and learning</cite>. <cite index="5-40,5-41">If a postmortem accepted "human error" as a finding, the investigation stopped too early. The real question starts there: what about the system made that error possible, likely, and uncatchable?</cite>
<cite index="9-22,9-23,9-24">When people feel safe, they tell you what really happened. When they're scared, they give you the sanitized version. Sanitized versions don't prevent repeat incidents</cite>. <cite index="3-26">Psychological safety was made famous by Harvard researcher Amy Edmondson and further validated by Google's Project Aristotle</cite>. The mechanism is simple: fear of punishment destroys signal, and without signal, you can't fix the system.
Sources:
- https://rootly.com/incident-postmortems/blameless
- https://www.sherlocks.ai/blog/blameless-postmortems-explained-lessons-from-real-outages
- https://hyperping.com/blog/incident-post-mortem
#systems-thinking#root-cause-analysis#psychological-safety#human-error#blameless-culture#incident-analysis#organizational-learning#sre-practicesAction Items as the Contract
A postmortem without follow-through is documentation theater. <cite index="10-1,10-2">You need an action plan on steps to stop the incident recurring. Every action item must have an owner and target date</cite>. <cite index="4-19">Key metrics include postmortem completion rate, action closure time, repeat incident rate, and remediation verification rate</cite>. These are not soft measures—they determine whether the process produces resilience or just reports.
<cite index="4-20">To enforce this: require verification criteria, assign owners, integrate actions into planning, and hold regular reviews</cite>. <cite index="8-2,8-3">Make sure you identify who is responsible for approving recommended actions and reviewing the write-ups themselves. At Atlassian, that person is a division-level head of engineering</cite>. Without executive sponsorship, action items migrate to the bottom of the backlog.
<cite index="3-11,3-12">Lessons are translated into tangible actions. Learning without action isn't learning</cite>. The iterative loop is: incident → postmortem → action items → trend analysis → fewer future incidents. <cite index="4-27,4-28,4-29">Blameless postmortems are a critical practice for resilient operations. They require instrumentation, culture, automation, and measurable follow-through. When done correctly they reduce incidents, lower costs, and increase trust</cite>.
Sources:
- https://www.pluralsight.com/resources/blog/tech-operations/how-conduct-blameless-postmortems-incident
- https://sreschool.com/blog/blameless-postmortem/
- https://rootly.com/incident-postmortems/blameless
- https://www.atlassian.com/incident-management/postmortem/blameless
#action-items#incident-analysis#accountability#remediation-tracking#sre-practices#follow-through#organizational-learningThe Cultural Load-Bearing Wall: Senior Leadership
<cite index="2-29">To properly support postmortem culture, engineering leaders should consistently exemplify blameless behavior and encourage blamelessness in every aspect of postmortem discussion</cite>. The Google SRE Workbook provides a specific example of how this breaks: a VP asking "someone must have known beforehand this was a bad idea, so why didn't you listen to that person?" <cite index="5-24">Under pressure, leaders revert to patterns that feel like accountability because they are simpler than confronting systemic complexity</cite>.
The suggested mitigation is immediate redirection. <cite index="2-7">One response: "I'm sure everyone had the best intent, so to keep it blameless, maybe we ask generically if there were any warning signs we could have heeded, and why we might have dismissed them"</cite>. This isn't just diplomacy—it changes the investigative path. The difference between "who ignored the warning" and "what made the warning easy to dismiss" is the difference between a personnel action and a process fix.
<cite index="1-5">Management reinforces a collaborative postmortem culture through senior management's active participation in the review and collaboration process</cite>. But <cite index="5-27,5-28,5-29">blameful language from senior leadership is the single most common way postmortems fail. Blame survives because it is cognitively convenient, organizationally satisfying, and feels like accountability</cite>—right up until the same incident repeats.
Sources:
- https://sre.google/workbook/postmortem-culture/
- https://www.sherlocks.ai/blog/blameless-postmortems-explained-lessons-from-real-outages
#leadership#blameless-culture#organizational-behavior#incident-review#sre-practices#cultural-debt#incident-analysis#organizational-learningBlameless as Infrastructure, Not Just Intent
<cite index="1-2">A truly blameless postmortem focuses on identifying contributing causes without indicting any individual or team for bad or inappropriate behavior</cite>. That much is well-known. What the Google SRE book makes clear is that this isn't a tone choice—it's a structural commitment. <cite index="1-16">The practice assumes everyone involved had good intentions and did the right thing with the information they had</cite>, which shifts the frame from moral judgment to systems analysis.
The hard part is implementation. <cite index="1-17">If a culture of finger pointing and shaming prevails, people will not bring issues to light for fear of punishment</cite>. <cite index="1-22">An atmosphere of blame risks creating a culture in which incidents and issues are swept under the rug</cite>, and that risk compounds. <cite index="1-18">Blameless culture originated in the healthcare and avionics industries where mistakes can be fatal</cite>—environments that learned the hard way that fear kills information flow.
<cite index="1-12">It is important to define postmortem criteria before an incident occurs so that everyone knows when a postmortem is necessary</cite>. This removes discretion at the moment of failure, when people are most likely to decide a postmortem isn't worth the exposure. <cite index="1-3">Once satisfied with the document and its action items, the postmortem is added to a team or organization repository of past incidents</cite>, turning individual failures into institutional learning.
Sources:
- https://sre.google/sre-book/postmortem-culture/
- https://sre.google/workbook/postmortem-culture/
#blameless-culture#incident-analysis#sre-practices#psychological-safety#organizational-learning#systems-thinkingFormal verification moves from binary to quantitative in production
<cite index="6-1">The ability to use formal models and specifications—for model-checking systems at design time, for validating in-production behavior using runtime monitoring by serving as a correctness oracle, for simulating emergent systems behavior, and for building proofs of critical properties—allows AWS to amortize the engineering effort of developing these specifications over a larger amount of business and customer value</cite>.
The methodology is expanding beyond yes/no correctness. <cite index="15-8,15-9">In addition to checking protocol correctness, there is a novel quantitative approach to measuring permissiveness of a transaction protocol: a computable metric of how restrictive a protocol is in its implementation of a given isolation level—this is a new dimension to protocol analysis, not just a binary outcome of correct or incorrect but an observation that a correct transactional protocol lives on a spectrum of efficiency</cite>. <cite index="15-3">This is a novel dimension of protocol analysis that demonstrates the value of formal methods beyond mere binary correctness outcomes</cite>.
Trace validation connects runtime behavior back to specs. <cite index="9-3,9-4">A framework for relating traces of distributed programs to high-level specifications written in TLA+ reduces the problem to a constrained model checking problem, realized using the TLC model checker</cite>. This lets engineers treat specifications as runtime oracles, not just design-time tools.
Sources:
- https://cacm.acm.org/practice/systems-correctness-practices-at-amazon-web-services/
- https://www.mongodb.com/company/blog/engineering/formal-methods-beyond-correctness-isolation-permissiveness-distributed-transactions
- https://arxiv.org/pdf/2404.16075
#formal-verification#runtime-verification#permissiveness-metrics#correctness#distributed-systems#trace-validation#methodology-debatesAWS pivoted to P because TLA+ was too hard for most engineers
<cite index="6-5,6-6">As the use of formal methods was expanded beyond the initial teams at AWS in the early 2010s, AWS discovered that many engineers struggled to learn and become productive with TLA+—this difficulty seemed to stem from TLA+'s defining feature: it is a high-level, abstract language that more closely resembles mathematics than the imperative programming languages most developers are familiar with</cite>.
The adoption problem forced a trade-off. <cite index="6-7,6-8">While this mathematical nature is a significant strength of TLA+, and AWS continues to agree with Lamport's views on the benefits of mathematical thinking, they also sought a language that would allow them to model check key aspects of systems design while being more approachable to programmers—they found this balance in the P programming language, a state-machine-based language for modeling and analysis of distributed systems</cite>.
The choice of TLA+ was deliberate and documented. <cite index="8-1,8-2">AWS evaluated several formal methods and published their findings, listing the requirements they think are important for a formal method to be successful in their industry segment—when they found TLA+ met those requirements, they stopped</cite>. <cite index="5-3">TLA+ has been successfully used by Intel, Compaq and Microsoft in the design of hardware systems, and has started seeing recent use in large software systems at Microsoft, Oracle, and most famously at Amazon where engineers use TLA+ to specify and verify many AWS services</cite>. But the learning curve remains a constraint on scale.
Sources:
- https://cacm.acm.org/practice/systems-correctness-practices-at-amazon-web-services/
- https://cacm.acm.org/research/how-amazon-web-services-uses-formal-methods/
- https://pron.github.io/posts/tlaplus_part1
#tla-plus#p-language#formal-verification#developer-experience#methodology-debates#adoption-barriers#aws-infrastructure#correctnessThe specification-implementation gap remains a human problem
<cite index="18-1,18-3">Although formal methods provide rigorous approaches to verifying the adherence of a program to its specification, there still exists a gap between a formal model and implementation if the model and its implementation are only loosely coupled</cite>. <cite index="18-4">Developers usually overcome this gap through manual effort, which may result in the introduction of unexpected bugs</cite>.
Verified systems still ship bugs because of mismatched assumptions in the toolchain. <cite index="12-3,12-4">Researchers found 16 bugs in verified systems that have a negative impact on server correctness or on verification guarantees—analyzing their causes reveals a wide range of mismatched assumptions about the unverified code, unverified libraries, resources implicitly used by verified code, verification infrastructure, and specification</cite>. <cite index="10-1">Formal methods can be used to verify that a single component is provably correct, but composition of correct components does not necessarily yield a correct system; additional verification is needed to prove that the composition is correct</cite>.
Some projects bridge the gap through code generation. <cite index="18-5">Erla+ is a translator that automatically translates models written in a subset of the PlusCal language to TLA+ for formal reasoning and produces executable Erlang programs in one run</cite>. <cite index="24-6">By providing pluggable interfaces in the generated code and integration tests with capability of reproducing traces, tools aim at reducing the gap between model and implementation</cite>. But the runtime behavior can still deviate from design, and <cite index="20-13">PGo has not been verified</cite>.
Sources:
- https://icfp24.sigplan.org/details/erlang-2024/3/Erla-Translating-TLA-Models-into-Executable-Actor-Based-Implementations
- https://www.cs.purdue.edu/homes/pfonseca/papers/eurosys2017-dsbugs.pdf
- https://queue.acm.org/detail.cfm?id=2889274
- https://dl.acm.org/doi/fullHtml/10.1145/3559744.3559747
- https://www.sigops.org/src/srcsosp2017/sosp17src-final23.pdf
#formal-verification#implementation-gap#correctness#code-generation#distributed-systems#verification-limits#methodology-debatesAWS found bugs testing couldn't—then made TLA+ an internal standard
<cite index="4-4,4-5">At AWS, formal methods have been a big success, helping prevent subtle, serious bugs from reaching production that would not have been found via any other technique</cite>. <cite index="3-14">This is an experience report from engineers who spearheaded the use of formal methods to verify complex distributed systems being built at AWS such as S3 and DynamoDB</cite>. <cite index="3-15,3-16">At first, they didn't think of formal methods and were investing in other types of testing—those tests helped but there were still edge cases that could cause serious bugs</cite>.
The returns were compelling enough that <cite index="1-3">it has become an industry standard for the core algorithm of distributed cloud services to use TLA+ to carry out design verifications</cite>. <cite index="4-6,4-7">Formal methods have helped AWS make aggressive optimizations to complex algorithms without sacrificing quality, and so far seven teams have used TLA+ with all finding high value in doing so</cite>. <cite index="6-3">Modeling a key commit protocol for the Aurora relational database engine in P and TLA+ allowed AWS to identify an opportunity to reduce the cost of distributed commits from 2 to 1.5 network roundtrips without sacrificing any safety properties</cite>.
The methodology validates designs, not implementations. <cite index="2-13,2-15">While Amazon describes being successful in unearthing devious bugs using TLA+, it's important to remember that TLA+ only tests the system design and has no knowledge of the actual code—you could have a totally correct specification but your code can still be wrong</cite>. The work continues: <cite index="4-8,4-9">at the time of writing, more teams are starting to use TLA+, and AWS believes use of TLA+ will accelerate both time-to-market</cite>.
Sources:
- https://lamport.azurewebsites.net/tla/formal-methods-amazon.pdf
- https://vishnubharathi.codes/blog/paper-notes-use-of-formal-methods-at-amazon-web-services/
- https://medium.com/the-continuous-conference/the-verification-of-a-distributed-system-200b847b882
- https://cacm.acm.org/practice/systems-correctness-practices-at-amazon-web-services/
#formal-verification#tla-plus#aws-infrastructure#design-validation#correctness#model-checking#distributed-systems#methodology-debatesFault injection: simulating what operators see
The core of Jepsen's methodology is controlled chaos applied to a running cluster. <cite index="1-1,1-13">It spins up a cluster of hosts to serve as a distributed database system, simulates a network partition, then observes how the system manages database operations under such conditions</cite>. <cite index="16-3">Jepsen can perform numerous chaos events on a distributed system such as introducing network issues, killing components, and generating random load</cite>.
The fault scenarios target the conditions where distributed systems claims typically break. <cite index="10-4">Jepsen applies property-based testing to databases to verify correctness claims during common failure modes: network partitions, process crashes, and clock skew</cite>. <cite index="24-6,24-13">Kingsbury added several new nemesis modes to inject adverse events into tests, such as clock changes or network failures</cite>. <cite index="18-5,18-6,18-7">Jepsen is designed to test partition tolerance of distributed systems by creating network partitions while fuzzing the system with random operations, then analyzing results to find if the system violates any consistency properties it claims</cite>.
What gets tested is not peak performance but behavior under degradation. <cite index="11-5,11-6">The correctness of the system is tested not under ideal conditions, but during failures, using chaos engineering tools to check system behavior</cite>. The assumption is that marketing claims hold in the demo environment; Jepsen's job is to break the environment in ways that look like production.
Sources:
- https://rmulhol.github.io/general/2015/05/28/testing-databases-jepsen.html
- https://medium.com/appian-engineering/chaos-testing-a-distributed-system-with-jepsen-2ae4a8bdf4e5
- https://conferences.oreilly.com/velocity/vl-ca-2018/public/schedule/speaker/304072.html
- https://www.cockroachlabs.com/blog/cockroachdb-beta-passes-jepsen-testing/
- https://developer.hashicorp.com/consul/docs/architecture/jepsen
- https://system-design.space/en/chapter/jepsen-consistency/
#fault-injection#network-partitions#chaos-engineering#nemesis#property-based-testing#failure-modes#consistency-testing#distributed-systems#verificationTesting as credibility: the vendor response pattern
Jepsen created a market for independent verification that changed how database vendors approach consistency claims. <cite index="23-1,23-2,23-3">VoltDB hired Kyle Kingsbury to validate their promise of strong serializability and passed official Jepsen testing, which was more stringent than any other system Jepsen had tested</cite>. <cite index="23-8,23-9">VoltDB understood that nothing they did would have the same credibility as a test run by Kingsbury himself, and when he started Jepsen-For-Hire they immediately got in line</cite>.
The relationship between vendor testing and Jepsen reveals assumptions about internal incentives. <cite index="9-6,9-7">ScyllaDB had confidence they would pass Jepsen with flying colors after running their own tests, but the result turned out to be surprising</cite>. <cite index="2-24,2-25">Eleven bugs were filed on ScyllaDB based on Jepsen tests, with bootstrap and decommission having most of the issues</cite>. <cite index="9-12">ScyllaDB learned that database quality has to be all-round, and maturity is achieved through testing across many product features and their interactions, rather than focusing on a single piece</cite>.
<cite index="24-4,24-5">The comments test was expected to fail because it required linearizability instead of serializability, which helps verify the testing methodology can detect subtle differences between consistency levels</cite>. That level of precision—tests that should fail, failing in the predicted way—is what makes the framework diagnostic rather than just adversarial.
Sources:
- https://www.voltdb.com/blog/2016/07/12/voltdb-6-4-passes-official-jepsen-testing/
- https://www.scylladb.com/2020/12/23/jepsen-and-scylla-putting-consistency-to-the-test/
- https://www.scylladb.com/2016/02/11/jepsen-testing/
- https://www.cockroachlabs.com/blog/cockroachdb-beta-passes-jepsen-testing/
#vendor-testing#credibility#independent-verification#consistency-claims#database-vendors#jepsen-for-hire#consistency-testing#distributed-systems#verificationThe machinery: checkers for linearizability and isolation
Jepsen's verification relies on specialized checkers that analyze operation histories against consistency models. <cite index="15-16,15-17">Currently Jepsen has two types of checkers: Knossos for checking if results are linearizable, and Elle for checking consistency of database transactions</cite>. The choice matters because the problems have different complexity profiles.
<cite index="20-4,20-5">Linearizability checking is NP-complete, suffering from combinatorial explosion in concurrent multi-register systems, and serializability checking is also NP-complete—unlike linearizability, one cannot use real-time constraints to reduce the search space</cite>. <cite index="5-6,5-7">Elle, a newer library, analyzes Jepsen histories and finds consistency violations in linear time by building on Adya's formalism of transactional anomalies as cycles in a dependency graph</cite>.
<cite index="13-13,13-14,13-15">Finding consistency violations in databases is NP-complete—the cost grows exponentially with transaction count, and sometimes it is impossible for an external observer; multiple violations can interact and produce correct results while masking underlying problems</cite>. <cite index="13-1,13-2">Jepsen observes not only whether there was a consistency fault, but also the steps that brought it about, with Elle finding anomalies in linear time proving particularly valuable</cite>. This is the dependency graph work—tracing how operations relate rather than brute-forcing all orderings.
Sources:
- https://www.cncf.io/blog/2024/02/19/analysis-of-xline-jepsen-tests/
- http://www.vldb.org/pvldb/vol14/p268-alvaro.pdf
- https://www.podc.org/podc2021/kyle-kingsbury/
- https://www.scylladb.com/2020/12/23/jepsen-and-scylla-putting-consistency-to-the-test/
#linearizability#consistency-models#elle#knossos#verification#complexity#serializability#consistency-testing#distributed-systemsJepsen: the test that made databases nervous
<cite index="11-13,11-14">Kyle Kingsbury created Jepsen as an independent project to test distributed systems for correctness, and it identified critical errors in dozens of popular databases</cite>. The methodology is straightforward: <cite index="11-16,11-17">Jepsen is a Clojure library that generates load, introduces failures like network partitions, process crashes, and clock skew, then checks whether stated guarantees are met</cite>.
What distinguishes Jepsen from vendor testing is where it applies pressure. <cite index="1-14">Kingsbury focuses on how well each database lives up to the claims made in its documentation</cite>. <cite index="1-18,1-19">His findings instill skepticism—he has found numerous places where databases do not live up to documentation claims, alongside bugs and unexpected behavior</cite>. <cite index="5-4">Over eight years, consistency violations were found in 26 systems, ranging from stale reads to catastrophic data loss</cite>.
The technical foundation combines several techniques. <cite index="5-1,5-5">Jepsen combines automated deployment, fault injection, and property-based testing techniques to uncover safety violations and performance characteristics</cite>. <cite index="15-13,15-14">It performs black-box testing, simulating complex real-world deployment environments and performing operations, then uses consistency checkers to verify whether results comply with guarantees</cite>. <cite index="9-5">Established in 2013, it has become the industry standard for distributed systems testing, with many SQL and NoSQL vendors using it as a checkmark</cite>.
Sources:
- https://system-design.space/en/chapter/jepsen-consistency/
- https://rmulhol.github.io/general/2015/05/28/testing-databases-jepsen.html
- https://www.podc.org/podc2021/kyle-kingsbury/
- https://www.cncf.io/blog/2024/02/19/analysis-of-xline-jepsen-tests/
- https://www.scylladb.com/2020/12/23/jepsen-and-scylla-putting-consistency-to-the-test/
#jepsen#distributed-systems#consistency-testing#fault-injection#database-verification#kyle-kingsbury#verificationInterest Risk: When Maintenance Costs Compound Beyond Recovery
<cite index="26-2,26-3">Technical debt interest refers to extra maintenance costs incurred by the existence of TD items; too little interest and TD effects are negligible, too much and the system becomes unsustainable</cite>. <cite index="26-4">Interest generation can be considered as a risk with a metric to quantify it</cite>. <cite index="26-1">Empirical validation in an industrial setting revealed the Interest Generation Risk Index (IGRI) captures accurately the notion of urgency to fix issues as perceived by software engineers</cite>.
<cite index="25-1">A framework calculates the Technical Debt Breaking Point (TD-BP), a point in time where accumulated interest becomes larger than the principal, making debt no longer sustainable</cite>. <cite index="7-2,7-3,7-4">A study empirically analyzed additional work effort caused by technical debt in software projects, exploring how delaying debt repayment through refactoring influences long-term work effort using data from open-source and enterprise projects</cite>.
The research trajectory moves from static measurement to temporal forecasting. <cite index="8-3,8-5,8-7">Machine learning models empirically evaluated on 15 open-source projects can provide meaningful estimates of future TD evolution—the first study applying ML for TD forecasting</cite>. The methodological shift is from asking "how much do we owe" to "when does the system break." That requires tracking velocity decay over time, not just snapshot metrics.
Sources:
- https://link.springer.com/article/10.1007/s42979-020-00406-6
- https://www.academia.edu/11402128/An_empirical_model_of_technical_debt_and_interest
- https://arxiv.org/pdf/2502.16277
- https://www.sciencedirect.com/science/article/abs/pii/S0164121220301904
#technical-debt#interest-accumulation#breaking-point#forecasting#machine-learning#maintenance-cost#sustainability-risk#engineering-economics#methodology-debatesPrincipal and Interest: Testing the Metaphor Empirically
<cite index="18-2,18-3">The cornerstones of technical debt are principal and interest borrowed from economics, but no prior study validated the strength of the metaphor</cite>. <cite index="18-6,18-7">An empirical study using the Mantel test examined the relation between TD principal and interest and identified aspects that denote proximity of artifacts with respect to TD</cite>.
<cite index="18-1,18-11">Results suggest TD principal and interest are related—classes with similar principal levels tend to have similar interest levels—and aggregated measures are more capable of identifying proximate artifacts than isolated metrics</cite>. <cite index="21-3,21-4">Nugroho et al. (2011) operationalized debt principal as the cost to remediate all detected quality issues and interest as incremental maintenance cost; analysis of 44 production systems showed architectural debt accruing interest at rates three to five times higher than code-level debt</cite>.
The metaphor holds under empirical scrutiny, but the interest rate is not constant. It varies by debt type and artifact characteristics. <cite index="18-12">Empirical evidence suggests improving certain quality properties like size and coupling should be prioritized when ranking refactoring opportunities, as high values are typically related to artifacts with higher TD principal</cite>. The implication: you can measure the relationship between what you owe and what it costs you, but only if you instrument both the remediation pipeline and the maintenance overhead.
Sources:
- https://www.sciencedirect.com/science/article/pii/S0950584920301567
- https://www.researchgate.net/publication/228684782_An_Empirical_Model_of_Technical_Debt_and_Interest
#technical-debt#principal-interest-relationship#empirical-validation#architectural-debt#code-quality#maintenance-cost#quantification-models#engineering-economics#methodology-debatesSQALE: Remediation Cost as Common Currency
<cite index="10-1,10-2">The SQALE method provides a Quality Model and Analysis Model used to estimate quality and technical debt of source code</cite>. <cite index="16-4,16-5">The method normalizes static analysis tool reports by transforming them into remediation costs using either a remediation factor or remediation function</cite>. <cite index="13-5">This time is the principal associated with the debt item and is called the remediation cost</cite>.
<cite index="12-6,12-7,12-8">Remediation functions depend mainly on the sequence of activities needed to fix non-compliance; correcting presentation defects like bad indentation does not have the same effort cost as correcting structural code defects requiring new unit, integration, and regression tests</cite>. <cite index="13-10,13-11">The non-remediation cost estimates future additional costs such as extra work imposed on anyone working with the code that arise from technical debt</cite>.
The method conforms to ISO 9126 and delivers an index that can be expressed in work units, time, or money. <cite index="11-9">The method adds up remediation costs to calculate quality indicators</cite> rather than averaging them. This is the core insight: debt aggregates, it does not average. The engineering choice is which remediation sequences to calibrate and which quality model to enforce. Both require organizational context that no off-the-shelf tool provides.
Sources:
- https://ieeexplore.ieee.org/document/6225997
- https://en.wikipedia.org/wiki/SQALE
- https://www.cutter.com/article/managing-technical-debt-sqale-method-490726
- https://www.researchgate.net/publication/239763591_The_SQALE_method_for_evaluating_Technical_Debt
#technical-debt#sqale#remediation-cost#iso-9126#engineering-economics#quantification-models#static-analysis#methodology-debatesThe Tooling Consensus Problem: Measurement Without Agreement
<cite index="2-3,2-4">Each technical debt assessment tool checks against a particular ruleset, and relying on a single tool produces diverse TD estimates and different mitigation actions that limit credibility</cite>. The Springer empirical benchmark study examined agreement among leading tools and found they measure fundamentally different things. This is not a calibration issue—it is a definitional one.
<cite index="4-2,4-3">A systematic mapping study identified approaches that quantify based on code smells, ROI of refactoring, comparing ideal versus current state, or comparing alternative development paths</cite>. <cite index="4-5">The problem is not being able to effectively compare and evaluate these approaches</cite>. The proposed Technical Debt Quantification Model (TDQM) attempts a uniform representation but the underlying issue persists: teams inherit the ontology of whichever vendor or research group wrote their static analysis pipeline.
<cite index="1-8,1-9">Technical debt prioritization research remains preliminary with no consensus on important factors or how to measure them, making current research inconclusive</cite>. The争论 is not technical—it is economic. Different measures answer different questions about risk, capacity, and compounding cost. Until teams understand which question they need answered, they will keep averaging incompatible numbers.
Sources:
- https://link.springer.com/article/10.1007/s10664-020-09869-w
- https://arxiv.org/pdf/2303.06535
- https://www.researchgate.net/publication/386270212_Technical_Debt_Measurement_An_Exploratory_Literature_Review
#technical-debt#static-analysis#measurement-disagreement#tooling-fragmentation#methodology-debates#empirical-validation#engineering-economicsReadiness Isn't a One-Time Gate; Systems Drift
<cite index="4-1,4-2,4-3">Production readiness isn't a one-time gate—systems change, traffic patterns shift, and new dependencies get added, so regular re-reviews ensure that a service that was production-ready six months ago is still production-ready today</cite>. The drift is structural, not exceptional. Teams leave. Dependencies update. Load patterns evolve.
<cite index="5-4,5-5">The need for safe, scalable, and reliable systems doesn't end at initial launch—code and the standards governing it change, which means software needs to be regularly evaluated for alignment</cite>. This ongoing validation is where most organizations fail. <cite index="4-11,4-12">A service passes the PRR at launch and then drifts out of compliance over the next year as changes accumulate, team members leave, and documentation goes stale</cite>.
Automation can help. <cite index="17-30,17-31,17-32">Automated checks for things like health endpoints, alert rule presence, dashboard existence, and runbook references can verify ongoing compliance—AI-driven platforms can continuously monitor for production readiness gaps as systems evolve, surfacing services where monitoring coverage is incomplete, alert rules are missing, or runbooks have gone stale, turning production readiness from a one-time launch gate into ongoing operational intelligence</cite>.
<cite index="6-3,6-5">The Production Readiness Review is a process that helps identify the reliability needs of a service, feature, or significant change to infrastructure, with the goal to make sure we have enough documentation, observability, and reliability for the feature, change, or service to run at production scale</cite>. GitLab recommends starting the readiness review process as early as possible as features progress through product maturity levels, not waiting until the last sprint.
Sources:
- https://neubird.ai/glossary/production-readiness/
- https://www.cortex.io/report/the-2024-state-of-software-production-readiness
- https://handbook.gitlab.com/handbook/engineering/infrastructure/production/readiness/
#production-readiness#operational-drift#continuous-compliance#quality-gates#operational-readiness#automation#infrastructure-review#methodologyChecklists Cover Seven Axes, Not Just Uptime
Readiness isn't a single dimension. <cite index="12-3,12-13">An open-source Production Readiness Checklist helps you assess your readiness against 7 key dimensions: Service Levels, Architecture Design Review, Performance, Documentation, Observability, Testing, and Deployment Strategy</cite>. Each dimension has clear acceptance criteria, not gut feel.
<cite index="17-15">A comprehensive checklist covers observability (logs, metrics, traces), alerting (SLO-based, actionable), reliability (health checks, graceful shutdown, retry logic), deployment and rollback procedures, capacity and scaling, security, and documentation including runbooks</cite>. These aren't aspirational—they're the operational machinery required to survive contact with production traffic.
Load testing matters. <cite index="1-3,1-4">Load testing simulates expected traffic patterns to understand how the service handles realistic load, and capacity planning validates that infrastructure can handle growth projections for the next 6-12 months without manual intervention</cite>. Stress testing goes further: <cite index="1-7">push the system beyond expected limits to find breaking points and understand degradation behavior</cite>.
But technical checks aren't sufficient. <cite index="2-15,2-16">The readiness review should identify one or two business-level signals that will be watched during rollout—these should not replace technical metrics; they complement them</cite>. A checkout flow may have low error rates but worse completion. A recommendation system may return responses quickly but produce irrelevant results. <cite index="2-3">The real job of a production readiness review is to expose the operational assumptions that are still fragile before users, traffic, alerts, and on-call engineers discover them the hard way</cite>.
Sources:
- https://www.cortex.io/post/how-to-create-a-great-production-readiness-checklist
- https://www.momentslog.com/development/how-to-design-a-production-readiness-review-that-actually-prevents-launch-surprises
- https://www.linkedin.com/pulse/you-ready-production-stephen-thair
- https://neubird.ai/glossary/production-readiness/
#production-readiness#operational-readiness#observability#capacity-planning#load-testing#quality-gates#sre-checklist#methodologyRisk-Proportional Reviews: Not Every Service Needs the Same Gate
<cite index="4-7,4-8,4-9,4-10">It's proportional to risk—not every service needs the same level of review; a customer-facing payment service requires more rigorous review than an internal batch processing job, and Google uses a tiered approach based on the service's criticality and blast radius</cite>. This risk calibration matters because treating every deployment the same either bottlenecks low-risk changes or under-scrutinizes critical ones.
<cite index="3-2">A Production Readiness Review (PRR) is required for application system releases or infrastructure changes that have a high operational risk associated with implementation</cite>, as documented in Federal Student Aid's process. Medium-risk changes follow a different path. The process becomes a function of consequence, not ceremony.
<cite index="5-2,5-3">Production readiness review (PRR) is a set of checks used to mark when software is considered secure, scalable, and reliable enough for use, and while PRRs are unique to the engineering teams compiling them, most include things like adequate testing and security coverage, connection to CI/CD tools, and detailed rollback protocol</cite>. The specifics vary, but the underlying principle holds: <cite index="17-4">Google's Production Readiness Review (PRR) established the industry standard—collaborative, criteria-based, proportional to risk, and ongoing rather than one-time</cite>.
The tiered structure lets teams scale readiness checks without grinding velocity to zero. High-blast-radius services get deep SRE review. Internal tooling gets automated checks and lighter scrutiny. The gate flexes to the risk profile.
Sources:
- https://neubird.ai/glossary/production-readiness/
- https://studentaid.gov/sites/default/files/fsawg/static/gw/docs/ciolibrary/PRR_Process.pdf
- https://www.cortex.io/report/the-2024-state-of-software-production-readiness
#production-readiness#risk-management#operational-readiness#quality-gates#tiered-review#blast-radius#infrastructure-methodology#methodologyGoogle's PRR: The Standard Gate Before SRE Ownership
<cite index="9-6,9-7">The Production Readiness Review can be started at any point of the service lifecycle, but the stages at which SRE engagement is applied have expanded over time</cite>. Google formalized this as the Simple PRR Model, which later evolved into the Extended Engagement Model and Frameworks structure. The original approach was limited—<cite index="9-19">the Simple Production Readiness Review was only applicable to services that had already entered the Launch phase</cite>.
<cite index="15-1,15-2">During an SRE entrance review (SER), also referred to as a Production Readiness Review (PRR), the SRE team takes the measure of a service currently running in production, assessing how the service would benefit from SRE ownership and identifying service design, implementation and operational deficiencies</cite>. The SRE looks at the service as-is and asks: "If I were on-call for this service right now, what are the problems I'd want to fix?"
<cite index="15-6,15-9,15-10">The SRE entrance review typically produces a prioritized list of issues with the service that need to be fixed, with four main axes of improvement: extant bugs, reliability, automation and monitoring/alerting—on each axis there will be blockers and others which would be beneficial to solve but not critical</cite>. Not all deficits block onboarding; some architectural changes can take months.
The shift came with frameworks. <cite index="9-23">Frameworks for production services were developed to meet demand: code patterns based on production best practices were standardized and encapsulated in frameworks, so that use of frameworks became a recommended, consistent, and relatively simple way of building production-ready services</cite>. This moved readiness left, baking it into development rather than retrofitting post-build.
Sources:
- https://sre.google/sre-book/evolving-sre-engagement-model/
- https://cloud.google.com/blog/products/gcp/how-sres-find-the-landmines-in-a-service-cre-life-lessons
#production-readiness#google-sre#operational-readiness#sre-engagement#quality-gates#infrastructure-review#framework-methodology#methodologyWhat gets omitted and why the numbers lie
<cite index="1-2,1-3">When you say '99.99% of observations show an X msec or better latency', the reader isn't expecting that to mean '99.99% of the good observations were that good'; they expect it to mean that 99.99% of all random, uncoordinated attempts were that good.</cite> But that's not what closed-loop tools measure.
<cite index="7-3,7-4,7-5">Using a typical load generator, for a first phase there will be 10,000 measurements at 1 msec each; in the second phase the result will be 1 measurement of 100 seconds.</cite> <cite index="7-10,7-11,7-12">The results look perfect, but the results are a lie; the bad results from the second phase are being ignored—that's the 'coordinated omission' part.</cite> <cite index="26-15,26-16">The CO methodology problem amounts to dropping or ignoring bad results from your data set before computing summary statistics on them, and reporting very wrong stats as a result; the stats can often be orders of magnitude off.</cite>
<cite index="8-1,8-2,8-3">Coordinated omission has been classified into 2 categories: load generation and latency measurement, with four solutions: queuing/queueless and correction/simulation; the best implementation involves a static schedule with queuing and latency correction.</cite> The correction can be applied post-hoc if you know the intended rate, but prevention is simpler—use open-loop tooling from the start.
Sources:
- https://groups.google.com/g/mechanical-sympathy/c/icNZJejUHfE/m/BfDekfBEs_sJ
- https://highscalability.com/your-load-generator-is-probably-lying-to-you-take-the-red-pi/
- https://www.scylladb.com/2021/04/22/on-coordinated-omission/
#coordinated-omission#latency-measurement#percentiles#testing-methodology#performance-evaluation#measurement-bias#benchmarkingOpen-loop testing: constant rate independent of response time
<cite index="18-1,18-2">Compared to the closed model, the open model decouples VU iterations from the iteration duration; the response times of the target system no longer influence the load on the target system.</cite> <cite index="20-2">In open-model tests, requests are submitted as soon as they arrive as discrete events to the system.</cite> <cite index="16-17">Currently active agents in an open model system would also be impacted by slowdowns, but the crucial difference is that new agents continue to arrive at the system.</cite>
This maps to how actual production traffic behaves. Users do not wait for the previous user's request to finish before hitting your API. <cite index="23-3,23-4">The key difference between the workload generated using common load-test tools and the workload generated by actual web-based users can be characterized as the difference in the pattern of requests arriving at the SUT; virtual users in a typical load-testing environment generate a synchronous arrival pattern, whereas web-based users generate an asynchronous arrival pattern.</cite>
<cite index="33-2,33-3">An open workload generator will not suffer from coordinated omission; an example of open-loop load generation is sending requests at a constant rate, regardless of whether previous requests completed, an approach Gil Tene pioneered in wrk2 and that has since been adopted by other tools such as Vegeta and autocannon.</cite> The tooling exists. The question is whether the people running the tests know the difference.
Sources:
- https://grafana.com/docs/k6/latest/using-k6/scenarios/concepts/open-vs-closed/
- https://stormforge.io/blog/open-closed-workloads/
- https://www.artillery.io/blog/load-testing-workload-models
- https://arxiv.org/pdf/1607.05356
#open-loop#workload-modeling#load-generation#testing-methodology#performance-evaluation#constant-arrival-rate#benchmarkingClosed-loop testing: when response time throttles request rate
<cite index="16-12,16-13">Closed model systems have a fixed number of agents executing workloads, and new work is only scheduled for an agent when it is done with the previous one.</cite> This is the default for most legacy tools—Apache Bench, wrk, many JMeter configurations. <cite index="20-2">In closed-model tests, a request is only submitted when the system under test is ready to process it.</cite>
The problem is subtle but structural. <cite index="16-14,16-15,16-16">The system under test coordinates the test itself; if the SuT is slowing down or stalls, then the entire test is impacted. During this time, all agents waiting for responses also stall, no new requests are made and load is taken away from the SuT which in turn allows the system to recover.</cite> <cite index="18-3">When the target system is stressed and starts to respond more slowly, a closed model load test will wait, resulting in increased iteration durations and a tapering off of the arrival rate of new VU iterations.</cite>
<cite index="6-1,6-3">Coordinated omission is a measurement issue that can happen when an open system is tested with a closed workload generator.</cite> Closed-loop testing makes sense for certain workloads—database connection pools, for example, where concurrency is bounded by design. But when you're trying to simulate independent user arrivals, the closed loop turns backpressure into erasure.
Sources:
- https://stormforge.io/blog/open-closed-workloads/
- https://www.artillery.io/blog/load-testing-workload-models
- https://grafana.com/docs/k6/latest/using-k6/scenarios/concepts/open-vs-closed/
#closed-loop#workload-modeling#testing-methodology#performance-evaluation#load-generation#coordinated-omission#benchmarkingCoordinated omission: when the system under test rigs the test
<cite index="2-1">The term coordinated omission was coined by Gil Tene around 2013</cite>, and it describes a measurement artifact that makes benchmarks catastrophically optimistic. <cite index="3-6,3-7">Coordinated omission occurs when the load generator is not able to accurately create a workload representative of real world traffic; there is a 'Coordination' from the System Under Test applying indirect back pressure to the load driver, that causes the load driver to 'Omit' any number of valid results.</cite>
The mechanism is straightforward. <cite index="4-3,4-4,4-5,4-6">If a load tester waits to send a request until the previous one has completed, and if the load tester is testing 10 req/s and a request normally takes 50ms, each request will return before the next one is due to be sent; however if the whole system occasionally pauses for 5 seconds, the load tester would not send any requests during this 5 second period, and the load test would record a single bad outlier that took 5 seconds.</cite> <cite index="4-7,4-8">If the load tester was firing requests consistently then it would have made 100 requests during the 5 second pause time; these requests were omitted, and if these requests were made during the pause time, then the latency percentiles would look very different.</cite>
<cite index="2-8">One load test showed p99 of 47 ms in testing, but the same release showed p99 of 1.8 seconds in production — a 38× regression that the test had reported as nothing.</cite> <cite index="3-9">Response time metrics measured with tools that suffer from Coordinated Omission are far from misleading, they are wrong.</cite> The percentiles you get look clean because you only measured the system when it was ready to be measured.
Sources:
- https://idle-ti.me/blog/coordinated-omission/
- https://redhatperf.github.io/post/coordinated-omission/
- https://github.com/artilleryio/artillery/discussions/1472
#coordinated-omission#performance-evaluation#testing-methodology#latency-measurement#benchmarking#tail-latencyCapacity decisions require performance targets, not just utilization
<cite index="14-6,14-11,14-12">In practice, systems are designed around service-level agreements with targets like a specified percentage of customers served within a target time—for example, 80% of calls answered within 20 seconds</cite>. Utilization alone doesn't tell you whether the system meets those guarantees.
<cite index="6-1,6-3,6-4">Queueing theory is a mathematical tool for capacity planning and optimization of production, manufacturing, or logistics systems, with one application being service capacity optimization</cite>. <cite index="7-1,7-3">Capacity is controlled through mean service rate at each node in systems modeled as networks of queues</cite>. The question isn't "how busy are my servers" but "what's the probability a request waits longer than my SLA."
<cite index="10-3,10-7">Models evaluate response time, queue length, throughput, and resource utilization</cite>. <cite index="10-15,10-16">For scalability and high throughput, M/M/c is often the most practical choice for many real-world scenarios, balancing accuracy and complexity</cite>. <cite index="20-1,20-2">By estimating arrival rate (demand) and desired cycle time, managers can determine the optimal work-in-process level that maximizes throughput while minimizing wait times</cite>.
<cite index="22-2,22-3">In production lines with variability, Little's Law can quantify and eliminate bottlenecks; for example, an automotive assembly plant can identify which workstation is the bottleneck, and increasing capacity there could reduce WIP and improve throughput</cite>. The same applies to microservice architectures: instrument each hop, apply Little's Law per service, identify where latency accumulates, then add capacity precisely where the constraint lives.
Sources:
- https://fiveable.me/stochastic-processes/unit-8/mm1-mmc-queues/study-guide/0OzdE5UrvJQbLoYX
- https://www.mmscience.eu/journal/issues/October%202019/articles/optimisation-of-service-capacity-based-on-queueing-theory
- https://www.sciencedirect.com/science/article/pii/089571779500090O
- https://ijres.org/papers/Volume-13/Issue-6/1306321325.pdf
- https://www.6sigma.us/six-sigma-in-focus/littles-law-applications-examples-best-practices/
- https://www.plantservices.com/maintenance-mindset/article/55237244/maintenance-mindset-littles-law-and-lean-manufacturing-a-formula-for-operational-excellence
#capacity-planning#performance-engineering#sla-design#queueing-theory#bottleneck-analysis#service-optimization#utilization#methodologyLittle's Law connects concurrency, latency, and arrival rate
<cite index="21-3">The law holds under broad stability conditions: long-run averages must exist, the system boundary must be consistent, and effective arrival rate must match the population whose time in system is being measured</cite>. <cite index="23-16">In a stable system, the average number of items present equals the average arrival rate multiplied by the average time spent in the system</cite>—expressed as L = λW.
<cite index="18-1,18-2,18-3">If your API handles 500 requests per second with 200ms average response time, Little's Law gives you 100 concurrent requests in flight at any moment</cite>. That's the number your connection pools and worker threads need to accommodate. <cite index="18-4,18-5,18-6">Apply the law independently to each layer—API gateway, application servers, database pool, downstream services—because a bottleneck in any layer creates backpressure upstream, and the system's actual throughput ceiling is the minimum across all layers</cite>.
<cite index="18-7,18-8,18-9">The law only applies to stable queues where arrival rate doesn't permanently exceed service rate (λ < μ); when load exceeds capacity, queues grow without bound, which is why 95th-percentile latency explodes suddenly at saturation</cite>. <cite index="21-8,21-9,21-10">In manufacturing, Little's Law connects work-in-process, throughput, and cycle time; if a line completes 200 units per day and average WIP is 100 units, each unit spends an average of half a day in the line</cite>. The same logic applies to software pipelines, CI/CD queues, and request-handling tiers.
<cite index="25-4,25-5,25-6">Throughput of the entire system cannot be higher than the throughput of the slowest step, bottlenecks cause the whole system to suffer, and underutilized capacity should always focus on resolving a bottleneck as first priority</cite>.
Sources:
- https://systemdr.systemdrd.com/p/capacity-planning-modeling-using
- https://en.wikipedia.org/wiki/Little's_law
- https://www.interlakemecalux.com/blog/littles-law
- https://getnave.com/blog/kanban-littles-law/
#littles-law#capacity-planning#performance-engineering#concurrency#latency#bottleneck-analysis#stability-conditions#methodologyQueueing theory models system saturation, not just throughput
<cite index="1-3,1-4">Capacity planning balances cost and performance, and queueing theory analyzes this by modeling arrival, service, and departure patterns</cite>. The distinction matters because <cite index="3-8">queueing theory provides deeper understanding of system performance and client experience compared to strictly rate-based approaches</cite>.
The methodology centers on characterizing a system's components—<cite index="1-10">arrival rate, service rate, number of servers, queue discipline, and capacity</cite>—then deriving performance metrics. <cite index="1-11">Key outputs include average waiting time, average queue length, utilization, probability of blocking, and probability of abandonment</cite>. These aren't decorative; they're the variables you need when deciding whether to add capacity or accept degraded latency.
<cite index="11-3,11-4">The M/M/c queue describes a system where arrivals follow a Poisson process, there are c servers, and service times are exponentially distributed</cite>. <cite index="14-1,14-2">The Erlang C formula—one of the most widely used results in queueing theory—is used daily by call centers to staff agents</cite>. <cite index="14-16">Pooling servers in an M/M/c system generally yields shorter queues than having one fast server with the same total service capacity</cite>, which is why autoscaling groups outperform vertically scaled instances under bursty load.
<cite index="3-13,3-14">Many calculators are available; the math exists under the hood but you don't need to understand it to model capacity forecasting</cite>. The practical workflow: identify your system, measure arrival and service rates, plug values into a solver, then iterate on server count until you hit your SLA.
Sources:
- https://www.linkedin.com/advice/3/how-can-you-use-queuing-theory-capacity-planning
- https://hackernoon.com/why-capacity-planning-needs-queueing-theory-without-the-hard-math-342a851e215c
- https://en.wikipedia.org/wiki/M/M/c_queue
- https://fiveable.me/stochastic-processes/unit-8/mm1-mmc-queues/study-guide/0OzdE5UrvJQbLoYX
#capacity-planning#queueing-theory#performance-engineering#methodology#erlang-c#m-m-c-queue#resource-allocation#sla-designProfiling as the Missing Fourth Pillar
<cite index="5-4,5-5,5-6">Three telemetry signals are the foundational pillars of observability—metrics, logs, and traces—but for modern observability, these might not be enough; Elastic proposes a new, fourth pillar: profiling</cite>. <cite index="7-16,7-17">Profiles are a complementary technique to the three pillars; while not officially considered a "fourth pillar," profiling can provide valuable insights into the performance and behavior of a system</cite>.
<cite index="9-21,9-22,9-23,9-24">While logs, metrics, and traces are sufficient for identifying availability issues, they often fail to pinpoint performance issues at the code level; a metric might tell you CPU is at 100%, a trace might tell you which service is slow, but neither tells you why</cite>. <cite index="9-28">This gap is filled by continuous profiling, increasingly regarded as the "fourth pillar" of observability</cite>. <cite index="9-29,9-30,9-32">Traditional profiling tools introduce high overhead (10–50%), making them unsafe for production; eBPF allows running sandboxed programs in the Linux kernel, enabling profilers to sample stack traces at high frequency with extremely low overhead (<1%)</cite>.
<cite index="1-18,1-19">Some would argue that context, correlation and alerting are also pillars of observability; context enriches metrics, logs and traces by providing additional information about the network environment</cite>. The debate over what constitutes a pillar reflects the underlying tension: the framework itself may constrain rather than clarify how engineers understand production systems.
Sources:
- https://www.elastic.co/blog/3-pillars-of-observability
- https://www.eginnovations.com/blog/the-three-pillars-of-observability-metrics-logs-and-traces/
- https://medium.com/@QuarkAndCode/three-pillars-of-observability-metrics-logs-traces-beyond-205a82648114
- https://www.ibm.com/think/insights/observability-pillars
#observability#profiling#methodology-debates#performance-analysis#ebpf#continuous-profiling#code-level-visibility#operationsObservability 2.0: Unified Storage vs. Multiple Pillars
<cite index="21-10,21-11">Charity Majors calls the metrics-logs-traces generation "observability 1.0," while tools built on arbitrarily-wide structured log events and a single source of truth are "observability 2.0"</cite>. <cite index="21-13,21-14">This is a backwards-incompatible breaking change—you cannot simultaneously store your data across both multiple pillars and a single source of truth</cite>. <cite index="24-9,24-11">Every observability startup founded before 2021 that still exists was built using the multiple pillars model, storing each signal in a different location with limited correlation ability; every startup founded after 2021 was built using unified storage, capturing wide structured log events in a columnar database</cite>.
<cite index="27-13,27-14">The major cost drivers in a multiple-pillars world are the number of tools you use, cardinality of your data, and dimensionality—the amount of context and detail you store, which is the most valuable part; you get locked in a zero-sum game between cost and value</cite>. <cite index="22-19,22-20,22-21">The problem with the three-pillars model is the multiplier effect—the same data is stored multiple times, one common criticism</cite>. <cite index="22-24,22-25">Big-M Metrics tools are designed to handle low-cardinality data; adding high-cardinality data to metrics tools makes them very expensive, and world-class observability teams now spend the majority of their time governing cardinality</cite>.
<cite index="25-1,25-2">You can derive metrics, logs and traces from arbitrarily-wide structured events, but the reverse is not true</cite>.
Sources:
- https://www.honeycomb.io/blog/time-to-version-observability-signs-point-to-yes
- https://charity.wtf/category/observability/
- https://charity.wtf/tag/observability-2-0/
- https://newsletter.pragmaticengineer.com/p/observability-the-present-and-future
- https://www.honeycomb.io/blog/observability-5-year-retrospective
#observability#methodology-debates#data-architecture#cardinality#cost-engineering#storage-strategy#unified-telemetry#operationsEach Pillar Has Gaps; Integration Is the Actual Work
<cite index="1-16,1-17">Metrics alert teams to problems, traces show their path of execution, logs provide the context needed to resolve them—together they help accelerate issue identification and resolution</cite>. But the division creates brittle dependencies. <cite index="17-4,17-11">Metrics often provide limited context, so they generally require correlation with logs and traces to give developers a comprehensive understanding of system events</cite>. <cite index="4-18,4-19,4-20">Metrics aren't granular or detailed enough to identify exactly which service within a microservices architecture is triggering errors; metrics only show that the application is experiencing errors</cite>.
<cite index="13-24">One of the main criticisms is the tendency to use them in isolation, instead of as part of a holistic observability system</cite>. <cite index="14-7,14-8,14-16,14-17">These three separate lenses have inherent limitations; observability isn't just the ability to see each piece at a time but also to understand the broader picture and how these pieces combine</cite>. <cite index="6-11">Companies can deploy all three and find they haven't achieved all their observability objectives and certainly haven't solved real-world business problems</cite>.
<cite index="11-9,11-10">IT teams must sometimes contextualize logs, metrics and traces with data from other systems, such as ticketing or CI/CD pipeline performance, to gain the most complete picture; to focus on the three pillars alone risks overlooking other important sources of visibility</cite>.
Sources:
- https://www.ibm.com/think/insights/observability-pillars
- https://www.techtarget.com/searchitoperations/tip/The-3-pillars-of-observability-Logs-metrics-and-traces
- https://www.strongdm.com/blog/three-pillars-of-observability
- https://www.techtarget.com/searchitoperations/opinion/Dont-limit-observability-to-3-pillars
- https://www.o11ytime.com/the-three-pillars-revisited/
- https://thenewstack.io/how-the-3-pillars-of-observability-miss-the-big-picture/
#observability#methodology-debates#operations#correlation-problem#system-integration#distributed-systemsThe Pillar Framework Sells Tools, Not Understanding
<cite index="1-3,2-3">The three-pillar model defines observability through logs, metrics, and traces</cite>—a framework that <cite index="21-2">Peter Bourgon proposed in 2018</cite>. <cite index="21-3,21-4">Vendors latched onto the language because they had metrics products, logging products, and tracing products to sell</cite>. <cite index="29-4,29-5">Logging, metrics, and tracing companies loved this definition and adopted it enthusiastically, pummeling engineers with "three pillars" marketing content</cite>.
<cite index="20-12,20-15">Ben Sigelman, ex-Googler and CEO at LightStep, presented "Three Pillars, Zero Answers" at KubeCon in December 2018, arguing that many organizations may need to rethink their approach</cite>. <cite index="20-9">Each pillar has "fatal flaws," and every system has tradeoffs</cite>. <cite index="26-3,26-9">Sigelman argues that three pillars are "just bits" and that the rather complex task of making sense of it is "left as an exercise to the reader"</cite>.
<cite index="23-2,23-30">Charity Majors states flatly: "Observability doesn't have three pillars, and it is not a monitoring tool"</cite>. <cite index="28-6,28-7,28-8">The framework is nonsense—traces, logs and metrics are data types, not pillars, like saying "software engineering is three pillars: strings, integers and arrays"</cite>. <cite index="15-1,15-2">This siloed approach is overly focused on technical instrumentation and underlying data formats; simply having systems emit all three data types doesn't guarantee better outcomes</cite>.
Sources:
- https://www.ibm.com/think/insights/observability-pillars
- https://www.crowdstrike.com/en-us/cybersecurity-101/observability/three-pillars-of-observability/
- https://www.infoq.com/news/2019/02/rethinking-observability/
- https://www.honeycomb.io/blog/time-to-version-observability-signs-point-to-yes
- https://mastersofdata.sumologic.com/public/39/Masters-of-Data-Podcast-851ac16a/c625fe73
- https://blog.mads-hartmann.com/sre/2020/01/11/journey-into-observability-telemetry.html
- https://kill3pill.com/
- https://thenewstack.io/observability-the-5-year-retrospective/
#observability#methodology-debates#vendor-positioning#three-pillars#framework-criticism#telemetry#operationsHypothesis testing under controlled blast radius
The experimental structure is hypothesis-driven. <cite index="23-5,23-6">Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group; the harder it is to disrupt the steady state, the more confidence we have in the behavior of the system</cite>. <cite index="23-7">If a weakness is uncovered, we now have a target for improvement before that behavior manifests in the system at large</cite>.
The blast radius is the control mechanism. <cite index="16-9,16-10">You can use Gremlin's custom scenarios to recreate a past outage or to automate a sequence of attacks to iteratively grow the blast radius of a chaos engineering experiment; a custom scenario has no pre-built constructs that limit which failure you can inject</cite>. <cite index="27-1">While one chaos engineering principle states that experiments need to run in production, you should still start small and run your experiment in a non-production environment first, learn and adjust, and then gradually expand the scope</cite>.
Observability is the dependency that makes this work. <cite index="21-8,21-9">Without strong observability, chaos experiments produce noise instead of signal; you inject a fault, something goes wrong, and you can't tell whether the degradation matches your prediction or is unrelated to the experiment entirely</cite>. The methodology requires instrumentation, hypothesis formation, controlled failure injection, and then measurement of whether the steady state held.
Sources:
- https://principlesofchaos.org/
- https://aws.amazon.com/blogs/apn/how-gremlins-chaos-engineering-platform-validates-aws-operational-excellence-and-reliability/
- https://cloud.google.com/blog/products/devops-sre/getting-started-with-chaos-engineering
- https://uptimelabs.io/learn/what-is-chaos-engineering/
#chaos-engineering#hypothesis-testing#observability#blast-radius#reliability-testing#experimentation#production-safety#reliability-engineering#testing-methodologyGremlin as controlled fault injection infrastructure
<cite index="11-2,11-3">Fault injection is a technique for creating controlled failure in a computing component such as a host, container, or service; by observing how components respond to failure, engineering teams can build them to be more resilient</cite>. Gremlin operationalizes this as a commercial platform. <cite index="12-1">It involves injecting faults into systems such as high CPU consumption, network latency, or dependency loss, observing how systems respond, then using that knowledge to make improvements</cite>.
The platform provides targeting and safety controls. <cite index="13-1,13-2">Gremlin helps inject failure in a safe, controllable way; generally you would install the Gremlin agent on your servers</cite>. <cite index="15-1,15-2">Gremlin Fault Injection lets you run custom chaos engineering faults such as packet loss, process termination, disk I/O usage, and more; you can also create custom fault injection workflows for cascading failures</cite>.
The failure modes mirror production conditions. <cite index="13-6,13-7">Chaos engineering or failure injection does not necessarily need to kill things or break services; introducing latency and seeing how your application will behave is common</cite>. <cite index="14-3,14-6">Network issues are often the most realistic failures you'll encounter in production</cite>, including latency injection, packet loss, and DNS attacks. This is the tooling layer between the methodology and the infrastructure.
Sources:
- https://www.gremlin.com/technologies/fault-injection
- https://www.gremlin.com/chaos-engineering
- https://konghq.com/blog/engineering/gremlin-chaos-engineering
- https://schathurangaj.medium.com/chaos-engineering-with-gremlin-your-guide-to-breaking-things-on-purpose-e807683a3dc2
- https://www.gremlin.com/kubernetes-chaos-engineering
#gremlin#fault-injection#chaos-engineering#reliability-testing#failure-simulation#production-readiness#infrastructure-testing#reliability-engineering#testing-methodologyVary real-world events, prioritize by impact and frequency
The second principle: <cite index="1-16,1-17">chaos variables reflect real-world events, prioritizing events either by potential impact or estimated frequency</cite>. This is not arbitrary sabotage. <cite index="23-4">Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed</cite>. <cite index="5-1">A recent study reported that 92% of catastrophic system failures were the result of incorrect handling of non-fatal errors</cite>, which points to where the discipline should focus.
Netflix evolved from single-instance termination to region-scale disruption. <cite index="2-2,2-3">Chaos Monkey, the first of Netflix's Simian Army tools, was born from randomly terminating production instances, forcing engineers to design systems that could withstand constant disruption</cite>. <cite index="2-11">They have since expanded this to include Chaos Kong, which simulates an entire Amazon EC2 region failure, and Failure Injection Testing (FIT), which causes requests between services to fail</cite>. The progression is methodical: can the system survive losing one server, then one availability zone, then one region.
<cite index="10-9">Netflix asked what if instances could terminate during business hours when engineers are available and can respond quickly</cite>. This is the operationalization of the methodology—run experiments when you can learn from them, not when they will only cause damage.
Sources:
- https://www.infoq.com/news/2015/09/netflix-chaos-engineering/
- https://principlesofchaos.org/
- https://arxiv.org/pdf/1702.05843
- https://medium.com/@tahirbalarabe2/what-is-chaos-engineering-chaos-by-design-fad9e39ab5e0
- https://techhq.com/news/how-netflix-pioneered-chaos-engineering/
#chaos-engineering#failure-injection#netflix#chaos-monkey#fault-tolerance#production-testing#reliability-engineering#testing-methodologySteady state as the contract, not the components
<cite index="1-9">Netflix defines chaos engineering as experimenting on a distributed system to build confidence in its capability to withstand turbulent conditions in production</cite>. The first principle is observability of the system as a whole. <cite index="1-10,1-11">Focus on the measurable output of a system rather than internal attributes—measurements of that output over a short period constitute a proxy for the system's steady state</cite>. <cite index="1-12,1-13">Throughput, error rates, and latency percentiles could all be metrics of interest representing steady state behavior</cite>.
This is not a test of individual services. <cite index="1-14">By focusing on systemic behavior patterns during experiments, chaos verifies that the system does work, rather than trying to validate how it works</cite>. You define what normal looks like from the outside—streams per second for Netflix, checkouts per minute for e-commerce—then you inject failure and see if that contract holds. <cite index="23-2,23-3">Start by defining steady state as some measurable output that indicates normal behavior, then hypothesize that this steady state will continue in both the control group and the experimental group</cite>.
The engineering consequence: if a feature service fails but the user can still stream video, availability has not degraded. <cite index="5-6,5-7">Each feature is implemented by a different service, each of which can potentially fail, yet even if one internal service fails this does not necessarily impact overall system availability</cite>.
Sources:
- https://www.infoq.com/news/2015/09/netflix-chaos-engineering/
- https://principlesofchaos.org/
- https://arxiv.org/pdf/1702.05843
#chaos-engineering#reliability-engineering#observability#steady-state-hypothesis#system-design#distributed-systems#testing-methodologyWriting for the wrong reader breaks the translation
<cite index="23-1,23-2,23-3">When explaining work in writing, you must keep your audience in mind; the most common mistake in young researchers is writing for people who know exactly what the author knows, and exercises like the Morning Paper format are designed to break that habit.</cite> This is the core methodological insight embedded in Colyer's approach: the synthesis artifact only has value if someone who wasn't in the room can use it.
<cite index="23-6,23-7,23-8">The Morning Paper's writings were designed for those from all walks of life, assumed no a priori expertise, and allowed readers to learn about various areas of computer science even without reading the complete papers.</cite> That's a high bar. It means you can't assume the reader knows the problem space, the prior art, or the terminology. You have to rebuild context.
The format enforces honesty. If you can't explain why a paper matters to someone outside your immediate domain, you probably don't understand why it matters inside your domain either. <cite index="14-5,14-6,14-7,14-8">A good research paper distills significant work—years of code and experiments thrown away to produce a single paper, with authors trying to boil down everything they learned into something digestible, where a single sentence may represent a full year of misdirected effort.</cite> The synthesis writer's job is to extract that distillation and make it actionable. The constraint of writing for a general technical audience—engineers, not researchers—forces you to identify what's mechanically useful versus what's academically interesting. Those aren't always the same dependency.
Sources:
- https://www.seltzer.com/margo/teaching/CS508.22/morningPaper.html
- https://read.seas.harvard.edu/cs260r/2022/on-reading/
#practitioner-translation#audience-awareness#technical-writing#research-communication#synthesis-methodology#cross-disciplinary#research-synthesis#systems-researchConference proceedings as a navigational lattice
<cite index="15-2,15-3,15-4,15-5">Colyer recommended starting with the reading lists of computer science courses in areas of interest—what papers appear in courses, what gets recommended—as a canon begins to emerge.</cite> <cite index="15-6,15-7">Then you realize which main conferences come around in different areas, mark up a calendar at the start of the year for when they occur, and go through proceedings when they come out to identify papers for a reading list.</cite> <cite index="15-8">Eventually you learn which research groups do interesting work regardless of conference.</cite>
This is a graph traversal strategy, not a search strategy. You're not trying to answer a specific question; you're building a map of the territory. The conferences—SOSP, OSDI, VLDB, SIGMOD for systems work—act as temporal checkpoints. New proceedings arrive; you scan titles and abstracts; you queue a few. <cite index="18-6,18-7,18-8">A lot of research reading involves going through the graph of references: most papers include references, and as you read you note which to follow up on, forming a directed acyclic graph going into the past.</cite>
The method scales because it's bounded. You're not reading everything; you're sampling with intent. <cite index="17-9,17-10,17-21">The Morning Paper mixed past papers and current research results, covering a wide range of topics with a bias toward distributed systems and data.</cite> That bias is the filter. Without it, the intake becomes noise. The question isn't what's new; it's what's new in the areas where you're building intuition about how systems actually break.
Sources:
- https://www.hashicorp.com/en/resources/a-chat-with-the-morning-paper-author-adrian-colyer
- https://brooker.co.za/blog/2020/05/25/reading.html
- https://www.infoq.com/minibooks/emag-the-morning-paper-3/
#research-discovery#conference-proceedings#systems-research#graph-traversal#reading-list-curation#filter-strategy#research-synthesis#practitioner-translationDaily digest as a forcing function for synthesis
<cite index="17-7,17-18">Adrian Colyer's Morning Paper ran for over six years on a simple model: every weekday, he took a computer science research paper and wrote it up as a post.</cite> <cite index="5-2,5-3">He wrote descriptions of three computer science papers each week, drawing on all areas of computer science.</cite> The mechanical constraint mattered. A daily cadence meant he couldn't optimize for depth in any single domain; he had to develop a repeatable process for extraction.
<cite index="17-11,17-22">The blog grew from his habit of reading research papers during his commute—he figured they'd give him more lasting value than the newspapers his fellow commuters were reading.</cite> <cite index="20-3,20-4">Colyer described himself as an expert in none of the topics he covered; his insight came from reading a lot of papers across sub-disciplines and seeing a lot of companies through his role as a venture partner.</cite> That dual exposure—academic work and production systems—created the translation layer.
<cite index="16-19,16-20,16-21">Colyer noted a common misconception about a gulf between academic work and practical concerns, but his bias was to select papers with immediate relevance to practitioners—work that opened eyes to what's possible and suggested practical steps.</cite> The format itself enforced a certain discipline: you write the digest whether you feel expert or not, and the repetition builds the muscle. <cite index="23-6,23-15">His writings were designed for those from all walks of life and assumed no a priori expertise.</cite> That's a constraint that forces clarity. You can't hide behind jargon when your reader might be coming from a different sub-field entirely.
Sources:
- https://www.infoq.com/minibooks/emag-the-morning-paper-3/
- https://www.seltzer.com/margo/teaching/CS508.22/morningPaper.html
- https://blog.acolyer.org/about/
- https://www.heavybit.com/library/podcasts/the-secure-developer/ep-17-security-research-with-the-morning-papers-adrian-colyer
#research-synthesis#daily-practice#practitioner-translation#distributed-systems#commute-reading#cross-domain-learning#systems-researchThe canonical reference on systems paper failures from 1983
USENIX conference sites still recommend a 1983 paper by Roy Levin and David Redell, the SOSP-9 program chairs, titled <cite index="30-1">"An Evaluation of the Ninth SOSP Submissions; or, How (and How Not) to Write a Good Systems Paper"</cite>. <cite index="28-2">The ten committee members quickly agreed on the disposition of over 80% of the papers</cite>, which suggests the failure modes were legible and recurring.
<cite index="25-1,25-4">In the hope of raising the quality of future SOSP submissions and systems papers generally, the committee decided to describe the criteria used in evaluating papers, pointing out common problems that appear repeatedly in technical papers in a way that would make it easier for future authors to avoid them</cite>. The fact that this document is still referenced forty years later indicates either that the failure modes are stable or that the field has agreed on a particular framing of rigor.
The paper is positioned as both diagnostic and corrective, written for <cite index="25-12,25-13">prospective authors for the 10th SOSP or for TOCS, asking what questions authors should be asking themselves as they write</cite>. The questions are the review criteria, made explicit.
Sources:
- https://www.usenix.org/guidelines-authors
- https://sigops.org/s/conferences/sosp/2013/submission.html
- https://ben.edu/wp-content/uploads/2022/06/How-and-How-Not-to-Write-a-Good-Systems-Paper.pdf
#research-methodology#systems-research#peer-review#writing-guidance#historical-contextEthics review happens at the same level as technical review
<cite index="10-4,10-5">When the PC has concerns about the ethics of work in a submission, the PC will have its own discussion of the ethics of that work, and the PC's review process may examine the ethical soundness of the paper just as it examines the technical soundness</cite>. Authors must attest compliance with institutional standards, but <cite index="10-4">submitting research for approval by one's institution's ethics review body is necessary, but not sufficient</cite>.
This creates a parallel track. The technical reviewers are also the ethics reviewers, applying their own judgment about what constitutes acceptable research practice. <cite index="14-4,14-5">In cases where the PC has concerns, the PC will have its own discussion of the ethics of that work, examining ethical soundness just as it examines technical soundness</cite>. There is no handoff to a separate body.
USENIX Security goes further. <cite index="19-4,19-5">Reviewers are asked to evaluate the ethics of all submissions, and authors are expected to complete a stakeholder-based ethics analysis or justify an alternative approach</cite>. The implication: a paper can be technically flawless and ethically unacceptable, and the same people make both calls.
Sources:
- https://www.usenix.org/conference/nsdi26/call-for-papers
- https://www.usenix.org/conference/nsdi27/call-for-papers
- https://www.usenix.org/conference/usenixsecurity26/call-for-papers
#research-methodology#peer-review#research-ethics#systems-research#program-committeeRebuttals are for clarification, not new experiments
<cite index="4-11,4-12">Authors can respond to reviews by correcting factual errors or directly addressing questions posed by reviewers, limited to clarifying the submitted work</cite>. The constraints are strict. <cite index="4-13">Responses must not include new experiments or data, describe additional work completed since submission, or promise additional work to follow</cite>. This is structural: the PC is evaluating what was done, not what could be done.
<cite index="4-15,4-16">The response can be up to 2000 words, though a shorter and crisper response is often advantageous</cite>. NSDI introduced a different mechanism for papers that need more than clarification. <cite index="11-1,11-2">For revise-and-resubmit decisions, reviewers primarily judge whether authors satisfied the requests accompanying the revision decision, and should avoid rejecting for non-fatal concerns they could have raised during the first round</cite>. <cite index="11-4">Unlike shepherding, revision instructions may include running additional experiments that obtain specific results, e.g., comparing performance against a certain alternative and beating it by at least 10%</cite>.
The difference matters. A rebuttal operates within the paper's existing perimeter. A revision can expand it, but only for papers the committee believes are structurally sound.
Sources:
- https://www.usenix.org/conference/osdi23/call-for-papers
- https://www.usenix.org/conference/nsdi25/call-for-papers
#peer-review#rebuttal-process#systems-research#revision-process#conference-procedures#research-methodologyWhat counts as relevant enough for a single-track program
<cite index="1-1,1-4">OSDI reviewers evaluate submissions based on topic relevance to computer systems and potential to impact future research and practices</cite>, but there is a filter above technical merit. <cite index="3-11">Submissions must demonstrate relevance and offer unique insights to capture the interest of a substantial portion of OSDI attendees</cite>. This is the consequence of the single-track format—every accepted paper occupies a fixed slot in front of the entire community.
The shift became explicit in 2025. <cite index="3-9">Starting from OSDI '25, there was a deliberate focus on selecting papers that offer significant contributions to computer systems research and align with community interests</cite>. What this means in practice: <cite index="1-6">Papers with little overlap with the program committee's interests are less likely to be accepted</cite>. Not wrong, not weak—just addressing a bottleneck the PC does not recognize as urgent.
<cite index="2-3">The explicit criteria across OSDI cycles are novelty, significance, interest, clarity, relevance, and correctness</cite>. Interest is doing work here that significance does not. A paper can contribute to knowledge without warranting interruption of everyone's conversation.
Sources:
- https://www.usenix.org/conference/osdi26/call-for-papers
- https://www.usenix.org/conference/osdi25/call-for-papers
- https://www.usenix.org/conference/osdi18/call-for-papers
#research-methodology#peer-review#systems-research#conference-scope#acceptance-criteria#single-track-formatReproducibility as a baseline requirement for benchmark validity
<cite index="17-1">Repositories can facilitate data reuse and benchmark reproducibility by ensuring salient metadata is provided for datasets and benchmark evaluations</cite>. <cite index="17-13">Benchmark reproducibility enables verification of published results, provides a starting point for experimentation and follow-up work, and makes contributions easier for others to use</cite>. This should be obvious, but the literature suggests it isn't.
<cite index="20-13,20-14">Specific reproducibility measures are essential to ensure benchmark validity over time, with primary challenges revolving around ambiguity in the evaluation workflow: which version of a dataset or model is being used, which version of the code is executing, and whether saved results can be reliably loaded and trusted</cite>. Without version pinning, you're not measuring the same thing twice. When someone publishes a benchmark result without publishing the exact dependencies, the benchmark becomes a claim, not a measurement. The TPC Council understands this—verification is part of their process. If you're using something outside TPC, ask whether the benchmark is even reproducible before you ask whether it's representative.
Sources:
- https://arxiv.org/html/2410.24100v1
- https://arxiv.org/pdf/2506.21182
#reproducibility#benchmarking#methodology#performance-evaluation#verification#version-controlStatic benchmarks and the scalability gap
<cite index="10-3,10-4">Existing AI benchmarks evaluate HPC system performance under predefined problem sizes in terms of datasets and models, but due to lack of scalability, static benchmarks might be inadequate to help understand performance trends of evolving applications on large-scale systems</cite>. The criticism applies beyond AI workloads. When a benchmark fixes its problem size, it fixes what you learn.
<cite index="9-3,9-4,9-5">System-level performance testing evaluates overall performance under real-world workload scenarios, simulating user interactions to assess response times, throughput, and resource utilization through four phases: defining workload, preparing environment, executing tests, and analyzing results</cite>. But if the workload doesn't scale, neither does your understanding. The performance curve at 1GB tells you nothing about the curve at 1TB unless the architecture is embarrassingly linear. It usually isn't. A benchmark that doesn't parameterize scale is a benchmark that measures one point on a surface. You're left guessing about the gradient.
Sources:
- https://arxiv.org/pdf/2212.03410
- https://arxiv.org/pdf/2408.08148
#scalability#benchmarking#performance-evaluation#methodology#workload-testing#hpc-systemsSynthetic workloads and the divergence problem
<cite index="2-8">The goal of synthetic workload construction is to minimize divergence in execution statistics—performance metrics and operator distributions—from the real workload trace</cite>. When proprietary traces can't be shared, researchers build proxies. <cite index="2-7,2-11">They construct synthetic workloads from existing benchmarks like TPC-H and TPC-DS, selecting percentages of queries from each to approximate real traces</cite>.
The method matters because the fit is imperfect. <cite index="2-13">Selecting all queries from a single benchmark without filtering lacks a fine-grained mechanism to select queries relevant to real workloads</cite>. This is the core challenge: you need execution statistics that match production, but you can't publish production queries. The compromise is mathematical—minimize statistical divergence using integer programming to combine benchmark components—but it's still a model of a system, not the system. If someone claims performance on TPC-H, ask what percentage of their actual query mix it represents. The answer is often zero.
Sources:
- https://www.arxiv.org/pdf/2506.16379
#synthetic-workloads#benchmarking#workload-modeling#performance-evaluation#tpc-h#database-systems#methodologyTPC benchmarks as the standard template, not the full story
<cite index="6-1,6-2">The Transaction Processing Performance Council issues standard benchmarks, verifies their correct application, and regularly publishes performance test results</cite>. <cite index="6-3">Classical TPC benchmarks share variants of a business database schema and are parameterized only by a scale factor determining database size</cite>. The methodology is rigorous but narrow: <cite index="6-10,6-11,6-12">TPC-C has been in use since 1992, featuring nine table types with diverse structures and a workload of concurrent transactions, with throughput as the performance metric</cite>.
What this means for anyone evaluating a new system: TPC gives you a baseline, not a workload. <cite index="5-5,5-6">Fixed benchmarks may fall short in representing varied, specific workloads and data characteristics unique to every user, and the nuances of real-world applications might not be fully captured</cite>. The lineage is clear—<cite index="1-13,1-14,1-15">DebitCredit evolved into TPC-A then TPC-C for OLTP, and TPC-CH was introduced to bridge OLTP and OLAP systems</cite>—but evolution is slow. If you're building on TPC results, you're building on a schema from the early 1990s wholesale industry model. Know what that excludes.
Sources:
- https://arxiv.org/pdf/1701.08634
- https://arxiv.org/html/2405.01312
- https://arxiv.org/pdf/2508.07551
#benchmarking#tpc#performance-evaluation#methodology#workload-modeling#oltp#database-systemsHydroflow: Dataflow as IR for Distributed Programs
Hellerstein's more recent work extends the declarative dataflow approach into a compiler infrastructure. <cite index="7-1,7-2">Hydroflow is a new cloud programming model used to create constructively correct distributed systems, as a refinement and unification of the existing dataflow and reactive programming models</cite>. <cite index="7-3">Hydroflow is primarily a low-level compilation target for future declarative cloud programming languages, but developers can use it directly to precisely control program execution or fine-tune and debug compiled programs</cite>.
The approach treats dataflow as an intermediate representation. <cite index="4-13">Hydroflow is intended as an "LLVM IR for distributed programs," designed to provide a simple and clear execution model that can be leveraged as a target of higher-level languages</cite>. <cite index="9-5">The foundation of the Hydro stack is Hydroflow, a Rust-based dataflow runtime with an IR based on algebraic dataflow</cite>.
This shifts the abstraction level. Rather than writing distributed systems in general-purpose languages and reasoning about execution order manually, you compile declarative specifications down to a dataflow IR that preserves correctness properties. The model separates concerns: high-level languages express intent, the IR enables transformations, and the runtime handles distributed execution.
Sources:
- https://hydro.run/research/
- https://www.researchgate.net/publication/221313291_Dedalus_Datalog_in_Time_and_Space
- https://isg.ics.uci.edu/event/joseph-hellerstein-uc-berkeley-hydro-a-compiler-stack-for-distributed-programs/
#hydroflow#dataflow-ir#compiler-infrastructure#rust#declarative-compilation#distributed-systems#data-infrastructure#database-architecture#processing-frameworksBloom: Disorderly Programming as a Feature
<cite index="10-5,10-6">Bloom is a distributed programming language that is amenable to high-level consistency analysis and encourages order-insensitive programming, presented as a prototype implementation as a domain-specific language in Ruby</cite>. <cite index="11-1">The BOOM project produced Bloom, which enabled writing complex distributed programs in simple, intuitive ways—with tens or hundreds of times less code than traditional languages</cite>.
The language is declarative by design. <cite index="10-7">Unlike Overlog, Bloom is purely declarative: the syntax of a program contains the full specification of its semantics, and there is no need for the programmer to understand or reason about the behavior of the evaluation engine</cite>. <cite index="11-3,11-4">The BOOM project developed a new programming model for distributed computers that helps programmers avoid specifying the steps of a computation in a particular order, instead focusing on the information that the program must manage and the way that information flows through machines and tasks</cite>.
<cite index="16-4">Building on the CALM Theorem, Bloom supports a powerful new programming analysis framework for analyzing the correctness and consistency of distributed programs</cite>. The work demonstrates that declarative, data-centric languages can target distributed execution with formal guarantees.
Sources:
- https://dsf.berkeley.edu/jmh/calm-cidr-short.pdf
- https://vcresearch.berkeley.edu/news/profile/joe_hellerstein
- https://learn.microsoft.com/en-us/shows/lang-next-2012/bloom-disorderly-programming-distributed-world
- https://boom.cs.berkeley.edu/
#bloom-language#declarative-programming#disorderly-programming#distributed-systems#dsl#data-centric#data-infrastructure#database-architecture#processing-frameworksCALM: When Coordination Is Provably Unnecessary
The CALM theorem connects distributed consistency to logical monotonicity. <cite index="10-4">The CALM principle connects the idea of distributed consistency to program tests for logical monotonicity</cite>. <cite index="2-10">Confluence can be applied to individual operations, components in a dataflow, or even entire distributed programs</cite>.
This matters because coordination is expensive. <cite index="11-2">If you force order, the machines spend all their time coordinating, and progress is limited by the slowest machine</cite>. The theorem provides a test: if a program is monotonic, it will produce consistent results without coordination, regardless of message ordering or timing.
<cite index="13-4,13-5">Bloom makes set-oriented, monotonic (and hence confluent) programming the easiest constructs for programmers to work with in the language, contrasting with imperative languages where assignment and explicit sequencing—two non-monotone constructs—are the most natural building blocks</cite>. <cite index="13-6">Bloom can leverage static analysis based on CALM to certify when programs provide the state-based convergence properties provided by CRDTs</cite>. The work bridges database theory and distributed systems practice, offering a formal basis for reasoning about when you can avoid the costs of consensus.
Sources:
- https://dsf.berkeley.edu/jmh/calm-cidr-short.pdf
- https://arxiv.org/pdf/1901.01930
- https://vcresearch.berkeley.edu/news/profile/joe_hellerstein
#calm-theorem#confluence#monotonicity#consistency-analysis#coordination-avoidance#distributed-systems#data-infrastructure#database-architecture#processing-frameworksDedalus: Time as the Missing Dimension in Datalog
<cite index="19-2">Recent research explored using Datalog-based languages to express distributed systems as a set of logical invariants</cite>, but distributed systems have two properties that make this difficult. <cite index="25-4,25-5">First, the state of any system evolves with its execution; second, deductions may be arbitrarily delayed, dropped, or reordered by unreliable network links</cite>.
Hellerstein and colleagues developed Dedalus to address this gap. <cite index="10-8">Bloom is based on a formal temporal logic called Dedalus</cite>, and <cite index="12-17">the key insight in Dedalus is that distributed programming is about time, not about space</cite>. The work extends Datalog with temporal semantics to model state evolution and network uncertainty.
<cite index="25-1">Experience implementing full-featured systems in variants of Datalog suggests that Dedalus is well-suited to the specification of rich distributed services and protocols, and provides both cleaner semantics and richer tests of correctness</cite>. The approach builds on database theory—<cite index="6-17">a relational query engine takes a declarative SQL statement, validates it, optimizes it into a procedural dataflow implementation plan</cite>—and applies that model to distributed execution.
Sources:
- https://link.springer.com/chapter/10.1007/978-3-642-24206-9_16
- https://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-173.html
- https://boom.cs.berkeley.edu/
- https://courses.cs.washington.edu/courses/cse444/23wi/readings/anatomy-dbs.pdf
#datalog#temporal-logic#distributed-systems#formal-semantics#database-theory#declarative-programming#data-infrastructure#database-architecture#processing-frameworksThe book's positioning in the design literature canon
<cite index="8-4,8-5">The book first introduces the fundamental problem in software design, which is managing complexity, then discusses philosophical issues about how to approach the software design process, and presents a collection of design principles to apply during software design</cite>. <cite index="27-2,27-7,27-8,27-9">There is a significant difference between Ousterhout's book and most books on software design: it's repeatability. While almost all software architecture books are based on real-world experiences of experienced developers, those are not repeatable experiences; Ousterhout, on the other hand, had the vantage point of having multiple teams solve the same design problem during a semester, with him observing, and he had the luxury of repeating this experiment multiple times, which allowed him to both validate and tweak his observations</cite>. The book has become a modern canonical text on complexity management. <cite index="26-4,26-15">Stanford professor Ousterhout believes that great software design is becoming even more important as AI tools become more capable in generating code</cite>. <cite index="6-3">Ousterhout makes a compelling, logical case for addressing complexity in software with simple designs augmented with clearly communicated documentation, rather than prioritizing getting something working as fast as possible</cite>.
Sources:
- https://www.amazon.com/Philosophy-Software-Design-John-Ousterhout/dp/1732102201
- https://blog.pragmaticengineer.com/a-philosophy-of-software-design-review/
- https://newsletter.pragmaticengineer.com/p/the-philosophy-of-software-design
- https://goodreads.com/book/show/39996759-a-philosophy-of-software-design
#software-design#complexity-theory#design-philosophy#software-engineering#canonical-texts#systems-thinkingDeep modules as the fundamental unit of abstraction
<cite index="22-10,22-11">A module is considered deep when its implementation is significantly more complex than its interface; in other words, modules should offer their clients a simple interface while providing a complex implementation behind it</cite>. <cite index="20-2">Deep modules maximize the volume of capabilities within an interface's implementation while minimizing the cognitive load of the interface itself</cite>. <cite index="25-7">The file I/O interface provided by Unix is a good example of a deep interface: the API only has a few system calls (open, read, write, seek, close), but hides a huge amount of complexity around implementation of files, directories, permissions, concurrent access, etc.</cite> Ousterhout contrasts this with shallow modules, which have a complex interface but not much functionality. <cite index="4-1">He says that the conventional wisdom is to write small components (keeping the LoC low in each method) rather than deep components, but this results in large numbers of shallow classes and methods, which add to overall system complexity</cite>. The core principle: <cite index="1-10">it is much more important for your module to have a simple interface than to have a simple implementation</cite>. This means <cite index="1-8,1-9">pulling the complexity downwards and handling it inside the module, striving to make life as easy as possible for your users, even if it makes your own life harder</cite>.
Sources:
- https://softengbook.org/articles/deep-modules
- https://www.nagarro.com/en/blog/deep-module-low-complexity-software-design
- https://www.mattduck.com/2021-04-a-philosophy-of-software-design.html
- https://www.janmeppe.com/blog/a-philosophy-of-software-design-john-ousterhout/
- https://embeddeduse.com/2020/10/10/book-review-a-philosophy-of-software-design-by-john-ousterhout/
#deep-modules#abstraction#interface-design#information-hiding#encapsulation#systems-thinking#software-design#complexity-theoryStrategic versus tactical programming as development philosophies
<cite index="12-5">Ousterhout argues that tactical mindset is focused on getting something working, but makes it nearly impossible to produce good system design</cite>. <cite index="15-2,15-4">Tactical programming focuses on shipping many features and fixes as fast as possible without planning for the long-term design, which results in a highly complex codebase filled with technical debt, as shortcuts, band-aid solutions, and bad practices become the norm</cite>. <cite index="11-10">By contrast, strategic programming's primary goal is to produce a great design, which also happens to work</cite>. <cite index="10-15,10-17">This requires an investment mindset where time is spent on improving system design and fixing problems, even if it slows down short-term progress; proactive and reactive investments lead to continuous improvements and long-term development speed</cite>. Notably, <cite index="18-4,18-6">Ousterhout suggests that agile development tends to focus developers on features, not abstractions, and encourages developers to put off design decisions in order to produce working software as soon as possible, which is against an investment approach and encourages tactical programming</cite>. He's critical of TDD for similar reasons: <cite index="16-9,16-10">it focuses attention on getting specific features working, rather than finding the best design, which is tactical programming pure and simple</cite>.
Sources:
- https://lethain.com/notes-philosophy-software-design/
- https://jgarivera.com/posts/tactical-strategic-programming/
- https://marcobacis.dev/blog/philosophy-of-software-design/
- https://dev.to/markadel/a-philosophy-of-software-design-summary-pk9
- https://dev.to/thawkin3/lessons-from-a-philosophy-of-software-design-4cn7
- https://smlx.dev/posts/book-review-ousterhout-philosophy-software-design/
#strategic-programming#tactical-programming#software-design#agile-development#tdd#technical-debt#design-philosophy#systems-thinking#complexity-theoryComplexity as a function of dependencies and obscurity
<cite index="3-25">Ousterhout defines complexity as anything related to the structure of a software system that makes it hard to understand and modify</cite>. He gives it three symptoms: <cite index="4-13">change amplification, where a seemingly simple change requires code modifications in many different places</cite>; <cite index="4-14">cognitive load, where a developer needs to know a large number of things in order to complete a task</cite>; and unknown unknowns, where it is not obvious what needs to be known. The two causes he identifies are dependencies and obscurity. <cite index="3-3">A dependency exists when a given piece of code cannot be understood and modified in isolation</cite>. Dependencies are unavoidable, but <cite index="1-6">they must be managed to a certain degree, otherwise you end up with the well-known big ball of mud</cite>. <cite index="2-8">Obscurity creates unknown unknowns, and also contributes to cognitive load</cite>. Ousterhout argues that <cite index="2-11,2-12">if every developer takes a permissive approach to small bits of complexity, it accumulates rapidly, and once it has accumulated, it is hard to eliminate, since fixing a single dependency or obscurity will not make a big difference</cite>. His prescription is a zero-tolerance philosophy toward complexity.
Sources:
- https://sive.rs/book/PoSD
- https://www.mattduck.com/2021-04-a-philosophy-of-software-design.html
- https://www.janmeppe.com/blog/a-philosophy-of-software-design-john-ousterhout/
- https://speakerdeck.com/philipschwarz/the-nature-of-complexity-in-john-ousterhouts-philosophy-of-software-design
#complexity-theory#systems-thinking#software-design#dependencies#technical-debt#cognitive-loadThe monolith tax: tight coupling and two-week release trains
<cite index="19-1,19-2,19-3">Since Netflix's Reloaded modules were often co-located in the same repository, it was easy to overlook code-isolation rules and there was quite a bit of unintended reuse of code across what should have been strong boundaries—such reuse created tight coupling and reduced development velocity, forcing all modules to deploy together.</cite> <cite index="19-4,19-5,19-6,19-7,19-8">The joint deployment meant that there was increased fear of unintended production outages as debugging and rollback can be difficult for a deployment of this size, driving the approach of the release train: every two weeks, a snapshot of all modules was taken and promoted to be a release candidate, which then went through exhaustive testing that took about two weeks.</cite>
This is the contract monoliths enforce: you can share code easily, which encourages the wrong kind of reuse. <cite index="19-13">The setup of a new Reloaded module and its integration with the orchestration required a non-trivial amount of effort, which led to a bias towards augmentation rather than creation when developing new functionalities.</cite> Build one more feature into the thing you already have, because standing up a new thing is expensive.
<cite index="19-22,19-23,19-24">Netflix had the new video pipeline running alongside Reloaded in production for a few years, completed the migration of all necessary functionalities, began gradually shifting over traffic one use case at a time, and completed the switchover in September 2023.</cite> Migration at scale is not a rewrite. It's running two systems in parallel until the new one proves it can handle the load.
Sources:
- https://netflixtechblog.com/rebuilding-netflix-video-processing-pipeline-with-microservices-4e5e6310e359
#monolith#technical-debt#tight-coupling#release-cycles#netflix#code-reuse#deployment-overhead#migration-strategy#systems-architecture#cloud-infrastructure#microservicesChaos engineering as architectural constraint, not just testing
<cite index="21-1,21-3">Netflix wanted to build a system where they assumed that the platform they were running on was unreliable rather than reliable, which meant they had to build software architectures which assumed that the components could go away at any time—this led to all the chaos engineering work, multiple zones and regions.</cite> <cite index="24-2,24-4">Chaos engineering was popularized around 2012, and one way of looking at it is as an architectural design control.</cite>
<cite index="20-24">One of the reasons Chaos Engineering appeared at Netflix at the time that it did was that they had moved to microservices and had single-function services.</cite> <cite index="20-15,20-16,20-17,20-18">A microservice that does one thing is much more observable than a monolith—even with tracing and logging, it's really hard to reason about what a monolith will do because it's got so many different things it does.</cite>
<cite index="24-6,24-8,24-9">Netflix wanted to have microservices that were stateless and autoscaled, and they wanted to encourage people to not store state within their instances—all session state had to be somewhere else.</cite> <cite index="13-15,13-16,13-17">A basic technique Netflix uses to make their systems more reliable and highly available is 'The Simian army,' a set of tools used to increase the resiliency of services, with Chaos Monkey being the most widely used, allowing one to introduce random failures in a system to see how it reacts.</cite> <cite index="22-2,22-3">Customer requests should be routed to specific local regions and services, data should be replicated and requests re-routed to active services during an incident, and microservices should be designed to limit the blast radius of any incident.</cite>
Sources:
- https://www.platformengineeringpod.com/episode/from-netflix-to-the-cloud-adrian-cockroft-on-devops-microservices-and-sustainability
- https://blog.container-solutions.com/adrian-cockcroft-on-serverless-continuous-resilience
- https://www.gremlin.com/blog/adrian-cockroft-chaos-engineering-what-it-is-and-where-its-going-chaos-conf-2018
- https://www.infoq.com/news/2017/11/cockcroft-chaos-architecture/
- https://www.packtpub.com/en-us/learning/how-to-tutorials/how-netflix-migrated-from-a-monolithic-to-a-microservice-architecture-video/
#chaos-engineering#reliability-engineering#microservices#netflix#stateless-services#resilience#failure-injection#chaos-monkey#systems-architecture#cloud-infrastructureSeven years from monolith to cloud: the 2008 outage as forcing function
<cite index="10-1,10-3,10-4">In 2008, Netflix faced a catastrophic database corruption that brought their entire service down for three days—their monolithic architecture, a single Java application running on Oracle databases, couldn't handle the scale they were rapidly approaching with their new streaming service.</cite> <cite index="13-5,13-6">The outage prevented them from shipping DVDs to customers, and following this they decided to move away from a single point of failure that could only scale vertically and move to components that could scale horizontally and are highly available.</cite>
<cite index="17-3,17-4">Netflix leadership, including CEO Reed Hastings and Cloud Architect Adrian Cockcroft, committed to completing their migration to Amazon Web Services, driven by the August 2008 database corruption incident and the need for greater scalability, reliability, and operational efficiency.</cite> <cite index="17-1,17-2,17-7">In 2009, Netflix began the gradual process of refactoring its monolithic architecture, service by service, into microservices, completing the customer-facing systems conversion to microservices by 2012.</cite> <cite index="17-13,17-19">The entire infrastructure migration took approximately seven and a half years, from the initial August 2008 crisis to the final data center shutdown in January 2016.</cite>
<cite index="12-20,12-21">Another distinct advantage of microservices is the speed and agility of development—Netflix engineers got the opportunity to develop, test, and deploy services independently, which allowed them to build more than 30 teams that could work on different parts of the system without having to wait for each other to finish.</cite>
Sources:
- https://caffeinatedcoder.medium.com/netflixs-microservices-migration-from-monolith-to-700-services-8caa8e5bc574
- https://www.packtpub.com/en-us/learning/how-to-tutorials/how-netflix-migrated-from-a-monolithic-to-a-microservice-architecture-video/
- https://tocconsulting.fr/blog/netflix-cloud-architecture
- https://www.hys-enterprise.com/blog/why-and-how-netflix-amazon-and-uber-migrated-to-microservices-learn-from-their-experience/
#netflix#monolith-migration#cloud-migration#aws#database-corruption#systems-architecture#organizational-design#migration-timeline#cloud-infrastructure#microservicesCockcroft's definition: bounded context and loose coupling
<cite index="1-1,1-11">Cockcroft defines a microservices architecture as a service-oriented architecture composed of loosely coupled elements that have bounded contexts.</cite> The bounded-context part matters—it's not just about splitting code into smaller pieces. It's about dependency management at the organizational level.
<cite index="1-5">As Director of Web Engineering and then Cloud Architect at Netflix, Cockcroft oversaw the transition from a traditional development model with 100 engineers producing a monolithic DVD-rental application to a microservices architecture with many small teams responsible for the end-to-end development of hundreds of microservices.</cite> <cite index="12-8">A microservices architecture allowed Netflix to break up the system into independent services: one service stores all watched shows, one is responsible for monthly credit card payments, one analyzes watching history and suggests similar shows and movies.</cite>
<cite index="3-9,3-10,3-11">The architecture is emergent, not designed centrally—whatever anyone needed to do at the time.</cite> That's a design constraint, not a problem. <cite index="2-4,2-13">The top lesson Cockcroft learned at Netflix is that speed wins in the marketplace—it's hard not to win if you're basing your moves on enough data points and your competitors are making guesses that take months to be proven or disproven.</cite> The architecture serves velocity.
Sources:
- https://www.f5.com/company/blog/nginx/microservices-at-netflix-architectural-best-practices
- https://dzone.com/articles/adopting-microservices-netflix-0
- https://medium.com/s-c-a-l-e/talking-microservices-with-the-man-who-made-netflix-s-cloud-famous-1032689afed3
- https://www.hys-enterprise.com/blog/why-and-how-netflix-amazon-and-uber-migrated-to-microservices-learn-from-their-experience/
#microservices#systems-architecture#bounded-contexts#organizational-design#netflix#cloud-native#service-decomposition#cloud-infrastructureSystems Performance: The Methodology-First Approach
<cite index="20-19,20-20,20-21">While Gregg's book covers performance tools and the background for understanding them, what makes it different is the inclusion of many performance methodologies, including those covered briefly in his USENIX 2012 talk. He's been teaching and developing systems performance classes on and off for over ten years, and has found methodologies to be crucial for giving students a starting point and then guiding them through performance activities. The USE Method is a methodology he developed for this purpose.</cite>
<cite index="12-2,12-3">World-renowned systems performance expert Brendan Gregg summarizes relevant operating system, hardware, and application theory to quickly get professionals up to speed even if they've never analyzed performance before, and to refresh and update advanced readers' knowledge. Gregg illuminates the latest tools and techniques, including extended BPF, showing how to get the most out of your systems in cloud, web, and large-scale enterprise environments.</cite>
<cite index="20-15,20-16">Chapters are structured to first cover durable skills (models, architecture, and methodologies) and then implementation with tools and tuning. This will be evident to those who read the first edition: most chapters begin with only light changes since the first edition, but the changes increase as each chapter progresses.</cite> <cite index="16-3,16-4,16-5">A performance analysis methodology is a procedure that you can follow to analyze system or application performance. These generally provide a starting point and then guidance to root cause, or causes. Different methodologies are suited for solving different classes of issues, and you may try more than one before accomplishing your goal.</cite>
<cite index="1-6">In November 2013, he received the LISA Outstanding Achievement Award from USENIX "for contributions to the field of system administration, particularly groundbreaking work in systems performance analysis methodologies."</cite>
Sources:
- https://www.brendangregg.com/systems-performance-2nd-edition-book.html
- https://www.amazon.com/Systems-Performance-Brendan-Gregg/dp/0136820158
- https://www.brendangregg.com/methodology.html
- https://en.wikipedia.org/wiki/Brendan_Gregg
#systems-performance#brendan-gregg#methodology#performance-engineering#observability#book#training#operationsFlame Graphs: Stack Traces Made Legible
<cite index="1-2">Gregg created several visualization types for performance analysis, including latency heat maps and flame graphs.</cite> <cite index="31-1,31-2,31-3,31-4">Flame graphs are a simple stack trace visualization that helps answer an everyday problem: how is software consuming resources, especially CPUs, and how did this change since the last software version? Flame graphs have been adopted by many languages, products, and companies, including Netflix, and have become a standard tool for performance analysis. They were published in "The Flame Graph" article in the June 2016 issue of Communications of the ACM, by their creator, Brendan Gregg.</cite>
<cite index="35-4,35-5,35-6">The flame graph provides a new visualization for profiler output and can make for much faster comprehension, reducing the time for root cause analysis. In environments where software changes rapidly, such as the Netflix cloud microservice architecture, it is especially important to understand profiles quickly. Faster comprehension can also make the study of foreign software more successful, where one's skills, appetite, and time are strictly limited.</cite>
<cite index="36-5,36-6">This is the official website for flame graphs: a visualization of hierarchical data that Gregg created to visualize stack traces of profiled software so that the most frequent code-paths can be identified quickly and accurately. They can be generated using his open source programs on github.com/brendangregg/FlameGraph which create interactive SVGs.</cite> The x-axis does not represent time; it represents the alphabetically sorted stack frames, which maximizes merging. <cite index="30-12,30-13,30-14">Determining why CPUs are busy is a routine task for performance analysis, which often involves profiling stack traces. Profiling by sampling at a fixed rate is a coarse but effective way to see which code-paths are hot (busy on-CPU). It usually works by creating a timed interrupt that collects the current program counter, function address, or entire stack back trace, and translates these to something human readable when printing a summary report.</cite>
Sources:
- https://en.wikipedia.org/wiki/Brendan_Gregg
- https://www.usenix.org/conference/atc17/program/presentation/gregg-flame
- https://queue.acm.org/detail.cfm?id=2927301
- https://www.brendangregg.com/flamegraphs.html
- https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
#flame-graphs#visualization#profiling#performance-analysis#brendan-gregg#stack-traces#cpu-profiling#observability#performance-engineering#operationsRED Method: The Workload View Gregg Didn't Build
<cite index="25-10">RED came about because Tom Wilkie was frustrated with the popular USE methodology of performance measurement.</cite> <cite index="23-5,23-7,23-8">Tom Wilkie, a prominent figure in the observability space and one of the founders of Grafana Labs, introduced the method. Wilkie first presented the RED method at the Prometheus meetup in London in 2015. At the time, he worked at Weave works, a company specializing in tooling for container-based applications.</cite>
<cite index="25-1,25-2,25-3">RED is based around requests, characterizing microservice performance thusly: Rate (R): The number of requests per second. Errors (E): The number of failed requests. Duration (D): The amount of time to process a request.</cite> The method is workload-centric, not resource-centric. <cite index="22-1,22-2">The RED method is about the workload itself, and treats the service as a black box. It's an externally-visible view of the behavior of the workload as serviced by the resources.</cite>
The real value is the pairing. <cite index="22-12">Taken together, RED and USE comprise minimally complete, maximally useful observability—a way to understand both aspects of a system: its users/customers and the work they request, and its resources/components and how they react to the workload.</cite> <cite index="29-1,29-2">RED and USE metrics provide a powerful framework for monitoring and observability in modern infrastructure. By implementing these complementary approaches, you gain visibility into both the user experience (RED) and the underlying resource health (USE), enabling you to quickly identify, diagnose, and resolve issues in your systems.</cite>
<cite index="25-8,25-9">The most immediate benefit to instrumenting microservices along the channels described by RED gives engineers who may not be familiar with a badly-performing microservice a standard set of tools to diagnose and correct an issue. RED offers a "consistency across services [that] really helps reduce the cognitive load of your on-call people."</cite>
Sources:
- https://thenewstack.io/monitoring-microservices-red-method/
- https://last9.io/blog/monitoring-with-red-method/
- https://www.solarwinds.com/blog/monitoring-and-observability-with-use-and-red
- https://betterstack.com/community/guides/monitoring/red-use-metrics/
#observability#red-method#microservices#workload-analysis#tom-wilkie#request-driven#methodology#performance-engineering#operationsThe USE Method: Three Questions for Every Resource
<cite index="1-1">Brendan Gregg developed the USE Method (Utilization, Saturation, and Errors), a methodology for performance analysis of system resources.</cite> <cite index="22-5,22-6">Gregg describes it as designed to help solve performance issues quickly, "like an emergency checklist in a flight manual, it is intended to be simple, straightforward, complete, and fast."</cite>
The framework is deceptively simple: <cite index="7-1">for every resource, check utilization, saturation, and errors.</cite> <cite index="9-14,9-15,9-16">It provides a strategy for performing a complete check of system health, identifying common bottlenecks and errors. For each system resource, metrics for utilization, saturation and errors are identified and checked. Any issues discovered are then investigated using further strategies.</cite>
<cite index="5-9">Instead of staring at dashboards hoping something looks wrong, USE gives you a systematic way to interrogate every resource—CPU, memory, disk, network—and determine whether it's your bottleneck.</cite> The method is resource-centric. It aims to prevent the fishing expedition that performance work can become, especially when facing an unknown system. <cite index="16-6,16-7">Analysis without a methodology can become a fishing expedition, where metrics are examined ad hoc, until the issue is found – if it is at all. Methodologies can help with another difficult issue: when to give up.</cite>
The USE method can be applied to hardware and software resources alike. <cite index="7-7,7-8,7-9">While the original article focuses only on hardware components, they can also be applied to software dependencies. Each software component has a set of resources it depends on to do its work – hardware and software.</cite>
Sources:
- https://en.wikipedia.org/wiki/Brendan_Gregg
- https://www.solarwinds.com/blog/monitoring-and-observability-with-use-and-red
- https://www.connected.app/library/use-method-sv42wsv
- https://grepory.substack.com/p/the-use-method-revisited
- https://www.brendangregg.com/USEmethod/use-linux.html
- https://www.brendangregg.com/methodology.html
#observability#performance-engineering#brendan-gregg#use-method#resource-analysis#bottleneck-detection#methodology#operationsPatterson-Hennessy as textbook for the hardware-software contract
<cite index="1-10,1-11">Patterson co-authored seven books, including two with John L. Hennessy on computer architecture: Computer Architecture: A Quantitative Approach (6 editions) and Computer Organization and Design RISC-V Edition: the Hardware/Software Interface (5 editions). They have been widely used as textbooks for graduate and undergraduate courses since 1990</cite>. The subtitle matters: "the Hardware/Software Interface" frames architecture not as transistor layout but as the contract programmers depend on.
<cite index="7-2,7-3">Computer Organization and Design RISC-V Edition: The Hardware Software Interface, Second Edition, the award-winning textbook from Patterson and Hennessy that is used by more than 40,000 students per year, continues to present the most comprehensive and readable introduction to this core computer science topic. This version of the book features the RISC-V open source instruction set architecture, the first open source architecture designed for use in modern computing environments such as cloud computing, mobile devices, and other embedded systems</cite>.
The books codified a methodology: measure, don't guess. <cite index="18-14,18-15,18-16">The term reduced instruction set computer originated from an influential paper in 1980, which made a case for less complex instruction sets. However, the term has become so abused that some (including the authors of the book) argue instead for the term "load-store instruction set". The reason for this is that the RISC movement was fundamentally about simpler instruction set architectures (ISA) to make it easier to implement performance-oriented features like pipelines</cite>. The Patterson-Hennessy lineage runs through every engineer who has had to reason about cache misses and pipeline stalls.
Sources:
- https://en.wikipedia.org/wiki/David_Patterson_(computer_scientist)
- https://www.amazon.com/Computer-Organization-Design-RISC-V-Architecture/dp/0128203315
- https://www.researchgate.net/publication/2398453_Computer_Architecture_a_qualitative_overview_of_Hennessy_and_Patterson
#computer-architecture#hardware-software-interface#risc-v#instruction-set-architecture#pedagogy#systems-design#systems-architecture#performance-engineeringLatency numbers as design constraint, not trivia
<cite index="28-5,28-6,28-7">In 2010, Jeff Dean from Google gave a wonderful talk at Stanford that made him quite famous. In it, he discussed a few numbers that are relevant to computing systems. Then Peter Norvig published those numbers for the first time on the internet</cite>. <cite index="26-2,26-3">This list was originally written by Jeff Dean. He's a legendary Google engineer who (usually side-by-side with the under-credited Sanjay Ghemawat) created such systems as MapReduce, GFS, Bigtable, and Spanner</cite>.
The numbers themselves span orders of magnitude: L1 cache reference at 0.5 nanoseconds, main memory at 100 nanoseconds, disk seek at 10 milliseconds, round-trip between continents at 150 milliseconds. <cite index="26-6">These numbers are useful in designing efficient systems in Google datacenters, making decisions like "is it better for all of my stateless servers to keep this on local disk or to send an RPC to an in-cluster server that has it in RAM?"</cite>
<cite index="33-1,33-2">Colin Scott, a Berkeley researcher, updated Jeff Dean's famous Numbers Everyone Should Know with his Latency Numbers Every Programmer Should Know interactive graphic. The interactive aspect is cool because it has a slider that let's you see numbers back from as early as 1990 to the far far future of 2020</cite>. Hardware moves, but hierarchy persists. The point is not memorization but understanding what is fast relative to what else.
Sources:
- https://www.freecodecamp.org/news/must-know-numbers-for-every-computer-engineer/
- https://news.ycombinator.com/item?id=24674239
- https://highscalability.com/more-numbers-every-awesome-programmer-must-know/
#latency#performance-engineering#systems-architecture#hardware-constraints#distributed-systems#memory-hierarchy#hardware-software-interfaceRISC as quantitative discipline, not just fewer instructions
<cite index="2-2">Patterson and Hennessy won the 2017 Turing Award for pioneering a systematic, quantitative approach to the design and evaluation of computer architectures</cite>, not simply for inventing RISC. <cite index="4-1">Patterson coined the term reduced instruction set computer in 1980</cite>, and <cite index="4-7">by 1982, the RISC-I processor built by Patterson, Professor Carlo Séquin and their Berkeley team outperformed a conventional design that used more than twice as many transistors to operate</cite>. The approach mattered more than the architecture itself: <cite index="2-6">RISC requires a small set of simple and general instructions for computing functions, thus requiring fewer transistors and reducing the amount of work a computer must perform</cite>.
<cite index="4-11,4-12">Together Patterson and Hennessy published, in 1990, a foundational textbook, Computer Architecture: A Quantitative Approach, establishing an analytical and scientific framework for engineers to evaluate microprocessor design. The text is credited with shifting the emphasis from raw performance power to an approach that takes energy usage, heat dissipation and off-chip communication into account</cite>. The book treated architecture as an engineering discipline with measurable tradeoffs. <cite index="2-4">Today, 99 percent of the more than 16 billion microprocessors produced annually are RISC processors</cite>. RISC won not because it was simpler but because its simplicity enabled pipelines, and pipelines could be measured.
Sources:
- https://news.berkeley.edu/2018/03/21/david-patterson-pioneer-of-modern-computer-architecture-receives-turing-award/
- https://engineering.berkeley.edu/david-patterson-a-winning-risc/
#computer-architecture#risc#performance-methodology#systems-design#hardware-software-interface#engineering-tradeoffs#systems-architecture#performance-engineeringPractical Leverage on Application Semantics
<cite index="20-2,20-3">Bailis investigated coordination avoidance—the use of as little coordination as possible while ensuring application integrity—demonstrating how to leverage the semantic requirements of applications to enable more efficient distributed algorithms.</cite>
The thesis work spans data serving, transaction processing, and web services. <cite index="1-14">The CALM Theorem shows that monotone programs exhibit deterministic outcomes despite reordering.</cite> <cite index="1-17">After transactions complete, servers exchange state and, after applying the merge operator, converge to the same state.</cite>
This isn't purely academic. <cite index="4-5,4-11">Many of the invariants Bailis proposed are monotonic in flavor and echo intuition from CALM.</cite> The approach shifts focus from storage consistency guarantees low in the stack to application-level properties. <cite index="11-8,11-9">CALM provides a constructive application-level counterpart to conventional systems wisdom, such as the apparently negative results of the CAP Theorem.</cite> You can work around CAP for monotone problems. The framework tells you which problems those are.
Sources:
- https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-206.html
- https://sites.cs.ucsb.edu/~rich/class/cs293b-cloud/papers/bailis-avoidance.pdf
- https://cacm.acm.org/research/keeping-calm/
- https://ar5iv.labs.arxiv.org/html/1901.01930
#coordination-avoidance#application-semantics#database-architecture#cap-theorem#distributed-systems#peter-bailis#consistency-modelsThe Cost Coordination Actually Imposes
<cite index="7-3">Minimizing coordination—blocking communication between concurrently executing operations—is key to maximizing scalability, availability, and high performance in database systems.</cite> <cite index="7-4">However, uninhibited coordination-free execution can compromise application correctness.</cite>
The penalties are measurable. <cite index="23-6">Coordination may lead to substantial delays, with 95th percentile round-trip times reaching 649ms in geo-replicated deployments.</cite> <cite index="25-2,25-5">Coordination decreases performance due to waiting, communication delays, and aborts, exacerbated in distributed environments.</cite> <cite index="19-3,23-8">Given the availability, latency, and throughput penalties associated with serializable transactions, a broad class of systems has sought weaker alternatives that reduce coordination during operation, often at the cost of application integrity.</cite>
<cite index="9-5">The classic use of serializable transactions is sufficient to maintain correctness but is not necessary for all applications, sacrificing potential scalability.</cite> The question Bailis addressed: when can you safely forego coordination, and when must you pay the price? <cite index="21-3">Read Committed is the default semantics in fifteen of eighteen popular relational databases, including Oracle, SAP Hana, and Microsoft SQL Server.</cite> Not everyone needs serializability.
Sources:
- https://dl.acm.org/doi/10.14778/2735508.2735509
- https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-206.pdf
- https://escholarship.org/uc/item/8k8359g2
- http://www.bailis.org/papers/bailis-thesis.pdf
- https://speakerdeck.com/pbailis/coordination-avoidance-in-distributed-databases
#distributed-systems#coordination-cost#latency#scalability#serializability#performance#database-architecture#consistency-modelsInvariant Confluence as a Design Criterion
<cite index="9-6">Bailis developed a formal framework called invariant confluence that determines whether an application requires coordination for correct execution.</cite> <cite index="1-2">A globally I-valid system can execute transactions with coordination-freedom, transactional availability, and convergence if and only if those transactions are I-confluent with respect to invariant I.</cite>
This extends CALM from the program level to the database constraint level. <cite index="4-4,4-10">Bailis defines Invariant Confluence for replicated transactional databases, given a set of database invariants.</cite> <cite index="5-177">If a set of transactions is invariant confluent, then all database states reachable by executing and merging transactions starting with a common ancestor must be mergeable into an I-valid database state.</cite>
<cite index="9-1">When programmers specify their application invariants, this analysis allows databases to coordinate only when anomalies that might violate invariants are possible.</cite> The framework lets you ask: which of my operations actually conflict under the semantics I care about? <cite index="20-4">The resulting prototype systems demonstrate regular order-of-magnitude speedups compared to traditional, coordinated counterparts on tasks including referential integrity, index maintenance, and constraint enforcement.</cite>
Sources:
- https://arxiv.org/abs/1402.2237
- https://sites.cs.ucsb.edu/~rich/class/cs293b-cloud/papers/bailis-avoidance.pdf
- https://cacm.acm.org/research/keeping-calm/
- http://www.bailis.org/papers/bailis-thesis.pdf
- https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-206.html
#invariant-confluence#database-architecture#coordination-avoidance#consistency-models#application-semantics#peter-bailis#distributed-systemsWhen Coordination Is Actually Necessary
<cite index="10-2,10-7">The CALM Theorem—Consistency as Logical Monotonicity—establishes that programs with consistent, coordination-free distributed implementations are exactly those expressible in monotonic logic.</cite> <cite index="12-13,12-14">Monotonic programs are safe in the face of missing information and can proceed without coordination, while non-monotonic programs must be concerned that truth of a property could change with new information.</cite>
This gives you a decision procedure. <cite index="14-4">Monotonic problems simply accumulate beliefs; their output depends only on the content of their input, not the order in which it arrives.</cite> Reachability is monotonic—once you know a node is reachable, a second path doesn't change that. But unreachability is not; you can't confirm a node is unreachable until you've seen the entire graph.
<cite index="11-11">Ameloot and colleagues presented a formalization and proof of the CALM Theorem using relational transducers.</cite> The practical implication: if you can express your problem in monotonic logic, you don't need distributed locks, two-phase commit, or Paxos. If you can't, you do. <cite index="16-3,16-4,16-5,16-6">The conjecture was presented at PODS 2010, based on experience with streaming queries and declarative networking, and formalized by Ameloot, Neven and Van den Bussche at Hasselt University.</cite>
Sources:
- https://arxiv.org/abs/1901.01930
- https://www.researchgate.net/publication/330212641_Keeping_CALM_When_Distributed_Consistency_is_Easy
- https://cacm.acm.org/research/keeping-calm/
- https://ar5iv.labs.arxiv.org/html/1901.01930
- https://rise.cs.berkeley.edu/blog/an-overview-of-the-calm-theorem/
#distributed-systems#calm-theorem#monotonic-logic#coordination#consistency-models#theory#database-architectureWhy Distributed Transactions Fail at Scale
<cite index="4-8">People aren't building large-scale systems with distributed transactions in practice, and if they do try to, the projects founder because the performance costs and fragility make them impractical</cite>. This isn't a theoretical observation. <cite index="1-13,1-14,1-15">The paper explores practical approaches used in the implementation of large-scale mission-critical applications in a world that rejects distributed transactions, including the management of fine-grained pieces of application data that may be repartitioned over time as the application grows, with design patterns for sending messages between these repartitionable pieces of data</cite>.
<cite index="4-19,4-20,4-22">Assume the number of customers, orders, shipments, and all other business concepts manipulated by the application grow to an almost-infinite number—typically the individual things do not get significantly larger, we simply get more and more of them, so you are forced to spread what formerly ran on a single or small number of machines across a larger number of machines</cite>. <cite index="21-4,21-5">You cannot perform atomic transactions across these entities, and an application must tolerate message retries and out-of-order arrival of messages</cite>.
<cite index="1-16,1-17">The goal is to reduce the challenges faced by people handcrafting very large scalable applications, and by observing these design patterns, the industry can perhaps work toward the creation of platforms to make it easier to develop scalable applications</cite>.
Sources:
- https://blog.acolyer.org/2014/11/20/life-beyond-distributed-transactions/
- https://queue.acm.org/detail.cfm?id=3025012
- https://highscalability.com/7-design-patterns-for-almost-infinite-scalability/
#distributed-systems#transaction-processing#scalability-limits#at-least-once-delivery#system-constraints#scale-architectureScale-Agnostic and Scale-Aware: The Two-Layer Split
<cite index="16-2,16-3">To track relationships and the messages received, each entity within the scale-agnostic application must remember state information about its partners, capturing this state on a partner-by-partner basis</cite>. <cite index="16-5,16-6">Both the stateless Unix-style process and the lower layers of the application are part of the implementation of the scale-agnostic API provided for the business logic—the upper-layer scale-agnostic business logic simply addresses the message to the entity key that identifies the durable state</cite>.
The separation matters because <cite index="4-25">a consequence of almost-infinite scaling is that this abstraction must be exposed to the developer of business logic</cite>. <cite index="5-17,5-18,5-19">An entity has a scale-agnostic and a scale-aware layer; the scale-aware layer deals with the delivery of messages to other entities, while the scale-agnostic layer stores information on messages sent to another entity in order to handle failures that cannot be automatically resolved by the scale-aware layer</cite>.
<cite index="1-28,1-29,1-30,1-31">Programmers striving to solve business goals like e-commerce and supply-chain management increasingly need to think about scaling without distributed transactions, as most developers don't have access to robust systems offering scalable distributed transactions—the patterns for building these applications can be seen, but are not yet applied consistently</cite>. The pattern isn't exotic; it's survival.
Sources:
- https://cacm.acm.org/magazines/2017/2/212429-life-beyond-distributed-transactions/fulltext
- https://xebia.com/blog/life-beyond-distributed-transactions/
- https://queue.acm.org/detail.cfm?id=3025012
- https://blog.acolyer.org/2014/11/20/life-beyond-distributed-transactions/
#distributed-systems#scale-architecture#layered-architecture#business-logic#messaging-patterns#system-design#transaction-processingActivities: Partner State in a World Without Atomicity
<cite index="16-14,16-15,16-16,16-17">As entities move, the clarity of a FIFO queue between sender and destination is occasionally disrupted—messages are repeated, later messages arrive before earlier ones</cite>. <cite index="16-18,16-19">Scale-agnostic applications are evolving to support idempotent processing of all application-visible messaging, which implies reordering in message delivery too</cite>.
<cite index="16-22">An activity is the local information needed to manage a relationship with a partner entity</cite>. <cite index="17-1,17-3">Messaging across entities introduces the need for managing conversational state, which Helland defines as an Activity—the management of state for each entity partner</cite>. <cite index="5-26,5-27,5-28">An activity is data about the conversation entity A has with entity B; once entity A has placed the message on its out-queue, the activity stores this fact, and if a success message is received from entity B, the activity is closed</cite>.
<cite index="17-4,17-5,17-6,17-10">In an order comprising many items for purchase, reserving inventory for shipment of each separate item will be a separate activity, with an entity for the order and separate entities for each item managed by the warehouse—the per-inventory-item data contained within the order-entity is an activity</cite>. <cite index="15-14,15-15">These activity workflows need to reach agreement in the absence of atomicity, functioning within activities within entities</cite>.
Sources:
- https://cacm.acm.org/magazines/2017/2/212429-life-beyond-distributed-transactions/fulltext
- https://www.infoq.com/news/2007/08/scalability-patterns/
- https://xebia.com/blog/life-beyond-distributed-transactions/
- https://blog.acolyer.org/2014/11/20/life-beyond-distributed-transactions/
#distributed-systems#messaging-patterns#state-management#idempotence#eventual-consistency#workflow-coordination#transaction-processing#scale-architectureEntities: The Transactional Boundary You Can't Cross
<cite index="3-2,3-3">Helland's paper published at CIDR 2007 argues that distributed transaction protocols like 2PC and Paxos provide a façade of global serializability</cite>, but <cite index="3-5">his experience led him to liken these platforms to the Maginot Line</cite>—an elaborate defense that gets routed around in practice. <cite index="2-7">Application developers simply do not implement large scalable applications assuming distributed transactions</cite>. Instead, you need a different unit of work.
<cite index="1-32">Entities are collections of named (keyed) data that may be atomically updated within the entity but never atomically updated across entities</cite>. <cite index="4-23,4-24">Scaling means using this abstraction as you write your program—an entity lives on a single machine at a time, and the application can only manipulate one entity at a time</cite>. <cite index="15-18,15-19">What Helland calls an entity is not simply a persistent object but Eric Evans' concept of an aggregate entity, representing disjoint sets of data</cite>.
This constraint forces coordination out of the transaction boundary and into messaging. <cite index="1-26,1-27">The absence of distributed transactions means accepting uncertainty when attempting to come to decisions across different entities—it is unavoidable that decisions across distributed systems involve accepting uncertainty for a while</cite>. The key insight is that this uncertainty isn't a failure mode you handle with retries; it's the operating condition.
Sources:
- https://ics.uci.edu/~cs223/papers/cidr07p15.pdf
- https://queue.acm.org/detail.cfm?id=3025012
- https://blog.acolyer.org/2014/11/20/life-beyond-distributed-transactions/
#distributed-systems#transaction-processing#scale-architecture#entity-design#eventual-consistency#aggregate-patternsVogels wrote from the Dynamo trenches, not the whiteboard
<cite index="10-27">Werner Vogels is vice president and chief technology officer at Amazon.com, where he is responsible for driving the company's technology vision of continuously enhancing innovation on behalf of Amazon's customers at a global scale</cite>. The 2009 paper referenced <cite index="26-20,26-21">Dynamo: Amazon's highly available key-value store, presented at the 21st ACM Symposium on Operating Systems Principles in Stevenson, WA, in October 2007</cite>. That paper, co-authored by Vogels, documented the system running under Amazon's retail operation.
<cite index="6-3">Amazon's CTO Werner Vogels posted an article describing approaches to tolerate eventual data consistency in large-scale distributed systems</cite> after wrestling with production constraints at a scale most academics never touch. <cite index="8-3,8-4">He wrote a first version in December 2007, but was never happy with it; ACM Queue asked him to revise it for their magazine and he took the opportunity to improve the article</cite>. The revision landed in January 2009 and has been cited over 550 times. This wasn't theory work—it was documentation of what had to be true to keep the system alive.
Sources:
- https://queue.acm.org/detail.cfm?id=1466448
- https://dl.acm.org/doi/10.1145/1435417.1435432
- https://www.infoq.com/news/2009/01/eventually-consistent/
- https://www.allthingsdistributed.com/2007/12/eventually_consistent.html
#werner-vogels#dynamo#amazon#distributed-systems#eventual-consistency#system-architecture#production-systems#database-architecture#consistency-modelsDynamoDB implements consistency per-request, not per-table
<cite index="12-10">In DynamoDB, consistency is not a table-level configuration—you choose the consistency model on each individual read request by setting ConsistentRead=true in your GetItem, Query, or Scan calls</cite>. <cite index="14-8">Eventually consistent is the default read consistent model for all read operations</cite>. <cite index="16-18">Eventually consistent reads are half the cost of strongly consistent reads</cite>.
The architecture: <cite index="11-14">writes are committed to a leader node and asynchronously replicated to other nodes</cite>. <cite index="11-13,16-14">With eventual consistency, reads might return slightly stale data for a brief period (typically under one second) after a write completes, and may not reflect the results of a recently completed write operation</cite>. When you flip the flag for strong consistency, <cite index="12-14">DynamoDB ensures the read reflects all successful writes that occurred before the read</cite>, but <cite index="17-16,17-17">a strongly consistent read might not be available if there is a network delay or outage, in which case DynamoDB may return a server error (HTTP 500)</cite>. This is the CAP theorem in API form: you choose AP or CP on every call.
Sources:
- https://www.hellointerview.com/learn/system-design/deep-dives/dynamodb
- https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadConsistency.html
- https://medium.com/adidoescode/know-how-you-read-insights-about-dynamodb-read-consistency-models-part-1-df8f9decf66b
- https://dynobase.dev/dynamodb-read-consistency/
- https://reintech.io/blog/dynamodb-consistency-models-strong-vs-eventual-consistency
#dynamodb#consistency-models#eventual-consistency#database-architecture#read-operations#replication#aws#distributed-systemsThe consistency window is a dial you turn, not a wall you hit
<cite index="7-3,26-2">Vogels's 2009 CACM paper framed eventual consistency as a deliberate engineering trade-off</cite>, not a defect. <cite index="5-1">Data inconsistency offers two advantages for large-scale reliable distributed systems: better read and write performance under highly concurrent conditions, and network partition tolerance for high availability and node outages</cite>. The core idea: <cite index="1-3">the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value</cite>.
This matters because <cite index="22-4,24-4">the CAP theorem states that of three properties of shared-data systems—data consistency, system availability, and tolerance to network partition—only two can be achieved at any given time</cite>. Vogels argued that <cite index="2-5">relaxing consistency will allow the system to remain highly available under partitionable conditions</cite>. The paper catalogued variations: causal consistency, read-your-writes consistency, session consistency. <cite index="21-2">One of the tools the system designer has is the length of the consistency window, during which the clients of the systems are possibly exposed to the realities of large-scale systems engineering</cite>. You pick your window based on what breaks when data lags. That's the contract.
Sources:
- https://dl.acm.org/doi/10.1145/1435417.1435432
- https://queue.acm.org/detail.cfm?id=1466448
- https://pages.cs.wisc.edu/~zuyu/summaries/cs764/eventuallyConsistency
- https://www.researchgate.net/publication/27301367_Eventually_Consistent
#distributed-systems#consistency-models#cap-theorem#eventual-consistency#system-design#trade-offs#database-architectureBigtable: A Sparse Multi-Dimensional Sorted Map
<cite index="38-1,38-2,38-3">Bigtable is a distributed storage system for managing structured data designed to scale to petabytes of data across thousands of commodity servers, used by many Google projects including web indexing, Google Earth, and Google Finance</cite>. <cite index="38-4,38-5">Applications placed very different demands on Bigtable in terms of data size and latency requirements, yet it successfully provided a flexible, high-performance solution</cite>. <cite index="6-15">Bigtable is a simple, sparse, distributed, persistent, multidimensional sorted map</cite>. <cite index="41-1,41-2">A Bigtable cluster stores tables, each consisting of tablets, with each tablet containing all data associated with a row range</cite>. <cite index="45-10,45-11">Bigtable does not support a full relational data model but provides clients a simple data model that supports dynamic control over data layout and format</cite>. <cite index="46-10,46-12,46-14">Bigtable uses GFS to store log and data files, with Google SSTable file format forming the foundation of storage, and relies on Chubby for locking</cite>. It's not a database in the relational sense. It's a key-value store with timestamps and column families, built for Google's actual access patterns, not for SQL compatibility.
Sources:
- https://research.google/pubs/bigtable-a-distributed-storage-system-for-structured-data/
- https://medium.com/@drajput_14416/mapreduce-the-backbone-of-big-data-processing-and-how-gfs-bigtable-power-it-5d4def98f838
- https://research.google.com/archive/bigtable-osdi06.pdf
- https://hemantkgupta.medium.com/insights-from-paper-bigtable-a-distributed-storage-system-for-structured-data-1eea26ee0f3a
- https://distributed-computing-musings.com/2022/09/paper-notes-bigtable-a-distributed-storage-system-for-structured-data/
#bigtable#distributed-storage#key-value-store#nosql#scale-architecture#data-models#distributed-systems#data-infrastructureMapReduce: Hiding Parallelization Behind Two Functions
<cite index="29-1,29-2">MapReduce is a programming model for processing and generating large data sets where users specify a map function and a reduce function, running on large clusters of commodity machines at high scale</cite>. <cite index="30-6,30-7">The run-time system takes care of partitioning input data, scheduling execution across machines, handling failures, and managing inter-machine communication, allowing programmers without experience in parallel and distributed systems to utilize large distributed systems</cite>. <cite index="33-25">More than ten thousand distinct MapReduce programs had been implemented internally at Google, with an average of one hundred thousand jobs executed on Google's clusters every day, processing over twenty petabytes daily</cite>. <cite index="5-11,5-12">The MapReduce model abstracts away the complexities of distributed systems such as parallelization, partitioning, task scheduling and machine failure, allowing developers to focus on application logic</cite>. The elegance is in what it hides. You write map and reduce, and the runtime handles the rest—scheduling, retries, data locality. That kind of abstraction doesn't come for free; it comes from knowing exactly which complexity to expose and which to bury.
Sources:
- https://research.google/pubs/mapreduce-simplified-data-processing-on-large-clusters/
- https://www.usenix.org/conference/osdi-04/mapreduce-simplified-data-processing-large-clusters
- https://dl.acm.org/doi/10.1145/1327452.1327492
- https://diogoalexandrefranco.github.io/classic-big-data-papers/
#mapreduce#distributed-systems#programming-models#data-processing#parallelization#abstraction-layers#data-infrastructure#scale-architectureGFS: Optimized for Appends, Not POSIX Compliance
<cite index="22-5">The largest cluster provided hundreds of terabytes of storage across thousands of disks on over a thousand machines, concurrently accessed by hundreds of clients</cite>. <cite index="20-10">GFS was designed to be a highly distributed filesystem with support targeted towards extremely large files</cite>. <cite index="21-6,21-7">The design was motivated by Google's cluster architecture paradigm and workload characterizations, providing fault tolerance while running on inexpensive commodity hardware</cite>. <cite index="19-16,19-17,19-18">GFS employs a master-worker architecture, where a single master manages metadata and multiple chunkservers store data in fixed-size chunks of 64 MB, replicated across commodity hardware for fault tolerance, optimizing for large sequential reads and appends</cite>. <cite index="24-7">Most files are mutated by appending new data rather than overwriting—appending is the focus of performance optimization and atomicity guarantees</cite>. This is the kind of design decision that only makes sense when you know what the workload will be. The trade-off is explicit: you give up general-purpose semantics to win on the actual use case.
Sources:
- https://research.google/pubs/the-google-file-system/
- https://www.the-paper-trail.org/post/2008-10-01-the-google-file-system/
- https://blog.ruipan.xyz/machine-learning-systems/index/the-google-file-system
- https://grokipedia.com/page/Sanjay_Ghemawat
- https://wenzhe.one/MIT6.824%2021Spring/GFS.html
#distributed-systems#file-systems#google-file-system#scale-architecture#commodity-hardware#design-tradeoffs#data-infrastructureThe Dependency Stack: GFS, MapReduce, and Bigtable as Infrastructure
<cite index="12-1,12-5">Jeff Dean and Sanjay Ghemawat's work at Google included the big data processing model MapReduce, the Google File System, and databases Bigtable and Spanner</cite>. <cite index="13-3,13-4">Their efforts created the first software designs for systems that harness the power of tens of thousands of computers</cite>. The three papers—<cite index="20-5">GFS published at SOSP in 2003</cite>, <cite index="30-10">MapReduce at OSDI in 2004</cite>, and <cite index="45-1">Bigtable at OSDI 2006</cite>—were stacked dependencies. <cite index="3-4,3-5">Bigtable is built on GFS, which it uses as a backing store for log and data files, with GFS providing reliable storage for SSTables</cite>. <cite index="20-11">GFS was used to store the results of Google's web crawlers and as a storage layer for MapReduce</cite>. <cite index="8-1,8-2">Bigtable can be used with MapReduce—there are wrappers that allow a Bigtable to be used both as an input source and as an output target for MapReduce jobs</cite>. This is how infrastructure gets built: one system assumes another, and the abstractions layer until you can run a product.
Sources:
- https://en.wikipedia.org/wiki/Sanjay_Ghemawat
- https://awards.acm.org/award_winners/ghemawat_1482280
- https://www.the-paper-trail.org/post/2008-10-29-bigtable-googles-distributed-data-store/
- https://hemantkgupta.medium.com/insights-from-paper-bigtable-a-distributed-storage-system-for-structured-data-1eea26ee0f3a
- https://www.the-paper-trail.org/post/2008-10-01-the-google-file-system/
- https://www.usenix.org/conference/osdi-04/mapreduce-simplified-data-processing-large-clusters
#distributed-systems#data-infrastructure#google#scale-architecture#dependency-design#systems-foundationsGray's other dependency: price/performance as a forcing function
<cite index="9-2,9-3,9-4">Performance in transaction processing is typically a throughput metric (work/second) and price is typically a five-year cost-of-ownership metric. Together, they give a price/performance ratio. For example, the transaction processing benchmarks define a standard transaction workload and a transaction per second (tps) metric</cite>.
Gray did not just define ACID. He defined how to measure whether a system could actually deliver it at a price someone would pay. <cite index="15-4,15-5">Gray taught the industry to think not only about "performance", but crucially also "price/performance"</cite>. <cite index="14-2,14-4">In the early 1980s, ordinary transaction processing systems and techniques bottlenecked at 50 transactions per second (tps) while high performance transaction processing systems achieved 200 tps</cite>.
<cite index="11-1,11-5">Gray warned that his metrics were performance metrics, not function metrics. They made minimal demands on the network, transaction processing, data management, and recovery management. It was painful to see a metric which rewarded simplicity—faster than fancier ones</cite>. But that was the point. The benchmark forced you to strip away the abstractions and look at what the system could actually sustain. Gray's benchmarks became TPC-A, TPC-B, TPC-C. Every database vendor still runs them. Every procurement decision references them. The theory matters less than the sustained throughput under contract.
Sources:
- https://jimgray.azurewebsites.net/benchmarkhandbook/chapter1.pdf
- https://jimgray.azurewebsites.net/papers/tandemtr85.1_1ktps.pdf
- https://jimgray.azurewebsites.net/papers/ameasureoftransactionprocessingpower.pdf
- https://tigerbeetle.com/blog/2024-07-23-rediscovering-transaction-processing-from-history-and-first-principles/
#database-benchmarking#transaction-processing#performance-metrics#price-performance#jim-gray#infrastructure-fundamentals#tpc-benchmarks#database-architectureTwo implementation paths: time-domain addressing versus log-and-lock
<cite index="29-8,29-9">Gray's 1981 paper identified two apparently different approaches to implementing the transaction concept: time-domain addressing and logging plus locking</cite>. <cite index="29-10,29-11,29-12">Logging clusters the current state of all objects together and relegates old versions to a history file called a log. Time-domain addressing clusters the complete history (all versions) of each object with the object. Each organization was seen to have some unique virtues</cite>.
The logging approach became dominant in commercial systems because it separated hot data from cold. Current state lives in fast storage; history goes to the log. Time-domain addressing—keeping all versions with the object—reappeared decades later in multiversion concurrency control and in systems like Datomic that treat history as a first-class citizen.
<cite index="28-6">Gray's paper described areas requiring further study: (1) the integration of the transaction concept with the notion of abstract data type, (2) some techniques to allow transactions to be composed of sub-transactions, and (3) handling transactions which last for extremely long times (days or months)</cite>. Those problems—nested transactions, long-running transactions, and transactional abstractions in programming languages—are still open. Some systems solve subsets. None solve all three well.
Sources:
- https://jimgray.azurewebsites.net/papers/theTransactionConcept.pdf
- https://johngrib.github.io/wiki/clipping/jim-gray/transaction-concept/
#transaction-processing#implementation-techniques#logging#locking#database-architecture#mvcc#infrastructure-fundamentalsThe Gray-Reuter book: 1,070 pages of implementation technique
<cite index="2-3,2-4,2-5">Jim Gray is arguably one of the most readable technical authors. Coauthored with Andreas Reuter, Gray's book, Transaction Processing: Concepts and Techniques, is an in-depth (1,070 pages) and easily readable description of transaction-oriented processing. It carries the reader from the basic concepts of transaction processing through a straw man implementation of a Resource Manager to a review of current transaction monitors</cite>.
<cite index="10-3,10-4">The book covers the theory and practice of implementing locking, logging, and the more generic topic of implementing transactional resource managers. As an extended example, the implementation of transactional files, records and access paths is covered in detail</cite>. <cite index="4-11">The book describes not just the transactions in a database, but basically any kind of transaction with ACID properties, including all kinds of actions, including "real" ones (moving rods in a nuclear reactor, dispensing money from an ATM), either local or distributed</cite>.
<cite index="16-4">Extensive use of compilable C code fragments demonstrates the many transaction processing algorithms presented in the book</cite>. Published in 1993, it remains the definitive reference. If you need to understand what a TP monitor actually does—or why your CICS/DB2 system behaves the way it does—you read Gray and Reuter. The book is not theory. It is a manual for building systems that handle failure without losing state.
Sources:
- https://www.availabilitydigest.com/private/0204/transaction_processing_Gray.pdf
- https://jimgray.azurewebsites.net/wics_99_tp/
- https://www.goodreads.com/book/show/1416957.Transaction_Processing
- https://www.goodreads.com/en/book/show/1416957.Transaction_Processing
#transaction-processing#database-architecture#jim-gray#infrastructure-fundamentals#resource-managers#distributed-systems#fault-toleranceACID as contract law for state machines
<cite index="28-2">Gray's 1981 paper "The Transaction Concept: Virtues and Limitations" defined a transaction as a transformation of state with three properties: atomicity (all or nothing), durability (effects survive failures), and consistency (a correct transformation)</cite>. <cite index="33-7,33-8">The paper started by relating the transaction concept to contract law, where two parties make a binding agreement, which led to defining the properties as consistency, atomicity, and durability</cite>. Isolation came later; <cite index="6-2,6-3">though Gray proposed the ACID properties, the acronym itself was coined by Andreas Reuter and Theo Härder in 1983</cite>.
<cite index="5-4,5-5">Of the four ACID properties, three are the responsibility of the underlying transaction-processing monitor, and one—consistency—is the responsibility of the application programmer. Maintaining consistency involves understanding the data model of the application, that is the correspondence between data items and entities in the real world, and ensuring that each transaction changes data items in a way that corresponds to a legitimate change in the real world</cite>. <cite index="5-7,5-8">To achieve atomicity, isolation, and durability, modern transaction processing systems typically use locking and logging. Locking is a technique for assuring isolation of concurrent transactions</cite>.
Gray's formulation mattered because it separated mechanism from policy. The system guarantees atomicity, isolation, and durability through locking and logging primitives. The application owns consistency. That contract has held for forty years.
Sources:
- https://jimgray.azurewebsites.net/papers/theTransactionConcept.pdf
- https://amturing.acm.org/info/gray_3649936.cfm
- https://krishnakumarsql.wordpress.com/2014/07/23/jim-gray-analogy-on-db-logging/
- https://arxiv.org/pdf/2310.04601
#acid-properties#transaction-processing#database-architecture#infrastructure-fundamentals#jim-gray#consistency-modelsWhy Impossibility Results Matter for System Design
<cite index="29-2,29-8">Nancy Lynch's work on proving impossibility results and lower bounds expresses inherent limitations of distributed systems for solving problems</cite>. <cite index="34-8,34-9">The ability to prove nontrivial impossibility results in distributed computing theory was exciting because it was quite different from the situation in the theory of sequential algorithms, in which lower bound results were (and still are) very hard to prove</cite>.
The FLP result forces precision. <cite index="14-1,14-2">The problem of consensus—getting a distributed network of processors to agree on a common value—was known to be solvable in a synchronous setting, where processes could proceed in simultaneous steps, and in particular the synchronous solution was resilient to faults</cite>. But <cite index="32-1,32-5">it is impossible to reliably reach agreement in an asynchronous network, with the possibility of even a single, simple processor stopping failure</cite>.
<cite index="4-12,4-13">The book familiarizes readers with important problems, algorithms, and impossibility results in the area so readers can then recognize the problems when they arise in practice, apply the algorithms to solve them, and use the impossibility results to determine whether problems are unsolvable, and also provides readers with the basic mathematical tools for designing new algorithms and proving new impossibility results</cite>. Impossibility results are design constraints—they tell you which assumptions you need to relax or which guarantees you have to give up.
Sources:
- https://nsf-gov-resources.nsf.gov/attachments/302205/public/Dr.NancyLynch_Slides.pdf?VersionId=f952_X8N8LKEcPdnQftiNDClZfcrAqHB
- https://www.arxiv.org/pdf/2502.20468
- https://www.the-paper-trail.org/post/2008-08-13-a-brief-tour-of-flp-impossibility/
- https://arxiv.org/html/2502.20468v1
- https://www.amazon.com/Distributed-Algorithms-Kaufmann-Management-Systems/dp/1558603484
#distributed-systems#impossibility-results#consensus-algorithms#systems-theory#formal-methods#lower-boundsEscaping Impossibility Through Model Relaxation
The FLP result is a boundary condition, not a brick wall. <cite index="26-1,26-2">The Fischer, Lynch, and Paterson impossibility proof shows that it is impossible to solve the consensus problem in an asynchronous distributed system if even a single process can fail, but two decades of work on fault-tolerant asynchronous consensus algorithms have evaded this impossibility result by using extended models that provide randomization, additional timing assumptions, failure detectors, or stronger synchronization mechanisms</cite>.
<cite index="16-4">The partially synchronous model—formalized by Dwork, Lynch, and Stockmeyer in 1988 in a paper that itself won the 2007 Dijkstra Prize—captures the real world much better than either extreme</cite>. <cite index="16-15">Rather than assuming messages can be arbitrarily delayed forever, partial synchrony assumes there exists some Global Stabilisation Time (GST) after which message delays are bounded—even if that bound is unknown in advance</cite>. <cite index="13-7">Randomized consensus algorithms can circumvent the FLP impossibility result by achieving both safety and liveness with overwhelming probability, even under worst-case scheduling scenarios such as an intelligent denial-of-service attacker in the network</cite>.
<cite index="20-1,20-2">The Fischer-Lynch-Paterson result says that you can't do agreement in an asynchronous message-passing system if even one crash failure is allowed, unless you augment the basic model in some way, for example by adding randomization or failure detectors</cite>. The impossibility defines what you need to add back in to make practical systems work.
Sources:
- https://arxiv.org/pdf/cs/0209014
- https://www.javacodegeeks.com/2026/04/the-flp-impossibility-result-40-years-later-why-it-still-defines-every-consensus-protocol-you-use.html
- https://en.wikipedia.org/wiki/Consensus_(computer_science)
- https://www.cs.yale.edu/homes/aspnes/pinewiki/FischerLynchPatterson.html
#distributed-systems#consensus-algorithms#partial-synchrony#randomized-algorithms#failure-detectors#impossibility-results#systems-theoryNancy Lynch's Impossibility Catalog
<cite index="28-6,30-9">Nancy Lynch has written hundreds of research articles about distributed algorithms and impossibility results, and about formal modeling and verification of distributed systems</cite>. <cite index="30-10">Her best-known contribution is the 1982 FLP impossibility result for distributed consensus in the presence of process failures, with Fischer and Paterson</cite>, though <cite index="32-10,32-11,32-12">she initially thought of this as a purely theoretical problem—having matching upper and lower bounds for the number of rounds needed for consensus in synchronous distributed systems, it seemed natural to consider what happens in the asynchronous case, and she did not know how this would turn out ahead of time</cite>.
<cite index="31-1,31-2,31-18">Lower bounds and other impossibility results have turned out to be a major part of distributed computing theory, and the main reason why lower bounds and other impossibility results can be proved here is the strong limitations imposed by locality in distributed systems</cite>. <cite index="31-19">Each process can see only its own state, the values it reads in shared memory, the messages it receives</cite>. <cite index="1-5,1-6">Lynch's book Distributed Algorithms contains the most significant algorithms and impossibility results in the area, all in a simple automata-theoretic setting, with algorithms proved correct and their complexity analyzed according to precisely defined complexity measures</cite>.
<cite index="35-1,35-2">The impossibility result has had the beneficial effect of helping or forcing system designers to clarify their claims about their system—you have to explain how your system is going to behave under the different kinds of failures that could happen during the execution of an algorithm in a distributed system</cite>.
Sources:
- https://www.nasonline.org/directory-entry/nancy-a-lynch-0frysu/
- https://www.nsf.gov/events/nancy-lynch-theoretical-view-distributed-systems
- https://groups.csail.mit.edu/tds/papers/Lynch/Lifetime_Contributions_Book-1.pdf
- https://arxiv.org/html/2502.20468v1
- https://medium.com/a-computer-of-ones-own/nancy-lynch-distributed-systems-pioneer-7234e9f1d34c
- https://books.google.com/books?id=2wsrLg-xBGgC&printsec=copyright
#distributed-systems#impossibility-results#systems-theory#consensus-algorithms#fault-tolerance#formal-methodsThe FLP Result: What Asynchrony Actually Prevents
<cite index="13-12,28-1">The Fischer-Lynch-Paterson impossibility result—commonly called FLP after its authors Michael J. Fischer, Nancy Lynch, and Michael S. Paterson—won the Dijkstra Prize</cite> and <cite index="14-3">definitively placed an upper bound on what it is possible to achieve with distributed processes in an asynchronous environment</cite>. The theorem is precise: <cite index="11-1">in a fully asynchronous distributed system, it is impossible to design a deterministic consensus algorithm that guarantees both termination and agreement even if a single node fails</cite>.
The constraints matter. <cite index="15-6">An asynchronous distributed system is one in which messages can take arbitrarily long to be delivered (but are eventually delivered and delivered in-order)</cite>. <cite index="12-10">The paper assumes a reliable message system where messages are neither lost nor duplicated, but processes can fail by stopping (crashing) unexpectedly</cite>. Under those conditions, <cite index="16-1">no deterministic algorithm can guarantee all three properties of consensus (termination, agreement, validity)</cite>.
<cite index="16-8,16-9">FLP does not say consensus is hard—it says consensus is impossible to guarantee termination for in a purely asynchronous system, and the moment you introduce even the weakest timing assumption the impossibility dissolves</cite>. <cite index="12-12">The result highlights the limitations of purely asynchronous systems and the need for additional assumptions or mechanisms (such as partial synchrony or failure detectors) to achieve consensus</cite>. Practitioners work around the boundary by adding randomization, timing assumptions, or failure detectors to the model.
Sources:
- https://en.wikipedia.org/wiki/Consensus_(computer_science)
- https://www.nasonline.org/directory-entry/nancy-a-lynch-0frysu/
- https://medium.com/@akash2077/fischer-lynch-paterson-flp-impossibility-the-inherent-limits-of-distributed-consensus-dcf88df2b5a0
- https://www.chriswirz.com/distributed-systems/flp-theorem
- https://afterhoursacademic.com/intuitive-flp-explanation/
- https://www.javacodegeeks.com/2026/04/the-flp-impossibility-result-40-years-later-why-it-still-defines-every-consensus-protocol-you-use.html
#distributed-systems#consensus-algorithms#impossibility-results#asynchronous-systems#flp-theorem#systems-theoryWhat Jacobson's algorithms still govern
<cite index="1-2">Jacobson's work redesigning TCP/IP's congestion control algorithms to better handle congestion is said to have saved the Internet from collapsing in the late 1980s and early 1990s</cite>. <cite index="7-2">Van Jacobson's algorithms for TCP helped solve the problem of congestion and are used in over 90% of Internet hosts today</cite>. This is not archaeological interest—the mechanisms he introduced in the 4.3BSD Tahoe release in 1988 remain the scaffolding on which modern congestion control is built.
<cite index="16-17,16-18">The most famous early efforts to manage congestion were undertaken by Van Jacobson and Mike Karels; the resulting 1988 paper Congestion Avoidance and Control is one of the most cited papers in networking of all time</cite>. <cite index="20-8">More than 30 years after it was introduced, Slow Start continues to avoid network congestion even though network speed is now measured in gigabits per second and the number of users has topped 3.5 billion</cite>. <cite index="13-4,13-5">The TCP congestion-avoidance algorithm is the primary basis for congestion control in the Internet; per the end-to-end principle, congestion control is largely a function of internet hosts, not the network itself</cite>. The dependency chain is clear: if you run TCP, you're running a descendant of what Jacobson and Karels built in response to a network that was failing at 56 Kbps.
Sources:
- https://en.wikipedia.org/wiki/Van_Jacobson
- https://www.internethalloffame.org/inductee/van-jacobson/
- https://tcpcc.systemsapproach.org/intro.html
- https://www.es.net/about/esnet-history/unjamming-the-information-superhighway-and-saving-the-internet/
- https://en.wikipedia.org/wiki/TCP_congestion_control
#network-infrastructure#tcp-ip#protocol-design#congestion-control#infrastructure-fundamentals#end-to-end-principleSlow start, cwnd, and the self-clocking mechanism
<cite index="11-9,11-10">Old TCPs would start a connection injecting multiple segments into the network up to the receiver's advertised window, which was fine on the same LAN but caused problems when routers and slower links separated sender and receiver</cite>. <cite index="11-13,11-14">Slow start adds a congestion window called cwnd to the sender's TCP; when a new connection is established, cwnd is initialized to one segment</cite>. <cite index="11-15">Each time an ACK is received, the congestion window is increased by one segment</cite>, which means cwnd doubles every round-trip time during slow start—an exponential ramp-up.
<cite index="14-14">Congestion avoidance and slow start require two variables per connection: a congestion window (cwnd) and a slow start threshold size (ssthresh)</cite>. <cite index="13-7,13-8">The transmission rate increases via slow start until packet loss is detected, the receiver's advertised window becomes limiting, or ssthresh is reached, at which point TCP switches to the congestion avoidance algorithm</cite>. The genius is that the mechanism requires no changes to routers—it's end-to-end control. <cite index="11-7">The four algorithms described were developed by Van Jacobson</cite>, and they moved into 4BSD in 1988. <cite index="2-2,2-3">The approach was introduced in 1988 by Van Jacobson and Mike Karels and refined multiple times; the variant in widespread use today is called CUBIC</cite>.
Sources:
- https://www.rfc-editor.org/rfc/rfc2001.html
- https://datatracker.ietf.org/doc/html/draft-stevens-tcpca-spec
- https://en.wikipedia.org/wiki/TCP_congestion_control
- https://tcpcc.systemsapproach.org/algorithm.html
#tcp-ip#congestion-control#slow-start#protocol-design#end-to-end-principle#network-infrastructure#infrastructure-fundamentalsThe October '86 collapse and the conservation principle
<cite index="6-1,6-2">In October 1986, the Internet experienced the first of a series of congestion collapses, with throughput between Lawrence Berkeley Laboratory and UC Berkeley dropping from 32 Kbps to 40 bps</cite>—a factor-of-thousand degradation across sites separated by 400 yards. <cite index="6-3,6-8">Van Jacobson and Mike Karels investigated whether 4.3BSD TCP was misbehaving or could be tuned for abysmal network conditions</cite>. The answer was yes to both.
<cite index="3-3">Hosts would send packets as fast as the advertised window allowed, congestion would occur at routers causing packet drops, and hosts would time out and retransmit, resulting in even more congestion</cite>. The core insight was what Jacobson called the packet conservation principle: <cite index="4-1,4-2">a TCP connection should obey conservation of packets, and if this principle were obeyed, congestion collapse would become the exception rather than the rule</cite>. The idea is straightforward—a stable system in equilibrium should inject one packet into the network for every packet removed. <cite index="2-6,2-7">The arrival of an ACK signals that a packet has left the network and it's safe to transmit a new packet; by using ACKs to pace transmission, TCP is self-clocking</cite>. Congestion control became a matter of finding places where implementations violated conservation and fixing them.
Sources:
- https://ee.lbl.gov/papers/congavoid.pdf
- https://cs162.org/static/readings/jacobson-congestion.pdf
- https://book.systemsapproach.org/congestion/tcpcc.html
- https://tcpcc.systemsapproach.org/algorithm.html
#network-infrastructure#protocol-design#tcp-ip#congestion-control#packet-conservation#infrastructure-fundamentalsThe 3f+1 replica requirement is not negotiable in PBFT
<cite index="27-9">Castro and Liskov presented a protocol that tolerates Byzantine faults with 3f + 1 replicas to survive up to f faulty replicas, works safely in asynchronous network conditions, and can still make progress under a weak synchrony assumption.</cite> That arithmetic is the cost floor. If you need to tolerate one failure, you deploy four replicas. Two failures cost seven replicas. The overhead compounds as f grows, which is why PBFT works best in small, known committees rather than open membership systems.
<cite index="30-5">In a distributed system that constitutes 3f+1 nodes (f represents the number of byzantine nodes), a consensus can be reached as long as no less than 2f+1 non-byzantine nodes are functioning normally.</cite> The quorum math ensures that any two quorums of 2f+1 nodes overlap by at least f+1 honest nodes, preventing divergence even when f nodes equivocate. <cite index="28-1,28-2">PBFT's core achievement is to make Byzantine consensus practical. It ensures that all honest replicas agree on the same sequence of requests, even if some replicas are actively malicious.</cite>
<cite index="27-10,27-11">In the normal case, PBFT executes read-only operations in one message round trip and read-write operations in two. Their implementation of a Byzantine-fault-tolerant NFS was reported to be only about 3% slower than standard unreplicated NFS in the Andrew benchmark.</cite> The latency gap closed enough that you could run services on it, not just admire the proof. The protocol became a dependency for permissioned blockchains and other settings where node identity is fixed and authenticated.
Sources:
- https://www.cube.exchange/what-is/pbft-practical-byzantine-fault-tolerance
- https://medium.com/tronnetwork/an-introduction-to-pbft-consensus-algorithm-11cbd90aaec
- https://bytepawn.com/practical-byzantine-fault-tolerance.html
#pbft#distributed-systems#quorum-systems#consensus-algorithms#replication#byzantine-fault-tolerance#performance#reliability-engineeringLiskov's substitution principle and distributed systems both care about contracts
<cite index="15-8,15-9">The Liskov substitution principle is a particular definition of a subtyping relation, called strong behavioral subtyping, initially introduced by Barbara Liskov in a 1987 conference keynote titled Data abstraction and hierarchy. It is based on the concept of substitutability—a principle in object-oriented programming stating that an object of a superclass may be replaced by an object of a subclass without breaking the program.</cite> <cite index="15-2,15-3">If S is a subtype of T, then objects of type T in a program may be replaced with objects of type S without altering any of the desirable properties of that program (e.g. correctness).</cite>
The principle maps to distributed systems thinking in a way that is not accidental. <cite index="9-1,9-2">Liskov's subsequent work has mainly been in the area of distributed systems, which use several computers connected by a network. Her research has covered many aspects of operating systems and computation, including important work on object-oriented database systems, garbage collection, caching, persistence, recovery, fault tolerance, security, decentralized information flow, modular upgrading of distributed systems, geographic routing, and practical Byzantine fault tolerance.</cite> Both domains care about behavioral contracts, not just structural ones. A subtype must respect the invariants of its parent; a replica must respect the invariants of the system even when other replicas lie.
<cite index="9-8,9-9">Byzantine fault tolerance deals with situations where a complex system fails in arbitrary ways. Liskov developed methods to allow correct operation even when some components are unreliable.</cite> The abstraction is the thing: define what correct means, then build mechanisms that preserve it under substitution or under failure.
Sources:
- https://en.wikipedia.org/wiki/Liskov_substitution_principle
- https://amturing.acm.org/award_winners/liskov_1108679.cfm
#distributed-systems#liskov-substitution-principle#abstraction#type-theory#fault-tolerance#correctness#behavioral-contracts#reliability-engineering#consensus-algorithmsByzantine failures are arbitrary, not just silent
<cite index="5-6">Software bugs, operator mistakes, and malicious attacks can cause arbitrary behavior, that is, Byzantine faults.</cite> The distinction matters because a Byzantine node does not simply stop responding. <cite index="18-17,18-18">A Byzantine node can behave in unpredictable ways: it might send different messages to different nodes, fake messages, or fail to send messages altogether.</cite> That makes detection harder than fail-stop semantics, where a dead process just stops.
<cite index="22-3,22-4">The Byzantine generals problem describes the difficulty of coordinating the actions of several independent parties in a distributed system. Marshall Pease, Robert Shostak, and Leslie Lamport developed the idea in 1982.</cite> <cite index="22-5,22-6,22-7">The problem is framed as a military metaphor, in which a group of generals are camping around a city and must decide whether to attack or retreat. The generals can only communicate with one another by sending messages, but they cannot be sure that the messages are authentic. Therefore, the generals must find a way to reach a consensus despite the possibility of deception and betrayal.</cite>
<cite index="19-3">Byzantine Fault Tolerance describes fault tolerance to Byzantine faults, where a rogue process may continue to generate arbitrary data instead of gracefully failing, making it difficult to detect as a faulty process.</cite> The protocol surface expands: you need to verify not just liveness but correctness under adversarial conditions. The replicas cannot assume honest mistakes; they must build proofs that survive coordination with malicious peers.
Sources:
- https://www.microsoft.com/en-us/research/publication/practical-byzantine-fault-tolerance-proactive-recovery/
- https://newsletter.scalablethread.com/p/what-is-the-byzantine-generals-problem
- https://www.baeldung.com/cs/distributed-systems-the-byzantine-generals-problem
- https://www.pdcunplugged.org/activities/byzantinegenerals/
#distributed-systems#byzantine-fault-tolerance#consensus-algorithms#fault-models#reliability-engineering#lamportPBFT moved Byzantine tolerance from theory into production
<cite index="3-1,3-2">Miguel Castro and Barbara Liskov published their 1999 OSDI paper on Practical Byzantine Fault Tolerance, developing a Byzantine fault tolerant consensus protocol that was both efficient and applicable to realistic scenarios.</cite> The work mattered because it changed the economics of building trustworthy distributed systems. <cite index="1-5">The algorithm worked in asynchronous environments like the Internet and incorporated optimizations that improved response time of previous algorithms by more than an order of magnitude.</cite> <cite index="3-10">Liskov and Castro were the first to propose a correct algorithm that worked efficiently in asynchronous networks and realized the 3f+1 lower bound.</cite>
The constraint is real: <cite index="31-1">a practical Byzantine Fault Tolerant system can function on the condition that the maximum number of malicious nodes must not be greater than or equal to one-third of all the nodes in the system.</cite> That means if you need to tolerate f failures, you deploy 3f+1 replicas—more hardware than crash-tolerant protocols, but less than the exponential message complexity of earlier Byzantine solutions. <cite index="27-11">Their implementation of a Byzantine-fault-tolerant NFS, called BFS, was reported to be only about 3% slower than standard unreplicated NFS in the Andrew benchmark.</cite> The gap between theoretical possibility and deployable service narrowed to something an engineer could work with.
<cite index="5-2,5-3">Software bugs, operator mistakes, and malicious attacks cause arbitrary behavior—Byzantine faults—and BFT can be used to build highly available systems that tolerate Byzantine faults.</cite> The protocol assumes nodes can lie, not just crash, which changes the verification surface. You need cryptographic proofs and multiple rounds of agreement before state changes commit.
Sources:
- https://www.the-paper-trail.org/post/2009-03-30-barbara-liskovs-turing-award-and-byzantine-fault-tolerance/
- https://www.researchgate.net/publication/2437947_Practical_Byzantine_Fault_Tolerance
- https://www.microsoft.com/en-us/research/publication/practical-byzantine-fault-tolerance-proactive-recovery/
- https://www.cube.exchange/what-is/pbft-practical-byzantine-fault-tolerance
- https://www.geeksforgeeks.org/practical-byzantine-fault-tolerancepbft/
#distributed-systems#consensus-algorithms#byzantine-fault-tolerance#reliability-engineering#replication#asynchronous-systems#pbftTuring Recognition for Systems That Ship
<cite index="17-1,17-3">ACM has named Michael Stonebraker of the Massachusetts Institute of Technology the recipient of the 2014 ACM A.M. Turing Award for fundamental contributions to the concepts and practices underlying modern database systems. Stonebraker invented many of the concepts that are used in almost all modern database systems.</cite> <cite index="17-4,17-5">He demonstrated how to engineer database systems that support these concepts and released these systems as open software, which ensured their widespread adoption. Source code from Stonebraker's systems can be found in many modern database systems.</cite>
<cite index="23-2">He has been critical of the assumption that "one size fits all" when implementing relational database management systems and that dominant general purpose systems, such as Oracle, can serve the needs of all users.</cite> <cite index="23-3">Stonebraker is the only Turing award winner to have engaged in serial entrepreneurship on anything like this scale, giving him a distinctive perspective on the academic world.</cite> <cite index="20-2,20-3">Through a series of academic prototypes and commercial startups, Stonebraker's research and products are central to many relational databases. He is also the founder of many database companies, including Ingres Corporation, Illustra, Paradigm4, StreamBase Systems, Tamr, Vertica, VoltDB and Hopara, and served as chief technical officer of Informix.</cite>
Sources:
- https://www.acm.org/articles/bulletins/2015/march/turing-award-2014
- https://amturing.acm.org/award_winners/stonebraker_1172121.cfm
- https://en.wikipedia.org/wiki/Michael_Stonebraker
#turing-award#database-systems#academic-research#commercialization#database-companies#systems-engineering#database-architecture#data-infrastructure#systems-thinkingIngres and Postgres: The Relational Foundation Layer
<cite index="8-1">Michael Stonebraker, who, along with Eugene Wong in 1974, created the first working relational database system, INGRES.</cite> <cite index="1-1,1-9">Stonebraker introduced the object-relational model of database architecture with the release of Postgres, integrating important ideas from object-oriented programming into the relational database context. Postgres extended the relational database model, enabling users to define, store and manipulate rich objects with complex state and behavior.</cite> <cite index="1-10,1-11">Concepts introduced in Ingres and Postgres can be found in nearly all major database systems today. Ingres and Postgres were well engineered, built on UNIX, released as open software, and form the basis of many modern commercial database systems including Illustra, Informix, Netezza and Greenplum.</cite>
The architecture work was foundational. <cite index="2-2,2-7">Architecture of a Database System presents an architectural discussion of DBMS design principles, including process models, parallel architecture, storage system design, transaction system implementation, query processor and optimizer architectures, and typical shared components and utilities.</cite> <cite index="2-4,2-9">Historically, DBMSs were among the earliest multi-user server systems to be developed, and thus pioneered many systems design techniques for scalability and reliability now in use in many other contexts.</cite>
Sources:
- https://thenewstack.io/dr-michael-stonebraker-a-short-history-of-database-systems/
- https://www.prweb.com/releases/acm_turing_award_goes_to_pioneer_in_database_systems_architecture_mit_s_michael_stonebraker_brought_relational_database_systems_from_concept_to_commercial_success/prweb12607207.htm
- https://books.google.com/books/about/Architecture_of_a_Database_System.html?id=aBPcm3C1avMC
#database-history#relational-databases#postgres#ingres#database-architecture#open-source#systems-design#data-infrastructure#systems-thinkingThe SQL Performance Argument Was Always a Decoy
<cite index="14-5,14-6,14-7">NoSQL is a collection of 50 or 75 vendors with various objectives. For some of them, the goal is to go fast by rejecting SQL and ACID. Stonebraker feels these folks are misguided, since SQL is not the performance problem in current RDBMSs.</cite> <cite index="13-4,13-8">The overhead associated with OLTP databases in traditional SQL systems has little to do with SQL.</cite> This matters because it changes the question from "relational versus NoSQL" to "what workload, what contract."
<cite index="14-8">There is a NewSQL movement that contains very high-performance ACID/SQL implementations.</cite> <cite index="14-9,14-10">Other members of the NoSQL movement are focused on document management or semi-structured data—application areas where RDBMSs are known not to work very well. These folks seem to be filling a market not well served by RDBMSs.</cite> The argument is not about query languages. <cite index="9-3,9-4,9-5">Most have a data model, which is unique to that system, along with a one-off, record-at-a-time user interface. My enterprise guru was very concerned with the proliferation of such one-offs. In contrast, SQL offers a standard environment.</cite> The costs are in the ecosystem—in the people who have to learn the interfaces, in the teams who build on unstable abstractions.
Sources:
- https://iggyfernandez.wordpress.com/2011/12/23/nocoug-journal-interview-professor-stonebraker/
- https://www.scribd.com/document/105112961/Stonebraker-SQL-vs-NoSQL-2010
- https://cacm.acm.org/blogcacm/stonebraker-on-nosql-and-enterprises/
#nosql#sql#database-performance#query-languages#systems-architecture#acid-properties#trade-offs#database-architecture#data-infrastructure#systems-thinkingNo Size Fits All: Stonebraker's Fracture Thesis
<cite index="1-2">Stonebraker has been an advocate of the "no size fits all" approach to database systems architecture and has developed database architectures for specialized purposes.</cite> This is not academic hedging. <cite index="27-4,27-5">The last 25 years of commercial DBMS development can be summed up in a single phrase: "one size fits all". This phrase refers to the fact that the traditional DBMS architecture (originally designed and optimized for business data processing) has been used to support many data-centric applications with widely varying characteristics and requirements.</cite> <cite index="25-1">The relational database cannot be extended ad infinitum, demonstrates how RDBMSs are inappropriate for several new applications, and argues that the DBMS market will fragment into a series of special-purpose engines, perhaps unified by a common front-end parser.</cite>
Stonebraker built the thesis with systems, not papers. <cite index="1-3,1-4,1-5">He pioneered real-time processing over streaming data sources (Aurora/StreamBase). His work on column-oriented storage architecture resulted in systems optimized for complex queries (C-Store/Vertica). He developed a high throughput, distributed main-memory online transaction processing system (H-Store/VoltDB).</cite> <cite index="24-4,24-5">Stonebraker argues that most people won't even consider a special-purpose database (largely due to inertia) unless it is at least 10x faster than relational for a given application. He then demonstrates several applications where you can see 10 - 100x gains in performance.</cite> These are not theoretical constructs. They are bottleneck-specific solutions with known contracts.
Sources:
- https://www.prweb.com/releases/acm_turing_award_goes_to_pioneer_in_database_systems_architecture_mit_s_michael_stonebraker_brought_relational_database_systems_from_concept_to_commercial_success/prweb12607207.htm
- https://www.kellblog.com/stonebrakers-one-size-fits-all-papers/
- https://www.researchgate.net/publication/4133428_One_size_fits_all_an_idea_whose_time_has_come_and_gone
- https://kellblog.com/2007/07/18/stonebrakers-one-size-fits-all-papers/
#database-architecture#specialized-databases#systems-thinking#architectural-trade-offs#performance-optimization#data-infrastructureSTEADY Goals and the 2020 Expansion
<cite index="10-1,10-2">In 2020, Lampson published a long version of the 1983 paper that suggests the goals you might have for your system—Simple, Timely, Efficient, Adaptable, Dependable, Yummy (STEADY)—and techniques for achieving them—Approximate, Incremental, Divide & Conquer (AID), along with some principles for system design that are more than just hints, and many examples of how to apply the ideas.</cite> The updated paper runs to hundreds of pages on arXiv, compared to the original's 16.
<cite index="11-1,11-2,11-3">The techniques are organized around efficiency (algorithm, batch, cache, concurrent, lazy, local, shard, stream, summarize, translate), adaptability (dynamic, index, indirect, scale, virtualize), and dependability (atomic, consensus, eventual, redundant, replicate, retry).</cite> <cite index="11-12,11-13">The main thing is to keep the spec simple and to divide the system into modules with simple specs, but there's also advice about keeping the code simple.</cite>
<cite index="9-2,9-3">Lampson presented the updated version at the Heidelberg Laureate Forum in September, talking about the learnings of the past decades that helped him update his 1983 work.</cite> The expansion is both a reflection and a catalog—proof that the original hints aged well enough to merit systematization.
Sources:
- https://arxiv.org/abs/2011.02455
- http://bwl-website.s3-website.us-east-2.amazonaws.com/87-HintsAndPrinciples/Hints%20and%20Principles%20short.pdf
- https://queue.acm.org/blogposting.cfm?id=73924
#systems-thinking#reliability-engineering#design-patterns#infrastructure-fundamentals#distributed-systems#modularitySimplicity, Speed, and the Refusal to Generalize
<cite index="16-8,16-9,16-10,16-11">The core advice: keep it simple—do one thing at a time and do it well; don't generalize in the interface; make it fast, rather than general or powerful, because it's much better to have basic operations executed quickly than powerful ones that are slower.</cite> <cite index="12-11,12-12">The trouble with slow, powerful operations is that the client who doesn't want the power pays more for the basic function, which is usually not the right one.</cite>
<cite index="12-1,12-2,12-3,12-4">Cache answers to expensive computations, rather than doing them over; use hints to speed up normal execution—a hint, like a cache entry, is the saved result of some computation, but it may sometimes be wrong, and it is not necessarily reached by an associative lookup.</cite> <cite index="22-6,22-7">Handle normal and worst case separately as a rule, because the requirements for the two are quite different: the normal case must be fast; the worst case must make some progress, and sometimes radically different strategies are appropriate.</cite>
<cite index="12-22,12-23">Use a good idea again, instead of generalizing it, because a specialized implementation of the idea may be much more effective than a general one.</cite> <cite index="19-10,19-11">One way to combine simplicity, flexibility, and high performance is to focus only on solving one problem and leaving the rest up to the client—Lampson gives the example of parsers that do context-free recognition but call out to client-supplied semantic routines.</cite>
Sources:
- http://www.vendian.org/mncharity/dir3/hints_lampson/
- https://medium.com/@bowen.raymone/paper-review-hints-for-computer-system-design-65a3d32ea380
- https://blog.acolyer.org/2016/09/16/hints-for-computer-system-design/
#performance-optimization#interface-design#systems-thinking#cache-architecture#trade-offs#infrastructure-fundamentals#reliability-engineeringFolk Wisdom From Someone Who Built Things That Worked
<cite index="14-6,14-7,14-8,14-9">Lampson had designed and built a number of computer systems, some that worked and some that didn't, and also used and studied many other systems, both successful and unsuccessful—from this experience came general hints for designing successful systems, with no claim to originality; most are part of the folk wisdom of experienced designers.</cite> <cite index="8-8,8-9,8-10,8-11">"They are just hints," he writes—some quite general and vague, others specific techniques that are more widely applicable than many people know, with both the hints and examples necessarily oversimplified, and many controversial.</cite>
<cite index="20-13,20-14,20-15">The disclaimer is deliberate: these hints are not novel, not foolproof recipes, not laws of design, not precisely formulated, and not always appropriate—they are just hints, context dependent, and some may be controversial.</cite> <cite index="1-1,1-3">Lampson's hints ("do one thing at a time, and do it well," "use brute force," "keep secrets") are not principles derived from theory but wisdom distilled from decades of building things that had to work.</cite>
This is anti-methodology. It's the opposite of top-down design or abstraction patterns. <cite index="20-6">What's remarkable is how relevant and fresh these hints remain decades after their publication.</cite>
Sources:
- https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/acrobat-17.pdf
- http://muratbuffalo.blogspot.com/2011/01/hints-for-computer-system-design-acm.html
- https://inigomedina.co/library/work/lampson-hints-for-computer-system-design
#pragmatism#systems-thinking#reliability-engineering#design-philosophy#infrastructure-fundamentals#xerox-parcWhy Interfaces Matter More Than Algorithms
<cite index="20-2,20-3,20-4">System design is different from algorithm design because the external interface is less precisely defined, more complex, and more subject to change; the system has more internal structure and hence more internal interfaces; and the measure of success is less clear.</cite> <cite index="23-10,23-11">Lampson says the designer "finds himself floundering in a sea of possibilities, unclear about how one choice will limit his freedom," and that there probably isn't a best way to build the system—much more important is to avoid choosing a terrible way.</cite>
<cite index="27-2,27-3">Defining interfaces is the most important part of system design, and usually the most difficult, since the interface must satisfy three conflicting requirements: it should be simple, it should be complete, and it should admit a sufficiently small and fast implementation.</cite> <cite index="27-6">The difficulty comes from the fact that each interface is a small programming language: it defines a set of objects and the operations that can be used to manipulate them.</cite>
<cite index="2-3,2-4">Lampson's 1983 paper drew on experience with the design and implementation of a number of computer systems, illustrated by examples ranging from hardware such as the Alto and the Dorado to applications programs such as Bravo and Star.</cite> The paper was presented at the 9th ACM Symposium on Operating Systems Principles and came out of Xerox PARC, where Lampson had watched enough systems succeed and fail to recognize patterns.
Sources:
- https://dl.acm.org/doi/10.1145/800217.806614
- https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/acrobat-17.pdf
- http://www.leptonica.org/cachedpages/lampson-hints.html
- http://muratbuffalo.blogspot.com/2011/01/hints-for-computer-system-design-acm.html
#interface-design#systems-thinking#xerox-parc#software-architecture#trade-offs#infrastructure-fundamentals#reliability-engineeringKleppmann's Framework for Data System Tradeoffs
<cite index="8-2,8-3">DDIA begins by establishing what matters in distributed systems: building applications that are reliable, scalable, and maintainable for the long run</cite>. <cite index="5-13,5-14">Data is at the center of system design challenges today—scalability, consistency, reliability, efficiency, and maintainability</cite>. <cite index="6-7,6-8">Kleppmann navigates the landscape by examining pros and cons of various technologies for processing and storing data. Software keeps changing, but fundamental principles remain the same</cite>.
<cite index="4-4,4-7,4-8">The book's structure moves from single-machine systems (data models, storage, encoding) through distributed data (replication, partitioning, transactions, consensus) to derived data (batch processing, stream processing)</cite>. <cite index="8-4">It explores different database types, distributed systems, and data processing to help understand strengths, weaknesses, and tradeoffs</cite>. This is not a vendor playbook. It's a dependency map for engineers who need to explain why one consistency model costs more than another, or when stream processing creates more problems than it solves.
Sources:
- https://newsletter.techworld-with-milan.com/p/what-i-learned-from-the-book-designing
- https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373320
- https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/
- https://cdn.bookey.app/files/pdf/book/en/designing-data-intensive-applications.pdf
#data-infrastructure#systems-architecture#distributed-systems#infrastructure-fundamentals#martin-kleppmann#ddia#reliability#scalabilityEvent Sourcing and Change Data Capture as Stream Primitives
<cite index="28-3,28-4">Keeping all changes as a log of immutable events gives strictly richer information than overwriting data in a database. Event sourcing records every write as an immutable command rather than performing destructive state mutation</cite>. <cite index="32-1">For data modeling, an append-only event log is preferred over database mutations because events capture state transitions and business processes more accurately than insert/update/delete operations</cite>.
<cite index="12-2,12-3">The log abstraction reappears: a database's change log can be viewed as a stream of events. This is the idea behind Change Data Capture (CDC), where database changes are captured and streamed to other systems</cite>. <cite index="12-4,12-5">You can stream database updates to a search index or cache rather than batch-syncing occasionally—this is how systems like Debezium or LinkedIn's Databus work</cite>. <cite index="10-6,10-9">Dual writes that update multiple systems concurrently are error-prone due to race conditions. A better approach designates one system as leader and makes others followers</cite>. The pattern turns the database inside out: replication logs become real-time data pipelines.
Sources:
- https://martin.kleppmann.com/2015/01/29/stream-processing-event-sourcing-reactive-cep.html
- https://queue.acm.org/detail.cfm?id=3321612
- http://muratbuffalo.blogspot.com/2024/12/ddia-chapter-11-stream-processing.html
- https://newsletter.techworld-with-milan.com/p/what-i-learned-from-the-book-designing
#event-sourcing#change-data-capture#data-infrastructure#stream-processing#immutability#replication#database-internals#systems-architecture#infrastructure-fundamentalsStream Processing as the Unbounded Opposite of Batch
<cite index="10-10,10-12">Batch processes run periodically on a bounded dataset—daily, hourly—introducing latency between input and output. Stream processing runs continuously on unbounded data, handling events as they arrive</cite>. <cite index="10-13">In stream processing, a record is typically an event: a small, immutable object with a timestamp</cite>. <cite index="10-1,10-2">Partitioned logs like Apache Kafka, Amazon Kinesis, and DistributedLog assign offsets to messages in append-only logs and prioritize throughput and ordering</cite>, while <cite index="10-5">JMS/AMQP brokers work better when messages are expensive to process and parallel execution matters more than order</cite>.
The tradeoff is operational. <cite index="23-8,23-9">Batch processing works on bounded datasets collected over a window and processed as a discrete job; streaming operates on unbounded data with no natural stopping point</cite>. <cite index="23-14">Streaming introduces failure modes batch pipelines rarely encounter, the most common being backpressure when incoming events exceed processing capacity</cite>. <cite index="22-11,22-12">Batch is simpler because data is finite; streaming is more complex due to continuous flow and consistency challenges</cite>. You pick based on whether you need answers now or can wait until the job finishes.
Sources:
- http://muratbuffalo.blogspot.com/2024/12/ddia-chapter-11-stream-processing.html
- https://www.landskill.com/blog/streaming-vs-batch-processing/
#stream-processing#batch-processing#data-infrastructure#kafka#message-brokers#systems-architecture#unbounded-data#infrastructure-fundamentalsHow real databases pick sides under partition
The CAP taxonomy sorts databases by what they preserve when the network splits. <cite index="7-9">NoSQL databases are generally considered to be AP systems, providing Availability and Partition tolerance at the expense of Consistency</cite>. <cite index="12-13">Examples of AP databases include eventual consistency models like Amazon DynamoDB and Riak</cite>. <cite index="16-10,16-11">An AP system provides Availability and Partition Tolerance but does not guarantee immediate consistency. During a network partition, the system continues to serve requests, but some nodes may return stale or outdated data until the system eventually synchronizes</cite>.
On the other side, CP systems prioritize consistency. <cite index="15-1">Bigtable is designed to prioritize consistency and partition tolerance (CP), which means that in the event of a network partition or failure, Bigtable may compromise availability to maintain data consistency</cite>. Some systems try to have it multiple ways: <cite index="15-2,15-3">DynamoDB is designed to prioritize high availability and partition tolerance. Initially, it offered only 'eventual consistency,' but now it also provides a 'strong consistency' option</cite>.
<cite index="7-11">According to Eric Brewer's painstaking analysis, Google Spanner is technically a CP system that can claim to be an "effectively CA" system</cite>. The categories matter less than understanding the mechanism: what happens to a read request when replication lags? What SLA does the client get? The theorem names the question; the implementation is where the dependencies live.
Sources:
- https://www.scylladb.com/glossary/cap-theorem/
- https://medium.com/@ajayverma23/demystifying-the-cap-theorem-understanding-consistency-availability-and-partition-tolerance-446de8452fac
- https://www.geeksforgeeks.org/system-design/cap-theorem-in-system-design/
- https://www.splunk.com/en_us/blog/learn/cap-theorem.html
#database-classification#ap-systems#cp-systems#eventual-consistency#dynamodb#bigtable#google-spanner#nosql#distributed-systems#database-architecture#infrastructure-fundamentalsHow the proof works and what it actually constrains
<cite index="19-5">Gilbert and Lynch's specification and proof of the CAP Theorem</cite> relies on a simple scenario. Consider two servers maintaining the same variable. <cite index="19-16">A client writes a value to one server and that server acknowledges, but when it reads from the other server, it gets stale data</cite>. <cite index="19-17,19-18,19-19">In a consistent system, the first server replicates its value to the second server before sending an acknowledgement to the client. Thus, when the client reads from the second server, it gets the most up to date value</cite>.
The proof by contradiction assumes <cite index="19-2">there does exist a system that is consistent, available, and partition tolerant</cite>. A simple network partition scenario breaks this. <cite index="22-18,22-19,22-20,22-21">Suppose there is a network failure and two servers cannot talk to each other. Now assume that the client makes a write to one server. The client then sends a read to the other server. Given the servers cannot talk, they have different views of the data</cite>. <cite index="22-1,22-2">If the system has to remain consistent, it must deny the request and thus give up on availability. If the system is available, then the system has to give up on consistency</cite>.
It is worth noting that <cite index="1-5,1-10">Brewer noted the different definition of consistency used in the CAP theorem relative to the definition used in ACID. Consistency as defined in the CAP theorem is quite different from the consistency guaranteed in ACID database transactions</cite>. CAP's consistency is closer to what ACID calls atomicity—visibility across the system.
Sources:
- https://mwhittaker.github.io/blog/an_illustrated_proof_of_the_cap_theorem/
- https://www.geeksforgeeks.org/system-design/cap-theorem-in-system-design/
- https://en.wikipedia.org/wiki/CAP_theorem
#cap-theorem-proof#gilbert-lynch#consistency-models#distributed-systems#formal-verification#network-partitions#acid-vs-cap#database-architecture#infrastructure-fundamentalsWhy partition tolerance is not optional in real systems
<cite index="11-15,11-16">In the theorem, partition tolerance is a must. The assumption is that the system operates on a distributed data store so the system, by nature, operates with network partitions</cite>. This is the part that collapses the "pick two" framing: <cite index="14-3">no distributed system is safe from network failures, thus network partitioning generally has to be tolerated</cite>.
Once you accept that partitions are environmental reality rather than a design choice, the theorem narrows. <cite index="14-1">If there is a network partition, one has to choose between consistency or availability</cite>. <cite index="14-8">When a network partition failure happens, it must be decided whether to cancel the operation and thus decrease the availability but ensure consistency, or proceed with the operation and thus provide availability but risk inconsistency</cite>. During normal operations, <cite index="14-2">a data store covers all three</cite>.
<cite index="10-5,10-6">You cannot choose whether to have network partitions. Brewer clarified the design problem as one of how to trade off consistency against availability when network partitions occur</cite>. <cite index="10-7">Brewer points out that network partitions are not a binary property; all networks have latency, and a complete communication failure is just the limiting case when the latency goes to infinity</cite>. The theorem is a forcing function: if you design distributed storage, you are answering this question whether you articulate it or not.
Sources:
- https://www.bmc.com/blogs/cap-theorem/
- https://en.wikipedia.org/wiki/CAP_theorem
- https://arxiv.org/pdf/2109.07771
#partition-tolerance#distributed-systems#network-failures#cap-theorem#system-design#latency#consistency-availability-tradeoff#database-architecture#infrastructure-fundamentalsThe theorem that forced engineers to choose their constraints
<cite index="1-2,23-2">Eric Brewer presented the CAP conjecture at the 2000 Symposium on Principles of Distributed Computing</cite>, after <cite index="1-1">it first appeared in autumn 1998</cite>. <cite index="1-3,23-3">In 2002, Seth Gilbert and Nancy Lynch of MIT published a formal proof of Brewer's conjecture, rendering it a theorem</cite>.
The assertion is simple: <cite index="1-7">any distributed data store can provide at most two of the following three guarantees</cite>. Consistency means <cite index="1-8">all clients see the same data at the same time, no matter which node they connect to</cite>. For this to work, <cite index="1-9">whenever data is written to one node, it must be instantly forwarded or replicated to all the other nodes in the system before the write is deemed 'successful'</cite>. Availability means <cite index="1-11">every request received by a non-failing node in the system must result in a response, without the guarantee that it contains the most recent version of the data</cite>. Partition tolerance means <cite index="14-7">the system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes</cite>.
<cite index="1-4,23-4">In 2012, Brewer clarified some of his positions, including why the often-used "two out of three" concept can be somewhat misleading because system designers only need to sacrifice consistency or availability in the presence of partitions</cite>. <cite index="8-2,8-4">By explicitly handling partitions, designers can optimize consistency and availability, thereby achieving some trade-off of all three</cite>. The theorem is less a vending machine where you pick two buttons and more a statement about what breaks when the network does.
Sources:
- https://en.wikipedia.org/wiki/CAP_theorem
- https://sites.cs.ucsb.edu/~rich/class/cs293b-cloud/papers/brewer-cap.pdf
- https://www.ibm.com/think/topics/cap-theorem
#distributed-systems#cap-theorem#eric-brewer#consistency-models#network-partitions#database-architecture#tradeoff-analysis#infrastructure-fundamentalsWhat Brooks did not say: incremental attacks can still compound
The phrase "no silver bullet" has drifted in common usage. <cite index="3-7">When people state something along the lines that there's no silver bullet in software development, the impression is often that they mean there's no panacea</cite>. But <cite index="2-1">Brooks insisted that while there is no one silver bullet, he believes that a series of innovations attacking essential complexity could lead to significant improvements</cite>. <cite index="21-14,21-15">Brooks saw no startling breakthroughs and believed such to be inconsistent with the nature of software, yet many encouraging innovations were under way, and a disciplined, consistent effort to develop, propagate, and exploit them should indeed yield an order-of-magnitude improvement</cite>. <cite index="19-12,19-13">Brooks made a bold statement: no single technique will cause an order of magnitude improvement in productivity in the next ten years (1986-1996), and this turned out true</cite>. The original paper's claim was narrower than people remember: not that progress is impossible, but that singular miracle cures don't exist. The path is incremental compounding, not one dramatic leap. Engineers who cite Brooks to argue against investing in better abstractions or better tooling are misreading him. He was skeptical of magic, not of discipline.
Sources:
- https://blog.ploeh.dk/2019/07/01/yes-silver-bullet/
- https://en.wikipedia.org/wiki/No_Silver_Bullet
- https://worrydream.com/refs/Brooks_1986_-_No_Silver_Bullet.pdf
- https://www.researchgate.net/publication/221322165_No_silver_bullet_a_retrospective_on_the_essence_and_accidents_of_software_engineering
#complexity-theory#systems-thinking#infrastructure-fundamentals#software-engineering#brooks#incremental-improvement#essential-complexityWhere the past productivity gains came from, and why they're spent
<cite index="5-12,5-13">Brooks examined areas where great improvements in productivity had taken place in the past and concluded that these breakthroughs addressed accidental difficulties and not essential ones</cite>. <cite index="5-14">High-level languages abstract away the hardware architecture, so the developer doesn't have to be concerned with registers and other low-level constructs</cite>. <cite index="17-8">Brooks argued that the major gains to be realized from addressing the accidental elements of software engineering have already been made: the invention of high level languages, movement to interactive computing from batch processing, and development of powerful integrated environments</cite>. <cite index="19-10">Brooks thought in 1986 that it was likely that the accidental complexity of projects was less than 90% of the total, so shrinking the accidental complexity to zero could not result in a productivity improvement of an order of magnitude</cite>. The implication is structural: if most of what remains is essential, then the compounding 10x wins that defined the previous era—assembler to FORTRAN to time-sharing—are not repeatable by the same means. <cite index="17-9">Any further order-of-magnitude improvements can be made only by addressing software's essential difficulties—the complexity, conformity, changeability, and invisibility inherent to software development</cite>. The tooling frontier had closed.
Sources:
- https://kenbaumcoder.medium.com/no-silver-bullet-db166c3a1add
- https://stevemcconnell.com/articles/software-engineering-principles/
- https://www.researchgate.net/publication/221322165_No_silver_bullet_a_retrospective_on_the_essence_and_accidents_of_software_engineering
#complexity-theory#systems-thinking#infrastructure-fundamentals#software-engineering#brooks#productivity#high-level-languages#accidental-complexityThe four properties that make software inherently difficult
Brooks identified <cite index="10-1">four essential, or inherent, difficulties of software: complexity, conformity, changeability, and invisibility</cite>. On complexity: <cite index="21-20">Software entities are more complex for their size than perhaps any other human construct, because no two parts are alike (at least above the statement level)</cite>. <cite index="18-5">When a software entity is scaled up, in most cases elements interact with each other in some non-linear fashion, such that the complexity of the whole increases much more than linearly</cite>. <cite index="14-3">Conformity arises because systems are complex because humans impose complex rules to them, not because they're necessarily naturally complex</cite>. <cite index="15-16">Changeability means that unlike hardware, software is constantly being updated and changed because users always want new features or need the program to work on new machines</cite>. <cite index="15-18">Unlike physical machines, software is difficult to visualize and because diagrams only display small portions of a program, it can be challenging to view the entire picture at once</cite>. <cite index="21-23,21-24">The complexity of software is an essential property, not an accidental one, hence descriptions of a software entity that abstract away its complexity often abstract away its essence</cite>. These four properties define what cannot be engineered away.
Sources:
- https://www.scribd.com/document/325265682/Reaction-Paper-No-Silver-Bullet
- https://worrydream.com/refs/Brooks_1986_-_No_Silver_Bullet.pdf
- https://blog.acolyer.org/2016/09/06/no-silver-bullet-essence-and-accident-in-software-engineering/
- https://strategicengineering.substack.com/p/the-mythical-silver-bullet
- https://www.studocu.com/en-us/document/purdue-university/software-engineering-i/hw3-307-summary-of-brooks-no-silver-bullet-essay/141439237
#complexity-theory#systems-thinking#infrastructure-fundamentals#software-engineering#brooks#essential-difficulty#conformity#changeabilityEssential versus accidental: Brooks's taxonomy of what resists fixing
<cite index="1-2">Brooks published "No Silver Bullet — Essence and Accidents of Software Engineering" in 1986</cite>, and the paper's central claim was precise: <cite index="2-3">no single development in technology or management technique promises even one order of magnitude improvement within a decade in productivity, reliability, or simplicity</cite>. The argument turned on a distinction borrowed from Aristotle. <cite index="2-5">Brooks distinguishes between accidental complexity and essential complexity</cite>. <cite index="1-10">Accidental complexity relates to problems engineers create and can fix, like the details of writing and optimizing assembly code or delays from batch processing</cite>. <cite index="2-8">Essential complexity is caused by the problem to be solved, and nothing can remove it—if users want a program to do 30 different things, the program must do those 30 different things</cite>. The crucial insight: <cite index="2-9">Brooks claims that accidental complexity has decreased substantially, and today's programmers spend most of their time addressing essential complexity</cite>. This meant <cite index="2-10">shrinking all accidental activities to zero will not give the same order-of-magnitude improvement as attempting to decrease essential complexity</cite>. The paper appeared first at the 1986 IFIP conference, then in IEEE Computer in April 1987, and was later included in the anniversary edition of The Mythical Man-Month.
Sources:
- https://en.wikipedia.org/wiki/No_Silver_Bullet
- https://en-academic.com/dic.nsf/enwiki/182271
- https://worrydream.com/refs/Brooks_1986_-_No_Silver_Bullet.pdf
#complexity-theory#systems-thinking#infrastructure-fundamentals#software-engineering#brooks#essential-complexity#accidental-complexity