Episode 55 — Fault Domains and Update Domains: planning for “planned failure” events

In Episode Fifty Five, titled “Fault Domains and Update Domains: planning for “planned failure” events,” the focus is on treating planned maintenance as a real failure mode that architects must design for rather than as an exception that operations will “handle somehow.” Many outages come from planned work, including patching, reboots, firmware updates, and platform maintenance, because planned events often take down multiple components at once. The exam tests whether you recognize that availability is challenged not only by unexpected failures but also by predictable maintenance cycles. Fault domains and update domains are two concepts used to structure where instances land and how they are updated so that maintenance does not become a full outage. When you design with these domains in mind, maintenance becomes a controlled disturbance rather than a surprise collapse. The goal is to keep service continuity by ensuring that not all capacity is removed or destabilized at the same time.

Before we continue, a quick note: this audio course is a companion to the Cloud Net X books. The first book is about the exam and provides detailed information on how to pass it best. The second book is a Kindle-only eBook that contains 1,000 flashcards that can be used on your mobile device or Kindle. Check them both out at Cyber Author dot me, in the Bare Metal Study Guides Series.

A fault domain is a shared hardware group that can fail together, meaning it represents a set of resources that share physical dependencies such as power, network switches, or underlying host infrastructure. If two instances are placed in the same fault domain, a failure in that shared hardware group can affect both at once, defeating redundancy. Fault domains are therefore a way to reason about correlated failure at the physical layer, even when resources appear separate at the virtual layer. Cloud providers use placement policies and infrastructure segmentation to reduce correlation, but the concept remains that some resources share more risk than others. For exam purposes, fault domain should translate to “things that break together,” especially due to hardware and infrastructure dependencies. The architecture goal is to place redundant instances in different fault domains so that one hardware group failure does not eliminate all replicas. When you can identify fault domain risk, you can explain why merely having multiple instances is not enough if they share the same underlying failure group.

An update domain is a group updated together during maintenance cycles, meaning it represents how platform and operating system maintenance is applied over time. Update domains exist to prevent all instances from being rebooted or patched simultaneously, which would cause downtime even if the service is otherwise redundant. The platform groups instances into sets and updates those sets in sequence, allowing other sets to remain running while one set is being updated. This concept matters for planned maintenance because a reboot is effectively a temporary failure, and update domains are how that failure is staggered. For exam questions, update domain should translate to “things that go down together during planned updates,” which is a different but equally important correlation compared to fault domains. An instance that survives a hardware fault is still unavailable if it is in the update group being rebooted. Designing for planned failure means ensuring enough capacity remains outside the active update domain.

Spreading instances across domains is the core directive because it is what allows the service to survive maintenance cycles without losing all capacity. When instances are distributed across multiple fault domains, a hardware failure is less likely to remove every replica. When instances are distributed across multiple update domains, planned maintenance can occur without taking every replica down at once. This distribution is a placement decision, not an afterthought, and it should be intentional for any tier that contributes to availability. Load balancing is usually the mechanism that makes this useful because it can steer traffic away from instances that are temporarily unavailable due to update events. The exam often presents a scenario where maintenance causes downtime and expects you to recognize that the replicas were not spread appropriately. Spreading is also a capacity planning exercise because you must ensure that the remaining instances can handle load while one domain is being updated. The key is that redundancy must be distributed, not clustered.

Rolling updates reduce outage risk when combined with load balancing because they change one small part of the service at a time while routing traffic away from the part being changed. A rolling update updates a subset of instances, verifies they are healthy, and then proceeds to the next subset, which aligns naturally with update domains. Load balancing supports this by removing instances from rotation during update, ensuring users are not sent to nodes that are rebooting or initializing. After update, health checks determine when an instance is ready to serve traffic again, preventing premature routing to unstable nodes. This pattern reduces the chance that a bad update takes down the whole service, because only a slice is affected at any moment. It also reduces the chance of a full outage from planned reboots because the majority of capacity remains available. The exam often tests rolling updates indirectly by describing staggered maintenance and asking what architecture enables continued service. The key is that rolling updates are a coordination pattern between deployment process and traffic management.

Dependency awareness is where many designs fail, because spreading one tier is not enough if critical dependencies do not also span domains. Databases, gateways, and identity services can become hidden single points of failure during maintenance if they are not redundant or if they are placed in a way that causes correlated downtime. If the application tier is distributed across domains but the database is a single instance in one fault domain, planned maintenance on that database or its host group can still bring down the service. Similarly, if an application gateway or load balancer depends on a single underlying component or is updated as one unit, traffic may be blocked even if backends are healthy. The exam tests this by presenting architectures where the obvious tier is redundant but a dependency is not, causing downtime during maintenance. True resilience to planned events requires that the whole request path is domain aware, from client entry through application logic to data stores and network gateways. This also includes supporting services like logging, secrets, and name resolution when they are critical to service operation. Availability is end to end, and planned failure events expose every weak dependency.

A common scenario is patching that causes downtime because instances were not separated by fault domain or update domain, even though the team believed they had redundancy. If all replicas of a service are placed on the same host group or fall into the same update domain, a maintenance cycle can reboot them together. The service appears highly available until the first planned patch window, at which point the redundancy collapses and users see an outage. This kind of failure often surprises teams because the change was planned and “routine,” yet the impact is severe. On the exam, this scenario is usually framed as “we have multiple instances but updates still cause downtime,” and the root cause is correlated placement. The lesson is that redundancy without distribution is an illusion, because it does not reduce correlated failure risk. Maintenance events are predictable stress tests that reveal whether redundancy is real. When you see planned downtime in a redundant design, domain separation is one of the first things to examine.

A contrasting scenario is one where rolling updates keep the service available by draining instances, updating them, verifying health, and then returning them to rotation gradually. The service remains responsive because there are always enough healthy instances outside the update group to handle traffic. Load balancing plays the central role by ensuring users are directed only to healthy instances and by removing instances that are updating or failing readiness checks. This scenario also demonstrates why capacity planning matters, because the service must tolerate reduced capacity while updates occur. If the service is running near peak utilization at baseline, even a controlled rolling update can create overload and timeouts, turning maintenance into an outage. The exam expects you to recognize that rolling updates are a combination of process and architecture, not just a button that makes updates safe. When done correctly, the update becomes a controlled, incremental change that can be halted or rolled back if errors appear. This is resilience through operational discipline, reinforced by placement and traffic management.

A pitfall that shows up repeatedly is deploying all replicas on the same host group, which defeats fault domain separation and makes hardware failures and maintenance events correlated. This can happen when placement is left to default behavior, when capacity is constrained in a region or zone, or when teams unknowingly deploy within a narrow pool of hosts. The result is that a single rack level incident, switch failure, or host maintenance event can take down all replicas simultaneously. For the exam, this is often tested as a subtle version of “single point of failure,” where the replicas exist but share the same underlying risk. The correct response is not merely to add more replicas, but to distribute them across fault domains or to ensure the platform places them across independent hardware groups. This is why fault domains exist as a concept, because virtual separation does not always imply physical separation. When you can explain correlated failure at the host group level, you can see why placement policy is part of availability design.

Another pitfall is skipping health checks during updates and serving bad instances, which can produce partial outages that look like random errors. If instances are added back into rotation before they are truly ready, users may hit nodes that are still warming up, missing dependencies, or running a partially applied update. This can cause authentication failures, application errors, or performance degradation that is difficult to diagnose because it appears intermittently. Health checks and readiness probes prevent this by requiring the instance to prove it can serve traffic before the balancer sends it requests. During rolling updates, health checks also act as the stop signal, because a spike in failures should pause the rollout rather than pushing forward blindly. The exam often tests this by describing an update that increases error rates and by expecting you to identify missing readiness gating as the design flaw. Serving bad instances is often worse than a clean outage because it damages trust and complicates recovery. Health checks are therefore not optional, they are the guardrail that makes rolling updates safe.

Quick wins for planned failure resilience include automating draining, monitoring errors closely during rollout, and keeping rollback ready as a standard operating posture. Draining means removing an instance from rotation and allowing in flight work to complete before taking it down, reducing user impact during updates. Monitoring errors during rollout provides early detection of bad changes, and it should include both application errors and infrastructure signals like health check failures and latency spikes. Rollback readiness means having a tested way to revert to the previous known good version quickly, rather than improvising under pressure. These practices convert maintenance from a high risk event into a controlled routine, which is essential for environments that patch frequently. The exam often rewards answers that include automation and rollback because they reflect the reality that humans are slow and error prone during incidents. Planned events become safer when the system itself enforces orderly traffic handling and when operators have clear triggers to stop and revert. The goal is not to avoid change, but to make change survivable.

A useful memory anchor is “spread, drain, update, verify, rollback,” because it captures the sequence that prevents planned work from becoming unplanned downtime. Spread reminds you to distribute instances across fault domains and update domains so there is always surviving capacity. Drain reminds you to remove instances from traffic before changing them, preserving user experience and reducing abrupt failures. Update is the maintenance action itself, whether patching, rebooting, or deploying a new version. Verify reminds you to use health checks and readiness signals to confirm the instance is safe before returning it to rotation. Rollback reminds you that updates can go wrong, and you must be able to revert quickly if verification fails. This anchor aligns closely with what the exam tests because it covers both placement and operational process. When you can apply the anchor, you can reason through many maintenance related outage scenarios efficiently.

To apply these ideas, imagine you are designing a rollout for a three tier application, and you must ensure the system remains available while each tier is updated. You would ensure that the web tier instances are spread across domains and behind a load balancer with health checks, so they can be drained and updated in small groups. You would ensure the application tier follows the same pattern, with enough capacity remaining during updates and with readiness checks that prove dependencies are reachable. For the data tier, you would ensure redundancy and a controlled update strategy that avoids taking the primary data store offline without a ready alternative, because the database often becomes the limiting dependency for planned maintenance. You would coordinate updates so that tiers that depend on each other are updated in a sequence that avoids incompatibility windows, and you would monitor errors continuously during the process. The exam expects you to think about rollout as a system wide event, not as a single tier action. When you can describe how each tier is protected during update, you demonstrate domain aware availability thinking.

To close Episode Fifty Five, titled “Fault Domains and Update Domains: planning for “planned failure” events,” the central lesson is that change is a predictable failure mode and availability designs must account for it. Fault domains represent shared hardware groups that can fail together, and update domains represent groups updated together during maintenance, and both determine whether redundancy is real during planned events. Spreading instances across these domains, combined with load balancing, draining, and health checks, enables rolling updates that keep service available while parts are updated. Dependencies such as databases and gateways must also span domains, because a redundant front end cannot compensate for a single domain dependency that goes down during maintenance. The pitfalls of clustered replicas and skipped health checks are common reasons planned work becomes user facing downtime. Automation, error monitoring, and rollback readiness are quick wins that turn maintenance into a controlled process rather than a gamble. Your rehearsal assignment is to narrate one rollout end to end, stating when instances are drained, how readiness is verified, and what triggers rollback, because that narration is how you prove the design can survive planned failure events.

Episode 55 — Fault Domains and Update Domains: planning for “planned failure” events
Broadcast by