The main thing to know about single points of failure
- A single point of failure is any device, link, service, or process whose loss can stop a larger system.
- The real risk is the failure domain, meaning how much of the network depends on one component.
- Two devices do not create resilience if they share the same power feed, rack, carrier path, or control plane.
- Good redundancy is about independent paths, not just duplicate hardware.
- In network infrastructure, the most common weak spots are edge routing, WAN circuits, DNS, identity, power, and management access.
What a single point of failure means in network infrastructure
In network terms, a single point of failure is any component that can take down more than itself. It might be physical, like a router or power feed, or logical, like DNS, authentication, or a VPN concentrator. I also look at the failure domain, which is simply the amount of the environment that depends on one component. The smaller that domain, the less one fault can spread.The easy mistake is to count boxes instead of dependencies. Two appliances sitting side by side do not create resilience if they share the same rack, power strip, circuit, and upstream path. Once that dependency map is visible, the next step is to find the places where these weak points usually hide.

Where single points of failure usually hide
When I review a network, I start with the parts that every other service quietly leans on. These are the components that rarely get attention during a feature rollout, but they are often the first thing to fail when something goes wrong.
| Component | Why it becomes risky | What I prefer instead |
|---|---|---|
| Edge router or firewall | All inbound and outbound traffic passes through it. | A pair with separate power and separate uplinks. |
| WAN circuit | One cut, one outage, or one maintenance window can isolate the site. | Two circuits from different providers and different physical routes. |
| Core switch | It can strand every access switch, server, and wireless controller behind it. | Redundant core design or a tested stacked pair. |
| DNS, identity, or DHCP | Users cannot resolve names, authenticate, or receive network settings. | Secondary services, caching, and break-glass access. |
| Power feed or UPS | Network kit is alive only as long as the feed holds. | Separate circuits, battery runtime, and generator planning where justified. |
| Management plane | You may lose the ability to fix the network during the incident itself. | Out-of-band access and independent monitoring. |
The most dangerous items are usually the ones nobody lists as “network equipment” at all, which is why the next section matters more than the box count.
What happens when one dependency fails
When a hidden dependency fails, the effect is usually broader than the broken part itself. The obvious case is a full outage: no internet, no VPN, no VoIP, no cloud access. More often, I see a partial failure that is harder to diagnose because some users still work while others are blocked behind a dead path or a stale DNS record. That ambiguity burns time, and time is what outage management never has enough of.
- Blast radius grows when many services share the same switch, circuit, or authentication service.
- Graceful degradation disappears when there is no fallback path.
- Recovery slows if monitoring, admin access, or backups depend on the same failed layer.
- Security pressure rises because teams start bypassing controls to restore service quickly.
This is why the fix is architectural, not cosmetic. Adding another piece of hardware helps only if it removes the shared dependency that causes the outage in the first place, and that leads straight into design choices.
How to design around it without wasting budget
I usually start with the smallest change that removes the widest shared dependency. The goal is not to build an expensive fortress; it is to make sure one fault does not take the whole service with it.
| Pattern | What it gives you | Main limitation |
|---|---|---|
| N+1 | One extra component beyond the minimum needed to run. | It still fails if the spare shares the same power, path, or controller. |
| 2N | A full duplicate path or system, usually with active-standby behaviour. | Cost and operational complexity go up quickly. |
| Active-active | Both nodes serve traffic and can absorb some failure instantly. | State synchronisation and load distribution must be solid. |
| Active-standby | One node is ready to take over if the other fails. | Failover can be slower, especially for stateful services. |
| Diverse path design | Traffic can survive a cable cut, carrier issue, or building fault. | It only works if the paths are truly different, not just labelled differently. |
| Out-of-band management | You can still reach the network when the main path is broken. | It is often forgotten until the first real incident. |
My blunt rule is this: if the backup shares the same rack, duct, power source, or control layer, it is not a real backup. Once the design is sensible, the next step is to apply that thinking to the UK environment you actually operate in.
What matters most in UK network infrastructure
In the UK, resilience is shaped by carrier diversity, route diversity, power disruption, and business continuity expectations. The NCSC treats resilience as something you design into systems, while Ofcom’s guidance for UK communications providers frames availability, performance, and functionality as part of the same problem. That is a useful reminder that uptime is not only a hardware issue; it is an operational one too.
For most UK businesses, the practical questions are straightforward:
- Are the primary and backup circuits from genuinely different providers?
- Do they enter the building through different routes, or do they meet in the same duct and fail together?
- Does the secondary path survive a local power cut, not just a router failure?
- If you rely on mobile backup, does it have enough bandwidth for the services you actually need?
- Can the business still authenticate users, resolve names, and reach support tools if the main site is unavailable?
For smaller sites, a leased line paired with a separate 4G or 5G backup, or a fixed wireless option where available, is often more useful than spending everything on a single faster primary link. The right answer is not universal, so the last step is always to check the biggest risks in your own environment.
The first changes I would make if I had one week
If I had to improve a real network quickly, I would not start with the most impressive technology. I would start with the dependencies that create the largest outage if they fail once.
- Separate power first, because many “network” incidents begin as power incidents.
- Make the WAN edge redundant with different carriers and different physical routes.
- Protect DNS, identity, and remote access before adding more bandwidth.
- Give the operations team out-of-band access and independent monitoring.
- Test failover after meaningful changes, not just on an annual schedule.
If I had to reduce the whole topic to one rule, it would be this: a network is only as resilient as the shared dependency you forgot to notice. The real goal is not perfect immunity to failure; it is making sure one fault does not take the whole service with it.