Key things to know before you redesign a network
- A failure is often caused by shared dependencies, not by the visible box that stopped working.
- The biggest risks usually hide in power, access circuits, DNS, identity, control planes, and physical routing.
- Redundancy only helps when the backup does not share the same fate as the primary path.
- The fastest gains usually come from diversifying the parts that sit closest to the service boundary.
- In the UK, resilience should be proportionate: the right fix is often one genuinely diverse path, not two copies of the same path.
What a single point of failure looks like in a real network
A network failure rarely starts with a dramatic “everything broke” event. More often, one dependency disappears and the rest of the stack collapses because the service was never truly designed to survive that loss. A core switch, firewall, DNS provider, identity service, power feed, or upstream circuit can all behave like a SPOF if everything else depends on it.
The important idea is failure domain. That is the part of the system that fails together. If two devices share the same rack, the same PDU, the same duct, or the same upstream carrier core, they may look redundant on paper while still failing in the same incident. I see this most often when teams duplicate hardware but not the underlying route, site, or operating assumption.
That is why the visible device is often not the real problem. The hidden dependency behind it is usually the thing that turns a small fault into a service outage. Once you start looking at the network that way, the next question becomes obvious: where are those hidden dependencies actually sitting?
Where hidden dependencies usually sit
If I am mapping risk in a live environment, I start with the places where a shared dependency can quietly take down multiple services at once. These are the spots where teams often believe they have resilience, but in practice they have only duplicated the outer shell.
| Dependency | Why it becomes risky | Better pattern |
|---|---|---|
| Internet access circuit | Two links from the same carrier can still share the same route, exchange, or duct. | Diverse carriers, diverse paths, and separate building entry points where possible. |
| Core routing or switching | A single chassis or a pair with shared control assumptions can stop all internal traffic. | Redundant core devices with tested failover and clear routing convergence. |
| Power delivery | One UPS, one PDU, or one mains feed turns a power event into a full outage. | A/B feeds, dual PSUs, generator support, and regular failover testing. |
| DNS | If name resolution fails, services can look “up” while users still cannot reach them. | Secondary DNS, caching strategy, and a recovery plan for registry and zone access. |
| Identity and access | A single identity provider or admin plane can block users and operators at the same time. | Break-glass accounts, offline recovery access, and separate admin controls. |
| Firewall or VPN edge | A single policy engine, software bug, or tunnel endpoint can cut off entire user groups. | High-availability pairs, independent management access, and rollback-ready configs. |
| Cloud region or availability zone | One zone can look resilient until a regional or shared-service failure hits. | Multi-AZ or multi-region design with realistic data replication and restore times. |
| Physical duct or site | Separate devices are not separate if they all cross the same corridor or enter through the same room. | Route diversity, site diversity, and a map of shared-fate infrastructure. |
The pattern is always the same: what matters is not whether you have two assets, but whether they fail independently. If they share power, space, timing, provider, or control logic, they are not really independent. That distinction is what separates useful redundancy from expensive theatre, which leads straight into how I would find the weakest link before an outage does.
How I would find the weakest link before an outage does
When I audit a network, I do not start with hardware inventories. I start with the service path: user, access layer, transport, edge security, application, data, and management plane. The goal is to answer one question clearly: what has to keep working for the service to stay alive?Trace the service path end to end
Map every dependency that sits between the user and the service. That includes DNS, authentication, certificates, firewall policy, load balancers, logging, remote access, and the systems your team uses to administer the network. If a component is required for both normal operation and recovery, it deserves extra scrutiny.
Look for shared fate, not just duplicate boxes
Two switches in two racks are not enough if they share the same power room, the same cable route, and the same upstream provider. Shared fate is what turns a backup into a mirror image of the failure. I usually mark these dependencies visually, because they are easy to miss when people think in vendor components rather than in failure domains.
Test failover under conditions that resemble reality
Green dashboards are not proof. I want to see traffic move, sessions re-establish, and operators recover access when a real dependency is removed. A failover that works only in a quiet lab is not a control; it is a theory. The test should include load, timing, and the awkward details, such as stale DNS, stateful sessions, or delayed route convergence.
Count people and process dependencies too
A network can also have a human SPOF. If only one engineer knows how the access layer works, or only one person can restart a broken service, that is a resilience risk. The same is true for “tribal knowledge” held in a private chat, a forgotten runbook, or one vendor contact who goes on holiday just when you need them.
Once those dependencies are visible, the fix becomes much more practical. You stop guessing and start choosing which risks deserve money, time, and extra complexity.
How to remove the risk without overengineering
The right answer is not to duplicate everything. That usually creates more cost, more operational drift, and more things that can be misconfigured. The better move is to remove the dependencies that create the largest blast radius and then design the recovery path so it is easy to execute under pressure.
| Risk pattern | Practical fix | Best use case |
|---|---|---|
| One access circuit | Add a second circuit from a different carrier and, where possible, a different route. | Sites where downtime has immediate business or customer impact. |
| One power source | Use dual PSUs, separate feeds, and monitored backup power. | Edge gear, core devices, and server rooms that cannot go dark. |
| One DNS or identity provider | Build secondary resolution and emergency access paths. | Any environment where users or admins need access during an incident. |
| One control plane | Separate management from production and protect it with break-glass access. | Networks that depend on remote administration or automation. |
| One recovery story | Document, rehearse, and time the restore process. | Systems with stateful services, compliance pressure, or limited staff coverage. |
If the service is stateless and traffic is steady, active-active designs can be worth the complexity. If the service is stateful, or if the team is small, active-passive is often the smarter compromise. I care less about the label and more about whether failover is predictable, fast enough, and actually usable at 2 a.m.
There is also a point where resilience stops being about hardware and becomes about discipline. Configuration management, rollback plans, monitoring, and spare capacity often prevent the ugly kind of outage more effectively than buying one more box. That said, the next section matters because the right answer in the UK is often shaped by physical reality, not just architecture diagrams.The trade-offs that matter in UK network design
The UK network environment rewards proportionate design. Not every service needs metro diversity, multi-region replication, and duplicated facilities, but the most critical ones do need a serious look at route diversity, building entry diversity, and provider concentration. Ofcom’s current resilience guidance is useful here because it frames resilience as the ability to resist disruption from physical, technology, human, and architectural causes, including single points of failure without backup routes or systems.The NCSC takes the same practical line: resilience has to be built into design, implementation, operation, and management, and then exercised through failover testing and recovery planning. That sounds obvious until you look at real networks, where spare equipment exists but the recovery path has never been rehearsed, or where the “backup” link still rides the same duct as the primary one.
In practice, I would watch for three UK-specific constraints. First, diverse routing is not always easy in dense urban buildings, especially when landlords control risers and entry points. Second, multiple suppliers can still terminate through shared physical infrastructure, which defeats the point of buying two services. Third, emergency or customer-facing services deserve a much higher resilience bar than internal tooling, so the same design choice should not be applied everywhere.
That is why spending is best guided by blast radius. A 99.9% service target still allows roughly 8.8 hours of downtime a year; 99.99% cuts that to about 52.6 minutes. Those numbers do not tell you what to buy, but they do force a sharper conversation about how much outage the business can actually absorb.
The first three fixes I would make in a UK network audit
If I had to improve a network quickly, I would start with the changes that remove the most obvious shared-fate risks. The first is connectivity: one truly diverse backup path, not a second line that follows the same trench. The second is power: separate feeds, tested UPS behaviour, and a generator or battery plan that is realistic for the site. The third is recovery access: DNS, identity, and out-of-band management that still work when the main path is already failing.
After that, I would test the whole thing under pressure. Pull the primary circuit. Simulate a power event. Lose a resolver. Rehearse the admin recovery path. If the service survives those exercises, I trust it more than a rack full of identical gear that only looks resilient from a distance.
The goal is not to make failure impossible. The goal is to make it local, understandable, and boring enough that one broken dependency does not become a business-wide outage.