Network Single Points of Failure - Don't Let Them Crash You

2 May 2026

Diagram shows a system with multiple load balancers and gateway services, avoiding single point of failure examples.

Table of contents

Network outages rarely begin with something dramatic. More often, one router, one fibre route, one DNS service, or one power feed fails and everything attached to it goes dark. This article breaks down single point of failure examples in network infrastructure, explains why they matter in the UK, and shows how to reduce the risk without building an overengineered mess.

The fastest way to spot a SPOF is to ask what breaks when one dependency disappears

  • A single point of failure is not just one device; it is any dependency that can take many services down with it.
  • In network estates, the usual culprits are edge routers, firewalls, WAN circuits, DNS, identity systems, and shared power.
  • Redundancy only works when the backup lives in a different failure domain, not just in a second box in the same rack.
  • Failover that has never been tested under real traffic is a design assumption, not resilience.
  • In the UK, resilience planning now has to account for route diversity, power continuity, and third-party dependencies, not just hardware count.

What counts as a single point of failure in network infrastructure

When I assess a network, I start by tracing failure domains, which is the set of systems affected by one fault. A component becomes a real SPOF only when its failure removes a service, a site, or a whole business function, not just when it slows traffic a little. That is why a pair of devices can still behave like one weak point if they share the same power feed, fibre duct, rack, or upstream provider.

The distinction matters because people often confuse capacity with resilience. A network can be fast, well-sized, and still brittle if a single control element, access path, or dependency can collapse the whole chain. I also look for shared fate situations, where separate-looking components actually fail together because they sit inside the same physical or logical boundary.

That is the core idea behind SPOFs in networks: not just one thing breaking, but one thing breaking in a way that takes too much else with it. Once that is clear, the useful part starts, which is looking at the components that most often cause the problem.

The network components that most often create the problem

These are the examples I see most often in real estates, and they are the ones that usually hurt the most when they fail.

Example What fails Why it becomes a SPOF Better approach
One edge router or firewall All traffic entering or leaving the site If the only perimeter device fails, the site loses internet access, VPN access, or both Use a redundant pair with separate power and tested failover
One WAN circuit or leased line Branch connectivity to the wider network A cut, outage, or provider fault isolates the branch completely Add a second carrier and, where possible, a second physical route
One DNS service Name resolution for apps, VPNs, SaaS, and mail The network may still be alive, but users cannot find services by name Run authoritative DNS across independent providers and regions
One identity or remote access platform Authentication for VPN, Wi-Fi, admin access, and SSO If identity stops, remote work and privileged access stop with it Separate critical identity functions and keep a break-glass path
One core switch or aggregation device Traffic between access, distribution, and upstream layers When the core fails, many lower-level devices are still healthy but useless Use resilient cores with diverse links and separate power domains
One power feed, UPS, or generator path Multiple devices at once Every box on the same electrical dependency can fail together Separate A and B feeds, diversify backup power, and test runtime
One site or one data centre Everything hosted there A building, flood, fire, or regional fault can remove the whole service Distribute critical services across sites with real geographic separation

The pattern is consistent: one fault becomes a wide outage whenever the network has no independent path around it. A second box only helps if the second box does not share the same fate as the first one.

That leads naturally to the next question: how does one local fault turn into a full service outage so quickly?

How one failure spreads through the stack

Network failures are rarely isolated. They tend to cascade because modern infrastructure is layered, and each layer depends on the one above and below it. A broken transport link can strand the control plane, a control-plane fault can blackhole user traffic, and a power issue can quietly remove every layer at once.

  • Access failure happens when a branch depends on one circuit, so a fibre cut or provider fault disconnects everyone on that site.
  • Control-plane failure happens when the logic that steers traffic breaks. BGP, for example, is the protocol that tells networks where to send traffic, so a routing mistake can make healthy links unreachable.
  • Dependency failure happens when the network itself is fine but a service like DNS, RADIUS, or SSO is unavailable, which makes the rest of the stack look broken.
  • Common-mode failure happens when two supposedly separate devices fail for the same reason, such as identical firmware, one shared power path, or one duct feeding both links.

The most expensive outages I see are usually not caused by a lack of hardware. They happen because the network was built with duplicated equipment but not duplicated risk. Two firewalls on the same UPS, two circuits in the same duct, or two servers in the same rack can all fail as one.

This is also where software-defined and virtualised networks add new wrinkles. As more of the stack moves into control software, the weakest link is often no longer the cable itself but the logic, orchestration, or automation behind it. That is useful when it works; it is brutal when it does not.

So the real job is not just spotting a weak component. It is tracing how one fault would move across layers before you discover it in production.

If I were reviewing a UK enterprise or communications network today, I would start with the dependencies that matter most to service continuity, then work outward. In practice, that means mapping every critical service to the physical and logical parts it relies on, including carriers, ducts, racks, sites, power systems, and cloud services.

  1. List the services that must stay up. Internet access, voice, VPN, identity, DNS, customer portals, and monitoring usually come first.
  2. Draw the dependency chain for each one. Keep going until you reach physical layers such as fibre, power, and premises.
  3. Check whether the duplicates are truly independent. Two devices from different vendors still share a SPOF if they sit on one power strip or one provider handoff.
  4. Look for third-party concentration risk. One carrier, one DNS platform, one cloud region, or one managed security provider can become your weak point even if your own design looks fine.
  5. Test failover under load. Cold failover in a maintenance window is not the same as failover during a live incident with real traffic.
  6. Document the manual recovery path. If recovery depends on tribal knowledge, it will be slower and riskier than you think.

For UK communications estates, I would also compare the design against Ofcom’s 2026 resilience guidance, which pushes operators toward geographically separate paths, better power continuity, and more honest treatment of interdependencies. The useful lesson for everyone else is simple: resilience is increasingly judged by how many independent routes and resources you actually have, not by how many components you bought.

Once you can see the dependency map clearly, the next step is deciding which fixes give you real resilience instead of decorative redundancy.

The fixes that actually reduce risk

Not every SPOF needs a complicated solution. In many cases, the best fix is boring: separate the dependency, test the fallback, and make sure the backup does not share the same failure domain.

Fix What it protects Trade-off
Dual-homing to two upstreams Loss of one carrier or handoff Costs more and needs cleaner routing design
Diverse fibre routes Cable cuts and duct failures Not always possible in dense cities or leased sites
Independent power feeds and backup Electrical faults, UPS failure, generator issues Space, maintenance, and fuel logistics become real concerns
Active-active or active-passive clustering Single device or node failure Complexity rises, especially if state synchronisation is weak
Separate management plane Loss of access to administer the network Needs strict access control and clear operational discipline
Multi-site service placement Building or site loss Data replication and latency become design constraints
Regular failover testing False assumptions about resilience Tests can be disruptive if they are not planned carefully

What I like about this list is that it forces a hard question: are you buying extra hardware, or are you buying genuine independence? The answer is not always obvious until you test routes, power, authentication, and recovery together.

The NCSC’s CAF Principle B5 gets this right by treating resilience as something that must be built into design, implementation, operation, and management. That matters because a beautiful architecture can still fail if the runbook is missing, the failover path has never been exercised, or the dependency graph is wrong.

In other words, resilience is a systems property. You do not get it from one extra appliance.

The first things I would fix when the budget is tight

If the budget is limited, I would prioritise the weak points that can take out the most users in the shortest time. That usually means the external edge, identity, DNS, and power before anything fancier.

  • Remove any single WAN circuit protecting a branch, office, or critical site.
  • Fix any firewall or router pair that still shares one power path or one upstream handoff.
  • Separate DNS and identity from the rest of the stack if they still live on one server or one site.
  • Confirm that backup power actually covers the outage window you care about.
  • Test one real failover path per quarter so the design is not just theoretical.

The shortest route to better resilience is usually not a large redesign. It is identifying the handful of dependencies that can still stop the whole network, then removing the shared fate behind them. If you fix those first, the rest of the hardening work becomes much more meaningful.

Frequently asked questions

A SPOF is any component or dependency whose failure will cause an entire system, service, or site to become unavailable. It's not just a single device, but anything that, if it fails, takes down too much else with it, often due to shared dependencies like power or a single physical path.

Typical SPOFs include single edge routers/firewalls, one WAN circuit, a sole DNS service, a single identity platform, one core switch, or a shared power feed. These components, if not properly redundant and diversified, can bring down entire network segments or services.

Start by mapping critical services to all their physical and logical dependencies, including carriers, power, and cloud services. Look for shared fate scenarios where seemingly separate components share a single point of failure, such as two devices on the same UPS or in the same duct.

Implement true redundancy by ensuring backup components are in different failure domains (e.g., separate power, diverse fibre routes). Test failover regularly under load, and document manual recovery paths. Prioritize fixing external edge, identity, DNS, and power SPOFs first, especially with limited budgets.

Rate the article

Rating: 0.00 Number of votes: 0

Tags:

single point of failure examples network single point of failure examples identify network spof reduce network spof risk network resilience uk single point of failure in network infrastructure

Share post

Hazel Schuppe

Hazel Schuppe

Nazywam się Hazel Schuppe i od 10 lat zajmuję się tematyką przyszłych technologii, łączności oraz bezpieczeństwa. Moje zainteresowanie tymi obszarami zaczęło się, gdy zauważyłam, jak szybko rozwijający się świat technologii wpływa na nasze codzienne życie. Pisanie o tym, co nas czeka w przyszłości, pozwala mi nie tylko dzielić się wiedzą, ale także inspirować innych do myślenia o tym, jak możemy wykorzystać nowe możliwości w sposób odpowiedzialny i bezpieczny. Szczególnie ważne jest dla mnie zrozumienie, jak technologia może zbliżać ludzi, ale także jakie wyzwania bezpieczeństwa się z tym wiążą. W moich artykułach staram się wyjaśniać złożoność tych zagadnień, aby czytelnicy mogli lepiej orientować się w dynamicznie zmieniającym się świecie technologii.

Write a comment