The practical aim is to turn network trade-offs into decisions you can measure
- The model translates business goals into a clear objective, such as lower latency, higher throughput, lower cost, or better resilience.
- The most useful inputs are traffic demand, topology, latency, jitter, packet loss, failover risk, and operating cost.
- Linear and mixed-integer models work best for planning, while simulation is better for stress testing real behaviour.
- Most performance gains come from removing bottlenecks, improving path diversity, and avoiding policy conflicts.
- In UK estates, the biggest wins often come from branch-to-cloud consistency and route resilience, not just raw speed.
What the model is really trying to optimise
At its core, the framework turns a messy operational question into a set of variables, constraints, and a clear objective. Instead of asking only “how do I make the network faster?”, I ask: what should improve, by how much, under which limits, and at what cost?
That distinction matters. A network can be cheaper, faster, and more resilient at the same time, but usually not without trade-offs. If I optimise only for throughput, I can create latency spikes. If I optimise only for redundancy, I may overspend on capacity that never carries traffic. The model exists to make those trade-offs explicit before they show up in production.
In practical terms, the decision variables are things like link capacity, route selection, workload placement, and where to add headroom. The constraints are the physical and operational limits: budget, provider availability, maintenance windows, security policy, and the fact that some sites simply cannot be rearranged without business disruption. Once those pieces are visible, the model becomes a planning tool rather than an abstract exercise.
The next question is which inputs deserve that level of attention.
The variables that matter most in network infrastructure
Not every metric belongs in the model. I usually start with the variables that change user experience or operating risk in a material way, then ignore the noise around them.
- Traffic demand tells you where the network is actually under pressure. Peak-hour usage, backup traffic, software updates, and batch jobs often behave differently, so averages alone are misleading.
- Latency and jitter matter most for voice, video, remote desktops, and real-time control. As a working line, I treat one-way latency above about 150 ms or jitter above roughly 30 ms as a warning zone for interactive traffic, not a universal law.
- Packet loss is the silent killer of perceived quality. A link can look “up” while still degrading applications through retransmissions and queueing.
- Topology and path diversity shape resilience. Two fast links that fail the same way are not the same as two truly independent paths.
- Cost includes more than circuit price. I also count support overhead, cloud egress, hardware lifecycle, and the operational cost of complex routing rules.
- Availability targets change the whole design. A 99.9% service target allows about 8.8 hours of downtime a year, while 99.99% cuts that to roughly 52 minutes. That single decimal point often changes the architecture.
- Security constraints can be performance-relevant. Inspection, segmentation, and policy enforcement may protect the business, but they also add hops, latency, or bottlenecks if they are bolted on late.
When these variables are cleanly defined, the model stops being vague and starts looking like an engineering decision engine. That leads naturally to the next issue: which mathematical approach should do the work.
Which modelling approach fits which problem
Not every network question needs the same mathematics. Some problems are planning problems, some are simulation problems, and some only need a transparent heuristic. The useful move is to match the method to the decision, not to the buzzword.
| Approach | Best for | Strengths | Limits |
|---|---|---|---|
| Linear programming | Continuous capacity and flow planning | Fast, transparent, easy to explain to non-specialists | Weak for yes/no decisions and discrete design choices |
| Mixed-integer programming | Link upgrades, site selection, redundancy design | Handles binary choices and hard constraints well | Can grow computationally heavy as the network gets larger |
| Simulation | Queueing, burst traffic, failure behaviour, and failover drills | Shows how the network behaves under stress, not just on paper | Does not directly choose the best design |
| Heuristics and rules | Operational tuning and quick wins | Simple, fast, and often good enough for routine changes | May leave performance on the table and hide global trade-offs |
| AI-assisted optimisation | Telemetry-rich environments with recurring patterns | Can surface non-obvious correlations and adaptive recommendations | Needs clean data, governance, and enough variation to learn from |
I usually start with the simplest model that can answer the decision. A perfect answer that nobody trusts is less useful than a good answer with clear assumptions. If the network is stable and the decision is small, a rule-based or linear approach may be enough. If the question involves site selection, redundant paths, or budget trade-offs, mixed-integer methods become much more attractive. The right method is the one that survives contact with operations.
That principle becomes much easier to apply once the model is built against actual data, not just theory.

How I would build one for a live network
When I build an optimisation framework for an active environment, I keep the process boring on purpose. The goal is not mathematical elegance. The goal is to make a better decision with enough confidence to deploy it.
- Define the primary objective. I start by choosing one main outcome, such as reducing branch-to-cloud latency, lowering transit spend, improving failover behaviour, or increasing throughput for a specific service class.
- Map the real topology. I include sites, cloud regions, circuits, firewalls, load balancers, and any dependency that can become a hidden bottleneck. If it can fail or queue traffic, it belongs in the map.
- Collect usable telemetry. Flow logs, interface counters, latency probes, incident history, and application metrics all matter. I care less about raw volume than about consistency and timestamp quality.
- Build a demand picture. I split traffic into business hours, peaks, backups, patching windows, and failure scenarios. A network that looks fine on average may still collapse under one predictable event.
- Encode constraints honestly. Security zoning, provider contracts, change windows, regulatory boundaries, and budget limits should all be explicit. If a constraint is real but ignored, the result is not a model. It is wishful thinking.
- Run at least three scenarios. I test steady state, peak load, and one failure mode. If the proposal only works when everything is perfect, I discard it.
- Pilot before broad rollout. I prefer a low-risk segment, such as one branch cluster or one application path, and then compare predicted versus observed results.
The biggest mistake here is optimising to the average day. Networks fail at the edges, so the model has to respect bursts, maintenance, and partial outages. That is also where most real-world limitations show up.
Where the model fails in the real world
Even a strong model can mislead you if the inputs are weak or the assumptions are too clean. I see the same failure patterns repeatedly.
- Poor telemetry produces confident nonsense. Missing counters, inconsistent sampling, and noisy timestamps can make a bad path look healthy.
- Hidden coupling breaks neat plans. A route change might improve one workload while hurting voice, security inspection, or backup traffic elsewhere.
- Overfitting is easy when the network rarely changes. A model that only reflects last month’s behaviour may fail the first time maintenance, weather, or a provider issue changes the pattern.
- Local optimisation can backfire. If the network team, security team, and finance team each optimise their own target, the organisation may end up worse overall.
- AI limitations are real in static environments. If network parameters barely change, there may be too little varied data for a learning system to generalise well.
My rule is simple: a model should be wrong in a detectable way, not silently wrong. That means explainable assumptions, easy-to-audit inputs, and enough scenario testing that surprises become less likely. From there, the country context starts to matter in a practical way.
What changes in the UK network infrastructure context
In the UK, the most useful network models are rarely about chasing the highest theoretical throughput on one circuit. They are about keeping service stable across a mixed estate: head offices, regional branches, retail sites, cloud workloads, and security layers that all introduce their own constraints.
That matters because a London-centric design can look efficient on paper while creating unnecessary latency or fragility for teams in other regions. I would rather see a model that understands branch-to-cloud paths, route diversity, and failover behaviour than one that merely assumes every site behaves like a data centre. For many UK organisations, especially those running hybrid estates, the real question is not “can the link go faster?” but “can the service stay predictable when traffic shifts, a provider degrades, or a maintenance window lands at the wrong time?”Security also sits closer to the centre of the design in the UK than many teams expect. Zero Trust controls, segmentation, and traffic inspection may be necessary, but they should be modelled as part of the network rather than appended after the fact. If they are added late, they often become the hidden source of latency and routing complexity.
Once you frame the problem this way, the fastest gains usually come from a small number of disciplined moves.
The decisions that usually move performance fastest
When I want quick, defensible improvement, I focus on the levers that consistently matter instead of chasing a perfect architecture on day one.
- Remove the worst bottleneck before buying more bandwidth.
- Fix policy conflicts between routing, security inspection, and QoS.
- Reserve failover capacity explicitly instead of assuming it will absorb normal load.
- Standardise telemetry so every site is measured the same way.
- Keep the objective narrow enough that the business can judge success.
That is the practical value of the framework: it helps you decide where performance is genuinely constrained, where extra capacity will help, and where the better answer is cleaner design. If I had to reduce the whole topic to one line, it would be this: the best network is not the one with the most parts, but the one whose trade-offs are already understood before the outage, not after it.