Cross-region network performance usually fails in the boring places: a congested firewall, a route change that nobody announced, a load balancer behaving differently after maintenance, or a link that looks healthy until traffic shifts in one direction. The useful question is not just whether the circuit is up, but whether the path between your northern and southern sites is still behaving the way the business expects. In this article, I focus on how to observe that traffic, which signals matter most, and how to turn raw monitoring into something you can act on quickly.
Key signals for regional traffic visibility
- A practical stack for north south traffic should combine flow data, metrics, traces, and selective packet capture.
- Monitoring tells you whether the link is healthy; observability tells you why the pattern changed.
- Start with direction-specific baselines, not a single blended average.
- Track latency, loss, utilisation, retransmits, route changes, and interface errors together.
- For UK estates, the most common choke points are WAN links, firewalls, load balancers, and region-to-region cloud paths.
What north-south traffic means in a split regional network
When I talk about north-south traffic in this context, I mean the data that crosses between geographically separated parts of the estate rather than moving around inside one local cluster. In a UK setup, that might be traffic between London and a northern site, or between an on-premises data centre and a cloud region serving the same users. The detail matters because these paths are usually longer, more policy-heavy, and more vulnerable to changes outside the application itself.
That is the main difference from east-west traffic. East-west stays inside a platform, a campus, or a service mesh and is often controlled by the application team. North-south flows cross boundaries: WAN, internet edge, private interconnect, NAT, proxies, firewalls, and sometimes multiple providers. When something goes wrong there, the symptom can look like an app problem even though the root cause sits in routing, capacity, or policy.
I find this distinction useful for observability because it tells me where to look first. If the problem only appears when traffic crosses regions, I care less about raw server CPU and more about path behaviour, queueing, and asymmetry. That naturally leads to the first question: which signals actually separate a noisy dashboard from a useful one?
Which signals tell you more than raw throughput
Bandwidth alone is a blunt instrument. A link can be under 60 percent utilisation and still feel terrible if latency is unstable, packets are being dropped in bursts, or one direction is silently taking a longer path. I prefer to start with a small set of signals that tell me whether the path is stable, not just busy.
| Signal | What it tells you | Practical starting point |
|---|---|---|
| Latency by direction | Shows whether one path is slower than the other, which often reveals routing or congestion issues | Investigate when p95 latency stays 20-30% above baseline for 5 minutes or more |
| Packet loss | Reveals congestion, drops, or a failing physical or virtual link | Treat sustained loss above 0.1% on critical links as a real warning |
| Retransmits and resets | Shows hidden transport pain even when a circuit appears up | Watch for a 2x jump versus the normal hour-of-day pattern |
| Utilisation and queue drops | Shows whether you are running out of headroom before the traffic visibly fails | Start paying attention above 70% sustained utilisation on shared links |
| Route or path changes | Explains sudden latency shifts, traffic rebalancing, or failover behaviour | Alert on unplanned changes during business hours |
| Interface errors and optics alarms | Highlights physical or virtual link quality problems that do not show up in app metrics | Any step change after a maintenance window deserves investigation |
| Jitter | Important for voice, remote desktop, streaming, and real-time workflows | Investigate when jitter stays above 10-20 ms above normal baseline |
Those values are starting points, not universal laws. The real test is how each signal behaves at the same time of day, on the same route, and under the same load profile. I get much more value from comparing Tuesday 10:00 to last Tuesday 10:00 than from staring at a single absolute threshold that ignores the shape of demand.
Once those basics are visible, the next question is how to connect them so you can see whether the issue is network-wide, path-specific, or tied to a particular request. That is where telemetry design starts to matter.
Why traces, flow logs, and metrics work better together
In practice, I still rely on a three-layer model: coarse network telemetry, service telemetry, and occasional proof-level evidence. Flow records tell me who is talking to whom. Metrics tell me whether the path is healthy over time. Traces tell me which user journey or service call experienced the pain. When those three stay correlated, a network anomaly stops being a guess and becomes a chain of evidence.
| Telemetry type | Best use | Where it falls short |
|---|---|---|
| Flow logs or NetFlow-style records | Spotting traffic sources, destinations, ports, directionality, and sudden shifts in volume | Low application context; it shows patterns, not user experience |
| Metrics | Alerting on latency, utilisation, loss, queue depth, and error trends | Great for trends, weak on explaining which request or transaction was affected |
| Traces | Following a request across services, regions, and intermediaries | Only useful if the application is instrumented well and sampling is sane |
| Targeted packet capture | Proving retransmits, TLS problems, DNS issues, MTU mismatches, or odd handshake behaviour | Too expensive to run continuously at scale |
| Synthetic probes | Measuring whether a path still behaves from the user’s point of view | Only covers what you test, not the entire traffic mix |
I like the correlation model because it keeps the story coherent. If a trace shows repeated timeout behaviour, a flow log can tell me whether the traffic volume changed at the same time, and a metric can tell me whether the path was already saturated. That is a lot faster than jumping between disconnected tools and trying to reconstruct the incident from memory.
OpenTelemetry fits this approach well because it is built to correlate traces, metrics, and logs across service boundaries. The important part is not the brand name; it is the discipline of attaching the same service, region, and request context everywhere so network and application evidence can be read together. With that in place, the dashboard becomes much easier to design in a way that mirrors how incidents actually unfold.

How I would lay out a dashboard for cross-region traffic
A good dashboard answers three questions immediately: is the path healthy, which direction is hurting, and what changed just before the problem started. If you have to scroll for the answer, the design is too busy. I usually keep the first screen to six to eight panels max, with the most important ones at the top and the topology view underneath.
The structure I prefer is simple.
- Top row for business health: active sessions, error rate, p95 latency, and loss.
- Middle row for direction-specific links: northbound and southbound throughput plotted separately, not blended into one average.
- Bottom row for topology and change markers: firewalls, load balancers, interconnects, and maintenance windows.
- Side panel for top talkers and top destinations so you can see whether one service or one site is dominating the path.
For a UK estate, that often means I want London-to-region links shown separately from region-to-region links. A single line chart can hide asymmetry very effectively, which is exactly why it is dangerous. If one direction is clean and the other is congested, the aggregate can look acceptable right up until users complain. I would rather see an awkward, slightly noisier dashboard than a pretty one that hides the failure mode.
In an incident, the best dashboard elements are the ones that show change, not just state. A route flip, a spike in retransmits, and a new firewall rule should all be visible on the same timeline. That makes the next step much easier: reading the symptoms without guessing.
What usually goes wrong and how to read the symptoms
Most cross-region incidents fall into a few familiar patterns. The cause may differ, but the signal pattern is usually repeatable enough that I can narrow it down quickly if the telemetry is good. I find this section useful because many teams overreact to the symptom they can see and underweight the layer where the fault actually lives.
| Symptom | What it usually suggests | First thing I would check |
|---|---|---|
| One direction slows down while the other looks fine | Asymmetric routing, stateful inspection, or a path-specific policy change | Compare route tables, firewall path, and any recent failover events |
| Latency rises but loss stays flat | Queueing, traffic shaping, or deeper packet inspection | Check utilisation, buffer drops, and any change in service chaining |
| Loss appears in short spikes at regular times | Backups, replication jobs, batch transfers, or another scheduled burst | Correlate with job schedules and see whether the burst is directional |
| Application timeouts with clean network metrics | DNS, TLS, load balancer behaviour, or an upstream service problem | Run a synthetic request and inspect the trace through the first hop |
| Interface errors on one edge device only | Optics, cabling, MTU mismatch, or a hardware issue | Check counters, transceiver health, and any recent change on that link |
| Traffic moves to a different path after a change window | Failover, policy drift, or a capacity trigger in the routing layer | Review the change record and compare path latency before and after |
The important habit here is to avoid treating every symptom as a network problem or every timeout as an application bug. Cross-region paths sit in the middle of both worlds. If you can see route changes, utilisation, and request traces on the same clock, the diagnosis becomes far less speculative. That leads naturally to the operational habits that keep the whole system trustworthy over time.
The habits that keep monitoring useful over time
The biggest mistake I see is not lack of data; it is collecting data without deciding how the team will use it under pressure. A few operational habits make a much bigger difference than adding another dashboard.
- Baseline by time of day and day of week. A Tuesday morning spike is not the same as a Saturday backup window.
- Alert on combinations, not single numbers. Utilisation alone is noisy; utilisation plus loss plus rising retransmits is much more meaningful.
- Keep labels consistent. Every metric should know the site, direction, circuit, service, and owner.
- Use different retention tiers. I usually keep metrics for 90 days or more, flow data for 14-30 days, and packet captures for short investigative windows of 24-72 hours.
- Review changes after every incident. If a route flip or firewall adjustment caused the spike, fold that learning into the next alert rule.
If telemetry contains personal data or customer identifiers, I would also make sure the storage and retention plan lines up with the organisation’s UK governance rules. That is not about turning observability into a compliance project; it is about preventing the monitoring stack itself from becoming a hidden risk. Once those habits are in place, the final step is deciding what to instrument first when you are starting from scratch.
What I would put in place first on a UK network
If I had to start with a fresh environment, I would not try to instrument everything. I would begin with the busiest inter-site paths, the firewall or load balancer in the middle, and the first application hop on either side. That gives me enough visibility to distinguish a transport issue from a service issue without drowning in noise.
My first rollout would be very small and very deliberate.
- Collect interface counters and flow records from every north-south edge.
- Add synthetic probes between the northern and southern hubs every 1-5 minutes.
- Correlate traces for the top three user journeys that cross regions.
- Build one incident view that ties route changes, retransmits, and user-facing errors together.
That approach is usually enough to expose whether the issue is capacity, path selection, policy, or application behaviour. It also keeps the team focused on explainability instead of dashboard theatre. When the path is visible in both directions and the data is tied together cleanly, north-south monitoring becomes less about guessing and more about making fast, defensible decisions.