Building a Multi-DC Zabbix Environment: Rights, Wrongs, and Everything in Between
A few years back I had the opportunity to present at a Zabbix conference about something we'd been grinding through at Sentia (NL branch was sold off and now became Accenture)— building a large-scale, multi-datacenter Zabbix monitoring environment from scratch, focusing as much as possible on open-source tooling. This post is a write-up of that talk, covering the decisions we made, the problems we hit, and what we'd do differently.
The Goal: 99.99% Uptime
The whole project was driven by one deceptively simple goal: maximum availability. That breaks down into a bunch of concrete requirements:
- Uncompromised access (users can always reach monitoring)
- Data duplication (no single point of data loss)
- Multi-component redundancy (no single point of failure in the stack)
- Server duplication across datacenters
- A solid failover and failback methodology
- Fast inter-component communication
The target was 99.99% uptime. And honestly? We believed it was attainable.
The high-level idea was to have two datacenters — AM6 and GS — where traffic naturally flows to AM6, but if AM6 goes down, everything seamlessly switches to GS.
What We Found First: The Discovery Phase
Before building anything, we had to understand what we were replacing. And what we found was... a mess.
No standards whatsoever. The environment had Zabbix in some places, Nagios in others, PRTG, AWS native monitoring, Azure Monitor, IBM Radar, WhatsUp Gold, System Center Operations Manager, custom scripts — you name it, it was there. No ticketing framework, no unified notification system, no centralization at all.
The scale was real though:
- 4,500+ hosts being monitored
- 650,000+ items collected
- 3TB+ of data per year
- 250+ users depending on this
Consolidating all of that into one coherent platform was the actual challenge. Choosing Zabbix as the backbone wasn't hard — it was already the most capable open-source option in the mix.
What We Wanted: The Wishlist
Once we knew what we were dealing with, we sat down and wrote the wishlist for the new platform:
- Start fresh, no legacy cruft
- Cover all features people actually use
- Open-source as much as possible
- Multi-datacenter is non-negotiable
- Multi-active databases for failover
- Keep 10 years of data (yes, really)
- Push notifications to multiple systems
- Capture metrics from everything in operations
- Dashboards everywhere (Grafana)
- Elasticsearch for long-term storage
- In-house integration tooling
- Automation via Ansible
The stack we landed on: Zabbix + Elasticsearch + Percona MySQL + Grafana + Ansible.
Storage: Elasticsearch or MySQL?
Zabbix at the time (4.x) let you split storage between MySQL and Elasticsearch. The idea was to use MySQL for config/short-term data and offload historical time-series to Elasticsearch.
Testing this was straightforward. But production was a different story — there were limitations and broken promises that we had to work around.
The Failover Architecture
We went from a simple two-node failover concept to a more resilient four-node cross-datacenter setup.
Simple version (two nodes):
What we actually built (four nodes, multiple failover paths):
MySQL Percona Active/Active and Elasticsearch in a cross-datacenter setup is entirely feasible — it just takes work.
Data Independence: The Key Design Principle
One of the most important architectural decisions was data independence between datacenters. The idea: if AM6 explodes, GS keeps running without skipping a beat. And when AM6 comes back, re-syncing should be easy.
Each DC runs its own full stack. The databases replicate between them. If the link dies, both sides carry on independently.
Networking: The Stretched VLAN
The network setup was key to making all of this work. Each DC has its own local VLAN, but all internal component communication runs over a stretched VLAN that spans both datacenters.
All internal traffic between Zabbix, MySQL, Elasticsearch etc. goes through the stretched VLAN. This is what makes cross-DC clustering and replication work without needing complex routing.
Two Zabbix Servers Per Datacenter
Here's something that took some design thought: we needed two Zabbix server instances per datacenter — one for infrastructure monitoring and one for client/team operations. They share the same Elasticsearch and Percona backends but are logically separated.
MySQL was fine with this — you can point each Zabbix instance to a different database name, port, or host. Elasticsearch, on the other hand, originally only supported one server and one index. That was a problem.
The Elasticsearch Index Prefix Problem (and the Fix)
With two Zabbix servers writing to the same Elasticsearch cluster, the indices would collide. The solution was an index prefix per Zabbix instance:
This required a patch to Zabbix itself (tracked as ZBXNEXT-4968). The $HISTORY_PREFIX variable gets added to the frontend config:
$HISTORY_PREFIX = 'infra'; // **## SENTIA PATCH ONLY ##**Elasticsearch in Production: The Problems
Once we had volume going through Elasticsearch, we started seeing errors:
cannot get values from elasticsearch, HTTP status code: 503
cannot get values from elasticsearch, HTTP status code: 429
cannot get values from elasticsearch, HTTP status code: 404- 503 = Service Unavailable
- 429 = Too Many Requests (rate limiting)
- 404 = Index not found
- 400 = Bad Request
The 400 errors were particularly annoying. They came from Elasticsearch's default http.max_initial_line_length being only 4KB. When Zabbix sends a DELETE request to clean up old scroll contexts, the URL can get enormous — easily blowing past 4KB.
Fix: bump it in your elasticsearch.yml:
http.max_initial_line_length: 16kbPaceMaker: Only One Zabbix Server Active at a Time
This is critical and easy to get wrong. Zabbix cannot run in active/active mode — if two instances are writing to the same database simultaneously, you get data corruption and chaos. So even though we have 4 Zabbix server instances across the two DCs, only one per "role" should be active at any given time.
Pacemaker + Corosync handles which instance is active. Standby nodes are kept ready but not running until needed.
The Frontend: Multiple Zabbix UIs on One Server
We needed to serve multiple Zabbix frontends — one for infra monitoring, one for global/client monitoring — from the same web server, through HAProxy, across both DCs.
Making this work requires copying the Zabbix PHP frontend to separate directories and editing two files in each copy:
# Copy PHP frontend
cp -r /usr/share/zabbix/ /usr/share/zabbix-infra/
# Copy config
cp -r /etc/zabbix /etc/zabbix-infraThen edit these two files to point to the correct config path:
include/classes/core/ZBase.php:276— path tomaintenance.phpinclude/classes/core/CConfigFile.php:27— path tozabbix.conf.php
Each frontend gets its own zabbix.conf.php pointing to the right database and Elasticsearch endpoint.
The Full Frontend Access Layer
The complete access layer looks like this — PowerDNS at the top doing health-aware DNS, HAProxy in the middle load-balancing across both DCs, and then the full Zabbix stack behind it:
PowerDNS + LUA: Smart DNS Failover
The clever bit at the DNS layer is using PowerDNS with LUA scripting to do health-check-based DNS failover. The ifurlup function checks whether a monitor URI is alive and returns the appropriate IP:
inframon-sentia.net 1 IN LUA A
"ifurlup('https://infra-monitoring/site-alive', {
{'185.133.x.x'}, {'213.264.x.x'}
})"For external DNS failover:
- DNS lives outside both DCs
- Uses HAProxy's
monitor-urifeature as the health source ifurluphas orderable targets (AM6 preferred, GS fallback)
For internal DNS failover:
- DNS lives inside both DCs
- Same
monitor-urihealth check - Points to internal addresses for replication traffic
MySQL Percona: Choosing the Right Replication Mode
We evaluated three Percona XtraDB cluster topologies:
The principle: the more automatic the better. Active/Active is the most resilient but also the most complex.
The XtraDB Headaches
With Active/Active at scale, we hit some gnarly problems:
InnoDB: BF-BF X lock conflict, mode: 1027 supremum: 0
Slave SQL: Could not execute Delete_rows event on table zabbix_infra.problem;
Can't find record in 'problem', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUNDKey lessons:
- Multi-database writes are not an advantage in Galera-based clusters — they create BF-BF (brute-force vs brute-force) lock conflicts
- Async failures stop the whole cluster — one node falling behind can stall everything
- Watch out for: disk speed, data volume, and latency between nodes
Other Things Worth Knowing
A few operational observations that didn't fit neatly elsewhere:
- Percona requires an Arbiter — one per DC, ideally in a separate Pacemaker cluster to avoid split-brain
- Pacemaker can stretch across both DCs or run as two separate clusters coordinated by a Booth Cluster Ticket Manager
- Elasticsearch can run in Cross-Cluster Search (CCS) mode, or if latency is very low, you can stretch a single cluster
- Kibana, Grafana, Zabbix FE and other web services can all co-exist on the same servers as multiple instances
- All internal component traffic goes through the stretched VLAN — not the internet entry points
- Ansible handles all server provisioning and configuration deployment
- HAProxy MySQL load balancing needs custom health check scripts — standard TCP checks don't tell you if the Galera node is read-write capable
- Grafana had compatibility issues with Elasticsearch at the time — version pinning was necessary
Migration: How We Got There
Moving 4,500 hosts from a chaotic multi-tool environment into this new stack followed a four-phase approach:
The "no historical data" phase is just a reality you have to accept — historical metrics don't migrate, only the configuration. Users need to be prepared for that.
What We Concluded
After all of this, here's what we'd tell anyone attempting something similar:
- Multi-datacenter Zabbix works — it's feasible, we proved it
- Elasticsearch for history is better suited to lower data volumes (at least up to Zabbix 4.0.x — things may have improved since)
- Percona Active/Active is promising but imperfect — diagnosing where problems originate is hard
- Build flexibility into your failover — there are many ways to configure fallback, and having options means small changes can recover from architectural failures
- Increase your proxy buffers — bigger buffers protect you during failovers, updates, and unexpected load spikes
- Keep each DC as independent as possible — the more self-sufficient each site is, the more resilient the whole system becomes




