Server Health Monitoring

Know about an outage before your customers do.

External pings, internal health checks, and alert routing wired up the way they should have been from day one — so you stop finding out about downtime from a customer’s email.

Sub-techniques covered · Uptime Kuma · External Pings · Status Page · Alert Routing · Resource Monitoring · Log Analysis · Incident Response · SLO Tracking
01 — What’s Included

Eight sub-techniques.
One honest early-warning system.

Monitoring is not a single dashboard. It is a layered set of checks running from inside your servers and outside your network, plus the alert routing and incident response that turn signals into action.

Every server we monitor has CPU, memory, disk, and swap thresholds set against its actual workload — not generic defaults. Every public endpoint is probed from at least one external location. And every alert lands in front of a human, not an inbox nobody opens.

N° 01

Uptime Kuma Self-Hosted

Foundation

A self-hosted Uptime Kuma instance is the backbone of our monitoring stack. It sits on infrastructure we control, probes your sites and services on a 60-second interval, and keeps a year of historical uptime data we can refer back to when anyone questions the numbers. Self-hosting keeps the recurring cost flat as your service count grows, removes the third-party SaaS dependency, and gives us full control over the probe configuration. We install it, harden it, configure your monitors, and document the credentials.

N° 02

External Synthetic Pings

Outside-in

A second monitoring service running entirely outside our infrastructure — UptimeRobot, Better Stack, or a comparable provider — checks your public endpoints from multiple geographic regions every minute. The redundancy matters: if our monitoring host itself goes down, the external service still catches the outage. We configure synthetic checks against the URLs that actually matter to your business — homepage, login, checkout, key API endpoints — not just an HTTP 200 from the front door.

N° 03

Public Status Page

Transparency

For clients who want it, we publish a branded status page reflecting the live state of their services — green, degraded, or down — with a public history of recent incidents. It deflects the inbound “is the site down?” emails during outages, gives your customers a place to confirm they aren’t imagining things, and signals operational maturity to enterprise prospects evaluating your reliability. We host it, brand it, and keep it accurate.

N° 04

Alert Routing

Signal-to-action

An alert that nobody acts on is worse than no alert at all — it teaches the team to ignore the channel. We route alerts to the people who can actually do something about them, through the channels they actually read: email for low-severity warnings, Slack or DingTalk for active issues, SMS or phone for production-down events. Alerts are deduplicated so a single outage doesn’t generate forty pages, and they are enriched with context — which monitor fired, what it was checking, what the recovery threshold is — so the on-call engineer doesn’t start from zero.

N° 05

Per-Host Resource Monitoring

Inside-out

CPU, RAM, disk, and swap on every server we manage, sampled every minute and graphed over time. Resource monitoring is what catches the slow leaks — a memory leak in a PHP-FPM worker, a runaway log file silently filling the disk, an OOM-killer about to evict the database. We set thresholds calibrated to each host’s actual workload, alert on sustained pressure rather than transient spikes, and keep enough history that we can show you the week-over-week trend, not just today’s snapshot.

N° 06

Log Analysis & Real-User Monitoring

Visitor reality

A site can be technically “up” while half the visitors hit a 502 — and a synthetic ping from one location won’t catch it. We ship server logs into a searchable index, surface error spikes, and where appropriate add a lightweight real-user monitoring (RUM) script that captures actual visitor experience: page-load times, JavaScript errors, failed API calls. The combination is the only way to know what your users are genuinely seeing rather than what your status page is claiming.

N° 07

Incident Response

Same-day, business-hours

When an alert fires during business hours, a human is on it within minutes. We diagnose, communicate, and resolve — and afterwards we write a brief postmortem covering what happened, what fixed it, and what we are changing so it doesn’t recur. We are honest about the boundaries: this is reliable same-day incident response, not a 24×7 NOC. Outside business hours, monitoring still runs and still alerts, and clients on a maintenance retainer have defined response commitments. For genuine 24×7 paging, we’ll tell you and help scope an appropriate provider.

N° 08

SLO Tracking & Monthly Review

Accountability

Each month we send a one-page operational review: uptime against your defined SLO, the number and severity of incidents, mean time to detection, mean time to recovery, and the resource trends worth watching for the month ahead. The review is short by design — three minutes to read, no jargon, no chart-heavy filler. It is the document that tells you, in plain English, whether the infrastructure is healthier or weaker than it was last month, and what we are doing about it.

02 — Our Approach

Layered checks.
Honest boundaries.

Most monitoring fails for one of three reasons — generic thresholds nobody tuned, alert fatigue from too many false positives, or a stack that depends on the very thing it is meant to watch. Our approach is built around avoiding all three.

i

Inside and outside, both

Internal checks tell us how the host is feeling — CPU, memory, disk pressure, swap activity, queue depth. External pings tell us how the world is finding it — DNS resolution, TLS handshakes, response times from real geographic regions. Either layer alone gives a partial picture. Run both in parallel and you cover the failure modes the other layer would miss, including the case where your monitoring server itself becomes the problem.

ii

Thresholds calibrated, not guessed

We do not ship generic alert rules. After the first week of baseline data, we tune every threshold to the actual workload of the host — CPU steady-state, peak memory, disk-fill rate, request volume — and we alert on sustained pressure rather than spikes. The result is fewer false positives, fewer ignored pages, and alerts that genuinely mean something. We re-tune quarterly as the workload shifts.

iii

Alerts go to humans, with context

Every alert includes the monitor name, what it was checking, the current and threshold values, the duration of the breach, and a link straight to the relevant graph or log. The on-call engineer sees what they need to act in the first ten seconds — not a paste of opaque IDs. Severity routing is explicit: warning to email, critical to a chat channel, page-level to SMS. No mysterious tickets, no acronym soup.

iv

Honest about what we are not

We are not a 24×7 NOC. We do not page someone at 3am on a Sunday for a third-tier site. What we offer is reliable, layered monitoring with same-day response during business hours and clearly documented expectations outside them. For most growing businesses that is the right balance of cost and coverage. When it isn’t — when you genuinely need round-the-clock paging — we’ll tell you upfront and help scope a provider that fits.

03 — Who It’s For

Teams who have
been surprised once.

The clients who reach out about monitoring almost always have the same story — there was an outage, it lasted longer than it should have, somebody in the team only learned about it from a customer, and now nobody wants that to happen again. The fix is not magic. It is the boring, layered, well-routed work below.

A few recurring profiles where monitoring is the unlock.

  • i Founders running production on a single VPSOne DigitalOcean droplet, one WordPress, one database. No monitoring, no alerts, no idea what’s normal — until something stops being normal and a customer notices first.
  • ii Teams with a host that “has monitoring built in”Most managed hosting dashboards report “all systems operational” right up until the moment a critical page returns 502s for ninety minutes. Built-in does not mean useful.
  • iii E-commerce stores where every minute of downtime is revenueCheckout errors that go undetected for an hour translate directly into lost orders. You need outside-in pings on the checkout flow, not just the homepage.
  • iv Professional firms where uptime is a credibility signalLegal, financial, and medical practices where a prospective client clicking a dead site quietly moves to the next firm. Monitoring is the cheapest reputation insurance available.
  • v Anyone running their own self-hosted SaaS or internal toolPlane PMS, n8n, a self-hosted Mautic, a Coolify panel — the productivity-stack apps your team relies on daily. They need the same monitoring discipline as the public site, and they almost never get it.

Monitoring is also the natural starting point if you don’t yet know whether your infrastructure has a problem. The first month of data tells you what your real baseline looks like — peak CPU, average memory, weekly disk-fill rate, response-time distribution — and that baseline is usually the most useful diagnostic any technical team can have. Pair this work with our IT Services parent practice or fold it into a maintenance & care plan for an integrated retainer.

04 — A complimentary report

Curious how Google sees your site?

Send us your URL. We’ll send back a Premium SEO Report, prepared by hand, within 48 hours — domain authority, keyword rankings, backlinks, competitor gap, and the technical quick-wins worth chasing first.

No sales call required.

The worst kind of failure is the kind that doesn’t announce itself. Monitoring is the practice of refusing to be surprised twice.
— The Aureole Practice —
05 — Frequently Asked

Questions we get
about monitoring.

If a question is missing here, the contact link at the foot of the page goes straight to the person who would answer it. No ticket queues, no funnels.

i Is this a 24×7 NOC?
No, and we’ll tell you that on the first call. What we offer is layered, properly-tuned monitoring with same-day incident response during business hours (Mon–Fri, 9–6 Pacific). Outside business hours, the monitoring still runs, alerts still fire, and clients on a maintenance retainer have defined response commitments. For most growing businesses that is the right balance of cost and coverage. If you genuinely need round-the-clock paging — a regulated SaaS, a high-revenue e-commerce platform during peak season, a critical internal tool used across timezones — we’ll tell you upfront and help you scope a provider whose model fits that need. We’d rather lose the engagement than misrepresent what we do.
ii My host says they have built-in monitoring. Why would I need anything else?
Built-in monitoring is usually a basic up/down check on the front door of the server. It will tell you if the box is unreachable. It typically will not tell you that PHP-FPM is leaking memory, that the disk is 96 per cent full and growing, that a critical cron job stopped running last Tuesday, or that the checkout endpoint is returning 502s while the homepage stays green. Hosting-dashboard monitoring also tends to alert through channels that nobody actively monitors — an email address registered five years ago by a developer who has since left. We supplement (or replace) it with checks that match how your business actually fails, alert routing that goes to people who will act, and dashboards your team can genuinely read.
iii How quickly can you get monitoring set up?
Initial setup typically takes a week. Day one we audit your existing infrastructure, identify the hosts and endpoints worth monitoring, and set up the Uptime Kuma instance and external ping service. Days two and three we install resource agents on each server, configure baseline thresholds, and wire up alert routing to your team’s preferred channels. Days four and five we verify the alerts behave correctly with test outages and document what we built. After that we run for a week with the baseline thresholds, gather real workload data, then re-tune. By the end of the second week the system is running steady-state with thresholds calibrated to your actual usage rather than generic defaults.
iv Why Uptime Kuma rather than a managed service like Datadog?
For the scale of business we work with, Uptime Kuma covers the use case at a fraction of the cost. It is open source, mature, actively maintained, and self-hostable on a small VPS. The monthly cost stays flat as you add hosts and endpoints, where managed services scale per-host or per-metric and quickly become expensive. We do use external SaaS pings as a redundant outside-in layer — that piece needs to live outside our infrastructure to be useful — but the bulk of the stack runs on infrastructure we control. For organisations that genuinely need APM-grade observability across distributed services, we’ll recommend the right tool and help configure it; the calculus is different at that scale.
v What happens after an incident is resolved?
For any incident that affected production, we write a short postmortem within one or two business days. It covers a brief timeline of what happened, the root cause as we currently understand it, what brought it back, and the changes we are making — to monitoring, to thresholds, to capacity, to runbooks — to prevent or shorten the next occurrence of the same class of problem. We share it with you in plain English. The point is not to assign blame; the point is to make the system stronger every time it teaches us something. After a few months, the postmortem archive itself becomes a useful reference for the next person who has to think about your infrastructure.
vi Can you take over monitoring someone else set up?
Yes, and this is a common starting point. We audit what’s currently in place — checks, thresholds, alert routing, dashboards — identify gaps, false-positive sources, and hosts or endpoints that are missing coverage entirely, and present a remediation plan in priority order. If the existing setup is sound, we’ll keep it and take over maintenance. More often we find the configuration was reasonable two years ago and has drifted since: alerts going to people who left the company, thresholds that no longer match the workload, monitors still pointed at services that have been decommissioned. We clean it up methodically and document the new state.
The Invitation

Ready to stop being the
last to know?

Tell us about the infrastructure you’d like watched — a single VPS, a small fleet, or the public endpoints that matter most. We’ll respond within one business day with a clear monitoring plan and a fair monthly figure for the retainer.

Mon–Fri · 9–6 PT support@aureoleintelligence.com Reply within 1 business day