Operations Runbook¶
Hetzner CA-SERVER (204.168.175.45) infrastruktūras pārvalde un alertu reakcijas procedūras.
Servera koordinātes¶
- Host:
204.168.175.45(Hetzner CX23, 8 GB RAM, 38 GB SSD) - SSH:
ssh -i ~/.ssh/klm_deploy_tmp root@204.168.175.45 - API URL:
https://klm-vzd.204.168.175.45.nip.io - Grafana:
https://grafana.204.168.175.45.nip.io - ntfy:
https://ntfy.204.168.175.45.nip.io - Wiki (šī):
https://wiki.204.168.175.45.nip.io
Konteinera kopaina¶
15 konteineri uz host'a:
KLM produkts (2): geo-backend, klm-vzd-service
KLM observability (8):
| Konteiners | Port (host) | Mērķis |
|---|---|---|
klm-prometheus |
– | TSDB hub, 15d retention |
klm-grafana |
127.0.0.1:3000 | UI vizualizācija |
klm-alertmanager |
– | Alert routing → ntfy |
klm-node-exporter |
host net 9100 | Host CPU/RAM/disk |
klm-cadvisor |
127.0.0.1:8085 | Per-container metrikas |
klm-postgres-exporter |
– | Postgres iekšējie stats |
klm-blackbox-exporter |
– | HTTP probes |
klm-ntfy |
127.0.0.1:8082 | Push delivery |
Cita projekta (3): vetocentrs-frontend, vetocentrs-api, vetocentrs-postgres
Pārvalde (2): pgadmin, portainer
Galvenās komandas¶
# Pārbaudīt visus konteinerus
docker ps --format "table {{.Names}}\t{{.Status}}"
# Disk
df -h /
# Postgres
sudo -u postgres psql -d vmd_db -c "SELECT pg_size_pretty(pg_database_size('vmd_db'));"
sudo -u postgres psql -d vmd_db -c "SELECT * FROM pg_stat_activity WHERE state != 'idle';"
# Aktīvie alerti
docker exec klm-alertmanager wget -qO- http://localhost:9093/alertmanager/api/v2/alerts | python3 -m json.tool
# Prometheus reload pēc config maiņas
docker exec klm-prometheus kill -HUP 1
# Restart konteineri
docker restart geo-backend klm-vzd-service
Alertu reakcijas¶
DiskSpaceLow (warning, >85% 10m)¶
Disks aug. Pirmais solis — saprast, kas ēd:
df -h /
du -sh /var/lib/* | sort -h
docker image prune -af # parasti atbrīvo 4-5 GB
du -sh /var/lib/docker/overlay2/
ls -laht /var/backups/klm-postgres/ # ja > 5 GB, sk. backup cleanup cron
DiskSpaceCritical (critical, >95% 5m)¶
PG var sākt fail rakstīt. Tūlītēja rīcība:
# 1. Atbrīvot ātri
docker image prune -af
rm /var/backups/klm-postgres/*.sql.gz # zaudē backup vēsturi, bet atrisina
journalctl --vacuum-size=100M # samazina journal
# 2. Apturēt mazsvarīgākos workloadus
docker stop vetocentrs-frontend vetocentrs-api # ja vajag
MemoryLow (warning, <5% available 10m)¶
OOM killer var sākt apturēt konteinerus.
free -h
docker stats --no-stream | sort -k4 -h
# Top atmiņa parasti: klm-prometheus, klm-grafana, geo-backend, native postgres
sudo -u postgres psql -c "SHOW shared_buffers;" # jāmatch ar 2GB pēc tuninga
Ja konteiners izlaida RAM (memory leak):
docker restart <name>
HostLoadHigh (warning, load5 > 4, 15m)¶
CPU saturated. Atvērt top, redzēt, kurš process ēd:
top -b -n1 | head -20
# parasti: encumbrance_diff.py cron run (sk. ~04:00 daily)
# vai LAD/VMD imports pirmdienās ~04:00-07:30
Ja imports paliek nokarājies pēc gaidāmā ilguma — docker logs --since 10m klm-vzd-service.
PrometheusTargetDown (warning, up == 0, 5m)¶
Scrape target nepieejams 5+ min. Pārbaude:
docker exec klm-prometheus wget -qO- http://localhost:9090/api/v1/targets | python3 -m json.tool | less
docker ps | grep <konteinera nosaukums>
docker logs --tail 50 <konteineris>
PrometheusConfigReloadFailed (warning, 5m)¶
Config sintakses kļūda. Verificēt:
docker exec klm-prometheus promtool check config /etc/prometheus/prometheus.yml
docker logs klm-prometheus --tail 50 | grep -i error
Rollback uz backup config un reload:
cp /opt/klm/observability/prometheus/prometheus.yml.bak-* /opt/klm/observability/prometheus/prometheus.yml
docker exec klm-prometheus kill -HUP 1
CronJobMissedRun (warning, 10m)¶
klm_cron_last_run_seconds value > expected interval. Pārbaude:
crontab -l | grep <name> # tas cron ieslēgts?
ls -laht /var/log/klm-*.log | head # vai log file aug
tail -50 /var/log/klm-<name>.log # kas notiek
Ja docker konteineris staila, tas ietekmē cron'us, kas izsauc docker exec klm-vzd-service ...:
docker restart klm-vzd-service
CronJobFailed (warning, exit_code > 0, 1m)¶
Cron pēdējais skrējiens exit non-zero. Tails log:
tail -100 /var/log/klm-<name>.log
Bieži cēloņi: tīkla timeout pie ārējās API, disk full pirms log rotation, DB lock contention.
AppHealthEndpointDown (critical, 2m)¶
Blackbox probe uz klm-vzd-service /health neatbild 200.
curl -sk https://klm-vzd.204.168.175.45.nip.io/health
docker logs --tail 100 klm-vzd-service
docker logs --tail 100 geo-backend
sudo -u postgres psql -c "SELECT 1" # DB dzīvs?
systemctl status nginx # nginx OK?
Ja konteineris running, bet /health neatbild → restart:
docker restart klm-vzd-service
AppMetricsScrapeDown (warning, 5m)¶
Prometheus nevar saskrāpēt klm-vzd-service:5050/metrics. Ja arī AppHealthEndpointDown firing — tas pats cēlonis. Ja tikai šis — /metrics endpoint problēma (DB lock?).
docker exec klm-prometheus wget -qO- http://host.docker.internal:5050/metrics | head
PostgresDown (critical, 2m)¶
Postgres native uz host vairs nereaģē.
systemctl status postgresql@16-main
journalctl -u postgresql@16-main --since "10m ago"
sudo -u postgres psql -c "SELECT version();"
# Ja crashed:
systemctl restart postgresql@16-main
# Ja disk full → atbrīvot, tad restart
PostgresTooManyConnections (warning, 5m)¶
Pārāk daudz savienojumu. Normāli ~10 (gunicorn × 3 workers × 2 + klm-vzd-service × 2).
sudo -u postgres psql -c "SELECT count(*) FROM pg_stat_activity;"
sudo -u postgres psql -c "SELECT pid, usename, application_name, state, query_start, left(query, 80) AS q FROM pg_stat_activity ORDER BY query_start;"
Atrast leaks: application_name rāda, kurš konteiners. Restart vainīgo.
DatabaseRowCountDropped (warning, 30m)¶
Tabulā rindas samazinājušās > X%. Iespējami iemesli:
- Manuāli DELETE izpildīts (vērtīgi, sk. audit log)
- Imports failed un truncated, bet jaunais saturs nav rakstījies
- Bug skripta migration
# Pārbaudīt, kura tabula:
docker exec klm-prometheus wget -qO- "http://localhost:9090/api/v1/query?query=klm_db_row_count" | python3 -m json.tool
# Salīdzināt ar backup:
zcat /var/backups/klm-postgres/klm-vmd_db-<date>.sql.gz | grep "COPY public.<table>" -A3
DatabaseDataStale (warning, > 8d, 30m)¶
klm_data_freshness_seconds > 691200. Resurss nav atjaunots > 8 dienas.
# Kuras resurss:
docker exec klm-prometheus wget -qO- "http://localhost:9090/api/v1/query?query=klm_data_freshness_seconds" | python3 -m json.tool
# Cron strādā?
crontab -l | grep <resurss>
ls -laht /var/log/klm-*-refresh.log
Specifiski VMD MVR resursiem — VMD arhīvi publicējas reizi mēnesī. vzd_refresh.py --include-vmd determinē "up-to-date" un neaiztiek import_meta.imported_at uz no-op pārbaudēm — tāpēc metric kļūst stale, neuztraucoties. Quick fix:
UPDATE import_meta SET imported_at = NOW() WHERE resource_name LIKE 'vmd_%';
Proper fix: vzd_refresh.py kods jāmaina, lai touch'o timestamp arī uz no-op.
ContainerOOMKilled (warning, 1m)¶
Konteineris kill'ots OOM dēļ. Bieži klm-prometheus pieaug ja TSDB pārāk daudz series.
docker stats --no-stream | sort -k4 -h
dmesg -T | grep -i "killed process" | tail -5
ContainerHighMemory (warning, 10m)¶
Konteineris > 80% no atmiņas limita. Skat docker stats un izvērtē, vai limit būtu jāpalielina vai memory leak.
Deploy procedūras¶
geo-backend deploy¶
Push uz main → CI automātiski:
- Build Docker image (~3 min)
- SSH uz Hetzner
docker pulljauno imagedocker stop && docker rm && docker rungeo-backendunklm-vzd-service- 3 min health check loop
Manuālā re-deploy bez koda izmaiņām:
ssh root@204.168.175.45
docker pull registry.gitlab.com/geo-ca-group/geo-ca-project/geo-backend:latest
docker restart geo-backend klm-vzd-service
geo-mobile deploy¶
CI buvē release APK kā manual job (consume CI minutes). Default ikdienai — lokāls debug build.
Wiki deploy¶
docs/** vai mkdocs.yml push uz main:
- CI
pagesjob buvē statisko HTML - (Nākotnē) rsync uz
/var/www/klm-wiki/Hetzner host'a
Pagaidām manuāli:
cd geo-mobile && mkdocs build -d /tmp/wiki-build
rsync -avz --delete /tmp/wiki-build/ root@204.168.175.45:/var/www/klm-wiki/
Backup un Disaster Recovery¶
Backup grafiks¶
- PG full dump: svētdienās 03:00 →
/var/backups/klm-postgres/klm-vmd_db-YYYYMMDD-HHMMSS.sql.gz(~1.3 GB) - Cleanup: ik dienas 02:00, tur 21 dienu vai dzēš 0-baitu failus
- Subscription cleanup: ik dienas 02:30, dzēš > 30d never-notified
user_subscriptionrindas - Event log cleanup: ik dienas 02:15
Atjaunošana no backup¶
# 1. Apturēt app
docker stop geo-backend klm-vzd-service
# 2. Drop + recreate DB
sudo -u postgres dropdb vmd_db
sudo -u postgres createdb -O vmd_user vmd_db
# 3. Restore
zcat /var/backups/klm-postgres/klm-vmd_db-YYYYMMDD-HHMMSS.sql.gz | sudo -u postgres psql -d vmd_db
# 4. Restart
docker start geo-backend klm-vzd-service
# 5. Verify
curl -sk https://klm-vzd.204.168.175.45.nip.io/health
Disaster scenario¶
Ja viss disks zudis (Hetzner volume crash):
- Jauns Hetzner CX23 instance (Ubuntu 24.04)
- Restore PG backup (ja eksistē offsite kopija — TODO!)
- Re-deploy caur GitLab CI (jaunais host IP)
- Atjaunot certbot certificates
- Update DNS (vai nip.io maina IP, citi pakalpojumi paliek)
TODO: Offsite backup uz S3 vai citu vietu. Patreiz visi backupi paliek tā paša diska.
Postgres tuning (2026-05-22)¶
Pielikti /etc/postgresql/16/main/postgresql.conf:
shared_buffers = 2GB
effective_cache_size = 6GB
work_mem = 16MB
maintenance_work_mem = 256MB
random_page_cost = 1.1
Restart: systemctl restart postgresql@16-main (3s downtime).
Backup config: /etc/postgresql/16/main/postgresql.conf.backup-*.
ntfy abonēšana operatoram¶
Lai pats saņemtu push alertus:
- Instalē ntfy app (F-Droid vai Play Store)
- Subscribe:
https://ntfy.204.168.175.45.nip.io/klm-alerts(warning) unklm-alerts-critical(critical) - Basic auth: izveidot atsevišķu lasošu user:
docker exec -it klm-ntfy ntfy user add --role=user andrievs
docker exec -it klm-ntfy ntfy access andrievs klm-alerts read-only
docker exec -it klm-ntfy ntfy access andrievs klm-alerts-critical read-only
- Pievienojies ar
andrievs:<parole>.
Saites¶
- Grafana dashboards — Operations Overview
- Hetzner control panel
docs/DEPLOYMENT.md(geo-backend)docs/DISASTER_RECOVERY.md