Pāriet uz saturu

Operations Runbook

Hetzner CA-SERVER (204.168.175.45) infrastruktūras pārvalde un alertu reakcijas procedūras.

Servera koordinātes

  • Host: 204.168.175.45 (Hetzner CX23, 8 GB RAM, 38 GB SSD)
  • SSH: ssh -i ~/.ssh/klm_deploy_tmp root@204.168.175.45
  • API URL: https://klm-vzd.204.168.175.45.nip.io
  • Grafana: https://grafana.204.168.175.45.nip.io
  • ntfy: https://ntfy.204.168.175.45.nip.io
  • Wiki (šī): https://wiki.204.168.175.45.nip.io

Konteinera kopaina

15 konteineri uz host'a:

KLM produkts (2): geo-backend, klm-vzd-service

KLM observability (8):

Konteiners Port (host) Mērķis
klm-prometheus TSDB hub, 15d retention
klm-grafana 127.0.0.1:3000 UI vizualizācija
klm-alertmanager Alert routing → ntfy
klm-node-exporter host net 9100 Host CPU/RAM/disk
klm-cadvisor 127.0.0.1:8085 Per-container metrikas
klm-postgres-exporter Postgres iekšējie stats
klm-blackbox-exporter HTTP probes
klm-ntfy 127.0.0.1:8082 Push delivery

Cita projekta (3): vetocentrs-frontend, vetocentrs-api, vetocentrs-postgres

Pārvalde (2): pgadmin, portainer

Galvenās komandas

# Pārbaudīt visus konteinerus
docker ps --format "table {{.Names}}\t{{.Status}}"

# Disk
df -h /

# Postgres
sudo -u postgres psql -d vmd_db -c "SELECT pg_size_pretty(pg_database_size('vmd_db'));"
sudo -u postgres psql -d vmd_db -c "SELECT * FROM pg_stat_activity WHERE state != 'idle';"

# Aktīvie alerti
docker exec klm-alertmanager wget -qO- http://localhost:9093/alertmanager/api/v2/alerts | python3 -m json.tool

# Prometheus reload pēc config maiņas
docker exec klm-prometheus kill -HUP 1

# Restart konteineri
docker restart geo-backend klm-vzd-service

Alertu reakcijas

DiskSpaceLow (warning, >85% 10m)

Disks aug. Pirmais solis — saprast, kas ēd:

df -h /
du -sh /var/lib/* | sort -h
docker image prune -af              # parasti atbrīvo 4-5 GB
du -sh /var/lib/docker/overlay2/
ls -laht /var/backups/klm-postgres/ # ja > 5 GB, sk. backup cleanup cron

DiskSpaceCritical (critical, >95% 5m)

PG var sākt fail rakstīt. Tūlītēja rīcība:

# 1. Atbrīvot ātri
docker image prune -af
rm /var/backups/klm-postgres/*.sql.gz   # zaudē backup vēsturi, bet atrisina
journalctl --vacuum-size=100M           # samazina journal

# 2. Apturēt mazsvarīgākos workloadus
docker stop vetocentrs-frontend vetocentrs-api  # ja vajag

MemoryLow (warning, <5% available 10m)

OOM killer var sākt apturēt konteinerus.

free -h
docker stats --no-stream | sort -k4 -h
# Top atmiņa parasti: klm-prometheus, klm-grafana, geo-backend, native postgres
sudo -u postgres psql -c "SHOW shared_buffers;"   # jāmatch ar 2GB pēc tuninga

Ja konteiners izlaida RAM (memory leak):

docker restart <name>

HostLoadHigh (warning, load5 > 4, 15m)

CPU saturated. Atvērt top, redzēt, kurš process ēd:

top -b -n1 | head -20
# parasti: encumbrance_diff.py cron run (sk. ~04:00 daily)
# vai LAD/VMD imports pirmdienās ~04:00-07:30

Ja imports paliek nokarājies pēc gaidāmā ilguma — docker logs --since 10m klm-vzd-service.

PrometheusTargetDown (warning, up == 0, 5m)

Scrape target nepieejams 5+ min. Pārbaude:

docker exec klm-prometheus wget -qO- http://localhost:9090/api/v1/targets | python3 -m json.tool | less
docker ps | grep <konteinera nosaukums>
docker logs --tail 50 <konteineris>

PrometheusConfigReloadFailed (warning, 5m)

Config sintakses kļūda. Verificēt:

docker exec klm-prometheus promtool check config /etc/prometheus/prometheus.yml
docker logs klm-prometheus --tail 50 | grep -i error

Rollback uz backup config un reload:

cp /opt/klm/observability/prometheus/prometheus.yml.bak-* /opt/klm/observability/prometheus/prometheus.yml
docker exec klm-prometheus kill -HUP 1

CronJobMissedRun (warning, 10m)

klm_cron_last_run_seconds value > expected interval. Pārbaude:

crontab -l | grep <name>                       # tas cron ieslēgts?
ls -laht /var/log/klm-*.log | head             # vai log file aug
tail -50 /var/log/klm-<name>.log               # kas notiek

Ja docker konteineris staila, tas ietekmē cron'us, kas izsauc docker exec klm-vzd-service ...:

docker restart klm-vzd-service

CronJobFailed (warning, exit_code > 0, 1m)

Cron pēdējais skrējiens exit non-zero. Tails log:

tail -100 /var/log/klm-<name>.log

Bieži cēloņi: tīkla timeout pie ārējās API, disk full pirms log rotation, DB lock contention.

AppHealthEndpointDown (critical, 2m)

Blackbox probe uz klm-vzd-service /health neatbild 200.

curl -sk https://klm-vzd.204.168.175.45.nip.io/health
docker logs --tail 100 klm-vzd-service
docker logs --tail 100 geo-backend
sudo -u postgres psql -c "SELECT 1"             # DB dzīvs?
systemctl status nginx                          # nginx OK?

Ja konteineris running, bet /health neatbild → restart:

docker restart klm-vzd-service

AppMetricsScrapeDown (warning, 5m)

Prometheus nevar saskrāpēt klm-vzd-service:5050/metrics. Ja arī AppHealthEndpointDown firing — tas pats cēlonis. Ja tikai šis — /metrics endpoint problēma (DB lock?).

docker exec klm-prometheus wget -qO- http://host.docker.internal:5050/metrics | head

PostgresDown (critical, 2m)

Postgres native uz host vairs nereaģē.

systemctl status postgresql@16-main
journalctl -u postgresql@16-main --since "10m ago"
sudo -u postgres psql -c "SELECT version();"

# Ja crashed:
systemctl restart postgresql@16-main

# Ja disk full → atbrīvot, tad restart

PostgresTooManyConnections (warning, 5m)

Pārāk daudz savienojumu. Normāli ~10 (gunicorn × 3 workers × 2 + klm-vzd-service × 2).

sudo -u postgres psql -c "SELECT count(*) FROM pg_stat_activity;"
sudo -u postgres psql -c "SELECT pid, usename, application_name, state, query_start, left(query, 80) AS q FROM pg_stat_activity ORDER BY query_start;"

Atrast leaks: application_name rāda, kurš konteiners. Restart vainīgo.

DatabaseRowCountDropped (warning, 30m)

Tabulā rindas samazinājušās > X%. Iespējami iemesli: - Manuāli DELETE izpildīts (vērtīgi, sk. audit log) - Imports failed un truncated, bet jaunais saturs nav rakstījies - Bug skripta migration

# Pārbaudīt, kura tabula:
docker exec klm-prometheus wget -qO- "http://localhost:9090/api/v1/query?query=klm_db_row_count" | python3 -m json.tool

# Salīdzināt ar backup:
zcat /var/backups/klm-postgres/klm-vmd_db-<date>.sql.gz | grep "COPY public.<table>" -A3

DatabaseDataStale (warning, > 8d, 30m)

klm_data_freshness_seconds > 691200. Resurss nav atjaunots > 8 dienas.

# Kuras resurss:
docker exec klm-prometheus wget -qO- "http://localhost:9090/api/v1/query?query=klm_data_freshness_seconds" | python3 -m json.tool

# Cron strādā?
crontab -l | grep <resurss>
ls -laht /var/log/klm-*-refresh.log

Specifiski VMD MVR resursiem — VMD arhīvi publicējas reizi mēnesī. vzd_refresh.py --include-vmd determinē "up-to-date" un neaiztiek import_meta.imported_at uz no-op pārbaudēm — tāpēc metric kļūst stale, neuztraucoties. Quick fix:

UPDATE import_meta SET imported_at = NOW() WHERE resource_name LIKE 'vmd_%';

Proper fix: vzd_refresh.py kods jāmaina, lai touch'o timestamp arī uz no-op.

ContainerOOMKilled (warning, 1m)

Konteineris kill'ots OOM dēļ. Bieži klm-prometheus pieaug ja TSDB pārāk daudz series.

docker stats --no-stream | sort -k4 -h
dmesg -T | grep -i "killed process" | tail -5

ContainerHighMemory (warning, 10m)

Konteineris > 80% no atmiņas limita. Skat docker stats un izvērtē, vai limit būtu jāpalielina vai memory leak.


Deploy procedūras

geo-backend deploy

Push uz main → CI automātiski:

  1. Build Docker image (~3 min)
  2. SSH uz Hetzner
  3. docker pull jauno image
  4. docker stop && docker rm && docker run geo-backend un klm-vzd-service
  5. 3 min health check loop

Manuālā re-deploy bez koda izmaiņām:

ssh root@204.168.175.45
docker pull registry.gitlab.com/geo-ca-group/geo-ca-project/geo-backend:latest
docker restart geo-backend klm-vzd-service

geo-mobile deploy

CI buvē release APK kā manual job (consume CI minutes). Default ikdienai — lokāls debug build.

Wiki deploy

docs/** vai mkdocs.yml push uz main:

  1. CI pages job buvē statisko HTML
  2. (Nākotnē) rsync uz /var/www/klm-wiki/ Hetzner host'a

Pagaidām manuāli:

cd geo-mobile && mkdocs build -d /tmp/wiki-build
rsync -avz --delete /tmp/wiki-build/ root@204.168.175.45:/var/www/klm-wiki/

Backup un Disaster Recovery

Backup grafiks

  • PG full dump: svētdienās 03:00 → /var/backups/klm-postgres/klm-vmd_db-YYYYMMDD-HHMMSS.sql.gz (~1.3 GB)
  • Cleanup: ik dienas 02:00, tur 21 dienu vai dzēš 0-baitu failus
  • Subscription cleanup: ik dienas 02:30, dzēš > 30d never-notified user_subscription rindas
  • Event log cleanup: ik dienas 02:15

Atjaunošana no backup

# 1. Apturēt app
docker stop geo-backend klm-vzd-service

# 2. Drop + recreate DB
sudo -u postgres dropdb vmd_db
sudo -u postgres createdb -O vmd_user vmd_db

# 3. Restore
zcat /var/backups/klm-postgres/klm-vmd_db-YYYYMMDD-HHMMSS.sql.gz | sudo -u postgres psql -d vmd_db

# 4. Restart
docker start geo-backend klm-vzd-service

# 5. Verify
curl -sk https://klm-vzd.204.168.175.45.nip.io/health

Disaster scenario

Ja viss disks zudis (Hetzner volume crash):

  1. Jauns Hetzner CX23 instance (Ubuntu 24.04)
  2. Restore PG backup (ja eksistē offsite kopija — TODO!)
  3. Re-deploy caur GitLab CI (jaunais host IP)
  4. Atjaunot certbot certificates
  5. Update DNS (vai nip.io maina IP, citi pakalpojumi paliek)

TODO: Offsite backup uz S3 vai citu vietu. Patreiz visi backupi paliek tā paša diska.

Postgres tuning (2026-05-22)

Pielikti /etc/postgresql/16/main/postgresql.conf:

shared_buffers = 2GB
effective_cache_size = 6GB
work_mem = 16MB
maintenance_work_mem = 256MB
random_page_cost = 1.1

Restart: systemctl restart postgresql@16-main (3s downtime).

Backup config: /etc/postgresql/16/main/postgresql.conf.backup-*.

ntfy abonēšana operatoram

Lai pats saņemtu push alertus:

  1. Instalē ntfy app (F-Droid vai Play Store)
  2. Subscribe: https://ntfy.204.168.175.45.nip.io/klm-alerts (warning) un klm-alerts-critical (critical)
  3. Basic auth: izveidot atsevišķu lasošu user:
docker exec -it klm-ntfy ntfy user add --role=user andrievs
docker exec -it klm-ntfy ntfy access andrievs klm-alerts read-only
docker exec -it klm-ntfy ntfy access andrievs klm-alerts-critical read-only
  1. Pievienojies ar andrievs:<parole>.

Saites