Kubernetes

Logging & Monitoring

Plan

  • Introduction
  • Container technology
  • Kubernetes
  • Logging architecture
  • Monitoring

Julien Bisconti

SRE / Data Engineer

Google Cloud Platform icon

contact

g.dev/julien

slides: bisconti.cloud

How Long

from monolith to microservices ?

8 fallacies of distributed computing

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn't change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

Source: wikipedia

RFC 1925 - 12 Networking Truths

Logging & Monitoring:

monolithic app

-vs-

distributed system

Loggingrecording events
Metricsdata combined from measuring events
Tracingrecording events with causal ordering


credit @coda

Logging

Log Levels for dev

  • Info
  • Debug
  • Warning
  • Error
  • Fatal
  • Zombie-Apocalypse
  • Meteor
  •    πŸ€·β€β™‚οΈ  don't care
  •    πŸ€“  when necessary
  •    πŸ€·β€β™‚οΈ  don't care
  •    🧐  to investigate
  •    πŸ€·β€β™‚οΈ  don't care
  •    πŸ€·β€β™‚οΈ  don't care
  •    πŸ€·β€β™‚οΈ  don't care
string perf tweet string perf

Monitoring

  • Application errors πŸ‘‰ where to look
  • Business metrics πŸ‘‰ money
  • Latency πŸ‘‰ user experience

Metrics, Metrics everywhere

youtu.be/czes-oa0yik

build OR buy

Tweet about datacenter

whole thread

Containers

What is a container

Not a real thing. An application delivery mechanism with process isolation based on several Linux kernel features.

kernel features

Dev πŸ‘‰ inside container (build)

Ops πŸ‘‰ outside container (run)



container = common interface for deploying services

cAdvisor


docker run \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:ro \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--volume=/dev/disk/:/dev/disk:ro \
--publish=8080:8080 \
--detach=true \
--name=cadvisor \
google/cadvisor:latest
    
localhost:8080

Logs for containers

πŸ‘‰ Treat logs as event streams

12factor.net

  • ❌ no routing
  • ❌ no storage
  • ❌ no file handling


write logs to: stdout/stderr

Log Levels for dev

  • Info
  • Debug
  • Warning
  • Error
  • Fatal
  •    πŸ€·β€β™‚οΈ  don't care
  •    πŸ€“  stdout
  •    πŸ€·β€β™‚οΈ  don't care
  •    🧐  stderr
  •    πŸ€·β€β™‚οΈ  don't care

Kubernetes

kubernetes architecture

Logging architecture

Node level logging

node level logging
  • JSON (no multiline)
  • /var/log/
  • keep previous pod logs
  • pod eviction = ❌ no logs
  • logrotate script

cluster level logging

Logs lifecycle & storage

independent of nodes, pods, or containers

logging with node agent

logging with node agent
  • per node agent pod (DaemonSet)
  • centralized logging
  • fluentd
  • logs to stdout/stderr

logging with streaming side car

logging with streaming side car
  • logs to shared volumes
  • sidecar streams logs to its own stdout
  • separate log streams
  • double disk usage
  • better to directly write to stdout/stderr

logging with sidecar agent

logging with sidecar agent
  • per pod agent (resources!)
  • no kubectl logs

logging from application

logs from application

which logs

typical app

tips

Monitoring

CPU & RAM should be enough

not really...

  • docker stats
  • kubectl top nodes
  • kubectl top pods

Why monitoring

  • detect/prevent outages (alerting)
  • entry price for chaos engineering
  • auto-scaling (HPA)
  • optimize (cost & perfs)

different levels of monitoring

  • Infrastructure level - U.S.E
  • Application level - R.E.D
USE method: for every resource, check:
  • utilization
  • saturation
  • errors
RED method: for every service, check request:
  • rate
  • error rate
  • duration (distributions)
are withing SLO/A
Site Reliability Engineering book by Google
4 Golden Signals

  • Latency
  • Traffic
  • Errors
  • Saturation


landing.google.com/sre/books/

What to monitor

  1. request time/rate (if it's fast, it works)
  2. connections (health check, DB, pods)
  3. kubernetes pods (CrashLoopBackOff,...)
  4. kubernetes internals (control plane, kubelet, ...)
  5. infrastructure (disk space, CPU, RAM, network,...)

Health check

what does "healthy" mean?

typical app

Where to monitor

maybe not on the cluster that you are monitoring

don't take my word for it

SLI, SLO, SLA

These measurements describe basic properties of metrics that matter, what values we want those metrics to have, and how we’ll react if we can’t provide the expected service

Define SLI SLO

Going further

K8s for dev

Open data

example: makebook.io/open

simpleanalytics.io

doesn't track your users

GDPR friendly

THANK YOU

and I'm sorry πŸ™
If you had to maintain my code
I hope you learned more by maintaining it
than me by writing it

contact

https://bisconti.cloud/

@julienBisconti

Slides made with Reveal.js and hugo-reveal