CHAOS ENGINEERING

with

SERVICE MESH

Julien Bisconti

SRE / Data Engineer

Google Cloud Platform icon

contact

g.dev/julien

slides: bisconti.cloud

Outline

  1. Genesis
  2. Kunernetes networking model
  3. Service mesh: architecture and features
  4. Demo of Istio
  5. Chaos Engineering: concepts & origin
  6. Demo of fault-injection
  7. Q&A

Questions to the audience

  1. Who uses containers ?
  2. Who uses an orchestrator such as kubernetes ?
  3. Who knows what is a service mesh ?
  4. Who knows what is a SLI, SLO, SLA ?
  5. Who knows what is Chaos Engineering ?
  6. Who already did Chaos Engineering ?

Service Mesh

CONNECT

SECURE

CONTROL

OBSERVE


VIDEO: Istio a la carte by Dan Ciruli

8 fallacies of distributed computing

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  1. Topology doesn't change.
  2. There is one administrator.
  3. Transport cost is zero.
  4. The network is homogeneous.

source (wikipedia)
RFC 1925 ( 12 Networking Truths )

using a library, framework or service for:

  • circuit breaking
  • timeouts
  • retries
  • service discovery
  • client-side loadbalancing
  • metrics
  • traffic shaping
  • rate limiting
  • ...

for which languages?

drawbacks

  • combination language/framework/version/feature
  • maintain, upgrade, migrate, retire
  • code pollution and complexity (+ testing)
  • deployment / rolling update
  • language/framework/version lock-in
  • debugging


➡️ ️ move it to the infrastructure

Kubernetes networking model

1. all containers → all other containers without NAT

2. all nodes → all containers
all nodes ← all containers
without NAT

3. the IP that a container sees itself as
is the SAME
IP that others see it as

NAT (wikipedia)
VIDEO: Kubernetes Deconstructed (1h)

What is a service mesh

What problems does it solve


Communication between services


A network for services, not bytes

How does it solve inter service communication

The overall architecture of an Istio-based application.
source

What's in the code


details = {
    "name" : "http://details:9080",
    "endpoint" : "details",
    "children" : []
}
ratings = {
    "name" : "http://ratings:9080",
    "endpoint" : "ratings",
    "children" : []
}
  
source code

Traffic Management


apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: reviews
  ...
spec:
  hosts:
  - reviews
  http:
  - match:
    - headers:
        end-user:
          exact: jason
    route:
    - destination:
        host: reviews
        subset: v2
  - route:
    - destination:
        host: reviews
        subset: v1

Resiliency


apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
  - reviews
  http:
  - route:
    - destination:
        host: reviews
        subset: v2
      retries:
        attempts: 3
        perTryTimeout: 2s

Security

  • namespace-level and service-level policies
  • mutual TLS Authentication
  • role-based access control (RBAC)

Observability

  • Metrics (prometheus)
  • Logs (fluentd)
  • Tracing (jaeger)
  • Cluster traffic (kiali)

DEMO

Bookinfo Application without Istio

QUESTIONS

about service mesh

CHAOS ENGINEERING

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

principlesofchaos.org

Thoughtful, planned experiments designed to reveal the weakness in our systems.

Kolton Andrus (cofounder and CEO of Gremlin Inc.)

Chaos Engineering isn't done to cause problems; it is done to reveal them.

Nora Jones (Netflix)

Chaos Engineering is exploratory testing of non-functional requirements where ‘non-functional requirements’ are the requirements that if not met render a service non-functional.

@littleidea

Usually untested

  1. Graceful shutdown
  2. Health check
  3. Cascading timeouts
  4. Deployments (smoke test)

Type of errors

  • Unreachable
  • Delays
  • Timeout cascading
  • Circuit breaker

Word of caution

"Chaos": sounds cool and fun for you.
"Resiliency": sounds great for your manager and the system.

Results of Chaos Engineering = resiliency


Article: Would a Chaos by any other Name

Book: Resilience Engineering in Practice

How to start Chaos Engineering

  1. Don't mention "Chaos" yet - talk about goals
  2. Set up monitoring !!!
  3. Identify a measurable output for "steady state"
  4. Form a hypothesis
  5. Simulate real-world events
  6. Disprove your hypothesis
  7. Write a report with findings and mesurements
  8. Talk about the "Chaos" experiment
  9. Practice and improve

Site Reliability Engineering

DEMO

Bookinfo Application without Istio

Resources

THANK YOU

and I'm sorry 🙏
If you had to maintain my code
I hope you learned more by maintaining it
than me by writing it

contact

https://bisconti.cloud/

@julienBisconti

Slides made with Reveal.js and hugo-reveal