CHAOS ENGINEERING

with

SERVICE MESH

Julien Bisconti

SRE / Data Engineer

Google Cloud Platform icon

contact

g.dev/julien

slides: bisconti.cloud

Outline

  1. Genesis
  2. Service mesh: architecture and features
  3. Demo of Envoy and Istio
  4. Chaos Engineering: concepts & origin
  5. Demo of fault-injection
  6. Q&A

Questions to the audience

  1. Who knows what is a service mesh ?
  2. Who knows what is a SLI, SLO, SLA ?
  3. Who knows what is Chaos Engineering ?
  4. Who already did Chaos Engineering ?

Service Mesh

CONNECT

SECURE

CONTROL

OBSERVE


VIDEO: Istio a la carte by Dan Ciruli

8 fallacies of distributed computing

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  1. Topology doesn't change.
  2. There is one administrator.
  3. Transport cost is zero.
  4. The network is homogeneous.

source (wikipedia)
RFC 1925 ( 12 Networking Truths )

using a library, framework or service for:

  • circuit breaking
  • timeouts
  • retries
  • service discovery
  • client-side loadbalancing
  • metrics
  • traffic shaping
  • rate limiting
  • ...

for which languages?

drawbacks

  • combination language/framework/version/feature
  • maintain, upgrade, migrate, retire
  • code pollution and complexity (+ testing)
  • deployment / rolling update
  • language/framework/version lock-in
  • debugging


➡️ ️ move it to the infrastructure

What is a service mesh

What problems does it solve


Communication between services


A network for services, not bytes

How does it solve inter service communication

The overall architecture of an Istio-based application.
source

What's in the code


details = {
    "name" : "http://details:9080",
    "endpoint" : "details",
    "children" : []
}
ratings = {
    "name" : "http://ratings:9080",
    "endpoint" : "ratings",
    "children" : []
}
  
source code

Traffic Management


apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: reviews
  ...
spec:
  hosts:
  - reviews
  http:
  - match:
    - headers:
        end-user:
          exact: jason
    route:
    - destination:
        host: reviews
        subset: v2
  - route:
    - destination:
        host: reviews
        subset: v1

Resiliency


apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
  - reviews
  http:
  - route:
    - destination:
        host: reviews
        subset: v2
      retries:
        attempts: 3
        perTryTimeout: 2s

Security

  • namespace-level and service-level policies
  • mutual TLS Authentication
  • role-based access control (RBAC)

Observability

  • Metrics (prometheus)
  • Logs (fluentd)
  • Tracing (jaeger)
  • Cluster traffic (kiali)

DEMO

Bookinfo Application without Istio

QUESTIONS about service mesh

List of service meshes

Comparison: Consult vs Istio

Resources

CHAOS ENGINEERING

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

principlesofchaos.org

Thoughtful, planned experiments designed to reveal the weakness in our systems.

Kolton Andrus (cofounder and CEO of Gremlin Inc.)

Chaos Engineering isn't done to cause problems; it is done to reveal them.

Nora Jones (Netflix)

Chaos Engineering is exploratory testing of non-functional requirements where ‘non-functional requirements’ are the requirements that if not met render a service non-functional.

@littleidea

What Chaos Engineering is not

pray to server

Hope is not a strategy

Usually untested

  1. Graceful shutdown
  2. Health check
  3. Cascading timeouts
  4. Deployments (smoke test)

Type of errors

  • Unreachable
  • Delays
  • Timeout cascading
  • Circuit breaker

Site Reliability Engineering

Gameday

What happens when ________ ? [fill in the blank]

example: Breaking DynamoDB

Organization failures

Expect failure and learn from it

  • High Severity Incident Management Program
  • If you don't learn from it, it will happen again!
  • Practice: It's a cultural approach to failure
  • Publish reports (RCA) and results

source

Kaizen (改善)

kai-zen = change-good

Toyota andon cord

definition

Word of caution

"Chaos": sounds cool and fun for you.
"Resiliency": sounds great for your manager and the system.

Results of Chaos Engineering = resiliency


Article: Would a Chaos by any other Name

Book: Resilience Engineering in Practice

How to start Chaos Engineering

  1. Don't mention "Chaos" yet - talk about goals
  2. Set up monitoring !!!
  3. Identify a measurable output for "steady state"
  4. Form a hypothesis
  5. Simulate real-world events
  6. Disprove your hypothesis
  7. Write a report with findings and mesurements
  8. Talk about the "Chaos" experiment
  9. Practice and improve

DEMO

Bookinfo Application without Istio

Resources

THANK YOU

and I'm sorry 🙏
If you had to maintain my code
I hope you learned more by maintaining it
than me by writing it

contact

https://bisconti.cloud/

@julienBisconti

Slides made with Reveal.js and hugo-reveal