CHAOS ENGINEERING

with

SERVICE MESH

Julien Bisconti

SRE / Data Engineer

Google Cloud Platform icon

contact

g.dev/julien

slides: bisconti.cloud

Outline

  1. Genesis
  2. Service mesh: architecture and features
  3. Demo of Envoy and Istio
  4. Chaos Engineering: concepts & origin
  5. Demo of fault-injection
  6. Q&A

at the beginning there was an

APP

and the app was code

that needed to scale

👉 microservices

Deployment

Containers: lightweight VMs

  • 12 factor app
  • easier deploy
  • reproducible build


but ...

Deployment concerns

  • Scaling up and down
  • Redundancy
  • Scheduling / Orchestration
  • Service Discovery
  • Resiliency
  • Rolling out and back
  • Health checks
  • Secret and config

➡️ kubernetes

but ...

8 fallacies of distributed computing

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  1. Topology doesn't change.
  2. There is one administrator.
  3. Transport cost is zero.
  4. The network is homogeneous.

source (wikipedia)
RFC 1925 ( 12 Networking Truths )

Kubernetes concerns

  • Logging
  • Tracing
  • Metrics
  • Dependency visualisation
  • Service identity and Auth
  • Circuit breaking
  • Traffic flow and policies
  • Failover
  • Fault injection
  • ...


➡️ ️ use code?

drawbacks

  • combination language/framework/version/feature
  • maintain, upgrade, migrate, retire
  • code pollution and complexity (+ testing)
  • deployment / rolling update
  • language/framework/version lock-in
  • debugging


➡️ ️ move it to the infrastructure

Data plane

envoy proxy
The network should be transparent to applications.
When network and application problems do occur it should be easy to determine the source of the problem.

The overall architecture of an Istio-based application.

How to manage a fleet of envoy proxy?

Service Mesh

CONNECT

SECURE

CONTROL

OBSERVE


VIDEO: Istio a la carte by Dan Ciruli

What is a service mesh

What problems does it solve


Communication between services


A network for services, not bytes

How does it solve inter service communication

The overall architecture of an Istio-based application.
source

What's in the code


details = {
    "name" : "http://details:9080",
    "endpoint" : "details",
    "children" : []
}
ratings = {
    "name" : "http://ratings:9080",
    "endpoint" : "ratings",
    "children" : []
}
  
source code

Traffic Management


apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: reviews
  ...
spec:
  hosts:
  - reviews
  http:
  - match:
    - headers:
        end-user:
          exact: jason
    route:
    - destination:
        host: reviews
        subset: v2
  - route:
    - destination:
        host: reviews
        subset: v1

Resiliency


apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
  - reviews
  http:
  - route:
    - destination:
        host: reviews
        subset: v2
      retries:
        attempts: 3
        perTryTimeout: 2s

Security

  • namespace-level and service-level policies
  • mutual TLS Authentication
  • role-based access control (RBAC)

Observability

  • Metrics (prometheus)
  • Logs (fluentd)
  • Tracing (jaeger)
  • Cluster traffic (kiali)

DEMO

Bookinfo Application without Istio

QUESTIONS about service mesh

List of service meshes

Comparison: Consult vs Istio

Resources

CHAOS ENGINEERING

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

principlesofchaos.org

Thoughtful, planned experiments designed to reveal the weakness in our systems.

Kolton Andrus (cofounder and CEO of Gremlin Inc.)

Chaos Engineering isn't done to cause problems; it is done to reveal them.

Nora Jones (Netflix)

Chaos Engineering is exploratory testing of non-functional requirements where ‘non-functional requirements’ are the requirements that if not met render a service non-functional.

@littleidea

What Chaos Engineering is not

pray to server

Hope is not a strategy

Usually untested

  1. Graceful shutdown
  2. Health check
  3. Cascading timeouts
  4. Deployments (smoke test)

Type of errors

  • Unreachable
  • Delays
  • Timeout cascading
  • Circuit breaker

Site Reliability Engineering

Gameday

What happens when ________ ? [fill in the blank]

example: Breaking DynamoDB

Organization failures

Expect failure and learn from it

  • High Severity Incident Management Program
  • If you don't learn from it, it will happen again!
  • Practice: It's a cultural approach to failure
  • Publish reports (RCA) and results

source

Kaizen (改善)

kai-zen = change-good

Toyota andon cord

definition

Word of caution

"Chaos": sounds cool and fun for you.
"Resiliency": sounds great for your manager and the system.

Results of Chaos Engineering = resiliency


Article: Would a Chaos by any other Name

Book: Resilience Engineering in Practice

How to start Chaos Engineering

  1. Don't mention "Chaos" yet - talk about goals
  2. Set up monitoring !!!
  3. Identify a measurable output for "steady state"
  4. Form a hypothesis
  5. Simulate real-world events
  6. Disprove your hypothesis
  7. Write a report with findings and mesurements
  8. Talk about the "Chaos" experiment
  9. Practice and improve

DEMO

Bookinfo Application without Istio

Resources

THANK YOU

and I'm sorry 🙏
If you had to maintain my code
I hope you learned more by maintaining it
than me by writing it

contact

https://bisconti.cloud/

@julienBisconti

Slides made with Reveal.js and hugo-reveal