The Infrastructure-as-Go Experiment

Lessons from the trenches

Julien Bisconti

Software Engineer / SRE

Google Cloud Platform icon

slides: bisconti.cloud

contact: g.dev/julien

qrcode link to Julien Bisconti contact information
MY_VAR="world"
echo "hello $MY_VAR"
#!/bin/sh

FOO="foo"
ssh some.remote.host << EOF
  BAR="bar"
  echo "FOO=$FOO"
  echo "BAR=\$BAR"
EOF
resource "local_file" "shell_script_example" {
  filename = "${path.module}/env_script.sh"
  content  = <<-EOF
    #!/bin/bash
    # This will be interpolated by Terraform:
    echo "Terraform variable 'aws_region': ${var.aws_region}"

    # This will be a literal shell variable expansion:
    echo "Current shell user: ${USER}"
    echo "Home directory: ${HOME}/data"
  EOF
}

Spot the bug

spec:
  template:
    spec:
      containers:
      - name: message
        image: busybox:1.36
        env:
        - name: MESSAGE
          value: "hello world"
        command: ["/bin/echo"]
        args: ["$(MESSAGE)"]

What do you notice about the API

abomination

https://gist.github.com/veggiemonk/57b25ab2afeb8618424629b1a9a9855b

Where is Waldo

https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/templates/prometheus-operator/deployment.yaml

Now, let’s add:

  • ArgoCD
  • FluxCD
  • multilayer Kustomize
  • CUE
  • Jsonnet
  • Starlark
  • [insert name of tool] …

You get the picture

We build our most critical systems on foundations that lack basic engineering safety nets. We wouldn’t write our backend services this way, yet we accept it for the infrastructure that runs them.

The Status Quo

“We have accepted a strange status quo.”

  • Too many layers of abstraction
  • Too many tools
  • Too many DSLs (Domain Specific Languages)
Mads Mikkelsen crying

What is missing

  • Day 2 operation - migration / upgrade is an exercise left to the reader.
  • Lack of compile time verification - shift left
  • Reusability is pretty hard.

What is missing (cont.)

  • Feedback loop is so slow (>15 min)
  • Authentication can be really messy
  • Apply and pray

Enforcing engineering practices is so hard it is near impossible.

Building the solution

LINGON

github.com/golingon/lingon

  • Jacob Lärsfors
  • Leonard Aukea
  • Julien Bisconti

The goal was to be anti-abstraction

See rationale.md and README.md

  • Kubernetes YAML/JSON => Go
  • Go => Kubernetes YAML/JSON
  • Terraform providers => Go
  • Go => Terraform HCL

Converted a bunch of helm-charts to Go.

😱

Cannot believe people put that into production.

=> wrong permissions, many unused resources/config

Why Go

  • It is not about the language
  • It is about the tooling

Who is it for

  • Advanced DevOps team
  • Software engineering dealing with infrastructure
  • Too many providers
  • Have extraordinary use cases
    • Always in a migration
    • Authentication is weird
    • Deprecated APIs that are still needed
    • 6 months re-org schedule

Who is NOT for

  • If you can have the whole infra in your head.
  • You just want to deploy something quick
  • Anyone else but us at the time.

Why don’t you use X

See Comparison.md

The Post-Mortem

  • Technically: Success (It worked beautifully).
  • Culturally: Archived. (feature complete).

Why?

Lessons

  • Always have an escape hatch, sometimes you have to get your hands dirty.
  • Understand what production means, environments are just a name.
  • Monitoring is so much more important than imagined.
  • Do things manually before you automate.
  • Define what is an error (ex: http status code 4xx vs 5xx)

THANK YOU

and I'm sorry 🙏
If you had to maintain my code
I hope you learned more by maintaining it
than me by writing it

Slides made with Reveal.js and hugo-reveal

Julien Bisconti

Software Engineer / SRE

Google Cloud Platform icon

slides: bisconti.cloud

contact: g.dev/julien

qrcode link to Julien Bisconti contact information