Building data/ML platforms

Lessons from the trenches

What is the purpose of

ML/data platforms ?

scaling

people & systems

( Skills & Tools )

Conway’s Law

"organizations [...] are constrained to produce designs which are copies of the communication structures of these organizations"

💡

tools

can help fix an

organization

but not its

culture

how to scale a system

  • vertically
  • horizontally
  • deeply

village (100+) -> city (10k+) -> megacity (1m+)

same same but different

💡 we are not scaling a system up

we are creating a new system

that can sustain 100x the load

and we continously migrate to it

💡 Always be migrating

or at least plan for it

Migration

create the new

deprecate the old

problem definition

how to:

  • train
  • build
  • deploy
  • monitor


Machine Learning models

in a repoducible manner

at scale ?

Hidden Technical Debt in Machine Learning Systems

Hidden Technical Debt in Machine Learning Systems paper Source: D. Sculley, et al.: Hidden Technical Debt in Machine Learning Systems

cost of context switching

Hidden Technical Debt in Machine Learning Systems paper

spreadsheet

source link

Mental limitations


  • # decisions / day
  • # things to remember
  • speed of memory / reflexes

Reliable

def: consistently good in quality or performance; able to be trusted.

“holistic approach toward deciding how to integrate development, deployment, production operations, and long-term care.”

src: “Reliable Machine Learning - Cathy Chen” (c)

CHAOS ENGINEERING*

* Requires good monitoring

What Chaos Engineering is not

pray to server

Hope is not a strategy

Strategy >< tactics

Tactic is a specific action to achieve a goal

Strategy is a plan to achieve a goal

ML platform assembly kit

data engineer toolbox Source: article by Clemens Mewald
production grade infrastructure

Yevgeniy Brikman - Lessons from 300k+ Lines of Infrastructure Code

Main blockers (Infra)

  • networking
  • identity and access management (IAM)
  • resources organization
  • biling
  • policy management
Shared VPC

source: GCP blog

IAM

  • how to get access to the data
  • how to remove access to the data

Good to understand OAuth & TLS certificates

build OR buy

Tweet about datacenter

whole thread

💡

Laws might dictate that you have

to build your own

Tradeoffs

Build >< buy

time & skills >< Money

When things go wrong

AWS Shared Responsibility Model

At scale

everything hurts

[the cloud bill is not proportional to the number of customers but to the number of developers with access to the cloud account]

“Corey Quinn (@QuinnyPig)”

Cloud operation team

basically restricting what devs can do

💡

Paying for a platform costs more

but allows us to validate our use cases faster

💡

What is measured can be improved

Infrastructure as “Code”

  • Terraform (HCL)
  • CloudFormation (YAML)
  • Jsonnet (Json)
  • Pulumi (TS, Python, Go, …)
  • CDKs (TS, Python, Go, …)


70% of outages are caused by a config change

prog. lang. (dynamic)

vs

config. lang. (static)

Error messages

good error message

source: Jenni Nadler (c)

converting types

the most awkwards and common things to do!

  • gRPC/protobuf
  • AVRO
  • JSON
  • DBs
  • language types
  • ORMs

💡 Everything is a tradeoff

“best” is the enemy of “good”

Main blockers (ML)

  • data organization
  • data quality
  • data access/security
  • data privacy
  • data governance
  • data lineage
  • deployment

Deployment concerns

  • Scaling up and down
  • Redundancy
  • Scheduling / Orchestration
  • Service Discovery
  • Resiliency
  • Rolling out and back
  • Health checks
  • Secret and config
kubernetes architecture

Kubernetes concerns

  • Logging
  • Tracing
  • Metrics
  • Dependency visualisation
  • Service identity and Auth
  • Circuit breaking
  • Traffic flow and policies
  • Failover
  • Fault injection
  • ...

using a library, framework or service for:

  • circuit breaking
  • timeouts
  • retries
  • service discovery
  • client-side loadbalancing
  • metrics
  • traffic shaping
  • rate limiting
  • ...

for which languages?

Programming languages

  • How many languages do we use ?
  • How many languages do we know really well ?
the bar for a new programming language in production at Google

source: "thread by Jaana Dogan" (c)

Go

It is not the language, it is the tooling/opinions

testing, benchmarks, deployments, dependency management, security, code formating, distribution …

Consistency is key

source

Archives of the History of American Psychology, The Center for the
History of Psychology, The University of Akron

army report uniformity

💡 Consistency is a myth but a worthy goal

💡

DevOps

basically people talking to each other through APIs

#thisisdevops

this is devops

Yevgeniy Brikman - Lessons from 300k+ Lines of Infrastructure Code

Dev + Ops = DevOps

DevOps + Data = DataOps

DataOps + ML = MLOps

SRE ?

how different is ML

  1. Various hardware
  2. Resources heavy
  3. Various cycles
  4. Many languages
  1. Dependencies
  2. Explainability >< debugging
  3. Composability of models
  4. Huge amount of data

Testing

what if we had 2 clusters + LB

instead of a “clone” of production

Observability

is a key component of a platform

Monitoring

Answers the question: Are we making money now ?

Observability >< Monitoring

Observability deduces the state of the system based on its output

SLI, SLO, SLA

These measurements describe basic properties of metrics that matter, what values we want those metrics to have, and how we’ll react if we can’t provide the expected service

Define SLI SLO

SLI / SLO / SLA

Start from the user experience

Nothing is perfect but we can always approximate

Security

Hackers are bots now

CVEs are monitored by hackers

They scan the internet for outdated software to exploit

and code too

Deploying

bake the model in the container or not ?

CICD

how to make changes to the system

And after a while

More models, more requests and more data

DORA

DevOps Research and Assessment report

https://cloud.google.com/blog/products/devops-sre/dora-2022-accelerate-state-of-devops-report-now-out

Speed is what matters

💡 Getting better at getting better

💡 If it hurts, do it more often

Julien Bisconti

SRE / Data Engineer

Google Cloud Platform icon

contact

g.dev/julien

slides: bisconti.cloud