Building data/ML platforms

Lessons from the trenches

What is the purpose of

ML/data platforms ?


people & systems

( Skills & Tools )

Conway’s Law

"organizations [...] are constrained to produce designs which are copies of the communication structures of these organizations"



can help fix an


but not its


how to scale a system

  • vertically
  • horizontally
  • deeply

village (100+) -> city (10k+) -> megacity (1m+)

same same but different

💡 we are not scaling a system up

we are creating a new system

that can sustain 100x the load

and we continously migrate to it

💡 Always be migrating

or at least plan for it


create the new

deprecate the old

problem definition

how to:

  • train
  • build
  • deploy
  • monitor

Machine Learning models

in a repoducible manner

at scale ?

Hidden Technical Debt in Machine Learning Systems

Hidden Technical Debt in Machine Learning Systems paper Source: D. Sculley, et al.: Hidden Technical Debt in Machine Learning Systems

cost of context switching

Hidden Technical Debt in Machine Learning Systems paper


source link

Mental limitations

  • # decisions / day
  • # things to remember
  • speed of memory / reflexes


def: consistently good in quality or performance; able to be trusted.

“holistic approach toward deciding how to integrate development, deployment, production operations, and long-term care.”

src: “Reliable Machine Learning - Cathy Chen” (c)


* Requires good monitoring

What Chaos Engineering is not

pray to server

Hope is not a strategy

Strategy >< tactics

Tactic is a specific action to achieve a goal

Strategy is a plan to achieve a goal

ML platform assembly kit

data engineer toolbox Source: article by Clemens Mewald
production grade infrastructure

Yevgeniy Brikman - Lessons from 300k+ Lines of Infrastructure Code

Main blockers (Infra)

  • networking
  • identity and access management (IAM)
  • resources organization
  • biling
  • policy management
Shared VPC

source: GCP blog


  • how to get access to the data
  • how to remove access to the data

Good to understand OAuth & TLS certificates

build OR buy

Tweet about datacenter

whole thread


Laws might dictate that you have

to build your own


Build >< buy

time & skills >< Money

When things go wrong

AWS Shared Responsibility Model

At scale

everything hurts

[the cloud bill is not proportional to the number of customers but to the number of developers with access to the cloud account]

“Corey Quinn (@QuinnyPig)”

Cloud operation team

basically restricting what devs can do


Paying for a platform costs more

but allows us to validate our use cases faster


What is measured can be improved

Infrastructure as “Code”

  • Terraform (HCL)
  • CloudFormation (YAML)
  • Jsonnet (Json)
  • Pulumi (TS, Python, Go, …)
  • CDKs (TS, Python, Go, …)

70% of outages are caused by a config change

prog. lang. (dynamic)


config. lang. (static)

Error messages

good error message

source: Jenni Nadler (c)

converting types

the most awkwards and common things to do!

  • gRPC/protobuf
  • AVRO
  • JSON
  • DBs
  • language types
  • ORMs

💡 Everything is a tradeoff

“best” is the enemy of “good”

Main blockers (ML)

  • data organization
  • data quality
  • data access/security
  • data privacy
  • data governance
  • data lineage
  • deployment

Deployment concerns

  • Scaling up and down
  • Redundancy
  • Scheduling / Orchestration
  • Service Discovery
  • Resiliency
  • Rolling out and back
  • Health checks
  • Secret and config
kubernetes architecture

Kubernetes concerns

  • Logging
  • Tracing
  • Metrics
  • Dependency visualisation
  • Service identity and Auth
  • Circuit breaking
  • Traffic flow and policies
  • Failover
  • Fault injection
  • ...

using a library, framework or service for:

  • circuit breaking
  • timeouts
  • retries
  • service discovery
  • client-side loadbalancing
  • metrics
  • traffic shaping
  • rate limiting
  • ...

for which languages?

Programming languages

  • How many languages do we use ?
  • How many languages do we know really well ?
the bar for a new programming language in production at Google

source: "thread by Jaana Dogan" (c)


It is not the language, it is the tooling/opinions

testing, benchmarks, deployments, dependency management, security, code formating, distribution …

Consistency is key


Archives of the History of American Psychology, The Center for the
History of Psychology, The University of Akron

army report uniformity

💡 Consistency is a myth but a worthy goal



basically people talking to each other through APIs


this is devops

Yevgeniy Brikman - Lessons from 300k+ Lines of Infrastructure Code

Dev + Ops = DevOps

DevOps + Data = DataOps

DataOps + ML = MLOps


how different is ML

  1. Various hardware
  2. Resources heavy
  3. Various cycles
  4. Many languages
  1. Dependencies
  2. Explainability >< debugging
  3. Composability of models
  4. Huge amount of data


what if we had 2 clusters + LB

instead of a “clone” of production


is a key component of a platform


Answers the question: Are we making money now ?

Observability >< Monitoring

Observability deduces the state of the system based on its output


These measurements describe basic properties of metrics that matter, what values we want those metrics to have, and how we’ll react if we can’t provide the expected service

Define SLI SLO


Start from the user experience

Nothing is perfect but we can always approximate


Hackers are bots now

CVEs are monitored by hackers

They scan the internet for outdated software to exploit

and code too


bake the model in the container or not ?


how to make changes to the system

And after a while

More models, more requests and more data


DevOps Research and Assessment report

Speed is what matters

💡 Getting better at getting better

💡 If it hurts, do it more often

Julien Bisconti

SRE / Data Engineer

Google Cloud Platform icon