Building a data platform

for machine learning operation

Content

  • Who am I
  • Problem definition
  • MLOps
  • Containers & Kubernetes
  • Observability
  • Conclusions

Julien Bisconti

Software Engineer
specialized in Google Cloud


google cloud professional data engineer certification Google Developer Expert badge kubernetes certifications

previous talks

PART I

problem definition

how to:

  • train
  • build
  • deploy
  • monitor


Machine Learning models

in a repoducible manner

at scale ?

Hidden Technical Debt in Machine Learning Systems

Hidden Technical Debt in Machine Learning Systems paper Source: D. Sculley, et al.: Hidden Technical Debt in Machine Learning Systems

cost of context switching

Hidden Technical Debt in Machine Learning Systems paper

spreadsheet

source link

Mental limitations


  • # decisions / day
  • # things to remember
  • speed of memory / reflexes
production grade infrastructure

Yevgeniy Brikman - Lessons from 300k+ Lines of Infrastructure Code

build OR buy

Tweet about datacenter

whole thread

We could build it

BUT

spending time on the business

makes more sense financially

build OR buy

Restaurants buy, cook and sell food.

Very few do farming and even less are good at both.

no code repository

PART II

MLOPS

#thisisdevops

this is devops

Yevgeniy Brikman - Lessons from 300k+ Lines of Infrastructure Code

ML platform assembly kit

data engineer toolbox Source: article by Clemens Mewald

how different is ML

  1. Various hardware
  2. Resources heavy
  3. Various cycles
  4. Many languages
  1. Dependencies
  2. Explainability >< debugging
  3. Composability of models
  4. Huge amount of data

And after a while

More models, more requests and more data

Consistency is key

source

Archives of the History of American Psychology, The Center for the
History of Psychology, The University of Akron

army report uniformity

A platform needs:

  • Command line interface
  • User interface
  • APIs
  • SDKs
  • Documentation
  • + examples
  • Migration paths
  • Reliability
  • Observability
  • Security
  • Cost management
  • Support

PART III

Containers & Kubernetes

Containers

container image: zip file of app + dependencies
docker: program that runs the image
each container runs in its own namespace

container vm

source link

Deployment

Containers: lightweight VMs

  • 12 factor app
  • easier deploy
  • reproducible build


but ...

how to orchestrate containers across many computers ?

Deployment concerns

  • Scaling up and down
  • Redundancy
  • Scheduling / Orchestration
  • Service Discovery
  • Resiliency
  • Rolling out and back
  • Health checks
  • Secret and config

Kubernetes

kubernetes architecture

Batch

vs

Application

PART III

Observability

Observability

What changes in your system ?

  • Site Reliability engineering
  • Chaos/Resilience engineering
  • FinOps 💸
  • ⚠️ Languages proliferation
  • opentelemetry.io

(see prev talks)

Logging and microservices

Don’t do it

(UYKWYAD)

in Distributed System, logging is not debugging

💸 : # app x $ (network + storage) x rentention day

  • Logging (immutable) event. (Selfish traces)
  • Metrics just statistics over time
  • Tracing traces provide context in the life of a transaction

They help to narrow down a problem, they will guide you where to investigate.

It is serverless the same way WiFi is wireless. At some point, it will hit a wire.

Gojko Adzic

CONCLUSIONS

  • Consistency is key
  • Context switching is expensive
  • Re-use = able to share = caring
  • More models & data tomorrow than today

Resources

THANK YOU

and I'm sorry 🙏
If you had to maintain my code
I hope you learned more by maintaining it
than me by writing it

contact

https://bisconti.cloud/

@julienBisconti

Slides made with Reveal.js and hugo-reveal