Building a data platform

for machine learning operation

Content

Who am I
Problem definition
MLOps
Containers & Kubernetes
Observability
Conclusions

Julien Bisconti

Software Engineer
specialized in Google Cloud

previous talks

PART I

problem definition

how to:

train
build
deploy
monitor

Machine Learning models

in a repoducible manner

at scale ?

Hidden Technical Debt in Machine Learning Systems

Source: D. Sculley, et al.: Hidden Technical Debt in Machine Learning Systems

cost of context switching

Hidden Technical Debt in Machine Learning Systems paper

spreadsheet

source link

Mental limitations

# decisions / day
# things to remember
speed of memory / reflexes

Yevgeniy Brikman - Lessons from 300k+ Lines of Infrastructure Code

build OR buy

whole thread

We could build it

BUT

spending time on the business

makes more sense financially

build OR buy

Restaurants buy, cook and sell food.

Very few do farming and even less are good at both.

repository

PART II

MLOPS

#thisisdevops

Yevgeniy Brikman - Lessons from 300k+ Lines of Infrastructure Code

ML platform assembly kit

Source: article by Clemens Mewald

how different is ML

Various hardware
Resources heavy
Various cycles
Many languages

Dependencies
Explainability >< debugging
Composability of models
Huge amount of data

And after a while

More models, more requests and more data

Consistency is key

source

Archives of the History of American Psychology, The Center for the
History of Psychology, The University of Akron

A platform needs:

Command line interface
User interface
APIs
SDKs
Documentation
+ examples

Migration paths
Reliability
Observability
Security
Cost management
Support

PART III

Containers & Kubernetes

Containers

container image: zip file of app + dependencies
docker: program that runs the image
each container runs in its own namespace

source link

Deployment

Containers: lightweight VMs

12 factor app
easier deploy
reproducible build

but ...

how to orchestrate containers across many computers ?

Deployment concerns

Scaling up and down
Redundancy
Scheduling / Orchestration
Service Discovery

Resiliency
Rolling out and back
Health checks
Secret and config

Kubernetes

Batch

vs

Application

PART III

Observability

What changes in your system ?

Site Reliability engineering
Chaos/Resilience engineering
FinOps 💸
⚠️ Languages proliferation
opentelemetry.io

(see prev talks)

Logging and microservices

Don’t do it

(UYKWYAD)

in Distributed System, logging is not debugging

💸 : # app x $ (network + storage) x rentention day

Logging (immutable) event. (Selfish traces)
Metrics just statistics over time
Tracing traces provide context in the life of a transaction

They help to narrow down a problem, they will guide you where to investigate.

It is serverless the same way WiFi is wireless. At some point, it will hit a wire.
— Gojko Adzic

CONCLUSIONS

Consistency is key
Context switching is expensive
Re-use = able to share = caring
More models & data tomorrow than today

Resources

THANK YOU

and I'm sorry 🙏
If you had to maintain my code
I hope you learned more by maintaining it
than me by writing it

contact

https://bisconti.cloud/

@julienBisconti

Slides made with Reveal.js and hugo-reveal