Building a data platform
for machine learning operation
Content
- Who am I
- Problem definition
- MLOps
- Containers & Kubernetes
- Observability
- Conclusions
Julien Bisconti
Software Engineer
specialized in Google Cloud
previous talks
problem definition
how to:
Machine Learning models
in a repoducible manner
at scale ?
Mental limitations
- # decisions / day
- # things to remember
- speed of memory / reflexes
We could build it
BUT
spending time on the business
makes more sense financially
build OR buy
Restaurants buy, cook and sell food.
Very few do farming and even less are good at both.
how different is ML
- Various hardware
- Resources heavy
- Various cycles
- Many languages
- Dependencies
- Explainability >< debugging
- Composability of models
- Huge amount of data
And after a while
More models, more requests and more data
A platform needs:
- Command line interface
- User interface
- APIs
- SDKs
- Documentation
- + examples
- Migration paths
- Reliability
- Observability
- Security
- Cost management
- Support
PART III
Containers & Kubernetes
Containers
container image: zip file of app + dependencies
docker: program that runs the image
each container runs in its own namespace
Deployment
Containers: lightweight VMs
- 12 factor app
- easier deploy
- reproducible build
but ...
how to orchestrate containers across many computers ?
Deployment concerns
- Scaling up and down
- Redundancy
- Scheduling / Orchestration
- Service Discovery
- Resiliency
- Rolling out and back
- Health checks
- Secret and config
Observability
What changes in your system ?
- Site Reliability engineering
- Chaos/Resilience engineering
- FinOps 💸
- ⚠️ Languages proliferation
- opentelemetry.io
(see prev talks)
Logging and microservices
Don’t do it
(UYKWYAD)
in Distributed System, logging is not debugging
💸 : # app x $ (network + storage) x rentention day
- Logging (immutable) event. (Selfish traces)
- Metrics just statistics over time
- Tracing traces provide context in the life of a transaction
They help to narrow down a problem, they will guide you where to investigate.
It is serverless the same way WiFi is wireless. At some point, it will hit a wire.
—
Gojko Adzic
CONCLUSIONS
- Consistency is key
- Context switching is expensive
- Re-use = able to share = caring
- More models & data tomorrow than today
THANK YOU
and I'm sorry 🙏
If you had to maintain my code
I hope you learned more by maintaining it
than me by writing it
contact
https://bisconti.cloud/
@julienBisconti
Slides made with Reveal.js and hugo-reveal