Building a data platform

for machine learning operation

Content

  • Who am I
  • Problem definition
  • MLOps
  • Containers & Kubernetes
  • Observability
  • Conclusions

Julien Bisconti

Software Engineer
specialized in Google Cloud


google cloud professional data engineer certification Google Developer Expert badge kubernetes certifications

previous talks

PART I

problem definition

how to:

  • train
  • build
  • deploy
  • monitor


Machine Learning models

in a repoducible manner

at scale ?

Hidden Technical Debt in Machine Learning Systems

Hidden Technical Debt in Machine Learning Systems paper Source: D. Sculley, et al.: Hidden Technical Debt in Machine Learning Systems

cost of context switching

Hidden Technical Debt in Machine Learning Systems paper

spreadsheet

source link

Mental limitations


  • # decisions / day
  • # things to remember
  • speed of memory / reflexes
production grade infrastructure

Yevgeniy Brikman - Lessons from 300k+ Lines of Infrastructure Code

build OR buy

Tweet about datacenter

whole thread

We could build it

BUT

spending time on the business

makes more sense financially

build OR buy

Restaurants buy, cook and sell food.

Very few do farming and even less are good at both.

no code repository

PART II

MLOPS

#thisisdevops

this is devops

Yevgeniy Brikman - Lessons from 300k+ Lines of Infrastructure Code

ML platform assembly kit

data engineer toolbox Source: article by Clemens Mewald

how different is ML

  1. Various hardware
  2. Resources heavy
  3. Various cycles
  4. Many languages
  1. Dependencies
  2. Explainability >< debugging
  3. Composability of models
  4. Huge amount of data

And after a while

More models, more requests and more data

Consistency is key

source

Archives of the History of American Psychology, The Center for the
History of Psychology, The University of Akron

army report uniformity

A platform needs:

  • Command line interface
  • User interface
  • APIs
  • SDKs
  • Documentation
  • + examples
  • Migration paths
  • Reliability
  • Observability
  • Security
  • Cost management
  • Support

    PART III

    Containers & Kubernetes

    Containers

    container image: zip file of app + dependencies
    docker: program that runs the image
    each container runs in its own namespace

    container vm

    source link

    Deployment

    Containers: lightweight VMs

    • 12 factor app
    • easier deploy
    • reproducible build


    but ...

    how to orchestrate containers across many computers ?

    Deployment concerns

    • Scaling up and down
    • Redundancy
    • Scheduling / Orchestration
    • Service Discovery
    • Resiliency
    • Rolling out and back
    • Health checks
    • Secret and config

    Kubernetes

    kubernetes architecture

    Batch

    vs

    Application

    PART III

    Observability

    Observability

    What changes in your system ?

    • Site Reliability engineering
    • Chaos/Resilience engineering
    • FinOps 💸
    • ⚠️ Languages proliferation
    • opentelemetry.io

    (see prev talks)

    Logging and microservices

    Don’t do it

    (UYKWYAD)

    in Distributed System, logging is not debugging

    💸 : # app x $ (network + storage) x rentention day

    • Logging (immutable) event. (Selfish traces)
    • Metrics just statistics over time
    • Tracing traces provide context in the life of a transaction

    They help to narrow down a problem, they will guide you where to investigate.

    It is serverless the same way WiFi is wireless. At some point, it will hit a wire.

    Gojko Adzic

    CONCLUSIONS

    • Consistency is key
    • Context switching is expensive
    • Re-use = able to share = caring
    • More models & data tomorrow than today

    Resources

    THANK YOU

    and I'm sorry 🙏
    If you had to maintain my code
    I hope you learned more by maintaining it
    than me by writing it

    contact

    https://bisconti.cloud/

    @julienBisconti

    Slides made with Reveal.js and hugo-reveal