manifesto.

ProductionML is an initiative to level up data engineers, enabling them to understand all the fundamentals, technologies and algorithms which they need to be aware of for running machine learning systems in production, at scale for trivial problems with the help of data scientists for development and validation of novel, quant-heavy solutions.

Data Actors Venn Diagram

Over the past few years, data science has rightfully been the most hyped career track. Numerous online courses, bootcamps and books have been published on the topic(s). Most recently fast.ai1 and Andrew Ng’s Deep Learning Specialization2 on Coursera have received due attention.

Depite all the extremely hard work being put into expediting the spread of machine learning knowledge, we see companies (those who are not AMZN, GOOG, FB, MSFT or the likes) struggle with building a machine learning backed feature/product3.

At most companies, the machine learning lifecycle consists of data scientists developing models offline and handing it off to data engineers to productionize them and integrate them with the rest of the (microservice) ecosystem, at times leveraging completely different implementation technology (language/compute framework), slowing down the entire process for the company.

The data engineers want to automate these engineering handoffs, deployments and integrations with the rest of the enterprise ecosystem. The data scientists want to have confidence on the model’s ability, its effectiveness and performance at scale. Most importantly both data scientists and data engineers want to move up the machine learning maturity scale to adapt to changing environments of the problem:

Machine Learning Spectrum

Today, ‘bias’ is a bug (in data-driven products) and models need to be secure, transparent and localized. We need machine learning to monitor models at scale 4.

Our objective is to work through production technologies such as (but not limited to) Apache Spark5, ModelDB6, MLeap7, Apache MXNet8 and untangle cutting edge research being done in the area by pioneers such as Rise Lab9, Google BRAIN Team10 and Stanford DAWN11 to reduce the time it takes for teams to take their models to production, at scale and eventually being able to build pipelines where models are trained, deployed, tuned, selected in an online manner following best practices12 and being weary of “The High-Interest Credit Card of Technical Debt”13.

We aim to achieve this through bridge workshops, meetups and content aggregation thereby helping solid data engineers in learn enough machine learning to be dangerous and turn into unicorns aka Machine Learning Engineers14.


Footnotes: