what's your ml test score?

Using machine learning in real-world production systems is complicated by a host of issues not found in small toy examples or even large offline research experiments. Testing and monitoring are key considerations for assessing the production-readiness of an ML system. But how much testing and monitoring is enough? We present an ML Test Score rubric based on a set of actionable tests to help quantify these issues 1.

Compute your ML test score

For each test, one point is awarded for executing the test manually and documenting and distributing the results. A second point is awarded if there is a system in place to run that test automatically on a repeated basis. The final ML Test Score is computed by taking the minimum of the scores aggregated for each of the 4 sections below.

(changing the dropdown values below will update your score)

1-1 Test that the distributions of each feature match your expectations
1-2 Test the relationship between each feature and the target, and the pairwise correlations be- tween individual signals
1-3 Test the cost of each feature
1-4 Test that a model does not contain any features that have been manually determined as unsuit- able for use
1-5 Test that your system maintains privacy controls across its entire data pipeline
1-6 Test the calendar time needed to develop and add a new feature to the production model
1-7 Test all code that creates input features, both in training and serving
2-1 Test that every model specification undergoes a code review and is checked in to a repository
2-2 Test the relationship between offline proxy metrics and the actual impact metrics
2-3 Test the impact of each tunable hyperparameter
2-4 Test the effect of model staleness
2-5 Test against a simpler model as a baseline
2-6 Test model quality on important data slices
2-7 Test the model for implicit bias
3-1 Test the reproducibility of training
3-2 Unit test model specification code
3-3 Integration test the full ML pipeline
3-4 Test model quality before attempting to serve it
3-5 Test that a single example or training batch can be sent to the model, and changes to internal state can be observed from training through to prediction
3-6 Test models via a canary process before they enter production serving environments
3-7 Test how quickly and safely a model can be rolled back to a previous serving version
4-1 Test for upstream instability in features, both in training and serving
4-2 Test that data invariants hold in training and serving inputs
4-3 Test that your training and serving features compute the same values
4-4 Test for model staleness
4-5 Test for NaNs or infinities appearing in your model during training or serving
4-6 Test for dramatic or slow-leak regressions in training speed, serving latency, throughput, or RAM usage
4-7 Test for regressions in prediction quality on served data

{ Your Score } 0 points: More of a research project than a productionized system



Footnotes: