Machine Learning stories roundup 2022.10

Xin Cheng
5 min readOct 12, 2022

--

This series is focused on what is happening in machine learning.

MLOps

Overview

Oreilly MLOps

Machine Learning Operations (MLOps): Overview, Definition, and Architecture

MLOps

MLOps Platform

Thoughtworks Evaluating MLOps Platforms

A company closely related to agile software development and strategy, design discusses how to evaluate machine learning/MLOps platform (typical lifecycle: data storage, processing, experimenting, visualizing, training, deploying, monitoring; typical ML platform capability: Feature engineering/feature store, Model training, Model serving, Model registry, Rollouts, Monitoring). It also shares a public Git repo for some of evaluation criteria.

MLOps platform with open source, e.g. Kubeflow for end-to-end Kubernetes machine learning, MLFlow for experiment tracking, seldon for ML model serving, Prometheus/Grafana for ML app metric monitoring. Here is more comprehensive article about tooling in MLOps landscape

Lists core activities in each machine learning lifecycle stage.

The article talks about multiple areas in MLOps space and provided simpler solution if requirement does not need comprehensive features (e.g. if data does not drift that much and is within acceptable range, scheduled retraining is much simpler than drift detection/continuous retraining). It reminds us of KISS (keep it simple stupid) principle. Don’t over-engineering.

MLOps on major cloud providers

Feature engineering

Featuretools, Feature-Engine are feature engineering Python packages. some feature engineering techniques are generating aggregation features (e.g. count, mean, max, min), feature interaction (sum of features, subtraction of features, etc.)

Above 2 article talk about main automated feature engineering tools: Feature Tools, TSFresh, Featurewiz, PyCaret, Autofeat, Featureselector. Featuretools can fulfill most of relational dataset, time series. TSFresh works specifically on time series data.

Model monitoring

Model drift: Over time, even highly accurate models decay as the incoming data shifts away from the original training set.

Types of model drift

Concept drift: indicates a change in the underlying relationships between features and outcomes: the probability of Y output given X input or P(Y|X), e.g. even applicants with same feature values (e.g. income, credit score, age) could become more or less risky (because the underlying decision boundary changes)

Data drift: simply the change of data distribution, not considering relationship between features and labels

Feature drift: change of features, e.g. we may receive more applicants with higher ages

Label drift: change of predicted variables, e.g. loan approval vs. non-approval ratio changes

Discussed a tool called NannyML which supports following drift analysis:

Univariate Drift Analysis: detect single feature distribution change (support continuous variables and categorical variables)

Multivariate Drift Analysis: multiple feature combinations distribution change

Way to detect model drift

If you have labeled data, compare accuracy of a model (e.g. F1 score, AUC)

If you have unlabeled data, compare training data and the post-training data. The article introduced some methods:

  • Kolmogorov-Smirnov (K-S) test: The K-S test is a nonparametric test that compares the cumulative distributions of two data sets, in this case, the training data and the post-training data. The null hypothesis for this test states that the distributions from both datasets are identical. If the null is rejected then you can conclude that your model has drifted.
  • Population stability Index (PSI): The PSI is a metric used to measure how a variable’s distribution has changed over time. It is a popular metric used for monitoring changes in the characteristics of a population, and thus, detecting model decay.
  • Z-score: Lastly, you can compare the feature distribution between the training and live data using the z-score. For example, if a number of live data points of a given variable have a z-score of +/- 3, the distribution of the variable may have shifted.

Data version. Source code version is already part of software developer’s life. However, managing data like managing code is still at early stage. There could be 2 situations, where data is at developer’s machine (small data/file), which DVC is popular to simulate git-like workflow but data is stored in object store other than Git itself, or big data (which needs data lake solution).

Machine learning training generally involves lots of experiments to find the best performing configuration of training data, metrics, hyperparameters, code, etc. Some tools mentioned in the article are not just experiment management, e.g. Pachyderm, Kubeflow, Sagemaker, DVC. The MLOps space is still fragmented and lots of players are there. More generally, it could be machine learning metadata store, which could involve tracking metadata at different stages (e.g. experiment, staging, production, decommission)

Among model management components, model registry is to let developer manage lifecycle of machine learning models (one model can have different versions (not like code version, model is generally much larger in size), in different environments (staging, production, archived/retired)).

AI Ethics

Some of ethical aspects mentioned in Microsoft Responsible AI are related to AI security, the most core of AI ethics is fairness, inclusiveness; while transparency is more related to model interpretability, privacy is related to federated machine learning (keeping training data private without being seen by other party), reliability is related to adversarial attack.

Fairness is understanding if predictions are biased towards certain groups. Interpretable models are easier to understand which features are important and then help correct bias, e.g. re-weight data point to shift distribution of groups mentioned in this article.

Others

Natured-inspired algorithms

--

--

Xin Cheng
Xin Cheng

Written by Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified

No responses yet