Machine Learning stories roundup 2022.10

Xin Cheng

5 min readOct 12, 2022

This series is focused on what is happening in machine learning.

MLOps

Overview

Oreilly MLOps

Machine Learning Operations (MLOps): Overview, Definition, and Architecture

MLOps

MLOps Platform

Thoughtworks Evaluating MLOps Platforms

A company closely related to agile software development and strategy, design discusses how to evaluate machine learning/MLOps platform (typical lifecycle: data storage, processing, experimenting, visualizing, training, deploying, monitoring; typical ML platform capability: Feature engineering/feature store, Model training, Model serving, Model registry, Rollouts, Monitoring). It also shares a public Git repo for some of evaluation criteria.

An End-to-End MLOps Platform Implementation using Open-source Tooling

By Omolade Saliu, IBM Executive Data Scientist and Head of Data Science

medium.com

MLOps platform with open source, e.g. Kubeflow for end-to-end Kubernetes machine learning, MLFlow for experiment tracking, seldon for ML model serving, Prometheus/Grafana for ML app metric monitoring. Here is more comprehensive article about tooling in MLOps landscape

This Is What You Need to Know to Build an MLOps End-To-End Architecture

7 principles to quickly bring MLOps to your machine learning projects.

medium.com

Lists core activities in each machine learning lifecycle stage.

No, you don’t need MLOps

Keep It Simple: the complexity of full MLOps is rarely needed

lakshmanok.medium.com

The article talks about multiple areas in MLOps space and provided simpler solution if requirement does not need comprehensive features (e.g. if data does not drift that much and is within acceptable range, scheduled retraining is much simpler than drift detection/continuous retraining). It reminds us of KISS (keep it simple stupid) principle. Don’t over-engineering.

MLOps — the dust has not settled yet

Issue 7: MLOps lifecycle, maturity levels, and tools

medium.com

MLOps on major cloud providers

Feature engineering

Top Python Packages for Feature Engineering

Know these packages to improve your data workflow

towardsdatascience.com

Feature-engine: A new open source Python package for feature engineering

Feature-engine is an open source Python library with the most exhaustive battery of transformers to engineer features…

trainindata.medium.com

Featuretools, Feature-Engine are feature engineering Python packages. some feature engineering techniques are generating aggregation features (e.g. count, mean, max, min), feature interaction (sum of features, subtraction of features, etc.)

Top Automated Feature Engineering Frameworks in Python in 2022

Automated Feature Engineering Frameworks in Python

in 2022 Automated Feature Engineering Frameworks in Pythonmoez-62905.medium.com

The Best Feature Engineering Tools - neptune.ai

When it comes to predictive models, the dataset always needs a good description. In the real world, datasets are raw…

neptune.ai

Above 2 article talk about main automated feature engineering tools: Feature Tools, TSFresh, Featurewiz, PyCaret, Autofeat, Featureselector. Featuretools can fulfill most of relational dataset, time series. TSFresh works specifically on time series data.

Model monitoring

Drift in Machine Learning: How to Identify Issues Before You Have a Problem | Fiddler AI Blog

Inaccurate models can be costly for businesses. Whether a model is responsible for predicting fraud, approving loans…

www.fiddler.ai

Model drift: Over time, even highly accurate models decay as the incoming data shifts away from the original training set.

Types of model drift

Concept drift: indicates a change in the underlying relationships between features and outcomes: the probability of Y output given X input or P(Y|X), e.g. even applicants with same feature values (e.g. income, credit score, age) could become more or less risky (because the underlying decision boundary changes)

Data drift: simply the change of data distribution, not considering relationship between features and labels

Feature drift: change of features, e.g. we may receive more applicants with higher ages

Label drift: change of predicted variables, e.g. loan approval vs. non-approval ratio changes

Expect the Unexpected — Detecting Post Model Deployment Data Drifts

It’s best not to wait for others to tell you that your Machine Learning model is misbehaving.

gabrieltardochi.medium.com

Discussed a tool called NannyML which supports following drift analysis:

Univariate Drift Analysis: detect single feature distribution change (support continuous variables and categorical variables)

Multivariate Drift Analysis: multiple feature combinations distribution change

What is Concept Drift? Model Drift in Machine Learning

Model Drift (also known as model decay) refers to the degradation of a model's prediction power due to changes in the…

datatron.com

Way to detect model drift

If you have labeled data, compare accuracy of a model (e.g. F1 score, AUC)

If you have unlabeled data, compare training data and the post-training data. The article introduced some methods:

Kolmogorov-Smirnov (K-S) test: The K-S test is a nonparametric test that compares the cumulative distributions of two data sets, in this case, the training data and the post-training data. The null hypothesis for this test states that the distributions from both datasets are identical. If the null is rejected then you can conclude that your model has drifted.
Population stability Index (PSI): The PSI is a metric used to measure how a variable’s distribution has changed over time. It is a popular metric used for monitoring changes in the characteristics of a population, and thus, detecting model decay.
Z-score: Lastly, you can compare the feature distribution between the training and live data using the z-score. For example, if a number of live data points of a given variable have a z-score of +/- 3, the distribution of the variable may have shifted.

Best 7 Data Version Control Tools That Improve Your Workflow With Machine Learning Projects …

Keeping track of all the data you use for models and experiments is not exactly a piece of cake. It takes a lot of time…

neptune.ai

Data version. Source code version is already part of software developer’s life. However, managing data like managing code is still at early stage. There could be 2 situations, where data is at developer’s machine (small data/file), which DVC is popular to simulate git-like workflow but data is stored in object store other than Git itself, or big data (which needs data lake solution).

15 Best Tools for ML Experiment Tracking and Management - neptune.ai

While working on a machine learning project, getting good results from a single model-training run is one thing. But…

neptune.ai

Machine learning training generally involves lots of experiments to find the best performing configuration of training data, metrics, hyperparameters, code, etc. Some tools mentioned in the article are not just experiment management, e.g. Pachyderm, Kubeflow, Sagemaker, DVC. The MLOps space is still fragmented and lots of players are there. More generally, it could be machine learning metadata store, which could involve tracking metadata at different stages (e.g. experiment, staging, production, decommission)

ML Model Registry: What It Is, Why It Matters, How to Implement It - neptune.ai

Why do you have to know more about model registry? If you were once the only data scientist on your team you can…

neptune.ai

Among model management components, model registry is to let developer manage lifecycle of machine learning models (one model can have different versions (not like code version, model is generally much larger in size), in different environments (staging, production, archived/retired)).

AI Ethics

Toward ethical, transparent and fair AI/ML: a critical reading list

**stopped maintaining in 2019

eirinimalliaraki.medium.com

Some of ethical aspects mentioned in Microsoft Responsible AI are related to AI security, the most core of AI ethics is fairness, inclusiveness; while transparency is more related to model interpretability, privacy is related to federated machine learning (keeping training data private without being seen by other party), reliability is related to adversarial attack.

The Relationship between Interpretability and Fairness

3 reasons why interpretable models are more likely to be fair

towardsdatascience.com

Fairness is understanding if predictions are biased towards certain groups. Interpretable models are easier to understand which features are important and then help correct bias, e.g. re-weight data point to shift distribution of groups mentioned in this article.

Others

Nature-Inspired Algorithms

by Maciej Świechowski

medium.com

Life and Nature’s Influence on AI

How AI algorithms are influenced by our life, nature, and the environment

towardsdatascience.com

Natured-inspired algorithms

Software Engineering Roadmap For Data Scientists

Data scientists are software engineers first and foremost. They may not be coding machine learning models or natural…

levelup.gitconnected.com

Machine Learning stories roundup 2022.10

MLOps

Overview

MLOps Platform

An End-to-End MLOps Platform Implementation using Open-source Tooling

By Omolade Saliu, IBM Executive Data Scientist and Head of Data Science

This Is What You Need to Know to Build an MLOps End-To-End Architecture

7 principles to quickly bring MLOps to your machine learning projects.

No, you don’t need MLOps

Keep It Simple: the complexity of full MLOps is rarely needed

MLOps — the dust has not settled yet

Issue 7: MLOps lifecycle, maturity levels, and tools

Feature engineering

Top Python Packages for Feature Engineering

Know these packages to improve your data workflow

Feature-engine: A new open source Python package for feature engineering

Feature-engine is an open source Python library with the most exhaustive battery of transformers to engineer features…

Top Automated Feature Engineering Frameworks in Python in 2022

Automated Feature Engineering Frameworks in Python

The Best Feature Engineering Tools - neptune.ai

When it comes to predictive models, the dataset always needs a good description. In the real world, datasets are raw…

Model monitoring

Drift in Machine Learning: How to Identify Issues Before You Have a Problem | Fiddler AI Blog

Inaccurate models can be costly for businesses. Whether a model is responsible for predicting fraud, approving loans…

Expect the Unexpected — Detecting Post Model Deployment Data Drifts

It’s best not to wait for others to tell you that your Machine Learning model is misbehaving.

What is Concept Drift? Model Drift in Machine Learning

Model Drift (also known as model decay) refers to the degradation of a model's prediction power due to changes in the…

Best 7 Data Version Control Tools That Improve Your Workflow With Machine Learning Projects …

Keeping track of all the data you use for models and experiments is not exactly a piece of cake. It takes a lot of time…

15 Best Tools for ML Experiment Tracking and Management - neptune.ai

While working on a machine learning project, getting good results from a single model-training run is one thing. But…

ML Model Registry: What It Is, Why It Matters, How to Implement It - neptune.ai

Why do you have to know more about model registry? If you were once the only data scientist on your team you can…

AI Ethics

Toward ethical, transparent and fair AI/ML: a critical reading list

**stopped maintaining in 2019

The Relationship between Interpretability and Fairness

3 reasons why interpretable models are more likely to be fair

Others

Nature-Inspired Algorithms

by Maciej Świechowski

Life and Nature’s Influence on AI

How AI algorithms are influenced by our life, nature, and the environment

Software Engineering Roadmap For Data Scientists

Data scientists are software engineers first and foremost. They may not be coding machine learning models or natural…

Written by Xin Cheng

No responses yet