Get the Linkedin stats of Wojtek Kuberski and many LinkedIn Influencers by Taplio.
open on linkedin
I'm the Co-Founder and CTO of NannyML (https://github.com/NannyML/nannyml). We detect silent model failure, estimate the performance of ML models without access to target data, and robustly identify data drift that might have caused a drop in performance. At NannyML, I lead the research and product teams, contributing to novel algorithms in model monitoring. I am also an experienced speaker at global conferences, having taken the stage at multiple events since 2020, including: • Web Summit • PyData • MLCon • Data Science Summit • Machine Learning Week
Check out Wojtek Kuberski's verified LinkedIn stats (last 30 days)
Use Taplio to search all-time best posts
It is a good idea to start with univariate drift detection in ml monitoring, but this approach can fail when the drift occurs across relationships between multiple variables. The Domain Classifier is one such method that detects multivariate drift. Instead of inspecting each feature in isolation, it treats the task as a binary classification problem. Here's how it works: 1. A machine learning classifier (often LightGBM) is trained to distinguish between two datasets i.e. Reference (resembles training data) and Analysis Set (resembles production data) 2. If the classifier struggles to tell the difference, your data distributions are likely similar. 3. A high AUROC signals that your data is drifting, possibly in ways individual-level tests would miss. If you're looking to catch subtle, multivariate drifts in complex datasets, the Domain Classifier method provides a powerful, proactive solution.
The technical performance of AI models is often disconnected from the business side. Accuracy , Precision and ROC AUC might tell you how well your model is doing technically, but they won’t explain whether it’s helping to increase profits, reduce costs, or improve user experience. That’s where custom metrics step in for connecting model performance to business outcomes, and they’re much easier to set up than you might think. To track a custom metric for a regression model, you just need two simple things: 1️⃣ A loss function that computes the metric 2️⃣ An aggregate function to sum it up. That’s it. With these, you can track any number of metrics and estimate it when the ground truth is absent.
Are you deploying a credit card fraud detection ML model? Here’s how it can degrade in real-world scenario: → Concept Drift – Fraud tactics change constantly. What was once a red flag might not be anymore, and new fraud strategies emerge. → Covariate Shift – Customer spending habits evolve, new payment methods arise, and anti-fraud policies alter transaction behavior. The model was trained on one distribution, but real-world data follows another. → Data Quality Issues – Inconsistent data formatting, errors, missing values, and mislabeled fraud cases distort model learning. → Delayed Ground Truth Data – Fraud labels often come late (e.g., chargebacks take weeks), making real-time evaluation difficult and slowing model updates. Fraud detection is a data-centric, adversarial problem. I have just the blog that talks more about these issues in detail and how to resolve them. Link in the comments How do you handle model degradation in production? Let’s discuss. 👇
Why can't we use univariate drift methods to predict performance during ml monitoring? If we have enough features, false positives are bound to happen quite often. If we have 100 or 200 features, some will drift, and the model tends to handle that well. These false positives will flood your system with false alarms. The models are robust to changes in data distribution. Even if we see an actual quantifiable statistical change in the distribution, it still does not mean that model performance will drop. From a univariate distribution perspective, we do not see any drift in the relationship between features.
Is your ML monitoring workflow focused on the right thing? Most companies monitor ML models by measuring realized performance and checking for data drift. This makes sense when labeled data is available, but what happens when it's not? when data drift doesn't actually impact performance? Our ML workflows are about performance. So our Monitoring workflow should also be about performance right? Here’s how to get started with the NannyML workflow 1. Estimate model performance ahead of time: Stop waiting for ground truth. Probabilistic algorithms like CBPE and DLE can estimate performance even without new labels. 2. Investigate only when performance drops: Drift detection should help you understand why a performance drop happens, not serve as an early warning system. 3. Context is key to solving issues: Once you identify the cause of a performance issue, tailor your solution to the context. There’s no one-size-fits-all fix. How do you monitor your models—performance-first or drift-first?
Importance-Weighted is a method where we adjust our ML performance metrics by assigning different weights to observations based on how much they resemble the new data. For instance, if our current data differs significantly from the reference data, IW emphasizes reference observations that closely match the new data. This makes our performance metrics more reflective of the current situation. While IW metrics excel with large datasets, they have limitations with smaller samples. In such cases, methods like Probabilistic Adaptive Performance Estimation (PAPE) might provide a better alternative. It uses adaptive mechanisms to recalibrate probabilities based on the shifted data distribution. This means it adjusts its performance estimates according to how the current data differs from the reference distribution.
Covariate shift is a change in the distribution of the model’s inputs between training and production data. This shift can lead to poor predictions if not monitored. Let’s go over all the methods we provide to detect and monitor covariate shift: → Univariate Methods: Analyze individual features for changes using Chi2 and KS tests. → Multivariate Methods: Detect shifts across multiple features with tools like Domain Classifiers and Data Reconstruction Error. → Summary Statistics: Spot trends using metrics like mean and standard deviation. How do you handle covariate shift in your production grade models?
What is the key to ML monitoring? Its knowing when data drift impacts performance. Data drift happens all the time, but it’s mostly harmless. Ideally, we’d compute ML metrics to find out, but targets are often delayed, censored, or unavailable. To illustrate: in the first plot, we see multivariate drift detection results. The data structure is changing significantly, but should we care? Data drift signals can’t tell us that. Fortunately, we can quantify the impact of data drift on performance with our performance estimation algorithms - DLE and (M-)CBPE. The second plot shows the estimated performance (obtained without target data) and realized performance. We see that estimated performance predicts realized performance well, while even comprehensive multivariate drift detection doesn’t.
Concept drift is covariate shift in unseen features. Concept drift happens when the relationship between inputs and outputs changes over time. A model trained on past data starts failing because real-world patterns shift. Covariate shift is when the input features change, but the relationship between input and output stays the same. So why is concept drift covariate shift in unseen features? Because models only learn from the data they were trained on. When new features emerge in production—features the model never saw—this creates a shift in the input distribution. That’s covariate shift. Example: A fraud detection model trained before cryptocurrency payments became popular. When crypto payments start appearing, the model hasn’t seen this feature before. The input distribution shifts, and its predictions suffer. Concept drift is often just covariate shift in features the model was blind to.
We are taught about confusion matrices in textbooks, but in production scenarios, this confusion matrix gets censored. What does that mean? In ML, we need ground truth or true labels to calculate performance metrics. But in the real world, ground truth is often delayed or absent in most scenarios. Predictive maintenance models tell us the possibility of a machine failing and when it needs repairs. If the model predicts a machine breakdown and maintenance is scheduled, the machine might never fail in real life. This raises the question: How can we be sure the model was correct? Is there a way to know whether our machine would have failed if we hadn’t maintained it? This lack of failure is our intended outcome, But it also means there’s no immediate failure data and as a result, the confusion matrix is “censored.” If we have no TP, TN, FN, or FP, how do we calculate performance metrics? And what do we tell stakeholders when we have no performance data? This is a serious issue that data scientists often realize too late. Do you know the solution? Tell me in the comments below.
Had a great time on the AI Stories Podcast with Neil Leiser. We covered my journey into AI, early freelance projects, building NannyML, model monitoring, and the future of NannyML. Neil was an amazing host. He was thoughtful and asked all the right questions to make this a meaningful conversation. Grateful for the opportunity to share these insights. Go and listen to our conversation on your favourite platforms (links in the comments)
The most popular univariate drift detection method is... Kolmogorov-Smirnov for continuous variables and L-infinity for categorical Should you always just use these 2 then? NO! Each univariate method has trade-offs that you have to carefully consider depending on your use case. For example Kolmogorov-Smirnov has many false positives and is insensitive to changes in tails and L-infinity is sensitive to big changes to one category So if you care about changes in the tails or don't care about big changes in one category, these methods wouldn't be the right choice. Fret not, we have run a ton of experiments and wrote a comprehensive blog that covers each of the 6 metrics. Check it out in the comments! What method do you use and how did you pick it?
Content Inspiration, AI, scheduling, automation, analytics, CRM.
Get all of that and more in Taplio.
Try Taplio for free