Writing
Machine learning and a wide range of other analytic techniques are key ingredients of modern decision processes. I take a human-centric approach to automation and focus on the application end of analytics rather than the math. All views are my own.
Contents
Learning about learning
Mar 30 2024
Over the past 18 months I have been learning, not writing. It is time to start documenting some of what I have learned. I am learning about learning.
While this statement below is extreme when applied to every machine, it is still worth considering. I am sure that you can find at least one machine that you use that it holds true for.
Humanity has formed an unhealthy relationship with machines. Machines evolved to maximize collective productivity, with little consideration for individual growth.
The premise behind this statement is that:
People grow through learning.
When we delegate tasks to machines, it allows us to do things that we could never do before.
We do more, but learn less. Hence we grow less.
When we grow less, we are less satisfied by doing.
You may argue that we grow by doing. If you look at us collectively as group, yes, a group could grow from its collective accomplishments. When you look at the individual, if each individual in the group is able to "do" the same things when aided by the same "machines", in what way is each individual growing? How much personal satisfaction do they take from this type of growth?
Since the industrial revolution, we have produced better machines that allow humans to do more. However, as the people building the machines, we placed all of our effort on the collective outputs of society. We focused on "doing". Is it time to balance doing with learning? Would this benefit the individual? In the long run, does it improve the sustainability of the group?
Too abstract for you? Consider this example:
At the turn of the millennium everybody clamored to implement ERP systems to replace their homegrown systems that were doomed to collapse in a Y2K timebomb. These ERP systems allowed companies to do a lot, but let's be frank, they are unintuitive to use and implement. The drain they place on users and implementers is considered a fair tax to pay for their utility. Is it a fair tax ?
Let's say the productivity gains associated with the ERP clearly outpace the direct tax on usability. In the short term, the ERP is a winner. How about the long term? When the machine took over functions that people previously had to think about, people started doing without thinking. In the process, they stopped thinking about improvements; and even if they did think about improvements; the machine became a barrier to implementing those improvements. The ERP wasn't necessarily a clear winner for every task.
My meta-goal of learning about learning is about trying to find new ways for humans to learn and grow, while doing. My hope is that machines can help us learn and grow; and that machines can adapt what they do to accommodate our changing needs as we learn and grow. As a software and ML person, I believe that the patterns associated with learning and growing are best baked into our software from the get-go. Taking on this learning about learning task, I am discovering new patterns to bake into software.
My findings will take time to emerge, but I will leave you with two practical design suggestions:
Design suggestion 1: Design the feedback loop
It has become second nature to design for a primary task.
When designing for the primary task, think about the task lifecycle. Think about what your product is doing to help users take the first step towards the task; master the task; and recognize and realize future improvements to the task.
After considering the task lifecycle, confirm that you are being intentional in your choices. If your design is all about "do fast", question whether you are doing enough to encourage first steps; understand task outcomes; deal with task exceptions; and improve task outcomes.
Design suggestion 2: Promote "Getting Started" tasks to "Everyday Use" tasks
Ask yourself, how does the system get setup so that it supports any given task?
Who does this setup?
If the setup is suboptimal, who has to figure it out? how? what can they do about it?
Critically examine tasks with big setup hurdles. Are you sure those hurdles are all needed to be placed up front? Can you eliminate any? Can you move any so that they are only incurred when absolutely required?
Question any "getting started" tasks that can't be performed by the end user. This is a signal of lack of empowerment. How much does this lack of empowerment impede usage? How much does it compromise improvement?
That is all for now. I would love to hear thoughts on counterpoints on the subject!
Progressive and forward looking decision processes require a broad and thoughtful approach to modeling
Aug 26 2022
Genichi Taguchi was ahead of his time. He is best known for using statistically based experiments to help design manufacturing processes that are robust to disturbances. These techniques were impractical to adopt in his day as they required conscious experimentation and manual computation. Nowadays we can get a lot of the benefit of his methods by automatically collecting and analyzing actual process data.
Taguchi was also ahead of his time when it came to thinking more holistically about quality. He came up with a novel way to understand quality from a customer/society perspective rather than an internal perspective. His view of quality looked at loss imparted to the society by the product from the time when the product is designed to the time it is shipped to the customer. At the time, everybody else was thinking of quality only from the perspective of the final product and its level of conformance to a standard. By taking a broader view, Taguchi’s loss function included the effects of things like scrap, rework, and delay.
The machine learning process also has a “loss function”. This loss function guides the learning algorithm, helping to recognize signal and reject noise. Loss functions in machine learning tend to be quite simple - often measuring the difference between a single variable and a target. In my previous post I gave an example of “regularization”: this was achieved by adding a term to the loss function to penalize model complexity. It is possible to add terms to loss functions to penalize or reward many different things. Loss functions tend to be simple because by telling the algorithm to focus on a single thing, it is easier to model and understand model accuracy. We should however expect that the learnings from simple loss functions may be shallow – like pre-Taguchi views on quality.
I already introduced you to one of my favorite “gurus” in this post. There is another guru whose teachings are very relevant to this topic. Peter Senge is the champion of the learning organization. His learning disciplines are very broad and abstract so they apply to anything. I have been wondering how they apply to AI/ML, and whether they can help achieve a desired goal of continuous ongoing learning. For this post I will consider one of Peter Senge’s concepts - the first law: “Today’s problems come from yesterday’s solutions”.
This first law is a somewhat fatalistic reminder that whatever you do to fix a problem will inevitably cause at least one future problem (a recurrence of the original in a different form or a brand new problem). This law encourages leaders who strive to build learning organizations to anticipate some of these new problems rather than merely react later when they appear out of the blue. Peter Senge teaches leaders to employ “systems thinking” to deeply understand how the parts of a “whole system” interrelate and use this knowledge when making decisions.
Machine Learning techniques are somewhat compatible with the notion of systems thinking, except for the part where learning algorithms rely on examining each observation of data as if it were independent from the rest. With true systems thinking, you need to attune yourself to interdependence of just about everything. The assumption of independent observations flies in the face of this aspect of systems thinking as this is the critically important part that helps understand time dependencies. Few actions cause an immediate reaction. There are leads and lags all over the place. Systems thinkers should understand these leads and lags.
With machine learning modeling, generally, unless the modeler deliberately manipulates input data by blending elements derived from past or future observations into current observations, the learning algorithm will be blind to leads and lags.
There is no universal method to accommodate the effects of the passage of time in modelling. It takes effort and may significantly worsen the effects of the curse of dimensionality when not done carefully. As a result, many ML models see and learn based on an overly simplistic view of the world. This overly simplistic view is blind to side effects and subject to bias.
Need tangible evidence? It was only a few years ago that people realized that models trained from historical data mirrored historical bias. A new awareness grew around measuring bias of this nature and eliminating features that encouraged bias. In certain industries, these models had been used for decades before the world became aware of these problems.
Learning algorithms find patterns. They don’t understand the source of these patterns and can’t distinguish right from wrong. It requires active human effort to produce morally sound models and there is no foolproof way to do this. Despite awareness of things like gender, race or age bias there is no guarantee that they don’t inadvertently creep in via some expected correlation.
There is also recent awareness that because recommender systems tend to recommend popular items, and they have the power to influence what we watch, read and listen to, they too are introducing bias. A biased recommender system is particularly dangerous to society because they have the power to build self-fulfilling prophecies. If we were to take this phenomenon to the extreme, the impact would be a reduction in the reward for human creativity and reduction in diversity of thought. The long terms effects of these models could genuinely make the world a duller place.
What I am talking about is not a simple technical problem it is a bigger societal problem. If human behavior is being influenced by models, who owns the right to decide what and how these models learn? Currently it is the builder of the models that chooses the loss function and features. It is the builder of the model that is entrusted with protecting the world from bias and myopia. As somebody building models or somebody commissioning somebody to build them for you, you are expected to take utmost precautions.
Now for the tough part. If you have ever read the Fifth Discipline or listened to Peter Senge speak, you will know that he never gives you a quick fix and I can’t offer one either. The purpose of this post is to raise awareness as to the need for a “broad system view” when modeling: choosing features, working out how to best transform features and encouraging more effective continuous learning by trying to incorporate future penalties in loss functions. It doesn’t stop there. The benefits of a broad system view apply when it comes to incorporating ML/AI or even simple descriptive data into any decision process.
One more definitive piece of advice is to keep everything in the whole chain of data collection, modelling, training, inference and decisioning as light and nimble so that change is a natural part of usage. The more people and effort needed to recognize a need for change and make the change, the more likely it is that you will be using stale thinking as part of your decision making. When designing and building MLOps technology you can help by providing a “model playground” with some of the following characteristics:
1) Provide approachable and accessible visualizations of model results. Make them accessible and understandable to consumers of models not just model builders. These visualizations should provide access to detailed model inference, not just summary statistics.
2) Allow for exploration and experimentation of current, past and possible future models on current, past, simulated and type-in data.
3) Allow people to isolate local areas of the dimensional space, view results in these areas, compare them with other local areas. This is a good way to understand bias.
4) Allow people to experiment with local modeling to understand how the performance of models trained in a local area differs from those trained on global data.
5) Allow viewing loss data using alternative loss functions to the one that the model was trained with. Allow them to define their own loss functions and add terms to existing ones.
6) Clearly define all assumptions inherent in the models themselves and processes that support decision making using data and models. Allow people to monitor these assumptions and understand the sensitivity of inference and recommendations to the value of these assumptions.
This post is a reminder that ML and the human processes around modeling are far from fallible. Biases and local/short term optimizations are common. There are no foolproof, automated ways to guard against undesirable modeling side effects. The best safeguards are: broad systems thinking about data and modeling and the decision processes that consume model inference; exposing models to scrutiny by non-technologists so that they can fully understand and experiment with alternative models. This post also stresses the importance of designing for change. There has been a dramatic rise in level of automation of tasks that could previously only be performed by humans. Society is still learning how to best live with this powerful machine. While we learn together, one of the best things we can do is engineer a nimble machine so that it adapts rapidly as society learns to better express its needs and expectations of the machine.
The Curse of Dimensionality
Aug 17 2022
Image by author (with help from Midjourney)
This post is an attempt to quantify and visualize the effects of a phenomenon that is commonly referred to as the curse of dimensionality. You can think of “dimensions” as different aspects of a system and machine learning as means of understanding how to represent a system through its “dimensions”. Intuitively it feels like the more aspects of the system you can gather data for, the more thorough your understanding of the system will be.
Unfortunately, sometimes your attempts to explore thoroughly backfire – making your highly dimensioned models less reliable at representing the system. Richard Bellman coined the phrase the “curse of dimensionality” to explain how by adding volume to a space (adding dimensions) you increase the amount of data needed to understand the space. Bellman called it a curse because in the dynamic programming problems that he was wrestling with, the amount of data rose exponentially for each new dimension added. This post explores the curse of dimensionality as it applies to machine learning.
Machine learning engineers are very aware of the curse. This awareness makes us tentative about increasing dimensionality willy-nilly. When confronted with highly dimensioned problems, we use a number of strategies to improve our chances of producing a useful model. These include:
1) Choosing learning algorithms that are less prone to the evil effects of the curse.
2) Using regularization and other model hyper parameters to help the algorithm “explore the space” less arduously.
3) Using feature selectors to reduce the dimensional space.
4) Collecting more observations of data to explain the space more thoroughly.
This post explores the effectiveness of these strategies in an objective way. To do this, I synthesized a dataset that is a real torture test. I did it this way because if you use real data, you can build a model and estimate its accuracy, but you don’t ever really know how good your estimate is and you don’t know what portion of the inaccuracy in your estimate should be attributed to the curse of dimensionality.
My torture test allowed me to measure the impact of noise and dimensionality on results. The dataset used is so tortuous that there is no real truth to be learned. This means that irreducible noise is quantifiable, and all learnings are falsehoods that can be attributed to the algorithms getting thrown off by noise and high dimensionality. In the extreme version of the test, I have 500 dimensions/features. There is signal in only one of the features, but since this is truly a torture test I haven’t provided enough data about the signal for the algorithm to detect it.
The plot below shows the results of a least squares regression on the single feature that would have been the only interesting one to look at had I not obscured the signal that is present in it.
Like x0 above, all 500 features are all normally distributed with a mean of zero and standard deviation of 1. As the linear regression line above shows, the “best” prediction of y is 0 for any value of x0. The value of y varies considerably, but these 500 dimensions don’t explain why y varies. This variation can all be attributed to irreducible noise. The safest/most honest/most reliable prediction of y is 0 regardless of the value of x0 or any of its 499 cursed friends.
Choice of learning algorithm
Highly dimensioned models are more prone to overfitting than lesser dimensioned models. Some algorithms are more prone to overfitting than others. In particular, algorithms like Random Forests that “bag” by averaging out multiple predictions should be prone to “bias” and those that “boost” to furiously hunt down each and every source of error in predictions are more likely to overfit.
An ML engineer may select an algorithm like Random Forests to deliberately err on the side of bias (if they suspect that the results could be marred by the curse of dimensionality) or choose an algorithm like Gradient Boosting if they are confident in their ability to control the “variance” that would result from overfit.
The plot that follows shows the results of building several models using different algorithms on my tortuous dataset. The x-axis of the plot shows the “bias”. Any deviation from 0 is “bias” introduced by the learning algorithm. The plot below shows that the mean value of predictions made using most of the models was close to 0. Some models predicted values greater than 0 more often and some predicted values less than 0 more often, but most are close enough to 0 to be considered low bias. How about variance? The y-axis of this plot (standard deviation) shows us just how much variance was introduced by the modeling process.
The theoretical “honest” model has a variance of 0. The best performing algorithm above is a lot worse than this desired result of 0 variance. It has an SD of 0.5. The worst result is off the chart with an SD of 9. To understand the impact of what this plot is telling you, let me explain how I calculated standard deviation. I trained 10 models using the same algorithm and same hyper parameters – using 10 different random samples of the same data and different seeds for the internal randomized portions of the algorithm. I then calculated the standard deviation of predicted values by looking at the difference between predictions produced by each model on a common set of new data that wasn’t used in training.
This test emulates what would happen if you asked the model to predict the value of y for the same set of x values on 10 consecutive days (after retraining overnight). You would expect the same value of y for the same set of x values – instead what you get is unstable predictions – where the same x values give you different y values on different days.
Instability of predicted values is bad for the person or process that consumes the predictions. For anybody expecting ML to resolve with the clarity and precision of a crystal ball, these models don’t. They change their mind a lot! When you have a temperamental crystal ball, it makes it difficult to use predictions from it as part of a decision process.
What I learned from this test:
You cannot rely on the use of an algorithm that favors bias over variance to protect you from high variance. All algorithms that I tried above produced results that showed high variance – some more so than others.
Regularization to minimize the impact of the curse
Some of the algorithms (SVM, Lasso, MLP) have built in regularization parameters that help directly tune and gain control over the effects of dimensionality. You can tweak the bias/variance tradeoff in a less direct way with other algorithms using their hyper parameters. These parameters are an effective way to change the bias/variance characteristics of a model. Unregularized linear regression on 500 features of noise produces models with a SD of a whopping 9.03 in my tests. The regularized Lasso model on the same data has an SD of 1.41, but this is a lot worse than the desired value of 0.
The test results below show that it is possible to increase model stability by tuning the regularization parameter.
These plots show how stable the predictions of y are when using models trained on different subsets of data. the same value of x0. As the value of the regularization parameter increases, the predictions become stable as the variance gets close to 0.
What I learned from this test:
Regularization worked particularly well with Lasso on this noisy dataset. I had to train multiple models to understand model stability during the tuning process. This is not always practical to do.
Feature Selection
The plot below shows results that mimic what could be obtained using a feature selector to reduce the dimensional space. The top row shows the mean squared error on the test dataset for a collection of models trained on a small volume of data. I expected this small amount of rows to highlight the benefits of feature selection. The top row contained 50 features, the row below it a single feature, the next two involve models trained on 5 features. The difference between them is that the 5 features in the 3rd row are largely independent of one another. The 5 features in the last row of plots are all highly correlated with one another.
The results above show the effects of lots of features, limited data and high noise quite clearly. There is more of a difference between training and testing error with more features.
I was curious to see how the regularization that Lasso provides improved the results over Linear Regression. The regularization worked. Lasso is not in the least bit phased by 500 features. The other thing about Lasso is that it is lightning fast to train. In all of the tests that I ran, the slowest was an SVM model that took over 3 hours to train. The equivalent Lasso model took 0.5 seconds. In case you are thrown by the different units in these timings, the Lasso model was 21,600 times faster to train than the SVM.
Most learning algorithms weren't phased by 4 bogus features. I had to add lots more bogus features to see the clear impact of high dimensionality.
Neural Networks and Gradient Boosting are notorious for overfitting when not well tuned. The Neural Network (MLP) was the worst performing algorithm in this test - even with fairly strong regularization. The Neural Network behaved much better with less features.
This is the first time that I had used AdaBoost. I expected it to perform better than Gradient Boosting due to less overfit. In the end the conservative Random Forests, AdaBoost and Gradient Boosting all delivered similar test results. The training results are quite different.
At first I thought that Lasso and AdaBoost's similar performance on test and train was a good thing, but on further reflection I am not so sure. It could be misleading given the fact that all of these models have high variance.
I expected to see different results between the conservative Random Forests (single feature per tree )and the Complex Random Forests (multiple features per tree). Random Forests are known for producing useful results with minimal tuning. This test confirmed that the severely detuned complex model managed to compensate quite a bit.
Were these results all due to the artificial nature of testing on pure noise? I repeated it on more data and strong signal in one of the features to find out.
As expected, where there is strong signal and more data, results are better across the board. There was one exception- Lasso.
The odd thing is that the best performer from the previous test is now the worst performer. The plots below are zoomed in versions of the upper RHS (500 features ) plots of each of the 2 grid plots above.
To further understand the effects of adding more training data, I ran some more tests. The plot below shows the difference between training MSE and test MSE for various amounts of training data. I ran this test on the torture test data with 500 features and 100% irreducible noise.
As expected, when there is more training data the summary stats obtained from training and testing are closer to one another. The conservative Random Forest model and the SVM model were the 2 that benefited the least amount from more training observations.
What I learned from this test:
Bellman coined the phrase “curse of dimensionality” to describe dynamic programming problems . ML Algorithms seem to deal with the curse better than dynamic programming does. Data needed for training does not rise anywhere close to exponentially for each new dimension. When data volumes are low, regularization and feature selection are both useful. As data volumes increase they become less important.
A small difference between evaluation metrics produced on test and train does not necessarily imply low variance. An algorithm that consistently produces inaccurate models will not show a big difference between summary stats between test and train. To get a better estimate of variance you have to look at the stability of individual predictions – not just summary statistics.
Conclusions
The torture test was a useful exercise for me. This is the first time that I had gone out of my way to measure model variance and understand stability in this way. Irreducible error, bias and variance are a triad of deeply related concepts that each warrant individual attention during model building and selection.
I learned that the shear volume of the space (number of dimensions) played less of a role than I thought it would and that the presence of so much noise often made for high variance models even when the model was based on a single feature.
You may wonder how relevant these findings are on real world problems given just how tortuous the dataset was. A lot of my work in ML has involved building pipelines that were tested in the software development lab and then used to train on customer data. In a situation like this, you often encounter noisy data and there is a lot of value in figuring out quickly and cheaply how much potential there is to produce something meaningful from the data at hand.
In subsequent posts I want to get back to the topic of signal that has been obfuscated by the data collection or modeling process. I believe that I can actually find the signal that I hid in this dataset. If so, a large chunk of what I accepted as irreducible error in these tests will become reducible. I will also explore a few other scenarios that hide signal and see whether they too can be made reducible.
Pull Process for Continuous Incremental Learning
July 21 2022
When I got started with machine learning, I was fascinated by the supreme number crunching capabilities of machines. Humans can only consider a small number of data points at a time. We make conclusions from what we observe, we act on those conclusions. We learn from our actions and the actions of others. This is a slow, iterative learning process. In contrast, a machine can happily plough through millions of observations across many dimensions and reach conclusions about how the world works amazingly rapidly. I have spent considerable time figuring out how best to structure large datasets to take maximum advantage of this supreme number-crunching ability and arrive at valid conclusions about the world.
Strangely enough, as my process has evolved, I have begun seeking ways to emulate the slower incremental human learning process to augment or even replace the supreme powers of bulk learning. Why the change of heart? There are lots of reasons, but I will focus on the one that got me started down this journey – “overfitting”.
Simply put, overfitting is an unwanted side effect of the machine’s super-human ability to describe patterns observed in data. When confronted with a dataset containing a mix of “signal” and “noise”, the machine will describe the patterns observed in both the signal and the noise. As the person preparing data for learning and tuning the learning process, it is your job to ensure that the machine focuses on the “signal” and doesn’t get too distracted by the “noise”. You do this by carefully manipulating input data and “punishing” the machine for being too precise using techniques like regularization, bagging, binning, and pruning. These techniques work when you have a lot of data and when the data that you are learning from has strong signal. As I encountered more and more problems where I didn’t trust the labeling of the data (resulting in high noise) and/or I didn’t have enough observations and/or I was dealing with vast dimensional spaces where the behavior of the system was not consistent across the entire space, I knew that there was little that I could do to reliably prevent overfitting. This is what led me to wonder what machines could learn from their less number-crunching-adept human brethren.
Conventional supervised learning deals with certain problems really well – particularly problems where you have access to data that describes a large population, and this population is consistent and stable in its behavior. Once you understand the common patterns of the population, you can apply your knowledge to understand how new members of the population that you have never seen before may act. Take an extreme case – you only ever interact with any given member of the population once. In this extreme case, your best decisions about this member can only be informed by what you have learned about the population at large. Now consider the opposite extreme. You only interact with two members of the population, but you do so repeatedly. You never get to formulate a view of how the population at large behaves, but you gather enough understanding about each of the two members to understand them intimately. In this second extreme example using a single supervised model to represent both members could be problematic. Supervised learning techniques are best used to represent (aggregate) behavior of a population rather than individual behavior of members. In its attempt to generalize, the uber-model would either try to average out the differences between the individuals and therefore result in a model that doesn’t describe either group well (high bias), or it may find spurious data to explain differences and produce an overfit model (high variance).
I created the example below to show this extreme case of a combined model created using data sampled from 2 members of a population. Just to prove the point, in this example, each member has a target variable "y" that is correlated with a feature "x0" in the opposite way. The plot below is a least squares regression that "correctly" shows zero correlation between the combined y value and "x0".
Image by author
In the extreme 2 member example, the obvious way to deal with it is to create a model for each distinct member. Since you interact with these members often, with enough experience of each member you should be able to represent their behavior in an unbiased way. With separate models, it is easy to control overfitting (high variance) as you can regularize each model separately – limiting the number of features separately based on how much data you have to describe each.
The plot below shows individual least squares models for the two members. A mini model per segment clearly out performs the "uber-model" in this admittedly deliberately contrived example.
Since most of my prior work has been in the area of B2B where each business interacts with a typically small population of the overall population of other businesses (unlike a Google or Facebook that understands a lot about the population at large), when I look back on many of the models I have built in the past, and the effort that I expended in trying to control overfit or produce valid inference despite it, I now want to approach things differently.
There is an obvious disadvantage of doing what I did above. You have to get good at recognizing "segments of members" and building individual models for them. When you see a new "member" for the first time, you need an automated process for dealing with it. As part of this automated process, you have to accept that if you don't know much about the new member, you may have to guess which segment it is best aligned with and your guess involves the risk of high bias until you understand the member better. Once you have more experience of the new member's behavior you can start correcting this bias by moving them between segments or automatically creating a new segment for them. All of this automated process around members and segments requires upfront consideration and effort. The payback for upfront effort is that you are setting yourself up for continuous incremental learning.
Is it beneficial to endure the initial pains of getting setup for slower, incremental, local learning than trying to build an uber-model? It will depend on what you are modeling. What are these initial pains that I refer to? They are automation pains associated with creating an environment that can produce, evaluate, and promote large numbers of models hands-free.
Building a collection of little models instead of large uber model mandates an MLOps process with a high level of automation. You can think of the process that produces the “uber model” as a push system – a time, data and skills intensive process to produce a model and then push it to production. To produce each new minor or major version of the model requires some level of human effort in the form of a new push cycle. When using a collection of little models, a semi-automated push cycle for each model would be slow and too expensive, so the whole process needs to be automated. Also, instead of a central authority for declaring fitness for use of a new model, each mini model should really be evaluated in a more local context – perhaps by the consumer of the model and not by its producer.
This implies that the process of building and managing a large collection of mini models is best carried out using a pull system. A distributed pull system allows models to be created and updated on demand and tuned to localized goals. Imagine a case where the first time that a recipient of inference obtained from a mini model doesn’t agree at all with the model’s naïve inference. The recipient gets to supply feedback. If the mini model is small enough, it can be updated immediately to reflect user feedback. This is not to say that all changes are initiated by a user “pull” and all learning takes place “online”. Suppose that the feedback that Fred gave about mini-model A was used to produce mini-model A’. Fred’s feedback may also be relevant to mini-model B and C. It would make sense that the process for building and maintain mini-models were also able to learn offline and transfer learnings between mini-models.
Image by author
The diagram above shows a push system in action. A central team starts the development process for a new model and then pushes it to production when it achieves an acceptable level of performance (ie: is accurate enough to meet a global goal).
The model is retrained periodically on new data. This allows it maintain acceptable performance. Periodically a new major version of the model is pushed into production. New major versions improve overall performance.
The diagram below shows results you may expect from a pull system.
Image by author
There are two main differences in this pull system. Firstly there is no single global "uber-model". Instead there are several "mini-models", each evaluated and deployed separately. Secondary there is no centralized authority for releasing models. Each model is "published" along with an appropriate performance metric. The consumer of the model gets to decide whether the model's performance is adequate for the task at hand.
Since the local learning task is easier than the global one, some of the "mini-models" should mature faster than their equivalent "uber-model". Some of them may mature slower if they see little data or the underlying process isn't well structured to learn from them.
If you have a mental model of mini-models as individual spreadsheet-like objects randomly dispersed across a computer network, then I led you astray with my description. There is a definite order to the structure of mini-models. Imagine the biggest possible of uber-models – a vast dimensional space spanning many organizations. Each mini-model as aware of the vast dimensional space that it lives within, it merely chooses to exclude a lot of “global” context when learning and supplying inference. It is selective about using global context because too much global context makes models prone to overfitting and/or getting stale. Instead of thinking of the mini-model as a standalone spreadsheet, think of it as a fragment of an ultra-large uber-model, but unlike an actual uber-model each fragment has a mind of its own. In theory, if the conditions are right to produce an uber model, the collection of min-models will closely resemble one. Consider the degenerate case: each node in the network of mini-models chooses to defer to learnings of its global neighbors. In this degenerate case, the network of mini-models choose to organize itself as an uber-model.
In the introduction to this post, I mentioned a connection to human learning. To elaborate on this point, consider the spread of knowledge or inference from human curators. Imagine a large population of people. Within this population there are people that play a curator role. Each curator has a mind of their own – which they use to formulate their own set of priorities and a unique outlook on the world. These human curators share their inference with the rest of the world. They don’t do it in a vacuum. They are selective about taking inspiration from or differentiating themselves against other curators. You can see an individual mini-model as if it were an individual human curator. Each mini-model applies its own deliberately limited outlook of the world to its learning process and to produce inference. In so doing, it chooses how much of its limited outlook is informed by the learnings of other mini-models.
By understanding mini-models as being similar to curators, it will help understand an embodiment of one. PinnerSage is an evolution of the graph neural network. Pinterst built PinnerSage as a specialized graph neural network to deal with their vast dimensional space and many sparse features. Historically graph neural networks convert these large sparse sets of features into smaller denser dimensional spaces using the abstract notion of “embeddings”. These embeddings are the common features used by an “uber model”. PinnerSage is a departure from this thinking. Instead of converting the full sparse feature space into a common set of dense embeddings, it clusters the data and then builds separate embeddings for each cluster. Each cluster represents a common set of interests for a collection of users.
PinnerSage is a relatively recent advancement in the field of incremental learning. Incremental learning is not actually a new concept though. You can think of learnings from random forests as a form of distributed incremental learning too. Each tree in the forest has specific interests (subset of features and data) and mind of its own. It produces what is considered a weak understanding of the “whole forest” from its limited perspective. To build a complete understanding of the world, the inference from all trees is averaged. This is very much an “uber-model” view of incremental learning as it harmonizes the opposing views of each random tree into a single common view. Once again, drawing on the mental model of the curator, this is like asking a bunch of people to curate items and then combining the results into a single list of the most commonly curated items. Depending on the task at hand, there may be merit in doing this, but it is obvious that by aggregating the results of multiple curation efforts, the process loses its ability to present a true view of the variety of preference that naturally exist in the world. Boosting is another incremental learning approach that uses a less democratic way of combining the results of multiple learners. The opinions of each learner are weighted to correct for the mistakes of prior learners. Boosting is also an “uber-model” approach as the inference from multiple learners is combined under a global definition of what constitutes right and wrong in a globally dimensioned space.
In a random forest, each learner has random interests. Each learner carries equal weight in the final vote when inference is produced – regardless of how close its interests were to the subject of inference or how strong or week a learner it was. In a mini-model approach, there is a deliberate effort to scope the interests of each learner down to particular area of interest (not random, but perhaps clustered). There is also a deliberate effort to align learners with consumers of inference so that consumers receive inference from learner/s whose learnings are best suited to the goals of the task at hand. In the case of PinnerSage, interests are clustered and each user of the system is aligned to a small subset of clusters whose “interest” most resemble their own.
PinnerSage is a specialization of a Graph Neural Network as an image recommender system. Does this concept of learning in distinct clusters apply more generally? I think it does. I first started to toy with these ideas several years before PinnerSage was created in the context of predictive maintenance where I believe it will be useful to carve up a global dimensional space purely as a means of controlling overfitting. In a follow-up post I will provide more detail. For now, I will leave you with an intuitive understanding that draws parallels with the human decision process. You may have seen this before with car or household maintenance. If you ask 3 experts for an opinion on what is wrong and what should be done about it, how many answers do you expect to get? If you are used to getting a single common opinion from multiple experts, I consider you lucky. All to often, I get conflicting advice from experts. Which of the human experts is more right? In a global context, you would have to aggregate their recommendations in some way. In a local context, you might find that each is potentially right in their own particular way and that one of their recommendations may be more right for you given your own local context and goals. If all you saw was the aggregated inference from the 3 experts, you would have no way of unpacking it to expose 3 distinct recommendations – one of which may have been best for you. Intuitively since immersing myself in this mini-model mindset, it makes me think of the commonly used practice of bagging as a smoothing process that may be absolutely necessary to produce valid inference in a global context, but at the same time smooths out details that may have been really important in a local context.
All of this talk of mini-models “vs” uber-models may sound like it flies in the face of current best practices of using pre-trained global models for producing inference from text and images. I don’t see it that way. Instead, I see it that the usefulness of pre-trained models for text and images strengthens the position that it makes sense to use a different process for local vs global learning. The success of pre-trained models from the likes of OpenAI and Deepmind shows just how rigorous the process of building an uber model should be – if you are going to produce a model that describes global knowledge, it should be trained on the biggest, cleanest, most global source of data imaginable. This is why these models are so successful. However if you need something that can infer results from text that contains your organization’s particular product, org names and other local jargon, you may still benefit from a local mini-model that doesn’t reinvent GPT-3, it decides how best to transfer learning from GPT-3 given the knowledge of local jargon and the task at hand.
This post introduced you to a couple of cornerstones of the concept of building a network of mini-models that are hybrid learners - comfortable learning and producing inference both online and offline and able to decide when to favor local learning over global knowledge. In the posts that follow I will delve deeper into the nature of inference from mini-models and delve into specific use cases for them.
Building Influential Agents
Design Considerations
Design Considerations
June 7 2022
In any decision process, there two places where decisions get made: the operating environment where actual work is performed, and the “management” environment where work is monitored and optimized. The lingo of reinforcement learning expresses this well: the environment is responsible for operations. Agents manage the environment using a decision process.
It is tempting to want to collapse the functions of the agent into the purview of the environment – to build a fully-contained smart environment. I encourage you to hold back. It is useful to have a clear logical separation between the functions of the agent and the functions of the environment. Environments are generally inhabited by the likes of ERP systems, MES systems and CRM systems that understand finance, manufacturing, and sales. These systems don’t understand risk and uncertainty and have limited understanding of the broader environment that they are connected to. If you try to build all the smarts for managing an environment directly into the environment itself, you will run the risk of curtailing your management smarts due to limited functionality of the systems that inhabit the environment, and/or land up destabilizing the environment. The risk of destabilization stems from the fact that the range of management decisions is broader, more volatile and subject to more unknowns than the range of operating decisions, so it is generally difficult to retrofit them into existing operations.
When you consider this logical distinction between the duties of the agent and the duties of the environment, you need to consider each of their goals and how they will interoperate. The goal of the agent is continuous improvement through observation, analysis and experimentation. The goal of the environment is stable, dependable operations. These conflicting goals can be a major dilemma. A dilemma that requires careful consideration when designing agents, the systems that operate environments and the linkage between the agents and these systems.
To create a system that learns and is responsive to change, your goal is to create an environment where agents are highly influential. The technology that powers agents understands the decision process and concepts like risk, reward, and uncertainty, but has little domain knowledge. For example, an agent can detect a likely mismatch between supply and demand and determine appropriate actions to address the imbalance, but can’t fully execute on its recommended actions. Instead, it influences actions that take place in the environment.
An agent’s ability to influence is limited both by how “smart it is” and to what extent the environment is equipped to receive and act on agents’ input. For the remainder of this post I will focus on design considerations for building a decision process.
1. Data Integration
The most obvious integration point between the agent and environment is data. All operational systems can produce data and assimilate data. At minimum, agents need read access to the environment’s data. Agents can run queries against databases and make API calls for data. Here is your first design challenge. Agents can be pretty data hungry and don’t have fully formed ideas of what they will need and when they will need it. The current state of the art in data integration still relies heavily on data integration jobs or APIs that are built and maintained by humans.
2. Data Capture
Are all of your business goals and assumptions written down? Is everything you know about your customers and products written down? Often the answer to this question is “sort of”. The next question is even harder to answer. “Where and in what form are they written down?”
There is important data that is used to manage and optimize systems that sits outside of the systems that are used to operate the environment. In many cases the amount of rigor employed in capturing, storing and disseminating this data is low. Agents need access to this data. The design of your decision process is going to have to provide a rigorous way to store, retrieve, display, and edit goals and assumptions. Since goals and assumptions are volatile, your design process will also have to provide a rigorous way for identifying and describing new goals and assumptions and changes to goals and assumptions.
3. Calculation
Agents are well equipped for number crunching. The type of number crunching that they do naturally is the domain-agnostic kind. Agents know how to interpret data about loss or rewards. What about the task of calculating performance, rewards and any other data that is used in the decision process? These calculations are domain specific so they generally can’t be invented by the agent.
This is another important design consideration: where and how are calculations run and how are they authored? Can the agent request new calculations on the fly?
4. Actions
Like goals and assumptions, actions are pieces of decision data that you are probably not accustomed to describing rigorously. The design of your decision process will need to make provision for describing actions, evaluating the potential outcome of actions, and invoking actions on the environment.
5. Simulation
Agents are good at designing experiments. Sometimes you may trust your agent to run limited experiments in the real world. When it is not feasible to run real world experiments, agents can use simulations for experimentation. Your agent may be able to build simulations and oversee analyzing the results of simulations. All but the most basic of simulations are domain specific. The business logic for simulation is generally a more rigorous version of the other calculations that you do to monitor the environment. With appropriate upfront design consideration, you can establish a robust calculation framework for defining and running all calculations required for basic monitoring, evaluation, and simulation.
Putting it all together
Agents and the environment are different beasts. To get a decision process to work well with highly influential agents that apply meaningful learning to the environment, there are lots of moving parts that need to come together. The design considerations for each of these moving parts will seem scary at first. They become a lot less scary when you see them as a collection of metadata and data with hooks to the environment.
A decision process is merely a means of obtaining, transforming, and managing data and metadata. Once you can express a simple goal along with its actions, evaluation criteria and data sources in metadata you have the fundamentals in place. You can start with really simple decision logic and then advance to more meaningful recommended actions by doing more rigorous metadata and data transformations.
A key prerequisite to automation is metadata-driven query and calculation. To automate a decision process you will need to be able to formulate a query for data from metadata, use the results of this query to build more queries, perform calculations, produce more data and metadata. For closed loop automation, you will translate the results of a decision process into terms that the environment understands so that you can invoke hooks and influence the environment. Once these basics are in place, you can start questioning the extent to which the existing hooks into the environment are holding you back. In a future post I will delve into methods for evolving static hooks into the environment into more dynamic hooks so that agents can broaden their sphere of influence over the environment.
Embedding a Decision Process in a User Experience
May18 2022
You can analyse any decision using a structured method:
1) Understand the current state of affairs
2) Understand possible actions that could improve or degrade the current state of affairs
3) Estimate the impact of actions on goals and choose the action that is expected to make the best overall impact.
Image by author
Sounds easy? It is easy if you truly understand the current state of affairs, have a stated goal, and understand how possible actions impact that goal. Now all of a sudden it is not so easy. Why? Uncertainty. It is common with real world decisions that goals are unstated, the current state of affairs is only partially understood, and actions and their impacts are shrouded in the fog of the unknown. Machine learning helps quantify and understand uncertainty, but historical data and brute force analysis only goes so far. Most machine learning applications perform better when they incorporate direct and immediate human feedback. The most efficient way of doing this is embedding the decision process directly into the user experience.
Consider a common web or mobile experience today. The structure of the application is dictated by its “information architecture”. The app is successful when the user understands the UI well enough to locate appropriate items and tasks within its “information architecture”. The onus is on the user to translate their current state of affairs (e.g. my shoes are worn out) and their current unstated goal (e.g. I have $200 to spend and I need a new pair right now) into a series of requests (e.g. men > see all categories > oops not clothing > gear by activity > oops don’t seem to find them here either > search: shoes > category: footware > mens shoes > size > scroll > scroll > scroll > found something > reviews are good > colour I want is out of stock > goodbye).
Imagine if instead of having to translate my holey shoe state of affairs and goal into the online store’s classification system, and then hunt for something that meets my goal, the software could help me more. In many cases the software could already have a reasonable view of my current state of affairs and goals, and could present a starting point in the experience that was based on me and my current state of affairs. In this case, the app is very unlikely to be able to infer that I have a hole in my well-worn sneakers. I will have to tell it. The freeform search method that I used in the scenario above was quite effective at finding shoes, but it was hopeless at helping me take action and purchase a new pair. The results were a grab-bag of men’s shoes, ladies’ shoes, kids’ shoes, climbing shoes, cycling shoes, winter boots. It was only after I found ways to narrow down the selection (gender, size) that it was worth starting to scroll and find something that I liked.
Image by author
As soon as the application understands that I need shoes, it is reasonable to assume that we could use purchase history to present a better starting point for the experience. The application could reorient itself around an recommendation. When presented with a recommendation, is is useful to include supporting information to help reinforce the decision. In this example, the price, availability, sizing info and key points from reviews are useful supporting info.
In the AS-IS scenario I did lots of scrolling. This was because there were limited ways to narrow down my choices. These limits were as a result of human encoding of each product into a rigid predefined classification system. Human encoding takes time and, in the example, above was quite ineffective.
Modern machine learning techniques provide effective ways to “encode” knowledge about items such as products and actions like purchases into data called “features”. There can be lots of “features” – each representing different characteristics of the item or action. “Features” provide fine-grained descriptors about the specifics of items and tasks. They are generally more effective than rigid human classification structures and “features” don’t all have to be identified upfront when the application is built. Features can be inferred on the fly by machine learning and associated with user preferences on the fly too.
With a bit of ML finesse, the application’s initial take on my next shoe purchase could be pretty decent. I don’t know of a clothing retailer that is applying a decision intelligence mindset to the purchasing experience, but I can tell you that the clothing ads that Instagram picks for me are often very good. It has access to enough data from enough users and must have built some good “features” that describe the abstract concept of style. For the purposes of this scenario, let’s assume that the initial recommendation is not as good as Instagram’s ads. The application got the recommendation wrong. Boohoo – I am not wearing those! How will the application redeem itself?
In the intro I spoke about the potential to embed a decision process into software so that applications can respond to human feedback immediately and align themselves with user goals. The first bit of feedback that it responded to was “shoes”. It used this bit of feedback to come up with an action and supporting information. This action and supporting information was informed by data derived from my history. The decision process that helped choose this action incorporated other people’s input in the form of their purchase history and reviews. It also included other inputs that were relevant to my goal. I wanted a new pair of shoes pronto.
To continue this scenario, I provide some feedback on the application’s personalized suggestion: “ a bit more dressy”. This is where we see the benefit of inferring “features” on the fly over a rigid human classification system. The rigid human classification system is a strict hierarchy: Footware > Mens Shoes > Casual Shoes OR Dress Shoes. This classification makes no accommodation for the fact that some casual shoes only work with shorts, others work with jeans and shorts, and others work best with chinos: - and equally importantly that not all guys are wired the same way. Some are more fussy about shoe pairings than others.
The feedback offered of “a bit more dressy” tells the decision process that I am one of those more fussy guys. The feedback translates directly into updates of “parameters” that are specific to me and will influence not only the systems’ next attempt at reshoeing me, but also future recommendations about shoes and other things. My “fussy” choices are also useful to the system. When it is asked to make recommendations for “equally fussy” users in the future, my actions carry more weight in the decision process than those of non-fussy or differently fussy users.
This post used an online shopping example to demonstrate the principle of embedding dynamic, personalized and goal driven decision processes directly into a user experience to drive action. The same principle applies in just about every user experience for consumer or business applications. Most applications are driven by predefined, rigid information architectures and data classification systems that are authored and maintained manually by humans. Machine Learning provides opportunity for richer classification of items, tasks and actions. To derive maximum benefit from the ability for ML to produce dynamic features and used them to make improved personalized recommendations, the application UI must be designed accordingly – allowing itself to take shape around an evolving machine learning derived structure and reshape itself quickly to meet the personal needs of each individual users.
Investing time and effort to resolve uncertainties in decision making.
April 19 2022
I favour proceeding confidently in the face of uncertainty, rather than waiting to resolve uncertainty before taking action. This is because most of my own business decisions are made in the fast-paced and always uncertain world of high tech. The Rolling Stones said it perfectly.... "Time waits for no one". This means get on with it, "exploit" the current market opportunity. Fail fast, learn and adapt.
There are Decision Intelligence strategies for the patient too. Perhaps the Rolling Stones were in a different frame of mind when they belted out "Time is on your side" in 1964. When time is on your side, you can use an "explore" strategy to better understand and resolve uncertainty.
Image by author
When the risks associated with a decision are immense, it makes sense to spend more time getting comfortable with the level of uncertainty. A thorough way of doing this is using A/B testing. When designing an A/B test, you expose two identical samples of a larger population to two different variations of an experiment. There should be only one thing different between the two experiments. This way you can conclude that if the two samples of people behave differently, it is because of the one thing that was different between the two experiments. Also, becuase the population samples of each test are indicative of the population at large, you can assume that the population at large will behave in the same way. A/B testing is used in clinical trials and other places where the decision is highly repeatable and carries high risk.
There are lots of assumptions involved with A/B testing. These assumptions are not always realistic. Are the two populations the same or is there some difference that could explain their different response to the experiment? Are the two populations used in the experiment actually indicative of the broader population? Was there only one difference between the experiments? Are the learnings from the experiment still valid given any changes that happened after the experiments were run.
A/B testing may be seen as the slower/safer route to a decision, but after looking at all of the assumptions and questions above you may wonder just how safe it is and whether there is an alternative. With an A/B test strategy, the "exploration" happens before a decision is made. After the decision is made, the knowledge gained from it is "exploited" to the full. The alternative is to always blend appropriate amounts of "exploration" with "exploitation" so that you can balance risk and learning potential. There is a statistical method called "many arm bandits" that provides a rigorous way of choosing how much "exploration" vs "exploitation" you should do at any point in time.
Regardless of whether you use the rigor of statistics or not, I suggest adopting the following practices:
1) Seek to understand the level of uncertainty present when anticipating the outcome of any proposed action.
2) Right-size exploration activities and use them to understand or resolve uncertainty.
3) Work out what you can exploit while exploring.
4) Constantly tweak the amount of exploration vs exploitation based on the results of experiments and feedback from actual exploitation.
To understand these practices from a more practical standpoint, let's return to a quality engineering example where I introduced some useful ways to quantify and visualize uncertainty. How do you use the knowledge gained from quality explorations to tweak the amount of exploration vs exploitation?
Image by author
In manufacturing the bias is to deliver product, not test it. This means that your efforts are mainly "exploit" efforts. Use quality inspection as an ongoing "explore" to quantify uncertainty. Tweak the inspection policy so that there is greater coverage at times where there is evidence to suggest that you need it.
These four practices are a recipe for sustained competitive advantage. Consider any repeatable decision - a decision that you make over and over again. How did you get comfortable with the level of risk in this decision? Do you feel that your decisions are optimal? What makes you feel this way? Do you do anything on a regular basis to validate the knowledge and assumptions that you use to make the decision? How often do you make tweaks? Could a competitor gain advantage by making a more optimal decision? If these questions raised any doubts in your mind, consider how the four practices above could apply to your repeatable decision.
"Digital you" and the metaverse
April 18 2022
There are a wide range of analytic techniques and technologies that lend structure and rigor to the art of decision making. They have been successfully used to make better decisions in areas of business like manufacturing, supply chain and sales. What about the metaverse?
Does the metaverse need Decision Intelligence? It does! In fact, Decision Intelligence may form an important piece of the fabric of the metaverse – the piece that deals with how human goals manifest and are achieved. The groundwork for Decision Intelligence was laid decades ago under the discipline of operations research, but it is the newer term “Digital Twin” that will help explain how Decision Intelligence relates to the metaverse. This is how IBM defines digital twin:
"A digital twin is a virtual representation of an object or system that spans its lifecycle, is updated from real-time data, and uses simulation, machine learning and reasoning to help decision-making."
IBM
When I worked at IBM, I was most concerned with decision processes around virtual representations of IoT devices and how they fitted into a broader system, but the “object” the twin represents could be anything. In the context of the metaverse, the “object” is you. “Virtual you” lives inside a broader system that allows you to interact with other human and non-human “objects”. To tie the concepts together, think of the metaverse containing a digital twin of “you” and Decision Intelligence as something that enables your twin to process its interactions with the metaverse. The metaverse is not just a rich 3D visualization. There is another side to it – the behavioral side. I would argue that this behavioral side is even more important than the visual side. After all, you don’t want your virtual you to be just an empty shell displaying overpriced virtual clothing to other inhabitants of the metaverse. Surely you want it to do things that are useful to the “real you”.
You already have a digital alter ego. You are a “profile” containing a name, a password and, loads of information about your internet browsing habits spread over a large collection of different web applications. The metaverse aims to form a more coherent view of you and unify your interactions with other people and things. Sound scary? It doesn’t have to be. The metaverse is going to understand you much better than the current siloed web apps that track you. Imagine if you get to declare your goals to the metaverse…now everything that it knows about you can be used to help you.
Image by author
In my post about recommender systems, I alluded to the fact that many recommender systems are hampered by the fact that they don’t understand your goals. Think of how this could change when you own your digital representation in the metaverse. This "digital you" becomes your personal agent that helps you achieve your goals. At minimum, your "digital you" should be a really good spam filter – negotiating with various purveyors of ads – helping you see things that are best aligned with your goals. It helps the purveyors too. I will give an example. Ever since I browsed the web gathering competitive intelligence, YouTube started bombarding me with ads for one of the products I looked at. It is a nice-looking product, but I am not in the market for it at all. Each time I see that ad, I am not annoyed. I would rather see that ad than many others, but I feel sorry for the poor startup company that pays for me to watch their ad. Currently, there are no digital agents negotiating what ads get to grace which eyeballs, but I would love my digital alter ego in the metaverse to do this for me.
A personalized spam filter was just a teaser. There are lots of tasks I could imagine trusting my digital self with. For a start, I think an automated agent would do a better job than me at keeping track of expenses, making long-range investment and risk projections, scheduling appointments, and even making purchases for things that I buy repeatedly. All of this becomes possible and not so scary if my digital me in the metaverse understands my goals, lets me see and influence any assumptions it is using and can explain its decision reasoning to me.
Automatically detecting changes in assumptions
April 12 2022
In a previous post I wrote about how quality engineers came to terms with the challenges of making process control decisions under uncertainty. I also commented on the fact that their process still requires a lot of human interpretation. I will use the plot below to recap.
Image by author
A manufacturer produces on average 25 units per day and inspects the exact dimensions of 1 percent of the items that they produce. The plot above shows the results of each inspection. This plot demonstrates just how hard it is to run a tight ship when you can only afford to inspect 1 percent of outputs. The target dimension of the item is 100mm. At some point in time, the simulation that I used to produce this data introduced a machine calibration error. This put the process off target by a small amount: 0.07 mm. With the small number of units produced and inspected, there is a high level of uncertainty about how well calibrated the manufacturing process is. It is impossible to tell from the above plot that a change occurred. Let's see what happens if we simulate a bigger calibration error.
Image by author
The above plot shows a clear visual indication that something changed, but the statistical tests still pass. According to the control limits, the process looks fine. The stats used for control charts are very low tech as they were designed for pen and paper. What would happen if we unleashed the power of modern computers on this problem?
The blue points on the plot below are inspection results. The grey lines are alternative viewpoints about what actually happened to produce the evidence shown in the blue dots. Most of these grey lines support the idea that there was a change in size. Each of these grey lines is a plausible view of what happened given the evidence at hand. The amount of discrepancy between the lines shows how much uncertainty there is. As expected, there is a lot of uncertainty.
Image by author
The above plot is a lot more inciteful than the low tech control charts shown earlier. It was produced by magic performed under the covers by something called a Markov chain Monte Carlo (MCMC) model using a python library called "NumPyro".
The magic starts by defining what a change could look like algebraically. A sigmoid is a generalized way of describing a shift in a variable. It describes the "before" and "after" values, when the shift occurred and how quickly the shift happened. A sigmoid is versatile. It can model step changes, linear incremental gradual change or a non-linear change. This versatility makes it perfect for trying to describe and quantify the unknown. The plot below shows a sigmoid representation of what could have transpired in the fictitious factory whose simulated data we are looking at.
Image by author
The MCMC uses available data to imagine, via simulation, many different permutations of possible parameters of the curve. Each one of these simulations represents a plausible possibility for when in time a possible shift started, how long it took to play out and what the starting and ending mean values were.
The resulting plot of all of those grey lines gives a intuitive understanding of the various possibilities that could have transpired and reveals patterns visually. The algorithm also produces some really useful summary statistics.
Image by author
These summary stats describe the range of values of the parameters of the curve. The first parameter provides an estimate of when the nasty shift in the manufacturing process took place. Given the high level of uncertainty present in the data, the spread is understandably large. With 90 percent certainty, the process went awry between the 9th inspection and the 70th. With more certainty it tells you that it thinks the original size was between 99.18 and 100.47 and the mean size after the change was between 99.78 and 101.15.
For any quality engineers out there, I am sure that you can appreciate what the power of this modern computational process reveals vs. traditional statistical process control. It doesn't remove uncertainty from the decision process, but it does a pretty decent job of explaining it.
I used a quality example in this post, but the thinking applies to any decision process. If you have a small amount of data that describes some observations of business behavior, you may be able to use techniques like the one outlined above to quantify the level of uncertainty and invent possible plausible ways to accommodate what you don't know into your decision process. You may also be able to monitor these immeasurable assumptions and detect possible changes - looking for important clues about shifts in probabilities and rewards that are material to decision making.
Acknowledgements:
The inspiration for this post and the method of using estimated parameters for a sigmoid function to detect change came from Andy Kitchen.
Reinforcement Learning for Business Decisions
April 11 2022
The most commonly used machine learning techniques in business are regression and classification. Regression and classification models are useful for predicting how a business will respond when exposed to some form of change. These models recognize patterns in previously observed data and are useful as predictors as long as 3 conditions are met:
1) there is enough historical data to observe all of the patterns
2) the models are suitably conditioned to understand the difference between pattern and noise
3) the patterns don't change after the model was created
For many business decisions, the violent reality is that there is seldom enough data available to identify all of the relevant patterns, the effects of noise can't be isolated, and patterns are often subject to change without notice. Reinforcement Learning techniques are useful because:
1) They allow the decision process to learn on the fly so there is no need to discover all patterns up-front.
2) They are tuned to respond to changing patterns.
Consider the decision model below:
Image by author
The diagram above tells a very familiar story in business decision making. The blue circles are business states. The present state of business is what we will call "business as usual" because the business is operating normally. Businesses generally operate in this state and generally stay in this state until some form of conscious or external action moves them out of this state. The red circles are chance nodes. The numbers are probabilities. The probabilities associated with the bottom red circle are consistent with the fact that if there is no action, the most likely outcome is that the business will continue operating in "business as usual" mode. It would take some external change (like a competitor introducing a new product or a competitor going out of business) to either transition to a less desirable regressed state or a more desirable future state.
The numbers associated with the top red chance node are the probabilities that describe the likely outcome of an action. The decision about whether or not to take action all comes down to the probabilities and rewards associated with the action. Businesses generally operate in steady state. Changes are introduced carefully as they cost money to implement and may carry risk. In the example above, this risk is marginal, but there is still a strong argument to stick with "business as usual". This is because the change being evaluated has only a 15% certain chance of advancing the business to a desired future state. The most likely outcome of taking the action is that the business will incur the cost of the action, but the action won't significantly advance them beyond the current state.
The diagram below highlights the key assumptions that impact the decision in question. If these assumptions are wrong then the decision is suboptimal. If a competitor is about to release a new product then the assumed probability in the bottom left is wrong. The 0.005 should be much higher. If there is a more proven way to achieve the desired future state, the 0.15 in the top right is misleading and is biasing the decision to non-action.
Image by author
The probabilities 0.005 and 0.15 are key uncertainties. They warrant the most attention in the decision process. A classification model could be a good way to obtain initial estimates for these probabilities, but as I stated in the intro, it is generally unsafe to assume that the probabilities were "correct" at the time that the classification model was built and that they haven't changed since then.
Reinforcement learning gives us a set of techniques for systematically questioning and improving assumptions so that we don't make suboptimal decisions using stale assumptions. There is however a catch. Standard reinforcement learning techniques work amazingly well when the business takes actions often and receives immediate feedback on these actions. This is in direct conflict with what I wrote earlier about businesses operating in steady state conditions. With most business processes, there is almost certainly going to be a lag between when an action is taken and when the results of the action are observed. The low frequency of business change and lag between action and observed feedback makes it hard to adopt reinforcement learning verbatim to business decisions.
The above highlights the challenges in using regression, classification and reinforcement learning techniques in business decisions. These challenges don't imply that you can't use these techniques - just that you can't use them blindly! As I repeat often, decision intelligence requires a holistic view of the decision process and a thorough exploration of the facts and assumptions along with logic derived from knowledge and inputs based on human intuition.
A good way to approach the example above may be to use a classification model to arrive at starter probabilities and then use reinforcement learning principles as a systematic framework to encourage the practice of challenging assumptions regularly. There are a number of formal and semi-formal techniques that can be applied inside the framework. These include:
1) Using anomaly detection to detect change. If the business generally operates in "business as usual" mode, anomaly models can detect when it is no longer operating as it did before.
2) Using simulation to describe the world under various assumptions and then testing whether the world is consistent with the simulation.
3) Engaging humans in the process - asking them questions that test assumptions against their intuition.
4) Building dedicated classification model to assess the likelihood that an assumption has changed.
The example below shows how this holistic approach applies to maintenance decisions. Imagine a critical piece of machinery. Downtime associated with failure carries a very large penalty. Premature replacement is wasteful. Replacement may offer a reward of upgraded performance. The dominant uncertainty around this decision is how likely it is that the equipment will fail. There is a secondary uncertainty around the likelihood that replacement will offer performance improvement.
Image by author
The primary uncertainty demands utmost attention in the decision process. A classification model is a great way to quantify uncertainty and discover the factors like age that contribute to or explain increasing failure probability. Anomaly models can help provide evidence that the equipment isn't behaving normally. Simulations are an excellent way to understand known failure modes and compare data associated with healthy equipment with what you expect unhealthy equipment to look like. Generally, equipment doesn't fail that often, so actual experiences take a long time to unfold, but when they do and you have wrapped up a number of these techniques inside a reinforcement learning framework that tweaks the probabilities dynamically you stand the best chance of producing useful repair recommendations.
This post introduced you to the principles of placing multiple methods for estimating assumptions into a reinforcement learning framework. This approach provides systematic methods for challenging and re-estimating assumptions. It aims to overcome the problems of relying on potentially stale assumptions.
Making "Good" Recommendations
April 5 2022
As a consumer of web content and online shopper, you probably see the outputs of "recommender systems" at least daily, e.g. Youtube recommends what to watch, Amazon recommends what to buy. I am sure that you find some of the recommendations produced by these systems "better" than others. In this post I will talk about how to evaluate automated recommendations and compare "consumer recommender systems" with systems that help people make business decisions.
So far I have used the subjective terms "good" and "better" as criteria for evaluating recommendations, but what separates "good" from "bad" when it comes to recommendations? Most ML techniques involve a learning objective that is backed up with a tangible measurement. In the case of consumer-style recommender systems, that metric is typically something called MAP@K. It is a numeric indicator of the "relevance" of the recommendations that the system comes up with.
My personal favorite content recommendations come from a piece of software called Roon. Roon plays music. I listen to varied music. I like to discover new music. When an artist that I like releases new music I like to hear it. Roon seems to understand me pretty well and I trust Roon to play music that is "relevant" to me. Like most content recommender systems Roon suggests music by understanding what other people that listen to similar music as I do, are listening to. It probably builds a lot of assumptions about me and my listening habits and uses those assumptions to personalize recommendations. Many content recommender systems do this too - so why does Roon stand out as "better".
I believe that the secret to Roon's success is that it gathers a lot of feedback on recommendations and uses this feedback to alter its assumptions about me. This allows it to personalize better and allows it to adapt fast as my preferences change. Collecting feedback from users can annoy them, so Roon went out of their way to do it in a non-intrusive way. For example, if you skip one of its recommended songs, it will serve up something else immediately and then allow you to describe why you skipped if you want to. It does this without nagging.
When Roon gathers feedback, it figures out whether I didn't want to listen to the track because I don't like it, because I didn't feel like listening to it at the moment, or because it didn't fit the "theme" of music that it is currently playing. I credit Roon's ability to please me with its music recommendations in the finesse they designed into the software to unobtrusively gather feedback and use it to update assumptions.
Roon's design finesse and their determination to not just make assumptions, but also validate and correct them are lessons that any application making recommendations can take inspiration from.
Now let's consider recommendations about business decisions - things like: when to maintain equipment; how much and when to inspect goods ; and what to invest in. The first thing we will do is back to is the original question about what constitutes a "good" recommendation. "Relevance" is necessary but not sufficient as a criterion to serve up recommended business actions. So what are sufficient criteria? If you read my prior posts, you will have noticed that I talk a lot about goals. When making recommendations about business actions, of course these actions must be relevant to whoever is receiving them, but they must also be consistent with achieving some or other stated goal.
In this post I stressed the importance of feedback as a mechanism to validate and update assumptions. I also spoke about the UX concerns involved with collecting feedback without getting in the way. I covered the importance of goals too - and how it is an understanding of goals that distinguishes recommendations about business actions from content recommendations. In subsequent posts we will explore more of the anatomy of Decision Intelligence apps and how this anatomy makes them naturally equipped to understand and use goals and assumptions as part of a decision process.
Getting Organized for Decision Intelligence
March 31 2022
This post shares observations that I have learned from my interactions with over 100 organizations about their initiatives to better make use of data and AI. The organizations that I spoke to covered various sectors: manufacturing and industrial , energy and utilities, software and technology services, financial services, and health care.
My main role in these discussions was talking about the applications of AI to achieve various business goals, but outside of the AI itself, there was another common thread that kept on coming up in discussions: how to organize to best "operationalize" learnings from ad hoc data science studies.
I learned that many organizations were treating data science as a standalone function. This is what a typical org structure looked like:
1) Organizations had established a new data science function and staffed it as a separate team.
2) Another team or teams provided the infrastructure around data: data lake / warehouse and BI capabilities.
3) Other teams provided applications and backend services needed to deliver applications to internal users and/or end customers.
Pains
This organizational design does not promote efficiency in either building AI models or getting these models "operationalized" into business processes . The various siloed org units shown above are expected to come up with a shared view of the business goals, prioritize them, figure out the analytical techniques to deliver on those goals and then try to add AI capabilities into existing applications or create new AI-powered applications.
One of the most interesting discussions I had on the subject of organizational design was when I was asked to facilitate a "birds of a feather" session at a VDI conference. In this session we focused on barriers to successful use of AI with a diverse group of manufacturing and maintenance engineers from multiple organizations. The consensus of this group was that the ad hoc communication and prioritization across the siloed teams of data science and applications delivery was the number 1 barrier to AI adoption in their orgs.
A better way to deliver Decision Intelligence
First off, to maximize business impact, any use of analytics or technology should be organized around a business goal and not around a technology or data.
Secondly, for efficient prioritization and delivery, the team that is going to satisfy that business goal should have all of the analytics and technology skills it needs.
In other words, you need a cross-functional team. The members of this team need 3 distinct sets of skills: domain expertise ; AI / data analytics expertise and application delivery expertise.
Domain Expertise
If you look back to the traditional view of data science team and separate applications teams you will notice that there is no mention of domain experts. There is an assumption that the data science team and applications teams will find domain expertise wherever it is present in the organization and harvest their knowledge. There is a big problem with this assumption. The domain experts have day jobs that take precedence over having their brains tapped for knowledge by data scientists and application developers. That is not the only big problem. Who is prioritizing the work? Domain experts should be ones prioritizing as they are intimate with goals, pains and the current business process. It is hard for a team of data scientists to rely on ad hoc input from domain experts for prioritization.
Assemble around the goal
The first step in organizing teams to deliver AI and applications that improve decision making is dedicating at least one domain expert per major business goal, then forming a team around each goal. Add appropriate analytics and technology skills to support the domain expert in achieving the goal.
Analytics Platform
If you have lots of different goals, you need lots of small teams. Each team will comprise of domain expert/s, data science skills and application delivery skills. You can't have each of these team reinventing the wheel when it comes to how data is organized and how applications are delivered, so you need another separate team that looks after common best practices and componentry for both analytics and UI.
Technology infrastructure is separate from all of this
The Analytics Platform is focused on the common best practices for adding analytic value not the fundamental basics of the infrastructure needed for things like data storage, compute, networking and security.
Start with a small project that serves an important goal
A number of organizations expressed frustration at not getting enough productive benefits from AI had started with a data first approach. The team was trying to answer a question "what can we do with this data" and generally concluding "not as much as we hoped". Often the team's initial findings were reporting data quality issues - not insights or production ready models. By putting a major business goal first and dedicating a domain expert to the team, it is much easier for the team to scope a small and manageable quick win.
If your teams are operating in silos or working with a data-first mindset, it may be worth trying at least one experiment with a cross functional team focused on a clear business goal. Place business domain knowledge, analytics and technology skills in the team and challenge them to deliver a quick win.
The perils of too much automation too soon
March 31 2022
In a previous post I stressed the importance of considering data, knowledge and intuition when making decisions. I mentioned that you shouldn't be tempted to automate a decision process just because you have automated the means to collect fact data. I also spoke about the level of automation of a decision is a decision in its own right so it needs full consideration of data, knowledge, logic and intuition.
If you automate a decision process without considering all of the above, it could have disastrous consequences. I apologize for the macabre nature of this example, but it is the best example I have. I am sure that most readers are familiar with a the case of "sensor failures" in the 737 Max aircraft. The New York Times offers a thorough account of the problem and how they believe it happened.
I am going to restate some of what the New York Times reported in the context of a Decision Intelligence scenario.
The design goal of an automated anti-stall system was safety. It aimed to save lives by automatically enabling an anti-stall maneuver when it detected a stall. This is a totally valid goal. The system operated on "fact data" describing a "nose angle" measurement delivered from a sensor. The system interpreted sensor data in real-time to assess the probability that the aircraft was in a stalled state. There was a fundamental flawed assumption in this process: It assumed the fact data was reliable.
If you search, you will find many references blaming the crashes on a faulty sensor. With my Decision Intelligence lens, I don't blame the sensor for the crashes at all. I blame the fact that the level of automation of the process was too high for the data that it had to operate under. The sensor is not to blame because no sensor can ever be assumed to be infallible. The decision process failed to take assumptions about sensor health into consideration.
Other than the inclusion of a second sensor, I don't know how Boeing addressed the issue, but I will offer an opinion on some of the considerations that my Decision Intelligence mindset would have wanted me to explore in the design of such a system.
1) I would want to better align the decision with the goal. I would do this evaluating potential actions based on the probability of saving lives - not the probability that the aircraft is in a stall. With this criteria in mind, the decision logic would look quite different as it would have incorporated more data and knowledge into the process to allow for a more thorough understanding of the probable end state, ie: The decision process would try to estimate where in space the aircraft would end up with or without the action and then judge which possible state would minimize loss. Altitude is an obvious piece of data that I would have wanted access to as part of the process.
2) I would want to allow the pilot to contribute their knowledge and intuition into the process too. I can understand that they didn't want to slow down the process by relying on a mandatory pilot input, but I would have still looked for ways to incorporate pilot feedback into the process.
3) To fully acknowledge the effects of the infallibility of fact data, I would want to consider false negatives from sensors too. The crashes were caused by false positives where the sensor over-stated the nose angle of an aircraft not in stall. There is another possible state. The aircraft is in a stall, but the nose angle reported by the sensor doesn't reflect this.
The New York Times raised another important point about the organization and process used to perform automation:
"many of the employees say, they didn’t recognize the importance of the decision. They described a compartmentalized approach, each of them focusing on a small part of the plane. The process left them without a complete view of a critical and ultimately dangerous system."
In a subsequent post I am going to talk about the cross section of skills needed for decision automation and what this means for organization and process.
Keeping tabs on assumptions
March 30, 2022
You used a decision model to help you make an important decision. You identified and described a number of assumptions as part of this process. You are now acting on the decision. If your logic was sound you are on track to achieve your goal.....except if your assumptions were wrong or something has changed that invalidates one of more assumption. There is a community of people that understand the need to make and test assumptions really well. Who are they? Quality engineers.
In this post we will explore some of the the disciplines that quality engineers have developed over decades to keep tabs on assumptions as some of their practices translate into any domain.
Organizations that manufacture components or finished product generally have to conform to agreed upon quality standards - often expressed is Defective Parts per Million (DPPM). Manufacturing processes are designed to conform to this quality standard, but if anything goes wrong with a process and it is not performing quite as designed, quality can slip. How does this relate to assumptions and goals. The goal is to achieve on the DPPM target. This DPPM target is achievable as long as numerous assumptions are met.
Keeping tabs on assumptions is tough for quality engineers because of the sheer number of things that can go awry with suppliers or the manufacturing process. It is made even tougher because they have to operate under conditions of high uncertainty. This uncertainty arises because it is not viable to thoroughly inspect every item that flows into the production process and every item as it moves through parts of the process. It is as if they are navigating in fog - making best use of the facts they can collect to understand what might be happening with everything that they don't know . To help them deal with the fog of uncertainty, engineers developed various statistical methods that they use to estimate incoming and outgoing quality levels and to decide when to take action to correct a possible manufacturing process issue.
Statistical quality control is a discipline that was honed to make quality decisions in the presence of uncertainty. It was developed decades ago before data collection was automated and compute resources became readily available - so it is really quite simple. It all revolves around the familiar concept of probability. Any variable that is subject to random variation can be described by a statistical distribution. Any observed values that fall in the extreme bounds of the assumed statistical distribution are used as possible red flags to suggest that the actual observed values may not conform with the assumptions.
Image by author
The plot above shows quality test results for a manufactured item. A random sample of items are inspected and measured. This plot shows the individual measurements along with the mean, upper control limit and lower control limit.
The upper and lower control limits are calculated using a simple formula. I calculated the control limits above based on 3 standard deviations. 99.7 % of values are expected to fall within these limits. This six sigma study guide is a good reference if you are looking to understand more about the use of standard deviations to set limits.
The first discipline that quality engineers learned about keeping tabs on assumptions is that if the assumption is important to the goal, you need to observe it regularly and you need to be consistent in the way that you observe it. In the case of quality, this meant establishing an inspection policy that describes how much to inspect, when to inspect and exactly what to measure when inspecting. If this is all you do to keep tabs on important assumptions, it is often enough.
The quality engineering disciplines used for observing quality assumptions go far beyond this example. There are different styles of plots and different calculation methods depending on the nature of the production process and the nature of the quality testing performed. These different methods are well documented. Some of these methods are applicable to assumptions outside of quality, but the main thing that you should take from this is that there is no one size fits all technique for observing assumptions. You need to pick observation methods that are in line with the nature of the assumption and the importance of the assumption to your goals. Quality engineers went to great lengths to understand and codify assumptions because quality assumptions are critical to achieve the goal of staying in business.
Quality engineers went a step beyond merely observing and recording results methodically, they also developed methods of interpreting results. The upper and lower control limits in the plot above are examples of interpretation methods. If you are keen to see more examples of interpretation methods take a look at the Western Electric rules. As with any uncertain assumption, it is not possible to categorically state from this plot whether the item size assumption is valid or not. Structured methods help with the consistency of interpretations - specially when different people have to interpret the same numbers.
Previous posts went into how the Decision Intelligence mindset encourages you to identify and document assumptions rigorously. This post stressed the need for also checking important assumptions regularly and methodically because changes in assumptions may be the signal of a the need for a new call to action. Each assumption is different and may need it own methods for checking and interpreting- there is no one size fits all.
Be wary of rule-based decision logic
March 29 2022
A consistent theme in my writing is the importance of assumptions in decision making. Not only is it important that you identify and document assumptions, you also have to be careful about how you use them. As useful as assumptions are to the decision process, they have an uncanny ability to be wrong. This means that when you use assumptions inside a decision process, that decision process should never assume that they are cast in stone.
Even the most non-technical of readers will have seen flow charts like the one below. Flow charts like these are an excellent way to describe non-volatile logic that is based on known factual data.
Image by author
Are decision processes ever based entirely on fact? Typically not! They are based on assumptions.
Are decision processes non-volatile? Typically not, as assumptions, almost by definition, are subject to change without notice.
If you were to describe a decision process in a rules based manner using assumptions as pseudo-facts like the flowchart above, you would have to revisit and rework the flow each time the assumptions change. When I gave an example of a decision in the previous post, I used another method to describe the process - something known as a Markov Decision Process. You can't use a Markov Decision Process for every decision, but it is a good example of how you should be thinking about describing decision logic and assumptions in a flexible and extensible way that accommodates change naturally and only ever has to be overhauled if there are major changes.
Image by author
The decision model above is flexible because you can add more states without restructuring it. The assumptions are expressed as discrete pieces of data so that they can readily be changed without restructuring too. These two factors together help minimize the complexity of the decision process and make it easy to automate. Once you have put the infrastructure in place to deal with describing states, probabilities, rewards and discounting you can automate many decision processes using the same infrastructure.
If you contrast this to the flowchart, where each new state required in the decision process may manifest itself in various different places in the flow and where there assumptions can be used in an open-ended way as arbitrary pseudo-fact data, the flowchart is more general, but this generality comes with additional complexity. When automating a process using a rules based model like this you pay for this complexity many times over: when describing logic ; when implementing the data and processing infrastructure to carry out the decision process; and when you need to change the decision process.
Decision Intelligence vs Failure
March 28 2022
When using a Decision Intelligence mindset, your aim is take "Fail Fast, Fail Often" to its logical conclusion: "Fail Continuously" as this is what drives "Continuous Learning" and "Continuous Improvement". After you reach this plateau of acceptance of failure you will wonder why organizational practices were so obsessed with failure in the first place. I put it down to the fact that everybody wants to be successful and failure is wrongly perceived as the opposite of success.
To get over this obsession with failure, we need a more enlightened view of success. I like this one:
When you realize that, you free yourself from the fear of failure. In life, failure is inevitable. And the best leaders learn from their failure. They use it as a tool to become successful"
Arinna HuffingtonA Decision Intelligence mindset encourages decision makers to apply data, knowledge and intuition inside a logical decision framework that allows them to constantly prioritize actions. The aims of the process are to maximize the potential for success and avoid being crippled by fear of failure.
Consider a software product decision: You could reduce the operating costs for a software product significantly by phasing out an existing feature that requires specialized infrastructure. There are existing customers using this feature.
Are you thinking, "phasing out a feature that people are using is a recipe for failure"? Before we allow fear of failure to get in the way, let's consider what success looks like.
Let's say success is measured by customer count. The goal is to capture more of the market. We are considering a binary decision: Should we phase out the feature? Yes or No.
The blue circles in the diagram below are the current state and possible immediate future states that could manifest as outcomes of the decision. The current customer count is 3. The top red circle is called a chance node. It describes the various possible outcomes of retaining the feature. There is a good chance that nothing will happen as an immediate result of retaining the feature as the "chance" nodes points back to the initial state. Doing nothing shouldn't cause an immediate loss or influx of customers.
The bottom red circle is a chance node that describes the possible outcomes of phasing out the feature. Some of the operating cost reduction derived from dropping the feature will be passed on to customers as a price cut. The most likely outcome of phasing out the feature is a net increase of the number of customers from 3 to 5.
Image by author
The example above shows how the decision process is oriented around a goal and how it encourages full exploration a of range of possible outcomes. The decision maker uses estimates of the probability of each outcome to make a decision. The decision maker can also assess the risks associated with the decision. In the example above there is a 9% chance that the decision to phase out will significantly compromise the goal and a 77% chance that it will result in net improvements.
This example is a good way to recap prior thoughts on facts, assumptions, knowledge, intuition and logic. There is only a single fact in this decision model. The current customer count is 3. The logic is supported by assumptions about probabilities. These assumptions may be informed by historical facts, knowledge or intuition. The rigor that went into formulating this model was useful as it provided a structure by which to explore and document assumptions. The process of identifying, describing and challenging assumptions is just as valuable as the decision model itself, as this process encourages creative thought about how to achieve desirable outcomes and fosters an in depth understanding of the risks involved.
Conflict in Decision Making
March 27 2022
Meetings and other organizational structures that support decision making may somewhat impolitely be compared with an agar-filled petri dish in that they are a perfect breeding ground for conflict. Decision Intelligence is not going to "cure" conflict, but it provides a framework to understand it. When conflict is well understood by conflicting parties they are less likely to be emotionally drawn into the conflict and more likely to act decisively in spite of it.
Conflict can cripple decision making and render decisions made ineffective by means of a phenomenon that Ron Askenanas called "decision spin" . Ron Askenanas has some sensible tips to tackle decision spin. Decision Intelligence is a helpful analytical tool to back these tips up with.
Letting people be heard
I have witnessed a lot of conflict that starts with people arguing about alternative solutions to a problem as the first order of business before actually talking about the problem. When subscribing to a Decision Intelligence mindset, before you allow people to start selling their proposed decision or debating alternatives, you should give people a solid chance to describe their goals, bring forward facts that may be relevant to the decision, raise assumptions that they have made, and express what their intuition is telling them. Once all of this is on the table, if there are different goals expressed, somebody is going to have to declare a common goal or an effective way to trade-off conflicting goals against each other. If everybody gets to invent their own goals, no amount of Decision Intelligence will help get things on track.
Reminding people not to take things personally
People get upset during the decision process when they attribute differences of opinions about a proposed decision to things like: others aren't listening ; others are ignorant ; others are short sighted ; others have their heads in the cloud ; others don't like me. Differences in understanding of goals, assumptions, facts and intuition are easier to rationalize than any of the aforementioned personal responses. Let's take the most contentious of these as an example: goals. When decisions cross org units it is common that they impact conflicting goals. Finance is legitimately trying to reduce cost. Customer Service is legitimately trying to improve the customer experience. When Customer Service and Finance are at odds over a decision that involves spending, things could easily get heated. By applying a Decision Intelligence mindset, it becomes obvious to all parties that there is more than one goal. By acknowledging the validity of both goals and describing goals in tangible terms each party is given the tools to think through the tradeoffs.
Discussing Pros and Cons
How many biased pros and cons lists have you seen? Lists with 10 pros and 1 con. Lists where the cons are dressed up as pros in disguise. In my experience merely listing pros and cons doesn't resolve conflict, it feeds the agar in that metaphorical petri dish. The Decision Intelligence mindset encourages people to express things that would have traditionally been described as pros and cons in terms of expected impact on a goal. This means that they get described in terms of data and decision logic....not just bullet points on the petri dish.
Setting Limits
Ron Askenanas spoke about setting time limits for debates. A Decision Intelligence mindset encourages thinking about the tradeoff between the cost of making the decision vs the value derived from the decision. When decision spin takes place, the cost of making decisions rises and the value derived drops. Setting time limits for debates helps, but also consider limits on everything else: people involved in the decision, number of meetings, amount of data collected. Not only do these limits control the damage by reducing time spent fighting, they also ensure that when there is conflict, it is over something that really matters.
Unilateral changes
Ron Askenanas advocates that decisions don't get changed unilaterally. I see where he was coming from. It is highly frustrating for the participants of a long drawn out and conflict-ridden decision process to have the results of the process overturned in a way that undermines what everybody fought for, What I don't like about this statement is that it implies that each change in a decision could involve another long drawn out and conflict-ridden decision process. This is how a Decision Intelligence mindset addresses change:
When a decision is made it is with the expectation that this decision will contribute to a goal. If at any time after making the decision, the goal changes or it appears that the goal is not being met, a new decision should be considered. That doesn't mean that that a new committee needs to be established to debate a possible decision. It means that the facts, assumptions and logic used to test the original decision should be re-evaluated as quickly as possible using a few people as possible and as little new data as possible.
Data: Facts and Assumptions
March 26 2022
When trying to describe the past, facts are of primary importance. Line A produced an average yield of 85% vs Line B's yield of 82 %. These are the facts. They can be taken at face value as long as they are used in the appropriate context. Let's restate these facts slightly. Line A produced an average yield of 3 % more than Line B. Is this a fact? Surely something computed from 2 facts must be a fact too. Not so fast.
When trying to understand the past, facts need to be considered alongside assumptions. Line A produced a yield of 3% more than line B is a fact as long as the yield calculation method was the same for A and B and the time period for both calculations was the same. The statements about equivalent calculation method and time period are assumptions. Assumptions should always be stated when inferring facts from other facts. Any attempt to get deeper into the understanding of the past will expose a need for understanding and stating more assumptions. If you were to infer that Line A is more efficient than Line B from these facts, you would have to state assumptions like, Lines A and B were given raw materials of the same quality.
When making decisions, assumptions take an a whole new level of significance. Since decisions impact the future and not the past, and nothing is truly known about the future, there is no such thing as a future fact - only assumptions. Imagine somebody makes a decision to expand the capacity of Line A and decommission line B to achieve a goal of improving overall yield. The facts about the past suggest that this might be a good idea. Those facts need to be interrogated and massaged to bash them into a state where they might be suitable for the purposes of making a decision. The logic that drives this decision has to consider a raft of assumptions including those stated above and new ones such as, yield is not dependent on capacity.
Given the importance of assumptions to decision making, you would think that humans would be great at understanding and documenting assumptions. They are not. In fact I would go as far as to say that assumptions are a human blind spot. I know because I am a human and even though I have recognized this blind spot in myself, and actively search for assumptions I still sometimes miss important ones. We are really good at making assumptions, but we are terrible at recognizing that we made them and describing them. It is only when we describe them that we can scrutinize them. Don't get me wrong. I am not suggesting that implicit assumptions are always wrong - just that you can't evaluate how realistic they are unless you state them explicitly.
In a previous post I mentioned the importance of intuition in decision making. Human intuition is a blessing. It distinguishes us from machines. It can also be a curse when it comes to assumptions though - as behind every bit of intuitive understanding lies a collection of unstated assumptions.
In subsequent posts I will delve into assumptions in more detail as they play an important role in conflict around decisions, communicating decisions and automated decision making.
Data Driven Decisions: Is data really “driving”?
March 25 2022
I am going to come right out and say it. The phrase “Data Driven Decision Making” is misleading. Data does not and will never drive decisions. Decisions are driven by logic. The logic used in decision-making may be supported by data, but it is the logic that drives and not the data. Data influences decision logic, but it is not the only influencer. The other influencers are equally worthy of consideration: knowledge and intuition. If you still don’t believe me, consider this driving example. When your car’s autonomous braking system decides to jam on anchors to save you from oblivion, what drove the decision? The data coming from the camera or the logic that assessed the likelihood that oblivion was the most likely outcome after corelating data from multiple sensors and cameras? Clearly data is merely a signal that influences the logic that drives the decision.
So the phrase is misleading! Why be a stickler for terminology and make such a song and dance about it? Afterall surely everybody knows that they can’t use data as a substitute for logic? Part of the reason why I wrote this paper is that not everybody knows that they can’t use data as a substitute for logic, but that is not the main reason. The main reason is the quest for “data driven decisions” is causing people to place too much emphasis on the collection of data and not enough emphasis on the rest of the influencers and the logic itself. I use the term Decision Intelligence to encapsulate logic, data, knowledge and intuition as equally important ingredients of the decision making process. When thinking about all of the elements of Decision Intelligence in harmony, it opens up more avenues to improve decision making – some of which may require less data or easier to collect proxy data. Data collection and processing is a costly and messy business. We only want to have to do as much of it as we really need too – and we definitely don’t want to fall into the ever-common trap of collecting data that is not used at all or is used incorrectly in the decision process.
A Decision Intelligence mindset places the decision logic first and gives equal consideration to data, intuition and knowledge. The Decision Intelligence mindset applies to any decision – whether fully automated, partially automated or totally brain-powered and free from technical aids. In fact it is only after you adopt a Decision Intelligence mindset that you will feel adequately empowered to make appropriate decisions about the level of automation of each decision. In contrast, a data-driven mindset frequently encourages misguided attempts at automation – where people attempt to automate decisions that are not ready for automation just because they have data.
Some of you may be questioning how I dare to state that “intuition” requires equal consideration with data. Wasn’t data-driven decision making supposed to free us from the perils of intuition? Let’s get one thing out the way. There are always perils with decision making. These perils exist due to uncertainty and no amount of data or “Decision Intelligence” can eliminate them. With data and logic, we can estimate the degree of peril in any decision. Once you understand the perils, intuition can have the liberating effect of allowing you to proceed boldly in the face of uncertainty. This is better than proceeding blindly on a data-driven fallacy that the future will look like the past.
By making a confident decision and proceeding boldly based on intuition adds a strong human element to the decision – which increases the level of investment in the decision. This investment in the decision is both a good and a bad thing. It is good when it becomes a motivator. It is bad when it causes the decision maker to be over-invested in the decision to the point where the decision maker is unable to objectively decide to change course when it becomes obvious that the course needs correction.
What about “knowledge”? I left the touchiest element of Decision Intelligence till last. Knowledge is a touchy subject, because human knowledge can be encoded as new data and logic – and hence placed in machines – resulting in increased level of automation and in the doom and gloom scenario less need for humans. The way that the counter-argument to doom and gloom normally goes is that replacing mindless tasks performed by humans with ones automated by machines creates more capacity for human thinking which leads to the creation of more valuable knowledge. This new knowledge is used to improve products and processes and as long as people value product and process improvement, people will continue to be rewarded for their contributions of knowledge. I don’t subscribe to either of these as blanket arguments. Each decision about how far to go with automation of any decision process is a decision in its own right – and should be handled with the rigor of Decision Intelligence.
This paper outlined the need for a more wholistic approach to Decision Intelligence. You can start practicing Decision Intelligence right now. All it takes is considering the role of knowledge, intuition and data in your decisions then acting confidently. Never beat yourself up over past decisions – instead use the knowledge gained from past decision to make [hopefully better] new decisions.
Interpreting Equipment Failure Predictions
March 24 2022
Image by author
The human race has spent the last two centuries mechanizing industrial processes. This has allowed us to produce goods faster, better and cheaper and produce goods that could never be created by humans alone. It does, however, mean that our industrial processes are running at the mercy of equipment reliability. Today, an operation is running at full steam, but it is only a matter of time before a key piece of equipment breaks down and brings to a halt. Predictive Maintenance is an approach that helps understand and manage the uncertainty around equipment failure.
Predictive Maintenance uses statistical models to predict when equipment is going to fail. By using the knowledge gained from predictive maintenance models to drive maintenance policy it is possible to reduce maintenance costs and reduce unplanned downtime resulting from equipment failure. The success of any predictive model depends on its accuracy and how you use its results. Accuracy is evaluated by going back in time and comparing predictions that would have been made by a model with the actual failure history.
Below, you can see the results of such a comparison of predictions made from 3 models compared with actual failures. Which model, A, B or C is best?
Model A’s predictions are accurate within a range of 300 days, a year ahead of failure
Model B’s predictions are accurate within a range of 16 days, a month ahead of failure
Model C’s predictions are accurate within a range of 1 hour, 3 hours ahead of failure
When I started working on Predictive Maintenance 4 years ago, I would have almost certainly picked model C. I was working under the misguided goal of trying to eliminate scheduled maintenance. This led me to focus on finding the model that gave predictions closest to the actual failure date. The theory was that If you could learn enough about equipment from its operating data and maintenance history you could figure out exactly when the equipment would fail and repair it just before it failed. The theory sounds good. After all, scheduled maintenance is wasteful. Parts and labour cost money. Taking equipment out of operation for maintenance reduces capacity which effectively costs money. So why not stop scheduling maintenance for anything other than routine lubrication and repair shortly before failure?
There are two reasons why not to do this:
1)Most people do not have enough good data to build a model that can tell you exactly when equipment will fail; and
2)If you are one of the lucky ones with good enough data to accurately predict failure to a day, what do you do when your model predicts that 5 machines will break tomorrow and you only have enough maintenance capacity and parts to repair 2 of them?
Bottom line: You need to schedule maintenance to make allowances for the uncertainty associated with predictions and to balance maintenance workload with available resources. Use predictive maintenance to decide on the best possible time to do scheduled maintenance on each machine.
When following the guidance above, you must think a little differently about your predictive models. Predictive models are biased to respond to the strongest signals. Unfortunately, the strongest failure signals are the lagging ones - obvious signs of impending failure – like vibration, efficiency loss and changes in thermal response. They are lagging because whatever it is that is going to cause the failure has already happened: a worn bearing is causing increased vibration, a leak is causing efficiency loss, failure of a brake lining has changed the thermal response. Predictive models readily observe patterns in these lagging indicators as they are evident in black and white in the data whereas the subtler leading indicators like overload and usage identify themselves as shades of gray.
When you ask a data scientist to build a model for you or evaluate a vendor’s model, how do you factor this into account? You design an evaluation metric that is consistent with your maintenance scheduling goals. If your maintenance scheduling window is 2 weeks, your metrics for evaluating model efficiency should reflect this, ie. measure the accuracy of the predictions made 4 weeks before failure not at the time of failure (which would likely be the data scientist’s or vendor’s default metric). Revisit the question at the start of the post. Which model is best now?
Unless care is taken when building a predictive maintenance model, it will naturally bias itself to predictions made from lagging indicators. It becomes a condition monitoring model that may be able to predict failure a day or week before it happens, but it is not much use in longer range maintenance planning which is where predictive maintenance tends to be most beneficial.
Jakob Bernoulli vs A humble laptop: The coin toss challenge
Nov 18 2016
Image by author
When the right problems are posed, machines are far more efficient at learning than humans are. The coin toss example that I used in the last few posts seems simple enough, but there were some really smart mathematicians that didn’t figure it out: Aristotle may have laid some of the groundwork for it as possibly the first practitioner of “analytics”, but then Euclid, Archimedes and just about everybody else steered clear of the black art of uncertainty until Cardono published a book on games of chance in 1545. It was only really after Jakob Bernoulli’s “Law of Large Numbers” theorem was published in 1713 that probability theory started to take shape. How efficient was the human race at discovering the math of certainty and uncertainty? Not very - it took centuries. How long did it take a machine to build up enough knowledge about patterns in outcomes of coin tosses to present comparable results? Less than a minute with a machine learning process running on a single commodity computer.
Don’t get me wrong. I am not suggesting that there is anything wrong with theoretical knowledge or knowledge that humans have derived from empirical analysis. I am just making it clear that these time honoured methods of acquiring, documenting and applying knowledge are not the only way to do it anymore. The Internet of Things as a massive network of automated data collection and compute resources - ready to acquire new knowledge at a far faster rate than academia and research labs can do it.
Human knowledge and knowledge discovered by machines are also not directly comparable - you can’t use them in the same way.
Humans are much more articulate in describing what they know. This makes it easier for humans to consume and apply knowledge acquired by other humans. The math around the Binomial distribution is a perfect example. It is well documented and concise . Anybody with enough grounding in statistics can interpret it and apply it to understanding uncertainty in any problem domain associated with binary outcomes like a coin toss.
Machines are not at all articulate when it comes to describing the knowledge they have discovered. Although it is possible to look at and produce summary statistics from most analytic models, the outputs are far from concise, hard to interpret and generally can’t be used outside of the specific context that data was collected in. The model that I made to help understand uncertainty around fair or biased coins does not give me anything like the concise set of formulae of the Binomial distribution. It gives me a table of probabilities. I demonstrated how to use a table of probabilities to gain an intuitive understanding for the level of uncertainty. I manually interpreted the results and scrawled them onto the table. Those interpretations are not as easy to apply in human decision making as the equations of the Binomial distribution.
Knowledge discovered by machines may be harder for humans to interpret, but it is actually a lot easier for computers interpret. A computer couldn’t read up on probability theory and make decisions about whether to accept a coin as fair. It could however readily consume the table of probabilities produced by model and use these results to make a decision. The best way to use the outputs of machine learning is to automate the response to what has been learned. If you don’t do this, human capacity to interpret voluminous and obscure outputs from machine learning will always be a limiting factor.
Humans are slower to adapt to change than machines. Once human knowledge has been cast in stone it difficult to correct. The internet reduced our reliance on physical media so my reference to stone tablets is unfair, but the internet it hasn’t changed human nature. People like to stick with what they know as learning new things takes effort and involves an admission that current knowledge is inadequate. Machines don’t suffer from either of these human traits. If data fed into a machine learning model is volatile so is the knowledge gleaned from it.A machine will have no problem proudly proclaiming that what it discovered before was wrong and has been supplanted by something better.
Fast response to change and high suitability of outputs to drive automated decisions are two things to embrace in machine learning. If you try use to machine learning like you would human knowledge described in a text book you are unlikely to be successful with it.
Making decisions under conditions of uncertainty
Nov 10 2016
This post will give you a reasonably non-technical example of how machine learning techniques can be used to quantify uncertainty and guide decision making.
Wouldn't it be nice if you could pose questions like the one I explored in the last post as something like this : "Hey AI, look at these results and tell me which coin is fair". It is not quite so simple.
As long as you have enough data, there are lots of ways to answer a question like this using AI/ML, but all of them will need you to totally rephrase the question and then transform your data to match how you asked the question. The way that you phrase the problem depends on the class of the problem you are trying to solve. I am going to use a "classifier" to solve the problem, so I need to phrase the problem as a classification problem. I know I may have lost you there, but classification is something your brain does all the time without even thinking. This is what a classifier does:
Imagine you have a collection of things. This collection contains chocolate, water, a doll, a pencil, some ink and whatever else was hanging around in your backpack. Now consider a set of classifications: food, toy, art supplies. A "classifier" can tell you how likely it is that any given thing in your backpack is food, a toy or art supplies. Classifiers are useful to decision makers as they are really good at quantifying uncertainty.
For a classifier to learn how to classify anything, it needs to see lots of different examples of each of the things it is expected to learn to classify, along with suitable data that describes the characteristics of the things being classified. The example above is actually a monstrously difficult classifier as there aren't obvious characteristics for things in question that would allow a machine to determine whether something was edible or not, or whether it is some form of plaything. The human brain uses something akin to classifiers a lot. For those of you with children, think back how long it took your children to learn not to play with food or eat art supplies, but once we have this classification nailed, the brain's ability to classify things as foodstuffs helps keep us alive.
So if a classifier needs data describing examples of each class and characteristics for each class, how would we build a classifier that can classify coins as fair coins or unfair coins? We would need to find a coin that was known to be fair and one or more that were known to be unfair . We would collect coin toss data from fair coins and unfair coins. That is the easy bit. Now the slightly more tricky bit: the data that you present to the classifier needs to contain one or more characteristic of the coins. Each time you toss a coin you get an outcome: head or tails. That outcome isn't a suitable characteristic in its own right because fair and unfair coins both produce heads and tails outcomes. To formulate a classifier we need to transform those outcomes into one or more useful characteristic.
The table below shows the transformed data from lots of tosses involving both known fair and known unfair coins. Each number represents an occurrence of consecutive appearance of either heads or tails in a series of trials. Each time you see the number 1 in the Heads area it records the that fact that there was a THHT sequence in the trial. A 2 in tails indicates a HTTTH sequence.
Image by author
If you look at the shape of the data for fair and unfair coins, you can visually see the skew from the greater concentration of values in the lower left quadrant. The classification algorithm will figure out these patterns too and figure out which characteristic values best describe the skew. Once the classifier understands the pattern, it can use this knowledge to evaluate any future trial data and estimate the probability that the coin that produced the data is either fair or unfair. Note: In this simplified example, this particular skew was easy to see visually as the unfair coins were all unfair in the same way. If it had to deal with different types of skews, it wouldn't be so easy. Consider a casino investing in a robust classifier to find anomalies in dice, decks of cards, gambling machines or casino staff. The patterns they are looking for would not be as obvious as these and would be unlikely to be observable visually like this. That is the power of the classification algorithm over simple visual analysis - it can find patterns among lot more variables than what you could ever plot visually.
The use of a machine learning classifier to assess the level of uncertainty imparts a rigor to the analysis. Some of the decisions that you encounter on a day to day basis warrant this level of rigor, others don't. Regardless of the level of rigor required the Decision Intelligence mindset encourages you to spend some time upfront thinking about the level of uncertainty involved in any decision. What are the risks associated with this uncertainty? Do the risks warrant spending more time to understanding uncertainty or is it better to make some assumptions and proceed boldly under those assumptions.
Getting to grips with uncertainty
Nov 7 2016
The first stage of getting to grips with uncertainty is quantifying it. Here is an activity to test your ability to quantify uncertainty. A fair coin has a probability of showing heads precisely 50% of the times that it is tossed.
Image by author
My answer is "No". Just because coin B produced more tails than heads in 8 tosses doesn't prove its guilt.
The best way to answer the question is to get to grips with the level of uncertainty by trying to quantify the amount of uncertainty. The field of Statistics gives us an excellent way to quantify uncertainty - probability.
“Probability is the measure of the likelihood that an event will occur. Probability is quantified as a number between 0 and 1 (where 0 indicates impossibility and 1 indicates certainty)” Wikipedia
Since coin tosses can be described very elegantly by using the Binomial distribution and the math of the the binomial distribution is simple, any statistician would make light work of this problem and provide the answer that either of those results could have been likely with a fair coin so there is no way to conclude which coin is bad. The best we could do is to state how many tosses would be needed to conclude which coin was bad with any degree of certainty.
Quantifying uncertainty is a discipline that anybody who wants to thrive under conditions of uncertainty will need to get to grips with. In the next post I will go into more detail about how to quantify uncertainty for the coin toss problem. Don't worry I won't make you pull out your Statistics text book. We will explore the problem empirically by using machine learning techniques to extract knowledge about uncertainty from coin toss data.
Life is uncertain. Don't sip.
Oct 23 2016
The title of this post is stolen from a beer label. Thanks Lagunitas Brewing Company !
There is no disputing the first statement. Life is uncertain.
How about the second? "Don't sip."
Image by author
Should we be tentative about how we deal with uncertainty - putting off decisions, waiting for certainty, doing what everybody else does as it must be right? I say no. Act with courage in conditions of uncertainty, not "Dutch Courage" as the makers of this tasty beverage would suggest. Act with the courage you get from having the tools and techniques needed to understand uncertainty and its effect on industrial processes. The next series of posts will discuss how uncertainty applies to specific industrial processes and how to use knowledge about uncertainty to improve processes.
New Modern Times
Sept 19 2016
Image by author
How could a comedy film made in 1936 inspire me to write about the future of manufacturing and the industrial sector? Charlie Chaplin’s ‘Modern Times’, made a strong and humorous statement about how advances in mechanization were changing the world. In 2016, we are at the start of another juncture. Advancements in industrial operations over the 19th and 20th centuries were fueled by increasing sophistication of machinery. This sophisticated machinery was designed and constructed using new engineering knowledge that had its roots in applied physics and empirical studies.
The changes that we are observing today are different. They are fueled by new knowledge and a new sense of confidence in how to take actions in the presence of uncertainty. The source of this knowledge is different. It is not derived from lab experiments or academic discoveries. New knowledge is being extracted from the humble outputs of the numerous, but generally rather unsophisticated, sensors that adorn modern equipment. This knowledge is not what we see written down or cast in stone in equipment catalogs, textbooks, operating procedures and maintenance manuals. It is dynamic in that it keeps evolving as the systems that it represents evolve. It has the potential to be put to use directly to make better operational decisions such as: when and what to sample for quality inspection, what operating parameters to use for a process and when to maintain equipment.
I am an Industrial Engineer, software technologist and data scientist. Over the two decades of my professional life, I have been exposed to numerous technologies and techniques for acquiring data, learning from data and putting the knowledge learned from data to use to improve processes. My writing will focus on my experiences and offer tips on how to thrive in our New Modern Times