Selection bias refers to a bias in the selection of data for training machine learning models. Selection bias is common in situations where prototyping teams are narrowly focused on solving a specific problem without regard to how the solution will be used and how the data sets will generalize.
It is imperative that data science and developer teams focus on ensuring their training data are representative of the real-world situation in which the model is to perform. For example, machine learning models that seek to predict customer attrition for a bank may need to carefully consider the demographics of the population. Attrition for high-net-worth individuals is likely to have substantially different characteristics than attrition for lower-net-worth individuals. A model trained on one set would likely perform quite poorly against the other. A machine learning modeler must look out for potential selection bias and take appropriate measures to mitigate introduction of bias to the model.
C3 AI Application Platform and C3 AI Applications provide sophisticated capabilities to explore training data sets and evaluate model performance prior to production deployment. In addition, based on C3 AI’s extensive experience in helping organizations solve large-scale problems with machine learning, we have codified best practices around detecting bias in data sets, that we make available to organizations we work with.