Unveiling the Dynamics of Human-AI Complementarity through Bayesian Modeling

Article Title: Bayesian modeling of human–AI complementarity

Authors & Year: M. Steyvers, H. Tejeda, G. Kerrigan, and P. Smyth (2022)

Journal: Proceedings of the National Academy of Sciences of the United States of America [DOI:10.1073/pnas.2111547119]

Review Prepared by David Han

Exploration of Human-Machine Complementarity with CNN

In recent years, artificial intelligence (AI) and machine learning (ML), especially deep learning, have advanced significantly for tasks like computer vision and speech recognition. Despite their high accuracy, these systems can still have weaknesses, especially in tasks like image and text classification. This has led to interest in hybrid systems where AI and humans collaborate, focusing on a more human-centered approach to AI design. Studies show humans and machines have complementary strengths, prompting the development of frameworks and platforms for their collaboration. To explore this further, the authors of the paper developed a Bayesian model for image classification tasks, analyzing predictions from both humans and convolutional neural networks (CNN). The Bayesian model is a statistical approach that updates knowledge/beliefs based on evidence, combining prior knowledge with new information to make better predictions or decisions. CNN, used in image processing, are similar to how humans view things but they make mistakes in different ways from humans, making them ideal for studying human-machine complementarity (viz., the synergy and cooperation between humans and machines). This model helps explore conditions for complementarity, such as when to combine predictions from humans and machines, or from groups of humans or machine algorithms. It also helps in understanding how to combine and differentiate errors made by both, as well as integrate their different confidence levels.

Bayesian Combination of Human & Machine Classifier Predictions

The proposed Bayesian model combines classifications and confidence scores from different types of classifiers: human or machine. It focuses on three pairs: hybrid human-machine (HM), human-human (HH), and machine-machine (MM). This model produces combined predictions and estimates correlations between classifiers, capturing how their confidence scores relate. For instance, if one classifier is confident about a label, another might show similar confidence. Unlike previous models, which assume no association between the classifiers, this model accounts for varying confidence score types. Machine classifiers give probability distributions, while humans provide single ordinal responses (like “low,” “medium,” and “high”). The process starts with a probability-based model, generating human and machine logit scores for each label. These are then transformed into probability confidence scores. For machine classifiers, confidence scores are observed directly while they are indirect for human classifiers. The model uses known ground truth labels for training data and unknown labels for test data. It considers observed human labels, confidences, and classifier probabilities for both training and test sets. Figure 1 illustrates the graphical model for combining hybrid HM pairs. Using Markov Chain Monte Carlo (MCMC) methods, the model parameters are estimated from observed data, and the probability distribution is updated (i.e., the posterior distribution).

The image is a graphical representation of the Bayesian combination model for hybrid Human-Machine (HM) pairs. It is divided into two main sections: "Training Data" on the left and "Test Data" on the right. Both sections include boxes containing nodes connected by arrows, illustrating the flow of data and relationships between variables. In both sections, there are shaded nodes representing observed variables and unshaded nodes representing latent variables. Plates (rectangles) indicate conditionally independent replicates of instances (images) and label-related variables for each instance. In the center section titled "Correlation & Performance Parameters," there is a central node labeled with parameters, connected to both data sections. Below this central node are additional parameters of human labeling and confidence. — Figure 1. Graphical representation of the Bayesian combination model for hybrid HM pairs; Shaded nodes indicate observed variables while unshaded nodes represent latent variables. Plates denote conditionally independent replicates of instances (images) and label-related variables for each instance.

Complementarity of Human & Machine Classifiers

For this study, the authors gathered human and machine classification decisions over 4,800 images. Using various CNN architectures known for high accuracy, they also introduced variability in machine classifier performance. In Figure 2A, humans struggle but machines excel, with low human accuracy and confidence compared to high machine accuracy. Figure 2B shows the reverse: humans succeed while machines struggle. This highlights the complementarity between humans and machines, each performing better in different situations.

The image is divided into two rows labeled A and B, each containing six black-and-white, blurry, and textured images. The first row (A) includes the following images from left to right:

- An indistinct image with various shades of gray, appearing highly textured and grainy.

- An image of a structure resembling a tower or oil rig with crisscrossing lines, somewhat clearer than the first image.

- A heavily blurred and dark image with major smudging.

- Another blurry and dark image slightly clearer in the center with indistinct forms possibly representing a bear.

- A close-up of a texture or fur, appearing vertically streaked.

- Another image that is predominantly vertical streaks.

These are images that pose challenges for humans but relatively easy for machine classifiers. The correct classifications are bird, boat, bear, bear, oven, and oven.

The second row (B) includes the following images from left to right:

- A slightly more discernible image of a car, still grainy but recognizable.

- A heavily obscured image of a large front portion of a vehicle, possibly a truck.

- An image with a blurred figure of a cat, slightly clearer yet still heavily textured.

- Another blurry image with a different cat.

- An image slightly resembling a bear amidst heavy blurring.

- Another obscured image with a bear-like figure.

These are images that are difficult for machine classifiers but easier for humans. The correct classifications are car, car, cat, cat, bear, and bear. — Figure 2. Examples demonstrating complementarity between human and machine classifiers;

A) Images that pose challenges for humans but relatively easy for machine classifiers. The correct classifications are bird, boat, bear, bear, oven, and oven;

B) Images that are difficult for machine classifiers but easier for humans. The correct classifications are car, car, cat, cat, bear, and bear.

Figure 3A shows how well the model performs on new data after just a little more training on CNN. The findings show that HM pairs perform well, especially with high noise, although Alexnet’s low baseline performance limits complementarity with humans. Combining two humans generally outperforms a single human, emphasizing the use of human confidence scores to improve accuracy. Figure 3B shows that when humans and machines work together (i.e., hybrid combinations of human and machine classifiers), the correlations between their predictions are less strong compared to when humans or machines work alone. Another important finding is that complementarity between human and machine classifiers depends on their differences in accuracy. Figure 4 illustrates this, showing observed and predicted complementarity outcomes for hybrid pairs. It displays results for 320 comparisons across image noise levels and fine-tuning (a procedure to adjust a pre-trained model to boost the task-specific performance). The shaded region indicates the narrow range where complementarity occurs, influenced by correlations between classifiers. If the HM pair has zero correlation, the complementarity zone expands (dashed line) since the human-machine performances can be augmented creating the synergy. Nevertheless, there are still limits on the accuracy differences for complementarity.

The image consists of two panels. The panel A presents accuracy results with 95% confidence intervals, while the panel B shows posterior distributions over correlations from a Bayesian combination model.

The panel A features a horizontal bar chart illustrating the accuracies of different models and their combinations, along with 95% confidence intervals. The x-axis ranges from 0.75 to 1, labeled as "Accuracy." The y-axis is labeled with model names or combinations of models and their labels in parentheses:

Single Human (H)

Two Humans (HH)

AlexNet (M,HM)

DenseNet161 (M,HM)

GoogleNet (M,HM)

ResNet152 (M,HM)

VGG-19 (M,HM)

AlexNet + DenseNet161 (MM)

AlexNet + GoogleNet (MM)

AlexNet + ResNet152 (MM)

AlexNet + VGG-19 (MM)

DenseNet161 + GoogleNet (MM)

DenseNet161 + ResNet152 (MM)

DenseNet161 + VGG-19 (MM)

GoogleNet + ResNet152 (MM)

GoogleNet + VGG-19 (MM)

ResNet152 + VGG-19 (MM)

Color coding is used to distinguish between "Human Only" (red), "Machine Only" (blue), and "Hybrid" (green):

- Red dashed vertical line at approximately 0.91 represents the accuracy of "Two Humans."

- Blue dots with horizontal lines represent Machine Only models (M, MM).

- Green dots with horizontal lines represent Hybrid models (HM).

The panel B displays posterior distributions over correlations from the Bayesian combination model across different model combinations. The x-axis ranges from 0.2 to 1, labeled as "Correlation." The y-axis is labeled with similar model names as in the top panel:

Two Humans (HH)

AlexNet + Human (HM)

DenseNet161 + Human (HM)

GoogleNet + Human (HM)

ResNet152 + Human (HM)

VGG-19 + Human (HM)

AlexNet + DenseNet161 (MM)

AlexNet + GoogleNet (MM)

AlexNet + ResNet152 (MM)

AlexNet + VGG-19 (MM)

DenseNet161 + GoogleNet (MM)

DenseNet161 + ResNet152 (MM)

DenseNet161 + VGG-19 (MM)

GoogleNet + ResNet152 (MM)

GoogleNet + VGG-19 (MM)

ResNet152 + VGG-19 (MM)

Triangles and kidney-shaped curves in different colors represent the posterior distributions for each combination. — Figure 3. Top: Accuracy results with 95% confidence intervals;
Bottom: Posterior distributions over correlations from the Bayesian combination model

The image is a scatter plot comparing the accuracy of human classifiers (H) on the x-axis to the accuracy of machine classifiers (M) on the y-axis. It displays various observed accuracies using circles, which are color-coded based on whether the hybrid human-machine (HM) accuracy surpasses human-human (HH) and machine-machine (MM) accuracies. Filled red circles represent cases where HM accuracy is superior, while empty black circles indicate otherwise. The plot includes a solid diagonal line, indicating equivalence between single human and machine performances. A red shaded area, labeled as the predicted complementarity zone, aligns closely with correlations inferred by a Bayesian model. Dashed lines outline the boundaries of a predicted complementarity area for an ideal scenario where human and model predictions are uncorrelated. A legend in the top left corner explains the color coding of the circles under the title "Complementarity." — Figure 4. Observed and predicted complementarity based on human and machine classifier accuracy; Circles represent observed accuracy in various datasets, with filled circles indicating cases where the hybrid HM pair’s out-of-sample accuracy surpasses HH and MM pairs. The colored area represents the predicted complementarity zone, closely aligning with the correlations inferred by the Bayesian model. The dashed line indicates the predicted complementarity area in an ideal scenario, where human and model predictions are uncorrelated. The diagonal line indicates equivalence between single human and model performances.

Importance of Designing Collaborative Human-AI Systems

The study finds that the following factors are statistically significant to improve the performance of the hybrid HM pairs, especially in high-noise conditions:

A customized error model that corrects errors and biases from both human and machine classifiers for specific labels
Human confidence scores
Machine classifier scores

Machine confidence scores have a larger impact on performance than human confidence scores or an error model. This is because machine classifiers provide scores across all labels simultaneously, while humans offer a single confidence score for their decision. Human confidence ratings and the error model contribute similarly to hybrid performance. Thus, one way to boost hybrid HM classifier performance is by including human confidence ratings. These findings emphasize the importance of confidence scores from both human and machine classifiers in hybrid classifier performance. The findings also show that combining predictions from a single human with those of a machine can improve performance, even when the human outperforms the machine. Conversely, a hybrid HM pair can outperform combinations of machine classifiers that individually outperform a single human. This has implications for algorithms not yet at human-level accuracy. Adding less accurate algorithmic predictions to a human predictor can improve performance more than adding additional human predictions. The benchmark for AI algorithms need not always be human-level; even below-human-level algorithms can enhance accuracy in hybrid predictions. This research highlights the potential benefits of combining human and AI predictions, even when AI falls short of human-level performance. It suggests that effective AI systems don’t need human-level accuracy to improve outcomes, and human judgment remains valuable. However, there are limits to complementarity. An important factor is the correlation between human and machine classifier predictions, setting limits on the accuracy difference that supports complementarity; see Figures 3 and 4. This suggests that effective AI advice should strive for independence from human judgment so that the human-machine synergy can be truly realized with meaningful performance enhancement when humans and machines work together. Human confidence scores play an important role, enhancing hybrid performance similar to an explicit error model. This study provides a framework for evaluating hybrid human-machine predictions, relevant for domains like medicine and the justice system. For example in the justice system, machine could analyze large amounts of legal data to identify patterns and trends while human experts could provide contextual understanding and moral reasoning. This combination could lead to more informed and fairer decisions in areas such as case evaluations, sentencing, and parole determinations, potentially reducing biases and errors inherent in human judgment alone.