Explainable AI: Shedding Light on the Black Box

Explainability tools generate plausible narratives about model behavior. Those narratives satisfy regulators and reassure stakeholders. They do not explain how the model actually works.

Most explainable AI methods produce post-hoc rationalizations that correlate with model outputs without revealing the mechanism that generated those outputs. The explanations look compelling while being fundamentally misleading.

Post-hoc explanations are rationalizations

A deep neural network with millions of parameters makes a prediction. An explainability tool like LIME or SHAP analyzes that prediction and produces a list of important features. The loan was denied because of credit utilization, recent inquiries, and account age.

That explanation describes correlations between inputs and the specific output. It does not describe what the neural network computed. The network does not have variables for credit utilization or account age. It has weight matrices and activation functions. The path from input to output passes through thousands of nonlinear transformations that have no semantic meaning.

LIME works by training a simple linear model to approximate the complex model’s behavior in a local region around a specific input. It explains the linear approximation, not the original model. The linear approximation may be a terrible representation of what the neural network is doing.

SHAP computes feature importance by measuring how predictions change when features are added or removed. It assumes features contribute independently. Neural networks learn interactions between features. SHAP explanations misrepresent those interactions as independent contributions.

Both methods produce feature importance rankings that look like explanations. Neither reveals how the model processes information. They rationalize outputs in terms humans find intuitive without being faithful to the model’s actual computation.

Explanations optimize for human satisfaction

Humans prefer explanations that are simple, coherent, and match their existing mental models. Explainability tools exploit these preferences. They generate explanations humans find satisfying rather than explanations that are accurate.

A hiring model rejects a candidate. An explainability tool attributes the decision to years of experience and education level. The recruiter nods. That explanation makes sense based on how hiring normally works.

The actual model might be using years of experience as a proxy for age and education level as a proxy for socioeconomic status. The model learned correlations in historical hiring data that encode bias. The explainability tool reframes that bias in neutral terms.

The explanation satisfied the human. It hid the problem.

Studies show that humans trust AI systems more when given explanations, regardless of whether those explanations are accurate. Adding any explanation increases trust. Adding incorrect explanations increases trust just as much as adding correct ones.

Explainability tools are persuasion mechanisms. They make people comfortable with model outputs. Comfort and understanding are not the same.

Faithfulness vs. plausibility

An explanation is faithful if it accurately represents the model’s reasoning process. An explanation is plausible if humans find it believable. Explainability tools prioritize plausibility over faithfulness.

Research comparing explanations to actual model internals shows consistent divergence. Feature importance rankings from SHAP do not match the actual gradients that determine model outputs. Saliency maps highlighting important image regions do not correspond to the activations that drive predictions.

Plausible explanations are easier to generate than faithful ones. They require less computational overhead. They produce cleaner results. They avoid exposing model pathologies that might raise questions.

A credit scoring model uses ZIP code as its most important feature. That reveals the model is encoding geographic discrimination. Explainability tools can be tuned to downrank ZIP code and emphasize more defensible features like payment history.

The explanation becomes more plausible and less faithful. Stakeholders get the narrative they want. The discrimination continues.

Interpretable models get rejected for being too simple

The alternative to explaining complex models is using inherently interpretable models. Logistic regression, decision trees, and rule-based systems have transparent decision logic. You can inspect exactly how they process inputs.

Organizations reject interpretable models because they have lower accuracy than neural networks. A logistic regression achieves 82% accuracy. A neural network achieves 87%. The neural network gets deployed with an explainability wrapper.

The 5% accuracy improvement is often marginal in business terms. The cost of inscrutability is large. But accuracy is measurable and appears on dashboards. Interpretability is fuzzy and does not have a number.

Teams optimize for accuracy metrics and then attempt to bolt explainability onto opaque models after the fact. They could have had real interpretability by accepting slightly lower accuracy. They chose to have neither real interpretability nor high reliability.

Complex models fail in subtle ways that simple models do not. Neural networks are fragile to distribution shift, adversarial inputs, and edge cases. Interpretable models fail predictably. When a decision tree makes a mistake, you can trace the exact branch that caused it.

When a neural network makes a mistake, explainability tools generate post-hoc narratives about why it might have erred. Those narratives are guesses.

Explanations fail for ensemble and deep models

Explainability methods barely work for single models. They fail completely for ensembles and very deep networks.

Gradient boosted trees are ensembles of hundreds or thousands of decision trees. Each tree is interpretable individually. The ensemble is not. Explaining why the ensemble made a prediction requires understanding how hundreds of trees voted and how those votes combined. Explainability tools collapse this into feature importance rankings that obscure the actual decision process.

Transformer models have hundreds of millions of parameters organized in dozens of layers with attention mechanisms connecting distant parts of the input. Explaining a transformer’s output requires understanding how attention patterns propagate through layers and how residual connections preserve information from earlier layers.

Current explainability methods cannot do this. They approximate transformer behavior with simpler models or highlight input tokens that correlate with outputs. Neither approach explains what the transformer computed.

As models get larger and more complex, explainability methods become less faithful. The gap between what the explanation claims and what the model does widens.

Regulatory compliance drives cosmetic explainability

GDPR includes a right to explanation for automated decisions. The regulation does not specify what counts as an explanation. Organizations interpret this loosely. They deploy complex models and attach explainability tools that generate feature importance scores.

A loan denial includes an explanation: “credit utilization was the primary factor.” That satisfies the regulatory requirement. The applicant receives words that look like an explanation. Whether those words accurately describe why the model denied the loan is unverified and probably unverifiable.

Regulators lack the technical capacity to evaluate whether explanations are faithful. They check that explanations exist, not that explanations are correct. This creates incentives to generate plausible explanations rather than accurate ones.

Banks, insurers, and other regulated entities deploy explainability as a compliance checkbox. The explanations serve legal and PR functions. They do not serve understanding.

Adversarial attacks exploit explanation systems

Explanations can be manipulated. If a model is paired with an explainability tool, adversaries can craft inputs that produce desired predictions with innocuous explanations.

A loan application is designed to trigger approval while generating an explanation that attributes approval to income and employment history. The actual reason is a carefully tuned combination of features that exploits model vulnerabilities. The explainability tool misses this because it analyzes local feature importance, not adversarial structure.

Content moderation systems using explainability can be bypassed by inputs that violate policies while generating explanations that emphasize benign features. The text contains hate speech, but the explainability tool highlights neutral words that happen to correlate with non-violation in the local approximation.

Explainability tools add attack surface. They assume inputs are benign. Adversarial inputs break that assumption.

Explanations do not prevent harm

The stated goal of explainable AI is to make systems safer and more trustworthy. Explanations do not achieve this. They make systems feel safer without making them actually safer.

A medical diagnosis model misdiagnoses a patient. An explainability tool attributes the diagnosis to symptoms and test results. The explanation sounds medical. The diagnosis is still wrong. The explanation reassured the doctor, which made the error more likely to go unchallenged.

Explainability creates a false sense of oversight. Humans reviewing AI outputs with explanations assume the explanations are accurate. They lower their guard. Errors that would have been caught through manual review slip through because the explanation made the output seem reasonable.

The harm from incorrect AI decisions is not reduced by attaching explanations to those decisions. It may be increased because explanations reduce scrutiny.

What explanations actually reveal

When an explainability tool highlights certain features as important, it reveals correlations in the training data. It does not reveal causation. It does not reveal model internals. It does not reveal whether the decision was correct.

Feature importance tells you what patterns the model learned. If the model learned the wrong patterns, feature importance will show you the wrong patterns, framed as if they were legitimate decision factors.

A hiring model trained on biased historical data learns that certain names correlate with rejection. Explainability tools might label this as “cultural fit” or “communication style.” The explanation hides the bias behind neutral language.

Explanations are useful for debugging in this narrow sense. They can reveal that a model learned something problematic. But they only reveal this if someone knows to look for it and interprets the explanation critically. Most users of explainability tools take explanations at face value.

The interpretability trade space

Organizations want models that are accurate, interpretable, and cheap to deploy. These requirements conflict. You can have any two.

Accurate and interpretable models exist. Logistic regression and decision trees can be highly accurate on structured data with proper feature engineering. They require manual feature engineering, which is expensive and requires domain expertise.

Accurate and cheap models exist. Train a neural network on raw features. Let it learn representations automatically. It will be accurate and opaque.

Interpretable and cheap models exist. Use simple heuristics. They are transparent and require minimal engineering. They are not accurate.

Explainability tools promise to make accurate and cheap models interpretable. They do not deliver. They add a layer of plausible rationalization on top of opaque models. The models remain opaque. The rationalization is cheap. The accuracy claims become suspect because no one can verify them.

True interpretability requires choosing interpretable architectures and accepting their limitations. Explainability tools let organizations pretend they can avoid that choice. They cannot.