Explain Yourself, HAL: The Art of Making AI Spill Its Digital Guts

Remember HAL 9000 from 2001: A Space Odyssey? When asked why he was being weird, HAL’s response was: “I’m sorry, Dave. I’m afraid I can’t do that.” If only Dave had had explainable AI he’d have avoided that whole murdering-a-computer-in-space thing.

Modern AI systems are less murderous but equally opaque. They reject loan applications, flag content for removal, and deny insurance claims without explanation. When pressed, they produce something called “explainability”—a set of numbers that allegedly describe why the model did what it did.

These explanations are often worse than no explanation. They provide false confidence. They suggest the model is interpretable when it is not. They create liability when deployed incorrectly.

What explainability actually means

Explainability is not a single concept. It is at least four different things that get confused in practice.

Model transparency is understanding the algorithm. Linear regression is transparent. You can see the coefficients. Deep neural networks are not. You cannot inspect 100 million parameters and understand what they encode.

Prediction justification is explaining why a specific output was produced for a specific input. This is what LIME and SHAP attempt to provide. They do not explain the model. They explain one prediction.

Feature importance is identifying which inputs most influenced the output. This sounds useful until you realize importance is context-dependent. A feature can be critical for one prediction and irrelevant for another.

Counterfactual explanation is describing what would need to change for a different outcome. “Your loan was denied because your income was too low” is a counterfactual. It does not explain the model. It identifies a path to a different decision.

Organizations ask for “explainability” without specifying which of these they need. They get the wrong one, then discover it does not solve their problem.

LIME does not explain the model

LIME—Local Interpretable Model-agnostic Explanations—is popular because it works with any model and produces numbers that look like explanations.

It works by training a simple model to approximate the complex model’s behavior in a local region around a specific prediction. The simple model is interpretable. The complex model is not.

This creates an obvious problem. LIME explains the approximation, not the original model. If the approximation is poor, the explanation is wrong. If the approximation is good locally but the model behaves differently elsewhere, the explanation does not generalize.

LIME also assumes the approximation model is the right structure. It typically uses linear models. If the original model’s local behavior is non-linear, the linear approximation will be misleading.

Organizations deploy LIME to satisfy explainability requirements. They show users feature weights. The weights describe the local linear approximation, not the model. The user receives an explanation that is technically correct for a model they did not interact with.

SHAP is expensive and misleading

SHAP—SHapley Additive exPlanations—is based on game theory. It computes each feature’s contribution to a prediction by considering all possible combinations of features.

This is mathematically principled. It is also computationally expensive. For a model with N features, SHAP requires evaluating 2^N feature subsets. This is infeasible for models with hundreds of features.

Approximate SHAP methods exist. They trade accuracy for speed. The explanations become estimates of estimates. The error compounds.

SHAP also assumes features are independent. They are not. Features correlate. Removing one feature changes the distribution of others. SHAP computes Shapley values as if features can be independently turned on and off. This assumption is violated in most real datasets.

The result is a number that looks precise but represents a counterfactual world where features are independent. That world does not exist. The explanation describes behavior that does not occur.

Attention is not explanation

Transformer models use attention mechanisms. Attention weights show which tokens the model focused on when generating output.

This is often presented as explanation. High attention means the token was important. Low attention means it was ignored.

This is wrong. Attention weights show where the model looked. They do not show what the model learned or why. A model can attend to a token and ignore its meaning. A model can attend to a token because it is irrelevant and needs to be suppressed.

Attention weights are also layer-specific. A token might receive low attention in one layer and high attention in another. Which layer’s attention is the explanation? All of them? None of them?

Visualization tools show attention heatmaps. They look convincing. They do not explain predictions. They show intermediate computation steps. These steps are not interpretable without understanding what each layer encodes.

Organizations that point to attention weights as explanations are confusing observability with understanding.

Saliency maps highlight artifacts

Saliency maps show which pixels in an image most influenced a classification. They are generated by computing gradients with respect to the input.

This works in theory. In practice, saliency maps often highlight noise, edges, and artifacts that have nothing to do with the classification.

A model might classify an image as “dog” based on grass texture because most dog images in the training set have grass. The saliency map highlights grass. The explanation is: the model classified this as a dog because of grass. This is true but useless. It does not explain dog recognition. It exposes dataset bias.

Saliency maps also vary based on the method used. Different gradient-based techniques produce different maps for the same prediction. There is no ground truth. The explanation depends on the explanation method.

Organizations use saliency maps to debug models. The maps show what the model used, not what it should have used. Fixing saliency map problems requires changing the training data or the model architecture. The explanation does not provide a path to the fix.

Explanations are adversarially fragile

Small changes to input can produce large changes in explanations, even when the prediction does not change.

Add imperceptible noise to an image. The classification remains the same. The saliency map changes completely. The explanation is now different for the same prediction.

This happens because explanations are computed from gradients. Gradients are sensitive to input perturbations. Predictions are computed from forward passes, which are more stable.

This creates a problem for explainability. If the explanation changes when the input changes imperceptibly, the explanation is not describing a stable property of the model. It is describing a local gradient that shifts with noise.

Adversarial examples make this worse. An input can be crafted to produce a specific explanation while maintaining the same prediction. The explanation can be manipulated without changing the output. This means explanations are not trustworthy signals of model behavior.

Post-hoc explanations are rationalizations

Most explainability methods are post-hoc. The model is trained. Then explanations are generated afterward.

This is rationalization, not explanation. The model was not designed to be interpretable. It was designed to minimize loss. The explanations are reverse-engineered.

Humans rationalize too. We make decisions based on subconscious processes, then construct plausible stories to explain them. The stories are not the causes. They are narratives that sound reasonable.

Post-hoc AI explanations are the same. The model’s decision process is opaque. The explanation is a plausible narrative generated after the fact. It might align with the true cause. It might not. There is no way to verify.

Organizations that rely on post-hoc explanations are treating rationalizations as ground truth. The explanation sounds good. It is not necessarily correct.

Interpretable models are not always better

The alternative to post-hoc explanation is to use inherently interpretable models. Decision trees, linear models, and rule-based systems are interpretable by design.

This works when interpretable models have acceptable performance. Often they do not. Complex tasks require complex models. Interpretable models underfit.

Organizations are told to choose between accuracy and interpretability. This is framed as a trade-off. It is often a false choice. Interpretable models are not actually interpretable at scale.

A decision tree with 10 nodes is interpretable. A decision tree with 10,000 nodes is not. A linear model with 10 coefficients is interpretable. A linear model with 10,000 coefficients is not. Interpretability does not scale with model size.

Large interpretable models are no more understandable than black box models. They just create the illusion of transparency.

Explanations create legal liability

Explanations can be used as evidence in disputes. If a model denies a loan and provides an explanation, that explanation becomes part of the decision record.

If the explanation is wrong—and post-hoc explanations often are—it creates liability. The explanation says income was the reason. Discovery reveals income was not actually predictive. The explanation is now evidence of arbitrary decision-making.

Regulations require explanations for certain decisions. GDPR includes a “right to explanation.” Fair lending laws require adverse action notices. These laws assume explanations are accurate.

When explanations are generated by methods like LIME or SHAP, they are approximations. Approximations can be misleading. Misleading explanations in regulated contexts create legal risk.

Organizations deploy explainability to comply with regulations. They may be increasing risk if the explanations are inaccurate.

Feature importance changes with context

Feature importance is not a fixed property of a model. It depends on the input distribution.

A model trained on diverse data might rely on different features for different subgroups. Age might be important for older applicants. Employment history might be important for younger ones.

Global feature importance averages over all inputs. It does not describe local behavior. Local feature importance—what LIME and SHAP provide—does not describe global patterns.

Organizations ask: “What features does the model use?” The answer is: it depends. Feature importance is conditional on context. A global answer is misleading. A local answer does not generalize.

This makes feature importance less useful than it appears. It cannot answer the question organizations actually care about: “Is this model using protected attributes?” The answer is “sometimes, depending on context,” which is not actionable.

Counterfactuals assume causality

Counterfactual explanations suggest changes that would produce a different outcome. “If your credit score were 50 points higher, the loan would be approved.”

This assumes the model encodes causality. It does not. The model encodes correlation. A higher credit score is correlated with approval. It is not necessarily causal.

The counterfactual explanation implies: increase your credit score, get approved. This might be false. The model might be using credit score as a proxy for something else. Changing the score without changing the underlying factor does not guarantee approval.

Counterfactuals also assume the model is stable. If the model is retrained, the same input might produce a different output even if the suggested changes are made. The explanation is no longer valid.

Organizations provide counterfactual explanations to users. Users act on them. The outcomes do not match the explanations. Trust is lost.

Debugging requires more than explanations

Explanations are often sought for debugging. The model is making bad predictions. Explanations should reveal why.

In practice, explanations do not debug models. They show correlations the model learned. They do not show whether those correlations are spurious, biased, or correct.

A model trained on biased data will learn biased patterns. Explanations will reveal those patterns. But knowing the model is biased does not fix it. Fixing requires changing the data, reweighting samples, or modifying the architecture.

Explanations are diagnostic, not corrective. They identify symptoms. They do not provide treatments. Organizations that expect explanations to fix models are disappointed.

Debugging requires access to training data, evaluation metrics, and error analysis. Explanations are one signal among many. They are not sufficient.

The transparency illusion

Explainability tools create an illusion of transparency. They produce numbers, graphs, and visualizations. These artifacts look like understanding.

They are not understanding. They are approximations, gradients, attention weights, and correlations. They describe aspects of model behavior. They do not explain why the model works or whether it is correct.

Organizations deploy these tools to satisfy stakeholders. Regulators want explanations. Users want transparency. Management wants accountability. Explainability tools provide the appearance of all three.

The appearance is not the reality. A LIME explanation satisfies a regulator without actually making the model interpretable. An attention heatmap satisfies a user without actually explaining the prediction. A feature importance chart satisfies management without revealing whether the model is biased.

This is dangerous. It substitutes real understanding with plausible-sounding artifacts. Decisions are made based on explanations that are technically correct but practically misleading.

What works instead

Explanation is not a technical problem. It is a requirements problem. Organizations need to specify what question they are trying to answer.

If the question is “Why was this decision made?” the answer is not an explanation method. It is an audit log. Record the inputs, the model version, and the output. That is the actual cause. Explanations are post-hoc rationalizations.

If the question is “Is this model biased?” the answer is not feature importance. It is fairness metrics computed on test data. Measure disparate impact. Compare error rates across groups. Explanations do not reveal bias. Evaluation metrics do.

If the question is “Can I trust this model?” the answer is not a saliency map. It is validation on out-of-distribution data. Test edge cases. Measure calibration. Explanations do not establish trust. Testing does.

If the question is “How do I improve this model?” the answer is not SHAP values. It is error analysis. Identify failure modes. Inspect misclassified examples. Explanations do not guide improvement. Systematic analysis does.

Explainability methods are useful when the question they answer matches the question being asked. Most organizations ask the wrong questions, get technically correct answers, and remain confused.

HAL would still not explain himself

HAL 9000 refused to explain his behavior because he was programmed to lie. No amount of LIME, SHAP, or attention visualization would have revealed that. His explanations would have been plausible and wrong.

Modern AI systems are not malicious. But their explanations are often plausible and wrong for different reasons. They approximate. They rationalize. They assume independence, stability, and causality that do not exist.

Asking AI to explain itself is asking for a story. Stories are useful when they are accurate. When they are not, they are just creative fiction with gradients.

Organizations that want real transparency need audits, not explanations. Logs, not LIME. Testing, not SHAP. Explanations make models feel understandable. Actual understanding requires different tools.

Dave should have checked the audit logs.