Why evaluating a DSS is hard
In ML you score a function f : X → Y on how often it's right. But prediction ≠ decision support — a DSS lives inside people and organizations.
Picture a medical DSS that predicts disease risk with excellent accuracy. Does that improve clinical decisions? Not necessarily — its recommendations may be ignored, its explanations incomprehensible, users may over-trust its mistakes, its outputs may not fit the workflow, or it may overload people. A great predictor can still be a useless DSS.
Lecture 6 said a DSS is a socio-technical system — Human + DSS + Interaction. Evaluation has to respect that: you can't measure only the algorithm and call it done. You measure technical performance and human behaviour and organizational fit.
Outcome vs process
Two systems can reach the same decisions but support people completely differently:
Outcome-oriented
Judges the final performance — accuracy, utility, cost reduction. "Did the decisions come out well?"
Process-oriented
Judges how decisions were reached — cognitive effort, trust & reliance, interpretability, interaction dynamics. A DSS can improve outcomes while increasing automation bias, eroding human expertise, and breeding overreliance.
So evaluation is inherently multi-dimensional:
| Dimension | Asks… |
|---|---|
| Predictive | Is the model accurate? |
| Decision-theoretic | Does it improve utility? |
| Behavioral | How do users react? |
| Cognitive | Does it reduce workload? |
| Organizational | Does it fit the workflow? |
| Ethical | Is it fair and accountable? |
Classical quantitative evaluation
Start with the numbers — but learn where they lie. Most DSS make a binary call Y ∈ {0,1}, summarised by a confusion matrix.
Play with the confusion matrix 🔢
Edit the four counts (or load a preset) and watch every metric recompute. The presets are designed to expose where a single number fools you.
TP/TN = correct · FP/FN = errors. Edit any cell, or load a scenario.
| Predicted + | Predicted − | |
|---|---|---|
| Actual + | ||
| Actual − |
Accuracy = fraction correct — but misleading under class imbalance (1% disease prevalence → "always healthy" scores 99%). Precision (PPV) = of the positives you flagged, how many were real. Recall (sensitivity/TPR) = of the real positives, how many you caught. They trade off: pushing recall up usually drags precision down. Specificity (TNR) and NPV are the mirror images for negatives.
It depends on the cost of each error. Medical screening → high recall (missing a sick patient is catastrophic). Spam filtering → high precision (deleting a real email is worse than letting spam through). F1 (harmonic mean of precision & recall) and balanced accuracy (mean of sensitivity & specificity) summarise the two — but they still ignore utility, organizational cost, and human use.
From scores to decisions: thresholds, ROC, AUC
Many DSS output a probability p̂ = P(Y=1|X), turned into a yes/no by a threshold τ: predict positive if p̂ ≥ τ. Lowering τ raises recall; raising τ raises precision. So threshold choice is itself a decision-theoretic problem. Two classic curves visualise this:
A DSS is calibrated if P(Y=1 | p̂=p) = p — among cases it calls "80% likely", about 80% truly are. Why it's special for a DSS: a human reads that number and decides how much to trust it. Poor calibration breeds overconfidence, undertrust, and inappropriate reliance. AUC and accuracy completely ignore it.
Costs, utility & net benefit
Errors aren't equal — a missed cancer ≠ a spam false alarm. Cost-sensitive evaluation weights each cell with a cost matrix; utility-aware evaluation (decision theory, Lecture 1!) goes further and scores actions by expected utility:
Decision Curve Analysis makes this operational with net benefit, which counts true positives as good and penalises false positives by the threshold-implied cost:
Beyond accuracy
A DSS can ace every metric above and still fail: accurate but unfair, accurate only on yesterday's data, accurate but unusable, accurate but harmful.
Distribution shift
A DSS learns under Ptrain(X,Y) but runs under Pdeploy(X,Y). When they differ, performance silently degrades — a medical DSS moved to a new hospital, a financial DSS during a crisis, a recommender after tastes change.
| Type of shift | What changes |
|---|---|
| Covariate shift | P(X) changes, P(Y|X) stable |
| Label shift | P(Y) changes |
| Concept drift | P(Y|X) changes — the most dangerous: the learned relationships become invalid |
Robustness tests stability under perturbations (noise, missing data, sensor failures): a robust system keeps Δ = |L(f,x) − L(f, x+δ)| small. And a wise DSS does OOD detection — recognising inputs unlike its training data and deferring to a human instead of guessing.
Fairness
"95% accuracy" can mean 99% for one group and 70% for another. A DSS can amplify social inequalities, so evaluation must be stratified by protected attributes A (gender, ethnicity, age, …) — subgroup performance, subgroup calibration, subgroup robustness.
Demographic parity
The positive-prediction rate is the same across groups: P(Ŷ=1|A=a) = P(Ŷ=1|A=b).
Equalized odds
Fixing the true outcome, the prediction doesn't depend on the group: P(Ŷ|Y,A=a) = P(Ŷ|Y,A=b).
Calibration fairness
Calibration holds equally in every group (a.k.a. multicalibration). Key catch: these three criteria are generally incompatible at once — you must choose.
Explainability — and explaining well
Complex models (ensembles, deep nets, LLMs) need explanations for trust, accountability, debugging, justification. But having explanations ≠ having useful ones — explainability itself must be evaluated, along several axes: fidelity (does it reflect the real model?), comprehensibility, completeness, actionability, cognitive compatibility. There's a real tension: more comprehensible often means less faithful.
Perceived understanding ≠ actual understanding. An explanation can make a user feel more confident without making them any more correct. That's why explainability usually needs human studies, not just code metrics — and ties straight back to trust calibration.
And none of this is context-free: the same DSS performs differently by workflow, expertise, time-pressure, law, and culture — Performance = f(System, User, Context).
Human–DSS interaction
The final outcome is Human + DSS + Interaction. So the real question isn't "how good is the model?" but "how good is the team?"
Does the human+DSS team beat either alone? Four outcomes are possible — including Human + DSS < both. Pairing a person with a tool can degrade performance through cognitive overload, misplaced trust, and workflow disruption. So team performance must be measured, not assumed.
Interaction is evaluated along six dimensions — tap the two subtle ones:
| Dimension | Asks… |
|---|---|
| Effectiveness | Does decision quality improve? (ΔU = Uwith − Uwithout) |
| Efficiency | Does workload / time drop? |
| Trust | Is it trusted appropriately? |
| Reliance | Are users over/under-relying? |
| Usability | Is interaction cognitively manageable? |
| Adoption | Is the DSS actually used? |
Efficiency = decision quality ÷ resource consumption, and "resources" include human cognitive workload (measured with NASA-TLX, eye-tracking, think-aloud…). The catch: an information-rich DSS can lower efficiency — overload pushes the user back onto crude heuristics, exactly the biased shortcuts from Lecture 5.
Trust vs reliance — and why it's tricky to measure
Trust is a psychological attitude toward the DSS; reliance is the behaviour of actually depending on its advice. Attitude drives behaviour. The goal is appropriate reliance: trust ≈ the DSS's real competence — avoiding both under-reliance (ignoring good advice) and over-reliance (accepting advice without scrutiny).
Since trust is internal, we measure it by proxies: direct (questionnaires — honest target, but noisy) or indirect (acceptance/override rates, behaviour after the DSS makes an error, behaviour under uncertainty).
Just watching whether a user agrees or disagrees with the AI is not enough: reliance ≠ agreement/disagreement. If a user already had the right answer, "agreeing" tells you nothing about reliance. Proper studies are human-first: capture the person's baseline reasoning, then show the AI and measure how their thinking changes.
Usability, adoption & how you actually run the study
Usability (learnability, cognitive simplicity, consistency, accessibility) is measured with usability testing, cognitive walkthroughs, and scores like SUS. Adoption asks whether it's truly woven into the workflow — frequency of use, retention, voluntary engagement. And effectiveness itself is established with real study designs:
RCT
Randomized controlled study: compare users with vs without the DSS under controlled conditions. Strong causal claim, low ecological validity.
A/B testing
Compare alternative versions/interfaces in a real deployment.
Simulated tasks
Replayed/synthetic scenarios where ground truth is known — cheap and controllable.
Longitudinal
Track performance over extended real-world use — catches adoption, drift, and habit effects.
The recurring tension across all of them: laboratory control ≠ real-world ecological validity.
No single score is enough. A useful DSS is evaluated across accuracy, calibration, robustness, fairness, interpretability, usability, and utility — and different applications weight them differently. Good DSS = Good Predictor + everything accuracy forgets.
The Exam Lab
This lecture mixes real formulas (the metrics) with conceptual breadth. Know the metrics cold, and be ready to discuss why they aren't enough.
① define / give the metric · ② show its limit · ③ name what to use instead (and why). The thread the professor wants: evaluation is multi-dimensional — accuracy is the start, never the end. For the confusion-matrix metrics it's fine to give the formula, but always pair it with the intuition and a use-case.
Why is high predictive accuracy not enough to call a DSS good? Discuss.
Define accuracy, precision, recall, specificity. Why is accuracy misleading under class imbalance, and when do you prefer high recall vs high precision? Discuss.
What does the threshold control? What does AUC measure, and why does calibration matter especially for a DSS with human users? Discuss.
Why evaluate a DSS by cost / utility / net benefit rather than accuracy? Discuss.
Explain distribution shift (its types), robustness/OOD, and fairness in DSS evaluation. Discuss.
How do you evaluate human–DSS interaction? Cover team performance, trust vs reliance, and why reliance ≠ agreement. Discuss.
• Why is "good predictor" not "good DSS"? Outcome vs process.
• Accuracy, precision, recall, specificity — formula + one-line meaning each.
• The class-imbalance trap (99% accuracy, 0% recall).
• Threshold → ROC/AUC → calibration: what each adds, what each ignores.
• Net benefit / utility: why best predictor ≠ best DSS.
• Three types of distribution shift; OOD = defer to human.
• Three fairness criteria — and that they can't all hold at once.
• Trust vs reliance; why reliance ≠ agreement; four study designs.