Lecture 7 · DSS Evaluation Principles & Methods

1the problem

Why evaluating a DSS is hard

In ML you score a function f : X → Y on how often it's right. But prediction ≠ decision support — a DSS lives inside people and organizations.

!High accuracy ≠ high decision quality

Picture a medical DSS that predicts disease risk with excellent accuracy. Does that improve clinical decisions? Not necessarily — its recommendations may be ignored, its explanations incomprehensible, users may over-trust its mistakes, its outputs may not fit the workflow, or it may overload people. A great predictor can still be a useless DSS.

💡 in plain words

Lecture 6 said a DSS is a socio-technical system — Human + DSS + Interaction. Evaluation has to respect that: you can't measure only the algorithm and call it done. You measure technical performance and human behaviour and organizational fit.

Outcome vs process

Two systems can reach the same decisions but support people completely differently:

lens · tap+

Outcome-oriented

Judges the final performance — accuracy, utility, cost reduction. "Did the decisions come out well?"

lens · tap+

Process-oriented

Judges how decisions were reached — cognitive effort, trust & reliance, interpretability, interaction dynamics. A DSS can improve outcomes while increasing automation bias, eroding human expertise, and breeding overreliance.

So evaluation is inherently multi-dimensional:

Dimension	Asks…
Predictive	Is the model accurate?
Decision-theoretic	Does it improve utility?
Behavioral	How do users react?
Cognitive	Does it reduce workload?
Organizational	Does it fit the workflow?
Ethical	Is it fair and accountable?

2the quantitative toolkit

Classical quantitative evaluation

Start with the numbers — but learn where they lie. Most DSS make a binary call Y ∈ {0,1}, summarised by a confusion matrix.

Play with the confusion matrix 🔢

Edit the four counts (or load a preset) and watch every metric recompute. The presets are designed to expose where a single number fools you.

🔢 Confusion Matrix Explorer

TP/TN = correct · FP/FN = errors. Edit any cell, or load a scenario.

	Predicted +	Predicted −
Actual +
Actual −

★The metrics, in words

Accuracy = fraction correct — but misleading under class imbalance (1% disease prevalence → "always healthy" scores 99%). Precision (PPV) = of the positives you flagged, how many were real. Recall (sensitivity/TPR) = of the real positives, how many you caught. They trade off: pushing recall up usually drags precision down. Specificity (TNR) and NPV are the mirror images for negatives.

🔗 which one do you want?

It depends on the cost of each error. Medical screening → high recall (missing a sick patient is catastrophic). Spam filtering → high precision (deleting a real email is worse than letting spam through). F1 (harmonic mean of precision & recall) and balanced accuracy (mean of sensitivity & specificity) summarise the two — but they still ignore utility, organizational cost, and human use.

From scores to decisions: thresholds, ROC, AUC

Many DSS output a probability p̂ = P(Y=1|X), turned into a yes/no by a threshold τ: predict positive if p̂ ≥ τ. Lowering τ raises recall; raising τ raises precision. So threshold choice is itself a decision-theoretic problem. Two classic curves visualise this:

ROC curve — TPR vs FPR across all thresholds. Up-left = ideal; the diagonal = random. AUC = P(a positive scores above a negative): threshold-free ranking quality.

Reliability diagram — observed frequency vs predicted probability. On the diagonal = well-calibrated; below = overconfident, above = underconfident.

✓Calibration: often more important than accuracy

A DSS is calibrated if P(Y=1 | p̂=p) = p — among cases it calls "80% likely", about 80% truly are. Why it's special for a DSS: a human reads that number and decides how much to trust it. Poor calibration breeds overconfidence, undertrust, and inappropriate reliance. AUC and accuracy completely ignore it.

Costs, utility & net benefit

Errors aren't equal — a missed cancer ≠ a spam false alarm. Cost-sensitive evaluation weights each cell with a cost matrix; utility-aware evaluation (decision theory, Lecture 1!) goes further and scores actions by expected utility:

a* = argmax_a 𝔼[ U(a, s) ]two models with equal accuracy can deliver very different utility → best predictor ≠ best DSS

Decision Curve Analysis makes this operational with net benefit, which counts true positives as good and penalises false positives by the threshold-implied cost:

NB(τ) = TP/N − (FP/N)·( τ / (1−τ) )directly decision-connected, threshold-aware, utility-sensitive — a DSS is useful where its curve beats "treat all" / "treat none"

◇ Check yourself

A disease has 1% prevalence and a DSS predicts "healthy" for everyone. What's true?

Under heavy class imbalance, accuracy rewards ignoring the rare class. Recall = 0 (no true positives found), and precision is undefined (no positive predictions). Use recall / balanced accuracy / net benefit instead.

3the failure modes accuracy hides

Beyond accuracy

A DSS can ace every metric above and still fail: accurate but unfair, accurate only on yesterday's data, accurate but unusable, accurate but harmful.

Distribution shift

★Trained on one world, deployed in another

A DSS learns under P_train(X,Y) but runs under P_deploy(X,Y). When they differ, performance silently degrades — a medical DSS moved to a new hospital, a financial DSS during a crisis, a recommender after tastes change.

Type of shift	What changes
Covariate shift	P(X) changes, P(Y\|X) stable
Label shift	P(Y) changes
Concept drift	P(Y\|X) changes — the most dangerous: the learned relationships become invalid

Robustness tests stability under perturbations (noise, missing data, sensor failures): a robust system keeps Δ = |L(f,x) − L(f, x+δ)| small. And a wise DSS does OOD detection — recognising inputs unlike its training data and deferring to a human instead of guessing.

Fairness

!Global metrics hide local failures

"95% accuracy" can mean 99% for one group and 70% for another. A DSS can amplify social inequalities, so evaluation must be stratified by protected attributes A (gender, ethnicity, age, …) — subgroup performance, subgroup calibration, subgroup robustness.

criterion · tap+

Demographic parity

The positive-prediction rate is the same across groups: P(Ŷ=1|A=a) = P(Ŷ=1|A=b).

criterion · tap+

Equalized odds

Fixing the true outcome, the prediction doesn't depend on the group: P(Ŷ|Y,A=a) = P(Ŷ|Y,A=b).

criterion · tap+

Calibration fairness

Calibration holds equally in every group (a.k.a. multicalibration). Key catch: these three criteria are generally incompatible at once — you must choose.

Explainability — and explaining well

Complex models (ensembles, deep nets, LLMs) need explanations for trust, accountability, debugging, justification. But having explanations ≠ having useful ones — explainability itself must be evaluated, along several axes: fidelity (does it reflect the real model?), comprehensibility, completeness, actionability, cognitive compatibility. There's a real tension: more comprehensible often means less faithful.

💡 the trap to remember

Perceived understanding ≠ actual understanding. An explanation can make a user feel more confident without making them any more correct. That's why explainability usually needs human studies, not just code metrics — and ties straight back to trust calibration.

And none of this is context-free: the same DSS performs differently by workflow, expertise, time-pressure, law, and culture — Performance = f(System, User, Context).

4the part that decides everything

Human–DSS interaction

The final outcome is Human + DSS + Interaction. So the real question isn't "how good is the model?" but "how good is the team?"

!Collaboration can make things worse

Does the human+DSS team beat either alone? Four outcomes are possible — including Human + DSS < both. Pairing a person with a tool can degrade performance through cognitive overload, misplaced trust, and workflow disruption. So team performance must be measured, not assumed.

Interaction is evaluated along six dimensions — tap the two subtle ones:

Dimension	Asks…
Effectiveness	Does decision quality improve? (`ΔU = U_with − U_without`)
Efficiency	Does workload / time drop?
Trust	Is it trusted appropriately?
Reliance	Are users over/under-relying?
Usability	Is interaction cognitively manageable?
Adoption	Is the DSS actually used?

🔗 efficiency ↔ bounded rationality (Lecture 5)

Efficiency = decision quality ÷ resource consumption, and "resources" include human cognitive workload (measured with NASA-TLX, eye-tracking, think-aloud…). The catch: an information-rich DSS can lower efficiency — overload pushes the user back onto crude heuristics, exactly the biased shortcuts from Lecture 5.

Trust vs reliance — and why it's tricky to measure

★Two different things

Trust is a psychological attitude toward the DSS; reliance is the behaviour of actually depending on its advice. Attitude drives behaviour. The goal is appropriate reliance: trust ≈ the DSS's real competence — avoiding both under-reliance (ignoring good advice) and over-reliance (accepting advice without scrutiny).

Since trust is internal, we measure it by proxies: direct (questionnaires — honest target, but noisy) or indirect (acceptance/override rates, behaviour after the DSS makes an error, behaviour under uncertainty).

!The subtle experimental-design point

Just watching whether a user agrees or disagrees with the AI is not enough: reliance ≠ agreement/disagreement. If a user already had the right answer, "agreeing" tells you nothing about reliance. Proper studies are human-first: capture the person's baseline reasoning, then show the AI and measure how their thinking changes.

Usability, adoption & how you actually run the study

Usability (learnability, cognitive simplicity, consistency, accessibility) is measured with usability testing, cognitive walkthroughs, and scores like SUS. Adoption asks whether it's truly woven into the workflow — frequency of use, retention, voluntary engagement. And effectiveness itself is established with real study designs:

method · tap+

RCT

Randomized controlled study: compare users with vs without the DSS under controlled conditions. Strong causal claim, low ecological validity.

method · tap+

A/B testing

Compare alternative versions/interfaces in a real deployment.

method · tap+

Simulated tasks

Replayed/synthetic scenarios where ground truth is known — cheap and controllable.

method · tap+

Longitudinal

Track performance over extended real-world use — catches adoption, drift, and habit effects.

The recurring tension across all of them: laboratory control ≠ real-world ecological validity.

✓The synthesis: seven quality dimensions

No single score is enough. A useful DSS is evaluated across accuracy, calibration, robustness, fairness, interpretability, usability, and utility — and different applications weight them differently. Good DSS = Good Predictor + everything accuracy forgets.

◇ Check yourself

Why is "the user agreed with the AI 80% of the time" a poor measure of reliance?

Reliance ≠ agreement/disagreement. You need human-first designs that capture the user's baseline reasoning and measure how it changes after seeing the AI — otherwise prior correctness masquerades as reliance.

★study like the exam

The Exam Lab

This lecture mixes real formulas (the metrics) with conceptual breadth. Know the metrics cold, and be ready to discuss why they aren't enough.

📋 How to answer

① define / give the metric · ② show its limit · ③ name what to use instead (and why). The thread the professor wants: evaluation is multi-dimensional — accuracy is the start, never the end. For the confusion-matrix metrics it's fine to give the formula, but always pair it with the intuition and a use-case.

★ The framing question

Why is high predictive accuracy not enough to call a DSS good? Discuss.

① CLAIMPrediction ≠ decision support. A DSS is a socio-technical system (Human + DSS + Interaction); its value is the quality of the decisions it helps produce, not its prediction error.

② WHYA highly accurate DSS can still fail: recommendations ignored, explanations incomprehensible, users over-trusting errors, outputs that don't fit workflows, cognitive overload — plus it may be unfair, drift off-distribution, or be economically harmful.

③ INSTEADEvaluate multi-dimensionally: accuracy, calibration, robustness, fairness, interpretability, usability, utility — technical + human + organizational. Best predictor ≠ best DSS.

Question · confusion-matrix metrics

Define accuracy, precision, recall, specificity. Why is accuracy misleading under class imbalance, and when do you prefer high recall vs high precision? Discuss.

① METRICSAccuracy = (TP+TN)/total. Precision (PPV) = TP/(TP+FP) — reliability of positive flags. Recall (sensitivity/TPR) = TP/(TP+FN) — share of real positives caught. Specificity (TNR) = TN/(TN+FP) — share of real negatives caught.

② IMBALANCEIf positives are rare (1% prevalence), predicting "negative" always gives 99% accuracy while catching zero positives (recall 0). So accuracy rewards ignoring the important class — use recall, balanced accuracy, or net benefit.

③ CONTEXTRecall ↑ usually means precision ↓. Medical screening → high recall (don't miss a case). Spam filtering → high precision (don't delete real mail). The choice is set by the cost of each error.

Question · thresholds, ROC/AUC, calibration

What does the threshold control? What does AUC measure, and why does calibration matter especially for a DSS with human users? Discuss.

① THRESHOLDA DSS outputs a probability p̂; the threshold τ turns it into a yes/no (positive if p̂ ≥ τ). Low τ → high recall; high τ → high precision. Choosing τ is a decision-theoretic problem.

② ROC/AUCThe ROC curve plots TPR vs FPR across all thresholds; AUC = probability a positive scores above a negative — threshold-free ranking quality, insensitive to imbalance. But it ignores utility, calibration, and human use.

③ CALIBRATIONCalibrated means P(Y=1 | p̂=p) = p. A human reads that probability and decides how much to trust it, so miscalibration → overconfidence / undertrust / inappropriate reliance. In a DSS, calibration often matters more than raw accuracy.

Question · cost / utility / decision curves

Why evaluate a DSS by cost / utility / net benefit rather than accuracy? Discuss.

① COSTSErrors have unequal consequences (a missed cancer vs a spam false alarm). Cost-sensitive evaluation weights confusion-matrix cells by a cost matrix; expected cost = Σ P(i,j)·C(i,j).

② UTILITYDecision theory refines this: score actions by expected utility, a* = argmax E[U(a,s)]. Two equally accurate models can yield very different utility → best predictor ≠ best DSS.

③ DCADecision Curve Analysis plots net benefit = TP/N − (FP/N)·τ/(1−τ) across thresholds; it's decision-connected, threshold-aware, utility-sensitive — a DSS is worth using where its curve beats "treat all" and "treat none".

Question · beyond accuracy

Explain distribution shift (its types), robustness/OOD, and fairness in DSS evaluation. Discuss.

① SHIFTTrained on P_train, deployed on P_deploy; if they differ, performance drops. Covariate (P(X) changes), label (P(Y) changes), concept drift (P(Y|X) changes — worst, the learned relationships break).

② ROBUST/OODRobustness = small performance change under perturbations (noise, missing data). OOD detection = recognising inputs outside the training distribution and deferring to a human.

③ FAIRNESSGlobal metrics hide subgroup failures (95% = 99% vs 70%). Stratify by protected attributes; criteria include demographic parity, equalized odds, calibration fairness — generally incompatible simultaneously.

Question · human–DSS interaction

How do you evaluate human–DSS interaction? Cover team performance, trust vs reliance, and why reliance ≠ agreement. Discuss.

① TEAMAsk whether the human+DSS team beats either alone — and accept it can be worse (overload, misplaced trust, workflow disruption). Dimensions: effectiveness, efficiency, trust, reliance, usability, adoption.

② TRUST/RELIANCETrust = attitude; reliance = behaviour. Goal = appropriate reliance (trust ≈ real competence), avoiding under-reliance (ignore good advice) and over-reliance (accept blindly). Trust is measured by proxies (questionnaires, override rates, behaviour after errors).

③ DESIGNReliance ≠ agreement/disagreement — a user who was already right "agrees" without relying. Use human-first designs that capture baseline reasoning then measure how it changes after seeing the AI. Run RCTs / A-B / simulated / longitudinal studies (lab control ≠ ecological validity).

🗣 Say these out loud (cover the page)

• Why is "good predictor" not "good DSS"? Outcome vs process.
• Accuracy, precision, recall, specificity — formula + one-line meaning each.
• The class-imbalance trap (99% accuracy, 0% recall).
• Threshold → ROC/AUC → calibration: what each adds, what each ignores.
• Net benefit / utility: why best predictor ≠ best DSS.
• Three types of distribution shift; OOD = defer to human.
• Three fairness criteria — and that they can't all hold at once.
• Trust vs reliance; why reliance ≠ agreement; four study designs.

Did it actually help?

🎯 The core message

Why evaluating a DSS is hard

Outcome vs process

Outcome-oriented

Process-oriented

Classical quantitative evaluation

Play with the confusion matrix 🔢

From scores to decisions: thresholds, ROC, AUC

Costs, utility & net benefit

Beyond accuracy

Distribution shift

Fairness

Demographic parity

Equalized odds

Calibration fairness

Explainability — and explaining well

Human–DSS interaction

Trust vs reliance — and why it's tricky to measure

Usability, adoption & how you actually run the study

RCT

A/B testing

Simulated tasks

Longitudinal

The Exam Lab