FOUNDATIONS · AIL-FP-2026-03

Reading the Learner

Why an AI Coach That Cannot See the Person in Front of It Cannot Coach — and what modern systems get right, get wrong, and cannot yet do

Gregg Collins & Brandon Dickens · Apr 2026 · 21 min read

My experience is what I agree to attend to.

— William James, The Principles of Psychology

You can observe a lot by just watching.

— Yogi Berra

§1. What Expert Coaches Have Always Done

A coach sits with a learner working through a termination conversation. The learner has handled four scenarios with the simulated employee cleanly—the right phrasing, the right pauses, the right concessions. In the fifth, the simulated employee says, “I have two kids in college.” The learner’s next line comes out half a second later than any of the four before it. The sentence is shorter. His voice flattens. The coach pauses the simulation. “That one landed somewhere the others didn’t.” The learner exhales and, after a moment, starts talking about the year his father was laid off. The coach has read the learner.

Coaches have always read the room. Before the coach chose what to say, he ran a quiet read on the person in front of him—response latency, voice quality, sentence length, the trajectory from confident to flat across five exchanges. None of that reading was written down. None of it appeared on a checklist. It was the kind of perception that, by the time the coach could explain it, would already be too slow to use. Klein (1998) described what this kind of reading actually is. Experts operating under time pressure do not analyze, compare options, and select. They pattern‑match the present moment against a library of past situations and execute the response associated with the closest match. Firefighters under a collapsing roof, ICU nurses in the middle of a code, coaches with a learner who just went flat—recognition‑primed decision‑making. The analysis is already compiled into the recognition. The coach does not know in words what she saw. She knows what to do next.

The signals are not new. Aviation came to the same conclusion the hard way. Two decades of cockpit accident review produced the finding that the information most often implicated in a crash was not on the instrument panel. It was in the first officer’s tone, the captain’s half‑answer, the sentence phrased as a question instead of an assertion. Crew Resource Management was built around the conclusion that situational awareness—the continuous read of what is happening in the room—is a distinct, trainable skill, and that training it reduced errors more than any other intervention the industry could find (Helmreich, Merritt, & Wilhelm, 1999). The panel data was fine. The crews were qualified. What the industry had been missing was not competence in the instruments. It was the read of the people around them.

Expert practice runs this way across domains. Clinicians read patients. Teachers read rooms. Coaches read learners. The read is the work. It happens in under a second, it shapes every decision that follows, and until recently nothing in coaching AI has tried to do it.

Notice what the coach did not do. He did not consult a rubric for termination‑conversation fidelity. He did not compare the learner’s wording against a model script. He did not score the utterance on a five‑point scale of empathy. All of those are forms of evaluation, and evaluation is what most coaching systems, human and machine, spend their attention on. The coach in the vignette did something earlier and more fundamental. He read the learner and noticed a change. The evaluation of what the learner did was, at that moment, the wrong question. The right question was what the learner could next absorb—and the only way to answer that question was to read him first.

The question the rest of this paper answers: can a machine do this, and if so, what does it have to see?

§2. What “Reading the Learner” Means in This Paper

The learner is the only subject. When this paper talks about reading, it means reading ten concrete things about the person in front of the system. It means the learner’s affect—whether she is engaged, confused, frustrated, bored, calm, or quietly proud of what she just said. It means her cognitive load at this moment: whether the current problem is inside, at, or beyond the edge of what her working memory can hold. It means her confidence—whether she believes her answer or is hedging it. It means her trajectory across the last three exchanges—whether she is moving toward understanding or drifting away from it. It means her attention—whether she is still with the task or has gone somewhere else. It means whether the confusion in her words is productive confusion that precedes insight, or the stuck kind that hardens into frustration. It means engagement read across multiple channels, not time on task alone. It means mode receptivity—whether this learner, right now, can absorb a challenge, can absorb support, or can absorb neither and needs the conversation to pause. It means metacognitive calibration—when she says “I understand,” how far is she from actually understanding. And it means the struggle pattern of the current attempt—whether this is the first try, the third, or the seventh, and what the curve of those attempts looks like.

What it does not mean. This paper does not argue for reading enterprise data. It does not ask the coaching system to ingest CRM records, post‑call notes, performance reviews, ticket volumes, or quota attainment. Those signals matter, and they deserve their own paper; they are not this paper’s subject. We are not writing about what an organization knows about a worker. We are writing about what a coaching system can perceive about a learner during a coaching interaction.

The question is not new. Human–computer interaction has been asking some version of it for thirty years. Picard (1997) reframed the field by arguing that any machine meant to interact productively with a person has to sense and reason about that person’s state, not just process the tokens on the screen. Abowd, Dey, and their collaborators (1999) formalized the broader question as context‑aware computing—the situation of a person relative to an interactive system is itself a first‑class signal, not a bag of features. Dey (2001) proposed an operational definition of context that has been cited ever since: any information that can be used to characterize the situation of an entity relevant to the interaction. Reading the learner is a specialization of that twenty‑five‑year‑old question. What is specific to coaching—and the argument this paper makes—is that in coaching, the answer to that question is the architectural center of the system, not a feature added on top of something else.

§3. Why Human Coaches Cannot Do This at Scale

Human bandwidth is finite. A coach cannot run ten parallel reads at the same time she is conducting dialogue, holding the thread of a session, and remembering what happened three weeks ago with this same learner. The constraint is not effort. It is what a single cognitive system can do at once. Simons and Chabris (1999) gave the field its sharpest demonstration of the limit: observers asked to count basketball passes routinely fail to register a person in a gorilla suit walking through the middle of the scene. The signal was unmissable in retrospect. At the moment it arrived, the observers’ attention was occupied. Expert coaches are not exempt from this. They are excellent at seeing what they have been trained to see, and they miss almost everything else. That is not a failing. That is what it is to be a human operating under bandwidth. Attention is scarce. It cannot be split across ten channels without loss.

The coach’s own cognitive load is a signal too, and it is the signal she is most reluctant to honor. Hart and Staveland (1988) developed the NASA Task Load Index forty years ago by decomposing the felt experience of mental workload into identifiable subscales—mental demand, temporal demand, effort, performance, frustration—and showing that the construct was stable across fields from cockpit operations to surgical practice. The instrument has been cited tens of thousands of times because the thing it measures is real. A coach conducting a hard session is running near the top of her own load budget. Reading more channels raises the load. Past a point, the additional read degrades the reads she was already doing. She does not hold ten channels simultaneously. She holds the two or three she has built the deepest reflex for and lets the others go.

This is why one‑to‑one coaching has always been a rationed practice. It worked because the coach’s full attention was bought. It went to executives, elite athletes, doctoral students with dedicated mentors, and the tutees Bloom (1984) had in mind when he reported two standard deviations of advantage for one‑to‑one mastery instruction over the ordinary classroom. The finding was not controversial. The finding was unreachable, because the sensing that made the finding work required a pair of human eyes, a pair of human ears, and a working‑memory buffer that did not come at a discount.

The constraint was never the theory of coaching. Expert practice has known what good sensing looks like for as long as anyone has taught anyone. The constraint was what a single human could do in real time. That limit decided how many learners got the practice that worked. Change the limit and you change the population that can be coached.

§4. Until Now

Until now, that meant rationing. The sensing that made expert coaching effective was available to the learners whose coaches could afford to do it, and to nobody else. Three things changed that in a decade. Real‑time probabilistic inference over multiple noisy channels stopped being a research curiosity and became an engineering discipline—Conati and Maclaren (2009) showed that dynamic Bayesian networks could fuse scenario‑side variables with learner‑side signals and produce real‑time affect estimates that held up empirically. The compute to run that inference at conversational latency stopped being expensive. And the empirical case that scalable, research‑based tutoring actually works stopped being a wish list: Kestin, Miller, Klales, Milbourne, and Ponti (2025) ran a randomized trial in a Harvard physics course and found that a carefully engineered AI tutor produced learning gains of 0.73 to 1.3 standard deviations, in less time than expert‑led instruction, with higher reported engagement.

The question of whether a machine could read a learner was an economic question, not a scientific one. It was whether we could afford to build and run the reading. We can. What remains is the question of what, exactly, the system has to read.

§5. The Three Timescales

The hardest thing about reading a learner is that the reading has to happen at three different speeds at once. A mode shift has to be decided inside the current exchange. A pattern in this learner’s week has to be noticed across the last several sessions. A stable trait—learning velocity, metacognitive accuracy, autonomy preference, resilience baseline—has to accumulate over weeks and months, and should not revise on an eight‑second pause. One reader cannot do all three at the same fidelity. One architecture can, if it stops pretending the three are the same thing.

The three‑layer learner model (Table 1) was introduced in our companion paper on the Modal AI Coach and is the architectural center of this one. The claim is that the system reads the learner at three timescales simultaneously, with different update rates, different signal fidelity, and different standards of evidence for changing its mind.

Table 1. Three-Layer Learner Model

Layer	Timescale	What It Tracks	Example
Momentary State	Seconds to minutes	Cognitive load, affect, trajectory, mode receptivity	Response latency jumped from 2s to 15s
Session Patterns	Hours to weeks	Mode response patterns, optimal challenge calibration, recovery patterns	Responds well to challenge—except immediately after failure
Stable Traits	Weeks to months	Learning velocity, metacognitive accuracy, autonomy preference, resilience baseline	High autonomy preference; tends toward overconfidence

Momentary state is fast. It tracks cognitive load, affect, response latency, confidence markers, and the direction of the current trajectory. It updates on every exchange. The system maintains probability distributions, not point estimates—it is rarely certain the learner is frustrated; it is sometimes seventy percent confident. A pause from two seconds to fifteen is a cue here, not a conclusion. The momentary layer answers one question on every turn: what is happening right now, and does the current mode still fit.

Session patterns are the bridge. They aggregate across the last several sessions with this learner: whether challenge lands when she is fresh, whether it lands when she has just failed, whether support moves her forward or reinforces dependency. Corbett and Anderson (1994) formalized the move toward this kind of inference in their Bayesian Knowledge Tracing work, which represented each skill as a probability distribution over mastery states, updated response by response. Session patterns generalize that move from what the learner knows to how the learner takes in feedback. The answers at this layer are conditional—she responds to challenge except immediately after failure, she responds to explanation except when she is tired, she recovers from public mistakes faster than from private ones—and it is the conditionals that make personalization real.

Stable traits are slow. They hold the things that persist across sessions and topics: learning velocity relative to baseline, metacognitive accuracy, autonomy preference, resilience. The layer revises reluctantly. A single bad day does not reset it. But people develop, and the layer is not infinitely stubborn either; it weights evidence by consistency over time. Pardos and Heffernan (2010) showed that learner models work better when they individualize—population‑level priors are a reasonable starting point, but learner‑specific parameters outperform them as data accumulates. Layer 3 is where that replacement lives. It is also what makes layer 1 interpretable. The same ten‑second pause means different things for a learner who typically answers in two seconds and for one who typically takes eight.

Information flows in both directions. Observations aggregate upward into patterns; patterns aggregate upward into traits. Traits shape downward the interpretation of fresh observations. The layers do not just stack. They converse. A learner who has been marked at Layer 3 as tending toward overconfidence gets a different interpretation attached to a fast, cheerful “I’ve got it” than a learner marked as well‑calibrated. Both sentences look the same on the page. The read is not the same.

To see the three layers operate at once, consider a single exchange. A learner who has been working on a negotiation simulation for forty minutes types, “I think I should offer sixty percent and see what they say.” Layer 1 registers the exchange: response time is three seconds longer than her session baseline, the sentence opens with a hedging phrase, there is no elaboration. On the face of it, cognitive load has risen and confidence has dropped. Layer 2 supplies the interpretation: across four previous sessions, this learner’s confident answers begin with commitments and end with reasons, and her hedged openings have coincided with productive doubt twice and pre‑shutdown frustration twice. The read at Layer 2 is that hedging by itself is ambiguous; the differentiator has been what follows in the next one or two exchanges. Layer 3 supplies the tilt: this learner has tended, over seven weeks, to underestimate her own readiness and respond well to a specific kind of calibrating question. The three layers together produce a read no single layer could have produced alone—pause, probe the doubt rather than answer it, and watch for recovery on the next exchange. None of the three layers asserts that read alone. The three converge on it.

Di Mitri, Schneider, Specht, and Drachsler (2018) formalized the pipeline: signals become features, features become constructs, constructs become feedback. The twist for a coaching system is that the feedback is not a dashboard—it is a mode selection. What is new is that this pipeline can now run at dialogue speed, continuously, on every learner.

§6. What the System Reads

No single signal is load‑bearing. The read is a composite. The ten dimensions named earlier in this paper cluster by function into three groups, each with its own research tradition.

The first group is affect. Picard (1997) made the case, against resistance, that machines meant to interact with people have to sense and reason about emotional state, not just process the tokens on the screen. The claim was unpopular then and is unremarkable now. D’Mello and Graesser (2012) later showed that learner affect during hard tasks is not a random walk. It oscillates on identifiable trajectories: engaged flow gives way to confusion, confusion gives way either to insight or to frustration, frustration gives way to boredom if nothing resolves it. The trajectories are measurable; the transition probabilities are stable. And D’Mello, Lehman, Pekrun, and Graesser (2014) added the sharpest cut: confusion that resolves is associated with deeper learning; confusion left to sour is not. It is not confusion the system has to detect. It is the direction confusion is moving. A learner writing, “wait, why does that work,” while still typing is in a different place from the same learner ninety seconds later, hands off the keyboard, restating the same question in a tone that has gone flat.

The second group is cognitive load. Hart and Staveland (1988) decomposed the felt experience of load into subscales that have held up across domains for forty years—mental demand, temporal demand, effort, performance, frustration. In a text‑and‑voice interface, the coach cannot read forehead sweat, but she can read response‑time distributions, self‑correction rate, the rate at which sentences are started and abandoned, the moment the learner stops thinking out loud, and the softening language she reaches for when she is no longer sure. A two‑second pause after a hard question is loading. A fifteen‑second pause after the same question, with a typed‑and‑deleted attempt in between, is near‑capacity. A thirty‑second pause is past capacity or gone. The signal is not the latency itself; the signal is the distribution of latencies for this learner, interpreted against what she typically produces at this level of difficulty.

The third group is trajectory. Trajectory is where coaching diverges from neighboring fields. A pilot’s load matters in the moment. A surgeon’s affect matters in the moment. A learner’s state matters mostly because of where it implies the learner is heading. VanLehn (2011) reviewed the tutoring evidence and found that step‑based intelligent tutoring systems—those that sense and respond at the granularity of each solution step rather than each problem—approach the effect size of one‑to‑one human tutoring. Finer sensing, larger effects. Piech and colleagues (2015) showed that recurrent networks could learn useful learner‑trajectory representations from interaction traces without hand‑coded skill taxonomies; the learner’s path through a problem is itself a signal. Shute (2011) framed the reading itself as stealth assessment: every action the learner takes during normal work is a piece of evidence about an underlying construct—persistence, calibration, strategy—that the system can update without interrupting the work. A learner who tries the same approach three times, fails three times, and gets progressively terser with each attempt is on a different trajectory from the learner who tries three different approaches, fails each in a different way, and is still reasoning aloud. The count of failures is the same. The trajectory is not.

No one of the three groups resolves the read by itself. Affect without load is mood without context. Load without trajectory is a snapshot of difficulty with no direction attached. Trajectory without affect is a curve with nothing to anchor its meaning to the person. The three groups have to be combined into one read, and the combination has to happen continuously. That combination is what the three‑layer model from the preceding section is for: it decides how much evidentiary weight to place on any one feature before letting it change the system’s read. A single late response weighs differently when the session‑patterns layer says this learner takes her time, the stable‑traits layer says she is well‑calibrated, and the momentary‑state layer says her trajectory has been steady. The same late response weighs very differently when two of those three layers are flashing in the other direction.

What the system does not read. This paper’s argument holds on the text and voice channels of a normal coaching interaction. It does not require and does not endorse camera‑based affect detection, keystroke‑dynamics biometrics, physiological sensors, or any of the surveillance‑adjacent channels that readers sometimes expect from a paper about learner sensing. Those channels belong to a different argument with different tradeoffs. The read we are describing is the read an attentive coach would run, if she had the bandwidth, on what the learner is already saying and doing in the coaching session itself.

Most systems still cannot see. The AI coaching systems in wide deployment right now read the learner’s typed words and, occasionally, a self‑report checkbox. They run the words through a language model and return a response that is polite, helpful, and tonally appropriate. They do not track response‑time distributions for this learner. They do not track confidence trajectories across a session. They do not know that this learner has failed three times already today and is close to shutdown, or that this learner is coasting and a push would land. Whatever read they are doing is whatever the language model happens to infer from the last turn of text.

This is not a problem with language models. It is a problem with architecture. The read‑the‑learner step has been treated as implicit in the generation step—as if a model that writes fluently must be reading fluently too. It does not follow. VanLehn (2011) reviewed the tutoring evidence across decades and found that the systems sensing at finer granularity produced larger learning effects. Step‑level tutors outperformed answer‑level tutors. Answer‑level tutors outperformed pace‑only tutors. The efficacy ceiling of an adaptive system tracks the resolution of its read.

Kestin, Miller, Klales, Milbourne, and Ponti (2025) showed how high that ceiling can reach when the read is taken seriously. Their Harvard physics tutor was engineered around seven research‑based pedagogical practices and deployed in an introductory course against expert‑led active‑learning instruction. Students in the AI‑tutor condition learned more, learned faster, reported higher engagement, and spent less time on task. Effect sizes ran from 0.73 to 1.3 standard deviations. The authors were careful about what they claimed: their tutor was built to sense and respond well inside a narrow context, and it worked because the sensing and responding were designed together. That is the existence proof. When the read is the center, the system performs near the top of what the technology permits. When the read is a byproduct of the generation loop, the system performs near the bottom.

The current failure mode is not that the outputs are bad. The outputs are polished. A learner who asks a question usually receives a clear, helpful answer. The failure mode is that the question is sometimes the wrong question to answer—the learner is not stuck on content, she is stuck on confidence, and another paragraph of content will not move her. A learner in the middle of productive struggle gets interrupted by a helpful scaffold she did not need and will never build the instinct the struggle was producing. A learner who is coasting receives encouragement she does not deserve and will not be pushed by. The polish is not the problem. The sensing gap is.

The architectural fix is not to add a sensing module next to the generation module. It is to decide that the sensing is the architecture and the generation sits on top of it. A coaching system that reads the learner first and generates second is a coaching system. A coaching system that generates first and reads whatever comes back is a chatbot with coaching vocabulary.

§8. How Reading Gets Better Over Time

Reading gets better with practice. The system is itself a learner. The way its read improves is the way an apprentice coach’s read improves: through feedback from outcomes, accumulated across sessions, stratified by what kind of thing is being learned.

Four loops run at four speeds. The fastest is the loop that learns this learner. Over the course of minutes and hours, the system updates its read of one particular person in front of it—when she pauses, what it means; when she says “I think I’ve got it,” how often she actually has; what a confident opener from her looks like on a day she is tired. The second loop learns learner types. Over days and weeks, conditionals that looked individual turn out to be shared across categories of learner. A high performer encountering a novel failure often reacts more sharply than her baseline would predict; the system notices that, and after enough instances the pattern stops being a quirk of this one learner and becomes a prior that transfers to the next learner with a similar profile. Pardos and Heffernan (2010) demonstrated the principle in knowledge tracing: models that combine population priors with learner‑specific parameters outperform models that pick one or the other. A good coaching system keeps both.

The third loop refines mode execution. Detecting that challenge is appropriate is not the same as knowing how to deliver a challenge that will land. “You can do better than that” produces defensiveness in one learner and a clean, energized response in the next. “What would this look like at your best?” produces engagement where the blunter form produced shutdown. Over weeks and months, the system accumulates evidence about which phrasings of which modes land where, for which learners. The read controls the choice of mode. The third loop controls the choice of wording.

The slowest loop, and the one that matters most at scale, refines the read itself. Which signals actually predict which learner states, which patterns of latency pair with which mode receptivity, which combinations of features carry diagnostic weight and which are noise. This is the loop the learning‑analytics and educational‑data‑mining communities have been running across the research field for fifteen years (Baker & Yacef, 2009; Siemens, 2013). The community has been learning which features predict which outcomes, in offline studies, on archived logs. What is new is the ability to close that loop in real time, for each deployed system, using the sessions it is running now. A coaching system that reads a million learners this month knows, by next month, things about which signals matter that no single coach could have learned in a career.

Cold start is a known problem. A new learner arrives. The session‑patterns layer is empty; the stable‑traits layer has nothing but population priors. The system defaults to those priors, weights early observations heavily, borrows from similar learners when the similarity is defensible, and sometimes just asks the learner directly—“Do you prefer to struggle with problems before getting help, or would you rather have guidance upfront?” Asking is a sensing channel, not a failure of sensing. A well‑constructed question is often the highest‑fidelity read available in the first five minutes with a new learner.

The deployed system also has to balance using what it has already learned against continuing to learn. Swets (1973) gave this its canonical framing as a relative‑operating‑characteristic problem: how the system tunes its thresholds depends on the downstream cost of a miss versus the cost of a false alarm. Early in a relationship, the cost of a miss is low and the cost of fitting too fast to noisy evidence is high; the system explores. Later, the cost of a miss climbs and exploration costs more than it returns; the system leans on what it has learned. Neither setting is correct in general. Both are correct when the relationship is at the right stage.

§9. What This Is, and What It Is Not

What this paper is. It is a claim about where the architectural center of coaching AI belongs. Reading the learner—not generating a response—is the load‑bearing act of coaching, and a coaching system that does not put that act at its center is not building coaching, whatever it is building. The components are not missing. Four decades of intelligent‑tutoring systems (Woolf, 2009 gives the textbook architecture), affective‑computing research, stealth assessment, multimodal learning analytics, and context‑aware HCI have produced the parts. Probabilistic learner models. Affect inference at dialogue speed. Multimodal pipelines from raw signal to actionable construct. Threshold calibration under uncertainty. Each part has been built, studied, and refined inside its own literature. What has been missing is a coaching field that treats the assembly of these parts as its architectural center rather than as decoration on the output layer.

What this paper is not. It is not a case for enterprise‑signal coaching—for reading an organization’s CRM data, performance‑review notes, or ticket volumes and turning those into coaching prompts. That is a different architecture with different tradeoffs, and we are not writing it here. It is not a call for camera‑based affect detection, keystroke biometrics, or any of the physiological sensing channels some readers may expect from a paper about learner sensing. The argument holds on what the learner is already saying and doing in the session. It is not an argument that AI coaching should replace human coaches in any setting where a human coach is available and paying attention; it is an argument about what machines will do for the overwhelming majority of learners who have never had a human coach and were never going to get one. A reader who walks away believing that we are advocating for surveillance in the workplace has read a paper we did not write.

The choice that remains. Two coaching systems can be built with the technology we have in 2026. The first puts learner‑reading at its architectural center. It tracks cognitive load, affect, confidence, and trajectory in real time across three timescales. It refines its read session by session, and across populations of learners. It generates a response only after it has decided what this learner needs, not before. It does, at scale, what expert coaches have always done. The second generates clean responses on top of whatever the language model happens to infer from the last turn of text. It is polished. It is helpful in the limited way that a polite conversation partner is helpful. It is not coaching. Both are possible. Both will be built. Both are already being built. The difference between them is visible in the architecture, long before it is visible in the outcomes.

A coaching system that cannot read its learner cannot coach its learner. That is the whole paper in one sentence. The only question is whether the systems being deployed right now were designed by people who believed it.

References

Abowd, G. D., Dey, A. K., Brown, P. J., Davies, N., Smith, M., & Steggles, P. (1999). Towards a better understanding of context and context-awareness. In Handheld and Ubiquitous Computing (HUC ’99), Lecture Notes in Computer Science 1707, 304–307. Springer.

Baker, R. S. J. d., & Yacef, K. (2009). The state of educational data mining in 2009: A review and future visions. Journal of Educational Data Mining, 1(1), 3–17.

Bloom, B. S. (1984). The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring. Educational Researcher, 13(6), 4–16.

Conati, C., & Maclaren, H. (2009). Empirically building and evaluating a probabilistic model of user affect. User Modeling and User-Adapted Interaction, 19(3), 267–303.

Corbett, A. T., & Anderson, J. R. (1994). Knowledge tracing: Modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction, 4(4), 253–278.

Dey, A. K. (2001). Understanding and using context. Personal and Ubiquitous Computing, 5(1), 4–7.

Di Mitri, D., Schneider, J., Specht, M., & Drachsler, H. (2018). From signals to knowledge: A conceptual model for multimodal learning analytics. Journal of Computer Assisted Learning, 34(4), 338–349.

D’Mello, S., & Graesser, A. (2012). Dynamics of affective states during complex learning. Learning and Instruction, 22(2), 145–157.

D’Mello, S., Lehman, B., Pekrun, R., & Graesser, A. (2014). Confusion can be beneficial for learning. Learning and Instruction, 29, 153–170.

Hart, S. G., & Staveland, L. E. (1988). Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. In P. A. Hancock & N. Meshkati (Eds.), Human Mental Workload (Advances in Psychology, Vol. 52, pp. 139–183). North-Holland.

Helmreich, R. L., Merritt, A. C., & Wilhelm, J. A. (1999). The evolution of Crew Resource Management training in commercial aviation. International Journal of Aviation Psychology, 9(1), 19–32.

Kestin, G., Miller, K., Klales, A., Milbourne, T., & Ponti, G. (2025). AI tutoring outperforms in-class active learning: An RCT introducing a novel research-based design in an authentic educational setting. Scientific Reports, 15, Article 17458.

Klein, G. A. (1998). Sources of Power: How People Make Decisions. MIT Press.

Pardos, Z. A., & Heffernan, N. T. (2010). Modeling individualization in a Bayesian networks implementation of knowledge tracing. In User Modeling, Adaptation, and Personalization (UMAP 2010), Lecture Notes in Computer Science 6075, 255–266. Springer.

Picard, R. W. (1997). Affective Computing. MIT Press.

Piech, C., Bassen, J., Huang, J., Ganguli, S., Sahami, M., Guibas, L., & Sohl-Dickstein, J. (2015). Deep knowledge tracing. In Advances in Neural Information Processing Systems 28 (NeurIPS 2015), 505–513.

Shute, V. J. (2011). Stealth assessment in computer-based games to support learning. In S. Tobias & J. D. Fletcher (Eds.), Computer Games and Instruction (pp. 503–524). Information Age Publishing.

Siemens, G. (2013). Learning analytics: The emergence of a discipline. American Behavioral Scientist, 57(10), 1380–1400.

Simons, D. J., & Chabris, C. F. (1999). Gorillas in our midst: Sustained inattentional blindness for dynamic events. Perception, 28(9), 1059–1074.

Swets, J. A. (1973). The relative operating characteristic in psychology. Science, 182(4116), 990–1000.

VanLehn, K. (2011). The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational Psychologist, 46(4), 197–221.

Woolf, B. P. (2009). Building Intelligent Interactive Tutors: Student-Centered Strategies for Revolutionizing E-Learning. Morgan Kaufmann.