FOUNDATIONS · AIL-FP-2026-03

The Sensing Engine

Why Coaching AI That Cannot See Is Still Blind — and how one sensing layer powers both the coach and the simulation

Gregg Collins & Brandon Dickens · Apr 2026 · 24 min read

My experience is what I agree to attend to.

— William James

You can observe a lot by watching.

— Yogi Berra

§1. The Invisible Sensory Work Every Expert Coach Does

A jet on final approach. The first officer’s voice changes—not the words, the rhythm. He says, “We might be a little low here.” The captain, working through a landing checklist, answers, “Yeah, we’re fine,” and keeps reading. Two minutes later the aircraft strikes terrain. The flight data recorder holds nothing unusual. The voice recorder holds everything: a first officer who knew, a captain whose attention was elsewhere, and a signal that traveled in a hedge rather than in an instrument.

This is the scene that produced Crew Resource Management. Over two brutal decades, commercial aviation concluded that the most important information in a cockpit often arrives through channels the panel does not monitor: tone, pacing, social deference, the phrasing of a question instead of the question itself. The instruments were fine. The crews were qualified. The failure mode sat above the instruments, in the allocation of attention, in the half-said sentence, in a recognition that never fired. CRM training is now a licensing requirement because the industry decided that situational awareness was a skill with a definable anatomy, and that training it produced more error reduction than any other intervention on offer (Helmreich, Merritt, & Wilhelm, 1999).

Expert coaching has always run on the same kind of attention. A learner pauses before answering. The pause has a texture: thinking-through pause, frustrated pause, dissociated pause, performance pause. A tone contracts. A question turns into a hedge. A sentence ends up half a beat too long. Good coaches read this stream continuously—without announcing that they are reading it, without writing it down, often without being able to say out loud what they just read. The sensing was the work. It was invisible to the literature because it was invisible to the coaches themselves.

Pull up any transcript from a capable coach working with a capable learner, and you will see the apparatus running. Pause duration rises from two seconds to eight to fifteen. A confident “Right—I think I see it” gives way to a hedged “I think? Maybe?” and then to “I don’t know, can you just tell me?” Three exchanges. The coach reads each one against the learner she has been building in her head for weeks, decides the trajectory has crossed from productive struggle into depleted frustration, and quietly shifts mode. The shift is invisible to the learner. It is also invisible on the transcript, because the coach never announces it. What the transcript shows is the exchange. What the sensing produced was the next move.

This is not metaphor. Psychologists have measured it in adjacent domains. Simons and Chabris (1999) famously demonstrated that humans miss strikingly salient events when their attention is allocated to a counting task—a gorilla walks through the scene, and half the viewers do not register it. Expert coaches are not immune to this. They are very good at seeing what they are trained to see and very good at missing everything else. That is not a failing. That is what it is to be a human working with bandwidth-limited attention.

It is also the reason the sensing layer has, until recently, been undocumented. Coaches did not describe the apparatus because, to them, there was no apparatus—there was only the room, the learner, and what the room was telling them. The moment the field started building AI systems to deliver coaching at scale, the apparatus became visible by its absence. What worked in the master coach’s office because she was reading eleven channels in parallel did not work on a chat interface that saw one. The sensing that had scaled through apprenticeship and mentorship and executive coaching did not scale through software. Not because the sensing was new, but because the rationing was.

§2. What Coaches Actually Read

The signals were always there. What changed is that we now have language to cluster them. Three families do most of the work, and they map onto research traditions that did not grow up in coaching but have already built the measurement apparatus coaches rely on.

The first cluster is affective. Picard (1997) reframed the field of human-computer interaction by arguing that any system meant to interact productively with a human has to sense and reason about that human’s emotional state. The claim was unpopular when she made it and is unremarkable now. D’Mello and Graesser (2012) later showed that learner affect during deep tasks does not wander—it oscillates on identifiable trajectories, engaged-flow giving way to confusion, confusion giving way to frustration or to resolution, frustration giving way to boredom if left unresolved. Affect is not noise. It is a structured signal with measurable transition probabilities. Expert coaches read it because it tells them where the learner is in the cycle and where the learner is likely to go next.

The second cluster is cognitive load. Hart and Staveland (1988) developed the NASA Task Load Index four decades ago by decomposing the felt experience of mental workload into six subscales: mental demand, physical demand, temporal demand, performance, effort, and frustration. The instrument has been cited tens of thousands of times precisely because the construct it measures turned out to be stable across domains from cockpit operations to surgical training. When a coach reads a response latency that jumped from two seconds to fifteen, what she is reading is a load transient. The latency is not the signal; it is the channel the signal traveled through.

The third cluster is trajectory. Trajectory is where coaching proves to be different from the adjacent fields. A pilot’s cognitive load matters in the moment. A surgeon’s affective state matters in the moment. But a learner’s state matters mostly as a prediction about where the learner is heading. Trajectory signals are aggregates—recovery speed after failure, response variance across comparable prompts, confidence calibration over a session, the rate at which a small frustration is escalating or resolving. Coaches read them continuously, because the mode that is right for a learner on a climbing trajectory is frequently wrong for the same learner two minutes later on a falling one. Trajectory is the integration of affect and load over time, and it is where most of the modal decisions get made.

Clustering the signals this way is not a list in disguise. It is a reminder that each family is tractable. Each has been studied, instrumented, and measured in a research literature with decades of investment. Coaches have been reading them all along, one coach at a time, one learner at a time. What has been missing is not the science. What has been missing is an architecture that fuses the three families into a running model of the learner that can be updated many times a minute without a human in the loop.

That architecture is a sensing engine. The question the next section has to answer is why, given that every family already has a measurement tradition behind it, coaching has not been able to work at this density before.

§3. The Bandwidth Constraint

But bandwidth is finite. The measurement literature for affect, load, and trajectory has existed for decades. Coaching has not worked at this density because a human coach cannot run three measurement pipelines in parallel while also conducting dialogue, holding context, and remembering what happened in the session three weeks ago.

The constraint is not about effort. It is about what a single cognitive system can do. Simons and Chabris (1999) demonstrated with their selective-attention paradigm that even alert, motivated human observers will fail to register signals they would call unmissable in retrospect, provided they are occupied with another task. Attention is scarce, and it is jealous. The coach who is listening carefully to the words is reading less from the tone. The coach who is reading the tone is missing the response latency. The coach who is integrating everything is running out of room to compare this learner to a cohort of past learners she worked with years ago. The apparatus fits or it does not fit.

Signal-detection theory (Green & Swets, 1966) gave cognitive science the grammar for what this tradeoff looks like. Every decision to classify a signal is a placement on a continuum—more sensitivity yields more false alarms; more specificity yields more misses. Human coaches run this tradeoff constantly and silently. A novice over-calls frustration and escalates support too early; an experienced coach has calibrated her threshold through years of practice. What she has done, in the technical vocabulary, is move her criterion—pushed back the false-alarm rate at some small cost in sensitivity—until her decisions land at the learner’s actual state more often than not. She has not solved the bandwidth problem. She has negotiated with it.

Klein (1998) described what it looks like to operate near the bandwidth limit. His work on recognition-primed decision-making documented how experts under time pressure make decisions: not by comparing options but by pattern-matching the situation to a library of past situations and executing the response associated with the closest match. Firefighters, nurses in triage, neonatal intensive-care physicians—they do not analyze. They recognize. The pattern match is fast precisely because most of the analysis has been compiled into the recognition itself. This is what expert coaches do. It is also what the literature has historically failed to describe, because the experts cannot articulate the recognition without degrading it.

Recognition-primed decisions scale badly. They depend on a library of prior cases that only accumulates through years of one-on-one work. They are fragile in the presence of distraction, which is the default condition of every synchronous coaching encounter. And they cannot be inspected, corrected, or transferred—the master coach cannot show the apprentice what she just saw, because she does not have direct access to it. What she has is the decision.

The consequence is that coaching at its best has been a practice that runs at the ceiling of a single cognitive system. That ceiling is high. It is still a ceiling. Every scaling strategy the field has tried—larger groups, standardized curricula, video libraries, first-generation AI tutors—has involved turning the sensing down so that one instance of the apparatus could serve more learners. Reduce the channels. Reduce the frequency. Reduce the integration across sessions. The bandwidth was going to be rationed somewhere. It was rationed in the sensing, because the sensing was where it hurt least to cut.

§4. Until Now, That Meant Rationing

Until now, that meant rationing. The sensing that made expert coaching effective was available to the learners whose coaches could afford to do it—the executive, the elite athlete, the doctoral student with a dedicated mentor, the private-tutoring clients Benjamin Bloom had in mind when he wrote about the two-sigma problem. Bloom’s finding was that one-to-one mastery tutoring outperformed conventional classroom instruction by roughly two standard deviations on comparable assessments (Bloom, 1984). It was not a small effect. It was not controversial. It was simply unreachable, because the tutoring that produced it required sensing that required attention that required a human whose cost made the arrangement feasible for roughly one percent of the population.

Generative AI does not change the theory of learning. It changes the economics of attention. A system built to sense can run affect detection, load detection, trajectory integration, pattern recognition across sessions, and cohort comparison in parallel without any of those pipelines stealing cycles from any other. The ceiling Klein described does not apply—not because the machine is smarter, but because the machine is not running against a single-threaded bottleneck. This is the bandwidth constraint lifting.

Kestin and colleagues at Harvard recently provided the first direct empirical foothold for what happens when it does. In a randomized trial, they compared an AI tutor built around seven research-based pedagogical practices—active engagement, cognitive-load management, growth-mindset framing, scaffolding, accuracy, timely feedback, and self-pacing—against expert-led active-learning instruction in an introductory Harvard physics course. The AI-tutor group learned more, learned faster, reported higher engaged-state ratings, and did so with less time on task. The effect sizes ran from 0.73 to 1.3 standard deviations (Kestin, Miller, Klales, Milbourne, & Ponti, 2025).

Read that finding carefully. It is not a claim that AI tutors beat human teachers. It is a claim that a carefully instrumented single pedagogical mode, applied in a well-scoped context, already clears the bar that conventional instruction has failed to clear for forty years. VanLehn (2011) had already shown in a meta-analytic review that step-based intelligent tutoring systems approach the effect size of human tutoring—roughly d = 0.76 compared with d = 0.79 for human tutors—precisely because step-based systems sense and respond at the granularity of each solution step. The empirical lesson is that the ceiling of adaptive instruction tracks the resolution of its sensing. Better sensing, better outcomes. Not because sensing is the whole story, but because nothing downstream of sensing can be more accurate than sensing allows.

What Kestin’s group did with one mode well-instrumented, the field can now do with many modes continuously re-instrumented. That extension is the Sensing Engine. It is not a fresh claim about pedagogy. It is a claim about what becomes operationally possible when sensing stops being the rate-limiting step. For forty years the adaptive-instruction literature has built what it could build against the constraint that a real-time learner model was expensive to maintain. The constraint is going away. The papers that were written to it will age quickly.

The rest of this paper describes what we think the sensing layer has to look like, and what the argument commits us to if we take it seriously.

§5. Three Layers, Three Timescales

Everything changes at the signal layer. The modal decisions downstream—challenge versus support, scaffold versus step back, question versus explain—all rest on a prior decision about what the learner’s state is, and the quality of the modal decision cannot exceed the quality of that prior. So the sensing has to be built right. Building it right, however, runs straight into an apparent contradiction.

On one side, sensing has to be fast. A coaching turn is a matter of seconds. A mode shift that arrives after the learner has disengaged has missed its moment. On the other side, sensing has to be slow. A “stable trait” that resets every conversation is not a trait; it is noise. Whether the learner prefers autonomy, how reliably she calibrates her own confidence, what her baseline resilience looks like—these are not claims that should revise on an eight-second pause. They are claims that should accumulate evidence across weeks and months and should, once formed, be harder to move than yesterday’s outburst.

The contradiction dissolves when the sensing layer stops being one system and becomes three.

Layer 1 is fast and reactive. It tracks cognitive load, affect, response latency, confidence markers, and the direction of the current trajectory. It updates on every exchange. Conati and Maclaren (2009) demonstrated that dynamic Bayesian networks can fuse cause-side variables (scenario difficulty, recent failures) with effect-side variables (pause duration, error rate) to produce real-time affective inference that holds up empirically. The engineering is tractable. The channel count has gone up; the decision-under-uncertainty mathematics have been worked out for decades.

Layer 2 is the bridge. It tracks how this specific learner responds to specific modes across the last several sessions—does challenge land well? does it land well except after failure? does support produce progress or reinforce dependency? The conditional patterns that live at this layer are where personalization becomes real. Corbett and Anderson’s (1994) Bayesian Knowledge Tracing established the canonical form: probability distributions over learner skill, updated after each response. Layer 2 of the Sensing Engine is the generalization of that move from skill mastery to response patterns—from what the learner knows to how the learner metabolizes feedback.

Layer 3 is slow and conservative. It holds stable traits: learning velocity, metacognitive accuracy, autonomy preference, resilience baseline. It revises reluctantly, on the basis of accumulated evidence across weeks. Pardos and Heffernan (2010) showed that learner models work better when they individualize—population-level priors are a reasonable starting point, but they are replaced by learner-specific parameters as data accumulates. Layer 3 is where that replacement lives. It is also what makes Layer 1 interpretable: the same ten-second pause means different things for a learner who typically answers in two seconds and for one who typically takes eight.

Information flows both directions. Observations aggregate upward into patterns; patterns aggregate upward into traits. Traits shape downward the interpretation of fresh observations. The layers converse.

Three-Layer Sensing Architecture

Layer	Timescale	What the Sensing Engine Tracks	What It Powers (Coach / Sim)
Momentary State	Seconds to minutes	Cognitive load, affect, response latency, confidence markers, trajectory	Mode shifts within a dialogue turn; real-time difficulty adjustment in sim
Session Patterns	Minutes to weeks	Mode response patterns, recovery curves, conditional patterns (e.g., challenge OK except after failure)	Which mode variants work for this learner; scenario selection across sims
Stable Traits	Weeks to months	Learning velocity, metacognitive accuracy, autonomy preference, resilience baseline	Default modes, default challenge level, pre-configured sim difficulty

The architecture does two things the earlier, single-layer designs could not do. It lets the fast layer be fast without forcing the slow layer to be credulous—a single volatile session does not revise the baseline. And it lets the slow layer inform the fast layer without freezing it—the baseline shapes interpretation, but unexpected behavior can still propagate upward if it persists. The forced coupling between timescales is what makes the sensing layer operable. It is also what the learner-modeling literature has been converging toward for thirty years. What is different now is the ability to instantiate it in real time, under dialogue, at the resolution the decision logic actually needs.

§6. The Coach-Sim Inversion, and Why Sensing Must Be Shared

One engine, two consumers. A coaching system built around the modal hypothesis needs a sensing layer. A simulation that adapts difficulty, manages consequences, and detects when a learner has drifted from a stated goal also needs a sensing layer. The question that decides the architecture is whether these are one sensing layer or two.

The intuitive first move is to embed sensing inside each simulation. The simulation, after all, owns the scenario. It knows which branch the learner took, which decisions were deferred, which NPCs are cooperative. It can directly measure decision latency, revision behavior, outcome quality. Embedding sensing there looks tidy.

Tidy is wrong. Embedded sensing breaks the moment the coach needs to stay coupled to a simulation in progress. A coach who cannot see what is happening inside the sim until the debrief is a coach who cannot intervene in real time—who cannot say “let’s pause for a minute, something is landing differently than you expected.” The whole point of the coach-sim inversion—the move where the coach becomes primary and simulations become tools the coach deploys—is that the coach decides. The coach cannot decide against a blank wall. Sim-embedded sensing also duplicates work across every scenario type, and it leaves each scenario type with its own private signal vocabulary. The learner model goes incoherent. One sim thinks the learner is resilient; another thinks the learner is brittle; neither talks to the other.

A shared sensing layer, external to both the coach and the simulation, closes each of these gaps.

Coach vs. Sim Sensing Overlap

Signal Dimension	What the Coach Channel Sees	What the Sim Channel Sees	Why One Layer Must Serve Both
Real-time cognitive load	Message speed, revision behavior, question patterns	Decision latency, hesitation, information-processing delay	Same phenomenon, different input channels; fusion sharpens detection
Real-time affect	Tone, word choice, energy	Dialogue tone, risk-taking propensity, urgency in utterance	Same phenomenon; fused channel eliminates single-modality blind spots
Decision quality	Reasoning quality in real work, patterns across sessions	Choices inside scenario, error correction, strategy consistency	Coach sees transfer; sim sees controlled practice; both needed to distinguish mastery from luck
Learner profile / context	Role, team, career, constraints, preferences	Partial—must be inferred from behavior	Sim’s inference is anemic without the coach’s explicit context; shared layer closes the gap
Scenario state	Limited—coach wasn’t in the scenario	Full—sim owns the scenario rules	Coach cannot monitor progress without sim-state visibility; shared layer makes real-time intervention possible
Long-term trait drift	Cross-domain, weeks to months	Limited to sim-type-specific patterns	Trait-level inference needs coach’s breadth; sim-level inference needs sim’s depth; both feed one learner model

What the table makes visible is that the overlap between coach-side and sim-side sensing is large on the learner-state dimensions (load, affect, decision quality), concentrated on the coach side for learner context and long-term patterns, and concentrated on the sim side for scenario state and information discovery. Neither side can build the full picture alone. The coach is rich in context but weak on what is happening inside this specific scenario. The simulation is rich in scenario telemetry but weak on why the learner is here, what the week has been like, what the last three sims looked like. A shared sensing layer is the only architecture that gives both consumers what they need without doubling the sensing investment.

The prior art for this exists. Abowd and his collaborators formalized context-aware computing a quarter-century ago around the argument that “context” is not a feature set but a characterization of the situation of a user and the objects they interact with—and that treating context as platform infrastructure allows many applications to share a sensing substrate that none of them would invest in individually (Abowd et al., 1999). Activity-recognition research has walked the same path (Bulling, Blanke, & Schiele, 2014): the Activity Recognition Chain moves raw sensor streams through feature extraction to classification to application, as a pipeline deliberately reusable across applications rather than rebuilt for each one. The Sensing Engine we are arguing for is the learning analog. It is not a new kind of infrastructure. It is a known kind of infrastructure that has not yet been built for learning, because, until recently, the learning side lacked the compute and the models to consume it.

One engine. Two consumers that cannot see each other’s business. A learner model both of them agree on.

§7. From Signals to Modes

The gap is upstream of theory. The theory of modal coaching is reasonably well-developed. Expert coaches have a repertoire of modes, they detect when trigger conditions are present, and they shift between modes as the conditions change. Each mode has counter-indications that tell you when to avoid it even if the learner is asking for it. The decision logic is neither particularly exotic nor particularly contested among practitioners. What is contested, and what has been absent, is the detection that has to happen before the decision logic gets to run.

The first thing a shared sensing layer enables is a clean handoff from sensing to decision. The Sensing Engine maintains probability distributions over the relevant learner-state variables—affect, load, trajectory, mode responsiveness, resilience—and hands those distributions, with uncertainty bounds, to downstream decision logic. The decision logic then applies mode-selection rules, hysteresis to avoid oscillation, minimum dwell time so a mode has a chance to work before being abandoned, and counter-indications that rule out certain moves even when they look attractive on the surface. Sensing gathers. Decision applies. The boundary is not decorative; it lets each side be improved separately, and it lets each side’s failures be diagnosed separately.

Swets (1973) formalized this as the separation between discriminability and criterion placement. How well you can tell signal from noise is one question. Where you set the threshold—how willing you are to false-alarm in order to catch more hits—is a different question. Trigger calibration, in the Sensing Engine, is the criterion-placement question. It depends on what the downstream cost of a miss is and what the downstream cost of a false alarm is, and it can be tuned separately for different trigger types. Mode shifts are cheap and frequent at Layer 1; trait revisions are expensive and rare at Layer 3. The criterion placement should look different. Treating trigger calibration as a binary-rules problem—“if pause longer than ten seconds, escalate support”—misreads the problem. Those rules go brittle on contact with real learners. ROC-based calibration does not.

At the organizational scale, triggers have already been categorized into five types: events (something happened), changes (the landscape shifted), goals (performance drifted), opportunities (a high-stakes moment approaches), and bad outcomes (something went wrong). The categories are operationally useful and data-source-indexed. But they run at a different granularity from the moment-to-moment triggers the coach responds to in a dialogue turn. A quota miss is a goal trigger. What the coach does about it depends on whether the learner is coasting or genuinely at limit—a Layer-1 affective-state question—and on whether the learner has been resilient or depleted this week—a Layer-2 pattern question. The organizational triggers tell the coach that something is worth coaching. The sensing layer tells the coach how to coach it. Neither can be collapsed into the other without loss.

Di Mitri, Schneider, Specht, and Drachsler (2018) laid out the pipeline that connects the two: raw signals get processed into features, features get combined into constructs, constructs get interpreted into feedback. The Sensing Engine sits at the construct stage. It produces a running estimate of learner state; downstream systems consume that estimate to produce feedback. What this paper argues for is the architecture of the first three stages. What to do with the constructs—how to choose the mode, how to sequence the intervention, how to design the debrief—is downstream work that deserves its own paper. The line between sensing and deciding is where the modularity lives.

§8. What Weak Sensing Breaks

Most AI coaches are flying blind. The systems in wide deployment right now sense almost nothing. They take the learner’s typed words, run them through a language model, and return a sequence of tokens that reads as helpful or encouraging or curious. They do not read pause length. They do not track confidence trajectories across a session. They do not know that this learner has failed three times already today and is nearing shutdown, or that this learner is coasting and a push would land. They are not bad systems; they are sensing-less systems. And their failures are specific.

Missed struggle detection. A learner tries the same approach three times, fails three times, and gets progressively terser with each attempt. A sensing-less system offers encouragement and invites the learner to try again. The sensing system recognizes the signature—repeated approach without variation, error rate steady, response time lengthening, affect curdling—and scaffolds. D’Mello, Lehman, Pekrun, and Graesser (2014) showed that confusion, productively resolved, benefits learning; confusion left to fester into frustration does not. The cost of missing this distinction is learned helplessness masquerading as persistence.

Unchecked confidence overreach. A learner answers quickly, confidently, and correctly—but for the wrong reasons. The sensing-less system congratulates and moves on. The sensing system notices the low latency paired with shallow justification and builds in a verification step: “walk me through why that worked.” Stealth assessment (Shute, 2011; Shute & Ventura, 2013) treats every action as diagnostic evidence. Confidence-correctness calibration is one of the constructs this machinery is designed to measure, and it is exactly the construct conventional testing has always struggled with. The cost of missing it is brittle mastery—an appearance of competence that collapses in the real case.

Emotional mismatch at the growth edge. A learner is working through an emotionally loaded scenario—a termination conversation, a difficult piece of feedback, a negotiation with someone whose agreement they need. Signals drift: response latency doubles, word choice turns passive, the engaged state of ten minutes ago falls off the table. The sensing-less system continues at the same difficulty because nothing has changed in the text. The sensing system recognizes that load has jumped past the productive zone and that what the learner needs is a pause, not another push. The cost of missing this is avoidance. The learner protects themselves by not coming back to the growth edge that hurt.

Path incoherence. A learner in an open-ended simulation starts with a stated goal—collaborate, hold the line, stay honest—and gradually drifts. The choices become more aggressive, more defensive, more expedient. The scenario’s metrics still look fine: the deal closed, the conversation ended, the objective was achieved. The sensing-less system congratulates. The sensing system tracks the goal-path divergence and surfaces it in the debrief: “notice where the tactics you used diverged from the approach you wanted to take.” The cost of missing this is locally correct, globally misaligned learning. The learner rehearsed the wrong thing with feedback that said it was the right thing.

Cold start. A new learner arrives. No Layer-3 model. No session history. The sensing-less system defaults to a one-size-fits-all posture, which either over-adapts to the first few noisy data points or under-adapts and treats a careful thinker as if she were stuck. The sensing system knows it is in a cold-start regime and acts like it. It maintains wide uncertainty bounds. It weights early observations heavily but flags them as provisional. It uses population baselines where they are defensible and asks the learner directly where they are not: “[d]o you prefer to struggle with problems before getting help, or would you rather have guidance upfront?” The question is not a failure of the sensing layer. It is the sensing layer doing its job, which sometimes means admitting that the best channel for a signal right now is the spoken one.

These are not hypothetical failures. They are the default failure modes of systems that were built without a sensing layer and then decorated with conversational polish. The polish is not the problem. The sensing gap is.

§9. Sensing as Infrastructure

Sensing is not optional. It is the base layer on which any adaptive system has to sit, and the interesting consequence of building it right is that it turns out to serve far more than modal coaching. Any downstream decision system that needs to know something about the learner—assessment, recommendation, failure prediction, scheduling, escalation, cohort comparison—has to consume the same state estimates the coach consumes. Building those pipes once and running them cleanly, rather than embedding redundant sensors in every application that claims to be adaptive, is the architectural move that makes the learning stack look less like a collection of artifacts and more like a system.

The Handbook of Learning Analytics (Lang, Siemens, Wise, & Gašević, 2017) already reflects how broad the consumer base is—assessment analytics, dashboard analytics, multimodal analytics, emotional analytics, predictive analytics. Each of those subfields is a different decision consumer sitting on the same underlying signals. Until recently, each tended to build its own sensing because the alternative did not exist. The Sensing Engine changes the arithmetic. Dey’s (2001) one-sentence operational definition—that context is any information that can be used to characterize the situation of an entity relevant to the interaction—is the bound. Everything inside that bound is shared territory. Everything outside of it belongs to the individual decision system.

What does not change is the warning that comes with shared sensing. The sensing layer has to translate organizational signals into coaching conversations, not deliver corporate mandates through a friendly interface. A performance dip in the sales pipeline is legitimate input to a coaching session about objection handling. It is not legitimate input to an algorithmic-management pipeline that micro-monitors the rep until the next review. The same sensing infrastructure can feed either one. Which one it actually feeds is a design choice the architecture cannot make on behalf of the organization.

This is the point at which the Sensing Engine stops being a coaching feature and starts being a governance problem. Who sees the signals. Who decides how they translate into interventions. What the learner is told about what the system knows. Those questions have not yet been answered in the literature. They will be answered in practice, by the organizations that go first, and the ones who go first will shape the shape. The narrower claim here is that the architecture that makes good coaching possible is the same architecture that makes algorithmic management possible, and pretending otherwise is how the field sleepwalks into the second outcome while thinking it is building the first.

Sensing is infrastructure. Infrastructure has politics. We will not develop that argument further here. The flag is planted.

§10. The Coach Who Cannot See

The limitation was never epistemic. The field has known for a long time what expert coaching looks like and what it requires of the coach. Anderson, Corbett, Koedinger, and Pelletier (1995) argued at the birth of cognitive tutoring that adaptive instruction has to run on an explicit, updating model of what the learner knows—thirty years ago, for this field, is now. The intelligent-tutoring literature has refined and extended that claim in every decade since. The sensing problem has been sitting in plain view for four decades, waiting for the compute and the models to catch up. It has now caught up.

Dehaene and Changeux (2011) documented that even in the brain, the moment a signal crosses into conscious access is a physical event with measurable neural signatures. Thresholding is not a metaphor. It is a design problem with a measurement tradition. The design problem for the Sensing Engine is the same design problem that cognitive neuroscience, signal-detection theory, context-aware computing, activity recognition, affective computing, and intelligent tutoring have each worked on for their own reasons. None of it had to be invented. All of it had to be assembled, in a learning context, at a fidelity the learner-facing systems had never previously supported.

A coach without sensing is not a coach. It is a script with a conversational tone. It can be helpful for some learners some of the time—a well-engineered single mode, as Kestin’s group showed, is already more effective than expert-led instruction in well-scoped contexts. That is not a future claim. It is a 2025 claim. What the field has not yet done is move from one mode well-instrumented to many modes continuously re-instrumented, which requires a sensing layer that a single-mode system does not.

There are two possible futures for the organizations that deploy AI coaching at scale in the next five years.

In the first, the sensing layer is the center of the architecture. Coach and simulation are two consumers of one engine. Every interaction feeds the model. The model becomes denser, more calibrated, more individualized. Failure modes get detected before they metastasize. The learner experiences something that feels like the elite tutoring she was never going to get, because the machinery that made that tutoring effective is finally available at scale.

In the second, the sensing layer does not exist. Conversational polish is layered on top of unchanged data pipelines. Coaches are deployed that cannot read their learners. Simulations are deployed that cannot hold coherent context across sessions. The failure modes of the previous section arrive on schedule and are blamed on the learner, because the system has no visibility into what it is missing. The investments mount and the outcomes do not. The gap between this future and the first one is measured in whether the organization treated sensing as a feature or as the thing.

One of those futures will happen. It is already being decided.

References

Abowd, G. D., Dey, A. K., Brown, P. J., Davies, N., Smith, M., & Steggles, P. (1999). Towards a better understanding of context and context-awareness. In Handheld and Ubiquitous Computing (HUC ’99), Lecture Notes in Computer Science 1707, 304–307. Springer.

Anderson, J. R., Corbett, A. T., Koedinger, K. R., & Pelletier, R. (1995). Cognitive tutors: Lessons learned. The Journal of the Learning Sciences, 4(2), 167–207.

Bloom, B. S. (1984). The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring. Educational Researcher, 13(6), 4–16.

Bulling, A., Blanke, U., & Schiele, B. (2014). A tutorial on human activity recognition using body-worn inertial sensors. ACM Computing Surveys, 46(3), Article 33, 1–33.

Conati, C., & Maclaren, H. (2009). Empirically building and evaluating a probabilistic model of user affect. User Modeling and User-Adapted Interaction, 19(3), 267–303.

Corbett, A. T., & Anderson, J. R. (1994). Knowledge tracing: Modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction, 4(4), 253–278.

Dehaene, S., & Changeux, J.-P. (2011). Experimental and theoretical approaches to conscious processing. Neuron, 70(2), 200–227.

Dey, A. K. (2001). Understanding and using context. Personal and Ubiquitous Computing, 5(1), 4–7.

Di Mitri, D., Schneider, J., Specht, M., & Drachsler, H. (2018). From signals to knowledge: A conceptual model for multimodal learning analytics. Journal of Computer Assisted Learning, 34(4), 338–349.

D’Mello, S., & Graesser, A. (2012). Dynamics of affective states during complex learning. Learning and Instruction, 22(2), 145–157.

D’Mello, S., Lehman, B., Pekrun, R., & Graesser, A. (2014). Confusion can be beneficial for learning. Learning and Instruction, 29, 153–170.

Green, D. M., & Swets, J. A. (1966). Signal Detection Theory and Psychophysics. Wiley.

Hart, S. G., & Staveland, L. E. (1988). Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. In P. A. Hancock & N. Meshkati (Eds.), Human Mental Workload (Advances in Psychology, Vol. 52, pp. 139–183). North-Holland.

Helmreich, R. L., Merritt, A. C., & Wilhelm, J. A. (1999). The evolution of Crew Resource Management training in commercial aviation. International Journal of Aviation Psychology, 9(1), 19–32.

Kestin, G., Miller, K., Klales, A., Milbourne, T., & Ponti, G. (2025). AI tutoring outperforms in-class active learning: An RCT introducing a novel research-based design in an authentic educational setting. Scientific Reports, 15, Article 17458.

Klein, G. A. (1998). Sources of Power: How People Make Decisions. MIT Press.

Lang, C., Siemens, G., Wise, A., & Gašević, D. (Eds.). (2017). Handbook of Learning Analytics (1st ed.). Society for Learning Analytics Research.

Pardos, Z. A., & Heffernan, N. T. (2010). Modeling individualization in a Bayesian networks implementation of knowledge tracing. In User Modeling, Adaptation, and Personalization (UMAP 2010), Lecture Notes in Computer Science 6075, 255–266. Springer.

Picard, R. W. (1997). Affective Computing. MIT Press.

Shute, V. J. (2011). Stealth assessment in computer-based games to support learning. In S. Tobias & J. D. Fletcher (Eds.), Computer Games and Instruction (pp. 503–524). Information Age Publishing.

Shute, V. J., & Ventura, M. (2013). Stealth Assessment: Measuring and Supporting Learning in Video Games. MIT Press.

Simons, D. J., & Chabris, C. F. (1999). Gorillas in our midst: Sustained inattentional blindness for dynamic events. Perception, 28(9), 1059–1074.

Swets, J. A. (1973). The relative operating characteristic in psychology. Science, 182(4116), 990–1000.

VanLehn, K. (2011). The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational Psychologist, 46(4), 197–221.