Last fall at NYU Stern, thirty-six students filed in for the final exam in a course on AI product management. There was no blue book. There was no essay. Each student spent about twenty-five minutes talking to a voice AI that asked follow-up questions and scored the answers. Total cost to the professor: fifteen dollars. That works out to forty-two cents a head. One undergraduate later told a reporter the format was "awkward," like "talking to what was pretty much a blank screen." On the post-exam survey, 83 percent said the experience was more stressful than a traditional test. Around 70 percent also called it the most honest measurement of their understanding they'd ever taken (The Decoder, 2025).
That tension, more anxiety on one hand and more honesty on the other, is the implicit pitch behind the biggest shift in college assessment in decades. Writing assignments are suspect now. ChatGPT can do them. So Cornell, UPenn, NYU, and UC San Diego are spinning up oral exam programs. Clay Shirky, NYU's Vice Provost for AI and Technology in Education, has called for a "medieval" return to in-person, face-to-face demonstration of learning (Fortune, 2025). Professors in the trenches agree. Chris Schaffer, a biomedical engineer at Cornell, added an oral defense requirement last semester and put it bluntly to a reporter: "You won't be able to AI your way through an oral exam" (San Diego Today, 2026). It sounds bulletproof.
It isn't. The oral exam revival does solve a real problem (a teacher who receives a polished, AI-smoothed essay has no reliable way to evaluate the thinking behind it), but it does so by creating a different one. Oral exams don't measure thinking. They measure the ability to perform thinking under pressure, on demand, often in a second language, to an authority figure, through vocal cords that may or may not cooperate on a given day. Those aren't the same skill. Pretending they are is how we end up punishing the students who were already losing under the old system.
The research on what oral exams actually measure
Shirky, to his credit, has already conceded the core problem. "Timed assessment may benefit students who are good at thinking quickly, not students who are good at thinking deeply," he wrote in the same essay arguing for oral exams (Fortune, 2025). The caveat usually gets folded into a logistics line (how do you give oral exams in a lecture of four hundred?) and then dropped. But it isn't a logistics problem. It's the pedagogical core of the objection. Ray Hembree's foundational 1988 meta-analysis of 562 studies found that test anxiety depresses performance, inversely tracks self-esteem, and varies by ability, gender, and grade level (Hembree, 1988). A thirty-year follow-up confirmed the association hasn't weakened. If anything, the effect is stronger in high-stakes evaluative settings (von der Embse et al., 2018). Oral assessments reliably elicit more anxiety than written ones. In the second-language literature the finding is even starker: oral performance is more anxiety-sensitive than any other modality, and high-stakes speaking tasks produce the largest anxiety spikes in English-language learners (Liu & Yan, 2025). The Foreign Language Classroom Anxiety Scale, the workhorse instrument in the field since Horwitz's 1986 paper, is so oral-biased that researchers have had to build new scales just to measure written anxiety cleanly.
What this means in practice: an oral exam doesn't just add some honest friction that filters for the students who "really know it." It filters for a specific cognitive profile. Fast retrieval under surveillance. Comfort improvising in the evaluator's primary language. Tolerance for being looked at while thinking. Some students have that profile and the subject knowledge behind it. Some have the subject knowledge and not the profile. The exam can't tell them apart.
The students who quietly pay
Consider who walks into an oral exam carrying a handicap the rubric doesn't name.
English-language learners are the clearest case. The ACTFL guidelines treat oral proficiency as its own construct, distinct from reading and writing competence, because the research says it is. A student can write lucid, argued English prose and still freeze when asked to produce the same thought in real time. Anxiety research on L2 speakers shows speech rate drops, accent thickens, and comprehensibility ratings fall under evaluation pressure. Graders register all of this, often unconsciously, as weaker understanding (Saito et al., 2025). In a district where 20 or 30 percent of students are English learners (which describes a lot of American public schools) converting a writing assessment into an oral one is, quantitatively, a redistribution of grades away from those students.
"Speaking is often regarded as the most anxiety-provoking modality in second language performance." Annual Review of Applied Linguistics on L2 anxiety (Gkonou et al., 2017)
Students with speech-related disabilities are another group. Stuttering is explicitly recognized under the ADA as a qualifying disability when it substantially limits speaking, and recommended accommodations include "written or typed responses in lieu of continuous speaking" and "alternative assessment formats" (ASHA, 2024). A school that defaults to oral examination is, procedurally, asking every student who stutters (along with every student with selective mutism, social anxiety disorder, or certain autism-spectrum profiles) to either disclose and request an accommodation or perform through the impediment. Neither is a reasonable tax to levy on a student who can write a good essay.
The quieter population is introverts and socially reticent students. Research distinguishes introversion from shyness, but both groups participate less in oral settings, and teachers systematically rate lower-participation students as less capable even when actual assessment performance is controlled for (Caspi et al., 2006). "Educators should avoid relying exclusively on oral presentations or timed verbal assessments," one recent secondary-school review concludes, because doing so reliably underestimates the capabilities of a quarter to a third of the class (Condliffe et al., 2023).
The rarer problem no one wants to talk about
Even if every student walked in with identical nerves and identical fluency, oral assessment has a second failure mode: the grader. Written essays can be anonymized. An oral exam can't.
Medical education gives us the cleanest evidence on what happens when evaluators have to assign real-time performance grades. A review covering more than 107,000 students across up to 113 medical schools found statistically significant racial and ethnic disparities in clerkship grades, which is the closest analogue in professional education to an oral performance evaluation (Low et al., 2019). Three studies of nearly 95,000 written evaluations found that the same students get systematically different adjectives depending on race and gender: fewer "exceptional" and "outstanding" descriptors for women and underrepresented minorities, more hedging language, more faint praise (Teherani et al., 2018). These weren't written exam grades. They were clinical performance assessments, the exact format Cornell and UPenn are expanding.
Stereotype threat is the other piece. The classic Spencer, Steele, and Quinn (1999) findings on women's math performance under threat conditions have been replicated, challenged, and narrowed in the years since. The current consensus is that the effects are real but more contextual than the early lab studies suggested. What hasn't been challenged is that live, high-stakes, evaluator-present assessment is exactly the condition where threat effects tend to surface. An oral exam is almost the ideal stimulus for the mechanism the literature describes.
When a teacher reads an essay, bias still operates, but it operates against a stable text that can be re-read, anonymized, cross-graded, or appealed. In an oral exam, the evaluation happens in the same room and the same second as the performance. There's no artifact to recheck. If a student wants to appeal, the best they can produce is their own recollection of the conversation against the professor's.
What oral exams are good for
None of this is an argument that oral defense is worthless. In the right place it's extremely valuable. A graduate thesis defense, a capstone project presentation, an oral component that follows and supplements a written product: these are well-designed because they test something the writing alone can't, namely the student's ability to hold the argument under pushback, revise in real time, and acknowledge what they don't know.
The problem is the slide from "oral exams are good for some things" to "oral exams are the answer to AI cheating." The first is a claim about the format's proper niche. The second is a claim that because ChatGPT broke take-home writing, we should replace take-home writing with live performance, and that students who do worse in the new regime must have been cheating in the old one. They weren't. They were writing essays. Some of those essays are now suspect because of the tool, not the student. The collateral damage is not evenly distributed.
It's also worth saying what oral exams cost the teacher. Panos Ipeirotis, the NYU professor who ran the AI voice exam, was explicit about why he automated it: human oral exams for thirty-six students would have taken a research week and cost around seven hundred and fifty dollars in grader time (The Decoder, 2025). The human version does not scale to a hundred-student lecture, let alone a high school English department. So the realistic future of "oral exams as AI-proofing" isn't a thoughtful Socratic dialogue. It's a student talking to a voice agent for twenty minutes and getting scored by a model. We have replaced "evaluating writing a student may not have written" with "evaluating speech a student gave to a robot." It isn't obvious this is progress.
The thing oral exams are reaching for, and can't deliver
The deeper reason oral exams are appealing right now is that they promise something every teacher wants: visibility into the thinking behind the product. You want to see the student reason in real time. You want to hear them struggle. You want evidence they own the answer.
That's the right goal. The answer is just that oral performance isn't the only way to see thinking, and it isn't the best one. Writing is thinking. That has been the entire pedagogical premise English teachers have defended for a century, and when a student drafts, revises, and revises again, the drafting process already contains all the evidence you need. The problem isn't that writing doesn't show thinking. The problem is that the only artifact we've been collecting is the final essay, which (with AI) is no longer a reliable proxy for the process.
Process-visible writing, where the keystrokes, the revision timeline, the planning notes, and the false starts are part of what gets submitted, turns a take-home essay back into evidence of learning without replacing it with a performance. It keeps writing instruction intact. It doesn't tax the students who think deeply but slowly, or write fluently in English but speak hesitantly, or have a stutter, or freeze in front of an authority figure. It just asks the question the oral exam is really trying to ask, in a modality that doesn't punish the wrong people.
If the reason we're going medieval is that we can no longer tell whose thinking an essay represents, the answer is not to stop assigning essays. It's to stop pretending that a clean final draft was ever the only thing worth looking at.

