Ensippification, part 3
Teaching to the (Sip) Test
This is the 3rd installment in an arc that looks at the way that the sip test (exemplified by the Pepsi Challenge) provides an analogy for developments over the past couple of decades across discourse and media:
It’s pretty easy to look at the past twenty years or so of internet development, and see it as an account, in part, of the gradual recession of book-based literacy. We’ve moved from relatively static websites to blog entries to paragraph- and sentence-length social media posts to SMS, emojis, and photo/video captions. And those transitions correlate to a steady erosion of both individual attention and its relative cultural value. There’s all sorts of discourse bemoaning the notion that school-aged children can’t even sit still long enough to watch a video, much less read a book with even grade-level comprehension1.
As our media ecology has evolved to account for these shifts, it’s difficult not to feel as though this is a story of decline. As our preferred channels provide less and less room for context and nuance, there are those who’ve intentionally exploited this fact, and continue to do so for partisan purposes. Antidotally, I find Parker Molloy really useful in this regard, as someone who looks closely at media coverage and often painstakingly reconstructs context to look at the ways that our news is transformed and distorted before it even reaches us. I’m thinking specifically about this post from a few weeks ago about Chicago Alderwoman Maria Hadden, where Molloy specifically demonstrates how Fox News and friends strip her comments out of context, distort them, and then send them circulating to the point that they bear no resemblance to actual events. It’s deeply cynical manipulation, to be sure, but it’s also enabled (and even incentivized) by the changes in medium I talked about in my last post. If you’re reducing the story (or the beverage) to a single sip, you need to ensure that taste is as intense as possible (regardless of whether or not it’s true).

But I don’t want to spend this episode rehashing arguments from part 2 about social media. Instead, I’d like to think in a little more depth about literacy itself. I’ll probably talk more about this in coming months, but one of the questions that the post-literacy discourse raised early on for me was methodological. What does it mean to say that literacy is in decline, what evidence might we use to verify (or disprove) such a claim, and is that evidence even possible to gather?2
A little more than a month ago, the NYT offered one answer. In the words of Kevin Rooze, they set up a baitwidget—sorry, “a blind taste test to see whether NYT readers prefer human writing or AI writing.” Set aside the fact that neither category is coherent enough to be tested3, and the fact we don’t typically encounter texts in this way (outside of these sorts of artificial contexts), and we’re still left with a “test” that gets it fundamentally wrong. As Max Read writes, these sorts of quizzes misunderstand4 why people take them in the first place:
It’s not that people see two paragraphs, prefer one based on its quality, and then attribute it to humans based on that preference. It’s that they see two paragraphs, attribute one to human authorship based on style, and then prefer the one they’ve attributed. What’s at stake when taking these tests isn’t quality or beauty or clarity, but style; not “which one is better,” but “which one sounds more like an L.L.M.?”
Read goes on to describe the point of these quizzes in terms that resonate with my own medium/message discussion from last episode. Just as Pepsi was/is a beverage engineered to “win” its own Challenge, Read asserts that “Quizzes like the Times’ are games that L.L.M.s are designed to excel at. A.I. writing is literally optimized to be the writing most people prefer in A/B preference tests.” In other words, the claim that these tests demonstrate folks’ preference shouldn’t carry much more weight than a Pepsi commercial. Read suggests an explanation as well for why so many people are willing to participate: “I suspect we’re training ourselves, taking the tests to measure and adjust our own heuristics for distinguishing A.I. text from human writing.” (He also acknowledges the likelihood that tests like these provide data for companies looking to evade detection.)
As with the example from Molloy above, the broader issue for me is that question of context. As a writing teacher, I’m fairly confident in my ability to detect writing that is misrepresentative of a particular student, as long as I have a certain amount of baseline context to work from. If I’ve read someone’s writing, spoken and interacted with them (and it doesn’t take that long), there’s a good chance that I’ll be able to tell if they’ve used Ai. It’s not magical, nor a matter of maintaining an updated list of chatbot tells (em-dashes! metanoia! off-tune metaphors!). I don’t even know that it’s particularly skillful on my part—I certainly don’t think it’s something that I’d brag about. It occurs to me that what I’m thinking about as context is also sometimes also described as an author’s voice, which isn’t entirely wrong. I’d allow it only insofar as I’m allowed to define voice in a particular way.
When I wrote about voice (2 years ago now), I was taken with the idea of describing it as iridescent, by which I meant that “It depends entirely on the moment you encounter it, and it shifts as you give it…dedicated attention.” Authorial voice isn’t a characteristic of writing itself but the qualitative residue of context that infuses our writing. We have some control over it, certainly, but substantial parts of it are invisible to us as well, and they only emerge when it comes into contact with a reader who inhabits a different context. Voice is frictional as well as contextual.
I just came across an interesting piece by researcher Bright Simons, “The Social Edge of Intelligence”, who shares and discusses the following study5. In the UK, researchers asked a few hundred writers to produce short fiction, some of whom were asked to use Ai while others were not. “Which stories, the researchers wanted to know, would be more creative? On average, the writers with AI help produced stories that independent judges rated as more creative than those written without it.” Simons notes that this sounds like another run-of-the-mill pro-Ai argument. But the researchers went further:
when the researchers examined the full body of stories rather than individual ones, the picture became murky. The AI-assisted stories were more similar to each other. Each writer had been individually elevated; collectively, they had converged.
(I wish I’d bookmarked it6, but I also came across a Substack note about the “house style” that’s emerging among high-volume Substack writers, particularly those who’ve turned to Ai tools to support their work. While each of them has taken the “individual gain” plunge, the “collective loss” is beginning to emerge for those of us who read them. See also Taylor Lorenz’s recent piece that attempts to pin down “How Much of Substack is Actually Ai?”…)
I’ve wandered a bit off the path, but only a bit. There’s a certain amount of value to these tools, arguably, when we frame them in terms of automating repeatable tasks. But of course, the key word in there is repeatable and the question of who has the right to apply it. From the perspective of the c-suite, customer service is one such, which is why it’s getting increasingly difficult to speak to a human. From individual customers’ points of view, getting cattle chuted by a push-button menu that may or may not address their issues is neither desirable nor repeatable7. Imagining tasks as repeatable invites us to decontextualize them, to strip away the narratives held by the people who perform them.
Returning to Nguyen’s theory of value capture, one of the most obvious (and potentially universal) examples is the gradual takeover of education by testing. By no means am I an expert in this particular narrative. I only vaguely remember growing up with the ITBS (the Iowa Test of Basic Skills), taking the PSAT, ACT, SAT, GRE, et al. In fact, it wasn’t until I got to college that it began to dawn on me that these tests were anything other than an occasional (and putatively objective) assessment. I was always pretty good at them, particularly on the math side of things, so it simply didn’t occur to me to convert these metrics into targets. The idea of paying someone money to help improve my score seemed outlandish to me, until I had friends from the East Coast who assured me that it was in fact “normal,” at least where they were from.
For reasons worth considering, the test preparation market today is estimated to be nearly $125 billion, and that’s to say nothing of the market for fake application essays or consultants providing the wealthy with ways to evade the admissions process altogether. (Yes, that’s billion with a B. Annually, we spend nearly half the amount of the Dept of Education’s budget on ways to game the system.)
Like I said, I’m no expert, but it’s interesting to look at the history of this industry and the ways that it’s gradually ensippified literacy. It’s instructive, I think, to consider the ways that tools like the SAT have changed over the years and to see how the test has narrowed in its assessment of literacy, reducing it to the ability to answer certain types of questions about decontextualized short passages. This shifts the focus of tests like these to a bounded set of strategies8, of the sort that test prep courses will drill into their charges. Another piece of this is state and federal educational policy, which proliferated the number and frequency of tests. They also tied teacher contracts and school budgets to the results9, scaling up the intensity and impact of the stakes involved. In the name of “accountability,” both parties have spent the past couple of decades revising our entire education system to revolve around sip tests.
Think of it as the New Coke of literacy instruction. Peter Shull writes,
Many will blame our movement to a ‘post-literate’ culture on smart phones and social media, but I see the seeds of something we planted in 2002—NCLB and the standardized testing and ‘mastery’ movements—coming to fruition. If you want to produce readers and a literate society, you don’t do it by training kids to pass tests, you do it by teaching them to love reading, and training kids to pass tests has the unsurprising consequence of making them hate reading.
The more that we’ve been captured by the dubious value of test scores, the more incentive our teachers and administrators have to engage in what Shull describes as EDI (explicit, direct instruction). EDI is the same principle behind cramming for a test—it can produce short-term gains that are offset by long-term consequences for the students themselves: anxiety, stress, and the unlikelihood that any of that knowledge will persist beyond the test. In another post, Shull explains that direct instruction is better described as training as opposed to teaching. While there are professions where training is vital, that practice is grounded in a very specific approach to knowledge that works well for motivated students and highly circumscribed contexts. In the looser, far more varied context of K-12 classrooms, training reduces students and teachers to data points. While this serves administrators and legislators, “they traded rich curricula and trust in classroom teachers for short-term numerical gains and data’s ephemeral mirages of progress.”
Unlike the soda drinkers upon whom New Coke was forced, though, the generations of students forced to choke down their test-centric educations didn’t have much recourse. Either they succeeded despite their ensippified instruction, contributing to the mirage that this system worked, or they failed, whereupon the system doubled down. Carl Hendrick characterizes this as “turn[ing] the description of the failure into the curriculum.” Hendrick’s analogy for this maps fairly neatly across Shull’s:
The entire architecture of the randomised controlled trial assumes that “content instruction” is a sort of treatment: something that can be administered in a controlled dose, measured against a control group, and evaluated after a fixed period. But knowledge does not work like a treatment. It compounds. It accretes. It builds the very architecture through which future comprehension becomes possible. To measure it as though it were a pill is to fundamentally misunderstand the mechanism.
Despite decades and hundreds of billions of dollars wasted on the attempt, illiteracy is not a condition that can be “treated” out of a student (or a targeted demographic) by reducing their education to standardized test training exercises. It’s not a “condition” at all, at least in the datafied sense that these tests purport to “measure.” Treating it that way, as Hendrick observes, ends up eroding the very foundations upon which those skills depend, and embedding that loss into our culture. According to Shull, many of the worst alleged features of younger generations can be attributed (partially) to “the near-vacuum of ideas and historical understandings they’ve grown up in during our shift to ‘skills-based’ curricula.”
I’m sure I’ll revisit some of these ideas down the road, as I review some of the more prominent “post-literacy” ideas circulating, but thinking about value capture and ensippification has prompted me to think about how post-literate we’d already become, how much deeper this particular problem runs. I’ve got one more episode in this arc, before I turn to lighter fare, and I’m hoping to get myself publishing a little more frequently now that spring semester is winding down. See you soon.
There are lots of examples of these complaints out there, but Vanessa Wingårdh has aggregated a number of them alongside educational reports.
These are such big questions that, honestly, it depresses me a little to even consider them. And I’ll tackle them, or at least chip away at them, down the road, most likely. But if you’d like a spoiler for my own opinion, here it is. I think the idea of literacy is too massive and varied to be articulated in any way that makes the post-literacy hypothesis viable, and yet, the vibes represented by these claims are indeed real, and worth exploring and resisting. I do think that we’re steadily losing certain cognitive capacities, but they’re really difficult to measure accurately, and if we wait until they are subject to such measurement, it’ll probably be too late. There’s no pithy way to make this argument, unfortunately. Call it the “you won’t know you’ve got it until it’s gone” frame.
Unlike two different cans of Pepsi, samples extracted from two different books are unlikely to resemble each other strongly enough to be characteristic of “human writing.” This is one of the most ridiculous assumptions that we’re asked to swallow in order to take such tests seriously.
And yes, I find it entirely plausible that folks at the NYT are intentionally misreading their little Turing test in the interest of engagement—where the point isn’t that the quiz proves anything, but that people will talk about it and whether or not it does so, and thus drive more traffic to the NYT.
I haven’t read the study itself yet, but it’s located here: Generative AI enhances individual creativity but reduces the collective diversity of novel content
I’m leaving this as written, but I did locate and bookmark that note after I wrote this. The note is from Rick Lewis, whose wife is a professional editor and increasingly faced with authors who “just run [their manuscripts] through AI for grammar.” The result? “‘What they don’t know,’ she tells me privately ‘is that every one of these books sound exactly the same, even across varying subjects. They don’t know what AI is doing to the voice they think they have.’”
Those of us who teach writing may be alert here to the analogy with our profession (and administrative attempts/desires to replace our work with Ai). I’ve written about this before, the tendency for folks to consider “college writing” from the outside, mischaracterize it, and pontificate about its value from a highly limited perspective. I’ve come across a couple of recent pieces that embody this tendency quite well, and may write about them down the road.
When I got to college, I took a placement test for French, which I hadn’t studied at all for a few years after fulfilling the requirement in high school. Despite the fact that I didn’t understand most of the questions, to say nothing of the reading passages, I was able to place into French III because I knew how tests worked. I knew that there was no way I’d be able to hang in 3rd semester French, so I ended up taking Japanese instead.
I’d argue that cultural hostility towards teachers is another piece of this, most of which originates from one end of the political spectrum, although it’s been abetted by both.


I like this...in spite of the Midjourney slop ;) That Social Intelligence write-up looks verrry interesting.
Your description of the way students are trapped in this educational model made me think about the reactions of the talented and dedicated K-12 teachers I have seen over the past 30 years. There is a kind of collective despair I sense in also being trapped in this weird quantified methodology for measuring learning and feeling powerless to stop it. The testing and measurement craze, a ridiculous cross between a business and science approach to learning mostly serving political ends, throws everything John Dewey showed us and everything great teachers understand intuitively out the window in favor of collecting meaningless, decontextualized numbers that show us a minute fraction of what we might recognize as learning and knowledge.
And then there's the way we look at literacy itself. Eddy Perkins, my neighbor growing up, was a giant of a man. He was illiterate in the tightest definition—he could not read or write. But he was one of the smartest people I have come across and an important teacher for me. From him I know how to grow tomatoes without staking them, get back up after life knocks me down, get a tick out of an animal safely, tell a joke, fix my lawn mower, and trust. Guessing he would not have done so well on any of those standardized tests if he stayed in school.