One of the (relatively few) things I hate about marking student work is the sinking feeling when a suspicion of plagiarism starts to form; the moment when something like an abrupt switch of style or changes in spelling or the wording of a phase makes you look over the pages you’ve already read and start spotting more such possible indications. Turnitin, as we all know, is moderately useless – it generates lots of false positives (well-formatted bibliography entries and properly referenced quotes) while missing stuff that can be found with a quick google – so this suspicion implies the need to invest a load of time in identifying and checking likely sources and marking up the exercise, while experiencing a general feeling of annoyance and disappointment.
Well, I’ve now found something worse: the suspicion that a substantial portion of an essay may be fake. The alleged quotation from, say, Cicero, that doesn’t actually sound much like Cicero; sometimes with a precise reference that doesn’t lead to anything like the quote, sometimes with an unhelpful reference (Republic 20, anyone? Check 1.20, 2.20, 3.20; check 1.XX (=33) to see if that’s it; check p.20 in the edition cited; check p.20 in any other edition to hand…), sometimes with no reference at all. Google key phrases; google key terms; search sections of the text where the topic might be discussed. All of this in the knowledge that it’s impossible to prove a negative, that there could be a bona fide translation out there that hasn’t been digitised so wouldn’t show up, so negative results don’t prove anything. The one thing to be said is that years of experience in hunting alleged Thucydides quotations has given me some relevant skills, but also a sense of their limits.
The ‘glass half full’ approach to all this is to think that at least the students in question have grasped the importance of engaging with ancient sources and supporting their argument with evidence; they’ve just skipped the usual step of finding some actual passages in the sources that make the relevant point, and have simply made them up.
Or: have had them made up – not as deliberate fakes, because why on earth would anyone do that in this situation, but they’ve outsourced the writing of the essay to someone, or something, that’s indifferent to questions of actual truth. My suspicions in this direction were aroused by a bibliography that included a load of stuff that I hadn’t recommended and didn’t recognise – which is of course usually grounds for praise, that the student has gone off and done their own research.*
Okay, suppose you were marking an essay on how the Romans came to terms with autocracy in their political thinking, and came across the following:
Morley, N. (2010) “Tacitus and the ‘state of exception’: how the institution of the dictatorship legitimised the Principate”, Polis: the journal of Greek and Roman political thought 14: 3-47
Relevant topic; credible journal; not exactly one of Morley’s specialist topics, but he writes all sorts of random stuff, and the implied use of modern political theory is entirely typical; nothing to worry about. Except that of course it’s completely fake; no such article exists. I’ve made up this example (but I really ought to write it some time…) as I’m not sure about the ethics of quoting the actual imaginary references from a dubios student essay, but I can assure you that they were equally convincing at first glance: the sorts of scholars you’d expect to be writing on this topic, publishing where you’d expect them to publish.
The good news is that such fake references are much quicker and easier to verify than the quotes are: check the journal’s archive, check the webpage of the supposed author. If you feel like offering the benefit of the doubt, check the title on Google Scholar. What I find remarkable – the reason I didn’t at first question these entries in the bibliography – is the level of knowledge required to create a plausible fake. You need a sense of the names of people likely to work on such a topic, and of the likely journals they’d publish in, and of the form and content of article titles. If you’ve done that amount of reading, why not just write the bloody essay properly?
Well, obviously one possibility is that the student hasn’t done that reading. It seems plausible if not rather likely that this is a Large Language Model (i.e. an ‘AI’ chatbot) scanning the vast corpus of published words and constructing a plausible-seeming reference on a statistical basis: X is a scholar regularly associated with topics in ancient political thought (not necessarily differentiating between Greek or Roman), Polis is a journal that publishes articles on ancient political thought, this string of numbers is associated with something published in Polis, this string of words resembles the strings of words in quote marks associated with Polis, and so forth.
I was suddenly reminded of the Nutri-Matic machine in Douglas Adams’ The Hitch-Hiker’s Guide to the Galaxy (the original radio series, obviously).
The way it functioned was very interesting. When the Drink button was pressed it made an instant but highly detailed examination of the subject’s taste buds, a spectroscopic analysis of the subject’s metabolism, and then sent tiny experimental signals down the neural pathways to the taste centres of the subject’s brain to see what was likely to go down well. However, no one knew quite why it did this because it invariably delivered a cupful of liquid that was almost, but not quite, entirely unlike tea.
One might apply that judgement equally to the bibliographic entries and to the essay as a whole: looks good, but undrinkable. To extend the metaphor: if as a marker you just glance at an essay, rather than properly tasting/testing it, you might be fooled – but the proper level of scrutiny implies a substantial amount of extra time, and a permanent bad taste in the mouth.
The alternatives, given that ChatGPT has apparently got better at faking references in just a couple of months compared with my last discussion of it, are equally unpalatable. One is a reversion to traditional unseen in-person short-term-memory-regurgitation exams, on the basis that they’re harder to fake even if they don’t actually assess what we’re supposed to be assessing. Another is outsourcing assessment to other ‘AI’ programmes; well, I tried the two best-known ones, ZeroGPT and OpenAI’s own AI classifier, and both estimated only 25% AI-generated, ‘this was probably written by a human’ – NOT flagging up the fake bibliography entries. They look genuine to an LLM; so another LLM isn’t going to quibble. The future will see Chatbots serving up glasses of slurry and other Chatbots confirming that they are indeed perfect cups of tea.**
The obvious risk is that the conversation becomes wholly driven by ‘How do we stop students cheating by using Chatbots?’ and/or ‘How do we set things up so we don’t invest lots of time trying to establish whether or not a Chatbot may have been used?’: wholly defensive, rather than focusing on the best way to assess the skills and knowledge we want to assess. I’ve been using ChatGPT content in classes recently as the basis for helping students develop critical analysis – here is a plausible-looking bit of academic prose, now identify its deficiencies – and at the same time exploring the underlying processes insofar as we can identify them (why is it generating occasional howlers within the usual mass of vague assertions?). This could be a way forward in assessment as well: focus more, at least in the first couple of years of university, on evaluating critical skills (just as we already use source analysis of individual passages to build towards using lots of evidence to support arguments).
At the moment, you could say, we ask history students to focus on producing simulacra of academic arguments; 2000-word imitations of the sorts of stuff they’re reading, but unavoidably thin and watery at best. It’s partly, I guess, a kind of ‘learning by doing’, on the assumption that they will pick up the skills and understanding required for genuine high-level historical analysis by repeatedly trying and failing to put them into practice – rather than focusing on teaching and assessing those ‘component’ skills first. Is there a fear that they will be bored or alienated by being asked to do detailed exercises rather than getting to pontificate about the Causes of the Fall of the Roman Republic in 2000 words? An old-fashioned assumption that, whatever we say in public, history is really all about content and old-fashioned sweeping surveys and narrative – or at any rate, that’s what we want to teach ‘cos it’s more fun?
If ChatGPT can produce plausible, passable equivalents of the sort of ersatz historical analysis we get first- and second-year students to generate, AND those exercises are not wholly reliable means of getting those students to the level of producing genuine historical analysis, then what is the point? Especially if we are expected, for credentialling purposes, to devote serious time to trying to determine whether a given piece is human-generated mediocrity with dubious references or AI-generated?
The basic problem – among the basic problems – is that, while it’s relatively straightforward to determine that quotations, let alone references, are fictional, it isn’t obvious how to demonstrate that the author is human or machine. For the specific essays I’ve been reviewing, we decided in the end that this didn’t matter; they’re the first part of a two-stage assessment, in which students get feedback on their first draft and revise the work for resubmission – and so there would be little point in requiring them to re-do the draft for a capped mark (the usual penalty for academic malpractice short of definitively-proven cheating on a grand scale), which I’d then have to spend time marking, when they can instead be expected to sort out these issues in the revised version, i.e. do what they would have to do anyway but make it very clear that they’d recognised the problem and how to remedy it.
Obviously this doesn’t work for a one-off assessment where the question is whether or not to penalise, depending on whether guilt can be established. Which is perhaps another reason for focusing assessment on the development of actual skills, including learning from feedback, rather than just the award of marks. ‘This is an undrinkable cup of tea; go away and do this to make a better one’, rather than just throwing the cup at the machine. Share and enjoy!
*So long as it’s not another Canadian Masters thesis or US undergraduate thesis, things which appear ever more frequently as students respond to our attempts at setting less generic questions, to encourage independent thinking, by looking up the limited number of things on the Internet whose titles suggest they cover exactly the same ground, regardless of age or credibility…
**Would an LLM flag up this gem, produced by ChatGPT? “Socrates was among those who were accused of opposing the Thirty Tyrants and was forced to drink hemlock in 399 BCE as a result.”
If I was one of your students, I’d be placing bets on whether or not you would spot parts of essays written by ChatGPT, inserted intentionally to test if you’re just a normal human professor or some sort of superhuman with powers to identify the author of any sentence or paragraph.
That would be unwise of them…
[Possible duplicate – delete if so. WordPress being weird.]
One of the problems with assessment in UKHE at the moment is that lecturers are simultaneously under pressure to provide more and better feedback, a good thing in itself but one that only really makes sense considered formatively – what can I do better next time? – and to cut assessment to the minimum, with the result that there’s no such thing as a formative assessment. (My last employer’s UG regs calculated final grades on the basis of L6 (year 3) marks or L6 and L5 – so the last time students get formally assessed in a way that doesn’t “matter” is the end of year 1.) Your two-stage setup sounds like an excellent idea, as long as your time allocation for marking is scaled up accordingly.
Sorry to hear about ChatGPT’s invented sources. In my last round of marking – round about a year ago – there was already a 90:10 situation developing, with hugely disproportionate amounts of my time spent on chasing up obscure sources in a small minority of essays. I talked about this here once before, or thought I did…
[pauses, ironically, to spend ten minutes looking through old comments on this blog, without success]
…never mind. Basically I had a few smart students who thought they’d impress me with a referenced quote on Heydrich (say) and used something out of the first interesting-looking thing Google Scholar threw up, even if it was a study of the psychodynamics of West German cinema (‘Heydrich’s posture was “symbolically erect, libidinally charged” (Smith 2003)’) or a paper several layers deep in an exchange of polemics between two historians (so “Viewed through Jones’s reductive lens, Heydrich becomes a figure of comical insignificance” gave us ‘Heydrich was “a figure of comical insignificance” (Davis 1987)’). I mean, not wrong, but… actually it is wrong, but in ways that this margin is too narrow to explain.
Papers that Don’t Even Bleeding Exist may be less of a time-sink – particularly if the university can be brought to take a strong enough line on them. Although as a criminologist (sort of) I do have to point out that punishment doesn’t work as a deterrent – what does work is the fear of getting caught. So it’s over to you, I’m afraid.
Thank you! I don’t recall hearing these comments before.
The two-stage marking would be adequately compensated, since it’s the only assessment in the module, if I wrote the normal amount if feedback for a summative essay on the draft; because it’s formative, I write substantially more, but on the other hand it’s generally much more fun and interesting, and feels more productive, so I don’t begrudge it.
The key problem is clearly that of proving that a given piece of work was AI-assisted – which seems very likely in the case of elaborate fictional references, but I have now encountered the claim that this can all be explained innocently – and when it comes to bland, unexceptionable prose I really don’t know what could be done other than lengthy vivas, which are only likely to be helpful in interrogating a suspiciously good performance rather than a poor/mediocre-but-with-weird-anomalies one. Maybe the latter don’t matter so much, since any possible AI assistance is not exactly giving a huge unfair advantage, but that means it’s not being officially identified let alone punished specifically and hence your deterrent effect is absent. I guess I just need to carry on talking to my students about it.