AI Detectors: A Proposed Way of Evaluating Efficiency and Accuracy

Maksim Taraikouski

02 Jul 2025 • 8 min read

Disclaimers.

The word "AI" in this article is used to refer to Large Language Models (LLMs) likes of ChatGPT, Gemini, Grok, etc..
Any findings included related to the previous research should not and, frankly, can not be considered conclusive considering today's funky digital landscape.
Parts of this article may be opinionated and considered biased.

While I was digging through some older material trying to pick out a topic for this article I stumbled upon a small research that I did in 2022 to evaluate just how good AI-generated text detectors work.

This one has some fun backstory behind it. One of these detectors marked my work as AI-generated once. And I am incredibly petty, so I tried to see how accurate they really are. Hence previous research, this article and an intent to research it further.

My previous research was fundamentally flawed in a lot of ways, so... why not roast myself over it, learn from the mistakes, and propose a new way of conducting similar research later down the line.

But before that, let's talk about why this is important.

AI detectors are being employed at universities and companies to review student works, essays, documentation and what not. This is not a bad thing, in theory these tools are intended to preserve academic integrity, ensure fairness in evaluation and yada-yada-yada. However, the fields of their deployment can make them an ethical nightmare. As Northern's Illinois University's Center for Innovative Teaching and Learning puts it in their article:

... even a small error rate can add up. If a typical first-year student writes 10 essays, and there are 2.235 million first-time degree-seeking college students in the U.S., that would add up to 22.35 million essays. If the false positive rate were 1%, then 223,500 essays could be falsely flagged as AI-generated (assuming all were written by humans). That’s a lot of false accusations.

Any machine learning systems is prone to biases due to gaps in the training data or what data they were trained on. For example researchers from Stanford University concluded:

GPT detectors frequently misclassify non-native English writing as AI generated, raising concerns about fairness and robustness. Addressing the biases in these detectors is crucial to prevent the marginalization of non-native English speakers in evaluative and educational settings and to create a more equitable digital landscape.

And what makes it even more ironic is while these tools have the potential of penalizing honest students, they can often be easily bypassed by ones intending to cheat. And this was something even my little 2022 research had shown, simply paraphrasing using automated tools sliced detection in half.

To understand why this kind of mistakes happen, let's take a look at how LLMs and detectors work.

AI, AI, Generative AI, it uses AI to detect AI

While I have the opportunity, I'd like to disambiguate a little. Depending on the definitions you use, LLMs are not AI, at best they are what would be called a "narrow AI". ~~And for these words I shall pay when AI eventually overthrows us.~~

At their core all these advanced models are nothing more but incredibly complex and sophisticated prediction engines. All of modern LLMs use what's called a Transformer architecture which, without going into technical details, means they are really good at finding patterns in sequential data, like words in a sentence.

Basically LLMs are trained on vast amounts of data (trillions of words, if not dozens of trillions) and learn to predict the next ones in the sequence. Then when you give an LLM a prompt, it predicts the next most probable words based on what it learned from the training data.

This results in coherent and grammatically correct sentences (though, you may know, not for all languages where the training data is limited) because it mirrors the patterns in learned from its vast training data.

All of this is a gross oversimplification, but the key takeaway is that LLMs generate text based on probability (and a sprinkle of randomness sometimes). They produce smooth and logical text, often what one would describe as "very average" because they are designed to do just that.

AI detectors are, essentially, mirroring this process. Many are classifiers built on top of LLMs, trained on huge datasets of both human-written and AI-generated text. These classifiers are there to answer one question: "Statistically speaking, is that something a human would write or something an LLM would write?".

There are 2 main fingerprints they may be looking for:

Humans phrase creatively, we use metaphors, we make typos and don't have perfect punctuation. All of this makes our word choices less predictable. LLMs, in turn, would use more statistically probable word choices and punctuation. Different LLMs also might make different most probable choices. This actually makes it possible to identify models from outputs through a process called "model fingerprinting".
Humans also vary sentence structure much more often. We are more "burst-y" when it comes to writing, mixing in short and long sentences. LLMs produce more predictable, structured and uniform sentences.

So, with all that, see the potential problem? These fingerprints are not tell-tale signs of machine-generated text, they are simply statistical markers describing how predictable the text is.

And more importantly, with how fast the field is moving, new LLMs may change faster than these detection models can update themselves. Not to forget that classifiers may be trained, for example, only on technical writing by a certain LLM, resulting in gaps (e.g. not being trained on creative writing by the same LLM) increasing the risk of false negatives.

While It's hard for humans to make exactly the same choices an LLM would, but with the growing amounts of models, it's far from impossible either. Coincidences happen. Not to forget that living in a world where loads of content is written by LLMs, we might subconsciously adapt some of the patterns as research suggests.

This brings us to the most critical takeaway: these tools do not provide definitive proof. A high "AI-generated" score doesn't mean someone used LLMs, if anything it means that the statistical properties of the text align with detector's training data and it's predictable. Not to forget that for most of them there is no way of knowing what they were trained on.

These detectors are not useless, but it's increasingly common to see them used as a lie detector when, really, they still are fallible statistical models. My own early attempts to quantify this fallibility, the small research project mentioned above, was equally flawed, but still was a valuable lesson.

Post-Mortem of My 2022 Experiment

The initial experiment (not published as it's outdated and the data is not conclusive) was an honest (and very naive) attempt at probing a new, emerging technology and ways to counteract it.

The methodology has a decent level of coherence: test a RoBERTa-based detector against raw ChatGPT output, then compare the scores after the text has been rephrased by a person and with the use of different tools. There was even some level of foresight to run a baseline test to check for false positives on my own writing and text that (almost) definitely wasn't AI-generated (back then - small-ish websites and older page revisions from community-ran websites likes of Wikipedia).

While the methodology is decent, the flaws were significant. To name a few:

The dataset was tiny, 5-10 samples per category.
The "human rephrasing" test (which is a crucial part of the experiment) was performed entirely by me, the person running the study.
The toolset (one detector, one LLM and a couple of basic paraphrasers) looks basic, to say the least, from today's perspective.

Keep in mind, the landscape has shifted drastically since then. In 2022 it was just OpenAI's ChatGPT (GPT-3, I believe, which would be considered rudimentary by today's standards), now it's a bustling ecosystem of thousands of models like Google's Gemini and Gemma, Anthropic's Claude and Meta's Llama.

In response, more "advanced" detection tools appeared like Copyleaks, Originality.ai, ZeroGPT appeared claiming to detect AI-generated content with high accuracy.

Not to forget, a new front in this "arms race" has opened: the AI "humanizer" tools. These applications are marketed specifically to rewrite AI-generated text to avoid detection. These are a lot more complex than simple paraphrasing tools I used. And, as it goes, modern studies must account for every new kind of adversarial tools.

With that, let's try to improve on the methods use and outlined a better, more robust plan for the research, shall we?

Research Plan 2.0

My original little experiment was a poke with a stick, this should be more of a systematic stress test.

First and foremost, let's state a clear objective: To quantitatively assess effectiveness, limitations and bias of leading / most marketed LLM text detectors.

Second, while I'd love to include human paraphrasing as a part of this research, I lack an army to rewrite thousands of texts using their own words and doing it myself is somewhere between "biased" and a "full-time job", so for round two it's machines vs. machines.

Third I want to to be (mostly) automated, hence reproducable and expandable. This is not hard to achieve and will allow everyone to re-run the experiment while tinkering with prompts and different variables.

With that in mind, a new methodology could look something like this:

Step 0: Establishing a Baseline

Before even starting, running a bunch of definitely-human-written texts through these detectors couldn't hurt. Pre-2020 texts from forums and community-built websites may be a good option, especially less visited ones since they might not be in the training data (the odds of that are low, but on a scale that's the best bet).

With this we'll have a good baseline understanding of how common false positives are.

Step 1: Text Generation

We'll need to generate a large, statistically significant dataset of samples. Initially I would likely aim for around 100-200 text samples for each combination of LLM and prompt type. This is easily automatable via APIs or something like Openrouter.

The dataset will have to cover multiple topics, such as technical writing (summarizing scientific abstracts, writing articles from several scientific works, etc..), creative writing (stories, essays and whatnot), academic writing (one could ask an LLM to write university assignments).

Step 2: Modification of Generated Texts

All texts will be processed through different paraphrazing, "humanization" and other modification tools to create different versions for testing. All tools will use default settings to ensure consistency. Most provide API access as well.

Step 3: Analyzing the texts

Now the fun (and cheapest) bit, transforming the raw data into meaningful insights.

First, every text sample (both the raw LLM output and its modified versions) is submitted to every detector in the set via their respective APIs. For each submissions we are looking to obtain:

The raw AI probability score: most crucial metric, specific score assigned by the detector.
Final judgement of the detector: the final label it placed on text ("AI", "Human")
Score change: measures how much a modification tool reduces the detection score (percentage reduction from the original score).

With this data it's possible to perform a decent statistical analysis to identify and prominent patterns and avoid flukes, like:

Is Tool A significantly more effective at evasion than Tool B?
Does Detector X perform significantly worse on texts from Claude than on texts from Llama?
Is there a significant difference in detection scores between technical and creative writing prompts?

The framework above could provide a simple, reproducable test to answer (or at least hint at) how accurate these detectors actually are beyond the marketing claims. If (when) I'll get to it I intend on open-sourcing all the scripts, prompts and data for anyone to contribute, check it out themselves and find even more flaws in my methods!

To wrap it up, initially I thought of it as more of a technical exercise. It's plain fun and interesting to mess with data and new tech. But behind all this wall of text is personal annoyance with unsupported marketing claims, how some of this can be used against fair people and all that.

As far as I am aware, none of these detectors actually have anything to prove that they are 100500% accurate beyond their own internal research which probably was conducted under near-sterile conditions. And people buy into it and, really, I don't see how it benefits anyone but these companies.