Ability of AI detection tools and humans to accurately identify different forms of AIgenerated written content

Executive Summary

As generative artificial intelligence (AI) and large language models (LLMs) like ChatGPT become integrated into academic scholarship, the ability to distinguish human-authored text from AI-generated content has become critical for maintaining research integrity. This briefing document synthesizes findings from a 2025 study evaluating three open-access AI detection tools and five human raters across a spectrum of five AI-use conditions in healthcare simulation literature.

The investigation reveals a stark disparity between automated and human detection capabilities. While AI detection tools—specifically ZeroGPT and PhraslyAI—demonstrate a statistically significant ability to differentiate between various levels of AI intervention, their absolute scores vary, raising concerns regarding their reliability as solo arbiters of academic honesty. Conversely, human raters, even those with subject matter expertise, performed at an accuracy rate of 19%, which is indistinguishable from random chance. The study concludes that while AI detection tools may assist editorial workflows, they must be supplemented by other verification methods, particularly given that human intuition is an unreliable safeguard against AI-generated content.

Study Methodology and Experimental Design

The study utilized 30 open-access articles published before 2022 in the journals Advances in Simulation and Simulation in Healthcare to ensure the source material was of human origin. Researchers extracted introduction sections (500–600 words) and subjected them to five experimental conditions using ChatGPT-4o to mirror realistic scholarly usage.

Experimental Conditions of AI Use

Condition	Description	Prompt/Methodology Focus
1. 100% Human	Original text used verbatim.	Baseline (pre-2022 publication).
2. Light AI Editing	Human text edited for minor errors.	Spelling and punctuation only.
3. Heavy AI Editing	Human text edited for flow and structure.	Grammar, sentence structure, and readability.
4. AI from Human Ideas	AI writes text based on author's bullets.	3–4 bullet points per paragraph provided to AI.
5. 100% AI Written	AI generates text de-novo from a title.	No human content provided beyond the title.

The resulting text was evaluated by three open-access tools (ZeroGPT, PhraslyAI, and Grammarly AI Detector) and five blinded human raters who are active healthcare simulation researchers.

Performance Analysis of AI Detection Tools

All three tools demonstrated the ability to detect increasing percentages of AI content as the level of AI intervention escalated (p < 0.001). However, significant variations in reliability and absolute scoring were observed.

Comparative Tool Reliability

ZeroGPT and PhraslyAI: Showed "very good" agreement with an Intraclass Correlation Coefficient (ICC) of 0.96. Both tools were highly effective at identifying de-novo AI content, with mean scores of 92.5% and 92.4% respectively for 100% AI-written text.
Grammarly AI Detector: Demonstrated "moderate" agreement with ZeroGPT (ICC 0.60) and Phrasly (ICC 0.57). Notably, Grammarly identified 100% AI-written text (Condition 5) as only 50% AI-generated on average, significantly underperforming the other tools.
Baseline Noise: Even in the 100% human condition, tools detected "noise," with average AI scores ranging from 1.6% to 6.5%.

Mean AI Detection Scores by Condition

Condition	ZeroGPT	Phrasly	Grammarly
Human Generated	6.5%	5.9%	1.6%
AI Lightly Edited	20.2%	24.8%	3.0%
AI Heavily Edited	43.1%	45.1%	11.5%
AI from Bullets	89.9%	85.6%	62.5%
AI De Novo	92.5%	92.4%	50.0%

Analysis of Human Detection Capabilities

Human raters proved highly unreliable in identifying the origin of the text. The overall accuracy across all conditions was 19%, trailing the 20% accuracy rate expected from random guessing.

Accuracy and Error Patterns

Detection Accuracy: Human accuracy was highest for "heavily edited" text (30%) but lowest for de-novo AI-generated text (10%).
False Positives (FP): In 100% human-written samples, the FP rate (identifying human text as AI) was 72.2%.
False Negatives (FN): In 100% AI-written samples, the FN rate (identifying AI text as human) was 76.9%.
Expertise Limitation: Despite being subject-matter experts capable of identifying factual "hallucinations," raters could not distinguish the stylistic markers of AI. The agreement between human raters and AI detection tools was "extremely poor" (ICC = 0).

Linguistic and Ethical Considerations

Markers of AI vs. Human Writing

The study highlights that AI detection tools rely on specific linguistic features that distinguish machine output from human creativity:

Perplexity: AI text has lower perplexity, meaning it is more predictable and follows common linguistic patterns found in training data. Human writing features "unexpected word choices."
Burstiness: AI writing is uniform in sentence length and structure (low burstiness). Humans exhibit high variance in sentence length, types of phrases, and overall structure.

Ethical Spectrum of AI Use

The research aligns with ethical models that categorize AI use based on the degree of original thought replacement:

Ethically Sound: Light editing for grammar and spelling (Condition 2).
Ethical "Grey Area": Heavy restructuring for readability (Condition 3).
Ethically Suspect: Generating de-novo text or ideas from bullet points or titles (Conditions 4 and 5). These methods are more likely to lead to factual inaccuracies (hallucinations) and plagiarism.

Practical Implications for Academic Publishing

Recommendations for Editorial Staff

Utilize Detection Thresholds: The data suggests that an AI detection score above 40% should serve as a "flag" for journal editors to initiate a closer review and discussion with authors.
Verify Citations: Given the AI propensity for hallucinating sources, manual verification of all cited articles is a critical secondary defense.
Establish Clear Policies: Journals must develop and communicate transparent plans regarding acceptable AI use and the adjudication of conflicts when AI use is suspected.
Avoid Sole Reliance on Software: Due to the risk of false positives (wrongful accusations) and false negatives (evasion of detection), AI tools should be part of a peer-review "toolkit" rather than the final authority.

Conclusion

The evolution of LLMs has created a landscape where human intuition is no longer a viable defense against synthetic content in scholarship. While automated tools show promise in identifying high levels of AI intervention, their variable reliability necessitates a cautious, multi-modal approach to editorial oversight. As scholars continue to adopt AI assistants, the academic community must prioritize transparency and the development of robust, proactive standards for AI-assisted writing.

Search This Blog

Dr. Khoai Tay's Coffee House