Back to Stories

The Real Capabilities of AI Text Detectors: False Negatives, Costs and Accuracy

Ramón Sánchez
AI DetectorsAIOpenAIAnthropicGoogle


September 18, 2025 - 3 min read

What are the real capabilities of tools to detect the use of AI.

The detection of AI-generated texts has become crucial for universities, media outlets, and technology companies, as each depends on ecosystems based on trust and human verification. In universities, for example, peer review is a cornerstone of the academic system, yet AI has exposed flaws such as difficulty distinguishing human from AI-generated content and potential biases in detection algorithms. A reliable system not only seeks to preserve academic and professional integrity but also to prevent the spread of disinformation and establish clear rules on the use of artificial intelligence. However, relying on these tools can create ethical and practical problems: unfairly penalizing authors or allowing automated content to go unnoticed.

In this context, Brian Jabarian and Alex Imas conducted research to evaluate the performance of different commercial and open-source detectors when faced with texts generated by AI and by humans. The detectors usually operate from their website or through an API accessible via code, normally delivering a percentage verdict about the amount of AI involved.

The study was based on 1,992 text passages divided into categories such as news, Amazon reviews, novels, restaurant reviews and summaries. The same amount of texts from AI and from humans was used. The objective was to analyze two types of errors: False Positive Rate (FPR): texts written by humans classified as generated by AI and False Negative Rate (FNR): texts generated by AI that the detector considers human.

For detection, four tools were used: GPTZero, Originality and Pangram (commercial detectors) and RoBERTa (open-source model). The analysis included both the “optimal” configuration recommended by each tool and tests with thresholds adjusted between 0.1 and 0.9, in order to observe different sensitivity scenarios.

The results show that tools like Pangram reach detection levels close to 99% in long passages, although their performance, like that of the other detectors, decreases in shorter texts. This drop is more pronounced in the case of Originality and GPTZero. Likewise, when analyzing performance based on the cost per detection of the tools, Pangram had the best price-to-cost ratio. This calculation established a clear methodology for future calculations of cost per true positive detection (CPTP), a very useful metric for conducting the financial analysis of the tools.

Finally, the authors propose that the choice of a detector should not be based solely on global accuracy metrics, but rather on a policy caps approach. This involves defining explicit tolerance thresholds for false positives and false negatives, which allows models to be compared more fairly and in alignment with the needs of each institution. In this way, universities, media outlets, and technology companies can make more informed decisions consistent with their values, protecting academic integrity and information quality. Thus, the study provides a key piece to the broader picture: moving from a purely technical approach to a normative one, where explicit rules are defined about what is desired and to a contextual one, understanding the intended use, allowing the detection tool to adapt to the real priorities of each ecosystem.

Scan the QR code to view this story on your mobile device