Contact Sales
Contact Sales

How Many LLMs Does it Take to Classify a Suspicious Email?

How Many LLMs Does it Take to Classify a Suspicious Email? (PDF, 1.03MB)Published: 12 Mar, 2026
Created by:
Bridget Bartell

Large language models (LLMs) such as ChatGPT, Microsoft CoPilot, and Google Gemini are becoming increasingly accessible to end users and may offer a novel avenue for evaluating suspicious email messages as they are encountered. However, little is known about how these publicly available models perform when classifying phishing versus legitimate content without additional tuning.

This study examines the accuracy, reliability, and operational behavior of three widely available LLMs using a dataset of 2000 human-written emails containing both legitimate and suspicious messages. Each model was provided with identical inputs and prompts across six runs to assess variability in output quality, classification consistency, and suspiciousness scoring.

The results show stark differences in performance: ChatGPT demonstrated full dataset acceptance but exhibited highly inconsistent scoring and categorization; CoPilot processed fewer messages but showed strong reliability and accuracy for those it evaluated; and Gemini displayed significant operational instability, returning inconsistent, partial, or malformed outputs. These findings indicate that publicly available LLMs vary widely in their dependability for phishing detection tasks, highlighting critical limitations for real-world adoption and informing recommendations for organizational use and offering future opportunities for study.