SEC595: Applied Data Science and AI/Machine Learning for Cybersecurity Professionals


Experience SANS training through course previews.
Learn MoreLet us help.
Contact usBecome a member for instant access to our free resources.
Sign UpWe're here to help.
Contact UsLarge language models (LLMs) such as ChatGPT, Microsoft CoPilot, and Google Gemini are becoming increasingly accessible to end users and may offer a novel avenue for evaluating suspicious email messages as they are encountered. However, little is known about how these publicly available models perform when classifying phishing versus legitimate content without additional tuning.
This study examines the accuracy, reliability, and operational behavior of three widely available LLMs using a dataset of 2000 human-written emails containing both legitimate and suspicious messages. Each model was provided with identical inputs and prompts across six runs to assess variability in output quality, classification consistency, and suspiciousness scoring.
The results show stark differences in performance: ChatGPT demonstrated full dataset acceptance but exhibited highly inconsistent scoring and categorization; CoPilot processed fewer messages but showed strong reliability and accuracy for those it evaluated; and Gemini displayed significant operational instability, returning inconsistent, partial, or malformed outputs. These findings indicate that publicly available LLMs vary widely in their dependability for phishing detection tasks, highlighting critical limitations for real-world adoption and informing recommendations for organizational use and offering future opportunities for study.









