Group Purchasing
Group Purchasing

Trust But Verify: Evaluating the Accuracy of LLMs in Normalizing Threat Data Feeds

Trust But Verify: Evaluating the Accuracy of LLMs in Normalizing Threat Data Feeds (PDF, 0.43MB)Published: 16 Jul, 2025
Created by:
Nicholas Peterson

This paper examines whether Large Language Models (LLMs) can be reliably applied to the normalization of Indicators of Compromise (IOCs) into Structured Threat Information Expression (STIX) format. Using benchmark datasets of 200 IOCs across three types (MD5 hashes, URLs, and IPv4 addresses), the performance of Google’s Gemini 2.0 Flash and OpenAI’s ChatGPT-4o will be evaluated.

While both models achieved 100% validity in generating syntactically correct STIX outputs, their fidelity in accurately preserving IOC values varied significantly. Gemini outperformed ChatGPT overall, though both models struggled with hash values, exhibiting frequent omissions and erroneous pattern translations. The inconsistencies in these errors pose a major obstacle to the reliable use of LLMs in operational security and data engineering pipelines.