SEC536: Adversarial AI - Penetration Testing AI Systems


Experience SANS training through course previews.
Learn MoreLet us help.
Contact usBecome a member for instant access to our free resources.
Sign UpWe're here to help.
Contact UsThis paper examines whether Large Language Models (LLMs) can be reliably applied to the normalization of Indicators of Compromise (IOCs) into Structured Threat Information Expression (STIX) format. Using benchmark datasets of 200 IOCs across three types (MD5 hashes, URLs, and IPv4 addresses), the performance of Google’s Gemini 2.0 Flash and OpenAI’s ChatGPT-4o will be evaluated.
While both models achieved 100% validity in generating syntactically correct STIX outputs, their fidelity in accurately preserving IOC values varied significantly. Gemini outperformed ChatGPT overall, though both models struggled with hash values, exhibiting frequent omissions and erroneous pattern translations. The inconsistencies in these errors pose a major obstacle to the reliable use of LLMs in operational security and data engineering pipelines.







