Microsoft AI Researchers Accidentally Expose Data
The Wiz Researcher Team discovered that Microsoft AI researchers inadvertently exposed 38 terabytes of private data while publishing open source training data in GitHub. The issue was due to an “overly-permissive Shared Access Signature token for an internal storage account. The compromised data include passwords, private keys, secrets, and more than 30,000 internal Microsoft Teams messages. Wiz notified Microsoft through a Coordinated Vulnerability Disclosure (CVD) report.
AI governance processes that include data management are critical to avoiding this and many other risks with AI. Think of it this way: Imagine if “Home Cooking AI” ingested everything in your kitchen, which would include food, cleaning supplies, and all your mail sitting in a pile on the counter or on the hard drive of your computer and then you typed in “Give me a recipe for Airline Chicken.” High probability of a poisonous meal and recipes containing the credit card numbers you used on airline reservations…
This is actually not an AI incident but a Cloud incident. Someone from Microsoft uploaded a huge amount of data into Azure / Github (a Microsoft’s Cloud solution). They misconfigured their configured account, accidentally exposing 37TB of data to the public. In addition, the data was editable, meaning malicious actors could have modified the data. It just so happens the data was AI-related as part of a research project. One of the biggest risks with Cloud is often not cyber threat actors, but privileged users making mistakes. Cloud environments are complex and constantly changing. If you get confused sometimes by the Cloud like I do, think what IT admins and developers are experiencing.
The core problem here was improper scope of the SAS (data sharing) token. It's a lot easier to share an entire collection than specific folders/storage containers. Good opportunity to review how you're training users to only share what's needed, as well as what processes you have to review what's been shared. Also take a look at expiring sharing. While some data will need to be shared indefinitely, other elements simply need to survive for a short interval. When reviewing scope and duration of data shares, also factor in the purpose, keeping an eye on how that can be misused, particularly data used to train AI.
It continues to be a bad couple months for Microsoft. Interestingly, GitHub recently implemented the capability to scan for secrets. Use the tools that GitHub and Microsoft make available to routinely scan your data repositories.