What You Will Learn
SEC595 is a crashcourse introduction to practical data science, statistics, probability, and machine learning. The course is structured as a series of short discussions with extensive handson labs that help students develop a solid and intuitive understanding of how these concepts relate and can be used to solve realworld problems. If you've never done anything with data science or machine learning but want to use these techniques, this is definitely the course for you!
Unlike other courses in this space, SEC595 is squarely centered on solving information security problems. Where other courses tend to be at the extremes of teaching almost all theory or solving trivial problems that dont translate to the real world, this course strikes a balance. We cover only the theory and math fundamentals that you absolutely must know, and only in so far as they apply to the techniques that we then put into practice.
The major topics covered in SEC595 include:
 Data acquisition from SQL, NoSQL document stores, web scraping, and other common sources
 Data exploration and visualization
 Descriptive statistics
 Inferential statistics and probability
 Bayesian inference
 Unsupervised learning and clustering
 Deep learning neural networks
You Will Be Able To:
 Apply statistical models to realworld problems in meaningful ways
 Generate visualizations of your data
 Perform mathematicsbased threat hunting on your network
 Understand and apply unsupervised learning/clustering methods
 Build Deep Learning Neural Networks
 Build and understand Convolutional Neural Networks
 Understand and build Genetic Search Algorithms
You Will Receive with this Course:
 A supporting virtual machine
 Jupyter notebooks of all of the labs and complete solutions
This Course Will Prepare You To:
 Build AI anomaly detection tools
 Model information security problems in useful ways
 Build useful visualization dashboards
 Solve problems with neural networks
Additional Resources:
 Anaconda
 TensorFlow (and supporting libraries)
 Matplotlib
 VMWare Workstation/Player/Fusion
Syllabus (30 CPEs)

Overview
This section introduces some of the terminology in the data science and machine learning fields. It also presents a number of the technologies that are used as data sources. Since the first step in any data science or machine learning project is to acquire data, the balance of the section is focused on handson exercises to prepare students for these tasks.
The first necessary skill is the use of Python, our chosen language for this course. The only course prerequisite is a fundamental understanding of Python. If youve written even one line of Python, you are probably knowledgeable enough to get started! We will cover lists, arrays, tuples, dictionaries, comprehensions, and then begin introducing the numpy variants!
Following the Python "refresher", we'll provide some theory followed immediately by handson exercises to give you just enough knowledge of SQL, MongoDB, and webscraping to get real work done.
Exercises
 Python Refresher
 Accessing, Manipulating, and Retrieving SQL Data
 Accessing, Manipulating, and Retrieving NoSQL Data: MongoDB
 Webscraping for Data Acquisition
Topics
 Data Science
 Python
 SQL
 NoSQL
 Webscraping

Overview
This section begins with the fundamentals of statistics that matter for data science and machine learning. Well quickly move to handson exercises that provide practical uses for these techniques against realworld data.
The course section then transitions to probability theory, which is an extensive field of its own. Following the introduction of some fundamentals, the course works directly toward deriving the Bayesian theorem. Building on this introduction, students then engage in a handson lab that builds a useful Bayesian analysis tool that students will improve upon later in the course.
The remainder of this section involves translating the statistical knowledge gained into the field of signals analysis. After a discussion of the derivation and applications of the Fourier series, the Fast Fourier Transformation, and the Discrete Fourier Transformation, students will use these tools in a realworld threat hunting activity.
Exercises
 Statistics Fundamentals: Medians and Means
 Statistics Fundamentals: Variance, Deviations, and Robust Measures
 Applications of Statistics to Data Identification
 Probability, Bayes, and Phishing
 Threat Hunting through Signals Analysis
Topics
 Statistics
 Robust Measures
 Probability
 Bayes Theorem and Inference
 Fourier Series and Related Derivations

Overview
The remaining 18+ contact hours of this course are spent learning about and immediately applying various machine learning models. After each topic is introduced and discussed, students engage in lengthy handson labs to develop an intuitive understanding and apply the technique to real problems.
This section begins with various clustering approaches and unsupervised machine learning. The exploration begins with Support Vector Classifiers, kernel functions, and Support Vector Machines. Following this discussion and exercises, we continue the clustering theme by considering the KMeans and KNN approaches. After working through examples in just two or three dimensions, we turn our attention to methods for determining the ideal number of clusters. With this done, we finally explore highdimensional applications and dimensionality reduction through Primary Component Analysis.
The balance of this section is spent discussing Decision Trees. After a handson activity and discussion of the limitations of Decision Trees, we expand into Random Forests and explore handson how these provide better inferences in most cases. The section wraps up with a clusterbased approach to finding anomalies in user activity on a network.
Exercises
 Support Vector Classifiers
 Support Vector Machines
 KMeans/KNN
 Elbow Functions and PCA
 Decision Trees
 Random Forests
 Finding Anomalies: Clustering
Topics
 Support Vector Classifiers
 Support Vector Machines
 Kernel Functions
 Primary Component Analysis
 KMeans
 KNN
 Elbow Functions
 Decision Trees
 Random Forests
 Anomaly Detection

Overview
The entire focus of this section is on the theory, development, and use of supervised learning approaches in the field of information security. Building on the mathematics and statistics covered in section 2, this course section begins with linear regressions and ends with an introduction to Convolutional Neural Networks.
The material is focused on using supervised machine learning and mathematics to create predictive models. The initial discussion and exercises center around forecasting and trends analysis for anomaly detection. Following this, most of the material focuses on classification problems.
Building on the Bayes approach used in section 2, this course section introduces deep learning neural networks and fully connected dense networks through the development of a far more accurate phishing detection network. Following this, we'll explore visualization and measurement of neural network training performance, in addition to discussing overfitting and overtraining and how to identify (and avoid!) them.
The next portion of this section turns to categorical problems. Students will build a realtime network protocol classification system and, more importantly, implement anomaly detection in this classification system, a task typically reserved for unsupervised approaches.
The final portion of this course section will introduce Convolutional Neural networks. Further exploration of these continues in section 5.
Exercises
 Polyfit Regressions
 Hello, World!, Ham vs. Spam
 Identifying Protocols
 Protocol Anomaly Detection
Topics
 Regression and Fitting
 Loss and Error Functions
 Vectors, Matrices, and Tensors
 Fundamentals of the Perceptron
 Dense Networks
 AutoEncoders
 Convolutional Neural Networks

Overview
The final section of the course picks up right where section 4 left off: Convolutional Neural Networks. Students begin by exploring the applications of Convolutional Networks to Natural Language Processing in the form of the Ham vs. Spam problem, generating a highly accurate tool for distinguishing one from the other.
The major focus of this section is on the creation of a deep neural network using TensorFlows functional pattern for both testing the quality of and solving CAPTCHAs. Whether you are on a red, blue, or purple team, you will learn how to think through and use machine learning to solve what amounts to a computer vision problem and solve it at greater than 95 percent accuracy! After this, we'll explore a different way to think about the problem that results in even greater accuracy with far less training time.
The final portion of the section investigates genetic algorithms as they can be applied to machine learning problems.
Exercises
 Ham vs. Spam, CNN Style!
 Solving CAPTCHAs
 Genetic Algorithms
Topics
 Convolutional Neural Networks
 Functional Definition of Neural Networks
 Deep Learning Networks with Multiple Outputs
 Thinking about Machine Learning Problems
 Genetic Algorithms
Prerequisites
A basic knowledge of Python or some similar scripting language is needed. You need not be proficient, but you should have written at least a handful of simple scripts at some point in your life.
Laptop Requirements
A laptop with at least 8 gigabytes of RAM is required for this course. The course relies heavily on the Anaconda Individual environment, which you should preinstall before attending class. It also makes use of a VMWare based virtual machine. A VMWare based virtualization solution such as VMWare Player, VMWare Workstation, or VMWare Fusion is required.
Special note for those with an M1 Mac: Even though it is not possible to run the virtual machine on a M1 mac, the course exercises are fully supported on your device! Please identify yourself to your instructor at the start of the course so that the instructor can provide you with alternate instructions for some of the labs.
If you have additional questions about the laptop specifications, please contact laptop_prep@sans.org.
Author Statement
"AI and machine learning are everywhere. How do the vendor solutions work? Is this really black magic? I wrote this course to fill an enormous knowledge gap in our field. I believe that if you are going to use a tool, you should understand how that tool works. If you don't, you don't really know what the results mean or why you are getting them. SEC595 is a crash course in statistics, mathematics, Python, and machine learning that will take you from zero to being a  I'm reluctant to promise 'hero', so let's just say to being a competent person who can solve real problems today!"  David Hoelzer