new

SEC595: Applied Data Science and AI/Machine Learning for Cybersecurity Professionals

  • In Person (5 days)
  • Online
30 CPEs

Data science, artificial intelligence, and machine learning aren't just the latest buzzwords - they are fast becoming primary tools in our information security arsenal. The problem is that, unless you have a degree in mathematics or data science, you're likely at the mercy of vendors of these tools. The new SEC595 course completely demystifies machine learning and data science. More than 50 percent of the time in class is spent solving machine learning and data science problems hands-on, rather than just talking about them.

Course Authors:

What You Will Learn

SEC595 is a crash-course introduction to practical data science, statistics, probability, and machine learning. The course is structured as a series of short discussions with extensive hands-on labs that help students develop a solid and intuitive understanding of how these concepts relate and can be used to solve real-world problems. If you've never done anything with data science or machine learning but want to use these techniques, this is definitely the course for you!

Unlike other courses in this space, SEC595 is squarely centered on solving information security problems. Where other courses tend to be at the extremes of teaching almost all theory or solving trivial problems that don't translate to the real world, this course strikes a balance. We cover only the theory and math fundamentals that you absolutely must know, and only in so far as they apply to the techniques that we then put into practice.

The major topics covered in SEC595 include:

  • Data acquisition from SQL, NoSQL document stores, web scraping, and other common sources
  • Data exploration and visualization
  • Descriptive statistics
  • Inferential statistics and probability
  • Bayesian inference
  • Unsupervised learning and clustering
  • Deep learning neural networks

You Will Be Able To:

  • Apply statistical models to real-world problems in meaningful ways
  • Generate visualizations of your data
  • Perform mathematics-based threat hunting on your network
  • Understand and apply unsupervised learning/clustering methods
  • Build Deep Learning Neural Networks
  • Build and understand Convolutional Neural Networks
  • Understand and build Genetic Search Algorithms

You Will Receive with this Course:

  • A supporting virtual machine
  • Jupyter notebooks of all of the labs and complete solutions

This Course Will Prepare You To:

  • Build AI anomaly detection tools
  • Model information security problems in useful ways
  • Build useful visualization dashboards
  • Solve problems with neural networks

Additional Resources:

  • Anaconda
  • TensorFlow (and supporting libraries)
  • Matplotlib
  • VMWare Workstation/Player/Fusion

Syllabus (30 CPEs)

  • Overview

    This section introduces some of the terminology in the data science and machine learning fields. It also presents a number of the technologies that are used as data sources. Since the first step in any data science, artificial intelligence or machine learning project is to acquire data, the balance of the section is focused on hands-on exercises to prepare students for these tasks.

    The first necessary skill is the use of Python, our chosen language for this course. The only course prerequisite is a fundamental understanding of Python. If youve written even one line of Python, you are probably knowledgeable enough to get started! We will cover lists, arrays, tuples, dictionaries, comprehensions, and then begin introducing the numpy variants!

    Following the Python "refresher", we'll provide some theory followed immediately by hands-on exercises to give you just enough knowledge of SQL, MongoDB, and webscraping to get real work done.

    Exercises
    • Python Refresher
    • Accessing, Manipulating, and Retrieving SQL Data
    • Accessing, Manipulating, and Retrieving NoSQL Data: MongoDB
    • Webscraping for Data Acquisition

    Topics
    • Data Science
    • Python
    • SQL
    • NoSQL
    • Webscraping

  • Overview

    This section begins with the fundamentals of statistics that matter for data science, artificial intelligence and machine learning. Well quickly move to hands-on exercises that provide practical uses for these techniques against real-world data.

    The course section then transitions to probability theory, which is an extensive field of its own. Following the introduction of some fundamentals, the course works directly toward deriving the Bayesian theorem. Building on this introduction, students then engage in a hands-on lab that builds a useful Bayesian analysis tool that students will improve upon later in the course.

    The remainder of this section involves translating the statistical knowledge gained into the field of signals analysis. After a discussion of the derivation and applications of the Fourier series, the Fast Fourier Transformation, and the Discrete Fourier Transformation, students will use these tools in a real-world threat hunting activity.

    Exercises
    • Statistics Fundamentals: Medians and Means
    • Statistics Fundamentals: Variance, Deviations, and Robust Measures
    • Applications of Statistics to Data Identification
    • Probability, Bayes, and Phishing
    • Threat Hunting through Signals Analysis
    Topics
    • Statistics
    • Robust Measures
    • Probability
    • Bayes Theorem and Inference
    • Fourier Series and Related Derivations
  • Overview

    The remaining 18+ contact hours of this course are spent learning about and immediately applying various machine learning models. After each topic is introduced and discussed, students engage in lengthy hands-on labs to develop an intuitive understanding and apply the technique to real problems.

    This section begins with various clustering approaches and unsupervised machine learning. The exploration begins with Support Vector Classifiers, kernel functions, and Support Vector Machines. Following this discussion and exercises, we continue the clustering theme by considering the K-Means and KNN approaches. After working through examples in just two or three dimensions, we turn our attention to methods for determining the ideal number of clusters. With this done, we finally explore high-dimensional applications and dimensionality reduction through Primary Component Analysis.

    The balance of this section is spent discussing Decision Trees. After a hands-on activity and discussion of the limitations of Decision Trees, we expand into Random Forests and explore hands-on how these provide better inferences in most cases. The section wraps up with a cluster-based approach to finding anomalies in user activity on a network.

    Exercises
    • Support Vector Classifiers
    • Support Vector Machines
    • K-Means/KNN
    • Elbow Functions and PCA
    • Decision Trees
    • Random Forests
    • Finding Anomalies: Clustering
    Topics
    • Support Vector Classifiers
    • Support Vector Machines
    • Kernel Functions
    • Primary Component Analysis
    • K-Means
    • KNN
    • Elbow Functions
    • Decision Trees
    • Random Forests
    • Anomaly Detection
  • Overview

    The entire focus of this section is on the theory, development, and use of supervised learning approaches in the field of information security. Building on the mathematics and statistics covered in section 2, this course section begins with linear regressions and ends with an introduction to Convolutional Neural Networks.

    The material is focused on using supervised machine learning and mathematics to create predictive models. The initial discussion and exercises center around forecasting and trends analysis for anomaly detection. Following this, most of the material focuses on classification problems.

    Building on the Bayes approach used in section 2, this course section introduces deep learning neural networks and fully connected dense networks through the development of a far more accurate phishing detection network. Following this, we'll explore visualization and measurement of neural network training performance, in addition to discussing overfitting and overtraining and how to identify (and avoid!) them.

    The next portion of this section turns to categorical problems. Students will build a real-time network protocol classification system and, more importantly, implement anomaly detection in this classification system, a task typically reserved for unsupervised approaches.

    The final portion of this course section will introduce Convolutional Neural networks. Further exploration of these continues in section 5.

    Exercises
    • Polyfit Regressions
    • Hello, World!, Ham vs. Spam
    • Identifying Protocols
    • Protocol Anomaly Detection

    Topics
    • Regression and Fitting
    • Loss and Error Functions
    • Vectors, Matrices, and Tensors
    • Fundamentals of the Perceptron
    • Dense Networks
    • Auto-Encoders
    • Convolutional Neural Networks
  • Overview

    The final section of the course picks up right where section 4 left off: Convolutional Neural Networks. Students begin by exploring the applications of Convolutional Networks to Natural Language Processing in the form of the Ham vs. Spam problem, generating a highly accurate tool for distinguishing one from the other.

    The major focus of this section is on the creation of a deep neural network using TensorFlows functional pattern for both testing the quality of and solving CAPTCHAs. Whether you are on a red, blue, or purple team, you will learn how to think through and use machine learning to solve what amounts to a computer vision problem  and solve it at greater than 95 percent accuracy! After this, we'll explore a different way to think about the problem that results in even greater accuracy with far less training time.

    The final portion of the section investigates genetic algorithms as they can be applied to machine learning problems.

    Exercises
    • Ham vs. Spam, CNN Style!
    • Solving CAPTCHAs
    • Genetic Algorithms

    Topics
    • Convolutional Neural Networks
    • Functional Definition of Neural Networks
    • Deep Learning Networks with Multiple Outputs
    • Thinking about Machine Learning Problems
    • Genetic Algorithms

Prerequisites

A basic knowledge of Python or some similar scripting language is needed. You need not be proficient, but you should have written at least a handful of simple scripts at some point in your life.

Laptop Requirements

A laptop with at least 8 gigabytes of RAM is required for this course. The course relies heavily on the Anaconda Individual environment, which you should pre-install before attending class. It also makes use of a VMWare based virtual machine. A VMWare based virtualization solution such as VMWare Player, VMWare Workstation, or VMWare Fusion is required.

Special note for those with an M1 Mac: Even though it is not possible to run the virtual machine on a M1 mac, the course exercises are fully supported on your device! Please identify yourself to your instructor at the start of the course so that the instructor can provide you with alternate instructions for some of the labs.

If you have additional questions about the laptop specifications, please contact laptop_prep@sans.org.

Author Statement

"AI and machine learning are everywhere. How do the vendor solutions work? Is this really black magic? I wrote this course to fill an enormous knowledge gap in our field. I believe that if you are going to use a tool, you should understand how that tool works. If you don't, you don't really know what the results mean or why you are getting them. SEC595 is a crash course in statistics, mathematics, Python, and machine learning that will take you from zero to being a - I'm reluctant to promise 'hero', so let's just say to being a competent person who can solve real problems today!" - David Hoelzer

Register for SEC595

  • In Person

Training events and topical summits feature presentations and courses in classrooms around the world.

Learn more
  • Live Online

Live, interactive sessions with SANS instructors over the course of one or more weeks, at times convenient to students worldwide.

Learn more
  • OnDemand

Study and prepare for GIAC Certification with four months of online access. Includes labs and exercises, and support.

Learn more

Loading...