Intrusion Detection FAQ: Statistical based approach to Intrusion Detection

Jamil Farshchi

Network Intrusion Detection Systems (IDS) monitor computer network traffic and attempt to identify, alert, and present all anomalous activity to the user. The basic premise is that if a transmission is not allowed on the network, the IDS will have the ability to recognize and report the illegitimate traffic. The key to any Intrusion Detection System is to maximize accurate alerts (true-positive) while at the same time minimizing the occurrence of non-justified alerts (false-positive). This is much easier in theory than in practice, as attested by the variety of intrusion detection methods. These methods include but are not limited to Artificial Immune System [7], Control-Loop Measurement [8], Data Mining [9], Statistical [24], and Signature-Based (Rule-Based [25]). The most popular of these methods is Signature-Based Intrusion Detection. While there are many approaches to intrusion detection, this document specifically focuses on Statistical-Based Intrusion Detection Systems, Spade, and the deployment of Spade in concurrence with a current IDS.

Signature-Based systems
Some of the more popular Signature-Based IDS’s are NFR [11], RealSecure [12], Dragon [13], Snort [14], and Cisco Secure IDS [15]. It has been shown that Signature-Based Intrusion Detection has many benefits, such as the potential for low alarm rates, accuracy of detection, and detailed textual logs [4]. With verbose signatures, it is relatively simple to specifically identify packets of interest. For example, it would be trivial to write a rule to alert on all TCP packets with the SYN flag set. Not all IDS’s allow independent rule development, but some, like Snort and Dragon, accept user created rules. Nearly all IDS vendors provide rules for their products with variable numbers of signatures, usually in the range of 500-1500+ rules. Rules are developed over time as the security community identifies new vulnerabilities and scanning techniques. The extensiveness and speed with which these rules are developed by the vendor is a good benchmark for how effective the IDS will ultimately be. While the Signature-Based approach to intrusion detection is acceptable, it leaves much to be desired. With vendors coming out with new signatures on a weekly or daily basis it is difficult for an already overburdened security professional to keep up to date with the latest rule sets. A far more serious shortcoming of the Signature-Based IDS approach is the inability to detect new and previously unidentified attacks. A Signature-Based IDS is only as strong as its rule set, and if the attack is new, there will simply not be any signatures developed to identify the probe. Signature-Based Intrusion Detection also has a limited ability to detect port scanning. In fact, most IDS’s use the rudimentary approach, whereby, if X events of interest are detected across a Y-sized time window [16], the system will generate an alert. By limiting the number of packets targeted at a network over a specified time frame, an attacker can easily escape detection by the IDS. These deficiencies are inherent in the Signature-Based model, which is why different methods of detection are needed to address the inadequacies of the Signature-Based approach.

An Introduction To The Statistical Approach
Statistical-Based Intrusion Detection Systems (SBIDS) can alleviate many of the aforementioned pitfalls of a Signature-Based IDS. Statistical-Based systems rely on statistical models such as the Bayes’ Theorem [26], to identify anomalous packets on the network. To identify an anomaly, the system uses data compiled from previous network behavior. Since warnings are based on actual usage patterns, statistical systems can adapt to behaviors and therefore create their own rule usage-patterns. The usage-patterns are what dictate how anomalous a packet may be to the network. Anomalous activity is measured by a number of variables sampled over time and stored in a profile. Based on the anomaly score of a packet, the reporting process will deem it an alert if it is sufficiently anomalous; otherwise, the IDS will simply ignore the trace. The reporting process will alert the user if the packet’s anomaly score is greater than or equal to the threshold level set by the user. So, the SBIDS identifies and tracks patterns and usage of the network data and then assigns an anomaly score to each packet. Once this is accomplished, the reporting facility will generate an alert if the anomaly score is greater than the alert threshold. As an example, let’s say that every morning, you wake up and read the morning paper that is waiting outside the door. After a few days or weeks of this behavior, it becomes normal; you expect the paper to arrive at the door in the morning. One morning, the paper is not waiting at the doorstep. Instead, the paper is lying in the driveway. This is not normal; it is clearly anomalous activity, but probably not enough to warrant investigation. Now, let’s say you continue to see approximately the same pattern of a few papers landing on the driveway every week. Then, one day, you wake up to no paper at all, or even worse, the paper is thrown through the window. Neither of these events is normal, and both would warrant some degree of investigation.  If an anomaly number is associated with these events, we can begin to see how a SBIDS works. The action of receiving a paper at the door in the morning would be deemed “normal” activity. The system would recognize the pattern and learn that this is normal behavior. Other activities would be judged based on the number of occurrences and how “unique” they were in relation to normal activity. The importance of the threshold level is shown in this example as well. If the threshold is set to a low number, the SBID would have generated an alert for any discrepancy from the norm, so there would have been an alert produced when the paper landed on the driveway. If set it too high, an alert would be created only when the paper broke through the window (and maybe not even then). Optimally, a report will be generated on all significant anomalous activity. What constitutes “significant” can and will vary from user to user. Therefore, it is ultimately up to the user to decide how many alerts are generated for a specific environment. The particular environment is crucial to the proper functioning of a SBIDS. The SBIDS will “learn” what is “normal” for a network. Each Statistical-Based IDS in every individual environment will alert to discrepancies based on its specific knowledge of the network at hand. The benefit of this approach is that the system does not have to have predefined signatures to identify an anomaly on the network; instead, the IDS is free to flag anything it deems unusual. For example, H4x0r has a brand new exploit she wants to use on the network. She launches the attack knowing that there is no signature for this exploit because the vulnerability was found recently. If one of the systems is exploitable by the attack, it will be compromised and no alert will be generated because a Signature-Based IDS will not recognize this new attack (signature). If, on the other hand, there is a Statistical-Based IDS in addition to the current Signature-Based IDS, the results of the attack would differ greatly. The SBIDS would see the packets and may recognize that the properties were inconsistent with the traffic that usually traverses the network. Following this detection, the Statistical-Based system would compute a high score for the packets in the attackers packet stream (like the newspaper breaking through the window), which would lead to an alert generation. While notification of an attack on the systems is a highly desirable feature for an IDS, so too is the detection of an enemy trying to enumerate the network through portscanning.

A SBIDS can provide a more accurate notification of portscanning activities. Portscan detection is a byproduct of the methods in which SBIDS gather data, due to the fact that the scan will be anomalous. At least some of the portscan is likely to be highly anomalous traffic relative to the usual traffic distribution. If this packet has unusual features (i.e. is a crafted packet), this will be still more true [1]. With this in mind, even the portscans that are distributed over a lengthy time frame will be recorded because they will be inherently anomalous. SBIDS give us the ability to detect portscanning packets with much greater accuracy than the “X packets in a Y-sized time frame” method that RBIDS must rely on. The problem with the Statistical-Based system is not the detection of the portscan packets; they will be identified, as any other anomalous activity on the network will be. The problems lie in the dissemination and correlation of the data once it is collected. Correlation is beyond the scope of this document but Silicon Defense is currently developing a correlation engine called Spice. Refer to the Silicon Defense web site for more information.

While there are many advantages to the Statistical-Based approach, there are also some shortcomings with this technology. To begin, a Statistical system must “learn” what is “normal” traffic for a particular network (SBIDS need a good baseline of network traffic). Unlike a Signature-Based system, which has the benefit of being implemented and immediately utilized, the Statistical-Based systems must initially adapt to the network at hand. The longer a SBIDS is placed on a specific network, the more accurate the results will be (assuming the network traffic doesn’t significantly alter in form). The second issue with the Statistical-Based approach is related to the adaptive nature of the systems. SBIDS detect anomalies based on discrepancies in “normal” network traffic. If the “normal” network traffic is malicious, the SBIDS will be rendered useless. For example, if the SBIDS sees a numerous number of SYN scans on a network over a period of time the system will eventually assume that this is normal behavior and cease to alert on the activity. This example, while drastic, is a possible scenario. Finally, the alerts that a SBIDS will generate will be relatively difficult to assess compared to a Signature-Based system. The alerts will simply be packet information with no immediately obvious reason for the alert. This analysis will require the services of a trained security professional with the ability to identify abnormalities in traffic at the packet level. Although Statistical-Based systems have some deficiencies, the positive effects of this technology far outweigh the growing pains that will be experienced upon implementation.

The benefits of the statistical-based approach are threefold. Not only do we now have notification for previously unknown attacks, we also have a system that doesn’t need constant signature updates, and we have a method to detect port scans that span extensive timeframes as well.

The Statistical Packet Anomaly Detection Engine: Spade
Spade is an anomaly detector publicly released under GNU GPL [20]. It can be downloaded from Spade is a Snort [14] preprocessor plug-in. Spade uses joint probability measurements to decide which packets are anomalous. Spade uses Snort’s input/output facilities to grab packets and put them into tables, which are used to determine an anomaly score [1]. The anomaly score is assigned by evaluating the source IP, source port, destination IP, and destination port, among others. Based on the user specified threshold level, Spade will either flag the packet or allow it to pass through the network without notification. The threshold setting is critical in Spade because if it is set too high, the user will miss critical packets; if it is too low, the analyst will see many false-positives. Spade also has an option that will perform automatic threshold adjustment to let Spade decide what the critical threshold number should be. Spade can also generate other reports of importance such as a survey about the distribution of anomaly scores and various reports about the feature statistics such as entropy and conditional probabilities. For more specifics on how Spade calculates anomaly scores, threshold numbers, and probabilities, refer to the documentation present on the Silicon Defense web site [17].

 The most critical output for the security analyst will be the Spade alerts, which look very similar to the Snort alerts. The list below is comprised of four Spade-generated alerts.

Review the Snort documentation [23] for specifics on how to read these alerts.

[**] [104:1:1] spp_anomsensor: Anomaly threshold exceeded: 3.8919 [**] 08/22-22:37:00.419813 -> VICTIM.HOST:80 TCP TTL:116 TOS:0x0 ID:25395 IpLen:20 DgmLen:48 DF ******S* Seq: 0xEBCF8EB7  Ack: 0x0  Win: 0x4000  TcpLen: 28 TCP Options (4) => MSS: 1460 NOP NOP SackOK

[**] [104:1:1] spp_anomsensor: Anomaly threshold exceeded: 10.5464 [**] 08/22-22:22:46.577210 -> VICTIM.HOST:27374 TCP TTL:108 TOS:0x0 ID:10314 IpLen:20 DgmLen:48 DF ******S* Seq: 0x63B97FE2  Ack: 0x0  Win: 0x4000  TcpLen: 28 TCP Options (4) => MSS: 1460 NOP NOP SackOK

[**] [104:1:1] spp_anomsensor: Anomaly threshold exceeded: 7.8051 [**] 08/23-23:04:53.051245 VICTIM.HOST:31337 -> TCP TTL:255 TOS:0x0 ID:0 IpLen:20 DgmLen:40 DF ***A*R** Seq: 0x0 Ack: 0x22676B9  Win: 0x0  TcpLen: 20

[**] [104:1:1] spp_anomsensor: Anomaly threshold exceeded: 9.0907 [**] 09/02-01:30:31.545406 VICTIM.HOST:515 -> TCP TTL:64 TOS:0x0 ID:0 IpLen:20 DgmLen:60 DF ***A**S* Seq: 0x16FC5A7F  Ack: 0x529F8CE7  Win: 0x16A0  TcpLen: 40 TCP Options (5) => MSS: 1460 SackOK TS: 124399151 14755839 NOP TCP Options => WS: 0

Note the difference in these alerts from ordinary Snort alerts. Spade flags packets based on the degree of anomalousness the packet signifies, not a specific signature. So, unlike a normal Snort alert, we do not see an alert name associated with these traces. Instead, we see an anomaly score preceded by an “Anomaly threshold exceeded” message. We can assess how anomalous these packets are by noting the score in association with the packet; the higher the number, the more anomalous the packet. Also, note that these packets are flagged only if the packet’s anomaly score is higher than the set threshold level. The first alert is an attempt to connect to a local web server. There is not a web server at the VICTIM.HOST address, so this is unusual activity. Yet, Spade did not flag this packet with a high anomaly score. In this specific case, the low anomaly score is likely due to the Code Red [20] epidemic[1]. The anomaly score of this packet is very low because the system had become accustomed to seeing traffic to port 80. Spade clearly thought this packet was not exceedingly anomalous activity (instead, Spade likened the port 80 request to the scenario where the newspaper landed on the driveway, which was anomalous, but not particularly unusual). This packet is an example of a weakness in the Statistical-Based approach. If a large amount of illicit traffic is introduced to a network monitored by a SBIDS, the system will begin to assume this activity is normal and cease to report occurrences of the packet.

The second packet shows a highly anomalous trace. With a score of 10.5464, this packet is extremely unique to the network. When looking at the destination port, it becomes clear why this packet should not be transmitted to the network. Simply, there are no services on the network utilizing the 27374 port. In fact, upon further investigation, it is realized that this port is usually associated with the Sub Seven Trojan [22]. Therefore, the packet warrants investigation, and Spade correctly associated a high anomaly score to the trace.

The third and fourth headers are two more examples of alerts that may be generated by Spade. The difference between Spade and Snort alerts lies primarily in the fact that Spade packets will not immediately identify the reason for capture. An analyst will initially have to analyze the Spade packets more closely than the Snort traces. They will have to inspect the trace and come to a conclusion as to why the particular packet was selected to become a candidate for investigation. 

[**] [104:2:1] spp_anomsensor: Threshold adjusted to 9.9015 after 2 alerts (of 13) [**]

[**] [104:2:1] spp_anomsensor: Threshold adjusted to 9.7523 after 0 alerts (of 12) [**]

[**] [104:2:1] spp_anomsensor: Threshold adjusted to 8.5722 after 0 alerts (of 12) [**]

[**] [104:2:1] spp_anomsensor: Threshold adjusted to 8.4727 after 0 alerts (of 11) [**]

Above is a sample of the alert logs that show Spade adjusting the threshold automatically. Spade is decreasing the threshold due to a lack of activity. If not enabled before running Spade, this option would have a fixed number for the threshold and the log would not show these entries.

The survey log listed below displays the distribution of anomaly scores over time. The file shows the hour relative to the execution of the Spade program, the total number of packets of the specified hour, the average anomaly score (Median Anom), the 90th percentile, and the the 99th percentile anomaly scores. This log will only be created if specified in the Spade configuration.  

60.00 minute interval

# Packet Count Median Anom 90th Percentile Anom  99th Percentile
1    20 3.629443 9.708243 10.331995
2 16 5.620299 8.082586 8.135222
3 14 7.415492 10.130501 10.333078
4 25 7.001369 10.333560 10.333619
5 22 6.758892 9.193461 10.297281
6 16 3.575038 8.832395 8.947573
7 10 3.562193 8.530327 8.530327
8 8 5.730879 8.109143 8.109143
9 5 3.547780 3.548970 3.548970


8 3.542491 7.570529 7.570529

The log.txt file is of importance in that it displays, at minimum, the number of packets that Spade accepted (analyzed) and the number of alerts generated.

Below is an example of the log.txt file output; the results are typical of what would be seen if Spade executed in probability mode 3 (edited for brevity).

392 packets recorded

51 packets reported as alerts

Threshold learning results: top 200 anomaly scores over 23.58361 hours

Suggested threshold based on observation: 3.522590

Top scores: 3.52317, 3.52433, 3.52549, 3.52665, 3.52782, 3.52898, 3.53015, 3.53132, 3.53249, 3.53366, 3.53483, 3.53601, 3.53718, 3.53836, 3.53954, 3.54072, 3….10.29728,

First runner up is 3.52201, so use threshold between 3.52201 and 3.52317 for 8.523 packets/hr
P(dip=44044824)= 0.064466877730
P(dip=44044824,dport=1)= 0.000062047043
P(dip=44044824,dport=2)= 0.000077558804
P(dip=44044824,dport=3)= 0.000062047043
P(dip=44044824,dport=4)= 0.000062047043
P(dip=44044824,dport=5)= 0.000062047043

Initially, the log displays basic packet statistics and the threshold learning results. This log shows how and why Spade is determining a certain threshold for a particular time. Towards the bottom of this file probability statistics are listed where H = entropy, dip = destination IP, dport = destination port, and P = probability.

In addition to the previously mentioned facilities, Spade also produces binary log output by using the Snort output method. This feature enables the user to later go back and do a more thorough analysis of the actual packet with other tools such as tcpdump [5], ethereal [6], or any other packet analyzer that will read tcpdump log file format. Spade has a lot of functionality, and because it is built on Snort, they can be utilized in conjunction with each other as a dual IDS solution. Snort benefits the network by alerting on packets with known signatures, where Spade will learn what is normal traffic for the network and alert to any discrepancies from that norm.

The deployment of Spade is relatively easy but there are a few prerequisites.
  1. A Unix operating system
  2. Packet capture software (Snort)
  3. A computer connected to an active network

The authors of Spade have made it very easy to deploy this SBIDS in addition to a current IDS. Snort is required on the system because Spade is built to utilize Snort’s input/output facilities[2]. All versions of Snort above 1.7 have support for Spade installed by default. The documentation is located in /contrib/Spade-<version>.tar.gz (where <version> is the version of Spade) within the Snort directory of the unzipped snort source tarball. For example, to start by reading the Spade README document, proceed with the following steps:

Change into the Snort contrib directory:

Change into the Snort contrib directory:
> cd $SNORT/snort/contrib (where $SNORT is the snort root directory)

Untar and gunzip the Spade source:
> tar –xvzf Spade-010818.1.tar.gz

Change into the Spade directory:
> cd Spade-010818.1

Open the README file:
> less README

To upgrade to a newer version of Spade, follow the steps above, but view the Installation file in addition to the README. The upgrade process is detailed in the Installation file; upgrading is a simple two-step procedure.

Once Spade is installed correctly, make a decision as to whether Spade will be run in addition to Snort or as a separate process. The Spade authors advise users to initially try Spade as a separate process, especially if it is on a production system. The differences in configuration are minimal regardless of which method is chosen. Continue by configuring the spade.config file.

Open the spade.config file for editing:
> vi spade.config

The spade.config file is short and direct. The layout of this file is identical to that of the Snort configuration file. Snort actually processes the spade.config file and then hands it to Spade upon completion. The default comment for each variable is descriptive and valuable. If there are any questions regarding the specifics of each option, refer to the Usage file located in the same directory. The primary configuration options in the spade.config file are the threshold and the output methods.

Change the reporting threshold because it is off by default:
Preprocessor spade: 4 $SPADEDIR/spade.rcv $SPADEDIR/log.txt 3 50000

All packets with an anomaly score of at least as great as 4.0 will be reported as an alert. The “3” is the probability mode; this number bases probability on destination IP and destination port. Refer to the Usage file for more specifics on the modes available. The next configuration line to modify is the adaptive threshold feature. Comment them all out and use the static number mentioned earlier (4). When testing is complete it is highly recommended to modify the configuration and utilize the adaptive threshold methods available. The adaptive threshold allows Spade to decide what the optimal threshold level should be. Please review the Usage document to choose which adaptive method would be best suited for a particular environment.

#preprocessor spade-adapt3: 0.01 60 168
Enable the reporting options that Spade offers:
preprocessor spade-survey: $SPADEDIR/survey.txt 60
preprocessor spade-stats: entropy uncondprob condprob

The spade-survey option enables the generation of a report that shows anomaly scores produced in the last time interval (an example was listed previously in the Spade section of this document). The spade-stats configuration reports periodically on certain information about the network traffic but will not write to the log.txt file until Spade receives a SIGHUP, SIGQUIT, SIGUSR1 or Snort is exited. Refer to the Usage manual for the specific descriptions of each argument.

The configuration file in its entirety (comments edited out for brevity).

var SPADEDIR /var/log/snort
preprocessor spade: 4 $SPADEDIR/spade.rcv $SPADEDIR/log.txt 3 50000
preprocessor spade-homenet:
#preprocessor spade-adapt3: 0.01 60 168
#preprocessor spade-adapt: 20 2 0.5
#preprocessor spade-adapt2: 0.01 15 4 24 7
preprocessor spade-threshlearn: 200 24
preprocessor spade-survey: $SPADEDIR/survey.txt 60
preprocessor spade-stats: entropy uncondprob condprob

Execute Spade by running Snort with the following option:
> /usr/local/bin/snort –c spade.config

Spade should now be monitoring packets on the network. The above command will run Spade as it’s own process, so as not to interfere with other instances of Snort that may be running. If Snort IDS and Spade are required to be run at the same time with the same process, the snort.conf file must be modified. The snort.conf section that deals with Spade (commented out by default) will need to be edited to mirror the configuration options in the spade.config file.

To assure everything is working properly, check the specified logging directory (/var/log/snort in the example) to see if the files spade.rcv, survey.txt, and log.txt are present. There will be a spade.rcv file as soon as the process captures the prespecified number of packets – this is called the “checkpointing” process of Spade. In the above example this number would be 50000. The spade.rcv file is what maintains state for the program. So the spade.rcv file should be produced sometime after the initial execution of Spade.

For further information regarding installation and configuration, refer to the documentation in the Spade directory or the Silicon Defense web site.

Statistical-Based Intrusion Detection Systems are an extremely effective method to supplement a current Intrusion Detection System. The benefits of a SBIDS, like Spade, should not be overlooked. Utilizing Spade is a second layer of defense. Spade is one of the first tools of its kind that shows the security community the possibilities of Statistical-Based Intrusion Detection. Never before has there been the ability to accurately identify rogue packets by comparing them with what is “normal” for a specific network. Never before has there been a method to easily recognize portscans spanning lengthy time frames. With automated threshold discovery and constant assessment of network activity to identify anomalous traffic, Spade is also a relatively low-labor IDS. The SBID technology is still in its infancy though, so there is still a lot of progress to be made in terms of functionality and false-positive control. Nevertheless, by utilizing both a Signature-Based and Statistical-Based Intrusion Detection System, the vast majority of anomalous traffic on network will be identified. There is no one silver bullet in the IDS field, but layering the systems and experimenting with new methods of intrusion detection can greatly improve the chances of winning the uphill battle against electronic intruders.

[1] Code Red is a program that exploits a vulnerability in the Microsoft IIS web server. Once a system is compromised with this program it propagates by scanning for other vulnerable hosts on the Internet. When this program was infecting hosts at its peak (July-August, 2001), it flooded the Internet with probes to port 80.

[2] The fact that Spade requires Snort to operate does not imply that Snort must be used as the complementary IDS; any IDS can be used in conjunction with Spade.

1.1.0 References
[1] S. Staniford, J. Hoagland, J. McAlerney. “Practical Automated Detection of Stealthy Portscans.” In: CCS IDS Workshop Athens. November 1, 2000.

[3] A. Sundaram. “An Introduction to Intrusion Detection.”

[4] H. Debar. “What is knowledge-based intrusion detection?” In: Intrusion Detection FAQ.

[5] H. Debar. “What is behavior-based intrusion detection?” In: Intrusion Detection FAQ.

[6] D. Lehmann. “What is ID?” In: Intrusion Detection FAQ.

[7] J. Kim. “An Artificial Immune System for Network Intrusion Detection.”

[8] M. Craymer, J. Cannady, J. Harrell. “New Methods of Intrusion Detection using Control-Loop Measurement.” In: Fourth Technology for Information Security Conference’96. May, 16, 1996.

[9] W. Lee, S. Stolfo. “Data Mining Approaches for Intrusion Detection.” In: Proceedings of the 7th USENIX Security Symposium. 1998.

[10] M. Gerken. “Statistical-Based Intrusion Detection.”






[16] S. Northcutt. Network Intrusion Detection: An Analyst’s Handbook. New Riders, Indianapolis, 1999. p. 125.





[21] R. Permeh, M. Maiffret. “.ida “Code Red” Worm.”

[22] R. Lyttle.

[23] D. Ruiu. “Snort FAQ Version 1.8.”

[24] M. Prabhaker. “Intrusion Detection.”

[25] M. Gerken. “Rule-Based Intrusion Detection.”

[26] R. Lupton. Statistics In Theory And Practice. Princeton University Press, Princeton, NJ, 1993. p. 50.