Thought Leaders

Table of Contents


Leigh Purdie, InterSect Alliance, co-founder of Snare: Evolution of log analysis

Stephen Northcutt - January 28th, 2009


We asked Leigh Purdie if he would give us an update on Snare and log analysis, as a follow to our interview with him in March, 2008, and we thank him for his time.

Leigh, when people use Syslog, they don't get a response, normally that is OK, but there would be situations where this is important; what do you think the three most critical situations for a response from the syslog server might be? And, related to that, please share your thoughts about the balance between using UDP for logging versus TCP, essentially the tradeoffs between ease of spoofing and not knowing the message was delivered, but at far less overhead with UDP versus TCP.

IT Security administrators are always being pulled in two directions - on one hand you have the wish to implement bullet-proof system security. On the other hand, system administrators have kitted out a computer to do a particular job. Anything that doesn't contribute directly to the 'core mission' of the system, needs to be as resource-efficient as possible, or you'll start seeing the admin team knocking at your door for taking up too much of their bandwidth, CPU or disk space. IT Security specialists always have to compromise, balancing the need for security, against the need for an organization to get the work done; the word 'compromise' has needless negative connotations though; so let's use an acceptable industry phrase instead: "a risk based approach".

UDP vs TCP, or syslog acknowledgment, is a reasonable example of this sort of thing - UDP has some advantages when it comes to sending eventlog data; the startup/shutdown cost of sending an event is minimal, there's slightly less network overhead on a per-event basis, and on the server end you don't have to try and maintain too many open 'streams', which means you can effectively support many more clients sending you data at the same time. That said though, with modern networks and computing hardware, the capacity of most deployed internal organizational networks fairly significantly exceeds day-to-day requirements, so the resource benefits of using UDP over TCP are decreasing.

It used to be the case that the variety and volume of audit log data sent by many devices, was reasonably restrained; systems would generally send out 'critical' information only. Ensuring the delivery of a single event log (eg: "The enable password has been changed on CISCO Router x") could have been highly important, so TCP delivery and receipt acknowledgment were pretty high on the priority list, despite the resource cost. These days, devices and operating systems seem to be both more verbose and more holistic in their approach to security information; attacks are identified not necessarily by a single security event, but by a pattern of activity, potentially across many devices or systems. The relative value of any one individual event is generally less, using our risk based approach above, when considering the potential impact of requiring guaranteed delivery on the computers and networks that graciously host our security tools. So strangely enough, at a time when the extra resource impact of TCP was noteworthy, the relative importance of TCP delivery was high. Now that the extra resources are no longer as much of a concern, TCP's not so much of a critical requirement.

However, there are certainly places where TCP delivery is very handy. UDP has no easy way of detecting whether the server-end is receiving or not. Sure, we can ping and traceroute, and so on, but in TCP mode, we know practically instantly, when the remote server is no longer available. This is spectacularly useful in today's mobile workforce, when a laptop may only be connected back to base on a sporadic basis. If we use TCP to send log data, we can put a 'bookmark' in our current event delivery schedule, and start back up again when the laptop is connected to the network once more. TCP is also a very handy skeleton on which to layer encryption, non-repudiation, checksums, digital signing, and related technologies.

UDP still has its uses - in fact, it is practically essential in some organizations; particularly high security agencies that have a requirement to consolidate logging centrally, from networks that would otherwise be air-gapped. UDP is effectively a requirement for a 'data diode' (a network connection that is physically capable of only transmitting information in one direction). Since there's no return network path available to tell the source server whether the packet arrived or not, TCP can't be used. This is a classic example of our risk based approach dictating our network choice - allowing logs to be centrally managed, at the potential expense of non-guaranteed delivery of any one individual event.

Regardless of whether we're using TCP or UDP, syslog or not - we're certainly seeing a lot more logs come through - upwards of thirteen thousand events per second being sent to a single Snare Server in some locations, from many thousands of hosts, resulting in several billions of events being stored per month - and when you're looking at those sorts of volumes, network overhead is much less of a concern.


Leigh, as you say, these archives get huge, what can we do to improve query performance?

We're consistently seeing customers these days, with billions of events in their data store, and terabytes of data available to query.
Over time, we've dealt with system bottlenecks by adopting a variety of strategies - when we first started, audit volumes were small enough that we could justify storing our data in a traditional database. It wasn't a perfect fit, by any means; normal databases are optimized on the basis that data will be sporadically added or updated, but queried regularly. We turned that on its head - we were adding data to the database hundreds of times a second, and only occasionally running queries (maybe a few per hour, on average).

As log volumes grew, the database storage paradigm was just not working well enough for us. CPU speed increases were significantly outpacing the increase in speed of storage, and it made a lot of sense for us to use that additional latent power to make queries faster. We've moved our Snare Servers to a system where we compress log data - sometimes at huge compression ratios. This shifts some of the burden of accessing log data, away from our (relatively) slow disk, and up to our (faster, and faster) CPUs and RAM. Files are much smaller, we're able to greatly simplify our indexing, and in turn, data seek and access times are significantly decreased. We designed our applications in such a way that each major component is nice and modular, so it could be swapped out if required as advances in technologies opened up options for us; our backend storage system, is practically independent from our front-end user interface. Our indexing and metadata subsystem doesn't care how audit events are collected at the front-end, so these changes were relatively easy to constrain, and didn't require wholesale changes to the whole system.

Small tweaks can also go a long way to improving the speed at which the end user perceives the query is running. Sitting down with people who use log data day-in/day-out can sometimes result in 'light-bulb' moments that can make a big difference in how usable your query tool is; little things like 'query caching', for example, can go a long way. Quite often, queries such as "tell me who has logged in over the course of the last 7 days" will be run fairly regularly - sometimes, on a daily basis. This means that on the second day a query is run, 6 days worth of valid log data has already been processed; if we had saved off the results of the previous query somehow, then we have to do one seventh of the hard work. Similarly, when our expert users are doing forensics work, they quite often 'gradually narrow' their query to reduce the false positives, based on the results they're seeing. If they have to wait an hour for the complete objective to be processed, before they see the first results, that's a lot of wasted time if they just have to tweak the query a little more. If we can return at least SOME results very quickly back to the user (even if they're unsorted), then the user can stop the query from running, go back to the objective, narrow the results, and re-run the query. Although the total time that it would take to run a complete query might not have changed at all, the actual speed at wh work increases significantly.

So, to signi improving query performance these days is the ability to reco opportunities p rather than seeing the changes as a series of hurdles that r decreasing perknows where we'll find our next speed jump - maybe utilizing the graphics processing unit for some limited speed-critical tasks? Perhaps by using a neural network at the front end to cull obviously irrelevant events? The second is listening to, and understanding, how people use the system. Following your users' work-flow, rather than forcing them to follow a non-o path, can result in some significant performance benefits, and hopef <
Tons and tons of text just is not inspiring, have y
With most organizations collecting megabytes, or gigabytes of day, i human-readable logs, that are difficult to break up and analyze with haven' a security administrator yet who can finish the Encyclopedia Britan while drnking their morning coffee, but that's effectively what quite a few vendors are asking people to do by trying to provide some sort of grammatical nirvana in their logs, at the expense of ease of follow-on processing. If we were talking a few dozen events per day, this approach would probably still be viable. These days, it's just an obstacle.

Add to this, the fact that the responsibility for managing critical data is generally migrating away from the operating system to more 'outlying' applications, such as web servers, or database engine double or triples the amount of logging you receive, since the operating syst log ma will then provide additional logging, the we that host application that manages the actual data, may log information. With such as 'offline synchronization services' (eg: google gears) available these days, browser-based logging may soon become a critical requirement for some organizations also.

We though some are ten times harder to mash into shape than others. Once everyth get to play with the data, and work out how best can be a tough job. The form different depending on the type of data we're analyzing, and also who the target audience is.

We've played w information out of raw data - parallel coordinate plots, 3d spinnin show sourc and destination network information, traditional line, pie and ion
plots, and many, many more - here are a few examples of the same data source (network IDS logs, in this case), represented 8 different ways:

Figure: 15 minute pattern map, tabular data, Geolocation based on IP Address, port map showing a high-end port scan, 3d spinning java source/destination cube, horizontal graph of alert category, line graph of activity over the last 24 hours, 3d bar graph of alert categories over time.


On the topic of visualization, I previously highlighted that visualization depends a great deal on your target audience. In the past, IT security has been a very centralized, controlled function, usually with some very technically proficient people; you could get away with down-and-dirty representations of logging information, with a reasonable level of confidence that the security administrators would be able to skim the data for useful highlights. Data owners were represented by proxy only - they would rarely see the output of security tools, and were almost never provided with access to the application that did most of the analysis. This strategy is a bit inefficient, when you're not taking advantage of resources that have the most to lose out of a potential breach, and have the most knowledge about how the information being protected should be limited. I'm not really talking about operating-system level logging here, that probably needs to remain in the 'specialist' category, in most circumstances; but for the security of data store, or membership of groups that have access to sensitive information, data owners are a spectacularly good source of corporate knowledge. Currently regulatory requirements such as NISPOM or HIPAA or PCI are good examples of where the sorts of logging crosses the traditional 'IT security only' boundaries to reach far into the organization. Log visualization, therefore, needs to be tailored so that data owners, who may know very little about computers in general, but who know a great deal about how their information should be secured, can gain a good understanding of the relative security of their data.

Of course, I mentioned earlier the 'cup of coffee' factor. Security administrators, and data owners, can justify very little time to operational, day to day, log analysis. Good visualization tools, summaries, traffic-light reports, health checkers... all these things are critical to the day-to-day operation of a log analysis engine in a time-constrained environment. It's important to make sure that security teams can drill down to the spectacularly detailed raw log information in situations where forensic analysis is required - since in these circumstances, time is rarely a factor; but on a day-to-day basis, if the security administrator can get a reasonable level of confidence that the objectives he or she has configured aren't reporting anything significant in the time it takes to drink the average caffeine-laden-beverage, then we're on the right track.


As you collect all these messages what are some things that can be done for front end analysis? Is facility and severity enough information to work with?

Under unix, syslog facility and severity have a history of not really being used to classify events of interest - instead, they've generally been employed to distinguish between actual log types more than anything else - eg: Network Time Protocol logs might go to "local1.info" on the local syslog server, whereas your radius authentication server might send its logs to "local3.debug" for some reason. As a result, for better or worse, the actual facility/severity values are rarely of much use from an analysis perspective, and we have to delve deeper into each event (and also holistically, across multiple events) to get any really useful information out of the log stream.

One of the real front-end challenges though, is how to make the process of setting up an 'objective' (ie: a series of query terms that allow you to search for the information that you're looking for), as streamlined and simple as possible - taking the hard work out of diving into those gigabytes of data trying to find the proverbial needles in the haystack. Although we try very hard to make sure that the system is usable out of the box for the majority of users (by providing frameworks for NISPOM, Sarbanes-Oxley, PCI, DCID/DIAM, and so on), there are requirements that might be more localized, or specialized, that we don't have any 'canned objectives' ready for yet (for example, California Senate Bill 1386/AB, or perhaps Danish Standard DS-484:2005). 'Web 2.0' technologies (for lack of a better term), like Ajax, have really helped us bring some of the interface flexibility of traditional applications, to our analysis software, without having to throw away all the big advantages that a web-based interface offers; drag & drop, pseudo-windowing environments, realtime feedback. These are all great tools that help us take away a bit of the learning curve when it comes to actually using audit related software.


Can you share a war story of how a hacker, or malicious insider was caught via log file analysis?

Sure! Unfortunately, quite a few of our customers are pretty high security environments, which means that I can't pass on some absolute doozies, but here are a couple of examples where Snare has helped the data owners track down some inappropriate activity.
  • Outsourced Environments A customer outsourced some of their database administration, on the firm pr access would not be allowed. Such access would have made the da job a little bit simpler, but it's not a path that the original customer wanted to take. A member of the outsourcer's team obviously decided that this made the task way too hard, and therefore set up a job that used a SQL service account to add himself into the domain administrators group at around 7pm, once the security team had gone home, then took himself back out again at around 7am, before the team came into the office of a morning.
Snare picked up both the changes via two different paths - an 'Authorized members of a group' objective, that takes a snapshot of group membership around midnight each night, and also using a 'Changes to sensitive groups' objective that watches the windows eventlog for group-related modifications to groups that the customer has defined as particularly sensitive.
  • Inappropriate material A slightly more security savvy user decided to mask his administrator-level activity by turning off auditing on the hosts that he was using to store inappropriate material, then deleting audit records that may have been incriminating. Snare agents are designed to send log data off the source system, as soon as it is generated, so the commands leading up to, and including, the command to stop the audit service, were captured centrally, and couldn't be erased. The events that indicated that the auditing subsystem was being shutdown were captured by Snare and highlighted for the attention of the security administrator.
A casual glance at an objective that shows the security administrator a snapshot of images downloaded through the organisational web-proxy server, revealed a couple of images that were not even slightly work-related. Further forensics analysis however, showed that the culprit was not an internal employee as originally suspected, but 'pointed' to a previously unknown unprotected wireless gateway, that was just in the right spot to be reachable from a few seats at a nearby cafe.
  • Legal Historical log information can sometimes be useful for unanticipated reasons. A civil claim was launched against a particular company, in which a laid-off employee claimed to be present, logged on, and working, at a particular period of time. The HR department came to see the Snare Server administrator, and confirmed that the employee in question was not present. The claim was dropped after such proof was made available.
  • Sensitive Information Flow A company suspected sensitive documents were being leaked via electronic mail. Using our exchange server agent to collect Exchange message tracking log data, in combination with a Windows file/directory tracking objective, the Snare Server was able to find the source, and address the problem.
  • Tracking Administrative users A large company, with many lines of business, was having a problem with users being removed from the Enterprise Administrators group, moving to another part of the country, being re-hired and re-inserted into the Enterprise Administration group without authorisation. Analysis of the logs, and validation of the group membership snapshots taken by the server, picked up this behaviour, and the company addressed the problem through internal policies and processes.