ltechnologygroup.com is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to amazon.com.
Mandiant’s Nick Bennett and Jake Valletta discussed data stacking
at MIRcon™ last month. If you were unable to attend the talk, we
will discuss this data analysis technique here on the M-Unition
What is Data Stacking?
Data stacking is the
application of frequency analysis to large volumes of similar data
in an effort to isolate and identify anomalies. In short, data
stacking is an investigative technique that can be used to find a
needle in a digital haystack. It involves an iterative process of
reducing large amounts of data into manageable chunks that can be
consumed and investigated.
At Mandiant we frequently
investigate large organizations with hundreds of thousands of hosts.
Looking at this data with no pre-processing and reduction, such as
data stacking, is infeasible. We commonly use data stacking to
discover unknown malware and new Indicators of Compromise (IOCs)
during our investigations.
When investigating targeted
attacks, malicious data should be much less common throughout the
enterprise than benign data, allowing investigators to concentrate
their efforts on audit rows that appear different. (We will discuss
the “different” in different rows later in the post.)
Given a large set of data, the investigator determines what
attributes make the unusual data rows stand out and may indicate
they are malicious. These attributes then become the grouping
criteria used to create the frequency analysis calculations. The
frequency, or count, is then used to determine investigative
starting points for possible anomalies within the enterprise.
Data Stacking in Four Steps
There are four
basic steps involved in the data stacking process. Most of the time
we perform steps three and four several times – grouping, filtering
and reducing the data on different sets of attribute as discussed
below. The four basic steps are as follows:
First, we acquire data from across the environment to
find anomalies across all hosts. It is imperative to get data from
as many hosts in the environment as possible. At Mandiant we
typically use Mandiant Intelligent Response® (MIR), but
the data can come from any source. As long as you have a
sufficiently large amount of similar data from multiple systems, the
data can be aggregated and analyzed. In the examples given below we
discuss stacking of data collected with MIR — specifically the
We usually skip step 2 listed above.
However, we mention it to show the iterative approach to data
reduction. This step is stacking the data without reducing the data
set to interesting attributes, which is explained in step 3
At step 3, we are trying to determine a good subset of
attributes for stacking. The list of attributes in a standard MIR
Services Audit is shown below:
The investigator’s goal is to determine a set of
attributes that can be used to stack the data. If you pick something
that is unique to every system you will end up with a grouping that
means nothing. For example, using the process ID of a running
service is meaningless for stacking. If you group and count services
based on that attribute, you will end up with an even distribution,
with approximately the same counts for every row. Also, the fact
that something is running with a particular process ID, with no
other context, doesn’t help a great deal in determining the nature
of the process.
Likewise, if you pick an attribute set that
is not sufficiently unique, you’ll run into the opposite problem:
not being able to identify the “evilness” because you
don’t have enough data. For instance, if you only stack (or group)
on the service name, you may be able to identify malware that is
installed with random names, but chances are you will not find
malware that uses more sophisticated techniques to disguise itself.
So, only looking at the names, will not find a malicious service
that is called “Disk Driver” (a common service name), but
is installed in C: EMP. If you were to group the data on service
name and service path you would be able to easily identify such
Determining a useful set of attributes to group the
data by, thus reducing the data set, is a balancing act. There are a
number of standard attribute sets that investigators at Mandiant
use, but even then, a lot depends on the environment, the type of
malware, the type of data available to the investigator and so
That is why you would typically repeat the grouping/data
reduction/analysis process more than once, employing different
attribute sets and different ways to slice the data.
A set of
attributes commonly used for Service Audits by Mandiant
Using this smaller grouping of attributes, we are able
to obtain better and more manageable frequency distribution counts.
This distribution makes flagging rows for more detailed analysis
much easier. This makes analysis easier because the stacked data now
has rows at each end of the spectrum. Rows that have been seen on a
large number of systems (typically ignored) and rows that have only
been seen on a few systems. As we learned earlier, these rows that
we have only seen on a few systems are probably where targeted
attackers are hiding. With data stacking, false positives are not
uncommon, and reducing the number of rows that require investigation
makes discovery of unknown threats more feasible.
investigators may choose different attribute groupings for their
stacking purposes. Most, however, have the same goal in mind: data
reduction in an effort to make manual inspection of results possible
and manageable in a relatively short time frame.
Last is step
4, looking for anomalies, which can be very tedious. Some rows
(frequently mass malware) will immediately jump out of the frequency
distribution, making identification easy. Sophisticated attackers
are typically more subtle and therefore, harder to discover. We hope
to give you a few pointers in discovering these camouflaged
As an example, let’s consider a small set of services
data that we obtained and stacked on service name, service path and
the example above, we have two rows representing the
“Seclogon” service. One has nearly 5,600 systems with the
same attributes. We know the count roughly represents systems
because the service names must be unique, so a single system cannot
have two services with the same name allowing that system to
increment the count more than once. The other row has a much lower
count of two. From the table, we can see the difference lies in the
naming of the Service DLL, seclogon.dll vs. selogon.dll (note the
missing ‘c’). In this situation the investigator would acquire the
“selogon.dll” file to determine if it’s malicious.
The iprip service exhibits the same distribution: one row has a
high incidence and two rows that are uncommon. The uncommon rows are
worth investigating further for different reasons. The first has an
odd spelling of the Service DLL name and the other has the DLL in a
temp directory — a common location where attackers place malicious
service DLLs. It is easier for malware authors to drop files into
these directories than the more protected directories, such as
It is important to note here that just because
something is not like the others doesn’t automatically mean that it
is evil. For example, it is common to find a few services that are
located in C:winnt. Windows NT 4 systems are not commonly found
these days but are still legitimate installs. Therefore, services on
NT systems commonly show up in the unique rows because they are rare
in 2012, but that doesn’t make them malicious. Uniqueness provides a
starting point for more thorough investigation and that is one of
the main purposes of data stacking.
Consider the following
psexec service listed above, in and of itself, is not malicious, but
is frequently used by attackers to execute tools on systems
throughout an enterprise. However, depending on the network
administration practices, the presence of this service may or may
not be a red flag. It might be that the red flag in this example is
the presence of three different versions. Or perhaps that is
expected due to other administrators using different versions.
Determining the concern level at this point requires further
Another services audit where we are
concentrating on the Service DLLs:
Here we see a listing of services named “bits”.
Looking at the table above we see that each row has a unique MD5.
Using a feature in MIR, the agent tries to verify the signature of
the file associated with the service. Signatures are used to ensure
that a file is the same as what the original author created. If a
single bit in the file changes so will the signature, and therefore
it would not match a previously known good signature. Shown above,
we see one service where the agent could not verify the signature.
This could mean several things: that the file was modified, that the
file does not have a signature, or that the MIR agent was not able
to properly verify it. Either way, this is worth further
In the situation where the MIR agent is unable
to properly verify a signature (for whatever reason: an error, lack
of internet connectivity and so on) you may find other rows that
have the same MD5 and a verified signature. That means you can
ignore the fact that a signature didn’t verify correctly for one row
because another row has the exact same file that did verify.
We have shown you just a few examples of the usefulness and data
reduction capabilities of performing data stacking. At Mandiant we
will stack and slice and analyze the services data set several times
using different sets of attributes: analyzing services as a whole;
and then, service paths separately; then, service DLLs and
associated MD5s separately. The process is more art than science and
there is no universal procedure. Generally, it depends a lot on the
environment and the attacker, but practice makes perfect.
At Mandiant, we do not limit our
data stacking analysis to service audits. We stack many different
types of audits in an effort to discover all aspects of a
compromise. For example, we have stacked particular registry keys
(think runkeys and shim cache entries), file data from file listing
audits (careful! too much data alert!), user accounts and even
Master Boot Records (MBR) from physical disks to identify
In all instances, the goal is the same: group and
reduce the data set to identify the outliers and make manual
analysis of an entire environment feasible.
At L Technology Group, we know technology alone will not protect us from the risks associated with in cyberspace. Hackers, Nation States like Russia and China along with “Bob” in HR opening that email, are all real threats to your organization. Defending against these threats requires a new strategy that incorporates not only technology, but also intelligent personnel who, eats and breaths cybersecurity. Together with proven processes and techniques combines for an advanced next-generation security solution. Since 2008 L Technology Group has develop people, processes and technology to combat the ever changing threat landscape that businesses face day to day.
Call Toll Free (855) 999-6425 for a FREE Consultation from L Technology Group, https://www.ltechnologygroup.com.