Using big data in cybersecurity

Operations research analysts search for ‘cyber subs’

By Douglas A. Samuelson

Defending U.S. cyber assets against adversaries is one of the most critical tasks the U.S. military faces. Source: U.S. Army

Defending U.S. cyber assets such as computer network, operational and intelligence systems against adversaries is one of the most critical tasks the U.S. military faces. The U.S. Army Cyber Command and Second Army (ARCYBER & 2A) has implemented a new, big data approach to address the challenges inherent in this task, exemplary not only for what it is accomplishing but also as a model for how to conduct analytical studies in a fast-paced, complicated setting. The team has created an integrated structure of large data sets, quick connections between them and readily usable tools to enable swift analyses by operators in deployed real-world missions. They can create prototypes in a week and deliver functional web-based analytics at mission-relevant speeds – in often three weeks or less.

In a recent presentation at the Center for Strategic and International Studies, one of the most prominent and respected defense “think tanks” representatives in the country, Lt. Gen. H. R. McMaster, pointed out the ARCYBER & 2A team as an especially good example of how analysis should be done in support of military missions. It is his job to know, as he now serves as deputy commander of the Training and Doctrine Command (TRADOC) and director of TRADOC’s Army Capabilities Integration Center. He is noted both as a successful combat commander, especially as a brigade commander in Iraq in 1991, and as a provocative, iconoclastic thought leader. He wrote one of the most highly regarded critiques of U.S. military policy and doctrine in the escalation of the Vietnam War [1] and managed to continue to advance in the Army, no small feat.

Lt. Col. Cade Saie leads the small team that built the capability, along with Maj. Isaac Faber, who recently returned to graduate school for Ph.D. studies, and Maj. Ross Schuchard. “Cyber is different,” Maj. Faber explained. “Traditional statistics don’t work because everything is incredibly non-linear. There’s a high false positive rate, so operational commanders lose interest pretty quickly if you can’t do better.”

The characteristics of these types of systems require adaptation in operations research approaches, as well as a re-thinking of military tactics. Formal O.R. approaches such as regression and optimization give way to rapid adaptation and generating multiple options. Additionally, the traditional military tactics become transformed into a rapid feedback loop between defender and adversary, so the operators’ analytical requirements change along with the pace of action.

One of the key principals in advocating the inclusion of O.R. methods in cyber, Maj. Gen. John Ferrari (U.S. Army Director of Program Analysis and Evaluation), handpicked the team in 2014. He described the problem, referring back to the roots of operations research, as “searching for cyber subs.” As in the World War II search for attackers, the essential idea is to place the analysts with the field commands, close to the situations of interest, and have them work closely with operational commanders to define the challenges, produce prototype solutions and rapidly implement. Showing agreement with Ferrari’s sentiment, U.S. Army Cyber Command’s Lt. Gen. Edward Cardon chartered the creation of a small ORSA (operations research/systems analysis) cell within the command. By taking this approach, Lt. Col. Saie amplified, “Analytics is then embedded into the daily operations routine.”

The ARCYBER ORSA team spent the first year after its founding in late 2014 doing some traditional O.R. analyses and modeling, developing some cyber operations metrics, and defining requirements for a big data platform that would support the expansion of advanced analytics and cutting-edge O.R. techniques and dissemination of those techniques to operators [2]. “We had a big data platform that couldn’t be leveraged by operators,” Lt. Col. Saie stated. “We needed to create a framework for the ORSA community to participate more readily in analytics with tools widely available in the field such as R or Python.”

The focus so far is on just defensive operations, such as intrusion detection and response. The effort to date has also been limited to unclassified data, possibly “For Official Use Only,” but not more restricted than that. The key is assembling patterns of low-level anomalies that are not of much interest by themselves but might, in combination, indicate something worth investigating.

The Building Blocks: Use Cases

The data platform the team built now integrates several dozen live data streams. Defenders identify use cases, that is, activities that are to some extent out of the ordinary, and they and the analysts then build analytics to address operational needs in response to the use cases. Most of these analytics now integrate regression, clustering, time series and visualizations – and heavily emphasize open source software.

Current data assembly relies on a global sensor grid that relays alerts to a central repository, consolidated by a commercial software product known as a Security Incident and Event Manager (SIEM). Queries can be complicated to formulate and slow to execute, with results that an analyst must then manually evaluate. It is difficult to answer complex questions or support even moderate mathematical algorithms. Verifying actions and their effects at multiple levels of activity is also difficult.

Big data technologies enable drastic increases in query speed and data storage limits by leveraging parallel computing. These technologies also create dynamic computing environments to support more advanced analytical tools and methods. Hence, the vision for the future is a federated network of cyber analytics platforms; that is, the data sets are all compatible in terminology and structure and therefore can easily be viewed and studied in combination.

To move toward the new structure, the team gathers problems from the Defensive Cyber Operations (DCO) community as part of the community’s routine functioning. Then, the problem is given to a development partner (Center for Army Analysis, the U.S. Military Academy at West Point, the Naval Postgraduate School or the Air Force Institute of Technology) or remains in-house for resolution via analytic development. Once the first version of an analytic is complete, it is deployed on a big data training system and used/validated by DCO community members. After feedback is incorporated, the revised analytic is then deployed onto an operational platform where it then becomes part of the operational workflow for the consuming organization.

The analytics range from simple (providing sorted counts) to moderate (providing interactive network flows) and finally to complex (such as Bayesian change point detection). Some of the most immediate impact of the work was simply observing workflow processes and creating capabilities within analytics that automated some analyst tasks, such as generating reports. This simple act of addressing a time-consuming aspect of the cyber analyst workflow had a double benefit: helping the team gain operators’ trust and solidifying the rapid analytic development framework. Over a small period of time to test the framework, the team worked closely with a small group of personnel, with a wide range of specialties, to develop a group of use cases and, from there, to produce analytics which helped to identify certain types of malicious behavior and thwart numerous unauthorized communication attempts.

In broad terms, analytics may employ a range of standard descriptive displays, some statistical tools, and innovative data exploration methods to find patterns of activity that are identified as potentially of interest but that would tend to elude more traditional approaches. In Medicare fraud detection, combining data from different types of claims often yields findings that would not have been apparent from just one source – for example, hospital surgery claims without associated claims from a surgeon and an anesthetist, or reasonable-looking numbers of services allegedly delivered within a short time span in several different places. A similar idea of combining disparate data sources and looking for connections among events that seem innocuous by themselves applies in cybersecurity threat detection.

Another parallel to Medicare claims analysis is that the anomaly of interest may not be an outlier. Rather, it might be a number of events, each quite unremarkable by itself, with unusual frequency – or even a set of events with less variation than typical. In Medicare claims, for example, an event of interest could be a provider with a high volume of claims and no claims with values that trigger a range check, when some such values are often observed in general. The absence of typical variation suggests that the provider may be submitting false claims for services that were never rendered; they know enough to fake unremarkable claims, but not to fake typical variation. Similarly, in cybersecurity monitoring, a “too regular” log of activity on the system could be an indicator of a log file being spoofed to conceal an intrusion.

These examples do not describe the actual use cases ARCYBER has pursued, but they are meant to illustrate the principles of reasoning in this field. In the view of some people especially knowledgeable in this topic area, too much specificity, even based on unclassified information, could reveal too much to prospective adversaries. ARCYBER produced a report, “The Rapid Analytic Development Framework,” [2] that describes many of the analytical tools and use cases in greater detail, along with a more detailed description of the command and its activities. Although an unclassified version is available, even that version of the report has distribution limitations and must therefore be requested from the organization.

Closely Embedding Analysts with Operators

The examples briefly summarized here and expounded in detail in the RADF report illustrate the kinds of analytics, based on use cases identified by operators, the analytical team has conducted. What is most important, however, is how the analysts do this. “We sit next to the operator,” Maj. Faber says, “and we’re very adaptive. We put an extreme premium on change. We have tight iterative feedback, changing approaches, getting new problems. Our goal is a simple solution evolving to more complex with continual feedback. With this approach, parties stay interested because they stay involved. The end user is involved from inception to delivery.”

The analytical focus is on reducing false positives and identifying low-level events of potential interest. False positives are common and a serious challenge. Maj. Faber recounted, “Routine scans to see how many Windows 10 machines were active on a network set off intrusion alerts.” To detect the subtle elements that do not set off intrusion alerts but are more meaningful, a key analytical approach is finding correlations between heterogeneous data sets. “At some point in the future,” he went on, “we hope offensive and defensive data sets will talk easily, at some level of classification.”

Concentrating on operator-identified use cases drove the implementation and the data architecture. The development and improvement of the large, integrated data platform provided the capability to ingest and process mission relevant data actively and quickly. The team automated inclusion of network activity reports and other incident data. Standardizing some formats greatly eased the task of comparing. An additional financial benefit was enabling commands to do more analytical tasks in-house rather than having to rely on other agencies or commercial providers.


The Army Cyber Command and Second Army’s Rapid Analytic Development Framework, built on a big data and parallel computing architecture, has produced striking improvements in defensive cybersecurity operations and provides a powerful example of how to integrate OR/MS into a real operating setting. “Placing analysts on station,” integrating them into the operational team to identify and address problems quickly and adaptively, as Philip Morse famously recommended during World War II, remains the most effective approach to using OR/MS professionals’ talents.

Douglas A. Samuelson ( is president and chief scientist of InfoLogix, Inc., a small R&D and consulting company in Annandale, Va.

  1. H. R. McMaster, 1996, “Dereliction of Duty: Johnson, McNamara, the Joint Chiefs of Staff and the Lies That Led to Vietnam,” Harper.
  2. U.S. Army Cyber Command and Second Army, 2016, “The Rapid Analytic Development Framework.” Point of contact: U.S. Army Cyber Command and Second Army, 8825 Beulah Street, Fort Belvoir, Va.
Analytics magazine

For more on the topic of cybersecurity from Doug Samuelson, see the September/October 2016 of Analytics magazine: