Can machine learning improve cybersecurity?

December 16, 2019
Bruce Davie.

As machine learning (ML) has matured in the last decade, the range of use cases to which it can be applied continues to expand. Factors that have contributed to considerable success in a field that has been around (depending on how you define it) for over 50 years include:

  • Dramatic improvements in specialized computing hardware, such as GPUs (graphics processing units) and TPUs (tensor processing units), that are particularly suited for machine learning algorithms;
  • Availability of such hardware at scale in public clouds;
  • Open source software libraries making ML algorithms widely accessible;
  • Massive data sets that can be used to train ML algorithms such as deep neural networks.

Image classification is a canonical example where rapid advances in ML have led to machines now routinely outperforming humans on tasks that only a few years ago seemed out of reach for computers.

Could ML be applied to the field of cybersecurity? The history of cybersecurity is a long series of ever more sophisticated attacks, and an explicit goal of such attacks is to be hard to detect. An increasing range of software and other tools have been developed to prevent or mitigate the attacks. Spending on prevention and mitigation continues to rise, but so too does the cost of security breaches.

Malware has been around for over 30 years and there is plenty of accumulated knowledge about how attacks are launched and what an attack looks like. Of course, new attacks happen all the time and new strains of malware continue to be created. This is sometimes characterized as an arms race, with attackers always seeking out an advantage over the current set of defenses. Today an increasing number of attacks are “fileless”, i.e., they exploit existing (and trusted) software on the target systems rather than installing malware. Given that we now have a large data set related to cyberattacks, it is reasonable to think that machine learning might be a useful tool to apply to the field.

Interestingly, machine learning algorithms themselves have opened up new avenues of attack. For example, it is possible to add a carefully calculated amount of “noise” to an image that is imperceptible to the human eye, and which will defeat a well-trained machine learning algorithm when it classifies the image. So we need to be aware that if we are trying to use ML to detect malware, attackers can be expected to take steps to confuse our algorithms, just as they can confuse image classifiers.

Training on “Goodware”

One promising avenue that has recently been explored is to build systems that, rather than trying to detect malware, try to enforce known good behavior. In such a system, we train our ML algorithms with known examples of “well-behaved” software, a.k.a. “goodware”. For example, in a cloud data center, a server might run some version of a Linux operating system (OS) and a web server such as Apache. There are many versions of both the OS and the web server, and they can be configured in a wide range of ways, but it is still possible to train ML algorithms to recognize these known forms of good behavior, just as image recognition software can recognize a lot of different kinds of dogs and cats. If at some point the server displays behavior that is considered by the ML algorithm to be anomalous, this could well be a signal that some sort of exploit is underway.  An example of this approach was recently described by my colleague, Tom Corn. By training our algorithms using the known, finite set of acceptable software, which is under our control, we avoid the challenge of trying to recognize the ever-growing millions of strains of malware controlled by adversaries. We can also detect fileless attacks by learning to recognize the normal behavior of our software and training algorithms to alert us when the behavior of our installed software changes unexpectedly.

Given that we now have a large data set related to cyberattacks, it is reasonable to think that machine learning might be a useful tool to apply to the field.

One of the issues that has consistently hampered cybersecurity is the bane of “false positives”. When cybersecurity systems generate thousands of alerts, it’s too hard to separate the signal from the noise. It’s been well documented that many high-profile breaches did set off alarms, but it was impossible to tell that the alarms were important because there were so many false positives. Training ML algorithms on goodware, especially in environments such as cloud data centers where the set of running software is well controlled, promises to provide a bright signal standing out from the noise when unexpected software is executed or behaves in an anomalous way. Furthermore, we can train ML algorithms to recognize “normal” changes in behavior such as patches and upgrades, thus making it possible to highlight abnormal changes that might signal an attack.

Just as it’s wise to have some caution in other uses of ML, it’s unlikely that ML will be a silver bullet for security. However, ML has the potential to be a powerful tool in the cybersecurity toolkit. Improving security in a fast-changing world at massive scale means that we need to leverage the best technologies to manage complexity. ML has become a fundamental tool to make sense of large and dynamic data sets. In the cybersecurity context, it has the potential to help us better manage risks, to improve the way we detect and prioritize threats, to prevent attacks from spreading, and to tilt the playing field in favor of those looking to protect their applications and data.

Bruce Davie, VP and CTO, Asia Pacific and Japan – VMware