2020 Edelman Finalist: IBM

Predictive Analytics for Server Incident Reduction in the Wild

While the Internet has become a major backbone of our society, the computers supporting this Internet can and will break.

IBM creates value for clients by providing integrated solutions and products that leverage data, information technology, deep expertise of industries and business processes, with trust and security and a broad ecosystem of partners and alliances.

IBM solutions typically create value by enabling new capabilities for clients that transform their businesses and help them engage with their customers and employees in new ways. These solutions draw from an industry-leading portfolio of consulting and IT implementation services, cloud, digital and cognitive offerings, and enterprise systems and software, which are all bolstered by one of the world’s leading research organizations.

Among IBM’s many businesses, Global Technology Services (GTS) is the main focus of our work. IBM GTS operates and manages some of the world’s largest data centers, and thereby some of the world’s most mission-critical workflows and franchises. GTS helps clients along their journey to the hybrid cloud, leveraging the best of their existing systems in the context of the regulatory, security, and workflow of their industry.

Every day, GTS operates many thousands of computers for IBM clients. To help them manage this, GTS and IBM Research have developed and deployed a system that identifies and proposes deep analytics and operations research (O.R.) remedies to keep IT systems operational. This Predictive Analytics for Server Incident Reduction (PASIR) solution developed at IBM has been broadly deployed to more than 360 IT environments since 2013. These environments, covering every sector from banking to travel to e-commerce, are serviced by IBM support groups.

More specifically, today, incidents occurring on servers, including the descriptions of the problems and the resolutions, are documented into an account-specific ticket management system. PASIR first classifies the incident tickets of an IT environment to identify high-impact incidents describing server outage and performance degradation issues by using the respective ticket descriptions and resolutions. Second, the occurrence of these high-impact tickets is correlated with server properties and utilization measurements to identify troubled server configurations and prescribe improvement actions through statistical multivariate analysis and simulation. In this contribution, we present the findings from deploying our machine learning solution in the wild.

We describe the PASIR methodology, from ticket classification to the recommendation of modernization actions. We also demonstrate the model effectiveness by comparing predictions on the impact of prescriptive actions with actual system improvements. We have applied PASIR to more than 840,000 servers since 2013 resulting in more precise refresh spending and environment stability, saving our clients an estimated $7 billion since that time.