Probability Management: Rolling up operational risk at PG&E

Emerging discipline provides an overall risk snapshot that allows diverse stakeholders to assess tradeoffs between safety, reliability and cost.

Utility companies operate thousands of miles of pipelines carrying flammable natural gas, creating a tradeoff of risk mitigation and an increase in rates. Image ©   | 123rf.com

Utility companies operate thousands of miles of pipelines carrying flammable natural gas, creating a tradeoff of risk mitigation and an increase in rates. Image © | 123rf.com

By Jordan Alen, Christine Cowsert Chapman, Melissa Kirmse, Farshad Miraftab and Sam Savage

Utility companies operate thousands of miles of pipelines carrying flammable natural gas as well as power lines containing high-voltage electricity. The industry is highly regulated and must weigh risk mitigation investments against increases in utility rates. This places these firms in an environment where diverse stakeholders (regulators, rate payers, line repair crews, etc.) need to understand the tradeoffs between cost and various categories of risk, including financial, safety and reliability. It would benefit the industry to work toward a standardized consolidated risk statement that aggregates operational risk across the physical assets of the system in the manner that a consolidated financial statement aggregates the financial health of the firm.

Risks Weren’t Additive – Until Now

There has been no practical way to add risks together to consolidate risk until recently. Here’s why: consider an adverse event that has a 10 percent chance of injuring someone (or equivalently injures 1/10th of a person on average). Now suppose the organization is exposed to 10 such risks. Because averages can be added, risk management procedures based on expected values estimate the total risk by multiplying 1/10th by 10 to get an average of one injury across the 10 events.

Although this does give the correct average, risks are not characterized by averages but by extreme outcomes, such as 10 people being injured across the 10 events. Reducing the risk to an average is wrong because it ignores the degree of statistical dependence or independence between events. This is analogous to playing craps with a pair of “average” dice, which can only display 3 1/2 dots (Figure 1).

Figure 1: Average dice (left) and real dice (right).

Figure 1: Average dice (left) and real dice (right).

In fact, an average of 1/10 of an injury per event could represent extremely disparate outcomes: anywhere between one chance in 10 of 10 injuries across the 10 events (if the results are completely correlated) to one chance in 10 billion of 10 injuries (if they are completely independent).

An alternative to rolling up risk with averages is to use simulation. Traditionally, this has required a single monolithic simulation that spans the entire organization. These systems have been effectively employed in the insurance industry and financial engineering, but they involve significant investment. Like all large programs, they run the risk of collapsing under their own weight or of being too inflexible to maintain. No wonder so many risk management procedures are ineffective [1, 2, 3].

However, government agencies must regulate risk in areas as diverse as banking, food, pharmaceuticals, transportation and utilities. This is not just an academic exercise. When a regulatory agency underestimates risk, it jeopardizes public safety. When it overestimates risk, it imposes excessive mitigation costs, penalizes the consumer and stifles the economy [4].

A Modular Approach

The emerging discipline of probability management communicates uncertainties as arrays of trials called SIPs, which may be added up across simulations like numbers [5]. This means that enterprise-wide risk simulations may now be broken down into Lego block-like modules created on a wide variety of software platforms, and then rolled up. Recently, native Microsoft Excel has become powerful enough to perform calculations with SIPs, a process known as SIPmath, without add-ins or macros [6]. To leverage these developments, ProbabilityManagement.org, a 501(c)(3) nonprofit, has created open, cross-platform standards and tools to help aggregate risk calculations systemwide [7, 8]. So today, instead of large monolithic risk simulations, SIPs of various risks (generated on multiple platforms at multiple levels of the enterprise) may be added, multiplied or used in other calculations, and then rolled up to higher levels.

Large firms have begun to develop internal communities of practice around probability management, and the discipline was recently recognized by Gartner Inc. as “transformational” [9]. The time for the consolidated risk statement has arrived.

A Hypothetical Risk Statement

Figure 2 displays a fully “rolled up” Excel model of a conceptual consolidated risk statement (CRS) for a multi-level organization with three categories of operational risk. Financial risk accounts for the direct monetary consequences of facility and equipment failure measured in millions of dollars over the upcoming year. Safety risk is measured in the number of injuries over the same period, and reliability risk is measured in terms of minutes of lost service per customer. The effects of various mitigations may be tested using check boxes, which instantly run simulations of 1,000 trials to display the residual risk post-mitigation.

Figure 2: Top level of a hypothetical consolidated risk statement.

Figure 2: Top level of a hypothetical consolidated risk statement.

The model is based on a hypothetical SIP library of the joint probability distribution of risks across an entire enterprise. Because the output SIPs are simply named spreadsheet ranges, they may be used in any Excel formula. Thus, it is easy to include a wide array of probabilistic outputs. At the fully rolled up level we have included only three, which contain the chances that pre-specified limits will be breached in each area of risk. Conditional formatting has been applied that turns the cells green when the chance is zero, and red when the chance is 50 percent.

The authors suggest that you download this model at ProbabilityManagement.org, since vicarious simulation is not as good as the real thing.

Figure 3 displays the fully “drilled down” view of the same model exposing the multiple levels of the organization. Rolling up and drilling down are accomplished with the Group and Ungroup tools on the Data Ribbon in Excel. Ultimately, mitigation decisions will involve tradeoffs between the risk dimensions, which must accommodate the risk tolerances of diverse stakeholders. In this model we have incorporated a graphical method used by Doug Hubbard to display such tolerances [10].

Figure 3: Drilled down view of consolidated risk statement.

Figure 3: Drilled down view of consolidated risk statement.

Exceedance and tolerance chart. The graphs at the bottom of the CRS show an exceedance graph in blue and a risk tolerance level in red. The blue curve is a direct output of the model and displays the chance that the risk will exceed the number shown on the X axis. The red curve is not an output, but an indication of the organization’s risk tolerance. Where the blue curve is below the red curve, risk is within tolerance; where it is above the red curve, the risk is beyond tolerance. Coming up with such curves is at the crux of risk management.

In fact, an important function of the CRS is to facilitate discussions among the stakeholders and to negotiate a mutually agreeable set of tolerance curves. For example, in this hypothetical model, replacing equipment reduces reliability risk, but due to the added use of repair crews, it increases their safety risk. To satisfy the interests of both customers and repair crews, this model rewards combining safety training with equipment repair.

Risk Management is a Journey, Not a Destination

Technologically we have reached a tipping point at which a consolidated risk statement can be created with everyday computers and software. But how does an organization start down this new path?

Pacific Gas & Electric (PG&E) is a utility company that provides natural gas and electricity to roughly 16 million customers in northern and central California. For them, the journey toward consolidating risk began in September 2010, when a tragic gas pipeline explosion in San Bruno, Calif., became a catalyst for change and improved utility risk management practices. In 2011, a California Senate bill [11] mandated that PG&E, the operator of the pipeline, adopt a safety focus and risk-based decision-making framework. The California Public Utilities Commission (CPUC) required that pipeline operators subject to CPUC rate regulation develop a specific plan to identify and minimize hazards and systemic risk to protect the public and employees when proposing programs in their rate case applications.

As a result, PG&E was directed to move away from characterizing risks as single values and begin using distributions of event outcomes. On the Electric side of the business, an external consultant leveraged the recent improvements in Excel to develop simulations to improve the reliability of electric transmissions lines. Meanwhile, the Gas Operations team, under the guidance of another consultant, started developing its own models using the latest generation of free tools available at ProbabilityManagement.org. These tools help create SIPmath models, but all it takes is Excel to run them, so both electric and gas models are compatible from a probability management perspective. This article describes the experience with the Gas Operations team.

A tragic gas pipeline explosion in San Bruno, Calif., became a catalyst for change and improved utility risk management practices. Image © Eric Broder Van Dyke | 123rf.com

A tragic gas pipeline explosion in San Bruno, Calif., became a catalyst for change and improved utility risk management practices. Image © Eric Broder Van Dyke | 123rf.com

In the past, operational risks have often been evaluated in isolation, for example, the risk of a transformer fire or the loss of containment of gas pipe. However, for cost-effective risk reduction, it would be desirable to allocate a total budget to a portfolio of mitigation activities to minimize total risk across the various categories.

The introduction of the third generation of free SIPmath Modeler Tools [12] in 2016 allowed Gas Operations to employ rapid prototyping on a large number of conceptual models. This led to quick wins, and built credibility within the organization. PG&E managers exposed to this work, whether they were new to the company or executives with decades of experience, could easily interact with the models and grasp the concepts being presented.

These models often shed light on unrelated areas. As more models were built, more questions were raised, which led to more models, which led to more questions. In many cases, this resulted in models that had nothing to do with the goal of risk quantification, but instead focused on improved decision-making, improved data visualization or measuring effectiveness of programs.

An asset level model

Figure 1: The model aggregates financial, safety and reliability risk across the assets making up a pipeline.

Figure 1: The model aggregates financial, safety and reliability risk across the assets making up a pipeline.

This model (available for download at ProbabilityManagement.org) aggregates financial, safety and reliability risk across the assets making up a pipeline. Each asset has three SIPs that can be rolled up to view overall risk using the Excel data filter and subtotal formula.

Using the filter function of Excel, the user can view the distribution of impacts under specified conditions. For example, one could select the total risk across all Fittings in the North region by adjusting the filters in cells D6 and H6.

The model also displays percentiles along with a threshold of risk and associated likelihood of exceedance.

Should the development of a consolidated risk statement be done top down (estimating probability distributions using subject matter experts and historical statistics) or bottom up (simulating failures through root cause analysis)? The answer is both. An organization must be comfortable with each approach, and the two are synergistic. Bottom up models can inform the probabilistic assumptions at the top level, while top down models can pinpoint areas in which additional detailed analysis is most beneficial.  

Following is an overview of a current initiative to optimize the mitigation of external corrosion on transmission pipes. A sidebar story describes how distributions of risks simulated on an asset level (that is, across pipes, valves, fittings, etc.) may be filtered to generate conditional distributions across portions of the system, for example, the sum of all risks across pipes in the North.

The end-to-end process. Gas system assets are typically buried underground, and their conditions, operations and performances are highly uncertain. Many methods exist to quantify risk or create distributions of adverse outcomes for gas assets at the component or segment level. (e.g., a valve has a probability of sticking closed with a distribution of subsequent hours of service interruption, or a specific segment of pipe has a likelihood of failure with a distribution of potential safety consequences).

The Gas Operations team arrived at a six-phase probabilistic framework to roll up external corrosion risk across PG&E’s gas transmission pipeline network, which after component testing is now under implementation. Instead of being developed from purely written plans, the effort is based on functioning prototype models. The four categories are:

  • “Paper airplanes” provide the team with a common understanding of some mathematical principles relating to risk.
  • “Balsa airplanes” look like real models in that they have user interfaces and data requirements suitable for managerial decisions.
  • “Eiffel Towers” are huge models whose purpose is to prove how big they can get before the limits of scalability cause them to collapse.
  • “Interactive blueprints” are greatly simplified working models of the entire system.

Figure 4 displays an interactive blueprint of the external corrosion system. It moves through a six-phase process that begins with a database of historical asset conditions and concludes with an optimized set of mitigation portfolios. In practice, each of the six phases would be a separate model, database or application, perhaps not even created on the same platform. Each would take input SIPs from the previous phase and pass output SIPs to the next.

Figure 4: Interactive blueprint of the end-to-end corrosion mitigation system.

Figure 4: Interactive blueprint of the end-to-end corrosion mitigation system.

Phase 1 – initial conditions: This is a database describing the conditions of the assets as of the date of last inspection. The data can be as detailed as the records of thousands of small corrosion pits detected by sensitive electronic monitoring equipment run through the pipes. It also contains information on the locations and soil conditions of the pipes, as well as densities of the populations surrounding them.

Phase 2 – time dependence: Probabilistic corrosion growth models estimate the current and future conditions of the assets.

Phase 3 – loss of containment simulation: Adverse events such as leaks or ruptures are simulated based on the current conditions and external factors such as earthquakes.

Phase 4 – conditional consequences: Distributions of consequences are simulated based on SIPs of adverse events generated in Phase 3.

Phase 5 – mitigation strategies: The results of various mitigation strategies are simulated.

Phase 6 – optimization: Optimization is performed graphically in the blueprint. In practice, stochastic optimization would be applied to the SIPs generated in Phase 5, but for this small model it is instructive to keep it interactive. The scatter plot displays cost vs. risk reduction for each combination of mitigations. The green dot displays the currently selected portfolio. An efficient frontier is observable in the southwest region of the graph.

Conclusion

In the past year, PG&E has developed probabilistic models in native Excel within both Electric and Gas Operations. The open SIPmath standard assures that these models may be used collaboratively to provide an overall risk snapshot that allows diverse stakeholders to assess tradeoffs between safety, reliability and cost.

As George Bernard Shaw said, “The single biggest problem in communication is the illusion that it has taken place.” This is particularly true in the area of enterprise risk management. A consolidated risk statement would create a common language of risk that would make risks transparent to everyone involved.

Jordan Alen is a technology coordinator at ProbabilityManagement.org and a risk consultant with experience in utility regulation and aggregated simulation models.

Christine Cowsert Chapman is the senior director of Asset Knowledge & Integrity Management in Gas Operations at Pacific Gas and Electric Company (PG&E). She is responsible for developing the strategic direction of PG&E’s risk, asset and integrity management programs that are applied to all of PG&E’s natural gas assets.

Melissa Kirmse is director of Operations at ProbabilityManagement.org. She has more than 20 years of experience in project coordination, administration, as well as technical writing and editing at tech companies such as Microsoft and TiVo.

Farshad Miraftab is the senior risk analyst for Gas Operations Risk Management at PG&E. He is responsible for developing data-driven approaches to identify and mitigate operational risks as well as investment optimization solutions that prioritize Gas Operations portfolio based on risk, financial and other operational constraints.

Sam L. Savage, Ph.D., is the executive director of ProbabilityManagement.org, author of “The Flaw of Averages – Why We Underestimate Risk in the Face of Uncertainty,” and an adjunct professor at Stanford University. He is the inventor of the open SIP data structure that allows simulations to communicate with each other across platforms.

References

  1. Sam Savage, 2009, “The Flaw of Averages,” John Wiley & Sons.
  2. Douglas Hubbard, 2009, “The Failure of Risk Management: Why It’s Broken and How to Fix It,” John Wiley & Sons.
  3. Philip Thomas, Reidar B. Bratvold and J. Eric Bickel, 2014, “The Risk of Using Risk Matrices,” Society of Petroleum Engineers, SPE Economics & Management, Vol. 6, Issue 2, April 2014.
  4. Stephen Breuer, 1993, “Breaking the Vicious Cycle: Toward Effective Risk Regulation,” Harvard University Press.
  5. Sam Savage, Stefan Scholtes and Daniel Zweidler, 2006, “Probability Management,” OR/MS Today, February 2006, Vol. 33, No. 1.
  6. Sam L. Savage, 2012, “Distribution Processing and the Arithmetic of Uncertainty," Analytics Magazine, November/December 2012 (http://viewer.zmags.com/publication/90ffcc6b#/90ffcc6b/29).
  7. Melissa Kirmse and Sam Savage, 2014, “Probability Management 2.0," OR/MS Today, October 2014, Vol. 41 No. 5 (http://viewer.zmags.com/publication/ad9e976e#/ad9e976e/32).
  8. Sam L. Savage, 2016, “Monte Carlo for the Masses,” Analytics Magazine, September/October 2016 (http://analytics-magazine.org/monte-carlo-for-the-masses/).
  9. https://www.gartner.com/doc/3388917/hype-cycle-data-science-
  10. Douglas Hubbard and Richard Seiersen, 2016, “How to Measure Anything in Cybersecurity Risk," p. 47, John Wiley & Sons.
  11. California SB-705 Natural Gas: Service and Safety, October 2011.
  12. http://probabilitymanagement.org/tools.html.