Mining social contact networks for pandemic disease mitigation strategies.

By Mario Ventresca and Dionne Aleman

The process of vaccine development and mass deployment remains a very lengthy process.

Like so many other fields, public health is poised to reap the benefits of data mining technologies. Given the recent concern over influenza pandemics (H1N1) and emergent avian flu strains, this article specifically addresses the potential of analyzing social networks with combinations of data mining, optimization and network science to uncover promising pandemic mitigation strategies.

The World Health Organization’s requirements for a disease to be considered pandemic are that it is able to infect and cause death in humans, as well as being able to spread from human to human with ease, thus resulting in many potential infections across large regions or globally. Typically, the disease in question is novel, and if very severe, may cause hundreds of millions of deaths and incur trillions of dollars in socio-economic costs.

Recently, early detection capabilities have improved, as evidenced by daily updates concerning potentially pandemic diseases such as Middle East respiratory syndrome coronavirus (MERS-CoV) and avian influenza A (H7N9). However, surveillance systems remain far from comprehensive or complete. Moreover, despite the tremendous technological and scientific advances made over the past decade, the process of vaccine development and mass deployment remains a very lengthy process (at least several months). Therefore, it is critical to focus on effective public policies to mitigate disease spread as we wait for vaccines to be developed and distributed through the population.

Wearing facemasks may decrease the likelihood of infection by an airborne pathogen.

Pandemic preparedness and response logistics are a vital component of a comprehensive public health policy. A great deal of research has already been undertaken concerning the ability of specific policies to mitigate certain types of disease spread. For instance, wearing facemasks may decrease the likelihood of infection by an airborne pathogen. Unfortunately, there is no one-size-fits-all, “silver bullet” approach to mitigating disease spread. Simply, the world contains too many diverse regions (cities, towns, states, countries), whose constituents have varied histories of disease exposure, and will therefore act and react differently due to cultural, educational and infrastructure differences, in addition to climate influence on the pathogen, and so on. Socio-economic and infrastructure constraints further confound the situation. Indeed, public health policy officials have an excruciatingly difficult and complex task.

Simply identifying high-risk individuals alone is typically insufficient for mitigating disease spread because they may not be those most responsible for actually spreading the disease, but rather those who are most adversely affected. After all, if the disease is adequately contained then the pandemic will cease, or at least be slowed enough for the opportunity to deploy vaccines. Also, it is highly unlikely that privacy laws will be relaxed to the level that allows for these specific individuals to be identified in real time. Even if so, there is no guarantee an individual will adhere to the suggested policy. For instance, a person may be asked to wear a facemask, but could decide against it for social or personal reasons. Hence, a portfolio of approaches is put forth, such as encouraging the use of facemasks and closing public schools. For moderate to severe pandemics, more sweeping and costly policies may be required.

On the very rare occasion that a vaccine already exists, there will likely be a very limited supply initially, so having the vaccine does not preclude the need for mitigation policies. Indeed, vaccination is only a single component of the broader campaign against the disease spread. Note that we distinguish between a strategy, which indicates who should be targeted (i.e., children under the age of five), and a policy, which indicates how the strategy is realized (i.e., school closure). From the perspective of the public policy official, the problem of pandemic disease mitigation is to devise an effective and efficient portfolio of policies to limit disease spread.

A robust tool to aid public health officials in devising effective policies would be enormously useful. This system could allow for the analysis and comparison of different policy portfolios, in addition to providing insight into more effective strategies that policies could be targeted toward and their subsequent impact. Such a robust system has yet to be fully implemented, but a number of strides have been taken in recent years. In particular, highly detailed agent-based simulations have been quite useful [1], although typically they are burdened by enormous computational resource needs. An alternative approach has recently emerged that abstracts away some of the detail provided by agent-based simulations and instead concentrates on the pathways of disease transmission through the population. That is, if a disease transmits through sexual contact, then we examine a sexual contact network, or if by social interaction, then we examine the social contact network, and so on.

This article provides brief preliminary results within the context of a social contact network, where individuals are the nodes and the connections between them are arcs. The results are based on the Greater Toronto Area (GTA), Toronto, Canada, which has approximately 5.5 million residents. While we focus on pandemic disease, the system itself can easily be adapted for devising strategies against any diffusive process over a network.

Identifying Policy Targets

The approach is composed of three stages [5]. First, the fusion of publicly available information and knowledge of social behavior and mixing patterns leads to an efficient algorithm for generating a representative large-scale social contact network. It is possible to calculate the basic reproductive number R0 (a measure of disease spread) and the probability of a disease outbreak leading to a pandemic based solely on the network topology [2]. If R0 < 1, then the disease will not transmit sufficiently to cause a pandemic, and a larger R0 corresponds to a more rapid spread. Second, public officials advocate policies that sufficiently alter the social network topology (i.e., by decreasing the potential avenues for disease spread) such that R0 drops below 1. From a network topology viewpoint, such a policy can be implemented by selecting a subset of nodes and removing some or all of their links to other nodes in the network. Finally, we combine the results from the previous two stages and perform exploratory data mining procedures that aim to classify individuals as targeted or ignored based on their generated personal characteristics. These rules represent potential strategies for health officials to consider when devising policy portfolios.

The first stage, recreating a representative social contact network, is a complicated task, primarily because of privacy and surveillance-related concerns and issues that leave input data only partially observed or reported as marginal distributions. Nevertheless, reconstructing a usable network is possible by leveraging existing knowledge of human behavior and mixing patterns in conjunction with publicly available information, such as the census, public transportation usage, hospital inpatient and staffing patterns, etc. [3].

For instance, members in a household will have some degree of meaningful interaction irrespective of their ages, whereas people who work in a large business office will predominantly interact with similarly aged individuals and on a less intimate level. Importantly, when the social network is constructed, it also generates realistic individual characteristics such as age, household size and location, whether public transportation is used, type of employment and workplace location, whether the individual attends or is employed by an educational institution or healthcare facility, etc. That is, a census of individuals in the generated social contact network will be highly similar to the actual input census data, allowing for subsequent and meaningful data mining of the population. The mixing patterns and distribution of contact duration in our networks are similar to those in popular agent-based models [3].

A simple decision tree using the PageRank algorithm in Stage 2

After creating the social contact network, a subset of nodes (i.e., individuals) is identified and effectively removed from the network. Technically, any subset of nodes will be associated with a potential solution to the pandemic mitigation problem. However, given the population sizes under consideration, most of these solutions will not be very useful. Instead, a variety of existing and potentially useful network-based strategies, including seemingly unrelated ones such as Google’s PageRank algorithm [3], are employed. Contrary to expectation, often these strategies are able to significantly decrease R0 and thus the probability of a pandemic occurring, while targeting just 10 percent to 25 percent of the population.

Intuitively, a more promising approach would be to solve the optimization problem of minimizing R0. Unfortunately, this goal is too broad to solve directly, and so a proxy goal is instead considered, such as minimizing the worst-case scenario [6] or minimizing the worst-case expected number of infections [4]. Despite being limited to small networks, the proposed algorithms appear highly effective. We have subsequently devised a number of efficient approaches for large-scale problems [7], though we are far from having exhausted the list of potentially useful proxy objectives. Regardless of the targeting strategy, some subset of nodes will be identified and labeled as the most effective targets for stopping disease spread. These individuals could even act as bridges between populations; for instance, public school teachers can represent a simple potential pathway for disease to spread between children and adults.

The Greater Toronto Area

The second stage, generating mitigation policies from the social network, is somewhat limited since it does not provide immediately usable information to public policy officials because the contact network is only a statistically reasonable representation of the true population. For example, a policy may say to vaccinate node 42, but there is no “real” person represented by node 42. However, by examining the personal characteristics of the identified individuals from Stage 2, it is possible to ascertain their common traits to arrive at strategies that can have a realistic impact, which leads us to Stage 3.

In the final stage, the generated data set of individual characteristics from Stage 1 is combined with the newly acquired property of whether each individual should be targeted or not from Stage 2. This merging of data allows us to view the problem as one of classifying each individual as somebody to “target” or “ignore,” based on their personal characteristics. Then, by mining the population using decision trees, it is possible to automatically ascertain the traits common to people whom public policies should target. Decision trees are used because they have a number of highly desirable characteristics: (1) the result is presented as an easily interpretable set of logical “if-then” decisions, (2) the output of the decision tree directly provides information as to the proportion of individuals correctly targeted versus those who happened to have similar characteristics, and (3) there is a simple mechanism by which the output of the decision trees are created that allows the user to adjust the output of (2).

A number of existing algorithms are capable of creating decision trees whose output “if-then” rules each correspond to a specific mitigation strategy. The entire tree can be seen as a portfolio of strategies. For instance, a potential rule could be to target children between the ages of five and 12 and who have at least four people in their household. Rule complexity will vary based on the subset of targeted individuals and their characteristics. A simple decision tree generated from the Greater Toronto Area network [3] and using the PageRank strategy during Stage 2 is shown in Figure 1, where the top 10 percent of PageRanked individuals are those labeled as “target” and the others are “ignore.” For demonstration purposes only, individual home and work locations are considered to create the decision tree. Using geographic information system software, the rules of decision trees can be overlaid onto the geographic area under consideration. For instance, Figure 2 displays the GTA broken into census dissemination blocks and highlights a strategy that targets a particular subset of schools.

The proportion of individuals affected by a strategy, as well as the classification accuracy of the rule, can be easily ascertained by tabulation and a confusion matrix, respectively. The practical interpretation of these results indicates whether the strategy is perhaps overly specific (i.e., targets a very small number of individuals in the population), or could potentially incur an excessive socio-economic cost through overly broad and invasive strategies. That is, individuals falsely identified to be targeted will yield a false positive in the associated confusion matrix, and those not targeted but who should have been, could lead to an increase in the number of infected individuals and decreased ability to mitigate the disease. A hypothetical confusion matrix and its interpretation are shown in Table 1. Deeper analysis includes per-person economic costs of mitigation, a risk assessment based on mortality and morbidity rates, etc. These considerations can be incorporated into the decision tree creation algorithm through the loss or cost matrix, which weighs the cost of false positive/negatives when deciding how to split the data along the tree branches.

Caveats and Looking Forward

Like any model or simulation, this approach is sensitive to input data problems.  Therefore, system output should be taken into consideration during the decision process but should not act as a replacement for public health policies or officials. These tools can also be used to efficiently explore potential “what if” scenarios, but with the caveat that validation studies for potential future events cannot be reliably conducted. Nevertheless, comparing the relative merits and downfalls of different mitigation portfolios can be straightforwardly accomplished within this framework, under appropriate model assumptions.

Another important caveat, also for agent-based simulation models, is that the manner and rate of disease spread, incubation time, etc., are assumed to be known with reasonable accuracy. Eventually, scientific and surveillance advances may be able to provide such highly accurate information within a reasonable time after an outbreak has been detected. For the present, however, caution is needed.

The future of network-based pandemic preparedness systems seems bright, but a significant amount of work remains to be done before such systems can truly be of use. We need validation of the generated social contact network structure in different environments. Moreover, interactions between individuals across vast distances are becoming increasingly common, and thus we need to study the boundaries of a population under consideration and the effect of outside influence on an assumed closed population.

Other important research directions are the effect of different proxy goals for mitigating disease spread within a contact network and efficient algorithms for doing so, and more efficient and effective data processing methods for generating decision tree rules. Yet another topic is the impact of public announcements and compliance with policy recommendations based on sentiment analysis. Despite its current shortcomings, the suggested approach outlined above to developing an efficient and robust pandemic preparedness system has enormous potential significance and could become an integral part of the cyberhealth infrastructure of society.

Mario Ventresca ( is an assistant professor in the School of Industrial Engineering and the School’s representative for the Computational Science and Engineering program at Purdue University. His research interests are in the areas of network science, machine learning, approximation and optimization, as well as automated design and inference, as applied to a number of real-world problems including in the biomedical and healthcare domains. He is a member of INFORMS, IEEE, AAAI and the ACM.

Dionne M. Aleman ( is an associate professor in the Department of Mechanical & Industrial Engineering at the University of Toronto, with appointments to the Institute for Health Policy, Management & Evaluation and the Techna Institute. Dr. Aleman’s research interests are in the application of operations research to medical and healthcare systems, including using agent-based simulation to predict the spread of a pandemic disease in an urban population. She is the secretary of the Canadian Operational Research Society (CORS) and is an active member of INFORMS, and serves as an associate editor for IIE Transactions on Healthcare Systems Engineering and for the International Journal of Biomedical Data Mining, a topical editor for the Wiley Encyclopedia of OR/MS, and an editorial board member of Operations Research in Health Care.

The authors thank Doug Samuelson for his helpful suggestions regarding this article.

1.     S. Eubank, H. Goclu, A. Kumar, M. Marathe, A. Srinivasan, Z. Totoczkal, and N. Wang, 2004, “Modelling disease outbreaks in realistic urban social networks,” Nature, Vol. 429, pp. 180-184.
2.     M. E. J. Newman, 2002, “The spread of epidemic disease on networks,” Physical Review E, 66:016128.
3.     M. Ventresca and D. Aleman, 2013, “Evaluation of strategies to mitigate contagion spread using social network characteristics,” Social Networks, Vol. 35, No. 11, pp. 75-88.
4.     M. Ventresca and D. Aleman, 2014, “A derandomized approximation algorithm for the critical node detection problem,” Computers and Operations Research, Vol. 43, No. 3, pp. 261–270.
5.     M. Ventresca and D. Aleman, 2013, “Deriving public policies from contact networks,” INFORMS Workshop on Data Mining and Heath Informatics, Minneapolis (expanded version is in submission).
6.     M. Ventresca and D. Aleman, 2014, “A randomized algorithm with local search for containment of pandemic disease spread, Computers and Operations Research (in press).
7.     M. Ventresca and D. Aleman, 2014, “Rule mining in critical node detection,” IFORS (expanded version in submission).