Endogeneity and Related Concerns that Keep an Empirical Researcher Up at Night

This post is motivated by two questions posed to us by Chris Tang:

  1.  “Is causality a requirement for publication?”
  2.  “Is predictive analysis valuable?”
fig1

We divide the first question on “Is causality a requirement for publication?” into two parts. Is it necessary to present causal arguments for publication? Is it necessary to demonstrate causality for publication? We believe that causal arguments are required for research questions not dealing with forecasting. However, even if a paper argues for causality, the extent to which causality needs to be demonstrated should depend on several factors.

When are causal arguments required?

It is common to separate empirical analysis into two large classes: prediction versus causal analysis. This classification implicitly suggests that prediction does not require causal analysis; but we find this to be misleading. Many problems in operations management (OM) require predicting the impact of a decision: for example, to predict the impact of increasing product variety on sales or predicting how a reduction in customer wait reduces abandonment rates. We argue such predictions do require causal analysis in order to be useful. On the other hand, there are several problems where causality is not a requirement. Many papers in OM focus on predicting demand and using that as an input to decision models. In such papers, the true causal factors that influence demand are unknown. Therefore, we propose classifying empirical research in OM into three classes of problems: a) forecasting, b) hypothesis testing, c) what-if analysis. The need for causal inferences varies across the three problems and we expand on them below.

a)      Forecasting: This research deals with generating models that can be used to predict an outcome based on external observable factors that are not controllable by the decision maker. In this stream, we rarely care about the causal interpretations since the goal is to obtain good predictions. The validity of the predictions is generally demonstrated with a hold-out sample. It is possible that some of these external factors have a large explanatory power to predict the outcome, but this does not necessarily imply that they cause the outcome; it is possible that the factor is picking up the effect of another omitted factor that causes the outcome to change. But in forecasting, the exclusion of these omitted variables does not invalidate the prediction; all we care is to include factors that capture part of the effect and thereby help to predict variation in the outcome. It is usual in forecasting to use lagged values of the outcome as external factors. Examples of forecasting models include Gaur et al. (2005, 2007), Ferreira et al. (2015), and Glaeser et al. (2019).

Gaur et al. (2005), for example, is a forecasting paper where the objective is to identify deviations of the forecast for the purpose of benchmarking across observations. In this paper, the authors build an econometric model to predict the inventory turns of a firm using contemporaneous values of gross margin, capital intensity, and sales surprise (these three are the external factors according to our definition). Deviations from those predictions are treated as abnormalities and used for benchmarking inventory performance of retailers.  Even though this paper uses rigorous arguments from theoretical literature to motivate the hypotheses, relating inventory turns with these external factors, it does not demonstrate causality. The objective of this empirical analysis is to forecast the average inventory turns of a firm with given characteristics to then benchmark firms with the same predicted performance.

A common theme across many models in OM is on how to deal with variability and uncertainty. In the newsvendor model and in queuing models, we learn that reducing variability and uncertainty can be useful to improve performance. This is exactly the purpose of conducting forecasting: use any available data to improve predictive power and thereby reduce uncertainty. Forecasting is crucial for many OM decisions and with the recent explosion of “big data” and machine learning, we should be interested in discovering new approaches to improve forecasting. 

b)      Hypothesis testing: A number of empirical papers in operations management focus on evaluating whether an intervention that has already been implemented had any effect on an outcome of interest. Herein, the hypotheses are typically made to explain observations that are already available (Lipton 2005) to develop a better understanding of the observed phenomenon. For example, Gallino and Moreno (2014) measure the effect of a new purchase channel where customers could buy online and pick up in the store, looking at how this intervention affected store sales and e-commerce sales. Another example is the work by Olivares and Cachon (2009), which seeks to test whether an increase in competition in small markets leads retail firms to increase their inventory level. In this type of research, the focus is to examine how a change in some factor x caused the outcome y to change (i.e., ∂y/∂x). Here, the emphasis is much more on ensuring that this causal impact is correctly estimated (consistent, unbiased) rather than the overall model fit and predictive power. In other words, even if the R_squared of the model is low, it may be possible to identify the impact of x with an appropriate research design and a sufficiently large sample size. This type of work can be categorized within the hypothesis testing framework, where the research seeks to prove that the change that took place had an effect by rejecting the null hypothesis of a non-existing effect. It is necessary to make causal arguments, motivated by theory, to hypothesize the relationships. 

c)      What-if analysis: Operations management has a long tradition of developing optimization models to improve management decisions. Optimization can be viewed as a what-if analysis, evaluating how alternative values of the decision variables affect the objective function. Many applications in supply chain management and service operations require empirical analysis in order to conduct this evaluation. For example, balancing operating costs with customer abandonments in a queuing system requires measuring the impact of waiting time on customer behavior. Assortment planning requires estimating the effect of product substitution when a product is included or excluded from the assortment. All of these problems fall within the category of what-if analysis, where it is both required to measure the causal effect of a decision but at the same time achieve a reasonable precision of the prediction/forecast.

Papers dealing with structural estimation typically fall under this category. This type of work requires causal inference, because the decision makers want to measure the impact of a change in the system—that is, altering the process in which the data has been generated. But a what-if analysis also cares about predictive power, in order to make the evaluation of alternative scenarios meaningful. While it is desirable to test the validity of the predictions in a hold-out sample[1], it may often not be possible. Consequently, the ceteris paribus assumption becomes crucial in these types of analyses. A high predictive power of the underlying model may alleviate concerns about violation of ceteris paribus partially but not fully. 

Demonstrating Causality

As we discussed above, both hypothesis testing and what-if analysis require causal arguments driven by theory. However, we want to make a distinction on the need for demonstrable proof of causality. It was common among early papers in OM and other fields to make causal arguments in the hypothesis development section and simply perform an ordinary least squares (OLS) regression to show positive or negative association among the variables. Today, it is common for researchers to pursue various methodologies to demonstrate causality. 

The gold standard to demonstrate causal effects is a controlled field experiment. In some cases, there may be an exogenous shock that may serve as a natural experiment to help authors demonstrate causality. However, in most cases the researcher receives access to the problem after the intervention/change took place and had no control on the design of the intervention. This is known as an observational study, where the empirical strategy used to identify the causal effect of interest is critical for the validity of the research. While there are a number of techniques available to demonstrate causality (instrumental variables, propensity score matching (PSM), difference-in-difference (DiD), etc.), it is essential to temper expectations about the causality demonstrated using these techniques as none of those techniques can guarantee causality. For example, finding valid instruments is a challenging task and deploying propensity score matching techniques are questionable in the presence of unobservable factors that drive decisions.

While authors need to make the best attempt to demonstrate causality, we believe that it is also necessary for reviewers to have reasonable expectations about the extent to which authors can demonstrate causality. Having an unrealistically high bar for causal evidence could limit the research questions we can pursue in OM to only those where we can perform field experiments and where researchers were fortuitous to observe exogenous shocks in their settings. It is essential to weight the other “virtues” of the paper (novelty of question, dataset, research setting) against the demonstrable evidence for causality.  If the other virtues of the paper outweigh some of the shortcomings in the methodology, then the limitations of the approach can be clearly documented in the paper so other researchers who have access to a better research setting that allows cleaner identification of the causal effects may reevaluate the research question.

We think the current climate surrounding empirical research in operations management leads to two significant problems. First, the types of research questions we pursue in our field are stymied by the excessive need to demonstrate causality. Many interesting, unaddressed research questions where clean causal inference is difficult are abandoned. For example, it is difficult to expect authors who propose to study a new research setting with an innovative problem to show the same level of sophistication in handling endogeneity as another study that examines a well-studied empirical problem. While we believe that causal inference is ultimately required, we do not think that every paper needs to be subjected to the same burden-of-proof that ignores the lifecycle of the research problem that it studies. Second, we feel that there is a hide-and-seek game going on in many papers where authors do not divulge limitations of their research setting that compromise the causal claims they can make while the reviewers distrust the authors and seek out lurking endogeneity in every paper. We think a different and a more open culture could lead to significant progress on the empirical front. To address these issues, we propose some changes.

Note to authors: Authors need to make the best attempt to handle causality (see Ho et al. 2017 for various techniques to handle endogeneity). Even after doing so, if they perceive any weaknesses in their methodology, or more importantly, in their research setting then they should be upfront about it and list it in their limitations. Explicitly stating when the technique used to handle endogeneity could lead to incorrect inferences (weak instruments, presence of unobservables in PSM, etc.) would be beneficial for readers and future researchers.  Every research setting has some limitations (even those where field experiments are run) and these limitations can spur further research into these areas and deepen our understanding.  Unlike analytical papers where the entire proof is available to reviewers, empirical papers require reviewers to make a leap of faith, and being upfront about the limitations builds confidence in the eyes of reviewers. 

Note to reviewers: When you review an empirical paper, put away the endogeneity “checklist” and think about the context of this specific study. Simply arguing that “x can be endogenous” is not enough. There is a risk that this criticism turns into a blunt instrument to reject every paper. In our opinion, there are three aspects that need to be considered to examine the threat of endogeneity to the reported results in a paper. First, is there a credible story of an alternative mechanism for the presence of endogeneity that may be driving the observed effect? Second, how significantly would the reported results alter if endogeneity can be fully alleviated, that is, what is the magnitude of the endogeneity bias? Third, is there a feasible test to alleviate such endogeneity concerns? Thinking through these aspects would help the reviewers weigh the benefits of handling endogeneity versus the costs of doing so.  For example, if there are no feasible tests to alleviate endogeneity concerns fully then, as we mentioned earlier, the reviewers should balance the other virtues of the paper against this limitation. As the authors strive to establish causal evidence beyond a reasonable doubt, credit should be given when their research design reflects good knowledge of the institutional details and context of their application.

References

Ferreira KJ, Lee BHA, Simchi-Levi D (2015) Analytics for an online retailer: Demand forecasting and price optimization. Manufacturing Service Operations Management 18(1):69–88.

Gallino S, Moreno A (2014) Integration of online and offline channels in retail: The impact of sharing reliable inventory availability information. Management Science 60(6):1434–1451. 

Gaur V, Fisher ML, Raman A (2005) An econometric analysis of inventory turnover performance in retail services. Management Science 51(2):181–194. 

Gaur V, Kesavan S, Raman A, Fisher ML (2007) Estimating demand uncertainty using judgmental forecasts. Manufacturing Service Operations Management 9(4):480–491. 

Glaeser CK, Fisher M, Su X (2019) Finalist–2017 M&SOM practice-based research competition—Optimal retail location: Empirical methodology and application to practice. Manufacturing Service Operations Management 21(1):86–102. 

Ho TH, Lim N, Reza S, Xia X (2017) OM forum—Causal inference models in operations management. Manufacturing Service Operations Management 19(4):509–525. 

Lipton P (2005) Testing hypotheses: prediction and prejudice. Science 307(5707):219–221. 

Olivares M, Cachon GP (2009) Competing retailers and inventory: An empirical investigation of General Motors’ dealerships in isolated U.S. markets. Management Science 55(9):1586–1604.

Endnote

[1] Edmond Halley not only accounted for the elliptical orbit of comets based on observations from 1531, 1607, and 1682 but was also able to predict that the comet would return in 1758. The latter prediction impressed more people than Halley’s explanations based on prior observations (Lipton 2005).

Comments

Hi Jan,

Just saw your comment. You make an excellent point. I agree that external validity of the findings need to be examined carefully. Having said that, in my opinion, the goal should not be to aspire for one definitive study that addresses all questions but aim for a combination of studies that take various contextual variations into account. To do so, it is essential for authors to be clear about the idiosyncrasies of their research setting (as you have mentioned).

thanks,
saravanan

Thanks Saravanan and Marcelo for posting this and thanks to Chris for initiating this.

I very much like the balanced view in this post as it relates to the position of OM as a field right in between (social) science and (industrial) engineering, with scientists primarily interested in descriptives and causality, and engineers interested in developing tools for better decision making.

One additional suggestion that would be worthwhile for authors of empirical papers is to much more extensively describe (or make available online) the context of the data that have been used and to have an open discussion to what extent this context drives the findings. This will really enable us to develop more context-rich theory, especially now that many of us have started to use transactional data that have been taken from a single company in a single context.

For instance, much of the online retail studies now use data from China or from the US, where the online retail environments and consumer behavior really differ from one another, and also from other areas such as Europe or Africa. This would affect both forecasting (would this forecasting model also work in a different context?) and theory (is the theory dependent on the context?)

I would be very interested to learn your perspective on this issue of context.

Jan Fransoo