Causal inference using instrumental variables

Data scientists often find themselves repeating the mantra “correlation is not causation”. Constantly reminding our stakeholders-and ourselves-is a good thing, because data can be dangerous, and because the human brain has to interpret statistical evidence based on causality. But maybe this is a feature, not a mistake: we instinctively seek a causal explanation because it is ultimately what we need to make the right decision. If there is no causal story behind it, correlation is not particularly helpful to decision makers.

But in the final analysis, we can only read the correlations from the data. It is very challenging to ensure that the causal stories we give these correlations are actually true. We can mistake causality in many ways. The most common mistake is failure to explain common causes or confounding factors. Using a typical example, there is a positive correlation between hospitalization and death. In other words, people who are hospitalized are more likely to die than people who are not. If we ignore the fact that illness leads to hospitalization and death, we may end up with the wrong causal story: hospital killings.

When we learn from confounding factors and consider common effects or colliders, another common pitfall arises.The example here is adapted from the description of Berkson’s Paradox Pearl and Mackenzie’s “Why”Suppose we are trying to see if COVID-19 infection can induce diabetes. For example, in reality, there is no such causal relationship, but if diabetic patients are infected with the virus, they are more likely to be hospitalized. Now, out of enthusiasm to consider any potential confounding factors, we have decided to limit our study to hospitalized patients. Even if there is no direct causality, this may allow us to observe the correlation between COVID-19 and diabetes. If we are more careless, we may discuss how COVID can cause diabetes.

If we only look at the hospitalized population, even if there is no direct causal relationship, we may observe the correlation between COVID-19 and diabetes and incorrectly infer that COVID-19 causes diabetes.

Another way that cause and effect stories go wrong is when we think about mediators. So far, continuing the pathological topic of this blog post, suppose we are studying whether smoking can actually cause premature death. If we consider/adjust/control all the ways that smoking can lead to death (lung cancer, heart disease), then we may find that there is almost no correlation between smoking and death, even if smoking does increase mortality.

“So, what’s so difficult about this!?” You might say. “Just adjust for confounding factors and ignore the colliders and intermediaries!” Causal inference is difficult, because first of all, we probably never have data on all possible confounding factors. Second, it is often difficult to distinguish between colliders, intermediaries, and confounders. Sometimes the causality is two-way, and it is almost impossible to analyze these two-way effects.

Roblox example

So, how do we solve these real challenges? A more reliable solution, especially in the technical field, is experiment or A/B testing. However, this is not always feasible. You must have had enough of ill-conditioned examples so far, so let’s use an interesting example. On Roblox, our users express their identity and creativity through their avatars, and by wearing different items they can get in the avatar store.

My icon

As you can imagine, keeping this function healthy is very important to us. In order to find out how much resources we have invested in this market, we want to know how much it ultimately contributes to our company’s goals. More specifically, we want to estimate the impact of Avatar Shop on community participation. Unfortunately, direct experimentation is not feasible.

  1. We cannot close the Avatar Shop for only some users, because it is a very important part of the user experience on our platform.
  2. Avatar Shop is a marketplace where users interact with each other as buyers and sellers. Turning it off for a group of users also affects users who are not closed.

At the same time, using non-experimental data to estimate this causality is a dangerous path, because (i) we have identified several confounding factors that cannot be fully adjusted or unobserved, and because (ii) we have found that our top-line indicators Changes also have an adverse effect on interaction with the store.

Why is causal inference difficult.

This is not an uncommon problem, there are several statistical methods that may be useful. For example, Differences-in-Differences or Two-Way Fixed Effects (TWFE) are estimated to track a group of users over time and see how their participation time changes after participating in the Avatar Shop. Another popular technique is propensity score matching (PSM), which attempts to match users who use Avatar Shop with users who do not use Avatar Shop based on various factors. These methods have their unique advantages and challenges, but even if they are implemented correctly, they often suffer from the same fatal flaw: unobserved factors will affect the participation and participation time with Avatar Shop, that is, confounding factors. (Side note: Differences-in-Differences is expected to be effective for fixing confounding factors, but still vulnerable to confounding factors that change over time).

Instrumental variables of salvation

Instrumental variable were able Provide solutions for unobserved confounding factors that other causal reasoning techniques cannot provide. The focus here is “Yes”, because the hardest part is to find special variables that meet the two main conditions for effective IV estimation:

  1. The first stage: It needs to be closely related to the variable of interest (in our case, Avatar Shop participation).
  2. exclude: Its only association with the result (participation time) is through the variable of interest (Avatar Shop participation).

If we can identify such a tool, our causal estimation using non-experimental data will become much simpler: any change in the result (Y) related to the change in the variable of interest (X) explained by the tool (Z) is The causal effect of X on Y. See the chart for a simplified example of the basic idea behind instrumental variables.

Z predicted the change in average Avatar Shop participation from X1 to X2. Therefore, the average participation time increases from Y1 to Y2. Then, the slope is a causal estimate of the X -> Y relationship.

The figure above also shows the importance of these two conditions.1. Instrument have to Strongly predict the movement from X1 to X2.Secondly, we are taking a Leap of faith Here the movement from Y2 to Y1 is entirely due to the movement from X1 to X2. If Z affects Y in other ways than through X, then we will mistakenly attribute all motion in Y to X.

As you can see, the second condition is where IV estimation fails most often because it is a very powerful statement made in a complex system. so, In our case, what is the instrument, and why are we fully confident that it meets the second condition?

Our instruments

About a year ago, we conducted an A/B test to evaluate our new “Recommended for You” feature for Avatar Shop. We observed a huge impact on Avatar Shop participation. In other words, which experimental group the user belongs to strongly predicts their interaction with the Avatar store (The first stage). We also observed the impact of working hours. And because this experiment was specifically designed to evaluate changes in the avatar store and did not touch any other content on Roblox, we have every reason to believe that any change in participation time must be due to changes in store participation (exclude).

Our recommendation experiment is a great tool because it has a great impact on store participation (F-stat> 15000), and we have no reason to believe that it will affect the number of hours of participation through any other means.

Having a good tool means that we can estimate the causality from Avatar Shop participation to participation time, without having to close the Avatar Shop for some of our users as a direct A/B test.


Using the IV estimates outlined above, we found that there is a statistically significant positive causal relationship between our two variables.Specifically, a 1% increase in Avatar store participation will result in 0.08% (SE: 0.008%, p-value

We estimate that Avatar Shop’s participation has a much greater impact on the community participation of our newest users.

This is a very useful insight that can help us design an onboarding experience for new users.This is also a good opportunity to discuss an important limitation of IV: they estimate Local average processing effect (late stage) Rather than the average therapeutic effect (ATE) like direct experiment. In other words, these estimates are specific to users whose behavior is affected by our tools, and therefore may not necessarily apply to the overall population. This distinction is important when we think that the treatment effect is not homogeneous, as we have seen above. In practice, it is always safe to assume that treatment effects are heterogeneous, so IV estimates, even if they are effective internally, are not a perfect substitute for experiments. But sometimes they may be all we can do.

Next step

The antidote to IV’s LATE problem is actually to find more tools and estimate a bunch of LATE. And the goal is to be able to construct a global average treatment effect estimate by combining a series of local effect estimates. This is exactly what we plan to do next, and we can do it because we have extensively experimented with different aspects of the Avatar store. Everyone should be an effective tool for our purpose. As you can imagine, there are many cool and challenging analysis problems to be solved. If these are what you want, we would love you to join Roblox’s data science and analysis team.

Final thoughts on instrumental variables

We hope this love letter and introduction to instrumental variables will demonstrate its power and stimulate your further interest.Although this causal estimation method may be Overuse in certain environments, We think it is under-used in technology, and its hypothesis is more likely to be established, especially when the instrument comes from an experiment.The further good news is that because it already exists Since the 1920s! , There is a wealth of literature Active and lively discussion Regarding its correct implementation and interpretation.

— — —

Ujwal Kharel is a senior data scientist at Roblox. He works in the Avatar store to ensure its economy is healthy and prosperous.

Neither Roblox Corporation nor this blog endorse or endorse any company or service. In addition, no guarantee or promise is made for the accuracy, reliability or completeness of the information contained in this blog.

©2021 Roblox Corporation. Roblox, the Roblox logo, and Powering Imagination are our registered and unregistered trademarks in the United States and other countries.