Causal inference with Synthetic Control using Python and SparseSC

Causal Inference
Synthetic Control
Author

Aayush Agrawal

Published

September 19, 2022

Understanding Synthetic Control and using Microsoft’s SparceSC package to run synthetic control on larger datasets.

What is Synthetic Control Method?

I will try to keep this part short and focus more on why Data scientists should care about such methods and how to use them on larger datasets based on practical experience using SparseSC package.

The Synthetic Control (SC) method is a statistical method used to estimate causal effects from binary treatments on observational panel (longitudinal) data. The method got quite a coverage by being described as “the most important innovation in the policy evaluation literature in the last few years” and got an article published in Washington Post - Seriously, here’s one amazing math trick to learn what can’t be known. “SC is a technique to create an artificial control group by taking a weighted average of untreated units in such a way that it reproduces the characteristics of the treated units before the intervention(treatment). The SC acts as the counterfactual for a treatment unit, and the estimate of a treatment effect is the difference between the observed outcome in the post-treatment period and the SC’s outcome.”1

“One way to think of SC is as an improvement upon difference-in-difference (DiD) estimation. Typical DiD will compare a treated unit to the average of the control units. But often the treated unit does not look like a typical control (e.g., it might have a different growth rate), in which case the ‘parallel trend’ assumption of DiD is not valid. SC remedies this by choosing a smarter linear combination, rather than the simple average, to weigh more heavily the more similar units. SC’s assumption is if there are endogenous factors that affect treatment and future outcomes then you should be able to control them by matching past outcomes. The matching that SC provides can therefore deal with some problems in estimation that DiD cannot handle.”2

Here is the link to the Causal inference book which I found most useful to understand the math behind SC- Causal Inference for The Brave and True by Matheus Facure - Chapter 15.

Why should any Data scientist care about this method?

Often as a Data Scientist, you will encounter situations as follows where running A/B testing is not feasible because of -

  1. Lack of infrastructure
  2. Lack of similar groups for running A/B testing (in case of evaluation of state policies, as there is no state equivalent of other)
  3. Providing unwanted advantage to one group over others. Sometimes running an A/B test can give an unfair advantage and lead you into anti-trust territory. For example, what if Amazon tries to charge differential pricing for different customers or apply different margins for their sellers for the same product?

As a data scientist, stakeholders may still ask you to estimate the impact of certain changes/treatments, and Synthetic controls can come to the rescue in this situation. For this reason, it is a valuable tool to keep in your algorithmic toolkit.

Problem Overview

We will use the Proposition 99 data to explain the use case for this approach and also how to use the SparceSC library and its key features. “In 1988, California passed a famous Tobacco Tax and Health Protection Act, which became known as Proposition 99. Its primary effect is to impose a 25-cent per pack state excise tax on the sale of tobacco cigarettes within California, with approximately equivalent excise taxes similarly imposed on the retail sale of other commercial tobacco products, such as cigars and chewing tobacco. Additional restrictions placed on the sale of tobacco include a ban on cigarette vending machines in public areas accessible by juveniles, and a ban on the individual sale of single cigarettes. Revenue generated by the act was earmarked for various environmental and health care programs, and anti-tobacco advertisements. To evaluate its effect, we can gather data on cigarette sales from multiple states and across a number of years. In our case, we got data from the year 1970 to 2000 from 39 states.”3

Code
# Importing required libraries
import pandas as pd
import numpy as np
import SparseSC
from datetime import datetime
import warnings
import plotly.express as px
import plotly.graph_objects as pgo
pd.set_option("display.max_columns", None)
warnings.filterwarnings('ignore')

Let’s look at the data

#Import data
data_dir = "https://raw.githubusercontent.com/OscarEngelbrektson/SyntheticControlMethods/master/examples/datasets/"
df = pd.read_csv(data_dir + "smoking_data" + ".csv").drop(columns=["lnincome","beer", "age15to24"])
df.head()
state year cigsale retprice
0 Alabama 1970.0 89.8 39.6
1 Alabama 1971.0 95.4 42.7
2 Alabama 1972.0 101.1 42.3
3 Alabama 1973.0 102.9 42.1
4 Alabama 1974.0 108.2 43.1

We have data per state as treatment unit and yearly (year column) per-capita sales of cigarettes in packs (cigsale column) and the cigarette retail price (retprice column). We are going to pivot this data so that each row is one treatment unit(state), and columns represent the yearly cigsale value.

df = df.pivot(index= 'state', columns = 'year', values = "cigsale")
df.head()
year 1970.0 1971.0 1972.0 1973.0 1974.0 1975.0 1976.0 1977.0 1978.0 1979.0 1980.0 1981.0 1982.0 1983.0 1984.0 1985.0 1986.0 1987.0 1988.0 1989.0 1990.0 1991.0 1992.0 1993.0 1994.0 1995.0 1996.0 1997.0 1998.0 1999.0 2000.0
state
Alabama 89.8 95.4 101.1 102.9 108.2 111.7 116.2 117.1 123.0 121.4 123.2 119.6 119.1 116.3 113.0 114.5 116.3 114.0 112.1 105.6 108.6 107.9 109.1 108.5 107.1 102.6 101.4 104.9 106.2 100.7 96.2
Arkansas 100.3 104.1 103.9 108.0 109.7 114.8 119.1 122.6 127.3 126.5 131.8 128.7 127.4 128.0 123.1 125.8 126.0 122.3 121.5 118.3 113.1 116.8 126.0 113.8 108.8 113.0 110.7 108.7 109.5 104.8 99.4
California 123.0 121.0 123.5 124.4 126.7 127.1 128.0 126.4 126.1 121.9 120.2 118.6 115.4 110.8 104.8 102.8 99.7 97.5 90.1 82.4 77.8 68.7 67.5 63.4 58.6 56.4 54.5 53.8 52.3 47.2 41.6
Colorado 124.8 125.5 134.3 137.9 132.8 131.0 134.2 132.0 129.2 131.5 131.0 133.8 130.5 125.3 119.7 112.4 109.9 102.4 94.6 88.8 87.4 90.2 88.3 88.6 89.1 85.4 83.1 81.3 81.2 79.6 73.0
Connecticut 120.0 117.6 110.8 109.3 112.4 110.2 113.4 117.3 117.5 117.4 118.0 116.4 114.7 114.1 112.5 111.0 108.5 109.0 104.8 100.6 91.5 86.7 83.5 79.1 76.6 79.3 76.0 75.9 75.5 73.4 71.4

Let’s observe how cigarettes sales per capita is trending over time w.r.t California and other states.

Code
plot_df = df.loc[df.index == "California"].T.reset_index(drop=False)
plot_df["OtherStates"] = df.loc[df.index != "California"].mean(axis=0).values


fig = px.line(
        data_frame = plot_df, 
        x = "year", 
        y = ["California","OtherStates"], 
        template = "plotly_dark")

fig.add_trace(
    pgo.Scatter(
        x=[1988,1988],
        y=[plot_df.California.min()*0.98,plot_df.OtherStates.max()*1.02], 
        line={
            'dash': 'dash',
        }, name='Proposition 99'
    ))
fig.update_layout(
        title  = {
            'text':"Gap in per-capita cigarette sales(in packs)",
            'y':0.95,
            'x':0.5,
        },
        legend =  dict(y=1, x= 0.8, orientation='v'),
        legend_title = "",
        xaxis_title="Year", 
        yaxis_title="Cigarette Sales Trend",
        font = dict(size=15)
)
fig.show(renderer='notebook')

Fig 1 - Cigarette sales comparison b/w California and other states

As we can see from the chart above, we can see that there is a general decline in cigarette sales after the 1980s, and with the introduction of Proposition 99, the decreasing trend accelerated for the state of California. We cannot say for sure if this is happening with any statistical significance, it is just something we observed by examining the chart above.

To answer the question of whether Proposition 99 influenced cigarette consumption, we will use the pre-intervention period (1970-1988) to build a synthetic control group that mimics California cigarette sales trend. Then, we will see how this synthetic control behaves after the intervention.

Fitting Synthetic Control using SparseSC package

On a high level SparseSC package provide two functions for fitting Synthetic controls i.e., fit() method and fit_fast() method. On a high level -

  • fit() - This method tries to compute the weight jointly and results in SCs which are ‘optimal’. This is the most common method used in most of the code/libraries I have found but it is computationally expensive and takes a long time to run. Hence, does not scale for larger datasets.
  • fit_fast()- This method tries to compute the weight separately by performing some non-matching analysis. This solution is much faster and often the only feasible method with larger datasets. The authors of this package recommend the fit_fast method to start with and only move towards the fit method if needed.

The SparseSC.fit_fast() method required at least three arguments -

  1. features - This is the NumPy matrix of I/p variables where each row represents a treatment/control unit (states in our case), each column is the period from pre-treatment (1970-1988), and the value in the matrix is the metric of interest (in this case it is the cigsale value)
  2. targets - This is the NumPy matrix of I/p variables where each row represents a treatment/control unit (states in our case), each column is the period from post-treatment (1999-2000), and the value in the matrix is the metric of interest (in this case it is the cigsale value)
  3. treatment_units - This is the list of integers containing the row index value of treated units
Note

Note that treatment units can be a list of multiple treatment indexes. Think of cases where the same treatment is applied to multiple groups, for example, what if proposition 99 was rolled in both California and Minnesota State, in this case, treatment_units will get [2, 15], which are the respective index of these states.

## creating required features
features = df.iloc[:,df.columns <= 1988].values
targets = df.iloc[:,df.columns > 1988].values
treated_units = [idx for idx, val in enumerate(df.index.values) if val == 'California'] # [2]

## Fit fast model for fitting Synthetic controls
sc_model = SparseSC.fit_fast( 
    features=features,
    targets=targets,
    treated_units=treated_units
)

Now that we have fitted the model, let’s get the Synthetic Control output by using predict() function.

result = df.loc[df.index == 'California'].T.reset_index(drop=False)
result.columns = ["year", "Observed"] 
result['Synthetic'] = sc_model.predict(df.values)[treated_units,:][0]
result.head(5)
year Observed Synthetic
0 1970.0 123.0 122.394195
1 1971.0 121.0 125.114849
2 1972.0 123.5 129.704372
3 1973.0 124.4 126.753988
4 1974.0 126.7 126.276394

Now that we have our synthetic control, we can plot it with the outcome variable of the State of California.

Code
fig = px.line(
        data_frame = result, 
        x = "year", 
        y = ["Observed","Synthetic"], 
        template = "plotly_dark",)

fig.add_trace(
    pgo.Scatter(
        x=[1988,1988],
        y=[result.Observed.min()*0.98,result.Observed.max()*1.02], 
        line={
            'dash': 'dash',
        }, name='Proposition 99'
    ))
fig.update_layout(
        title  = {
            'text':"Synthetic Control Assessment",
            'y':0.95,
            'x':0.5,
        },
        legend =  dict(y=1, x= 0.8, orientation='v'),
        legend_title = "",
        xaxis_title="Year", 
        yaxis_title="Per-capita cigarette sales (in packs)",
        font = dict(size=15)
)
fig.show(renderer='notebook')

Fig - Assessment of Proposition 99 on state of California using Synthetic Control

Note

In the pre-intervention period, the synthetic control does not reproduce the treatment exactly but follows the curve closely. This is a good sign, as it indicates that we are not overfitting. Also, note that we do see divergence after the intervention (introduction of Proposition 99) after 1988.

With the synthetic control groups in hand, we can estimate the treatment effect as the gap between the treated and the synthetic control outcomes.

Code
result['California Effect'] = result.Observed - result.Synthetic
fig = px.line(
        data_frame = result, 
        x = "year", 
        y = "California Effect", 
        template = "plotly_dark",)
fig.add_hline(0)
fig.add_trace(
    pgo.Scatter(
        x=[1988,1988],
        y=[result["California Effect"].min()*0.98,result["California Effect"].max()*1.02], 
        line={
            'dash': 'dash',
        }, name='Proposition 99'
    ))

fig.update_layout(
        title  = {
            'text':"Difference across time",
            'y':0.95,
            'x':0.5,
        },
        legend =  dict(y=1, x= 0.8, orientation='v'),
        legend_title = "",
        xaxis_title="Year", 
        yaxis_title="Gap in Per-capita cigarette sales (in packs)",
        font = dict(size=15)
)
fig.show(renderer='notebook')

Fig - Gap in Per-capita cigarette sales in California w.r.t Synthetic Control

Code
print(f"Effect of Proposition 99 w.r.t Synthetic Control => {np.round(result.loc[result.year==2000,'California Effect'].values[0],1)} packs")
Effect of Proposition 99 w.r.t Synthetic Control => -28.8 packs

Looking at the chart above, we can observe that by the year 2000, Proposition 99 has reduced the sales of cigarettes by ~29 packs. Now we will figure out if this is statistically significant.

Making Inference

In Synthetic control, to find if the effect we observed is significant or not, we run a placebo test. A placebo test is taking a random untreated unit and pretending all other units are in control and fit the Synthetic control over this randomly selected untreated unit to estimate the effect. Once we have repeated this placebo test multiple times, we can estimate the distribution of this randomly observed effect and see if the effect observed is significantly different from the placebo observed effect. In the SparceSC package, we can use the estimate_effects method to do this automatically for us. The estimate effects method takes a minimum of two arguments -

  1. outcomes - This is the NumPy matrix of I/p variables where each row represents a treatment/control unit (states in our case), and each column is the period of both pre-treatment and post-treatment period and the value in the matrix is the metric of interest (in this case it is the cigsale value)
  2. unit_treatment_periods - Vector of treatment periods for each unit, (if a unit is never treated then use np.NaN if vector refers to periods by numerical index)
## Creating unit treatment_periods
unit_treatment_periods = np.full((df.values.shape[0]), np.nan)
unit_treatment_periods[treated_units] = [idx for idx, colname in enumerate(df.columns) if colname == 1988][0]

## fitting estimate effects method
sc = SparseSC.estimate_effects(
    outcomes = df.values,  
    unit_treatment_periods = unit_treatment_periods, 
    max_n_pl=50, # Number of placebos
    level=0.9 # Level for confidence intervals
)
print(sc)
Pre-period fit diagnostic: Were the treated harder to match in the pre-period than the controls were.
Average difference in outcome for pre-period between treated and SC unit (concerning if p-value close to 0 ): 
1.8073663793403947 (p-value: 0.9743589743589743)

(Investigate per-period match quality more using self.pl_res_pre.effect_vec)

Average Effect Estimation: -16.965374734951705 (p-value: 0.07692307692307693)

Effect Path Estimation:
 -2.712194133284143 (p-value: 0.6923076923076923)
-6.424223898557088 (p-value: 0.20512820512820512)
-4.18303325159971 (p-value: 0.5128205128205128)
-10.210104128966762 (p-value: 0.28205128205128205)
-10.198518971790051 (p-value: 0.2564102564102564)
-14.921163932323616 (p-value: 0.15384615384615385)
-18.441946863077938 (p-value: 0.1282051282051282)
-21.27793738802989 (p-value: 0.1794871794871795)
-22.47075245885469 (p-value: 0.1794871794871795)
-23.397467590595554 (p-value: 0.20512820512820512)
-26.13637789807555 (p-value: 0.05128205128205128)
-30.504443997992325 (p-value: 0.02564102564102564)
-29.67170704122487 (p-value: 0.05128205128205128)

 

The estimate_effects method returns an object which will print the treatment effect of each post-treatment year and estimate the significance of the observed difference. The information printed can also be found in the pl_res_pre function of the returned object.

print(f"Estimated effect of sales in California state in year 2000 because of preposition 99 is {np.round(sc.pl_res_post.effect_vec.effect[-1])}, \
with a p-value of  {np.round(sc.pl_res_post.effect_vec.p[-1],2)}")
Estimated effect of sales in California state in year 2000 because of preposition 99 is -30.0, with a p-value of  0.05

Conclusion

Here are some key takeaways -

  1. Synthetic control allows us to combine multiple control units to make them resemble the treated unit. With synthetic control, we can estimate what would have happened to our treated unit in the absence of treatment.
  2. Microsoft SparseSC library provides a fast and easy-to-use API to run synthetic control groups and allows us to run a placebo test to estimate the significance of the effects observed.