Chapter 4: Directed Acyclic Graphs (DAGs)

Story: Drawing Causal Diagrams to Solve a Mystery

Imagine you're a public health researcher investigating why traffic accidents spike on rainy days. You've gathered a mountain of data: rainfall amounts, traffic congestion reports, road conditions, driver ages, car models, and more. But patterns seem tangled and unclear. One day, a colleague sketches something simple on a whiteboard: circles and arrows connecting the variables. Rain → Accidents. Rain → Traffic jam ← Accidents.

Instantly, the mystery becomes clearer. This "aha moment" is the power of causal diagrams — also called Directed Acyclic Graphs (DAGs) — to cut through complexity and reveal structure. Of course, real-world pathways are often much more complicated than this simple sketch. But for the purpose of learning, we will embrace simplicity before layering in the complexity.

Concept: Causal Graphs

What Is a DAG?

Nodes represent variables (like Rain, Traffic Jams, Accidents). They are the circles.
Directed edges (arrows) represent causal relationships or pathways, as we love to call it.
Acyclic means no feedback loops (you can't return to the same node following arrows).

Simple Examples:

Mediator: X → Z → Y (e.g., Exercise → Weight → Heart health) . Weight is the mediator between exercise and heart health.
Confounder: Y ← Z → X (e.g., Traffic → Police presence ← Crime rate). Police presence affects both traffic and crime rates, hence, we call it a confounder.
Collider: X → W ← Y (e.g., Traffic Jams → Emergency response ← Accidents). Traffic jams and accidents influence emergency response, so emergency response is a collider.

Why DAGs Matter

A DAG captures causal assumptions. It tells us:

What variables cause other variables.
What we must adjust for to estimate causal effects.
What pathways create bias if left open.

The Back-Door Criterion (informally):

When studying the causal effect of a variable X (say, exercise) on an outcome Y (say, heart health), we want to be sure that differences in Y are truly caused by differences in X — not by something else lurking in the background.

A back-door path is any path between X and Y that doesn’t represent the direct causal path X → Y. Examples are X → Z → Y or X → W ← Y or X ← W → Y. The trick is knowing how to handle the backdoor path so that the only path left is the direct causal path X → Y. We will address this next chapter.

Perfect, thanks for the clarification! Here’s the finalized Insight section for you — just clean, clear writing (no code) plus the actual drawing of the DAG as you asked.

Insight: The Power of Visualization

Returning to our traffic accident story: now that we know about causal diagrams, we can map the relationships visually.

Here’s the situation:

In this DAG:

Rain causes Traffic Jam.
Rain also causes Accidents directly.
Accidents cause Traffic Jam.

Notice how Traffic Jam is the point where two arrows meet — one from Rain and one pointing to Accidents. This makes Traffic Jam a collider in the graph.

Understanding the Causal Paths

There are two paths between Rain and Accidents:

Direct Causal Path: Rain ⟶ Accidents

Rain directly makes roads slippery, leading to more accidents.

Backdoor Collider Path: Rain ⟶ Traffic Jam ⟵ Accidents

Rain increases Traffic Jams, and Accidents increase Traffic Jam.

What This Means:

Because Traffic Jam is a collider, this path is naturally blocked.
We do NOT adjust for colliders. Adjusting for Traffic Jam would actually open the path and create a spurious association between Rain and Accidents, making the analysis biased.
If we were estimating the effect of Rain on Accidents, we would adjust for confounders (common causes) if any existed — but NOT for colliders.

Final Takeaway:

Drawing a DAG helps you see:

Which variables you must adjust for (confounders),
Which variables you must not adjust for (colliders),
And how mistaken adjustment could introduce bias rather than remove it.

In this case, visualizing the causal structure immediately prevents a major error: adjusting for Traffic Jam would have ruined the analysis.

Practice: Drawing and Analyzing DAGs in Python

Let's build simple DAGs using networkx and matplotlib in Python.

# Step 0: Import libraries
!pip install networkx matplotlib
import networkx as nx
import matplotlib.pyplot as plt

# Define a function to plot DAGs neatly in a triangle format
def plot_dag(edges, title):
    G = nx.DiGraph()
    G.add_edges_from(edges)

    pos = {}
    if len(G.nodes) == 3:
        nodes = list(G.nodes)
        pos[nodes[0]] = (-1, 0)
        pos[nodes[1]] = (1, 0)
        pos[nodes[2]] = (0, 1.5)
    else:
        pos = nx.spring_layout(G, seed=42)

    plt.figure(figsize=(6,5))
    nx.draw(G, pos, with_labels=True, arrows=True, node_color='skyblue', node_size=2500, arrowstyle='->', arrowsize=15)
    plt.title(title)
    plt.show()

# --- Section 1: Confounder Example ---
# Fork structure: Socioeconomic Status influences both Exercise and Health Outcome
edges_confounder = [('Socioeconomic Status', 'Exercise'), ('Socioeconomic Status', 'Health Outcome')]
plot_dag(edges_confounder, 'Confounder Example: Fork Structure')

# --- Section 2: Mediator Example ---
# Chain structure: Exercise leads to Weight Loss which leads to Heart Health
edges_mediator = [('Exercise', 'Weight Loss'), ('Weight Loss', 'Heart Health')]
plot_dag(edges_mediator, 'Mediator Example: Chain Structure')

# --- Section 3: Collider Example ---
# Inverted fork: Traffic and Crime Rate both affect Police Presence
edges_collider = [('Traffic', 'Police Presence'), ('Crime Rate', 'Police Presence')]
plot_dag(edges_collider, 'Collider Example: Inverted Fork Structure')

# --- Section 4 (Advanced): Optional Manipulation ---
# Demonstrate what happens if you control for a collider
# (Normally this opens a path between Traffic and Crime Rate)

# The user will be guided to think: "Should I condition here or not?"

Comment: You can replace the variables to create your own DAG easily!

Task: Draw and Analyze Your Own DAG

Scenario

A researcher is studying the causal relationship between Exercise and Heart Health. Here's what we know:

Exercise reduces Weight.
Lower Weight improves Heart Health.
Genetics independently influence both Exercise and Heart Health.

Your goal is to build a clear causal diagram (DAG) to represent this situation.

Instructions

Step 1: Identify and List the Variables

Exercise
Weight
Heart Health
Genetics

Step 2: Draw the Causal Arrows

Which variable causes which?
Think carefully about direct effects vs indirect paths.

Step 3: Analyze Your DAG

Confounders: Which variables influence both the treatment (Exercise) and the outcome (Heart Health)?
Mediators: Which variables lie on the causal path between Exercise and Heart Health?
Adjustment Set: Based on the back-door criterion, which variables should you adjust for to estimate the causal effect of Exercise on Heart Health?

Bonus Challenge (Python)

Use Python to create your DAG visually!
 Here's a simple template to get you started:
# Install networkx and matplotlib if needed
# pip install networkx matplotlib

import networkx as nx
import matplotlib.pyplot as plt

# Create the Directed Acyclic Graph (DAG)
G = nx.DiGraph()

# Add causal relationships (edges)
G.add_edges_from([
    ("Exercise", "Weight"),          # Exercise affects Weight
    ("Weight", "Heart Health"),       # Weight affects Heart Health
    ("Genetics", "Exercise"),         # Genetics affects Exercise
    ("Genetics", "Heart Health")      # Genetics affects Heart Health
])

# Layout and draw the graph
pos = nx.spring_layout(G, seed=42)  # Reproducible layout
nx.draw(
    G, pos,
    with_labels=True,
    node_size=2000,
    node_color="lightgreen",
    font_size=10,
    arrowsize=20
)
plt.title("Causal DAG: Exercise and Heart Health")
plt.show()

Reflection Questions

Is Weight a Confounder, Mediator, or Collider?
Should you control for Weight when estimating the direct effect of Exercise on Heart Health?
Why is Genetics critical to adjust for?
What could go wrong if you accidentally control for a collider?

What You Learned

DAGs are powerful tools to clarify causal assumptions.
Confounders, mediators, and colliders have distinct roles — knowing how to spot them matters.
Python lets you draw and analyze DAGs easily.
Careful thinking about structure prevents bias and supports credible causal claims.

Chapter 4b: Confounders and Mediators – Addressing the Issue Using Regression

Understanding Confounders and Mediators: The Key to Proper Causal Inference

In the world of causal inference, understanding the roles of confounders and mediators is essential for drawing valid conclusions from observational data. As we progress from Chapter 4, where we learned about the theoretical foundations of Directed Acyclic Graphs (DAGs) and the importance of identifying causal relationships, Chapter 4b will focus on how we can use regression analysis to address these challenges.

The Role of Mediators: When Is It Important to Control for Them?

Before diving into regression, let's first revisit the idea of mediators. Mediators are variables that lie on the causal pathway between the independent variable (or treatment) and the outcome. When a mediator is present, it essentially explains how or why the independent variable has an effect on the outcome. For example, in our dataset, let's say we're interested in understanding how violent crime rates (our independent variable) might influence property crime rates (our dependent variable). A possible mediator could be police presence in a community. If the increase in violent crime leads to more policing, which in turn influences property crime, then police presence is acting as a mediator.

The challenge with mediators is knowing when to adjust for them. The general rule of thumb is that we should not adjust for mediators if our goal is to estimate the total effect of the independent variable on the outcome. This is because adjusting for mediators can remove the indirect effect of the independent variable that operates through the mediator. However, if we want to isolate the direct effect of the independent variable, excluding the mediator's influence, then adjusting for it makes sense.

This nuance is often subtle. In the case of the violent-crime-property-crime scenario, if we want to understand how much of the effect of violent crime on property crime is due to changes in policing, we would need to control for police presence. On the other hand, if we want to understand the total effect of violent crime on property crime, it is better to leave police presence out of the equation, as doing so would capture the entire causal pathway.

Confounders: Why Adjusting for Them Is Crucial

Confounders, in contrast to mediators, are variables that distort the relationship between the independent variable and the outcome. A confounder is a third variable that affects both the independent variable (e.g., violent crime rates) and the outcome (e.g., property crime rates), creating a spurious association between the two. In other words, a confounder can make it appear as though there is a causal relationship between the independent and dependent variables when, in reality, the relationship is due to a third factor.

For example, suppose we suspect that socioeconomic status (SES) is a confounder in the relationship between violent crime rates and property crime rates. Both SES and violent crime rates are likely influenced by broader social conditions such as poverty, education, and neighborhood stability. If we don’t adjust for SES, we might mistakenly conclude that violent crime causes property crime, when in fact, both might be driven by the same underlying factors (i.e., SES).

This is where regression analysis becomes invaluable. By adjusting for confounders in our regression model, we can isolate the true causal effect of the independent variable on the outcome. Essentially, regression allows us to "control" for confounders, ensuring that we are estimating the effect of the independent variable while holding the confounders constant.

Regression as a Tool for Addressing Confounders

Now that we understand the importance of confounders and mediators, the next step is to discuss how we can use regression analysis to address them in practice.

In regression models, the goal is often to estimate the effect of an independent variable (e.g., violent crime rates) on a dependent variable (e.g., property crime rates) while controlling for other variables that might confound or mediate the relationship. The core idea is to adjust for confounders by including them as covariates in the regression model.

For example, imagine we want to study the relationship between violent crime rates and property crime rates, but we are concerned that variables like socioeconomic status and police presence could confound or mediate this relationship. The simplest way to adjust for confounders is to include them as additional predictors in the regression model.

Let’s say we’re interested in a simple linear regression where we predict property crime rates from violent crime rates, adjusting for potential confounders like homicide rates and socioeconomic status. This would look like:

Property Crime = β0 + β1(Violent Crime) + β2(Homicide Rate) + β3(Socioeconomic Status) + ϵ

In this equation:

β1 represents the effect of violent crime on property crime after adjusting for homicide rates and socioeconomic status.
β2 and β3 represent the effects of homicide rates and socioeconomic status on property crime.

Practical Application with Your Data: An R Script for Regression

Let's now turn to the provided dataset to see how we can apply regression analysis to adjust for confounders and mediators. Here’s how you can structure an R script to perform regression analysis on your dataset.

# Load the necessary libraries
library(tidyverse)

# Load the dataset
data <- read.csv("path_to_your_data/chapter1log.csv")

# Define your dependent and independent variables
dependent_var <- "Property_sum"  # This could be Violent_sum or any other variable
independent_var <- "Violent_sum"  # Independent variable (e.g., violent crime)

# Choose the confounders and mediators to control for
confounders <- c("Homicide_sum", "Socioeconomic_status", "Burglary_per_100k")
mediators <- c("PolicePresence", "AggAssault_per_100k")

# Combine all variables for the regression model
all_vars <- c(dependent_var, independent_var, confounders, mediators)

# Subset the dataset for the variables of interest
data_subset <- data %>% select(all_of(all_vars))

# Run a linear regression model adjusting for confounders and mediators
model <- lm(as.formula(paste(dependent_var, "~", paste(c(independent_var, confounders, mediators), collapse = " + "))), data = data_subset)

# View the regression summary to assess the results
summary(model)

Directions for Readers

Step 1: Choose Your Variables: First, decide on your dependent variable (e.g., Property_sum, Violent_sum) and independent variable (e.g., Violent_sum, Property_sum).
Step 2: Select Confounders: Identify which variables might be confounding the relationship between the independent and dependent variables. Common confounders in crime-related datasets include factors like Homicide_sum, Socioeconomic_status, and Burglary_per_100k.
Step 3: Consider Mediators: Think about potential mediators in your analysis. For example, if you're interested in how violent crime affects property crime, police presence might mediate that effect.
Step 4: Run the Regression: Use the provided R script to run a regression model, adjusting for the confounders and mediators you've identified.
Step 5: Interpret the Results: After running the regression, examine the coefficients of the independent variable. If the coefficient of violent crime changes significantly after adjusting for confounders like Homicide_sum or Socioeconomic_status, it suggests that those confounders were influencing the relationship between violent crime and property crime.

Task: Practice with the Data

Using the provided data, follow these steps:

Select your dependent variable (e.g., Property_sum, Violent_sum).
Choose an independent variable (e.g., Violent_sum, Property_sum).
Identify and include potential confounders (e.g., Homicide_sum, Socioeconomic_status, Burglary_per_100k).
Identify potential mediators (e.g., PolicePresence, AggAssault_per_100k).
Run the regression model and interpret the results.

By following these instructions and practicing with the provided dataset, you will gain a hands-on understanding of how to use regression analysis to address confounders and mediators in causal inference.

PreviousChapter 3. Conditional and Unconditional Parallel Trends NextChapter 5: Matching

Last updated 3 months ago