Issues with controlling for ‘IQ’

And why controlling for things is harder than you might think.
statistics
cognition
Author

Giacomo Bignardi

Published

June 17, 2024

Many developmental science studies have the following setup.

We start with a cognitive skill that we think is important for a particular outcome. Let’s use the example of “number sense” measured using a non-symbolic number discrimination task. We hypothesise that this cognitive skill is particularly important for influencing a specific outcome, let’s say mathematical ability.

A problem with cognitive variables is that individuals who do well on one test of cognitive or academic performance, generally do well on others. This phenomenon has been termed the positive manifold.

One explanation for the positive manifold is that individual’s have some general cognitive ability (or “general intelligence”) that causes performance on all other cognitive skills.

So, going back to our example study, a standard way of testing whether our cognition of interest (number sense) is important for our outcome of interest (mathematics) is to run a multivariable regression analysis with number sense and “IQ” as predictors, and mathematics ability as the outcome variable. Usually, “IQ” is measured using a nonverbal reasoning assessment.

What could possibly go wrong? Let’s explore this using a causal diagram.

Residual Confounding

Let’s hypothesise that general cognitive ability causes one to have better number sense, nonverbal reasoning and mathematics skills and that neither number sense nor nonverbal reasoning have causal effects on mathematics. Let’s assume multiple other factors independently cause number sense, nonverbal reasoning, and mathematics skills. Because these other causes are not in our model, we can represent them as random “error” terms.

Code
# Install the necessary packages if they are not already installed
if (!require(ggdag)) {
  install.packages("ggdag")
  library(ggdag)
} else {
  library(ggdag)
}

# Define the DAG structure
dag <- dagify(
  NS ~ GCA + Err_NS,
  NVR ~ GCA + Err_NVR,
  Math ~ GCA + Err_MA,
  coords = list(
    x = c(GCA = 0, NS = 1, NVR = 1, Math = 2, Err_NS = 1, Err_NVR = 1, Err_MA = 3),
    y = c(GCA = 0, NS = 1, NVR = -1, Math = 0, Err_NS = 2, Err_NVR = -2, Err_MA = 0)
  )
)

# Plot the DAG using ggdag
ggdag(dag, node_size = 20) +
  geom_dag_text() +
  theme_dag() + 
  ggplot2::labs(title = "Hypothesised Causal Diagram")

In this example, general cognitive ability (GCA) is a common cause of number sense (NS) and mathematics (Math). We will simulate data when neither number sense nor nonverbal reasoning have direct causal effects on mathematics. Therefore, if our analyses are appropriate, we should conclude that number sense has no specific causal effect on mathematics.

In the real world, we rely on proxy measures such as nonverbal reasoning because we cannot measure general cognitive ability directly. However nonverbal reasoning is an imperfect measure of general cognitive ability because it has other causes (which we have termed “Err_NVR”), such as perhaps previous experience with abstract reasoning assessments, mood at the time of testing, background noise, etc.

As we will see, if we try to control for a covariate (e.g., general cognitive ability) using a proxy measure that imperfectly measures the covariate (e.g., nonverbal reasoning performance), we will not fully adjust for confounding. This artefact is called residual confounding.

We can simulate what happens in R when we use multivariable regression to test for an association between number sense and mathematics under two conditions. In the first condition, we control for general cognitive ability, and in the second condition we control for nonverbal reasoning.

To simulate this model in R, we start with exogenous variables (variables that do not have a cause in the model) and work towards our endogenous variables caused by our previously simulated variables:

Model Simulation Code

set.seed(10)

sample_size = 100000

gca      = rnorm(sample_size)
err_ns   = rnorm(sample_size)
err_nvr  = rnorm(sample_size)
err_math = rnorm(sample_size)

nvr = gca + err_nvr
ns  = gca + err_ns
math = gca + err_math

df_sim = data.frame(gca, nvr, ns, math)

Let’s run a multiple regression analyses predicting mathematics:

model0 = lm(math ~ ns + gca, data = df_sim)
model1 = lm(math ~ ns + nvr, data = df_sim)

summary(model0)

Call:
lm(formula = math ~ ns + gca, data = df_sim)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.3696 -0.6787 -0.0029  0.6750  4.1727 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.004004   0.003170  -1.263    0.207    
ns          -0.001263   0.003180  -0.397    0.691    
gca          0.999897   0.004489 222.725   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.003 on 99997 degrees of freedom
Multiple R-squared:  0.4999,    Adjusted R-squared:  0.4999 
F-statistic: 4.997e+04 on 2 and 99997 DF,  p-value: < 2.2e-16
summary(model1)

Call:
lm(formula = math ~ ns + nvr, data = df_sim)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.0427 -0.7827 -0.0007  0.7824  4.6588 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.006802   0.003662  -1.858   0.0632 .  
ns           0.336593   0.002988 112.658   <2e-16 ***
nvr          0.329477   0.002988 110.272   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.158 on 99997 degrees of freedom
Multiple R-squared:  0.3329,    Adjusted R-squared:  0.3329 
F-statistic: 2.495e+04 on 2 and 99997 DF,  p-value: < 2.2e-16

In model0 with general cognitive ability as a covariate, the effect of general cognitive ability on math is close to 0 (beta ≈ .0). However when we control for nonverbal reasoning instead of GCA, there is a highly significant effect of number sense on math (beta ≈ .33).

In the real world we cannot control for general cognitive ability directly, we can only use proxy measures like nonverbal reasoning. However, because nonverbal reasoning is an imperfect measure of general cognitive ability, it doesn’t fully adjust for confounding! Residual confounding even occurs when we have measured all the relevant confounding variables, but our measurements have some measurement error.

In this example, the reliability of nonverbal reasoning was relatively poor. We can measure its reliability by calculating the squared correlation with general cognitive ability. In this case the reliability is 50%.

Reliability of nonverbal reasoning for measuring GCA:

cor(gca, nvr)^2
[1] 0.4986983

Does increasing the reliability of our covariate help?

If we increase the reliability of our covariate (NVR), this will reduce the amount of residual confounding. However, even if reliability is excellent, if it is less than 100% reliable (100% reliability = no measurement error), we will still see some residual confounding.

Let’s rerun the simulation and, increase the reliability of NVR to 90% and see what happens.

We can increase the reliability of nonverbal reasoning by reducing the variance of the error_nvr term. We can do this by setting nvr = gca + 1/3*err_nvr.

set.seed(10)

sample_size = 100000

gca      = rnorm(sample_size)
err_ns   = rnorm(sample_size)
err_nvr  = rnorm(sample_size)
err_math = rnorm(sample_size)

nvr = gca  + 1/3*err_nvr # see https://www.wolframalpha.com/input?i=.9+%3D+1%2F%281%2Ba%5E2%29 
ns  = gca  + err_ns
math = gca + err_math

df_sim = data.frame(gca, nvr, ns, math)

cat("Reliability of NVR for measuring GCA = ",round(cor(nvr, gca)^2,3))
Reliability of NVR for measuring GCA =  0.9
model0 = lm(math ~ ns, data = df_sim)
model1 = lm(math ~ ns + nvr, data = df_sim)

summary(model1)

Call:
lm(formula = math ~ ns + nvr, data = df_sim)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.3178 -0.7115 -0.0040  0.7091  4.0562 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.005570   0.003313  -1.681   0.0927 .  
ns           0.091569   0.003166  28.924   <2e-16 ***
nvr          0.816162   0.004244 192.296   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.048 on 99997 degrees of freedom
Multiple R-squared:  0.4538,    Adjusted R-squared:  0.4538 
F-statistic: 4.153e+04 on 2 and 99997 DF,  p-value: < 2.2e-16

In this case, we still observe some residual confounding even when we have an excellent measure of general cognitive ability. Number sense is still a highly significant predictor of math skills, but its effect is much smaller now (Beta ≈ .09)

Solutions

As always, outlining the problem is easier than fixing it.

Better covariates. Increasing the reliability of our covariate (nonverbal reasoning) reduced the beta coefficient for number sense. Perfect reliability is rarely posisble. Certainly, if we want a better observed measure of general cognitive ability could we should use more than one measure of cognitive or academic skills.

Focus on effect sizes rather than p-values. Even if we do a good job with covariate adjustment, a little residual confounding may sneak in, and thus, a small relationship between our predictor and outcome may persist. With large datasets, even tiny effect sizes may be statistically significant.

A more convincing argument could be made if, instead of focusing on if number sense is a statistically significant predictor, we focus on how strongly it predicts math skills. A better case could be made for its importance if it is a strong predictor of math before and after adjusting for a set of reliable and well-justified covariates. In our example below, the effect is pretty small after adjusting for our more reliable measure of gca.

Better models. We cannot assume that controlling for a single measure of general cognitive ability, such as nonverbal reasoning, will tell us anything about the specificity of cognition-outcome relationships. A fundamental problem with this approach is that research has consistently shown that while tests of nonverbal reasoning are good measures of general cognitive ability - they are not particularly special among cognitive measures. We found, for example, that vocabulary, reading and short-term memory tests can have even stronger relationships with general cognitive skills. Importantly, pretty much any cognitive or academic test can be considered a measure of general cognitive ability.

An alternative approach would be to use a latent variable modelling approach and model general cognitive ability as a latent variable that causes performance on all cognitive variables, and then test if there are residual associations between cognitive tests and outcomes. There are also fancier ways of modelling cognitive data that do not assume a single latent variable, arguably an over-simplification.

Experiments! Confounding is challenging to eliminate via covariate adjustment. Another approach would be manipulating number sense experimentally, such as via a randomised training study where half the participants train their number sense skills, and the other half do not. Because group assignment (i.e., group = 0 for control, group 1 = for training group) is randomised, no confounding can occur because GCA or other variables cannot cause group assignment. One study did just this and did not find that number sense training improves symbolic math skills.

When randomised experiments are impossible, we can sometimes leverage natural experiments or use other causal analysis methods too.

Final note

If our goal is to show that a specific domain of cognition is important for a specific outcome, simply showing a statistically significant association when controlling for nonverbal reasoning is a really limited approach. Using effect sizes or testing other structural hypotheses in a more principled way is a much better.

There is nothing inherently wrong with nonverbal reasoning assessments, but we should be cautious when using them as proxy measures of general cognitive ability/intelligence/IQ.

Further Reading

Residual confounding is widely discussed in epidemiology & biostatistics, but there have been some nice articles written for psychology audiences:

Westfall, J., & Yarkoni, T. (2016). Statistically controlling for confounding constructs is harder than you think. PloS one, 11(3), e0152719.

Kahneman, D. (1965). Control of spurious association and the reliability of the controlled variable. Psychological Bulletin, 64(5), 326.

On the topic of measuring “IQ” with nonverbal reasoning:

Gignac, G. E. (2015). Raven’s is not a pure measure of general intelligence: Implications for g factor theory and the brief measurement of g. Intelligence, 52, 71-79.

Applied Example Using Latent Variable Modelling Approach:

Bignardi, G., Mareva, S., & Astle, D. E. (2024). Parental socioeconomic status weakly predicts specific cognitive and academic skills beyond general cognitive ability. Developmental Science, 27(2), e13451.