Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Individual Differences in Cognitive Tasks: Challenges in Measuring Correlations, Lecture notes of Experimental Psychology

The difficulties in establishing correlations among common inhibition tasks such as Stroop and flanker tasks. It explores the reasons behind the low correlations and suggests using hierarchical models to account for variation across trials, individuals, and tasks. The document also assesses the reliability of task measures and the recoverability of true latent correlations.

Typology: Lecture notes

2021/2022

Uploaded on 09/27/2022

benjamin56
benjamin56 🇬🇧

5

(4)

222 documents

1 / 46

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Running head: INDIVIDUAL DIFFERENCES IN COGNITIVE TASKS 1
Why Most Studies of Individual Differences With Inhibition Tasks Are Bound To Fail
Jeffrey N. Rouder1, Aakriti Kumar1, & Julia M. Haaf2
1University of California, Irvine
2University of Amsterdam
Version 2, 6/2019
Author Note
We are indebted to Craig Hedge, Claudia von Bastian, and Alodie Rey-Mermet who
allowed us to reuse their individual-differences data sets. The Rmarkdown source code for
this paper is available at
https://github.com/PerceptionAndCognitionLab/ctx-inhibition/papers/revision. This source
code contains links to all data sets, all analyses, and code for drawing the figures and
typesetting the paper.
Correspondence concerning this article should be addressed to Jeffrey N. Rouder, .
E-mail: jrouder@uci.edu
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e

Partial preview of the text

Download Individual Differences in Cognitive Tasks: Challenges in Measuring Correlations and more Lecture notes Experimental Psychology in PDF only on Docsity!

Running head: INDIVIDUAL DIFFERENCES IN COGNITIVE TASKS 1

Why Most Studies of Individual Differences With Inhibition Tasks Are Bound To Fail

Jeffrey N. Rouder^1 , Aakriti Kumar^1 , & Julia M. Haaf^2 (^1) University of California, Irvine (^2) University of Amsterdam

Version 2, 6/

Author Note

We are indebted to Craig Hedge, Claudia von Bastian, and Alodie Rey-Mermet who allowed us to reuse their individual-differences data sets. The Rmarkdown source code for this paper is available at https://github.com/PerceptionAndCognitionLab/ctx-inhibition/papers/revision. This source code contains links to all data sets, all analyses, and code for drawing the figures and typesetting the paper.

Correspondence concerning this article should be addressed to Jeffrey N. Rouder,. E-mail: jrouder@uci.edu

Abstract

Establishing correlations among common inhibition tasks such as Stroop or flanker tasks has been proven quite difficult despite many attempts. It remains unknown whether this difficulty occurs because inhibition is a disparate set of phenomena or whether the analytic techniques to uncover a unified inhibition phenomenon fail in real-world contexts. In this paper, we explore the field-wide inability to assess whether inhibition is unified or disparate. We do so by showing that ordinary methods of correlating performance including those with latent variable models are doomed to fail because of trial noise (or, as it is sometimes called, measurement error). We then develop hierarchical models that account for variation across trials, variation across individuals, and covariation across individuals and tasks. These hierarchical models also fail to uncover correlations in typical designs for the same reasons. While we can characterize the degree of trial noise, we cannot recover correlations in typical designs that enroll hundreds of people. We discuss possible improvements to study designs to help uncovering correlations, though we are not sure how feasible they are. Keywords: Individual Differences, Cognitive Tasks, Hierarchical Models, Bayesian Inference

D; e.g., Bollen, 1989; Skrondal & Rabe-Hesketh, 2004).

There is a wrench, however, in the setup. Unfortunately, scores from experimental tasks correlate with one another far less than one might think a priori. An example is the lack of correlation among the Stroop task and the flanker task. While Friedman and Miyake (2004) found a healthy correlation of .18 between the tasks, subsequent large-scale studies from Hedge, Powell, and Sumner (2018), Pettigrew and Martin (2014), Rey-Mermet, Gade, and Oberauer (2018) and Von Bastian, Souza, and Gade (2015) have found correlations that range from -.09 to .03, and average -.03 in value. The near-zero value of correlation between these two tasks is not an outlier. As a rule, effects in inhibition tasks show surprisingly low correlations (Rey-Mermet et al., 2018). And the low correlations are not limited to inhibition tasks. Ito et al. (2015) considered several implicit attitude tasks used for measuring implicit bias. Here again, there is surprisingly little correlation among tasks that purportedly measure the same concept. This lack of correlation may also be seen in latent variable analyses. Factor loadings from latent variables to tasks are often dominated by a single task indicating that there is little covariation to decompose (MacKillop et al., 2016). The question of why these correlations are so low has been the subject of recent work by Draheim, Mashburn, Martin, and Engle (2019), Hedge et al. (2018) and Rey-Mermet et al. (2018) among others. On one hand, they could reflect underlying true task performance that is uncorrelated or weakly correlated. In this case, the low correlations indicate that performance on the tasks do not largely overlap, and that the tasks are indexing different mental processes. Indeed, this substantive interpretation is taken by Rey-Mermet et al. (2018), who argue that inhibition should be viewed as a disparate rather than a unified concept. By extension, different tasks rely on different and disparate inhibition processes. On the other hand, the true correlations could be large but masked by measurement error. Several authors have noted the possibility of a large degree of measurement error. Hedge et al. (2018), for example, set out to empirically assess the reliability of task measures

by asking participants to perform a battery of tasks and to return three weeks later to repeat the battery. With these two measures, Hedge et al. (2018) computed the test-retest reliability of the tasks. The results were somewhat disheartening with test-retest reliabilities for popular tasks in the range from .2 to .7. Draheim et al. (2019) argue that commonly used response time difference scores are susceptible to low reliability and other artifacts such as speed-accuracy tradeoffs. It has been well known for over a century that correlations among measures are attenuated in low reliability environments (Spearman, 1904). Yet, how much attenuation can we expect? If it is negligible, then the observed low correlations may be interpreted as true indicators that the tasks are largely measuring uncorrelated mental abilities. But if the attenuation is sizable, then the underlying true correlation remains unknown. One of our contributions in this paper is to document just how big this attenuation is in common designs.

Figure 2 provides an example of attenuation. Shown in Panel A are hypothetical true difference scores (or true effects) for 200 individuals on two tasks. The plot is a scatter plot—each point is for an individual; the x-axis value is the true score on one task, the y-axis value is the true score on the other task. As can be seen, there is a large correlation, in this case it is 0.78. Researchers do not observe these true scores; instead they analyze difference scores from noisy trial data with the tabulation shown in Figure 1. Figure 2B shows the scatterplot of these observed difference scores (or observed effects). Because these observed effects reflect trial noise, the correlation is attenuated. In this case it is 0.38. While this correlation is statistically detectable, the observed value is dramatically lower than the true one. The amount of attenuation of the correlation is dependent on critical inputs such as the number of trials and the degree of trial variability. Therefore, to get a realistic picture of the effects of measurement error it is critical to obtain realistic values for these inputs. In

there is no such thing as the reliability of a task or a correlation between tasks without reference to sample sizes. The second consequence is that the number of trials is far more important than the number of participants. The number of participants determines the unsystematic noise in the correlation; the number of trials determines the systematic downward bias. With few trials per task and many participants, researchers will have high confidence in a greatly biased estimate.

The realization that measurement error is primarily trial noise is wonderful news! It means that measurement error may be overcome by running many trials per participant per condition per task. Even more importantly, trial noise can be estimated and perhaps removed using statistical techniques. The hope is that with such techniques, it may be possible to obtain unbiased estimates of the correlations even in realistic designs with limited numbers of trials per person per task. For example, Behseta, Berdyyeva, Olson, and Kass (2009), Matzke et al. (2017), and Rouder and Haaf (2019) propose hierarchical statistical models to disattenuate correlations. The potential of such models is shown in Figure 2C. Here, a hierarchical model, to be discussed subsequently, was applied to the data in 2B, and the resulting posterior estimates of participants’ effects reveal the true strong correlation.

Based on the demonstration in Figure 2C, we had come into this research with the hope of telling a you-can-have-your-cake-and-eat-it story. We thought that perhaps hierarchical models would allow for the accurate recovery of correlations in typical designs providing for an answer to whether inhibition is unified or disparate. Yet, the story we tell here is far more complicated. First, we study 15 previously-published experiments to characterize the amount of measurement noise, true variability, and sample sizes in typical designs in inhibition-task research with individual differences. With these inputs, we then study correlation recovery through simulation. To foreshadow, overall estimates from hierarchical models do disattenuate correlations. But, in the course, they suffer from a large

degree of imprecision. It seems that in typical designs, one can use sample statistics and suffer massive attenuation or use a modeling approach and accept a large degree of imprecision. And this difficulty is why we believe most studies of individual differences with tasks are doomed to fail. This story is not the one we had hoped for, but it is a critical story for the community of individual-differences scholars to digest.

Spearman’s Correction for Attenuation

Before addressing the main question about recovery, we consider the Spearman (1904) correction for the attenuation of correlation from measurement error. In this brief detour, we assess whether Spearman’s correction leads to the recovery of latent correlations among tasks in typical designs. The assessment provides guidance because the data generation in simulations match well with the assumptions in Spearman’s correction. If Spearman’s correction cannot recover the latent correlations in realistic designs, these correlations may indeed be unrecoverable.

Spearman’s derivation comes from decomposing observed variation into true variation and measurement noise. When reliabilities are low, correlations may be upweighted to account for them. In Spearman’s classic formula, the disattenuated correlation, denoted rxy between two variables x and y is rxy = (^) √ rrxxxyryy ,

where rxy is the sample correlation and rxx and ryy are the sample reliabilities.^1 (^1) The estimation of reliability in tasks is different than the estimation of reliability in a classical test because there are replicates within people and conditions in tasks. The presence of these replicates maybe leveraged to produce better estimates of error variability than when they are not present. Let Y ¯ s ik^ and Let¯ yik^ d be the sample mean and sample standard error for the^ i th individual in the^ k th condition,^ k^ = 1 ,^^2. This sample variable is the total variance to be decomposed into true and error variances. Assuming an i^ =^ Y ¯ i^2 −^ Y ¯ i^1 be the effect for the^ i th individual, and let^ Vd^ be the sample variance of these effects. equal number of trials per condition, the error variance for the i th person, denoted Vei is s^2 ¯ yi 1 + s^2 ¯ yi 2. The estimate of error variance is simply the average of these individual error variances, or Ve = ∑ i^ ∑ k s^2 y ¯ ik /I. The reliability is r = ( VdVe ) /Vd.

each task independently, so we may safely ignore j , the task subscript (we will use it subsequently, however). The model for one task is:

Yik` ∼ Normal( αi + xkθi, σ^2 ) ,

where αi is the i th individual’s true response time in the congruent condition, xk = 0 , 1 codes for the incongruent condition, θi is the i th individual’s true effect, and σ^2 is the trial noise within an individual-by-condition cell. The critical target are the θi s, and these are modeled as random effects: θi ∼ Normal( μθ, σ^2 θ ) ,

where μθ describes the overall mean effect and σ^2 θ is the between-person variation in individuals’ true effects. Our targets then are the within-cell trial noise, σ^2 , and between-individual variance, σ^2 θ.

To analyze the model priors are needed for all parameters. Our strategy is to choose scientifically-informed priors (Dienes & Mclatchie, 2018; Etz, Haaf, Rouder, & Vandekerckhove, 2018; Rouder, Morey, & Wagenmakers, 2016; Vanpaemel & Lee, 2012) that anticipate the overall scale of the data. The parameters on baseline response times, in seconds, are αi ∼ Normal(. 8 , 1). These priors are quite broad and place no substantive constraints on the data other than baselines are somewhere around 800 ms plus or minus 2000 ms. The prior on variability is σ^2 ∼ Inverse Gamma(. 1 ,. 1), where the inverse gamma is parameterized with shape and scale parameters (Rouder & Lu, 2005). This prior, too, is broad and places no substantive constraint on data. Priors for μθ and σ θ^2 were informed by the empirical observation that typical inhibition effects are in the range of 10 ms to 100 ms. They were μθ ∼ Normal(50 , 1002 ) and σ θ^2 ∼ Inverse Gamma(2 , 302 ), where the values are in milliseconds rather than seconds. A graph of these prior settings for μ and σθ = √ σ^2 θ is shown in Figure 3. These priors make the substantive assumption that effects are relatively small and are not arbitrarily variable across people. The scale setting on σ^2 θ is important as

it controls the amount of regularization in the model, and the choice of 30 (on the ms scale) is scientifically informed (see Haaf & Rouder, 2017).

We applied this model to a collection of 15 experimental tasks from a variety of authors. Brief descriptions of the tasks are provided in the Appendix. The experiments were chosen based on the following criteria: I. Raw trial-level data were available and adequately documented. This criterion is necessary because model analysis relies on the raw data and cannot be performed with the usual summary statistics. II. These raw data could be shared. This research is offered within a fully open and transparent mode (Rouder et al., 2019), and you may inspect all steps from raw data to conclusions. III. The data come from an experimental setup where there was a contrast between conditions; i.e., between congruent and incongruent conditions. We think that given our limited goals of getting a sense of values for simulations, these criteria are appropriate.

The results are shown in Table 1, and the specific values inform our subsequent simulations. The first three columns describe the sample sizes: The first column is the total number of observations across the two conditions after cleaning (see Appendix), the second column is the total number of individuals, and the third column is the average number of replicates per individual per condition. The fourth and fifth columns provide estimates of reliability. The column labeled “Full” is the sample reliability using all the observations in one group (see Footnote 1); the column labeled “Split” is the split-half reliability. Here, even and odd trials comprised two groups and the correlation of individuals’ effects across these groups was upweighted by the Spearman-Brown prophecy formula. Note that the former estimate is more accurate than the split-half estimate because the former uses variability information across trials, much like in ANOVA, where the later does not. The next pair of columns shows the mean sample effect and the standard deviation of individuals’ sample effects around this mean. These are sample statistics calculated in the usual way and do not reflect the model. The next two columns are standard deviation estimates from the

trial noise and variability across individuals. The second column, reflects the model’s partition of variance, that is, what is left over after trial noise, given by σ , is accounted for. Given the assumptions of the model, it reflects only the variability across individuals. Hence, it is the far better value for simulation.

We provide a second argument that may be more intuitive for understanding the 25 ms value. Consider the possibility that all people truly respond faster in the congruent than in the incongruent condition. Or, restated, nobody has a negative true effect. This condition is called dominance in Rouder and Haaf (2018), and is explored extensively in Haaf and Rouder (2017) and Haaf and Rouder (2019). The results from these studies is that dominance broadly holds. In the Stroop case, everyone Stroops, that is, in the large trial limit, everyone has truly faster scores for congruent than incongruent stimuli. If dominance holds, and the true mean effect is small across the population, say 50 ms, then the variance between individuals cannot be too high. For if it were large, then some proportion of people must have negative true effects. Dominance—which is natural and seems to hold in almost all sets we have examined—provides a limit on the size of variability. Figure 3 provides a graph of true values with a spread of 25 ms. As can be seen, there is only minimal mass for negative true values, and the spread of true values to us seems appropriate for a true 50 ms effect.

Expected Attenuation

The above results are useful for understanding how much attenuation of the correlations we should expect with the usual analysis in Figure 1. We consider the case where in each task there are L trials per task per condition, common trial variance σ^2 and common true variance σ^2 θ. The expected classical estimate, ρ ∗, is given by

ρ ∗^ = ρ

( (^) 2 θ Lσ θ^2 + 2 σ^2

) .

This equation is most useful if written with the ratio η = √ σθ/σ , with this ratio interpreted as a ratio of signal (true variability) to noise (trial noise). Then, the attenuation factor, ρ is ρρ =

( L

L + 2 ^2

) .

The last column of Table 1 shows the value of η for the various studies, and the values range from 1/11 to 1/3, with η = 1 / 7 corresponding to our typical case. Figure 4 shows the dependence of the attenuation factor on the number of trials ( L ) for various values of signal to noise. As can be seen, with the usual approach of tabulating participant-by-task scores, we expect attenuation to be a factor of about 1/2.

Model-Based Recovery of Correlations Among Tasks

The critical question is then whether accurate estimation of correlation is possible. The small simulation in the introduction, which was based on the above typical settings for two tasks and a true population correlation of .80, showed that naive correlations among sample effects were greatly attenuated and Spearman’s correction was unstable. We now assess the recoverability of true latent correlations with the hierarchical models used to simulate data and for several values of true correlations.

A Hierarchical Model for Correlation

Here we develop a hierarchical trial-level model for many tasks that explicitly models the covariation in performance among them. A precursor to this model is provided in Matzke et al. (2017) and Rouder and Haaf (2019). The difference is that these previous models are applicable for only two tasks and one correlation coefficient. They are not applicable to several tasks and coefficients.

correlation coefficients.

Two Tasks

The first simulation is for two tasks. Using the typical sample sizes discussed above, each hypothetical data set consisted of 80,000 observations (200 people × 2 tasks × 2 conditions × 100 replicates per condition). One might hope that with such a large sample size and with a goal of estimating a single correlation, the true population correlation, ρ , might be recoverable. Supporting this hope is the success of the single run in Figure 4C. On the other hand, given the large degree of measurement noise and the instability of Spearman’s correction (Figure 4B), it seems plausible that ρ may not be unrecoverable. For the simulations, true correlation values across the two tasks were varied on three levels with values of .2, .5, and .8. For each of these levels, 100 data sets were simulated and analyzed.

Figure 5A shows the results. Naive correlations from participant-by-task sample means are shown in red. As expected, these correlations suffer a large degree of attenuation from trial noise. Correlation estimates from Spearman’s correction are shown in green. These values are better centered though some of the corrected values are greater than 1.0. The correlation estimates from the hierarchical model are shown in blue.

Overall, the correlation estimates from Spearman’s correction and the hierarchical model have less bias than the naive sample-effect correlations. Yet, the estimates are quite variable. For example, consider correlations when the population value is .2. The model estimates range from -0.21 to 0.54 and miss the target with a RMSE of 0.15. Spearman corrected estimates are a slightly better and have an RMSE for this case of 0.14. Overall though, this variability is quite high especially given the large number of observations. We would not have confidence in substantive conclusions with it.

Are there risks in using model-based recovery? We see in simulation that the model

and Spearman-corrected recovery is variable. One potential problem is that in any one study, researchers using the model inflate the values of correlations. The attenuation in the naive correlations is conservative in that recovered correlations are never inflated, rather, they are dramatically deflated. In this regard, we can think of naive-correlations as having a fail-safe quality where high-value correlation estimates are avoided at the draconian expense of not detecting true high correlations. Spearman-corrected correlations do not share this fail-safe orientation. The variability in estimation results in values that are both inflated and deflated.

The critical question is about model-based recovery. Figure 5A shows only posterior mean estimates. Yet, in the Bayesian approach, the target is not just the posterior mean, but the entirety of the posterior distribution. Figure 5B-D shows the posterior 95% credible intervals for all runs with true correlations of .2, .5, and .8, respectively. There are two noteworthy trends. First, the 95% credible intervals tend to contain the true value on 90% of the simulation runs. This means that the posterior variability is relatively well calibrated and provides reasonably accurate information on the uncertainty in the correlation. Second, there is a fair amount of uncertainty meaning that the analyst knows that correlations have not been well localized. This lack of localization provides the needed hedge for over interpreting inflated values. With the Bayesian model-based estimates, at least we know how uncertain we are in localizing true correlations. With the Spearman correction, we have no such knowledge.

Six Tasks

We explored correlations across six tasks. Each hypothetical data set consisted of 240,000 observations. To generate a wide range of correlations, we used a one-factor model to simulate individuals’ true scores. This factor represents the individual’s inhibition ability. This ability, denoted zi , is distributed as a standard normal. Tasks may require more or less of the individuals’ inhibition ability. Therefore, task loadings onto this factor zi are variable

Analysis of Rey-Mermet, Gade, and Oberauer (2018)

To assess real-world correlation recovery, we re-examined the flanker and Stroop tasks in Rey-Mermet et al.’s battery of inhibition tasks. The authors included two different types of Stroop tasks (a number Stroop and a color Stroop task, see the Appendix for details) and two different types of flanker tasks (a letter flanker and an arrow flanker task, see the Appendix for details). The question then is about the correlation across the tasks.^3

The top three rows of Figure 8 show the estimated correlations from sample effects, Spearman’s correction, and the hierarchical model. Given the previous simulations results, it is hard to know how much credence to give these estimated correlations. In particular, it is hard to know how to interpret the negative correlation between the arrow flanker and color Stroop task.

To better understand what may be concluded about the range of correlations, we plot the posterior distribution of the correlation (Figure 9A). These distributions are unsettling. The variation in most of these posteriors is so wide that firm conclusions are not possible. The exception is the null correlation between number and color Stroop which seems to be somewhat well localized. The surprisingly negative correlation between color Stroop and arrow flanker comes from a posterior so broad that the 95% credible interval is [-0.27,0.39]. Here, all we can say is that very extreme correlations are not feasible. We suspect this limited result is not news.

Analysis of Rey-Mermet et al. (2018) provides an opportunity to examine how (^3) One of the elements that makes analysis complicated is how to exclude low-performing participants. In the previous analysis, where each task was analyzed in isolation, we retained all participants in a taskwho performed over 90% accuracy on that task. In the current analysis, however, we must have the same participants for all four tasks. We decided to retain those participants who have over 90% accuracy on all four tasks. With this strict criterion, we retain only 180 of the original 289 participants. The most noticeable effect of this exclusion is that the reliability for the arrow flanker task was reduced from .87 to .56. The fact that the reliability changes so much indicates that the high reliability was driven by a few participants with very large difference scores. This cutoff differs from Rey-Mermet et al. (2018), who used a 75% accuracy. With this lower cutoff, they included many more participants.

hierarchical models account for variation across trials as well as variation across people. Figure 9B shows sample effects across individuals for the color Stroop and arrow flanker tasks, the two tasks that were most negatively correlated. There is a far greater degree of variation in individual’s effects for the color Stroop task than for the arrow flanker task. The model estimates (Figure 9C) reflect this difference in variation. The variation in arrow flanker is so small that it can be accounted for with trial variation alone. As a result, the hierarchical model shows almost no individual variability. In contrast, the variability in the color Stroop is large and the main contributor is true variation across individuals rather than trial variation. Hence, there is relatively little shrinkage in model estimates. The lack of variation in the arrow flanker task gives rise to the uncertainty in the recovered correlation between the two tasks.

General Discussion

A basic question facing researchers in cognitive control is whether inhibition is a unified phenomenon or a disparate set of phenomena. A natural way of addressing this question is to study the pattern of individual differences across several inhibition tasks. In this paper, we have explored whether correlations across inhibition tasks may be recovered. We consider typically large studies that enroll hundreds of participants. The answer is negative—correlations are difficult to recover with the accuracy that would allow for a definitive answer to this basic question. This statement of poor recovery holds for hierarchical models that are extended to the trial level.

Why this depressing state-of-affairs occurs is fairly straightforward. Relative to trial noise, there is little true individual variation in inhibition tasks. To see why this is so, consider an average effect, say one that is 50 ms. In inhibition tasks like Stroop and flanker, we can safely make a dominance assumption —nobody truly has a negative effect (Haaf & Rouder, 2017). That is to say nobody truly identifies incongruent stimuli faster than