



















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The importance of test-retest reliability in psychometric tools, outlines important factors to consider, common errors, and methods for conducting and reporting reliability analyses. It highlights the impact of inappropriate reliability assessments on measurement validity and offers simple methods for improving reliability statistics.
Typology: Study notes
1 / 27
This page cannot be seen from the preview
Don't miss anything!
1 Assessing test-retest reliability of psychological measures: persistent methodological 2 problems 3 4 Abstract 5 Psychological research and clinical practice relies heavily on psychometric testing for 6 measuring psychological constructs that represent symptoms of psychopathology, individual 7 difference characteristics, or cognitive profiles. Test-retest reliability assessment is crucial in 8 the development of psychometric tools, helping to ensure that measurement variation is due 9 to replicable differences between people regardless of time, target behaviour, or user profile. 10 While psychological studies testing the reliability of measurement tools are pervasive in the 11 literature, many still discuss and assess this form of reliability inappropriately with regard to 12 the specified aims of the study or the intended use of the tool. The current paper outlines 13 important factors to consider in test-retest reliability analyses, common errors, and some 14 initial methods for conducting and reporting reliability analyses to avoid such errors. The 15 paper aims to highlight a persistently problematic area in psychological assessment, to 16 illustrate the real-world impact that these problems can have on measurement validity, and to 17 offer relatively simple methods for improving the validity and practical use of reliability 18 statistics. 19 20 Key Words: Reliability Analysis, Test-Retest Reliability, Psychometric Testing, 21 Measurement Reliability, Limits of Agreement 22 23
1 Assessing test-retest reliability of psychological measures: persistent methodological 2 problems 3 Psychometrics is defined by Rust and Golombok (2009) as the science of psychological 4 assessment. Psychological measures assess latent factors such as personality, emotional state, 5 or cognition, via a set of observed variables, and the science of psychometrics is concerned 6 with the quality, validity, reliability, standardization, and removal of bias in such 7 measurement tools (Rust & Golombok, 2009). The vast body of psychological literature 8 utilizing this method of measurement is testament to its value and popularity. However, a 9 great deal of work is required to design and evaluate a new measurement tool to try and 10 ensure that it measures what it intends to, and does so each time it is used. Just as we want to 11 be sure that physical or mechanical tools are giving us the right information every time we 12 use them, we should be equally concerned that the measuring tools we rely upon in research 13 and clinical practice are accurate and dependable. Therefore, consideration of validity and 14 reliability are essential in the development of any new psychological measuring tool. 15 Validation of psychometric tools ensures that measurements are accurate and 16 meaningful for their target population. Generally, assessments of validity have been well 17 conducted in published psychological research. For instance, multidisciplinary input has long 18 been reported in the development of items and tools (e.g., Bennett & Robinson, 2000; Meyer, 19 Miller, Metzger, & Borkovec, 1990; Steptoe, Pollard, & Wardle, 1995), iterative approaches 20 are usually taken to the refinement of item inclusion, and typically, assessment of both 21 content and performance validities (e.g., construct, criterion-related) are reported (e.g., 22 Garner, Olmstead, & Polivy, 1983; Goodman, 1997, 2001 ; Pliner & Hobden, 1992). 23 In contrast, appropriate methods for assessing the reliability of new psychometric 24 measuring tools across time, context, and user (i.e., test-retest reliability), have been more 25 scarcely reported in psychological literature, This is despite the relatively large number of
1 Whilst there are many different meanings ascribed to the term ‘reliability’ across 2 scientific disciplines, ‘test-retest’ reliability refers to the systematic examination of 3 consistency, reproducibility, and agreement among two or more measurements of the same 4 individual, using the same tool, under the same conditions (i.e., when we don’t expect the 5 individual being measured to have changed on the given outcome). Test-retest studies help us 6 to understand how dependable our measurement tools are likely to be if they are put into 7 wider use in research and/or clinical practice. When a measurement tool is used on a single 8 occasion, we want to know that it will provide an accurate representation of the patient or 9 participant so that the outcome may be used for practical purposes (e.g., diagnostics, 10 differentiation of individuals or groups). When a measurement tool is used on multiple 11 occasions (e.g., to compare baseline and follow-up) we want to know that the tool will give 12 accurate results on all occasions, so that observed changes in outcome can be attributed to 13 genuine change in the individual, rather than instability in the measurement tool; this is 14 particularly relevant when assessing the efficacy of treatments and interventions. Finally, 15 when a measurement tool is used to assess different groups (e.g., patients receiving different 16 treatments, different characteristics), we want to know that the tool is accurately measuring 17 all individuals so that any group differences may be considered genuine and not an artifact of 18 measurement. Although demonstrating validity is the key to knowing that the right thing is 19 being assessed with any given tool, assessing validity is only truly possible once it has been 20 established that a tool is measuring something in the same way each time it is used. 21 In the context of test reliability studies, there are two approaches to understanding the 22 comparability/reliability of test scores – we’ll refer to them in this paper as ‘relative 23 consistency’ and ‘agreement’ – that hold very different definitions of what it means for 24 measurements to be ‘reliable’. Relative consistency, also termed ‘rank-order stability’ 25 (Chmielewski & Watson, 2009), means that the relative position or rank of an individual
1 within a sample is consistent across raters/times, but systematic differences in the raw scores 2 given to individuals by different raters or at different times are unimportant. For example, one 3 assessor may score the first three people in a sample as 100, 105, and 107 for IQ, and the 4 second may score the same three people, at the same time, as 105, 110, and 112. Even though 5 the raw scores given by the two raters are not the same, the difference in rating is consistent 6 across all three participants and they maintain the same rank relative to one another; 7 therefore, the IQ measure would be considered to have relative reliability across raters. In 8 contrast, agreement is concerned with the extent to which the raw observed scores obtained 9 by the measurement tool match (or, agree) between raters or time-points, when measuring the 10 same individual in the absence of any actual change in the outcome being measured. 11 If the relative ordering of individuals within a given sample is of greater importance 12 or use than the observed differences between individuals (e.g., finishing position in a race) 13 then assessing the relative consistency between measurements may be suitable. However, this 14 is not typically the case when assessing the test-retest reliability of standardized measuring 15 tools such as psychometric questionnaires. In this case, the aim is to try and make objective 16 measurements that are unaffected by the time or place of measurement, or by attributes of the 17 individual making the measurement. Once the tool is applied in practice, we want to be 18 confident that any given measurement is accurate, and that any differences in outcome 19 observed within a study or clinical practice, are due to real changes in an individual, or 20 genuine differences between individuals/groups. Therefore, the purpose of reliability studies 21 in these contexts is to determine the extent to which repeated measurements agree (i.e., are 22 the same), over time, rater, or context (i.e., test-retest), when used to assess the same 23 unchanged individual. In such a case, it is necessary to assess absolute differences in scores, 24 since these provide a direct measure of score stability at an individual level. Aside from the 25 mere presence/absence of stability, absolute score differences also permit the assessment of
1 Correlation 2 Correlation is not agreement****. A common misconception is that high correlation 3 between two measurements equates to agreement between them. In reality, quite incongruent 4 paired measurements can produce strong and highly statistically significant correlation 5 coefficients, despite the observed agreement between these measurements being very poor. 6 Parametric correlation coefficients (Pearson’s product moment correlations), which are 7 frequently presented in reliability studies, use a -1 to 1 coefficient to quantify how 8 consistently one variable increases (or decreases) relative to another variable increasing, 9 according to how close points lie to any straight line. This can be seen by plotting the 10 measurements against one another and adding a line of best fit. In contrast, agreement in 11 scores means that the 2+ results produced for each individual are the same. To illustrate 12 agreement on a scatter plot the points must lie, not on any straight line, but on the line of 13 equality specifically, where the intercept is 0 and the slope of the line is 1 (Streiner et al., 14 2014 ). The difference between correlation and agreement is demonstrated in the data in table 15 1 taken from a laboratory study of adult food preference. This data shows the ratings given by 16 participants when presented with the same food on two occasions. Despite relative stability of 17 food preferences in adulthood, we see that, even relative to the measurement scale, there are 18 large differences (range 15.7 to 226.7) between ratings given on the two occasions. 19 20 Table 1 21 22 The scatterplot in figure 1 illustrates just how far away paired ratings are from 23 agreement, since very few points lie on or close to the dashed line of equality. Despite this 24 clear disparity in ratings, highlighted in both the plot and the absolute score differences, the 25 Pearson’s correlation coefficient for this data is 0.93 (p<0.001), which would undoubtedly be
1 reported as a very strong association. 2 3 Figure 1 4 5 Correlation conceals systematic bias. Correlation coefficients are standardized 6 statistics that always fall between -1 (perfect negative association) and 1 (perfect positive 7 association). The units and magnitude of the variables being compared are irrelevant in the 8 calculation of the coefficient, and coefficients are not sensitive to mean differences or 9 changes in scores; as such, coefficients will mask systematic biases (the amount that one 10 measurement differs from another) between measurements/measurers. What this means for 11 test-retest reliability is that even very large differences between test and retest values, which 12 may represent significant intra-rater instability or inter-rater difference, will not be detected 13 by correlation analysis if the differences are consistent in a sample. In practice, this means 14 that critical factors affecting measurement reliability such as order effects (practice, boredom, 15 fatigue, etc.) and user interpretation may never be identified. The values in table 2 can be 16 used as an example here; this table presents scores given by two teachers double marking a 17 computer-based exam task and the differences between the scores for each of the 14 students. 18 The table also presents a third set of transformed scores used to exemplify a large difference 19 (bias) in marking. If the way that pairs of measurements increase and decrease relative to one 20 another is constant, the correlation coefficients between measurements will be exactly the 21 same whether there is no bias, a small bias (e.g., around 1.5 points on average), or a very 22 large bias is present (e.g., around 46 points on average). Whilst we are unlikely to see 23 repeated measures differing by such a margin as teachers A and C in real-life data, this more 24 extreme example is used to illustrate an important point. In real world contexts systematic 25 bias can occur if a measurement tool is open to interpretation by the specific user, or where
1 to compare reliability of growth measurements from a sample of children aged 3-5 years with 2 a sample of children aged 3-10 years, the latter group would be far more variable than the 3 former, so a larger correlation coefficient would be produced for the 3-10 year olds even if 4 agreement in absolute growth measures was the same for both samples. This would also be 5 relevant when comparing reliability estimates from clinical and non-clinical populations, 6 where variation in psychological outcome measures maybe highly disparate between groups. 7 As such the researcher may find that the tool appears more reliable in the non-clinical group 8 than the clinical group (or vice versa), when in actual fact the absolute differences in scores 9 in each group are comparable. It is important to note that this specific issue for test-retest 10 analysis does not arise as a result of narrow or incomparable samples (though these have their 11 own inherent issues if they are unrepresentative), but as a direct result of the use of a relative 12 method (i.e., correlation) to estimate reliability; therefore, it can be overcome by examining 13 absolute differences in scores. 14 The above issues surrounding correlation analysis also apply to regression analyses 15 when used to assess agreement, since simple regression of one measurement onto another is 16 also based upon association. These problems are particularly hazardous when data are not 17 plotted and examined visually, and reliability is endorsed based on statistical output alone. 18 An expectation of high agreement between measures may also lead to less rigorous 19 consideration of raw data and statistical results. 20 Statistical tests of difference 21 Reliance on traditional statistical testing and p-values can be a hindrance to reliability 22 analysis; “ performing a test of significance for a reliability coefficient is tantamount to 23 committing a type III error – getting the right answer to a question no one is asking” 24 (Streiner, 2007; Streiner et al., 2014). While there are ongoing debates around the use and/or 25 over-reliance on p-values in research generally, the specific issue in this context is that the
1 null hypotheses against which many such statistical tests are compared are relatively 2 meaningless when assessing reliability. Perhaps the greatest issue relevant to test-retest 3 reliability analysis is the use of hypothesis driven tests of difference, such as the paired t-test. 4 The common fallacy is that, if a test finds no significant difference between measurements 5 then the measurements agree, but this is not the case (Altman & Bland, 1995). Finding a 6 difference to be ‘significant’ simply means that systematic variability between the 7 measurements (i.e., between raters, conditions, or time-points) outweighs the variability 8 within measurements (i.e., between the individuals in the sample). Therefore, even large 9 differences between repeated measurements, which indicate very poor agreement, can be 10 statistically non-significant if the sample being tested is heterogeneous. The inverse is also 11 true; very similar test-retest scores, which should be seen as demonstrating high reliability, 12 may differ statistically significantly in a homogenous sample. 13 A related error in reliability analyses is the belief that the average (mean) difference 14 between two or more conditions is adequate to quantify agreement between individual pairs 15 of scores. This error is demonstrated by the data in table 3, which presents another example 16 of laboratory food (pizza) preference ratings (0-20 scale) from 34 participants assessed on 17 two occasions. Table 3 also includes the within-pair differences for scores, the mean score for 18 each time-point, and the mean within-pair difference. 19 20 Table 3 21 22 Relative to the scale of measurement, the absolute differences between ratings are 23 large and variable, ranging from -4.75 to 8.85; and yet, the average within-pair difference is 24 only 0.71. This value suggests far greater similarity in the data than is actually the case. 25 Calculating the mean difference in scores can mask notable disparity between paired
1 similar/different the observed scores obtained from a tool are. In contrast, suitable methods 2 for analysing test-retest reliability examine the difference(s) between measurements for each 3 case in the sample at an individual level, and assess whether or not the absolute differences 4 between scores obtained by the tool fall within an acceptable range according to the tool’s 5 specific clinical, scientific, or practical field of use. Unlike relative consistency, this relies on 6 having an agreed or directly observable unit of measurement for the outcome score. A 7 specific cut-off value (size of difference) up to which measurements may be considered to 8 agree, should be identified and justified by the researcher before viewing the data, to avoid 9 biasing the reliability analyses. Establishing reliability in this way facilitates more in-depth 10 examination of the data (e.g., the size and consistency of differences across a sample) and 11 hence more thorough evaluation of reliability. It also permits the creation and validation of 12 reference values and cut-off scores, for diagnosis and classification and for understanding a 13 single outcome score for an individual; something which is precluded in relative measures of 14 reliability since systematic scoring differences are permissible. 15 Select suitable methods 16 Limits of Agreement. Bland-Altman Limits of Agreement (LOA) (Bland & Altman, 17 1986 ) is a statistical method typically used to assess agreement between two repeated 18 numeric measurements (i.e., test-retest scores, or comparison of methods). LOA are based on 19 descriptive statistics for paired data and are typically accompanied by a plot of the data to aid 20 data checking and interpretation. The limits themselves represent the upper and lower 21 boundaries of the middle 95% range of the observed data (within-pair differences), 22 constructed around the mean within-pair difference as mean ± 1.96(SD). For improved 23 interpretation and inference beyond the sample, confidence intervals are also constructed 24 around the upper and lower LOA. Confidence intervals around the LOA will be wider than 25 those around the mean by a factor of 1.71, when samples are not small (Bland & Altman,
1 1999 ). Assuming normality of the data, this gives a range of values in which we are 95% 2 confident the population limit should lie. The ‘population’ is a hypothetical scenario in which 3 all possible measurement differences could be measured, but it provides a practical indication 4 of the variability/precision of measurements that we might expect to see if the tool were 5 implemented widely (e.g., a new clinical or research assessment tool). 6 An associated Bland-Altman plot sees the average of the two paired measurements 7 plotted on the x axis, against the difference between the two measurements on the y axis. The 8 plot is used to examine the data distribution and to screen for outliers and heteroscedasticity 9 (where the level of agreement between measurements depends on, or is proportionate to the 10 size of the measurement). When constructing the LOA plot, a horizontal reference line is first 11 added to show the mean within-pair difference; we hope to see this line sitting as close to 12 zero as possible. Points spread evenly either side of zero show random error variability in 13 which measurements do not differ on average. Points lying around any other positive or 14 negative value would indicate systematic bias in the measurements, and if the amount that 15 points vary from the mean line differs across the range of measurements, this suggests that 16 the data are heteroscedastic. Heteroscedasticity may be dealt with via data transformation to 17 permit statistically valid calculation of LOA ((Bland & Altman, 1999). However, it would be 18 essential to try and determine the source of heterogeneity, and to discuss the implications of 19 this data pattern for reliability and wider application of the measurement tool. 20 If we use the food preference ratings presented in table 3 as an example, we saw 21 previously that the mean within-pair difference for this data was 0.71. Using the standard 22 deviation (3.29) and sample size ( n =34) we can calculate the standard error (0.56) and a 95% 23 confidence interval for the mean (-0.39, 1.81). The LOA, which represent an interval 24 containing 95% of the observed differences, can be calculated as -5.73 (95% CI -7.64, -3.81) 25 to 7.16 (95% CI 5.24, 9.07); confidence intervals for the LOA are based on a standard error
1 confidence interval tells us that we can be 95% confident that measurement differences 2 should not exceed 9.07 in the wider population of all measurements. 3 The disparity between the sample and confidence interval indexes of difference or 4 agreement presented above (7.16 vs. 9.07), illustrates how confidence intervals can alter our 5 conclusions about reliability beyond what is observed in the data, and highlight why it is so 6 important to quantify precision for all estimates. For example, if researchers working with the 7 data had chosen 10 as the maximum difference permitted for this tool to show agreement, we 8 would be confident that our tool was reliable; the chosen cut-off exceeds both the sample and 9 population limits. If instead the cut-off had been 5, we would be quite confident in 10 concluding that our tool was not reliable, since differences observed in the sample and 11 inferred for the population exceed this margin. The most difficult scenario is when the cut-off 12 lies between the two indexes. For example, if the cut-off had been 8, we would have to 13 discuss the implications of our uncertainty around reliability. The observed data do not 14 exceed this value, but reliability is not confidently supported in the context of the wider 15 population. The only way to minimize differences between sample and population estimates 16 is to study large samples, thus reducing the width of confidence intervals around the LOA. 17 Intraclass Correlation. While there remain frequent problems with reliability 18 analyses in psychology, the use of the Intraclass Correlation Coefficients (ICC) (Shrout & 19 Fleiss, 1979) has been seen in psychological literature for some time (e.g., Angold & 20 Costello, 1995; Egger et al., 2006; Grant et al., 2003; Kernot, Olds, Lewis, & Maher, 2015; 21 March, Sullivan, & Parker, 1999; Silverman, Saavedra, & Pina, 2001). Unlike Pearson’s 22 (interclass) correlation, ICC is an acceptable measure of reliability between two or more 23 measurements on the same individual/case. Despite the name and the presence of a 24 coefficient to quantify reliability, ICC is actually based on a ratio of rater, participant, and 25 error sources of measurement variability (derived from ANOVA models). This does mean
1 that ICC coefficients are, like other inferential tests, influenced by sample homogeneity; 2 when variability between measurements is constant, the more alike the sample is, the lower 3 the ICC will be (Bland & Altman, 1990; Lee et al., 2012). Therefore, ICC coefficients 4 derived from samples whose outcome variances differ, such as non-clinical and clinical 5 samples, should not be compared directly. For example, if a depression measure was used in 6 a non-clinical sample we would expect a modest range of scores with many cases scoring 7 close to zero, but this same tool applied to a sample of depressed individuals would likely 8 produce a much greater range of scores. In this case, the clinical sample would obtain a 9 higher ICC coefficient than the more homogenous non-clinical sample, in the absence of any 10 difference in the tool’s reliability. This factor does not discredit ICC as a method of reliability 11 analysis, but highlights the importance of evaluating reliability using a representative sample 12 drawn from a relevant population (i.e., in which the tool will be used) (Bland & Altman, 13 1990 ). It also emphasizes the need to consider sample variance when interpreting ICC 14 coefficients and differences in reliability observed between samples and populations. 15 ICC coefficients quantify the extent to which multiple ratings for each individual 16 (within-individual) are statistically similar enough to discriminate between individuals, and 17 should be accompanied by a confidence interval to indicate the precision of the reliability 18 estimate. Most statistical software will also present a p-value for the ICC coefficient. This p- 19 value is obtained by testing sample data against the null hypothesis that measurements 20 within-person are no more alike than between-people (i.e., there is no reliability). In contrast, 21 reliability studies aim to answer the functional question ‘are the repeated measurements made 22 using a tool similar enough to be considered reliable’. As such, the p-value provided is, in 23 most cases, of little practical use or relevance. 24 Though many authors report simply that ‘ICC was used’ there are in fact six different 25 ICC types to suit different theoretical and methodological study designs (Atkinson & Nevill,
1 teacher A, while teaching C scored on average 46 points higher than A. We saw previously 2 that correlation fails to recognize systematic bias and as such the correlations for A with B 3 and A with C were both 0.99 and highly statistically significant. If we now assess this data 4 with ICC type 2 (assuming a sample of teachers were used to assess the random sample of 14 5 students) to look at agreement, we find that good reliability is demonstrated for teachers A 6 and B who marked similarly (ICC (single measures) = 0.97), and appropriately, very poor 7 reliability is shown for teachers A and C who marked differently (ICC (single measures) = 8 0.16). When all three teachers are added into the ICC model we see a negligible increase to 9 0.18, suitably reflecting poor reliability across all three raters. As expected, when ICC type 3 10 is run, which treats raters as fixed and allows for systematic variability between raters, the 11 result is a considerably higher ICC coefficient of 0.49, which would be higher still if the bias 12 between raters, however large, was consistent. This again highlights the deficiency of 13 assessing consistency rather than agreement for test-retest types of reliability. 14 An ICC coefficient can also be accompanied by an ICC plot, which sees the sample 15 cases plotted in the x axis, outcome scores on the y axis, and different point characters used 16 for each rater/rating. ICC plots illustrate the size and nature of observed differences between 17 raters/ratings, and the clustering of scores within person relative to variability across the 18 sample, which aid the practical interpretation of statistical results. For example, figure 3 19 presents the exam marking data from table 2 for teachers A and B; from this plot we see that 20 teacher A scores consistently lower than teacher B, indicating a small bias, but in most cases 21 the marks are similar. In contrast, figure 4 presents the table 2 data for all three teachers 22 together. This plot clearly shows that teacher C marks much higher than teachers A and B, 23 representing a large positive bias, and hence poor reliability. 24 25 Figure 3
1 Figure 4 2 3 Improved Reporting 4 In any study, the aims of the research and the methods used to meet those aims should 5 be clearly outlined; it is insufficient to present vague conclusions verified only by statistical 6 output (e.g., ‘good reliability was shown r (100)= 0.85, p=0.006’). The purpose of test-retest 7 reliability studies is to provide evidence that a tool will measure the same thing, in the same 8 way, each time it is used (validity assessment then tells us if it is measuring the right thing). If 9 the methods used to evidence this reliability are not sufficiently explained to validate their 10 use, or the evidence is not presented in the context of a wider population (i.e., no confidence 11 intervals), then the evidence is compromised, or absent altogether. Statements such as ‘ICC 12 was used to assess reliability’ are common, despite important differences in ICC models, and 13 the implications of their selection. Such reporting provides no evidence of mindful selection 14 of methods, and may lead the reader to infer that software default settings were used, which 15 vary between packages and may not be appropriate. For example, the default ICC type in 16 IBM SPSS Statistics (version 22) is a two-way mixed effects model for consistency (type 3 17 ICC). This model is liable to give the highest ICC coefficient of all three main types, but is 18 only appropriate to use when a fixed group of raters is used, and consistent differences 19 between those raters are unimportant. This is contrary to test-retest studies that aim to 20 examine agreement between measurements. It should also be clearly specified and justified 21 when an average of raters is used rather than assessing across individual raters, since this will 22 always inflate the resulting reliability coefficient. 23 Problems regarding the justification of analytical choices in ICC also extend to 24 correlations and inferential tests of difference. Often, the application of these tests is stated, 25 but neither a rationale for their selection, nor an explanation of how the results demonstrate