When considering our self-esteem measure, it is important to understand the distinct ways classical test theory (CTT) and item response theory (IRT) approach measurement. Although both aim to accurately assess constructs like self-esteem, they differ in how they conceptualize scores, reliability, and item performance. From a classical test theory perspective, the focus is primarily on the total score obtained from a test. CTT assumes that each observed score is made up of a true score and an error component (Cook, 2013; Kline, 2005). For example, if we administered a 20-item self-esteem survey, CTT would emphasize the total score and assess reliability using methods such as Cronbach’s alpha. It would not deeply evaluate how individual items function across different levels of self-esteem. In CTT, item difficulty and discrimination are considered sample-dependent, meaning results may change depending on the group tested (Putnick & Bornstein, 2016).
In contrast, item response theory (IRT) offers a detailed and dynamic view. IRT treats item responses probabilistically, examining how each question functions at different levels of the trait being measured, such as self-esteem (Edelen & Reeve, 2007). Rather than assuming one error term for the whole test, IRT acknowledges that measurement error varies across different trait levels. For example, some items on a self-esteem scale may work well for individuals with moderate self-esteem but not those with very high or low self-esteem. Park and Park (2019) conducted a Rasch analysis, which is a form of IRT, on the Rosenberg self-esteem scale with individuals with intellectual disabilities. Their findings showed that while all items demonstrated acceptable fit statistics, the difficulty level of the items varied significantly. This highlighted that certain items may not adequately assess self-esteem across all ability levels. Notably, Park and Park (2019) found that the 4-point rating scale functioned well, suggesting that the responses were captured effectively even in specialized populations. It also reinforces how IRT provides critical insights that CTT would not easily detect, such as whether an item is too easy or too difficult for a given population. It also demonstrates how Rasch modeling helps ensure items maintain consistent meaning across diverse groups.
Furthermore, CTT would suggest that the same test is equally reliable across all respondents, but IRT shows us that reliability can vary across levels of self-esteem. With IRT, we can better tailor test items to match a participant’s ability level, potentially using adaptive testing methods (Cook, 2013). Hence, CTT gives us a straightforward and broad understanding of test performance. At the same time, IRT allows us to look more closely at individual items' precision and different self-esteem trait levels. Both frameworks are valuable, but offer different insights into how we interpret our self-esteem measurement.
References
Cook, K. F. (2013, January 24). A conceptual introduction to item response theory: Part 1. The logic of IRT scoring. [Video]. YouTube. https://www.youtube.com/watch?v=SrdbllMYq8M
Kline, T. J. B. (2005). Psychological testing: A practical approach to design and evaluation. Thousand Oaks, CA: Sage Publications
Park, J. Y., & Park, E. Y. (2019). The Rasch analysis of Rosenberg self-esteem scale in individuals with intellectual disabilities. Frontiers in Psychology, 10(1), 1–10. https://doi.org/10.3389/fpsyg.2019.01992
Putnick, D. L., & Bornstein, M. H. (2016). Measurement invariance conventions and reporting: The state of the art and future directions for psychological research. Developmental Review, 41(1), 71–90. https://doi.org/10.1016/j.dr.2016.06.004
Edelen, M. O., & Reeve, B. B. (2007). Applying item response theory (IRT) modeling to questionnaire development, evaluation, and refinement. Quality of Life Research, 16(1), 5–18. https://doi.org/10.1007/s11136-007-9198-0