Item Response Theory (item + response_theory)

Distribution by Scientific Domains

Terms modified by Item Response Theory

  • item response theory analysis
  • item response theory models

  • Selected Abstracts


    Interactive Graphics for Computer Adaptive Testing

    COMPUTER GRAPHICS FORUM, Issue 8 2009
    I. Cheng
    K.3.1 [Computer Milieux]: Computer and Education , Computer Uses in Education; I.3.8 [Computing Methodologies]: Computer Graphics , Application Abstract Interactive graphics are commonly used in games and have been shown to be successful in attracting the general audience. Instead of computer games, animations, cartoons, and videos being used only for entertainment, there is now an interest in using interactive graphics for ,innovative testing'. Rather than traditional pen-and-paper tests, audio, video and graphics are being conceived as alternative means for more effective testing in the future. In this paper, we review some examples of graphics item types for testing. As well, we outline how games can be used to interactively test concepts; discuss designing chemistry item types with interactive 3D graphics; suggest approaches for automatically adjusting difficulty level in interactive graphics based questions; and propose strategies for giving partial marks for incorrect answers. We study how to test different cognitive skills, such as music, using multimedia interfaces; and also evaluate the effectiveness of our model. Methods for estimating difficulty level of a mathematical item type using Item Response Theory (IRT) and a molecule construction item type using Graph Edit Distance are discussed. Evaluation of the graphics item types through extensive testing on some students is described. We also outline the application of using interactive graphics over cell phones. All of the graphics item types used in this paper are developed by members of our research group. [source]


    Revising the Cannabis Use Disorders Identification Test (CUDIT) by means of Item Response Theory

    INTERNATIONAL JOURNAL OF METHODS IN PSYCHIATRIC RESEARCH, Issue 3 2010
    Beatrice Annaheim
    Abstract Cannabis use among adolescents and young adults has become a major public health challenge. Several European countries are currently developing short screening instruments to identify ,problematic' forms of cannabis use in general population surveys. One such instrument is the Cannabis Use Disorders Identification Test (CUDIT), a 10-item questionnaire based on the Alcohol Use Disorders Identification Test. Previous research found that some CUDIT items did not perform well psychometrically. In the interests of improving the psychometric properties of the CUDIT, this study replaces the poorly performing items with new items that specifically address cannabis use. Analyses are based on a sub-sample of 558 recent cannabis users from a representative population sample of 5722 individuals (aged 13,32) who were surveyed in the 2007 Swiss Cannabis Monitoring Study. Four new items were added to the original CUDIT. Psychometric properties of all 14 items, as well as the dimensionality of the supplemented CUDIT were then examined using Item Response Theory. Results indicate the unidimensionality of CUDIT and an improvement in its psychometric performance when three original items (usual hours being stoned; injuries; guilt) are replaced by new ones (motives for using cannabis; missing out leisure time activities; difficulties at work/school). However, improvements were limited to cannabis users with a high problem score. For epidemiological purposes, any further revision of CUDIT should therefore include a greater number of ,easier' items. Copyright © 2010 John Wiley & Sons, Ltd. [source]


    Evaluation of a computer-adaptive test for the assessment of depression (D-CAT) in clinical application

    INTERNATIONAL JOURNAL OF METHODS IN PSYCHIATRIC RESEARCH, Issue 1 2009
    Herbert Fliege
    Abstract In the past, a German Computerized Adaptive Test, based on Item Response Theory (IRT), was developed for purposes of assessing the construct depression [Computer-adaptive test for depression (D-CAT)]. This study aims at testing the feasibility and validity of the real computer-adaptive application. The D-CAT, supplied by a bank of 64 items, was administered on personal digital assistants (PDAs) to 423 consecutive patients suffering from psychosomatic and other medical conditions (78 with depression). Items were adaptively administered until a predetermined reliability (r , 0.90) was attained. For validation purposes, the Hospital Anxiety and Depression Scale (HADS), the Centre for Epidemiological Studies Depression (CES-D) scale, and the Beck Depression Inventory (BDI) were administered. Another sample of 114 patients was evaluated using standardized diagnostic interviews [Composite International Diagnostic Interview (CIDI)]. The D-CAT was quickly completed (mean 74 seconds), well accepted by the patients and reliable after an average administration of only six items. In 95% of the cases, 10 items or less were needed for a reliable score estimate. Correlations between the D-CAT and the HADS, CES-D, and BDI ranged between r = 0.68 and r = 0.77. The D-CAT distinguished between diagnostic groups as well as established questionnaires do. The D-CAT proved an efficient, well accepted and reliable tool. Discriminative power was comparable to other depression measures, whereby the CAT is shorter and more precise. Item usage raises questions of balancing the item selection for content in the future. Copyright © 2009 John Wiley & Sons, Ltd. [source]


    Estimation of Item Response Theory Parameters in the Presence of Missing Data

    JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 3 2008
    Holmes Finch
    Missing data are a common problem in a variety of measurement settings, including responses to items on both cognitive and affective assessments. Researchers have shown that such missing data may create problems in the estimation of item difficulty parameters in the Item Response Theory (IRT) context, particularly if they are ignored. At the same time, a number of data imputation methods have been developed outside of the IRT framework and been shown to be effective tools for dealing with missing data. The current study takes several of these methods that have been found to be useful in other contexts and investigates their performance with IRT data that contain missing values. Through a simulation study, it is shown that these methods exhibit varying degrees of effectiveness in terms of imputing data that in turn produce accurate sample estimates of item difficulty and discrimination parameters. [source]


    Comparing the Difficulty of Examination Subjects with Item Response Theory

    JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 2 2008
    Oksana B. Korobko
    Methods are presented for comparing grades obtained in a situation where students can choose between different subjects. It must be expected that the comparison between the grades is complicated by the interaction between the students' pattern and level of proficiency on one hand, and the choice of the subjects on the other hand. Three methods based on item response theory (IRT) for the estimation of proficiency measures that are comparable over students and subjects are discussed: a method based on a model with a unidimensional representation of proficiency, a method based on a model with a multidimensional representation of proficiency, and a method based on a multidimensional representation of proficiency where the stochastic nature of the choice of examination subjects is explicitly modeled. The methods are compared using the data from the Central Examinations in Secondary Education in the Netherlands. The results show that the unidimensional IRT model produces unrealistic results, which do not appear when using the two multidimensional IRT models. Further, it is shown that both the multidimensional models produce acceptable model fit. However, the model that explicitly takes the choice process into account produces the best model fit. [source]


    The Impact of Omitted Responses on the Accuracy of Ability Estimation in Item Response Theory

    JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 3 2001
    R. J. De Ayala
    Practitioners typically face situations in which examinees have not responded to all test items. This study investigated the effect on an examinee's ability estimate when an examinee is presented an item, has ample time to answer, but decides not to respond to the item. Three approaches to ability estimation (biweight estimation, expected a posteriori, and maximum likelihood estimation) were examined. A Monte Carlo study was performed and the effect of different levels of omissions on the simulee's ability estimates was determined. Results showed that the worst estimation occurred when omits were treated as incorrect. In contrast, substitution of 0.5 for omitted responses resulted in ability estimates that were almost as accurate as those using complete data. Implications for practitioners are discussed. [source]


    Testing Measurement Invariance Using Item Response Theory in Longitudinal Data: An Introduction

    CHILD DEVELOPMENT PERSPECTIVES, Issue 1 2010
    Roger E. Millsap
    Abstract, Item response theory (IRT) consists of a set of mathematical models for the probabilities of various responses to test items as a function of item and person characteristics. In longitudinal data, changes in measured variables can only be interpreted if important psychometric features of the measured variables are assumed invariant across time. Measurement invariance is invariance in the relation of a measure to the latent variable underlying it. Measurement invariance in longitudinal studies concerns invariance over time, and IRT provides a useful approach to investigating longitudinal measurement invariance. Commonly used IRT models are described, along with the representation of measurement invariance in IRT. The use of IRT for investigating invariance is then described, along with practical considerations in using IRT for this purpose. Conceptual issues, rather than technical details, are emphasized throughout. [source]


    Symptom features of postpartum depression: are they distinct?,

    DEPRESSION AND ANXIETY, Issue 1 2008
    Ira H. Bernstein Ph.D.
    Abstract The clinical features of postpartum depression and depression occurring outside of the postpartum period have rarely been compared. The 16-item Quick Inventory of Depressive Symptomatology-Self-Report (QIDS-SR16) provides a means to assess core depressive symptoms. Item response theory and classical test theory analyses were conducted to examine differences between postpartum (n=95) and nonpostpartum (n=50) women using the QIDS-SR16. The two groups of females were matched on the basis of age. All met DSM-IV criteria for nonpsychotic major depressive disorder. Low energy level and restlessness/agitation were major characteristics of depression in both groups. The nonpostpartum group reported more sad mood, more suicidal ideation, and more reduced interest. In contrast, for postpartum depression sad mood was less prominent, while psychomotor symptoms (restlessness/agitation) and impaired concentration/decision-making were most prominent. These symptomatic differences between postpartum and other depressives suggest the need to include agitation/restlessness and impaired concentration/decision-making among screening questions for postpartum depression. Depression and Anxiety 0:1,7, 2006. Published 2006 Wiley-Liss, Inc. [source]


    Item response theory: applications of modern test theory in medical education

    MEDICAL EDUCATION, Issue 8 2003
    Steven M Downing
    Context Item response theory (IRT) measurement models are discussed in the context of their potential usefulness in various medical education settings such as assessment of achievement and evaluation of clinical performance. Purpose The purpose of this article is to compare and contrast IRT measurement with the more familiar classical measurement theory (CMT) and to explore the benefits of IRT applications in typical medical education settings. Summary CMT, the more common measurement model used in medical education, is straightforward and intuitive. Its limitation is that it is sample-dependent, in that all statistics are confounded with the particular sample of examinees who completed the assessment. Examinee scores from IRT are independent of the particular sample of test questions or assessment stimuli. Also, item characteristics, such as item difficulty, are independent of the particular sample of examinees. The IRT characteristic of invariance permits easy equating of examination scores, which places scores on a constant measurement scale and permits the legitimate comparison of student ability change over time. Three common IRT models and their statistical assumptions are discussed. IRT applications in computer-adaptive testing and as a method useful for adjusting rater error in clinical performance assessments are overviewed. Conclusions IRT measurement is a powerful tool used to solve a major problem of CMT, that is, the confounding of examinee ability with item characteristics. IRT measurement addresses important issues in medical education, such as eliminating rater error from performance assessments. [source]


    Psychopathy in adolescent female offenders: an item response theory analysis of the psychopathy checklist: youth version

    BEHAVIORAL SCIENCES & THE LAW, Issue 1 2006
    Crystal L. Schrum M.A.
    The present study examined the applicability of the PCL:YV items to a sample of detained adolescent girls. Item response theory (IRT) was used to analyze test and item functioning of the PCL:YV. Examination of IRT trace lines indicated that the items most discriminating of the underlying construct of psychopathy included "callousness and a lack of empathy", "conning and manipulation", and "a grandiose sense of self-worth". Results from the analyses also demonstrated that the items least discriminating in this sample, or least useful for identifying psychopathy, included "poor anger control", "shallow affect", or engaging in a "serious violation of conditional release". Consistent with previous research (Cooke & Michie, 1997; Hare, 2003), interpersonal and affective components of psychopathy provided more information than behavioral features. Moreover, although previous research has also found affective features to provide the most information in past studies, it was interpersonal features of psychopathy in this case, followed by affective features, that provided greater levels of information. Implications of these results are discussed. Copyright © 2006 John Wiley & Sons, Ltd. [source]


    Testing Measurement Invariance Using Item Response Theory in Longitudinal Data: An Introduction

    CHILD DEVELOPMENT PERSPECTIVES, Issue 1 2010
    Roger E. Millsap
    Abstract, Item response theory (IRT) consists of a set of mathematical models for the probabilities of various responses to test items as a function of item and person characteristics. In longitudinal data, changes in measured variables can only be interpreted if important psychometric features of the measured variables are assumed invariant across time. Measurement invariance is invariance in the relation of a measure to the latent variable underlying it. Measurement invariance in longitudinal studies concerns invariance over time, and IRT provides a useful approach to investigating longitudinal measurement invariance. Commonly used IRT models are described, along with the representation of measurement invariance in IRT. The use of IRT for investigating invariance is then described, along with practical considerations in using IRT for this purpose. Conceptual issues, rather than technical details, are emphasized throughout. [source]


    Diagnostic utility of the Quick Inventory of Depressive Symptomatology (QIDS-C16 and QIDS-SR16) in the elderly

    ACTA PSYCHIATRICA SCANDINAVICA, Issue 3 2010
    P. M. Doraiswamy
    Doraiswamy PM, Bernstein IH, Rush AJ, Kyutoku Y, Carmody TJ, Macleod L, Venkatraman S, Burks M, Stegman D, Witte B, Trivedi MH. Diagnostic utility of the Quick Inventory of Depressive Symptomatology (QIDS-C16 and QIDS-SR16) in the elderly. Objective:, To evaluate psychometric properties and comparability ability of the Montgomery-Ĺsberg Depression Rating Scale (MADRS) vs. the Quick Inventory of Depressive Symptomatology,Clinician-rated (QIDS-C16) and Self-report (QIDS-SR16) scales to detect a current major depressive episode in the elderly. Method:, Community and clinic subjects (age ,60 years) were administered the Mini-International Neuropsychiatric Interview (MINI) for DSM-IV and three depression scales randomly. Statistics included classical test and Samejima item response theories, factor analyzes, and receiver operating characteristic methods. Results:, In 229 elderly patients (mean age = 73 years, 39% male, 54% current depression), all three scales were unidimensional and with nearly equal Cronbach , reliability (0.85,0.89). Each scale discriminated persons with major depression from the non-depressed, but the QIDS-C16 was slightly more accurate. Conclusion:, All three tests are valid for detecting geriatric major depression with the QIDS-C16 being slightly better. Self-rated QIDS-SR16 is recommended as a screening tool as it is least expensive and least time consuming. [source]


    Variable reporting and quantitative reviews: a comparison of three meta-analytical techniques

    ECOLOGY LETTERS, Issue 5 2003
    Marc J. Lajeunesse
    Abstract Variable reporting of results can influence quantitative reviews by limiting the number of studies for analysis, and thereby influencing both the type of analysis and the scope of the review. We performed a Monte Carlo simulation to determine statistical errors for three meta-analytical approaches and related how such errors were affected by numbers of constituent studies. Hedges'd and effect sizes based on item response theory (IRT) had similarly improved error rates with increasing numbers of studies when there was no true effect, but IRT was conservative when there was a true effect. Log response ratio had low precision for detecting null effects as a result of overestimation of effect sizes, but high ability to detect true effects, largely irrespective of number of studies. Traditional meta-analysis based on Hedges'd are preferred; however, quantitative reviews should use various methods in concert to improve representation and inferences from summaries of published data. [source]


    The performance of the Japanese version of the K6 and K10 in the World Mental Health Survey Japan

    INTERNATIONAL JOURNAL OF METHODS IN PSYCHIATRIC RESEARCH, Issue 3 2008
    Toshi A. Furukawa
    Abstract Two new screening scales for psychological distress, the K6 and K10, have been developed using the item response theory and shown to outperform existing screeners in English. We developed their Japanese versions using the standard backtranslaton method and included them in the World Mental Health Survey Japan (WMH-J), which is a psychiatric epidemiologic study conducted in seven communities across Japan with 2436 participants. The WMH-J used the WMH Survey Initiative version of the Composite International Diagnostic Interview (CIDI) to assess the 30-day Diagnostic and Statistical Manual of Mental Disorders , Fourth Edition (DSM-IV). Performance of the two screening scales in detecting DSM-IV mood and anxiety disorders, as assessed by the areas under receiver operating characteristic curves (AUCs), was excellent, with values as high as 0.94 (95% confidence interval = 0.88 to 0.99) for K6 and 0.94 (0.88 to 0.995) for K10. Stratum-specific likelihood ratios (SSLRs), which express screening test characteristics and can be used to produce individual-level predicted probabilities of being a case from screening scale scores and pretest probabilities in other samples, were strikingly similar between the Japanese and the original versions. The Japanese versions of the K6 and K10 thus demonstrated screening performances essentially equivalent to those of the original English versions. Copyright © 2008 John Wiley & Sons, Ltd. [source]


    Marginal maximum likelihood estimation of item response theory (IRT) equating coefficients for the common-examinee design

    JAPANESE PSYCHOLOGICAL RESEARCH, Issue 2 2001
    Haruhiko Ogasawara
    A method of estimating item response theory (IRT) equating coefficients by the common-examinee design with the assumption of the two-parameter logistic model is provided. The method uses the marginal maximum likelihood estimation, in which individual ability parameters in a common-examinee group are numerically integrated out. The abilities of the common examinees are assumed to follow a normal distribution but with an unknown mean and standard deviation on one of the two tests to be equated. The distribution parameters are jointly estimated with the equating coefficients. Further, the asymptotic standard errors of the estimates of the equating coefficients and the parameters for the ability distribution are given. Numerical examples are provided to show the accuracy of the method. [source]


    The Existential Loneliness Questionnaire: Background, development, and preliminary findings

    JOURNAL OF CLINICAL PSYCHOLOGY, Issue 9 2002
    Aviva M. Mayers
    We described the background and the development of a new measure of existential loneliness, the Existential Loneliness Questionnaire (ELQ). Specifically, we analyzed the items of the preliminary version of the ELQ (ELQ-P) using methods based on item response theory (the Rasch model) and examined the convergent and discriminative validity of the ELQ in a sample of 47 HIV-infected women. Item analysis produced an ELQ version consisting of 22 items that were internally consistent and performed well in measuring an underlying construct conceptualized as existential loneliness. In addition, the ELQ discriminated well between symptomatic and asymptomatic HIV-infected women. The ELQ correlated strongly with measures of depression, loneliness not identified as existential and purpose-in-life and moderately strongly with a measure of hopelessness. Holding constant depression scores, the correlation between the ELQ and loneliness not identified as existential was significantly attenuated. Limitations of the study include the small sample size, which precluded an analysis of the dimensional structure of the ELQ. © 2002 Wiley Periodicals, Inc. J Clin Psychol 58: 1183,1193, 2002. [source]


    A Comparison of Item Fit Statistics for Mixed IRT Models

    JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 3 2010
    Kyong Hee Chon
    In this study we examined procedures for assessing model-data fit of item response theory (IRT) models for mixed format data. The model fit indices used in this study include PARSCALE's,G2,,Orlando and Thissen's,S,,,X2,and,S,,,G2,,and Stone's,,2*,and,G2*. To investigate the relative performance of the fit statistics at the item level, we conducted two simulation studies: Type I error and power studies. We evaluated the performance of the item fit indices for various conditions of test length, sample size, and IRT models. Among the competing measures, the summed score-based indices,S,,,X2,and,S,,,G2,were found to be the sensible and efficient choice for assessing model fit for mixed format data. These indices performed well, particularly with short tests. The pseudo-observed score indices, ,2*,and,G2*,,showed inflated Type I error rates in some simulation conditions. Consistent with the findings of current literature, the PARSCALE's,G2,index was rarely useful, although it provided reasonable results for long tests. [source]


    Multidimensional Linking for Tests with Mixed Item Types

    JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 2 2009
    Lihua Yao
    Numerous assessments contain a mixture of multiple choice (MC) and constructed response (CR) item types and many have been found to measure more than one trait. Thus, there is a need for multidimensional dichotomous and polytomous item response theory (IRT) modeling solutions, including multidimensional linking software. For example, multidimensional item response theory (MIRT) may have a promising future in subscale score proficiency estimation, leading toward a more diagnostic orientation, which requires the linking of these subscale scores across different forms and populations. Several multidimensional linking studies can be found in the literature; however, none have used a combination of MC and CR item types. Thus, this research explores multidimensional linking accuracy for tests composed of both MC and CR items using a matching test characteristic/response function approach. The two-dimensional simulation study presented here used real data-derived parameters from a large-scale statewide assessment with two subscale scores for diagnostic profiling purposes, under varying conditions of anchor set lengths (6, 8, 16, 32, 60), across 10 population distributions, with a mixture of simple versus complex structured items, using a sample size of 3,000. It was found that for a well chosen anchor set, the parameters recovered well after equating across all populations, even for anchor sets composed of as few as six items. [source]


    Comparing the Difficulty of Examination Subjects with Item Response Theory

    JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 2 2008
    Oksana B. Korobko
    Methods are presented for comparing grades obtained in a situation where students can choose between different subjects. It must be expected that the comparison between the grades is complicated by the interaction between the students' pattern and level of proficiency on one hand, and the choice of the subjects on the other hand. Three methods based on item response theory (IRT) for the estimation of proficiency measures that are comparable over students and subjects are discussed: a method based on a model with a unidimensional representation of proficiency, a method based on a model with a multidimensional representation of proficiency, and a method based on a multidimensional representation of proficiency where the stochastic nature of the choice of examination subjects is explicitly modeled. The methods are compared using the data from the Central Examinations in Secondary Education in the Netherlands. The results show that the unidimensional IRT model produces unrealistic results, which do not appear when using the two multidimensional IRT models. Further, it is shown that both the multidimensional models produce acceptable model fit. However, the model that explicitly takes the choice process into account produces the best model fit. [source]


    Skills Diagnosis Using IRT-Based Latent Class Models

    JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 4 2007
    Louis A. Roussos
    This article describes a latent trait approach to skills diagnosis based on a particular variety of latent class models that employ item response functions (IRFs) as in typical item response theory (IRT) models. To enable and encourage comparisons with other approaches, this description is provided in terms of the main components of any psychometric approach: the ability model and the IRF structure; review of research on estimation, model checking, reliability, validity, equating, and scoring; and a brief review of real data applications. In this manner the article demonstrates that this approach to skills diagnosis has built a strong initial foundation of research and resources available to potential users. The outlook for future research and applications is discussed with special emphasis on a call for pilot studies and concomitant increased validity research. [source]


    Generating Dichotomous Item Scores with the Four-Parameter Beta Compound Binomial Model

    JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 3 2007
    Patrick O. Monahan
    A Monte Carlo simulation technique for generating dichotomous item scores is presented that implements (a) a psychometric model with different explicit assumptions than traditional parametric item response theory (IRT) models, and (b) item characteristic curves without restrictive assumptions concerning mathematical form. The four-parameter beta compound-binomial (4PBCB) strong true score model (with two-term approximation to the compound binomial) is used to estimate and generate the true score distribution. The nonparametric item-true score step functions are estimated by classical item difficulties conditional on proportion-correct total score. The technique performed very well in replicating inter-item correlations, item statistics (point-biserial correlation coefficients and item proportion-correct difficulties), first four moments of total score distribution, and coefficient alpha of three real data sets consisting of educational achievement test scores. The technique replicated real data (including subsamples of differing proficiency) as well as the three-parameter logistic (3PL) IRT model (and much better than the 1PL model) and is therefore a promising alternative simulation technique. This 4PBCB technique may be particularly useful as a more neutral simulation procedure for comparing methods that use different IRT models. [source]


    Generalizability in Item Response Modeling

    JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 2 2007
    Derek C. Briggs
    An approach called generalizability in item response modeling (GIRM) is introduced in this article. The GIRM approach essentially incorporates the sampling model of generalizability theory (GT) into the scaling model of item response theory (IRT) by making distributional assumptions about the relevant measurement facets. By specifying a random effects measurement model, and taking advantage of the flexibility of Markov Chain Monte Carlo (MCMC) estimation methods, it becomes possible to estimate GT variance components simultaneously with traditional IRT parameters. It is shown how GT and IRT can be linked together, in the context of a single-facet measurement design with binary items. Using both simulated and empirical data with the software WinBUGS, the GIRM approach is shown to produce results comparable to those from a standard GT analysis, while also producing results from a random effects IRT model. [source]


    A Mixture Model Analysis of Differential Item Functioning

    JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 2 2005
    Allan S. Cohen
    Once a differential item functioning (DIF) item has been identified, little is known about the examinees for whom the item functions differentially. This is because DIF focuses on manifest group characteristics that are associated with it, but do not explain why examinees respond differentially to items. We first analyze item response patterns for gender DIF and then illustrate, through the use of a mixture item response theory (IRT) model, how the manifest characteristic associated with DIF often has a very weak relationship with the latent groups actually being advantaged or disadvantaged by the item(s). Next, we propose an alternative approach to DIF assessment that first uses an exploratory mixture model analysis to define the primary dimension(s) that contribute to DIF, and secondly studies examinee characteristics associated with those dimensions in order to understand the cause(s) of DIF. Comparison of academic characteristics of these examinees across classes reveals some clear differences in manifest characteristics between groups. [source]


    Using Patterns of Summed Scores in Paper-and-Pencil Tests and Computer-Adaptive Tests to Detect Misfitting Item Score Patterns

    JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 2 2004
    Rob R. Meijer
    Two new methods have been proposed to determine unexpected sum scores on sub-tests (testlets) both for paper-and-pencil tests and computer adaptive tests. A method based on a conservative bound using the hypergeometric distribution, denoted p, was compared with a method where the probability for each score combination was calculated using a highest density region (HDR). Furthermore, these methods were compared with the standardized log-likelihood statistic with and without a correction for the estimated latent trait value (denoted as l*z and lz, respectively). Data were simulated on the basis of the one-parameter logistic model, and both parametric and non-parametric logistic regression was used to obtain estimates of the latent trait. Results showed that it is important to take the trait level into account when comparing subtest scores. In a nonparametric item response theory (IRT) context, on adapted version of the HDR method was a powerful alterative to p. In a parametric IRT context, results showed that l*z had the highest power when the data were simulated conditionally on the estimated latent trait level. [source]


    Assessing Goodness of Fit of Item Response Theory Models: A Comparison of Traditional and Alternative Procedures

    JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 4 2003
    Clement A. Stone
    Testing the goodness of fit of item response theory (IRT) models is relevant to validating IRT models, and new procedures have been proposed. These alternatives compare observed and expected response frequencies conditional on observed total scores, and use posterior probabilities for responses across , levels rather than cross-classifying examinees using point estimates of , and score responses. This research compared these alternatives with regard to their methods, properties (Type 1 error rates and empirical power), available research, and practical issues (computational demands, treatment of missing data, effects of sample size and sparse data, and available computer programs). Different advantages and disadvantages related to these characteristics are discussed. A simulation study provided additional information about empirical power and Type 1 error rates. [source]


    Comparing Multidimensional and Unidimensional Proficiency Classifications: Multidimensional IRT as a Diagnostic Aid

    JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 3 2003
    Cindy M. Walker
    This research examined the effect of scoring items thought to be multidimensional using a unidimensional model and demonstrated the use of multidimensional item response theory (MIRT) as a diagnostic tool. Using real data from a large-scale mathematics test, previously shown to function differentially in favor of proficient writers, the difference in proficiency classifications was explored when a two-versus one-dimensional confirmatory model was fit. The estimate of ability obtained when using the unidimensional model was considered to represent general mathematical ability. Under the two-dimensional model, one of the two dimensions was also considered to represent general mathematical ability. The second dimension was considered to represent the ability to communicate in mathematics. The resulting pattern of mismatched proficiency classifications suggested that examinees found to have less mathematics communication ability were more likely to be placed in a lower general mathematics proficiency classification under the unidimensional than multidimensional model. Results and implications are discussed. [source]


    Identification and Evaluation of Local Item Dependencies in the Medical College Admissions Test

    JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 4 2002
    April L. Zenisky
    Measurement specialists routinely assume examinee responses to test items are independent of one another. However, previous research has shown that many contemporary tests contain item dependencies and not accounting for these dependencies leads to misleading estimates of item, test, and ability parameters. The goals of the study were (a) to review methods for detecting local item dependence (LID), (b) to discuss the use of testlets to account for LID in context-dependent item sets, (c) to apply LID detection methods and testlet-based item calibrations to data from a large-scale, high-stakes admissions test, and (d) to evaluate the results with respect to test score reliability and examinee proficiency estimation. Item dependencies were found in the test and these were due to test speededness or context dependence (related to passage structure). Also, the results highlight that steps taken to correct for the presence of LID and obtain less biased reliability estimates may impact on the estimation of examinee proficiency. The practical effects of the presence of LID on passage-based tests are discussed, as are issues regarding how to calibrate context-dependent item sets using item response theory. [source]


    Design, validation, and use of an evaluation instrument for monitoring systemic reform

    JOURNAL OF RESEARCH IN SCIENCE TEACHING, Issue 6 2001
    Kathryn Scantlebury
    Over the past decade, state and national policymakers have promoted systemic reform as a way to achieve high-quality science education for all students. However, few instruments are available to measure changes in key dimensions relevant to systemic reform such as teaching practices, student attitudes, or home and peer support. Furthermore, Rasch methods of analysis are needed to permit valid comparison of different cohorts of students during different years of a reform effort. This article describes the design, development, validation, and use of an instrument that measures student attitudes and several environment dimensions (standards-based teaching, home support, and peer support) using a three-step process that incorporated expert opinion, factor analysis, and item response theory. The instrument was validated with over 8,000 science and mathematics students, taught by more than 1,000 teachers in over 200 schools as part of a comprehensive assessment of the effectiveness of Ohio's systemic reform initiative. When the new four-factor, 20-item questionnaire was used to explore the relative influence of the class, home, and peer environment on student achievement and attitudes, findings were remarkably consistent across 3 years and different units and methods of analysis. All three environments accounted for unique variance in student attitudes, but only the environment of the class accounted for unique variance in student achievement. However, the class environment (standards-based teaching practices) was the strongest independent predictor of both achievement and attitude, and appreciable amounts of the total variance in attitudes were common to the three environments. © 2001 John Wiley & Sons, Inc. J Res Sci Teach 38: 646,662, 2001 [source]


    A primer on classical test theory and item response theory for assessments in medical education

    MEDICAL EDUCATION, Issue 1 2010
    André F De Champlain
    Context, A test score is a number which purportedly reflects a candidate's proficiency in some clearly defined knowledge or skill domain. A test theory model is necessary to help us better understand the relationship that exists between the observed (or actual) score on an examination and the underlying proficiency in the domain, which is generally unobserved. Common test theory models include classical test theory (CTT) and item response theory (IRT). The widespread use of IRT models over the past several decades attests to their importance in the development and analysis of assessments in medical education. Item response theory models are used for a host of purposes, including item analysis, test form assembly and equating. Although helpful in many circumstances, IRT models make fairly strong assumptions and are mathematically much more complex than CTT models. Consequently, there are instances in which it might be more appropriate to use CTT, especially when common assumptions of IRT cannot be readily met, or in more local settings, such as those that may characterise many medical school examinations. Objectives, The objective of this paper is to provide an overview of both CTT and IRT to the practitioner involved in the development and scoring of medical education assessments. Methods, The tenets of CCT and IRT are initially described. Then, main uses of both models in test development and psychometric activities are illustrated via several practical examples. Finally, general recommendations pertaining to the use of each model in practice are outlined. Discussion, Classical test theory and IRT are widely used to address measurement-related issues that arise from commonly used assessments in medical education, including multiple-choice examinations, objective structured clinical examinations, ward ratings and workplace evaluations. The present paper provides an introduction to these models and how they can be applied to answer common assessment questions. Medical Education 2010: 44: 109,117 [source]


    Cutaneous allodynia in the migraine population

    ANNALS OF NEUROLOGY, Issue 2 2008
    Richard B. Lipton MD
    Objective To develop and validate a questionnaire for assessing cutaneous allodynia (CA), and to estimate the prevalence and severity of CA in the migraine population. Methods Migraineurs (n = 11,388) completed the Allodynia Symptom Checklist, assessing the frequency of allodynia symptoms during headache. Response options were never (0), rarely (0), less than 50% of the time (1), ,50% of the time (2), and none (0). We used item response theory to explore how well each item discriminated CA. The relations of CA to headache features were examined. Results All 12 questions had excellent item properties. The greatest discrimination occurred with CA during "taking a shower" (discrimination = 2.54), wearing a necklace (2.39) or ring (2.31), and exposure to heat (2.1) or cold (2.0). The factor analysis demonstrated three factors: thermal, mechanical static, and mechanical dynamic. Based on the psychometrics, we developed a scale distinguishing no CA (scores 0,2), mild (3,5), moderate (6,8), and severe (,9). The prevalence of allodynia among migraineurs was 63.2%. Severe CA occurred in 20.4% of migraineurs. CA was associated with migraine defining features (eg, unilateral pain: odds ratio, 2.3; 95% confidence interval, 2.0,2.4; throbbing pain: odds ratio, 2.3; 95% confidence interval, 2.1,2.6; nausea: odds ratio, 2.3; 95% confidence interval, 2.1,2.6), as well as illness duration, attack frequency, and disability. Interpretation The Allodynia Symptom Checklist measures overall allodynia and subtypes. CA affects 63% of migraineurs in the population and is associated with frequency, severity, disability, and associated symptoms of migraine. CA maps onto migraine biology. Ann Neurol 2007 [source]