Home About us Contact

Item Difficulty (item + difficulty)

Distribution by Scientific Domains

Education	70%
Medical Sciences	20%
Psychology	10%

Selected Abstracts

A Closer Look at Using Judgments of Item Difficulty to Change Answers on Computerized Adaptive Tests

JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 4 2005
Walter P. Vispoel
Recent studies have shown that restricting review and answer change opportunities on computerized adaptive tests (CATs) to items within successive blocks reduces time spent in review, satisfies most examinees' desires for review, and controls against distortion in proficiency estimates resulting from intentional incorrect answering of items prior to review. However, restricting review opportunities on CATs may not prevent examinees from artificially raising proficiency estimates by using judgments of item difficulty to signal when to change previous answers. We evaluated six strategies for using item difficulty judgments to change answers on CATs and compared the results to those from examinees reviewing and changing answers in the usual manner. The strategy conditions varied in terms of when examinees were prompted to consider changing answers and in the information provided about the consistency of the item selection algorithm. We found that examinees fared best on average when they reviewed and changed answers in the usual manner. The best gaming strategy was one in which the examinees knew something about the consistency of the item selection algorithm and were prompted to change responses only when they were unsure about answer correctness and sure about their item difficulty judgments. However, even this strategy did not produce a mean gain in proficiency estimates. [source]

Can Examinees Use Judgments of Item Difficulty to Improve Proficiency Estimates on Computerized Adaptive Vocabulary Tests?

JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 4 2002
Walter P. Vispoel
Recent simulation studies indicate that there are occasions when examinees can use judgments of relative item difficulty to obtain positively biased proficiency estimates on computerized adaptive tests (CATs) that permit item review and answer change. Our purpose in the study reported here was to evaluate examinees' success in using these strategies while taking CATs in a live testing setting. We taught examinees two item difficulty judgment strategies designed to increase proficiency estimates. Examinees who were taught each strategy and examinees who were taught neither strategy were assigned at random to complete vocabulary CATs under conditions in which review was allowed after completing all items and when review was allowed only within successive blocks of items. We found that proficiency estimate changes following review were significantly higher in the regular review conditions than in the strategy conditions. Failure to obtain systematically higher scores in the strategy conditions was due in large part to errors examinees made in judging the relative difficulty of CAT items. [source]

Classroom experiments on the effects of different noise sources and sound levels on long-term recall and recognition in children

APPLIED COGNITIVE PSYCHOLOGY, Issue 8 2003
Staffan Hygge
A total of 1358 children aged 12,14 years participated in ten noise experiments in their ordinary classrooms and were tested for recall and recognition of a text exactly one week later. Single and combined noise sources were presented for 15 min at 66 dBA Leq (equivalent noise level). Single source presentations of aircraft and road traffic noise were also presented at 55 dBA Leq. Data were analysed between subjects since the first within-subjects analysis revealed a noise after-effect or a asymmetric transfer effect. Overall, there was a strong noise effect on recall, and a smaller, but significant effect on recognition. In the single-source studies, aircraft and road traffic noise impaired recall at both noise levels. Train noise and verbal noise did not affect recognition or recall. Some of the pairwise combinations of aircraft noise with train or road traffic, with one or the other as the dominant source, interfered with recall and recognition. Item difficulty, item position and ability did not interact with the noise effect. Arousal, distraction, perceived effort, and perceived difficulty in reading and learning did not mediate the effects on recall and recognition. Copyright © 2003 John Wiley & Sons, Ltd. [source]

Manipulating Processing Difficulty of Reading Comprehension Questions: The Feasibility of Verbal Item Generation

JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 4 2005
Joanna S. Gorin
Based on a previously validated cognitive processing model of reading comprehension, this study experimentally examines potential generative components of text-based multiple-choice reading comprehension test questions. Previous research (Embretson & Wetzel, 1987; Gorin & Embretson, 2005; Sheehan & Ginther, 2001) shows text encoding and decision processes account for significant proportions of variance in item difficulties. In the current study, Linear Logistic Latent Trait Model (LLTM; Fischer, 1973) parameter estimates of experimentally manipulated items are examined to further verify the impact of encoding and decision processes on item difficulty. Results show that manipulation of some passage features, such as increased use of negative wording, significantly increases item difficulty in some cases, whereas others, such as altering the order of information presentation in a passage, did not significantly affect item difficulty, but did affect reaction time. These results suggest that reliable changes in difficulty and response time through algorithmic manipulation of certain task features is feasible. However, non-significant results for several manipulations highlight potential challenges to item generation in establishing direct links between theoretically relevant item features and individual item processing. Further examination of these relationships will be informative to item writers as well as test developers interested in the feasibility of item generation as an assessment tool. [source]

Activities of daily living in persons with intellectual disability: Strengths and limitations in specific motor and process skills

AUSTRALIAN OCCUPATIONAL THERAPY JOURNAL, Issue 4 2003
Anders Kottorp
As there is a wide range of abilities among clients with intellectual disability, occupational therapists should use assessments of activities of daily living that specify clients' strengths and limitations to guide and target interventions. The aim of the present study was to examine if activities of daily living performance skills differ between adults with mild and moderate intellectual disability. Three hundred and forty-eight participants with either mild intellectual disability (n = 178) or moderate intellectual disability (n = 170) were assessed using the Assessment of Motor and Process Skills to examine the quality of their activities of daily living skills. The overall activities of daily living motor and activities of daily living process hierarchies of skill item difficulties remained stable between groups. Although participants with moderate intellectual disability had more difficulty overall with activities of daily living motor and activities of daily living process skills, they were able to carry out some of these activities equally as well as participants with mild intellectual disability. The findings are discussed in relation to the planning of specific interventions to improve the ability of clients with intellectual disability to carry out activities of daily living. [source]

Commentary: A Response to Reckase's Conceptual Framework and Examples for Evaluating Standard Setting Methods

EDUCATIONAL MEASUREMENT: ISSUES AND PRACTICE, Issue 3 2006
E. Matthew Schulz
A look at real data shows that Reckase's psychometric theory for standard setting is not applicable to bookmark and that his simulations cannot explain actual differences between methods. It is suggested that exclusively test-centered, criterion-referenced approaches are too idealized and that a psychophysics paradigm and a theory of group behavior could be more useful in thinking about the standard setting process. In this view, item mapping methods such as bookmark are reasonable adaptations to fundamental limitations in human judgments of item difficulty. They make item ratings unnecessary and have unique potential for integrating external validity data and student performance data more fully into the standard setting process. [source]

Factor and item-response analysis DSM-IV criteria for abuse of and dependence on cannabis, cocaine, hallucinogens, sedatives, stimulants and opioids

ADDICTION, Issue 6 2007
Nathan A. Gillespie
ABSTRACT Aims This paper explored, in a population-based sample of males, the factorial structure of criteria for substance abuse and dependence, and compared qualitatively the performance of these criteria across drug categories using item,response theory (IRT). Design Marginal maximum likelihood was used to explore the factor structure of criteria within drug classes, and a two-parameter IRT model was used to determine how the difficulty and discrimination of individual criteria differ across drug classes. Participants A total of 4234 males born from 1940 to 1974 from the population-based Virginia Twin Registry were approached to participate. Measurements DSM-IV drug use, abuse and dependence criteria for cannabis, sedatives, stimulants, cocaine and opiates. Findings For each drug class, the pattern of endorsement of individual criteria for abuse and dependence, conditioned on initiation and use, could be best explained by a single factor. There were large differences in individual item performance across substances in terms of item difficulty and discrimination. Cocaine users were more likely to have encountered legal, social, physical and psychological consequences. Conclusions The DSM-IV abuse and dependence criteria, within each drug class, are not distinct but best described in terms of a single underlying continuum of risk. Because individual criteria performed very differently across substances in IRT analyses, the assumption that these items are measuring equivalent levels of severity or liability with the same discrimination across different substances is unsustainable. Compared to other drugs, cocaine usage is associated with more detrimental effects and negative consequences, whereas the effects of cannabis and hallucinogens appear to be less harmful. Implications for other drug classes are discussed. [source]

Estimation of Item Response Theory Parameters in the Presence of Missing Data

JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 3 2008
Holmes Finch
Missing data are a common problem in a variety of measurement settings, including responses to items on both cognitive and affective assessments. Researchers have shown that such missing data may create problems in the estimation of item difficulty parameters in the Item Response Theory (IRT) context, particularly if they are ignored. At the same time, a number of data imputation methods have been developed outside of the IRT framework and been shown to be effective tools for dealing with missing data. The current study takes several of these methods that have been found to be useful in other contexts and investigates their performance with IRT data that contain missing values. Through a simulation study, it is shown that these methods exhibit varying degrees of effectiveness in terms of imputing data that in turn produce accurate sample estimates of item difficulty and discrimination parameters. [source]

A Closer Look at Using Judgments of Item Difficulty to Change Answers on Computerized Adaptive Tests

Manipulating Processing Difficulty of Reading Comprehension Questions: The Feasibility of Verbal Item Generation

Increasing the Homogeneity of CAT's Item-Exposure Rates by Minimizing or Maximizing Varied Target Functions While Assembling Shadow Tests

JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 3 2005
Yuan H. Li
A computerized adaptive testing (CAT) algorithm that has the potential to increase the homogeneity of CAT's item-exposure rates without significantly sacrificing the precision of ability estimates was proposed and assessed in the shadow-test (van der Linden & Reese, 1998) CAT context. This CAT algorithm was formed by a combination of maximizing or minimizing varied target functions while assembling shadow tests. There were four target functions to be separately used in the first, second, third, and fourth quarter test of CAT. The elements to be used in the four functions were associated with (a) a random number assigned to each item, (b) the absolute difference between an examinee's current ability estimate and an item difficulty, (c) the absolute difference between an examinee's current ability estimate and an optimum item difficulty, and (d) item information. The results indicated that this combined CAT fully utilized all the items in the pool, reduced the maximum exposure rates, and achieved more homogeneous exposure rates. Moreover, its precision in recovering ability estimates was similar to that of the maximum item-information method. The combined CAT method resulted in the best overall results compared with the other individual CAT item-selection methods. The findings from the combined CAT are encouraging. Future uses are discussed. [source]

Can Examinees Use Judgments of Item Difficulty to Improve Proficiency Estimates on Computerized Adaptive Vocabulary Tests?

A Sex Difference by Item Difficulty Interaction in Multiple-Choice Mathematics Items Administered to National Probability Samples

JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 1 2001
John Bielinski
A 1998 study by Bielinski and Davison reported a sex difference by item difficulty interaction in which easy items tended to be easier for females than males, and hard items tended to be harder for females than males. To extend their research to nationally representative samples of students, this study used math achievement data from the 1992 NAEP, the TIMSS, and the NELS:88. The data included students in grades 4, 8, 10, and 12. The interaction was assessed by correlating the item difficulty difference (bmale, bfemale) with item difficulty computed on the combined male/female sample. Using only the multiple-choice mathematics items, the predicted negative correlation was found for all eight populations and was significant in five. An argument is made that this phenomenon may help explain the greater variability in math achievement among males as compared to females and the emergence of higher performance of males in late adolescence. [source]

Effects of Response Format on Difficulty of SAT-Mathematics Items: It's Not the Strategy

JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 1 2000
Irvin R. Katz
Problem-solving strategy is frequently cited as mediating the effects of response format (multiple-choice, constructed response) on item difficulty, yet there are few direct investigations of examinee solution procedures. Fifty-five high school students solved parallel constructed response and multiple-choice items that differed only in the presence of response options. Student performance was videotaped to assess solution strategies. Strategies were categorized as "traditional",those associated with constructed response problem solving (e.g., writing and solving algebraic equations),or "nontraditional",those associated with multiple-choice problem solving (e.g., estimating a potential solution). Surprisingly, participants sometimes adopted nontraditional strategies to solve constructed response items. Furthermore, differences in difficulty between response formats did not correspond to differences in strategy choice: some items showed a format effect on strategy but no effect on difficulty; other items showed the reverse. We interpret these results in light of the relative comprehension challenges posed by the two groups of items. [source]

Witness confidence and accuracy: is a positive relationship maintained for recall under interview conditions?

JOURNAL OF INVESTIGATIVE PSYCHOLOGY AND OFFENDER PROFILING, Issue 1 2009
Mark R. Kebbell
Abstract A large positive correlation between eyewitness recall confidence and accuracy (C-A) is found in research when item difficulty is varied to include easy questions. However, these results are based on questionnaire responses. In real interviews, the social nature of the interview may influence C-A relationships, and it is the interviewer's perception of the accuracy of a witness that counts. This study was conducted to investigate the influence of these factors for recall of a video. Three conditions were used; the same questions were used in each. Participants in condition 1 (self-rate questionnaire condition, n = 20) were given a questionnaire that required them to answer questions and rate confidence on a scale. Pairs of participants in condition 2 (self-rate interview condition, n = 40) were given the role of eyewitness or interviewer. Eyewitnesses were asked questions by an interviewer and responded orally with answers and confidence judgements on a Likert scale. Participants in condition three (interviewer-rate interview condition, n = 40) were tested in the same way as condition two but provided confidence judgements in their own words. Interviewers independently rated each confidence judgement on the Likert scale. The experiment showed high C-A relationships, particularly for ,absolutely sure' responses. The main effect of the social interview condition was to increase confidence in correct answers but not in incorrect answers. However, the advantage of this effect was tempered by the fact that, although observers can differentiate between confident and less confident answers, less extreme confidence judgements were ascribed. Copyright © 2009 John Wiley & Sons, Ltd. [source]

Exploring alternative conceptions from Newtonian dynamics and simple DC circuits: Links between item difficulty and item confidence

JOURNAL OF RESEARCH IN SCIENCE TEACHING, Issue 2 2006
Maja Planinic
Croatian 1st-year and 3rd-year high-school students (N,=,170) completed a conceptual physics test. Students were evaluated with regard to two physics topics: Newtonian dynamics and simple DC circuits. Students answered test items and also indicated their confidence in each answer. Rasch analysis facilitated the calculation of three linear measures: (a) an item-difficulty measure based upon all responses, (b) an item-confidence measure based upon correct student answers, and (c) an item-confidence measure based upon incorrect student answers. Comparisons were made with regard to item difficulty and item confidence. The results suggest that Newtonian dynamics is a topic with stronger students' alternative conceptions than the topic of DC circuits, which is characterized by much lower students' confidence on both correct and incorrect answers. A systematic and significant difference between mean student confidence on Newtonian dynamics and DC circuits items was found in both student groups. Findings suggest some steps for physics instruction in Croatia as well as areas of further research for those in science education interested in additional techniques of exploring alternative conceptions. © 2005 Wiley Periodicals, Inc. J Res Sci Teach 43: 150,171, 2006 [source]

Item response theory: applications of modern test theory in medical education

MEDICAL EDUCATION, Issue 8 2003
Steven M Downing
Context Item response theory (IRT) measurement models are discussed in the context of their potential usefulness in various medical education settings such as assessment of achievement and evaluation of clinical performance. Purpose The purpose of this article is to compare and contrast IRT measurement with the more familiar classical measurement theory (CMT) and to explore the benefits of IRT applications in typical medical education settings. Summary CMT, the more common measurement model used in medical education, is straightforward and intuitive. Its limitation is that it is sample-dependent, in that all statistics are confounded with the particular sample of examinees who completed the assessment. Examinee scores from IRT are independent of the particular sample of test questions or assessment stimuli. Also, item characteristics, such as item difficulty, are independent of the particular sample of examinees. The IRT characteristic of invariance permits easy equating of examination scores, which places scores on a constant measurement scale and permits the legitimate comparison of student ability change over time. Three common IRT models and their statistical assumptions are discussed. IRT applications in computer-adaptive testing and as a method useful for adjusting rater error in clinical performance assessments are overviewed. Conclusions IRT measurement is a powerful tool used to solve a major problem of CMT, that is, the confounding of examinee ability with item characteristics. IRT measurement addresses important issues in medical education, such as eliminating rater error from performance assessments. [source]

Assessment of the upper limb in acute stroke: The validity of hierarchal scoring for the Motor Assessment Scale

AUSTRALIAN OCCUPATIONAL THERAPY JOURNAL, Issue 3 2010
Rebekah L. Pickering
Background/aim:, Stroke is the greatest contributor to disability in Australian adults and much of this disability results from a stroke-affected upper limb. This study aimed to determine the validity of hierarchal scoring for the upper limb subscale of the Motor Assessment Scale (UL-MAS) in acute stroke using Rasch analysis. Method:, This study applied Rasch analysis to 40 UL-MAS assessment results across 25 subjects to determine the validity of the hierarchy of the three upper limb subsets: upper arm function (six), hand movements (seven) and advanced hand activities (eight). Rasch analysis examines the relationship between ,item difficulty' and ,person ability' and produces an output which represents the difficulty of each item in relation to each other. Results:, As hypothesised, the hierarchy was upheld within subset 6. In subset 7, the hierarchy was not upheld. Results indicated that item 3 was the least difficult, followed by items 1, 4, 2, 5 and 6 in order of increasing difficulty. In subset 8 the hierarchy was not upheld. Results indicated that item 1 was the least difficult, followed by item 6, then 2 and 5 of equal value and then 3 and 4 of equal value. Conclusions:, The hierarchal scoring is not supported for subsets 7 and 8 and future research is required to explore the validity of alternate scoring methods. At present, the authors recommend that the UL-MAS should be scored non-hierarchally, meaning that every item within the subsets should be scored regardless of its place within the hierarchy (UL-MAS-NH). [source]