Home About us Contact

Test Items (test + item)

Distribution by Scientific Domains

Medical Sciences	44%
Education	36%
Psychology	8%
Chemistry	5%
2 Other Domains	7%

Selected Abstracts

Impact of Elaboration on Responding to Situational Judgment Test Items

INTERNATIONAL JOURNAL OF SELECTION AND ASSESSMENT, Issue 4 2008
Filip Lievens
Although faking has been identified as a potential problem in situational judgment tests (SJTs), no studies have investigated proactive approaches for controlling faking in SJTs. Therefore, this study examined the impact of elaboration on responding to SJT items. Elaboration was operationalized as reason-giving. Two hundred and forty-seven master students were assigned to either an honest or a fake condition, and to a non-elaboration or an elaboration condition. Results showed that elaboration decreased the effect of faking for items with high familiarity. Elaboration on familiar items also decreased the percentage of fakers in the top of the distribution. Next, participants in the elaboration condition rated the SJT significantly higher in terms of allowing them to present themselves more realistically and to demonstrate their knowledge, skills, and abilities. Finally, there were no significant differences in participants' satisfaction with the SJT across the elaboration and non-elaboration condition. [source]

A SIBTEST Approach to Testing DIF Hypotheses Using Experimentally Designed Test Items

JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 4 2000
Daniel M. Bolt
This paper considers a modification of the DIF procedure SIBTEST for investigating the causes of differential item functioning (DIF). One way in which factors believed to be responsible for DIF can be investigated is by systematically manipulating them across multiple versions of an item using a randomized DIF study (Schmitt, Holland, & Dorans, 1993). In this paper: it is shown that the additivity of the index used for testing DIF in SIBTEST motivates a new extension of the method for statistically testing the effects of DIF factors. Because an important consideration is whether or not a studied DIF factor is consistent in its effects across items, a methodology for testing item x factor interactions is also presented. Using data from the mathematical sections of the Scholastic Assessment Test (SAT), the effects of two potential DIF factors,item format (multiple-choice versus open-ended) and problem type (abstract versus concrete),are investigated for gender Results suggest a small but statistically significant and consistent effect of item format (favoring males for multiple-choice items) across items, and a larger but less consistent effect due to problem type. [source]

Predicting Cognitive Impairment in High-Functioning Community-Dwelling Older Persons: MacArthur Studies of Successful Aging

JOURNAL OF AMERICAN GERIATRICS SOCIETY, Issue 6 2002
Joshua Chodosh MD, MSHS
OBJECTIVES: To examine whether simple cognitive tests, when applied to cognitively intact older persons, are useful predictors of cognitive impairment 7 years later. DESIGN: Cohort study. SETTING: Durham, North Carolina; East Boston, Massachusetts; and New Haven, Connecticut, areas that are part of the National Institute on Aging Established Populations for Epidemiological Studies of the Elderly. PARTICIPANTS: Participants, aged 70 to 79, from three community-based studies, who were in the top third of this age group, based on physical and cognitive functional status. MEASUREMENTS: New onset of cognitive impairment as defined by a score of less than 7 on the Short Portable Mental Status Questionnaire (SPMSQ) in 1995. RESULTS: At 7 years, 21.8% (149 of 684 subjects) scored lower than 7 on the SPMSQ. Using multivariate logistic regression, three baseline (1988) cognitive tests predicted impairment in 1995. These included two simple tests of delayed recall,the ability to remember up to six items from a short story and up to 18 words from recall of Boston Naming Test items. For each story item missed, the adjusted odds ratio (AOR) for cognitive impairment was 1.44 (95% confidence interval (CI) = 1.16,1.78, P < .001). For each missed item from the word list, the AOR was 1.20 (95% CI = 1.09,1.31, P < .001). The Delayed Recognition Span, which assesses nonverbal memory, also predicted cognitive impairment, albeit less strongly (odds ratio = 1.06 per each missed answer, 95% CI = 1.003,1.13, P = .04). CONCLUSIONS: This study identifies measures of delayed recall and recognition as significant early predictors of subsequent cognitive decline in high-functioning older persons. Future efforts to identify those at greatest risk of cognitive impairment may benefit by including these measures. [source]

The Use of Generalizability (G) Theory in the Testing of Linguistic Minorities

EDUCATIONAL MEASUREMENT: ISSUES AND PRACTICE, Issue 1 2006
Flores, Guillermo Solano
We contend that generalizability (G) theory allows the design of psychometric approaches to testing English-language learners (ELLs) that are consistent with current thinking in linguistics. We used G theory to estimate the amount of measurement error due to code (language or dialect). Fourth- and fifth-grade ELLs, native speakers of Haitian-Creole from two speech communities, were given the same set of mathematics items in the standard English and standard Haitian-Creole dialects (Sample 1) or in the standard and local dialects of Haitian-Creole (Samples 2 and 3). The largest measurement error observed was produced by the interaction of student, item, and code. Our results indicate that the reliability and dependability of ELL achievement measures is affected by two facts that operate in combination: Each test item poses a unique set of linguistic challenges and each student has a unique set of linguistic strengths and weaknesses. This sensitivity to language appears to take place at the level of dialect. Also, students from different speech communities within the same broad linguistic group may differ considerably in the number of items needed to obtain dependable measures of their academic achievement. Whether students are tested in English or in their first language, dialect variation needs to be considered if language as a source of measurement error is to be effectively addressed. [source]

Estimating food intakes in Australia: validation of the Commonwealth Scientific and Industrial Research Organisation (CSIRO) food frequency questionnaire against weighed dietary intakes

JOURNAL OF HUMAN NUTRITION & DIETETICS, Issue 6 2009
C. Lassale
Abstract Background:, There is a dearth of knowledge about the foods that Australian adults eat and a need for a flexible, easy-to-use tool that can estimate usual dietary intakes. The present study was to validate a commonly used Australian Commonwealth Scientific and Industrial Research Organisation (CSIRO) food-frequency questionnaire (C-FFQ) against two 4-day weighed food records (WFR), as the reference method. Methods:, The C-FFQ, as the test item, was administrated before the WFR. Two 4-day WFR were administrated 4 weeks apart. Under-reporting was established using specific cut-off limits and estimated basal metabolic rate. Seventy-four women, aged 31,60 years, were enrolled from a free-living community setting. Results:, After exclusion for under-reporting, the final sample comprised 62 individuals. Correlations between protein intake from the WFR and urinary urea were significant. Overall agreement between FFQ and WFR was shown by ,levels of agreement' (LOA) and least products regressions. There was presence of fixed and proportional bias for almost half the nutrients, including energy, protein, fat and carbohydrates. For most of the nutrients that did not present bias, the LOA were 50,200%. Agreement was demonstrated for percentage dietary energy protein and fat; carbohydrate; and absolute amounts of thiamine, riboflavin, magnesium and iron. However, relative intake agreement was fair to moderate, with approximately 70% of (selected) nutrients exact or within ±1 quintile difference. Conclusion:, The C-FFQ is reasonable at measuring percentage energy from macronutrients and some micronutrients, and comprises a valuable tool for ranking intakes by quintiles; however, it is poor at measuring many absolute nutrient intakes relative to WFR. [source]

Comparisons between a mixing ability test and masticatory performance tests using a brittle or an elastic test food

JOURNAL OF ORAL REHABILITATION, Issue 3 2009
T. SUGIURA
Summary, A variety of chewing tests and test items have been utilized to evaluate masticatory function. The purpose of this study was to compare a mixing ability test with masticatory performance tests using peanuts or gummy jelly as test foods. Thirty-two completely dentate subjects (Dentate group, mean age: 25·1 years) and 40 removable partial denture wearers (RPD group, mean age: 65·5 years) participated in this study. The subjects were asked to chew a two-coloured paraffin wax cube as a test item for 10 strokes. Mixing Ability Index (MAI) was determined from the colour mixture and shape of the chewed cube. Subjects were asked to chew 3 g portions of peanuts and a piece of gummy jelly for 20 strokes, respectively. Median particle size of chewed peanuts was determined using a multiple-sieving method. Concentration of dissolved glucose from the surface of the chewed gummy jelly was measured using a blood glucose meter. Pearson's correlation coefficient was used to test the relationships between the MAI, median particle size and the concentration of dissolved glucose. Mixing Ability Index was significantly related to median particle size (Dentate group: r = ,0·56, P < 0·001, RPD group: r = ,0·70, P < 0·001), but not significantly related to glucose concentration (Dentate group: r = 0·12, RPD group: r = 0·21, P > 0·05). It seems that ability of mixing the bolus is more strongly related to the ability of comminuting brittle food than elastic food. [source]

Objective and subjective hardness of a test item used for evaluating food mixing ability

JOURNAL OF ORAL REHABILITATION, Issue 3 2007
N. M. SALLEH
Summary, The aim of this study was to compare objective and subjective hardness of selected common foods with a wax cube used as a test item in a mixing ability test. Objective hardness was determined for 11 foods (cream cheese, boiled fish paste, boiled beef, apple, raw carrot, peanut, soft/hard rice cracker, jelly, plain chocolate and chewing gum) and the wax cube. Peak force (N) to compress each item was obtained from force,time curves generated with the Tensipresser. Perceived hardness ratings of each item were made by 30 dentate subjects (mean age 26·9 years) using a visual analogue scale (100 mm). These subjective assessments were given twice with a 1 week interval. High intraclass correlation coefficients (ICCs) for test,retest reliability were seen for all foods (ICC > 0·68; P < 0·001). One-way anova found a significant effect of food type on both the objective hardness score and the subjective hardness rating (P < 0·001). The wax cube showed significant lower objective hardness score (32·6 N) and subjective hardness rating (47·7) than peanut (45·3 N, 63·5) and raw carrot (82·5 N, 78·4) [P < 0·05; Ryan,Einot,Gabriel,Welsch (REGW)-F]. A significant semilogarithmic relationship was found between the logarithm of objective hardness scores and subjective hardness ratings across twelve test items (r = 0·90; P < 0·001). These results suggest the wax cube has a softer texture compared with test foods traditionally used for masticatory performance test, such as peanut and raw carrot. The hardness of the wax cube could be modified to simulate a range of test foods by changing mixture ratio of soft and hard paraffin wax. [source]

Mental tests and fossils

JOURNAL OF THE HISTORY OF THE BEHAVIORAL SCIENCES, Issue 4 2004
Richard A. Littman
This article investigates the origins of the intelligence test item known as the Ball and Field in Lewis M. Terman's Stanford Revision of the Binet-Simon Intelligence Scale. The question was initially raised by the resemblance of paleontological ocean bed floor tracings left by ancient creatures to the responses produced by children given the Ball and Field Test. A version of the Ball and Field Test was invented by Clifton F. Hodge, one of Terman's graduate school instructors who devised it as a result of his observations about how birds and other animals navigated and found their way. He then tested how humans and children located hidden objects and found that, in many ways, animals and humans used similar strategies for getting home or finding objects. © 2004 Wiley Periodicals, Inc. [source]

Modelling approaches to compare sorption and degradation of metsulfuron-methyl in laboratory micro-lysimeter and batch experiments

PEST MANAGEMENT SCIENCE (FORMERLY: PESTICIDE SCIENCE), Issue 12 2003
Maik Heistermann
Abstract Results of laboratory batch studies often differ from those of outdoor lysimeter or field plot experiments,with respect to degradation as well as sorption. Laboratory micro-lysimeters are a useful device for closing the gap between laboratory and field by both including relevant transport processes in undisturbed soil columns and allowing controlled boundary conditions. In this study, sorption and degradation of the herbicide metsulfuron-methyl in a loamy silt soil were investigated by applying inverse modelling techniques to data sets from different experimental approaches under laboratory conditions at a temperature of 10 °C: first, batch-degradation studies and, second, column experiments with undisturbed soil cores (28 cm length × 21 cm diameter). The column experiments included leachate and soil profile analysis at two different run times. A sequential extraction method was applied in both study parts in order to determine different binding states of the test item within the soil. Data were modelled using ModelMaker and Hydrus-1D/2D. Metsulfuron-methyl half-life in the batch-experiments (t1/2 = 66 days) was shown to be about four times higher than in the micro-lysimeter studies (t1/2 about 17 days). Kinetic sorption was found to be a significant process both in batch and column experiments. Applying the one-rate-two-site kinetic sorption model to the sequential extraction data, it was possible to associate the stronger bonded fraction of metsulfuron-methyl with its kinetically sorbed fraction in the model. Although the columns exhibited strong significance of multi-domain flow (soil heterogeneity), the comparison between bromide and metsulfuron-methyl leaching and profile data showed clear evidence for kinetic sorption effects. The use of soil profile data had significant impact on parameter estimates concerning sorption and degradation. The simulated leaching of metsulfuron-methyl as it resulted from parameter estimation was shown to decrease when soil profile data were considered in the parameter estimation procedure. Moreover, it was shown that the significance of kinetic sorption can only be demonstrated by the additional use of soil profile data in parameter estimation. Thus, the exclusive use of efflux data from leaching experiments at any scale can lead to fundamental misunderstandings of the underlying processes. Copyright © 2003 Society of Chemical Industry [source]

2162: New aspects of the Slug Mucosal Irritation (SMI) assay: Detecting ocular stinging, itching and burning sensations

ACTA OPHTHALMOLOGICA, Issue 2010
J LENOIR
Purpose Our eyes are one of the most important senses. They are very sensitive and irritations may occur easily. A screening method for ocular discomfort would be very helpful in the development and refinement of formulations. In the past, the Slug Mucosal Irritation (SMI) assay demonstrated a relation between an increased mucus production (MP) in slugs and an elevated incidence of stinging, itching and burning (SIB) in human eyes. The aim of this study is to compare subjective ocular discomfort caused by shampoos evaluated in volunteers with results of the SIB-procedure. Methods The stinging potency of 1 artificial tear and 10 shampoos was evaluated with the SIB-procedure by placing 3 slugs per treatment group 3 times on 100 µl of the test item. After each 15 min contact period, MP was measured. Evaluation of the results is based upon the total MP during 3 repeated contact periods. Experiments were repeated 3 times. A Human Eye Irritation test with the same test items will be set up (12-period cross-over study, 24 volunteers, study approved by an independent Commission for Medical Ethics, associated with Ghent University Hospital). The participants are dripped 10 µl of a 5% or 10% shampoo dilution in water or the artificial tear in 1 eye, while in the other eye 10 µl of water is administered. The evaluation of the test substances is done both by participants and the ophthalmologist at several time points. Conclusion With the obtained results we will be able to improve the newly developed protocol and examine the predictability with reference to non- and mildly irritating formulations in humans. We hope to conclude that the SIB-procedure is a good tool to predict clinical ocular discomfort. [source]

Effects of back care education in elementary schoolchildren

ACTA PAEDIATRICA, Issue 8 2000
G Cardon
The purpose of this study was to investigate the effects of a back care education programme, consisting of six sessions of 1 h each, in fourth- and fifth-grade elementary schoolchildren. Testing consisted of a practical performance and a back care knowledge test. Forty-two subjects and 36 controls performed a pre-test and were tested within 1 wk after the programme. To monitor effects and follow-up effects on a larger sample, 82 different pupils were tested within 1 wk after the programme and 116 other children 3 mo after. Both larger samples were compared with one group of 129 controls. Interrater reliability for the test items of the practical assessment was high; intraclass correlation coefficients varied from 0.785 to 0.980. In the pre/post design study, interaction between time and condition was significant for the sum score of the practical assessment and for the knowledge test (p < 0.001), with higher scores for the intervention group (15% improvement for the knowledge test score, 31.6% for the practical sum score). Significantly higher sum scores for the knowledge test and for all practical assessment items were found in the intervention groups, tested within 1 wk and 3 mo after the programme, in comparison with the control group (p <0.001). Conclusion: The effectiveness of a primary educational prevention programme on back care principles was demonstrated in this study. Effectiveness, long-term outcomes and behavioural changes need further evaluation to optimize back care prevention programmes for elementary schoolchildren. [source]

Accommodating variability in voice and foreign accent: flexibility of early word representations

DEVELOPMENTAL SCIENCE, Issue 4 2009
Rachel Schmale
In six experiments with English-learning infants, we examined the effects of variability in voice and foreign accent on word recognition. We found that 9-month-old infants successfully recognized words when two native English talkers with dissimilar voices produced test and familiarization items (Experiment 1). When the domain of variability was shifted to include variability in voice as well as in accent, 13-, but not 9-month-olds, recognized a word produced across talkers when only one had a Spanish accent (Experiments 2 and 3). Nine-month-olds accommodated some variability in accent by recognizing words when the same Spanish-accented talker produced familiarization and test items (Experiment 4). However, 13-, but not 9-month-olds, could do so when test and familiarization items were produced by two distinct Spanish-accented talkers (Experiments 5 and 6). These findings suggest that, although monolingual 9-month-olds have abstract phonological representations, these representations may not be flexible enough to accommodate the modifications found in foreign-accented speech. [source]

Alignment of Mathematics State-Level Standards and Assessments: The Role of Reviewer Agreement

EDUCATIONAL MEASUREMENT: ISSUES AND PRACTICE, Issue 2 2007
Noreen M. Webb
This article examines the role of reviewer agreement in judgments about alignment between tests and standards. We used case data from three state alignment studies to explore how different approaches to incorporating reviewer agreement changes alignment conclusions. The three case studies showed varying degrees of reviewer agreement about correspondences between objectives and test items. Moreover, taking into account reviewer agreement in the analyses sometimes had a marked effect on alignment conclusions. We discuss reasons for differences across case studies and alignment approaches, as well as implications for future alignment efforts. [source]

Use of Knowledge, Skill, and Ability Statements in Developing Licensure and Certification Examinations

EDUCATIONAL MEASUREMENT: ISSUES AND PRACTICE, Issue 1 2005
Ning Wang
The task inventory approach is commonly used in job analysis for establishing content validity evidence supporting the use and interpretation of licensure and certification examinations. Although the results of a task inventory survey provide job task-related information that can be used as a reliable and valid source for test development, it is often the knowledge, skills, and abilities (KSAs) required for performing the tasks, rather than the job tasks themselves, which are tested by licensure and certification exams. This article presents a framework that addresses the important role of KSAs in developing and validating licensure and certification examinations. This includes the use of KSAs in linking job task survey results to the test content outline, transferring job task weights to test specifications, and eventually applying the results to the development of the test items. The impact of using KSAs in the development of test specifications is illustrated from job analyses for two diverse professions. One method for transferring job task weights from the job analysis to test specifications through KSAs is also presented, along with examples. The two examples demonstrated in this article are taken from nursing certification and real estate licensure programs. However, the methodology for using KSAs to link job tasks and test content is also applicable in the development of teacher credentialing examinations. [source]

Distinguishing between task and contextual performance for nurses: development of a job performance scale

JOURNAL OF ADVANCED NURSING, Issue 6 2007
Jaimi H. Greenslade
Abstract Title.,Distinguishing between task and contextual performance for nurses: development of a job performance scale Aim., This paper is a report of a development and validation of a new job performance scale based on an established job performance model. Background., Previous measures of nursing quality are atheoretical and fail to incorporate the complete range of behaviours performed. Thus, an up-to-date measure of job performance is required for assessing nursing quality. Methods., Test construction involved systematic generation of test items using focus groups, a literature review, and an expert review of test items. A pilot study was conducted to determine the multidimensional nature of the taxonomy and its psychometric properties. All data were collected in 2005. Findings., The final version of the nursing performance taxonomy included 41 behaviours across eight dimensions of job performance. Results from preliminary psychometric investigations suggest that the nursing performance scale has good internal consistency, good convergent validity and good criterion validity. Conclusion., The findings give preliminary support for a new job performance scale as a reliable and valid tool for assessing nursing quality. However, further research using a larger sample and nurses from a broader geographical region is required to cross-validate the measure. This scale may be used to guide hospital managers regarding the quality of nursing care within units and to guide future research in the area. [source]

A novel approach for screening discrete variations in organic synthesis

JOURNAL OF CHEMOMETRICS, Issue 5 2001
Rolf Carlson
Abstract In this paper we present a general strategy for screening discrete variations in organic synthesis. The strategy is based upon principal properties, i.e. principal component characterization of the constituents defining the reaction system. The first step is to select subsets of test items from each class of constituents defining the reaction space, i.e. substrates, reagents, solvents, catalysts, etc., so that the selected items from each class cover the properties considered. The second step is to construct a candidate matrix which contains all possible combinations of the items in the subsets. This matrix is a full multilevel factorial design. The third step is to assign a tentative model for the screening experiment and to construct the corresponding candidate model matrix. The fourth step is to select experiments to yield an experimental design that spans the variable space efficiently and that also gives good estimates of the model parameters. We present an algorithm that uses singular value decomposition to select experiments. The proposed strategy is then illustrated with an example of the Fischer indole synthesis. Copyright © 2001 John Wiley & Sons, Ltd. [source]

A computer-assisted test design and diagnosis system for use by classroom teachers

JOURNAL OF COMPUTER ASSISTED LEARNING, Issue 6 2005
Q. He
Abstract Computer-assisted assessment (CAA) has become increasingly important in education in recent years. A variety of computer software systems have been developed to help assess the performance of students at various levels. However, such systems are primarily designed to provide objective assessment of students and analysis of test items, and focus has been mainly placed on higher and further education. Although there are commercial professional systems available for use by primary and secondary educational institutions, such systems are generally expensive and require skilled expertise to operate. In view of the rapid progress made in the use of computer-based assessment for primary and secondary students by education authorities here in the UK and elsewhere, there is a need to develop systems which are economic and easy to use and can provide the necessary information that can help teachers improve students' performance. This paper presents the development of a software system that provides a range of functions including generating items and building item banks, designing tests, conducting tests on computers and analysing test results. Specifically, the system can generate information on the performance of students and test items that can be easily used to identify curriculum areas where students are under performing. A case study based on data collected from five secondary schools in Hong Kong involved in the Curriculum, Evaluation and Management Centre's Middle Years Information System Project, Durham University, UK, has been undertaken to demonstrate the use of the system for diagnostic and performance analysis. [source]

Testing Features of Graphical DIF: Application of a Regression Correction to Three Nonparametric Statistical Tests

JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 4 2006
Daniel M. Bolt
Inspection of differential item functioning (DIF) in translated test items can be informed by graphical comparisons of item response functions (IRFs) across translated forms. Due to the many forms of DIF that can emerge in such analyses, it is important to develop statistical tests that can confirm various characteristics of DIF when present. Traditional nonparametric tests of DIF (Mantel-Haenszel, SIBTEST) are not designed to test for the presence of nonuniform or local DIF, while common probability difference (P-DIF) tests (e.g., SIBTEST) do not optimize power in testing for uniform DIF, and thus may be less useful in the context of graphical DIF analyses. In this article, modifications of three alternative nonparametric statistical tests for DIF, Fisher's ,2test, Cochran's Z test, and Goodman's U test (Marascuilo & Slaughter, 1981), are investigated for these purposes. A simulation study demonstrates the effectiveness of a regression correction procedure in improving the statistical performance of the tests when using an internal test score as the matching criterion. Simulation power and real data analyses demonstrate the unique information provided by these alternative methods compared to SIBTEST and Mantel-Haenszel in confirming various forms of DIF in translated tests. [source]

Comparison of the Performance of Varimax and Promax Rotations: Factor Structure Recovery for Dichotomous Items

JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 1 2006
Holmes Finch
Nonlinear factor analysis is a tool commonly used by measurement specialists to identify both the presence and nature of multidimensionality in a set of test items, an important issue given that standard Item Response Theory models assume a unidimensional latent structure. Results from most factor-analytic algorithms include loading matrices, which are used to link items with factors. Interpretation of the loadings typically occurs after they have been rotated in order to amplify the presence of simple structure. The purpose of this simulation study is to compare the ability of two commonly used methods of rotation, Varimax and Promax, in terms of their ability to correctly link items to factors and to identify the presence of simple structure. Results suggest that the two approaches are equally able to recover the underlying factor structure, regardless of the correlations among the factors, though the oblique method is better able to identify the presence of a "simple structure." These results suggest that for identifying which items are associated with which factors, either approach is effective, but that for identifying simple structure when it is present, the oblique method is preferable. [source]

The Impact of Omitted Responses on the Accuracy of Ability Estimation in Item Response Theory

JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 3 2001
R. J. De Ayala
Practitioners typically face situations in which examinees have not responded to all test items. This study investigated the effect on an examinee's ability estimate when an examinee is presented an item, has ample time to answer, but decides not to respond to the item. Three approaches to ability estimation (biweight estimation, expected a posteriori, and maximum likelihood estimation) were examined. A Monte Carlo study was performed and the effect of different levels of omissions on the simulee's ability estimates was determined. Results showed that the worst estimation occurred when omits were treated as incorrect. In contrast, substitution of 0.5 for omitted responses resulted in ability estimates that were almost as accurate as those using complete data. Implications for practitioners are discussed. [source]

An Empirical Investigation Demonstrating the Multidimensional DIF Paradigm: A Cognitive Explanation for DIF

JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 2 2001
Cindy M. Walker
Differential Item Functioning (DIF) is traditionally used to identify different item performance patterns between intact groups, most commonly involving race or sex comparisons. This study advocates expanding the utility of DIF as a step in construct validation. Rather than grouping examinees based on cultural differences, the reference and focal groups are chosen from two extremes along a distinct cognitive dimension that is hypothesized to supplement the dominant latent trait being measured. Specifically, this study investigates DIF between proficient and non-proficient fourth- and seventh-grade writers on open-ended mathematics test items that require students to communicate about mathematics. It is suggested that the occurrence of DIF in this situation actually enhances, rather than detracts from, the construct validity of the test because, according to the National Council of Teachers of Mathematics (NCTM), mathematical communication is an important component of mathematical ability, the dominant construct being assessed. However, the presence of DIF influences the validity of inferences that can be made from test scores and suggests that two scores should be reported, one for general mathematical ability and one for mathematical communication. The fact that currently only one test score is reported, a simple composite of scores on multiple-choice and open-ended items, may lead to incorrect decisions being made about examinees. [source]

Comparisons between a mixing ability test and masticatory performance tests using a brittle or an elastic test food

Objective and subjective hardness of a test item used for evaluating food mixing ability

Knowledge and Skills for PISA,Assessing the Assessment

JOURNAL OF PHILOSOPHY OF EDUCATION, Issue 1 2007
NINA BONDERUP DOHN
This article gives a critique of the methodology of OECD's Programme for International Student Assessment (PISA). It is argued that PISA is invalidated by the fact that the methodology chosen does not constitute an adequate operationalisation of the question of inquiry. Therefore, contrary to the claims of PISA, PISA is not an assessment of the ,knowledge and skills for life' of students, but only of ,knowledge and skills in assessment situations'. Even this latter form of assessment is not fully reliable, however, because of problems at the level of concrete test items and because of an inherent confusion of relative and absolute evaluation. [source]

Using data mining to predict K,12 students' performance on large-scale assessment items related to energy

JOURNAL OF RESEARCH IN SCIENCE TEACHING, Issue 5 2008
Xiufeng Liu
This article reports a study on using data mining to predict K,12 students' competence levels on test items related to energy. Data sources are the 1995 Third International Mathematics and Science Study (TIMSS), 1999 TIMSS-Repeat, 2003 Trend in International Mathematics and Science Study (TIMSS), and the National Assessment of Educational Progress (NAEP). Student population performances, that is, percentages correct, are the object of prediction. Two data mining algorithms, C4.5 and M5, are used to construct a decision tree and a linear function to predict students' performance levels. A combination of factors related to content, context, and cognitive demand of items and to students' grade levels are found to predict student population performances on test items. Cognitive demands have the most significant contribution to the prediction. The decision tree and linear function agree with each other on predictions. We end the article by discussing implications of findings for future science content standard development and energy concept teaching. © 2007 Wiley Periodicals, Inc. J Res Sci Teach 45: 554,573, 2008 [source]

Exploring alternative conceptions from Newtonian dynamics and simple DC circuits: Links between item difficulty and item confidence

JOURNAL OF RESEARCH IN SCIENCE TEACHING, Issue 2 2006
Maja Planinic
Croatian 1st-year and 3rd-year high-school students (N,=,170) completed a conceptual physics test. Students were evaluated with regard to two physics topics: Newtonian dynamics and simple DC circuits. Students answered test items and also indicated their confidence in each answer. Rasch analysis facilitated the calculation of three linear measures: (a) an item-difficulty measure based upon all responses, (b) an item-confidence measure based upon correct student answers, and (c) an item-confidence measure based upon incorrect student answers. Comparisons were made with regard to item difficulty and item confidence. The results suggest that Newtonian dynamics is a topic with stronger students' alternative conceptions than the topic of DC circuits, which is characterized by much lower students' confidence on both correct and incorrect answers. A systematic and significant difference between mean student confidence on Newtonian dynamics and DC circuits items was found in both student groups. Findings suggest some steps for physics instruction in Croatia as well as areas of further research for those in science education interested in additional techniques of exploring alternative conceptions. © 2005 Wiley Periodicals, Inc. J Res Sci Teach 43: 150,171, 2006 [source]

Performance of students in project-based science classrooms on a national measure of science achievement

JOURNAL OF RESEARCH IN SCIENCE TEACHING, Issue 5 2002
Rebecca M. Schneider
Reform efforts in science education emphasize the importance of supporting students' construction of knowledge through inquiry. Project-based science (PBS) is an ambitious approach to science instruction that addresses concerns of reformers. A sample of 142 10th- and 11th-grade students enrolled in a PBS program completed the 12th-grade 1996 National Assessment of Educational Progress (NAEP) science test. Compared with subgroups identified by NAEP that most closely matched our student sample, White and middle class, PBS students outscored the national sample on 44% of NAEP test items. This study shows that students participating in a PBS curriculum were prepared for this type of testing. Educators should be encouraged to use inquiry-based approaches such as PBS to implement reform in their schools. © 2002 Wiley Periodicals, Inc. J Res Sci Teach 39: 410,422, 2002 [source]

Kindergarten Predictors of Math Learning Disability

LEARNING DISABILITIES RESEARCH & PRACTICE, Issue 3 2005
Michèle M. M. Mazzocco
The aim of the present study was to address how to effectively predict mathematics learning disability (MLD). Specifically, we addressed whether cognitive data obtained during kindergarten can effectively predict which children will have MLD in third grade, whether an abbreviated test battery could be as effective as a standard psychoeducational assessment at predicting MLD, and whether the abbreviated battery corresponded to the literature on MLD characteristics. Participants were 226 children who enrolled in a 4-year prospective longitudinal study during kindergarten. We administered measures of mathematics achievement, formal and informal mathematics ability, visual-spatial reasoning, and rapid automatized naming and examined which test scores and test items from kindergarten best predicted MLD at grades 2 and 3. Statistical models using standardized scores from the entire test battery correctly classified ,80,83 percent of the participants as having, or not having, MLD. Regression models using scores from only individual test items were less predictive than models containing the standard scores, except for models using a specific subset of test items that dealt with reading numerals, number constancy, magnitude judgments of one-digit numbers, or mental addition of one-digit numbers. These models were as accurate in predicting MLD as was the model including the entire set of standard scores from the battery of tests examined. Our findings indicate that it is possible to effectively predict which kindergartners are at risk for MLD, and thus the findings have implications for early screening of MLD. [source]

Impact of item-writing flaws in multiple-choice questions on student achievement in high-stakes nursing assessments

MEDICAL EDUCATION, Issue 2 2008
Marie Tarrant
Context, Multiple-choice questions (MCQs) are frequently used to assess students in health science disciplines. However, few educators have formal instruction in writing MCQs and MCQ items often have item-writing flaws. The purpose of this study was to examine the impact of item-writing flaws on student achievement in high-stakes assessments in a nursing programme in an English-language university in Hong Kong. Methods, From a larger sample, we selected 10 summative test papers that were administered to undergraduate nursing students in 1 nursing department. All test items were reviewed for item-writing flaws by a 4-person consensus panel. Items were classified as ,flawed' if they contained , 1 flaw. Items not containing item-writing violations were classified as ,standard'. For each paper, 2 separate scales were computed: a total scale which reflected the characteristics of the assessment as administered and a standard scale which reflected the characteristics of a hypothetical assessment including only unflawed items. Results, The proportion of flawed items on the 10 test papers ranged from 28,75%; 47.3% of all items were flawed. Fewer examinees passed the standard scale than the total scale (748 [90.6%] versus 779 [94.3%]). Conversely, the proportion of examinees obtaining a score , 80% was higher on the standard scale than the total scale (173 [20.9%] versus 120 [14.5%]). Conclusions, Flawed MCQ items were common in high-stakes nursing assessments but did not disadvantage borderline students, as has been previously demonstrated. Conversely, high-achieving students were more likely than borderline students to be penalised by flawed items. [source]

Developing an Oral Communication Strategy Inventory

MODERN LANGUAGE JOURNAL, Issue 2 2006
YASUO NAKATANI
This study focuses on how valid information about learner perception of strategy use during communicative tasks can be gathered systematically from English as a foreign language (EFL) learners. First, the study attempted to develop a questionnaire for statistical analysis, named the Oral Communication Strategy Inventory (OCSI). The research project consisted of 3 stages: an open-ended questionnaire to identify learners' general perceptions of strategies for oral interaction (N= 80); a pilot factor analysis for selecting test items (N= 400); and a final factor analysis to obtain a stable self-reported instrument (N= 400). The resulting OCSI includes 8 categories of strategies for coping with speaking problems and 7 categories for coping with listening problems during communication. The applicability of the survey instrument was subsequently examined in a simulated communicative test for EFL students (N= 62). To validate the use of the instrument, participant reports on the Strategy Inventory for Language Learning (SILL) were compared with the result of the OCSI. When combined with the oral test scores, it was revealed that students with high oral proficiency tended to use specific strategies, such as social affective strategies, fluency-oriented strategies, and negotiation of meaning. [source]