Reading Your Test Analysis Report The purpose of this bulletin is to help instructors answer two common questions about their multiplechoice tests: "How well did my test work?", and more importantly, "How can I improve this test?". This memo describes the information contained in the test analysis report from the Evaluation and Examination Service (EES). Suggestions are offered for using this information to improve classroom tests.
Identification, Page 1 (sample) The bottom of the first page includes the number of items on the test that were scored and how the test was scored plus contact information for EES. Item Analysis, Pages 2 and 3 (sample)
The difficulty index is somewhat of a misnomer because it actually reflects how easy (percent of students answering correct) an item was for the class. For tests that are intended to differentiate among students, maximum differentiation can be achieved in tests of moderate difficulty (i.e., when items are answered correctly by 50% to 80% of the group). Difficulty indices diverging widely from this recommended range may indicate the need for item revision. For tests intended to show students' levels of content mastery, however, high item difficulty values are not uncommon since it is anticipated that most (possibly all) students should answer each item correctly. The item discrimination index (DISCRIM.) indicates whether the students who are scoring well overall on the exam are answering the item correctly. The generally accepted value for DISCRIM. is +.40 or above. The values of +.40 to +1.00 indicate that students who performed well on the test performed substantially better on the item than did students who performed poorly on the total test. Values on the order of +.20 to +.39 are acceptable, but indicate that the item probably should be reviewed. Very low values (less than .20) or negative values indicate that an item is in need of either revision or replacement. A negative value indicates a particularly undesirable situation in which students who earned high scores on the test performed more poorly on that particular item than those students who earned low scores on the test. In such instances, the instructor is best advised to review the item for ambiguities which may have made a distracter more plausible for higher achieving students than the keyed response. The values of both the difficulty and discrimination indices are fairly unstable for small groups. For this reason, item statistics are less useful for making item revisions when classes of fewer than 20 students are involved. Discrimination indices tend to be related to difficulty indices such that maximum discrimination (.40 or larger) is likely only for the middle range of difficulty values (50%80%). Difficulty indices higher or lower than 50%80% usually produce lower discrimination values.
Three items have been chosen from the item analysis tables on pages 2 and 3 to illustrate the interpretation of the test item statistics. An evaluation such as this can be useful for diagnosing problems with test items so that poor items can be revised for use on future examinations. Item 1. Ninetytwo percent of the group answered the item correctly indicating that this is a very easy item. Consequently, the item did not discriminate well between high and low test scorers. A clue to why the item was too easy can be found in the response count row. Responses four and five were not selected at all and the other two distracters, were chosen by only one student each. The correct answer to the item was very obvious, either because most students learned the concept measured by the item or because the distracters were so obviously implausible. If this item was being used as a check to ensure the majority of students understood this concept, you would probably maintain it in your item bank. However, for tests where the purpose is to distinquish between student's grasp of the concept, this item should be replaced or, the distracters rewritten. Item 5. Though this item is within the moderate difficulty range, it is a "negative discriminator." That is, the correct answer was chosen by more students with low total test scores than students with high overall test scores. Part of the problem seems to be response option #2. The value of DISC(R) for option #2 is high and positive (.54) when it should be negative. There is something about the wording of option #2 that makes it more attractive as the correct answer for high scorers than the actual correct response. This situation is often the result of an incorrect scoring key. Item 12. The difficulty index of 58% indicates the item is of moderate difficulty, not too easy and not too difficult. The discrimination index of .66 is very high and, therefore, very good. It means that students with high test scores tended to answer correctly and those with low scores missed the item. This interpretation is verified by the values of DISC(R) for the distracters which are all negative. All of the distracters seem to be functioning well. Distribution of Scores, Page 4 (sample) The SCORE column lists the raw score values from the highest obtained score to the lowest. The FREQUENCY column indicates the number of students who received each particular score. The CUMULATIVE FREQUENCY DISTRIBUTION shows how many students scored at or below a particular score value. The cumulative frequency distribution is also used in computing the percentile ranks shown in the next column. The PERCENTILE RANK tells the student what percentage of students earned a score equal to or less than their own. The STANDARD SCORE column contains the linear Tscore which corresponds to each of the raw scores. Like percentile ranks, these standard scores can be used to interpret relative group standing for a given student. Since the mean of lineat Tscores is always 50 and the standard deviation is 10, this information can be used to judge how far above or below the mean a given student has performed. Standard scores are also more useful than raw scores when several scores are to be weighted and combined for a final course grade. The use of standard scores ensures that the desired weights will in fact be used. An explanation on how to use standard scores for grading is provided in EES Technical Bulletin #5, Assigning Course Grades. Distribution of Item Difficulties, Page 5 (sample)
Distribution of Item Discriminations, Page 5 Test Summary, Page 5 The STANDARD DEVIATION is an index of the degree to which test scores vary within a group. The larger the standard deviation, (for tests of the same length), the more variability there is in the test scores. The value of the ideal standard deviation should be about onesixth of the effective range. To the extent that the actual standard deviation is less than this ideal value, the test scores are not able to discriminate as well as we might prefer. For example, a 25 item test should have standard deviation of about 25/6 = 4.17. In our example, the standard deviation is 4.51 which reflects the test's effectiveness in spreading scores out to better discriminate between the students' abilities. The RELIABILITY is a number which tells how consistent students' scores would be if you were able to give them a similar test with different items. Every time an instructor puts a test together he or she uses items that are merely a sample from thousands of possible items. You want the test scores to give you the same information no matter what sample from the "thousands" you elected to use. This reliability coefficient is a number between 0.00 and 1.00 and can be interpreted like a correlation coefficient. The closer the value is to 1.00, the more reliable or errorfree are the test scores. The STANDARD ERROR OF MEASUREMENT provides another way of representing the accuracy of test scores. Its use requires the assumption that some component of error is part of any person's test score. This error can result from the content of the questions sampled and/or daytoday factors which influence individual performance (i.e., memory fluctuation, guessing, etc.). All test scores contain some degree of error and, consequently, should be treated as an estimate of achievement level. The size of the standard error of measurement is an indication of how far from the actual test score the "real" errorfree score might be. The standard error of measurement for an average of scores on tests of the same length tends to be smaller than the standard error of measurement for any single test (i.e., one can be more sure of a student's observed score falling in a narrower band around the hypothetical errorfree score). It is generally recommended that instructors do not use a single test as the basis for grading (i.e., grades should not be assigned to individual components), but make use of the totals (or average) for a number of tests given during the semester. The MEAN DIFFICULTY of the test is defined as the arithmetic average of the item difficulties. The MEAN DISCRIMINATION of the test is the average of the item biserial correlations (DISCRIM.) between the keyed response and the total score. (Correlations are transformed to standardized z scores, averaged, and then transformed back to a correlation.) The Name Roster, Page 6 (sample) The Error Listing, Pages 7 and 8 (sample) Users who wish to have further assistance in interpreting the test analysis report should stop in or contact the Evaluation and Examination Service scoring office at 3193350359 or examscoring@uiowa.edu.

