Reading Your Test Analysis Report
The purpose of this bulletin is to help instructors answer two common questions about their multiple-choice tests: "How well did my test work?", and more importantly, "How can I improve this test?". This memo describes the information contained in the test analysis program output from the Evaluation and Examination Service (EES) and offers suggestions for using this information to improve classroom tests.
Identification, Page 1 (sample)
Below the key are important messages which indicate information that has been saved for other EES programs. The COMPOSITE (Technical Bulletin #24) is a record keeping program that stores scores from tests and projects and combines them to form a total or composite score after each scoring request and at the conclusion of a course. The ITEMBANK program (Technical Bulletin #25) allows a user to create an extensive library of test items and item statistics that can be used to quickly and accurately produce tests. In the example given, scores have been saved for the Composite program but item statistics have not been saved for the Itembank program.
Item Analysis, Pages 2 and 3 (sample)
The difficulty index is somewhat of a misnomer because it actually reflects how easy (percent of students answering correct) an item was for the class. For tests that are intended to differentiate among students, maximum differentiation can be achieved in tests of moderate difficulty (i.e., when items are answered correctly by 50% to 80% of the group). Difficulty indices diverging widely from this recommended range may indicate the need for item revision. For tests intended to show students' levels of content mastery, however, high item difficulty values are not uncommon since it is anticipated that most (possibly all) students should answer each item correctly.
The item discrimination index (DISCRIM.) indicates whether the students who are scoring well overall on the exam are answering the item correctly. The generally accepted value for DISCRIM. is +.40 or above. The values of +.40 to +1.00 indicate that students who performed well on the test performed substantially better on the item than did students who performed poorly on the total test. Values on the order of +.20 to +.39 are acceptable, but indicate that the item probably should be reviewed. Very low values (less than .20) or negative values indicate that an item is in need of either revision or replacement. A negative value indicates a particularly undesirable situation in which students who earned high scores on the test performed more poorly on that particular item than those students who earned low scores on the test. In such instances, the instructor is best advised to review the item for ambiguities which may have made a distracter more plausible for higher achieving students than the keyed response. The values of both the difficulty and discrimination indices are fairly unstable for small groups. For this reason, item statistics are less useful for making item revisions when classes of fewer than 20 students are involved. Discrimination indices tend to be related to difficulty indices such that maximum discrimination (.40 or larger) is likely only for the middle range of difficulty values (50%-80%). Difficulty indices higher or lower than 50%-80% usually produce lower discrimination values.
Three items have been chosen from the item analysis tables on pages 2 and 3 to illustrate the interpretation of the test item statistics. An evaluation such as this can be useful for diagnosing problems with test items so that poor items can be revised for use on future examinations.
Item 1. Ninety-two percent of the group answered the item correctly indicating that this is a very easy item. Consequently, the item did not discriminate well between high and low test scorers. A clue to why the item was too easy can be found in the response count row. Responses four and five were not selected at all and the other two distracters, two and three were chosen by only one student each. The correct answer to the item was very obvious, either because most students learned the concept measured by the item or because the distracters were so obviously implausible. The distracters should be rewritten.
Item 5. Though this item is within the moderate difficulty range, it is a "negative discriminator." That is, the correct answer was chosen by more low-scoring than high-scoring students. Part of the problem seems to be response option #2. The value of DISC(R) for option #2 is high and positive (.54) when it should be negative. There is something about the wording of option #2 that makes it more attractive as the correct answer for high scorers than the actual correct response.
Item 12. The difficulty index of 58% indicates the item is of moderate difficulty, not too easy and not too difficult. The discrimination index of .66 is very high and, therefore, very good. It means that students with high test scores tended to answer correctly and those with low scores missed the item. This interpretation is verified by the values of DISC(R) for the distracters which are all negative. All of the distracters seem to be functioning well.
Distribution of Scores, Page 4 (sample)
The SCORE column lists the raw score values from the highest obtained score to the lowest. The FREQUENCY column indicates the number of students who received each particular score. The CUMULATIVE FREQUENCY DISTRIBUTION shows how many students scored at or below a particular score value. The cumulative frequency distribution is also used in computing the percentile ranks shown in the next column. The PERCENTILE RANK tells the student what percentage of students earned a score equal to or less than their own.
The STANDARD SCORE column contains the linear T-score which corresponds to each of the raw scores. Like percentile ranks, these standard scores can be used to interpret relative group standing for a given student. Since the mean of these standard scores is always 50 and the standard deviation is always 10, this information can be used to judge how far above or below the mean a given student has performed. Standard scores are also more useful than raw scores when several scores are to be weighted and combined for course grading purposes. The use of standard scores ensures that the desired weights will in fact be used. With raw scores, those grading components having larger standard deviations are given more weight.
Distribution of Item Difficulties, Page 5 (sample)
Distribution of Item Discriminations, Page 5
Test Summary, Page 5
The STANDARD DEVIATION is an index of the degree to which test scores vary within a group. The larger the standard deviation, (for tests of the same length), the more variability there is in the test scores. The value of the ideal standard deviation should be about one-sixth of the effective range. To the extent that the actual standard deviation is less than this ideal value, the test scores are not able to discriminate as well as we might prefer. For example, a 25 item test should have standard deviation of about 25/6 = 4.17. In our example, the standard deviation is 4.51 which reflects the test's effectiveness in spreading scores out to better discriminate between the students' abilities.
The RELIABILITY is a number which tells how consistent students' scores would be if you were able to give them a similar test with different items. Every time an instructor puts a test together he or she uses items that are merely a sample from thousands of possible items. You want the test scores to give you the same information no matter what sample from the "thousands" you elected to use. This reliability coefficient is a number between 0.00 and 1.00 and can be interpreted like a correlation coefficient. The closer the value is to 1.00, the more reliable or error-free are the test scores. (If you bought a commercially prepared achievement test, you would expect it to have a reliability of at least .90).
The STANDARD ERROR OF MEASUREMENT provides another way of representing the accuracy of test scores. Its use requires the assumption that some component of error is part of any person's test score. This error can result from the content of the questions sampled and/or day-to-day factors which influence individual performance (i.e., memory fluctuation, guessing, etc.). All test scores contain some degree of error and, consequently, should be treated as an estimate of achievement level. The size of the standard error of measurement is an indication of how far from the actual test score the "real" error-free score might be. The standard error of measurement for an average of scores on tests of the same length tends to be smaller than the standard error of measurement for any single test (i.e., one can be more sure of a student's observed score falling in a narrower band around the hypothetical error-free score). It is generally recommended that instructors do not use a single test as the basis for grading (i.e., grades should not be assigned to individual components), but make use of the totals (or average) for a number of tests given during the semester.
The MEAN DIFFICULTY of the test is defined as the arithmetic average of the item difficulties. The MEAN DISCRIMINATION of the test is the average of the item bi-serial correlations (DISCRIM.) between the keyed response and the total score. (Correlations are transformed to standardized z scores, averaged, and then transformed back to a correlation.)
The Name Roster, Page 6 (sample)
The Error Listing, Pages 7 and 8 (sample)
Users who wish to have further assistance in interpreting the test analysis report should stop in or call the Evaluation and Examination Service scoring office at 319-335-0359.