Resources mainpage

Print Version

Reading Your Test Analysis Report

            The purpose of this bulletin is to help instructors answer two common questions about their multiple-choice tests:  "How well did my test work?", and more importantly, "How can I improve this test?".  This memo describes the information contained  in the test analysis report from the Evaluation and Examination Service (EES). Suggestions are offered for using this information to improve classroom tests. 

Test Analysis Report

The test analysis report contains the following information:
1. An item frequency count table showing the number of students who selected each possible response and the item difficulty and discrimination indices.
2. A distribution of test scores with the corresponding percentile ranks and standard scores.
3. Frequency distributions for the item difficulty and discrimination indices.
4. A summary of the test statistics.
5. An Alphabetical test score roster.
6. An item error list for each student..
    A sample test analysis report is included at the end of this technical bulletin.  Reference will be made to the sample report page number as each section of the test analysis is described.

           Identification, Page 1 (sample)
            The first page of the report contains test identification information used by EES personnel.  The string of numbers appearing under the word "KEY" should be reviewed by instructors to verify that the keyed responses used for both scoring and test analysis are correct.  If an error is detected, contact the EES front desk (319-335-0356 or exam-service@uiowa.edu) to arrange for rescoring.           

            The bottom of the first page includes the number of items on the test that were scored and how the test was scored plus contact information for EES.

            Item Analysis, Pages 2 and 3 (sample)          
            This section of the report contains a table for each test item.  Item numbers in this section correspond to the same item number used on the test.  Each table contains two parts -- a set of item statistics and a response count.

            Three item statistics appear on the left side of each table:
1. DIFFICULTY is the difficulty index for the item or the percent of students that answered the item correctly.
2. DISCRIM. (discrimination index) is the correlation of the item score with the total test score.  A high positive discrimination (near +1.0) means that more of the high- scoring students chose the correct answer; a negative discrimination means that more low-scoring students chose the correct answer.  Negative discriminations often indicate a miskeyed item.
3. OMITS is the number of students who failed to mark an answer for the item.  An omit occurs when a student does not make a mark for the item, when the student's mark is made so lightly that the scanner reads the mark as an erasure, or when a student marks two answers for the same item.

            The difficulty index is somewhat of a misnomer because it actually reflects how easy (percent of students answering correct) an item was for the class.  For tests that are intended to differentiate among students, maximum differentiation can be achieved in tests of moderate difficulty (i.e., when items are answered correctly by 50% to 80% of the group).  Difficulty indices diverging widely from this recommended range may indicate the need for item revision.  For tests intended to show students' levels of content mastery, however, high item difficulty values are not uncommon since it is anticipated that most (possibly all) students should answer each item correctly.           

            The item discrimination index (DISCRIM.) indicates whether the students who are scoring well overall on the exam are answering the item correctly.  The generally accepted value for DISCRIM. is +.40 or above.  The values of +.40 to +1.00 indicate that students who performed well on the test performed substantially better on the item than did students who performed poorly on the total test.  Values on the order of +.20 to +.39 are acceptable, but indicate that the item probably should be reviewed.  Very low values (less than .20) or negative values indicate that an item is in need of either revision or replacement.  A negative value indicates a particularly undesirable situation in which students who earned high scores on the test performed more poorly on that particular item than those students who earned low scores on the test.  In such instances, the instructor is best advised to review the item for ambiguities which may have made a distracter more plausible for higher achieving students than the keyed response.  The values of both the difficulty and discrimination indices are fairly unstable for small groups.  For this reason, item statistics are less useful for making item revisions when classes of fewer than 20 students are involved.  Discrimination indices tend to be related to difficulty indices such that maximum discrimination (.40 or larger) is likely only for the middle range of difficulty values (50%-80%).  Difficulty indices higher or lower than 50%-80% usually produce lower discrimination values.

            The response count on the right half of each item table has three lines labeled RESPONSE, COUNT, and DISC(R):
1. The RESPONSE line simply shows the five (or ten) number choices from which students selected their answer.  The asterisk (*) indicates the keyed choice or correct answer for the item.
2. A discrimination index for each response choice appears on the third line after the word DISC(R).  An item which discriminates well between high and low scoring students will have a large and positive DISC(R) value for the keyed response and zero or negative DISC(R) values for the distracters.  The DISC(R) for the keyed response is equal to the discrimination index for the item (DISCRIM).
Strategies for making items more discriminating include the following:
1. Inspect the item in question to make sure it does not contain an unintended clue which aids the less-knowledgeable but test-wise student.
2. Inspect the response discrimination indices, DISC(R), to identify those responses which are not contributing to item discrimination.  For any item, the keyed response should have a positive DISC(R) index, preferably .4 or greater, and the distracters (wrong answers) should have zero or negative DISC(R) indices.  If a correct or incorrect response does not meet these guidelines then it should be considered for possible revision.
3. Revise the distracters by making use of common student errors or misconceptions to distinguish more accurately the better prepared from the less prepared students.
4. The distracters can be made more similar to the correct response so that finer discrimination is required of the students.  Asking students who responded correctly to an item why they selected that answer, may also be useful in identifying "wrong" routes to the "right" answer.

            Three items have been chosen from the item analysis tables on pages 2 and 3 to illustrate the interpretation of the test item statistics.  An evaluation such as this can be useful for diagnosing problems with test items so that poor items can be revised for use on future examinations.

            Item 1.  Ninety-two percent of the group answered the item correctly indicating that this is a very easy item.  Consequently, the item did not discriminate well between high and low test scorers.  A clue to why the item was too easy can be found in the response count row.  Responses four and five were not selected at all and the other two distracters, were chosen by only one student each.  The correct answer to the item was very obvious, either because most students learned the concept measured by the item or because the distracters were so obviously implausible.  If this item was being used as a check to ensure the majority of students understood this concept, you would probably maintain it in your item bank. However, for tests where the purpose is to distinquish between student's grasp of the concept, this item should be replaced or, the distracters rewritten.

            Item 5.  Though this item is within the moderate difficulty range, it is a "negative discriminator."  That is, the correct answer was chosen by more students with low total test scores than students with high overall test scores.  Part of the problem seems to be response option #2.  The value of DISC(R) for option #2 is high and positive (.54) when it should be negative.  There is something about the wording of option #2 that makes it more attractive as the correct answer for high scorers than the actual correct response. This situation is often the result of an incorrect scoring key.

            Item 12.  The difficulty index of 58% indicates the item is of moderate difficulty, not too easy and not too difficult.  The discrimination index of .66 is very high and, therefore, very good.  It means that students with high test scores tended to answer correctly and those with low scores missed the item.  This interpretation is verified by the values of DISC(R) for the distracters which are all negative.  All of the distracters seem to be functioning well.

            Distribution of Scores, Page 4 (sample)
            The Distribution of scores table indicates the range of test scores obtained by the students, the number of students obtaining each score value, the PERCENTILE RANK of each score, and the corresponding STANDARD SCORE.  A histogram of the scores provides a visual representation of the score distribution.

            The SCORE column lists the raw score values from the highest obtained score to the lowest.  The FREQUENCY column indicates the number of students who received each particular score.  The CUMULATIVE FREQUENCY DISTRIBUTION shows how many students scored at or below a particular score value.  The cumulative frequency distribution is also used in computing the percentile ranks shown in the next column.  The PERCENTILE RANK tells the student what percentage of students earned a score equal to or less than their own.

            The STANDARD SCORE column contains the linear T-score which corresponds to each of the raw scores.  Like percentile ranks, these standard scores can be used to interpret relative group standing for a given student.  Since the mean of lineat T-scores is always 50 and the standard deviation is 10, this information can be used to judge how far above or below the mean a given student has performed.  Standard scores are also more useful than raw scores when several scores are to be weighted and combined for a final course grade.  The use of standard scores ensures that the desired weights will in fact be used.  An explanation on how to use standard scores for grading is provided in EES Technical Bulletin #5, Assigning Course Grades.

            Distribution of Item Difficulties, Page 5 (sample)
            This table shows how many items occurred within each of the difficulty ranges.  Item difficulty is the percentage of the group who answered the item correctly.  For classroom tests designed to differentiate among students' levels of achievement, items should be moderate in difficulty, in the 50-80% range.

There are several reasons why a test item may turn out to be more difficult than expected:
1. the correct answer may be marked incorrectly on the scoring key;
2. the item might be worded ambiguously; or
3. the item content may be a part of the course content that students did not learn.
On the other hand, some items turn out to be easier than expected again for a variety of reasons:
1. the students may have learned the content particularly well;
2. the incorrect choices may be obviously incorrect to students; or

3. some fault in the wording may provide a clue to the correct answer.

            Distribution of Item Discriminations, Page 5
            This table displays the distribution of item discriminations.  Item discrimination indices should be as high as possible, but certainly above .20.  An item discrimination near zero often happens because the item is worded ambiguously, and low-scoring students are choosing the "correct" response for different reasons, many of them for the wrong reasons.  The most frequent reason for a negative discrimination is that the item has been keyed incorrectly. Always check items that have a negative item discrimination index to determine if the scoring key was filled in correctly.

            Test Summary, Page 5
            The MEDIAN, sometimes called the middle score, is the score above which and below which half of the group scored.  The MEAN, the arithmetic average of the scores, is a good indicator of the overall difficulty of the test.  When the values of the median and the mean are somewhat similar, one can conclude that the number of extremely high scores and extremely low scores are about the same.  However, when many of the scores are quite low or quite high, the median and mean will be quite different.  The mean is used to judge how close the test is to "moderate difficulty".  The standard for moderate difficulty, the score midway between the perfect score and the chance score, will vary across tests.  Because students should be able to get some items correct, even if they guess, the chance score or guessing score is considered the lowest score on a multiple choice or true-false test.  For example, a test composed of 25 items, five choices per item has a chance score of 1/5 x 25 = 5.  The score midway between a chance score and a perfect score is (5+25)/2 = 15.  The mean for the test provided is 13.00.  This test is a little more than moderately difficult.

            The STANDARD DEVIATION is an index of the degree to which test scores vary within a group.  The larger the standard deviation, (for tests of the same length), the more variability there is in the test scores.  The value of the ideal standard deviation should be about one-sixth of the effective range.  To the extent that the actual standard deviation is less than this ideal value, the test scores are not able to discriminate as well as we might prefer.  For example, a 25 item test should have standard deviation of about 25/6 = 4.17.  In our example, the standard deviation is 4.51 which reflects the test's effectiveness in spreading scores out to better discriminate between the students' abilities.

            The RELIABILITY is a number which tells how consistent students' scores would be if you were able to give them a similar test with different items.  Every time an instructor puts a test together he or she uses items that are merely a sample from thousands of possible items.  You want the test scores to give you the same information no matter what sample from the "thousands" you elected to use.  This reliability coefficient is a number between 0.00 and 1.00 and can be interpreted like a correlation coefficient.  The closer the value is to 1.00, the more reliable or error-free are the test scores. 

            The STANDARD ERROR OF MEASUREMENT provides another way of representing the accuracy of test scores.  Its use requires the assumption that some component of error is part of any person's test score.  This error can result from the content of the questions sampled and/or day-to-day factors which influence individual performance (i.e., memory fluctuation, guessing, etc.).  All test scores contain some degree of error and, consequently, should be treated as an estimate of achievement level.  The size of the standard error of measurement is an indication of how far from the actual test score the "real" error-free score might be.  The standard error of measurement for an average of scores on tests of the same length tends to be smaller than the standard error of measurement for any single test (i.e., one can be more sure of a student's observed score falling in a narrower band around the hypothetical error-free score).  It is generally recommended that instructors do not use a single test as the basis for grading (i.e., grades should not be assigned to individual components), but make use of the totals (or average) for a number of tests given during the semester.

            The MEAN DIFFICULTY of the test is defined as the arithmetic average of the item difficulties.  The MEAN DISCRIMINATION of the test is the average of the item bi-serial correlations (DISCRIM.) between the keyed response and the total score.  (Correlations are transformed to standardized z scores, averaged, and then transformed back to a correlation.)

            The Name Roster, Page 6 (sample)
            Page 6 of the sample report contains the roster of all students taking the test, listed alphabetically.  The PERCENT CORRECT score is the student's score as a percent of the maximum possible score.  The Percentile Rank column gives the percentile rank which is the percent of students scoring at or below each particular score.   The STANDARD is a linear T-score.

            The Error Listing, Pages 7 and 8 (sample)
            The last section of the report contains the optional error listing which can be sorted alphabetically.  For each student, the error list indicates the items the student missed and their incorrect responses.  For example, the 4/5 in Pamela Ard means she incorrectly answered item 4 with choice 5.

            Users who wish to have further assistance in interpreting the test analysis report should stop in or contact the Evaluation and Examination Service scoring office at 319-335-0359 or exam-scoring@uiowa.edu.    

 

Test Analysis1

                                BACK

Test Analysis2

Test Analysis3

                                BACK

Test Analysis4

                                BACK

Test Analysis5

                                BACK

Test Analysis6

                                BACK

Test Analysis7

Test Analysis8

                                BACK

HOME | CLASSROOM TESTS | NATIONAL TESTS | UI PLACEMENT TESTS
COURSE EVALUATION | CONTACT US | EES FORMS | RESOURCES | LINKS