Resources mainpage

Print Version

Reading Your Test Analysis Report

The purpose of this bulletin is to help instructors answer two common questions about their multiple-choice tests:  "How well did my test work?", and more importantly, "How can I improve this test?".  This memo describes the information contained  in the test analysis program output from the Evaluation and Examination Service (EES) and offers suggestions for using this information to improve classroom tests. 

Sample Test Analysis Report

The test analysis output contains the following information:
1. An item frequency count table showing the number of students who selected each response for an item and the item difficulty and discrimination indices.
2. A distribution of test scores with the corresponding percentile ranks and standard scores.
3. Frequency distributions for the item difficulty and discrimination indices.
4. A test statistics summary for the entire test.
5. Test score rosters in name order and also ID order.
            A sample report from the test analysis program is reproduced as EXHIBIT I  at the end of this technical bulletin.  Reference will be made to the sample report page number (located in the lower center) as each section of the test analysis is described.

           Identification, Page 1 (sample)
            The first page of the report contains identification for use by EES personnel.  The string of numbers appearing under the word "KEY" should be reviewed by instructors to verify that the keyed responses used for both scoring and test analysis are correct.  If an error is detected, the EES receptionist should be contacted to arrange for rescoring.           

            Below the key are important messages which indicate information that has been saved for other EES programs.  The COMPOSITE (Technical Bulletin #24) is a record keeping program that stores scores from tests and projects and combines them to form a total or composite score after each scoring request and at the conclusion of a course.  The ITEMBANK program (Technical Bulletin #25) allows a user to create an extensive library of test items and item statistics that can be used to quickly and accurately produce tests.  In the example given, scores have been saved for the Composite program but item statistics have not been saved for the Itembank program.

            Item Analysis, Pages 2 and 3 (sample)          
            This section of the report contains a table for each test item.  Item numbers in this section correspond to the same item number used on the test.  Each table contains two parts -- a set of item statistics and a response count.

Three item statistics appear on the left side of each table:
1. DIFFICULTY is the difficulty index for the item or the percent of the group that answered the item correctly.
2. DISCRIM. (discrimination index) is the correlation of the item score with the total test score.  A high positive discrimination (near +1.0) means that all the high- scoring students chose the correct answer; a negative discrimination means that more low-scoring than high-scoring students chose the correct answer.  Negative discriminations often indicate a miskeyed item.
3. OMITS is the number of students who failed to mark an answer for the item.  An omit occurs when a student does not make a mark for the item, when the student's mark is made so lightly that the test scoring machine reads the mark as an erasure, or when a student marks two answers for the same item.

            The difficulty index is somewhat of a misnomer because it actually reflects how easy (percent of students answering correct) an item was for the class.  For tests that are intended to differentiate among students, maximum differentiation can be achieved in tests of moderate difficulty (i.e., when items are answered correctly by 50% to 80% of the group).  Difficulty indices diverging widely from this recommended range may indicate the need for item revision.  For tests intended to show students' levels of content mastery, however, high item difficulty values are not uncommon since it is anticipated that most (possibly all) students should answer each item correctly.           

            The item discrimination index (DISCRIM.) indicates whether the students who are scoring well overall on the exam are answering the item correctly.  The generally accepted value for DISCRIM. is +.40 or above.  The values of +.40 to +1.00 indicate that students who performed well on the test performed substantially better on the item than did students who performed poorly on the total test.  Values on the order of +.20 to +.39 are acceptable, but indicate that the item probably should be reviewed.  Very low values (less than .20) or negative values indicate that an item is in need of either revision or replacement.  A negative value indicates a particularly undesirable situation in which students who earned high scores on the test performed more poorly on that particular item than those students who earned low scores on the test.  In such instances, the instructor is best advised to review the item for ambiguities which may have made a distracter more plausible for higher achieving students than the keyed response.  The values of both the difficulty and discrimination indices are fairly unstable for small groups.  For this reason, item statistics are less useful for making item revisions when classes of fewer than 20 students are involved.  Discrimination indices tend to be related to difficulty indices such that maximum discrimination (.40 or larger) is likely only for the middle range of difficulty values (50%-80%).  Difficulty indices higher or lower than 50%-80% usually produce lower discrimination values.

            The response count on the right half of each item table has three lines labeled RESPONSE, COUNT, and DISC(R):
1. The RESPONSE line simply shows the five (or ten) number choices from which students selected their answer.  The asterisk (*) indicates the keyed choice or correct answer for the item.
2. A discrimination index for each response choice appears on the third line after the word DISC(R).  An item which discriminates well between good and poor students will have a large and positive DISC(R) value for the keyed response and zero or negative DISC(R) values for the distracters.  The DISC(R) for the keyed response is always similar to the DISCRIM. for the item and will be identical when no correction for guessing is used.
Strategies for making items more discriminating include the following:
1. Inspect the item in question to make sure it does not contain an unintended clue which aids the less-knowledgeable but test-wise student.
2. Inspect the response discrimination indices, DISC(R), to identify those responses which are not contributing to item discrimination.  For any item, the keyed response should have a positive DISC(R) index, preferably .4 or greater, and the distracters (wrong answers) should have zero or negative DISC(R) indices.  If a correct or incorrect response does not meet these guidelines then it should be considered for possible revision.
3. Revise the distracters by making use of common student errors or misconceptions to distinguish more accurately the better prepared from the less prepared students.
4. The distracters can be made more similar to the correct response so that finer discrimination is required of the students.  Asking students who responded correctly to an item why they selected that answer, may also be useful in identifying "wrong" routes to the "right" answer.

            Three items have been chosen from the item analysis tables on pages 2 and 3 to illustrate the interpretation of the test item statistics.  An evaluation such as this can be useful for diagnosing problems with test items so that poor items can be revised for use on future examinations.

            Item 1.  Ninety-two percent of the group answered the item correctly indicating that this is a very easy item.  Consequently, the item did not discriminate well between high and low test scorers.  A clue to why the item was too easy can be found in the response count row.  Responses four and five were not selected at all and the other two distracters, two and three were chosen by only one student each.  The correct answer to the item was very obvious, either because most students learned the concept measured by the item or because the distracters were so obviously implausible.  The distracters should be rewritten.

            Item 5.  Though this item is within the moderate difficulty range, it is a "negative discriminator."  That is, the correct answer was chosen by more low-scoring than high-scoring students.  Part of the problem seems to be response option #2.  The value of DISC(R) for option #2 is high and positive (.54) when it should be negative.  There is something about the wording of option #2 that makes it more attractive as the correct answer for high scorers than the actual correct response.

            Item 12.  The difficulty index of 58% indicates the item is of moderate difficulty, not too easy and not too difficult.  The discrimination index of .66 is very high and, therefore, very good.  It means that students with high test scores tended to answer correctly and those with low scores missed the item.  This interpretation is verified by the values of DISC(R) for the distracters which are all negative.  All of the distracters seem to be functioning well.

            Distribution of Scores, Page 4 (sample)
            The distribution of student scores indicates the range of test scores obtained by the students, the number of students obtaining each score value, the PERCENTILE RANK of each score, and the corresponding STANDARD SCORE.  A histogram of the score distribution has been added using a = to represent each student who achieves a particular test score.  Because of space limitations in printing, if more than 30 students achieve any one score on a test, the entire histogram is a proportional representation.  If there are no test scores with 30 or more students, each = represents one student.

           What makes a good or bad distribution of scores depends on the purpose for creating the test.  For tests intended to discriminate among students the following criteria are appropriate:
1. If you have to give a range of grades, you will want a range of scores on the test.  In an extremely bad case, if every student received exactly the same score, the test would not help you at all in giving different grades to different students.
2. Generally you do not want all the students clustered near the top of the possible score range, that is, many students with nearly perfect scores.  This frequently happens when the items are testing recall of simple facts or relationships.  Items that test more complex learning require students to make inferences, analyze processes, or apply knowledge in new or novel situations.  Such questions would likely not be answered correctly by all the students.
3. If all of the students are clustered at the low end of the scale, you are probably asking questions that are too hard.  Some hard questions may be desirable if you wish to challenge your most able students, but there is no benefit derived from overwhelming all the students.

            The SCORE column lists the raw score values from the highest obtained score to the lowest.  The FREQUENCY column indicates the number of students who received each particular score.  The CUMULATIVE FREQUENCY DISTRIBUTION shows how many students scored at or below a particular score value.  The cumulative frequency distribution is also used in computing the percentile ranks shown in the next column.  The PERCENTILE RANK tells the student what percentage of students earned a score equal to or less than their own.

            The STANDARD SCORE column contains the linear T-score which corresponds to each of the raw scores.  Like percentile ranks, these standard scores can be used to interpret relative group standing for a given student.  Since the mean of these standard scores is always 50 and the standard deviation is always 10, this information can be used to judge how far above or below the mean a given student has performed.  Standard scores are also more useful than raw scores when several scores are to be weighted and combined for course grading purposes.  The use of standard scores ensures that the desired weights will in fact be used.  With raw scores, those grading components having larger standard deviations are given more weight.

            Distribution of Item Difficulties, Page 5 (sample)
            This table shows how many items occurred within each of the difficulty ranges.  Item difficulty is the percentage of the group who answered the item correctly.  For classroom tests designed to differentiate among students' levels of achievement, items should be moderate in difficulty, at about 60 or 70.

There are several reasons why a test item may turn out to be more difficult than expected:
1. the correct answer may be marked incorrectly on the scoring key;
2. the item might be worded ambiguously; or
3. the item content may be a minor or trivial part of the course content that students                                   did not learn.
On the other hand, some items turn out to be easier than expected again for a variety of reasons:
1. the students may have learned the content particularly well;
2. the incorrect choices may be obviously incorrect to students; or

3. some fault in the wording may provide a clue to the correct answer.

            Distribution of Item Discriminations, Page 5
            This table displays the distribution of item discriminations.  Item discrimination indices should be as high as possible, but certainly above .20.  An item discrimination near zero often happens because the item is worded ambiguously, and low-scoring people are choosing the "correct" response for different reasons, many of them for the wrong reasons.  The most frequent reason for a negative discrimination is that the item has been keyed incorrectly.  ALWAYS CHECK ITEMS THAT HAVE A NEGATIVE ITEM DISCRIMINATION INDEX TO DETERMINE IF THE SCORING KEY WAS FILLED IN CORRECTLY

            Test Summary, Page 5
            The MEDIAN, sometimes called the middle score, is the score above which and below which half of the group scored.  The MEAN, the arithmetic average of the scores, is a good indicator of the overall difficulty of the test.  When the values of the median and the mean are somewhat similar, one can conclude that the number of extremely high scores and extremely low scores are about the same.  However, when many of the scores are quite low or quite high, the median and mean will be quite different.  The mean is used to judge how close the test is to "moderate difficulty".  The standard for moderate difficulty, the score midway between the perfect score and the chance score, will vary across tests.  Because students should be able to get some items correct, even if they guess, the chance score or guessing score is considered the lowest score on a multiple choice or true-false test.  For example, a test composed of 25 items, five choices per item has a chance score of 1/5 x 25 = 5.  The score midway between a chance score and a perfect score is (5+25)/2 = 15.  The mean for the test provided is 13.00.  This test is a little more than moderately difficult.

            The STANDARD DEVIATION is an index of the degree to which test scores vary within a group.  The larger the standard deviation, (for tests of the same length), the more variability there is in the test scores.  The value of the ideal standard deviation should be about one-sixth of the effective range.  To the extent that the actual standard deviation is less than this ideal value, the test scores are not able to discriminate as well as we might prefer.  For example, a 25 item test should have standard deviation of about 25/6 = 4.17.  In our example, the standard deviation is 4.51 which reflects the test's effectiveness in spreading scores out to better discriminate between the students' abilities.

            The RELIABILITY is a number which tells how consistent students' scores would be if you were able to give them a similar test with different items.  Every time an instructor puts a test together he or she uses items that are merely a sample from thousands of possible items.  You want the test scores to give you the same information no matter what sample from the "thousands" you elected to use.  This reliability coefficient is a number between 0.00 and 1.00 and can be interpreted like a correlation coefficient.  The closer the value is to 1.00, the more reliable or error-free are the test scores.  (If you bought a commercially prepared achievement test, you would expect it to have a reliability of at least .90).

            The STANDARD ERROR OF MEASUREMENT provides another way of representing the accuracy of test scores.  Its use requires the assumption that some component of error is part of any person's test score.  This error can result from the content of the questions sampled and/or day-to-day factors which influence individual performance (i.e., memory fluctuation, guessing, etc.).  All test scores contain some degree of error and, consequently, should be treated as an estimate of achievement level.  The size of the standard error of measurement is an indication of how far from the actual test score the "real" error-free score might be.  The standard error of measurement for an average of scores on tests of the same length tends to be smaller than the standard error of measurement for any single test (i.e., one can be more sure of a student's observed score falling in a narrower band around the hypothetical error-free score).  It is generally recommended that instructors do not use a single test as the basis for grading (i.e., grades should not be assigned to individual components), but make use of the totals (or average) for a number of tests given during the semester.

            The MEAN DIFFICULTY of the test is defined as the arithmetic average of the item difficulties.  The MEAN DISCRIMINATION of the test is the average of the item bi-serial correlations (DISCRIM.) between the keyed response and the total score.  (Correlations are transformed to standardized z scores, averaged, and then transformed back to a correlation.)

            The Name Roster, Page 6 (sample)
            Page 6 of the sample report contains the roster of all students taking the test, listed alphabetically by last name and includes a blank line after each set of eight scores for clarity.  The PERCENT CORRECT score is the student's score as a percent of the maximum possible score.  The Percentile Rank column gives the percentile rank for each student -- the percent of students in the class that earned the same score as or a lower score than a particular student.  The STANDARD score used here, also called a linear T-score, is useful for comparing individual student's performances on different tests and for weighting the scores before combining them for course grading purposes.

            The Error Listing, Pages 7 and 8 (sample)
            The last section of the report contains the optional error listing which can be sorted either by the last 5 digits of the student's ID number (appropriate for posting) or alphabetically (individual slips can be cut and handed out to students).  For each student, the error list indicates the item number and the student's incorrect choice.  For example, the 4/5 in Pamela Ard's listing means she incorrectly answered with choice 5 to the fourth question.  This feature is optional and you will need to ask for one of the two types of error lists each time you bring a test in to be scored.

            Users who wish to have further assistance in interpreting the test analysis report should stop in or call the Evaluation and Examination Service scoring office at 319-335-0359.    

 

Test Analysis1

                                BACK

Test Analysis2

Test Analysis3

                                BACK

Test Analysis4

                                BACK

Test Analysis5

                                BACK

Test Analysis6

                                BACK

Test Analysis7

Test Analysis8

                                BACK

HOME | CLASSROOM TESTS | NATIONAL TESTS | UI PLACEMENT TESTS
COURSE EVALUATION | CONTACT US | EES FORMS | RESOURCES | LINKS