Difficulties Met In Scoring Essay Test

How the HiSET® Exam is Scored

ETS provides scoring for the HiSET® exam on a continuing basis. Scores are processed daily and results are transmitted to the state or jurisdiction as they become available.

Computer-delivered Tests:

For computer-delivered tests, this is approximately:

  • three business days for multiple-choice tests
  • five business days for essay tests

Paper-delivered Tests:

Scoring for paper-delivered tests is pending receipt of returned answer sheet documents from the test center. Once they are received at ETS, scores are available in approximately:

  • three business days for multiple-choice tests
  • five business days for essay tests

Multiple-choice test scoring

Multiple-choice answer sheets are scored by machine, which gives virtually 100 percent scoring accuracy.

  • Each correct answer is worth one raw point.
  • The total raw score is the number of questions answered correctly on the full test.
  • The scaled score is computed from the total number of raw points in a way that adjusts for the difficulty of the questions.

Essay test scoring

Each essay response is scored by trained, supervised scorers who follow strict scoring procedures. Written responses to each question are read and scored by two or more qualified scorers specifically trained to score the responses to that question.

Viewing and printing scores

Scores are available online in the form of printable reports. To view their scores, test takers should sign in to their accounts and go to the View Scores page. Through your account, you can view scores or print official Comprehensive Score Reports for any test taker who has tested in your state. Official score reports can also be sent to Designated Institutions by calling ETS Customer Service.

Two different reports

Scores are reported in two different types of reports: the Comprehensive Score Report and Individual Test Reports.

Comprehensive Score Report

The official HiSET Comprehensive Score Report is the cumulative record that includes the highest score for each subtest. It is updated each time the test taker takes another subtest if the score improves, and contains the following test-taker information:

  • Contact information (name)
  • ETS ID
  • Report date
  • Test date(s)
  • Whether or not the test taker has taken all five HiSET subtests
  • Whether the test taker met the three HiSET passing criteria
    • Scored at least 8 out of 20 on each subtest
    • Scored at least 2 out of 6 on the essay
    • Achieved a total scaled score on all five HiSET subtests of at least 45 out of 100
  • Cumulative record of highest scaled score(s) on each subtest
  • Indication of whether the test taker passed the HiSET exam or not

See a sample Comprehensive Score Report.

Individual Test Report

Individual Test Reports are issued each time a test taker completes a subtest. They contain the following test-taker information:

  • Name
  • ETS ID
  • Test date
  • Subtest scaled score result
  • Minimum scaled score result required to pass
  • Whether or not the test taker achieved the minimum scaled score to pass
  • Whether the test taker demonstrated college and career readiness by achieving a scaled score of 15 out of 20 on any subtest
  • Performance Summary by individual competency

Because there is one for each subtest, test takers will see more than one of these in their accounts.

See a sample Individual Test Report.

Understanding scores

  • HiSET scores are reported on a 1–20 score scale in 1-point increments.
  • ETS has set a national passing score of 8 for each of the five subtests and a combined score of 45 to pass the HiSET exam. Note: some states set their own passing score.
  • Test takers must achieve a minimum score of 2 on the Language Arts – Writing essay to pass.
  • The HiSET exam also has a College and Career Readiness (CCR) minimum scaled score of 15 for each subtest except the Language Arts – Writing essay test. The CCR score for the Language Arts-Writing essay is a 4.

For more information, see the HiSET® Test-taker Bulletin.

Automated scoring may be thought of as the machine grading of constructed responses that are not amenable to approaches relying on exact matching (such as correspondence with a list of key words; Bennett & Zhang, 2016). These answers are not suitable for exact-matching approaches in that the specific form(s), and/or content of the correct answer(s) is not known in advance. Automated scoring has been employed in multiple content areas including mathematics, science, and the English language arts (e.g., for writing and speaking ability). Regardless of the content area, these scoring methods typically focus on the extraction and aggregation of features from the constructed responses.

In scoring essay responses, which is the subject of this paper, natural language processing methods are commonly used for feature extraction (e.g., grammatical error detection and word association). After feature extraction, evidence is combined, usually by assigning weights to the various response features and then aggregating the weighted feature values. Weights can be set by a panel of experts or derived by regressing the human ratings on the features. The model is then applied to predict the score that a human rater would have given to an unseen essay.

Unusual Responses in Automated Essay Scoring

Among the measurement issues that have not been fully addressed in the field of automated scoring is the detection of unusual responses. In multiple-choice testing, unusual response typically refers to an unexpected pattern across answers for an examinee (e.g., incorrect answers to easy questions and correct answers to more difficult questions). A large body of research exists on the detection of unusual response patterns via person-fit statistics (e.g., Karabastos, 2003; Reise & Due, 1991; Rupp, 2013). For automated scoring, however, the focus is on the characteristics of a single but complex item response rather than on a pattern across responses.

For our present purposes, unusual responses are defined to be those answers that are not suitable for machine scoring due to response characteristics that the scoring system cannot accurately handle but that experienced human raters can more often effectively process. Among the characteristics that may lead to an unusual response are off-topic content, foreign language, unnecessary text repetition, random keystrokes, extensive copying or paraphrasing from source materials, prememorized text, unusually creative content (e.g., highly metaphorical), unexpected organization or format (e.g., a poem), or text segments that cannot be processed because the automated scoring system itself is imperfect.

Several aspects of this definition merit comment. First, this definition suggests that these responses result from the interaction between the limitations of the scoring system and the behavior of examinees with respect to a type of assessment task. Such limitations may be particular to an automated scoring system or more generally linked to the state of the art. Second, the definition indicates that one common indicator of unusualness in a response should be a disagreement between machine and human scores because the response has characteristics that more frequently exceed the current capabilities of the machine than those of experienced human raters.1 Last, the definition makes no assumptions with respect to the examinee's intentions (e.g., purposeful attempts to “game” the system). Such an inference is not necessary for identification and handling of unusual responses.

Whereas there is a large literature on unusual response in multiple-choice testing, there is only limited research for automated essay scoring. In one investigation, Powers, Burstein, Chodorow, Fowles, and Kulich (2001) attempted to trick a machine scoring system by repeating the same paragraphs many times so as to increase text length. In another study, Higgins, Burstein, and Attali (2005) developed a method based on vocabulary patterns to detect off-topic responses. More recently, Chen, Zhang, and Bejar (in press) proposed a method to improve the prediction accuracy of off-topic responses. Finally, several recent publications addressed gameability in automated scoring (e.g., Bejar, VanWinkle, Madnani, Lewis, & Steier, 2013; Higgins & Heilman, 2014). Most of the above investigations were experiments in which scores were compared before and after some manipulation, such as increasing the complexity of the vocabulary and adding shell language that did not necessarily connect to the content.

Detection of Unusual Responses

Whether automated scoring is sensitive to atypical responses and how it processes them affects how the resulting scores can be interpreted and used. Our confidence will understandably be diminished if the scoring system fails to handle those responses effectively.

Unusual responses may be detected at two times in the processing stream. The first occasion is the prescreening stage before scoring occurs. Intentionally or not, individuals may generate answers that are nonsensical or otherwise atypical. Such atypical responses may be blank, have random keystrokes, be off topic, or have unusual linguistic structure. Advisory flags are commonly used to detect such answers during the prescreening stage. In some instances, these responses can still be processed automatically without human involvement. Examples include an empty submission, an essay with language different from the target language, or an essay consisting of a complete copy of the prompt text. In other instances, the unusual response is sent to a human rater for processing, bypassing the automated scoring system entirely.

The second occasion is the post hoc screening stage. Because automated scoring cannot fully measure some of the higher level aspects of the essay-writing construct, most consequential testing programs employ both human and machine scoring methods. Answers that are possibly inappropriate for machine scoring indicated by, for example, low human–machine agreement, can be detected and sent to additional human raters. This supplementary processing is typically triggered by a difference between the machine and human scores of more than a predetermined threshold set by policy decision.

The cost and time required for human scoring have motivated many large-volume testing programs to consider automated scoring as a primary method (e.g., Common Core State Assessments; Educational Testing Service [ETS], 2014a; Partnership for Assessment of Readiness for College and Careers, 2010; SMARTER Balanced Assessment Consortium, 2010). For this scenario to be workable, the efficacy of prescreening and post hoc methods must first be established. For prescreening, the evidence would include confirming that the advisory flags accurately identify responses likely to be inaccurately scored automatically. For post hoc screening, it might include creating means for predicting the chances that a response would have produced a sizeable human–machine disagreement had it been judged by a human rater. Specifically, if machine-scoring difficulty can be accurately predicted, human raters can be involved only if the automated scores were deemed potentially problematic. The effectiveness of this particular approach, however, has not been sufficiently studied.

Prescreening Advisory Flags in e-rater®

The automated scoring system used in this study was e-rater® v13.1 (Attali & Burstein, 2006). Developed at ETS, e-rater has been used in various testing programs for purposes ranging from classroom assessment to graduate and professional school admissions (ETS, 2014b; ETS, 2014c). In most testing programs, the e-rater scoring model is calibrated through the multiple linear regression of human ratings onto text features such as vocabulary sophistication; essay development and organization; and absence of grammar, mechanics, usage, and style errors.

The e-rater system uses several prescreening advisory flags to identify responses that the system is likely to misscore. For the present investigation, we analyzed eight advisories available for the essay tasks we examined. Each advisory indicates some questionable aspect of an essay submission (see Table 1). These questionable aspects would be expected to occur in most writing assessment programs and, as a result, comparable prescreening mechanisms have been commonly included in other automated scoring systems (Foltz, Laham, & Landauer, 1999; Page, 2003; Vantage Learning, 2012).

#1RepetitionMay contain too many repetitions of words, phrases, sentences, or text sections
#2Insufficient developmentMay not show enough development on topic or concept, or may provide insufficient evidence to support the claims
#3Off topicMay not be relevant to the assigned topic
#4Restatement of prompt textAppears to be a restatement of the prompt text with few additional concepts
#5Too shortMay be too short to be reliably automatically scored
#6Too longMay be too long to be reliably automatically scored
#7Unusual organizationMay contain unusual organizational elements that cannot be recognized by the automated scoring system
#8Excessive number of problemsMay contain unusually large amount of errors in grammar, mechanics, style, and usage, which may result in unreliable automated scores

In some of the assessment programs that use e-rater, a type of post hoc screening has also been implemented (i.e., in addition to prescreening advisories like those above). That post hoc screening entails evaluating the discrepancy between the automated score and a human rater's score for each response. When the human–machine discrepancy exceeds a given threshold, a second human rating is solicited. Whereas the specific thresholds employed in operational settings have not been reported, prior research has evaluated thresholds from 0.5 to 1.5 on a 5- or 6-point holistic scoring scale (e.g., Zhang, Breyer, & Lorenz, 2013).

Purpose of This Study

This investigation evaluates the usefulness of approaches for detecting unusual responses as a step toward supporting the use of automated scoring as a primary method. Although various prescreening advisory flags have been integrated into automated essay scoring systems (e.g., Intelligent Essay Assessor, Pearson Education, 2010; IntelliMetric, Vantage Learning, 2012), little research on the utility of those advisory flags has been published. In this investigation, we studied the usefulness of such prescreening flags. In addition, as a precursor to developing a general post hoc screening method, we investigated whether the size of the human–machine discrepancy could be predicted.

Research Questions

We asked two research questions, one concerning the prescreening stage and the other, post hoc screening.

Research Question 1. Are the advisory flags at the prescreening stage useful in detecting responses that the machine is likely to score differently from human raters?

RQ1.1 Is the mean absolute human–machine discrepancy greater for flagged than for nonflagged responses?

RQ1.2 Is human–machine agreement lower for flagged than for nonflagged responses?

Research Question 2. For responses that pass through prescreening, can the size of the human–machine discrepancy be predicted well enough to support an effective postscreening mechanism?

The rationale for posing these research questions is related to supporting the use of automated scoring as a primary method. An answer to the first question will give a measure of the utility of the advisory flags by indicating whether unusual responses (in terms of lower levels of machine–human agreement) can be detected at the prescreening stage for routing to human raters. Note that lower levels of agreement for unusual responses are expected by definition. This expectation is because the purpose of flagging is to indicate a type of response that the machine can only score with higher-than-acceptable uncertainty due to known system limitations relative to well-monitored and carefully trained human raters. As an example, an essay that contains a well-formulated argument but has many grammatical and mechanical errors would be expected to receive a lower machine score because of the machine's relative inability to judge content and quality of argument.

For essays that get through prescreening, an answer to the second question will suggest whether mechanisms could be created to predict which of those essays would have been likely to produce low human–machine disagreement had they been processed by humans. Accurately identified, such essays could bypass human review entirely, facilitating the sole use of automated scoring.

It is important to note that data relating to the identification of unusual responses is only one piece of evidence needed for evaluating the validity of automated scores for given purposes. Depending upon a test's purpose, other important evidence relates to the automated scoring model (e.g., the construct relevance of the features), generalization (i.e., the degree to which scores on one task associate with scores on other tasks from the universe), external relations (i.e., the degree to which expected relationships with indicators of different and similar constructs are observed), population invariance (i.e., the extent to which scores operate similarly across demographic groups), and impact on learning and teaching practice (Bennett & Zhang, 2016).



We used writing responses collected from four essay tasks given in two large-scale, high-stakes testing programs. In one task, examinees were asked to express their opinion on a common issue. In a second task, examines were asked to compose a synthesis of a short article and an audio recording. The score scale for these first two tasks ranged from integer 1 to 5. In a third task, examinees were asked to evaluate an argument by assessing the claims and evidence it provided. Finally, in a fourth task, examinees were asked to construct an argument on a given issue with reasons and examples to support their views. The score scale for these two latter tasks ranged from integer 1 to 6. Included in this study were 71 different prompts for Task 1, 72 prompts for Task 2, 76 prompts for Task 3, and 76 prompts for Task 4.

Data Set

Essay responses were collected between April 2013 and March 2014. The total number of responses was approximately 871,000 for Task 1, 873,000 for Task 2, 516,000 for Task 3, and 520,000 for Task 4. Responses that were flagged accounted for about 5%, 9%, 13%, and 4% of the total sample, respectively for the four tasks. All responses were scored by at least one human rater and e-rater, while a subset was further graded by a second randomly assigned human rater (n = 40,851 for Task 1, 40,426 for Task 2, 20,355 for Task 3, and 20,153 for Task 4). For Research Question 2, a subset of the total sample was used to examine the extent to which human–machine discrepancy could be predicted. For each task, all responses were included except the ones automatically flagged by the testing programs' prescreening processes.

Data Analyses

Because advisories are intended to detect responses that the machine would not be expected to effectively score, responses with advisory flags should generate lower human–machine agreement than responses triggering no advisory flag. Consequently, for Research Question 1, we compared the means of the absolute differences in human–machine discrepancy between flagged and nonflagged responses separately for each of the advisories using Cohen's d. Absolute difference was employed because positive and negative discrepancies can cancel out, hiding large differences between scoring methods.

We next compared the machine–human agreements between the flagged and the nonflagged groups. For this purpose, we used the Pearson correlation coefficient (r), quadratically weighted kappa (QWK), and standardized mean score difference (SMD), with the pooled variance of the machine and human scores as the denominator. The first two statistics denote human–machine agreement at the individual response level, and the last statistic (SMD) reflects distributional differences.

For Research Question 2, a two-step procedure was employed. In the first step, we investigated the extent to which the machine had difficulty scoring responses. Scoring difficulty was evaluated in several ways, each of which employed the output from cumulative logistic regression of the human ratings on the linguistic features extracted by the machine (Haberman & Sinharay, 2010). First, we evaluated the squared correlation between human scores and the machine scores produced by this regression. A high squared correlation would suggest that the machine had produced scores that tracked human ratings well. Second, we computed the mean squared error (MSE) between machine and human scores. A low MSE would suggest a close correspondence between machine and human scores. Third, we compared the conditional dispersion (CD) of the responses with the MSE:

where H is the human score, M is the e-rater machine score resulting from the cumulative logistic regression model, E(H|M) is the expected human score given M, and P(H|M) is the probability of H given M. Because CD reflects the estimated expected response dispersion under the model and MSE depicts the observed values (though not assuming the model holds), CD and MSE values should be comparable to one another, with large differences suggesting a lack of consistency in machine scoring. For the purposes of this study, we considered an absolute difference greater than or equal to 0.10 as notable.

Last, for each response, we used the probability produced by the regression for each of the human-score categories. The standard deviation of those probabilities was computed for each response. A response that was difficult for the machine to score would be anticipated to have a very small standard deviation, indicating that the probability of assigning a score category was approximately equal across the range. On a 5-point scale (in Tasks 1 and 2), a response for which the probabilities were equal for all categories would have a standard deviation of 0, whereas a response with a score-category probability of 1 would have a standard deviation of approximately 0.45. This latter response would have a single score category predicted with certainty, implying no scoring difficulty. Similarly, on a 6-point scale (for Tasks 3 and 4), a response with equal probabilities across all score categories would have a standard deviation of 0, and when one of the six score-categories receives a probability of 1, the standard deviation would be approximately 0.41. To summarize results across the data set, the mean and range of the standard deviations were computed, and the distribution was examined.

To evaluate whether the machine had trouble judging responses at different score levels, we computed both MSE within each score level and the correlation of the standard deviation of the probabilities with human scores. For purposes of computing MSE, depending on the scale of the human scores, eight or 10 score levels were created using the machine scores, running from 1 to 5 (for Tasks 1 and 2) or 1 to 6 (for Tasks 3 and 4), in increments of 0.5. The MSE between human and machine scores was computed using both the overall sample and the double human-scored sample. In addition, the MSE between the two human ratings was computed and contrasted with the machine–human MSE. This contrast was made to detect the extent to which scoring difficulty was also manifest in human ratings, which are commonly known to have limitations (e.g., scale shrinkage and inconsistency; Zhang, 2013).

In the second step, a linear regression model was calibrated to predict the size of the absolute discrepancy between human scores and the machine scores resulting from the cumulative logistic regression. The predictors were the e-rater linguistic features, advisory flags not used by the testing program for prescreening, and two more linguistic features. These two features indicated the overlap in vocabulary of the target essay with essays at different score levels. This predictive model was assessed using the Pearson correlation coefficient between the predicted and observed human–machine disagreements. A high correlation would suggest that the size of the disagreement could be predicted and possibly employed as a component in a postscreening process.

The indices described above were computed for the overall sample, as well as for the top five test-center countries/territories based on examinee volume.


Results for Research Question 1

Table 2 shows the results for comparing the mean absolute value of the human–machine discrepancy between flagged and nonflagged response groups. This comparison shows the degree to which human and machine scores disagree at the level of individual responses.2

Task 1
#112,0510.54 (0.42)821,0480.50 (0.39) 0.10
#21,0191.17 (0.92)821,0480.50 (0.39) 1.71
#321,6480.56 (0.47)821,0480.50 (0.39) 0.15
#48040.49 (0.37)821,0480.50 (0.39)−0.03
#51471.34 (1.11)821,0480.50 (0.39) 2.15
#6940.97 (0.56)821,0480.50 (0.39) 1.21
#79,9150.51 (0.39)821,0480.50 (0.39) 0.03
#85521.36 (0.93)821,0480.50 (0.39) 2.20
Task 2
#112,4040.85 (0.66)798,3410.78 (0.60) 0.12
#26892.28 (1.50)798,3410.78 (0.60) 2.49
#350,7140.93 (0.77)798,3410.78 (0.60) 0.25
#45310.80 (0.63)798,3410.78 (0.60) 0.03
#52571.95 (1.74)798,3410.78 (0.60) 1.95
#74,0040.86 (0.63)798,3410.78 (0.60) 0.13
#85932.06 (1.33)798,3410.78 (0.60) 2.13
Task 3
#13,5750.58 (0.47)448,7560.52 (0.42) 0.14
#22,6871.16 (0.86)448,7560.52 (0.42) 1.51
#447,8120.50 (0.40)448,7560.52 (0.42)−0.05
#5332.12 (1.41)448,7560.52 (0.42) 3.81
#61610.52 (0.48)448,7560.52 (0.42) 0.00
#77,3630.52 (0.41)448,7560.52 (0.42) 0.00
#8651.41 (1.08)448,7560.52 (0.42) 2.12
Task 4a
#22,5910.55 (0.39)499,5730.45 (0.35) 0.28
#410,2430.48 (0.36)499,5730.45 (0.35) 0.08
#5520.49 (0.21)499,5730.45 (0.35) 0.12
#63600.52 (0.45)499,5730.45 (0.35) 0.21
#77,0170.44 (0.33)499,5730.45 (0.35)−0.02
#81270.60 (0.41)499,5730.45 (0.35) 0.44

As the table indicates, Advisory Flags #2 (insufficient development) and #8 (excessive number of problems) showed practically important, albeit small, effect across all four writing tasks (d values greater than 0.20). Advisory Flag #5 (too short) produced such effects for all but Task 4, whereas Advisory Flag #6 (too long) showed effects for only Tasks 1 and 4, and Advisory Flag #3 (off topic) showed an effect for only Task 2. No practically important effects were found for Advisory Flags #1 (repetition), #4 (restatement), and #7 (unusual organization) for any task (d value smaller than 0.20).

Table 3 presents three additional agreement statistics between human and machine scores for flagged responses and nonflagged responses. Included in the table are the human–machine SMD, Pearson correlation coefficient (r), and QWK.

Task 1
#112,0510.65 (0.64, 0.66)0.15 (0.14, 0.16)  0.70 (0.69, 0.71)
#21,0190.58 (0.55, 0.61)−0.63 (−0.68, −0.58)0.83 (0.81, 0.85)
#321,6480.76 (0.75, 0.77)−0.19 (−0.20, −0.18)0.80 (0.80, 0.81)
#48040.59 (0.54, 0.64)0.03 (−0.03, 0.09)0.64 (0.59, 0.68)
#51470.05 (0.01, 0.08)−1.38 (−1.59, −1.16)0.28 (0.12, 0.41)
#6940.13 (0.06, 0.21)1.69 (1.48, 1.90)  0.51 (0.35, 0.65)
#79,9150.64 (0.63, 0.65)−0.21 (−0.22, −0.19)0.71 (0.70, 0.72)
#85520.17 (0.13, 0.20)−1.46 (−1.55, −1.36)0.49 (0.43, 0.55)
Nonflagged group821,0480.66 (0.66, 0.66)0.00 (0.00, 0.01)  0.70 (0.70, 0.70)
Task 2
#112,4040.59 (0.58, 0.60)0.31 (0.30, 0.31)  0.63 (0.62, 0.64)
#26890.22 (0.19, 0.26)−1.43 (−1.51, −1.35)0.64 (0.59, 0.68)
#350,7140.62 (0.62, 0.63)−0.24 (−0.25, −0.23)0.67 (0.66, 0.67)
#45310.52 (0.47, 0.57)0.33 (0.25, 0.41)  0.58 (0.52, 0.63)
#52570.13 (0.10, 0.16)−1.22 (−1.37, −0.23)0.39 (0.28, 0.49)
#74,0040.57 (0.55, 0.59)−0.09 (−0.12, −0.06)0.59 (0.57, 0.61)
#85930.14 (0.11, 0.17)−1.66 (−1.75, −1.57)0.48 (0.42, 0.54)
Nonflagged group798,3410.59 (0.59, 0.60)0.02 (0.01, 0.02)  0.62 (0.62, 0.62)
Task 3
#13,5750.72 (0.70, 0.73)0.11 (0.09, 0.14)  0.76 (0.75, 0.78)
#22,6870.44 (0.43, 0.46)−0.89 (−0.93, −0.86)0.78 (0.77, 0.80)
#447,8110.72 (0.71, 0.72)0.03 (0.03, 0.04)  0.76 (0.75, 0.76)
#5330.05 (0.00, 0.09)−1.95 (−2.39, −1.50)0.37 (0.03, 0.63)
#61610.14 (0.00, 0.27)0.38 (0.18, 0.57)  0.29 (0.14, 0.43)
#77,3630.64 (0.62, 0.65)0.05 (0.03, 0.07)  0.69 (0.68, 0.70)
#8650.16 (0.08, 0.24)−1.28 (−1.57, −1.00)0.45 (0.24, 0.63)
Nonflagged group448,7560.73 (0.73, 0.73)−0.02 (−0.02, −0.01)0.76 (0.76, 0.77)
Task 4
#22,5910.77 (0.75, 0.78)−0.40 (−0.43, −0.37)0.86 (0.85, 0.87)
#410,2430.75 (0.74, 0.75)−0.16 (−0.17, −0.15)0.80 (0.79, 0.81)
#5520.29 (0.26, 0.33)−1.48 (−1.85, −1.11)0.41 (0.13, 0.65)
#63600.36 (0.28, 0.44)0.33 (0.21, 0.44)  0.53 (0.45, 0.60)
#77,0170.70 (0.68, 0.71)0.10 (0.08, 0.12)  0.76 (0.75, 0.77)
#81270.45 (0.33, 0.57)−0.91 (−1.01, −0.70)0.59 (0.45, 0.72)
Nonflagged group499,5730.77 (0.77, 0.77)−0.04 (−0.04, −0.03)0.81 (0.76, 0.77)

For the SMD, all flagged groups produced values noticeably greater than the nonflagged groups with few exceptions (e.g., Advisory Flag #4—restatement of prompt text—in Tasks 1 and 3). Across all four tasks, the largest differences were for Advisory Flags #2 (insufficient development), #5 (too short), and #8 (excessive number of problems), each of which identified responses for which the machine gave a notably lower score on average than did the human raters. These advisories are also the ones that functioned most effectively in terms of d.

With respect to the Pearson correlation coefficient, the values for two advisory flags (#5: too short and #8: excessive number of problems) were considerably lower for the flagged groups than for the nonflagged group for all four tasks. Advisory Flag #6 (too long) showed a similar pattern except for Task 2 (where no response was flagged by the advisory). Among the remaining five advisories, smaller differences were apparent for #4 (restatement of prompt text) in Task 1 and #7 (unusual organization) in Tasks 3 and 4. The three advisories (i.e., #1: repetition, #2: insufficient development, and #3: off topic) had machine–human agreement for the flagged group that was equal to or higher than the nonflagged groups.

Finally, generally similar results were found for the QWK statistic for all but Advisory Flag #2 (insufficient development). For this advisory, the QWK values were considerably lower for the flagged group than for the nonflagged group for Tasks 1 to 3, a result that was not observed in the r statistic.

Results for Research Question 2

The second research question concerned whether the size of the human–machine discrepancy for a response could be predicted. This question was addressed through a two-step process, with the first step being an evaluation of the extent to which the machine had difficulty in scoring. This step was undertaken because if little difficulty was encountered, human–machine discrepancy would be rare and hard to predict.

Several indicators of machine-scoring difficulty were examined. The two middle columns in Table 4 show (a) the squared multiple correlation between human scores and the machine scores produced by the cumulative logistic regression (R2), and (b) the MSE between human and machine scores. These indices are given for the overall sample and for the top five countries/territories based on test-taker volume. For the total sample, the R2 was 0.50 for Task 1, 0.40 for Task 2, 0.60 for Task 3, and 0.67 for Task 4. Except for Task 2, the R2 suggests a reasonably strong relationship between machine and human scores. However, clear differences are evident among subgroup populations on this index, suggesting some variation with respect to scoring difficulty. For example, the R2 ranged from as low as 0.44 to as high as 0.57 in Task 1, from 0.31 to 0.42 in Task 2, from 0.34 to 0.57 in Task 3, and from 0.36 to 0.64 in Task 4. A further examination of the countries/territories revealed that English-native speaking countries tended to have higher levels of R2 than did non-English speaking countries/territories. In contrast to R2, relatively little variation was observed for MSE (which is sensitive to differences in scores for individual responses, as opposed to differences in response ordering).

Total sample854,4010.500.34 (0.50)−0.40Total sample854,6670.400.78 (1.06)−0.07
China263,9400.440.32 (0.47)−0.38China264,8700.360.78 (1.06)−0.05
USA122,5220.540.32 (0.48)−0.39USA121,0460.420.77 (1.06)−0.21
Korea66,5940.490.37 (0.59)−0.38Korea67,7610.380.79 (1.09)0.06
India57,9480.460.41 (0.55)−0.40India57,5440.310.79 (1.03)0.33
Japan45,2800.570.31 (0.47)−0.37Japan45,7390.420.80 (1.09)−0.29
Total sample507,5470.600.38 (0.60)−0.62Total sample512,4390.670.28 (0.43)−0.53
USA323,0110.560.39 (0.64)−0.60USA325,1250.640.27 (0.41)−0.55
India76,5110.510.39 (0.56)−0.54India77,8700.520.32 (0.48)−0.30
China47,3190.340.31 (0.45)−0.44China48,0890.360.34 (0.51)−0.22
Korea6,1250.510.28 (0.41)−0.54Korea6,2520.520.29 (0.43)−0.38
Canada5,4410.570.39 (0.59)−0.58Canada5,5600.620.31 (0.46)−0.56

Not shown in Table 4 is a third scoring-difficulty indicator, the standard deviation of the probabilities assigned to each score level by the cumulative logistic regressions. For any given response, this value can range from 0, which reflects the most difficulty in distinguishing among score categories, to approximately 0.45 on a 5-point scale (Tasks 1 and 2) and 0.41 on a 6-point scale (Tasks 3 and 4), which reflects no difficulty. The mean standard deviations of the probabilities were 0.27 (SD = 0.03) for Task 1, 0.18 (SD = 0.04) for Task 2, 0.25 (SD = 0.02) for Task 3, and 0.25 (SD = 0.03) for Task 4, with Task 2 showing the most scoring difficulty. Figure 1 shows the distributions of the standard deviations by task. As the figure indicates, for Task 1, 3, and 4, most examinees fell in the upper half of the range of possible values, implying a relative lack of scoring difficulty. In contrast, most cases fell in the lower half of the range in Task 2, indicating some level of machine scoring difficulty.

We also examined scoring difficulty as a function of score level. Two indices were evaluated. One was the Pearson correlation coefficient between the human scores and the standard deviation of the probabilities for the score categories yielded by the cumulative logistic regression. This index is shown in the r columns of Table 4.

For all but Task 2, based on the sample as a whole, r ranged from −0.62 to −0.40, indicating a moderate relationship between machine-scoring difficulty and score level, such that the higher the score, the greater the difficulty. This index also had negative values for all top five subpopulations, though for some countries (e.g., China, Korea, India), the relationship was weaker than for others (e.g., USA and Canada). For Task 2, however, the association of scoring difficulty with score level in the overall population was negligible and varied considerably from one subpopulation to the next.

The second index used to investigate the association of scoring difficulty with level was conditional MSE (based on the machine scores; Table 5), an individual-level indicator of machine–human disagreement. Consistent with the correlational analysis above, the largest MSEs were at the upper end of the scale on the scoring rubrics (i.e., the 3.5-to-4.0 and 4.0-to-4.5 range for Task 1 and the 4.5-to-5.0 and 5.0-to-5.5 ranges for Tasks 3 and 4). For Task 2, however, the largest MSEs occurred around the middle of the score scale—the 2.0-to-2.5, 2.5-to-3.0, and 3.0-to-3.5 ranges. This curvilinear result is in line with the limited correlation between the standard deviation of the probabilities and score level reported above for Task 2.

[1.0, 1.5)6,7250.18 (0.29)0.17 (0.10)[1.0, 1.5)24,2970.30 (0.66)0.32 (0.17)
[1.5, 2.0)14,0480.30 (0.41)0.30 (0.01)[1.5, 2.0)47,5690.65 (0.81)0.73 (0.08)
[2.0, 2.5)45,2440.32 (0.43)0.32 (0.02)[2.0, 2.5)97,5440.87 (0.95)0.89 (0.02)
[2.5, 3.0)150,4230.28 (0.46)0.29 (0.02)[2.5, 3.0)181,9030.91 (1.14)0.90 (0.01)
[3.0, 3.5)303,1750.30 (0.49)0.32 (0.04)[3.0, 3.5)254,0560.82 (1.16)0.84 (0.02)
[3.5, 4.0)232,9390.41 (0.52)0.40 (0.00)[3.5, 4.0)187,4850.72 (1.02)0.75 (0.03)
[4.0, 4.5)90,1710.43 (0.57)0.39 (0.01)[4.0, 4.5)57,0000.61 (0.87)0.61 (0.05)
[4.5, 5.0]11,6760.36 (0.64)0.29 (0.06)[4.5, 5.0]4,8130.47 (0.85)0.39 (0.07)
[1.0, 1.5)4,7700.26 (0.36)0.16 (0.09)[1.0, 1.5)5,3730.25 (0.33)0.14 (0.10)
[1.5, 2.0)

0 thoughts on “Difficulties Met In Scoring Essay Test”


Leave a Comment

Your email address will not be published. Required fields are marked *