Difference between revisions of "Automated Essay Scoring"

From Penn Center for Learning Analytics Wiki
Jump to navigation Jump to search
Line 13: Line 13:
* A later version of automated scoring models for evaluating English essays, or e-rater
* A later version of automated scoring models for evaluating English essays, or e-rater
* E-rater gave significantly lower score than human rater when assessing African-American students’ written responses to issue prompt in GRE
* E-rater gave significantly lower score than human rater when assessing African-American students’ written responses to issue prompt in GRE
* E-rater gave  better scores for test-takers from Chinese speakers (Mainland China, Taiwan, Hong Kong) and Korean speakers when assessing TOEFL (independent prompt) and  GRE essays
* E-rater gave  better scores for test-takers from Chinese speakers (Mainland China, Taiwan, Hong Kong) and Korean speakers when assessing TOEFL (independent prompt)  
* E-rater gave lower scores for Arabic, Hindi, and Spanish speakers when assessing their written responses to independent prompt in TOEFL
* E-rater gave lower scores for Arabic, Hindi, and Spanish speakers when assessing their written responses to independent prompt in TOEFL
* E-Rater system correlated comparably well with human rater when assessing TOEFL and GRE essays written by male and female students
* E-Rater system correlated comparably well with human rater when assessing TOEFL and GRE essays written by male and female students

Revision as of 07:33, 19 May 2022

Bridgeman et al. (2009) page

  • Automated scoring models for evaluating English essays, or e-rater
  • E-Rater gave significantly better scores than human rater for 11th grade essays written by Hispanic students and Asian-American students than White students
  • E-Rater gave significantly better scores than human rater for TOEFL essays (independent task) written by speakers of Chinese and Korean
  • E-Rater correlated poorly with human rater and gave better scores than human rater for GRE essays (both issue and argument prompts) written by Chinese speakers
  • E-Rater system performed comparably accurately for male and female students when assessing their 11th grade essays, TOEFL, and GRE writings


Bridgeman et al. (2012) pdf

  • A later version of automated scoring models for evaluating English essays, or e-rater
  • E-rater gave significantly lower score than human rater when assessing African-American students’ written responses to issue prompt in GRE
  • E-rater gave better scores for test-takers from Chinese speakers (Mainland China, Taiwan, Hong Kong) and Korean speakers when assessing TOEFL (independent prompt)
  • E-rater gave lower scores for Arabic, Hindi, and Spanish speakers when assessing their written responses to independent prompt in TOEFL
  • E-Rater system correlated comparably well with human rater when assessing TOEFL and GRE essays written by male and female students



Ramineni & Williamson (2018) pdf

  • Revised automated scoring engine for assessing GSE essay
  • E-rater gave African American test-takers significantly lower scores than human raters when assessing their written responses to argument prompts
  • The shorter essays written by African American test-takers were more likely to receive lower scores as showing weakness in content and organization



Wang et al. (2018) pdf

  • Automated scoring model for evaluating English spoken responses
  • SpeechRater gave a significantly lower score than human raters for German
  • SpeechRater scored in favor of Chinese group, with H1-rater scores higher than mean