Difference between revisions of "Automated Essay Scoring"

From Penn Center for Learning Analytics Wiki
Jump to navigation Jump to search
Line 1: Line 1:
Bridgeman, Trapani, and Attali (2009) [https://d1wqtxts1xzle7.cloudfront.net/52116920/AERA_NCME_2009_Bridgeman-with-cover-page-v2.pdf?Expires=1652902797&Signature=PiItDCa9BN8Mey1wuXaa2Lo0uCVp4I245Xx14GPKAthX7YREEF2wT8HEjmwwiSL~rn9tB21kL6zYIFrL3b44oyHfw5ywE1GQmGSeLCcK7f0WyfQUzYDXbQqWzJCInX9t3QvPKN05XK37iAn7SrEI5iN2HQcYmeF3B0fhtLdszf2-5TtPfT1dNwtdo8A30Z4xxtt~gIBQwXYtNhtPbv3idaZPUZe3lZf6kGGweKj3q9-yuyPc8VkXg7Tc72AOUlQqpjb2TDPH7vze1xLbg3Q1~YxYJnHWhvIINkbAadTLitQZvKhJOmV-it2pNEqqzrnwwl5~gqcgX180xVd89z81iQ__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA pdf]
Bridgeman, Trapani, and Attali (2009) [https://d1wqtxts1xzle7.cloudfront.net/52116920/AERA_NCME_2009_Bridgeman-with-cover-page-v2.pdf?Expires=1652902797&Signature=PiItDCa9BN8Mey1wuXaa2Lo0uCVp4I245Xx14GPKAthX7YREEF2wT8HEjmwwiSL~rn9tB21kL6zYIFrL3b44oyHfw5ywE1GQmGSeLCcK7f0WyfQUzYDXbQqWzJCInX9t3QvPKN05XK37iAn7SrEI5iN2HQcYmeF3B0fhtLdszf2-5TtPfT1dNwtdo8A30Z4xxtt~gIBQwXYtNhtPbv3idaZPUZe3lZf6kGGweKj3q9-yuyPc8VkXg7Tc72AOUlQqpjb2TDPH7vze1xLbg3Q1~YxYJnHWhvIINkbAadTLitQZvKhJOmV-it2pNEqqzrnwwl5~gqcgX180xVd89z81iQ__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA pdf]


* E-rater gave significantly higher score for 11th grade essays written by Asian American and Hispanic students, particularly, Hispanic female students
* Automated scoring models for evaluating English essays, or e-rater
* The score difference between human rater and e-rater was significantly smaller for 11th grade essays written by White and African American students.
* E-Rater gave significantly better scores for 11th grade essays written by Hispanic students and Asian-American students than White students
* E-rater gave slightly lower score for GRE essays (argument and issue) written by Black test-takers while e-rated scores were higher for Asian test-takers in the U.S
* E-Rater gave significantly better scores for TOEFL essays (independent task) written by speakers of Chinese and Korean
 
* E-Rater correlated poorly with human rater and give better scores for GRE essays (both issue and argument prompts) written by Chinese speakers
* E-rater gave significantly higher score for students from China and South Korea than 14 other countries when assessing independent writing task in Test of English as a Foreign Language (TOEFL)
* E-Rater system performed comparably accurately for male and female students when assessing their 11th grade essays, TOEFL, and GRE writings
* E-rater gave slightly higher scores for GRE analytical writing, both argument and issue prompts, by students from China whose written responses tended to be the longest and below average on grammar, usage and mechanics
 
* E-rater performed accurately for male and female students when assessing 11th grade English essays and independent writing task in Test of English as a Foreign Language
* While feature-level score differences were identified across gender and ethnic groups (e.g. e-rater gave better scores for word length and vocabulary level but less on grammar and mechanics when grading 11th grade essays written by Asian American female students), the authors called for larger samples  to confirm the findings
 
 


Bridgeman, Trapani, and Attali (2012) [https://www.tandfonline.com/doi/pdf/10.1080/08957347.2012.635502?needAccess=true pdf]
Bridgeman, Trapani, and Attali (2012) [https://www.tandfonline.com/doi/pdf/10.1080/08957347.2012.635502?needAccess=true pdf]

Revision as of 15:19, 18 May 2022

Bridgeman, Trapani, and Attali (2009) pdf

  • Automated scoring models for evaluating English essays, or e-rater
  • E-Rater gave significantly better scores for 11th grade essays written by Hispanic students and Asian-American students than White students
  • E-Rater gave significantly better scores for TOEFL essays (independent task) written by speakers of Chinese and Korean
  • E-Rater correlated poorly with human rater and give better scores for GRE essays (both issue and argument prompts) written by Chinese speakers
  • E-Rater system performed comparably accurately for male and female students when assessing their 11th grade essays, TOEFL, and GRE writings

Bridgeman, Trapani, and Attali (2012) pdf

  • A later version of automated scoring models for evaluating English essays, or e-rater
  • E-rater gave particularly lower score for African-American, and American-Indian males, when assessing written responses to issue prompt in GRE
  • The score was significantly lower when e-rater was assessing GRE written responses to argument prompt by African-American test-takers, both males and females.
  • E-rater gave slightly higher scores for test-takers from Chinese speakers (Mainland China, Taiwan, Hong Kong) and Korean speakers when assessing written responses to independent prompt in Test of English as a Foreign Language (TOEFL)
  • E-rater gave slightly lower scores for Arabic, Hindi, and Spanish speakers when assessing their written responses to independent prompt in TOEFL
  • E-rater gave significantly higher scores for test-takers from Mainland China than from Taiwan, Korea and Japan when assessing their GRE writings which tended to be below average on grammar, usage, and mechanics but longest response
  • The score difference between human rater and e-rater was marginal when written responses to GRE issue prompt by male and female test-takers were compared
  • The difference in score was significantly greater when assessing written responses to GRE argument prompt, as e-rater gave lower score for male test-takers, particularly for African American, American Indian, and Hispanic males, when assessing written responses to GRE argument prompt


Ramineni & Williamson (2018) pdf

  • Revised automated scoring engine for assessing GSE essay
  • E-rater gave African American test-takers significantly lower scores than human raters when assessing their written responses to argument prompts
  • The shorter essays written by African American test-takers were more likely to receive lower scores as showing weakness in content and organization



Wang et al. (2018) pdf

  • Automated scoring model for evaluating English spoken responses
  • SpeechRater gave a significantly lower score than human raters for German
  • SpeechRater scored in favor of Chinese group, with H1-rater scores higher than mean