Difference between revisions of "Automated Essay Scoring"

From Penn Center for Learning Analytics Wiki
Jump to navigation Jump to search
(Added Litman et al. (2021))
 
(17 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Bridgeman, Trapani, and Attali (2009) [[https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.577.7573&rep=rep1&type=pdf pdf]]
Bridgeman et al. (2009) [https://www.researchgate.net/publication/242203403_Considering_Fairness_and_Validity_in_Evaluating_Automated_Scoring page]
* E-Rater system that automatically grades a student’s essay
*Essays written by Hispanic and Asian-American students over-graded than those by White and African American peers.
*inaccurately give Chinese and Korean students significantly higher scores than human essay raters on a test of foreign language proficiency
*Correlate more poorly and bias upwards in terms of GRE essay scores for Chinese students,


Bridgeman, Trapani, and Attali (2012) [[https://www.tandfonline.com/doi/pdf/10.1080/08957347.2012.635502?needAccess=true pdf]]
* Automated scoring models for evaluating English essays, or e-rater
* E-Rater gave significantly better scores than human rater for 11th grade essays written by Hispanic students and Asian-American students
* E-Rater gave significantly better scores than human rater for TOEFL essays (independent task) written by speakers of Chinese and Korean
* E-Rater correlated poorly with human rater and gave better scores than human rater for GRE essays (both issue and argument prompts) written by Chinese speakers
* E-Rater system performed comparably accurately for male and female students when assessing their 11th grade essays, TOEFL, and GRE writings


* A later version of E-Rater system for automatic grading of GSE essay
* Model gave lower scores to African American students than human-raters
*Chinese students are given higher scores than human essay raters
*Speakers of Arabic and Hindi were given lower scores


Ramineni & Williamson (2018) [[https://onlinelibrary.wiley.com/doi/10.1002/ets2.12192 pdf]]


* A later version of E-Rater system for automatic grading of GSE essay
Bridgeman et al. (2012) [https://www.tandfonline.com/doi/pdf/10.1080/08957347.2012.635502?needAccess=true pdf]
*for some types of essays, E-Rater gave African American students substantially lower scores than human raters did
 
Wang et al. (2018) [[https://www.researchgate.net/publication/336009443_Monitoring_the_performance_of_human_and_automated_scores_for_spoken_responses pdf]]
* A later version of automated scoring models for evaluating English essays, or e-rater
* E-rater gave significantly lower score than human rater when assessing African-American students’ written responses to issue prompt in GRE
* E-rater gave  better scores for test-takers from Chinese speakers (Mainland China, Taiwan, Hong Kong) and Korean speakers when assessing TOEFL (independent prompt) essay
* E-rater gave lower scores for Arabic, Hindi, and Spanish speakers when assessing their written responses to independent prompt in TOEFL
* E-Rater system correlated comparably well with human rater when assessing TOEFL and GRE essays written by male and female students
 
 
 
 
Ramineni & Williamson (2018) [https://onlinelibrary.wiley.com/doi/10.1002/ets2.12192 pdf]
 
* Revised automated scoring engine for assessing GSE essay
 
* E-rater gave African American test-takers significantly lower scores than human raters when assessing their written responses to argument prompts
* The shorter essays written by African American test-takers were more likely to receive lower scores as showing weakness in content and organization
 
 
 
 
Wang et al. (2018) [https://www.researchgate.net/publication/336009443_Monitoring_the_performance_of_human_and_automated_scores_for_spoken_responses pdf]
*Automated scoring model for evaluating English spoken responses
*Automated scoring model for evaluating English spoken responses
*SpeechRater gave a significantly lower score than human raters for German
*SpeechRater gave a significantly lower score than human raters for German students
*SpeechRater scored in favor of Chinese group, with H1-rater scores higher than mean
*SpeechRater scored students from China higher than human raters, with H1-rater scores higher than mean
 
 
Litman et al. (2021) [https://link.springer.com/chapter/10.1007/978-3-030-78292-4_21 html]
* Automated essay scoring models inferring text evidence usage
* All algorithms studied have less than 1% of error explained by whether student is female and male, whether student is Black, or whether student receives free/reduced price lunch

Latest revision as of 11:33, 4 July 2022

Bridgeman et al. (2009) page

  • Automated scoring models for evaluating English essays, or e-rater
  • E-Rater gave significantly better scores than human rater for 11th grade essays written by Hispanic students and Asian-American students
  • E-Rater gave significantly better scores than human rater for TOEFL essays (independent task) written by speakers of Chinese and Korean
  • E-Rater correlated poorly with human rater and gave better scores than human rater for GRE essays (both issue and argument prompts) written by Chinese speakers
  • E-Rater system performed comparably accurately for male and female students when assessing their 11th grade essays, TOEFL, and GRE writings


Bridgeman et al. (2012) pdf

  • A later version of automated scoring models for evaluating English essays, or e-rater
  • E-rater gave significantly lower score than human rater when assessing African-American students’ written responses to issue prompt in GRE
  • E-rater gave better scores for test-takers from Chinese speakers (Mainland China, Taiwan, Hong Kong) and Korean speakers when assessing TOEFL (independent prompt) essay
  • E-rater gave lower scores for Arabic, Hindi, and Spanish speakers when assessing their written responses to independent prompt in TOEFL
  • E-Rater system correlated comparably well with human rater when assessing TOEFL and GRE essays written by male and female students



Ramineni & Williamson (2018) pdf

  • Revised automated scoring engine for assessing GSE essay
  • E-rater gave African American test-takers significantly lower scores than human raters when assessing their written responses to argument prompts
  • The shorter essays written by African American test-takers were more likely to receive lower scores as showing weakness in content and organization



Wang et al. (2018) pdf

  • Automated scoring model for evaluating English spoken responses
  • SpeechRater gave a significantly lower score than human raters for German students
  • SpeechRater scored students from China higher than human raters, with H1-rater scores higher than mean


Litman et al. (2021) html

  • Automated essay scoring models inferring text evidence usage
  • All algorithms studied have less than 1% of error explained by whether student is female and male, whether student is Black, or whether student receives free/reduced price lunch