Difference between revisions of "National Origin or National Location"

From Penn Center for Learning Analytics Wiki
Jump to navigation Jump to search
m (Add empty lines)
 
(22 intermediate revisions by 3 users not shown)
Line 1: Line 1:
Bridgeman, Trapani, and Attali (2009) [[https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.577.7573&rep=rep1&type=pdf pdf]]
Švábenský et al. (2024) [https://educationaldatamining.org/edm2024/proceedings/2024.EDM-posters.82/2024.EDM-posters.82.pdf pdf]


* E-Rater system that automatically grades a student’s essay
*Classification models for predicting grades (worse than an average grade, “unsuccessful”, or equal/better than an average grade, “successful”)
* Inaccurately high scores were given to Chinese and Korean students
*Investigating bias based on university students' regional background in the context of the Philippines
*System showed poor correlation for GRE essay scores of Chinese students
*Demographic groups based on 1 of 5 locations from which students accessed online courses in Canvas
*Bias evaluation using AUC, weighted F1-score, and MADD showed consistent results across all groups, no unfairness was observed




Bridgeman, Trapani, and Attali (2012) [https://www.tandfonline.com/doi/pdf/10.1080/08957347.2012.635502?needAccess=true pdf]
Li et al. (2021) [https://arxiv.org/pdf/2103.15212.pdf pdf]
*A later version of E-Rater system for automatic grading of GSE essay
* Chinese students were given higher scores than when graded by human essay raters
*Speakers of Arabic and Hindi were given lower scores
Ogan and colleagues (2015) [[https://link.springer.com/content/pdf/10.1007/s40593-014-0034-8.pdf pdf]]


Ogan and colleagues (2015) [https://link.springer.com/content/pdf/10.1007/s40593-014-0034-8.pdf pdf]
*Model predicting student achievement on the standardized examination PISA
*Multi-national model predicting learning gains from student's help-seeking behavior
*Inaccuracy of the U.S.-trained model was greater for students from countries with lower scores of national development (e.g. Indonesia, Vietnam, Moldova)
*Both U.S. and combined model performed extremely poorly for Costa Rica
*U.S. model outperformed for Philippines than when trained with its own data set




Li et al. (2021) [https://arxiv.org/pdf/2103.15212.pdf pdf]
Wang et al. (2018) [https://www.researchgate.net/publication/336009443_Monitoring_the_performance_of_human_and_automated_scores_for_spoken_responses pdf]
 
*Automated scoring model for evaluating English spoken responses
*SpeechRater gave a significantly lower score than human raters for German students
*SpeechRater scored gave higher scores than human raters for Chinese students, with H1-rater scores higher than mean
 
 
Ogan et al. (2015) [https://link.springer.com/content/pdf/10.1007/s40593-014-0034-8.pdf pdf]  


* Model predicting student achievement on the standardized examination PISA
*Multi-national models predicting learning gains from student's help-seeking behavior
* Inaccuracy of the U.S.-trained model was greater for students from countries with lower scores of national development (e.g. Indonesia, Vietnam, Moldova)
*Models built on only U.S. or combined data sets performed extremely poorly for Costa Rica
*Models performed better when built on and applied for the same country, except for Philippines where model built on that country which was outperformed slightly by model built on U.S. data




Wang et al. (2018) [https://www.researchgate.net/publication/336009443_Monitoring_the_performance_of_human_and_automated_scores_for_spoken_responses pdf]
Bridgeman et al. (2012) [https://www.tandfonline.com/doi/pdf/10.1080/08957347.2012.635502?needAccess=true pdf]  


* Automated scoring model for evaluating English spoken responses
*A later version of automated scoring models for evaluating English essays, or e-rater
* SpeechRater gave a significantly lower score than human raters for German
*E-rater gave better scores for test-takers from Chinese speakers (Mainland China, Taiwan, Hong Kong) and Korean speakers when assessing TOEFL (independent prompt) essay
* SpeechRater scored in favor of Chinese group, with H1-rater scores higher than mean
*E-rater gave lower scores for Arabic, Hindi, and Spanish speakers when assessing their written responses to independent prompt in TOEFL




Bridgeman et al. (2009) [https://www.researchgate.net/publication/242203403_Considering_Fairness_and_Validity_in_Evaluating_Automated_Scoring pdf]
Bridgeman et al. (2009) [https://www.researchgate.net/publication/242203403_Considering_Fairness_and_Validity_in_Evaluating_Automated_Scoring page]


* Automated scoring models for evaluating English essays, or e-rater
*Automated scoring models for evaluating English essays, or e-rater
* E-rater gave significantly higher score for students from China and South Korea than 14 other countries when assessing independent writing task in Test of English as a Foreign Language (TOEFL)
*E-Rater gave significantly better scores than human rater for TOEFL essays (independent task) written by speakers of Chinese and Korean
* E-rater gave slightly higher scores for GRE analytical writing, both argument and issue prompts, by students from China whose written responses tended to be the longest and below average on grammar, usage and mechanics
*E-Rater correlated poorly with human rater and gave better scores than human rater for GRE essays (both issue and argument prompts) written by Chinese speakers

Latest revision as of 19:13, 1 September 2024

Švábenský et al. (2024) pdf

  • Classification models for predicting grades (worse than an average grade, “unsuccessful”, or equal/better than an average grade, “successful”)
  • Investigating bias based on university students' regional background in the context of the Philippines
  • Demographic groups based on 1 of 5 locations from which students accessed online courses in Canvas
  • Bias evaluation using AUC, weighted F1-score, and MADD showed consistent results across all groups, no unfairness was observed


Li et al. (2021) pdf

  • Model predicting student achievement on the standardized examination PISA
  • Inaccuracy of the U.S.-trained model was greater for students from countries with lower scores of national development (e.g. Indonesia, Vietnam, Moldova)


Wang et al. (2018) pdf

  • Automated scoring model for evaluating English spoken responses
  • SpeechRater gave a significantly lower score than human raters for German students
  • SpeechRater scored gave higher scores than human raters for Chinese students, with H1-rater scores higher than mean


Ogan et al. (2015) pdf

  • Multi-national models predicting learning gains from student's help-seeking behavior
  • Models built on only U.S. or combined data sets performed extremely poorly for Costa Rica
  • Models performed better when built on and applied for the same country, except for Philippines where model built on that country which was outperformed slightly by model built on U.S. data


Bridgeman et al. (2012) pdf

  • A later version of automated scoring models for evaluating English essays, or e-rater
  • E-rater gave better scores for test-takers from Chinese speakers (Mainland China, Taiwan, Hong Kong) and Korean speakers when assessing TOEFL (independent prompt) essay
  • E-rater gave lower scores for Arabic, Hindi, and Spanish speakers when assessing their written responses to independent prompt in TOEFL


Bridgeman et al. (2009) page

  • Automated scoring models for evaluating English essays, or e-rater
  • E-Rater gave significantly better scores than human rater for TOEFL essays (independent task) written by speakers of Chinese and Korean
  • E-Rater correlated poorly with human rater and gave better scores than human rater for GRE essays (both issue and argument prompts) written by Chinese speakers