Using generalizability theory to investigate the variability and reliability of EFL composition scores by human raters and e-rater



Palabras clave:

evaluación de redacción de inglés como lengua extranjera, teoría de la generalización, variabilidad de puntuación, fiabilidad de puntuación, evaluación de escritura automatizada


Utilizando la teoría de la generalización (teoría G) como marco teórico, este estudio tuvo como objetivo investigar la variabilidad y confiabilidad de los puntajes holísticos asignados por evaluadores humanos y e-rater a los mismos ensayosde inglés como lengua extranjera. Ochenta ensayos argumentativos escritos sobre dos temas diferentes por estudiantes turcos de inglés como lengua extranjera de nivel terciario fueron calificados de manera integral por un evaluador electrónico y ocho evaluadores humanos que recibieron una capacitación detallada como evaluador. Los resultados mostraron que los evaluadores electrónicos y humanos asignaron puntajes holísticos significativamente diferentes a los mismos ensayos de inglés como lengua extranjera. Los análisis de la teoría G revelaron que los evaluadores humanos asignaron
puntajes considerablemente inconsistentes a los mismos ensayos de inglés como lengua extranjera, aunque se les proporcionó una capacitación detallada para los evaluadores y se obtuvieron calificaciones más confiables cuando el evaluador electrónico se integró en el procedimiento de puntaje. Se dan algunas implicaciones para las prácticas de evaluación de escritura EFL.


Los datos de descargas todavía no están disponibles.


American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater V. 2.0. The Journal of Technology, Learning and Assessment, 4(3), 3-30.

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.

Baker, B. A. B. (2010). Playing with the stakes: A consideration of an aspect of the social context of a gatekeeping writing assessment. Assessing Writing, 15, 133–153.

Barkaoui, K. (2010). Do ESL essays raters’ evaluation criteria change with experience? A mixed-methods, cross-sectional study. TESOL Quarterly, 44(1), 31-57.

Bauer, M. I., & Zapata-Rivera, D. (2020). Cognitive Foundations of Automated Scoring. In Handbook of Automated Scoring (pp. 13-28). Chapman and Hall/CRC.

Blood, I. (2011). Automated essay scoring: A literature review. Studies in Applied Linguistics and TESOL, 11(2), 40-64.

Brennan, R. L. (2001). Generalizability theory: Statistics for social science and public policy. New York: Springer-Verlag. Retrieved from

Bridgeman, B., Trapani, C, & Attali, Y. (2012). Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country. Applied Measurement in Education, 25(1), 27-40.

Briesch, A. M., Swaminathan, H., Welsh, M., & Chafouleas, S. M. (2014). Generalizability theory: A practical guide to study design, implementation, and interpretation. Journal of Psychology, 52(1), 13-15.

Brown, H. D. (2004). Language assessment: Principles and classroom practice. New York, NY: Pearson/Longman.

Burstein, J., Braden‐Harder, L., Chodorow, M., Hua, S., Kaplan, B., Kukich, K., ... & Wolff, S. (1998). Computer analysis of essay content for automated score prediction: A prototype automated scoring system for GMAT analytical writing assessment essays. ETS Research Report Series, 1998(1), i-67.

Chang, Y. (2002). EFL teachers' responses to L2 writing. Reports Research (143). Retrieved from on March 23, 2015

Chodorow, M., & Burstein, J. (2004). Beyond essay length: Evaluating e-rater’s performance on TOEFL essays (Research report No. 73). Princeton, NJ: Educational Testing Service.

Cronbach, L. J., Gleser G. C., and Nanda H. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: John Wiley

Ebyary, K., &Windeatt, S. (2010). The impact of computer-based feedback on students’ written work. International Journal of English Studies, 10(2), 121-142.

Elliot, S. (2001). Applying IntelliMetric Technology to the scoring of 3rd and 8th grade standardized writing assessments (RB-524). Newtown, PA: Vantage Learning.

Elorbany, R., & Huang, J. (2012). Examining the impact of rater educational background on ESL writing assessment: A generalizability theory approach. Language and Communication Quarterly, 1(1), 2-24.

Engber, C. A. (1995). The relationship of lexical proficiency to the quality of ESL compositions. Journal of second language writing, 4(2), 139-155.

Foltz, P. W., Kintsch W., & Landauer, T. K. (1999). The measurement of textual coherence with Latent Semantic Analysis. Organizational Process, 25(2-3), 285-307.

Güler, N., Uyanık, G. K., & Teker, G. T. (2012). Genellenebilirlik kuramı. Ankara: Pegem Akademi Yayınları.

Han, T. (2013). The impact of rating methods and rater training on the variability and reliability of EFL students' classroom-based writing assessments in Turkish universities: An investigation of problems and solutions. Atatürk University, Erzurum, Turkey.

Harsch, C., & Martin, G. (2012). Adapting CEF-descriptors for rating purposes: Validation by a combined rater training and scale revision approach. Assessing Writing, 17(4), 228-250.

Heaton J. B. (2003). Writing English language tests. USA: Longman.

Hoang, G. T. L., & Kunnan, A. J. (2016). Automated essay evaluation for English language learners: A case study of MY Access. Language Assessment Quarterly, 13(4), 359-376.

Hockly, N. (2019). “Automated Writing Evaluation”. ELT Journal, 73(1), 82-88.

Homburg, T.J. (1984). “Holistic Evaluation of ESL Composition: Can It be Validated Objectively?” TESOL Quarterly, 18(1), 87-108.

Huang, J. (2008). How accurate are ESL students’ holistic writing scores on large-scale assessments? - A generalizability theory approach. Assessing Writing, 13(3), 201-218.

Huang, J. (2012). Using generalizability theory to examine the accuracy and validity of large scale ESL writing assessment. Assessing Writing, 17(3), 123-139.

Huang, S. J. (2014). Automated versus Human Scoring: A Case Study in an EFL Context. Electronic Journal of Foreign Language Teaching, 11.

Hyland, K. (2003). Second language writing. New York, NY: Cambridge University Press.

James, C. L. (2006). Validating a computerized scoring system for assessing writing and placing students in composition courses. Assessing Writing, 11, 167-178.

Johnson, R.L., Penny, J.A., & Gordon, B. (2009). Assessing performance: Designing, scoring, and validating performance tasks. New York: The Guilford Press.

Kieffer, K. M. (1998, April). Why generalizability theory is essential and classical test theory is often inadequate? Paper presented at the Annual Meeting of the South Western Psychological Association, New Orleans, LA.

Latifi, F. S., & Gierl, M. J. (2020). Automated scoring of junior high essays using Coh-Metrix features: Implications for large-scale language testing. Language Testing. https://doi. org/10.1177/0265532220929918

Lee, Y. W., & Kantor, R. (2005). Dependability of new ESL writing test scores: Evaluating prototype tasks and alternative rating schemes. ETS Research Report Series, 2005(1), i-76.

Lee, Y.-W., Kantor, R., & Mollaun, P. (2002). “Score Dependability of the Writing and Speaking Sections of New TOEFL”. [Proceeding]. Paper Presented at the Annual Meeting of National Council on Measurement in Education, New Orleans: LA. Abstract retrieved on December 11, 2012 from ERIC. (ERIC No. ED464962)

Li, Z., Link, S., Ma, H., Yang, H., & Hegelheimer, V. (2014). The role of automated writing evaluation holistic scores in the ESL classroom. System, 44, 66-78.

Lim, G. S. (2009). Prompt and rater effect in second language writing performance assesment. (Doctoral dissertation, The University of Michigan). Retrieved from on March 23, 2015

Liu, S., & Kunnan, A. (2016). Investigating the Application of Automated Writing Evaluation to Chinese Undergraduate English Majors: A Case Study of WriteToLearn. CALICO, 33(1), 71-91.

Popham, J.W. (1981). Modern educational measurement. Englewood: Prentice.

Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A premier. Newbury Park, CA: Sage

Shermis, M. D., & Burstein, J. (2003). Automated Essay Scoring: A cross disciplinary perspective. Mahwah, NJ: Lawrence Erlbaum.

Shermis, M. D., Burstein, J., Higgins, D., & Zechner, K. (2010). Automated essay scoring: Writing assessment and instruction. In P. Peterson, E. Baker, & B. McGaw (Eds.), International encyclopedia of education (3rd ed., pp. 20-26). Oxford, UK: Elsevier.

Shermis, M. D., Koch, C. M., Page, E. B., Keith, T. Z., & Harrington, S. (2002). Trait rating for automated essay scoring. Educational and Psychological Measures, 62, 5-18.

Shi, L. (2001). Native- and Nonnative-Speaking EFL Teachers’ Evaluation of Chinese Students’ English Writing. Language Testing, 18(3), 303-325.

Shohamy, E., Gordon, C. M., & Kraemer, R. (1992). The effect of raters’ background and training on the reliability of direct writing tests. The Modern Language Journal, 76(1), 27-33.

Song, B., & Caruso, I. (1996). Do English and ESL faculty differ in evaluating the essays of native English-speaking, and ESL students? Journal of Second Language Writing, 5(2), 163-182.

Warschauer, M., & Ware, P. (2006). Automated writing evaluation: Defining the classroom research agenda. Language Teaching Research, 10(2), 1-24.

Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197-223.

Weigle, S. C. (2002). Assessing writing. United Kingdom: Cambridge University Press.

Williamson, D., Xi, X., & Breyer, J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement, Issues and Practice, 31(1), 2-13.

Zehner, F., Goldhammer, F., & Sälzer, C. (2018). Automatically analyzing text responses for exploring gender-specific cognitions in PISA reading. Large-scale Assessments in Education, 6(1), 1-26.




Cómo citar

Sari, E., & Han, T. (2022). Using generalizability theory to investigate the variability and reliability of EFL composition scores by human raters and e-rater. Porta Linguarum Revista Interuniversitaria De Didáctica De Las Lenguas Extranjeras, (38), 27–45.