Using generalizability theory to investigate the variability and reliability of EFL composition scores by human raters and e-rater
DOI:
https://doi.org/10.30827/portalin.vi38.18056Palabras clave:
evaluación de redacción de inglés como lengua extranjera, teoría de la generalización, variabilidad de puntuación, fiabilidad de puntuación, evaluación de escritura automatizadaResumen
Utilizando la teoría de la generalización (teoría G) como marco teórico, este estudio tuvo como objetivo investigar la variabilidad y confiabilidad de los puntajes holísticos asignados por evaluadores humanos y e-rater a los mismos ensayosde inglés como lengua extranjera. Ochenta ensayos argumentativos escritos sobre dos temas diferentes por estudiantes turcos de inglés como lengua extranjera de nivel terciario fueron calificados de manera integral por un evaluador electrónico y ocho evaluadores humanos que recibieron una capacitación detallada como evaluador. Los resultados mostraron que los evaluadores electrónicos y humanos asignaron puntajes holísticos significativamente diferentes a los mismos ensayos de inglés como lengua extranjera. Los análisis de la teoría G revelaron que los evaluadores humanos asignaron
puntajes considerablemente inconsistentes a los mismos ensayos de inglés como lengua extranjera, aunque se les proporcionó una capacitación detallada para los evaluadores y se obtuvieron calificaciones más confiables cuando el evaluador electrónico se integró en el procedimiento de puntaje. Se dan algunas implicaciones para las prácticas de evaluación de escritura EFL.
Descargas
Citas
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater V. 2.0. The Journal of Technology, Learning and Assessment, 4(3), 3-30. https://doi.org/10.1002/j.2333-8504.2004.tb01972.x
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.
Baker, B. A. B. (2010). Playing with the stakes: A consideration of an aspect of the social context of a gatekeeping writing assessment. Assessing Writing, 15, 133–153. http://dx.doi.org/10.1016/j.asw.2010.06.002
Barkaoui, K. (2010). Do ESL essays raters’ evaluation criteria change with experience? A mixed-methods, cross-sectional study. TESOL Quarterly, 44(1), 31-57.
Bauer, M. I., & Zapata-Rivera, D. (2020). Cognitive Foundations of Automated Scoring. In Handbook of Automated Scoring (pp. 13-28). Chapman and Hall/CRC.
Blood, I. (2011). Automated essay scoring: A literature review. Studies in Applied Linguistics and TESOL, 11(2), 40-64.
Brennan, R. L. (2001). Generalizability theory: Statistics for social science and public policy. New York: Springer-Verlag. Retrieved from https://www.google.com.tr/search?hl=tr&tbo=p&tbm=bks&q=isbn:0387952829
Bridgeman, B., Trapani, C, & Attali, Y. (2012). Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country. Applied Measurement in Education, 25(1), 27-40. https://doi.org/10.1080/08957347.2012.635502
Briesch, A. M., Swaminathan, H., Welsh, M., & Chafouleas, S. M. (2014). Generalizability theory: A practical guide to study design, implementation, and interpretation. Journal of Psychology, 52(1), 13-15. http://dx.doi.org/10.1016/j.jsp.2013.11.008
Brown, H. D. (2004). Language assessment: Principles and classroom practice. New York, NY: Pearson/Longman.
Burstein, J., Braden‐Harder, L., Chodorow, M., Hua, S., Kaplan, B., Kukich, K., ... & Wolff, S. (1998). Computer analysis of essay content for automated score prediction: A prototype automated scoring system for GMAT analytical writing assessment essays. ETS Research Report Series, 1998(1), i-67. http://dx.doi.org/10.1002/j.2333-8504.1998.tb01764.x
Chang, Y. (2002). EFL teachers' responses to L2 writing. Reports Research (143). Retrieved from http://files.eric.ed.gov/fulltext/ED465283.pdf on March 23, 2015
Chodorow, M., & Burstein, J. (2004). Beyond essay length: Evaluating e-rater’s performance on TOEFL essays (Research report No. 73). Princeton, NJ: Educational Testing Service.
Cronbach, L. J., Gleser G. C., and Nanda H. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: John Wiley
Ebyary, K., &Windeatt, S. (2010). The impact of computer-based feedback on students’ written work. International Journal of English Studies, 10(2), 121-142. https://doi.org/10.6018/ijes/2010/2/119231
Elliot, S. (2001). Applying IntelliMetric Technology to the scoring of 3rd and 8th grade standardized writing assessments (RB-524). Newtown, PA: Vantage Learning.
Elorbany, R., & Huang, J. (2012). Examining the impact of rater educational background on ESL writing assessment: A generalizability theory approach. Language and Communication Quarterly, 1(1), 2-24.
Engber, C. A. (1995). The relationship of lexical proficiency to the quality of ESL compositions. Journal of second language writing, 4(2), 139-155. https://doi.org/10.1016/1060-3743(95)90004-7
Foltz, P. W., Kintsch W., & Landauer, T. K. (1999). The measurement of textual coherence with Latent Semantic Analysis. Organizational Process, 25(2-3), 285-307. https://doi.org/10.1080/01638539809545029
Güler, N., Uyanık, G. K., & Teker, G. T. (2012). Genellenebilirlik kuramı. Ankara: Pegem Akademi Yayınları.
Han, T. (2013). The impact of rating methods and rater training on the variability and reliability of EFL students' classroom-based writing assessments in Turkish universities: An investigation of problems and solutions. Atatürk University, Erzurum, Turkey.
Harsch, C., & Martin, G. (2012). Adapting CEF-descriptors for rating purposes: Validation by a combined rater training and scale revision approach. Assessing Writing, 17(4), 228-250.
Heaton J. B. (2003). Writing English language tests. USA: Longman.
Hoang, G. T. L., & Kunnan, A. J. (2016). Automated essay evaluation for English language learners: A case study of MY Access. Language Assessment Quarterly, 13(4), 359-376. https://doi.org/10.1080/15434303.2016.1230121
Hockly, N. (2019). “Automated Writing Evaluation”. ELT Journal, 73(1), 82-88. https://doi.org/10.1093/elt/ccy044
Homburg, T.J. (1984). “Holistic Evaluation of ESL Composition: Can It be Validated Objectively?” TESOL Quarterly, 18(1), 87-108.
Huang, J. (2008). How accurate are ESL students’ holistic writing scores on large-scale assessments? - A generalizability theory approach. Assessing Writing, 13(3), 201-218. http://dx.doi.org/10.1016/j.asw.2008.10.002
Huang, J. (2012). Using generalizability theory to examine the accuracy and validity of large scale ESL writing assessment. Assessing Writing, 17(3), 123-139. http://dx.doi.org/10.1016/j.asw.2011.12.003
Huang, S. J. (2014). Automated versus Human Scoring: A Case Study in an EFL Context. Electronic Journal of Foreign Language Teaching, 11.
Hyland, K. (2003). Second language writing. New York, NY: Cambridge University Press.
James, C. L. (2006). Validating a computerized scoring system for assessing writing and placing students in composition courses. Assessing Writing, 11, 167-178. https://doi.org/10.1016/j.asw.2007.01.002
Johnson, R.L., Penny, J.A., & Gordon, B. (2009). Assessing performance: Designing, scoring, and validating performance tasks. New York: The Guilford Press.
Kieffer, K. M. (1998, April). Why generalizability theory is essential and classical test theory is often inadequate? Paper presented at the Annual Meeting of the South Western Psychological Association, New Orleans, LA.
Latifi, F. S., & Gierl, M. J. (2020). Automated scoring of junior high essays using Coh-Metrix features: Implications for large-scale language testing. Language Testing. https://doi. org/10.1177/0265532220929918
Lee, Y. W., & Kantor, R. (2005). Dependability of new ESL writing test scores: Evaluating prototype tasks and alternative rating schemes. ETS Research Report Series, 2005(1), i-76. https://doi.org/10.1002/j.2333-8504.2005.tb01991.x
Lee, Y.-W., Kantor, R., & Mollaun, P. (2002). “Score Dependability of the Writing and Speaking Sections of New TOEFL”. [Proceeding]. Paper Presented at the Annual Meeting of National Council on Measurement in Education, New Orleans: LA. Abstract retrieved on December 11, 2012 from ERIC. (ERIC No. ED464962)
Li, Z., Link, S., Ma, H., Yang, H., & Hegelheimer, V. (2014). The role of automated writing evaluation holistic scores in the ESL classroom. System, 44, 66-78. https://doi.org/10.1016/j.system.2014.02.007
Lim, G. S. (2009). Prompt and rater effect in second language writing performance assesment. (Doctoral dissertation, The University of Michigan). Retrieved from http://deepblue.lib.umich.edu on March 23, 2015
Liu, S., & Kunnan, A. (2016). Investigating the Application of Automated Writing Evaluation to Chinese Undergraduate English Majors: A Case Study of WriteToLearn. CALICO, 33(1), 71-91. https://doi.org/10.1558/cj.v33i1.26380
Popham, J.W. (1981). Modern educational measurement. Englewood: Prentice.
Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A premier. Newbury Park, CA: Sage
Shermis, M. D., & Burstein, J. (2003). Automated Essay Scoring: A cross disciplinary perspective. Mahwah, NJ: Lawrence Erlbaum.
Shermis, M. D., Burstein, J., Higgins, D., & Zechner, K. (2010). Automated essay scoring: Writing assessment and instruction. In P. Peterson, E. Baker, & B. McGaw (Eds.), International encyclopedia of education (3rd ed., pp. 20-26). Oxford, UK: Elsevier. https://doi.org/10.1016/B978-0-08-044894-7.00233-5
Shermis, M. D., Koch, C. M., Page, E. B., Keith, T. Z., & Harrington, S. (2002). Trait rating for automated essay scoring. Educational and Psychological Measures, 62, 5-18. https://doi.org/10.1177/001316440206200101
Shi, L. (2001). Native- and Nonnative-Speaking EFL Teachers’ Evaluation of Chinese Students’ English Writing. Language Testing, 18(3), 303-325.
Shohamy, E., Gordon, C. M., & Kraemer, R. (1992). The effect of raters’ background and training on the reliability of direct writing tests. The Modern Language Journal, 76(1), 27-33. https://doi.org/10.1111/j.1540-4781.1992.tb02574.x
Song, B., & Caruso, I. (1996). Do English and ESL faculty differ in evaluating the essays of native English-speaking, and ESL students? Journal of Second Language Writing, 5(2), 163-182. https://doi.org/10.1016/S1060-3743(96)90023-5
Warschauer, M., & Ware, P. (2006). Automated writing evaluation: Defining the classroom research agenda. Language Teaching Research, 10(2), 1-24. https://doi.org/10.1191%2F1362168806lr190oa
Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197-223. http://dx.doi.org/10.1177/026553229401100206
Weigle, S. C. (2002). Assessing writing. United Kingdom: Cambridge University Press.
Williamson, D., Xi, X., & Breyer, J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement, Issues and Practice, 31(1), 2-13. https://doi.org/10.1111/j.1745-3992.2011.00223.x
Zehner, F., Goldhammer, F., & Sälzer, C. (2018). Automatically analyzing text responses for exploring gender-specific cognitions in PISA reading. Large-scale Assessments in Education, 6(1), 1-26.
Descargas
Publicado
Cómo citar
Número
Sección
Licencia
Los autores que publican en esta revista están de acuerdo con los siguientes términos:
- Los autores conservan los derechos de autor y garantizan a la revista el derecho de ser la primera publicación del trabajo al igual que licenciado bajo una Creative Commons Attribution License que permite a otros compartir el trabajo con un reconocimiento de la autoría del trabajo y la publicación inicial en esta revista.
- Los autores pueden establecer por separado acuerdos adicionales para la distribución no exclusiva de la versión de la obra publicada en la revista (por ejemplo, situarlo en un repositorio institucional o publicarlo en un libro), con un reconocimiento de su publicación inicial en esta revista.
- Se permite y se anima a los autores a difundir sus trabajos electrónicamente (por ejemplo, en repositorios institucionales o en su propio sitio web) antes y durante el proceso de envío, ya que puede dar lugar a intercambios productivos, así como a una citación más temprana y mayor de los trabajos publicados (Véase The Effect of Open Access) (en inglés).