Using generalizability theory to investigate the variability and reliability of EFL composition scores by human raters and e-rater
DOI:
https://doi.org/10.30827/portalin.vi38.18056Keywords:
EFL writing assessment, generalizability theory, scoring variability, scoring reliability, automated writing evaluation (AWE)Abstract
ABSTRACT: Using the generalizability theory (G-theory) as a theoretical framework, this study aimed at investigating the variability and reliability of holistic scores assigned by human raters and e-rater to the same EFL essays. Eighty argumentative essays written on two different topics by tertiary level Turkish EFL students were scored holistically by e-rater and eight human raters who received a detailed rater training. The results showed that e-rater and human raters assigned significantly different holistic scores to the same EFL essays. G-theory analyses revealed that human raters assigned considerably inconsistent scores to the same EFL essays although they were given a detailed rater training and more reliable ratings were attained when e-rater was integrated in the scoring procedure. Some implications are given for EFL writing assessment practices.
Downloads
References
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater V. 2.0. The Journal of Technology, Learning and Assessment, 4(3), 3-30. https://doi.org/10.1002/j.2333-8504.2004.tb01972.x
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.
Baker, B. A. B. (2010). Playing with the stakes: A consideration of an aspect of the social context of a gatekeeping writing assessment. Assessing Writing, 15, 133–153. http://dx.doi.org/10.1016/j.asw.2010.06.002
Barkaoui, K. (2010). Do ESL essays raters’ evaluation criteria change with experience? A mixed-methods, cross-sectional study. TESOL Quarterly, 44(1), 31-57.
Bauer, M. I., & Zapata-Rivera, D. (2020). Cognitive Foundations of Automated Scoring. In Handbook of Automated Scoring (pp. 13-28). Chapman and Hall/CRC.
Blood, I. (2011). Automated essay scoring: A literature review. Studies in Applied Linguistics and TESOL, 11(2), 40-64.
Brennan, R. L. (2001). Generalizability theory: Statistics for social science and public policy. New York: Springer-Verlag. Retrieved from https://www.google.com.tr/search?hl=tr&tbo=p&tbm=bks&q=isbn:0387952829
Bridgeman, B., Trapani, C, & Attali, Y. (2012). Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country. Applied Measurement in Education, 25(1), 27-40. https://doi.org/10.1080/08957347.2012.635502
Briesch, A. M., Swaminathan, H., Welsh, M., & Chafouleas, S. M. (2014). Generalizability theory: A practical guide to study design, implementation, and interpretation. Journal of Psychology, 52(1), 13-15. http://dx.doi.org/10.1016/j.jsp.2013.11.008
Brown, H. D. (2004). Language assessment: Principles and classroom practice. New York, NY: Pearson/Longman.
Burstein, J., Braden‐Harder, L., Chodorow, M., Hua, S., Kaplan, B., Kukich, K., ... & Wolff, S. (1998). Computer analysis of essay content for automated score prediction: A prototype automated scoring system for GMAT analytical writing assessment essays. ETS Research Report Series, 1998(1), i-67. http://dx.doi.org/10.1002/j.2333-8504.1998.tb01764.x
Chang, Y. (2002). EFL teachers' responses to L2 writing. Reports Research (143). Retrieved from http://files.eric.ed.gov/fulltext/ED465283.pdf on March 23, 2015
Chodorow, M., & Burstein, J. (2004). Beyond essay length: Evaluating e-rater’s performance on TOEFL essays (Research report No. 73). Princeton, NJ: Educational Testing Service.
Cronbach, L. J., Gleser G. C., and Nanda H. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: John Wiley
Ebyary, K., &Windeatt, S. (2010). The impact of computer-based feedback on students’ written work. International Journal of English Studies, 10(2), 121-142. https://doi.org/10.6018/ijes/2010/2/119231
Elliot, S. (2001). Applying IntelliMetric Technology to the scoring of 3rd and 8th grade standardized writing assessments (RB-524). Newtown, PA: Vantage Learning.
Elorbany, R., & Huang, J. (2012). Examining the impact of rater educational background on ESL writing assessment: A generalizability theory approach. Language and Communication Quarterly, 1(1), 2-24.
Engber, C. A. (1995). The relationship of lexical proficiency to the quality of ESL compositions. Journal of second language writing, 4(2), 139-155. https://doi.org/10.1016/1060-3743(95)90004-7
Foltz, P. W., Kintsch W., & Landauer, T. K. (1999). The measurement of textual coherence with Latent Semantic Analysis. Organizational Process, 25(2-3), 285-307. https://doi.org/10.1080/01638539809545029
Güler, N., Uyanık, G. K., & Teker, G. T. (2012). Genellenebilirlik kuramı. Ankara: Pegem Akademi Yayınları.
Han, T. (2013). The impact of rating methods and rater training on the variability and reliability of EFL students' classroom-based writing assessments in Turkish universities: An investigation of problems and solutions. Atatürk University, Erzurum, Turkey.
Harsch, C., & Martin, G. (2012). Adapting CEF-descriptors for rating purposes: Validation by a combined rater training and scale revision approach. Assessing Writing, 17(4), 228-250.
Heaton J. B. (2003). Writing English language tests. USA: Longman.
Hoang, G. T. L., & Kunnan, A. J. (2016). Automated essay evaluation for English language learners: A case study of MY Access. Language Assessment Quarterly, 13(4), 359-376. https://doi.org/10.1080/15434303.2016.1230121
Hockly, N. (2019). “Automated Writing Evaluation”. ELT Journal, 73(1), 82-88. https://doi.org/10.1093/elt/ccy044
Homburg, T.J. (1984). “Holistic Evaluation of ESL Composition: Can It be Validated Objectively?” TESOL Quarterly, 18(1), 87-108.
Huang, J. (2008). How accurate are ESL students’ holistic writing scores on large-scale assessments? - A generalizability theory approach. Assessing Writing, 13(3), 201-218. http://dx.doi.org/10.1016/j.asw.2008.10.002
Huang, J. (2012). Using generalizability theory to examine the accuracy and validity of large scale ESL writing assessment. Assessing Writing, 17(3), 123-139. http://dx.doi.org/10.1016/j.asw.2011.12.003
Huang, S. J. (2014). Automated versus Human Scoring: A Case Study in an EFL Context. Electronic Journal of Foreign Language Teaching, 11.
Hyland, K. (2003). Second language writing. New York, NY: Cambridge University Press.
James, C. L. (2006). Validating a computerized scoring system for assessing writing and placing students in composition courses. Assessing Writing, 11, 167-178. https://doi.org/10.1016/j.asw.2007.01.002
Johnson, R.L., Penny, J.A., & Gordon, B. (2009). Assessing performance: Designing, scoring, and validating performance tasks. New York: The Guilford Press.
Kieffer, K. M. (1998, April). Why generalizability theory is essential and classical test theory is often inadequate? Paper presented at the Annual Meeting of the South Western Psychological Association, New Orleans, LA.
Latifi, F. S., & Gierl, M. J. (2020). Automated scoring of junior high essays using Coh-Metrix features: Implications for large-scale language testing. Language Testing. https://doi. org/10.1177/0265532220929918
Lee, Y. W., & Kantor, R. (2005). Dependability of new ESL writing test scores: Evaluating prototype tasks and alternative rating schemes. ETS Research Report Series, 2005(1), i-76. https://doi.org/10.1002/j.2333-8504.2005.tb01991.x
Lee, Y.-W., Kantor, R., & Mollaun, P. (2002). “Score Dependability of the Writing and Speaking Sections of New TOEFL”. [Proceeding]. Paper Presented at the Annual Meeting of National Council on Measurement in Education, New Orleans: LA. Abstract retrieved on December 11, 2012 from ERIC. (ERIC No. ED464962)
Li, Z., Link, S., Ma, H., Yang, H., & Hegelheimer, V. (2014). The role of automated writing evaluation holistic scores in the ESL classroom. System, 44, 66-78. https://doi.org/10.1016/j.system.2014.02.007
Lim, G. S. (2009). Prompt and rater effect in second language writing performance assesment. (Doctoral dissertation, The University of Michigan). Retrieved from http://deepblue.lib.umich.edu on March 23, 2015
Liu, S., & Kunnan, A. (2016). Investigating the Application of Automated Writing Evaluation to Chinese Undergraduate English Majors: A Case Study of WriteToLearn. CALICO, 33(1), 71-91. https://doi.org/10.1558/cj.v33i1.26380
Popham, J.W. (1981). Modern educational measurement. Englewood: Prentice.
Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A premier. Newbury Park, CA: Sage
Shermis, M. D., & Burstein, J. (2003). Automated Essay Scoring: A cross disciplinary perspective. Mahwah, NJ: Lawrence Erlbaum.
Shermis, M. D., Burstein, J., Higgins, D., & Zechner, K. (2010). Automated essay scoring: Writing assessment and instruction. In P. Peterson, E. Baker, & B. McGaw (Eds.), International encyclopedia of education (3rd ed., pp. 20-26). Oxford, UK: Elsevier. https://doi.org/10.1016/B978-0-08-044894-7.00233-5
Shermis, M. D., Koch, C. M., Page, E. B., Keith, T. Z., & Harrington, S. (2002). Trait rating for automated essay scoring. Educational and Psychological Measures, 62, 5-18. https://doi.org/10.1177/001316440206200101
Shi, L. (2001). Native- and Nonnative-Speaking EFL Teachers’ Evaluation of Chinese Students’ English Writing. Language Testing, 18(3), 303-325.
Shohamy, E., Gordon, C. M., & Kraemer, R. (1992). The effect of raters’ background and training on the reliability of direct writing tests. The Modern Language Journal, 76(1), 27-33. https://doi.org/10.1111/j.1540-4781.1992.tb02574.x
Song, B., & Caruso, I. (1996). Do English and ESL faculty differ in evaluating the essays of native English-speaking, and ESL students? Journal of Second Language Writing, 5(2), 163-182. https://doi.org/10.1016/S1060-3743(96)90023-5
Warschauer, M., & Ware, P. (2006). Automated writing evaluation: Defining the classroom research agenda. Language Teaching Research, 10(2), 1-24. https://doi.org/10.1191%2F1362168806lr190oa
Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197-223. http://dx.doi.org/10.1177/026553229401100206
Weigle, S. C. (2002). Assessing writing. United Kingdom: Cambridge University Press.
Williamson, D., Xi, X., & Breyer, J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement, Issues and Practice, 31(1), 2-13. https://doi.org/10.1111/j.1745-3992.2011.00223.x
Zehner, F., Goldhammer, F., & Sälzer, C. (2018). Automatically analyzing text responses for exploring gender-specific cognitions in PISA reading. Large-scale Assessments in Education, 6(1), 1-26.
Downloads
Published
How to Cite
Issue
Section
License
The authors who publish in this journal agree to the following terms:
- The authors retain copyright and guarantee to the journal the right to be the first to publish the work as well as licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the authorship of the work and the initial publication in this journal.
- Authors may separately enter into additional agreements for non-exclusive distribution of the version of the work published in the journal (e.g., placing it in an institutional repository or publishing it in a book), with acknowledgement of its initial publication in this journal.
- Authors are allowed and encouraged to disseminate their work electronically (e.g., in institutional repositories or on their own website) before and during the submission process, as this can lead to productive exchanges as well as earlier and greater citation of published work (see The Effect of Open Access).