Using generalizability theory to investigate the variability and reliability of EFL composition scores by human raters and e-rater

Authors

DOI:

https://doi.org/10.30827/portalin.vi38.18056

Keywords:

EFL writing assessment, generalizability theory, scoring variability, scoring reliability, automated writing evaluation (AWE)

Abstract

ABSTRACT: Using the generalizability theory (G-theory) as a theoretical framework, this study aimed at investigating the variability and reliability of holistic scores assigned by human raters and e-rater to the same EFL essays. Eighty argumentative essays written on two different topics by tertiary level Turkish EFL students were scored holistically by e-rater and eight human raters who received a detailed rater training. The results showed that e-rater and human raters assigned significantly different holistic scores to the same EFL essays. G-theory analyses revealed that human raters assigned considerably inconsistent scores to the same EFL essays although they were given a detailed rater training and more reliable ratings were attained when e-rater was integrated in the scoring procedure. Some implications are given for EFL writing assessment practices.

Downloads

Download data is not yet available.

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater V. 2.0. The Journal of Technology, Learning and Assessment, 4(3), 3-30. https://doi.org/10.1002/j.2333-8504.2004.tb01972.x

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.

Baker, B. A. B. (2010). Playing with the stakes: A consideration of an aspect of the social context of a gatekeeping writing assessment. Assessing Writing, 15, 133–153. http://dx.doi.org/10.1016/j.asw.2010.06.002

Barkaoui, K. (2010). Do ESL essays raters’ evaluation criteria change with experience? A mixed-methods, cross-sectional study. TESOL Quarterly, 44(1), 31-57.

Bauer, M. I., & Zapata-Rivera, D. (2020). Cognitive Foundations of Automated Scoring. In Handbook of Automated Scoring (pp. 13-28). Chapman and Hall/CRC.

Blood, I. (2011). Automated essay scoring: A literature review. Studies in Applied Linguistics and TESOL, 11(2), 40-64.

Brennan, R. L. (2001). Generalizability theory: Statistics for social science and public policy. New York: Springer-Verlag. Retrieved from https://www.google.com.tr/search?hl=tr&tbo=p&tbm=bks&q=isbn:0387952829

Bridgeman, B., Trapani, C, & Attali, Y. (2012). Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country. Applied Measurement in Education, 25(1), 27-40. https://doi.org/10.1080/08957347.2012.635502

Briesch, A. M., Swaminathan, H., Welsh, M., & Chafouleas, S. M. (2014). Generalizability theory: A practical guide to study design, implementation, and interpretation. Journal of Psychology, 52(1), 13-15. http://dx.doi.org/10.1016/j.jsp.2013.11.008

Brown, H. D. (2004). Language assessment: Principles and classroom practice. New York, NY: Pearson/Longman.

Burstein, J., Braden‐Harder, L., Chodorow, M., Hua, S., Kaplan, B., Kukich, K., ... & Wolff, S. (1998). Computer analysis of essay content for automated score prediction: A prototype automated scoring system for GMAT analytical writing assessment essays. ETS Research Report Series, 1998(1), i-67. http://dx.doi.org/10.1002/j.2333-8504.1998.tb01764.x

Chang, Y. (2002). EFL teachers' responses to L2 writing. Reports Research (143). Retrieved from http://files.eric.ed.gov/fulltext/ED465283.pdf on March 23, 2015

Chodorow, M., & Burstein, J. (2004). Beyond essay length: Evaluating e-rater’s performance on TOEFL essays (Research report No. 73). Princeton, NJ: Educational Testing Service.

Cronbach, L. J., Gleser G. C., and Nanda H. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: John Wiley

Ebyary, K., &Windeatt, S. (2010). The impact of computer-based feedback on students’ written work. International Journal of English Studies, 10(2), 121-142. https://doi.org/10.6018/ijes/2010/2/119231

Elliot, S. (2001). Applying IntelliMetric Technology to the scoring of 3rd and 8th grade standardized writing assessments (RB-524). Newtown, PA: Vantage Learning.

Elorbany, R., & Huang, J. (2012). Examining the impact of rater educational background on ESL writing assessment: A generalizability theory approach. Language and Communication Quarterly, 1(1), 2-24.

Engber, C. A. (1995). The relationship of lexical proficiency to the quality of ESL compositions. Journal of second language writing, 4(2), 139-155. https://doi.org/10.1016/1060-3743(95)90004-7

Foltz, P. W., Kintsch W., & Landauer, T. K. (1999). The measurement of textual coherence with Latent Semantic Analysis. Organizational Process, 25(2-3), 285-307. https://doi.org/10.1080/01638539809545029

Güler, N., Uyanık, G. K., & Teker, G. T. (2012). Genellenebilirlik kuramı. Ankara: Pegem Akademi Yayınları.

Han, T. (2013). The impact of rating methods and rater training on the variability and reliability of EFL students' classroom-based writing assessments in Turkish universities: An investigation of problems and solutions. Atatürk University, Erzurum, Turkey.

Harsch, C., & Martin, G. (2012). Adapting CEF-descriptors for rating purposes: Validation by a combined rater training and scale revision approach. Assessing Writing, 17(4), 228-250.

Heaton J. B. (2003). Writing English language tests. USA: Longman.

Hoang, G. T. L., & Kunnan, A. J. (2016). Automated essay evaluation for English language learners: A case study of MY Access. Language Assessment Quarterly, 13(4), 359-376. https://doi.org/10.1080/15434303.2016.1230121

Hockly, N. (2019). “Automated Writing Evaluation”. ELT Journal, 73(1), 82-88. https://doi.org/10.1093/elt/ccy044

Homburg, T.J. (1984). “Holistic Evaluation of ESL Composition: Can It be Validated Objectively?” TESOL Quarterly, 18(1), 87-108.

Huang, J. (2008). How accurate are ESL students’ holistic writing scores on large-scale assessments? - A generalizability theory approach. Assessing Writing, 13(3), 201-218. http://dx.doi.org/10.1016/j.asw.2008.10.002

Huang, J. (2012). Using generalizability theory to examine the accuracy and validity of large scale ESL writing assessment. Assessing Writing, 17(3), 123-139. http://dx.doi.org/10.1016/j.asw.2011.12.003

Huang, S. J. (2014). Automated versus Human Scoring: A Case Study in an EFL Context. Electronic Journal of Foreign Language Teaching, 11.

Hyland, K. (2003). Second language writing. New York, NY: Cambridge University Press.

James, C. L. (2006). Validating a computerized scoring system for assessing writing and placing students in composition courses. Assessing Writing, 11, 167-178. https://doi.org/10.1016/j.asw.2007.01.002

Johnson, R.L., Penny, J.A., & Gordon, B. (2009). Assessing performance: Designing, scoring, and validating performance tasks. New York: The Guilford Press.

Kieffer, K. M. (1998, April). Why generalizability theory is essential and classical test theory is often inadequate? Paper presented at the Annual Meeting of the South Western Psychological Association, New Orleans, LA.

Latifi, F. S., & Gierl, M. J. (2020). Automated scoring of junior high essays using Coh-Metrix features: Implications for large-scale language testing. Language Testing. https://doi. org/10.1177/0265532220929918

Lee, Y. W., & Kantor, R. (2005). Dependability of new ESL writing test scores: Evaluating prototype tasks and alternative rating schemes. ETS Research Report Series, 2005(1), i-76. https://doi.org/10.1002/j.2333-8504.2005.tb01991.x

Lee, Y.-W., Kantor, R., & Mollaun, P. (2002). “Score Dependability of the Writing and Speaking Sections of New TOEFL”. [Proceeding]. Paper Presented at the Annual Meeting of National Council on Measurement in Education, New Orleans: LA. Abstract retrieved on December 11, 2012 from ERIC. (ERIC No. ED464962)

Li, Z., Link, S., Ma, H., Yang, H., & Hegelheimer, V. (2014). The role of automated writing evaluation holistic scores in the ESL classroom. System, 44, 66-78. https://doi.org/10.1016/j.system.2014.02.007

Lim, G. S. (2009). Prompt and rater effect in second language writing performance assesment. (Doctoral dissertation, The University of Michigan). Retrieved from http://deepblue.lib.umich.edu on March 23, 2015

Liu, S., & Kunnan, A. (2016). Investigating the Application of Automated Writing Evaluation to Chinese Undergraduate English Majors: A Case Study of WriteToLearn. CALICO, 33(1), 71-91. https://doi.org/10.1558/cj.v33i1.26380

Popham, J.W. (1981). Modern educational measurement. Englewood: Prentice.

Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A premier. Newbury Park, CA: Sage

Shermis, M. D., & Burstein, J. (2003). Automated Essay Scoring: A cross disciplinary perspective. Mahwah, NJ: Lawrence Erlbaum.

Shermis, M. D., Burstein, J., Higgins, D., & Zechner, K. (2010). Automated essay scoring: Writing assessment and instruction. In P. Peterson, E. Baker, & B. McGaw (Eds.), International encyclopedia of education (3rd ed., pp. 20-26). Oxford, UK: Elsevier. https://doi.org/10.1016/B978-0-08-044894-7.00233-5

Shermis, M. D., Koch, C. M., Page, E. B., Keith, T. Z., & Harrington, S. (2002). Trait rating for automated essay scoring. Educational and Psychological Measures, 62, 5-18. https://doi.org/10.1177/001316440206200101

Shi, L. (2001). Native- and Nonnative-Speaking EFL Teachers’ Evaluation of Chinese Students’ English Writing. Language Testing, 18(3), 303-325.

Shohamy, E., Gordon, C. M., & Kraemer, R. (1992). The effect of raters’ background and training on the reliability of direct writing tests. The Modern Language Journal, 76(1), 27-33. https://doi.org/10.1111/j.1540-4781.1992.tb02574.x

Song, B., & Caruso, I. (1996). Do English and ESL faculty differ in evaluating the essays of native English-speaking, and ESL students? Journal of Second Language Writing, 5(2), 163-182. https://doi.org/10.1016/S1060-3743(96)90023-5

Warschauer, M., & Ware, P. (2006). Automated writing evaluation: Defining the classroom research agenda. Language Teaching Research, 10(2), 1-24. https://doi.org/10.1191%2F1362168806lr190oa

Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197-223. http://dx.doi.org/10.1177/026553229401100206

Weigle, S. C. (2002). Assessing writing. United Kingdom: Cambridge University Press.

Williamson, D., Xi, X., & Breyer, J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement, Issues and Practice, 31(1), 2-13. https://doi.org/10.1111/j.1745-3992.2011.00223.x

Zehner, F., Goldhammer, F., & Sälzer, C. (2018). Automatically analyzing text responses for exploring gender-specific cognitions in PISA reading. Large-scale Assessments in Education, 6(1), 1-26.

Downloads

Published

2022-06-01

How to Cite

Sari, E., & Han, T. (2022). Using generalizability theory to investigate the variability and reliability of EFL composition scores by human raters and e-rater. Porta Linguarum An International Journal of Foreign Language Teaching and Learning, (38), 27–45. https://doi.org/10.30827/portalin.vi38.18056

Issue

Section

Articles