Design and Validation of a System of Categories to Assess the Written Discourse of L2 English Learners

This study is part of a broader research focused on written discourse analysis and its application in English language teaching. This paper aims to design and evaluate the validity of the content of an instrument (a system of categories) built to analyse the written discourse in L2 (English) of 112 Spanish students in Upper Secondary Education. This study was carried out from a qualitative approach using the technique of content analysis and the Delphi technique with five experts in the field. The results show that the category system has content validity as it meets the criteria to ensure the scientific rigour of qualitative methodology: credibility, transferability, dependence and confirmability. Moreover, there is a high score in terms of interrater reliability on the criteria contrasted with Kendall’s W test (>.07), as well as on its application in the coding process which was confirmed with Cohen’s Kappa test (.658). It can be concluded that this instrument is a valid tool to help ELF teachers and secondary school students in the teaching and learning process of the written discourse in English as a foreign language.

knowing more about the language but also becoming more able at using the language". This is also set out in the official Spanish curriculum for second language teaching. This skill is a necessity in today's world, and it makes sense that more attention should be paid to how students learn to write and how teachers learn to teach writing. Learning how to write cannot be left to chance. Thus, the role of explicit teaching in raising foreign language writing achievement is a key factor.
Employing an instrument such as the system of categories that we present here for the analysis of the written production in English is of great help. Therefore, the main aim of this paper is to describe the process required to design and validate this system of categories by using content analysis, a qualitative research technique.

MEthodologIcal aSpEctS
In the light of the objective indicated above, this part describes the different steps taken to design the system of categories in addition to the actions taken to validate this system. Thus, the following sections are intended to describe in-depth the research design, the participants who took part in the study and those who helped to validate the system of categories, the instrument and the procedure implemented to collect the data, as well as the data analysis process aimed at exposing the credibility, transferability, dependency and confirmability of the system of categories (Kyngäs & Kääriäinen, 2020).

Research design
Formulating a pertinent research question, as well as defining the design of the research are considered two crucial steps for the research process. Therefore, in this section, we attempt to describe the research design concerning the two main issues of this paper: the design of the system of categories and its validation.
Regarding the design of the system of categories, the written productions of the students are the main source of information for the creation of the system. Thus, in order to discern the meaning of certain significant features of the texts, the qualitative content analysis technique was employed (Kvale, 2011;Flick, 2012). Content analysis is a technique for the analysis of trends in communication contents (Souza, Ferreira, & Gomes 2017). It is characterized by: • Objectivity: it is defined by the possibility to reach the same results in different studies carried out by different persons. • Consistency: it refers to the categories that include the content of the text. • Generalization: it allows to extract conclusions that can be directly extrapolated.
• Quantification: it refers to the quantification of absences and presences of certain aspects in the study.
In relation to the content or construct validation of the system of categories, it does not make sense to evaluate the validity of a qualitative research with the traditional criteria used in the quantitative paradigm (Ruiz-Olabuénaga, 2012). Although at present there is still no unanimous terminology, qualitative research has moved away from the use of positivist terms and guarantees the reliability of its results according to criteria such as credibility, instead of internal validity. It looks at the "truth" value of the research, transferability, instead of external validity. It considers the applicability of the results to other contexts; it also places emphasis on dependency, instead of reliability, which studies the consistency of the data, and confirmation, instead of objectivity, which focuses on neutrality (Kyngäs & Kääriäinen, 2020).
Thus, the validation process was carried out from a qualitative approach, using the technique of content analysis through the Delphi technique of expert judgement. Three rounds of experts were carried out until an acceptable rate of agreement was reached. It was contrasted with Kendall's W test, and the application of the coding process was confirmed with Cohen's Kappa test.

Participants
In this section, we present two groups of participants: those who were involved in the design process of the system of categories, that is, the 112 secondary school students and those involved in the validation process of the system of categories.

Participants involved in the design process of the system of categories
The participants who took part in the wider research mentioned above consisted of 112 students who were enrolled in two public schools in Granada; one of which is located in the city of Granada and the other in a town of the province. They were selected by means of an intentional or deliberate non-probabilistic sample, by criteria of accessibility, personal interest, relevance and adequacy (Tójar, 2006). The written production of these students was the main source of information used for the creation of the system of categories.

Participants involved in the validation process of the system of categories
The validation process of the system of categories was carried out by applying a Delphi-based expert judgment method. Thus, five expert judges from various Spanish universities were involved to assess the content of the system of categories: two specialists in Language and Literature Teaching, one in Pedagogy and two in Research Methods, whose teaching and research experience ranges from 9 to 22 years (table 1).

Instrument to collect the written production of the students
As previously mentioned, the students' written production was the main source of information for the creation of the system of categories. Therefore, the instrument selected to collect this information was a written test previously designed by Madrid and Hughes (2011). It consists of three parts that correspond to three different types of writing: 1) a short email, 2) a story about an accident the student had, and 3) the students' opinion about the school uniform for a school magazine. After carrying out an extensive review of the literature, this test was selected due to its flexibility in helping to assess the written discourse of the students. Moreover, using a previously employed instrument can create a coherent dialogue and a more comprehensive analysis of the phenomenon questioned (Berry, Poortinga, Segall & Dasen, 2002).

Data collection procedure
In this section, we present two different data collection procedures. The first one is related to the data collection procedure of the written production of the students who participated in the design process of the system of categories. The second part is related to the data collection procedure of the experts' feedback (Delphi technique).

Gathering the written production of the students
This section corresponds to the description of the data collection procedure carried out to gather the written production of the students. Following the references of Souza, Ferreira and Gomes (2017), due to practical purposes, the data collection procedure was divided into three phases: the exploratory phase, the fieldwork phase and the treatment of the information.
The exploratory phase is the preliminary research to clarify the procedure of the investigation in order to be constituted for the fieldwork. In this phase, the problem and the objectives of the research are delimited, and the study is developed at a theoretical and methodological level. The instrument is chosen and described, the timeframe is set out, as well as the space and the participants of the research. The second phase is based entirely on the available fieldwork. It consists of putting into practice the theoretical construction elaborated in the previous stage. The analysis and treatment of the material, which is the third stage, refers to the set of procedures for assessing, understanding and interpreting the empirical data, articulated with the theory that the project was based on.
In the exploratory phase, the instrument was selected. The fieldwork started in the second phase. For this purpose, an action plan was carried out in order to obtain access to the participating schools. Obtaining the necessary permissions was a laborious task, as parental authorization and voluntary participation were required. We guaranteed their anonymity since the ethical code in any research is crucial. Once this process was finished, the test was distributed to a total of 112 students by one of the researchers of this study. Two days were needed to collect all the data. It was agreed that the teachers and managers responsible for the two different programmes would receive a report with the results.

Gathering feedback from the experts (Delphi technique)
The validation process of the system of categories was carried out by applying a Delphi-based expert judgment method. The Delphi technique is a process used to survey and collect the opinions of experts on a particular subject. Linstone and Turoff (1975) provides a basic definition of the Delphi technique: "Delphi may be characterized as a method for structuring a group communication process so that the process is effective in allowing a group of individuals, as a whole, to deal with a complex problem" (p.3).
The process implemented to collect the opinions of the five experts was carried out in three rounds, where the different experts shared their opinions and perspectives on the system of categories. The general tone was favourable, with a positive assessment, alluding to the coherence presented in the system. However, various doubts and weaker points were clarified, such as the exclusivity of certain traits.
These suggestions allowed us to delimit and define the system of categories in a more exhaustive way. This improvement guaranteed us, as we mentioned at the beginning, a useful and valid instrument for the research and for its application in the study of written discourse in a second language. The final result of the category system is presented in table 4.
In the following figure, the process, which is described above, is presented for better clarification:

Data analysis process
The design of the system of categories was carried out through the qualitative technique called content analysis. Qualitative content analysis is not merely a classification of informant opinion, it is actually the discovery of their social codes from stories, symbols and observations. It is the search for understanding and interpretation that provides a unique and contextualised contribution of the researcher (Souza, Ferreira & Gomes, 2017).
This part corresponds to the third phase called the treatment of the information (mentioned in the above section) in which we proceed to the ordering, classification and analysis of the data for the creation of the system of categories. It started once the corresponding phase to the fieldwork had finished.
When the information was collected from the 112 students, the next step was to register the data using the N-vivo11 qualitative data analysis software. This qualitative software allows researchers to deal with a large amount of data and improves the validity of the qualitative research. Moreover, it simplifies the coding and categorising task. The qualitative content technique requires a reflective process to extract the most relevant information from the text by subsuming it into categories and storing it separately for later processing (Navarro & Díaz, 1995). This is a continuous process of categorising and codifying the information and then return to the raw data, reflect about the analysis and start the process once again (Rodríguez, Gil & García 1996, Bardin, 1996. In this process, reading and rereading is a key aspect as it is the only way to get a sense of the whole. This process shows how the analysis of the raw data progressed toward the identification of the themes that emerge from the performance feedback of the participants. These themes have to be divided into smaller units, known as meaning 6 exhaustive way. This improvement guaranteed us, as we mentioned at the beginning, a useful and valid instrument for the research and for its application in the study of written discourse in a second language. The final result of the category system is presented in table 4.
In the following figure, the process, which is described above, is presented for better clarification:

Data analysis process
The design of the system of categories was carried out through the qualitative technique called content analysis. Qualitative content analysis is not merely a classification of informant opinion, it is actually the discovery of their social codes from stories, symbols and observations. It is the search for understanding and interpretation that provides a unique and contextualised contribution of the researcher (Souza, Ferreira & Gomes, 2017).
This part corresponds to the third phase called the treatment of the information (mentioned in the above section) in which we proceed to the ordering, classification and analysis of the data for the creation of the system of categories. It started once the corresponding phase to the fieldwork had finished.
When the information was collected from the 112 students, the next step was to register the data using the N-vivo11 qualitative data analysis software. This qualitative software allows researchers to deal with a large amount of data and improves the validity of units (Erlingsson & Brysiewicz, 2017). The following step is to condense these meaning units into codes. Codifying is a process that allows segregating, grouping and regrouping similar data into categories. Coding meaning in a text into categories makes it possible to quantify how often specific topics are addressed in a text, and the frequency of topics can be compared and correlated with other measurements (Kvale, 2011). The data collection is finished when redundancy of information was obtained, and the categories were saturated; this provides more credibility and transferability to the results (Ruiz-Olabuénaga, 2012).
The process of elaboration of the system of categories integrates different procedures amongst which three approaches are central: deductive category development, inductive category application, and validation of the system of categories. These three approaches are described in the results section.

RESUltS
Regarding the deductive approach, the analysis starts with relevant findings that serve as a guide for the initial process of codification and categorisation. In this respect, it is worth mentioning that the main theoretical source of the system of categories is the approach of Canale and Swain (1980) based on Communicative Competence and also the classification of errors from Vázquez (1991), Bueno, Carini and Linde (1992). By using and adapting the information derived from these authors, we were able to establish the main categories and codes of our study, which are shown in the following figure: 7 information from the text by subsuming it into categories and storing it separately for later processing (Navarro & Díaz, 1995). This is a continuous process of categorising and codifying the information and then return to the raw data, reflect about the analysis and start the process once again (Rodríguez, Gil & García 1996, Bardin, 1996. In this process, reading and rereading is a key aspect as it is the only way to get a sense of the whole. This process shows how the analysis of the raw data progressed toward the identification of the themes that emerge from the performance feedback of the participants. These themes have to be divided into smaller units, known as meaning units (Erlingsson & Brysiewicz, 2017). The following step is to condense these meaning units into codes. Codifying is a process that allows segregating, grouping and regrouping similar data into categories. Coding meaning in a text into categories makes it possible to quantify how often specific topics are addressed in a text, and the frequency of topics can be compared and correlated with other measurements (Kvale, 2011). The data collection is finished when redundancy of information was obtained, and the categories were saturated; this provides more credibility and transferability to the results (Ruiz-Olabuénaga, 2012).
The process of elaboration of the system of categories integrates different procedures amongst which three approaches are central: deductive category development, inductive category application, and validation of the system of categories. These three approaches are described in the results section.

RESULTS
Regarding the deductive approach, the analysis starts with relevant findings that serve as a guide for the initial process of codification and categorisation. In this respect, it is worth mentioning that the main theoretical source of the system of categories is the approach of Canale and Swain (1980) based on Communicative Competence and also the classification of errors from Vázquez (1991), Bueno, Carini and Linde (1992). By using and adapting the information derived from these authors, we were able to establish the main categories and codes of our study, which are shown in the following figure: These four categories: Linguistic Competence (LC), Sociolinguistic Competence (SC), Strategic Competence (StC) and Discourse Competence (DC) and their respective subcategories attempted to cover the different written discourse aspects. Nonetheless, this was just the first step taken in the process of elaboration of the system of categories. The next step was to apply these initial categories to the written productions in order to observe if they were adjusted to the nature of the research. To do this, we started the categorisation and codification process from an inductive approach. This process is based on two rules: formulating the definition of the categories and codes. They have to be defined in advance and can be changed in the elaboration process before the validation of the system of categories.
These definitions serve to establish a selection criterion; therefore, in the reading and rereading process all the information that fitted into the category definitions was accepted, the rest of the information was ignored. This procedure allowed us to create new categories, modify them or even eliminate them from the initial system of categories. As mentioned above, this process was completed using a qualitative software analysis, N-vivo11.
Furthermore, it is worth mentioning the difficulty in obtaining the exclusivity of the categories as each reading unit should fit into only one category on a given scoring dimension. Despite the difficulty, this concern has been solved satisfactorily.
As previously, the process of elaboration of the system of categories integrates different procedures. The deductive category development and the inductive category application have already been explained; thus, we proceed to describe the validation process of the system of categories.
Once the category system was created, it was submitted to a content validation process, which refers to how well an instrument measures all facets of a given construct (Oluwatayo, 2012) by applying the Delphi technique of expert judgement. The validation process of our system of categories was carried out by five experts of the field in a discussion group. Their profile is described in Table 1. To carry out this process, the objectives of the study were explained to the experts and they were given a questionnaire in which the categories and codes with the definitions and examples taken from the written productions were included.
In order to establish the internal reliability of the category system, the five expert judges were asked to evaluate the categories and subcategories using a Likert scale (from 0 to 5), according to the criteria set out: -Homogeneity: each category is obtained from the same principles used for the whole categorisation. -Exhaustiveness: they give an account of all the material analysed.
-Exclusivity: the content of the material analysed cannot be classified in more than one category. -Concreteness: they are easily understood, not expressed in abstract terms that admit different meanings. -Appropriateness: the categories and the objective to be reached are adapted to the content.
The experts carried out three rounds to make the modifications they considered relevant in order to improve the instrument. They were also asked to give improvement proposals for the worst-rated items. The development of the process can be seen in the section of data collection process (figure 1).
For the validation of the category system, the analysis of means and degree of interrater reliability was carried out by applying Kendal's W test for ordinal data. These data were obtained from the expert judgement using descriptive statistics (mean and standard deviation). This process was repeated for the three rounds implemented; however, Table 2 shows the data of the last round due to space and word limitations. Analysing the results following the consensus given in round three, it could be observed that the mean scores obtained for each category were higher than 3, between 3.20 and 4.20 (on a Likert type scale with five response levels). The results of Kendall's W test with a significance index p ≤ .05 led us to reject the null hypothesis, that is, there is agreement among experts, with the results shown in table 3. Therefore, all the criteria present a high concordance (W > .05), which indicates that the category system has a good internal reliability. In light of these results, the definitive category system was established as shown in Table 4. It refers to the lack of knowledge of a set of conventions for writing in the target language. It can be caused by interlingual or intralingual transferences.

Code-switching (CEAC)
It refers to the use of words in L1 when writing in L2.

Transference (CEAE)
It refers to the use of non-existing words in L2 caused by negative transference from L1. Literal translation (CEAT) It refers to the use of expressions whose source, the mother tongue, is causing a deviation in the target language.

Coherence (CDH)
It refers to the lack of ability of the user or learner to arrange coherent stretches of language. Cohesion (CDC) It refers to the lack of logic relations when it comes to produce a sentence. It can be identified by the lack of cohesive devices. Textual adequacy (CDA) It refers to the lack of adaptation in a communicative situation.

Sociolinguistic competence (CS)
Sociolinguistic adequacy (CSA) It refers to the knowledge of sociolinguistic rules and cultural patterns that allow the user or learner to formulate adequate linguistic interventions.
When the content validity of the category system was carried out and the interrater reliability verified, we proceeded to check the agreement between judges in the application of this category system. This means that whenever we evaluate the same text with the same category system (the instrument), we obtain the same results. To do that, the Cohen's Kappa coefficient for nominal data was used with the help of the statistical program SPSS v25.
For this purpose, two expert judges were given the same text together with the category system to code it. Once codified, the number of agreements and the number of disagreements were registered in a matrix of the statistical programme for the analysis of quantitative data SPSS V25 and the Kappa coefficient of Cohen was calculated, establishing the following hypotheses: H 0 : The degree of agreement is 0, i.e., there is no agreement. H 1 : There is a significant agreement among evaluators, i.e., K > 0. When the content validity of the category system was carried out and the interrater reliability verified, we proceeded to check the agreement between judges in the application of this category system. This means that whenever we evaluate the same text with the same category system (the instrument), we obtain the same results. To do that, the Cohen's Kappa coefficient for nominal data was used with the help of the statistical program SPSS v25.
For this purpose, two expert judges were given the same text together with the category system to code it. Once codified, the number of agreements and the number of disagreements were registered in a matrix of the statistical programme for the analysis of quantitative data SPSS V25 and the Kappa coefficient of Cohen was calculated, establishing the following hypotheses: H 0 : The degree of agreement is 0, i.e., there is no agreement. H 1 : There is a significant agreement among evaluators, i.e., K > 0. The null hypothesis is not assumed. Use of the asymmetric standard error assumed by the null hypothesis.
As shown in the results of Tables 5 and 6, the H0 is rejected as the degree of significance is lower than 0.05. Thus, it is concluded that there is agreement among evaluators, with a satisfactory strength of association (K=.65).

dIScUSSIoN aNd coNclUSIoNS
Writing is one of the least practised skills, due to the difficulty it entails for both teaching and learning (Nunan, 1991;Alcaráz Varó, 2000;Palmer Silveira, 2001). It can be considered as a tool of recognised value that consolidates and reinforces other skills and constitutes an essential element in the learning of a language, as Marsh (2000), Porte (1996) and McLaren, Madrid and Bueno (2005) indicate.
Numerous studies have pointed out the effectiveness of teaching writing explicitly (Manchón & Roca de Larios, 2007) and the fact of using a system of categories related to the different aspects of the written discourse can be highly recommended. As it is proved to be a great learning strategy for students, cognitively attractive and allow them to know more about their strengths and weaknesses in the writing skill (Cassany, 2005).
The results from the reliability and validity testing suggest that the system of categories demonstrates adequate reliability and validity for use by EFL teachers and students to assess the different aspects that comprise written discourse in English language.
As it has been shown throughout the process, different reliability techniques have been used, so the results allow us to consider that the study meets the following criteria, introduced above: -With regard to credibility: triangulation with pre-existing scientific literature, consultation of various types of documentation (in order to be able to contextualise the data obtained), triangulation of instruments for collecting textual information (e-mail, story and opinions on a student's school subject), review of the information obtained, and analyses carried out on different occasions and by different experts of the field. -Regarding transferability: the entire research process has been described in detail, listing and indicating all the data. In short, all the elements that have allowed us to describe or interpret the data. The steps in the process of data analysis have been presented. -With regard to dependency and confirmability: the control process has been carried out by expert judges who are not related to the research.
Nonetheless, this study related to written aspects must be considered in light of two main limitations. First, the categories and subcategories of the system were created in an inductive-deductive process whose main source were the texts provided by students of a high level in high school, 4º grade. This issue can potentially limit the transferability of the instrument to lower levels. Although, the transferability to similar contexts is indeed guaranteed. Second, there is a research gap in literature related to the process of content validation for systems of categories. Hence, it was a challenging task to find out the best way to do it.
Nonetheless, despite these limitations, the results indicate that the system of categories is a valid, useful and valuable instrument as it has been shown good internal reliability and a satisfactory agreement among experts. Moreover, this study has contributed by providing relevant information for all those English teachers and professors interested in how to treat the different aspects of written discourse in a second language. In addition, it has introduced a better understanding of the designing and validation process of an instrument such a system of categories.
In short, this study presents the design and validation process of an instrument for the analysis of the written discourse of L2 English learners. This instrument is a system of categories whose main objective is to provide EFL teachers with a linguistic and discursive resource that will help them to clarify and solve some everyday problems that they encounter when trying to promote the learning of the written language. It is not only a valid instrument for teachers but also for L2 learners, since analysing their own written production through this system of categories allows them to observe the different shortcomings they have in written discourse.