Learning foreign languages through content and language integrated learning in physical education: A systematic review

ABSTRACT: The goal of this systematic review was to describe and assess studies that used content and language integrated learning (CLIL) programmes to teach a foreign language via physical education in a school setting. 35 articles met the selection criteria. Results consisted of two types of studies: (a) Low-intensity CLIL programmes (only the physical education subject was used) and (b) High-intensity CLIL programmes (various academic subjects were combined). No studies were found in the preschool education stage. Most of the research was implemented in Spain. Games and sports were the most frequently used contents, and English was the most commonly used foreign language. High-intensity CLIL programmes improved student’s overall proficiency in a foreign language at a higher level. Finally, it is not clear if physical education classes conducted using CLIL have a positive or negative effect on students’ moderate-to-vigorous physical activity (MVPA) levels.


IntroductIon
Learning both a foreign language (FL) and the contents of a specific academic subject at the same time is a framework that has become increasingly common in the education field. There are multiple methodological programmes to choose from: content based instruction, bilingual programmes, language X as a medium of instruction, game-based projects, language immersion programs, etc. (Merino & Lasagabaster, 2015;Pérez-Cañado, 2012). However, content and language integrated learning (CLIL) is the most favoured approach in Europe .
CLIL has been described as a pedagogical approach that focuses on two goals: learning the academic subject content and learning the FL, which represents the medium of instruction for the content (Coyle et al., 2010)Philip Hood and David Marsh and drawing on their experience of CLIL in secondary schools, primary schools and English language schools across Europe, this book gives a comprehensive overview of CLIL. It summarises the theory which underpins the teaching of a content subject through another language and discusses its practical application, outlining the key directions for the development of research and practice. This book acknowledges the uncertainty many teachers feel about CLIL, because of the requirement for both language and subject knowledge, while providing theoretical and practical routes towards successful practice for all\»--Provided by publisher. \»A comprehensive and up-to-date overview of the theory and practice of CLIL. This will be of use to practitioners (for example teachers and course developers. The CLIL approach was designed to help improve FL competence without having any pernicious effect on the students' L1 or the content learning (Lasagabaster & Ruiz de Zarobe, 2010). CLIL can be easily integrated through different academic subjects (Merino & Lasagabaster, 2017). However, Physical education (PE) is considered ideal, as it promotes learning in a playful and interactive manner .
Many researchers link the benefits of CLIL in PE to the communication and interaction that is provided through movement and play, creating an effective platform for learning a FL (Coral & Lleixà, 2013;Zurita-Ortega et al., 2019). Furthermore, PE via CLIL has been recognised as a comprehensive approach that embodies the principles of learning, teaching motor skills via a FL, and fostering cognition and citizenship. It also takes into consideration students' motivation for physical activity and provides support to develop both motor and language skills . PE and sports can also have a positive influence on the physical, cognitive, and social domains of a child's development, and in his/her lifestyle (World Health Organization, 2010), which, in turn, can be beneficial for FL learning.
A systematic review on learning a FL through PE has been conducted . However, that review was more general, due to the fact that the study incorporated different teaching approaches in the learning of a FL through PE and it only concluded that CLIL was the most commonly used approach for teaching academic subjects through a FL in Europe . Furthermore, the number and diversity of studies on CLIL and PE have increased exponentially since 2017 (25 studies), which calls for a recent and more extensive analysis (i.e., PE contents, outcomes). To promote this framework (CLIL + PE) and bring it closer to teachers and researchers, this systematic review aimed to describe and assess studies that used CLIL programmes to teach a FL via PE in a school setting.

Search limits
A systematic search through six electronic databases (Web of Science, Pubmed, Scopus, SportDiscus-EBSCO, ERIC and Google Scholar) from January 2007 to April 2020 was conducted based on PRISMA Protocol for systematic reviews of the Cochrane Collaboration (Moher et al., 2015). Therefore, the starting date for the systematic search was chosen because Rottmann (2007) linked CLIL and PE to FL learning for the first time during the year of 2007 and further research has been conducted since then. Furthermore, these databases were selected because they included PE articles developed in the school context and published in peer-reviewed journals. The protocol for the systematic review was registered on PROSPERO (CRD42019126972) and is published elsewhere.
The search strategies used included a combination of the following keywords classified in four categories: (a) PE (physical activity, exercise, sports and PE), (b) CLIL, (c) FL (multilingual, bilingual, language teaching, language learning and FL), and (d) population (child, adolescent, pupil, youth and student). Additionally, the English Boolean data types 'and' and 'or' were used.

Selection criteria
All relevant articles included in this systematic review met the following criteria: (a) studies published in peer-reviewed journals, as these types of records have already been vetted by publishing bodies and experts in the field for their quality and relevance, (b) the intervention study had to promote and assess the learning or the teaching of a FL through CLIL, (c) studies that included qualitative and/or quantitative methods and findings, (d) research connected to school contexts (e.g. PE lessons, activity breaks or after-school programmes), (e) participants were students between three and 18 years of age and teachers, (f) studies which included CLIL programmes developed in an educational setting, and (g) studies published in English or Spanish, due to resource constraints. Duplicated documents, opinion articles, books, conference articles or thesis were eliminated at the first level of exclusion. Studies that did not meet the abovementioned criteria were excluded at the second level. Moreover, the evaluation of the methodological quality of the systematic review was performed using the 11-items checklist elaborated by AMSTAR, a measurement tool to assess the methodological quality of systematic reviews (Shea et al., 2007).
The inclusion criteria were first applied during a double-screening process; during which two reviewers independently screened each title and abstract and recorded the primary reason for rejection, if any. Inter-rater reliability for all screened records was then assessed via the Kappa statistic at 0.83, exceeding levels expected by chance. Disagreements regarding application of the inclusion criteria were resolved through co-author consultation. The full text of records retained after screening was then assessed for eligibility according to the inclusion criteria. Reasons for rejecting studies at this stage are documented in Figure 1.
To increase sensitivity and identify any additional relevant material, the bibliographies of all eligible records were examined through a hand-searching process. Unlike the original database search, which was limited to peer-reviewed, scholarly records; the hand-search also considered working papers and report citations. Given the acknowledgment of these records within peer-reviewed, scholarly literature, as well as their clear relevance to our research question, excluding these grey literature studies may otherwise have been considered a methodological weakness. Records identified through hand-searching underwent the same screening and full text assessment process as records identified through database searching. The abovecited Kappa statistic for inter-rater reliability includes both database and hand-search records.
The systematic search process and the number of results in each database are shown in Figure 1. During the selection process, the database search found a total of 6,080 articles (6,288 with duplicates). Subsequent to the elimination of many works at the first level of exclusion, 1437 original, potentially relevant articles remained. After assessing the tittle, abstract, introduction and/or context, 1402 were discarded in the second level of exclusion. Finally, 35 articles were included in the final review. A narrative review method was employed to synthesize these studies. Data extraction was performed systematically by two authors to create two synthesis tables (table 2 and table 3) with comparable information related to the risk of bias of the Cochrane Collaboration protocol (CCP), the focus and assessed variables, the stage of education and age, the country where the research was developed, the number of schools involved, the sample size, the duration of the intervention, the PE contents involved, the FL used, the type of analysis (quantitative or qualitative), the measurement instruments and the outcome(s). 4 otherwise have been considered a methodological weakness. Records identified through hand-searching underwent the same screening and full text assessment process as records identified through database searching. The above-cited Kappa statistic for inter-rater reliability includes both database and hand-search records.
The systematic search process and the number of results in each database are shown in Figure 1. During the selection process, the database search found a total of 6,080 articles (6,288 with duplicates). Subsequent to the elimination of many works at the first level of exclusion, 1437 original, potentially relevant articles remained. After assessing the tittle, abstract, introduction and/or context, 1402 were discarded in the second level of exclusion. Finally, 35 articles were included in the final review. A narrative review method was employed to synthesize these studies. Data extraction was performed systematically by two authors to create two synthesis tables (table 2 and table 3) with comparable information related to the risk of bias of the Cochrane Collaboration protocol (CCP), the focus and assessed variables, the stage of education and age, the country where the research was developed, the number of schools involved, the sample size, the duration of the intervention, the PE contents involved, the FL used, the type of analysis (quantitative or qualitative), the measurement instruments and the outcome(s).

Data extraction and reliability
The information required by the reporting guideline STROBE (O'Brien et al., 2014;Von Elm et al., 2007) was included in the original studies of the extracted data: (a) bibliographic information (author, title, year and form of publication), (b) description of the study sample (age, number, sex and grade of the group of participants), (c) theoretical framework and aim of the study, (d) task description of the intervention and control group and intervention type (physical and theoretical components), (e) intervention name, risk of bias, didactical method, duration and frequency, (f) assessed variables, used measurement instruments and (g) study results. This data was analysed and described in order to obtain relevant information.
Based on the seven criteria of the Cochrane Collaboration protocol (Table 1), this systematic review assessed the risk of bias, which is defined as the risk of an occurring systematic error in study results or inferences (e.g. the study does not specify whether students are randomly assigned to an intervention or to a control group). This can lead to either under-or overestimation of the intervention effects. An assessment of the risk of bias can help to interpret result-variation and balance assumptions and conclusions drawn from these variations (Higgins et al., 2011). This protocol was designed in the field of health and health care research. However, different studies have proved the appropriateness and usefulness of this protocol in the social science field, concluding that there are certain types of observational studies on the effects of non-healthcare interventions for which there is in fact close agreement with the corresponding randomised controlled trials (Konnerup & Kongsted, 2012;Cook et al., 2008). In the process of judging the risk of bias, the study was assessed by two reviewers and discrepancies were resolved by agreement, citing relevant text passages, and offering a justified judgement. Studies were noted to have a low risk of bias if the requested criteria did not show any indication of bias. If there was insufficient information given about the criteria, it was described as an unclear risk of bias. High risk of bias included studies showing highly biased decisions in one or more of the domains considered and was thought to seriously influence the results of the study. Within the category "blinding of participants and personnel", it was considered positive and still a low risk of bias if teachers were not blinded.

Bias domain Source of bias Support for judgment
Selection bias: Systematic differences between baseline characteristics of the groups that are compared

Random Sequence Generation (RSG)
Description of the detailed method to generate the allocation sequence to allow an assessment of whether it should produce comparable groups. Judgment according to the adequacy with which the allocation sequence was generated

Allocation concealment (AC)
Description of the detailed method to conceal the allocation sequence to determine whether intervention allocations could have been foreseen in advance of, or during enrolment. Judgement according to whether the allocation was concealed adequately

Performance bias (PB):
Systematic differences between groups

Blinding of participants and personnel
Description of measures used to blind participants and personnel from knowledge of which intervention a participant received information on whether the intended blinding was effective

Detection bias (DB):
Systematic differences between groups in how outcomes are determined

Blinding of outcome assessors
Description of measures used, to blind the outcome assessors from knowing to which intervention group a participant belonged. Judgement according to whether knowledge of the intervention means allocated was hidden from assessors

Attrition bias (AB):
Systematic differences between groups in withdrawals from a study Incomplete outcome data Description of the completeness of outcome results for each main outcome, including attrition and exclusions from the analysis. Report of attrition and exclusions in each intervention group and description of reasons for attrition/ exclusions

Reporting bias (RB):
Systematic differences between reported and unreported findings Selective outcome reporting Description of how the possibility of selective outcome reporting was examined, and what was found. Judgement according to whether hints in selective outcome reporting were found

Other sources of bias (OSB)
Anything else, ideally prespecified Concerns about bias that were not addressed in one of the other domains

results
Overall, 35 studies were included in the systematic review. Two types of studies were found in the systematic search process and are grouped in two different tables. Table 2 shows Low-CLIL interventions (low-intensity CLIL programmes; only the physical education subject was used). Table 3 groups High-CLIL interventions (high-intensity CLIL programmes; various academic subjects were combined: maths, natural sciences, social sciences (history and/or geography), arts and PE).

dIscussIon and conclusIons
The aim of this systematic review was to describe and assess studies that used CLIL programmes to teach a FL via PE in a school setting. The number of studies has increased exponentially since 2017 (25 studies, 71.43%).
An analysis of the risk of bias revealed that all the studies had at least two fields with a low risk: performance bias (blinding of participants and personnel) and reporting bias (selective outcome reporting). The criteria selection bias (random sequence generation: 34.48% and allocation concealment: 27.58%) was highly biased in some studies. Most of the studies (30, 85.71%) revealed a low risk of detection bias (blinding of outcome assessment). Attrition bias (incomplete outcome data) revealed a low risk of bias in the majority of the studies (32, 91.43%). The other sources of bias assessed were the lack of certain English skills in the tests used to evaluate participant's overall FL proficiency, brief intervention programmes and small intervention groups. However, most studies (25, 71.42%) reported an unclear risk in other sources of bias.
Results showed two types of studies: (a) Low-CLIL (low-intensity CLIL programmes) and (b) High-CLIL (high-intensity CLIL programmes). Regarding the student's overall proficiency in a FL, Merino & Lasagabaster (2017) concluded that no significant differences emerged between learners that participated in non-CLIL and low-intensity CLIL programmes. However, significant benefits were found among students that participated in high-intensity CLIL programmes through the combination of different academic subjects (maths, natural sciences, social sciences (history and/or geography), arts and PE) Merino & Lasagabaster, 2017). Taking into consideration all the academic subjects included within the curriculum, PE is ideal for developing CLIL and FL learning as it provides a comprehensive education and encourages motivation, participation and interactive learning amongst the students through movement and play (Coral & Lleixá, 2013;Zurita-Ortega et al., 2019).
Research seems to confirm that CLIL has a higher success rate in developing a FL, both written and oral skills, than traditional methods (García-Calvo & Salaberri, 2018;Gil-López et al., 2019).
It is important to highlight that no studies were found in preschool education. Incorporating a FL during the early stages of childhood education is a positive, feasible and recommended way for students to learn and advance (Coyle et al., 2010). It is also important to integrate FL learning with different school subjects, since the objective is for students to view the FL as something natural (Rodríguez & Varela, 2004).
Games and sports were the most frequently used PE contents when learning a FL through CLIL because this type of content encourages communication amongst students to improve output skills (writing and speaking) (Gil-López et al., 2019). Therefore, the same way Salvador-García et al. (2018) applied the Teaching Games for Understanding model through a lesson plan of Touch-rugby, applying pedagogical models through the incorporation of games and sports is a very powerful and innovative line of research in the learning of a FL through CLIL.
Although the original goal of CLIL was to promote multilingualism, English has proved to be the most commonly used language (Coyle et al., 2010;Merino & Lasagabaster, 2015).
In this systematic review, the majority of the research was implemented in Spain using the English language. It shows the growing importance of learning English as a FL through CLIL in this country .
Regarding urban/rural setting, Alejo & Piquer-Píriz (2016) highlighted the advantages that urban learners had over rural learners when learning a FL. Randhawa & Michayluk (1975) concluded that urban settings that offered a better physical environment, were more intellectually stimulating, and met the learners' needs to provide a satisfying learning experience. Nevertheless, Fan & Chen (1998) indicated that students from rural schools performed as well as, if not better than, the students in metropolitan schools in math, science, reading, and social studies. This research reinforces the idea that the differences between rural and urban students cannot only be attributed to the school settings (Pavón, 2018). Regarding students' physical activity level, results are scarce, but contradictory. Coral et al. (2017) and Martínez-Hita & García-Cantó (2017) found lower MVPA levels than recommended (60 minutes of MVPA daily) to produce positive health effects (Martínez et al., 2012;World Health Organization, 2010), while Salvador-García et al. (2019), Salvador García et al. (2020a) andSalvador García et al. (2020b) concluded that physical activity levels were higher using CLIL. It is important that the use of a FL in PE does not reduce students' MVPA, and more research is needed.
Finally, Ní Chróinín et al. (2016) concluded that students' PE content learning was restricted by their FL knowledge. However, other researchers disagree. Hernando et al. (2018) concluded that using CLIL through English as a FL is not a factor which implies a lower specific PE curricular learning. Housen (2002) alluded to successful learning outcomes in content-subject areas (history, geography, arts, music, religion and economics) that were taught in a FL, and Jabrun (1997) found that after a year, FL immersion maths students outperformed mainstream students, and FL immersion science students performed as well as their mainstream counterparts. This study concluded that FL immersion students were the most efficient learners (Coyle et al., 2010). Therefore, more research is needed to assess students' PE content learning when using CLIL through a FL.

lIMItatIons and areas for future research
This systematic review is subject to certain methodological limitations. While it attempted to capture an exhaustive list by using 14 different terms in the systematic search, some relevant records may not have been captured. The database search was also limited to records in English and Spanish, thus potentially excluding important publications in other languages. Finally, publication bias within the literature is well documented and can be a serious source of bias in systematic reviews (Petticrew & Roberts, 2006). This bias generally results in an underreporting of research with null or negative findings.
The magnitude of the impacts found in the studies included in this review were not expressed in such a way as to permit meta-analysis of overall impact. Overall, only some studies (Fernández-Barrionuevo & Baena-Extremera, 2018;Merino & Lasagabaster, 2017) provided information on effect sizes. Most of the studies only reported tendencies without a statistical approach. Therefore, it would be beneficial for researchers to include data on effect sizes for quantitative and mixed research to help provide sufficient information for a future meta-analysis.
With respect to our findings, the quality of some of the included studies is questionable despite the fact that they can be found in bibliographic databases and that they have even gone through a peer review process. Quality research (and evidence-based practice) should be deeply reviewed and analysed.
Lastly, the studies in our review were relatively geographically concentrated, primarily in Western Europe, due to the inclusion criteria and language restrictions. Different countries face unique challenges and the findings from one country may not be generalizable elsewhere. Therefore, more research is required in other continents to empirically deepen the knowledge of how to implement, understand and advance this teaching-learning approach.