Assessing Dual Language Learners of Spanish and English: Development of the QUILS: ES

Introduction and Objectives: Developing a language screener for Dual Language Learners presents numerous challenges. We discuss possible solutions for theoretical and methodological problems often encountered in the development of such a test and illustrate possible solutions using a newly developed language screener for Dual Language Learners. Materials and Methods: The process for developing, validating and norming the screener is also offered as a potential model for the development of other assessments for Dual Language Learners throughout the world. The twelve types of subtests are described with in the areas of Vocabulary, Syntax, and Process. Results and Conclusions: Results from the Tryout and Norming phase on 362 Dual Language Learners aged 3 to 5;11 years are presented, together with the results of item selection via IRT, validity, and reliability testing. The advantage of using Best Scores is highlighted as a useful measure that helps identify children who are at risk of encountering language difficulties that will impact their academic success. Importantly, knowledge is found to be distributed across the languages.


The difficulties of Dual Language (English/Spanish) Screening
The general need for a language screener for preschool children is based on research findings that proper instruction and intervention are likely to be more effective in younger children, and overlooked problems can have long-term consequences for children's success in academics and life (Glogowska, Roulstone, Enderby, & Peters, 2000;Law, Kot, & Barnett, 1999;Ramey & Ramey, 1998;Roberts & Kaiser, 2015;Wake et al., 2011). Even by 3 years of age, the effects of lower language competence are evident: for example, children with poor communication skills are less sought after as conversational partners and more likely to be ignored or excluded by their peers (Rice, 1993). These children then fall further behind socially and tend to develop poor self-esteem as they advance through childhood (e.g., Conti-Ramsden & Botting, 2004;Craig, 1993;Jerome, Fujiki, Brinton, & James, 2002;Lindsay & Dockrell, 2000). Even short-term gains in language ability can enhance social relationships and mitigate the negative impact of language delay on behavioral, social, and emotional development (Olswang, Rodriguez, & Timler, 1998;Paul, 1996;Robertson & Weismer, 1999). Although several screeners are available for monolingual English speakers in the US, Dual Language Learners have been largely neglected as a group, and are often mis-identified as having language problems based on testing only one language (Peña, Gillam, Bedore, & Bohman, 2011;Gillam, Peña, Bedore, Bohman, & Mendez-Perez, 2013). It is welldocumented that there is over-identification of English language learners (ELLs) as having language delays (Artiles, Rueda, Salazar, & Higareda, 2005), but under-identification is also a problem, where SLPs do not trust that a test is adequate to assess a language in the process of being learned (ref). A screener is necessary to assess whether a bilingual child has a language difference or potentially a language disorder.
In the US, there is a critical need to develop linguistically appropriate and valid assessment tools for children growing up in homes where they are exposed to English and Spanish (Barrueco Barrueco, Lopez, Ong, & Lozano, 2012). Some children are primarily exposed to Spanish at home, but a large proportion will be raised in an environment in which both languages are used (Rojas, Iglesias, Bunta, Miller, Goldenberg, & Reese, 2016). Assessing the progress of dual language learning children is difficult for two reasons. First, children are arrayed along a continuum of bilingualism, from knowing mostly Spanish to knowing mostly English, with every alternative in between, thus making it hard to find norms in either language that treat all children fairly. Second, what Dual Language Learners know in each language remains obscure. It has been known for many years that vocabulary is distributed across the languages of children exposed to two languages, and not just at the very start, where children might resist having two words for one referent (Pearson & Fernandez, 1994;Pearson, 1998;Core, Hoff, Rumiche, & Señor, 2013;Mancilla-Martinez, & Vagh, 2013). There is evidence even up to college age that students have different vocabulary items in each language, with many words that do not have corresponding lexical items in the other language (Dong, Gui, & MacWhinney, 2005). What children store is distributed across the two languages. One purpose of the present report is to demonstrate that it is not just vocabulary that is distributed in young Dual Language Learners, but also syntactic development, and even the ease with which children learn new forms and words, or the process of learning. A dual-language learning child must be assessed in both of their languages to understand whether they at risk of a language delay or disorder. Thus, the QUILS: ES assesses both languages.
and it also provides a metric to evaluate the child's overall langage competence.
The test-development process reported here might also serve as a schema for others looking to create dual language screeners for different language combinations, either for the US or other countries with a significant population of children learning two languages at an early age. The principles of test construction, choice of measures and methods of sampling, reliability and validity, should transcend the particular languages involved.

Challenges and Solutions
There are specific challenges in developing an adequate language screener for Dual Language Learners, and we highlight five below, together with the solutions we have devised from the process of developing a new screener, the Quick Interactive Language Screener: English-Spanish (QUILS: ES).

First Challenge and Solution: Persistent language problems are hard to identify early.
Some children are identified as "late talkers" at age 2 or 3 years based on their low language production. However, research suggests many of these children go on to develop language within the typical range (Dollaghan, 2013;Leonard, 2014;Rescorla, 2000). Language comprehension may provide a better predictor of which children will continue to have problems (Leonard, 2014;Thal & Bates, 1988) and require intervention. Parents and teachers can spot a child who is not speaking, but not all children who are late talkers require intervention; some children who appear to have language delays can comprehend language. Comprehension measures are at the cutting edge of children's linguistic capability (Hirsh-Pasek & Golinkoff, 1996;Seidl, Hollich, & Jusczyk, 2003;Weisleder & Fernald, 2009;Friend, Smolak, Liu, Poulin-Dubois, & Zesiger, 2018). Thus, it is essential to probe children's language comprehension because it may serve as a more sensitive measure of language skill than children's language production.
Relying on language production (what children say) can be problematic because young children may have limited expressive capacities and are often reluctant to demonstrate their full expressive potential in an assessment context with an unfamiliar examiner (Brown, 1973). With comprehension measures, the burden of communication with an examiner the child does not know can be reduced. In addition, the minimal response demands of comprehension-in the case of the QUILS: ES, touching the correct picture on a screen-are much lower than those of production and do not require examiners to make judgments in the face of children's early, nonstandard pronunciations. The QUILS: ES invites children to play a game in which there are brightly colored pictures and animated scenes. It circumvents the problem of coaxing children to speak or to answer questions posed by a stranger. Children engage with the touchscreen computer or tablet in a way that is fun and yet reveals their language skill. The QUILS-ES screener presents items to children on a touchscreen, and the items are narrated automatically in the appropriate language. After a few training items that teach the child how to touch the screen, the test unfolds with a few interspersed animated gifs that congratulate the child on their efforts and encourage the child to keep going.
These advantages of a comprehension instrument accrue to young children whether they are dual language learning or not. All children picked out as being at risk by such a screening tool will also need assessment of their production skills in a more thorough clinical workup.

Second Challenge and Solution:
Assessments must examine the ability to learn as well as the products of learning.
Results from research on monolingual children show that oral language skills at age 3, including syntax as well as vocabulary, contribute to reading outcomes in first grade regardless of socioeconomic status (SES; NICHD ECCRN, 2005). Likewise, vocabulary and syntactic ability in prekindergarten are unique predictors of language variability in third grade (LARRC, 2015;Pace, Alper, Burchinal, Golinkoff, & Hirsh-Pasek, 2019). However, assessments have not incorporated more recent research that underscores the importance of evaluating the processes by which children learn language in addition to the products of language learning: syntax and vocabulary. That is, existing screeners and assessments measure what the child knows with little attention to how the child learns (Hirsh-Pasek, Kochanoff, Newcombe, & de Villiers, (2005).
Process measures that have become popular include dynamic assessment (Peña) , and response to intervention ( ). In the current context, we assess the process of learning in a single test, not over time, by designing items that test how adequately children can learn new word meanings (a process called fast mapping), by exploiting the syntactic contexts in which new words appear, and to extend words to new contexts-all of which jointly contribute to children's skills as language learners (Fisher, 1996;Golinkoff, Jacquet, Hirsh-Pasek, & Nandakumar, 1996;Seymour, Roeper, & de Villiers, 2004). In addition to assessing vocabulary and syntax, the QUILS: ES focuses on the process -in both languages -by which children learn language; that is, their proficiency at learning new vocabulary items and generalizing syntactic information in new contexts. For example, a child may have fewer vocabulary words than peers (e.g., perhaps due to limited exposure to language models) but be in line with his or her age group in terms of vocabulary acquisition skills, such as quickly acquiring a new word after a limited number of exposures. Children who have low scores in acquired vocabulary and syntax for example, but prove capable at the process of learning new items and structures, have the machinery to learn language and perhaps only lack exposure to more high-quality language interactions. Those who are poor at language learning and have low levels of acquired vocabulary and syntax are more likely to need further assessment to determine eligibility or a remediation plan to bolster their existing language skills.
Our solution consisted of creating two distinct, although parallel sections (English and Spanish) that assessed product (vocabulary and syntax that child knows) and process (child's ability to learn new vocabulary and syntactic structures). Each section (English or Spanish) of the QUILS: ES is arranged according to the three areas described below: Vocabulary, Syntax, and Process. Each area measures different types of language knowledge (e.g., prepositions) and the specific items are not the same in each of the two sections (e.g., "la muñeca está arriba del regalo" "the girls are between the motorcycles"). The screener uses animations to provide a more precise depiction of an event sequence that may be challenging for young children to glean from still pictures of actions or event sequences. Table 1   Two illustrations are provided in Figure 1. These are stills of the final scene, but there is animation preceding this to allow the child to see the events unfold in time.

Figure 1 An illustration from clausal connectives (CC) in English and Spanish.
Question: Who ate the food before the cat jumped on the table?
"Who slid down the slide before the bus came?"

Third Challenge and Solution: Assessments must be applicable to the population assessed
The procedure by which we arrived at the final selection of items for QUILS: ES happened in multiple stages. All of the items on the QUILS: ES were chosen by experts in the science of child language development and are based on the most current research in language acquisition.
During item development and creation, native English and Spanish-speaking experts evaluated each item, ensuring that the items 1) were feasible for both English-monolingual, Spanish monolingual and Spanish-English bilingual children, and 2) did not discriminate between children who spoke different dialects of English or Spanish. All items were chosen to be adaptable to English or Spanish, rather than relying on simple translation, and only words that were neutral across Spanish dialects were considered for inclusion in the screener. In addition, the use of obvious cognates, or words that overlap in form and meaning across languages such as the English cafeteria and Spanish caféteria, were avoided. This design prevents a speaker of Spanish from scoring correctly on an English item because of his or her Spanish knowledge rather than English knowledge of the word. Foils (i.e., the incorrect alternative answers) all represent choices children might plausibly make if they were guessing or had a false idea about the meaning of the word or sentence. These ideas were grounded in research studies wherever possible (e.g., Golinkoff, Bailey, & Wenger, 1992).

Fairness across Dialects
The QUILS: ES was designed with linguistic and cultural fairness in mind by selecting items through careful testing to be culturally and dialectally neutral in both languages.

Multi-step Process to Match Item Levels across Sections
To find appropriate items that would allow matching level items across the English and Spanish sections, the QUILS: ES development process occurred in four main phases over 5 years: 1) Item First Item Tryouts on the bilingual test guided our assignment of items to each language.
By examining performance on items against general child ability level across all of the items, we assessed whether an item behaved well or not. The rule was that an item "behaved well" if the more able children passed it, and the less able children failed it. We examined each item to see if the children who passed it had a total score that exceeded the total score of the children who chose one of the foils. By this means, we selected the items with the best discrimination between ability levels in each language, and chose which items were more successfully discriminating in English than Spanish or vice versa.

Inclusion Criteria for the Normative Sample
The normative sample for the QUILS: ES included children 3 (3;0) through 5 (5;11) years old with no reported visual or hearing difficulties who were screened in their child care centers, preschools, kindergartens, and Head Start programs in Massachusetts, Pennsylvania, Delaware, Florida, and Nebraska. Children who spoke a language other than English or Spanish were not included in the sample. A Language Questionnaire completed by parents (ref) or school-supplied information was used to determine the degree to which English or Spanish were used. Since the normative sample was designed to be representative of dual language learning Spanish-English children in this age range in the United States, it likely includes some children who had language disorders.   Table 2). The percentage of mid-SES families approximates the percentage reported in the 2014 U.S. census data for Hispanic females. A more education level of an associate's degree and above was 26.1% in 2015 (NCES 2015). However, that figure includes women who achieve a degree later in life. If one looks at rate of completion of bachelor's degrees or higher among Hispanic females in the years from 2006 to 2016 the rate is between 12.9 and 16.6% (U. S. Census Bureau, 2016). Table 2 Composition of the norming sample for the QUILS: ES

Final norming sample
Total N 362 Demographic data for race were available for 66.6% of the final bilingual sample: 55.8% were White, 6.6% were Black/African American, 1.4% were multiracial, 0% were Asian, and 1.9% were other races. Additionally, 82% of parents reported whether or not their child was of Hispanic origin; of those who reported on it, 91.2% self-identified as being of Hispanic origin.

Fourth Challenge and Solution: Knowledge is Distributed across Languages.
A crucial decision in the design of the new screener for Spanish-English Dual Language Learners was to assess both languages in an equivalent way, so as to assess what a child knew in each language, and also overall. Our approach to capturing the child's overall language uses their best score in each of the language areas assessed, and compares their performance to other Dual Language Learners. It would not be appropriate to compare these children's language skill to monolingual English or monolingual Spanish speakers who have only heard a single language.
Therefore, screening bilingual children in both of their languages, and using their best score provides us with information about whether children are developing language at an appropriate rate for their age.
Why are the Best Scores important for assessment of dual language learners? First, because they make it possible to develop peer group comparisons for children who vary in whether they are stronger in English than Spanish or vice versa, namely, across the broad continuum of types of dual language learner. Second, because Best Scores consider that a child may know one feature in one language -let's say negation -and another feature in another language, hence be disadvantaged if only one language is assessed. With Best scores, we see whether they have controlled that language feature generally. Third, the point here is not to emphasize how strong the skills are overall, despite the word Best Scores. A child whose Best Scores lie outside the range of his peers -even peers along this varied continuum -reveals a deficit that is of clinical concern, because he does not show understanding in either language.

Distributed Knowledge
The performance across the various subtest types provides useful information about what a given child knows already, though based on a very small sample of items. Nevertheless, for our purposes the patterns of responses reveal the important fact that a child's knowledge is distributed across the languages. As Figure 1 reveals, these two sample children show quite different profiles of which subtests they find easy and hard in Spanish versus English. It is not just knowledge of particular lexical items that is distributed in a Dual Language Learner, but also syntax and process. Table 1.

Best Scores across English and Spanish
The Best Score uses the maximum score on each subtest type from each language to get an overall view of the child's functioning (Peña, Gutiérrez-Clellen, Iglesias, Goldstein, & Bedore, 2018;de Villiers, 2015). Best Scores capture the fact that a bilingual child's knowledge can be distributed between their two languages (Peña, Bedore, & Zlatic-Giuta, 2002

Example 2
English Spanish achieved in a language was included in the child's total score. The comparison was of proportions correct as the numbers of items in each area varied. These total scores provide Best area scores (e.g. Best Process, Best Vocabulary, Best Syntax) and Best Totalscores.
It is evident that the two children presented in Figure 2 differ in what they find easy or hard in each language. But is every case unique, or are there similarities across the group? One troubling question in a comparison of this sort is how we could match the level of sophistication of items in Spanish to those in English. For example, despite the piloting and first Tryout work, we might have accidentally chosen a harder set of verbs in English, or a more difficult set of scenarios for conjunctions in Spanish. If that were true, then the Best Scores would give the pattern away, because there would be uniformity as to which language the children did better in for a given subtest. On the other hand, if this varies, then the pattern must be due to something other than the difficulty of the items chosen.
To answer this question, we derived difference scores on each subtest, i.e., English minus Spanish. Then we added the subtests together for each general area : Vocabulary, Syntax and Process. A positive score means English was superior to Spanish for that skill, and a negative score means Spanish was better. The differences across the whole sample were tested using a one sample t-test where, hypothetically, the expected value is zero if the children as a group knew both languages equally. In fact, there are significant differences across the subtests, with four favoring Spanish (verbs, prepositions, wh-questions, and fast mapping adjectives) and the remaining eight favoring English. However, the differences in general are very close to zero (mean=.02, or 2% difference) and with a large standard deviation (.36).
To the extent that a subtest changes valence across time, it must be that the child is acquiring knowledge that allows them to score higher in the other language. For almost all subtests, there is a significant drift towards English skills being better than Spanish skills from age 3;0 to 5;11. This is in keeping with the children's attendance at largely English speaking day-cares and preschools. Taking a wider lens, difference scores for the summed subtests in Vocabulary, Syntax, and Process show a broader pattern in which Spanish Vocabulary (though only verbs and prepositions) dominates, whereas Syntax and Process shift earlier to an English preference. A repeated measures ANOVA with the three area scores as the dependent variable and age and gender as the independent variables revealed a significant difference across the different areas (F(1,357) = 43.97, p<.001, ηp 2 = .12), and a small but significant interaction with age (F(1,357= 3.14, p<.05, ηp 2 =.18). Vocabulary is different in profile than the other two areas since children do better on the Spanish items, but all show the same movement across age towards English. Figure 2 shows the change across age in which general area children do better in Spanish or English. preference in vocabulary. It is clear that some of the variability is predicted by age and experience in English versus Spanish.

Fifth Challenge and Solution: Assessments must be psychometrically sound
Screening instruments have to pass certain psychometric standards to be useful for practitioners, and these include establishing that they have sufficient validity and reliability.

Construct Validity
Validity of an instrument is examined to ensure a test is actually measuring what it claims to measure. That is, do the items on the QUILS: ES form a coherent set (construct validity)? A screener must be based on phenomena that expert researchers, teachers, and other educators regard as linguistically significant and educationally meaningful for children in the age range being examined. Without adequate theoretical and empirical backing to establish construct validity, no screener or test can be considered adequate.

Concurrent/Convergent Validity
The The PLS-5 has two components: expressive competence (EC) and receptive competence (AC) and provides a total score in each language. To prepare the data for the validity analyses, a total score was derived for the QUILS: ES by adding together the 45-item scores in each language. To compare with the standard scores of the PLS and PPVT, these totals were then converted to standard (Z) scores by age group. Bivariate correlations between the QUILS: ES in English and the PLS-total English reveal a moderately high correlation (r(44)=.693, p<.001).
Bivariate correlations between the QUILS: ES total Spanish scores and the PLS-total in Spanish reveal a smaller but still highly significant correlation with the (r(44))=.449, p<.002).
As part of the concurrent validity testing, 44 other children completed the QUILS: ES and the BESOS: the Bilingual English/Spanish Oral Screener. This test designed for ages 4 to 7 contains Morphosyntax (BESOS-MS) and Semantics (BESOS-S) subtests in both English and Spanish (Lugo-Neris, Peña, Bedore, & Gillam, 2015). We looked at the inter-correlations between the Spanish and English BESOS with the Spanish and English QUILS: ES, shown in Table 9.7. Since the BESOS has only been normed for ages 4 and up, we only included in the analyses the 29 children (out of 44 total) who were older than 4.

Internal Reliability
A test must also have internal integrity. The items on the test must form a coherent set that intercorrelate even though the items may vary in difficulty. To ensure this for the QUILS: ES, an analysis called Rasch modeling was used (Rasch, 1960;Wright & Stone, 1979 Demonstrating that a test's items have internal consistency is another metric of reliability. Cronbach's (1951) coefficient alpha is used to calculate this. Coefficient alpha provides a lower bound value of test reliability and is considered to be a conservative estimate of a test's reliability (Allen & Yen, 1979;Carmines & Zeller, 1979;Reynolds, Livingston, & Willson, 2009 the DIFs cancel out, neither of the groups is disadvantaged by including these items (Nandakumar, 1993).

Test-Retest Reliability
A second session of QUILS: ES testing was administered four to six weeks after the initial QUILS: ES testing to establish test-retest reliability. Children received both English and Spanish portions of the QUILS: ES after their initial QUILS: ES session, in the same order in which the initial QUILS: ES was administered. As with the initial QUILS: ES session(s), for the retest, the two language portions of the QUILS: ES were given within two weeks of each other. Using Best Scores as the measure, the test-retest reliability was high (.89).
The instrument has good internal reliability, test-retest reliability, and validity against other accepted measures like the BESOS, the PLS, and the PPVT(English). QUILS: ES has not yet been fully tested on a clinical population of children with language delays, though that work is underway in two clinics and the results are promising. We need to establish the specificity and sensitivity of the test for clinical use, but its use as a screener in educational settings is not precluded and should provide useful information.
At the completion of both sections of the text, the QUILS:ES provides several kinds of automatic reports designed for parent, teacher and school in different levels of specificity and formality, of the child's individual language scores, their norms, percentiles, and an evaluation of risk status based on their overall performance. A sample "Student Brief Report" is provided in Appendix B.

Discussion
In this paper we have addressed five significant issues that need to be tackled by designers of a screening instrument for Dual Language Learners. We argue that construct validity is essential: SLPs, linguists, psychologists, and experts on language acquisition and disorder need to collaborate to choose appropriate areas of assessment. These areas should reflect linguistic properties that are diagnostic of the stages of development in early childhood, but also their use in everyday life, for example in preparation for the demands of schooling. That is why we emphasize assessment of how children can learn new things, not just a sample of what they already know. It is important that the sample match the group for whom it is designed, both in terms of adequate representation across SES and, on the screener, that the items are neutral with respect to culture and dialect. Given the way dual languages are represented in the mind, we emphasize that the scores take into account distributed knowledge.
We addressed each in turn and presented the solutions we adopted in the making of a new screener for Spanish-English bilinguals in the US. The results demonstrate that a touchscreen screener for bilingual Spanish-English learners is a viable option for fair testing of children aged 3-6 years in the US. It is a self-contained test, where narration and scoring are automatic, making it broadly useful even in areas where the number of bilingual SLPs is low relative to the population of children in need of screening. We would like to test it in wider arenas such as Latin America, where the Spanish section might prove useful even with monolingual learners of Spanish.
The screener emphasizes the use of the Best Score as a fair index of a Dual Language Learner's competence with language development. There are three new findings here. First, we demonstrate that there is distributed knowledge in Dual Language Learners not just in vocabulary but in syntax and process indices too. Second, differences between subtests begin to switch over the course of the preschool years towards English, though Spanish retains strength in the areas of Vocabulary. Third, these changes are predictable from parental reports of the proportional use of the different languages in the home.
Finally, we recognize that this is a screener with potential extension to even younger children. We recently completed work on a touch screen assessment (BabyQUILS) with simpler language subtests that includes vocabulary, syntax, and process items, and is normed on US two- year-olds (N=440) who are monolingual in English (de Villiers et al., 2019). In the process of collecting data, some children (N=83) were tested whose exposure in the home was to other languages as well, and many of their scores approached the normal range for English, especially by 30-36 months. This gives us confidence that a screener is a future possibility for two-year-old Dual Language Learners, who could reveal their full linguistic knowledge distributed across different versions of the assessment. The existing work on this younger age range focuses heavily on vocabulary, so a broader assessment that included grammar and process would be a valuable addition to the research base.