Research to Establish the Validity, Reliability, and Clinical Utility of a Comprehensive Language Assessment of Mandarin

Purpose With no existing gold standard for comparison, challenges arise for establishing validity of a new standardized Mandarin language assessment normed in mainland China. Methods A new assessment, Diagnostic Receptive and Expressive Assessment of Mandarin (DREAM)1, was normed with a stratified sample of 969 children (ages 2;6–7;11) in multiple urban and non-urban regions in northern and southern China. In this study of 230 children the sensitivity and specificity of DREAM were examined against an a priori judgment of disorders. External validity was assessed using two indices of language production for different age groups. Results External validity was assessed against spontaneous language indices (correlations range r=0.6 to 0.7, all p<.01) and narrative indices (overall r=0.45, p<.01). Sensitivity (0.73) and specificity (0.82) of DREAM are moderate to good using a priori judgment as the standard. The values improved to 0.95 and 0.82 when spontaneous language and narratives were added to a priori judgment to define typicality. Divergent validity was moderate with non-linguistic indices. Conclusions DREAM holds promise as a diagnostic test of Mandarin language impairment for children


The Background Problem
Children with specific language impairment (SLI) are clinically defined as having language skills one standard deviation or more below the age-group mean, with a non-verbal IQ being 85 or above, and without medical or neurological diagnoses, such as hearing impairment (Rice, 2013;Tomblin et al., 1997). It is a matter of some controversy whether there is a genuine category under the term SLI, or whether more attention needs to be paid instead to the problem of language impairment in a broader sense (Reilly et al., 2014). Regardless of the answer to this question, it is true that children with hearing loss, or with genetic disorders such as Down Syndrome, are easily recognized as needing language services by the medical profession, whereas children who only have a language problem are likely to be missed.
The prevalence of SLI for kindergartners in the upper Midwestern region of the United States was found to be 7.4% overall, 6% for females, and 8% for males (Tomblin et al., 1997) Assuming that the impairment is a worldwide phenomenon with some significant genetic component (Rice, 2013), applying the United States prevalence rate to the 2010 Chinese census (China Data Center, 2012) would suggest there are in China approximately 230,000 children with SLI just between five and six years of age who are currently in need of identification and rehabilitation services. The profession of speech-language pathology is just beginning in mainland China, so there are few trained clinical professionals with the linguistic knowledge to assess a child with language impairments, as well as a lack of language assessments that meet validity and reliability standards. There have been some attempts by pediatricians to develop language screeners and assessments for early identification of a possible language disorder among mainland Chinese children. For example, there is a checklist-the Infant and Child Assessment of Mandarin 4 Language Development Screener (Zhang, Jin, & Shen, 2003)-for early functional language communication from birth to 36 months that has items such as "makes noise while smiling" and "is able to speak a simple sentence". This screener has been tested on over 8,000 children in Shanghai, China and is currently being used by multiple pediatric hospitals in China. The Mandarin MacArthur-Bates Communicative Development Inventories (Tardif, Fletcher, Zhang, Liang, & Zuo, 2008) was adapted from the MacArthur-Bates Communicative Development Inventories (Fenson et al., 2007) and normed only for the Beijing region. It is a vocabulary checklist completed by parents of children between birth to 30 months. However, researchers have reported that late talkers often catch up to their peers in language skills by 3 to 5 years of age (Leonard, 2014), while children with a language disorder do not. As a result, it is important that diagnostic tests be available for children older than three years. The problem in China is that there are until now no formal standardized and comprehensive language assessment tools normed in mainland China that meet psychometric standards (Friberg, 2010) to diagnose whether children have a language impairment when an overt medical diagnosis, such as hearing impairment, is not present.

Designing a New Test for Mandarin
To fill the need in China, the DREAM test  was developed as a standardized, norm-referenced language assessment representative of Mandarin speaking children aged 2;6-7;11 years old in mainland China. We will first discuss the specific challenges associated with designing a new test in a language, and how items were developed for DREAM.
First, it is difficult to begin the process of developing a new test when there are no existing systematic assessments. In English, language acquisition has been studied in great depth Assessment of Mandarin 5 for about the past 50 years, leading to a very good empirical knowledge base of what children at different ages know and can answer in natural circumstances of language interaction (e.g., the extensive databases in CHILDES: MacWhinney & Snow, 1985). In parallel with naturalistic studies, experimental work has been carried out testing particular aspects of knowledge, such as relations among lexical concepts, understanding of tense, or production of relative clauses (Crain & Thornton, 1998;de Villiers & Roeper, 2011;Guasti, 2004;McDaniel, McKee, & Cairns, 1996). Furthermore, large numbers of assessments have been developed and normed, including tests of vocabulary for age 2 to 90 years (Dunn & Dunn, 2007), and batteries of syntax, morphology, semantics, and pragmatics (Hresko, Reid, & Hammill, 1999;Semel, Wiig, & Secord, 1996;Zimmerman, Steiner, & Pond, 2002).
Hence, in developing a new assessment in English, a wealth of gold standard data already exists against which to examine the validity of the new items. Nevertheless, even in the United States, issues arise about the appropriate standard for children who are dual language learners (Iglesias, 2015) and for children who speak non-mainstream dialects of English (de Villiers & de Villiers, 2010).
Not only are there no gold standard assessments in Mandarin, but the knowledge base about normal language acquisition is considerably less well established. Yet it is a mistaken approach to translate assessment tests from English into another language, because languages differ in significant ways (Peña, 2007). Semantic constructs may be expressed straightforwardly in one language but in a highly complex way, or not at all, in another language. Grammar development can take quite different paths in different languages (Slobin, 1986).
To begin to solve this problem a large variety of test items was designed and piloted in the Beijing/Tianjin area by a team comprised of linguistic experts in Mandarin from mainland China, western-trained bilingual speech-language pathologists, and experts in assessment development.
The items were selected to represent the variety of linguistic forms and structures that are acquired by typically developing children from age 2 through 8 who speak Mandarin (Cheng, 1988;Lee, 1982Lee, , 1986Lee, , 1992Lee & Naigles, 2008;Li, Huang, & Hsiao, 2010;Liu, 2009;Liu & Ning, 2009;Zhou, 2002Zhou & Crain, 2009, 2011. Items were chosen not only by close attention to the empirical and theoretical literature on language development, but also by considering evidence on the nature of language deficits in childhood. The research team took care to reflect properties of Mandarin that could create challenges for a child with language impairment (Cheung, 2009;Fung, 2009).
Mandarin is radically different from English in several ways. Inflectional and derivational morphology-standbys for English language tests-are virtually nonexistent in Mandarin, where in contrast, compounding of morphemes is common. Unlike English, the time markers in Mandarin focus on Aspect, not Tense. Discourse allows the dropping of both subject (S) and object (O) noun phrases, and even the verb (V). Wh-phrases do not move, they remain in situ. Nouns often require classifiers, but there are no determiners. The passive takes several forms, but controversy rages about its underlying structure. About the only thing Mandarin and English grammars have in common is a preference for S-V-O word order.
The content of the items went through a rigorous selection and testing procedure. The first focus was on the unique aspects of Mandarin (e.g. classifiers) for which small-scale experimental studies have been conducted, allowing estimates of the age at which the constructions might be mastered, and also clues about the likely errors a child might make in production or comprehension. (Chien, Lust & Chiang, 2003;Li et al., 2010).
Thirdly, a specific area of focus was to consider measures of process. The problem with measures of what has been achieved is that they can fail to distinguish between children who have genuine difficulties in learning the language, despite adequate exposure, from those who are delayed because their exposure to sufficiently varied and complex language has been inadequate (Hirsh-Pasek, Kochanoff, Newcombe, & de Villiers, 2005;Rice, Buhr, & Nemeth, 1990).
Fourthly, we considered the variation in the language-even the Mandarin spoken in different regions. According to Sun (2006), China is home to at least seven mutually unintelligible dialects and numerous additional regional languages mostly associated with minority peoples. However, Mandarin seems to be the place to start because it is China's educational policy in urban and suburban regions for teachers to use Mandarin exclusively in schools for children age 3 and above. Parents filled out a questionnaire regarding the child's exposure to and use of other dialects and languages. Only children who were reported to speak Mandarin could be included, though it was common for members of the household to speak other local dialects. Much care was taken to avoid lexical items or expressions that might be biased against or in favor of certain dialects and not others. As a check, the assessment was administered across different urban and suburban regions, as well as northern and southern dialect regions, and analyses were conducted to finalize the items on the test by ensuring that Assessment of Mandarin 8 they were not biased by the dialect the child spoke or region of China from which the child came. A differential item functioning (DIF) analysis was used to determine if an item on an assessment may be biased against a subgroup of respondents due to characteristics of the item that are not related to the construct being assessed (Zumbo, 2007). Items were flagged for possible DIF using two approaches, Mantel-Haenszel (Holland & Thayer, 1988) and item response theory (Rasch DIF in Winsteps, Linacre, 2014). Approximately 3% of the originally developed item pool was flagged and modified based on this process. Revised items were subsequently included in standardization and tested again for evidence of DIF to ensure item modifications were effective. Nevertheless, we fully recognize that this is only the first of a series of potential tests that will be needed to maximize fair testing across China.
A final consideration in item design was the performance of the items throughout the age range. Within any given type, it was necessary to design easy and difficult items that might discriminate at different points in development.

Standardization
Detailed information about the piloting and standardization of the DREAM test is included in the assessment manual, which is available upon request from the first author. After extensive piloting and tryout in different regions within mainland China, a final set of items was selected by subjecting the data to Rasch analyses (a variety of item response theory ;Embretson & Reise, 2000). In a Rasch analysis, "ability" is defined by the number of items a child gets correct, and "item difficulty" by the number of children who get the item incorrect. As in classical test theory, the overall index of whether the test is satisfactory is measured by the degree to which the items cohere in providing the same estimate of a child's relative ranking.
Rasch psychometrics have the advantage of being less affected by the properties of the sample.

Assessment of Mandarin 9
Well-behaved test items are selected that spread across the range of abilities in terms of their discriminative potential, and are not misfits; for example, they do not create undesirable ushaped developmental curves.
The linguistic components were compiled into four composites: Expressive score, Comprehension score, Syntax score (adding across modes) and Semantics score (adding across modes). The table in Appendix A shows a breakdown of the major parts of the test and the number of items finalized for each of the receptive and expressive subcomponents. It also lists some examples from the subtests.
The nationally representative standardization sample consisted of 969 3 Mandarinspeaking children between the ages of 2;6 and 7;11, with equal numbers of boys and girls.
Between 2;6 and 5;11 years, half-year age groups were distinguished, with year-long age groups for 6 and 7 year olds. Sampling included multiple cities and suburbs in both the northern and southern regions of China, and was stratified by multiple variables such as age, gender, urban versus suburban, region, and highest primary caregiver education level, according to the most recent census data (China Data Center, 2012).

Reliability and Validity Studies and the Establishment of Clinical Utility
Data from standardization provided estimates of the internal consistency reliability of the DREAM Total scale (Cronbach's alpha=0.94; N=969) and test-retest reliability over a 2-4 week period (r=0.85; N=60).
In the normative sample, females demonstrate a 4.2 point advantage on DREAM total scaled score, on average. These results are in keeping with the general finding world-wide that boys are more likely to have language delay than girls (Snowling, Duff, Nash, & Hulme, 2015;Zambrana, Pons, Eadie, & Ystrom, 2014). In DREAM data this advantage is less pronounced at Assessment of Mandarin 10 younger ages (<4.0 years, 1.2 points), and more so at older ages (>=4.0 years, 5.6 points), in keeping with other findings that boys may have more persistent language disorder.
The item development process and dialect differential analysis indicated that DREAM has good content validity. The standardization phase revealed that DREAM has appropriate psychometric properties, including internal reliability and test-retest reliability, with a large sample of children aged 2;6-7;11 years.

The Present Study
The quality of a test is also judged by its ability both to identify children in need of language intervention services (sensitivity), and not to select those who do not need such services (specificity). The study we report next will evaluate the sensitivity and specificity of DREAM in assessing children with potential language impairment. The question that needs to be addressed is how to establish the groups of typical and atypical children in order to judge the specificity and sensitivity. We assess two approaches. One uses the judgment of pediatricians, who usually make such decisions in China, solely based on a thorough parent interview. The second adds to this by using language samples to further refine the pediatricians' decisions.
Language sampling and linguistic analyses are often recommended to assess language impairment when there is no established instrument, but the process is time consuming and requires expertise. Nevertheless, results from language sampling can be used for examining convergent validity of a new instrument, especially when there are no other suitable measures (Bedore, Pena, Gillam, & Ho, 2010;. The samples also provided a way to refine the discrimination of the pediatricians' decisions.

Participants
Three hundred children aged 2;6-7;11 were recruited for this study at a major urban pediatric hospital, Shanghai Children's Medical Center in China. This hospital serves as a central agency for evaluating children with developmental problems, including language development. All 300 children received a regular physical examination from the pediatricians at the Medical Center. If a concern about the child's communication was expressed by the parents or teachers, pediatricians then evaluated the child through a thorough informal interview addressed to the parents. Ninety-four children were classified as possibly having atypical language development according to pediatricians' judgment based on the parent interview without any assistance from comprehensive language assessments. These children were termed the a priori atypical group. A parallel sample of a priori typical children (N=136) with approximately the same demographic characteristics were reported to have no concerns for language development by their parents and/or teachers. Seventy children were excluded because they were reported to have autism, a neurological diagnosis, genetic disorders, intellectual disability (<60), severe Cerebral Palsy, a hearing loss, or blindness. Table 1 summarizes demographics and other characteristics of this sample. Table 1 here

Procedures
All children received the DREAM test, administered via a tablet, with a standardized narration provided by a female Mandarin speaker who works in a professional capacity on a children's radio program. Each child heard pre-recorded questions while viewing pictures, and responded by touching the screen in the comprehension part and by giving a verbal response in Assessment of Mandarin 12 the expressive part. When the child gave a verbal answer, the test administrator recorded responses by touching the corresponding word, picture, or phrase, choice buttons on the tablet screen. Test administration time averaged about 45 minutes to complete and took place in the child's school or preschool in a quiet place. Five examiners received a full-day training from a bilingual speech-language pathologist certified by the American Speech-Language Hearing Association, and a two-day practicum, until each examiner demonstrated competency in administering all tests used in this study. All children were tested by the trained examiners under the supervision of two speech-language pathologists certified by the American Speech-Language Hearing Association.
In the absence of an existing language test in China, elicited language samples were chosen as the accepted standard for convergent validity. A reasonable body of literature provided guidance as to what to expect at different ages in Mandarin language development, and the assessment research team made use of these resources (Cheng, 1988;Lee, 1982Lee, , 1986Lee, , 1992Li et al., 2010;Lin, 1986;Liu, 2009;Miao, 1986;Zhou, 2002. Convergent validity was investigated using language samples collected in the same test session. Given the wide age band, it was necessary to use different means of language sample collection for two broad age groups. For younger children aged 2;6 to 4;5, a spontaneous language sample was collected from a play session designed to elicit varied kinds of talk with the examiner (See Appendix B). For children aged 4;6 to 7;11, children received three wordless pictured narratives to describe (the Mandarin Expressive Narrative Test 4) , targeting aspects of grammar and semantics likely to be discriminating at this later age, to gain a broader perspective on language skills (see Appendix C for details). Very little relevant work has been published on Mandarin narratives in mainland China (Zhou & Zhang, 2010), though there are some small-Assessment of Mandarin 13 scale studies in Taiwan (Chang, 2004). Once these narratives were recorded, a group of linguistically trained researchers listened to the samples and coded each utterance along a series of dimensions. In each case, specific criteria were used to score the child's language along a 0-3 point scale. There were 16 overall indices formed by the five scales for each of the three stories plus a composite measure of adequacy of answers to the questions following the stories.
Cronbach's alpha was 0.82, suggesting that together these form a good scale. An overall narrative score was then derived by summing these together.
Several other measures were collected on the same children with the purpose of providing more information about their nonverbal intellectual capacities such as spatial reasoning (PTONI), executive function (Day-Night Stroop), and short-term auditory memory (Forward digit span). The details of these are provided in Appendix D.

Sensitivity and Specificity
In order to estimate sensitivity and specificity, one needs a true or gold-standard classification status of typical or atypical for each child. Given the early state of speechlanguage diagnostics in China, no such true classification was available in the current study.
Instead, a priori judgment status was used. Sensitivity represents the probability that a child who is judged to be atypically developing will receive a score below the DREAM test's at-risk cutscore. Specificity represents the probability that a child who is judged to be typically developing will receive a DREAM score above the at-risk cut score. Sensitivity and specificity values will vary depending on the cut-score selected.
To improve the usability of DREAM to identify at-risk students, a simple decision rule was sought. Instead of implementing a different cut-score for each DREAM scale, a rule was Assessment of Mandarin 14 identified where a single cut-score was applied to all the scale scores simultaneously. For a single cut-score, an optimal balance of sensitivity and specificity was determined to occur when a child received a scale score of <80 on any one of the five DREAM test scales. This represents approximately 1.33 SDs below the mean, which is consistent with other at-risk cut-points in the literature 5 .
The approach was intentionally designed not to rely on the Total Scale score alone, as this is not available if the child does not complete the entire test administration. Furthermore, unusual difficulty on one of the scales alone could warrant a classification of at-risk. Table 2 shows the findings for various cut-scores applied to the set of DREAM scale scores available to the practitioner. Table 2 here A second analysis tightened the criteria for atypical language development to improve the a priori classification. In addition to being referred for likely language problems by the pediatrician based solely on parent interview, the child also had to exhibit poor performance (z score <-1.25 SDs below the mean) on the language sample measures, whether that was the play session or the narrative. To count as typically developing, the child had to score above that level and also not have an a priori judgment of disorder.
One-hundred and eight children satisfied these dual criteria as typical or atypical, and a further statistical analysis was conducted to look at the sensitivity and specificity of their DREAM scores. As might be expected, the extra refinement of a priori classification improved sensitivity dramatically, to 95%. The DREAM test missed very few children who were classified as potentially language impaired and had poor language sample measures. Specificity did not improve, staying at 82%. 6

Assessment of Mandarin 15
In this study, 70% of the a priori atypical children were males and 30% females.
However, there would be no reason to expect gender differences in the DREAM scores for children within the a priori atypical group. In fact, there were no statistically significant differences in DREAM total scaled scores attributable to gender within the a priori atypical group.

Evidence for External Validity
The study incorporated several additional measures discussed in Appendix D, to explore the construct validity of the DREAM scales. Correlations of DREAM scales with external measures are reported based on the different measures administered for two age ranges.
For ages 2;6-4;5, correlations among the DREAM, PTONI, Digit Span, Executive Function, and Spontaneous Language are provided in Table 3. The following Spontaneous Language measures were computed: A Grammar score based on complexity of sentences, a Vocabulary score based on variety and types of words, and a Morpheme score based on the range of likely grammatical morphemes observed. These were combined into a Total Spontaneous Language score by averaging the z-scores of the three measures. Table 3 here For ages 4;6-7;11, correlations among the DREAM, PTONI, Digit Span, Executive Function, and Narrative are provided in Table 4. In this particular sample, the correlations among the DREAM scales are very high. The PTONI scores are in the moderate range, indicating discriminant validity between DREAM (a general language measure) and PTONI (a cognitive measure specific to visual-spatial reasoning). Table 4 here

Assessment of Mandarin 16
The intercorrelation matrix among the narrative indices was examined, this time taking the totals across the three stories for each index a-e (see Table 5). Results were mixed, with the most highly intercorrelated item being reference specification, and least effective index being descriptions of character's desires. However, different indices may contribute useful information at different points over this broad age span as found in other work on narrative. The individual scores were converted to z-scores by age band. Then a total narrative z-score was composed of the average of these component z-scores, to give them equal weight. Table 5 here

A Priori Judgment Status
This paper has discussed two approaches for classifying children as at-risk: a priori judgment of disorder without comprehensive language assessments, and any scale score <80. In the absence of a standard for atypical status, this section explores how well each approach relates to other skills such as Narrative production and Spontaneous language. Table 6 provides evidence that atypical classification defined by the DREAM cut-scores is more highly related to Narrative production and Spontaneous language measures than a classification based on a priori judgment. All of the DREAM atypical status correlations are significantly higher statistically (Lee & Preacher, 2013) than those for a priori atypical status (at p<.05 or lower). Note that being judged a priori typical or atypical is not a predictor of children's Narrative performance as the correlation is not statistically significant (p>.05). Table 6 here Assessment of Mandarin 17

Discussion
The predictive validity of the new DREAM test was assessed in several ways. First, its sensitivity and specificity were evaluated against an a priori judgment status. In the absence of other comprehensive language assessments, it is unlikely that the children judged as atypically developing all meet the definition of pure language impairment. Despite that qualification, a respectable level of sensitivity and specificity was achieved if any DREAM component standard score was set at 80, or approximately the 9th percentile. This score is within the range of expected level of language impairments estimated to exist worldwide (Leonard, 2014;Tomblin et al., 1997). Sensitivity was much higher (0.95) if an extra criterion was added to a priori status, namely, whether the children also fell into the normal range or below -1.25 SDs on the indices of spontaneous language or narratives.
Specificity was still only moderate, however. This would be expected if the new test were to measure properties of language about which non-linguist professionals would be unaware, such as quantifier scope or verb complement structures. Neither would these properties necessarily be picked up in spontaneous language or even narratives, which are biased towards lexical items and structures that a child can use with confidence. As a result of both factors, a well-designed and demanding linguistic test is likely to pick out more children with subtle difficulties, categorizing fewer children as "typical". Therefore, the specificity against the a priori judgment is moderate. In a case such as this with no alternative gold standard, relatively lower specificity does not necessarily mean that the test is not doing its job.
Secondly, the validity was assessed by comparing the standardized test results to language samples, argued to be the best alternative in the absence of another gold standard language test. For the younger cohort, this was elicited in various ways in a play session that encouraged a variety of language forms and uses. The results showed excellent correlations with the DREAM subscores and total score. In addition, the a priori status was more weakly associated with the spontaneous language measures than the DREAM scores, reflecting the fact that a priori judgment is not yet as refined as instruments that have been carefully designed to measure linguistic content and avoid test bias. For the older cohort, spontaneous language was considered unlikely to reveal subtle language properties. For that reason, narratives were elicited using wordless picture stories that the children were encouraged to tell. DREAM scores correlated with narrative scores in a way that a priori judgment failed to do. The implication is that some children may have subtler problems that are manifest under careful testing but are concealed in ordinary communication with family and in school. Other children may be mistaken as having language problems in this age range. Though the narrative measure (MENT) added useful information, the index still has more variance than is desirable and needs refinement. It served its purpose here as providing further validity for DREAM in this older age group where there is too much unconstrained variability in ordinary spontaneous language.
Other measures that tap general cognitive abilities, such as visual pattern making (PTONI), executive function (Day-Night Stroop), and short term auditory memory (Digit Span) proved to be modestly related to the DREAM standard scores. For the younger group, these indices were less well correlated with DREAM than the spontaneous language measures were.
However, for the older group, the narrative and cognitive measures both correlated with DREAM to about the same degree. The difference needs further exploration, as there were possibly floor effects for the younger children on the cognitive measures. We could interpret the pattern to suggest that language development is increasingly intertwined with other cognitive skills as children master the fundamentals and begin to use language for wider purposes. Though language has a considerable link to and dependence on general intelligence, it is not fully reducible to a general cognitive skill. For example, sentence repetition was not only related to Digit Span, but also to syntax comprehension. Repeating a sentence correctly requires grammatical knowledge, not just short-term memory for sounds. However, a child with weak auditory memory may show language difficulties as a result. Other work has shown that children with language impairment often have difficulties in executive function skills (Henry, Messer, & Nash, 2012;Sabbagh, Xu, Carlson, Moses, & Lee, 2006), but the direction of effect is not clear.
As children acquire language they begin to use it as a tool for control of memory, rehearsal, and planning, but it is undoubtedly true that language learning itself requires auditory memory and controlled attention. The general implication is that each of these tasks provides useful information about a child's functioning, but the language test gives information of a specific sort relevant for language-based therapy, as it reveals the child's state of linguistic competency.

Limitations
Developing a standardized, norm-referenced assessment for mainland China does not end with the demonstration that the test meets appropriate standards, and at the present time the sample size is still small relative to the population of China. It will be necessary to expand the range of children who take this and other assessments to investigate whether such tests can play a satisfactory role in all of the diverse circumstances that affect children in need of language intervention.
In addition, the issue of bilingualism and bi-dialectalism needs to be directly addressed in future work. The current norming took careful account of regional and dialectal influences in the areas studied, and found relatively little to adjust. Nevertheless, the concern is that, especially below age 3, there may be children who are just beginning to be exposed to Mandarin. It is Assessment of Mandarin 20 therefore unwise to compare their skills to those of children who have been native Mandarin speakers from the first year. Even in this age range the spontaneous language was remarkably confirmatory of the child's level of attainment, but both may underestimate the language skills of the youngest bilingual children. Interesting work is underway in the United States (Iglesias, 2015;Peña, Gutierrez-Clellen, Iglesias, Goldstein & Bedore, 2014) and Europe (Armon-Lotem, 2012) to derive the best practice for evaluating bilingual children for speech and language disorders, and more work is needed in China on this front.
On a final note, the assessment of children for speech and language disorders cannot happen in a vacuum: the educational and pediatric care must be prepared for the consequences of such identification by training therapists, designing and testing appropriate interventions and reevaluations (Rogers et al., 2012). China's progress has been rapid in this regard, and it will be vitally important to match the preparation of the therapists to the sophistication of the instruments, particularly with respect to knowledge about language acquisition and linguistics.   *Correlation is significant at the 0.05 level (2-tailed).
a PTONI was administered only to ages 3;0 and above. *Correlation is significant at the 0.05 level (2-tailed).

Assessment of Mandarin 35
Point to where the cat is.

Appendix B: Language Samples in Spontaneous Play
For the spontaneous language sample, the researchers used a variety of toys and pictures with the child to elicit language, including descriptions, and not just naming. For instance, the child was shown pictures of illogical or unusual situations, such as a boy riding a tricycle with a square-shaped front wheel and was asked whether the boy could pedal forward and why or why not. The session was also arranged so that certain things went wrong, for example, the activity of coloring a picture would be thwarted by lack of the crayon that the child was instructed to use.
This provided appropriate functional opportunities for requests, questions, or negations. This strategy has worked well in other tests for young children Peña, Gutierrez-Clellen, Iglesias, Goldstein, & Bedore, 2014).
Once the 15 minute language samples were recorded, they were played back to a group of linguistically skilled researchers who listened for certain specific properties in the language sample, covering word use, grammatical complexity, and morphology. The diversity of vocabulary was assessed on a five-point scale derived in part from previous work on developmental milestones (Hao, Shu, Xing, & Li, 2008;. The grammatical complexity was also assessed on a five-point scale, derived by consideration of the complexity of the clause types used, much as in  work on the IPSYN in English. Based on previous language acquisition studies of the emergence of grammatical morphemes of aspect and classifiers Zong, 2011), the morphemes heard in the transcript were checked off and the variety of morphemes used was then totaled. Each checklist was designed to represent simpler or earlier forms and then increasingly complex forms, based on empirically based knowledge about Mandarin use by young children. An overall spontaneous language score 7 was derived by adding together the points from these different aspects.

Appendix C: Narratives
Narratives were recorded and analyzed for the presence of properties typically developed between the ages of 4 to 8 years, based on cross-linguistic work. The measures focus on markers of temporal cohesion, reference specificity to distinguish characters for the listener, and the landscape of consciousness or mental state references about the characters. As in the Dialectal Evaluation of Language Variation test  children saw three short picture stories, seeing one picture at a time, then they were asked to start at the beginning of the sequence and describe what happened. One sequence represented a classic theory-of-mind scenario (Leslie, 1987;Wellman, Cross, & Watson, 2001), in which a character sees something placed in one location, leaves the scene, and then another character moves the object to a new location, out of sight. The first character returns, with a thought balloon to indicate that the character now wants the object. The second and third stories added more and different elements, including unintended mistakes and small dramas of deception. These were also designed to allow children to describe events either at a purely action level, or at the level of character's desires, motives, beliefs, and emotions. The transition from one form of storytelling to the other is a major development in this age range 4;6-7;11 Burns, de Villiers, Pearson, & Champion, 2012). After each story, the researcher asked the child some questions designed to promote such causal explanations, such as, "Why is he looking there?" or "Why couldn't the boy climb down the tree?" to further elicit complex elements of desires and emotions etc.
The dimensions coded included higher levels of grammatical complexity appropriate for this age range, to include mental verbs plus complements, or complex use of sentence connectives such as "while", "before", etc. A level of overall quality was also marked, having to do with the adequacy of a single picture description. In addition, indices of how well the child Assessment of Mandarin 47 specified referents and whether they had sophisticated time reference were scored (Burns et. al, 2012). The sophistication of references to emotion, desire, and mental states was also coded, as an index of their ability to employ Theory of Mind to describe the characters. Finally, one or two questions requiring a causal explanation were asked after each story.
The tests needed to be culturally neutral and to provide a snapshot of the child's intellectual abilities of a different kind than linguistic testing reveals. The tests chosen included the Day-Night Stroop executive function test, a version of the color Stroop test (Golden, 1978) adapted for young children. The Day-Night Stroop (Gerstadt, Hong, & Diamond, 1994 A visual spatial intelligence test, the Primary Test of Nonverbal Intelligence (PTONI) (Ehrler & McGhee, 2008), was used as an index of nonverbal intelligence. The PTONI is especially appropriate for testing children who have underdeveloped verbal and/or motor skills.
The publishers claim a certain cultural neutrality, in that PTONI directions are provided in eight alternative languages (including Mandarin), making it an appropriate assessment of intelligence for children from diverse language backgrounds, according to Ehrler and McGee (2008). It was normed on a culturally and ethnically diverse demographic sample of 1,010 children from 38 states in the United States. The test format requires a child to look at a series of pictures and to point to the one picture that does not belong with the others. Items are arranged in order of difficulty. The PTONI provides standard scores, percentile ranks, and age equivalents. Though Assessment of Mandarin 50 not normed in mainland China, PTONI is a non-verbal test and is purported to be largely free of cultural bias.
Finally, short-term auditory memory was measured using a forward digit span task (backward digit span has some advantages in tapping working memory but would have been unsuitable for children younger than five). This task was designed so that there were four items of each length, for example, four 4-digit strings, four 5-digit strings and so on. Increasingly long digit strings were presented until the child failed to get 75% of the items at that length correct.
The highest level at which they achieved 3 or 4 items correct was taken as their digit span.
This procedure closely mimics the procedure used in standardized tests such as WISC-V (Wechsler, 2014) and the Differential Ability Scales (Elliott, 2006). It is also parallel to earlier research that examined digit span in China. That research found scores to be slightly higher than Western samples (Chen & Stevenson, 1988). Properties of the Mandarin digits were considered in creating a digit span test for Mandarin. All Mandarin digits are monosyllabic, but some of them rhyme, so we chose strings in which rhymes were not adjacent, and in which there were no difficult phonetic sequences. As in other digit span tests, we checked carefully to ensure that no number sequences were frequent idioms (as with 911 in English).