Introduction Corpus, Tools and Method Three analyses Reproducible Identification of Pragmatic Universalia in CHILDES Transcripts GNU meets OpenScience Daniel Devatman Hromada123 daniel@wizzion.com 1 Université Paris 8 / Lumières École Doctorale Cognition, Langage, Interaction Laboratoire Cognition Humaine et Artificielle 2 Slovak University of Technology Faculty of Electronic Engineering and Informatics Department of Robotics and Cybernetics 3 Universität der Künste Fakultät der Gestaltung, Berlin Conclusion Introduction Corpus, Tools and Method Table of Contents 1 Introduction Psycholinguistics Reproducibility Universalia 2 Corpus, Tools and Method 3 Three analyses 4 Conclusion Three analyses Conclusion Introduction Corpus, Tools and Method Three analyses Conclusion Developmental Psycholinguistics DP Is a science which uses experimental methods of developmental psychology in order to study acquisition, learning and development of linguistic structures and processes in human children. Multiple epistemological and methodological problems include: 1 child’s behaviour is often very instable 2 the very fact of being subjected to experiment impact child’s responses 3 the invasivity problem These problems do not exist when researcher decides to observe instead of experiment! Introduction Corpus, Tools and Method Three analyses Conclusion Reproducibility The Hallmark Principle Reproducibility ”Non-reproducible single occurrences are of no significance to science” (Popper, 1992) Experimentator-independent reproducibility can be attained iff: 1 all experimentators use the same dataset 2 use the same (or least very similiar) set of tools 3 the first experimentator faithfully protocols the usage of such tools 4 other experimentators follow the protocol 5 analysis is deterministic Introduction Corpus, Tools and Method Three analyses Conclusion Universalia Pragmatic and Ontogenetic Universalia Linguistic Universal A pattern that occurs systematically across natural languages. Most common lists of universals, like those of Greenberg (1963), concern syntax, morphology or semantics. Pragmatic Universal A L.U. related to pragmatic (extralinguistic context, deictics, etc.) facet of linguistic communication. Ontogenetic Universalia Introduce the temporal dimension (age). Introduction Corpus, Tools and Method Table of Contents 1 Introduction 2 Corpus, Tools and Method Corpus Tools Method 3 Three analyses 4 Conclusion Three analyses Conclusion Introduction Corpus, Tools and Method Three analyses Corpus CHILDES CHILDES Child Language Data Exchange System (MacWhinney&Snow, 1985) http://childes.psy.cmu.edu/data http://wizzion.com/CHILDES/ (mirror from 6th Feb 2016) 1 more than 50 years of tradition 2 cca 30000 transcripts 3 more than 1.5 GigaBytes of mostly textual data 4 at least 26 languages, dialects or language combinations 5 major terran language-groups (indo-european, ugro-finic, semitic, altaic, east-asian, south-asian) represented 6 Creative Commons BY-NC-SA licence Conclusion Introduction Corpus, Tools and Method Three analyses Conclusion Corpus CHAT format CHAT system provides a standardized format for producing computerized transcripts of face-to-face conversational interactions. (MacWhinney, 2016; http://childes.talkbank.org/manuals/chat.pdf). @Begin @Languages: eng @Participants: CHI Eve Target_Child , MOT Sue Mother , FAT David Father @ID: eng|Brown|CHI|1;6.|female|||Target_Child||| @ID: eng|Brown|MOT|||||Mother||| @ID: eng|Brown|FAT|||||Father||| @ID: eng|Brown|RIC|||||Investigator||| @ID: eng|Brown|COL|||||Investigator||| @Date: 29-OCT-1962 *MOT: one two three four . %mor: det:num|one det:num|two det:num|three det:num|four . %act: tests tape recorder *CHI: one two three . [+ IMIT] Introduction Corpus, Tools and Method Three analyses Conclusion Tools GNU + PERL + R The idea is to perform the analysis with solely publicly-available open-source command-line tools. GPR combo GNU: grep, sort, uniq, sed, wc (runs in bash and connected through pipes) PERL: regular expressions are part of language syntax R: vectors, matrices, plotting First command wget -P CHILDES -e robots=off –no-parent –accept ’.cha’ -r http://wizzion.com/childes/CHILDES flat Introduction Corpus, Tools and Method Three analyses Method Pre-processing Populate filenames with age information mkdir aged; grep -P ’\|\d;\d’ *| grep Child | perl -n -e ’chomp; ‘cp $1 aged/$2-$3-$1‘ if /^(.*?):.*0?(\d+);0?(\d+)/;’ ; rm *.cha Remove noise perl -ni -e ’print if $_!~/^\*(MOT|CHI):\t(xxx|www) ?\./’ aged/* Extract Child and Motherese utterances mkdir CHI; cp aged/* CHI; sed -i ’/\*CHI/! d’ CHI/*; mkdir MOT; cp aged/* MOT; sed -i ’/\*MOT/! d’ MOT/*; Yields 5 833 656 CHI utterances contained in 29180 transcripts 3 798 005 MOT utterances contained in 13590 transcripts Conclusion Introduction Corpus, Tools and Method Three analyses Conclusion Method Metrics Main metrics: Probability PX that signifiant X shall occur in the utterance. PX = FX /Nutterances where FX is the absolute number of occurences of X in CHILDES section and the normalization factor Nutterances denotes the number of utterances of the CHILDES section. Probability values are mutually comparable. Introduction Corpus, Tools and Method Table of Contents 1 Introduction 2 Corpus, Tools and Method 3 Three analyses 1st analysis: Laughing 2nd analysis: Second Person Singular 3rd analysis: First Person Singular 4 Conclusion Three analyses Conclusion Introduction Corpus, Tools and Method Three analyses Conclusion 1st analysis: Laughing Laughing Objective Verify whether observed tendency (Hromada, 2016, Conceptual Foundations) of mothers to laugh less is in interaction with older toddlers is specific to English, or whether it is a culture-independent invariant. Both &=laughs and =!laughing tokens are used by diverse CHILDES transcribers, so we simply use for occurences of laugh token. grep laugh MOT/*French*|grep -o -P ’\-French\-.+\-’| sort|uniq -c;grep laugh MOT/*Farsi*|grep -o -P ’\-Farsi\-.+\-’| sort|uniq -c;grep laugh MOT/*Japanese*|grep -o -P ’\-Japanese\-.+\-’ |sort|uniq -c;grep laugh MOT/*Chinese* |grep -o -P ’\-Chinese\-.+\-’ | sort | uniq -c ; wc -l MOT/*Eng*|perl -e ’while (<>){s/MOT\///;/(\d+) (\d+-\d+)-/; $h{$2}+=$1; } for (sort keys %h) {/(\d+)-(\d+)/; print "$h{$_} $1 $2\n";}’ >MOT.Eng.N Introduction 1st analysis: Laughing Plot Corpus, Tools and Method Three analyses Conclusion Introduction Corpus, Tools and Method Three analyses Conclusion 1st analysis: Laughing Some observations For english, french and farsi children: marked decrease of maternal laughing between first and third year of age (english, french, farsi) little children laugh more often than their mothers but older children laugh less frequently than their mothers significant correlations between MOT and CHI in English (Pearson’s cor.coeff 0.933, p = 7.886e-05) and in Farsi (corr. coef. 0.972, p-value=0.02735). Almost significant in French (p=0.053, cor. coef = 0.947) In regards to laughing, Indo-European mothers and children seem to follow different ontogenetic trajectories than their Japanese and Chinese counterparts ⇒ no culture-independent Universal ? Introduction Corpus, Tools and Method Three analyses Conclusion 2nd analysis: Second Person Singular 2nd Person. Sg. Pronouns Language-specific CHILDES sub-corpora are matched by following Perl-Compatible regular expressions (PCREs): The absolute frequency FX of cases when PCREX matched is assessed as usually: grep -i -P "[\t ]you[’ ]" MOT/*Eng*| perl -n -e ’/MOT\/(\d+)-(\d+)/; print "$1 $2\n"’ |uniq -c >exp2.MOT.Eng.F Subsequently, FX /Nutterances division and plotting are realized in R. (c.f. http://wizzion.com/code/jadt2016/childes.R for the trivial R-code snippet) Introduction Corpus, Tools and Method 2nd analysis: Second Person Singular Plot Three analyses Conclusion Introduction Corpus, Tools and Method Three analyses Conclusion 2nd analysis: Second Person Singular Some observations One can observe, in English in motherese, ”you” is used in cca every fifth utterance significant correlation between CHI and MOT time series (Pearson’s cor. coeff. = 0.768, t = 3.393, df = 8, p-value = 0.009451; Kendall’s tau = 0.6, T = 36, p-value = 0.016671; Spearman’s rho = 0.733, S = 44, p-value = 0.02117) One can observe, in all languages Marked increase in maternal usage of 2nd. p. sg. between 1st and 4th year of age has been observed in case of all six studied languages (representing three distinct language groups). children use 2nd. p. sg. less often than mothers (only exception: Farsi between 2 and 3) ⇒ ontogenetic Universal ? Introduction Corpus, Tools and Method Three analyses Conclusion 3rd analysis: First Person Singular 1st Person. Sg. Pronouns Language-specific CHILDES sub-corpora are matched by following Perl-Compatible regular expressions (PCREs): The absolute frequency FX of cases when PCREX matched is assessed as usually: grep -i -P "[\t ]I[’ ]" MOT/*Eng*| perl -n -e ’/MOT\/(\d+)-(\d+)/; print "$1 $2\n"’ |uniq -c >exp3.MOT.Eng.F Subsequently, FX /Nutterances division and plotting are realized in R. (c.f. http://wizzion.com/code/jadt2016/childes.R for the trivial R-code snippet) Important: focus on ALL transcripts of a given language. Introduction Corpus, Tools and Method 3rd analysis: First Person Singular Plot Three analyses Conclusion Introduction Corpus, Tools and Method Three analyses Conclusion 3rd analysis: First Person Singular Some observations ALL: around 3 years of age, children tend to pronounce 1.p.sg much more frequently than their mothers ALL: steep decline between 6th and 7th year of age (offset of ”egocentric” stage?) ENGLISH: significant correlation between usage of mothers and children Significant intercultural correlations french and chinese children (p=0.02474) english and french children (p=0.002425) polish and hebrew children (p=0.048) polish and french children (p=0.048) ⇒ language-independent ontogenetic trajectory of usage of 1.p.sg? Introduction Corpus, Tools and Method Table of Contents 1 Introduction 2 Corpus, Tools and Method 3 Three analyses 4 Conclusion Three analyses Conclusion Introduction Corpus, Tools and Method Three analyses Conclusion Methodological conclusion Combination of command-line (no GUI!) open-source (for free!) fast * deterministic utils (grep, uniq, ...) and languages (PERL, R) yields a 100% reproducible methodology for very little cost. Experimental protocol automatically stored in .history (or .bash history) and .RHistory files: no need to reinvent the wheel! * 3rd analysis executed on one sole core of 3.2 Ghz PC with 8GB RAM and CHILDES data stored on a SSD disk was over in less than 15 seconds Introduction Corpus, Tools and Method Three analyses Conclusion Epistemological conclusion Developmental Psycholinguistics + Natural Language Processing + Big Data + OpenScience = la textometrie psycholinguistique Manifest: to perform state-of-the-art research without expensive tools and apparati to study ontogeny of soul and language in a non-invasive fashion to share all that can be shared Introduction Corpus, Tools and Method Psycholinguistic conclusion Piaget eut raison. Three analyses Conclusion Introduction Corpus, Tools and Method Three analyses Merci pour votre attention. Questions ? Conclusion