Evolutionary modelisation of ontogeny of linguistic structures
Rigorous Thesis Examination

Daniel D. Hromada12
1 Slovak University of Technology
Faculty of Electronic Engeneering and Informatics
Department of Robotics and Cybernetics
2 Université Paris 8
École Doctorale Cognition, Langage, Interaction
Laboratoire Cognition Humaine et Artificielle

4-11-2014

doc. Ing. Ivan Sekaj, PhD.
prof. Ing. Vladimı́r Kvasnička, DrSc.

Administrative Position
Development
2010: enrolled for PhD. at Ecole Doctorale Cognition, Langage, Interaction of
University Paris8
2011: attribution of double PhD. scholarship by french government, inscription at
STU as external doctorant
2012: inter-university convention signed by FEI STU’s Dean and President of
Paris8, start of scholarship
2013: summer semester in Paris, presentation of prof. Tijus in Bratislava
2014: summer semester in Paris, end of scholarship

Thesis is to be written and defended in English language.

An interdisciplinary enterprise
Part-ofspeech
induction
Categorization
Optimization

Induction

Natural Language
Processing

Grammar
induction

Computational
Linguistics

Production
vs. Perception

Generalization

Formal Language Theory

Language Acquisition Device

Grammar
Systems

Developmental
Psycholinguistics

Evolutionary
Modeling of
L-Structure
Ontogeny

Populations
Genetic
Algorithms

Motherese and
Toddlerese

Evolutionary
Computation

Universal
Darwinism

Bootstrapping

Fitness Functions
and Landscapes

Evolutionary
Strategies

Heredity, Variation, Selection

Stochastic Systems

Adaptation

Trial and Error

Universal Darwinism
Definition
A general theoretical framework aiming to explain the emergence and optimization of
diverse complex phenomena in terms of interaction of three basic processes:
1

information variation

2

information selection

3

information replication

UD-consistent disciplines
biology (Darwin 1859, Mendel 1866) and genetics (Morgan 1916, Watson &
Crick, 1953)
sociobiology (Hamilton 1974, Wilson 1978) and evolutionary psychology
(Cosmides & Tooby 1997)
memetics (Dawkins 1976, Blackmore 2000)
evolutionary epistemology
neural darwinism
evolutionary computation, artificial life, ...
evolutionary linguistics

Evolutionary Epistemology

An ambiguous definition
Evolutionary Epistemology aims to explain source, existence, nature, scope and diversity
of forms of knowledge in evolutionary terms.

Two possible intepretations:
1

biological evolution of cognitive and mental faculties in animals and humans

2

knowledge per se evolves by selection and variation

The second interpretation can be further analyzed:
1

knowledge can emerge by variation&selection of ideas shared by a group of
mutually interacting individuals (Popper 1972)

2

knowledge can emerge by variation&selection of cognitive representations within
one individual

Genetic Theories of Learning and Creativity

”Genetic” not in contemporary (i.e. DNA-related) sense but as related to ”origins”
(genesis) and ”heredity” (genus).
Piaget’s Genetic Epistemology
1

aims to explain how human cognitive systems (CS) develop from birth onwards

2

CS pass through series of stages, every stage involves equilibration of cognitive
schemas

3

schemas change through process of assimilation and acommodation

Campbell-Simonton’s Theory of Creativity
1

scientific discovery and creativity can be explained in terms of blind variation and
selective retention (Campbell 1970)

2

”How do human beings create variations? One perfectly good Darwinian
explanation would be that the variations themselves arise from a cognitive
variation-selection process that occurs within the individual brain.” (Simonton
1990)

Neural Darwinism
Edelman (1987) postulated that complex adaptations in the brain arise through some
process similar to natural selection.
Another variant of ND is theory of Changeux and Dehane (1989): ”the production
and storage of mental representations, including their chaining into meaningful
propositions and the development of reasoning, can also be intepreted, by analogy, in
variation-selection (Darwinian) terms within psychological time-scales.”
Fernando et al. (2012) propose two ”toy models .. of a means by which a higher-order
unit of neuronal evolution above the synaptic level may be able to replicate.”

Figure: reproduced from (Fernando et al., 2012).

Evolutionary Computation
Definition
”Evolutionary computation uses computational models of evolutionary processes as key
elements in the design and implementation of computer-based problem solving systems”
(Spears et al., 1993)
genetic algorithms (c.f. next slide)
evolutionary programming (stronger genotype-phenotype distinction, FSAs, little
recombination)
evolutionary strategies (involves more recombination, self-adaptation, other
nature-inspired approaches)
genetic programming (does not search for solutions but for programs)
Grammatical evolution
Variant of Genetic Programming which uses evolutionary search to discover specific
sequences of application of rules of production which generate program code which
yields wished solutions.
swarm intelligence (Kennedy & Eberhart, 2001)
artificial life (no exogenous fitness function: Tierra, AVIDA, etc.)

Genetic algorithms
Canonic GA (Holland, 1975)
Encoding: binary vector
Initial population: randomly generated P
Selection: fitness proportionate (pi = fi / N
j=1 fj )
Crossover: one-point
Mutation: bit-flip with probability p (0.001)

rand i n i t
evaluate
select
repeat
crossover
mutation
evaluate
select
u n t i l stop

Schema theorem
A schema is a subset of strings with similarities at certain positions. Schema theorem
states that short, low-order (i.e. with few fixed positions) schemata with above-average
fitness increase exponentially in successive generations:
E (m(H, t + 1)) ≥

m(H, t)f (H)
[1 − p
at

m(H,t) is the number of strings belonging to schema H at generation t, f(H) is the
observed average fitness of schema H and at is the observed average fitness at generation
t. P is the probability that crossover or mutation will disrupt H.
Convergence to global optimum
Rudolph (1994) has proven that CGAs are certain to converge to global optimum only
if they ”keep track of the best solution found over time” (i.e. involve a form of elitism).

Evolutionary Language Game (Nowak et al., 1999)
Let’s have a population of N agents. Each agent is described by an r ∗ c associative
matrix A. A’s entry aij specifies how often an individual, in a role of a student,
observed one or more other individuals (teachers) referring to object i by producing
signal j. From this associative matrix A, one can derive
a
the active ”speaker” matrix S by normalizing A’s rows: sij = Pr ij a
n=1 in

the ”hearer” passive matrix H by normalization of A’s columns: hij =

a
Pc ij

n=1 anj

Subsequently, we can imagine two individuals A and A’, the first one having the
language L (H,S), the other having the language L’ (H’, S’). The payoff related to
communication of two individuals is calculated as follows:
F (A, A0 ) =

r X
c
X

sij hji0 = Tr (SH 0 )

i=1 j=1

And the fitness of the individual A in regards to all other members of the population
can be obtained as follows :
X
1
f (A) =
F (A, A0 )
|P| − 1 0
A ∈P
A6=A0

By implementing EC, these fitness values can subsequently direct evolution of the
population toward states where individual matrices are more optimally ”aligned”. In
ELG, this alignment represents the situation when hearer and speaker mutually
understand each other, i.e. speaker has encoded meaning M by sound S and hearer
had subsequently decoded sound S as meaning M.

Evolutionary Language Game #2

parent-child information transfer modelled by matrix sampling procedure
parameter k specify the quantity of repetition during the matrix sampling
all experiments with N=100 having memones of size 5x5 (i.e. their associative
matrices could encode max 5 ”sounds” and 5 ”meanings”)
convergence to globally optimal state is assured only if MS involves small but
nonzero amount of noise!
beautiful model how ”language” is sure to arise ex nihilo in communities wherein
information transfer between individuals exists
Nowak et al. (1999) and Kvasnička & Pospı́chal (2007) use it to iluminate
emergence of language in phylogeny of homo sapiens sapiens species, but
couldn’t be an analogical approach used to model transfer from mother to child
in ontogeny?

Evolutionary Linguistics

Definition
Scientific study of both the origins and development of language as well as the cultural
evolution of languages.
Schleicher’s (1853) language tree (Stammbaumtheorie) theory
lack of fossil records, difficult to empirically verify, banned by Societe linguistique
de Paris in 1866
revived at the end of 20th century (c.f. Pinker & Bloom, 2011)
quantitative comparative linguistics, phylogenetic trees...
focuses on phylogeny and not ontogeny
Why EL should focus on ontogeny
”We are not very well informed about the psychology of Neanderthal man or about the
psychology of Homo siniensis of Teilhard de Chardin. Since this field of biogenesis is
not available to us, we shall do as biologists do and turn to ontogenesis. Nothing could
be more accessible to study than the ontogenesis of these notions. There are children
all around us. (Piaget, 1975)”

Formal Language Theory
Alphabet A is a finite, nonempty set of symbols.
A word or a string over an alphabet A is a finite sequence of symbols of from A.
A* is the set of all words over A*.
A language L over A is a subset of A*.
A grammar G is a quadruple=(N,T,P,S) where N is the nonterminal alphabet, T is the
terminal alphabet, S ∈ N is the axiom and P is the set of rewriting (production,
substitution) rules, written as x → y . Grammars are called
REGULAR when the form of all rules in P is X → α, X → αB, α ∈ T , A, B ∈ N
CONTEXT-FREE when all its rules have form X → x where X ∈ N, x ∈ A∗G
CONTEXT-SENSITIVE when P contains only rules of the form x1 Xx2 → x1 wx2
x1 , x2 , w being strings over AG , X ∈ N
Language L GENERATED from grammar G is a set of all sequences of terminals
which can be derived from axiom S by recursive application of rules in P.
Language L can be PARSED by grammar G if, for all sequences s ∈ L, there exist at
least one sequence of application of production rules which, when applied in an inverse
fashion (i.e. substitute left side of production rule for the right side), shall end at
axiom S.

Grammar Systems
Introduction
A Grammar System is a set of grammars working together, according to a specified
protocol, to generate a language.
a syntactic theory of multi-agent, distributed and parallel systems
multiple independent grammars share their productions in ”string environment”
(analogic to AI ”blackboard” approaches)
environment can change on its own (so called ”eco-grammar” systems) or not
(language colonies)

Figure: Reproduced from Kelemen’s (2004) article ”Miracles, colonies, and emergence”.

Natural Language Processing

uses computers to process human languages
implements AI, data-mining, information retrieval and machine learning methods
(both supervised and unsupervised)
first and ultimate NLP challenge posed by Turing (1950)
other problems: anaphora resolution, automatic summarization, discourse
analysis, machine translation, morpohological segmentation, named entity
recognition, natural language understanding, POS-induction and tagging, parsing,
question answering, sentiment analysis, speech recognition, word sense
disambiguation etc...
in NLP, statistics often plays more important role than FLT
in NLP, methods based on aNN, Naive Bayes, SVM-ba methods are predominant,
EC is much less used

POS-induction and Grammar induction
Part-of-speech induction
The goal is to group tokens, present in the pure-text corpus C, into clusters grouping
members of diverse parts-of-speech (nouns, verbs, adjectives, etc.).
Grammar induction
The goal is to infer, from pure-text corpus C, a grammar G which could have generated
the corpus C.
POS-i and GI problems are strongly intertwined. Clusters discovered by POS-i can be
denoted by non-terminal symbols.
C: John loves Mary. Mary hates John. Mary sleeps. John weeps.
ideal grammar:
N→JohnkMary
V→lovekhateksleepkweep
S→NVskNVsN
least general grammar: S→C
most general grammar: S→A*

Learning of semantic categories

How can machines work with semantic categories?
Semantic categories (i.e. concepts) can be characterized
by extensive (listing the instances) or ostentative (pointing the finger) definition
in terms of sufficient and necessary features
as (convex) subspaces of N-dimensional semantic feature space (Gardenfors 2004)
as prototypes (points) within such spaces
Principle(s) behind construction of semantic spaces
In neurosciences: ”neurons that fire together, wire together” (Hebb 1964)
In linguistics: ”a word is characterized by the company it keeps” (Harris 1954) In
philosophy: ”the meaning of a word is its use in the language (Wittgenstein 1953)
Conjecture
Development of vocabulary in human children is a variant of multi-class classification
problem and as such can be simulated by an algorithm creating and partitioning semantic
feature vector spaces.

Developmental Psycholinguistics
Developmental Psycholinguistics (DP) is a scientific discipline studying changes
occuring in human faculty of understanding and production of natural languages. As
such, it is closely related to developmental psychology (a sub-field of psychology) and
developmental linguistics (a sub-field of linguistics).
Language Development (DEF)
Language development (LD) - or ontogeny of natural language L in human individual
H - is a constructivist process gradually transforming L into evermore optimized communication channel facilitating the exchange of information between H and her social
surroundings.
language is social and pragmatic (allows children to manipulate objective world)
comprehension precedes production: C-representations offer preliminary targets
for P-productions
physiological predispositions of language are innate but useless without triggering
epigenetic stimuli
children are not ”ideal learners” (in Gold’s theorem sense)
brains simultaneously encode multiple language registers and grammars

Motherese
parents modify their language in order to make themselves understood
higher pitch (267 Hz in comparison to 198Hz), slower tempo, greater rhytmicity,
longer pauses between utterances
”much of the speech addressed to babies consists of short, routine, repetitive
utterances produced with great consistency and frequency in the same contexts,
day after day” (Clark 2003)”
repetitions three times more frequent in speech to two-years-old than in speech to
ten-years-old

Figure: Reproduced from Tverarthen (1993).

Toddlerese
babys expressive faculties start with 1bit communication channel (need
soothing/don’t need soothing)
gets more subtle and fine-grained with time: more information transmitted with
less signal
babbling starts cca at 8 months of age, first as repetition of same syllables
(mamamama, babababa), later syllables shall start to vary within the sequence
(babadadabebe)
around 1 year: consistent vocalizations in specific contexts (protowords)
children tend to be quite accurate in their first productions but later versions of
the same words appear to be further from adult targets
”continuous exploration, experimentation, practice and intense involvement with
linguistic structure” (Labov, 1978)
LD reveales, upon closer inspection, a constantly changing series of small
experiments where child progressively scrutinizes and tries out different options
(Clark 2003)
a lot of variability in children’s word forms (Ferguson and Farewell 1975)
first grammar form around ”pivot” words, e.g.: ”mama auch, tato auch, nana
auch, baba auch” (S → Nauch ; N → mama|tato|nana|baba)
toddlerese: 10 - 30 months

Quantitative laws of language acquisition
Piotrowski law
In both linguistic phylogeny as well as ontogeny (e.g. sentence length, vocabulary size) does development follow the
logistic equation: 1+aec −bt . Note that in ecology, the same
equation is considered to yield the law of population growth
(Lotka, 1920).
Figure: reproduced from (Baixeries et al.,
2013).

Zipf’s law
Zipf (1949) showed that if the most frequent
word in a text is assigned rank 1, the second
most frequent word is assigned rank 2 etc. than
frequency f(r) of a word of rank r obeys f ≈ r −α
(i.e. follows the power-law distribution). Recent
(Baixeries et al., 2013) analyses of CHILDES
corpus indicate that the exponent α depends on
age and is much higher and decreases faster in
small children.

Common aspects of both LD and evolution

Axiomatic
1

Convergence: different trajectories, same result

2

Variation: children PLAY, children forget

3

Non-monotonicity: locally ”correct” behaviours are lost

Hypothetic
1

Adaptation: gradual convergence of LT towards LM and possibly GT towards GM

2

Replication: both vertical (repetition) and horizontal (non-local storage)

3

Parallel coexistence of schemas

4

Selection: correct behaviours are rewarded

Subject and Method

Subject
My own daughter. 0-30months (0-2;6)
Method
Phenomenological method based principially on amazed observations.
Long-term journal.
Little or no experimental (artificial) interactions beyond natural and normal scenarios.

Cognitive Crossover Cases
Case 1 - Banan
Banan was called ”baja” in (1;6) and ”anan” in (1;10). At (1;11) a following interaction
took place:
F: banan ; C: anan
F: banan ; C: anan
F: baja ; C: bajan
F: bajan ; C: banan
Case 2 - Olol
Very intensive ”Krtko & Orol” period between 1;10-1;11. Word ”OLOL” used with high
frequency on a regular basis. During one pre-sleep monologue, subject said ”KOLOL”
when enumerating the names of her friends from creche, one among them being named
Nikol.
Bilingual crossovers
oči+augen=oge
opica+afe=api
voda+wasser=vava
etc...

Quantitative observations

Corpus
CHILDES - Child Language Data Exchange System (MacWhinney and Snow 1984)
1

more than 130 corpora of transcripted child verbal interactions

2

more than 20 languages

Variation operators whose impact shall be analyzed
1

Substitutions - papija → babija → mamija

2

Reduplications - hau − hau

3

Omissions - vlak → ak

Method
Matching with Perl-compatible regular expression (Hromada 2011). Reduplication, for
example, can be easily detected with regexp (\d{2,})\1.
Note that strings can evolve by substituting substrings for other strings and the substitution rule itself is also a string.

Grammar Induction

inducing not one monolithic grammar but populations of individual grammars
fitness function promotes individuals which
1) match patterns present in environment
2) generate utterances which shall be ”accepted” by environment
3) minimize number of utterances which shall not be accepted by environmennt
individual grammar is encoded as an ordered sequence of production rules

Corpus
#Mutter#
#Vater#

Grammar1

Grammar2

Vat → A
#A → B
er # → A
#Mu → A
tA → A
axioms: AA BA

er # → B
#Va → A
#Mu → A
tB → B
er # → B
axioms: AB

Concept Construction
Attaching meanings to words interpreted as supervised learning of multiclass classifier.
In most recent experiments I crossover four ideas in order to create it:
RANDOM PROJECTION - exploiting lemma Johnson-Lindenstrauss to project
problem into D-dimensional space
BINARIZATION - transposition of problem from real-valued spaces to binary
(Hamming) spaces (Hromada 2014)
THEORY OF PROTOTYPES - every category C can be characterized by a
prototype PC which is as close as possible to members of C and as far as possible
members of other category (Rosch 1973)
EVOLUTIONARY COMPUTATION - thus searches for such a set of K
prototypes P1 , ..., PN which maximizes the ”prototype fitness function”:
F (I ) =

N X
K
X
(
H(PB , i)PC =PB ifLi 6=C − H(PG , i)PC =PG ifLi =C )
i

C

where H(PG , i) is the Hamming distance between the binary vector denoting the
prototype PG and the document i contained in training document set of
cardinality N.
Every individual is a binary vector obtained by concatenation of vectors of all K
prototypes |I | = D ∗ K .

Concept Construction - Preliminary Results

Trained on training part and evaluated on testing part of 20newsgroups corpus (K=20)
LSB parameters D=128, S=3, I=2
CGA (N=100, PM =0.001, one-point crossover) with 1/8 elitism
Figure: Evolutionary induction of semantic prototypes - training

Figure: Evaluation of induced prototypes against the testing set

Algorithm seems to perform better than ”deep learning” Semantic Hashing method of
Salakhutdinov & Hinton (2009).
An Evolutionary Computation Algorithms are capable of generalization and can be
thus considered a case of Machine Learning.

Thesis

1

At some level of abstraction, ontogeny of syntactic and semantic categories is a
process consistent with tenets of Universal Darwinism.

2

Representations in human mind are subjects of variation, selection and replication.

3

In young children this process is still not completely internalized (Vygotsky 1934)
and is thus visible to external observer.

4

Evolutionary Computation is a means how this process can be successfully
simulated in silico.

Merci

Thank You.