Slovak University of Technology in Bratislava Faculty of Electrical Engineering and Information Technology Department of Control and Cybernetics && University Paris 8 Ecole Doctoralle Cognition, Langage, Interaction Cognition Humaine et Artificielle Thesis for rigorous examination Daniel D. Hromada Topic of doctoral Thesis : Evolutionary models of ontogeny of linguistic categories and rules Advisors : doc. Ing. Ivan Sekaj, PhD. prof. Charles Tijus Form of study : external and under double supervision Study began : september 2010 at University Paris 8 september 2011 at Slovak University of Technology Study program : Cybernetics (Slovak University of Technology) Psychology (University Paris 8) Study discipline : 9.2.7 Cybernetics (Slovak University of Technology) Cognitive Psychology (University Paris 8) (ivan.sekaj @ stuba.sk) (tijus @ univ-paris8.fr) Abstract Language development is a process by means of which a human baby constructs an adequate competence to encode & decode meanings in language of her parents. Computationally it can be described as a trinity of mutually interconnected problems : clustering of all tokens which baby heard into 1) semantic and 2) grammatical categories ; and 3) discovery of grammatical rules allowing to combine the members of diverse equivalence classes into syntactically correct and meaningful phrases. A theoretical, « psycholinguistic » claim of our Thesis is that similary to those theories which explain emergence of cultural or creative thinking as the result of evolutionary process occuring within an individual mind, the emergence of linguistic representations and faculties within a human individuum can be also considered as a case where basic tenets of Universal Darwinism apply. The practical, « cybernetic » aim of the Thesis is to create a computational models of concept learning, part-of-speech induction and grammar induction having comparable performance to existing models but based principially on evolutionary algorithms. It shall be argued that the « fitness function » , which determines the « survival rate » of « candidate grammars » emerging and disappearing in baby’s mind should be based upon the idea that the most fit is such a grammar G which « minimizes the distance » between the utterances successfully parsed from linguistic environment E by the application of grammar G and the utterances potentially generated by the grammar G. Keywords : evolutionary computing, language acquisition, genetic epistemology, part-of-speech induction, grammar induction, optimal clustering, machine learning, concept construction, grammar systems, motherese, toddlerese List of most important abbreviations CC – Concept Construction EA – Evolutionary Algorithms | EP – Evolutionary Programming ES – Evolutionary Strategies ET – Evolutionary Theory GA – Genetic Algorithms GE - Genetic Epistemology GI – Grammar Induction | Grammar Inference LA - Language Acquisition | LD – Language Development NLP – Natural Language Processing POS-i – Part-of-Speech Induction ; POS-t – Part-of-Speech Tagging UD – Universal Darwinism Convention « Italics » – citation ( x | y | z ) - disjuncted token – i.e. read token « (neural|quantum) darwinism » as « neural darwinism ; quantum darwinism » Table of Contents 0.Introduction.......................................................................................................................................4 1.Universal Darwinism.........................................................................................................................5 1.1.Biological evolution...................................................................................................................6 1.2.Evolutionary Psychology...........................................................................................................8 1.3.Memetics....................................................................................................................................9 1.4.Evolutionary Epistemology.......................................................................................................9 1.5.Individual Creativity................................................................................................................10 1.6.Genetic Epistemology..............................................................................................................11 1.7.Evolutionary computation........................................................................................................12 1.7.1. Genetic algorithms & fitness landscapes........................................................................13 1.7.2. Evolutionary programming & evolutionary strategies....................................................15 1.7.3.Genetic programming......................................................................................................15 1.7.4.Grammatical evolution.....................................................................................................17 1.7.5.Tierra................................................................................................................................20 2.Language development....................................................................................................................21 2.1.Ontogeny of semantic categories (concepts)...........................................................................22 2.2.Ontogeny of formal categories (parts-of-speech)....................................................................25 2.3.Ontogeny of grammars (grammar induction)..........................................................................27 3.Computational Models of Text Processing......................................................................................29 3.1.Concept construction...............................................................................................................30 3.1.1. Non-evolutionary model of CC.......................................................................................31 3.1.2. An evolutionary model of CC.........................................................................................33 3.2.Part-of-speech induction and part-of-speech tagging..............................................................35 3.2.1. Non-evolutionary models of POS-i.................................................................................36 3.2.2. Evolutionary models of POS-i & POS-t.........................................................................37 3.3.Grammar induction..................................................................................................................38 3.3.1. Non-evolutionary models of grammar induction............................................................39 3.3.2. Evolutionary models of grammar induction...................................................................43 3.4. Evolutionary Language Game................................................................................................51 4.Remark concerning the Theory of Grammar Systems....................................................................53 5.Conclusive remarks.........................................................................................................................54 6.Bibliography....................................................................................................................................56 0. Introduction A general form of Evolution Theory (ET) postulates that entities evolve and adapt to their environment by a process of accumulation of information. Such a generalized theory – often referred to as « Universal Darwinism » - can be and often is applied in diverse scientific disciplines as diverse as biology, linguistics or even anthropology and psychology. Since principal concepts and tenets of ET can be easily formalised into stochastic « evolutionary » algorithms, ET can yield not only a theoretical framework but also a computational experimental methodology for any scienctific discipline whose basic concepts and principles can be « reduced » into a ET-consistent form. The aim of my doctoral Thesis is to empirically – i.e. by means of computational experiments - demonstrate that certain phaenomena observed by « developmental linguists » and « psycholinguists » can be explained in terms of principles of Universal Darwinism and as such can be modelled by « computational linguists » and « Natural Language Processing (NLP) engineers» who shall found their computational models upon methods offered by Evolutionary Computing paradigm. More concretely, I shall try to indicate that « evolutionary » optimization can be used to yield solutions to at least three problems of language development: 1) induction of semantic categories, i.e. construction of « concepts » 2) the problem of induction of part-of-speech grammatical categories of words natural languages, i.e. the problem of how equivalence classes like « nouns », « verbs », « adjectives » etc. are constructed by the language-acquiring agent 3) the problem of grammar induction, i.e. the problem of how an agent can acquire a grammar from the corpus or its environment It shall be indicated that the term «language-acquring agent » could be interpreted both as an organic agent (e.g. a human baby) trying to learn the language of its environment (e.g. its parents) as well as a computational agent (e.g. a Turing Machine) inducing the structural properties of the language which generated the corpora with which the agent has been confronted. In other terms, it shall be indicated that ET is generalizable in such an extent, that its correct implementation may allow two systems based upon Darwinian principles « replicate, mutate, select » to converge to same optimal or quasi-optimal categories regardless the fact that the substrate by means of which they computate is organic or not. The first chapter will more closely present the above-mentioned basic principles of the universal ET doctrine and enumerate certain scientific disciplines for which the ET furnishes a useful theoretical framework. Asides biology where the role of ET is evident, a discipline of «evolutionary psychology » shall be mentioned principially in order to avert the reader that our aims are not limited to those posited by evolutionary psychology. The « memetic theory », on the contrary, shall more precisely elucidate our ultimate aim since it already introduces a novel level of representation, « a meme », supposed to be « the basic unit of immitation » and as such offers an interesting starting point for any Darwin-consistent theory of evolution of non-organic (e.g. cultural) structures and artefacts. It is, however, the constructionist « genetic epistemology » (GE) of Jean Piaget which shall resonate even more strongly with our aims – since what GE ultimately postulates is that the human psyche – with all its linguistic, moral, object-manipulating faculties – pass through the sequence of « stages » . For it is our belief that such Piagetian « stages » can be explained, in computational terms, as «quasi-optimal attractors» within a very complicated « search space » of agent’s internal representations and that a sort of evolutionary process occurs not only on a social-memetic level between the agents imitating each other, but, in the first place, within the agents (him|her|it)self. This Thesis is our tentative to base this « learning=evolution » belief on solid ground of complexity theory. The second chapter will address the topic of language development (LA). The topic is so vaste and deep that only most fundamental subproblems (i.e. vocabulary development, acquisition of part-of-speech categories and acquisition of gramamrs) shall be briefly described and some basic notions like « variation set » or « motherese» will be introduced. We shall try to evit the dispute between diverse linguistic doctrines and schools (e.g. nativists, cognitivists, comparativists) ; focus shall be put upon points of consensus supported by empiric evidence. While the goal of first chapter is to furnish the theoretical framework and the goal of the second is to specify the problem, it is the third chapter which deals with the concrete computational tentatives to unify the two. Major part of the chapter shall deal with the question of evaluation of diverse inductive models. Some most successful computational models of part-of-speech induction (POS-i) and grammar induction (GI) shall be mentioned in order to pave the way for the evolutionary ones. As shall be indicated by this section, the tentatives to apply evolutionary algorithms (EAs) to solve the POS-i and GI problem are, regardless the good results reported in the litterature, very rare. In specific subsections of the chapter, we shall mention certain models, both psycholinguistic and computational, which justify our claim that the process of ontogeny of linguistic faculty can be not only interpreted but also modelled as a process of evolutionary optimization of cognitive structures. 1. Universal Darwinism Universal Darwinism (UD) is a scientific paradigm regrouping diverse scientific theories extending the Darwinian theory of evolution and natural selection (Darwin 1859) beyond the domain of biology. It can be understood as a generalized theoretical framework aiming to explain the emergence of many complex phenomena in terms of interaction of three basic processes: 1) variation 2) selection 3) retention. According to UD paradigm, interaction of these three components yields a « universal algorithm valid not only in biology, but in all domains of knowledge where we can extract informational entities – replicators, which are able to reproduce themselves with variations and which are subjects to selection » (Kvasnička and Pospíchal 1999). This generic algorithm is nothing else than traditional Evolutionary Theory (ET) which, when when considered as substrate-neutral, can be applied to such a vaste number of scientific fields that it has been compared to a kind of « universal acid » which « eats through just about every traditional concept, and leaves in its wake a revolutionized world-view, with most of the old landmarks still recognizable, but transformed in fundamental ways» (Dennett 1996) . As of 2013, the existing scientific disciplines which could be labeled as UD-consistent include: biology ; evolutionary (art | psychology | music | linguistics | ethics |economics | anthropology | epistemology | computation); sociobiology ; memetics ; (quantum | neural | psycho ) darwinism ; artificial life and many others. In regards to the overall aim of our Thesis, some of the most relevant instances of UD are described in following sub-sections. 1.1. Biological evolution Evolutionary Theory was born when young Charles Darwin realised that the « gradation and diversity of structure » (Darwin 1906), which he had encountered among mockingbirds of Galapagos islands, could be explained by natural tendency of species to « adapt to changing world ». Parallely to Darwin’s work which was gradually clarifying the terms of variability and its close relation to environment-originated selective pressures, Gregor Mendel was assessing statistical distributions of colours of flowers of his garden peas in Brno in order to finally converge to fundamental principles of heredity . But it was only in 1953 when the double-helix structure of the material substrate of heredity of biological species – the DNA molecule – was described in (Watson & Crick, 1953) paper. In simple terms : In the DNA molecule, information is encoded as a sequence of nucleotides. Every nucleotide can contain one of four nucleobases, it thus ideally carry 2 bits of information. Continuous sequence of three nucleotids gives a « triplet » which, when interpreted by a intracellular « ribosome » machinery, can be « translated » into an amino-acid. Sequences of amino-acides yield proteins which interact one with another in biochemical cascades. The result is a living organism with its particular phenotype aiming to reproduce its genetic code. If, in the given time T there are two organisms A and B whose genetic code differs in such an extent that their phenotype differs, and if ever the phenotype of organism A augments probability of A’s survival and reproduction in the external world W, while the B’s phenotype diminishes such probability , we say that the A is better adapted to world W than B, or more formally that fitness(A) > fitness(B). Evolutionary Theory postulates that in case that there is a lack of resources in world W, descendants of the organism B shall be gradually, after multiple generations, substituted by descendants of a more fit organism « A ». This is so because during every act of reproduction, the material reason for having a more fit phenotype - the DNA molecule – is transferred from parent to offspring and the whole process is cumulative across generations. It can, however, happen, that the world W changes. Or a random (stochastic) event – a gamma ray, the presence of a free radical - can occur which would tamper A’s genetic code. Such an event – called « mutation » - shall result, in majority of cases, in decrease of A’s fitness. Rarely, however, can mutations also increase it. Another event which can transform the genetic sequence is called « crossover ». It can be formalised as an operator which substitutes one part of genetic code of the organism A with corresponding sequence of organism B, and vice versa, the part of B with the corresponding part of A. It is indeed especially the crossover operation, first described by in the article of T.H. Morgan (Morgan 1916), which is responsible for « mixing of properties » in case of a child organism issued from two parent organisms. In more concrete terms : the genetic code of such « diploid » organisms is always stored in X pairs of chromosomes. Each chromosome in the pair is issued from either father or mother organism which, during the process of meiosis, divide their normally diploid cells into haploid gamete cellls (i.e. sperms in case of father and eggs in case of mother). It is especially during the first meiotic phase that Figure 1 : Two types of crossover operation. Figures reproced from (Morgan, 1916) crossover occurs, the content of DNA sequence of two grand-parents being mixed and mapped during crossover operation into the chromosome contained in the gamete which, if lucky, shell fuse with the gamete of another parent in the act of fecondation. Resulting « zygote » is again diploid, contains mix of fragments of genetic code originally present in the cells of all four grand-parents of the nascent organism. Zygote subsequently exponentially divides into growing number of cells which differentiate from each other according to instructions contained in the genetic code which are triggered by biochemical signals coming from cell’s both internal and external environment. If the genetic code shall endow the organism with such properties that will allow it to survive in its environment until its own reproduction, approximately half of the genetic information contained in its DNA shall be transfered to the offspring organism. If not, the information as such shall disappear from the population due to its incompatibility with the environment. 1.2. Evolutionary Psychology It was already Darwin who posited that ET shall have profound impact upon psychology : « In the distant future I see open fields for far more important researches. Psychology will be based on a new foundation that of the necessary acquirement of each mental power and capacity by gradation. » (Darwin 1859) While two possible intepretations of this Darwin’s idea exist, Evolutionary Psychology (Ev.Psych.) focuses only on the first one. It aims to explain diverse faculties of human soul & mind in terms of selective pressures which moulded the modular architecture of human brain during millions of years of its phylogenetic history. Its central premises state : « The brain's adaptive mechanisms were shaped by natural and sexual selection. Different neural mechanisms are specialized for solving problems in humanity's evolutionary past » (Cosmides and Tooby 1997). In more concrete terms, Evolutionary Psychology explains quite successfully phaenomena as diverse as emergence of cooperation and altruistic behaviour (Hamilton 1963); male promiscuity and parental investment (Trivers 1972) or even the obesity of current anglo-saxxon population (Barrett 2007). All this and much more is explained as a result of adaptation of homo sapiens sapiens (and all its biological ancestors) to dynamism of its ever-changing ecological and social niche. Thus, in the long run, Ev.Psych. tends to explain and integrate all innate faculties of human mind in the evolutionary framework. The problem with Ev.Psych., however, is that in its grandious aim to « assemble out of the disjointed, fragmentary, and mutually contradictory human disciplines a single, logically integrated research framework for the psychological, social, and behavioral sciences » (Cosmides and Tooby 1997) it can sometimes happen that Ev.Psych. posits as innate, and thus explainable in terms of biological natural selection, cognitive faculties which are not innate but acquired. Thus it may be more often than rarely the case that whenever it comes to the famous nature vs. nurture (Galton 1875) controversy, evolutionary psychologists tend to defend the nativist cause even there, where it means to commit a epistemological fallacy to do so1. And what makes things even worse for the discipline of Evolutionary Psychology as is currently performed is, that the forementioned Darwin’s precognition has, asides the nativist & biological one, also another intepretation. Id est, when Darwin spoke about mental powers and capacities acquired by gradation, one cannot exclude that he was speaking not only about gradation in phylogeny, but also ontogeny. 1.3. Memetics Theory of memes or memetics is, in certain sense, a counter-reaction to Evolutionary Psychology’s aims to explain human mental and cognitive faculties in terms of innate propensities. Similiarly to Ev.Psych., memetics is also issued from the discipline of sociobiology which was supposed to be « The extension of population biology and evolutionary theory to social organization » (Wilson 1978). But differently to both Ev.Psych. and sociobiology, memetics does not aim to explain diverse (cultur|psychologic|soci)al phenomena solely in terms of evolution operating upon biochemical genes, but also in terms of evolution being realised on the plane of more abstract information-carrying replicators called « memes » (Dawkins 2006). The basic definition of the classical memetic theory is: « Meme is a replicator which replicates from brain to brain by means of imitation» (Blackmore 2000). These replicators are somehow represented in the host brain as some kind of « cognitive structure » and if ever externalised by the host organism – no matter whether in form a word, song, behavioral schema or an artefact – they can get copied into other host organism endowed with the device to integrate such structures 2. Similary to genes which often network themselves into mutually supporting auto-catalytic networks (Kauffman 1996), memes can also form more complex memetic complexes, « memplexes », in order to augment the probability of their survival in time. Memes can thus do informational crossovers with one another (syncretic religions, new recepts from old ingredients or DJ mixes can be nice examples of such memetic crossover) or they can simply mutate, either because of the noise present during the imitation (replication) process, or due to other entropy-related decay-like factors related to the ways how active memes are ultimately stored in brains or other information processing devices. Memetic theory postulates that the cumulative evolutionary process applied upon such information-carrying stuctures shall ultimately lead to emergence of such complex phaenomena as culture, religion or language. 1.4. Evolutionary Epistemology Epistemology is a philosophical discipline concerned with the source, nature, scope , 1 If ever we accept the notion of falsifiability as an important criterion of accpetation or rejection of the scientific hypothesis (Popper et al. 1972), many hypotheses issued from EP would have to be rejected because, since being based in the distant past which is almost impossible to access, they are less falsifiable than hypotheses explaining the same phaenomena in terms of empiric data observable in the present. 2 In neurobiological terms, the faculty to imitate and hence to integrate memes from external environment is often associated to so-called « mirror neurons » (Rizzolatti and Craighero 2004). existence and divesity of forms of knowledge. Evolutionary epistemology (EE) is a paradigm which aims to explain these by applying the evolutionary framework. But under one EE label, at least two distinct topics are, in fact, addressed : 1) EE1 which aims to explain the biological evolution of cognitive and mental faculties in humans and animals 2) EE2 postulates that knowledge itself evolves by selection and variation EE1 can be thus considered as sub-discipline of Ev.Psych. and as such, is subject to Ev.Psych.-directed criticism presented on previous page. EE 2, however, is closer to memetics since it postulates the existence of a second replicator, i.e. of an information-carrying structure which is not materially encoded by a DNA molecule. The distinction between EE1 and EE2 can also be characterised in terms of « phylogeny » and « ontogeny ». Given the definition of phylogeny as the « processus which shapes the form of species » and contrasting it to ontogeny defined as « processus shaping the form of individual », we find it important to reiterate that while EE1 is more concerned with knowledge as a result of phylogenetic moulding of DNA, EE2 points more in the direction of « ontogeny». In fact, EE2 paves the way for at least two other sub-interpretations : EE2-1 Knowledge can emerge by variation&selection of ideas shared by a group of mutually interacting individuals (Popper 1972) EE2-2 Knowledge can emerge by variation&selection of cognitive structures within one individuum It is worth noting that while a so-called recapitulation theory stating that « ontogeny recapitulates phylogeny » (Haeckel 1879) is considered to be discredited by many biologists and embryologists ; it is still held as valid by many reseachers in human and cognitive sciences observing a « strong parallelism between cognitive development of a child and … stages suggested in the archeological record » (Foster 2002)100 years after one of Darwin’s companion has noted : « Education is a repetition of civilization in little » (Spencer 1894). 1.5. Individual Creativity In fact, the evolutionary epistemology was born with the tentative of D.T. Campbell to explain both creative thinking and scientific discovery in terms of « blind variation and selective retention » of thoughts (Campbell 1960). Departing from introspective works of mathematician Henri Poincare who stated « To create consists precisely in not making useless combinations and in making those which are useful and which are only a small minority. Invention is discernment, choice...Among chosen combinations the most fertile will often be those formed of elements drawn from domains which are far apart...What is the cause that,among the thousand products of our unconscious activity, some are called to pass the threshold, while others remain below?» (Poincaré 1908), Campbell suggests that what we call creative thought can be described as a Darwinian process whereby the previously acquired knowledge blindly varies in unconscious mind of the creative thinker and that only some such structures are subsequently selectively retained. As (Simonton 1999) puts it: « How do human beings create variations? One perfectly good Darwinian explanation would be that the variations themselves arise from a cognitive variation-selection process that occurs within the individual brain. » 1.6. Genetic Epistemology « The fundamental hypothesis of genetic epistemology is that there is a parallelism between the progress made in ... organization of knowledge and the corresponding formative psychological processes. Well, now, if that is our hypothesis, what will be our field of study? Of course the most fruitful, most obvious field of study would be reconstituting human history: the history of human thinking in prehistoric man. Unfortunately, we are not very well informed about the psychology of Neanderthal man or about the psychology of Homo siniensis of Teilhard de Chardin. Since this field of biogenesis is not available to us, we shall do as biologists do and turn to ontogenesis. Nothing could be more accessible to study than the ontogenesis of these notions. There are children all around us. » (Piaget 1974) Strictly speaking, Piaget’s developmental theory of knowledge, which he himself called Genetic Epistemology (GE) may seem to be utterly non-Darwinian. In fact it is not even concerned with biochemical genes : Piagetian uses the term « genetic » to refer to a more general notion of « heredity » defined as structure’s tendency to guard its identity through time. The basic structural primitives of Piagetian theory are behavioral « schemas » which can be defined as « a basic set of experiences and knowledge that has been gained through personal experiences that define how things should be and act in the person's environment. As the child interacts with their world and acquires more experiences these schemes are modified to make sense, or used to make sense of the new experience » (Bee and Boyd 2003). There are two ways how such schemes can be modified. Either they « assimilate » data from external environment. Or, if ever such assimilation is not possible because it is simply not possible that child’s cognitive system matches the perceived external datum with the internal pre-existing category, the process of « accomodation » takes place which transforms the internal category to match the external datum. Ultimately, the set of schemes gets so out-dated or so altered by past modifications that they are not useful anymore. Whenever such «equilibriation » occur, old set of schemas is rejected, the child tends to « start fresh with a more up-to-date model » (Bee and Boyd 2003), thus attaining new substage or stage of its development. In the Piagetian system – which is based on very precise yet exhaustive observations of dozens of children including his own – the order of stages is fixed and it is very difficult, or even fully impossible, for evolving psyche to attain pre-operational stage 2 or concrete operational stage 3 if it had not even mastered all that is to master during the sensorimotor stage 1 . Given the fact that the GE paradigm involves : • heredity – schemes are structures which tend to keep their identity in time • variation – schemes are altered by the environment-driven assimilation or accomodation 3 • selective pressures – only those schemas which are most well adapted to environment and/or form most functionally fit complexes with other schemas shall pass through the equilibriation milestone it can be briefly stated that Piaget’s GE could be aligned with ET and UD. And what more, it may be the case that notion of Piagetian stages is consisted with the notion of attractor or locally optimal states whose emergence is, according to complex system theory (Kauffman 1996; Flake 1999), inevitable in a system as complex as child’s psyche definitely is. 1.7. Evolutionary computation We have already mentioned (c.f. 1.1.) that evolution, as defined within UD, can be thought of as a universal, generic algorithm. Not only can « evolutionary theory » serve us to explain diverse phenomena around us, it can be also exploited for finding solutions to diverse problems. Thus it is of no surprise that many researchers in informatics realized that not only can be the evolutionary process encoded as an informatic algorithm, but that such algorithms could be useful as a heuristic which could potentially lead to a discovery of useful (quasi)-optimal solutions to wide range of diverse problems. First explorations in the domain were done by Rechenberg’s « evolutionary strategies » (Rechenberg 1973) and Holland’s « genetic algorithms » (Holland 1975) which, along with « evolutionary programming » (Fogel et al. 1966) and « genetic programming » form the « evolutionary computation » subdiscipline of computer science. All four approaches differ from classical optimization methods in following aspects : 1. using a population of potential solutions in their search 2. using explicit « fitness » instead of function derivatives 3. using probabilistic, rather than deterministic, transition rules » (Kennedy et al. 2001) Figure 2: Basic genetic algorithm schema. Reproduced from (Pohlheim 1996) 3 Note that in terms of theory of evolutionary computation, one can relate the Piagetian notion of assimilation to an operator of local variation which attracts the cognitive system to locally optimal agreement with its environment, while accomodation suggests an interpretation in term of more global variation operators (like cross-over), which could potentially allow the cognitive system to reach a state of global equilibrium in regards to environment. 1.7.1. Genetic algorithms & fitness landscapes Basic principle of « genetic algorithms » is illustrated on Figure 2. The core component of every genetic algorithm is the objective « fitness function » able to attribute a cardinal value or ordinal rank to any individum in the population of potential solutions. In other terms, the fitness function yields the criterium according to which one candidate individum is evaluated as « more fit » a solution, in regards to the problem under study, than other potential solutions present in the population. Population is the set of individual solutions. Every individual solution is encoded as a vector of values (also called « chromosome » or « genome ») which can vary in time. Designer choice related to the way how the problem solutions are encoded in chromosomal vectors, e.g. the type (Boolean ? Integer ? Float ? Set? ) of different elements of the vector is also a crucial one and can often determine whether the algorithm shall succeed or fail. In every generation – i.e. in every iteration of the algorithmic cycle represented by the circle on Figure 2. - all N individuals in the population are evaluated by the fitness function. Every individual thus obtains the « fitness » value, which subsequently governs the « selection » procedure choosing a subset of individuals from the current generation as those, whose genetic information shall reproduce into next generations. In our Thesis we plan to exploit especially the « fitness proportionate selection » as the selection operator. This operator, also called « roulette wheel operator » transforms the fitness fi of individual i into the probability p i of its survival by means of a formula : where N is the number of individuals in the population. Once the « most fit » candidates are selected by the selection operator, they are subsequently mutually recombined by means of « crossover » operators and/or modified by means of « mutation » operators. Many different types of selection, mutation and crossover operators exist, for their overview c.f. (Sekaj 2005). For the purpose of this work let’s just note that the probabilities of occurrence of mutation or crossover have to be fairly low, otherwise no fitness-increasing information could be transferred among generations and whole system will tend to present non-converging chaotic behaviour (Nowak et al. 1999). Another useful strategy, which guarantees that maximal fitness shall either increase or at least stay constant, is called elitism. In order to implement the strategy, one simply guards one (or more) individual(s) with highest fitness unchanged for next generation, thus protecting « the best ones » from variations which would, most probably, decrease rather than increase the fitness4. Yet another widely used approach reinforces the selection pressure by removal of the 4 Note that in nature, elitism is often but not always the case. For it can happen that, due to stochastic factors, the most fit individuals die before they succeed to reproduce themselves. weakest individuals. Both elitist « survival of the fittest » and the contrary « removal of the weakest » are often combined. The selection of the most fit individuals from the old generation, their subsequent replication and/or recombination and diversification yields a new generation. Because individuals with lower fitness have been either completely or at least partially discarded by the selection process, one can expect that the overall fitness of new generation shall be higher than the fitness of the old generation. With little bit of luck, one can also hope that the most fit individuals of the new generation shall be little bit more fitter than the most fit individuals discovered in the new generation – this can happen if ever a « benign » mutation have occured, i.e. a modification which had moved the individual from the lower point on the « fitness landscape » to somewhat higher state. The notion of fitness landscape, first introduced in (Wright 1932) is a metaphor useful for understanding&explanation of diverse evolutionary phenomena. The landscape is depicted as a mountain range with peaks of varying height. The height at any point on the landscape corresponds to its fitness value; i.e. the higher the point, the greater the fitness of an individual represented by the given point of the landscape. In such a representation, the evolution of the organism to more and more « fit » forms can be depicted as a movement up-hill, towards the most closest peak (i.e. local optimum) or towards the highest peak of the whole landscape (i.e. global optimum). Figure 3 illustrates a fitness landscape of a very simple organism with only one gene (whose potential values are encoded by illustration’s X axis). Figure 3: Possible fitness landscape for a problem with only one variable. Horizontal axis represents gene’s value, vertical axis represents fitness. Every arrow on the figure represents one possible individual. Its length represents the variation which can be brought in by the mutation operator. The fact that individuals always tend to move « upwards » indicates that selection pressures are involved. It has to be added that without the implementation of the crossover operator, the globally optimal state (encoded by point C) could not be attained for individuals who haven’t originated at the slopes of C. Only some sort of crossover operator could ensure that individuals who attained the local optima (encoded by peaks A, B, D) could be mutually recombined (for example B with D) in a way that shall allow them to leave the locally stable states and approach the globally optimal C. The fact that genetic algorithms, thanks to « crossover » operators, can combine two individuals from diverse sectors of the fitness landscape, allow them to find solutions to problems where heuristics based on « gradient descent » should fail. 1.7.2. Evolutionary programming & evolutionary strategies Evolutionary programming (EP) and evolutionary strategies (ES) are methods whose overall essence is very similar to GAs. There are, however, some subtle differences among the approaches. In EP, mutation is the principal and often the only variation operator. While recombination is rarely used, « operators are freely adapted to fit the problem at hand » (Kennedy et al. 2001). EP algorithms often double the size of population by mixing children with parents and then halving the population by selection. Tournament selection operator is often used. Another difference is that while GAs were developped in order to optimize the numeric parameters of mathematical function under study – and variation thus directly modifies the genotype – in EP, one mutates the genotype but evaluates the fitness according to phenotype. EP is thus often used for construction & optimization of such structures like « finite state automata » (Fogel et al. 1966). A self-adaptation approach (Bentley 1999) allowing for mutation of the parameters of the evolution itself – e.g. the mutation rate – is also frequently used. Such an approach of « evolving the evolution » is also used in ES which where discovered - in parallel but independetly with Holland’s GAs – by (Rechenberg 1973). The biggest difference between EP and ES is thus fact that ES often recombines its individuals before mutating them. Popular and well-performing strategy thus seems to be : 1. Initialize the population 2. Perform recombination using P parents to form C children5 3. Perform mutation on all children 4. Evaluate children population and select P members from it. 5. If the termination criterion is not met, go to step 2 ; terminate otherwise. Given the fact that in our Thesis, we shall often: 1) encode the problem of linguistic category induction by non-numeric chromosomes 2) evaluate the fitness of individuals by means of additional « phenotypic algorithms » we consider the works of Fogel & Rechenberg to be of particular importance for our study. 1.7.3.Genetic programming Contrary to GAs, E.Prog and E.Strat which operate upon the chromosomes (vectors) of fixed length of numeric/boolean/character values, do individuals evolved by means of Genetic Programming (GP) encode programs of arbitrary length and complexity. In other terms, one may state that while above-mentioned EC methods look for the most optimal solution of a given problem, GP tends to produce a hierarchical tree 5 Frequently used C/P ratio is 7 structure encoding a sequence of instructions (i.e. a program) able to yield optimal solutions to a whole range of problems. Simply said : GP is simply a way how computer programs can automatically « discover » new and useful programs. The most important thing to do in order to prepare a GP framework is to specify how shall be the resulting individuals (programs) encoded. Original choice of the founder of the discipline, John Koza, was to encode all individuals as trees of LISP S-expressions composed of sub-trees, which are, themselves, also LISP S-expressions. Within such arborescent S-expressions, the terminal (i.e. leave nodes where the branches end) nodes represent program’s variables and constants while the non-terminal nodes (i.e. internal tree points) represent diverse functions contained in the function set (e.g. arithmetic functions like +, -, *, / ; mathematic functions like log, cos ; boolean functions like AND, OR, NOT ; conditional operators if/else etc.) Figure 4 illustrates how, during the initial run of the algorithm, an individual – calculating, for example, the square root of X+5 – could be possibly randomly generated by implementing a following procedure : 1) « Root » of the program tree is randomly chosen from the function set, it is the function sqrt. 2) The function sqrt has only one argument (arity(sqrt)=1), therefore it will take only one input from the randomly determined functor + (addition) Figure 4: Sequence of steps constructing the program sqrt(x+5) 3) Functor + takes two inputs (arity(+)=2), therefore the tree bifurcates into two lines in this node. It randomly choses, as the first argument, the constant 5 ; and the variable X as the second argument. Note that in step 3, both arguments were chosen from the terminal set. If they would have been chosen from the function set, the tree could bifurcate further. In order to prevent such growth of trees ad infinitum, a limiting « maximal tree depth » parameter is more than often implemented in GP scenarios. Once such a program has been generated, one can evaluate its fitness by confronting it with diverse input arguments and comparing its output with a golden standard. Such a random-program generation & evaluation is repeated for all N initial candidate programs, subsequently the most individuals are selected and varied. While GP’s selection techniques can sometimes closely ressemble selection techniques as used in GAs, variation operators are often of essentially different nature. This is so, because GP’s not individual genomes or their linear sequences can be mutated or crossed-over, but rather complex and hierarchical networks of expressions. In a case of cross-over, for example, one switches whole sub-tree encoded within one individual, for a sub-tree encoded within another one. GP-based solutions cannot be expected to function correctly if they do not satisfy the theoretical properties of closure and sufficiency. In order to fulfill the closure condition, each function from the non-terminal set must be able to successfully operate both on output of any function in the non-terminal set and on any value obtainable by a member of the terminal set. Even behaviour of some simple operators thus has to be a priori adjusted (e.g. return 1 in case of division by zero) in order to assure correct functioning of the resulting program. On the other hand, sufficiency property demands that the set of functors and terminals is sufficiently exhaustive. Otherwise the solution could not be found. One can not, for example, hope to discover equation for generating the Mandelbrot set if the initial set of terminals does not contain the notion of imaginary number, nor does the function set contain any other explicit or implicit reference to the notion of complex plane. Thus, while the closure constraint delimits the upper bound beyond which the discovery of the solution is not feasible, the sufficiency constraint delimits the lower bound of the minimal set of « initial components » which have to be defined a priori, so that discovery of the adequate program should be at least theoretically possible. Other theoretical notions as well as diverse subtleties (special operators, methods how to distribute the initial population in the search space, fitness function proposals, domains of application, etc.) of practical implementation, are to be found in possibly the most important GP-concerning monography (Koza 1992). 1.7.4.Grammatical evolution Grammatical Evolution (Gr.Ev) is a variant of GP in a sense that it also use evolutionary computing in order to automatically generate computer programs. The most important difference between Gr.Ev and GP is that while GP operates directly upon phenotypic trees representing program’s code itself (for example in form of LISP expressions), Gr.Ev uses the evolutionary machinery for the purpose of generating grammars, which would subsequently generate the program code. In Formal Language Theory, grammar is represented by the tuple {N, T, P, S} where N denotes the set of non-terminals, T the set of terminals, S is a symbol which is member of N and P denotes the set of production rules that substitute elements of N by elements of N, T or their combinations 6. Consider a grammar exhaustive enough to encode programs able to perform arbitrary number of operations of addition or subtraction of two variables: N = {expr, op, var} T = { +, -, x, y} S = expr P= { → + | 6 This is the case for so-called context-free and context-sensitive grammars. → x | y | Such a grammar contains three non-terminals, non-terminal which could be subtituted for either terminal + or terminal - ; non-terminal which could be subtituted for either terminal x or terminal y ; and non-terminal which could be substituted for either a non-terminal , or a sequence of non-terminals . The fact that in this last production, the non-terminal is present both on left and right side of the substitution rule gives this grammar a possibility to recursively generate infinite number of expressions like : x+x x+y y+x y+y x-x x-y y-y y-x x+x x+x+x x+x-x x+x+y x+x-y x-x+y-y x-x-y+y+x y+y+x+x+y-x etc. Thus, even a very simple grammar with only four terminal symbols and three non-terminal symbols to each of which are associated only two production rules can theoretically produce an infinite number of distinct individual programs able to perform basic arithmetic operations with two variables. Generation of a given resulting expression is determined by the order of application of specific production rules, starting with non-terminal symbol S. Such a sequence of application of production rules is called derivation. For example, in order to derive the individual « x+x », one has to apply production rules in following order : S = ::= ::= ::= x :: = + :: = :: = x while the individual « y-x » would be generated, if ever the starting symbol S should be expanded by a following sequence of production rules : S = ::= ::= ::= y :: = :: = :: = x In Grammatical Evolution, it is this « order of application of production rules» which is encoded in the individual chromosome. In other terms, individual chromosomes encode when and where distinct production rules shall be applied. Figure 5 more closely illustrates, and puts into analogy with biological systems, the sequence of transformations which every binary chromosome undergoes during the process of unfolding into fully functional program : Figure 5: Sequence of transformations from genotype until phenotype in both Gr.Ev and Biological systems. Figure reproduced from (O’Neill & Ryan 2003). It can be easily infered from the above-displayed schema that the approach of Gr.Ev is quite intricate and involves multiple steps of information processing. Whole process starts with binary chromosome subsequently split into 8-bit codons which yield an integer specifying which production rule to use in a given moment of program’s generation. On many different layers does the « generation » process, as implemented in Gr.Ev, introduce and implement very original ideas like: 1. « Degenerate genetic code » - similary to « nature’s choice » to encode one amino-acid by means of many different triplets, can one encode application of a unique production rule by more than one codon. 2. «Wrapping » - under certain conditions can be whole genome « traversed » more than once during the process of phenotypic expression. Specific codon can be thus used more than once during the compilation of single individual. etc. Rationale for usage of such « biologically inspired tricks » is more closely presented in the work of the founders of Grammatical Evolution field (O’Neill & Ryan 2003) . They claim that the focus on genotype-phenotype distinction, especially in combination with implementation of « degenerate code » and « wrapping » notions, could result in compression of representation (& subsequent reduction of size of program search-space) and account for phenomenas like « neutral mutation », well-observed in biological systems, whereby a mutation occures in the genotype but does not have any effect upon the resulting phenotype. Another important advantage mentioned by O’Neill and Ryan is that Gr.Ev approach makes it very easy to generate programs in any arbitrary language. This is due to the versatility and generality of notion of « grammar ». When compared with traditional GP technique, Gr.Ev was outperformed in a scenario when one had to find solutions to problem of symbolic regression. But in more case complex scenarios like « symbolic integration », « Santa Fe ant trial » or in scenario where one had to discover a most precise « caching algorithm », Gr.Ev significantly outperformed GP. Seminal work of (O’Neill and Ryan 2003) presents also some other interesting examples of practical application of Gr.Ev, for example in the domain of financial market prediction. We note that while in many points (« grammar », « evolution ») does the work of O’Neilly and Ryan significantly overlap with ours, their aims significantly differ from those that shall be presented in our Thesis. More concretely, while Gr.Ev tends to offer a very general toolbox to generate useful computer programs in arbitrary programming language and used for solving arbitrary problems, our Thesis shall deploy the evolutionary computation machinery to shed some light upon diverse facets of one sole problem : that of « Natural Language Development». Other important difference between the approach of Gr.Ev and the one we shall present in our thesis is that while in Gr.Ev, grammars are considered to be « generative devices », i.e. tools used for generation of programs, in our Thesis we shall use them as both « generative » and « parsing » devices. Another, even more fundamental difference is due to the fact that while « At the heart of GE lies the fact that genes are only used to determine which rule is applied when, not what the rules are. » (O’Neill and Ryan 2003), the evolutionary model of language-induction proposed in our Thesis shall aim to determine not only the order of application of the rules, but also the content of the rules themselves. 1.7.5.Tierra Another example of how can one materialise evolutionary principles within an in silico framework is offered by Tierra, an artificial life simulation environment programmed between 1990-2001 by Thomas S. Ray and his colleagues. Since Ray is an ecologist, his objective was not to develop an EC-like model in order to find or optimalize solutions of a given problem, rather he aimed to create a system where artificially entities could spontaneously evolve, co-evolve and potentially create whole artificial ecosystems. An artificial entity in Tierra’s framework (Ray 1992) is a program composed of sequence of instructions, chosen from instruction set containing 32 quite traditional assembler instructions somewhat tuned by the author so that their usage would facilitate « replication » of the code. Every artificial entity runs in its own « virtual CPU » but its code stays encoded in the « soup », i.e. piece of RAM which is potentially read-accessible to all other entities as well. Rare «cosmic ray » mutations flip the bits of « soup » from time to time, more variation is ensured by bit-flipping during the procedure whereby the entity replicates (i.e. copies) its code from the « mother cell » section of the soup to the « daughter cell » section. Selection is in certain sense emulated by a so-called Reaper process which tends to stop the execution of programs which are either too old or contain too much flawed instructions. Other than that, there is nothing which ressemble the traditional notion of exogenously defined « fitness function ». For within Tierra, the survival (or death) of diverse species of programs is a direct consequence of species ability (or inability) to obtain access to limited ressources (CPU & memory). Thus, after one seeds the initially empty soup with a manually constructed individual, containing 80-instructions allowing the individual to copy his code into the daughter cell of the memory, after the memory has been filled and the battle for ressources has started and once the mutation have generated sufficiently enough of variation, one can observe the emergence of dozens of new forms of replicable programs. Some of them being parasites, some of them being able to create algorithmic counter-mesures against parasites, one can literally observe an emergence of artificial yet living ecological system. It is therefore little surprising that Tierra could automatically evolve, among others, an individual containing just 22 instructions, capable of replication. That is, a replicator almost 4 times shorter than the replicator manually programmed by the conceptor of the system and injected into initial « soup ». Currently the most famous descendant of Tierra is an AVIDA system (Ofria and Wilke 2004). Contrary to Tierra, however, is every AVIDA’s individual encapsulated within its own virtual CPU and memory space. Tierra’s Darwinian metaphore 7 of computer programs evolving by means of fighting for limited ressources is thus not so strictly followed. 7 http://life.ou.edu/pubs/tierra/node3.html 2. Language development Language development (LD) is a constructionist process which endows humans with the capacity for transfering of information to, and obtaining of information from, other humans by means of verbal communication. Term « language development » shall be used preferably to « language acquisition » in order to mark the fact that the child not only passively « acquires » the language from environmental input but rather gradually builds it, in interaction with its environment. Sometimes the term « language learning » shall be used as well to denote the same process. In our Thesis we shall focus only on modeling of development of « first language » , i.e. we shall aim to present a computational and evolutionary model of the process by means of which a human baby learns the language of its closest social environment. Child’s closest social environment are her parents, most notably her mother. Hundreds of studies were conducted to study the nature of « motherese », a special simplified language between mothers and their children (M. Harris 2013). While many studies point in divergent directions, they more or less agree that « Maternal speech has certain characteristics that distinguish it from speech to other adults. These characteristics are in essence simplicity, brevity and redundancy. » What’s more, it seems to be a well-established fact that there exists a reciprocal link between the complexity of motherese and complexity of child’s production. In other terms, mother’s adjust their language according to the stage of child’s linguistic development. Other studies also indicate an existence of causal link between the quantity and simplicity of motherese utterances on one hand and child’s linguistic development. More concretely, studies like that of (Furrow et al. 1979) indicate that child’s confrontation with frequent and simple utterances facilitates their linguistic development while more complex style can slow their development down. Other studies, like that of Ellis & Wells (1980) precise that « children who showed the earliest and most rapid language development received significantly more acknowledgments, corrections, prohibitions and instructions from their parents ». This causal link between mother’s linguistic productions and child’s developping linguistic competence shall play an important role when we shall discuss the « fitness function problem ». More concretely, we shall try to integrate into our computational models an idea that the fitness function evaluating the performance of child’s internal categorization mechanism and/or candidate grammar shall be external to the child. The fitness function shall be given by mother’s behaviour. 2.1. Ontogeny of semantic categories (concepts) Natural language furnishes a communication channel for exchange of meanings. Meaning (also called « signifié » in traditional linguistics) is intentional, it refers to some external entity (also called « referent ») . Within the language L, meaning M can be denoted by a token (also called « signifiant ») and it is by exchange of physical (phonic, in case of spoken language, graphemic in case of written language etc.) manifestations of these tokens that producer (speaker|writer) and reciever (hearer|reader) communicate. Traditionally meaning of the word, i.e. its « semantics », was often considered as something almost «sacred » and impossible to formalize by mathematical means. Maximum which could be done, and had been done since Aristotle until middle of 20th century, was to define concept in terms of lists of « necessary and sufficient features ». Two types of features were considered to be both necessary and sufficient for definition of majority of concepts : first specifying concept’s genus (or superordinated concept) and second specifying the particular property (differentia) which distinguished the concept from other members of the same genus. Thus, for example, « dog » could be defined as domesticated (differentia) canine (genus). Important property of such system of concepts was, that it allowed no ambigous or fuzzy border cases : the logical « law of excluded middle » guaranteed that all entities which were not both canines and domesticated at the same time (e.g. a chihuahua which passed all her life in wilderness) could not be called a dog. The change of paradigm came slowly with works of late Wittgenstein 8 but especially with empirical studies of Eleanore Rosch (Rosch 1999) who realized that not only are concepts often defined by bundles of features which are neither necessary not sufficient, but that the degree with which a feature can be associated with a concept often varies. Subsequently, Rosch has proposed a « prototype theory » of semantic categories whose basic postulate is, that some members of the category (or some instances of the concept) can be more « central » in relation to the category (resp. concept) than others. Prototypical theory as well as other both theoretic and empirical advances, in combination with development of information-processing technologies, have paved the way to operationalization of semantics which allows us to transform meanings of words into mathematically commesurable entities. In computational semantics, meaning of a token X observable within language corpus C is often characterized as a vector of relations which X holds with other tokens observable within the corpus. The set of such vectors associated to all tokens observable in C yields a « semantic space » which is a vector space within which one can effectuate diverse numeric and|or geometric operations. In short, concepts can be operationalized as geometric entities (Gärdenfors 2004). « In the most simple case can be the vector which denotes concept X calculated as a linear combination of vectors of concepts in context of which X occurs » (Hromada 2013a). This is an algebraic form of famous « distributional hypothesis » stating that « a word is characterized by the company it keeps » (Z. S. Harris 1954) which can be considered to be the central dogma of statistical semantics. Distributional hypothesis is in certain a variation to an old « associationist » explanation of functioning of mind, which stated that the essence of mind is somehow related to mind’s ability to create relations, i.e. associations, between successive mental states. Both mind’s faculty to create associations -considered by philosophers like Hume and Locke to be primary faculty of mind - as well as distributional hypothesis that 8 « For a large class of cases of the employment of the word ‘meaning’—though not for all—this way can be explained in this way: the meaning of a word is its use in the language. » (Wittgenstein 2009) meaning of symbol X can be defined in terms of meanings of symbols with which X co-occurs, can be, we believe, neurologically explained in terms of postulate first stated by Hebb, the neurologist : « The general idea is an old one, that any two cells or systems of cells that are repeatedly active at the same time will tend to become 'associated', so that activity in one facilitates activity in the other. » (Hebb 1964) One can assume that 1) if not only on single neurons but, mutatis mutandi, also whole neural circuits are governed by Hebb’s rule, and 2) if distinct words W x and Wy are somehow processed and represented by distinct neural circuits N x and Ny THEN it shall follow that whenever a hearer shall hear (or speaker shall speak) the two-word phrase WxWy, the ensemble of material (synaptic?) relations between Nx and Ny shall get reinforced. In more geometrical terms, on a more « mental » level, such a « rapprochement » of Nx and Ny would be characterized by convergence of the geometrical representations of both circuits to their common geometrical centroid. Thus, after processing the phrase WxWy, the vectorial representations of both Nx and Ny will be closer to each other than before hearing (or generating) the phrase. In our Thesis we shall presuppose that an associationist principle, similar to the one described above, is indeed at work whenever a human mind constructs a concept. We use term « concept » synonymously to the term « semantic class » : we define both concept and semantic classes as either subspaces of « semantic vector space », or as centroid points of such subspaces. Theoretically, there are multiple (and possibly infinitely) many ways how a cognitive system can internally represent an external environment E (or, in case of a computational linguistic agent, a corpus C) as « semantic space » S of dimensionality D. It is important to notice that the overall partitioning of cognitive system’s vector space determines how the system classifies the world. If system’s ability to correctly classify the world determines the reproductive fitness of an organism within which the cognitive system is embedded, one can state that the topology of internally represented semantic space can quite directly influence organism’s fitness. Consider, for example, reproduction fitness of a member of prey species which sometimes mis-classifies a predator species for a sexual mate, and compare it to the fitness of such a an individual among prey species whose semantic space is optimized so that the probability of such mis-classification is practically reduced to zero. A question whether such « semantic space optimization » occurs during the phylogeny of human species or whether it occurs principially during early years of child’s developpement (i.e. ontogeny) is a variant of « nature vs. nurture » (Galton 1875) debate between « nativists » who bet on the «innateness » of certain faculties of human psyche (c.f. discussion above Evolutionary Psychology above); and empiricist who believe that practically all knowledge we dispose of and use in everyday life is acquired from environment. Being aware of results of studies suggesting that children of very small age dispose of knowledge concerning basic relations among physical objects, or even social and moral skills (Haidt 2012) we consider as unwise the tentative to label nativist position as a priori invalid. On the other hand, being aware of the force with which processes like socialisation, acculturation and learning mould the psyche of an adult individual, we shall definitely consider as true the statement «topology of semantic space represented within the cognitive system of human individual can be optimized by supervised assimilation of knowledge encoded in surrounding environment». Notwithstanding the answer to nature & nurture question in regards to human faculty of categorization, the part of our Thesis devoted to «evolutionary models of concept construction » shall simply suggest that something like optimization of semantic spaces by means of evolutionary computing is, indeed, possible. 2.2. Ontogeny of formal categories (parts-of-speech) Words of language can be also partitioned into classes independently from their semantic content. For example, while there is practically no manifestly evident semantic feature between words like « apple » and « process », they can be both considered as belonging to the same category of « nouns ». Principal reason for this being the fact that within a sentence like, for example, «This apple makes me happy» one can freely substitute « apple » for « process » and still obtain a grammaticaly correct sentence. Sometimes the formal categories and semantic categories partially overlap. Such is the case, for example, in many indo-european languages where one often finds « feminine » nouns marked with markers of one formal group and « masculine » nouns marked with markers of other group. Even more extreme case of such « overlap » of semantic and formal categorization processes was observed among Diyarbal aborigines of Australia who use the same determiner « balan » (in certain sense analogic to German article « die ») in front of all nouns referring to « woman, fire and dangerous things» (Lakoff 1990). In modern linguistic tradition, however, are semantic and formal categories considered to be independent from each other. There exist multiple dimensions along which linguistic tokens can be categorized into formal classes. Most importantly, the appartenance of word W to class C can be principially infered from : 1) its position in regards to other words 2) its morphology (i.e. its internal composition with all prefixes, word root, suffixes etc.) 9. It is also important to realize that the same token can belong to many different categories in the same time and that the relations between categories themselves could be either inclusive, for « nested » categories, or « orthogonal ». Thus, for nested categories, appartenance of , for example, german token « die Schönheit » to «gender» subcategory «feminine» immediately implies that it also belongs to part-of-speech «noun ». On the other hand the sole fact that it is « feminine » does not inform us whether it could be attributed to « nominative » or « accusative » subsubcategories of grammatic subcategory « case ». Thus, subcategories of « case » and « gender », while being both « nested » within the part-of-speech category « nouns » are orthogonal to each other10. 9 C.f. (Hromada 2014a) for a comparative study assessing the impact of morphology and word-order features upon POS-induction in Bulgarian, Czech, Estonian, Farsi, English, Hungarian, Polish, Romanian, Russian and Slovak. 10 The theoretical importance of existence of this distinction in regards to current formal grammar models of natural On the most abstract level, linguistic tokens can be categorized into two principal 0-level formal categories of «functional» and « lexical » items. The set of grammatical items is closed, and it contains such parts-of-speech as determiners, conjunctions, pronouns, prepositions. On the other hand, classes of « lexical items » are opened and include meaning-carrying parts-of-speech like nouns, verbs, adverbs, adjectives etc. Study by (Shi et al. 1999)offers evidence that even newborn children (1-3 days old !) react differently to lexical and functional words and are thus «able to categorically discriminate these sets of words based on a constellation of perceptual cues that distinguish them». Once children are able to distinguish functional words from lexical ones, the process of ontogeny of formal categories can proceed towards development of part-of-speech categories. While it would be definitely mistaken to state that all languages of the world can be partitioned into & mapped upon part-of-speech languages known from English or other indo-european languages (i.e. noun, adjectives, pronouns, verbs, adverbs, preposition, conjunction, interjections), linguists generally agree that some kind of « noun»-ressembling and «verb»-ressembling categories are to be observed in all systems of human verbal communication. It is undoubtably the case that between the birth and cca 2-years of age, prototype for such part-of-speech clusters are being formed within the child’s cognitive system. This has to be so, around age of 2, children usually start to apply specific rules to specific items (i.e. start to conjugate the verbs or declinate the nouns). Subsequently, the learning of much more subtle distinctions, related to nature of grammatical categories like genus, casus, numerus for nouns or modus, tempus, etc. for verbs can take place. For diverse case studies concerning the acquisition of formal categories, c.f. (Y. E. Levy, Schlesinger, & Braine, 1988). Acquisition of both semantic and formal linguistic categories is facilitated by so-called « variation sets » (VS). One observes a linguistic variation set whenever the identific word/cluster of words occur in identical or slightly variated form within multiple consequent utterances. Not only nursery rhymes and lullabies are filled with such « alternations in maternal self-repetitions » (Hoff-Ginsberg 1986) VS are also highly frequent in standard « motherese ». In Turkish, for example, VS seem to make up approximately 20% of child-directed speech (Küntay and Slobin 1996) and very similar proportions are also reported for English language (Brodsky et al. 2007). Note that the notion of « variation set » can be intepreted in terms of evolutionary theory, given that: • maternal self-repetition can be intepreted as a form of « replication in time», whereby every single utterance is considered to be an independent individual • alteration of form between subsequent utterances can be interpreted as a result of a variation operator influencing mother’s production of new sentences In context of our tentatives to explain language development in terms of evolutionary theory and suggest its validity by means of evolutionary computation model, we find languages shall be further extended in fulll version of the Thesis. this insight « the image that best characterizes the young language leaner is that of a multilevel analyzer who is working with several types of analysis simulatenously, with different degrees of success, as learning progresses » (Levy 1988). It may be stated that the reason why categorization processes develop in the first place is congitive system’s the tendency to optimize its functions and structures. As Maratsos (1998) put it: « Once the speaker hears just one grammatical use of a new word which suffices to identify its membership in a category, he can refer to the whole system of rules involving this category » (Maratsos 1988). Thus, both semantic as well as formal categories can reduce cost of processing and storing of information by and within the cognitive system. 2.3. Ontogeny of grammars (grammar induction) Partitioning of words into grammatical categories can be useful only if it is accompanied by development of grammatical rules which combine members of diverse categories in order to produce meaningful sentences. We reiterate that strictly formally, grammar is defined as the tuple {N, T, P, S} where N denotes the set of non-terminals, T the set of terminals, S is a symbol which is member of N and P denotes the set of production rules that substitute elements of N by elements of N, T or their combinations. Within such formal framework, the problem of partitioning of words into diverse grammatical categories can thought to be as equivalent to problem of discovery of production rules which 1) associate members of T (words) to members of N (labels of distinct categories) 2) combine elemets of N in order to produce new elements of N. In fact, the problem of construction of formal categories and discovery of grammatical rules are mutually intertwined, some researchers go even so far as to state : « Category symbols, whether in phrase structure rules or in the lexicon, are logically equivalent to the rules written on them, and as such are completely system-dependent : They are shorthand descriptions of the rule system as a whole. By anyone’s theory, young children’s linguistic system does not possess all the features of the endstate system. In other words, their language cannot be describe by the same grammar as the adult system» (Ninio 1988). In litterature, development of language is often described as a process composed of three « stages » which can be subsequently subdivised in a followin manner : «Pregrammatical : a. Rote-learning – item-based acquisition is manifested in the use of formally unanalyzed units or chunks ; b. Initial modifications – formal alternations apply to a small number of highly familiar, good exemplars ; Structure-bound c. Interim schemata – transitional or bridge strategies take the form of productive, but nonnormative rules ; d. Grammaticization – structure-bound rules are those of the endstate grammar ; Discourse-oriented : e. Convention and variety – grammatical rules are deployed with appropriate, discourse-sensitive lexical restrictions, stylistic alternations, usage conventions, register distinctions etc. » (Berman 1988). In our Thesis, we shall put aside the intricacies of the third, « Discourse-oriented » stage and shall focus on « Pregrammatical » and « Structure-bound » stages. More concretely, we shall aim to explain acquisition of words and word chunks in phase a. as the result of the « crossover » between structures present in the environment and structures represented within the cognitive system; while the gradual emergence of categories and associated production rules which can be observable during phases b. c. d. shall be explained not only in terms of informatic crossover of structures present in environment and represented in cognitive system but also as the result of purely internal replication, variation and decay, proper to the cognitive system, and resulting in complexity-increasing « battle for resources » among structures represented within it. We are convinced that introduction of such «cognitive-system internally variying operators » like « entropy-induced decay » (associated to the phenomenon of « forgetting ») and « structural merging » (associated to the phenomenon of « dreaming ») we can, for example, offer a very simple&natural yet effective solution to a so-called « overgeneralization »11 problem. When it comes to overgeneralization of grammatical rules, they are often observable in phases c. & d. (i.e. between 2-4 years of age) whenever the child applies the production rule beyond the scope of its validity. The most famous example of overregularization in English is that practically all children apply the rule Vpast → VPresent+ed on all verbs. Thus, especially during MLU stage 4 and 512, they generate past participles like « throwed » or « braked » which are not correct. What is fascinating about the problem of overregularization is not only that all children shall start to employ irregular forms of past participles so that errors are not reproduced anymore ; but especially the fact that often, children used the correct « irregular form » even before (i.e. in one-word phases a. and b.). Only later did they converge to incorrect overregularization : « Initially, children’s uses of -ed past tense are all accurate. They may say melted or dropped, but not, as they later do, runned and breaked » (Maratsos 1988) . We see an important analogy between observations of such sequence of correct/incorrect/correct behaviour, and general behaviour of evolutionary systems which also often « reject » locally optimal solutions and descend into fitness 11 According to the domain (formal, semantic) the problem is also sometimes named as « overextension », « overregularization » or the problem of « overinclusive grammar ». 12 MLU means «Mean Length of Utterance » and is a measure traditionally used in developmental psycholinguistics for assessing of child’s linguistic performance at given age. In period when child produces one-word utterances like « mama » , « tato », MLU is considered to be 1 ; later when child starts to say two-word utterances like « mama nene », MLU increases towards 2 etc. landscape valleys in order to subsequently climb towards more optimal states. Thus, we believe that the term « conflict » present in the following principle can be also interpreted in evolutionary sense : « Whenever a newly acquired specific rule (i.e. a rule that mentions a specific lexical item, like throw, make, allow, report) is in conflict with previously learned general rule (i.e. a rule that would apply to that lexical item but also to many others of the same class), the specific rule eventually takes precedence » (Braine 1971). McWhinney uses a similar term « competition » to label its Competition Model of linguistic competence. « The competition model assumes that lexical elements and components to which they are connected can vary in their degree of activation. Activation is passed along connections between nodes. During processing, items are in competition with one another. In auditory processing …, in allomorphic processing …, in the processing of role relations, in polysemy …, the item that wins out in a given competition is the one with the greatest activation » (MacWhinney 1987). If one could interpret the last phrase of the above citation as « the component which has the greatest activation has the greatest fitness and thus the highest probability of being replicated within the cognitve system », one could consider MacWhinney’s connectionist model as an evolutionary one, and thus pointing in our direction. But since that is not the case, and since it seems that MacWhinney’s model does not, at least not explicitly, involve any processes of replication, nor sources of random variation nor does it explicitely work with «populations of grammars», we are obliged to look for another theoretical framework which could more easily integrate such notions. It may be the case that a so-called theory of « Grammar Systems » (Csuhaj-Varjú 1994) and « Language Colonies » (Kelemen and Kelemenová 1992) could furnish such a framework for our tentative to explain ontogeny of grammar in human individuum as an evolutionary process. Both will be introduced in part 4 of this text. 3. Computational Models of Text Processing Majority of models and algorithms presented in this chapter are results of intellectual work of computational linguists working in domain of « Natural Language Processing » (NLP). In NLP, one processes data encoding natural (human) languages with computational methods which often involve machine learning, data mining, information retrieval, statistical inference or artificial intelligence (AI) algorithms. Among principal objectives of NLP can one include : 1) to allow machines to « understand » and|or work with meanings 2) to develop an autonomous artificial agent (Hromada, 2012) able to pass the Turing Test (Turing 2008); and 3) to elucidate, by means of computational simulations, possible ways how human cognitive system treats natural language. Computational aim of our Thesis overlaps especially with NLP’s third objective. Such an aim bring with itself many complex problems not easy to tackle and thus, in order to reduce their amount and complexity we shall reduce the notion of « Natural Language » to the notion of « text ». It is true that in doing so, we shall completely ignore the phonetic, phonologic and prosodic aspect of language which has been, during practically all human history, a principal way how human speakers encoded their messages in order to transfer them to other human hearers. It is only during few centuries that the communication by means of text became prominent and only within last decades it became dominant, mainly because of increasing role of computers in our lives. This is at least partially so because computers are essentially machine built for processing of sequences of discrete symbols and that’s what a text is – a sequence of discrete symbols. Contrary to flux of spoken language, which is also a sequence, but composed of units whose boundaries are often unclear and whose features overlap. 3.1. Concept construction We define the « concept construction » (CC) problem as an open-class variant of « classification » or « categorization » problem. In classical, « closed-class » categorization problem, the objective is to assign a label which denotes the membership to a category C1 to a set of objects disposing of particular combination of properties (also called « features » in AI community) ; and assign to categories C 2, C3 etc. other objects disposing of different features. Problem of « binary classification » where only two categories are involved, is well studied and dozens of diverse algorithms exists which allow to train, in machine learning scenario, such classification models (« classifiers ») which will subsequently quite successfully classify such objects of the « testing set » which absent in the « training set ». In NLP one often solves classification problem by means of so-called « Support Vector Machines » . During the traininig of SVM, algorithm tries to discover a hyperplane « that has the largest distance to the nearest training data point of any class » (Vapnik et al. 1997). SVMs belong to group of « linear classifiers » which all base their classification decisions on linear combinations of characteristics (features) of objects-to-be-classified. Other machine learning algorithms as diverse as Linear Discrimant Analysis, Naive Bayes classifiers, logistic regression or perceptron also belong to group of linear classifiers. « Multiple class » variants of these algorithms also exist, allowing for classification of objects into more than 2 categories. In case of all these algorithms, however, all the classes-to-be-looked-for are known in advance ; datapoints in the training set are labeled with labels belonging to a finite set and after the training, during the testing phase, one’s objective is simply to attribute the correct label to a new object. While the object itself was most probably not present in the training set and is not « new », the finite set of all class/category labels-to-be-attributed are well known from the very beginning of training. In this sense all algorithms mentioned above address the closed-class variant of classification problem. On the contrary, in open-class variant of classification problem one can be potentially asked, in the testing phase, to attribute to an object, which was not present during training phase, a label which was also not present in the turing phase. In other terms, in open-class variant of classification problem one does not know in advance neither the number nor even the nature of categories which are to be constructed. 3.1.1. Non-evolutionary model of CC One possible way how one can address problem of Concept Construction – which we consider to be the instance of an « open-class classification problem » as defined above – is described as follows: 1. During the (train|learn)ing phase, use the training corpus to create a D-dimensional semantic vector space, i.e. attribute the vectors of length D to all members of the set of entities (word fragments, words, documents, phrases, patterns) E which includes all observables within the training corpus 2. During the testing phase : 2.1 characterize the object (text) O by a vector ⃗o calculated as a linear combination of vectors of features which are observable in O and whose vectors were learned during the training phase 2.2 characterize labels-to-be-attributed L1, L2, ... by vectors l⃗1, ⃗l2 ... 2.3 associate the object O with the closest label. In case we use cosine metric, we minimize angle between ⃗o and label vectors, i.e. arg max cos(⃗o , ⃗l x ) Note that in order to make this approach functional, two important conditions have to be fulfilled. Primo, vectors associated to entities observables within the training corpus must be commesurable, i.e. have to be of same dimensionality and be members of the same vector space. Secundo, the set of all entities E observed during learning has to be sufficiently exhaustive, so that potentially any novel label or object which shall appear during the testing phase could be at least partially characterized in terms of members observables during the training phase. The first condition of « entity commesurability » is not fulfilled by many vector space models which often yield multiple spaces for entities of different « types ». In such models, « word » entities are often encoded as rows of the matrix and « context » or « documents » entities, i.e. entities within which the words entities occur, are encoded as column of the same matrix, or are encoded in a completely different matrix. On the contrary, algorithms like Random Indexing (RI) or Reflective Random Indexing (RRI) construct semantic vector spaces from initial textual corpora in a way that everything they encounter – be it syllables, words or whole documents – is ultimately represented as rows of the same matrix. RI and RRI have also other advantages which are more closely described elsewhere (Sahlgren 2005; Cohen et al. 2010; Hromada 2013b). For the purpose of this article let’s just underline the fact that both RI and RRI can be quite computationally efficient since they are able to « project » semantic relations hidden in the text upon a vector space with restrained dimensionality. Theoretically, this is permitted due to a so-called lemma Johnson-Lindenstrauss stating that « a small set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved » (Johnson and Lindenstrauss 1984) Figure 6: Description of DEFT2012 system for automatic attribution of keywords to scientific articles. Figure reproduced from poster In 2012, a hybrid system with RRI semantic component at its very core, was deployed in a francophone datamining competition DEFT2012 (El Ghali et al. 2012). The goal of the competition was to create such an automatic NLP system which would be able to attribute to scientific articles the same keywords as were attributed by their authors. In other terms, the goal was to artificially simulate the cognitive activity of « attributing a conceptual label » to a scientific article. The tricky thing about the problem was that it was not a standard « closed class » classification problem, but indeed an « open class » problem since there were many keywords labels which have not been present in the training set, yet were to be associated in the testing scenario. Figure 6 illustrates relations among diverse components of this hybrid system. As may be easily seen, whole « artillery » of diverse NLP tools like POS-taggers, lemmatizers and chunkers was deployed in order to yield sufficiently exhaustive set of features from which two distinct semantic spaces were composed by means of RRI. Resulting semantic spaces were subsequently post-optimized by combining probabilistic Bayesian Networks and production rules. In the first simpler task of the competition DEFT2012, the system has attained F-Score of 94.8%. The task was simpler because a list of candidate labels was furnished within training corpus and subsequently another list of candidate keywords was furnished with the testing corpus. The system has attained F-score of 58.7% in a second, more difficult task where no such lists were given. In both tasks it outperformed the systems deployed by other 9 participants of the competition. 3.1.2. An evolutionary model of CC Task 4 of 2014 edition of the datamining competition Defi en Fouille Textuelle (DEFT) was understood as an instance of classification problem with opened number of classes. More concretely, the challenge was to create an artificial system which would be able attribute a specific member of the set of all class labels to scientific articles of the testing corpus. The training corpus of 208 scientific articles presented in diverse sessions of diverse editions of an annual TALN/RECITAL conference was furnished to facilitate the training of the model. To solve this problem, we have proposed an algorithm consisting of two nested components, as represented on Figure 7. The inner component, which we call Reflective Space Indexing (RSI) is responsable for construction of the vector space. Its input is a genotype, the list of D features which trigger the whole reflective process, its output -a phenotype - is a D-dimensional vector space consisting of vectors for all features, objects (documents) and classes. The inner component is « reflective » in a sense that it multi-iteratively not only characterizes objects in terms of their associated features, but also features in terms of associated objects. RSI's principal parameter is the number of dimensions of the resulting space (D). Input of RSI is a vector of length D whose D elements denote D « triggering features », the initial conditions to which the algorithm is sensible in the initial iteration. After the algorithm has received such an input, it subsequently characterizes every object O (document) by a vector of values which represent the frequency of triggering feature in object O. Initially, every document is thus characterized as a sort of bag-of-triggering-features vector. Subsequently, vectors of all features – i.e. not only triggering ones – are calculated as a sum of vectors of documents within which they occur and a new iteration can start. In it, initial document vectors are discarded and new document vectors are obtained as a sum of vectors of features which are observable in the document. Whole process can be iterated multiple times until the system converges to stationary state, but it is often the second and third iteration which yields most interesting results. Note also that what applies for features and objects applies, mutatis mutandi, also for class labels. Figure 7: Diagram of DEFT2014 model, embedding the construction of semantic spaces within an evolutionary framework. For purposes of DEFT 2014, every individual RSI run consisted of 2 iterations and yielded 200-dimensional space. The envelopping outer component is a trivial evolutionary algorithm whose task was to find the most « fit » combination of features to perform the classification task. In every « generation », evolutionary component injects multiple individual lists of triggering features (i.e. « genomes ») into the inner component and subsequently evaluates the fitness function of resulting vector spaces. It subsequently mutates, selects and crosses-over genotypes which had yielded the vector spaces wherein the classification was most precise. The evolutionary component of the system was conceived as a sort of feature selection mechanism. The objective of the optimization was to find such a genotype – i.e. such a list of « triggering features » – which would subsequently lead to discovery of a vector space whose topology would facilitate construction of a most classification-friendly vector space. As is common in evolutionary computing domain, whole process was started by creation of a random population of individuals. Each individual was fully described by a genome composed of 200 genes. Initially, every gene is assigned a value randomly chosen from the pool of 5849 feature types observable in the training corpus. In DEFT2014's Task 4 there were thus 5849200 possible individual genotypes one could potentially generate and we consider it important to underline that classificatory performance of phenotypes, i.e. vector spaces generated by RSI from genotypes, can also substantially vary. What's more, our observations indicate that by submitting the genotype to evolutionary pressures -i.e. by discarding the least « fit » genomes and promoting, varying and replicating the most fit ones - one also augments the classificatory performance of the resulting phenotypical vector space. In other terms, search for a vector space1 which is optimal in regards to subsequent partitioning or clustering can be accelerated by means of evolutionary computation. During the training, evaluation of fitness of every individual in every generation proceeded in a following manner : • pass the genotype as an input to RSI (D=200, I=2) • within the resulting vector space, calculate cosines between all document and class vectors • attribute N documents with highest score to every class label (N was furnished for both testing and training corpus) • calculate the precision in regards to training corpus golden standard. Precision is considered to be equivalent to individual's fitness Size of population was 50 individuals. In every generation, after the fitness of all individuals has been evaluated, 40% of new individuals were generated from the old ones by means of a one-point crossover operator whereby the probability of the individual to be chosen as a parent was proportional to individual's fitness. For the rest of the new population, it was generated from the old one by combination of fitness proportionate selection and mutation occuring with 0.01 probability. Mutation was implemented as a replacement of a value in a genome by another value, randomly chosen in the pool of 5849 feature types. Advanced techniques like parallel evolutionary algorithms or parameter auto-adaptation were not used in the study. While algorithm succeeded to optimize the vector space generated to training corpus with precision of 87%. However, the resulting model over-fit the training corpus and failed to be fully transferable on testing corpus. Possibly due to implementation error – c.f. (Hromada 2014b) for closer discussion- the model has thus achieved only 27 % precision when confronted testing data. While being definitely more performant than a random baseline, our approach was the least performant among 5 participants of DEFT2014. Notwithstanding the failure of our model in DEFT2014, we consider as an important our observation that « by evolutionary selection of chromosome of features which initially « trigger » the reflective process one can, indeed, optimize the topology and hence the classification performance of the resulting vector space » (Hromada 2014b). 3.2. Part-of-speech induction and part-of-speech tagging The term Part-of-speech-induction (POS-i) designates the process which endows the human or an artificial agent with the competence to attribute the POS-labels (like “verb”, “noun”, “adjective”) to any linguistic token observable in agent’s linguistic environment. POS-i can be understood as a « partitioning problem » since one’s objective is to partition the initial set of all tokens occuring in corpus C (which represent agent’s linguistic environment E) into N subsets (partitions, clusters) whose members would correspond to grammatical categories as defined by the gold standard. Because one does not use any information about « ideal » gold standard grammatical categories during the training phase and uses it only for final evaluation of the performance of the model, POS-i is considered to be an « unsupervised » machine learning problem. POS-i’s « supervised » counterpart is the problem of POS-tagging. In POS-tagging, one trains the system by serving it, during the training phase, sequence of couples (word W, tag T) where tag T is the label denoting the grammatical category into which the word W belongs. POS-tagging is thus simpler than POS-i where no information about ideal labels is furnished during the learning. Training of POS-tagging systems is of particular importance especially for languages where many word forms can potentially belong to many part-of-speech categories (in English, for example, can almost any noun play also role of the verb; token like « still » can be intepreted as substantive, verb, adjective and even adverb, its POS-category being determined by its context). On the contrary, in morphologically rich languages where such a « homonymy of forms » is present in lesser degrees and relations between word types and classes are less ambigous, one can often simply train the POS-tagging system by simply memorizing an exhaustive list of (W, T) couples. 3.2.1. Non-evolutionary models of POS-i The paradigm currently dominating the POS-i domain was fully born with article published by Brown et al. in 1992. Brown and his colleagues have applied the information theoretic notion of « mutual information » : M (w 1 w 2)=log Pr ( w1 w2 ) Pr (w 1) Pr(w2 ) upon all bigrams (i.e. sequences of two words) composed of tokens w 1, w2 and had subsequently devised a merging algorithm able to group words into classes in a way that the mutual information within a class would be maximized. In two decades since publication of study of Brown et al., their approach has inspired hundreds of studies : be it hidden Markov Models tweaked with variational Bayes (Johnson, 2007) , Gibbs sampling (Goldwater & Griffiths, 2007), morphological features (Berg-Kirkpatrick, Bouchard-Côté, DeNero, & Klein, 2010; Clark, 2003) or graph-oriented methods (Biemann, 2006) – all such approaches and many others consider co-occurence of words with n-gram sequences to be the primary source of relevant information for subsequent creation of part-of-speech clusters. In all these models, one aims to discover the ideal parameters of Markovian statistical models, often employing a so-called Expectation-Maximization (EM) algorithm to discover the optimal partitioning. Unfortunately, EM is unable to quit locally optimal states once they were discovered. Notwithstanding this disadvantage, comparative study of (Christodoulopoulos et al. 2010) suggests that probabilistic models of part-of-speech induction can be indeed very performant. POS-i induction can be also realized by means of k-means clustering algorithm, or one of its variants. K-means algorithm (Karypis 2002) partitions N observations, described as vectors in D-dimensional space, into K clusters by attributing every observation into the cluster with the nearest centroid (i.e. mean). If one considers these centroids to denote prototypes of the categories in center of which they are located, then one can consider the k-means algorithm to be consistent with « prototype theory of categorization », as proposed by Rosch. Table 1 illustrates simple K-mean partitioning of tokens present in English version of Orwell’s 1984. Table 1. K-means clustering of tokens according both suffixal and co-occurence informations. Table partially reproduced from (Hromada 2014c) 0 1 2 3 4 5 6 Noun 10 568 97 13 1173 608 1977 Verb 3 67 668 1011 67 958 97 In this example case we have clustered all tokens observable in the corpus into 7 clusters according to features both internal to the token – i.e. suffixes – and external – i.e. co-occurrence with other tokens. Note that even such a simple model where no machine learning or optimization were performed, K-means algorithm somehow succeeds to distinguish verbs from nouns. As is shown in the Table 1, whose columns represent the “gold standard” tags and rows denote the artificially induced clusters, even such a naïve computational model has assigned 83.6% of nouns to clusters 1, 4 and 6 while assigning 91.8% of verbs into clusters 2, 3 and 5. 3.2.2. Evolutionary models of POS-i & POS-t Usage of evolutionary computing in NLP is - in comparison to other methods like neural networks, Hidden Markov Models, Conditional Random Fields or SVMs – still very rare. This is also the case to NLP’s sub-problem of part-of-speech tagging and thus we are aware of only one tentative to use genetic algorithms to train a part-of-speech tagger : In his (Araujo 2002) proposal, Araujo describes a system of POS-t involving crossover and mutation operators. What is particularly interesting about Araujo’s system is that separate evolution process is run for every separate sentence of the test corpus. Training corpus, on the other hand, serves mainly as a source of statistical information concerning co-occurrences of diverse words and tags in diverse word & tag contexts. This information concerning the « global » statistic properties of the training corpus is later exploited in computation of fitness. Let’s take, for example, the phrase « Ring the bell ». Since words like « ring » and « bell » are in English sometimes used as verbs, and sometimes used as nouns, such a sentence can be tagged at least in 4 different ways : N D13 N VDV NDV VDN Such sequences of tags yiels individual members of Araujo’s initial population of chromosomes. In languages like English where almost every word can be attributed to more than one POS category & the number of possible tag sequences therefore increases with length of the phrase-to-be-tagged, one will be most probably obliged to randomly choose such initial individuals. Fitness of every individual possibly tagging the sentence of n words is subsequently calculated as a sum of accuracies of tags (genes) on position i : n ∑ f ( gi ) i=0 Accuracy gi of an individual gene is calculated as : f ( gi)=log( context i ) all i whereby values of contexti and alli are extracted from the training table which was constructed during the training phase and represent the overall frequency of occurrence of word wi within specific (contexti) and all (alli) contexts. Once fitness is evaluated, fitness-proportional crossing-over (50%) and mutation (5%) is realized. Notwithstanding the fact that Araunjo doesn’t seem to have used any other selection mechanism, in less than 100 generations, populations seemed to converge into sequence of tags which were more than 95% correct in regards to gold standard. This is a result comparable to other POS-tagging systems but with lesser computational cost. It is also worth noting that Araujo’s experiments indicate that working solely with contextual window WL, W, WR , i.e. just looking one word to the 13 We denote, by a non-terminal symbol D, the category of « determiners » into which belongs also article « the ». left and one word to the right, seems to yield, in case of POS-tagging of English language higher scores than extracting data from larger contextual spans. When it comes to the « unsupervised » variant of the POS-t problem, id est the problem of Part-of-speech induction, up to this date there have been -as far as we know - no tentatives to address the POS-i problem by means of evolutionary computing. For this reason, and for the reason that we see strong analogies between problems of CC and POS-i, our Thesis shall aim to solve this problem with a model similar to the one which we have presented in part 3.1.2 of this work. 3.3. Grammar induction Input of Grammar Induction (GI) process is a corpus of sentences written in language L, its output is, ideally a grammar (i.e. a tuplet G={S,N,T,P} as defined in above chapters) or at least a model able to generate language sentences of L, including such sentences that were not present in the initial training corpus. The nature of resulting grammar is closely associated to the content of the initial corpus as well as to the nature of the inductive (learning) process. According to their « expressive power », all grammars can be located somewhere on a « specificity – generality » spectrum. On one extreme of the spectrum lies the grammar having following production rules : 1 → 2* 2→a|b|c…Z whereby * means « repeat as many times as You Want ». This very compact grammar can potentially generate any text of any size and as such is very general. But exactly because it can accept any alphabetic sequence and thus does not have any « discriminatory power » whatsoever, is such a grammar completely useless as an explication of system of any natural language. On the other extreme lies a completely specific grammar which has just one rule : 1 → This grammar contains exactly what corpus C contains and is thus not compact at all (it is even two symbols longer than C). Such a grammar is not able to encode anything else than the sequence which was literally present in the training corpus and is therefore also useless for any scenario were novel sentences are to be generated (or accepted). The objective of GI process is to discover, departing solely from corpus C (which is written in language L), a grammar which is neither too specific, nor too general. If it is too general, it shall « overgeneralize », i.e. shall be able to generate (or accept) sentences which aren’t be considered as grammaticaly correct by common speaker of L. If it is too specific, it shan’t be able to represent all sentences contained in C or, if it shall, it shan’t be able to generate (or accept) any sentence which is considered to be sentence of L but was not present in the initial training corpus C. 3.3.1. Non-evolutionary models of grammar induction One of the first serious computational models of GI is a « Syntagmatic – Paradigmatic » (SNPR) model presented in (Wolff 1988). Its core algorithm is presented in Table 2. TABLE 2 Outline of Processing in the SNPR Model (reproduced from Wolff, 1988) 1. Read in a sample of language. 2. Set up a data structure of elements (grammatical rules) containing, at this stage, only the primitive elements of the system. 3. WHILE there are not enough elements formed, do the following sequence of operations repeatedly: BEGIN 3.1 Using the current structure of elements, parse the language sample, recording the frequencies of all pairs of contiguous elements and the frequencies of individual elements. During the parsing, monitor the use of PAR elements to gather data for later us in rebuilding of elements. 3.2 When the sample has been parsed, rebuild any elements that require it. 3.3 Search amongst the current set of elements for shared contexts and fold the data structures in the way explained in the text. 3.4 Generalize the grammatical rules. 3.5 The most frequent pair of contiguous elements recorded under 3.1 is formed into a single new SYN element and added to the data structure. All frequency information is then discarded. END We consider the SNPR model to be of particular importance because of its aim to explain the process of Grammar Induction as a sort of cognitive optimization : « The central idea in the theory is that language acquisition and other areas of cognitive development are, in large part, processes of building cognitive structures which are in some sense optimal for the several functions they have to perform » (Wolff 1988). Wolff also associates his « cognitive optimization hypothesis » with a «law of cumulative complexity » postulated in a study (Brown 1973) which is considered to be tha big classics of language development litterature : «if one structure contains everything that another structure contains and more then it will be acquired later than that other structure » (Wolff 1988). Grammar resulting from such a contact between language sample and SNPR inducing mechanism is displayed on figure 7. In Wolff’s theory optimalization is further understood as compression. Within the SNPR model is such compression realized in part 3.5 of his algorithm, where the most Figure 7: Grammar induced by SNPR model. Figure reproduced from (Wolff, 1988) frequent pair of contiguous elements (either terminals or non-terminals) is substituted for a new non-terminal symbol. For this reason, the size of grammar able to generate the initial language sample ideally decreases with every cycle of model’s « while » loop until the process converges to state where there is no redundancy to « compress ». Wolff proposes that Grammar Induction is a process which should maximize the coding capacity (CC) of the resulting grammar while minimizing its size 14. He defines the ratio between grammar’s CC/MDL to denote grammar’s efficiency and it may be the case that within a more evolutionary framework where one would work with populations of grammars, a very similarly defined notion of efficiency could be used as the core component of the fitness function. Unfortunately, Wolff’s 1988 SNPR model is not evolutionary since it does not involve any stochastic factors nor notion of multiple candidate solutions. Wolff’s SNPR is simply confronted with the language sample, deterministically compresses redundancies in a way that can sometimes ressembles human grammar (and sometimes not), gets subsequently stuck in local optimum and there’s no way how to get out of it. Another famous model of GI is that of (Elman 1993). Contrary to Wolff’s algorithm which is principially « symbolic », is Elman’s model « connectionist » one. More concretely, Elman had succeeded to train a simple recurrent neural network which was «trained to take one word at a time and predict what the next word would be. Because the predictions depend on the grammatical structure (which may involve multiple embeddings), the prediction task forces the network to develop internal representations which encode the relevant grammatical information. » (Elman 1993). The most important finding of Elman’s study seems to be the evidence for a so-called « less is more hypothesis » (Newport 1990) which Elman himselfs labels with terms « importance of starting small » : « Put simply, the network was unable to learn the complex grammar when trained from the outset with the full “adult” language. However, when the training data were selected such that simple sentences were presented first, the network succeeded not only in mastering these, but then going on to master the complex sentences as well. » (Elman 1993). Something similar occured also when he tuned the capacity of « internal memory » of his networks rather than the corpus itself. Elman observed: « If the learning mechanism itself was allowed to undergo “maturational changes” (in this case, increasing its memory capacity) during learning, then outcome was just as good as if the environment itself had been gradually complicated. » Thus, not only results of Elman’s computational model point in the same direction as many developmental and psycholinguistic studies of « motherese » (c.f. citations from Harris in part 2 of this work) ; they also show the importance of gradual physiological changes for ultimate mastering of maternal language. He goes even so far to state that prolonged infancy of human children can possibly go hand in hand with the fact that only humans develop language in an extent we do : «In isolation, 14 In current research, it is more common to speak about grammar’s Minimal Description Length (MDL). we see that both learning and prolonged development have characteristics which appear to be undesirable. Working together, they result in a combination which is highly adaptive» (Elman 1993). Notwithstanding these interesting results which are not to be underestimated, we see two disadvantages of Elman’s approach. Primo, as is often the case for connectionist neural networks, his resulting model is somewhat difficult to interpret : given the training constraints mentioned above, the network seems to predict quite well the next word in the phrase, but it is not evident why it does what it does. Elman himself dedicates major part of his article to descriptions of his tentatives to understand how his « blackbox » functions. Secundo, Elman confronted his model only with artificial corpora, i.e. corpora generated from manually created grammars. Thus, his model accounts only for a limited subset of properties of one language (English) and as such is still quite far from full-fledged solution to problem natural language’s GI. Last model we present in this brief overview, called « Automatic Distillation of Structure » (ADIOS) seem to be in lesser extent touched by this second disadvantage since as its authors state : « In grammar induction from large-scale raw corpora, our method achieves precision and recall performance unrivaled by any other unsupervised algorithm. It exhibits good performance in grammaticality judgment tests (including standard tests routinely taken by students of English as a second language) and replicates the behavior of human subjects in certain psycholinguistic tests of artificial language acquisition. Finally, the very same algorithmic approach also is proving effective in other settings where knowledge discovery from sequential data is called for, such as bioinformatics. » (Solan et al. 2005). ADIOS is a graph-based model. It considers the sentences to be a path in the directed pseudograph (i.e. loops and multiple edges are allowed), each sentence being delimited by special « begin » and « end » vertices. Every lexical entry (i.e. a word type) is also a vertex of the graph, thus if more than two sentences share the same word X, they cross themselves in the vertex VX ; if they contain the same subsequence XY, their paths share the common subpath (edge) VXVY etc. Authors of ADIOS describe their algorithm as follows : « The algorithm generates candidate patterns by traversing in each iteration a different search path (initially coinciding with one of the original corpus sentences), seeking subpaths that are shared by a significant number of partially aligned paths. The significant patterns (P) are selected according to a context-sensitive probabilistic criterion defined in terms of local flow quantities in the graph...Generalizing the search path, the algorithm looks for an optional equivalence class (E) of units that are interchangeable in the given context [i.e., are in complementary distribution]. At the end of each iteration, the most significant pattern is added to the lexicon as a new unit, the subpaths it subsumes are merged into a new vertex, and the graph is rewired accordingly... The search for patterns and equivalence classes and their incorporation into the graph are repeated until no new significant patterns are found. » (Solan et al. 2005). In other terms, ADIOS starts with a so-called Motif Extraction (MEX) procedure which looks for bundles of graph’s subpaths which obey certain conditions. Once such « patterns » are found, they are subsequently « substituted » for non-terminal symbols and a graph is « rewired » to incorporate such newly constructed non-terminals. Such a « pattern distillation » procedure of generalization bootstraps itself until no further rewiring is possible. Output of the whole process is a rule grammar combining patterns (P) and their equivalence classes (E) into rules, able to generate even phrases which weren’t present in the initial corpus. Example of how ADIOS progressively discovers more and more abstract combinatorial patterns is presented on Figure 8. Figure 8: Equivalence classes and production rules induced from English language samples by ADIOS algorithm. Figure reproduced from (Solan et al. 2005) ADIOS is undoubtably one of the most performant GI systems which currently exist. It combines both statistic, probabilistic and graph-theory notions with notion of rule-based grammar and as such is also of great theoretical interest. On the other hand, ADIOS does not involve any source of stochasticity, seems to be purely deterministic and as such incapable to deal with highly probable convergence towards locally optimal grammars. In confrontation with some partial corpora this may possibly not cause any problems but, we predict, without any stochastic variation whatsoever, ADIOS could not account for more than few « advanced » & real-life properties of natural languages and as such shall possibly share the destiny of SNPR model. 3.3.2. Evolutionary models of grammar induction Multiple authors have proposed to solve the GI problem with different variants of evolutionary computinng - in following paragraphs we shall describe five different approaches: 1) Tomita’s (1982) hill-climbing induction of finite state automata 2) Dupont’s (1994) GIG method for inference of regular languages 3) Evolution of stochastic Context-Free Grammars as presented by Keller & Lutz (Keller and Lutz 1997) 4) Evolutionary method of (Aycinena et al. 2003) inducing grammars from POS tags of nine different English language corpora 5) Genetic algorithm of Smith & Witten (Smith and Witten 1995) for inducing a LISP s-expression grammar from a simple corpus of English sentences Tomita’s 1982 paper can be considered to be one of the first empiric studies of grammatical inference. The study focused on inference of grammars of 14 different regular languages – which are often called « Tomita languages » in subsequent litterature – by means of deteministic finite state automata. Tomita had first encoded any possible finite state machine with n states in a following manner : Figure 9: Finite state automaton matching all strings over (1 + 0)* without an odd number of consecutive 0's after an odd number of consecutive 1's. Figure reproduced from (Tomita 1982) ( ( A1, B1, F1) (A2 , B2 , F2 ) . . . . (An , Bn , Fn )) whereby every block « (Ai, Bi, Fi) corresponds to the state i, and Ai and Bi indicate the destination states of the 0-arrow and the 1-arrow from the state i, respectively. If A or B is zero, then there is no 0-arrow or 1-arrow from the state i, respectively. F i indicates whether state i is one of the final states or not. If F i is equal to 1, the state i is one of the final states. The initial state is always state 1 » (Tomita, 1982). Thus, for example, the string ((1 2 1 ) ( 3 1 1 ) ( 4 0 0 ) ( 3 4 1 )) encodes the finite state automaton illustrated on figure 9. Such encoding allowed Tomita to subsequently apply his hill-climbing approach. Hill-climbing can be considered to be a precursor to more extended genetic programming, since it employs both random mutations to explore surounding search-space and sort of selection algorithm which always prefers to use, in following iteration of the algorithm, such individual solutions for which the value of evaluation function E increases. Tomita’s definition of E is very simple: E=r-w « where r is the number of strings in the right-list accepted by the machine, and w is the number of strings in the wrong-list accepted by the machine ». Right-list is a positive sample corpus while wrong-list is the negative sample. Thus, if a random mutation transforms an individual Xn into individual Xn+1 so that E(Xn+1) > E(Xn), i.e. if an automaton is discovered which matches more positive sequences, or less negative sequences, or both - it will be Xn+1 which will be mutated in the next cycle of the algorithm. Tomita’s approach cannot be considered to be fully evolutionary because he haven’t used populations nor did he employed any kind of cross-over operator. For this reason, Tomita’s regular grammar-infering algorithm did sometimes got stuck in local maxima from which there was no way out. Notwithstanding this small imperfection – of which Tomita himself was well aware – his work served, and still serves, the role of an important hallmark on the path to full-fledged GI. Dupont (1994), for example, has also focused his study on induction of 15 different regular Tomita languages. In his formally very sound work, he defines the problem of inference of regular languages as a problem of finding of optimal partition of a state space of a finite « maximal canonical automaton » (MCA) able to accept the sentences from positive sample. Fitness function takes into account also the system’s tendency to reject the sentences contained in the negative sample. By using a so-called « left-to-right canonical group encoding », Dupont succeeds to represent diverse individuals automata in a very concise way which allows him to subsequently evolve them by means of structural mutation («the structural mutation consists of a random selection of a state in some block of a given partition followed by the random assignment of this state to a block », e.g. MUTATE({{1,3,5},{2},{4}}) → {{1,5}, {2,3},{4}}) and structural crossover («the structural crossover consists of the union in both parent partitions of a randomly selected block », for example CROSS({{1,4}, {2,3,5}},{{1,3},{2},{4},{5}}) → {{1,3,4},{2,5}},{1,3,4},{2},5}). Because « the search space size dramatically increases with the size of the positive sample, making the correct identification more difficult when we have a larger positive information on the language », Dupont has also proposed an incremental procedure allowing to start the search process from smaller yet pertinent region of the search space. Procedure goes as follows : « first sort the positive sample I+ in lexicographical order. Consequently, the shortest strings are first taken into account. Starting with the first sentence of I +, we construct the associated MCA(I+) and we search for the optimal partition of its state set under the control of the whole negative sample I_. Let A1 denote the derived automaton with respect to this optimal partition. Let snext denote the next string in I+. If snext is already accepted by A1, we skip it. » (Dupont 1994). Otherwise, the aumaton A1 is be extended so that it can cover also snext. The search under the control of whole negative sample is then restarted and whole process is repeated until all sentences from positive sample have been considered. With population size of 100 individuals, maximum number of 2000 evaluations, crossover rate 0.2, mutation rate/bit 0.01 and semi incremental procedure implemented, Dupont’s approach have attained, in average, classification rate of 94.4%. For five among fifteen Tomita’s languages, grammars were constructed which attained 100% accuracy (i.e. accepted all sentences from positive sample and rejected all strings from negatives sample). Results have also indicated that if ever the semi-incremental procedure is applied, the sample size has positive influence upon the accuracy of infered grammars – bigger sample yields more accurate grammars. While Tomita’s results indicate and Dupont’s results further confirm the belief that induction of grammars by means of evolutionary computing is a plausible thing to do, they do so only in regards to most similar type of grammars – the regular ones. Grammars of natural languages, however, are definitely not regular languages and models of GI of more expressive « context free » (CFG) or « context sensitive » grammars are needed. Keller and Lutz employed a genetic algorithm to evolve parameters of stochastic context-free grammars (SCFG) of 6 different languages. SCFGs are similar to traditional CFGs15, but extended with probability distribution, so that there is a probability value in the range [0,1] associated to every production rule of the grammar. These values are called SCFG’s parameters and these are the values which the algorithm of Keller & Lutz aims to optimize by means of GAs. Their approach involves following steps : « 1. Construct a covering grammar that generates the corpus as a (proper) subset. 2. Set up a population of individuals encoding parameter settings for the rules of the covering grammar. 3. Repeatedly apply genetic operations (cross-over, mutation) to selected individuals in the population until an optimal set of parameters is found. » (Keller and Lutz 1997) Their fitness function F(G) is based on idea of Minimal Description Length (MDL). More formally, Keller & Lutz aimed to maximize: F (G)= Kc L(C∣G)+ L(G) by minimizing the denominator which is defined as a sum of number of bits needed to encode the grammar G (L(G)) plus the number of bits needed to encode corpus G, given the grammar G (L(C|G)). Numerator K c is just a corpus dependent normalization factor assuring that the value of fitness shall be in range [0,1]. When 15 « In formal language theory, a context-free grammar (CFG) is a grammar inn which every production rule is of the form V → w, where V is a single non-terminal symbol, and w is a string of terminals annd/or non-terminals. The term « context-free » expresses the fact that non-terminals can be rewritten without regard to the context in which they occur » (Choubey and Kharat 2009) confronted with positive samples of cca 16000 strings (typically of length 6 or 8) of 6 different context-free languages : 1. 2. 3. 4. 5. 6. EQ : language of all strings consisting of equal numbers of as and bs language a n b n (n≥1) BRA1 : language of balanced brackets BRA2 : balanced brackets with two sorts of bracketing symbols PAL1 : palindromes over {a,b} PAL2 : palindromes over {a,b,c} their algorithms have converged, in majority of cases, to such combinations of parameters of their SCFGs which had allowed them to accept more than 95% of strings presented in the positive sample. Such results indicate that genetic algorithms can be used as a means for unsupervised inference of parameters of stochastic context-free grammars. Note that Keller & Lutz confronted, during both testing and training, their algorithm only with positive sample. While doing so for training is justifiable - since the objective of their study was to study whether grammars can be infered solely from positive evidence – not doing so during testing phase makes uncertain the extent to which their infered grammars overgeneralize. Another huge disadvantage in regards to aims of our Thesis is the simple fact that their approach also seems to be very costly (« number of parses that must be considered increases exponentially with the number of non-terminals »). And since they confronted their algorithms only with corpora composed of sentences of artificial and not natural languages, we shall not try to imitate their approach of « tuning SCFG parameters » in our Thesis. By being context-free and not simply regular, the grammars studied by Keller & Lutz or (Choubey and Kharat 2009) could be considered to be more similar to grammars of natural languages. Nonetheless, languages composed of palindromes and sequences of balanced brackets are still far way off from natural languages and the question « in what extent are results concerning GI of artificial languages applicable to GI of natural languages ? » is far from being answered. Rather than trying to answer it, we proceed now to discussion of two approaches where evolutionary GIs have been applied upon natural language sentences : The first method, proposed in (Aycinena et al. 2003) has focused on induction of CFG grammars from nine different part-of-speech tagged natural language corpora. Sentences contained in these corpora, composed thus of sequences of part-of-speech tags (c.f. Section 3.2) were used as positive examples, while randomly generated sequences of POS-tags have yielded negative examples. Initial population was composed of linear encodings of randomly generated context-free grammars, for example the string SABABCBCDCAE would represent this CFG : S → AB A → BC B → CD C → AE During the evaluation of individual grammar G, one would first try to parse both positive and negative corpora with the grammar G and subsequently calculate the final fitness by applying the following formula : F( α)=γ max(0,∣α∣−¿P∣) C (α)−δ I (α) « where P is the set of preterminals, C(α) is the number of parsed sentences from the corpus, I(α) is the number of sentences parsed from the randomly generated corpus, δ is the penalty associated with parsing each sentence in the randomly generated corpus, and γ is the discount factor used for discouraging long grammars » (Aycinena et al. 2003) In their study, Aycinena had placed randomly generated population of 100 individual grammars on a two-dimensional 10 x 10 torus grid. Subsequently, they had applied a following select-breed-replace strategy : « 1. Select and individual randomly from the grid 2. Breed that individual with its most fit neighbor to produce two children 3. Replace the weakest parent by the fittest child » (Aycinena et al. 2003) In their framework, «cross-over is accomplished by selecting a random production in each parent. Then a random point in these productions is selected and cross-over is performed, swapping the remainder of the strings after the cross-over points». Every symbol of a resulting string can be subsequently mutated (mutation rate=0.01). «A mutation is simply the swapping of a non-terminal or pre-terminal with another non-terminal or pre-terminal » (Aycinena et al. 2003) Figure 10 shows the number of generations each run was able to complete, the grammar G that last evolved, the percentage of positive examples parsed by G, the percentage of negative examples parsed by G and G’s fitness. While results displayed above may seem encouraging authors, have noticed that in majority of cases, their approach « gives a grammar that is very capable of detecting whether a sentence is valid in English, but it has not learned much English structure ». In other terms, Aycinena et al. have succeeded to breed grammars which have certain discriminatory power but are practically useless as models of English language. They go even so far as to state, in the ultimate paragraph of their work that « It is still possible that English grammar is too complex to be learned from a corpus of words » and that other external clues are necessary for successful GI of English. The big disadvantage of above-mentioned algorithm was also the fact that its input were sequences of already attributed POS-tags and not sequences of words themselves. Thus, even if the approach would discover some interesting grammars, a reproach could be made and justified that in fact it only re-discovered the rules of the tagging system which was used in the first place. From perspective of our Thesis, another disadvantage of Aycinena et al.’s approach is related to the fact that their approach is anything but model of grammar development in human child. For it is evident (c.f. Section 2) that children learn the grammar of their language in an incremental fashion – they are not confronted with whole corpus from the very beginning. Nor does the corpus stay identic after each iteration of the learning process. On the contrary : as child grows, its linguistic environment - the corpus – also grows. Both in length and complexity. Figure 10: Grammars evolved from nine different POS-tagged corpora. Figure reproduced from (Aycinena et al., 2003). An interesting evolutionary approach of GI which both tries to create own non-terminal categories and also takes such « incrementality » into account is presented in the work of (Smith and Witten 1995). In their scenario, candidate grammars are evolved after presentation of every new sentence. Grammars have form of LISP s-expressions whereby AND represets a concatenation of two symbols (i.e. a syntagmatic node) and OR represents a disjunction (i.e. a paradigmatic node). Whole process is started as follows : « The GA proceeds from the creation of a random population of diverse grammars based on the first sample string. The vocabulary of the expression is added to an initially empty lexicon of terminal symbols, and these are combined with randomly chosen operators in a construction of a candidate grammar...If the candidate grammar can parse the first string, it is parsed into the initial population ». Figure 11 displays two sample grammars for the sentence « the dog saw a cat ». Figure 11: Two simple grammars covering the sentence "the dog saw a cat". Figure reproduced from (Smith & Witten, 1995) S-expression sequences representing individual grammars are subsequently mutated. Couple of parent grammars can also switch their nodes – probability of being chosen for such cross-over is inversely proportional to grammar’s size : shorter grammars are prefered. Cross-over is non-destructive, parents thus also persist. The events of reproductions are grouped in cycles, at the end of each cycle, population of candidate grammars is confronted with new sentence from sample of positive evidence. In their article (Smith and Witten 1995)show, how after presentation of sentences : «the dog saw a cat », « a dog saw a cat », « the dog bit a cat », « the cat saw a cat », « the dog saw a mouse » and « a cat chased the mouse » their system naturally converged to a grammar which had quite correctly subsumed determiners like « a », « the » under one group of OR nodes, verbs like « chased », « saw », « bit » under another, and nouns like « dog », « cat », « mouse » under yet another. The grammar which they finally obtain is not ideal but, as they argue, it could get better if confronted with new sentences. «It is an adaptive process whereby the model is graudally conditioned by the training set. Recurring patterns help to reinforce partial inferences, but intermediate states of the model may include incorrect generalizations that can only be eradicated by continued evolution. This is not unlike the developing grammar of a child which includes mistakes and overgeneralisations that are slowly eliminated as their weaknenesses are made apparent by increasing positive evidence ». (Smith and Witten 1995) While strongly agreeing with above citation, we nonetheless cannot ignore certain drawbacks of Smith & Witten’s approach. Most importantly, by using LISP’s s-expressions as a way of representing their grammars, they ultimately have to end up with highly bifurcated binary trees (since arity of AND|OR operators is 2). Thus, one can easily subordinate two non-terminals to one terminal (e.g. OR(cat,dog)), but in case of three subordinated terminals, one is obliged to use complex expression involving three non-terminals (e.g. OR(OR(cat,dog),OR(mouse,NULL)). Therefore, in such an s-expression based representation, is any class having more than two members neccessarily represented by a longer sequence → is more prone to mutation → is highly « handicapped » in regards to much shorter expressions subordinating just two nodes. Another drawback of Smith & Witten’s work which cannot be ignored is related to the fact that while they used English language sentences to train their system, the sentences were very simple and the relevance of their findings to GI of « natural » English is more than disputable. In fact, they seem to achieve, with quite complex evolutionary machinery, even less than Wolff’s deterministic SNPR model have achieved almost a decade before. Notwithstanding these two drawbacks we nonetheless consider as particularly inspiring their approach aiming to solve the problem of GI of natural languages by uniting, in one framework, the notions adaptability, evolvability and statistical sensitivity to recurring patterns. We summarize : all five above-mentioned approaches indicate that evolutionary computing can potentially yield useful solutions to the problem of Grammar Induction of both artificial (regular, context-free) and natural language grammars. The length of the candidate grammar is frequently used as an input argument of the fitness function. Note also that both solutions of Dupont and Smith & Witten also use a sort of « incremental » procedure whereby individual solutions gradually adapt to every new sentence. Especially Dupont’s findings are reminiscent of what was already told about « importance of starting small » when discussing works of Elman & Harris. On the other hand, none of the above mentioned models was confronted with corpus of child-directed (i.e. « motherese ») or child-originated utterances. The objective of our Thesis shall be to fill this gap. 3.4. Evolutionary Language Game Evolutionary Language Game (ELG) first proposed in (Nowak et al. 1999) is a stunningly simple yet mathematically feasible stochastic model addressing the question « How could a coordinated system of meanings&sounds evolve in a group of mutually interacting agents ?». In most simple terms, the model can be described as follows: Let’s have a population of N agents. Each agent is described by an n x m associative matrix A. A’s entry a ij specifies how often an individual, in a role of a student, observed one or more other individuals (teachers) referring to object i by producing signal j. Thus, from this matrix A, one can derive the active « speaker » matrix P by normalizing rows : while the « hearer » passive matrix Q by normalization of A’s columns: The entries pij of the matrix P denote the probability that for an agent-speaker, object i is associated with sound j. The entries q ji of the matrix Q denote the probability that for an agent-hearer, a sound j is associated with the object i. Subsequently, we can imagine two individuals A and A’, the first one having the language L (P, Q), the other having the language L’ (P’, Q’). The payoff related to communication of such two individuals is, within Nowak’s model, calculated as follows: n n F ( A, A′ ) = ∑∑ pij q′ji = Tr ( PQ′ ) i =1 j =1 And the fitness of the individual A in regards to all other members of the population can be obtained as follows : f ( A) = 1 | P | −1 ∑ F ( A, A′) A′∈P ( A′≠ A) After the fitness values are obtained for all population members, one can easily apply traditional evolutionary computing methods in order to direct the population toward more optimal states, i.e. states where individual matrices are mutually « aligned ». In Nowak’s framework this alignment represents the situation when hearer and speaker mutually understand each other, i.e. speaker has encoded meaning M by sound S and hearer had subsequently decoded sound S as meaning M. ELG beautifully illustrates how such an alignment of sound-meaning matrices – a mutually shared communication protocol - can emerge practically ex nihilo given that there is some « mutual learning » procedure mechanism involved, which allows to transfer information from one individual to individual another. This is attained by creating a blank « student » matrix and then filling its elements, by means of stochastic « matrix sampling » procedure, in a way so that the resulting student matrix will partially correspond to| be aligned with matrices of pre-existing « teacher » (or teachers). Further discussion and experiments with ELG are described (Kvasnička and Pospíchal) and (Hromada 2012). All these studies point in the same direction and suggest that not only emergence of mutually shared communication protocol practically ex nihilo is possible whenever there exists a means of transfer of information among individuals but also that presence of certain low amount of noise during the learning process is the only way how to make certain that the system will converge to « communicatively optimal » state. The role of ELG model within the context of our Thesis is quite opened. For while it is the case that ELG sheds some light upon the question of emergence of language within a community of symbolicaly interacting agents, it does not, principially address the problem of language learning by a concrete individual. Thus, ELG is rather a model of macroscopic phylogeny than microscopic ontogeny - it addresses the problem of how small communities of homo habilis could, hundred years ago, gradually converge to system of signs within which, for example, « baubau » could mean a banana and « wauwau » mean a lion. But it does not address a problem of how today’s human baby learns the complex language of her mother. On the other hand, it is not completely hors propos to imagine a slight variation of Nowak’s model wherein one population of matrices would be fixed (representing the linugistic competence of a teacher or mother organism) while the second population of matrices would represent the linguistic competence of a « child ». Given that the fitness function would somehow succeed to represent the degree of alignment between such « mother » and « child », we postulate that something like child’s language competence could spontaneously emerge. 4. Remark concerning the Theory of Grammar Systems A branch of Formal Language Theory which could be of particular use for pursposes of our Thesis is devoted to study of Grammar Systems (GS). A GS is a « set of grammars working together, according to a specified protocol, to generate a language» (Jiménez-López 2000). Thus, contrary to classical Formal Language Theory within which one grammar generate ones language, in GS several grammars work together in order to generate one language. Grammar Systems can be therefore considered as a sort of multi-agent variants of traditional « monolithic » formal grammar theory. The very nature of multi-agent systems often implies cooperation, communication distribution, modularity parallelism, or even emergence of complexity. For example, Figure 12 illustrates a very simple bimodular « language colony » variant of a GS. Figure 12: Language colony of two finite grammars cooperating to generate an infinite language. Figure reproduced from (Kelemen 2004) By allowing the finite grammar components to communicate through a common symbolic environment16, one ultimately generates a language which is infinite ! (Kelemen 2004) applies the term « miracle » to such behaviour, which is very common in the world of GS. Since the Theory of Grammar Systems is formally very well developped - most notably thanks to life-long work of Erzsébet Csuhaj-Varju and substantial contributions by George Paun and Jozef Kelemen– it is impossible for us to introduce, within the limited scope of this text, the formalism of GS Theory in closer detail. This will be done in the final version of our Thesis, if ever we decide to poursuit our research in direction. If that will be the case, we will often refer to the doctoral Thesis of (Jiménez-López 2000) which contains many persuasive arguments for application of GS upon the study of natural human languages. On the other hand, the Thesis of Jimenez-Lopez is limited by the fact that it mostly proposes to use the Grammar System Theory as a framework explaining the final, i.e. « adult » linguistic component, and not as a framework which could elucidate the very process of language development and language acquistion17. The only tentative to use Grammar System apparatus for grammatical inference is that of (Sosík and Štỳbnar 1997). Contrary to other authors of GS who focus principially on the productive (i.e. generative) aspects of GS, Sosik & Štýbnar have focused on GS's language-accepting properties. In a hybrid connectionist-symbolic architecture, they have used a « neural pushdown automaton » to infer a language colony able to cover some simple artificial context-free grammars able to cover balanced parenthesis or palindrom languages. As far as we know, no tentative is reported in the litterature to solve the problem of grammar induction of natural languages by means of evolutionary optimization of Grammar Systems. 5. Thesis The Thesis hereby introduced is done under double supervision of dpt. Cybernetics at Slovak University of Technology (STU) and « cognitive psychology » laboratory affiliated to University Paris 8 (P8). Ideally, both « engineering » approach – common to STU – as well as more cognition-oriented « experimental » approach of P8, should be equally reflected in the final Thesis. In order to do so, the Thesis shall, in fact, introduce multiple «theses » among which some shall be addressing more « theoretical » psychology and linguistics related phenomena and problems. But due to its affiliation to STU, the text shall also introduce more concrete, 16 A common symbolic environment which is shared by different modules plays the central role in practically all variants of Grammar Systems. It is reminiscent of the role which « short term memory » or «working memory » plays in cognitive psychology. 17 In terms of Grammar System Theory, it seems to be more appropriate to speak about «language emergence » pragmatic and operational theses aiming to offer a computationally and formally sound affirmative answer to the question : « Can a language development be modelled as an evolutionary process ? » 5.1. Theoretical Thesis At first, a child has to learn • how to segment the world into groups of discrete objects and processes • how to segment phonetic flux into sequences of discrete linguistic tokens The subsequent problem of language development can be analyzed as a trinity of sub-problems: 1) vocabulary development (learning of mappings between objects and tokens) 2) induction of grammatical categories 3) induction of grammatical rules These tasks are deeply and strongly intertwined. Without ability to segment world into objects there are no stable referents to which linguistic tokens could refer. Without ability to perceive recurrent tokens, there are no conventional symbols with which a child could denote specific objects. Without vocabulary development (which relates to induction of semantic classes which we have called « concept construction » in the text above), there is no need for grammatical rules nor categories. Without grammatical categories, grammatical rules are just a senseless tautological formal game and there is no way to distinguish useful grammars from useless ones. Without useful grammars, vocabulary development shall halt at some locally optimal level of a « pidgin » language. Left on their own, these problems pose us in front of us a variant of a chicken & egg problem which seems almost impossible to tackle. Baby’s brain, however, resolves these problems with such such an ellegance that one is tempted to say that they even do not exist. Aim of the Thesis which shall follow is to demonstrate that if one interprets the above mentioned set of problems interpreted in terms of • parent-child communication (imitation) • partitioning of vector spaces (categorization) • gradual accomodation and assimilation of knowledge (generalization) one could subsequently state that the key theoretical Thesis we aim to defend is Tt process of language development is an auto-organizing and potentially evolutionary process Note the word « potentially », because in order to be labeled as « evolutionary », following conjectures have to be validated : C1) Not only imitation but also repetition are forms of replication : Information replicates not only between the brains but also in the brain. C2) Fitness of a linguistic structure is related to its ability to represent certain recurrent aspect of agent’s environment : If cognitive structure matches some aspect of environment, it gets activated. By being activated, it augments it probability of being (at least partially) replicated. C3) Problems of both generalization and overgeneralization are to be solved by variation|decay operators endogenously transforming the information represented in the memory of a language-inducing system. Acceptation of above-mentioned conjectures lead us to model of language development based not on tuning of parameters of single monolithic grammar, but rather based on a population of « microgrammars », a « language colony » (Kelemen and Kelemenová 1992) of mutually communicating, co-operating, decaying and replicating sequences of production rules unceasingly trying to match the language of linguistic environment. We postulate that if such an environment has certain properties of « motherese », a linguistic competence : an ability to generate utterances in still more & more complex « toddlerese », shall spontaneously emerge. Thus, three notions will be of utmost importance in the Thesis which we hereby introduce : « motherese », « microgrammar » and « matching ». The corpus of « motherese », more concretely the CHILDes corpus (MacWhinney 2000) , will be considered to be sufficiently adequate image of initial stages of child’s linguistic environment. Development of child’s linguistic competence will be explained in terms of gradual evolution of individual « microgrammars », i.e. chromosomes whose genomes can be understood as individual production rules. At last but not least, the notion of « matching » shall furnish us the first principle which could potentially allow us to to explain the mystery of language acquisition as an evolutionary process: P1 «If (internal) rule R or substitional schema S succeeds to match some aspect of (external) environment, then it shall be replicated into another microgrammar» 5.2. Operational Thesis The operational Thesis (TO) is stated as follows : TO « There exists an evolutionary algorithm A which, when confronted with the corpus of motherese language (LM) as an input, can produce the toddlerese grammar (GT) able to generate the LM -ressembling toddlerese language LT » The term « evolutionary » means that the algorithm A shall involve incremental replication, mutation and selection of information-representing structures. More concretely, these information-representing structures, i.e. genomes, shall be ordered sequences of genes, whereby each gene shall contain an individual substitution rule. Thus, every individual genome shall represent a « microgrammar » aiming to transform linguistic token (i.e. sequence of terminals) currently observable in the environment, into sequence of non-terminals. Whenever such « successful parse » shall occur, the principle P1 shall apply and useful genes shall be reproduced into other individual micro-grammars. This could potentially cause the micro-grammars to gradually adapt their structures to those of environment. On the other hand, in order to prevent excessive adaptation, a variation operator shall be also integrated in the algorithm A, aiming to vaguely modeling a well-known phenomenon of « forgetting ». 5.3. The organization of the Thesis The Thesis shall be composed of five parts each of which is composed of multiple major chapters. Every chapter consists of introduction and conclusion preceding resp. following more specific subchapters which can fractally branch into sub-chapters , sub-sub-chapters etc. All such parts, chapters, sub-chapters etc. can be considered to be « non-terminal » nodes of structure presented by this text. The first part, labeled Theses, is a stem of whole text. It will introduce multiple theses at varying degrees of generality which shall be all - in one way or another - more directly addressed in subsequent sections. In order to weave the basic conceptual fabric, some definitions of terms like « evolution » and « language learning » shall be also offered along the path delimited in Section 1. All variants of the thesis shall be briefly related to other cognitive sciences. The second branch, labeled « Theoretical position » is composed of chapters dedicated to Universal Darwinism, Developmental Psycholinguistics and Natural Language Processing. In these chapters, the theses presented in the first chapter shall be more deeply interpreted and contextualized in terms of respective disciplines. The third branch, labeled «Observations» will describe multiple longtitudinal observations of one concrete human child. In certain cases, the generalizability of such individual observations shall be verified or falsified by means of text-mining the CHILDes corpora. Subsequent interpretations in terms of the evolutionary theoretical framework shall follow. The penultimate branch, called «Simulations» shall present multiple computational models addressing four problems related to language acquisition process : 1) The problem of segmentation 2) The problem of induction of grammatical categories 3) The problem of induction of grammatical rules 4) The problem of concept induction. Specific chapter will be dedicated to every problem in which existing solutions shall be described. Special focus shall be put on evolutionary solutions, if they exist. To every of four above-mentioned problems we shall try to offer our own unique evolutionary solution and subsequently we shall discuss its performance. PERL source codes of diverse versions of the algorithm A shall be also attached in order to allow reproducibility of our results by other scientists. The conclusive branch labeled « Synthesis » shall primarily discuss results obtained in parts « Observations » and « Simulations ». If the results turn out to be consistent with theory, the work shall end with a tentative to integerate theses Tt and Tt in one unified framework. If unsuccessful, potential reasons of the failure shall be analysed. 6. Bibliography Araujo, Lourdes. 2002. Part-of-speech tagging with evolutionary algorithms. In Computational Linguistics and Intelligent Text Processing, 230–239. Heidelberg, Germany: Springer. Aycinena, Margaret, Mykel J. Kochenderfer, and David Carl Mulford. 2003. An evolutionary approach to natural language grammar induction. Final Paper Stanford CS224N June. Barrett, Deirdre. 2007. Waistland: A (R) evolutionary View of Our Weight and Fitness Crisis. New York, NY : WW Norton & Company. Bee, Helen L., and Denise Roberts Boyd. 2003. The developing child. Boston, MA : Allyn & Bacon. Bentley, Peter. 1999. Evolutionary design by computers. San Francisco, CA : Morgan Kaufmann. Berman, Ruth A. 1988. Word class distinctions in developing grammars. Categories and processes in language acquisition: 45–72. Blackmore, Susan. 2000. The meme machine. Oxford, England : Oxford University Press. Braine, Martin DS. 1971. On two types of models of the internalization of grammars. The ontogenesis of grammar: 153–186. Brodsky, Peter, H. R. Waterfall, and Shimon Edelman. 2007. Characterizing motherese: On the computational structure of child-directed language. In Proceedings of the 29th Cognitive Science Society Conference, ed. DS McNamara & JG Trafton, 833–38. Brown, Roger. 1973. A first language: The early stages. Cambridge, MA : Harvard University Press. Campbell, Donald T. 1960. Blind variation and selective retentions in creative thought as in other knowledge processes. Psychological review 67: 380. Choubey, Nitin S., and Madan U. Kharat. 2009. Grammar Induction and Genetic Algorithms-An Overview. Pacific Journal of Science and Technology 10: 884–888. Christodoulopoulos, Christos, Sharon Goldwater, and Mark Steedman. 2010. Two Decades of Unsupervised POS induction: How far have we come? In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (pp. 575-584). Cohen, Trevor, Roger Schvaneveldt, and Dominic Widdows. 2010. Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections. Journal of Biomedical Informatics 43: 240–256. Cosmides, Leda, and John Tooby. 1997. Evolutionary psychology: A primer. Retrieved from http://www.cep.ucsb.edu/primer.html. Csuhaj-Varjú, Erzsébet. 1994. Grammar systems: a grammatical approach to distribution and cooperation. Yverdon, Switzerland : Gordon and Breach Science Publishers. Darwin, Charles. 1859. On the Origin of Species. London, England : John Murray. Darwin, Charles. 1906. The voyage of the Beagle. 104. JM Dent & sons. Dawkins, Richard. 2006. The selfish gene. Oxford, England : Oxford university press. Dennett, Daniel C. 1996. Darwin’s Dangerous Idea: Evolution and the Meanings of Life. 39. New York, NY : Simon & Schuster. Dupont, Pierre. 1994. Regular grammatical inference from positive and negative samples by genetic search: the GIG method. In Grammatical Inference and Applications, 236–245. Heidelberg, Germany: Springer. El Ghali, Adil, Daniel Hromada, and Kaoutar El Ghali. 2012. Enrichir et raisonner sur des espaces sémantiques pour l’attribution de mots-clés. JEP-TALN-RECITAL 2012: 77. Elman, Jeffrey L. 1993. Learning and development in neural networks: The importance of starting small. Cognition 48: 71–99. Flake, G. W. 1999. The computational beauty of nature. Cambridge, MA : MIT press. Fogel, Lawrence J., Alvin J. Owens, and Michael J. Walsh. 1966. Artificial intelligence through simulated evolution. New York, NY : John Wiley & Sons. Foster, Mary LeCron. 2002. Symbolism: the foundation of culture. Companion encyclopedia of anthropology : 366. Canada : Routledge. Furrow, David, Katherine Nelson, and Helen Benedict. 1979. Mothers’ speech to children and syntactic development: Some simple relationships. Journal of child language 6: 423–442. Galton, Francis. 1875. English men of science: Their nature and nurture. . Gärdenfors, Peter. 2004. Conceptual spaces: The geometry of thought. MIT press. Haeckel, Ernst Heinrich Philipp August. 1879. The evolution of man. Vol. 1. [sn]. Haidt, Jonathan. 2012. The righteous mind: Why good people are divided by politics and religion. Random House LLC. Hamilton, William D. 1963. The evolution of altruistic behavior. The American Naturalist 97: 354–356. Harris, Margaret. 2013. Language experience and early language development: From input to uptake. Psychology Press. Harris, Zellig S. 1954. Distributional structure. Word. Hebb, Donald Olding. 1964. The Organization of Behavior: A Neuropsychlogical Theory. John Wiley & Sons. Hoff-Ginsberg, Erika. 1986. Function and structure in maternal speech: Their relation to the child’s development of syntax. Developmental Psychology 22: 155. Holland, J. H. (1975). Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence. Ann Arbor, MI : University of Michigan Press. Hromada, Daniel Devatman. 2012. Variations upon the theme of Evolutionary Language Game. Unpublished manuscript. Slovak University of Technology. Hromada, Daniel Devatman. 2013a. Geometrizácia ontológií - prípadová štúdia SNOMED. Unpublished manuscript. Slovak University of Technology. Hromada, Daniel Devatman. 2013b. Random Projection and Geometrization of String Distance Metrics. In Proceedings of the Student Research Workshop associated with RANLP, 79–85. Hissar : Bulgaria. Hromada, Daniel Devatman. 2014a. Comparative study concerning the role of surface morphological features in the induction of part-of-speech categories. In Proceedings of TSD2014 conference. Heidelberg, Germany : Springer. Hromada, Daniel Devatman. 2014b. Introductory experiments with evolutionary optimization of reflective semantic vector spaces. In TALN-RECITAL-DEFT 2014. Marseille. Hromada, Daniel Devatman. 2014c. Conditions for cognitive plausibility of computational models of category induction. In Proceedings of 15th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems. Heidelberg, Germany: Springer. Jiménez-López, MD. 2000. Grammar systems: a formal-language-theoretic framework for linguistics and cultural evolution. PhD Dissertation. Tarragona, Spain : Rovira i Virgili University, . Johnson, William B., and Joram Lindenstrauss. 1984. Extensions of Lipschitz mappings into a Hilbert space. Contemporary mathematics 26: 1. Karypis, George. 2002. CLUTO-a clustering toolkit. DTIC Document. Kauffman, Stuart. 1996. At home in the universe: The search for the laws of self-organization and complexity. Oxford, England : Oxford University Press. Kelemen, Jozef. 2004. Miracles, colonies, and emergence. In Formal Languages and Applications, 323–333. Heidelberg, Germany: Springer. Kelemen, Jozef, and Alica Kelemenová. 1992. A grammar-theoretic treatment of multiagent systems. Cybernetics and System 23: 621–633. Keller, Bill, and Rudi Lutz. 1997. Evolving stochastic context-free grammars from examples using a minimum description length principle. In 1997 Workshop on Automata Induction Grammatical Inference and Language Acquisition. Kennedy, James F., James Kennedy, and Russel C. Eberhart. 2001. Swarm intelligence. San Francisco, CA : Morgan Kaufmann. Koza, J. R. (1992). Genetic programming: on the programming of computers by means of natural selection (Vol. 1). Cambridge, MA : MIT press. Küntay, Aylin, and Dan I. Slobin. 1996. Listening to a Turkish mother: Some puzzles for acquisition. Social interaction, social context, and language: Essays in honor of Susan Ervin-Tripp: 265–286. Kvasnička, Vladimír, and Jirí Pospíchal. Evolúcia jazyka a univerzální darwinizmus. In Myseľ, inteligencia a život. Bratislava : Slovenská Technická Univerzita. Lakoff, G. 1990. Women, fire, and dangerous things. Chicago, IL : University of Chicago Press. Levy, Yonata. 1988. The nature of early language: Evidence from the development of Hebrew morphology. Categories and processes in language acquisition: 73–98. Lawrence Erlbaum Associates. MacWhinney, Brian. 1987. The competition model. Mechanisms of language acquisition: 249–308. MacWhinney, Brian. 2000. The CHILDES Project: Tools for Analyzing Talk. Transcription, format and programs. Vol. 1. Lawrence Erlbaum Associates. Maratsos, Michael. 1988. The acquisition of formal word classes. Categories and processes in language acquisition: 31–44. Lawrence Erlbaum Associates. Morgan, Thomas Hunt. 1916. A Critique of the Theory of Evolution. Princeton University Press. Newport, Elissa L. 1990. Maturational constraints on language learning. Cognitive science 14: 11–28. Ninio, Anat. 1988. On formal grammatical categories in early child language. Categories and processes in language acquisition. Lawrence Erlbaum Associates. Nowak, M. A., J. B. Plotkin, and D. C. Krakauer. 1999. The evolutionary language game. Journal of Theoretical Biology 200: 147–162. Ofria, Charles, and Claus O Wilke. 2004. Avida: A software platform for research in computational evolutionary biology. Artificial life 10: 191–229. O’Neill, Michael, and Conor Ryan. 2003. Grammatical evolution: evolutionary automatic programming in an arbitrary language. Genetic Programming Series, Vol. 4. Heidelberg, Germany: Springer. Piaget, Jean. 1974. Introduction à l’épistémologie génétique. Paris, PUF. Pohlheim, Hartmut. 1996. GEATbx: Genetic and evolutionary algorithm toolbox for use with MATLAB documentation. Retrieved from http://www. geatbx. com/docu/algindex. html. Poincaré, Henri. 1908. L’invention mathématique. Popper, Karl Raimund, Karl Raimund Popper, and Karl Raimund Popper. 1972. Objective knowledge: An evolutionary approach. Oxford, England : Clarendon Press. Ray, Thomas S. 1992. Evolution, ecology and optimization of digital organisms. Santa Fe. Rechenberg, Ingo. 1973. Evolutionsstrategie–Optimierung technisher Systeme nach Prinzipien der biologischen Evolution. Stuttgart, Germany : Fromman-Holzboog. Rizzolatti, Giacomo, and Laila Craighero. 2004. The Mirror-Neuron System. Annual Review of Neuroscience 27: 169–192. Rosch, Eleanor. 1999. Principles of categorization. Concepts: core readings: 189–206. Cambridge, MA : MIT press. Sahlgren, Magnus. 2005. An introduction to random indexing. In Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE. Vol. 5. Sekaj, I. 2005. Evolučné výpočty a ich využitie v praxi. Iris. Shi, Rushen, Janet F Werker, and James L Morgan. 1999. Newborn infants’ sensitivity to perceptual cues to lexical and grammatical words. Cognition 72: B11–B21. doi:10.1016/S0010-0277(99)00047-5. Simonton, Dean Keith. 1999. Creativity as blind variation and selective retention: Is the creative process Darwinian? Psychological Inquiry 10: 309–328. Smith, Tony C., and Ian H. Witten. 1995. A genetic algorithm for the induction of natural language grammars. In Proc. of IJCAI-95 Workshop on New Approaches to Learning for Natural Language Processing, 17–24. Solan, Z., D. Horn, E. Ruppin, and S. Edelman. 2005. Unsupervised learning of natural languages. Proceedings of the National Academy of Sciences 102: 11629. Sosík, Petr, and Leoš Štỳbnar. 1997. Grammatical inference of colonies. In New Trends in Formal Languages, 236–246. Heidelberg, Germany: Springer. Spencer, Herbert. 1894. Education: Intellectual, moral, and physical. CW Bardeen. Tomita, Masaru. 1982. Dynamic construction of finite-state automata from examples using hill-climbing. In Proceedings of the fourth annual cognitive science conference, 105–108. Trivers, R.L. (1972). Parental investment and sexual selection. In B. Campbell (Ed.), Sexual selection and the descent of man, 1871-1971 (pp. 136–179). Chicago, IL: Aldine. Turing, A. M. 2008. Computing machinery and intelligence. Parsing the Turing Test: 23–65. Vapnik, V., S. E Golowich, and A. Smola. 1997. Support vector method for function approximation, regression estimation, and signal processing. Advances in Neural Information Processing Systems 9. Wilson, Edward O. 1978. What is sociobiology? Society 15: 10–14. Wittgenstein, L. 2009. Philosophical investigations. Wiley-Blackwell. Wolff, J. Gerard. 1988. Learning syntax and meanings through optimization and distributional analysis. Categories and processes in language acquisition 1. Wright, Sewall. 1932. The roles of mutation, inbreeding, crossbreeding and selection in evolution. In Proceedings of the sixth international congress on genetics, 1:356–366.