Prolegomena Paedagogica I N T R A M E N TA L E V O L U T I O N A N D ONTOGENY OF TODDLERESE && F O U R S I M U L AT I O N S 1st and 2nd volume of phd. dissertation by msc. bc. et bc. daniel devatman hromada Daniel Devatman Hromada: Intramental Evolution & Ontogeny of Toddlerese, Propedeutica Didactica I., © June 2016 W H AT T H I S T E X T I S N O T This text is not a formal work in a mathematico-logical sense. Its aim is not to introduce the Theory of natural language, nor even a set of theorems, whose validity could be proven by blind application of rules of symbolic substition upon a pre-defined set of solid definitions and «self-evident » axioms. Assuming that • indeed incomplete is every formal system whose explanatory power is at least equivalent to explanatory power of formal system of basic arithmetics (Gödel, 1931) • explanatory power of any natural language system is at least as exhaustive as that of any conceivable arithmetic system we consider the temptation to explain natural languages in strictly formal terms, as potentially counterproductive one. Nor is this text a product of analytical approach to science. It shall not limit itself to study of a sub-problem of a problem which can be sometimes observable when one confronts the world with a particular terminology and methodology of a sub-branch of a highly specialized discipline. It does not, in Nietzschean terms, devote itself to the study of « the brain of the leech ». In other terms, this text does not aim to attain knowledge – whatever it is - by reductionist act of focusing one’s attention upon one sole boring fragment of "truth". Knowing that truth is complex, sometimes contextually bound and more than often simply beyond reach of an individual observer, we do NOT pretend that ALL hypotheses presented on following pages are apodictically and universally true. For any hypothesis is just a piece of a bigger picture and it is this picture itself which is supposed to represent Reality – i.e. to be « true » - and not the pieces. On their own, theses and hypotheses are just indices helping the scientist to find his way on a path to such bigger picture. Thus, even invalid hypothesis can serve the productive purpose if ever they succeed to transpose the scientist into realms where (s)he was never before. And as we shall try to indicate and re-indicate during this whole text, it is indeed «by descending into & traversing through the valley of falsehoods» that the researcher can ultimately attain a perspective which is «higher» (i.e. more «optimal») than the original one. iii This principle, we believe, applies both on a baby language learner as well as on evolution of scientist’s knowledge and, possibly, on evolution of science in general. W H AT T H I S T E X T I S This text is a tentative to elucidate «the mystery» of acquisition and development of linguistic competence in terms of evolutionary and complexity theory. Thus, it is principially a multidisciplinary scientific essay. By being «scientific», its aim has to be either analytic or synthetic ; and since we have already that our main aim is not analytic, it follows that the goal of this essay is of synthetic nature. More concretely, the synthesis under question aims to involve following scientific disciplines : artificial intelligence and artificial life, cognitive psychology, developmental psycholinguistics, evolutionary computing, natural language processing, theory of complexity, and universal darwinism. Mindmap localizing the central Topic of this text within wider scientific context is presented on Figure 1 (c.f. list of Acronyms on page 355 if some abbreviations are unclear or ambigous). Machine Learning QL POS-i Induction NLP GI Computational Linguistics Generalization Production vs. Comprehension Language Acquisition Device FLT Intramental Evolution of Linguistic Structures GS Populations GA EC Developmental Psycholinguistics Motherese BootstrapFitness ping Landscapes Universal Darwinism ES Variate, Reproduce, Select Adaptation Trial and Error Stoch. Systems Figure 1: Central notions of this dissertation. iv To demonstrate the validity of our perspective, this text shall three different proofs-of-thesis. The theoretical proof-of-thesis shall shall consist in making reference to & aligning with multiple theories scattered among different disciplines of cognitive sciences. Ideally, many seemingly unrelated phenomena could be thus brought under clef-devoute of one scientific paradigm. The observational / empiric proofof-thesis shall aim to align the thesis with seemingly trivial observations of linguistic behaviour of a certain human subject. Finally, the computational / experimental proof-of-thesis shall hopefully illustrate that diverse problems of language acquisition are computationally solvable if ever they introduce an evolutionary component. At last but not least, this text is also a dissertation work with which we aspire for the attribution of the title Philosophiae Doctor. For this reason, all chapters of this first volume contain a certain quantity of remarks which partially surpass the informatic, cognitive and/or psycholinguistic paradigm and point in direction of philosophy in general, and epistemology in particular. HOW IS THE TEXT ORGANIZED The text is composed of two volumes which, taken together, contain four parts. First volume consists of three parts, second volume (Hromada, 2016d) consists of only one. Each part is divided into chapters. Every chapter consists of introduction and conclusion preceding resp. following more specific subchapters which can fractally branch into sub-chapters , sub-sub-chapters etc. All such parts, chapters, subchapters etc. can be considered to be « non-terminal » nodes of structure presented by this text. The first part, labeled Theses, is a stem of whole text. It will introduce multiple theses at varying degrees of generality which shall be all - in one way or another - more directly addressed in subsequent sections. In order to weave the basic conceptual fabric, some definitions of terms like « evolution » and « language learning » shall be also offered. All variants of the thesis shall be briefly related to other cognitive sciences. The second branch, labeled «Paradigms» is composed of chapters dedicated to Universal Darwinism, Developmental Psycholinguistics and Computational Linguistics. In these chapters, the theses presented in the first chapter shall be more deeply interpreted and contextualized in terms of respective disciplines. The third branch, labeled «Observations» will describe multiple longtitudinal observations of one concrete human child. Subsequent interpretations in terms of the evolutionary theoretical framework shall follow. v Basic structure of the text The ultimate branch, called «Simulations» shall present multiple computational models addressing three problems related to language acquisition process. That is, 1. the problem of concept induction 2. the problem of induction of grammatical categories 3. the problem of induction of grammatical rules Text’s nodes and their attributes Specific chapter will be dedicated to every problem in which existing solutions shall be described. Special focus shall be put on evolutionary solutions, if they exist. To every of four above-mentioned problems we shall try to offer our own unique evolutionary solution and subsequently we shall discuss its performance. PERL source codes shall be also attached and publish under mrGPL licence in order to facilitate reproducibility (Hromada, 2016e) of results by other scientists. As a whole, the text hereby presented can be thus considered to be a tree with five major branches which bifurcates all the way to « terminal » (i.e. leaf) nodes. To all nodes of such « tree » shall be also attributed one among following types: DEF Definition Intensive or extensive definition (or combination of both) of the term used throughout the book TXT Text Longer piece of text, often dedicated to one specific hypothesis, topic, theory or model this is a default node type OBS Observation Transcription of an item from the observation journal APH Aphorism A comment presenting author’s stance in regards to topic raised in Text or Observation node. More subjective than TXT SRC Source code Snippet of PERL source code The type of the node is specified in its title. Preceding the title is a unique numeric identificator which can serve as an anchor for crossreferences. Thus, a text dedicated to Piager’s Genetic Epistemology which is contained in fifth secion of chapter eight, will be introduced with a following expression: 8.4.4. Genetic Epistemology (TXT) The end of every node is marker by an expression containing node’s numeric ID, title and the token END. An above-described node will thus be terminated with a following expression : 8.4.4. Genetic Epistemology END vi Because the nodes can be embedded within each other, such syntax is needed to exclude any disambiguity. C.f. 1.0 and its relation to embedded nodes 1.0.1 and 1.0.2 for a concrete example of such embedding1 . Margin-notes shall be also employed to facilitate even further the orientation within the text and cross-referencing between diverse parts of the text. Such a note shall be usually placed at the margin of the text whenever a new topic shall be addressed. I N W H AT L A N G U A G E I S T H E T E X T W R I T T E N ? This dissertation is written in a language which shares majority of its morphological, lexical and syntactic features with modern standard english2 . Thus, majority of words are english words and majority of sentences can be easily parsed by a standard english-language parser. But it has to be noted that this text is not written by a native english speaker. Written mainly in germany and deposed at french university by a child of slovak mother and czech father, inspired by compactness and eloquence of classic (i.e. latin, greek and sanskrit) treatises, and often aiming to denote very subtle distinctions and novel meanings: all this often lead to a sapirwhorfian feeling that communication of certain thoughts is inconsistent with certain well-established schemas and rules. If ever such situation occured, it was the communicative intention and not the rule which was prioritized: hence the origin of many seemingly agrammatical constructions present in this work. Thus, asides multitudes unvoluntary and erroneous typos and asides multitudes of omitted and/or misplaced articles - a slavic speciality - this work also exposes the reader to a certain amount of errors which are, in fact, not "bugs" but "features". In certain cases, italics and bold were used to mark the moment whereby the author intentionally broke the existing schema - or invented a new one - in order to emphasize a certain aspect of the-intention-to-be-communicated. 1 Without this embedding, the arborescent structure of this Thesis would be reminiscent of Wittgenstein’s Tractatus. But because this embedding is implemented, the structure resembles more a context-free (10.2.2) form of a valid XML document. 2 Within the context of this dissertation, standard english is principially understod in terms of set theory as union of british and american english. Given that it is defined as union and not intersection both (i.e. american as well as british) can be accepted as valid and used interchangeably in cases where two languages diverge (e.g. both british "optimise" as well as american "optimize" can be accepted ). vii This is a self-referential margin-note Part I THESES In the distant future I see open fields for far more important researches. Psychology will be based on a new foundation that of the necessary acquirement of each mental power and capacity by gradation. — Charles Darwin In this part we shall posit and discuss multiple theses whose validity or invalidity we shall try to demonstrate in subsequent parts of this diseration. After a brief discussion of Initial Thesis "mind evolves", the sense of the Hard Thesis "learning is a form of evolution" shall be more thoroughly criticized by exploring the conditions of its validity. The Soft Thesis "learning can be successfully simulated by means of evolutionary computation", the Softer Thesis "learning of natural language can be successfully simulated by means of evolutionary computation" and the Softest Thesis "learning of first language can be successfully simulated by means of evolutionary computation" shall be postulated next. At last, the Operational Thesis "learning of first language from its textual representations can be successfully simulated by means of evolutionary computation" shall turn out to be sufficiently concrete enough to become an object of computational simulations. Definitions for terms mind, to evolve, evolution, brain, 2nd law of thermodynamics, evolutionary computation, natural language, first language and child shall be also provided. Asides all that, a so-called "alternative" hypothesis concerning the non-local storage of information in human brain shall be also introduced. 1 INITIAL THESIS Mind evolves. This is the Initial Thesis (IT) whose validity we hereby undertake to demonstrate. In order to do so, both terms of the statement are to be properly defined. 1.1 Definition of substantive "mind" mind (def) An auto-organising set of structures and processes determining the characteristic behaviour of an individual. end mind 1.1 1.2 to evolve (def) Oxford Dictionary definition: 1. Develop gradually 2. Develop in time as a result of natural selection 3. (chemistry) Give off gas or heat Definition of verb "to evolve" Etymological definition: • 1640s: "to unfold, open out, expand," from Latin evolvere "to unroll," especially of books ; ... from ex- "out" + volvere "to roll". • 1832: "to develop by natural processes to a higher state" end to evolve IT seems to be a tautology 1.2 IT means that an auto-organising set of structures and processes determining the characteristic behaviour of an individual is endowed with propensity to gradually attain higher states of complexity. Hence, not only structures stocked in and by mind, but also the very processes which act in mind are to be understood as subjects to transformation. A fact that the predicate "to evolve" is conjugated in indicative mode of 3rd person singular of present simple tense suggests that the statement tends to denote the-state-of-affairs independent of temporal context within which the evolution of mind occurs. Thus, it can be reproached that IT is too general and potentially tautologic. Since it is difficult to see how such a statement could, per se, become an object of positivist endeavour, let’s now discuss IT’s less tautological variants. end initial thesis 1 2 2 HARD THESIS Hard Thesis (HT) is expressed as follows : «learning is a form of evolution» . The term evolution, as presented in HT, is to be understood in terms of generalized form of Darwin’s theory, which is called Universal Darwinism (UD). In such framework, evolution can be defined as follows : 2.1 evolution (def) Evolution is a durative process emergent in any finite-resourced environment containing a population of information-encoding entities which : Definition of substantive "evolution" 1. Reproduce 2. Need resources for their reproduction 3. Vary because of inaccuracies inherent to reproduction process end evolution 2.1 If ever there exists a causal relation between the information these entities encode (genotype) and the means how they exploit environment’s resources (phenotype), the population will lead to gradual optimalization of its relations with the environment, i.e. discover ways which exploit resources more efficiently than before. In this sense shall be the next generations of individual-encoding entities better «adapted» to their common environment. It is important to realize that the notion of «evolution», as hereby defined, goes far beyond the traditional Darwinian theory which was concerned with just one instance of evolution, namely the biological one. Some phaenomena, which could be interpreted or even modelled as instances of systems whose functioning is constitent with the precepts postulated by Universal Darwinism, shall be in somewhat closer detail discussed in Chapter 8. For a Universal Darwinist, evolution is not an empirical but a logical necessity. It has to necessarily occur within any system fulfilling the above-mentioned conditions. Emergence of evolution in a system fulfilling above-mentioned conditions is independent from the concrete form of «natural laws » & physical constants which determine the particularities of such system. 3 How evolution "works" Evolution has many forms Logical necessity of evolution 4 HT is about ontological equivalence hard thesis The Hard Thesis states that psycho-pedagogical process of «learning» can be not only interpreted and simulated as an evolutionary process. The Hard Thesis states that learning IS functionally equivalent to evolutionary process. That on an ontological level, « learning » is an instance of an evolutionary process and therefore IS an evolutionary process. In UD-consistent sense. 2.2 Definition of substantive / participle "learning" Attributes of learning learning (def) Learning is a mind-transforming, information-processing, constructivist and embodied process. end learning 2.2 The attribute « mind-transforming » denotes the finality of learning - it means that both contents as well as processes which determine the characteristic behaviour of an individual agent can be modified by means of learning. Attribute « information-processing » denotes the modality of learning – it implies that learning always involves the processing of information – namely assimilation, accomodation, encoding, storage or decoding of information. The term « constructivist » suggests that learning is gradual and can potentially bootstrap itself. The term "embodied" suggests that learning could succeed only with big difficulties if it is not embedded into an individual monadic entity which keeps track -in one way or another- of its own trajectory. The last term of HT is defined, in accordance with tradition, as follows : 2.3 form (def) « Form is the possibility of structure.» (Wittgenstein, 1922) end form 2.3 Given the definitions 2.1, 2.2 and 2.3, the Hard Thesis - presented as a conjunction of terms « learning is a form of evolution » - can considered to be true iff following statements are true as well: 2.4 first condition of ht’s validity (def) Learning involves the reproduction of information-encoding entities. end first condition of ht’s validity 2.4 2.5 second condition of ht’s validity (def) 2.5 second condition of ht’s validity (def) These learning-enabling information-encoding entities consume resources in order to reproduce. end second condition of ht’s validity 2.6 2.5 third condition of ht’s validity (def) The process of reproduction of information-encoding entities can be influenced by stochastic phenomena which cause an unpredictable structural variation. end third condition of ht’s validity 2.6 2.7 fourth condition of ht’s validity (def) The resources of environment, within which the learning occurs, are finite. end fourth condition of ht’s validity 2.7 Hard Thesis, as proposed until now, defines learning in general and as such can be told to describe the form of « learning » of both human and artificial minds. In this general sense, it will be used in majority of the text which shall follow. For the rest of chapter 2, however, we shall discuss « learning » as related solely to humans. The material substrate of human learning1 is the brain and no positivist theory of learning thus cannot be considered to be adequate if it ignore brain’s essential attributes. We list its essential attributes in a following definition. 2.8 brain (def) Human brain is a physical (i.e. four-dimensional) object of organic origin which consumes biochemical energy in order to process and/or store information in a non-local, highly parallel, and in certain extent also plastic, equipotent and holographic fashion robust to both endogenous and exogenous perturbations. end brain 2.8 The fact that the brain disposes of above-mentioned properties is usually explained in terms of « neural » connectionist theories whose validity is well demonstrated by multitudes of anatomical observa1 Cellular memory being an exception with which we cannot deal here. 5 6 Of validity of connectionist Level of Abstraction Of brain’s consumption of resources. hard thesis tions and clinical experiments. And it is indeed true that when observed by a microscope, at a strictly material « level of abstraction » (LoA2 ), the brain is nothing else than a ball-sized walnut of wetware consisting of approximately one hundred miliard neural cells. Subsequently, one when one adopts a more computational LoA, one easily comes to conclusion that the substrate of mutually interconnected neural cells can indeed yield a device capable of strongly parallelized computation. « Neural networks », « backpropagation », « stochastic gradient descent » – all these notions offer us useful conceptual tools which enable us to bridge the « objective » material reality of the brain with information-processing, i.e. « computational » faculties of the mind. Ability to « learn» can be, of course, considered to be such computational faculty. And the Hard Thesis states that above the « material » and « computational » LoA from which the brain can be intepreted, there exists also a scientifically sound « evolutionary » LoA at which learning can be functionally conceived as being both structurally as an instance of an evolutionary process – i.e. a process involving reproduction, variation and selection of information-carrying entities. If HT is valid, the functions of one and same brain could be thus ideally intepreted by the prism of « material », « computational » and « evolutionary » LoAs at the same time. An « evolutionary » LoA can be considered to be scientifically sound only if it does not contradict empiric knowledge – in the case of HT, it should not contradict the anatomical and clinical knowledge concerning the brain. Nor should it contradict the connectionist « computational » theory. What we already know about « brain » and « learning » should rather be consistent with the meaning of HT. Is it the case ? It can be, if ever the conditions of HT validity (c.f. 2.4 – 2.7) would be found consistent with current neuroscientific knowledge. The last condition «The resources of environment, within which the learning occurs, are finite» seem not to pose a problem since both environment about which we speak here – the brain itself – and its material and energetic resources are finite : even in case of a most abnormal human being, a brain can simply not consume more than 25-30 % percent of one’s energy. Hence, it is impossible, for a human brain as an energyconsuming system, to go beyond the upper bound of cca 500 kilocalories per day (Mink and Blumenschine, 1981). In this sense what holds for energy holds, mutatis mutandi, also for limits of nurturing chemical substances which the brain must metabolize to keep its vital functions in equilibrium. Their quantity is limited – even in case of a well-nurtured healthy individual are brain’s material resources finite. 2 C.f. Philosophy of Information (Floridi, 2011, pp. 46-58) for a more exhaustive definition of « Level of Abstraction ». 2.9 2nd law of thermodynamics (def) 7 The third condition, i.e. «the process of replication of informationencoding entities can be influenced by stochastic phenomena which cause an unpredictable structural variation» also does not seem to be very problematic when we consider the fact that the replication does physically occur within its environment – i.e. in the brain – and that its environment is an energy&information-processing system. It is not problematic, because of 2nd law of thermodynamics. 2.9 2nd law of thermodynamics (def) « Every process occurring in nature proceeds in the sense in which the sum of the entropies of all bodies taking part in the process is increased. In the limit, i.e. for reversible processes, the sum of the entropies remains unchanged.» (Planck, 1926) 2nd law of thermodynamics 2.9 Human brain, when understood as a physical system, is not an exception to this law. Nor are its components – lobes, neural circuits, neurons, axons, dendrites, receptors, proteins etc. Whenever and wherever is information processed, energy transforms its form and some residual heat is generated. Heat is energy with increased entropy – in its essence it is kinetic energy kicking the surrounding molecules in all directions. As such, it can induce unexpected «unpredictable structural variation» of brain tissue’s molecular substratum. Thus, the very fact that the brain is an energy-consuming device implies a possibility of decay and loss of information encoded in brain’s materia. Heat aside, brain is also confronted with other sources of «unpredictable structural variation». From quantum phenomena, free radicals and different toxins contained in food and air to purely cognitive noise entering the brain through sensory channels – both brain’s processes and structures are constantly confronted with both endo & exo-genous sources of «unpredictable structural variation». If a sort of replication of information-encoding entities would take place in the brain, it would be highly improbable that it would not be also subject to such variation. Thus, when it comes to human brain, we consider the third condition of HT’s validity 2.6 as fulfilled. By its very definition, any activity of a material system involves consumption of energy and learning, understood as «informationprocessing mind-transforming constructionist process» 2.2 is not an exception. Thus, in case of a material system like brain, can the second condition of HT’s validity, i.e. «learning involves informationencoding entities which consume resources in order to replicate» 2.5 be considered to be necessarily valid if first condition of HT’s valid- Of brain and heat Of intracerebral sources of variation 8 Of indirect evidence for intracerebral reproduction of information hard thesis ity , i.e. «Learning involves the information-encoding entities which replicate» (2.4), is itself valid. But now the thing gets complicated since, as far as we know, existence of such « reproduction of information-encoding structures » within the brain has not been, as of November 2014, demonstrated with sufficient certitude. At least not directly3 . But note that such reproduction of information is at least indirectly implied at least since 1950s whence neuroanatomic observations, which were primarily concerned with effects of brain lesions upon the resulting behaviour of the brain, have demonstrated that information in brain is stored in a non-local fashion. As Karl Lashley, one amongst the biggest neuroscientists of 20th century who spent most of his life studying equipotentiality (i.e. the capacity of any part of functional area to solve a particular task), once put it: «The equivalence of different regions of the cortex for retention of memories points to multiple representations. Somehow, equivalent traces are established throughout the functional area.» (Lashley, 1950, pp. 28) There are at least two possible interpretations of such «non-local storage of information» based on "equivalent traces" and/or "multiple representations". The first one is « connectionist »: 2.10 connectionist explanation of non-locality (txt) Information stored in the brain cannot be localized at one particular spatial locus because it is spatially distributed among multiple synapses of the neural network. end connectionist explanation of non-locality In other terms, the connectionist interpretation states that a material representation of a cognitive structure S (or a cognitive function F) cannot be localized to this place « here », because it is also partially encoded « there » and « there » and « even there ». From « connectionist » perspective it is indeed this distribution, this decentralization of information among synaptic weights which gives to a neural network both its robust character as well as its capacity of generalization. But there exists also a second interpretation of the fact that information in brain is not stored on a one specific place: 2.11 alternative explanation of non-locality (txt) Information stored in the brain cannot be localized at one particular spatial locus because it is materially encoded at multiple loci. end alternative explanation of non-locality 3 In 8.6 we shall see some theories interpreting certain neural phenomena not only as «reinforcement» but also as «reproduction of information». 2.11 alternative explanation of non-locality (txt) 9 From this other perspective, brain stores the material representation of a cognitive structure S (or a cognitive function F) in multiple alternative places and|or in multiple forms. A trivial example illustrating the essential difference between two approaches is presented on Figure 2 which visualises "connexionist" and "alternative" representations of corpus containing fours tokens "MABA" and one token "MAPA". PA BA BA BA BA MA MA BA PA 4 1 MA (a) (b) Figure 2: Distinction between "connectionist" (a) and "alternative" (b) representations of the same data. It is evident that the latter allows for more structural variation than the former. Far for being mutually exclusive with the first intepretation of brain’s non-locality, such «alternative» representation has one advantage and one disadvantage. Given that each particular locus encodes a particular instance of structure S (or function F), i.e. S1 S2 , S3 , any individual instance can be modified while leaving all others branches intact. Every instance is thus an independent individual with an individual history: that is a non-negligeable advantage. The disadvantage is that in order to get encoded, such "alternative" representations need more space than "compressed" connectionist representations which superpose distinct instances one atop another in order to yield one ultimate representation. Note that such diverse individual instances could be well confounded by an external observer who – if (s)he had not been equipped with fine-grained resolution imagery apparatus - could easily believe to witness only the activation of one and only neural circuit S. But the closer inspection shall reveal -so we speculate - that the same stimulus and the same response is to be followed, respectively preceded, by activation of distinct neural loci. Such observation could be potentially interpreted as an empiric evidence of «alternative» interpretation of non-local encoding of information in the brain. It is true that from certain point of view, such «alternative» way of storage of information in multiple cerebral loci could be considered as redundant. But redundancy does not necessarily mean suboptimality. In a body of a multicellular organism, for example, is the complete genetic code stored practically in nucleus of every single cell (erythrocytes and trombocytes of higher vertebrates excepted). And it is verily this very fact that every cell contains the schema for the whole, which gives, among other properties, to such an organism a somewhat «miraculous» capacity to regenerate itself. This being said, Of utility of redundancy in organic systems. 10 hard thesis it can be further speculated whether the «miraculous» property of brain called «plasticity» - i.e. the fact that the brain can, in some extent, restore the original knowledge even if some part of brain was damaged or even fully lesioned – can be also explained, mutatis mutandi, in terms of redundant storage of information at multiple loci. Now back to question discussing the possibility of reproduction of information-carrying structures within the brain. If we accept that the «alternative» hypothesis concerning the brain’s faculty to store information non-locally is at least partially valid, we may subsequently pose a question: «but how comes, that multiple individual instances of information S are stored at distinct loci L1 , L2 , L3 ?». A possible answer : «because sometimes, somehow4 , information from L1 is copied into L2 » could pave the way to experiments whose objective shall be to verify the 1st condition of HT’s validity (2.4) demanding that learning should somehow involve the reproduction of informationencoding entities. Note that for the purposes of level of abstraction at which Hard Thesis is postulated, it is secondary whether the replication of informationencoding structure is materially realized as a creation of new material synapses, or synchronization of firings of neural circuits, or modification of oscillatory properties of certain fields, or something completely different. The only thing, we believe, which is currently needed to offer ultimate neuroscientific evidence for the statement «learning is a form of evolution » is to directly observe spontaneous intracerebral reproduction of one concrete chunk of information ; from one locus to another. More formally, such a « reproduction » could be considered as taking place if, at spatiotemporal locus T1 L1 , one would observe the emergence of « child » representation R1 which is at least partial isomorph with « parent » representation R0 which has been already observed at spatiotemporal locus T0 L0 and is still observable at some spatial locus in time T1 . Such ensemble of observations would indicate that at least some part of information E was copied from L0 to L1 in a way which leaves practically intact the original representation R0 . But until such neuroscientific evidence is given, the first condition of HT’s validity cannot be considered as sound on empirical grounds. This logically implies the consequence that the whole Hard Thesis must be -given the current state of neuroscientific knowledge- considered as nothing else than a speculative conjecture. The only thing which we can do to make this dissertation less speculative, and hence more scientific, is to soften the Thesis by reducing the scope of the domain upon which it applies. 4 For example during phases of «dreaming» or other activities of "repeating" and "rehearsing". 3 SOFT THESIS Soft Thesis (ST) is expressed as follows : « learning can be successfully simulated by means of evolutionary computation » ST simply postulates a sort of explanatory adequacy between « learning » and « evolutionary computation ». It does not, as HT does, express the statement about ontological position of « learning », it does not state what learning « is ». It simply states that the behavior of a system whose functioning is in agreement with principles of « evolutionary computation » could ressemble to behavior of a system which is considered to be "learning". 3.1 ST postulates explanatory and not ontological adequacy evolutionary computation (def) « Evolutionary computation uses computational models of evolutionary processes as key elements in the design and implementation of computerbased problem solving systems.» (Spears et al., 1993) end evolutionary computation 3.1 Evolutionary computation (EC) can be thus considered to be a subdiscipline of informatics. This does not mean that the principles of EC should be relevant only to realm of silicon-based computers. It is so, because informatics aims to yield a general theoretical framework for description of information-processing systems, that is, a theory which could be ideally applied on both silicon-based (e.g. computers) and neuron-based (e.g. brain) computational devices1 . In practice, however, are hypotheses related to informatic science best studied and most applied in relation to silicon-based universal Turing machines. Voici reasons why it is so: • Minimal ethical concerns : it is considered ethically completely acceptable to program one’s computer ; it is less so to do that with one’s neighbor, or his intestinal flora. • Full initial control : a programmer can control practically all initial states of his informatic model as well as the initial form of rules according to system shall subsequently behave. 1 And potentially to other types of computational devices. As of 2014, particularly promising seem to be devices developped in the discipline of biomolecular computing. Note that the very essence of these devices (e.g. DNA-computers) is particularly favorable to problem-solving by means of evolutionary computation. 11 Of EC and material substrate Advantages of EC simulations in silico 12 soft thesis • Reduced cost : construction, execution and evaluation of a model in silico is generally much less resource-demanding than construction, execution and evaluation of such model in vitro or in vivo. Advantages of performing evolutionary simulations in silico Since EC is a subdiscipline of informatics, it follows that abovedescribed utility of silicon-based machines for informatics would be also appreciable in the domain of evolutionary computing. In fact, especially due to moral and security concerns, in silico seems to be the only way how living evolution can be empricially studied on a time scale directly percievable and interpretable by practically any human observer able to run a program on a computer. For this reason, when Soft Thesis relates the EC to « learning », it is principially a silicon-based computer which is supposed to be the subject of the « learning » process. With exception of Part iii - where we shall mainly discuss learning process as instantiated in human children - shall be, in the rest of this dissertation, computer understood as an entity capable of learning. In 8.7 we shall discuss EC in somewhat closer detail. There, we shall also introduce the most important EC paradigms like «genetic algorithms» (8.7.1), «evolutionary strategies» (8.7.2) and «genetic programming» (8.7.3). But the particularities of these diverse approaches are not of a great interest for the subject which interests us in this chapter, that is : to elucidate the of meaning of the Soft Thesis. In order to do so, the term « successfully simulated » should be defined. 3.2 successful simulation (def) A process P can be said to be « successfully simulated » by a system S iff the way, how outputs oS1 , oS2 , . . . oSn of the system S (given the inputs iS1 iS2 ... iSN ) are generated is isomorph, at certain Level of Abstraction, to the way how process P reacts to stimuli iP1 ,iP2 ... iPN when generating outputs oP1 , oP2 ... oPn . Morphism iPX → iSX can be understood as representational mapping of inputs from the domain of the process P (i.e. « reality ») into the domain of the simulation S. end successful simulation 3.2 Of stimuli and input In less formal terms, a simulating system can be told to perform « successful simulation» if and only if tends to react to sequences of its inputs in the same way as does the process-which-is-simulated react to sequences of stimuli with which it is confronted. Note that in order to distinguish the two, we use the term "stimulus" when we speak about the data entering the original physical process-whichis-simulated and we use the term "input" when we speak about the data which enters the simulation. In light of this definition, the Soft Thesis practically postulates that by implementing the precepts of Evolutionary Computation (Section 8.7), one can construct computa- 3.3 cognitive plausibility (def) tional models which shall gradually transform inputs into outputs in a way that would be, for an external observer, indistinguishable from the mappings gradually produced by the process of « learning ». Let it be underscored that the above-mentioned definition speaks not only about simulating the outputs (results) of the process; it speaks also about the manner by means of which such results are obtained. It demands not only external but also internal adequacy between the simulation and the process which is being simulated. That is, NOT ONLY should the simulation yield the outputs which are the most accurate - i.e. ressemble the most the observable behaviours of the system - BUT ALSO should execute the input -> output mapping in a similar way. In case of tentatives aiming to simulate human cognitive processes, we find it useful to speak about such "internal adequacy" in terms of cognitive plausibility. 3.3 13 Morphisms among morphisms cognitive plausibility (def) « We label as “cognitively plausible” a model which tends to address some basic function/skill of human cognitive system not only by simulating, in a sort of “black-box apparatus”, the mapping of inputs (stimuli, corpus data etc.) upon outputs (observed behaviors, results etc.), but also tends to faithfully represent – at least when interpreted from a certain LoA- the way how the respective function/skill is accomplished by a real human mind.» (Hromada, 2014b) cognitive plausibility 3.3 We believe that it is often pertinent to ask the question "is computational model M of process P cognitive plausible?". In case of process of "learning" and its computational "machine learning" (ML) counterparts, an analysis through the prism of "cognitive plausibility" could potentially yield surprising results: while many ML models perform more than well in a task which was previously the domain of exclusively human learning, they are far from being cognitively plausible. Extent in which the model successfully simulates the real process (i.e. its performance) and an extent in which the model does it in a way similiar to human mind (i.e. its cognitive plausibility) demarcate two independent axes which are not to be confounded. Engineers interested only in attaining the best results (i.e. the most adequate outputs, given the inputs) can often ignore the manner by means of which a natural system solves a given problem. On the other hand, researchers aiming to understand the functioning of the natural system are often more ready to accept lesser performance of their model if ever it seems to exhibit the same properties and faculties as the natural system. Only in rare cases do such engineering, i.e. result-oriented, and scientific, i.e. knowledge-oriented, axes converge. end soft thesis 3 Of researchers and engineers 4 SOFTER THESIS An important question was left unanswered during our discussion of the Soft Thesis. That is : what shall be the object of learning which is supposed to be successfully simulable by means of Evolutionary Computation ? What shall be the nature of stimuli ip1 ip2 ... ipN entering the learning process we aim to simulate ? To concretely address this question, we are, once again, obliged to soften the Thesis somewhat more, thus obtaining the Softer Thesis which can be expressed as follows: «learning of natural language can be successfully simulated by means of evolutionary computation» Contrary to ST, which relates EC to a very broadly defined notion of « learning », does the S2 T specify the object of «learning» which is supposed to be EC-simulable. It is learning of natural languages. 4.1 natural language (def) Natural language is a system composed of prosodic, phonetic, phonologic, morpohologic, syntactic, semantic and pragmatic structures and principles which allows human beings to encode messages in a way that is comprehensible to other human beings. end natural language 4.1 Further definitions related to natural language, notably those -ic terms, shall be presented in the 9.2 for they are not inevitably needed for eludication of S2 T’s meaning. What we consider of bigger importance here is to introduce the reasons which have motivated us to study evolutionary computation in relation to learning of natural languages. 4.2 Of essence of humanity why natural language ? (aph) Among all the faculties which distinguish man from other animals is the mastery of language potentially the most salient one. This was already well-known for the ancients among which Aristotle, for example, defined man as ζῶον λόγον ἔχον, «an animal which word has». Centuries later, Wittgenstein (1953) had indicated that whole philosophy, and potentially even more, can be understood as a realisation of some sort of perenial «language game»... 14 4.2 why natural language ? (aph) In the meantime, on the very frontier between «natural» and «human» science, emerged linguistics : the science whose object of study is language, understood as a system, and whose objective is to understand principles governing such a system. During the century which followed after de Saussure (1916) presented linguistics as a mainly positivist study of diverse forms of linguistic structures, linguistics had refined its methodology and terminology in a way such broad and deep that currently -as of 2014- among all other sciences studying one specific domain of human activity, linguistics has practically no equal in both quantity and quality of scientific knowledge which has been already accumulated. Thus, one reason why we have chosen to focus on the natural language is purely pragmatic one: natural languages are well-studied. For us it principially means that we are not obliged to «reinvent the wheel» and can instead use the already existing methodology and terminology, refer to past observations and experiments and potentially exploit the established corpora. Notably the discipline of developmental psycholinguistics, with its focus on the process of «language development» (Section 9.1) as well as an increasingly popular discipline of Natural Language Processing (NLP, Section 10.3), located on the border between linguistics and computer science, seem to be of particular importance in regards to potential proof of validity of S2 T. The second reason for focusing our interest on natural language is related to the role which natural language seems to play in development of every healthy human individual. This role is considered to be non-negligeable by those who consider the language to be the very fundament of human society ; and is considered to be vital by those who know that on its own, i.e. without society’s protective matrix, a human individual – and especially a human child – simply could not survive and/or develop full capacities of a self-realized member of homo sapiens sapiens species. Simply stated, language is a phenomen present in all cultures and as such can be considered to be the anthropological constant par excellence. By having already mentioned philosophy, anthropology, and linguistics, we consider it important to underline that the topic of natural language seems to be recurrent in all cognitive sciences. Neuroscience, for example, had fully established itself as an empiric science the very day when Broca (1861) realized that the damage of brain’s inferior frontal lobe of the dominant hemisphere leads to problems in production of language (he was later followed by Wernicke (1874) who noticed that the damage of superior temporal gyrus leads to troubles in language comprehension). Language plays also important role in both psychotherapy and psychology. In both Freundian and Jungian psychanalysis, in Rogerian person-centered psychotherapy, in Frankl’s logotherapy or individual 15 Of linguistics 1st reason 2nd reason Role of language in psychology 16 softer thesis Figure 3: Cognitive Hexagram The centroid of the hexagram 3rd reason psychology Adler (1976) and possibly in many other psychotherapeutic systems, language is considered to be therapeutic tool of utmost importance. What is more, in a very sound psychological "theory of multiple intelligences", as articulated by Gardner (1985a), is man’s faculty to understand and produce linguistic utterances important enough so that it merits to obtain the label of «verbal-linguistic» intelligence. Along with six other intelligences, this «linguistic intelligence» is considered to be the basic computational module of human cognitive system. Also within a theory coming from a different (russian) tradition, that of Vygotsky (1987), is language considered to be a crucial component of man’s psyche: in Vygotski’s framework, in fact, is the thinking itself understood as a so-called inner speech. All this arguments lead us to belief that natural language is a topic which is localized very close to the centroid of the hexagram delimiting the object of study of all cognitive sciences (depicted on Figure 3). In one way or another, explicitely or implicitely, all cognitive sciences deal with natural language. On their own, these two reasons, «language is well-studied» and «language is central» would yield, we believe, sufficient an answer to the question «Why does S2 T relate evolutionary computing with learning of natural language and not, for example, with learning of deer-hunting or learning of swimming?" But there is another, AI-related, reason for which we consider the study of language learning to be of particular importance in relation to evolutionary computing and/or computer science. More concretely, similarily to Turing (1950), who saw in language a means how to address the question «Can machines think ?» in an answerable way, we see in natural language a potentially first solid bridge between the realms of artificial and human beings. end why natural language? 4.2 end softer thesis 4 5 SOFTEST THESIS The Softest Thesis (S3 T) is expressed as follows : «Ontogeny of toddlerese can be successfully simulated by means of evolutionary computation.» In this definition, the term "ontogeny" is used in the sense practically synonymous to "learning", the sole difference between the two being our intention to mark the notion that toddlerese is not only passively learnt, but that it emerges and is actively constructed. When it comes to toddlerese itself, it is hereby defined as: 5.1 toddlerese (def) Toddlerese is a transitory protovariant of a natural language which is transferred from minds of human adults into the mind of a child by means of repetitive exchange of sequences of contextualized symbols. end toddlerese 5.1 Thus, the term "toddlerese" has a meaning similiar to meaning of terms widely-used terms like "first language" or "mother language". But contrary to these terms -which are used to denote not only the language which develops but also, and mainly, the end-state language resulting from such development- the term "toddlerese" is conceived to denote only a certain transitory state, or a sequence of states in development of such "first language". In other terms, mother language stays develops in man’s mind for the rest of (her|his) life but toddlerese language LT gradually disappears, or at least gets latent, in parallel with child’s cognitive and physiological development away from the toddler state. The term "protovariant" is used to mark even more both temporariness as well as its function of a base for a fullfledged language which shall unfold from LT in mid-childhood and later. More concretely, we define -for the purpose of this Thesis- toddlerese as language LT emergent from child’s interactions with the world within the temporal interval (0,2;6) years, id est between birth and two and half years of age1 . 1 In order to facilitate bridging between computer science and developmental psycholinguistics, we shall not use the decimal notation, but a year;month;week notation to speak about child’s age (e.g. 2;3;1 when speaking about child which is two years, 3 months and one week old) 17 Toddlerese is a transitory language Age range of toddlerese 18 Of repetitivity and reproduction The mirror metaphor Child mirrors its parents softest thesis Thus, the term "toddlerese" has a meaning similiar to meaning of terms widely-used terms like "first language" or "mother language". But contrary to these terms -which are used to denote not only the language which develops but also, and mainly, the end-state language resulting from such development- the term "toddlerese" is to denote only a certain transitory state, or a sequence of states in development of such "first language". In other terms, mother language stays active in man’s mind for the rest of (her|his) life but toddlerese language LT gradually disappears, or at least gets latent, in parallel with child’s cognitive and physiological development away from the toddler state. The term "protovariant" is used to mark even more both temporariness as well as its function of a base for a full-fledged language which shall unfold from LT in mid-childhood and later. Another important notion included in the definition of « toddlerese » is repetitivity. Repetition of symbol S can be understood as a sort of « reproduction » along the temporal axis and in following chapters we shall often interpret phenomena, which repeat themselves, not only as reactivation of the original schema, but rather in terms of activity of multiple schemas which are being reproduced. We repeat ; we restate ; we reiterate: at a certain LoA, repetition can be understood as a form of reproduction. But the most important terms of definition ?? are those of «transfer» and «exchange». Initially, these terms seem to denote divergent concepts: the term «transfer» carries with itself the conotation of somewhat unidirectional movement from the origin (mind of the parent) to the destination (mind of the baby) while the term «exchange» denotes a bidirectional process whereby neither of interactors plays the dominant role and both dispose of faculty to partially influence or fully transform the behaviour of the other. But they can be reconciled through the metaphor of a «mirror». At first sight, mirror is a completely passive device simply reflecting the objects which project (transfer) their shapes on its surface. But by the very fact that «mirror mirrors», it has also the power to influence the behaviour of the one who is looking in it and thus entrer en échange avec l’autrui. It is important to realize that since mirrors can be constructed differently, they can mirror things differently – the image they offer in exchange is thus not only dependent upon the-objectthey reflect, but also determined by the material and the way how mirror was physically forged2 . Something similar holds, mutatis mutandi, when it comes to transfer of linguistic competence from the parent to the child. By means of diverse neural mechanisms (e.g. «mirror neurons» (Rizzolatti et al., 2008)) does child’s plastic brain assimilate information from its environment. We count among the objects of such assimilation also 2 By interpreting «tabula rasa» hypothesis as a particular case of the mirror metaphor hereby introduced, one could partially align the empirist and nativist doctrines. 5.2 child (def) structures explicitely expressed or implicitely encoded in sequences which child observes and which are, most often, generated by lessplastic and more-crystalized minds of her parents. The child somehow «parses» such information, processes, understands it and acts accordingly. This action is subsequently projected into external environment by diverse means – most prominent of which are undoubtably child’s vocal tract and child’s facial expressions – and by these means is the very environment transformed. Minds of parents including. We precise that by introducing the metaphor of the mirror we do not, of course, want to state that child is just a receptive informationassimilating entity passively reflecting its external environment. Such a statement would be completely contradictory to the fact of ceaseless activity which every healthy child continuously demonstrates. This fact of childs activity being in fact so salient, we propose to integrate it in the very definition of what the term «child » means: 5.2 child (def) Child plays. end child 5.2 It is by game that child mirrors the world; by playing the game which is pure activity without finality. Child sees around (her|him)self the world in movement, then understands that (s)he can also move and thus (s)he moves. Child’s way of mirroring is thus principially mirroring by playful action and it is by playful action that the child exerts influence in and upon its environment3 . Pages which shall follow, and notably the Part iii, shall furnish further illustrations of what we mean by «playful action» in regards to both language learning and evolutionary tâtonnement. Other computational language games shall also be introduced, mostly in form of programs able to induce sets of classes (10.4.7) or transcription rules (??) from diverse textual corpora. All programs shall apply the principles inherent to « evolutionary computing » in order to furnish some data validating (or falsifying) the hypothesis S3 T. On the other hand, none of the programs will be able to account for phonetic or pragmatic layers of languages under study. For this reason we are obliged to delimit, for the last time, the scope of our Thesis. end softest thesis 5 3 Notions of «game» and «playfulness» are not the same for adults and children. Adults often consider as hazardous activities which children consider as a game and vice versa, children often consider as serious the sandbox activities which are not at all perceived as such by adults. The transfer of adequate categories «game» and «serious» is an important goal of socialisation and possibly learning in general. 19 6 O P E R AT I O N A L T H E S I S The Operational Thesis (OT) is defined as follows: «Learning of toddlerese from its textual representations can be successfully simulated by means of evolutionary computation.» OT is thus very similar to the softest thesis, the only difference being the specification of the modality of representation of inputs in confrontation with which the toddlerese is supposed to be learnable, in simulation, by means of evolutionary computation. It is precised that such learnable modality is «textual». 6.1 text (def) Sequence of discrete graphemic symbols representing morphosyntactic and semantic contents of natural language utterances. end text Text does not have phonetic, prosodic and pragmatic layers. 6.1 This definition principially states that text encodes only subset of information which a normal « hearable » utterance contains. That is : semantic information related to its meaning and sense, and morphpsyntactic information related to its grammatical composition. C.f. sections 9.2.3 and 9.2.4 for discussion of «morphosyntax» and «semantics» respectively. By specifying the modality of data with which it shall operate, OT has drastically reduced the scope of applicability of the softest thesis. More concretely, by defining « text » as the modality of representation with which we shall confront our computational models, we have left aside the phonetic, phonologic, prosodic and pragmatic aspects of language. That is, aspects of language which have been -during practically all human history- crucial whenever the « speaker » intended to pass information to the « hearer ». It is only during few centuries that the communication by means of text became prominent and only within last decades it became dominant, mainly because of increasing role of computers in our lives. This is at least partially so because computers are essentially machine built for processing of sequences of discrete symbols and that’s what a text is – a sequence of discrete symbols. Contrary to flux of spoken language, which is also a sequence, but composed of units whose boundaries are often unclear and whose features overlap. 20 6.1 text (def) But the fact that practically no prosodic1 , phonetic or pragmatic information shall be involved in our computational simulations does not mean that these simulations will not be concerned by natural language. On the contrary – it is evident, from experience of every reader, that text indeed is a «communication system which human beings use to express information in a way comprehensible to other human beings» (4.1). In other terms : if message is clear and if productive linguistic competence of the writer overlaps with the receptive linguistic competence of the writer, message shall make it possible that the reader shall understand the writer. In this sense, text can be considered as a valid and functional modality of representation of natural language. However, the question « Whether text can be also considered as a modality of representation sufficient for learning of language, and most notably first language ? », is still an opened one. While some existing computational models indicate that at least for certain subproblems of language learning, like POS-induction (10.5) and grammar induction (10.6), the answer can be «yes», empiric observations of first language learning of human children also suggest that prosody and phonology play crucial role (9.2.1) and to ignore them would mean to miss out the crucial component of the language learning process. But since children which are deaf, and thus without any access whatsoever to prosody or phonology, are able to learn the sign language -and since the sign language ressembles, in the sense that it is visual and sequential, to text - the operational reduction of language to text is potentially not a completely unreasonable one. Thus, an operational definition language → text shall be principially used in sections dedicated to computational simulations of language learning. In other sections, however, this reduction shall not be applied and language will be most often discussed in its full extent, i.e. involving its phonetic, prosodic and pragmatic facets. end operational thesis 6 1 One can argue that exclamation (!) or question (?) signs add certain prosody to text since they can possibly represent increasing or decreasing tone or accent. This is, however, discutable because prosodical cues are present « along » whole utterance while the interpunction signs are normally located only at sentence’s final position. 21 Text is a form of natural language 7 SUMMA I In this section we had introduced multiple theses which we consider as valid. These theses were discussed in deductive order, i.e. from the most general to the most specific one. Discussion started with the initial thesis « mind evolves » and definition of mind as « auto-organising set of structures and processes». Because such thesis is so general that one may suspect that it is in fact a tautological statement-of-faith than a verifiable hypothesis, a so-called Hard Thesis was subsequently introduced, stating that « learning is a form of evolution ». Learning was principially defined as an information-processing constructionist process and it was further precised that the term « evolution » is meant in Darwin-consistent sense, i.e. as an adaptive process based on reproduction, variation of selection of information-carrying structures. What was not yet explicitely said, however, is that both evolution and learning share an important feature : they involve trials and errors. 7.1 trial and error (def) Most fundamental heuristics based on repetitive confrontation of system’s activity with external and internal constraints and demands. end trial and error 7 It is generally believed that in learning, trial events are related to other trial events only in a serial, vertical manner - one trial follows another one in time. On the other hand, in an evolutionary process, trials are related to other trials not only in serial (i.e. one generation folows another) but also in parallel (i.e. generation consists of multiple individuals) manner. The principal sense of the Hard Thesis is to state that such distinction is illusionary and that learning process almost always involves a sort of horizontality, a sort of population of parallely co-existing structures which underlay and determine the observable manifestation of individual "trial". What’s more, HT postulates that as in evolution, so in learning are such individual structures endowed with the faculty to reproduce the information which they encode into another locus. It was further postulated that 1. if ever a stochastic phenomenon can cause variation of information content of an individual entity E generated by the reproduction process 22 7.1 trial and error (def) 2. if ever the information encoded by entity E influences the amount of resources consumed during the reproduction 3. and if ever such multi-iterative reproduction occurs within the environment having only finite amount of resources then, with logical necessity, a sort of adaptation of entities to their environment shall follow. After proposing four conditions under which HT can be considered as plausible, it was further discussed whether human brain could be potentially considered as such "environment" for a sort of intracerebral evolutionary process. The brain was primarily defined as a finite physical object storing information in a non-local way. As a physical system, brain is subordinated to laws of physics like 2nd law of thermodynamics: brain generates heat and heat can, with non-zero probability, cause variation of its own material content. Such variation of materia could subsequently result itself in the information of information which the brain encodes. Thus, the very fact that brain is a finite physical system implies that third and fourth conditions of HT’s validity - when related to learning faculty of human brain - are to be considered as fulfilled. Much more problematic are conditions 1 and 2 of HT’s validity relating to the question "does brain contain information-encoding structures able to reproduce?". Since reproduction of informationencoding entities has not yet been directly and irefutably observed within the brain, conservative scientists are often reluctant to answer such question in affirmative. On the other hand, an "alternative" (2.11) explanation of well-observed phenomenon of non-local storage of information implies, that a process resembling reproduction, a process copying information to multiple loci could, indeed, take place within the region of brain’s wetware. It was also suggested that in natural ensembles like organisms, species or even societies, redundancy of information makes often systems more robust against unpredictable perturbations and it was suggested that same "robustness through redundancy" principle holds, mutatis mutandi, also for human mind. Unfortunately, the questions raised by HT are too wide to be adressed, in extent they merit, in a limited scope of this dissertation. For this reason, the Hard Thesis is reduced into the soft form which states that learning can be simulated by means of evolutionary computation. ST thus does not postulate the ontological adequacy between nature of evolutionary and learning processes - it simply postulates that computational models of the former can successfully simulate the latter. The notion of successful simulation was defined in terms of isomorphism between input-to-output mapping of the simulation and stimulus-to-reaction mapping of the process-which-is-simulated. The need to create not only externally but also internally adequate computational models of human faculties was also discussed. By introducing the notion of cognitive plausibility, we have proposed to focus not 23 24 summa i only on result but also on the path which leads to attainment of the result (Section 3.3). Thus, when considering the realm of machines, ST postulates that there exist at least certain class of problems -usually solved by means of traditional "machine learning" techniques- which could be also solved by means of evolutionary computation with similar or better results. And whose manner of functioning ressembles the manner of functioning of the system which is simulated. A so-called Softer Thesis have subsequently precised that learning of natural languages is such problem. Natural language was definedin a most liberal way as "communication system which human beings use to express information to other human beings" (Section 4.1). Natural languages were chosen as the topic of our interest for three principal reasons: Primo, natural languages are well-studied. Secundo, natural languages are thematized, in one way or another, by all cognitive sciences. Tertio, the canonical (Turing’s) method to answer AI’s central question "Can machines think?" is principially a test evaluating machine’s mastery in simulation of understanding and production of natural language utterances and discourses. Since the expression "learning of natural language" can cover too many phenomenon, the S2 T is further transformed into the Softest Thesis (S2 T) which speaks only about the "learning of first language". First language is defined as a communication system transferred from the mind of the parent into the mind of a child by means of repetitive exchange of sequences of symbols. Serial - in contrast with parallel, sequental and repetitive nature of first language was discussed and the apparent contradiction between unidirectional "transfer" and bidirectional "exchange" was subsequently reconciled by means of the "mirror metaphor". Human child was defined in terms of its most distinctive propensity, i.e. propensity to "play", to execute activity which lacks the absolute finality. The last thesis which have been presented is the Operational one. This specifies that the modality of representation, with which the "first language learning" evolutionary computation algorithms will be confronted, shall be textual. Given that text does not include practically any phonetic, prosodic or pragmatic layer, the complexity of the first language learning from text could be substantially reduced. The question whether such reduction is not too strict was also addressed. By positing 6 theses of varying degree of universality - i.e. Initial, Hard, Soft, Softer, Softest and Operational - we have delimited the level at which the rest of this dissertation shall operate. By defining terms like evolution, learning, form, brain, 2nd law of thermodynamics, evolutionary computation, successful simulation, cognitive plausibility, natural language, first language, child and trial & error we have demarcated the basic form of a prism - a theory- through which one could see that the theses we posited hereby are, indeed, valid. This theoretical prism shall be polished in the following part. Part II PA R A D I G M S Ideas are never static but develop across time and context, constantly cross-fertilizing with other currents of thought. — Edwin F. Bryant This part shall start the crossover of three seemingly unrelated scientific paradigms. In its initial chapter devoted to Universal Darwinism, we shall introduce scientific disciplines and their respective theories, which are either derived from - or at least consistent with - Darwinian Theory of evolution, understood as gradual development of populations of informationcarrying structures. Thus, not only biological evolution shall be discussed, but also evolutionary and genetic epistemology and psychology, memetics, neural darwinism and different branches and sub-branches of evolutionary computation. In the subsequent chapter, devoted to Developmental Psycholinguistics, we shall introduce the fascinating field of study of acquisition of first language by human children. After definition of few necessary notions we shall bring to reader’s attention towards few widely accepted facts and do a brief historical overview of most important languageacquisition theories. More concretely: associanist, behaviorist, nativist, constructivist and sociopragmatic theories shall be mentioned and thematised. The last chapter of this part shall invite the reader into the realm of Computational Linguistics and Natural Language Processing. After brief introduction into Formal Language Theory and its Grammar Systems Theory variant, the discussion shall be focused on computational problems of concept construction, part-of-speech induction and grammatical inference. Some state-of-the-art computational models aiming to solve these problems shall be described in closer detail in order to pave the theoretical path towards future evolutionary models of first language acquisition. 8 U N I V E R S A L D A RW I N I S M Universal Darwinism (UD) is a scientific paradigm regrouping diverse scientific theories extending the Darwinian theory of evolution and natural selection (Darwin, 1859) beyond the domain of biology. It can be understood as a generalized theoretical framework aiming to explain the emergence of many complex phenomena in terms of interaction of three basic processes: 1. variation 2. selection 3. retention According to UD paradigm, interaction of these three components yields « universal algorithm valid not only in biology, but in all domains of knowledge where we can extract informational entities – replicators, which are able to reproduce themselves with variations and which are subjects to selection» (Kvasnicka and Pospichal, 2007). This generic algorithm is nothing else than traditional Evolutionary Theory (ET) which, when when considered as substrate-neutral, can be applied to such a vaste number of scientific fields that it has been compared to a kind of « universal acid » which «« eats through just about every traditional concept, and leaves in its wake a revolutionized world-view, with most of the old landmarks still recognizable, but transformed in fundamental ways» (Dennett, 1995). UD is a source of both theoretical inspiration and practical precepts for many scientific disciplines, technological methods or artistic endeavours. The most prominent include: 1. biology 2. evolutionary art, e. psychology, e. music, e.linguistics, e.ethics, e.economics, e.anthropology, e.epistemology, e.computation 3. sociobiology (Wilson, 2000) 4. memetics (Blackmore, 2000) 5. quantum darwinism, neural darwinism, psycho darwinism 6. artificial life et caetera. We shall now discuss some of them. 26 8.1 biological evolution 8.1 biological evolution Evolutionary Theory was born when young Charles Darwin realised that the « gradation and diversity of structure» (Darwin and Bettany, 1890), which he had encountered among mockingbirds of Galapagos islands, could be explained by natural tendency of species to « adapt to changing world ». Parallely to Darwin’s work which was gradually clarifying the terms of variability and its close relation to environment-originated selective pressures, Gregor Mendel was assessing statistical distributions of colours of flowers of his garden peas in Brno in order to finally converge to fundamental principles of heredity . But it was only in 1953 when the double-helix structure of the material substrate of heredity of biological species – the DNA molecule – was described in article of (Watson et al., 1953). In simple terms : In the DNA molecule, information is encoded as a sequence of nucleotides. Every nucleotide can contain one of four nucleobases, it thus ideally carry 2 bits of information. Continuous sequence of three nucleotids gives a « triplet » which, when interpreted by a intracellular « ribosome » machinery, can be « translated » into an amino-acid. Sequences of amino-acides yield proteins which interact one with another in biochemical cascades. The result is a living organism with its particular phenotype aiming to reproduce its genetic code. If, in the given time T there are two organisms A and B whose genetic code differs in such an extent that their phenotype differs, and if ever the phenotype of organism A augments probability of A’s survival and reproduction in the external world W, while the B’s phenotype diminishes such probability , we say that the A is better adapted to world W than B, or more formally that fitness(A) > fitness(B). Evolutionary Theory postulates that in case that there is a lack of resources in world W, descendants of the organism B shall be gradually, after multiple generations, substituted by descendants of a more fit organism « A ». This is so because during every act of reproduction, the material reason for having a more fit phenotype the DNA molecule – is transferred from parent to offspring and the whole process is cumulative across generations. It can, however, happen, that the world W changes. Or a random (stochastic) event – a gamma ray, the presence of a free radical - can occur which would tamper A’s genetic code. Such an event – called « mutation » - shall result, in majority of cases, in decrease of A’s fitness. Rarely, however, can mutations also increase it. Another event which can transform the genetic sequence is called « crossover ». It can be formalised as an operator which substitutes one part of genetic code of the organism A with corresponding sequence of organism B, and vice versa, the part of B with the corresponding part of A. It is indeed especially the crossover operation, 27 28 universal darwinism Figure 4: One-point and two-point crossovers. Figures reproced from Morgan (1916). (a) (b) first described by in the article (Morgan, 1916), which is responsible for « mixing of properties » in case of a child organism issued from two parent organisms. In more concrete terms : the genetic code of such « diploid » organisms is always stored in X pairs of chromosomes. Each chromosome in the pair is issued from either father or mother organism which, during the process of meiosis, divide their normally diploid cells into haploid gamete cellls (i.e. sperms in case of father and eggs in case of mother). It is especially during the first meiotic phase that crossover occurs, the content of DNA sequence of two grand-parents being mixed and mapped during crossover operation into the chromosome contained in the gamete which, if lucky, shell fuse with the gamete of another parent in the act of fecondation. Resulting « zygote » is again diploid, contains mix of fragments of genetic code originally present in the cells of all four grand-parents of the nascent organism. Zygote subsequently exponentially divides into growing number of cells which differentiate from each other according to instructions contained in the genetic code which are triggered by biochemical signals coming from cell’s both internal and external environment. If the genetic code shall endow the organism with properties that will allow it to survive in its environment until its own reproduction, approximately half of the genetic information contained in its DNA shall be transfered to the offspring organism. If not, the information as such shall disappear from the population with death of the last individual who carries it. end biological evolution 8.1 8.2 evolutionary psychology 8.2 evolutionary psychology We have already quoted Darwin’s statement that asserted that psychology in the distant future shall "be based upon a new foundation of the necessary acquirement of each mental power and capacity by gradation". While two possible intepretations of this Darwin’s idea exist, the discipline Evolutionary Psychology (EP) focuses only on the first one. It aims to explain diverse faculties of human soul & mind in terms of selective pressures which moulded the modular architecture of human brain during millions of years of its phylogenetic history. Its central premises state :« The brain’s adaptive mechanisms were shaped by natural and sexual selection. Different neural mechanisms are specialized for solving problems in humanity’s evolutionary past.» (Cosmides and Tooby, 1997). In more concrete terms, Evolutionary Psychology explains quite successfully phaenomena as diverse as emergence of cooperation and altruistic behaviour (Hamilton, 1963) ; male promiscuity and parental investment (Trivers, 1972) or even the obesity of current anglo-saxxon population (Barrett, 2007). All this and much more is explained as a result of adaptation of homo sapiens sapiens (and its biological ancestors) to dynamism of its ever-changing ecological and social niche. Thus, in the long run, EP tends to explain and integrate all innate faculties of human mind in the evolutionary framework. The problem with EP, however, is that in its grandious aim to « assemble out of the disjointed, fragmentary, and mutually contradictory human disciplines a single, logically integrated research framework for the psychological, social, and behavioral sciences » (Cosmides and Tooby, 1997), it can sometimes happen that EP posits as innate, and thus explainable in terms of biological natural selection, cognitive faculties which are not innate but acquired. Thus it may be more often than rarely the case that whenever it comes to the famous "nature vs. nurture" (Galton, 1875) controversy, evolutionary psychologists tend to defend the nativist cause even there, where it means to commit a epistemological fallacy to do so1 . And what makes things even worse for the discipline of Evolutionary Psychology as is currently performed is, that the forementioned Darwin’s precognition has, asides the nativist & biological one, also another intepretation. Id est, when Darwin spoke about mental powers and capacities acquired by gradation, one cannot exclude that he was speaking not only about gradation in phylogeny of species, but also ontogeny of an individual. end evolutionary psychology 8.2 1 If ever we accept the notion of falsifiability as an important criterion of accpetation or rejection of the scientific hypothesis (Popper, 1972), many hypotheses issued from EP would have to be rejected because, since being based in the distant past which is almost impossible to access, they are less falsifiable than hypotheses explaining the same phaenomena in terms of empiric data observable in the present. 29 30 universal darwinism 8.3 Memes coalesce in auto-catalytic memplexes Of inter- and intramental memetics memetics Theory of memes or memetics is, in certain sense, a counter-reaction to Evolutionary Psychology’s aims to explain human mental and cognitive faculties in terms of innate propensities. Similiarly to EP, memetics is also issued from the discipline of sociobiology which was supposed to be « The extension of population biology and evolutionary theory to social organization» (Wilson, 2000). But contrary to both EP and sociobiology, memetics does not aim to explain diverse cultural, psychological or social phenomena solely in terms of evolution operating upon biochemical DNA-encoded genes, but also in terms of evolution being realised on the plane of more abstract informationcarrying replicators which Dawkins (1976) named « memes ». The basic definition of the classical memetic theory is: « Meme is a replicator which replicates from brain to brain by means of imitation» (Blackmore, 2000). These replicators are somehow represented in the host brain as some kind of « cognitive structure » and if ever externalised by the host organism – no matter whether in form a word, song, behavioral schema or an artefact – they can get copied into other host organism endowed with the device to integrate such structures2 . Similary to genes which often network themselves into mutually supporting auto-catalytic networks (Kauffman, 1995) , memes can also form more complex memetic complexes, « memplexes », in order to augment the probability of their survival in time. Memes can thus do informational crossovers with one another (syncretic religions, new recepts from old ingredients or DJ mixes can be nice examples of such memetic crossover) or they can simply mutate, either because of the noise present during the imitation (replication) process, or due to other decay factors related to the ways how active memes are ultimately stored in brains or other information processing devices. Memetic theory postulates that the cumulative evolutionary process applied upon such reproduction of information-carrying stuctures BETWEEN minds shall ultimately lead to emergence of such complex phaenomena as culture, religion or language. It can be thus considered to be mainly the theory of inter-mental reproduction of information. In complementarity with such a view, this dissertation claims the existence of reproduction of information WITHIN the individual mind. Thus, the theory hereby presented can be labeled as a theory of intra-mental memetics. end memetics 8.3 2 In neurobiological terms, the faculty to imitate and hence to integrate memes from external environment is often associated to «mirror neurons» 8.4 evolutionary epistemology 8.4 evolutionary epistemology Epistemology is a philosophical discipline concerned with the source, nature, scope , existence and divesity of forms of knowledge. Evolutionary epistemology (EE) is a paradigm which aims to explain these by applying the evolutionary framework. But under one EE label, at least two distinct topics are, in fact, addressed : 1. EE1 which aims to explain the biological evolution of cognitive and mental faculties in humans and animals 2. EE2 postulates that knowledge itself evolves by selection and variation EE1 can be thus considered as sub-discipline of EPSection 8.2 and as such, is subject to EP-directed criticism. EE2 , however, is closer to memetics since it postulates the existence of a second replicator, i.e. of an information-carrying structure which is not materially encoded by a DNA molecule. The distinction between EE1 and EE2 can also be characterised in terms of « phylogeny » and « ontogeny ». Given the definition of phylogeny as 8.4.1 phylogeny Process which shapes the form of species. end phylogeny 8.4.1 Processus which shapes the form of an individual. end ontogeny 8.4.2 and contrasting it to ontogeny defined as 8.4.2 ontogeny we find it important to reiterate that while EE1 is more concerned with knowledge as a result of phylogenetic moulding of DNA, EE2 implies the moulding of non-DNA replicators in both phylogeny and ontogeny. Thus, the notion of EE2 can be subsequently analysed into two sub-notions : • EE2-1 Knowledge can emerge by variation&selection of ideas shared by a group of mutually interacting individuals (Popper, 1972) • EE2-2 Knowledge can emerge by variation&selection of cognitive structures within one individuum 31 32 universal darwinism This distinction is homologous to distinction between inter- and intra- mental memetics, as discussed in Section 8.3. It is worth noting that while a so-called recapitulation theory stating that « ontogeny recapitulates phylogeny » (Haeckel, 1879) is considered to be discredited by many biologists and embryologists ; it is still held as valid by many reseachers in human and cognitive sciences. In anthropology, for example, some scientists observe a « strong parallelism between cognitive development of a child and . . . stages suggested in the archeological record» (Foiter, 2002). Also in relation to pedagogy it was observed that « education is a repetition of civilization in little» (Spencer, 1894). 8.4.3 Creativity as intrapsychic evolution individual creativity In fact, the evolutionary epistemology was born with the tentative of D.T. Campbell to explain both creative thinking and scientific discovery in terms of « blind variation and selective retention» (Campbell, 1960) of thoughts. Departing from introspective works of mathematician Henri Poincare who stated « To create consists precisely in not making useless combinations and in making those which are useful and which are only a small minority. Invention is discernment, choice...Among chosen combinations the most fertile will often be those formed of elements drawn from domains which are far apart...What is the cause that,among the thousand products of our unconscious activity, some are called to pass the threshold, while others remain below?» (Poincaré, 1908), Campbell suggests that what we call creative thought can be described as a Darwinian process whereby the previously acquired knowledge blindly varies in unconscious mind of the creative thinker and that only some such structures are subsequently selectively retained. The theory which interprets the creative process as an evolutionary one has been subsequently developped by Dan Simonton who answers his rhetorical question "How do human beings create variations?" with a UD-constitent answer: « One perfectly good Darwinian explanation would be that the variations themselves arise from a cognitive variation-selection process that occurs within the individual brain.» (Simonton, 1999). end individual creativity 8.4.3 8.4.4 genetic epistemology « The fundamental hypothesis of genetic epistemology is that there is a parallelism between the progress made in ... organization of knowledge and the corresponding formative psychological processes. Well, 8.4 evolutionary epistemology now, if that is our hypothesis, what will be our field of study? Of course the most fruitful, most obvious field of study would be reconstituting human history: the history of human thinking in prehistoric man. Unfortunately, we are not very well informed about the psychology of Neanderthal man or about the psychology of Homo siniensis of Teilhard de Chardin. Since this field of biogenesis is not available to us, we shall do as biologists do and turn to ontogenesis. Nothing could be more accessible to study than the ontogenesis of these notions. There are children all around us.» (Piaget, 1974) When understood only superficially, Piaget’s developmental theory of knowledge, which he himself called Genetic Epistemology (GE) may seem to be utterly non-Darwinian. Its concern is not the phylogeny of human species, it is not even concerned with biochemical genes. In fact, during practically all his fecond life-lasting research, Piaget had focused solely on the study of ontogeny of diverse cognitive faculties in human children. Thus, Piaget uses the term « genetic » to refer to a more general notion of « heredity » defined as structure’s tendency to guard its identity through time. These structures, which he called « schemas » can be defined as « a basic set of experiences and knowledge that has been gained through personal experiences that define how things should be and act in the person’s environment. As the child interacts with their world and acquires more experiences these schemes are modified to make sense, or used to make sense of the new experience.» (Bee and Boyd, 2000) There are basicly two ways how such schemes can be modified. Either they « assimilate » data from external environment. Or, if ever such assimilation is not possible because it is simply not possible that child’s cognitive system matches the perceived external datum with the internal pre-existing category, the process of « accomodation » takes place which transforms the internal category to match the external datum. Ultimately, the set of schemes gets so out-dated or so altered by past modifications that they are not useful anymore. Whenever such «equilibriation » occur, old set of schemas is rejected, the child tends to « start fresh with a more up-to-date model» (Bee and Boyd, 2000), thus attaining new substage or stage of its development. In the Piagetian system – which is based on very precise yet exhaustive observations of dozens of children including his own – the order of stages is fixed and it is very difficult, or even fully impossible, for evolving psyche to attain pre-operational stage 2 or concrete operational stage 3 if it had not even mastered all that is to master during the sensorimotor stage 1. 1. sensorimotor stage - repetitive but playful manipulation of objects without goal 33 34 universal darwinism 2. egocentric stage - imitation of behavioral schemas of others without understanding of why it is done 3. cooperative stage - coordination of one’s activity with one’s environment 4. autonomous stage - understanding of procedures which allow to change rules governing one’s environment Given that that the GE paradigm involves • heredity – schemes tend to keep their identity in time • variation – schemes are altered by the environment-driven assimilation or accomodation3 • selective pressures – only those schemas which are most well adapted to environment and/or form most functionally fit complexes with other schemas shall pass through the equilibriation milestone it can be briefly stated that Piaget’s GE could be aligned with ET and UD. And what more, it may be the case that notion of Piagetian stages is consisted with the notion of attractor or locally optimal states whose emergence is, according to complex system theory (Kauffman, 1995; Flake, 1998) inevitable in a system as complex as child’s brain, mind and psyche definitely is. end genetic epistemology 8.4.4 end evolutionary epistemology 8.5 8.4 evolutionary linguistics Analogically to Evolutionary Epistemology, objects of interest of EL subdivide it at least into two branches: • EL1 the study of origin and development of faculties related to comprehension and production of linguistic signal by homo sapiens sapiens and its ancestors • EL2 the study of historical development of diverse languages Distinction between EL1 and EL2 EL1 can be thus considered to be closely related to Evolutionary 3 Note that in terms of EC, one can relate the Piagetian notion of assimilation to an operator of local variation which attracts the cognitive system to locally optimal agreement with its environment, while accomodation suggests an interpretation in term of more global variation operators (like cross-over), potentially allowing the CS to adapt to its physical and social environments in a more globally optimal way. 8.5 evolutionary linguistics 35 Psychology (Section 8.2) and discuss phylogenetic evolutionary phenomena taking place during hundreds of thousands of years while EL2 can be said to take place in the historical time (order of ten thousand years and less) and is thus closely related to disciplines like anthropology, culturology, comparative grammar and memetics. In simple terms, EL2 is dedicated to study of linguistic ethnogeny. 8.5.1 ethnogeny (def) Processus which shapes the form of a human community. end ethnogeny 8.5 EL2 ’s central tenet that "language changes in time" is far from being new. Socrates have believed that « ...the primeval words (πρώτα ονόματα) have already been buried by people who wanted to embellish them by adding and removing letters to make them sound better, and disfiguring them totally, either for aesthetic considerations or as a result of the passage of time...» (Plato, 80BC) and the best of Plato’s students was well aware that change can be expressed in terms of Language changes • insertion • deletion • transposition • substitution Aristotle (42BC). Ancient syntacticians like Apollonius Dyscolus could subsequently apply such notions to describe particular linguistic phenomena (Householder, 1981). It was, however only centuries later when men of science had realized that language change is far from being a linear degression of the primordial ideal, as the ancients have mostly believed. On the contrary: sir William Jones’s discovery that sanskrit is similar to Greek, Celtic, Gothic and Latin languages and that they all « sprung from some common source, which perhaps no longer exists» (Jones, 1788). Subsequent realization that these similarities make it possible to cluster languages into hierarchical taxonomies combined with the trivial fact that languages exchange their internal contents (e.g. wordborrowing), all this has led to evermore stronger belief that languages can be studied as living entities. Darwin himself was well aware of the parallelism between biology and linguistics: « The formation of different languages and of distinct species, and the proof that both have developer through a gradual process, are curiously parallel...We find in distinct languags striking homologies due to community of descent, and analogies due to a similar process of formation.» (Darwin, 1859) Language evolves 36 universal darwinism Figure 5: Schleicher’s Stammbaum of family of Indo-European languages. Reproced from Schleicher (1873). Language tree theory The fossile absence problem Glottochronology Practically in the same as Darwin was preparing his opus which was to change biology forever, was, on the linguistic side, the existence of such parallelism articulated by Schleicher (1873) in his "tree" Stammbaum theory of Indo-European languages . By publishing his theory, Schleicher had in fact triggered a completely new form of evolution - that is, evolution of linguistic theories. In a dozen years that followed was the influx of articles related to Stammbaumtheorie so high that Société linguistique de Paris had decided, in 1866, to refuse any articles on the subject. Which is somewhat a pity because many theories which emerged during that period, for example "the wave theory" (Schmidt, 1872) taking into consideration not only temporal but also spatial (i.e. geographic) aspects of language spread, were indeed preminiscent to diffusion models which became prominent in biology only a century later. One of the reasons for Societe’s "ban" was the fact that languages, contrary to "biological species" do not left fossile traces after them and therefore any endeavour to understand their distant past or even origin is only speculative and inconsistent with empiric method of science. EL simply does not go well with the principle of scientific parsimony, the omnipresent Occam’s razor. Notwithstanding this critique which stays, we believe, valid today as ever4 , allowed the advent of computers to EL2 to catch the second breath. An often criticized but nonetheless very important step in making EL computer-positive was the introduction of "lexicostatistical" and glottochronological methodology originally based on cognate distance matrices (Swadesh, 1952). These numeric matrices, whose elements Mij were denoting the number of cognates - i.e. the number of similarly sounding words having the same meaning - subsequently al4 C.f. the footnote in Section 8.2 or citation of Piaget in Section 8.4.4 for other reformulations of the same critique. 8.5 evolutionary linguistics 37 lowed to computationally "discover" and (fals|ver)ify hypothesie concerning kinship of existing or past languages. An article of Atkinson and Gray (2005), from which we reproduce a Table 1 a Table describing parallelism between biological and linguistic evolution, offers a satisfactory introduction to some EL2 ’s state-of-the-art computational models some of which pretend to unveil knowledge about ancestry of languages as far as the end of last ice age (Pagel et al., 2013). biological evolution linguistic evolution Discrete characters Lexicon, syntax and phonology Homologies Cognates Mutation Innovation Horizontal gene transfer Borrowing Hybrid plants Creole languages Table 1: Conceptual parallels between biological and linguistic evolution. Table partially reproduced from Atkinson and Gray (2005). Section 8.7.5 shall discuss a so-called "Evolutionary Language Game" computational model. Since ELG addresses - and some may say that also answers - the question "How may a coordinated system of soundmeaning mappings evolve ex nihilo in a community of mutually interacting agents?", it can be posited at the very border between EL1 and EL2 . According to Pinker, who is one of the most famous proponents of so-called "nativist" theory in developmental psycholinguistics (c.f. item 10.2) models like ELG « suggest ways of connecting the evolution of language to other topic of in human evolution, allowing each to constrain the others» (Pinker, 2000). But there is another question related to evolution of language which has not yet been sufficiently resolved by ELG nor any other EL theory5 . That is: "Why are languages subject to some types of changes and not to others?". Why indeed is history of languages so full of insertions (e.g. "osm" in czech and "osem" in slovak), deletions (e.g. "mravenec" in czech and "mravec" in slovak), substitutions (e.g. all instances of what is a diftong "ie" in slovak are pronounced in czech as a long vowel í) and metathetic transpositions (e.g. "hmla" in slovak and "mlha" in czech)? Our answer to this question is as follows : because the changes observable in ethnogeny of diverse languages, dialects and accents are, at their origin, triggered by "variation operators" inherent to every fundamental unit of any linguistic community which is, of course, 5 We set aside a so-called neo-grammarian school of historical and comparativist philology who believed that language change can be described in terms of sequences of universally applicable "laws which suffer no exceptions". We set them aside because we are strongly persuade that evolution not only does suffer "exceptions" but, in fact, endorses them in order to be fully operational. Why are some changes more fit than others? Cognitive constraints in language evolution 38 universal darwinism an individual human mind. Stated more simply: the reasons why language forms develop in the way they develop are in great extent cognitive. Part iii shall present somewhat more concrete an evidence of activity of such operators of intramental variation which potentially influence the process of language production in human children . end evolutionary linguistics 8.5 8.6 Basic tenets of neural darwinism From neural to mental neural and mental darwinism It was already an evolutionary biologist John Maynard-Smith who have remarked that « there is a similarity between the dynamics of genetic selection and the operant conditioning paradigm of Skinner» (Maynard Smith, 1986)6 . But it was only the book Neural Darwinism: The Theory of Neural Group Selection of Nobel-prize winner Edelman (1987) who had, as first in history of science, described in a finegrained detail how a process similar to evolution could be potentially instantiated within the human brain. Stated in one sentence, Edelman’s theory postulates that « complex adaptations in the brain arise through process similar to natural selection» (Fernando et al., 2012). Stated in a more fine-grained detail, the theory shows how epigenetically influenced interactions of "cell adhesion molecules" and "substrate adhesion molecules" can lead to generation of so-called primary repertoire. Synapses within diverse groups of this repertoire are subsequently, during postnatal ontogenesis, "differentially amplified" into a secondary repertoire by a process which is, according to Edelman, functionally equivalent to the process of selection as known in evolutionary theory. Edelman also believes that well-known processes like cell proliferation, cell migration, cell death, neurite branching or synaptic pruning are potentially also governed by analogic selective processes. It is not possible for us to explain Edelman’s tour de fource in the limited scope of this section and it would be, in fact, an act of scientific dishonesty to do so since as computational linguists, we do not feel competent to express any definite statement about truth or falsity in such expert domain as neurology definitely is. But we nonetheless consider as important to emphasize that Edelman is definitely not alone in his view of things. Thus, for example, important authorities of continental neurological tradition were not afraid to state that « the thesis we wish to defend...[is] that the production and storage of mental representations, including their chaining into meaningful propositions and the development of reasoning, can also be intepreted, by 6 Skinner’s behaviorist theory of verbal behaviour will be more closely discussed in 9.4.1 8.6 neural and mental darwinism 39 Figure 6: Possible mechanism of replication of patterns of synaptic connections between neuronal groups. Reproduced from Fernando et al. (2012). analogy, in variation-selection (Darwinian) terms within psychological time-scales.» (Dehaene and Changeux, 1989) It has to be noted, however both Edelman’s "neural" and Dehaene’s and Changeux’s "mental" Darwinism describe processes fundamentally based on variation and selection, but not on replication, of informationencoding neural groups. It is a well-known fact that neural cells do Variation and selection of not reproduce and the possibility that the reproduction of neurons neuronal groups would yield a material basis of existence of intracerebral replicators is thus a priori excluded. As is pointed by (Fernando et al., 2012) this fact in itself, however, does not mean that mental or neural darwinism are not evolutionary. They are evolutionary because one can postulate a sort of evolution for any system whose global development is governed by famous Price’s theorem (Price et al., 1970) which is - so the authors argue - also the case for development of neuronal group structures. Same authors also suggest possible process of replication of information between neuronal groups . This process - fundamentamenReplication among neuronal groups 40 Neural darwinism still speculative Bridging the explanatory gap universal darwinism tally based upon a well-known form of Hebbian-learning7 called "spiketiming dependent plasticity" (STDP) and the existence of a neural "topographic" map between the original replicans (circuit A) and the following replicandum (circuit B) - can be described as follows: « If a neuronal circuit exists in layer A and is externally stimulated to make its neurons spike, then due to a topographic map from layer A to layer B, neurons in layer B will experience similar spike pattern statistics as in layer A. If there is STDP in layer B between weakly connected neurons then this layer becomes a kind of causal inference machine that observes the spike input from layer A and tries to produce a circuit with the same connectivity, or at least that is capable of generating the same pattern of correlations.» (Fernando et al., 2012). Whole process is visualised on Figure 6. While we strongly believe that such a mechanism does indeed operate in human cortex, we reiterate what was already stated in 2.11 : under current state of knowledge is existence of neural replicators not indisputably demonstrated and stays speculative. But in regards to overall objectives of this dissertation does this speculative nature of intracerebral replicators NOT pose any hindrance. This being so because our aim is to apply use evolutionary theory to explain linguistic phenomena. And linguistic phenomena are principially intangible, mental, high-order phenomena which are potentially irreducible to tangible and physical phenomena labelled as "neural". On the other hand, it may be the case that a sort of theory intramental evolution allow us to bridge the "explanatory gap" between tangible and intangible, neural and mental. Thus, for example, whenever we shall emit hypothesis like "canonical babbling is a sort of replicatory process" (9.2.2), we hereby tacitly imply that neural mechanisms, as the one presented on Figure 6, are to be sought-for in Broca’s area of one year old infants. end neural darwinism 8.6 8.7 Universal Algorithm evolutionary computation Evolution can be thought of as a universal, generic algorithm. But our growing knowledge of evolution serves not only descriptive and explanatory purposes. It is becoming normative. Thus, not only can « evolutionary theory » serve us to explain diverse phenomena around us, it can be also exploited for finding solutions to diverse problems. Many researchers in informatics have already realized that diverse of "evolutionary recepts" offer useful heuristics making it possible to discover (quasi)-optimal ways out of wide range of concrete practical isses. 7 Principle of Hebbian learning shall be more closely discussed in Section 9.4.1. 8.7 evolutionary computation Figure 7: Basic genetic algorithm schema. Reproduced from Pohlheim (1996) Evolutionary computing (3.1) approaches differ from classical optimization methods in following aspects : • using a population of potential solutions in their search • using probabilistic, rather than deterministic, transition rules » • using «fitness» instead of function derivatives Kennedy et al. (2001) First computational models which have the above-mentioned attributes were named « evolutionary strategies » by Rechenberg (1971), «genetic algorithms» by Holland (1975) and « evolutionary programming » by Fogel et al. (1966). These paradigms, along with the «genetic programming » paradigm later introduced by Koza (1992) constitute the most important sub-branches of «evolutionary computation» Sekaj (2005) branch of computer and informatic science. 8.7.1 genetic algorithms Basic principle of « genetic algorithms » is illustrated on Figure 7. GAs iteratively produce populations of data stuctures. Each individual data structure is a possible solution, population of every generation is thus a set of diverse solutions. Every individual solution is encoded as a vector of values (also called « chromosome » or « genome ») which can either vary or be copied verbatim from one to generation to the other. Designer choice related to the way how the problem solutions are encoded in chromosomal vectors, e.g. the type (Boolean ? Integer ? Float ? Set? ) of different elements of the vector is also a crucial one and can often determine whether the algorithm shall succeed or fail. In every generation – i.e. in every iteration of the algorithmic cycle represented by the circle on Figure 7 - all N individuals in the population are evaluated by the fitness function. Every individual thus obtains the « fitness » value, which subsequently governs the « selection » procedure choosing a subset of individuals from the current 41 42 universal darwinism generation as those, whose genetic information shall reproduce into next generations. More on fitness functions in 8.7.1. Another important design decision which every programmer of GAs have to do, is to choose the selection operator. An operator which is widely used, and which we shall also implement in all future EC simulations (c.f. 10.4.7, ?? & volume 2) is the «fitness proportionate selection». This operator, also called « roulette wheel operator » normalizes the fitness fi of individual i into the probability pi of its survival by means of a formula : p i ] = fi / N X fj j=1 where N is the number of individuals in the population. Once these probabilities are calculated to different individuals, one can use them to guide the process of selection of individuals which shall be reproduced into the next generation. Minimal PERL source code for such fitness proportional selection operator is: Fitness Proportional Selection (SRC) 1 sub fitness_to_proba { 6 11 my @weights = @_; my @dist = (); my $total = 0; local $_; foreach (@weights) { $total += $_; } for my $weight (@weights) { push @dist, $weight/$total; } return @dist; } sub weighted_rand { my @dist = @_; 16 while (1) { my $rand = rand; my $i=0; for my $w (@dist) { return $i if ($rand -= $w) < 0; 21 $i++; } } } end fitness proportional selection (src) 8.7.1.0 8.7 evolutionary computation 43 Another widely used selection operator is a so-called tournament selection based on repeated selection of the best individual of population’s randomly chosen subset. The tournament selection operator Tournament selection offers multiple advantages: for example, by tuning the tournament size parameter one can easily adjust the selection pressures favorizing or defavorizing fit candidates. And it can also be used in parallel computation scenarios. Once the « most fit » candidates are selected by the selection operator, they are subsequently mutually recombined by means of « crossover » operators and/or modified by means of « mutation » operators. Many different types of selection, mutation and crossover operators exist. For the purpose of this work let’s just note that the probabilities of Values for variation operators occurrence of mutation or crossover have to be fairly low, otherwise no fitness-increasing information could be transferred among generations and whole system will tend to present non-converging chaotic behaviour (Nowak et al., 1999). Another useful strategy, which guarantees that maximal fitness shall either increase or at least stay constant, is called elitism. In order to imElitism plement the strategy, one simply guards one (or more) individual(s) with highest fitness unchanged for next generation, thus protecting « the best ones » from variations which would, most probably, decrease rather than increase the fitness8 . Yet another widely used approach reinforces the selection pressure by removal of the weakest individuals. Both elitist « survival of the fittest » and the contrary « removal of the weakest » are often combined within the sequence of instructions which, alltogether, form a genetic algorithm. The selection of the most fit individuals from the old generation, their subsequent replication and/or recombination and diversification yields a new generation. Because individuals with lower fitness Drift towards higher fitness have been either completely or at least partially discarded by the selection process, one can expect that the overall fitness of new generation shall be higher than the fitness of the old generation. With little bit of luck, one can also hope that the most fit individuals of the new generation shall be little bit more fitter than the most fit individuals discovered in the new generation – this can happen if ever a « benign » mutation have occured, i.e. a modification which had moved the individual from the lower point on the « fitness landscape » to somewhat higher state. end genetic algorithms 8.7.1 8 Note that in nature, elitism is often but not always the case. For it can happen that, due to stochastic factors, the most fit individuals die before they succeed to reproduce the information they encode. But, in such a case, are such individuals truly "the most fit"? 44 universal darwinism Fitness functions and fitness landscapes Of functional core What is fitness function? Fitness function as a design choice Fitness landscapes The core component of every genetic algorithm is the objective «fitness function» able to attribute a cardinal value or ordinal rank to any individum in the population of potential solutions. In other terms, the fitness function yields the criterium according to which one candidate individum is evaluated as «more fit» a solution, in regards to the problem under study, than other potential solutions present in the population. The choice of good fitness function determines, more than anything else, the success or failure of GA as a means to find the solution for the problem at hand. Ideally, the fitness function is a mathematical representation of the very essence of the problem which is to be solved. For purely mathematical problems, the choice of the fitness function is straightforward - fitness function is simply the function whose global optimum one wants to find. Also in many practical implementations - notably those of optimalization of physical components - the fitness function is also often evident: one can deduce it from well-established physical laws. But fitness functions for other problems are far from being certain. The "first language learning" which we aim to address in this dissertation belongs among such problem since it is not trivial to answer the question: "which model of language is better (i.e. more fit): X, Y or Z?". Such an answer is strongly determined by the theoretical point of view one adopts: an engineer prefering a sociopragmatic (9.4.4) or constructivist (9.4.3) theory of language acquisition, a model of language competence of 12-month old baby which generates utterances like "tato tek tete" would be considered to be more "fit" than model generating utterances like "father, had your colorless green ideas slept furiously?" (Chomsky, 1957). Rather contrary should be the case for an engineer who would decide to formalize his fitness function on the grounds of nativist (10.2) theories of language acquisition. The notion of fitness landscape, first introduced by Wright (1932) is a metaphor useful for understanding, discussion and comparison of diverse fitness functions. The landscape is depicted as a mountain range with peaks of varying height. The height at any point on the landscape corresponds to its fitness value; i.e. the higher the point, the greater the fitness of an individual represented by the given point of the landscape9 In such a representation, the evolution of the organism to more and more « fit » forms can be depicted as a movement uphill, towards the most closest peak (i.e. local optimum) or towards the highest peak of the whole landscape (i.e. global optimum). Figure 8 9 Note that to find an optimal solution of the problem with N variables, one has to look for it in the N dimensional search space. This multi-dimensionality is what makes the search so difficult since the number of possible solutions grows exponentially with the number of dimensions (i.e. variables of the problem). 8.7 evolutionary computation 45 illustrates a fitness landscape of a very simple organism with only one gene (whose potential values are encoded by illustration’s X axis). Figure 8: Possible fitness landscape for a problem with only one variable. Horizontal axis represents gene’s value, vertical axis represents fitness. Every arrow on the figure represents one possible individual. Its length represents the variation which can be brought in by the mutation operator. The fact that individuals always tend to move « upwards » indicates that selection pressures are involved. It has to be added that without the implementation of the crossover operator, the globally optimal state (encoded by point C) could not be attained for individuals who haven’t originated at the slopes of C. Only some sort of crossover operator could ensure that individuals who attained the local optima (encoded by peaks A, B, D) could be mutually recombined (for example B with D) in a way that shall allow them to leave the locally stable states and approach the globally optimal C. The fact that genetic algorithms, thanks to « crossover » operators, can combine two individuals from diverse sectors of the fitness landscape, allow them to find solutions to problems where heuristics based on « gradient descent » would normally fail. An important property of fitness-landscape is its "ruggedness". Some fitness functions can yield landscapes as flat as Pannonian Plane: the algorithm will need a long time to find there a hill if ever a hill there is. Other may yield landscapes as rugged as mountains of northwest Vietnam: nothing is certain on such landscapes where even the slightest mutation can produce huge decrease or increase of fitness. Ideal landscapes are those which are rugged but not too much: as on the slopes of Himalaya, a steady progress towards some locally optimal -not neccessarily the highest but sufficiently high- vantage point can be assured. C.f. NK Theory introduced in Kauffman (1995) for further discussion of landscape ruggedness and ways how it can be potentially tuned. end fitness functions and landscapes 8.7.1.0 Multidimensional hiking Ruggedness of fitness landscapes 46 universal darwinism Canonical Genetic Algorithms Canonical genetic algorithm (CGA) is a genetic algorithm applied on populations (n-tuples) of binary strings (individuals) of length l. Each among l bits is considered to be a gene and each string of such genes is considered to be a potential solution to the problem which is to be solved. Given that the initial population is randomly generated, the CGA proceeds as follows: In CGAs, fitness proportionate selecListing 1: Canonical Genetic Algorithm initialize the population determine the fitness of each individual perform selection repeat 5 perform crossover perform mutation determine the fitness of each individual perform selection until some stopping criterion applies Operators in CGA CGA convergence and elitism tion (8.7.1) is used as the selection operator. Mutation operates independently on each gene of each individual and consists of stochastic bit flipping of current gene’s value to its opposite. A "one-point crossover" (4) is most commonly used in CGAs, which consist of randomly chosing a section locus of the chromosome, dissecting two selected parent individuals A and B along the section locus and creating two children individuals C and D as a concatenation of sections previously encoded in two distinct parent organisms, i.e. C = A1 B2 and D = B1 A2 . CGAs being thus defined in (Holland, 1975; Goldberg, 1990), it has been demonstrated by (Rudolph, 1994) that such pure CGAs are unable to converge to global optimum of the problem they tend to maximize. This is so because even if CGA would be able to discover the optimum, the unceasing activity of mutation operators would force the system to depart from such an ideal state. On the other hand, if ever one implements the elitist trick of keeping the most fit individual, such convergence is assured. Thus, Rudolph’s theoretical « analysis reveals that the convergence to the global optimum is not an inherent property of the CGA but rather is a consequence of the algorithmic trick of keeping track of the best solution found over time.» (Rudolph, 1994) It is principially because of CGA’s 1. theoretical ability to converge to global optimum 2. simplicity and architectural elegance 8.7 evolutionary computation that our method of Evolutionary Localization of Semantic Attractors (ELSA, 10.4.7) is, in essentia, nothing else than a CGA endowed with elitist strategy. end canonic genetic algorithms 8.7.1.0 Parallel Genetic Algorithms Parallel Genetic Algorithms (PGAs) add another level of complexity to traditional GAs. In PGAs is the global population of solutions divided into multi sub-populations which, most of the time, evolve independently from each other. One can understand such sub-populations as different societies or species evolving on isolated islands. Only during so-called "migratory periods" do the sub-populations communicate with each other: most often by means of "sending" the most fit individual to another receptor sub-population. Grid (A,B), hierarchical (C), ring (D) and multi-hierarchical (E,F) architectures of such interinsular migratory relations are depicted on Figure 9. Figure 9: Different architectures of Parallel Genetic Algorithms. Reproduced from Sekaj (2004) By introducing multiple independent populations, PGAs allow to put in equilibrium the selective pressure (i.e. preference of better individuals) and population diversity (i.e. gene dissimilarity). In traditional single-populated GAs, these two forces oppose each other: by increasing the selective pressure an engineer reduces the diversity and thus exposes himself to danger of converging "just" to a locally optimal state. On the other hand, by favorizing too much diversity, one can slow down significantly the convergence rate. To find the equilibrium between these two forces is indeed an art. 47 48 universal darwinism PGAs solve this tradeoff problem by allowing to increase selective pressures in one sub-population while augmenting the diversity of the other. The gain seems to be particularly significative in case of heterogenous PGAs whereby diverse sub-populations implement diverse search strategies. Another improvements in case of problems with "rugged" fitness landscapes can be attained by introducing "subpopulation re-initialisation" into the process. That is, an exchange of population whose diversity is too low, for a completely new, randomly generated population. Such « re-initialisation is able to remove differences between homogenous and heterogenous PGA’s or between different PGA architecture types respectively. However, all the presented PGA modifications can speed up the search process and prevent the search algorithm from a premature convergence» (Sekaj, 2004). It seems that adding another level of complexity to GAs increases the probability of finding the globally optimal solution. It is true that even the traditional single-population GAs explore the search space in multiple directions, in PGAs, however, is such exploration qualitatively augmented. By their faculty to centralize the decentralized; by their ability to speeden the convergence to optimal solutions of diverse problems as well as by allowing for hiearchical10 stacking of independent information-processing units, PGAs are reminiscent of so-called deep-learning methods principially based on hierarchical stacking of diverse connectionist networks. And what’s more, by being partially localized and partially globally-integrative, PGAs can offer a possibly interesting means how to simulate certain functions of human brain (c.f. 2.8) which seems to dispose of analogic properties. end parallel genetic algorithms 8.7.1.0 8.7.2 evolutionary programming & evolutionary strategies Evolutionary programming (E.Prog) and evolutionary strategies (E.Strat) are methods whose overall essence is very similar to GAs. There are, however, some subtle differences among the approaches. In E.Prog, mutation is the principal and often the only variation operator. While recombination is rarely used, « operators are freely adapted to fit the problem at hand» (Kennedy et al., 2001). E.Prog algorithms often double the size of population by mixing children with parents and then halving the population by selection. Tournament selection operator is often used. 10 In study of Sekaj (2004) hierarchical architectures C, E and F seem to be the most successful in approaching the global solutions of two specfic mathematical functions. 8.7 evolutionary computation Another difference is that while GAs were developped in order to optimize the numeric parameters of mathematical function under study – and variation thus directly modifies the genotype – in E.Prog, one mutates the genotype but evaluates the fitness according to phenotype. E.Prog is thus often used for construction and optimization of such structures like finite state automata (Fogel et al., 1966). A self-adaptation approach (Bentley, 1999) allowing for mutation of the parameters of the evolution itself – e.g. the mutation rate – is also frequently used. Such an approach of « evolving the evolution » is also used in E.Strat which where discovered - in parallel but independetly with Holland’s GAs – by Rechenberg (1971). The biggest difference between E.Prog and E.Strat is thus fact that E.Strat often recombines its individuals before mutating them. Popular and well-performing strategy thus seems to be : 1. Initialize the population 2. Perform recombination using P parents to form C children11 3. Perform mutation on all children 4. Evaluate children population and select P members from it. 5. If the termination criterion is not met, go to step 2 ; terminate otherwise. Given that in certain simulations (c.f. ??), we shall 1. encode solutions by means of non-numeric chromosomes 2. evaluate the fitness of individuals by means of additional « phenotypic algorithms » we consider the works of Fogel & Rechenberg to be precursors of our approach. end evolutionary programming & strategies 8.7.2 8.7.3 genetic programming Contrary to GAs, E.Prog and E.Strat which operate upon the chromosomes (vectors) of fixed length of numeric/boolean/character values, do individuals evolved by means of Genetic Programming (GP) encode programs of arbitrary length and complexity. In other terms, one may state that while above-mentioned EC methods look for the most optimal solution of a given problem, GP tends to produce a hierarchical tree structure encoding a sequence of instructions (i.e. a program) 11 Frequently used C/P ratio is 7 49 50 universal darwinism able to yield optimal solutions to a whole range of problems. Simply said : GP is simply a way how computer programs can automatically « discover » new and useful programs. The most important thing to do in order to prepare a GP framework is to specify how shall be the resulting individuals (programs) encoded. Original choice of the founder of the discipline, John Koza, was to encode all individuals as trees of LISP S-expressions composed of sub-trees, which are, themselves, also LISP S-expressions. Within such arborescent S-expressions, the terminal (i.e. leave nodes where the branches end) nodes represent program’s variables and constants while the non-terminal nodes (i.e. internal tree points) represent diverse functions contained in the function set (e.g. arithmetic functions like +, -, *, / ; mathematic functions like log, cos ; boolean functions like AND, OR, NOT ; conditional operators if/else etc.) Figure 10: Sequence of steps constructing the program sqrt(x+5) Figure 10 illustrates how, during the initial run of the algorithm, an individual – calculating, for example, the square root of X+5 – could be possibly randomly generated by implementing a following procedure : 1. « Root » of the program tree is randomly chosen from the function set, it is the function sqrt. 2. The function sqrt has only one argument (arity(sqrt)=1), therefore it will take only one input from the randomly determined functor + (addition) 3. Functor + takes two inputs (arity(+)=2), therefore the tree bifurcates into two lines in this node. It randomly choses, as the first argument, the constant 5 ; and the variable X as the second argument. Note that in step 3, both arguments were chosen from the terminal set. If they would have been chosen from the function set, the tree could bifurcate further. In order to prevent such growth of trees ad infinitum, a limiting « maximal tree depth » parameter is more than often implemented in GP scenarios. 8.7 evolutionary computation Once such a program has been generated, one can evaluate its fitness by confronting it with diverse input arguments and comparing its output with a golden standard. Such a random-program generation & evaluation is repeated for all N initial candidate programs, subsequently the most individuals are selected and varied. While GP’s selection techniques can sometimes closely ressemble selection techniques as used in GAs, variation operators are often of essentially different nature. This is so, because GP’s not individual genomes or their linear sequences can be mutated or crossed-over, but rather complex and hierarchical networks of expressions. In a case of cross-over, for example, one switches whole sub-tree encoded within one individual, for a sub-tree encoded within another one. GP-based solutions cannot be expected to function correctly if they do not satisfy the theoretical properties of closure and sufficiency. In order to fulfill the closure condition, each function from the nonterminal set must be able to successfully operate both on output of any function in the non-terminal set and on any value obtainable by a member of the terminal set. Even behaviour of some simple operators thus has to be a priori adjusted (e.g. return 1 in case of division by zero) in order to assure correct functioning of the resulting program. On the other hand, sufficiency property demands that the set of functors and terminals is sufficiently exhaustive. Otherwise the solution could not be found. One can not, for example, hope to discover equation for generating the Mandelbrot set if the initial set of terminals does not contain the notion of imaginary number, nor does the function set contain any other explicit or implicit reference to the notion of complex plane. Thus, while the closure constraint delimits the upper bound beyond which the discovery of the solution is not feasible, the sufficiency constraint delimits the lower bound of the minimal set of « initial components » which have to be defined a priori, so that discovery of the adequate program should be at least theoretically possible. Other theoretical notions as well as diverse subtleties (special operators, methods how to distribute the initial population in the search space, fitness function proposals, domains of application, etc.) of practical implementation, are to be found in possibly the most important GP-concerning monography (Koza, 1992). Grammatical evolution Grammatical Evolution (Gr.Ev) is a variant of GP in a sense that it also use evolutionary computing in order to automatically generate computer programs. The most important difference between Gr.Ev and GP is that while GP operates directly upon phenotypic trees representing program’s code itself (for example in form of LISP expressions), Gr.Ev uses the evolutionary machinery for the purpose of 51 52 universal darwinism generating grammars, which would subsequently generate the program code. In Formal Language Theory (c.f. also item 10.2), grammar is represented by the tuple {N, T, P, S} where N denotes the set of nonterminals, T the set of terminals, S is a symbol which is member of N and P denotes the set of production rules that substitute elements of N by elements of N, T or their combinations1. Consider a grammar G exhaustive enough to encode programs able to perform arbitrary number of operations of addition or subtraction of two variables: Such a grammar contains three non-terminals, non-terminal Listing 2: An example of grammar G. 1 N = {expr, op, var} T = { +, -, x, y} S = expr P = { -> + | -> x | y 6 -> | } which could be subtituted for either terminal + or terminal - ; non-terminal which could be subtituted for either terminal x or terminal y ; and non-terminal which could be substituted for either a non-terminal , or a sequence of non-terminals . x+x The fact that in this last production, the nonx+y terminal is present both on left and y+x right side of the substitution rule gives this y+y grammar a possibility to recursively generx-x ate infinite number of expressions. As may be x-y seen in the listing to the left, even a very simy-y ple grammar -with only four terminal symy-x bols and three non-terminal symbols to each x+x of which are associated only two production x+x+x rules- can theoretically -i.e. if given infinte x+x-x amount of time for application of production x+x+y rules- produce an infinite number of distinct x-x+y-y individual programs able to perform basic y+y+x+x+y-x arithmetic operations with two variables. etc. Generation of a given resulting expression is determined by the order of application of specific production rules, starting with nonterminal symbol S. Such a sequence of application of production rules is called derivation. For example, in order to derive the individual « x+x », one has to apply production rules in following order: 8.7 evolutionary computation Listing 3: Production of expression x+x. 4 S = ::= ::= # ::= x # x :: = + # x + :: = # x + :: = x # x + x while the individual « y-x » would be generated, if ever the starting symbol S should be expanded by a following sequence of production rules : Listing 4: Production of expression y-x. S = ::= 3 ::= # ::= y # y :: = # y - :: = # y - :: = x # y - x In Grammatical Evolution, it is this « order of application of production rules» which is encoded in the individual chromosome. In other terms, individual chromosomes encode when and where distinct production rules shall be applied. Figure 11 more closely illustrates, and puts into analogy with biological systems, the sequence of transformations which every binary chromosome undergoes during the process of unfolding into fully functional program. As the Figure 11 indicates, the approach of Gr.Ev is quite intricate and involves multiple steps of information processing. Whole process starts with binary chromosome subsequently split into 8-bit codons which yield an integer specifying which production rule to use in a given moment of program’s generation. On many different layers does the « generation » process, as implemented in Gr.Ev, introduce and implement very original ideas like: • « Degenerate genetic code » - similary to « nature’s choice » to encode one amino-acid by means of many different triplets, can one encode application of a unique production rule by more than one codon. • «Wrapping » - under certain conditions can be whole genome « traversed » more than once during the process of phenotypic expression. Specific codon can be thus used more than once during the compilation of single individual. 53 54 universal darwinism Figure 11: Sequence of transformations from genotype until phenotype in both Gr.Ev and Biological systems. Figure reproduced from O’Neil and Ryan (2003). Rationale for usage of such « biologically inspired tricks » is more closely presented in the work of the founders of Grammatical Evolution field (O’Neil and Ryan, 2003) . They claim that the focus on genotype-phenotype distinction, especially in combination with implementation of « degenerate code » and « wrapping » notions, could result in compression of representation (& subsequent reduction of size of program search-space) and account for phenomenas like « neutral mutation », well-observed in biological systems, whereby a mutation occures in the genotype but does not have any effect upon the resulting phenotype. Another important advantage mentioned by O’Neill and Ryan is that Gr.Ev approach makes it very easy to generate programs in any arbitrary language. This is due to the versatility and generality of notion of « grammar ». When compared with traditional GP technique, Gr.Ev was outperformed in a scenario when one had to find solutions to problem of symbolic regression. But in more case complex scenarios like « symbolic integration », « Santa Fe ant trial » or in scenario where one had to discover a most precise « caching algorithm », Gr.Ev significantly outperformed GP. Seminal work of O’Neil and Ryan (2003) presents also some other interesting examples of practical application of Gr.Ev, for example in the domain of financial market prediction. It is worth underlining that while in many points (« grammar », « evolution ») does the work of O’Neilly and Ryan significantly overlap with ours, their aims significantly differ from our aim to interpret 8.7 evolutionary computation the process of language acquisition as an inherently evolutionary process. More concretely, while Gr.Ev tends to offer a very general toolbox to generate useful computer programs in arbitrary programming language and used for solving arbitrary problems, we confront the evolutionary computation machinery to shed some light upon diverse facets of one sole problem : that of «learning of first language». Other important difference between the approach of Gr.Ev and the one we shall present in our thesis is that while in Gr.Ev, grammars are considered to be « generative devices », i.e. tools used for generation of programs, in our Thesis we shall use them as both « generative » and « parsing » devices. Another, even more fundamental difference is due to the fact that while « At the heart of GE lies the fact that genes are only used to determine which rule is applied when, not what the rules are» (O’Neil and Ryan, 2003) the evolutionary model of language-induction proposed in our Thesis shall aim to determine not only the order of application of the rules, but also the content of the rules themselves. end grammatical evolution 8.7.3.0 end genetic programming 8.7.4 8.7.3 tierra Another example of how can one materialise evolutionary principles within an in silico framework is offered by Tierra, an artificial life simulation environment programmed between 1990-2001 by Thomas S. Ray and his colleagues. Since Ray is an ecologist, his objective was not to develop an EC-like model in order to find or optimalize solutions of a given problem, rather he aimed to create a system where artificially entities could spontaneously evolve, co-evolve and potentially create whole artificial ecosystems. An artificial entity in Tierra’s framework (Ray, 1992) is a program composed of sequence of instructions, chosen from instruction set containing 32 quite traditional assembler instructions somewhat tuned by the author so that their usage would facilitate « replication » of the code. Every artificial entity runs in its own « virtual CPU » but its code stays encoded in the « soup », i.e. piece of RAM which is potentially read-accessible to all other entities as well. Rare «cosmic ray » mutations flip the bits of « soup » from time to time, more variation is ensured by bit-flipping during the procedure whereby the entity replicates (i.e. copies) its code from the « mother cell » section of the soup to the « daughter cell » section. Selection is in certain sense emulated by a so-called Reaper process which tends to stop the execution of programs which are either too 55 56 universal darwinism old or contain too much flawed instructions. Other than that, there is nothing which ressemble the traditional notion of exogenously defined « fitness function ». For within Tierra, the survival (or death) of diverse species of programs is a direct consequence of species ability (or inability) to obtain access to limited ressources (CPU & memory). Thus, after one seeds the initially empty soup with a manually constructed individual, containing 80-instructions allowing the individual to copy his code into the daughter cell of the memory, after the memory has been filled and the battle for ressources has started and once the mutation have generated sufficiently enough of variation, one can observe the emergence of dozens of new forms of replicable programs. Some of them being parasites, some of them being able to create algorithmic counter-mesures against parasites, one can literally observe an emergence of artificial yet living ecological system. It is therefore little surprising that Tierra could automatically evolve, among others, an individual containing just 22 instructions, capable of replication. That is, a replicator almost 4 times shorter than the replicator manually programmed by the conceptor of the system and injected into initial « soup ». Currently the most famous descendant of Tierra is an AVIDA system (Ofria and Wilke, 2004). Contrary to Tierra, however, is every AVIDA’s individual encapsulated within its own virtual CPU and memory space. Tierra’s Darwinian metaphore1 of computer programs evolving by means of fighting for limited ressources is thus not so strictly followed. end tierra 8.7.4 8.7.5 evolutionary language game Evolutionary Language Game (ELG) first proposed by Nowak et al. (1999) is a stunningly simple yet mathematically feasible stochastic model addressing the question « How could a coordinated system of meanings&sounds evolve in a group of mutually interacting agents ?». In most simple terms, the model can be described as follows: Let’s have a population of N agents. Each agent is described by an rxc associative matrix A. A’s entry aij specifies how often an individual, in a role of a student, observed one or more other individuals (teachers) referring to object i by producing signal j. Thus, from this associative matrix A, one can derive the active «speaker» matrix S by normalizing rows : sij = Pr aij n=1 ain while the «hearer» passive matrix H by normalization of A’s columns: 8.7 evolutionary computation hij = Pc 57 aij n=1 anj The entries sij of the matrix S denote the probability that in Prepresentations of an agent-speaker, object i is associated with sound j. The entries hi j of the matrix H denote the probability with which, within C-representations12 of the hearer, a sound j is associated with the object i. Subsequently, we can imagine two individuals A and A’, the first one having the language L (S,H), the other having the language L’ (H’, S’). The payoff related to communication of such two individuals is, within Nowak’s model, calculated as follows: F(A, A 0 ) = r X c X 0 sij hji = T r(SH 0 ) i=1 j=1 And the fitness of the individual A in regards to all other members of the population can be obtained as follows : f(A) = X 1 F(A, A 0 ) |P| − 1 0 A ∈P A6=A 0 After the fitness values are obtained for all population members, one can easily apply traditional evolutionary computing methods in order to direct the population toward more optimal states, i.e. states where individual matrices are mutually « aligned ». In Nowak’s framework this alignment represents the situation when hearer and speaker mutually understand each other, i.e. speaker has encoded meaning M by sound S and hearer had subsequently decoded sound S as meaning M. ELG beautifully illustrates how a mutually shared communication protocol can emerge from a population of randomly set sound-meaning associative matrices if there is some « mutual associative reinforcement » mechanism involved. This mechaninsm allows to transfer information from one individual to individual another. This is attained by creating a blank « student » matrix and then filling its elements, by means of stochastic « matrix sampling » procedure, in a way so that the resulting student matrix will partially correspond to| be aligned with matrices of pre-existing « teacher » (or teachers). Further experiments with ELG are described in Kvasnicka and Pospichal (2007, 1999) and Hromada (2012b). All these studies point in the same direction and suggest that not only emergence of mutually shared communication protocol practically ex nihilo is possible whenever there exists a means of transfer of information among individuals, but also that without the presence of certain low amount of noise during the learning processs, the system as a whole would fail to 12 See following chapter to see closer introduction to what C and P-representations are. 58 universal darwinism converge to « communicatively optimal » state. In other words, ELG model indicates that presence of noise -a minimal yet not null amount of mal-transfered information- is necessary in order to assure that the population of mutually aligned sound-meaning matrices shall, sooner or later, converge into most communicatively optimal state. The role of ELG model within the context of our Thesis is quite opened. For while it is the case that ELG sheds some light upon the question of emergence of language within a community of symbolicaly interacting agents, it does not, principially address the problem of language learning by a concrete individual. Thus, ELG is rather a model of macroscopic phylogeny than microscopic ontogeny - it addresses the problem of how small communities of homo habilis could, hundred years ago, gradually converge to system of signs within which, for example, « baubau » could mean a banana and « wauwau » mean a lion. Or, in less fatal and more vital affairs, it can be useful to synchronize activities problems related to dating, mating etc., as represented on Figure 12. Figure 12: A case whereby mutual alignement of sound-meaning mappings can be useful. Reproduced from Kvasnicka and Pospichal (2007)’s reproduction in Pinker (2000). Unfortunately, ELG wasn’t explicitely constructed to address the problem of ontogenetic alignement, id est the problem of how toddlerese adapts to the motherese. But, we believe, it is not completely hors propos to imagine a slight variation of Nowak’s model wherein one population of matrices would be much more stable (representing the linugistic competence of mother, parent or teacher agent) while the second population of matrices would represent the linguistic com- 8.7 evolutionary computation petence of a « child ». Given that the fitness function would somehow succeed to represent the degree of alignment between such « mother » and « child », we postulate that something like child’s language competence could spontaneously emerge obe distilled and induced from ontogeny-oriented variant of Evolutionary Language Game. end evolutionary language game 8.7.5 In this section we have discussed more closely diverse applications of Evolutionary Computing (as defined in Section 3.1), namely 1. genetic algorithms (GA) and parallel genetic algorithms 2. evolutionary programming (EP) and evolutionary strategies (ES) 3. genetic programming (GP) and its variant grammatical evolution (GE) 4. an artificial ecology environment Tierra 5. model of ex nihilo induction of sound-meaning mappings called Evolutionary Language Game (ELG) While some of these applications may strongly differ from each other they all materialize -sometimes in purely informatic or mathematic worlds; sometimes in worlds more material or even "social" - the basic premises of Universal Dariwnism. They all implement, in one way or another, reproduction, selection and variation of populations of information-encoding entities. The content of 1. what these entities encode 2. ways how they encode it and how it varies 3. reasons why some structures are chosen into another generation and some not all this varies substantially from application to application. But the trinity of principles: reproduction, selection, variation is implemented in all of them, otherwise they could not be, ex vi termini, labeled as EC implementations. Dozens of analytical studies - related to topics as fitness-landscapes 8.7.1 or parallel genetic algorithms item 8.7.1 - could, sooner or later, find their accomplishement in a general, formal, and mathematical theory of evolution. Articulation of such theory could yield more rigorous a base for description of phaenomena which are nowadays explained in terms of somewhat vague, speculative and conjectural doctrine of Universal Darwinism. For the one who would decide to establish such a theory, EC could furnish tool as useful as was, for Kepler, the Galileo’s telescope. 59 60 universal darwinism As was already indicated, the aim of this dissertation is not to furnish nor even discuss such general theory. The aim is to first use the conceptual prism of doctrine of Universal Darwinism in order to observe and interpret the phaenomena related to the topic of our interest - language acquisiton. And subsequently - in order to furnish a sort of testimonium ex simulatione- to use a most simple evolutionary model possible to demonstrate that it may be useful to conceive the problem of language acquisition in terms of gradual optimization and co-evolution of populations of linguistic functions and structures. We believe that for such a purpose, EC can furnish a very useful framework. The reason behind this belief is simple - during few decades since its conception, EC-based systems have demonstrated their capability to find solutions to thousands of diverse problems and metaproblems. EC-based systems help designers and planners to invent optimal components, houses and cities; EC-based approaches are used to tune neural networks in robotic systems; EC-based systems help us not only to understand our world but also to change it. Simply stated: Evolutionary Computing works. end evolutionary computing 8.7 Evolutionary Computing works because evolution itself works. And evolution - understood as gradual optimization of replicators - works, because it is a logical necessity. Such is the doctrine of Universal Darwinism. The goal of this chapter was to furnish a brief overview of diverse scientific theories and paradigms based on or inspired by UD’s explicatory power. First was mentioned the biological evolution - it was the study of this form of evolution which gave birth to evolutionary theory. A discipline of Evolutionary Psychology was later discussed and partially criticized as being often too expansive in its aims. It was reiterated that the aims of this dissertation are not those of EP: while EP tries to explain diverse human skills as a result of biological evolution, the Hard Thesis postulates that human learning itself is an evolutionary process. Evolutionary epistemology, Campbell&Simonton’s explanation of individual creativity in terms of "blind variation and selective retention" and the notion of memetics were discussed as examples of evolution which is based on reproduction, variation and selection of nonDNA replicators. It was further precised that contrary to EE2-1 and traditional memetics which study the evolution based on structures copied between the brains, we shall tend to put focus on evolution going on within the brain. An existence of a sort of 3rd replicator is thus posited. Asides nucleic acids - which furnish the material base for evolution of Nature; and asides memes - which represent the basic units of evolution of 8.7 evolutionary computation Culture; a third replicator is posited in order to explain certain properties of a mind (1.1) which learns. To honor Piaget’s work in genetic epistemology (8.4.4) we tend to call such replicator a "scheme". By being internal to both mind&brain, such "schemes" are very elusive and it is of no surprise that they could potentially escape the attention of occidental "positivist" science. Even in case of other replicators, science took its time to recognize their nature and force. While breeding domesticated species for thousands of years, "science" was nonetheless ignorant of principles of evolution until a sort of crossover between Mendel’s and Darwin’s ideae occured. While being bombarded on a daily basis by propaganda memplexes and viral tweets, certain scholae have still somewhat difficult time to admit the sheer existence of memes. And if the nature of such salient, objective, empiric phenomena escaped for such a long time the analytic regard of scientific enquiry, could there be anything done - in the limited scope of this dissertation - to demonstrate the existence of such "subjective" schemes ? After putting aside introspection as an invalid method of validating hypotheses in a positivist way, we see only three possible means of proving the existence of such 3rd replicator: a. Study of reproduction of information within the brain by means of imaging techniques like fMRI, EEG etc. b. Study of "schemes" when they are still observable, i.e. before they are interiorized. c. Computational simulations The path A, the path of neurosciences, is too costly and thus beyond our reach. Hopefully it shall be undertaken by others with more resources and more patience. But luckily, the price for undertaking paths B & C is negligeable and it is thus in this direction that we shall proceed. For in order to make progress on path B, one just needs to observe activity of minds who have not yet mastered the way how to interiorize into their subjective realm their perceptive and behavioral schemas. Such minds are, according to Piaget and even moreso by Vygotsky (1987): minds of children. And when it comes to path C, nothing could serve us better than EC 3.1 branch of informatics. It has been suggested that EC is a sort of applied evolutionary theory: it can generate empiric proofs. Whenever a genetic algorithm discovers a useful solution which was not yet found, whenever a genetic programming scenario generates a piece of evolutionary art which the programmer haven’t even dreamt of, a tangible -and often beautiful- proof is furnished. A proof of belief that darwinists are definitely not further away than creationists from knowledge of a noumenic Principle governing our phenomenal world. end universal darwinism 8 61 D E V E L O P M E N TA L P S Y C H O L I N G U I S T I C S Developmental Psycholinguistics (DP) is a scientific discipline studying changes occuring in human faculty of understanding and production of natural languages. As such, it is closely related to developmental psychology (a sub-field of psychology) and developmental linguistics (a sub-field of linguistics). While developmental psychology thematises phaenomena development of human psyche, consciousness, mind, attention, reasoning, intellect, memory, perception, action, etc. on their own, DP does so always in relation to language. And contrary to linguistics, which often thematises language - or linguistic competence - as a product of some process P, DP ultimately strives to understand the process itself. In other terms, approaches common to DP « regard continuity of expression and function as critical clues to tracing the path children follow as they acquire language» (Clark, 2003). We consider this distinction between "Product versus Process" to be of crucial importance for our tentative to align DP with UD. This is so, because the evolution itself is a process and hence it would be impossible to align the two paradigms if ever the linguistic faculty was understood solely as a static product. In other terms, alignement of DP and UD is possible only under the condition that the DP’s main object of interest is not a static product, but a dynamic process. As a name for such a process, we shall adopt the decision made by Harris (2013) and use the term «language development» preferably to widely-used term «language acquisition». A reason for this being the tentative to mark the fact that the child not only passively «acquires» the language from environmental input but rather gradually builds it, in interaction with its environment. Sometimes the term «language learning» is also used to denote the same process - a great take care has to be taken, however, not to forget that the "implicit and natural" way how a child learns toddlerese differs substantially from "explicit" drill used in learning of second, third, foreign, etc. languages. This being said, we can know define the process which is, ex vi termini, the main object of interest of any developmental psycholinguist: 9.1 language development (def) Language development (LD) - or ontogeny of natural language L in human individual H - is a constructivist process gradually transforming L into evermore optimized communication channel facilitating the exchange of information between H and her social surroundings. 62 9 Processing approach More active than plain acquisition 9.1 language development (def) end language development Language Development is social and constructivist Optimized language allows to mean more but say less Conditions of success of a communicative act 9.1 The adjective constructivist indicates that LD should be, within the theory hereby introduced, considered as a process based on gradual internalization and modification of mental representations induced and re-induced by confrontations with external informations. Piaget’s constructivist theory in relation to LD shall be more closely described in 9.4.3. Note also that by introduction of terms "verbal communication between H and social surroundings", the definition 9.1 places emphasis on the social aspects of human language. By doing so, it embraces so-called socio-pragmatic approach to LD (c.f. 9.4.4) more closely than so-called generativist and nativist ones (c.f. 10.2). But the key component of LD’s definition is the notion of "optimization". This notion, which goes hand-in-hand with the notion of "facilitation of exchange of information" refers to the fact that, as language L develops - in infancy and beyond - it usually makes it possible to encode more precise an information with smaller quantity of signal. By integrating the notions of "optimization" and "facilitation of information", definition 9.1 thus ultimately states that language development is indeed a process which, if healthy and well-adopted to environment, makes it possible to successfully exchange ever subtler and subtler meanings (signifiées) encoded by shorter - or at least not longer - sequences of articulated symbols. An information can be successfully exchanged between human sender and receiver if and only if following conditions are fulfilled: 1. C1 the sender is able to encode the information into the signal 2. C2 the signal can be decoded by the receiver 3. C3 the result of such decoding attracts receiver’s mind limitely close to the state intended and anticipated by the sender Of producing and parsing Of comprehension One can speak about success in intepersonal communication only if the communicative act fulfilles all of these conditions. As was already pointed out in 5.1, linguistic signals are usually strongly sequential and analysable into finite numbers of distinct discrete elements. When sender encodes his intention into such a sequence, (s)he is said to produce or generate the linguistic utterance. When receiver decodes it, he is said to parse the utterance. Ideally, when sufficiently strong a morphism exists between such meaningencoding and signal-decoding interactors, the result of such parsing would have, as a consequence, a precious moment which humans call "understanding". Understanding, or comprehension, is closely related to the condition C3 . The fact that humans are able to understand each other, the fact that speaker and listener, writer and reader can share intentionality is, according to usage-based theorists of LD, something which 63 64 developmental psycholinguistics seems to be a unique propensity of human species (c.f. 9.4.4 for further introduction of usage-based theories). It is important to realize that in spite of disposing, at certain level of abstraction, of a sort of symmetry, production and comprehension are nonetheless distinct processes. It is as with movement of a hand which involves different muscles when the hand is going up and different ones when the hands move in the opposite direction; as with human endocrine system which uses one hormone to promote a certain activity and a completely different hormone to inhibite it; as with multitudes of other biological and cognitive phaenomena which seem to be the mirror images of each other but in fact are not: production and comprehension are distinct. Distinct mechanisms are implemented to generate a sentence and distinct to parse it. Distinct brain regions are involved. Hearing is not speaking with roles of speaker and heared simply inverted: it is something fundamentally different. The existence of such a mismatch between linguistic production and linguistic comprehension is so evident that many linguistic theories ignored it, or at least set it aside, as secondary. Practically all linguistic schools drawing inspiration from the Formal Language Theory (FLT) (10.2), e.g. the generativist tradition, do not care much about this distinction. This is so because at the level of abstraction where FLT is postulated, parsing is practically the same thing as generation and the only thing which differs is the direction in which rules are applied. . It is true that when parsing, system proceeds from surface structure towards deep structure by always substituting left-side of the production rule for the right-side; when generating, system proceeds from the deep structure towards the surface structure by substituting right-sides of the production rules for the left-sides. But all the rest - the content of production rules, the alphabet, the lexicon, the very computational machinery - is the same. Given more theoretical and much less empirical aims of FLT, one can understand the reasons why it practically ignores the mismatch between man’s language comprehension and man’s language production. One could even praise FLT’s conceptors for the fact that by pointing to the level of abstraction where production and comprehension meet, they point to some potentially fundamental unity. For adopting an attitude where production is a sort of inverted parsing can, ineed, yield some interesting and potentially useful computer programs. But to build psycholinguistic theories and ignore the principle which every parent feels and every language teacher knows, such an attitude has to necessarily result in an inconsistent theory or a mal-functioning model. In order to evit such an epistemologic disaster, a somewhat more mundane principle is being posited: Of asymmetry between production and comprehension Of symmetry between production and parsing Of human condition and insufficiency of FLT 9.1 language development (def) 9.1.1 central dogma of dp (def) C-representations precede P-representations. end central dogma Of C- and Prepresentations C-representations are not passive 9.1 Eve Clark, who had coined these (C|P)-representation terms further clarifies: « Children set up a representation for each new word or phrase they notice in the speech they hear, attach meaning to it, and adjust the representation in the light of further analyses. They can use it to access that meaning when they next encouter that form. As they hear more language, they add to their store of such representations. These representations for comprehension (C-representations) consist first of an acoustic template, to which children then add information about meaning, syntax, and use...Children also represent the information needed for producing each expression. For this, they need specifications for articulating the sounds in the target word or phrase. Their representations for production (P-representations), then, necessarily differ from C-representations.» (Clark, 2003) In simple terms, the central dogma states that humans understand language before they can speak it. A child comprehends what an airplane means long before it will be able to pronounce that word correctly. According to Clark, comprehension is always ahead of production, even in adult age when the mismatch is much less visible than in childhood. Important thing to realise is that utility of C-representations goes far beyond some passive involvement in recognition and comprehension of words and phrases. This is so, because C-representations can also influence and determine the direction of construction of Prepresentations. C-representations can provide targets with which Prepresentations are gradually aligned. This is how Clark describes the process: « How would this work? Suppose a child is trying to produce snow. If children can access the C-representation for snow, they can compare their own production with the C-representation, detect any mismatch, and repair their own utterance. The C-representation is a model of what the word should sound like so others can recognize it. Under this view, C-representations provide model targets for what children produce. They also provide the target that the product of a P-representation must match. So as children adjust their Prepresentations to match what they hear from others, they align them with their C-representations. It is this gradual alignment that mirrors changes in children’s own production of words and phrases.» (Clark, 2003) In other words, what Clark tacitly indicates is that not only does human language development involve gradual adaptation of one’s internal representations (C-ones) to structures observables in the external 65 66 developmental psycholinguistics environment, but also that LD involves a sort of gradual adaptation of one set of internal representations (P) to another set (C). In light of such a theory can many common phenomena, like canonical babbling (9.2.2), for example be interpreted as highly useful and possibly inevitable ways how infant’s linguistic faculty tunes and bootstraps itself in partially auto-programming and auto-poietic fashion. Another reason why we consider the principle C-precedes-P can to be of certain interest for this dissertation is, that it indirectly addresses the debate we have already raised when discussing the hypothesis stating that "learning involves reproduction of information-encoding entities" (2.4). For if we accept that C-precedes-P, we have to accept, in the first place, that representation of the word mama is somewhow distinct from the P-representation of the same word. And more: if we accept that the C-representation of the word mama is distinct from the P-representation of the word mama, yet refers to the very same mother-referent in the external world, we have to accept that the information contained in two representations has to be, at least partially, the same. We thus end up with two distinct representations, C and P but both originating in C and pointing to the referential content to which C sole referred when it got set up. Couldn’t this mean that informational content of locus which encodes C (e.g. in Wernicke’s area) got replicated into independent cortical locus which encodes P (e.g. in Broca’s area) ? Couldn’t the neural basis of such process be somewhat similar to processes postulated by neural darwinists (8.6), for example the one depicted on figure Figure 6? We let the reader (him|her)self to answer these and similar questions. Note, however, that answering these answers with "yes" would suggest that the statement which have been labeled hereby as a central dogma of developmental psycholinguistics, does indirectly support the thesis that language development is a form of evolutionary process. 9.2 Central Dogma implies information replication development of toddlerese The goal of following subsections is to present facts related to development of multiple facets of toddlerese. In 5.1, the toddlerese was defined as a protovariant of the natural language, and natural language was defined as a system composed of prosodic, phonologic, morphologic, syntactic, semantic, and pragmatic structures and principles (4.1). None of these layers is to be ignored by somebody aiming to have an adequate vision of development of toddlerese. But taking into account all scientific discussions which were, since end of 19th century, pre-occupied with elucidation of mystery of LD’s universality, speed and the fact that in case of healthy individuals, LD is practically always successful, taking into account all such scholas- Facets of toddlerese 9.2 development of toddlerese tic schisms is not a path to knowledge neither. Thousands of experiments and observations were done, hundreds of books published, dozens of theories and even whole doctrines were unleashed, sometimes sentencing whole generations of linguists into the scholastic hell filled with infinities, recursive rules and utterly inconvenient formalist games. In order to evit such a destiny, following paragraphs shall restrict themselves to very "minimalist" presentation of few evident or experimentally well-verified LD-pertaining facts. Thus, the brief expose hereby introduced will be only very rarely concerned with any linguistic phenomena beyond the state of the toddlerese, whose upper bound was operationalized, in 5,at 30 months (i.e. at age 2;6). And given that the "operational thesis" (6) restricted the scope of our interest to textual modality of human intepersonal communication, we shall present in closer detail the psycholinguistic studies pertaining to development of morphosyntactic and semantic faculties. In contrast to these, prosodic, phonic and pragmatic layers shall be described much more superficially than they rightfully merit. For when it comes to language as was known to all our human predecessors, it was indeed the pragmatic and phonetic aspect which were at the core and inception of it all. 9.2.1 Prosody, Phonetics, Phonology (PPP) Prenatal CPPP representations ontogeny of prosody, phonetics and phonology Prosody is all that relates to tempo, rhythm, stress and intonation of speech. Phonetics is concerned with articulation, acoustics and audition of physical properties of speech signs. Phonology, on the other hand, is less "material" and more "cognitive" in a sense that it is not concerned with such physical characteristics of phonemes like amplitude, frequency or timbre, but rather with systems of abstract categories and rules whose existence is directly or indirectly observable in case of any human cognitive system which was exposed to phonemes and is able to perceive them. Human beings are sensitive to language even in the prenatal period. The study of DeCasper and Spence (1986) has shown that new-born infants prefer to listen to a story which they have already "heard" in utero. Given the fact that in uterus, frequencies about 1kHz are weakened by transmission through maternal tissue, this preference could be explained principially in terms of prosodic and not phonemic information. Another study had indicated that even 4-day old newborns are able to distinguish between mother language (e.g. French, in case of French newborns) and a foreign language (Russian, English etc.) filtered by a 400Hz low-pass filter Mehler et al. (1988). Clark summarizes the results of both studies in a statement: « what infants are attending to are the prosodic properties of the speech they have been exposed to prenatally» (Clark, 2003). 67 68 developmental psycholinguistics During approximately first eight months which follow the birth, infants are capable to discriminate practically any phonetically plausible contrast between two sounds. But before attaining one year of age, children loose this capacity to distinguish practically any sound from any other and their perceptual filters become more and more adapted to phonology of language spoken in their social environment. In other words, « infants can discriminate nonnative speech contrasts without relevant experience...there is a decline in this ability during ontogeny...data...shows that this decline occurs within the first year of life, and that it is a function of specific language experience.» (Werker and Tees, 1984) It shall be indicated multiple times in this dissertation that sometimes a loss or limitation can serve a creative purpose. Such is also the case, we believe, in case of the above-mentioned loss of capacity to distinguish practically any phoneme from any other. For by losing this capacity, an infant also gains something: she gains the capacity to distinguish language from non-language, the mother language from a language spoken by an alien passing by. When this problem is resolved, child’s cognitive system can focus more efficiently upon the upcoming problem: that of discovery and extraction of recurring patterns in and from the speech stream. Set of experiments, performed by Jusczyk and his colleagues, focused principially on infants’ ability to "hear" such regularities. One type of regularities are prosodic ones, for example syllabic stress patterns (stronger stress on first syllables in English). Another regularities are, of course, due to repetitive occurences of the same words. In order to remark that the word X was repeated, an infant has to be able to somehow identify the word as something which was already heard. The study of Jusczyk and Aslin which focused on perception of monosyllabic words has shown that « some ability to detect words in fluent speech contexts is present by 7 and half months of age» (Jusczyk and Aslin, 1995). The same study has also indicated that 6-month old infants still lack such ability to perceive (monosyllabic) words as perceptual units. At 9 months, infants are able to identify sequences of two and more syllables: « 9-month-olds appear to be capable of integrating sequential and suprasegmental information in forming wordlike (multisyllabic) phonological percepts, 6-month-olds are not» (Morgan and Saffran, 1995). Another study indicates that 9-month-olds also prefer to listen to words of their ambient (mother) language and not words from another language Jusczyk et al. (1993). Before attaining first year of age, children are able not only to discriminate but also to identify familiar phonemic chunks of various sizes, extract them from the speech stream and potentially associate with contextual information and other sensory modalities (visual, tactile etc.). It is therefore reasonable to assume that 9-month old healthy infant already disposes Decline in universal discriminative capacity Less is more 1 9-month-olds match lexical patterns 9.2 development of toddlerese Of cry and cooing of dozens of C-representations which can be labeled as protolexical. Their transformation into full-fledged lexical structures shall be discussed in next section. Ontogeny of infant’s faculty to produce intelligible verbal signal is no less fascinating. It starts, of course, with the cry of a new born who is able to obtain any wished change of environment (food, warmth, diaper change etc.) with one loudly and adamantly expressed bit of information. But after cca 2 months, an infant starts to produce more gentle cooing sounds which seem to express, contrary to crying, infant’s satisfaction or agreement with the current state of environment. In three and four months which follow get these two modes of verbal production - crying and cooing - still and more refined and are evermore accompanied by facial and gestural expressions. And sometimes - when the cooing vowel-like "ooo" and "aaah" are co-articulated with some occlusive consonants, thus forming sounds like "uuum", "baaa" or "maaa" - one can constate the occurence of marginal babbling. And then, somewhere between six and ten months, comes canonical babbling. 9.2.2 canonical babbling « Canonical babbling consists of short or long sequences containing just one consonant-vowel (CV) combination that is reduplicated or repeated.» (Clark, 2003) end canonical babbling 9.2 Development of babbling Emergence of first words Consonants occurent in canonical babbling are more than often voiced and labial1 (b), labionasal (m), velar (g) and little bit later when child already has some teeth to block the airflux with - also dental (d). 2 Few months which follow shall be subsequently dedicated to variation of both the enveloping intonation contour of the babbling as well as the syllables contained in the babbling sequence: canonical "mamama" sequences shall thus evolve into sequences like "mamapapadadada?". At cca 1 year of age « many babbled sequences sound compatible with the surrounding language using similar sound sequences, rhythm and intonation contours.» (Clark, 2003) It is during the period of babbling when first "words" appear. And according to growing amount of evidence, the development of first words is a natural and continuous prolongation of babbling phase. Elbers and Ton, for example, summarize their analysis of monologues of a Dutch boy Thomas in the six weeks following acquisition of his 1 In our bachelor’s thesis we had emitted the hypothesis that prominence of labial closures in early babbling is to be associated with succion. 2 Development of canonical babbling 69 70 developmental psycholinguistics first word (1;3-1;5) with the conclusion: « new words may influence the character and the course of babbling, whereas babbling in turn may give rise to phonological preferences for selecting other new words» (Elbers and Ton, 1985). For example the frequency of t-like vowels occurent in the babbling sequences has increased significantly (from 15% to 40%) in the period when Thomas started to use his t-containing word ("aut(o)"). Toddlers thus seem to be selective in word forms which they pronounce, « working first on what they can already do and only after that moving on to harder problems» (Clark, 2003). When they do not have enough practice with a certain sound or a word form, they tend to avoid it. This hypothesis was to be demonstrated by an ingenious experiment designed as follows: « during 10 bi-weekly experimental sessions, 12 children (1;0.21 - 1;3.15) were presented with 16 contrived lexical concepts, each consisting of a nonsense word and four unfamiliar referents. For each child, eight words involved phonological characteristics which had been evidenced in production (IN) and eight had characteristics which had not been evidenced in production or selection (OUT)» (Schwartz and Leonard, 1982). The results of the experiment, presented on 2 made it evident that while children’s ability to understand is independent from the form of the word-tobe-understood, toddlers and pre-toddlers prefer to "mention" mainly those things, whose names contain only familiar phonetic forms (i.e. IN words). in words out words Produced spontaneously 33 12 Understood correctly 54 50 Preference and avoidance Table 2: Children avoid production of words with unknown characteristics. Reproduced from table Clark (2003) based on data in Schwartz and Leonard (1982). To get from babbling to rich spectrum of intelligible words is not an easy task. Every child uses its own unique strategy to solve it; every child traverses a different "path" in order to align its linguistic structures to those of her social environment. As the author of a thorough study comparing acquisition of phonology by three children put it: « each of the three children is exhibiting a unique path of development with individual strategies and preferences and and idiosyncratic lexicon» (Ferguson and Farwell, 1975). We consider it important to underline that rarely are these path a linear descent from random babbling to optimal (i.e. correct) pronounciation. As can be seen not only in the data collected by Ferguson & Farwell it is often rather the contrary which is the case: « although the children tend to be quite accurate in their first production, their accuracy often decline over time, so From babbling to words 9.2 development of toddlerese Of variation of PPP structures later versions of the same words appear to be further from the adult targets» (Clark, 2003). What seems to be common, however, to all those paths is that they flourish with variation. As William and Terese Labovs have observed during the longtitudional observation (1;3-1;8) of their daughter Jessie, she revealed « continuous exploration, experimentation, practice and intense involvement with linguistic structure» (Labov and Labov, 1978). For 3 months of Jessie’s life, this experimentation was concerned solely with words "cat" and "mama"; overall she had pronounced each of these terms at least 5000 times during the 5 months of observation. « In summary, what might be regarded as a rather flat plateau in Jessie’s development, upon closer inspection, revealed a constantly changing series of small experiments where she progressively scrutinized and tried out different phonological options.» (Clark, 2003) This small experiments can be often characterized in terms of application (or non-application) of specific simplification routines. These routines, which we shall call "variation operators" in iii can be coarsly divided into three big groups of 1. Substitutions 2. Assimilations 3. Transpositions Of substitutions Of assimilations Of transpositions Substitutions are due to simple replacement of one sound (or group of sounds) with another sound or group of sounds. Common is voicing of initial voiceless consonants ("pie" pronounced as [bay]), devoicing of final ones ([nop] <- "knob"), gliding ("ball"->[baj]) etc. Also, children often do not pronounce some parts of word at all. These omissions - which can be understood as special cases of substitution whereby one sound is substituted for a "blank" or "non-terminal" sound which is not articulated - are also very common especially in case of consonants at initial ("tram"->[am]) or final ("pes"->[pe]) positions. Assimiliations « refer to the effect of sounds on those preceding or following them within a word or across word-boundaries» (Clark, 2003). Assimilated can be only one or few features, e.g. in ("orol" > [olol] does the lateral feature of final "l" override the trill feature of "r") for backward lateralization or ("balon"->[balol]) for forward lateralization. But whole cluster of features or even sounds can be assimilated as well: in particular is this the case in syllable reduplication whereby one syllable completely overrides the other ("wasser">[vava]). Another group of simplification procedures & variation operators are transpositions. Known as "metathesis" in historical and evolutionary linguistics (8.5) and analogic, mutatis mutandi, to so-called chiasms (Hromada (2011); Dubremetz (2013)) in rhetorics these switches 71 72 developmental psycholinguistics in order (AxB->BxA) are already at play in production of toddlers ("KOstOL"->[okol]). All these examples shall be in closer detail discussed in iii. And in the second volume of this thesis, these cases shall be formalized and subsequently embedded as "variation operators" into evolutionary computation scripts. But for the purpose of this expose let’s just limit ourselves to the constatation that the sequence of application of similiar routines shall, in course of ontogeny, allow the child to converge from a quasi-random babbling to correct articulatory program able to produce word X. end ontogeny of ppp 9.2 9.2.3 ontogeny of lexicon and semantics Raison d’etre of language is to communicate meanings. Semantics is a scientific discipline devoted to study of meanings. Meaning - also called signifie in tradition established by (de Saussure, 1916) - is a fairly abstract entity which does only rarely, if ever, exists on its own. In language, meanings are always coupled with "signifiants", i.e. with material phonetic or graphemic forms which denote some specific meaning. Signifiant, signifié and information related to morphosyntactic properties (c.f. 9.2.4) form a triad which, taken all together, composes a word. In modern linguistics, words are sometimes considered to be members of "lexicon". A lexicon is simply a set of all words internalized by and represented within the individual cognitive system. In DP, the process of acquisition of lexicon is also known as vocabulary development. We consider the process of vocabulary development to be, in huge extent, reducible to the problem of construction of semantic categories. Under such view, the problem of understanding of a new word W could be understood as the problem of: 1. detection of recurrence of W in speech 2. establishment of mapping|association between W and corresponding semantic category C3 3. reducing or increasing the extension of C so that it is neither too specific nor too general None of these problems is computationally trivial but children nonetheless solve both of them with stunning swiftness and ease. We think 3 We consider it important to precize that within the theory hereby proposed, semantic categories - understood as points, regions or subspaces of some sort of "absolute semantic space" - can be shared, i.e. accessed by multiple mutually independent cognitive agents. Of semantics Of sign Of semantic categories 9.2 development of toddlerese Of word game Of utility of lexical constraints that this is so, because human brain (2.8) is principially a patterndetecting computational device whose principal objective, especially during initial stages of ontogeny, is to subsume huge amount of contextual multi-modal information under and into as-neatly-as-possible packaged categories. Under such view a word W, a signifiant, is not only a "label" for its respective conceptual category; it is also a stimulus triggering a completely unvoluntarily categorization process. As Kyra Karmiloff and her mother Annette put it: « there is a dynamic feedback between developing cognitive skills and growing vocabulary, and words can act as an invitation to form a category» (Karmiloff and Karmiloff-Smith, 2009). Since we shall return later (10.4) to more theoretic discussion of what semantic categories "are" in computational sense, and how mapping between them and their labels can be constructed in a general computational system, let’s just focus on the question "What are particular aspects of acquisition of semantic categories constructed in human children?". As infants gradually overcome perceptual limitations of the newborn state they tend to see the world evermore clearly. This subsequently makes it possible that « very young infants can and do perceive even the most subtle differences between and across category members. One study showed that three-month-olds could not only differentiate between cats and dogs (a between-category distinction), but also distinguish among different kinds of cat (a within-category distinction))» (Karmiloff and Karmiloff-Smith, 2009). During first year of age, initial C-representations are being formed by associating such representations of perceptual categories with coocurrent representations of most frequent and salient forms which the infant succeeds to detect and identify in her linguistic environment. Interaction with such environment - consisting mainly of mother, father, siblings or other "tutors" - is dynamic, repetitive and goaloriented. Roger Brown describes it as a "word game" of which the child is a principal player: « The tutor names things in accordance with semantic customs of the community. The player forms hypothesis about the categorical nature of the things named. He tests his hypothesis by trying to name new things correctly. The tutor compares the player’s utterances with his own anticipations of such utterances and, in this way, checks the accuracy of fit between his own categories and those of the player. He improves the fit by correction.» (Brown, 1958) To understand what object is meant by what name is not an easy task. For how does a child know, that word "milk" means the lifestrenghtening liquid and not white color, liquid in general, something to drink or vessel in which it is stored? A possible answer can be: by application of diverse lexical constraints (LCs). Among multiple LCs mentioned in the litterature, we consider these: 73 74 developmental psycholinguistics 1. whole-object assumption 2. basic-level assumption 3. taxonomic assumption 4. mutual exclusivity and fast mapping constraints to be of biggest importance during the toddler stage of LD. The whole-object assumption « presupposes that children already have categories of objects, such that objects can be represented as whole entities distinct from their locations or from their relations to other objects or places.» (Clark, 2003). It is evident that endowing humans with could be quite useful for our survival as species: to be able to immediately percieve and label a lion as a lion is more "fit" a strategy than to invest computational resource’s in seeing details of lion’s fur or whiskers. The same applies for the basic-level assumption: ability to parition the world into the basic-level categories Rosch (1999) which are not too general (above basic level), nor too specific (below basic level) is crucial to survival. In comparison to one’s ability to categorize a shark as a shark, is the ability to categorize these predators into below-basic-level categories as blue, white or tiger sharks or as members of above-basic-level category of chordates, somewhat secondary.4 . Another LC which is quite closely related to Rosch’s theory of basiclevel categories and prototypes (10.4.1) is the taxonomic assumption which presupposes that labels should be a priori extended to object of the same kind and not the object which is thematically related. Its validity was demonstrated by experiment in which « children saw a series of target objects (e.g., dog), each followed by a thematic associate (e.g., bone) and a taxonomic associate (e.g., cat). When children were told to choose another object that was similar to the target (“See this? Find another one.”), they as usual often selected the thematic associate. In contrast, when the instructions included an unknown word for the target (“See this fep? Find another fep.”), children now preferred the taxonomic associate.» (Markman and Hutchinson, 1984) While above-mentioned LCs are useful heuristics for determining either the nature or scope of categories-to-be-constructed, LCs of fast mapping and mutual exclusivity are heuristics facilitating the discovery of relation between the label (signifiant) and semantic category (signifie). Thus, « the mutual exclusivity constraint stipulates that in a given language an object cannot have more than one name, so if the child already knows the word “car,” he will not think a new word refers to cars. In other words, in the early stages of word learning, the child does not expect synonyms. The second constraint, fast mapping, stipulates that novel words map onto objects for which the child 4 This does not apply for professional biologists and philosophers, of course. Of whole-object assumption Of taxonomic assumption Of fast mapping and mutual exclusivity 9.2 development of toddlerese 75 People: mommy (1;0), daddy (1;0), baby (1;3) Food: banana (1;4), juice (1;4), cookie (1;4), apple (1;5), cheese (1;5) Body parts: eye (1;4), nose (1;4), ear (1;5) Clothing: shoe (1;4), sock (1;6), hat (1;6) Animals: dog (1;2), kitty (1;4), bird (1;4), duck (1;4) Vehicles: car (1;4), truck (1;6) Toys: ball (1;3), book (1;4), balloon (1;4) Household objects: bottle (1;4), keys (1;5) Routines: bye (1;1), hi (1;2), no (1;3) Activities: uh oh (1;2), woof (1;4), moo (1;4), ouch (1;4) Table 3: Words produced by at least half of children in the monthly sample. Reproduced from table in Clark (2003) based on data from Fenson et al. (1994). does not already have a name.» (Karmiloff and Karmiloff-Smith, 2009) Both lexical constraints of mutual exclusivity and fast mapping can be understood as direct implications of principle of contrast. The Principle of Contrast (DEF) « Every two forms contrast in meaning.» (Clark, 1987) end the principle of contrast Of usefulness of principle of contrast Of insufficient and excessive generalization 9.2.3 The importance of this fairly trivial principle in regards to LD is not to be underestimated. Acquisition of any kind of form-meaning mapings can be significantly catalysed by the sole fact that PoC applies. Take, for example, an information-processing agent which knows only what "mama" means, but often hears an expression "mama a tato" when it simultaneously sees her mother and father. Discovery that the form "tato" denotes "father" would be trivial for an agent with PoC embedded among her information-processing procedures. And quasi impossible or very (computationally) costly for an agent who does not. Thus, with aid of very restricted number of heuristic-like principles and constraints, and in combination with contexts which repeat themselves day after day and week after week, small infants shall start to associate first linguistic forms to first conceptual categories. But the sole establishment of this association between the word and the category is not sufficient. The scope, the extent, the region of semantic space covered by and attributed to the specific category have to be delimited as well. Before it shall be the case the child shall commit many errors of either insufficient or excessive generalisation. For ex- 76 developmental psycholinguistics ample, in case of insufficient generalisation, it shall sometimes apply a generic label ("dog") to denote just one specific canine ("lessie"). And in case of excessive generalisation, it shall denote with a label ("cat") even referents ("lynx") upon which such label is not commonly applied by child’s linguistic community. Clark (2003) offers a nice example, extracted from Kuczaj’s transcripts 5 contained within CHILDES corpus, in which a child (2;4) learns a new word which shall help her to narrow down a general verbal meaning: I (wanted to have his orange peeled) : Fix it. T : You want me to peel it? I : Uh-huh. Peel it. Section 13 shall present some more detailed results related to such microconversations resulting in a correction of child’s semantic category. For the time being let’s just suggest that such parental or sibling corrections could be quite easily integrated into a darwinian model of language ontogeny either as a sort of selection or variation operator. Such kind of an exogenous, environment-originated perturbations gradually divide infant’s conceptual space into structure of partitions functionally isomorph to structure of partitions embodied in child’s tutor. Table 4 illustrates in a very brief but nonetheless parlant way an example of how relations between few labels and subjacent categories changed in ontogeny of one particular child. Of usefulness of exogenous feedback word initial and subsequent referents more appropriate word papa father/grandfather/mother (1;0) mama (1;3) any man (1;2) Mann (1;5) Mann pictures of adults (1;5) any adult (1;6) ball Frau (1;7) ball (1;0) balloon (1;4) balloon (1;10) Table 4: Case of development of word|meaning mapings. Based on data in Barrett (1978). Thus, an important « part of learning a word meaning is also learning what the extension of each term is, by learning what counts as a possible referent. Children also try out some words in ways that are hard to link to any identifiable use. The target word itself may not be identifiable, and the general absence of adult comprehension typically leads to the word’s being abandonded» (Clark, 2003). We propose to interpret this tendency to "abandon not identifiable words" as 5 In transcripts of conversations with children we shall label child-directed sentences with I (meaning "infant") and adult-generated sentences with T (meaning "tutor"). 9.2 development of toddlerese Of selection of vocables Of graduality of word-learning Of proto-imperatives and proto-declaratives Of vocabulary explosion a sort of selection. Cumulation of multitudes of such selective events combined with playful variation inherent to every healthy child shall, so we argue, gradually attract child’s mind into a state where she shall dispose of language of her surroundings. And learning of concepts is indeed gradual. Analyses of maternal journals and estimations suggest that at 12 months of age, children understand on average at least ten words (Menyuk et al., 1991). In following months increases the size of the lexicon only slowly; topics for the words which the child understands and is subsequently able to produce are also quite restrained:« not surprisingly, young children talk about what is going on around them: the people they see every day; toys and household objects they can manipulate; food they themselves can control; clothing they can get off by themselves; animals and vehicles both of which move and so attract attention; daily routines and activities; and some sound effects» (Clark, 2003). Table 3 contains a list of words produced at given age by at least 50% among 1803 children whose parental reports were studied by Fenson et al. (1994). From the perspective of end-state language are many among these first words a specific-object-denoting nouns. But children often use them with function of verbs or, more specifically, as imperatives. Thus, when saying "milk" a small child expresses her wish, want and need meaning of "wanting milk" or "make me get that bottle". Only months later shall be such proto-imperatives accompanied with proto-declarative statemens meaning "look, mother, there is milk!". This distinction between proto-imperatives and proto-declaratives is not to be underestimated since it seems to stem from child’s growing will to share information. Since we shall return later (9.4.4) to this properly human tendency to share information, intentionality and attention with others, let’s just express our agreement with the statement that « Using language simply to share a common experience with the listener is particular to human communication. Animals tend only to use communication in a proto-imperative way» (Karmiloff and Karmiloff-Smith, 2009). It is approximately in the period of gradual passage from protoimperative to proto-declarative use of language, i.e. between 16-20 months, when the rate of acquisition of vocabulary shall start to accelerate. This phenomenon, known as vocabulary explosion or vocabulary spurt, starts to express itself when child’s productive vocabulary attains approximately 150 words and can be described as follows: « Prior to the vocabulary spurt, children learn on average about three words per week. But when they enter the vocabulary spurt stage, their learning of new words increases dramatically to about eight to ten words per day» (Karmiloff and Karmiloff-Smith, 2009). We shall discuss the phenomenon of vocabulary spurt in somewhat more quantitative terms in 10.1.2, during the discussion of Logistic law. Figure 13 77 78 developmental psycholinguistics Figure 13: Development of productive vocabulary in early (a) and late (b) toddlerese. Figures reproced from Fenson et al. (1994). (a) (b) shows the development of productive vocabulary in 1803 children as measured by McArthur infant and toddler communication development inventories. Authors like Marchman and Bates (1994) interpret occurence of this and other similar LD-related phenomena in terms of attainment of critical mass.6 . It seems indeed reasonable to postulate that some sort of qualitative change of toddler’s linguistic faculties occurs during the period when she enters the vocabulary spurt: for approximately in the same period the toddler shall start to juxtapose words side by side and construct first phrases. And that marks the advent of morphology and syntax which shall be discussed in following section. Before we end this very brief overview of word learning among toddlers, let’s just reiterate the founding that « one of the interesting characteristics of words is that their meanings do not remain static; they can change» (Karmiloff and Karmiloff-Smith, 2009). And this "change" is a part of process which is usually called "learning". And this "learning" can be, we suggest, plausibly interpreted as a particular case of evolutionary process which, during ontogeny, divides one’s semantic space (10.4) into categories which shall tend to overlap with categories categories "out there". 9.2.4 ontogeny of morphosyntax In traditional linguistics, most fundamental meaning-carrying units of linguistic analysis are not individual words, but so-called morphemes. That is, prefixes, suffixes, word roots or other materially 6 Phenomena which occur when only a certain critical mass is attained are best studied by the theory of complexity (c.f. Kauffman (1995) for gentle introduction). Such phenomena are often related to a so-called "phase transition" (e.g. water → ice; fuel → reactor) which can be accompanied with not only quantitative but also qualitative transformation of the observed system. Of critical mass in word-learning 9.2 development of toddlerese 79 encoded signifiants encoding a particular signifie. The particular system of mutual interactions of diverse categories of morphemes yields a particular morphology. When interaction of morphemes surpasses an individual word and multiple words are concatenated in a fullfledged utterance, an information can be contained also in the way in which diverse components (words) are ordered: in utterance’s syntax (σύν -> "together",τάξις -> "ordering"). Since in many languages, the distinction between morphology and syntax seems to be very fuzzy7 , some linguists prefer to speak simply about morphosyntax. Similiary to many other forms of human activity (e.g. object-manipulation, food-preparation, rituals etc.) are human languages compositional and combinatorial. Compositionality can be defined as follows: Compositionality (DEF) « The meaning of a signal is a function of the meaning of its parts, and how they are put together.» (Brighton et al., 2003) end compositionality 9.2 Of variability in LD Of non-determinism in LD while combinatoriality means that theoretically infinite - yet practically vaste but finite - of complex constructions can be obtained by means of combining of finite amount of elements (morphemes). Evolutionary and computational advantages of compositionality and combinatoriality of natural languages being adressed elsewhere (Brighton et al. (2003); Kvasnicka and Pospichal (1999); Pinker (2000)) let’s now focus on other characteristics related to development of syntax in practically any healthy human child. One such "universalium" is that there exists both inter- (i.e. children of different communities acquire different languages) and intra- (i.e. children of the same community acquire their language differently) linguistic variability. The variability of developmental trajectories is in fact so huge, that one could plausibly argue that there are no two children in the world - not even twins8 - which would acquire language in an absolutely identic way. This is so, becaue language-learning is strongly dependent on individual perspective as well as context within which the learning occurs. Contexts from which child acquires language structures involves not only auditive, but also visual, emotional, social, etc. dimensions, and internalization of structures thus involves many factors. Since some of these factors are stochastic, language-acquisition itself can NOT be a fully deterministic process. This is the second "universalium". 7 Take, for example, the German word of the year 1999 "Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz" which, in fact, is quite minimalist in comparison to words uttered by classical sanskrt poets. Are the rules which govern composition of such words the rules of morphology, or the rules of syntax? 8 In this context, we consider it worth mentioning that twins often develop a sort of their own language, or "idioglossia", whose potential conflict with ambient language can slow down twins’ language development. 80 developmental psycholinguistics Asides compositionality, combinatoriality, variability, context-boundedness and non-determinism, we consider it worth to mention these other Other LD universalia characteristics which could be considered as universal and axiomatic: • graduality: children tend to master shorter structures before they master longer structures • cumulativity: children tend to "build upon" what they already know • specificity: children tend to learn individual patterns in individual contexts of social interaction • repetitivity: scenes in which children acquire individual structure X contain certain recurrent features • inductivity: specific structures can be transcontextually crossedover to yield structures corresponding to more general meanings than the ones with which the child was already confronted Of course the list does not end here and other properties like (Tomasello, 2009), recursivity (Chomsky, 1957), syllabicity (Jackendoff, 2002) or importance of substitution were rightfully highlighted. Since some of these shall be discussed in 9.4 and 10.2, let’s now lay aside the generalities and focus upon facts. First, production: Being still in their babbling phase, children first produce one-word "holophrases" which they succeed to fit into an individual intonational contour. As intentions they want to communicate get more Of first phrases and more complex, children couple these with movements like approaching or running away; with gestures like pointing, nodding or shoulder-shrugging; or even with more complex manipulation like object bringing, throwing or showing. In sum, « gestures appear to help young children communicate before they can pronounce the longer phonological sequences required for combining words» (Clark, 2003). As the temporal span of intonational contours increases9 and as child improves its pronounciation of individual words - thus reducing the cognitive cost related to phonetic aspects of the utterance she succeeds, normally around cca 18 months of age , to fit multiple words under the vault of a single intonational contour, thus creating a first two-word construction. Of 2-word constructions According to (Tomasello, 2009, pp. 104), this primordial "word combinations" stage have two distinctive features: • they partition the scene into multiple symbolizable units • they are composed only of concrete pieces of language, not categories 9 Possibly because of slowing-down of the "internal oscillator" observable in experiments with so-called spontaneous tempo (c.f. 9.2.6) 9.2 development of toddlerese A concrete example MAMA NENE (meaning "mother-breast") shall be further discussed in 12.7.1. Child’s ability to concatenate two words and integrate them into a single intonational contour is swiftly followed by emergence of socalled pivot schemas. Pivot schema (DEF) A two-word schema in which « one word (the "pivot") recurres frequently in the same position in combinations, and the other word varies10 » (Braine and Bowerman, 1976) end pivot schema 9.2 A canonic example of what is meant by pivot words and pivot schemas is presented on table reproduced on Figure 14. This table lists all comprehensible two-word combinations, noted by the mother, which the toddler named Andrew spontaneously produced during first five months after leaving the single-word stage. Figure 14: Corpus of two-word utterances produced by a toddler Andrew. Reproduced from Braine and Bowerman (1976). Of affinity of pivot words In Andrew’s case, the pivot words are "more", "no", "all", "other", "there", "off", "all gone", "all done", "byebye", "hi" and "see". It can be immediately seen that pivot words tend to be juxtaposed with words belonging to specific linguistic categories ("more" with "nouns", "all" with adjectives or participles). Where are categories, there is generalisation and where is generalisation, there is productivity and, indeed, 10 Word "varies" put in italics by the author of this Thesis. 81 82 developmental psycholinguistics had been such productivity of pivot schemas experimentally demonstrated in (Tomasello et al., 1997). In retrospect, its author concludes it as follows: « 22-month-old children who were taught a novel object for an object knew immediately how to combine this novel name with other pivot-type words already in their dictionary» (Tomasello, 2009). Another interesting result of the same study was that « children combined the novel nouns productively with already known words much more often than they did the novel verbs – by many orders of magnitude» (Tomasello et al., 1997). But because categories like "nouns" and "verbs" are results of adult categorization of certain lexical phenomena and not necessarily categories pertinent to child’s own linguistic experience, let’s just limit ourselves to trivial constatation that specific pivot words have affinity to words with specific features. And vice versa. These mutual affinities between "constant" pivot words and their variable "complements" result in emergence of populations of microsystems of productive order, which (Tomasello, 2009, pp. 117-127) Of item-based calls "item-based constructions". When observing his daughter, Tomasello constructions had realized that: « almost all ... multi-word utterances during her second year of life revolved around the specific verbs or predicative terms involved. This was referred to as the Verb Island hypothesis since each verb seemed like its own island of organization in an otherwise unorganized language system...Within any given verb’s development there was great continuity such that new uses of a given verb almost always replicated previous uses and then made one small addition or modification.» (Tomasello, 2009) Other experiments have indicated the validity of the claim that the stage of "pivot schemas" naturally develops into a stage of such "constructional islands" of productive order. For example, the study of Of islands of order Pine and Lieven (1997) had shown that children between 1 and 3 years of age tended to use the determiner "the" juxtaposed with one set of verbs and determinet "a" juxtaposed with another, with rare overlap between the sets. In a parallel study conducted with the same group of 12 toddlers, the same authors have observed that 91.6% of first 400 distinct utterances could be "traced back" to only 25 initial patterns Lieven et al. (1997). Since many results of these studies are English-specific (e.g. the importance of prototypical constructions like "want+X", "verb+it" etc.), we consider it important to emphasize those conclusions of these authors which seem to point towards more "universal" a direction: « Our metaphor would be of language developing initially as a number of different islands of organization which gradually link up...These islands are initially segments (either words or phrases) which the child has identified to the extent that she can start analysing other systematic relations between what comes before, after or within them...We think, rather, that the data can support a view of structure as emergent.» (Lieven et al., 1997) 9.2 development of toddlerese Based on research of Lieven and her colleagues as well as on his own (Tomasello, 2009, p.308) lists three basic operations by means of which a child can produce an utterance: 1. retrieval of a rote-learned concrete expression and the repetition of the same form as was already heard 2. retrieval of an utterance-level construction and its modification in order to fit the current situation 3. « combining constituent schemas» (Tomasello, 2009) Of operators of morphosyntactic variation Note that the first operation can be aligned with notion of "imitation" and thus "replication of information", the third can be interpreted as a "cross-over" and the second is - so states our Thesis - equivalent to what universal darwinists call "variation operators". Tomasello lists three principal means of structural modification: 1. extension: concatenation of the constituent to the end or beginning of an expression (e.g. ich auch + Yoga -> ich auch Yoga) 2. injection: « inserting a new constituent into the middle of an utterance-level construction or expression (the way a German child might insert auch11 [too] into a schema where nothing had ever before appeared» (Tomasello, 2009) 3. slot-filling: inserting new content into a slot in the item-based construction (e.g. Brot + essen X -> Brot essen) Of over-regularization It is evident that such "slots" are, in fact, categories and they are denoted by what is in formal linguistics 10.2 called non-terminal symbols. As category-representing symbols, they are undoubtably a consequence of a category-construction (CC) process12 . In the long run, the output of CC process should be a set of categories which are functionally equivalent to categories shared by, and inherent to, child’s social surroundings. But what was already said about lexical and semantic categories hold also, mutatis mutandi, for the grammatical ones: before the gap between the ambient and the individual is bridged, before the structure of partitions inherent to the latter converges to the structure isomorph to the former, discrepancies between the two systems are to be observed. Most salient and best studied among such discrepancies is group of phenomena labeled as "over-regularization" . Traditionally, overregularization is supposed to account for cases whenever the child applies the production rule beyond the scope of its validity. The most 11 C.f. 12.7.2 for closer discussion of productivity of "auch" in case of one specific child. 12 We prefer to speak about CC and not simply about "categorization" to mark the distinction between the process by means of which a category is built, and the process during which an already built category is used in order to "categorize" diverse blobs of stimuli observable in the world 83 84 developmental psycholinguistics famous example of overregularization in English is that in certain stage of their development, practically all children tend to apply the rule Vpast → VPresent+ed on all verbs. Thus, especially during period when their mean length of utterance (MLU) is cca 4-5 words, do child generate past participles like « throwed » or «braked» which they had never (or very rarely) heard. Another interesting aspect of over-regularization is that often, children used the correct forms BEFORE they start producing incorrect over-regularizations: « Initially, children’s uses of -ed past tense are all accurate. They may say melted or dropped, but not, as they later do, runned and breaked.» (Maratsos, 1988) Sooner or later -but often sooner than later- are practically all grammatical over-regularizing discrepancies corrected and child’s linguistic behaviour aligned with that of her surroundings. It is difficult to describe this fact without taking for granted "the principle of precedence of the specific": The principle of precedence of the specific (DEF) « Whenever a newly acquired specific rule (i.e. a rule that mentions a specific lexical item) is in conflict with previously learned general rule (i.e. a rule that would apply to that lexical item but also to many others of the same class), the specific rule eventually takes precedence.» (Braine, 1971) end principle of precedence 9.2 This principle is defined her in terms of "rules". But as shall be seen in 9.4, the notion of "rule" is crucial only for certain - and not all- theories of language and LD. Thus, too much focus on notion of "rule" can turn out to be misleading, moreso in this section of our expose where our objective has been to focus more on empirical and less on theoretical considerations of process of development of individual morphosyntactic representations. The body of empiric research which have explored this or that facet of language acquisition, is indeed vaste. For example, even a simplistic synthesis concerning the developmental, cross-linguistic or clinical aspects of MLU would easily account for a monography thick as a brick. But in this Thesis we cannot dedicate to this topic more than space than that which is dedicated to 15. Idem for other fascinating LD-related topics like cue competition (MacWhinney, 1987) in both comprehension and production, or acquisition of verbal skills related to negations, questions or word order: all these problems, and many others, are simply too specific to be addressed appropriately there, were only the most general principles of LD are sought to be addressed. 9.2 development of toddlerese Figure 15: Mean length of utterances produced by English and Italian children of different age (in months) . Figures reproduced from Devescovi et al. (2005). (a) Of importance of input (b) However, what should be addressed and re-addressed, emphasized and re-emphasized is the importance of linguistic input. This is so, because both content and distribution of linguistic input significantly influence the content and distribution of resulting representations and structures. This constatation may seem trivial, but is less so when one realizes "how special" the content of child-directed input and its distribution is. Since the content shall be more closely discussed in secion 9.3, let’s now end this brief overview of development of child’s LD with the question: Y a-t-il some particular distributional, statistical, computational property of linguistic input which facilitates the internalization of morphosyntactic representations? And the answer seem to be: yes there is, and the seems to be somehow related to the fact that acquisition of linguistic representations is governed, similary to many other cognitive functions, by the principle of distributed practice. Principle of distributed practice (DEF) « Given an equal number of exposures, distributed (or spaced) practice at a skill is almost always superior to massed practice.» (Tomasello, 2009) end principle of distributed practice 9.2 In other words, humans in general and children in particular internalize better when they are confronted with the structure-to-beinternalized within contexts of N different sessions (ideally on different days), and worse when they are confronted with it N times during the same session (on the same day). In relation to LD, this phenomen 85 86 developmental psycholinguistics was first noticed in study by Schwartz and Terrell (1983) who noticed that both group of 1-3 year old children who heard the new word once par session and group of children who have heard it twice per session, have, in fact, both needed approximately 6-8 sessions to learn it. Thus, « when the absolute number of presentations was held constant, distributed (infrequent) presentations led to greater acquisition than massed (frequent) presentations.» (Schwartz and Terrell, 1983) Of distributed practice Similar results were subsequently obtained in studies of acquisition of grammatical constructions. For example, Ambridge et al. (2006) conclude their study: « for grammatical constructions, children are more able to analogize across exemplars and extract a relational schema when those exemplars are more widely distributed in time than when they are temporally contiguous» (Ambridge et al., 2006). And since the possibility that something like "principle of distributed practice" exerces its force not only in case of acquisition of human verbal behaviour but also in development and optimization of other cognitive functions and skills, the same authors conclude: « a single set of general learning and cognitive processes is responsible for the acquisition of both individual lexical items (the lexicon) and regular and irregular grammatical constructions (the grammar)» (Ambridge et al., 2006). Agreeing with such conclusions of which Piaget would be undoubtably quite fond of, we terminate this section with expression of our belief that no matter whether taking place in the lexical or morphosyntactic domain, we consider such processes to be principially based on iterative, gradual and non-deterministic optimization of populations of internal representations which replicate, vary and are subjects of selection. A belief, which we shall try to defend in what shall follow. end ontogeny of morphosyntax 9.2 9.2.5 ontogeny of pragmatics Pragmatics goes hand in hand with practice. In linguistics, pragmatics is all that is somehow involved in production or comprehension of utterance but is not contained within the utterance itself. Thus, pragmatics is all that encompasses and envelops the communicative act; pragmatic layer contains all the context within which the utterance is exchanged. It was already stated that language is a social entreprise and the context within which natural language utterances are exchanged is thus principially a social context. In such context, multiple human agents are in mutual interaction and exchange of sequences of linguistic symbols is only one among many other ways how these interactors modify each other’s mental states. Other important channels Of pragmatics Of social context 9.2 development of toddlerese of communication between two prototypical human subjects -mother and a child- are illustrated on 16 Figure 16: Some modalities of information exchange between mother and her child. Reproduced from Trevarthen (1993). Of spatiotemporal context Of intersubjective context Of most difficult a task But the extralinguistic context is not limited to facial expression and gests. Nor introduction of olphactoric (pheromonal) or haptic communication would make the notion of "context" complete. For the context par excellence is given by the very spacetime region within which the linguistic exchange takes place, the region which contains specific physical objects or embodies certain processes. Somewhat contrary to what is displayed on 16, linguistic communication is rarely dyadic. Much more often it refers to object or stateto-be-attained which is external to both members of the interacting couple. It is rarely by chance that two humans encounter each other: more often they go towards each other because they want to be with each other. Even if the object of such wanting can be a simple "being with the Other". What’s more, both interactors have mental states and they both have intentions. And to make things even more complex, they use language in order to mutually modify their mental states. They use language in order to augment the probability that their intention shall be materialized. With the help, and through the act, of the Other. Thus, an infant whose cry/not cry signal emittor is not appreciated anymore has to change her strategies. She has to learn what formulas work best in what contexts, what should be said and when, where, in what order and how it should be said so that the mental states of the Other should be modified appropriately. A Herculean task extending well beyond childhood and puberty towards adolescence and beyond: pragmatic knowledge seems to be acquired from the very first until the very last breath of one’s ontogeny. Having many forms -from cry of a newborn to wisdom of and old man; from benevolent lies to ma- 87 88 developmental psycholinguistics nipulative propaganda- acquisition of pragmatic knowledge seems to be too difficult nut to crack. This awareness of diverse intricacies and complexities of the pragmatic layer as well as the respect in front of maximas (Grice, 1975) and values which are its foundation; and in agreement with the principle « Whereof one cannot speak, thereof one must be silent.» (Wittgenstein, 1922), we decide not even trying to discuss computational aspects of pragmatics-related phenomena. end ontogeny of pragmatics 9.2 9.2.6 physiological and cognitive development One cannot speak about early development and ignore the vast amount of physiological and cognitive changes which children undergo. Between the birth and the end of toddler stage, height of children’s body almost doubles and both the weight and lung volume more than triple. Muscles strenghten, bones ossify. Fontanelles close, thus envelopping the brain within the fully enclosed resonator called skull. Primordial reflexes appear and disappear. In first 12 months only does the average brain volume increases from 369 cubic centimeters to 961 cc. This increase, however, is not to be explained in terms of increase in quantity of neurons (gray matter) but in terms of increase of glial cells. In context of what was already said about Neural Darwinism (8.6), we consider it important to underline that during development, the number of neurons in fact, decrease due to the process known as "synaptic pruning". It is also during the early year of development when the linguistic faculty gets entrenched in the specific hemisphere of the brain. While there is still an ongoing debate concerning diverse aspect of such processus of "lateralization" (see Clark, 2003, pp. 387-391 for overview), it is nonetheless commonly accepted that the hemisphere of installation of developing linguistic faculty is determined in first 20 months of age. Among hundreds of other neurological and physiological changes which shall occur with apodictic necessity in any healthy human child, there are three which we consider to be particulary important for the linguistic development yet thtacitly undiscussedd by psychodevelopmental linguists. First is related to a relatively trivial fact that in comparison to other primates, human teeth erupt very late (Holly Smith et al., 1994). This, on one hand, allows for much longer breast-feeding and hence temporally reinforced emotional and social bonding between the mother and the child while, on the other hand, makes it impossible for a child to articulate sounds with dentals or alveodental acoustic fea- Of ontogeny of body Of ontogeny of brain Of hemisphere lateralization Of teeth eruption 9.2 development of toddlerese Of tuning of the internal oscilator Of sleeping tures. Child’s ability to correctly generate language of her surrounding is not only cognitively but also physiologically limited. The second physiological change to which we would like to point attention in context of ontogeny of human linguistic faculty is related to rythmical behaviour. As suggested by practically whole tradition of research dedicated to psychology of rythm at least from Fraisse (1974) to Provasi et al. (2014), every human can be characterized in every stage of his development by a so-called Spontaneous Motor Tempo (SMT). By being both the tempo in which people tap when asked "Please tap on the table with Your hand in the most natural speed" and, also, the tempo which people choose as most natural when asked to choose between multiple tapping sequences, SMT seems to be a fundamental cognitive phenomenon integrating both passive (perceptive or even C-structure) and active (productive or even P-structure) components. In regards to ontogeny, it’s worth to underline that SMT tends to slow down with age, which can help children to maturate « from facile acquisition of relatively brief events, such as phonetic categories, to enhanced proficiency with longer events» (McAuley et al., 2006). It is also only with age that children acquire faculty to process and generate rythmic patterns wider range of tempos, id est tempos which significantly differ from the SMT of their endogenous tact-giving "oscillator". Given the importance of tempo for control of repetitive or oscillatory activity including not only language but also walking - which, coincidentally or not, appears approximately in the same period when children leave the phase of canonical babbling and enter the phase of word productions - we believe that the study of SMT and other rhytm-related phenomena could by useful for any further study of human cognition. Within the scope of the current Thesis, however, shall these phenomena serve only a peripheral role. At last but not least, the third non-negligeable development occurring in early childhood is related to changes in length, distribution and composition of sleep cycles. Thus, « the newborn infant spends two-thirds of each 24-h period asleep; by 6 months he spends half of his time asleep and half of his time awake. Sleep consolidation is another important aspect of infant’s sleep development. By the age of 6 months sleep condensed into fewer periods of longer duration so that sleep periods are lengthened from 4 to 6 h.» (Gertner et al., 2002) But not only do new-borns sleep significantly more than toddlers which sleep significantly more than older children and adults; not only they sleep not in one nocturnal block as adults do but in multiple blocks, both diurnal and nocturnal, but - and this is an important "but" - they spend significantly more time in dreaming "rapid eye movement" (REM) phase than they shall ever do in the future: « REM sleep assumes a high proportion of total sleep in the first days of life 89 90 developmental psycholinguistics and its amount and ratio diminish as maturation proceeds» (Roffwarg et al., 1966). Beautiful line of research studying pre-sleep "crib talk" monologues (Nelson, 2006) aside, the relation of LD and sleep has not yet been studied in an extent it merits. Note, however, the words with which author of few among very few experiments studying the impact of sleep upon processing of linguistic stimul concludes her results: « memory consolidation associated with sleep introduces flexibility into learning, such that infants recognize a pattern at test regardless of whether it is instantiated exactly as it was before. Sleep then sustains the learning of previously encountered information in a form that enables children to generalize to similar but not identical cases, and it also introduces flexibility into learning» (Gómez, 2011). Consistently with such conclusions, we end this short expose with a statement that from the point of view of theory of intramental evolution of linguistic representations which we hereby aim to introduce, such "memory consolidation" ocurrent during sleep could be interpreted in terms of activity of mutation and cross-over operators acting upon and mixing already encoded structures. end physiological and cognitive development 9.2 9.3 Of memory consolidation motherese Child’s closest social environment are her parents, most notably her mother. Hundreds of studies were conducted to study the nature of «motherese», a special simplified child-directed which mothers use when speaking with their children. While some studies point in divergent directions, they more or less agree that « maternal speech has certain characteristics that distinguish it from speech to other adults. These characteristics are in essence simplicity, brevity and redundancy» (Harris, 2013). Other characteristics generally associated with child-directed speech are: 1. uses higher pitch (mean fundamental frequency of speech to 2year olds is cca 267Hz and 198 Hz in case of speaking to adults) 2. exaggerated intonation (wider pitch range) 3. slower speech due to both more and longer pauses 4. contains repetitions and variation sets (9.3) Clark (2003) summarizes the properties of child-directed speech as follows: « adults consistently produce shorter utterances to younger addressees, pause at the ends of their utterances around 90% of the Of basic features of motherese 9.3 motherese time (50% in speech to adults), speak much more fluently, and frequently repeat whole phrases and utterances when they talk to younger children. They also use higher than normal pitch to infants and young children, and they exaggerate the intonation contours so that the rises and falls are steeper over a larger range (up to one-and-a-half octaves in English).» (Clark, 2003) Multiple studies indicate an existence of causal link between the quantity and simplicity of motherese utterances and speed of child’s linguistic development. More concretely, it had been observed that « mothers’ choice of simple constructions facilitated language growth» (Furrow et al., 1979), while more complex style can slow their development down. Other studies precise that « children who showed the earliest and most rapid language development received significantly more acknowledgments, corrections, prohibitions and instructions from their parents» (Ellis and Wells, 1980). Other means how LD can be stimulated are variation sets. Variation sets Variation set is composed of two or multiple subsequent utterances which are all derived from one common item-based construction. « Variation sets are identified by three types of phenomena: (1) lexical substitution and rephrasing, (2) addition and deletion of specific referential terms, and (3) reordering of constituents.» (Küntay and Slobin, 2002) A most simple form of VS is a sequence of utterances U1 ...Ux sharing the same word W serving as a link between subsequent utterances. MOT MOT MOT MOT MOT MOT MOT we lost that piggy bit . so that bit goes there . and (.) we’ve lost that horsie bit which is a bit of a pain . that bit goes there . and that bit there . shall we try and find the lost bits ? found two bits . Just a little bit more complex are VS where "linking" between U1 ...Ux is done first with one word (or construction) W1 which, in certain moment co-occurs with another word (or construction) W2 which subsequently "links" to following sentences. In an illustratory example extracted from data/Eng-UK/Lara/1-11-27.30.cha transcript of CHILDES corpus (MacWhinney, 2014) do words "bit" and "that" fulfill such a fixing role. More complex variation sets involve slight variation of longer expressions. For example, the transcript data/Eng-UK/Lara/1-11-27.30.cha of the same mother-daughter couple taken 7 months later, contains a following VS: 91 92 developmental psycholinguistics MOT you have to sing to her , Lara , if you want her to go to sleep MOT sing her a lullaby MOT you have to sing her a song In this VS, it is a whole expression "sing something to someone" which is varied, first by removal of a non-obligatory dativ marker "to" and subsequently by variation of the object of singing, from lullaby to a song. Many researchers argue that exposure to variation sets can facilitate acquisition of both semantic and syntactic categories and/or rules. For example, Küntay and Slobin (1996) have observed that 1. use of variation sets is positively correlated with child’s acquisition of certain verbs 2. VS make up cca 20% of child-directed speech. Similar observation, i.e. that 1/5th of child-directed speech are variation sets, was obtained by Brodsky et al. (2007). In a study involving the analysis of CHILDES corpus, the authors explain the advantages of VS in information-theoretic terms: « variation sets seem to be ideal environments for learning lexical items and constituent structures...a pair of utterances that have nothing in common is not informative, and neither is a pair of identical utterances. An optimally informative pair would therefore balance between overlap and change.» (Brodsky et al., 2007) Note that the notion of «variation set» can be intepreted in UDconsistent terms, given that: 1. repetition is equivalent to «replication in time» and every single instance of the utterance can be therefore considered as an independent,individual structure 2. alteration of form between subsequent utterances can be interpreted as a consequence of variation operator influencing production of new sentences To illustrate the extent of some variation sets, 1st appendix (??) contains the longest variation game discovered in, and extracted from, the CHILDES corpus. end variation sets 9.3 Studies like those of Harris (2013) suggest that there the releation between the complexity of motherese and complexity of child’s own production is in fact reciprocal. Thus, mothers adjust their language according to the stage of child’s linguistic development. 9.4 language acquisition paradigms In context of current Thesis, we propose to interpret the mutual getting closer of motherese and toddlerese in terms of adaptation and coevolution of two populations of linguistic structures. Mother adapts to toddlerese of her child, toddler adapts to motherese: both co-evolve. In the long run it is the adult who leads the dance. This is so because internalization of the language of the Other is in child’s and not parent’s vital interest. Thus, child’s P-structures which shall correctly modify the adult’s behaviour in an intended direction could be considered more fit and thus more prone to intramental replication than P-structures which shall not yield an intended effect. end child-directed speech 9.3 9.4 language acquisition paradigms In practically no modern scientific discipline is the age-lasting trialectics between realists, nominalists and idealists so ardent as in psycholinguistics. Different perspectives and terminology being adopted, it is nonetheless the eternal "problem of universals" which is being targeted. Travested into rationalists, mentalists or nativists, one group does its best to convince the public that intangible "general" is prior to the "specific"; in the other camp, the empiricists battle for their belief that the observable and specific is prior to general. In course of centuries do hours of hand-waving, ink-spilling and dozens of metaphysical chimeres accompany the process whose unfolding is supposed to bring scientific community ever closer to "most fit" narrative about origins and development of language in onto-, phylo- or even cosmo- (De Chardin et al., 1965) genesis. Let’s now glance on few among its most distinctive figures: 9.4.1 First observations classical One among first tentatives to describe the process of language acquisition was done by Saint Augustine in his Confessions: « Passing hence from infancy, I came to boyhood, or rather it came to me, displacing infancy. Nor did that depart,- (for whither went it?)- and yet it was no more. For I was no longer a speechless infant, but a speaking boy. This I remember; and have since observed how I learned to speak. It was not that my elders taught me words (as, soon after, other learning) in any set method; but I, longing by cries and broken accents and various motions of my limbs to express my thoughts, that so I might have my will, and yet unable to express all I willed, or to whom I willed, did myself, by the understanding which Thou, my God, gavest me, practise the sounds in my memory. When they named any thing, and as they spoke turned towards it, I saw and remembered that they 93 94 developmental psycholinguistics called what they would point out by the name they uttered. And that they meant this thing and no other was plain from the motion of their body, the natural language, as it were, of all nations, expressed by the countenance, glances of the eye, gestures of the limbs, and tones of the voice, indicating the affections of the mind, as it pursues, possesses, rejects, or shuns. And thus by constantly hearing words, as they occurred in various sentences, I collected gradually for what they stood; and having broken in my mouth to these signs, I thereby gave utterance to my will.» (Augustine, 1838) Not being a theory per se, Augustine’s naive but honest reflexion concerning the origins of his own mental faculties had nonetheless a non-negligeable impact upon all theories of LD which had followed. From our current perspective it, this Augustine’s Confessio could be most probably labeled as a precursor of associationist school. This is so, because Augustine principially explains the ontogeny of his semiotic faculty in terms of associations between "words" and "things". Associationism was among one of those very rare prototheories of cognition which have succeeded to survive one and half millenium which had followed. David Hume, John Locke or J.S. Mill had adhered, in one way or another, into the camp of those who were convinced that great deal of mental phaenomena -or possibly ALL phaenomena- could be principially explained in terms of mind’s tendency to "link" its internally represented "signs" with things-in-world or other "signs". Thus, thanks to its usefulness in learning and evidence in introspection, associationist had succeeded to cross the centuries to see the day when the neurologist Donald Hebb postulated the material (i.e. neural) basis of what was before considered to be solely mental phaenomena: « When one cell repeatedly assists in firing another, the axon of the first cell develops synaptic knobs (or enlarges them if they already exist) in contact with the soma of the second cell.» (Hebb, 1964) Since the discovery of the phenomenon which is often referred to by saying "cells that fire together, wire together; neurons that fire out of sync fail to link" the Hebb’s rule helped to yield explication of such neuroscientific challenges as "emergence of mirror neurons"13 Keysers and Perrett (2004). The core idea behind many functional artificial neural network architectures - e.g. Hopfield networks Hopfield (1982)- is also essentially Hebbian. In 10.4.2 shall be the Hebb’s postulate mentioned as a potential explanation of validity of a so-called "distributional hypothesis" which is the idea behind practically all efficient computational models of semantic vector spaces. Behaviorism is another school of thought which was derived from associationist school and in which the Hebb’s can play a role of the 13 It was already indicated, during our discussion of memetic theory, that mirror neurons are often mentioned in relation with imitation. But in theory hereby introduced, they can also be understood as neural substrate for bridging C-representations with P-representations. 9.4 language acquisition paradigms most fundamental principle. For in its very essence, behaviorism simply substituted the notion of association between two signs with the notion of conditioning. By adopting the terminology of stimuli and reflexes, of rewards and punishement; and by renouncing to any methods which were not strictly positivist and empiric, the behaviorist school had renounced to any tentatives to understand internals of the mind. Since behaviorist precepts worked -and worked not only when applied on Pavlov’s dogs or Skinner’s pigeons but on humans as well- and because science lacked both computers and subtle experimental neuroimaging apparati, the tentatives to explain man’s mind in terms of reinforcement of relations between stimuli and reflexes was the principal pre-occupation of western psychology of 1st half of 20th century. And it would have possibly dominated until now if ever the central figure of the field, B.S. Skinner, hadn’t decided to apply the behaviorist doctrine into the domain of linguistics. Skinner’s book Verbal Behavior Skinner (1957) claimed that language is learned by operant conditioning, id est, that child learns language because the expressions of her verbal behaviour - the fact that she utters X and not Y are reinforced by parental rewards. For example: « In all verbal behavior under stimulus control there are three important events to be taken into account: a stimilus, a response and a reinforcement. These are contingent upon each other... The three term contingency ... is exemplified when, in the presence of a doll, a child frequently achieves some sort of generalised reinforcement by saying doll.» (Skinner, 1957) In Skinner’s theory, vectors of reinforcement are not necessarily just dolls, milk and breasts but can be quite abstract: the very parental attention can be rewarding, lack of it may punish. By focusing on child’s and parent’s basal needs, wants and behaviours with which they attain them, Skinner articulated a theory which has some overlaps with current interactionist and sociopragmatic theories of LD 9.4.4. The idea of founding the fitness functions of our grammar induction systems not only on measures internal to the system, but also on environment’s responses to the system (14.4.1) is also traceable back to similar behaviorist point of view. 9.4.2 Of Chomsky and Skinner generativists and nativists Then came Chomsky. In his revolutionary Syntactic Structures (Chomsky, 1957) he proposed to adopt a rule-based, algebraic, transformationalist approach to explain the mystery of grammars able to generate infinite number of utterances out of finite sets of elements. Two years later, young Noam had gained in prominence by overt and uncompromising review (Chomsky, 1959) of aging Skinner’s Verbal Behaviour. More carpet bombing than review-like, this critique - in some 95 96 developmental psycholinguistics circles considered as the most influential rhetoric exercice of 20th century - irreversibly hallmarked the rupture: the turn from "behaviorist" to "cognitive" approach to study of language acquisition. Whether one rightfully praises Chomsky like (Gardner, 1985b), or rightfully criticize him as Tomasello (2009), one has to admit that he was one among the first who attempted to interpret linguistic representations and processes as fundamentally computational phenomena. Thus, the surface structure of an utterance was to be understood as an output of series of substitutional rules acting upon a certain deep structures offered as an input. While the notion of substitution rules was already known to sanskrit scholar Panini more than two millenia before Chomsky and was in 19th century practically deified by neogrammarians who spent a non-negligeable effort to use the notion of universally applicable rule to explain the mystery of historic language change (8.5), it was nonetheless Chomsky who, strongly influenced by his predecessors Jakobson and Harris, developed a theory whereby substitution rules are supposed to be acting also in individual human cognitive systems. Of generativism Panini’s Grammar (APH) Panini’s (cca 400 BC) grammar is the oldest attested work in descriptive linguistics. Composed at the end of Vedic period and at the beginning of Classical period, it contains 3996 rules of Sanskrit morphosyntax and, in lesser extent, also of semantics. It was transfered Of first rule-based grammar orally -from masters to their students in myriads of schools spreadout through whole India- in form of sutras, i.e. verses to be memorized. Grammar begins with Shiva Sutras which enumeration of 14 fundamental phonological classes from which one can generate 281 pratyāhāras, i.e. classes of second order which are to be subsequently processed by application of one or more among 4000 "Ashtadhyayi" substituion rules and meta-rules. In terms of modern linguistics these 14 sutras list all sanskrit terminals phonemes (16 vowels and 33 consonants) and associate them with anubandha labels (non-terminals). PERLconsistent transcription of Shiva sutras follow: every line presents one sutra in form of a substitution rule. Parantheses contain individual phonemes; symbol enclosed between second and third / symbol denotes the anubandha. Shiva Sutras (SRC) s/ ( a | i | u ) / n . / 2 s/ ( R . | l . ) / K / s/ s/ s/ s/ ( e | o ) / ṅ / ( ai | au ) / c / ( ha | ya | va | ra ) / t . / la / n / . 9.4 language acquisition paradigms 7 s/ ( ña | ma | ṅa | n .a | na ) / M / s/ s/ s/ s/ 12 s/ s/ s/ ( jha | bha ) / ñ / ( gha | d .ha | dha ) ( ja | ba | ga | d .a ( kha | pha | cha | (ka | pa ) / Y / (śa | s .a | sa ) / R ha / L / / s . / | da ) / ś / t .ha | tha | ca t .|a | ta ) / V / / end shiva sutras (src) 9.4.2.0 Description of the means by which 281 (14*3 + 13*2 + 12*2 + 11*2 + 10*4 + 9*1 + 8*5 + 7*2 + 6*3 * 5*5 + 4*8 + 3*2 + 2*3 +1*1 - 14 10) pratyaharas are to generated from the list of classes listed above, and description of almost 4000 rules (e.g. vr.ddhir ādaiC) which subsequently allow the production of so-is-believed, of whole corpus of one amongst the most complex languages ever known to man, all that surpasses the objectives of this dissertation. What does not surpass it, however, is the question: "How could Panini (or any lineage which preceeded Panini) ever discover such a grammar?". Computationally, the task is enormous. Revelatory explanations - so popular in India - aside, we see only one possible answer: by means of intramental evolution. end panini’s grammar 9.4 Of recursivity Not only Panini, Jakobson, Harris have influenced Chomsky, but also Turing whose idea of a symbol-substituting machine crossed over with discovery of transistor, thus yielding new and powerful generation of computers around the time as Chomsky entered MIT. And it was in and by and through contact with computers that Chomsky understood the generativity of rule which is applied many times and which can consider its own past outputs as its present or future inputs. Recursivity once understood, followed an insight that recursivity is present in natural languages, e.g. in expressions like: She knows that he knows that he knows that she knows... Et caetera et caetera, theoretically ad infinitum. For an old-school generativist, the very theoretical possibility to realize such an infinite regress constitutes a sort of a proof of belief that grammars, understood as systems of rules which can be recursively applied upon sequences of symbols chosen from a finite alphabet, can ultimately generate infinite amounts of such sequences. 14 . 14 C.f. the "Halting problem" in (Hromada, 2008) for closer theoretical discussion of this "infinitist" fallacy which is, we believe, the source of many problems which haunt the generative linguistics from the very moment of its conception. 97 98 developmental psycholinguistics That recursion, combined with substitution, can simulate and/or generate more than anything, was already known to Cantor and Godel, let along Turing. But contrary to Godelian proving-god-through-arithmetics and Turing’s Enigma-breaking, Chomsky had decided to program the "universal machine" with a goal in mind which his audience could understand: language. To make things formal and contrary to centuries of knowledge which says otherwise, language was subsequently reduced to a set of sequences of symbols chosen from the alphabet (10.2). Other definitions, axioms and theorems had followed, often Of Chomsky’s contribution to with huge importance for subsequent development of evermore comcomputer science plex assemblers, parsers and compilers of artificial languages. For example, it is difficult to imagine how informatics could move from assembler to C++ or PERL without having at its disposal the theoretical framework of Chomsky-Schützenberger containment hierarchy of formal grammars. Thus, a formal system was developed which turned out to be useful for certain subdiscplines of informatic science. But as is often the destiny of formal systems which fail to delimit their domain of applicability, its proponents started to confound the map with the territory. As a result, practically whole generation of linguists following the "discovery of generativity of recursivity" got lost in the labyrinth of futile tentatives trying to fit the expressive diversity of the natural into the monolithic framework able to account only for simple among the artificial. Thus, in a somewhat paradoxical turn of events, were thousands of linguists transformed into pigeons pecking X-bar conditioned by the reinforcement principle of "publish or perish". New models and theories with names as binding as "GovOf operantconditioning of ernment and Binding" or as non-minimalist as "Minimalist program" linguists were proposed and turned into full-fledged doctrines shared among the castes of initiates. To give at least some meaning to the evermore ezoteric symbol-substituting passe-temps, a noble quest was launched: the quest for a so-called Universal Grammar. In its very essence, the notion of Universal Grammar (UG) is Chomsky’s answer to the problem raised by Gold and related to the fact that from one specific language sample, one can induce multiple grammars which are able to generate such sample. But if multiple gramOf Universal Grammar mars can be obtained, how could a child now which one is the correct one? Chomsky’s answer was: because her choice is constrained by "something" innate: her majesty the UG15 One can explain "innate" either in creationist or emergentist terms. Hoping that nativists do not belong to the first group, there is only one way how existence of UG could be explained: by evolutionary 15 Note that there is a certain symmetry between couples (Turing Machine, Universal Turing Machine) and (Grammar, Universal Grammar). In both couples does the unicity of the latter furnish a frame for the diversity of the former. But there is a difference as well: while UTM allows to emulate any Turing Machine, UG is supposed to constrain the set of relevant grammars. 9.4 language acquisition paradigms Of reconciliation process. This is, we believe, a point of reconciliation, a point of convergence between non-creationist yet nativist doctrines which postulate UG, and theory of intramental evolution as hereby introduced. On the other hand, there is also a significant point of divergence: while we suggest that it is more computationally feasible to produce language-generating and language-constraining representations during ontogeny, Chomsky’s "nativist" disciples like Pinker (1994) spent significant part of their carreer arguing for the fact that UG is somehow produced by evolution which is phylogenetic. Simply stated: nativists believe that UG is encoded in DNA. As was already indicated, the raison d’etre of UG is to constraint and direct learning of grammar. The problem that English children learn grammar of english and Chinese learn grammar of chinese was "solved" by a so-called Principles & Parameters theory as follows: during acquisition of language LX , child extracts from the utterances she hears a set PX of parameters specific to LX , inserts these parameters PX into the UG, thus obtaining the specific grammar GX able to generate LX . Id est: GX = UG(PX ) Of power of stimulus According to nativists, a child would not be able to learn GX without UG’s intervention in the acquisition process. The necessity is supposed to be both empiric, and theoretic. The empiric necessity is related to the problem of "poverty of stimulus" that the utterances children here are qualitatively incorrect and quantitatively insufficient to account for the fact that child, indeed, learns language of its environment. Unfortunately for nativisits, vaste body of rigorous empiric research (Clark, 2003; Karmiloff and Karmiloff-Smith, 2009; Tomasello, 2009) in DP indicates that the notion of "poverty of stimulus" was nothing else than a chimere and that reality of a healthy child surrounded by a healthy social environment is rather the contrary: one should not speak about poverty, but rather about "power of stimulus". A scientist who agrees with the statement « in summary, childdirected speech and other sources of language – overheard speech, stories read aloud, speakers heard on radio or TV, for instance – provide such rich input that children should eventually learn enough of their language for all their needs» (Clark, 2003) can thus discount the poverty of stimulus as simply irrelevant. On theoretical grounds, the necessity to have something like UG embedded in human Language Acquisition Device (LAD) is often claimed as a necessary consequence of the Gold’s Theorem, an important result obtained in a "learnability theory" sub-branch of formal language theory. 99 100 developmental psycholinguistics Refutation of Gold’s Theorem (APH) In short, the theorem postulated by Gold (1967) states that « Any class of languages with the Gold Property is unlearnable.» (Johnson, 2004) In Gold’s formal system, class C of languages has the Gold Property if and only if: 1. C contains a countable infinity of languages Li such that Li ⊂ Li+1 for all i>0 2. a further language L∞ such that for any i>0, x is a sentence of Li only if x is a sentence of L∞ , and x is a sentence of L∞ only if x is a sentence of Lj , for some j>0 Learnability, on the other hand, is defined in terms of 1. environment E, which is supposed to be an infinite sequence of sentences of the language to be learned 2. an ideal learner, which « learns L given E iff there is some time tn such that at tn and all times afterward, the learner correctly guesses that L is the target language present in the environment» (Johnson, 2004) Gold’s Theorem therefore simply states that infinite 16 set of mutually embedded languages are «unlearnable» by a system which never forgets and which is exposed to infinite environment. It is difficult to see, however, how these purely theoretical relation between purely theoretical infinite sets can have any further implication for concrete individual languages, understood as finite sets of utterances exchanged in specific extralinguistic contexts. Given that lifetime of an individual human learner is always finite, the linguistic environment of such a learner also has to be finite. The first condition of what "learnability" means in Gold’s formal system is thus irelevant to human learners. The second condition is irrelevant as well: human mind is not a storage system which faitfully internalize and store for "all time afterwards" every piece of linguistic data to which it was exposed, unmodified. Humans forget, children moreso. It is therefore somewhat unclear how the very notion of learnability and unlearnability, as defined by Gold, could apply to human beings in general and children in particular. This being said, we find it appropriate to state that more than an important proof telling us something about LD in human beings, whole fuzz about Gold’s Theorem is more an evidence of how a whole a multidisciplinary scientific endeavour can get stuck for decades in a blind alley just because of an inter-disciplinary quaternio termino16 Taking Gold’s Theorem seriously in regards to language learning is equivalent, mutatis mutandi, to belief that children shall never learn basic arithmetics because in order to understand addition, they would first have to be confronted with all integers between one and infinity. Of finiteness of environment of human learner Of forgetting Of fallacy of four terms 9.4 language acquisition paradigms 101 rum (Sokol, 1998) fallacy. In other words the term «unlearnable» in Gold Theorem is just a term which Gold uses within his tautological statement to denote certain properties of certain infinite hierarchically embedded sets of sequences of symbols, and as such does relate only abstractly to concrete condition of human learners. end refutation of gold’s theorem 9.4.2.0 Of syllables and chunks A theoretical pillar of necessity to postulate UG being somewhat undermined by the above aphorism, there are not many reasons for not using the lex parsimoniae of William of Occam to raze the notion of DNA-encoded UG away from the terminological toolbox of 21st century linguistics. This does not mean that there are no faculties and features which would be universally present among all human languages17 . Take for example the fact that in all human cultures, people group consonants and vowels into specific clusters. Given the universality of the phenomenon, one is more than tempted to agree with authors like Jackendoff (2002) (a nativist of second generation) and state that syllabization is a component of UG. But when one realizes that syllabization is potentially just a consequence of concrete application of a deeper cognitive processus, known as "chunking", upon articulatory programs constrained by the trivial fact that consonants cannot be pronounced without vowels, one can immediately ask whether syllabization - or any other components of so-called UG - are in fact not a consequence of particular interactions of more general cognitive processes and neurophysiological characteristics of human beings (9.2.6) and their particular social environments. Somewhat contradictory to what term "universal" normally means, a fidele nativist would consider the following equation: LD = General cognitive processes (Linguistic Input + Extralinguistic Input) Of non-linguistic grammars as an unexcusable heresy. This is so, because he considers UG to be the core component of a language-specific cognitive module and not a general cognitive process. For some nativists, there is only one domain where mind uses rule-based grammars: language. Others may be ready to accept that some other domains of human activity from walking, dancing, body excercising and mating through musicgenerating, food preparation and object creation to ritual-performing, healing or simple arithmetics - can also be rule-based and encapsulated in specific cognitive modules Fodor (1983) or even have sorts of grammars of their own. Thus, all nativists believe that foundations of language faculty are to be found in genes, only few, however, would accept that triggering of those very same genes could result in activity of drumming or salsa-dancing as well. 17 Some universals like compositionality, graduality, specificity etc. were mentioned in section 9.2.4 102 developmental psycholinguistics To summarize: more than half a century ago, the generativist approach to language threw an energetic spark into muddying waters of structural linguistics, thus igniting a passionate interdisciplinary debate between linguistics and computer science. Strictly formalist, transformationalist approach had failed to furnish a complete, consistent and elegant framework for the study of natural languages but significantly facilitated construction and further development of artificial and programming languages. What failed in greater extent, however, was the nativist entreprise aiming to discover DNA-encoded "innate" predispositions specific to language. In spite of two generations of effort of linguists, psychologues, geneticians or clinicians, no "language gene" was discovered and the answers to questions asking Of success and failure of chomskian entreprise • Which composents of Language Acquisition Device are innate? • What is the nature of its core, the Universal Grammar? • Which processes are purely language-specific and which are more general? seem to be obfuscated as they ever were, potentially because its terms mean slightly different things for computer scientists, different things for logicians, different things for psychologues and different things for linguists. But the lack of answers to such question notwithstanding, an orthodox nativist position adopted by Chomsky had nonetheless resulted in a state profitable for everybody: on beginning of 21st century, the vaste majority of all cognitive scientists agrees that man’s language faculty is a result of interaction between at least two major components: 1. cognitive and physiological characteristics tuned by phylogenesis of human species 2. input to which the language learner is exposed during its prenatal and postnatal ontogeny and the whole nature & nurture debate is not led anymore in terms of mutually exclusive either/or, but focuses more on degree and forms of mutual triggering and epigenetic interactions between innate and acquired programs. Let’s now leave the discussion of those who emphasize the importance of the first component and focus on those who emphasize the role of the second component: empiricists and constructivists. end generativists and nativists 9.4.2.0 9.4.3 empiricists and constructivists Empiricists argue that human knowledge arises principially from experience (ἐμπειρία). They thus explain the acquisition of a certain word Of consensus 9.4 language acquisition paradigms or expression in terms of percieved contexts within which the child hears a given word or expression. Empiricist paradigm is thus quite similar to associanist, and in lesser extent also behaviorist paradigms mentioned above (9.4.1). But what about acquisition of structures and principles which are not salient, evident or even percievable at all? What about acquisition of all those directly unpercievablable entities - be it rules, schemas, patterns, templates or something else - which determine the result of linguistic comprehension and production, yet are not present per se in any utterance? What about all word order, long-distance dependency or chiasmatic principles which definitely have to be somehow encoded in the mind - because they act - but are detectable only through consequences of their actions? Because they express themselves only through their instances, and because they instances vary, pure empiricism has to encounter serious epistemological problems when explaining acquisition linguistic representations operating with and on more general levels of abstraction solely through sensory experience. Through hundreds of years of both theoretical reflexion as well as methodic experimentation, empiricists gradually evolved into constructivists. Being firmly rooted in phenomenology of everyday human experience, constructivists do not deny existence of more general representations. They simply state that all those concrete-surpassing quantifiers, rules, principles, categories, templates or schemas are as natural a consequence of exposure of mind to repetitive, contextualized stimuli, as photosyntesis is a natural consequence of exposure of plant’s leaves to light. For constructivists and their connectivist descendants, mind’s hardware - the brain - is a generalization device par excellence and therefore there is truly nothing mysterious about the fact the mind is able to transcend the concrete and the arbitrary. Thus, contrary to their nativist counterparts who postulate that child’s mind tries to "deduce" concrete grammar of her ambient language by entering the specific parameters into the formal system of universal grammar, constructivists postulate that child’s mind in fact "induces" her grammar from and out of myriads specific utterances she hears. In the famous Chomsky vs. Piaget debate, the position of father of all constructivists could be characterized as follows: « Piaget maintains that one’s linguistic structures are not defined by the genome, but instead, are ’constructed’ by ’assimilating’ (organizing) things in the environment in terms of pre-linguistic structrues, and ’accomodating’ (modifying) these as they prove insufficient. This mode of functioning, called ’reflective abstraction’ is innate, as is some elementary reflex behaviour (e.g., sucking, grasping), but the cognitive structures, even the pre-linguistic ones, are not.» (PiatelliPalmarini, 1980) 103 104 developmental psycholinguistics Asides two modes of schema application and modification which Piaget called ’assimilation’ and ’accomodation’, and which we have already discussed in presentation of Piaget’s Genetic Epistemology theory (8.4.4); and asides notion of ’reflective abstraction’ which Piaget uses to explain cognitive development occuring well after the age of toddlerese, Piaget introduced other terminology which we consider to be particulary useful when aiming to explain certain facets of LD: circular reaction, schema coordination and interiorization. Circular reactions are related to the propensity of cognitive system to repeat, reproduce and reactivate its schemas. Primary circular reactions occur between 1-4 months and are triggered by child’s discovery that acts, originally performed by accident, can bring about a pleasing consequence, and subsequently repeating the action. Secondary circular reactions occur between 4-8 months of age, are still repetitive and habitus-forming but also involve external objects (e.g. switch switching). Schemas thus formed are subsequently mutually combined and recombined, coordinated and recoordinated thus generating still bigger a variety of behaviours and habits which the child finds out to be useful, pain-reducing or simply pleasurable. According to Piaget’s theory, schemas of first 18 months of age are principially sensorimotor. But later, after the basic perception-action couplings were mastered and optimized in sufficient extent, child leaves the "sensorimotor stage" and enters "the preoperational stage" wherein she starts to "internalize" the schemata. Internalization does not mean that child simply creates neural representations of her sensorimotor couplings: such "neural substrate encoding" is axiomatic for any organism with nervous system and takes place already in prenatal development. Internalization in Piaget’s theory means that a child creates neural representations of mental substitutes - symbols - which themselves refer to certain sensory,motor and later also symbolic "realities". Internalization shall subsequently allow the child to execute certain operations purely mentally, without need to materially realize them in physical reality: it is in great extent thanks to internalization that the child can find the shortest way out of an unknown space without the need to physically toddle through all possible paths. Parallely to Piaget, at the other border of Europe, but in a less Kantian and somewhat more "dialectical materialist" a space, the mind of Lev Vygotskij was slowly converging to practically identical conclusions: the process of internalization of schemas was to explain the ontogeny of thinking. Believing that « the internalization of socially rooted and historically developed activities is the distinguishing feature of human psychology is the distinguishing feature of human psychology, the basis of the qualitative leap from animal to human psychology» (Vygotsky, 1978) and knowing that language is potentially the most important exemplar among such "socially rooted and 9.4 language acquisition paradigms 105 historically developped activities" Vygotskij went even further than Piaget and postulated that thinking is a form of internalized language: thoughts are inner-speech utterances. There are, of course, subtle differences between theories of Piaget and Vygotskij. For example, while Vygotskij’s theory focuses more on social and cultural forces behind the split between the outside "social" speech and the inner speech, Piaget’s theory emphasizes child’s individual,egocentric, knowledge-constructing acts. But it would be false to state that Piaget wasn’t aware of importance of social aspects for development of cognitive functions, as is evident, for example, from the statement « The individual would not come to organize his operations in a coherent whole if he did not engage in thought exchanges and cooperation with others.» (Piaget, 1947) One can rencocile such point of with Vygotskij: « An operation that initially represents an external activity [e.g., egocentric speech] is reconstructed and begins to occur internally [e.g., private and internal speech]. . . An interpersonal process [e.g., social language] is transformed into an intrapersonal one [i.e., inner speech]. . . The transformation of an interpersonal process into an intrapersonal one is the result of a long series of developmental events.» (Vygotsky, 1978). Such long series of developmental events is, according to the theory of intramental evolution, equivalent to evolutionary process wherein diverse schemata are replicated through processes of internalization and articulation and are selected by their ability to induce intended changes in the (social) environment. In other terms: pragmatic, environmentrelated concerns are to be present in function evaluating survival and reproduction fitness of such structures. end empiricists and constructivists (src) 9.4.3.0 9.4.4 socio-pragmatic and usage-based paradigms Piaget’s and Vygotskij’s theories are not theories of LD. They are much more general: they are theories of development of knowlege and thought; they are theories of learning. Under light of such generality, concrete particularities are secondary: thus neither Piaget nor Vygotskij offer specific quantitative values which are to be defined by any engineer aiming to reproduce LD-like processes in silico. They rather offer a general gradual, environment-oriented and ludic framework within which once can do so: that alone suffices. But what about concrete aspects of child’s social learning (Bandura and McClelland, 1977) of language? One can hardly speak about science if specific processes and representations are not experimentally explored and verified; there is no science where specific relations and correlations between variables and phenomena are not evaluated. 106 developmental psycholinguistics Only entreprise which allows to find concrete answers to concrete questions by analyzing concrete phenomena, is truly scientific. Jerome Bruner was among the first scientists who have performed a detailed analysis of child’s LD and interpreted their findings in "social environment" terms. By performing a longitudinal study focused on two boys - Richard and Jonathan, by visiting their homes every fortnight since they were 5 (resp. 3) months old until they were 24 (resp. 18) and by taking half-hour audiovideo recordings of their playing with their mothers; Bruner initiated a paradigm both rigorous yet "natural" and completely non-violent (because with mother and in home environment). During first months, Bruner focused on games through which the child learns how to manage interactions with the closest social environments. Through games with a cloth clown which the mother makes gradually appear, disappear and reappear; or through the game of peek-a-boo, the infant gradually learns the basic conditions of social and participatory activities . Bruner concludes these observations with words: « If the "teacher" in such a system were to have a motto,it would surely be: "where before there was a spectator, let there now be a participant".» (Bruner and Watson, 1983), indicating that such games help to establish social conventions upon which latter language use shall be based. Later, Bruner had explored how referential meaning of first words is born through mother-originated object-highlighting and child-originated pointing. Or he analyzed the motherese articulated during the picturebook reading to discover that « The variety of mother’s utterance types in book reading is strikingly limited. She makes repeated use of four key utterance types, with a surprisingly small number of variant tokens of each. These types were (1) the Attentional Vocative, e.g. Look; (2) the Query, e.g. What’s that?; (3) the Label, e.g. It’s an X; and (4) the Feedback Utterance, e.g., Yes.» (Bruner and Watson, 1983). Complete list of such constructions which occured more than once during session at 1;1.1, are presented on Table 5. It is also worth noting that such utterances were observed to occurs almost always in the sequence: 1. Attentional Vocative 2. Query 3. Label 4. Feedback Some members of the sequence can be left out - e.g. attentional vocative is left out when mother simply responds to what the child does - but Bruner noticed that the order of utterances is practically never switched. Along with the extra-linguistic context (e.g. book-reading), can be undertood as the first format. 9.4 language acquisition paradigms type / tokens frequency I. Attentional Vocatives 65 Look ! 61 Look at that! 4 II. Query 85 What’s that? 57 What are those? 8 What are they doing? 6 What is it? 5 III. Label 216 X (=a stressed label) 91 It’s an X 34 That’s an X 28 There is an X 12 An X 12 That’s X 6 There is X 6 Lots of X 5 They are X-ing 5 More X 3 They are X 3 These are the X 3 The X 2 IV. Feedback 80 Yes 50 Yes, I know 8 It’s not an X 5 That’s it! 3 Isn’t it? 2 Not X 2 No, it’s not X 2 Table 5: Utterances classified as tokens of the four major types of the motherese. Reproduced from table 4.2 in (Bruner and Watson, 1983, pp.79-80). 107 108 developmental psycholinguistics Format (DEF) « A format is a standardized, initially microcosmic interaction pattern between an adult and an infant that contains demarcated roles that eventually become reversible.» (Bruner and Watson, 1983) end format 9.4 Contrary to pivot schemas () or variation sets (), which are "individual" in the sense that they constructed, stored and articulated by a signle individual, are formats interactive, mutual and shared. It is also important to emphasize the extra-linguistic and pragmatic facets of such a microcosmic scene: « format is a routinized and repeated interaction in which an adult and child do things to and with each other» (Bruner and Watson, 1983). Successful unfolding of a "format" from its beginning until its very end is possible only if both participants succeed to focus their attention upon the same object of interest. Such "joint attention" is the cornerstone of not not only Bruner’s theory, but also of Tomasello’s. As a primatologue par formation, Tomasello stresses out the fact that humans are only apes which use symbols • to acknowledge sharing of attention with others • redirect attention of others to external objects, states or processes • modify mental states of others . In other terms, humans are capable of "intention-reading", they are able of "joint attention" surpassing the dyadic I-You relation by integrating an external object (or mental state) into triadic I-You-it relation (Buber (1937)). For Tomasello, intention-reading « is the foundational social-cognitive skill underlying children’s comprehension of the symbolic dimensions of linguistic communication» (Tomasello, 2009). It is supposed to be a domain-general skill allowing not only linguistic communication, but many other practices as well (rituals, tool and house manufacture, co-ordinated warfare, healing, nonreproductive mating etc.) and is strongly intertwined with other phenomena studied by theory of mind (imitation, perspective-taking etc). The principal reason why intention-reading is supposed to be foundational is its ability to attribute function to diverse linguistic expressions or their components (e.g. words): « identifying the functional roles of the components of utterances is possible only if the child has some (perhaps imperfect) understanding of the adult’s overall communicative intention—because understanding the functional role of X means understanding how X contributes to some larger communicative structure.» (Tomasello, 2009). Thus, not only the observable "mi- 9.4 language acquisition paradigms crocosmos" within the articulation of the utterance AXB took place, but also "intention of the speaker" has to be understood if the child is to understand the function of X’use in her language. Which, according to Wittgenstein, is equivalent to meaning of X. Asides the foundational processes of intention-reading and joint attention, ambition of Tomasello’s theory is to understand the nature of cognitive processes related to: " 1. schematization and analogy, which account for how children create abstract syntactic constructions out of the concrete pieces of language they have heard 2. entrenchment and competition, which account for how children constrain their abstractions to those that are conventional in their linguistic community 3. functionally based distributional analysis, which accounts for how children form paradigmatic categories of various kinds of linguistic constituents (Tomasello, 2009) also involved in LD process. Crucial to explanation of these processes is the nature of their input and output representations. Contrary to Bruner, whom his generativist Zeitgeist forced to interpret his data through the terminological prism of "deep" and "surface" structures, has Tomasello conceived his theory in the period when generativism was already somewhat "out of mode". Thus, instead of mysterious UG, isolated lexicon, a monolithic set of transformation rules and omnipresent arborescent structures is his usage-based theory of language acquistion based on "itembased constructions", "expressions", "schemas" and "templates" containing "slots" and variable through diverse operators. A big theoretical advantage of these forms of representation is their ability to can encode multiple levels of generality at once. As such, they have no problem to account for acquisition of such fixed or semifixed idiomatic expressions like ça va?, gonna, dunno or kick the bucket which seem like rule-generated but are in fact learned by rote. Generative models based on rules -supposed to be generally applicable- and lexicon -whose members are supposed to be as concrete and atomic as possible- have huge difficulties with such fixed or semi-fixed entities18 Usage-based models, on the other hand, do not have any problem whatsoever in accounting for existence of such hybrid structures. As 18 Take as an example ça va?, the French equivalent of How do You do? The form is not purely rule-generated because otherwise a completely normal demonstrativ pronoun ça cannot be substituted for other demonstrativs (il, elle) without the whole losing completely its meaning. On the other hand, it cannot be a member of lexicon neither because it is decomposable and the second component va (i.e. "goes") can be, in some argot contexts, substituted for specific verbs (i.e ça tourne; ça roule). 109 110 developmental psycholinguistics Tomasello puts it: « The impossibility of making a clear distinction between the core and the periphery of linguistic structure is a genuine scientific discovery, and it has far-reaching theoretical implications...it suggests that language structure emerges from language use, and that a community of speakers may conventionalize from their language use all kinds of linguistic structures—from the more concrete to the more abstract, from the more regular to the more idiomatic, and with all kinds of mixed constructions as well.» (Tomasello, 2009). In usagebased linguistics, mind is free to mix terminals with non-terminals as is wanted, necessary and appropriate for realization of communicative intention framed by a specific situation. This being said we consider it futile to try to summarize all details of Tomasello’s broad and detailed synthesis. Instead, reader is hereby invited to read the monography: its lecture should significantly facilitate the interpretation of what shall follow in Part ii and Part iii and could be used as a sort of prolegomena to these. For the purpose of the present exposé, let’s just conclude with the following citation: « As children attempt to read the intentions of other persons as expressed in utterances, they extract words and functionally coherent phrases from these utterances, but they also create item-based constructions with open slots on the level of whole utterances. Few theorists of language acquisition deal with these humble creations, and those who have dealt with them (e.g., Braine, 1976) have not provided an account by means of which they evolve into more abstract and adult-like constructions.» (Tomasello, 2009) As a follow-up to this citation, both phenomenological and computational exploration of extent in which evolution of populations of such "humble creations" could be characterized as a process involving intramental replication, variation and selection, is now defined as a principal objective of this dissertation. end socio-pragmatic and usage-based 9.4.4 Having thus glanced at the history of just a few among legions of savants who spent time of their lives seeking, in one way or another, to propose an answer to the question: "How is it possible that humans understand each other yet often not agree with each other?" we conclude this brief overview with a simple truism which could, with little bit of good-will, reconcile all-above mentioned positions: "Because we can do so and want to do so." and note that the debate of what "can" and "want" means in context of "understanding" and "agreement" surpasses by far the scope of our current proposal. 9.4 language acquisition paradigms end language acquisition paradigms 9.4 Intention behind less than 50 pages of this chapter was to acquaint the reader with certain facts, concepts and theories related to language development. LD was principially defined as "constructivist process" (9.1) and it was indicated that ontogeny of language competence can be understood as a process of gradual optimization of one’s linguistic structures and processes. Difference between "comprehension" and "production" of language use was emphasized and a "dogma" was postulated, stating that in developing mind, language comprehension is to precede the language production. Certain facts specific to all facets of linguistic competence were brought to reader’s attention. These were presented in order to indicate that certain domain-general processes - related to contrast detection; category construction aided by input-driven distributional analysis; schematization; pattern-matching etc. - operate on all levels, from prosodic to pragmatic and beyond. Gradual increase in diversity and complexity of representations was observed in many different cases: in learning of phonotemplates, in vocabulary development, in construction of item-based constructions from pivot schemas etc. Input and social interactions were also often mentioned. Brown’s "word game" and Bruner’s "formats" were discussed in case of vocabulary learning and "variation sets" were told to facilitate the acquisition of morphosyntax. Special properties of "motherese" were praised for their abilities to significantly reduce the computational complexity decyphering process. This was the reason for adoption of somewhat critical a stance towards formal "learnability" and "nativist" theories which see perfect learner in imperfect environment there, where we tend to see an imperfect learner in the perfect environment. For this reason we prefer to end this chapter with the citation from a book which, asides (Clark, 2003; Tomasello, 2009) was our first guide in the evercomplex labyrinth of LD-related data and theories:« If grammar were innately specified in the infant brain and simply triggered by hearing the correct forms, why would it take so long to manifest itself? In such a case, one might expect grammar to be an inherent part of the child’s output from the start. After all, by the time infants reach their first birthdays they will have had considerable exposure to linguistic input and, as we saw, they already have a significant receptive vocabulary. The average three-month-old has already had approximately 900 waking hours or 54,000 minutes of auditory input. And these calculations do not even take into account the last three months of intrauterine life...» (Karmiloff and Karmiloff-Smith, 2009). Hundreds of thousands of minutes when the child is still a toddler, millions of sequences of linguistic tokens pre-processed and pre-formated by those-who-love-the-one-to-whom-they-speak: plenty 111 112 developmental psycholinguistics of high-quality data to proces by ever-evolving populations of cognitive schemata. Plenty of information to induce useful patterns from. end developmental psycholinguistics 9.4 10 Plan of the chapter C O M P U TAT I O N A L L I N G U I S T I C S Computational linguistics (CL) is a discpline positioned at the intersection between linguistics and informatics. The extent of this intersection is huge because both informatics and linguistics have one important property in common: on the most formal level, they both deal with sequences of symbols. In this chapter, such abstract and theoretical perspective shall be more closely discussed in section dedicated to Formal Language Theory 10.2 and its modular counterpart theory of Grammar Systems 10.2.3. Subsequently, more "real-life" problems related to Natural Language Processing (10.3) shall be mentioned, with special focus being put on the problems of: • geometrization of semantics attained by projection of natural language corpora into N-dimensional vector space • part-of-speech tagging and part-of-speech induction which make it possible to automatically attribute grammatical category membership to different tokens occurent in the corpus • grammar induction which makes it possible to infer grammar GL of language L from the corpus CL But before doing so, let’s now briefly discuss that sub-discipline of CL, which is older than CL itself. 10.1 Frequency of occurence quantitative and corpus linguistics Centuries before first computers were invented, the preceptors had already been counting words in different corpora. Panini and his disciples contemplated the Vedic corpus (9.4.2) in order to invent the most cognitively efficient means of transmission of the Corpus through human cerebral wetware without ever writing it down, Dominicans were creating concordancy tables of biblical verses, Arabs analyzed the Quran and kabbalists the Torah: and it cannot be excluded that practically all members of these otherwise divergent currents had found particular pleasure in doing so. The advent of computers have changed such an opaque hermeneutic passe-temps of some most devoted philologues into a full-fledged and highly empiric science. The symbol-reading and symbol-manipulating faculties of Turing machines embedded in first thousands, then millions, then billions of transistoric flip-flops have allowed to process all words of one’s library in few seconds. Frequencies of occurence of a word W - i.e. the answer to the question "How many times does 113 114 computational linguistics the word W occur in corpus C ? - were evaluated for still bigger and bigger corpora; probability distributions of relative frequencies - i.e. FW normalized by number of all words in C - were assessed. And new evidence was given that natural language corpora contain such salient regularities, that one can or even must explain them in terms of mathematical "laws". 10.1.1 zipf’s law The basic form of Zipf’s law can be expressed by an equation: fW ∗ rW ≈ C where fW is a frequency of occurence of word W in the corpus and rW is word’s rank in the table where all words of the corpus are sorting according to their frequency in descending order (i.e. the most frequent word has rank 1, the second has rank 2 etc.) and C is a constant. In other terms, Zipf’s law states that the frequency of a word is inversely proportional to its rank in the frequency table, which is equivalent to the statement that « the frequency of a word in a text and its rank is approximately linear when plotted on a double logarithmic scale» (Ferrer-i Cancho and Elvevåg, 2010). In terms of probability distributions, this law states that frequencies of occurence of words in the corpus are independent and identically distributed random variables with distribution Formulation of ZL p(f) = αf−1−1/s id est, the "power law"/Pareto distribution with exponent s. G.K.Zipf was profoundly convinced that this regularity is an expression of a domain-general cognitive eco(nom|log)y principle of least effort. More concretely, he conjectured that the observed regularity is a consequence of tendency of linguistic system to attain the state of vocabulary balance, i.e. the state in which two oposing forces, force of unification and force of diversification, characterized as follows: « on the one hand, the Force of Unification will act in the direction of decreasing the number of different words to 1, while increasing the frequency of that 1 word to 100%. Conversely, the Force of Diversification will act in the opposite direction of increasing the number of different words, while decreasing their average frequency of occurrence towards 1. Therefore number and frequency will be the parameters of vocabulary balance.» (Zipf, 1949) Next generations of linguists -Chomsky included- and mathematicians - e.g. Benoit Mandelbrot - were less enthusiastic when it came to the importance which Zipf attributed to his "law". The point of conflict was not whether the frequencies of words in natural language texts follow the power law distribution: each new analysis had demonstrated that it is, verily, the case. The argument arose when Vocabulary balance Critics of ZL 10.1 quantitative and corpus linguistics ZL defended ZL in language ontogeny ZL and ecology some authors started to considered Zipfian distributions as a tautological necessity, as a phenomenon emerging anytime, even in randomly generated artificial corpora. One famous study concluded: « Zipf’s law is not a deep law in natural language as one might first have thought. It is very much related the particular representation one chooses, i.e., rank as the independent variable» (Li, 1992) and reiterated Mandelbrot’s remark that ZL is "linguistically very shallow". On the other hand, more recent article convincingly demonstrates that « good fit of random texts to real Zipf’s law-like rank distributions has not yet been established. Therefore, we suggest that Zipf’s law might in fact be a fundamental law in natural languages» (Ferreri Cancho and Elvevåg, 2010). A lateral support for such claim comes also from study which focused on "evolution" of ZL - notably the exponent s - in language ontogeny. Given statistically significant observations that « in children the exponent of the law tends to decrease over time while this tendency is weaker in adults...Our analysis also shows a tendency of the mean length of utterances (MLU), a simple estimate of syntactic complexity, to increase as the exponent decreases. The parallel evolution of the exponent and a simple indicator of syntactic complexity (MLU) supports the hypothesis that the exponent of Zipf’s law and linguistic complexity are inter-related» (Baixeries et al., 2013). We add that it would be somewhat difficult to observe such ontogeny-related modifications of ZL’s exponent if ever ZL was just a pure artefact owing its existent to one’s choice of mathematical formalism. At last but not least, we consider it worth reiterating that similar Zipf-Mandelbrot distribution were observed in sciences other than linguistics. In ecology, for example, the distribution between number of species observed as a function of their abundance is understood as a zipfian phenomenon (Mouillot and Lepretre, 2000). Given that ecology is principially a science about equilibrium-seeking systems consisting populations of entities which interact and replicate, we consider the fact that similar scaling phenomena operate both • in realm of words - and, Zipf would add, also in realm of "meanings" because « words are tools that are used to convey meanings in order to achieve objectives...the reader may infer from the orderliness of the distribution of words that there may well be a corresponding orderliness in the distribution of meanings because, in general, speakers utter words in order to convey meanings» (Zipf, 1949) • in ecology to support the Thesis that certain neurolinguistic structures intramentally interact and replicate. end zipf’s law 10.1.1 115 116 computational linguistics 10.1.2 logistic law Another among multiple "quantitative laws" which seems to be of particular interest for anyone aiming to understand and create evolutionary models of language ontogeny is the "logistic law" often known as Piotrowski’s law. This law postulates that language development follows the logistic curve formalizable into mathematical notation as 1 c + a · e−b·t whereby t denotes time, p(t) denotes quantitified value of an observable property of lingustic system in time t, e is euler’s constant and a, b, c are parameters of the model. We consider it important to mention that what research of Best (2006) and other participants in the "Gottingen project" indicates is that the law applies not only on ethnogenic, cultural and historic (i.e. Sprachwandel, c.f. 8.5) but also on ontogenic development of linguistic systems (i.e. Spracherwerb). p(t) = Figure 17: Logistic law in relation to historic and ontogenetic linguistic processes. Data taken from Best (2006). Figure 17 illustrates examples of these two cases: • points on image (a) represents increase of amount of words of arab origin into german between 14th and 20th century while the line represents the ideal logistic curve with parameters (a=7.41, b=0.696, c=160) • right image (b) represents gradual increase of Mean Length of Utterance (9.2.4) While members of certain schools may argue adamantly that many phenomena in both ethnogeny and ontogeny can be explained or even modelised in terms of logistic curves, other data1 shall rightfully oblige others to express certain scepsis to capacity of logistic 1 C.f., for example, Figure 15 to see some data which, while aiming to represent practically the same phenomena as Figure 17 seems unsubsumable under the logistic curve. LL’s ubiquity 10.1 quantitative and corpus linguistics Law of population growth Towards ecology of intramental representations curve to cover practically ALL quantitative aspects of language acquisition process, and to do so with sufficient statistic significance. On the other hand, some phenomena like the "vocabulary spurt" (c.f. 13) seem sometimes to follow the logistic first-slow-then-fast-than-slowagain so faithfuly, that it would be unwise to to apriori ignore such a salient, formal and high-order analogy between ethno- and onto-geny of intra- and inter-personal linguistic eco-systems. Be it as it may, instead of trying to adequately address the Haeckellike conjecture some processes of linguistic ontogeny are formally isomorph to certain processes in linguistic ethnogeny, it seems more appropriate to focus the attention of the reader upon the fact that closing the previous paragraph with the plural form of the term eco-system was intentional. This is so because it was indeed ecology where logistic curve models where deployed for the first time: introduced in 1838 by Pierre-François Verhulst as a model whereby the reproduction of population is proportional to both the existing population and the amount of available resources, and canonized later by (Lotka, 1925) as the law of population growth, it is closely related to predator-prey (or LotkaVolterra) differential equations which are, even more than hundred years since its conception, still consider as a model of reference of population dynamics of biological and ecological systems within which two or more species interact. end piotrowski’s law 10.1.2 In this brief overview of Corpus and Quantitative linguistics we have mentioned two hallmark "laws" postulated (or discovered?) by proponents of this discipline: Zipf’s law and Piotrowski’s logistic law. We have indicated that both of these laws have certain ontogenypertinent aspects which make them worthy of interest not only to researchers interested in historical linguistics, but also to those known as "psycholinguists". What’s more, it was also indicated that there exist a certain analogy, a certain partage of features, between developmental psycholinguistics and ecology: • not only frequencies of words in corpus, but also abundances in ecology are Zipf-distributed • logistic curves are used to model not only rate of (pro|intro)duction of new words into the copus, but also population dynamics of diverse mutually-interacting species within a specific ecosystem Given that there exists a certain formal similarity between models of dynamics occurent within ecological or linguistic systems, transposition of certain principles from ecology into psycholinguistics may seem to be appropriate. end quantitative and corpus linguistics 10.1.2 117 118 computational linguistics 10.2 formal language theory Formal Language Theory (FLT) is a computational theory of of formal languages and formal grammars. Being rooted in apodictic definitions of computer science, mathematics and logics, its aim is to offer solid, coherent and scientifically valid framework useful for 1. design of new artificial (e.g. programming) languages 2. elucidation of structure and function of natural languages No-one denies that when it comes to the first objective mentioned above, the practical utility of FLT-originated concepts and principles is demonstrated anytime a computer translates the source code into machine code. It is true that without any solid theory thematising the rules of production and parsing of symboling sequences, it would be highly problematic to procceed all the way from romantic intuitions of lady Ada Lovelace, notebooks of Gottlob Frege, Zuse’s Plankalkül through assembler, C, C++ all the way to parsers, linkers, and compilers of modern high-level programming languages like Python, PERL or R. But the capper evidence that FLT can also yield a framework useful for the attainement of the second goal, is yet to be furnished. In spite of effort initiated by Chomsky’s focus on generativism (), that is, in spite more than half of century of intellectual work of thousands of most brilliant minds of their generation, no pure FLT-based model2 was proposed, which could account for diversity of forms of even such a morphologically poor language as English. Sadly for science, sectarian disputes within FLT community are of envergure which makes it impossible to answer even the most trivial problems, like that of positioning of natural languages within Chomsky-Schutzenberger hierarchy. This being said, let’s just introduce the conceptual pillars upon which the FLT stands. 10.2.1 basic tenets (def) ) FLT is based on notions of symbols, sequences and sets. Thus, 1. alphabet A is defined as a finite set of symbols including the empty symbol  2. string S is defined as an ordered sequence of concatenated symbols contained in A 3. language L is defined as a set of strings over A 2 By pure FLT model, we mean a model that does not contain any statistic components. Of FLT’s usefulness Of FLT’s uselessness 10.2 formal language theory 4. * (Kleene star) is a free moinoid unary operator generating all possible strings over a certain alphabet, [ A∗ = Ai = {ε} ∪ A ∪ A2 ∪ A3 ∪ A4 ∪ . . . i∈N A∗ therefore denotes the infinite set of all possible strings over A and language L is either a subset of, or equivalent to A∗ , i.e. L ⊆ A∗ Given this, grammar GL of language L is a means how to characterize which among the members of A∗ are to be contained in L, and which are not. In traditional FLT, Grammar is defined as follows: Grammar and Rule(DEF) A grammar G is a tuplet {VT , VN , X, R}, where VT is the set of terminal elements, VN is the set of non-terminals, X -an "axiom symbol" is a member of VN (X ∈ VN ), and R is a finite set of N rules R = {r1 , r2 , ..., rN }. A (rewriting|(produc|substitu)tion) rule r has a form foo → bar and fundamentally denotes 2-ary substitution operation wherein the first operand foo is substituted by second bar, or vice versa. end grammar and rule(def) 10.2.1 The expression vice versa is quite important here, for it denotes that grammar can be useful in both 1. (produc|genera)tion of terminal string-expression E of language L started by input of "entry axiom" X which takes place when rules are applied in right-wise order (i.e. foos are substituted by bars) and is to be terminated only when the string does not contain any non-terminal symbols. 2. parsing / comprehension of string (sentence) started by input of E and terminating when the some substitution transform E or its derivates into X. In other terms, this scenario occurs when rules are applied in right-wise order (i.e. bars are substituted by foos within the string) and terminates when the working string does not contain any terminal symbols. Practically all currently widely used notations take this symmetry between substituens (that-which-substitutes) bar and substituendum (that-which-is-substituted) foo, as granted. Table 6 illustrates "plain", "compressed" and "uncompressed" grammars written down in three common notations3 3 In all notations we follow the common convention of denoting non-terminal symbols with uppercase characters (e.g. "B", "M", "S", "X") and terminal symbols with lowercase (e.g. "a", "b", "m" ...) characters. 119 120 computational linguistics Note that the "uncompressed" and "compressed" grammars of GL are equivalent only when it comes to language they cover, but not in the way how the GL is represented. They are functionally but not structurally isomorph. Thus, where "uncompressed" grammars represent disjunction in terms of multiple trivial rules (for any disjunction, one has as many rules as there are disjunct elements), compressed grammars represent a disjunction by one rule only. The price, however, is the need to introduce the disjunctive symbol | and to use it every time when disjunction needs to be marked. Plain Compressed Uncompressed S-notation Backus-Naur notation PERL-notation X → baba X :=< mama > s/X/mama/ X → mama X :=< baba > s/X/baba/ X → SS X := SS s/X/SS/ S → ba|ma S :=< ba > | < ma > s/S/ba|ma/ X → MM X := MM s/X/MM/ X → BB X := BB s/X/BB/ M → ma M :=< ma > s/M/ma/ B → ba B :=< ba > s/B/ba/ Table 6: Diverse notations of three grammars covering the language L = {"mama", "baba"}. For a logics-oriented reader, it may be useful to conclude that FLT considers languages and their respective grammars to be equivalent to a sort of formal system. Thus, the set of all strings A∗ being understood as a set of all (i.e. both true and false) proposition, the string belonging to language L can be understood as true theorems and the act of deriving these by G is equivalent to theorem proving. end basic tenets 10.2.1 10.2.2 chomsky-schützenberger hierarchy (txt) Language L can be classed according to type and form of rules which its respective grammar GL contains. Undoubtably the most common typology is the Chomsky-Schützenberger hierarchy of languages which classes all possible languages into one among four classes. These are defined as follows: 1. unrestricted grammars contain rules which can contain any combination of terminals and non-terminals in both substituens and substituendum 10.2 formal language theory 2. context-sensitive grammars have rules of the form αAβ → αγβ with A a nonterminal and α, β and γ strings of terminals and|or nonterminals. Strings α and β may be empty, γ , however, must be nonempty. 3. context-free grammars which have rules of a form A → γ with A a being a nonterminal and γ being a string of terminals and/or nonterminals. 4. regular grammars have rules with one single non-terminal on a left-side and one terminal with max one juxtaposed non-terminal on the right side Containment hierarchy Types of automata Usefulness of C-S hierarchy Uselessness of C-S hierarchy These classes of languages are mutually embedded. Thus, the class of regular languages is the specific subset of context-free languages and it follows that while any regular language is a context-free one, not any context-free language is a regular one. Idem for embeddings of higher-order: context-free languages are specific cases among contextsensitive grammars and context-sensitive languages are just a certain specific subset in the vast "unrestricted" ocean of Type-0 languages. The main categorization being so canonized, great deal of FLT is occupied with study of algebraic and computational properties of these classes. It is thus known that languages produced by regular grammars can be recognized by finite state automata (FSAs), contextfree languages can be recognized by non-deterministic push-down automaton, context-sensitive languages are recognizable by means of linear-bounded non-deterministic Turing machines while an arbitrary Type-0 is not to be recognized by nothing less complex than a Turing machine. It is indeed in such overlap regions between computer science and algebra, whereby the conceptualization of things in terms of C-S hierarchy finds its utmost utility. Utmost and practical: for as it was stated - but merits to be re-stated- purely theoretical explorations of mutual relations between diverse types of grammars and diverse types of symbol-manipulating automata can and indeed do have serious material consequences: faster encoding and faster decoding means faster machines. But in relation to diversity of expressions of natural languages, FLT taxonomies can be quite misleading. For example, as is nicely illustrated in overview (Jiménez López et al., 2000, pp. 87-97), even after decades of debates, "linguists" cannot even find an agreement whether English alone fits into class of context-free languages or whether it is more appropriate to consider it as a priori case-sensitive language. For experts coming from FLT-ignorant domains of linguistics - be it linguistic typology, comparative grammar etc. - such debates express nothing else than sad vaste of intellectual resources. Confronted on a daily basis with the astounding diversity of linguistic structures 121 122 computational linguistics grounded in the substrate of their usage, adherents of such schools would at maximum dare to utter: "In world of natural languages, nothing is certain and nothing is fixed. Asides the fact that natural languages belong into class of Type-0 languages. Maybe." end c-s hierarchy 10.2.2 10.2.3 grammar system theory (txt) A spin-off branch of Formal Language Theory which is of particular interest in regards to overall objectives of this Thesis is devoted to study of Grammar Systems (GS). A grammar system is a « set of grammars working together, according to a specified protocol, to generate a language» (Jiménez López et al., 2000). Thus, contrary to definitions of canonic FLT in which one grammar generate ones language, in GS several grammars work together in order to generate one language. Grammar Systems can be therefore considered as a sort of multi-agent variants of traditional «monolithic» FLT. Such multi-agent nature of GSs implies cooperation, communication, distribution, modularity, parallelism, or even emergence of complexity. Let’s take as an example the most simple among the GS, so-called "language colonies", defined in (Kelemenová and Csuhaj-Varjú, 1994) as follows: Language Colony (DEF) A language colony colony C is an (n+2)-tuple C = (T , R1 , ..., Rn , S), where 1. Ri = (Vi , Ti , Pi , Si ) , for every i, 1 6 i 6 n, is a regular grammar generating a finite language; Ri is called a component of C; 2. S = Si for some i, 1 6 i 6 n; S is called the startsymbol of C; S 3. T ⊆ i = 1n Ti is called the set of terminals of C S And the total alphabet of C is denoted by V, i.e. V = i = 1n (Ti ∪ Ni ) end language colony 10.2.3 Figure 18 illustrates a very simple bi-component (n=2) "language colony" variant of a GS. What is striking in case of even such a simplistic colony is that the very fact of sharing and exchange of strings between two otherwise finite regular grammars results in generation of an infinite language. Polylithic model 10.2 formal language theory Figure 18: Emergence of "miraculous" infinite generative capacity by means of interlock of two finite grammars. Figure reproduced from Kelemen (2004). Blackboard model Other Grammar Systems GST still mostly theoretical Let it be reiterated: by allowing two or more finite components to communicate through a common symbolic environment, one can generate a set of strings - a language - with potentially infinite cardinality ! Kelemen (2004) denotes such behaviour - which is very common in the world of GS - with the term «miracle». The cornerstone idea of not only language colonies but also of any other GS is that diverse "component" grammars share a common "environment". This environment is nothing else than a shared string whereupon and wherein diverse components grammars apply their rules of production. In analogy to class (population) of individual students which together solve the problem on the blackboard they see, the term "blackboard model" is often used to denote the idea. For psychologues this model can be somewhat reminiscent of "working memory" accessible and accessed by diverse independent and encapsulated cognitive modules. Computer scientists, on the other hand, may see some similarity with multiple computational threads accessing the same address space in the shared memory. Aside "language colonies" and GST introduces and precisely defines many other theoretical and formal constructs like "Cooperating Distributed Grammar Systems", "Parallel Communicating Grammar Systems" and "eco grammar systems". Notably due to life-long work of Erzsébet Csuhaj-Varju and substantial contributions by George Paun and Jozef and Alica Kelemens are these constructs developed in such a detail that it is practically impossible for us to introduce here, in extent and rigour they merit, the exact formalisms of GS theory in closer detail. Instead, we forward a potentially interested reader to the doctoral dissertation of Jiménez López et al. (2000) which contains many persuasive arguments for application of GS upon the study of natural human languages. On the other hand, the forereferred dissertation is limited by the fact that it mostly proposes to use the Grammar System Theory as a framework explaining the final, i.e. "adult" linguistic component, and not as a framework which could elucidate the very process of 123 124 computational linguistics language development and language acquistion. In fact, we are not aware of any study which would use the GST as a theoretical explanatory framework for the process of LD, nor of any tentative aiming to implement GST in concrete programs, offering solutions to concrete practical "natural language processing" (NLP) problems. end grammar systems 10.2.3 FLT unites set theory, algebra and theory of formal systems in a highly abstract and subtle conceptual framework aiming to help us (and machines) to conceive more optimal sequences of operations within the realms of sequences (strings) of symbols. It introduces many useful notions like that of 1. terminal symbols, i.e. those symbols which materially occur in the articulated utterance (i.e. are parts of "signifiant") 2. non-terminal symbols, i.e. those symbols which denote generic properties inherent in and specific to the utterance 3. substitution rules and grammars (10.2.1) which are, in one form or another, to be found in all linguistic theories at least since Panini (9.4.2). One simply cannot have a linguistic theory, no matter whether general, descriptive, generative, psycholinguistic or developmental without postulating both material observables (terminals), non-material non-observables (non-terminals) and something like a list of principles which relate the two. Unfortunately, FLT was canonized in an era when computer scienHistoric context of FLT’s conception tists and computational linguists had to think about allocation of ev4 ery byte of the memory In such context, the CPU-register-manipulating recursive while-loops were considered as magical means of generation of big amount of output from minimal input. Thus, a sort of obsession with the notion of recursivity was born which led generativists to 1. tentatives to explain huge part (or all) human linguistic creativity in terms of recursivity 2. ignorance of the role which memory plays not only in concrete situations of linguistic performance, but also for overall stability of system underlying one’s linguistic competence What’s more, FLT is strictly about syntax. It is, ex vi termini, a selfencapsulated formal system and any tentative to make any reference to the world to the world of semantics beyond syntax is predesti4 However, the contemporary generation of computer scientists is not subjected to such constraints. Memory is cheap in the world where 640kb ought NOT to be enough for anybody. No syntax without semantics 10.3 natural language processing nated 5 to put FLT into state of irreversible havoc. For the world of meanings is the world of passionate contextual transpositions, useful metaphores, implicit ambiguities and fuzzy approximations; FLT, on the other hand, brings about the realm of evermore-abstract arborescent hierarchies of pure reason. Fitting one into another, subsuming syntax to semantics or semantics to syntax, thus seems to be at least as absurd a problem as the good old egg-chicken dilemma. end formal language theory 10.2.3 10.3 natural language processing Natural Language Processing (NLP) is a field of artificial intelligence and linguistics which explores machine’s faculty to understand, produce and interact in natural languages. In contrast to both quantitative and corpus linguistics which mainly concentrates on discovery of general quantiative principles and sometimes on data-mining, as well as in contrast to FLT whose ultimate challenge is purely theoretical, is NLP concerned with concrete, practical and real-life problems of verbal interaction between humans and machines. As was already noted in Chapter 4, the so-called Turing’s Test (TT) is -at least in the canonic6 form in which Alan M. Turing had proposed it- in its very essence nothing else than a NLP challenge. This is so because in the canonic TT, the interaction between the human tester and the artificial testee is mediated solely through written verbal modality. The task of creating an artificial system which would truly pass the TT is not as easy as Turing and early computer scientists had believed. Natural languages are multi-layered structures whose components mutually interact both with each other as well as their external environments, the very personal identity of their host not excepted. Natural languages serve many goals - giving commands, transfer of information (or deceipt), telling stories - and often exploit highly irregular means with which these goals are attained. Machines, on the other hand, are regular and ordered. If not programmed otherwise, they blindly follow the path towards the stationary state; if not programmed otherwise, they are unable to deal with any irregularity whatsoever. Thus, in order to bring the ordered world of machines together with the unpredictable world of living language, NLP engineers usually proceed step after step: one minute linguistic problem is understood, formalized and subsequently tackled with in one’s source code. Then another. 5 Take, as an example, the introduction of Θ roles into Chomsky’s Government & Binding Theory. 6 C.f. Hromada (2012a) for a description of taxonomy of TT-consistent scenarios allowing the evaluation of not only linguistic, but also emotional, spatial, visual, corporal, moral etc. intelligences of an artificial agent. 125 126 computational linguistics Indeed many are such problems: • author attribution • plagiate detection • named entity disambiguation • word and/or morphological segmentation • sentiment analysis • relationship extraction • rhetoric figure detection (Hromada, 2011) • automatic summarization • discourse analysis • anaphora resolution • parsing • automatic translation • natural language understanding • natural language generation • question answering all these are just few among dozens of other tasks which NLP experts aim to tackle. These are, in practice, almost always solved by means of adoption of NLP’s ultimate methodology: the machine learning. 10.3.1 machine learning Machines can learn. That is, machines are able to discover underlying general patterns and principles governing the concrete input data and can subsequently exploit such general knowledge in contact with data which they have never seen before. They « can use experience to improve performance or make accurate predictions» (Mohri et al., 2012). And in everbigger number of domains, they do so still better and better than their human teachers. Since the moment when machine learning (ML) was first defined, in relation to game of checkers, as « field of study which gives computers ability to learn without being explicitely programmed» (Samuel, 1959) has the discipline of ML evolved in an extent which is hardly compressible into a single book (Mohri et al., 2012) and certainly incompressible into limited scope of this subsection. This is so because not only does the number of domains of ML’s application grow from subsec:ml 10.3 natural language processing 127 year to year, but firstly because the quantity of distinct ML methods is already counter in dozens, if not in hundreds. The general framework - sometimes also called "learning theory" (LT) - however, stays the same. No matter whether in psychology or in computer science, LT principially studies how an informationprocessing system (e.g. brain or computer) processes, represents and stores data sensed from external environment, how it internally transforms them and how outputs of such transformations influence subsequent activity of such a system (including sensing and processing of future data). There is thus a system that learns (the learner system, LS) , the learnt information (LI) and the process of learning (PL). Interactions among these three components, whether one should postulate less (e.g. in case when LS 6= PL) or more such components (e.g. in case when sensed data differs from learnt information), and many other - some stemming from neurosciences, other from pure mathematics - all such topics are to be explored by full-fledged LT. A distinction which is most pertinant for the purposes of this Thesis - and one may argue that for the ML in general as well - is the distinction between supervised and unsupervised learning. Supervised learning, called also learning-with-Teacher7 is based upon the idea that a full cycle of a learning process consists of two stages: 1. training|learning stage - LS is first exposed to set of problems and their respective solutions, then aims to create the model associating the two 2. testing|evaluation stage - LS exploits the previously constructed model in order to furnish solutions to problems to which she wasn’t exposed during the training stage. Its performance is then evaluated according to certain evaluation metrics. In unsupervised learning, on the contrary, it is expected that the one-who-launches-the-program shall not furnish any explicit solution|answerrelated information to the LS. The training phase is thus practically equivalent to the testing phase: both contain questions; neither contain answers. LS is simply furnished a huge dataset - in unsupervised NLP practice, the dataset is almost always equivalent to textual Corpus - and is asked to do something reasonable with it. Cluster the corpus contents into classes, for example. While distinction between supervised and unsupervised seems to be crystal-clear for anyone practicing the NLP fach, the "cognitive plausibility" of fully unsupervised learning is more than discutable. Primo, the distinction turns out to be problematic for any models of phenomena in which the very order of exposure - i.e. the fact that 7 Or learning-with-Oracle, if the Teacher system is able to correctly solve the problem (e.g. furnish the answer) immediately after it received input sufficiently describing the problem (e.g. a meaningful question). 128 computational linguistics the corpus to which LS was exposed contains first the token baba and only later the token mama - can significantly influence the learning process. Thus, for models for which holds the statement « the engineer’s decision to confront the algorithm with corpus X and not Y, and to do so in the moment T1 and not T2 , is already an act of supervision» (Hromada, 2014b) the method cannot be considered as stricly un-supervised even in absence of any explicit answers. Secundo, in case of modeling of LD processes, one cannot say that toddler undergoes "unsupervised" learning just because the input to which she is exposed does not contain any explicit corrections, cues or answers. The very corpus is the answer and - from toddler’s pointof-view - the very authority of the adult who furnishes the corpus mints the corpus with justification of its truthfulness and validity. The very notion of "valid solution" or "correct input-output mapping" loses non-negligeable part of its importance when one realises that LSs which we aim to discuss here, can be conditioned to perceive agrammatical and false utterances as grammatical and true. No matter whether it is the case of a child in the middle of ego-centric stage or a victim of a propaganda machinery, it is often NOT the adequacy with external reality niether consistency with as big a set of propositions, which counts. Instead, it is the repetition, the frequency of cooccurrence, the self-referential and self-reinforcing set of references to the minimal "seeding" set of symbols, which counts and which directs the learning process. Tertio, it is evident that both accuracy as well as speed of learning of solving of a particular class of problems is, at least in case of human learners, significantly catalyzed by the presence of a teacher skilled enough to adopt the input-to-be-taught to the momentanous state of LS. Vygotsky’s "zone of proximal development" (Vygotsky, 1978) is too salient and too omnipresent a fact to be ignored: humans learn more efficiently with a skilled teacher. And this, as constructivists would also argue, is a domain-general fact which is to govern not only singing, drawing, cooking or bicycle riding but...all facets of natural language learning as well. Principially for these "cognitive plausibility"-related reasons shall we attribute, in volume 2, a certain conceptual priority to supervised ML of evolutionary models of ontogeny of toddlerese. But before doing so, let’s focus on that, which both supervised and unsupervised branches of ML have in common: evaluation. Evaluation It is in degree of sharing and conventionality of formal, quantitative and objective means of evaluation that science can be distinguished from art, and in lesser extent, also engineering from science. An NLP, as principially more a skill than a science, is not an exception in this regard. To paraphrase the same thing somewhat differently: it is not 10.3 natural language processing from existence of diverse means of evaluation, but from partage commune of the need to evaluate and knowledge of usefulness of such evaluations, from which the very productive unity of NLP stems. Productive unity in the field there is, and diversity -luckily for field’s survival- is there as well. Thus, what holds for already-mentioned diversity of NLP’s learning methods, holds also for diversity of diverse evaluation metrics. This is so because there exists no wide agreement about the meta-criterion which could help to decide what criteria exactly should a good evaluation metrics fulfill. Hence, asides the fact that a good evaluation metrics should make the result of an arbitrary experiment as much comprehensible as possible even to an un-initiated greenhorn, and asides the observation that there, verily, exist evaluation metrics which describe certain classes of phenomena better than others, it should not be a priori accepted that there exists "the" evaluation metrics which is the best of all. Things being as they are, the NLP and "information retrieval" communities often tend to use the traditional evaluation formulas for Precision and Recall: Precision and Recall (DEF) Number of retrieved relevant entities Total number of all relevant entities Number of retrieved relevant entities Precision = Total number of all retrieved entities precision and recall end 10.3.1.0 Recall = whereby the relevancy of the "relevant" document is defined to the external, ideally manually annotated étalon (i.e. golden standard), corrected by a human judge and subsequently furnished to LSD by the teacher or evaluator. Precision thus, in certain sense, carries information about how much is the set X, retrieved by the algorithm, stained with "false positives" which do not belong to X according to the golden standard. Recall, on the other hand, carries information about how many among the entities which are labelled as "true" in the golden standard, were selected (i.e. labeled as "positives") by the algorithm. Values of the always are constrained in the interval [0,1] and can be further combined into their "harmonic mean", commonly known as F-Score: precision ∗ recall F = 2∗ precision + recall which also yields a score from interval [0,1] whereby 0 is obtained by the worst possible and 1 by the ideally performing algorithm. This being said, it should be evident that precision and recall are concepts useful especially in case of binary classification tasks, i.e. tasks in which one aims to categorize certain set of entitities into two 129 130 computational linguistics groups (i.e. a is X or not-X). Given that the notion of binary distinction is indeed a powerful one it is not uncommon that some studies succeed to get crowned with laurel, thanks to some additional averaging, even when they use use precision & recall based metrics also for evaluation of pure multiclass classification problems, i.e. problems where one aims to categorize certain set of entitites into N>2 groups, or clusters. Different measures were developed which target specifically the problem of multiclass clustering. The most traditional among these being purity, defined as: PurityΩ,C = 1 X maxj |ωk ∩ cj | N k where Ω = ω1 , ..., ωK is the set of K clusters hypothetized by the LS, members C = C1 , ..., CN denote N classes present in the golden standard. During estimation of purity each among K hypothetized clusters is assigned to the class which is most frequent in the cluster. The accuracy of assignment is subsequently assessed by counting the number of correctly assigned documents and dividing by number of gold-standard classes (N). Similiarly to all notions closely introduced in this section, ideal results have value of 1 while bad results shall be close to 0. Purity asides, litteraly dozens other measures for clustering accuracy performance have been already developped, see Rosenberg and Hirschberg (2007) for overview of most important among them. The same article also a measure called V-measure defined as: V-measure (DEF) h = 1− H(Ω|C) H(Ω) c = 1− H(C|Ω) H(C) V= (1 + β)hc (βh) + c where H(C) denotes entropy of collection of classes; H(Ω) denotes entropy of collection of hypothetized clusters; H(C|Ω) denotes conditional entropy of C given Ω and H(Ω|C) denotes conditional entropy of Ω given C; and β specifies the weight between the h and c8 . v-measure end 10.3.1.0 Asides the fact that its values are also from the values of interval [0, 1], V-measure disposes of multiple properties which makes it worthy of interest for anyone willing to use an elegant measure 8 β is often set to 1 in order not to bias the value of V neither towards homogeneity, nor towards completeness 10.3 natural language processing of cluster evaluation. Not only is V-measure a harmonic mean of h (also called "homogeneity") and c ("completeness") and is thus strongly reminiscent of F-score, but it has also property of being stable in regards to variation of number of clusters. For these, as well as other reasons more closely elucidated in Rosenberg and Hirschberg (2007); Christodoulopoulos et al. (2010); Hromada (2014a), shall be V-measure used in "part-of-speech induction" chapter of the 2nd volume of this Thesis . In order to work, Recall, Precision, F-score, Purity and V-measure require the golden standard which has, in NLP, often the form of a manually annotated corpora. These measures, based on "external criteria" must not be, ex vi termini, used to modulate the execution of an unsupervised learning process. In learning scenarios in which the only source of knowledge is pure non-annotated dataset, one is obliged to evaluate the clustering only according to criteria inherent in the dataset itself. Many such "internal criteria" have been already discussed in the litterature (e.g. silhouette coefficient, Dunn index, Davies-Bouldin index), one more - the "prototypicity coefficient" shall be introduced in volume 2. Let’s now now move forwards with just one little warning: in no way does the sketchy overview hereby presented pretends to be a complete overview of NLP evaluation techniques, let alone the learning methods themselves. Given the amount of research being done in the domain, this is simply impossible. Thus, in order to restrict this expose to reasonable length, topic of evaluation of continuous, i.e. "regression" ML models was completely set aside and all attention was concentrated upon the evaluation of ML algorithms which tend to "learn" models composed of two or more discrete categories. This design choice was mainly motivated by a belief that it is more reasonable to aim to explain functioning of language cognition in terms of categorization, and not in terms of regression9 . evaluation end 10.3.1.0 At last but not least, it is important to mention that machine learning is able to yield programs and applications which work, and work very well. And it is indeed especially NLP which is, asides "computer vision" 10 a field in and for which the ML is developped. It is thus not too surprising that recent days have seen, for example in article of Karpathy and Fei-Fei (2014), results of some quite successful efforts to unite the two. 9 None that in reasoning that shall follow, operations acting upon continuous domains are not to be completely excluded. Take as an example the notions of 1) temporal halflife (i.e. decay interval) of a cognitive schema 2) selection of locally-nearest-neighbor according to similarity defined in cosine metrics. 10 C.f. (Hromada et al., 2010) for an older application of ML methodology in training of smile-detection classifiers. 131 132 computational linguistics machine learning end 10.3.1 ML-inspired methodologies for: 1. problem of ontogeny of semantic categories (equivalent to supervised learning of word meanings) 2. problem of ontogeny of morphosyntactic categories (also known as part-of-speech induction) 3. problem of ontogeny of grammars (also known as grammar induction) shall be described in closer detail in following sections, as well as in Volume 2. end natural language processing 10.3 10.4 semantic vector architectures It was already (9) mentioned, that natural language furnishes a communication channel for exchange of meanings. Meaning («signifié») is intentional, it refers to some external entity («referent») . Within the language L, meaning M can be denoted by a token («signifiant») and it is by exchange of physical (phonic, in case of spoken language, graphemic in case of written language etc.) manifestations of these tokens that producer (speaker|writer) and reciever (hearer|reader) communicate. Traditionally meaning of the word, i.e. its «semantics», was often considered as something almost «sacred» and not-to-be-formalized by mathematical means. Maximum which could be done - and had been done since Aristotle until middle of 20th century - was to define concept in terms of lists of «necessary and sufficient features». Two types of features were considered to be both necessary and sufficient for definition of majority of concepts : first specifying concept’s genus (or superordinated concept) and second specifying the particular property (differentia) which distinguished the concept from other members of the same genus. Thus, for example, «dog» could be defined as domesticated (differentia) canine (genus). Important property of such system of concepts was, that it allowed no ambigous or fuzzy border cases : the logical «law of excluded middle» guaranteed that all entities which were not both canines and domesticated at the same time (e.g. a chihuahua which passed all her life in wilderness) could not be called a dog. Even in contemporary CL practice, projects like WordNet (Miller, 1995) incarnate such aristotelic view in form of datasets organizing items of human lexicon in what is principially an arborescent hierarchy of sub- and super- ordinated terms (i.e. of hyponyms and hyperonyms). Signifier, signified, referent Necessary and sufficient Aristotelic paradigm of word meaning 10.4 semantic vector architectures Non-aristotelic paradigm The change of the classical paradigm came slowly with works of late Wittgenstein 11 but especially with empirical studies of Eleanore Rosch. What these studies (e.g. Rosch (1999)) found out, was that not only are concepts often defined by bundles of features which are neither necessary not sufficient but also that the degree with which a feature can be associated with a concept often varies. Subsequently, Rosch has proposed a «prototype theory» of semantic categories whose basic postulate is, that some members of the category (or some instances of the concept) can be more «central» in relation to the category (resp. concept) than others. Thus, in some cultures "rose" is more "flower" than "daisy", in other cultures contrary is the case. 10.4.1 category prototype (def) A prototype P of the category C is a member of C, which shall be retrieved with highest probability whenever one queries C for its most salient concrete representative. Such a member of C is to be as similar as possible to all other members of C and as dissimilar as possible from members or prototypes of other categories. category prototype end 10.4.1 Geometrization of meaning Distributional hypothesis Prototypical theory as well as other both theoretic and empirical advances like formalization of notion of similarity, in combination with development of information-processing technologies, have paved the way to operationalization of semantics which allows to transform meanings of words into mathematically commesurable entities. In modern semantics, concepts are operationalized as geometric entities. Thus, meaning of a token X observable within language corpus C is often characterized as a vector of relations which X holds with other tokens observable within the corpus. The set of such vectors associated to all tokens observable in C yields a «semantic space» which is a vector space within which one can effectuate diverse numeric and|or geometric operations. Since a methodological objective of this disertation is to bridge developmental psycholinguistics with the computational one, we consider it to important to underscore that in NLP practice, transformation of corpus C into semantic feature space S is practically always based on empirical validity of "distributional hypothesis" (DH) which states that « a word is characterized by the company it keeps» (Harris, 1954) 12 11 « For a large class of cases of the employment of the word ‘meaning’—though not for all—this way can be explained in this way: the meaning of a word is its use in the language» (Wittgenstein, 1953) 12 DH can be also restated in somewhat more algebraic terms:« In the most simple case can be the vector which denotes concept X calculated as a normalized linear combination of vectors of concepts in context of which X occurs.» (Hromada, 2014d) 133 134 computational linguistics Practical usefulness of DH in practically all models of geometric operationalization of meaning is undisputable. But DH has also nonnegligable theoretical importance. For stated as it is, it supports «associationist» theories based on the notion that the essence of mind is somehow related to mind’s ability to create relations, i.e. associations, between successive states. In addition to what was said in (9.4.1, we suggest that both mind’s faculty to create associations, as well as the distributional hypothesis "meaning of symbol X can be defined in terms of meanings of symbols with which X co-occurs", can be neurologically explained in terms of already-mentioned Hebb’s postulate: « The general idea is an old one, that any two cells or systems of cells that are repeatedly active at the same time will tend to become ’associated’, so that activity in one facilitates activity in the other» (Hebb, 1964) One can assume that IF Hebb’s law 1. Hebb’s rule govern activity of not only single neurons but also of neural ensembles 2. if distinct words Wx and Wy are somehow processed and represented by distinct neural ensembles Nx and Ny THEN it shall follow that whenever a hearer shall hear (or speaker shall speak) the two-word phrase Wx Wy , the ensemble of material (synaptic?) relations between Nx and Ny shall get reinforced. In more geometrical terms, on a more « mental » level, such a « rapprochement » of Nx and Ny would be characterized by convergence of the geometrical representations of both circuits to their common geometrical centroid. Thus, after processing the phrase Wx Wy , the vectorial representations of both Nx and Ny will be closer to each other than before hearing (or generating) the phrase. 10.4.2 Associanist geometry hebb-harris analogy (aph) For a corpus linguist, distributional hypothesis means, mutatis mutandi, the same thing as Hebb’s law for a neuroscientist. end h-h aphorism 10.4.2 We conjecture that an associationist principle, similar to the one described above, is indeed at work whenever a mind projects stimuli percieved from the external world unto an internally represented semantic space. Such «semantic vector space» can be subsequently divided, partitioned or tesselated into diverse subspaces each of which represents diverse semantic categories, classes or concepts. Or maybe even more than just represent: such partitions are concepts. Of concepts and subspaces 10.4 semantic vector architectures Conceptual similarity The big advantage of approaches modelling the « geometry of thought» (Gärdenfors, 2004) is that they allow, among other things, to measure and assess similarities and distances between two or more concepts. By doing so, they seem to be much more closer to actual human experience with meanings than other computational methods (expert systems, ontologies, RDF etc.), based principially on application of logical rules of inference. For programs which work with concepts as if they were geometrical entities have no problem whatsoever to answer questions like "what is more similar to a dog - a cat or a wolf?". Such questions -which any child would love to answer- couldn’t be answered by an expert system without intervention of human operator who would explicitely declare the criterium of similarity according to which the similarity is assessed. But a system considering all three terms -"dog", "cat", "wolf"- as being just labels denoting geometrical points, would not have problem to do so if ever it was already confronted with corpus in which the three terms occured. And given the fact that these geometric models make it possible to calculate, evaluate or compere similarities between meanings, it is of no surprise that these very models make it quite easy to create artificial simulations for such cognitively salient phenomena as analogies, metaphors Lakoff (1990) and intuitions. Let’s now glance on few such NLP models which process meanings as it they were geometric entities. 10.4.3 bag-of-terms Bag-of-Terms (BoT) distinguish contained and containing entities. Most often, words are understood as the contained entities and sentences or whole documents are the containing ones. What is important for such Bag-of-Words (BoW) models is that the document D1 contains certain set of words while the document D2 contains another set of words. Such quantitative information about the number of occurences of diverse words in diverse documents can be used to construct vectorial representations of such documents. This is done by representing every distinct document with a row vector whose specific elements denote specific words. Table 7 illustrates this for three sentences13 , considered as individual documents. The order of words or other aspects (e.g. morphosyntax, phonology, prosody) are considered as irrelevant: in pure BoW, it is only the occurence of the word that counts. This, however, is not necessary the case in BoTs which implement another definition of the "contained 13 Sentences like these (meaning "mama has ema", "ema has mama" and "mama has mama") are often among the first used in Slovak language primers. 135 136 computational linguistics mama má emu ema mamu mama má emu 1 1 1 0 0 ema má mamu 0 1 0 1 1 mama má mamu 1 1 0 0 1 Table 7: Vectorial representations of three sentence-sized documents. Every distinct word yields a distinct column. entity" - i.e. of component term by means of which one characterizes the "containing" document. For one can also work with terms which are either smaller, bigger or utterly different from words. One can look for occurence of syllables or, simpler yet, a distinct sequence of N characters (an N-gram). Construction of vectorial representations based on occurence of 3-gram terms is presented on Table 8. "mam" "ama" "ma " "a m" "má " "á e" " em" "emu" "á m" " ma" "amu" D1 1 1 1 1 1 1 1 1 0 0 0 D2 2 1 1 1 1 0 0 0 1 1 1 Table 8: Vectorial representations of sentence-sized documents D1 = "mama má emu" and D2 = "mama má mamu". Every distinct character trigram yields a distinct column. In this case, one can see that some information about the word order is also included into the vectorial representation. This is so, because the word-dividing empty space character " " is also taken into account which was not the case in pure BoW presented in Table 7. On the other hand, by focusing on trigram features and not on whole words, one may observe a feature "mam" to occur twice in document D2 . Hence X2,1 = 2. No matter what definition of documents and term one uses, one obtains, at the end, a list of N D-dimensional row vectors where N is the number of documents in the corpus and D is the number of distinct tokens observed in the corpus. One thus obtains a term-document matrix X. In NLP practice, it is common and recommendable to process the resulting values of such matrix to so-called term frequency–inverse document frequency (tf-idf) weighting scheme. TF-IDF Let tf (t, d) denote the term frequency, i.e. number of times the term t occurs in document d and let idf(t, D) denoting the inverse document frequency be obtained as follows: idf(t, D) = log N |{d ∈ D : t ∈ D}| 10.4 semantic vector architectures where N denotes the total number of documents in the corpus and |{d ∈ D : t ∈ D}| denotes the number of documents in which t occurs. Then term frequency–inverse document frequency (tf-idf) is to be calculated as follows: tfidf(t, d, D) = tf(t, d) ∗ idf(t, D) in order to yield a numerical weight reflecting how important a word is to to a document contained in a corpus. tf-idf end 10.4.3.0 Verily is tf-idf a very simple yet very effective means how an NLP engineer can increase the accuracy of one’s vectorial model. But it has also some disadvantages. Primo, it adds a second pass to construction of term-document matrices which can, especially in case of BigData NLP, bring about certain computational and memory costs. Secundo, the cognitive plausibility of tf-idf models is still to be demonstrated. In other terms: while practically whole history of NLP empirically demonstrates that tf-idf represents an information-processing component wherein statistical properties of the whole influence weights of individual associations, current psycho-linguistic knowledge seems to fail to identify a cerebral mechanism functioning as tf-idf’s neural correlate. Be it as it may, tf-idf brings even more order and information into the metric space given by the entities represented by the term-document matrix. And given that these entities are already of numeric, quantified nature, they can be commesurated. Distance between words can be obtained by measuring distances between two column vectors; distance between documents can be obtained by assessing distances between two row vectors. Multiple metrics (e.g. Jaccard index, Euclidean distance, cosine for real-valued vectors, Hamming distance for binary etc.) are used in order to do so. bag-of-terms end 10.4.3.0 10.4.4 latent semantic analysis (txt) A major disadvantage of term-document occurrence matrices, as generated by BoW models, is their sparsity. Given, for example, a corpus containing N=1 million documents and M=50000 distinct terms, BoW postulates existence of a rectangular term-document matrix with fifty billion elements. And given that only a relatively small subset of distinct words shall occurs in any specific document, vaste majority of values in such a matrix shall be zero. Latent Semantic Analysis (LSA) was one among the first solutions aiming to address this sparsity problem in NLP scenario. By unfold- 137 138 computational linguistics ing the formula, known in algebra as singular value decomposition (SVD): X = UΣV T it transorms the original term-document matrix X into orthogonal matrices U and V and a diagonal matrix Σ. By selecting D values from Σ and vectors of U and V associated with these values, one can reduce the dimensionality of original matrix X to only D dimensions, with minimal smallest error. Algebraic and dimensionality-reduction aspects aside, LSA was, in its time, revolutionary for one principal reason: it allowed to compare not only documents with documents and terms with terms, but also terms with documents. It also allowed for a means of optimization: one could tune model’s performance by modifying the dimensionality 14 . Feats furnished by LSA were, at the time of its conception, so astounding that LSA’s conceptors considered their model to be the answer to the problem of category induction and antique problem concerning the essence of knowledge in general, hence promoting their computational model to status of « a solution to Plato’s problem: latent semantic analysis theory of knowledge» (Landauer and Dumais, 1997). LSA is indeed able to furnish dense, low-dimensional vector spaces of semantic categories and concepts. It seems to yield interesting solutions for dozens of other problems, let’s mention, as an example, the problem of grapheme-to-phoneme in speech synthesis (Bellegarda, 2005). And it is also true that transition through the site http://lsa.colorado.edu has been and is - for at least one generation of all sorts of cognitive science students - an important, useful, and potentially obligatory rite of passage of their academic parcours. But it is also true that LSA has certain drawbacks. Computationally speaking, LSA is costly because SVD is costly. And cognitively speaking, it is somewhat difficult to see how human brain could perform such a precise deterministic operation like SVO, let alone the dimensionality optimization which should precede it15 . As LSA’s conceptors put it: « It still remains to understand how a mind or brain could or would perform operations equivalent in effect to the linear matrix decomposition of SVD and how it would choose the optimal dimensionality for its representations, whether by biology or an adaptive computational process.» (Landauer and Dumais, 1997) We propose to address the problem by simply ignoring the SVD altogether and rather focusing on another means of dimensionality reduction: the random projection. 14 According to (Landauer and Dumais, 1997), an optimal dimensionality for problem of concept induction from English language corpora is approximately 300 15 Note that the dimensionality optimization could have occured during development, either phylogenetic or ontogenetic, or both. 10.4 semantic vector architectures latent semantic analysis end 10.4.5 10.4.4 random indexing (txt) Random Indexing (RI) is a method of representation of textual corpora with dense, low-dimensional vector spaces. In theory, RI is justified by a lemma of Johnson-Lindenstrauss whose corollary « states that if we project points in a vector space into a randomly selected subspace of sufficiently high dimensionality, the distances between the points are approximately preserved» (Sahlgren, 2005). In more formal terms, dimensionality of rxc-dimensional term-document occurence matrix X can be reduced by projection through rxd-dimensional random matrix R, whereby the target number of dimensions (d) is the parameter of the projection, and is smaller than the initial number of columns (i.e. d  c): 0 Xrxd = Xrxc Rcxd In NLP practice, the simplest yet quite efficient variant of creation of such slighthly distorted d-dimensional matrix X 0 is implemented by a following procedure: « Given the set of N objects (e.g. documents) which can be described in terms of F features (e.g. occurence of the string in the document), to which one initially associates a randomly generated d-dimensional vector, one can obtain d-dimensional vectorial representation of any object X by summing up the vectors associated to all features F1 , F2 observable within X. The original random feature vectors are generated in a way that out of d elements of vector, only S among them are set to either -1 or 1 value. Other values contain zero. Since the "seed" parameter S is much smaller than the total number of elements in the vector (d), i.e. S «d, initial feature vectors are very sparse, containing mostly zeroes, with occasional value of -1 or 1.» (Hromada, 2014c). The PERL Data Language (PDL)-compliant source code of the procedure is presented in Listing 5. Listing 5: Random Indexing Source Code 1 my %doc_vectors; my %term_vectors; sub generate_initvector { my $value; my %set; 6 my $vec=zeroes $dimensions; for (0..$seed) { (rand >0.5) ? $value=1 : $value=-1; my $offset=round(($dimensions-1)*rand); while (exists $set{$offset}) { 11 $offset=round(($dimensions-1)*rand); } 139 140 computational linguistics $set{$offset}=$value; index($vec,$offset).=$value; 16 } return $vec; } for $document (@document_list) { my @words=split(/[^\w]/,$document); for my $word (@words) { 21 if (!exists $tvectorz{$word}) { $term_vectors{$word}=generate_initvector; } $doc_vectors{$document}=zeroes $dimensions if ! exists $doc_vectorz{$document}; $doc_vectors{$document}+=$term_vectors{$word}; 26 } Simply stated, vectorial representation of documents A is obtained as simple linear combination16 of initial vectors associated to terms T1 , T2 , T3 observable in A. For any such term, an d − dimensional initial vector is randomly generated and contains d − S zero elements and S elements whose value is either -1 or 1. The output of this simple variant of RI is a set of d − dimensional document vectors which can be used to calculate similarity among the documents. Normalization of these vectors is needed when one uses the cosine metrics. But one can go further: for one can additionaly "reflect" the whole process, forget the random vectors (initially attributed to individual terms) and now calculate the vectorial representation of the term Tx as a linear combinations of documents in which Tx occurs. After 2 or 3 iterations17 of such "reflection of information" from documents to terms and vice versa, one obtains numeric representations of both documents and terms both projected into one holistic metric space. Thus, in the spaces generated by Reflective Random Indexing (RRI) (Cohen et al., 2010), there is no distinction of essence between words and documents or, more generally, between objects and the context of their use. All can be understood as points or vectors of the same d − dimensional space. Not only that, such geometric entities can be also interpreted in terms of subspaces: one can speak about the region whose centroid is the entity, E or one can speak about subspaces orthogonal to E’s vector. The world of meanings once thus geometrized, verily many are applications of such "vector symbolic architectures" (Widdows and Cohen, 2014). 16 Weighting the term vectors with related tf-idf values is strongly recommended. 17 Note that due to convergence properties of random projection, more than 2 or 3 iterations of the reflective process often tend to degrade the accuracy of RI’s semantic discrimination. On the other hand, multi-iterative convergence of associanist matrices yields highly useful results in other NLP tasks, including the estimation of the "importance of the sign" (Hromada, 2009) commonly known as PageRank (Hromada, 2010a). 10.4 semantic vector architectures random indexing end 10.4.6 10.4.5 light stochastic binarization Raison-d’etre of all semantic space architectures is information|knowledge retrieval. No matter whether one encodes one’s dataset in form bagof-words, LSA, RI or RRI vectors, the objective is often the same: to implement the model in real-life applications which are able to identify members of the dataset which are semantically closest to some user-specified query. And to do so in reasonable time. Thus, the computational complexity of retrieval phase at least as important as the computational complexity of the indexing (encoding) phase. Moreso in the BigData scenario where one aims to find a needle in a haystack of billions of documents. In case of data of very low dimensionality (d < 10), the solution is quite straightforward: one can one’s data and create indices for it, by use of binary trees or other indexing techniques18 . Unfortunately, because of a so-called "curse of dimensionality", it is practically impossible to create retrieval indices for entities with higher dimensionality. In layman’s terms this is so because two entities close to each other in many dimensions can still be considered far from each other (because being really far from each other in just few dimensions); or because two entities far from each other in some dimensions can still be considered relatively close to each other (because they are quite close in many other dimensions). Thus, in hugedimensional spaces, usage of indices (e.g. k-trees) in retrieval can sometimes turn out to be more costly than a simple "linear" search in which one compares one’s query with all vectores stored in the dataset. Given that the complexity of such linear search is N ∗ d and given that one cannot reduce the size of one’s dataset (i.e. N) and given that one accepts that the "curse of dimensionality" is inevitable in semantic spaces, one can still fasten, in silico the retrieval by at least two possible means: 1. construct semantic spaces smallest possible (yet still sufficiently high to encode semantically relevant distinctions) dimensionality d 2. execute operations with binary vectors (instead of integer, float or complex ones) Combination of these two means into one algorithm yields Light Stochastic Binarization (LSB). 18 Dataset indexing is often explained in terms of a huge library with one shelf containing a sorted cartotheque of cards which specify book’s position in the libary 141 142 computational linguistics The idea behind LSB is fairly trivial and is inspired by approaches like Locality Sensitive Hashing (LSH, Datar et al. (2004)) or Semantic Hashing (SH, Salakhutdinov and Hinton (2009)). In these hashing approaches, the objective is to use a "hashing function" able to attribute a short and concise binary vector (i.e. "a hash") to any document in the dataset in a way that if two documents are similar (or identical) their hashes will also be similar (or identical). In this sense, LSB can also be understood as a sort of hashing algorithm which simply uses the Reflected Random Indexing (10.4.5) as its hashing function. Once RRI transforms document (or a query) Q is transformed into its vectorial representation ~q and whose n−th element we denote with ~qn , one obtains the resulting binary hash ~h by trivial thresholding:  ~hn = 0 ~qn < 0 1 ~qn >= 0 Expressed verbally, when value generated by RRI is bigger than 0, one shall put 1 into respective position of the binary hash, otherwise one puts 019 . At its very core, it is nothing else than mapping of RRI’s output integer|float range onto the binary range. A mapping which exploits a mathematically beautiful intuition of Sahlgren (2005) that the random projection - as performed by RI and RRI - should be seeded solely with values of -1 and 1. The study (Hromada, 2014c) has indicated that in case of classification scenarios where low recall is allowed if high precision is attained, LSB yields results comparable (or better) than both binarized LSA and renowned deep-learning technique proposed by Salakhutdinov and Hinton (2009). Figure 19 displays these results for the problem of multiclass classification (C=20). All models thereby represented used dimensionality d = 128 and the size of a document hash was thus exactly 16 bytes. light stochastic binarization end 10.4.6 10.4.7 evolutionary localization of semantic attractors Reflective procedure asides, LSB involves neither optimization nor machine learning components. But given that it produces simplest datastructures possible - id est, low-dimensional binary vectors - it can be easily embedded into more complex frameworks. Evolutionary Localization of Semantic Attractors (ELSA, Hromada (2015)) aims to change it. 19 This trivial thresholding is applicable only in case of huge (BigData) corpora where law of big number applies. C.f. Hromada (2014c) for LSB’s variant usable in cases of smaller corpora. 10.4 semantic vector architectures Figure 19: Comparison of reflective LSB (with I=2 iterations) and unreflective LSB (I=0) LSB with Semantic Hashing and binarized Latent Semantic Analysis. Reproduced from Hromada (2014c). 143 144 computational linguistics ELSA is a result of embedding the LSB into an evolutionary computation framework. More concretely, ELSA uses canonic genetic algorithms (8.7.1) to localize a set of category prototypes (10.4.1) best adapted to document classes encoded in the training corpus. LESB thus aims to address the problem of supervised document classification abd as such expects to be trained with corpus containing documents and associated category labels. It first processes whole corpus by LSB algorithm and once documents are transformed into binary vectors, it starts to look for the most optimal set of category prototypes. In ELSA, the search for category prototypes is equivalent to discovery of such a set of prototypes which minimize the function: X X F(P) = α H(t, P) − β H(f, P) (1) t∈cP f6⊂cP whereby P denotes the vector representation of the prototype in the binary space, H denotes the Hamming distance20 , t denotes the vector representation of the "true" document belonging to same class (cP ) as the prototype, f is the vector of the "false" document belonging to some other class of the training corpus and α and β are weighting parameters. Thus, a candidate prototype P of category cx is considered to be most fit if it is as close as possible (i.e. has smallest Hamming distance) to all documents which are attributed to cx in the training corpus; and as far as possible from documents which are not attributed to cx in the training corpus. In ELSA, solution to multiclass classification problem is formalized as such group of prototypes which minimize the distance to members of categories they should represent and maximize the distance to others. Given that the training corpus divides its documents into |C| classes, and given that every document and every prototype can be represented as a d−dimensional binary vector, chromoses which are to be optimized by ELSA are binary vectors of |C| ∗ d. The rest is in work in progress. C.f. Hromada (2015) for comparison of ELSA with binarized LSA, non-optimized LSB, or Semantic Hashing. Given that ELSA introduces in a one unified framework three components which we claim to be cognitively plausible, id est: 1. dimensionality reduction by means of random projection 2. theory of semantic prototypes 3. evolutionary computation and given that its binary nature predestines it to execute very fast on any transistor-based computer, we shall use aim to implement ELSA, 20 Hamming distance of two binary vectors h1 and h2 is the smallest number of bits of h1 which one has to flip in order to obtain h2 . 10.5 part-of-speech induction in one way or another, in majority of simulations described in volume 2. elsa end 10.4.7 In this section we have presented multiple architectures which have all one thing in common: they succeed to transform textual documents into geometric and|or mathematical entities. To keep the overview as simple and concise as possible, only scalars, vectors and matrices were discussed; the reader is to be reminded that other mathematical models of semantics exists which also involve tensors of higher order. Even a very introduction of these, however, surpasses by far the objectives of this Thesis. Thus, instead of closer discussion of fascinating topics like interrelations between "binding operators", "circular convolution", "complex numbers" and "quantum logic" (Widdows and Cohen, 2014), we have preferred to acquaint the reader with the idea that meanings are subspaces of d−dimensional semantic spaces. Departing from simple word-document occurence matrices of first bag-of-words models, passing through LSA’s ambitions to answer perennial questions: What are ideas, how are they stored and how are they accessed ? and discussing other more natural means of dimensionality reduction, we finally approach the Point where multiple divergent streams converge into one. But before exploring it somewhat further, let’s see whether the realms of semantic and syntactic categories does not have something in common. In computational sense, for example. semantic vector architectures end 10.4 10.5 POS-i POS-t part-of-speech induction The term Part-of-speech-induction (POS-i) designates the process which endows the human or an artificial agent with the competence to attribute the POS-labels (like “verb”, “noun”, “adjective”) to any linguistic token observable in agent’s linguistic environment. POS-i can be understood as a « partitioning problem » since one’s objective is to partition the initial set of all tokens occuring in corpus C (which represent agent’s linguistic environment E) into N subsets (partitions, clusters) whose members would correspond to grammatical categories as defined by the gold standard. Because one does not use any information about « ideal » gold standard grammatical categories during the training phase and uses it only for final evaluation of the performance of the model, POS-i is considered to be an « unsupervised » machine learning problem. POS-i’s « supervised » counterpart is the problem of POS-tagging. In POS-tagging, one trains the system by serving it, during the training phase, sequence of couples (word W, tag T) where tag T is the 145 146 computational linguistics label denoting the grammatical category into which the word W belongs. POS-tagging is thus simpler than POS-i where no information about ideal labels is furnished during the learning. Training of POStagging systems is of particular importance especially for languages where many word forms can potentially belong to many part-ofspeech categories (in English, for example, can almost any noun play also role of the verb; token like « still » can be intepreted as substantive, verb, adjective and even adverb (Páleš, 1994), its POS-category being determined by its context). On the contrary, in morphologically rich languages where such a « homonymy of forms » is present in lesser degrees and relations between word types and classes are less ambigous, one can often simply train the POS-tagging system by simply memorizing an exhaustive list of (W, T) couples. 10.5.1 non-evolutionary pos-i The paradigm currently dominating the POS-i domain was fully born with article published by Brown and his colleagues in 1992 Brown et al. (1992). Brown and his colleagues have applied the information theoretic notion of « mutual information » M : M(w1 , w2 ) = log P(w1 , w2 ) P(w1 )P(w2 ) upon all word bigrams (i.e. sequences of two tokens w1 , w2 which co-occur with probability P(w1 , 22 ) and had subsequently devised a merging algorithm able to group words into classes in a way that the mutual information within a class would be maximized. In two decades since publication of study of Brown’s study Brown et al. (1992), the word n-gram co-occurence approach has inspired hundreds of studies : be it hidden Markov Models tweaked with variational Bayes, Gibbs sampling morphological features, or graphoriented methods – all such approaches and many others consider cooccurence of words with n-gram sequences to be the primary source of relevant information for subsequent creation of part-of-speech clusters. In all these models, one aims to discover the ideal parameters of Markovian statistical models, often employing a so-called ExpectationMaximization (EM) algorithm to discover the optimal partitioning. Unfortunately, EM is unable to quit locally optimal states once they were discovered. Notwithstanding this disadvantage, comparative study Christodoulopoulos et al. (2010) suggests that probabilistic models of part-of-speech induction can be indeed very performant. POS-i induction can be also realized by means of k-means clustering algorithm, or one of its variants. K-means algorithm MacQueen et al. (1967); Karypis (2002) partitions N observations, described as vectors in D-dimensional space, into K clusters by attributing every observation into the cluster with the nearest centroid (i.e. mean). If 10.5 part-of-speech induction one considers these centroids to denote prototypes of the categories in center of which they are located, then one can consider the k-means algorithm to be consistent with « prototype theory of categorization », as proposed by Rosch. Table 9 illustrates simple K-mean partitioning of tokens present in English version of Orwell’s 1984, as contained in Multext-East Erjavec (2004). cluster nouns verbs 0 10 3 1 568 67 2 97 668 3 13 1011 4 1173 67 5 608 958 6 1977 97 Table 9: K-means clustering of tokens according both suffixal and cooccurence informations. Table partially reproduced from Hromada (2014b). In this example case we have clustered all tokens observable in the corpus into 7 clusters according to features both internal to the token – i.e. suffixes21 – and external – i.e. co-occurrence with other tokens. Note that even such a simple model where no machine learning or optimization were performed, K-means algorithm somehow succeeds to distinguish verbs from nouns. As is shown in the Table 9, whose columns represent the “gold standard” tags and rows denote the artificially induced clusters, even such a naïve computational model has assigned 83.6% of nouns to clusters 1, 4 and 6 while assigning 91.8% of verbs into clusters 2, 3 and 5. non-evolutionary pos-i end 10.5.1 10.5.2 evolutionary Usage of evolutionary computing in NLP is - in comparison to other methods like neural networks, Hidden Markov Models, Conditional Random Fields or SVMs – still very rare. This is also the case to NLP’s sub-problem of part-of-speech tagging and thus we are not aware any tentative resolve the POS-i problmem with evolutionary means, and of only one tentative to use genetic algorithms to train a part-ofspeech tagger: 21 That suffixes are of particular importance for POS-induction is more closely demonstrated in our article Hromada (2014a). 147 148 computational linguistics In his Araujo (2002) proposal, Araujo describes a system of POSt involving crossover and mutation operators. What is particularly interesting about Araujo’s system is that separate evolution process is run for every separate sentence of the test corpus. Training corpus, on the other hand, serves mainly as a source of statistical information concerning co-occurrences of diverse words and tags in diverse word & tag contexts. This information concerning the « global » statistic properties of the training corpus is later exploited in computation of fitness. Let’s take, for example, the phrase « Ring the bell ». Since words like « ring » and « bell » are in English sometimes used as verbs, and sometimes used as nouns, such a sentence can be tagged at least in 4 different ways : N D22 N VDV NDV VDN Such sequences of tags yiels individual members of Araujo’s initial population of chromosomes. In languages like English where almost every word can be attributed to more than one POS category & the number of possible tag sequences therefore increases with length of the phrase-to-be-tagged, one will be most probably obliged to randomly choose such initial individuals. Fitness of every individual possibly tagging the sentence of n words is subsequently calculated as a sum of accuracies of tags (genes) on position i : n X f(gi ) i=0 Accuracy gi of an individual gene is calculated as :   contexti f(gi ) = log alli whereby values of contexti and alli are extracted from the training table which was constructed during the training phase and represent the overall frequency of occurrence of word wi within specific (contexti ) and all (alli ) contexts. Once fitness is evaluated, fitness-proportional crossing-over (50%) and mutation (5%) is realized. Notwithstanding the fact that Araunjo doesn’t seem to have used any other selection mechanism, in less than 100 generations, populations seemed to converge into sequence of tags which were more than 95% correct in regards to gold standard. This is a result comparable to other POS-tagging systems but 22 The non-terminal symbol D denotes the category of determiners containing such elements as articles "the", "a / an" etc. 10.6 grammar induction with lesser computational cost. It is also worth noting that Araujo’s experiments indicate that working solely with contextual window WL , W, WR , i.e. just looking one word to the left and one word to the right, seems to yield, in case of POS-tagging of English language higher scores than extracting data from larger contextual spans. When it comes to the «unsupervised» variant of the POS-t problem, id est the problem of Part-of-speech induction, up to this date there have been -as far as we know - no tentatives to address the POS-i problem by means of evolutionary computing. For this reason, we shall aim propose our own solution in volume 2. evolutionary pos-i and pos-t end 10.5.2 pos-i and pos-t end 10.5 10.6 grammar induction Input of Grammar Induction (GI) process is a corpus of sentences written in language L, its output is, ideally a grammar (i.e. a tuplet G=S,N,T,P as defined in 10.2) or a language model able to generate sentences of L, including such sentences that were not present in the initial training corpus. The nature of resulting grammar is closely associated to the content of the initial corpus as well as to the nature of the inductive (learning) process. According to their « expressive power », all grammars can be located somewhere on a « specificity – generality » spectrum. On one extreme of the spectrum lies the grammar having following production rules : 1 → 2∗ 2 → a|b|c . . . Z whereby * means « repeat as many times as You Want ». This very compact grammar can potentially generate any text of any size and as such is very general. But exactly because it can accept any alphabetic sequence and thus does not have any « discriminatory power » whatsoever, is such a grammar completely useless as an explication of system of any natural language. On the other extreme lies a completely specific grammar which has just one rule : 1 →< corpus > This grammar contains exactly what corpus C contains and is thus not compact at all (it is even two symbols longer than C). Such a grammar is not able to encode anything else than the sequence which was literally present in the training corpus and is therefore also useless for any scenario were novel sentences are to be generated (or accepted). The objective of GI process is to discover, departing solely from corpus C (which is written in language L), a grammar which is nei- 149 150 computational linguistics ther too specific, nor too general. If it is too general, it shall «overregularize» (9.2.4), i.e. shall be able to generate (or accept) sentences which the common speaker of L wouldn’t consider as grammatical. If it is too specific, it shan’t be able to represent all sentences contained in C or, if it shall, it shan’t be able to generate (or accept) any sentence which is considered to be sentence of L but was not present in the initial training corpus C. 10.6.1 existing non-evolutionary approaches One of the first serious computational models of GI is Wolff’s «Syntagmatic – Paradigmatic» (SNPR) model Wolff (1988). Its core algorithm is presented in Listing 6. Listing 6: Outline of Processing in the SNPR Model (reproduced from Wolff (1988) 1. Read in a sample of language. 2. Set up a data structure of elements (grammatical rules) containing, at this stage, only the primitive elements of the system. 3. WHILE there are not enough elements formed, do the following sequence of operations repeatedly: 4 BEGIN 3.1 Using the current structure of elements, parse the language sample, recording the frequencies of all pairs of contiguous elements and the frequencies of individual elements. During the parsing, monitor the use of PAR elements to gather data for later us in rebuilding of elements. 3.2 When the sample has been parsed, rebuild any elements that require it. 3.3 Search amongst the current set of elements for shared contexts and fold the data structures in the way explained in the text. 3.4 Generalize the grammatical rules. 9 3.5 The most frequent pair of contiguous elements recorded under 3.1 is formed into a single new SYN element and added to the data structure. All frequency information is then discarded. END We consider the SNPR model to be of particular importance because of its aim to explain the process of Grammar Induction as a sort of cognitive optimization : « The central idea in the theory is that language acquisition and other areas of cognitive development are, in large part, processes of building cognitive structures which are in some sense optimal for the several functions they have to perform» (Wolff, 1988). Wolff also associates his « cognitive optimization hypothesis » with Brown’s «law of cumulative complexity » (c.f. RE- 10.6 grammar induction FREF) which Wolff paraphrases in statement: « if one structure contains everything that another structure contains and more then it will be acquired later than that other structure» (Wolff, 1988). Figure 20: Equivalence classes and production rules induced from English language samples by ADIOS algorithm. Fig. reproduced from Wolff (1988). Grammar resulting from such a contact between language sample and SNPR inducing mechanism is displayed on Figure 20. In Wolff’s theory optimalization is further understood as compression. Within the SNPR model is such compression realized in part 3.5 of his algorithm, where the most frequent pair of contiguous elements (either terminals or non-terminals) is substituted for a new non-terminal symbol. For this reason, the size of grammar able to generate the initial language sample ideally decreases with every cycle of model’s « while » loop until the process converges to state where there is no redundancy to « compress ». Wolff proposes that Grammar Induction is a process which should maximize the coding capacity (CC) of the resulting grammar while minimizing its size, i.e. its Minimal Description Length (MDL). He defines the ratio between grammar’s CC/MDL to denote grammar’s efficiency and it may be the case that within a more evolutionary framework where one would work with populations of grammars, a very similarly defined notion of efficiency could be used as the core component of the fitness function. Unfortunately, Wolff’s 1988 SNPR model is not evolutionary since it does not involve any stochastic factors nor notion of multiple candidate solutions. SNPR is simply confronted with the language sample, deterministically compresses redundancies in a way that can sometimes ressembles human grammar (and sometimes not), gets subsequently stuck in local optimum and there’s no way how to get out of it. Another famous model of GI is that of Elman Elman (1993). Contrary to Wolff’s algorithm which is principially « symbolic », is Elman’s model « connectionist » one. More concretely, Elman had succeeded to train a simple recurrent neural network which was « trained to take one word at a time and predict what the next word would be. Because the predictions depend on the grammatical structure (which 151 152 computational linguistics may involve multiple embeddings), the prediction task forces the network to develop internal representations which encode the relevant grammatical information.» (Elman, 1993). The most important finding of Elman’s study seems to be the evidence for a so-called «less is more hypothesis» which Elman himselfs labels with terms «importance of starting small»: « Put simply, the network was unable to learn the complex grammar when trained from the outset with the full “adult” language. However, when the training data were selected such that simple sentences were presented first, the network succeeded not only in mastering these, but then going on to master the complex sentences as welli» (Elman, 1993). Something similar occured also when he tuned the capacity of « internal memory » of his networks rather than the corpus itself. Elman observed: « If the learning mechanism itself was allowed to undergo “maturational changes” (in this case, increasing its memory capacity) during learning, then outcome was just as good as if the environment itself had been gradually complicated» (Elman, 1993). Thus, not only results of Elman’s computational model point in the same direction as many developmental and psycholinguistic studies of « motherese » (c.f. Section 9.3) ; they also show the importance of gradual physiological changes for ultimate mastering of maternal language. He goes even so far to state that prolonged infancy of human children can possibly go hand in hand with the fact that only humans develop language in an extent we do : « In isolation, we see that both learning and prolonged development have characteristics which appear to be undesirable. Working together, they result in a combination which is highly adaptive» (Elman, 1993). Notwithstanding these interesting results which are not to be underestimated, we see two disadvantages of Elman’s approach. Primo, as is often the case for connectionist neural networks, his resulting model is somewhat difficult to interpret : given the training constraints mentioned above, the network seems to predict quite well the next word in the phrase, but it is not evident why it does what it does. Elman himself dedicates major part of his article to descriptions of his tentatives to understand how his « blackbox » functions. Secundo, Elman confronted his model only with artificial corpora, i.e. corpora generated from manually created grammars. Thus, his model accounts only for a limited subset of properties of one language (English) and as such is still quite far from full-fledged solution to problem natural language’s GI. The model called « Automatic Distillation of Structure » (ADIOS) seems to be in lesser extent touched by this second disadvantage since, as Solan and his colleagues state: « In grammar induction from large-scale raw corpora, our method achieves precision and recall performance unrivaled by any other unsupervised algorithm. It exhibits good performance in grammaticality judgment tests (including stan- Less is more hypothesis 10.6 grammar induction dard tests routinely taken by students of English as a second language) and replicates the behavior of human subjects in certain psycholinguistic tests of artificial language acquisition. Finally, the very same algorithmic approach also is proving effective in other settings where knowledge discovery from sequential data is called for, such as bioinformatics.» (Solan et al., 2005) ADIOS is a graph-based model. It considers the sentences to be a path in the directed pseudograph (i.e. loops and multiple edges are allowed), each sentence being delimited by special « begin » and « end » vertices. Every lexical entry (i.e. a word type) is also a vertex of the graph, thus if more than two sentences share the same word X, they cross themselves in the vertex VX ; if they contain the same subsequence XY , their paths share the common subpath (edge) VX VY etc. Figure 21: Equivalence classes and production rules induced from English language samples by ADIOS algorithm. Reproduced from Solan et al. (2005). Authors of ADIOS describe their algorithm as follows : « The algorithm generates candidate patterns by traversing in each iteration a different search path (initially coinciding with one of the original corpus sentences), seeking subpaths that are shared by a significant number of partially aligned paths. The significant patterns (P) are selected according to a context-sensitive probabilistic criterion defined in terms of local flow quantities in the graph...Generalizing the search path, the algorithm looks for an optional equivalence class (E) of units that are interchangeable in the given context [i.e., are in complementary distribution]. At the end of each iteration, the most significant pattern is added to the lexicon as a new unit, the subpaths it subsumes are merged into a new vertex, and the graph is rewired ac- 153 154 computational linguistics cordingly... The search for patterns and equivalence classes and their incorporation into the graph are repeated until no new significant patterns are found» (Solan et al., 2005). In other terms, ADIOS starts with a so-called Motif Extraction (MEX) procedure which looks for bundles of graph’s subpaths which obey certain conditions. Once such « patterns » are found, they are subsequently « substituted » for non-terminal symbols and a graph is « rewired » to incorporate such newly constructed non-terminals. Such a « pattern distillation » procedure of generalization bootstraps itself until no further rewiring is possible. Output of the whole process is a rule grammar combining patterns (P) and their equivalence classes (E) into rules, able to generate even phrases which weren’t present in the initial corpus. Example of how ADIOS progressively discovers more and more abstract combinatorial patterns is presented on Figure 10.6.1. ADIOS is undoubtably one of the most performant GI systems which currently exist. It combines both statistic, probabilistic and graph-theory notions with notion of rule-based grammar and as such is also of great theoretical interest. On the other hand, ADIOS does not involve any source of stochasticity, seems to be purely deterministic and as such incapable to deal with highly probable convergence towards locally optimal grammars. In confrontation with some partial corpora this may possibly not cause any problems but, we predict, without any stochastic variation whatsoever, ADIOS could not account for more than few « advanced » & real-life properties of natural languages and as such shall possibly share the destiny of SNPR model. end non-evolutionary gi 10.6 10.6.2 existing evolutionary approaches Multiple authors have proposed to solve the GI problem with different variants of evolutionary computinng - in following paragraphs we shall describe five different approaches: 1. hill-climbing induction of finite state automata Tomita (1982) 2. GIG method for inference of regular languages Dupont (1994) 3. Evolution of stochastic Context-Free Grammars Keller and Lutz (1997) 4. Evolutionary method of inducing grammars from POS tags of nine different English language corpora Aycinena et al. (2003) 5. Genetic algorithm of Smith & Witten Smith and Witten (1995) for inducing a LISP s-expression grammar from a simple corpus of English sentences 10.6 grammar induction Tomita’s 1982 paper can be considered to be one of the first empiric studies of grammatical inference. The study focused on inference of grammars of 14 different regular languages – which are often called « Tomita languages » in subsequent litterature – by means of deteministic finite state automata. Tomita had first encoded any possible finite state machine with n states in a following manner : ((A1 , B1 , F1 )(A2 , B2 , F2 )....(An , Bn , Fn )) whereby every block « (Ai , Bi , Fi ) corresponds to the state i, and Ai and Bi indicate the destination states of the 0-arrow and the 1-arrow from the state i, respectively. If A or B is zero, then there is no 0-arrow or 1-arrow from the state i, respectively. Fi indicates whether state i is one of the final states or not. If Fi is equal to 1, the state i is one of the final states. The initial state is always state 1.» (Tomita, 1982) Thus, for example, the string ((1 2 1 ) ( 3 1 1 ) ( 4 0 0 ) ( 3 4 1 )) encodes the finite state automaton illustrated on figure item 10.6.2. Figure 22: Finite state automaton matching all strings over (1 + 0)* without an odd number of consecutive 0’s after an odd number of consecutive 1’s. Reproduced from Tomita (1982). Such encoding allowed Tomita to subsequently apply his hill-climbing approach. Hill-climbing can be considered to be a precursor to more extended genetic programming, since it employs both random mutations to explore surounding search-space and sort of selection algorithm which always prefers to use, in following iteration of the algorithm, such individual solutions for which the value of evaluation function E increases. Tomita’s definition of E is very simple: E = r−w « where r is the number of strings in the right-list accepted by the machine, and w is the number of strings in the wrong-list accepted by the machine» (Tomita, 1982). Right-list is a positive sample corpus while wrong-list is the negative sample. Thus, if a random mutation transforms an individual Xn into individual Xn + 1 so that E(Xn + 1) > E(Xn ), - i.e. if an automaton is discovered which matches more positive sequences, or less negative sequences, or both - it will be Xn + 1 which will be mutated in the next cycle of the algorithm. 155 156 computational linguistics Tomita’s approach cannot be considered to be fully evolutionary because he haven’t used populations nor did he employed any kind of cross-over operator. For this reason, Tomita’s regular grammarinfering algorithm did sometimes got stuck in local maxima from which there was no way out. Notwithstanding this small imperfection – of which Tomita himself was well aware – his work served, and still serves, the role of an important hallmark on the path to fullfledged GI. Dupont Dupont (1994), for example, has also focused his study on induction of 15 different regular Tomita languages. In his formally very sound work, he defines the problem of inference of regular languages as a problem of finding of optimal partition of a state space of a finite « maximal canonical automaton » (MCA) able to accept the sentences from positive sample. Fitness function takes into account also the system’s tendency to reject the sentences contained in the negative sample. By using a so-called « left-to-right canonical group encoding », Dupont succeeds to represent diverse individuals automata in a very concise way which allows him to subsequently evolve them by means of structural mutation (« the structural mutation consists of a random selection of a state in some block of a given partition followed by the random assignment of this state to a block» (Dupont, 1994), e.g. MUTATE(((1, 3, 5), (2), (4)) → ((1, 5), (2, 3), (4))) and structural crossover (« the structural crossover consists of the union in both parent partitions of a randomly selected block» (Dupont, 1994), for example (((1, 4), (2, 3, 5)) ⊗ ((1, 3), (2), (4), (5)) → ((1, 3, 4), (2, 5)), (1, 3, 4), (2), (5)). Because « the search space size dramatically increases with the size of the positive sample, making the correct identification more difficult when we have a larger positive information on the language» (Dupont, 1994), Dupont has also proposed an incremental procedure allowing to start the search process from smaller yet pertinent region of the search space. Procedure unfolds as follows « first sort the positive sample I+ in lexicographical order. Consequently, the shortest strings are first taken into account. Starting with the first sentence of I+, we construct the associated MCA(I+) and we search for the optimal partition of its state set under the control of the whole negative sample I. Let A1 denote the derived automaton with respect to this optimal partition. Let snext denote the next string in I+. If snext is already accepted by A1, we skip it.» (Dupont, 1994) Otherwise, the aumaton A1 is be extended so that it can cover also snext. The search under the control of whole negative sample is then restarted and whole process is repeated until all sentences from positive sample have been considered. With population size of 100 individuals, maximum number of 2000 evaluations, crossover rate 0.2, mutation rate/bit 0.01 and semi incremental procedure implemented, Dupont’s approach have attained, in average, classification rate of 94.4%. For five among fifteen Tomita’s 10.6 grammar induction languages, grammars were constructed which attained 100% accuracy (i.e. accepted all sentences from positive sample and rejected all strings from negatives sample). Results have also indicated that if ever the semi-incremental procedure is applied, the sample size has positive influence upon the accuracy of infered grammars – bigger sample yields more accurate grammars. While Tomita’s results indicate and Dupont’s results further confirm the belief that induction of grammars by means of evolutionary computing is a plausible thing to do, they do so only in regards to most similar type of grammars – the regular ones. Grammars of natural languages, however, are definitely not regular languages and models of GI of more expressive « context free » (CFG) or « context sensitive » grammars are needed. Keller and Lutz Keller and Lutz (1997) employed a genetic algorithm to evolve parameters of stochastic context-free grammars (SCFG) of 6 different languages. SCFGs are similar to traditional CFGs (see item 10.2 for definition of CFGs), but extended with probability distribution, so that there is a probability value in the range [0, 1] associated to every production rule of the grammar. These values are called SCFG’s parameters and these are the values which the algorithm of Keller & Lutz aims to optimize by means of GAs. Their approach involves following steps : 1. Construct a covering grammar that generates the corpus as a (proper) subset. 2. Set up a population of individuals encoding parameter settings for the rules of the covering grammar. 3. Repeatedly apply genetic operations (cross-over, mutation) to selected individuals in the population until an optimal set of parameters is found. Their fitness function F(G) is based on idea of Minimal Description Length (MDL). More formally, Keller & Lutz aimed to maximize: F(G) = Kc L(C|G) + L(G) by minimizing the denominator which is defined as a sum of number of bits needed to encode the grammar G (L(G)) plus the number of bits needed to encode corpus G, given the grammar G (L(C|G)). Numerator Kc is just a corpus dependent normalization factor assuring that the value of fitness shall be in range [0, 1]. When confronted with positive samples of cca 16000 strings (typically of length 6 or 8) of 6 different context-free languages : 1. EQ : language of all strings consisting of equal numbers of as and bs 157 158 computational linguistics 2. language an bn (n > 1) 3. BRA1 : language of balanced brackets 4. BRA2 : balanced brackets with two sorts of bracketing symbols 5. PAL1 : palindromes over a,b 6. PAL2 : palindromes over a,b,c their algorithms have converged, in majority of cases, to such combinations of parameters of their SCFGs which had allowed them to accept more than 95% of strings presented in the positive sample. Such results indicate that genetic algorithms can be used as a means for unsupervised inference of parameters of stochastic context-free grammars. Note that Keller & Lutz confronted, during both testing and training, their algorithm only with positive sample. While doing so for training is justifiable - since the objective of their study was to study whether grammars can be infered solely from positive evidence – not doing so during testing phase makes uncertain the extent to which their infered grammars overgeneralize. Another huge disadvantage in regards to aims of our Thesis is the simple fact that their approach also seems to be very costly (« number of parses that must be considered increases exponentially with the number of non-terminals» (Keller and Lutz, 1997)). And since they confronted their algorithms only with corpora composed of sentences of artificial and not natural languages, we shall not aim to imitate their approach of «tuning SCFG parameters» in our simulations. By being context-free and not simply regular, the grammars studied in Keller and Lutz (1997) or (Choubey and Kharat 2009) could be considered to be more similar to grammars of natural languages. Nonetheless, languages composed of palindromes and sequences of balanced brackets are still far way off from natural languages and the question « in what extent are results concerning GI of artificial languages applicable to GI of natural languages ? » is far from being answered. Rather than trying to answer it, we proceed now to discussion of two approaches where evolutionary GIs have been applied upon natural language sentences : The first method, proposed by Aycinena et al. in Aycinena et al. (2003) focuses on induction of CFG grammars from nine different part-of-speech tagged natural language corpora. Sentences contained in these corpora, composed thus of sequences of part-of-speech tags (see Section 10.5)) were used as positive examples, while randomly generated sequences of POS-tags have yielded negative examples. Initial population was composed of linear encodings of randomly generated context-free grammars, for example the string SABABCBCDCAE would represent this CFG : S → AB 10.6 grammar induction A → BC B → CD C → AE During the evaluation of individual grammar G, one would first try to parse both positive and negative corpora with the grammar G and subsequently calculate the final fitness by applying the following formula : F = γmax(0,|α|−|P|) C(α) − δIα « where P is the set of preterminals, C(α) is the number of parsed sentences from the corpus, I(α) is the number of sentences parsed from the randomly generated corpus, δ is the penalty associated with parsing each sentence in the randomly generated corpus, and γ is the discount factor used for discouraging long grammars.» (Aycinena et al., 2003) In their study, Aycinena and her colleagues had placed randomly generated population of 100 individual grammars on a two-dimensional 10 x 10 torus grid. Subsequently, they had applied a following selectbreed-replace strategy : 1. Select and individual randomly from the grid 2. Breed that individual with its most fit neighbor to produce two children 3. Replace the weakest parent by the fittest child In their framework, « cross-over is accomplished by selecting a random production in each parent. Then a random point in these productions is selected and cross-over is performed, swapping the remainder of the strings after the cross-over points» (Aycinena et al., 2003). Every symbol of a resulting string can be subsequently mutated (mutation rate=0.01). « A mutation is simply the swapping of a non-terminal or pre-terminal with another non-terminal or pre-terminal» (Aycinena et al., 2003) Figure 10.6.2 shows the number of generations each run was able to complete, the grammar G that last evolved, the percentage of positive examples parsed by G, the percentage of negative examples parsed by G and G’s fitness. While results displayed above may seem encouraging authors, have noticed that in majority of cases, their approach « gives a grammar that is very capable of detecting whether a sentence is valid in English, but it has not learned much English structure» (Aycinena et al., 2003). In other terms, Aycinena et al. have succeeded to breed grammars which have certain discriminatory power but are practically useless as models of English language. They go even so far as to state, in 159 160 computational linguistics Figure 23: Grammars induced from nine different POS-tagged corpora. Reproduced from Aycinena et al. (2003). the ultimate paragraph of their work that « It is still possible that English grammar is too complex to be learned from a corpus of words» (Aycinena et al., 2003) and that other external clues are necessary for successful GI of English. The big disadvantage of above-mentioned algorithm was also the fact that its input were sequences of already attributed POS-tags and not sequences of words themselves. Thus, even if the approach would discover some interesting grammars, a reproach could be made and justified that in fact it only re-discovered the rules of the tagging system which was used in the first place. From perspective of our Thesis, another disadvantage of Aycinena et al.’s approach is related to the fact that their approach is anything but model of grammar development in human child. For it is evident 9 that children learn the grammar of their language in an incremental fashion – they are not confronted with whole corpus from the very beginning. Nor does the corpus stay identic after each iteration of the learning process. On the contrary : as child grows, its linguistic environment - the corpus – also grows. Both in length and complexity. 10.6 grammar induction An interesting evolutionary approach of GI which both tries to create own non-terminal categories and also takes such « incrementality » into account is presented in the work of Smith & Witten Smith and Witten (1995). In their scenario, candidate grammars are evolved after presentation of every new sentence. Grammars have form of LISP s-expressions whereby AND represets a concatenation of two symbols (i.e. a syntagmatic node) and OR represents a disjunction (i.e. a paradigmatic node). Whole process is started as follows : « The GA proceeds from the creation of a random population of diverse grammars based on the first sample string. The vocabulary of the expression is added to an initially empty lexicon of terminal symbols, and these are combined with randomly chosen operators in a construction of a candidate grammar...If the candidate grammar can parse the first string, it is parsed into the initial population.» (Smith and Witten, 1995) Figure 24: Two simple grammars covering the sentence "the dog saw a cat". Fig. reproduced from Smith and Witten (1995). Figure 10.6.2 displays two sample grammars for the sentence « the dog saw a cat ». S-expression sequences representing individual grammars are subsequently mutated. Couple of parent grammars can also switch their nodes – probability of being chosen for such cross-over is inversely proportional to grammar’s size : shorter grammars are prefered. Crossover is non-destructive, parents thus also persist. The events of reproductions are grouped in cycles, at the end of each cycle, population of candidate grammars is confronted with new sentence from sample of positive evidence. In their article, Smith & Witten demonstrate, how after presentation of sentences : «the dog saw a cat », « a dog saw a cat », « the dog bit a cat », « the cat saw a cat », « the dog saw a mouse » and « a cat chased the mouse » their system naturally converged to a grammar which had quite correctly subsumed determiners like « a », « the » under one group of OR nodes, verbs like « chased », « saw », « bit » 161 162 computational linguistics under another, and nouns like « dog », « cat », « mouse » under yet another. The grammar which they finally obtain is not ideal but, as they argue, it could get better if confronted with new sentences. « It is an adaptive process whereby the model is graudally conditioned by the training set. Recurring patterns help to reinforce partial inferences, but intermediate states of the model may include incorrect generalizations that can only be eradicated by continued evolution. This is not unlike the developing grammar of a child which includes mistakes and overgeneralisations that are slowly eliminated as their weaknenesses are made apparent by increasing positive evidence.» (Smith and Witten, 1995) While strongly agreeing with above citation, we nonetheless cannot ignore certain drawbacks of Smith & Witten’s approach. Most importantly, by using LISP’s s-expressions as a way of representing their grammars, they ultimately have to end up with highly bifurcated binary trees (since arity of AND|OR operators is 2). Thus, one can easily subordinate two non-terminals to one terminal (e.g. OR(cat,dog)), but in case of three subordinated terminals, one is obliged to use complex expression involving three non-terminals (e.g. OR(OR(cat,dog),OR(mouse,NULL)). Therefore, in such an s-expression based representation, is any class having more than two members neccessarily represented by a longer sequence → is more prone to mutation → is highly « handicapped » in regards to much shorter expressions subordinating just two nodes. Another drawback of Smith & Witten’s work which cannot be ignored is related to the fact that while they used English language sentences to train their system, the sentences were very simple and the relevance of their findings to GI of « natural » English is more than disputable. In fact, they seem to achieve, with quite complex evolutionary machinery, even less than Wolff’s deterministic SNPR model have achieved almost a decade before. Notwithstanding these two drawbacks we nonetheless consider as particularly inspiring their approach aiming to solve the problem of GI of natural languages by uniting, in one framework, the notions adaptability, evolvability and statistical sensitivity to recurring patterns. We summarize : all five above-mentioned approaches indicate that evolutionary computing can potentially yield useful solutions to the problem of Grammar Induction of both artificial (regular, contextfree) and natural language grammars. The length of the candidate grammar is frequently used as an input argument of the fitness function. Note also that both solutions of Dupont and Smith & Witten also use a sort of « incremental » procedure whereby individual solutions gradually adapt to every new sentence. Especially Dupont’s findings are reminiscent of what was already told about « importance of starting small » when discussing computational model of Elman (Section 10.6.1). 10.6 grammar induction On the other hand, none of the above mentioned models was confronted with corpus of child-directed (i.e. « motherese ») or childoriginated utterances. The objective of our Thesis shall be to fill this gap. end evolutionary models of gi 10.6.2 Asides these non-evolutionary and evolutionary algorithms for grammar induction, there exist also first tentatives to solve the GI problem by means of Grammar Systems (10.2.3). The pioneer work in this regard is the study of (Sosík and Štỳbnar, 1997). Contrary to majority of GS-inspired authors who focus on productive (i.e. generative) aspects of GS, Sosik & Štýbnar have focused on GS’s language-accepting properties. In a hybrid connectionist-symbolic architecture, they have used a «neural pushdown automaton» to infer a language colony (10.2.3) able to cover some simple artificial context-free grammars able to cover balanced parenthesis or palindrom languages. While their results demonstrate that it is indeed viable to perform grammatical inference by means of grammar systems, the artificial nature of the input languages makes it difficult to see whether their approach could be of any use in modeling acquisition of natural language. This being said, we conclude with statement that as of 2015, ADIOS (Solan et al., 2005) seems to be the only full-fledged computational model of unsupervised grammar induction which is • publicly available (at least partially23 ) • capable of inducing grammars even from child-speech transcript input data (Brodsky et al., 2007) For this reason we shall compare, in second volume of this Thesis, results of our ELSA-based simulations with those, induced by ADIOS. end grammar induction 10.6.2 As of 2015, NLP is one of the most "hottest" and active sub-disciplines of not only computational linguistics, but also computer and potentially cognitive sciences in general. Without being aware of it, lifes of billions of people are influenced on a daily basis by client platforms, applications, marketing bots or search engines which implement some kind of NLP technique. In NLP, accuracy - defined, for example, in terms of precision and recall (10.3.1) - is always important because it is easier for human users to interact with more accurate systems. But in real-life applications, accuracy is not the only constraint which has to be taken into account: speed and computational complexity of the task are also crucial. 23 Demo version of ADIOS http://adios.tau.ac.il/download.html can be downloaded from 163 164 computational linguistics To support our point, let’s take a Turing test as an example: questionanswering systems which need hours to generate the most accurate and valid answer shall not pass the test; the test shall be passed by machines which offer an approximate answer in few seconds. Hence, even in case of the challenge from which whole discipline of NLP originates, accuracy of one’s model is not a goal per se and is, in fact, useless if one forgets that expression "natural language" does not mean a piece of dead, static corpus stored on one’s disk, but rather a set of sequences of symbols always expressed in a context, and alway expressed with an intention. end natural language processing 10.6.2 Computational linguistics is a symbiont of computer science and linguistics. In this chapter, we have explored its three principal components: 1. Quantitative and Corpus Linguistics (QCL) devoted to discovery of patterns and laws within linguistic corpora 2. Formal Language Theory (FLT) devoted to formalization of principles of syntax in terms of set theory and algebra 3. Natural Languauge Processing (NLP) devoted to amelioration of machine’s faculty of processing of information which machines exchange with human beings During introduction to QCL Zip’s law "the frequency of the word is inversely proportional to its rank" and logistic law "the increase is first slow, than fast, shan slow again" were discussed in somewhat closer detail. It was noted that both of these laws are relevant descriptive mechanisms for diverse diachronic processes, both in linguistic ethnogeny as well as linguistic ontogeny. The fact that both of these laws yield very successful models for description of ecological phenomena was also brought to attention. Brief overview dedicated FLT to has offered only the very basic definitions: language L was defined as a potentially infinite set of strings of symbols chosen from a finite alphabet; grammar GL was defined as a formal system containing rules of production able to generate, as its theorems, exactly all and only strings of L. Classes of regular, context-free, context-sensitive and unrestricted grammars were described and usefulness of such hierarchical view of things was mentioned, notably in relation to artificially (e.g. programming) languages. Brief excursion to multi-agent, non-monolithic, parallelized and modular "Grammar Systems" have illustrated that "miraculous" things - like ability to generate infinite language obtained by interlock of two finite grammars - can happen whenever individual component grammars share their input/output string environments. 10.6 grammar induction The major part of the chapter was dedicated to NLP. Methodological aspects which NLP shares with machine learning field of artificial intelligence were first pointed out. Subsequently, three classes of problems were addressed: 1. problem of geometrization of meaning was principially presented as projection of semantic features into N-dimensional metric spaces 2. problem of part-of-speech induction was principially presented as projection of morphosyntactic features into N-dimensional spaces + subsequent attribution of specific partitions of morphosyntactic metric spaces with specific non-terminal labels 3. problem of grammatical induction was principially presented as a problem of part-of-speech induction + gradual optimalization of content and order of substitution rules Few exemplar solutions to these problems were mentioned, both deterministic and, if existing, also non-deterministic and evolutionary. It was noted that some encouraging results were already attained but that there is still plenty of work to be done. So let’s do it. end computational linguistics 10.6.2 165 SUMMA II Different paradigms have been presented in preceding chapters: 1. universal darwinism 2. developmental psycholinguistics 3. computational linguistics 1st offering a theoretical framework; 2nd offering the data, the materia, the object of interest; 3rd offering the method how the validity of the theory in relation to materia is to be ultimately demonstrated. The framework: a theory of intramental evolution. Id est, a theory stipulating that not only genes or memes evolve, but that there exists yet another, 3rd kind of evolutionary force which moulds man’s destiny. An evolutionary force which is neither phylogenetic like unceasing development of DNA-molecule, nor ethnogenetic and cultural like the memetic evolution occurent between mutually communicating minds. An evolutionary force which is profoundly ontogenetic: a sort of process limited by a life span of the individual in whose mind the process occurs. The materia, the object of interest: a mind of a child. Id est, a mind in constant change, an exploring mind, a playful mind. A mind that masters, in less than three years of existence and practically completely ex nihilo, the most fundamental structures of her mother language. Indeed in less than three years do the representations encoding the universally perturbing cry of a newborn into our world evolve into evermore precise, robust and well-adapted prosodic, phonologic, phonetic, morphosyntactic, semantic and pragmatic representations. Being unafraid of commiting an error and feeling no shame nor guilt when doing so, a soul of an infant, previously so alien to our world, gradually and swiftly learns how to live in it. Gets grounded in it, gets informed how to live in it with us. The method: a computational simulation. Id est, a simulation aiming to reproduce, in silico, at least few key processes through which a child learns its mother language. A simulation that would succeed to partition the world of its representations into categories or clusters similar to those which an organic child would construct, if ever presented with the same data. A simulation able to discover and provide grammars whose products would be undistinguishable to utterances produced by normal human children in course of their daily interactions. If such goal were to be attained by means of evolutionary computation, success of such simulation could be used as a non-invasive, 166 11 summa ii indirect proof that a sort of ontogenetic, intramental evolutionary process governs the process of language acquisition in human children. A theory of intramental evolution, a mind of a child and a computational simulation: among this trinity of cornerpoints embedded in a semantic space representing our current knowledge, one can observe overlapping regions, one can observe common topics. To start with, note the notion of a gradual yet continuous change: no matter whether in subdisciplines of UD, DP or CL, outputs of phase TN serve as inputs for the next phase TN+1 . There exist an analogy between successive stages of a developing child and successive iterations of an NLP algorithm: both invest present energy into processing of knowledge attained in the past so that more accurate performance can be attained in the future. Cognitive representations continuously change but the processes which make the change possible are always present. Note that such gradually changing continuity does not exclude that from time to time, paradigm shifting, phase-transiting phenomena shall be observed. On the contrary, such moments of global equilibriation of the whole psycholinguistic system are necessarily implied by any theory that consideres child’s linguistic faculty in moment T to be a nexus of parallel activity of many modular entities whose means of interaction are complex and potentially non-deterministic. The notion of "parallel activity" is thus equally crucial both for the theory, as well as for correct understanding of observations and simulations that shall follow. That human brain is a device which processes information is a well-known fact; the sequential nature of language can, however, lead one to a conclusion that language is processed in a monolithic, serial fashion. To a somewhat "monotheistic" conclusion that to every language utterance (in production) or to its understanding (in comprehension) there leads only one correct sequence of applications of rules extracted from one correct grammar. We consider such conclusions as fallacious. Knowing how the nature usually tends to proceed, we do not consider as necessary to postulate cold, fixed, static, formal, universal and omnipresent order there, where much more local notions of dynamism, variation, interaction, exchange and convergence clearly suffice. Given that the notion of "convergence" is flexible enough to account for the fact that, in course of time, completely different species (e.g. humans and cephalopodes) "obtained" the organ with identic function (e.g. eye) by following two completely different evolutionary trajectories, we believe that it should also be flexible enough to explain the "mystery" of language acquistion: Children learn language by converging to it. And as we shall now proceed to demonstrate, it is through interaction with peers and parents that the point of convergence is to be discovered. end synthesis of part ii 11 167 Part III O B S E R VAT I O N S A child’s spontaneous remark is more valuable than all questioning in the world. Jean Piaget This part shall describe certain observations related to ontogeny of linguistic structures and interpret in terms of theory of intramental evolution. Its first chapter is principially a longitudinal qualitative study of one particular human child. At its beginning, a non-invasive, phenomenological, observational data-collecting method shall be described and few salient moment of subject’s prenatal and postnatal development shall be mentioned. The major part of the study shall be devoted to subject’s linguistic development during the toddler period, id est between 10 and 30 months of age. Among others, some among subject’s first words, first phrases, first pivot grammars and first variation sets shall be presented. A set of "evolutionary" notions shall be developped and defined in order to facilitate the interpretation of the obtained data in evolutionary terms. Notions like intralexical|intraphrastic|interlinguistic crossover shall be thus introduced and multiple real-life cases shall be furnished for each notions These notions will play important role in the following chapter devoted to quantitative observations. When possible, they would be transcribed into a form of PERL-compatible regular expression. The corpus of child-language transcriptions (CHILDES) shall be subsequently processed by such regexps in a series of simple and reproducible data-mining, pattern-extracting experiments. Ideally, patterns and statistical regularities shall be discovered which are not only language-specific but also language-indepedent. That is, occurent in not only English but ideally in all languages attested in CHILDES corpus. 12 Q U A L I TAT I V E 12.1 Limits of traditional method Significance levels are arbitrary Problem of experimental invasivity method and data collection In no domain of scientific endeavour are limits of Gallileo-Cartesian dubitating yet experimental method as visible and problematic as in studies of subtle mental and psychic layers of human subjects. And in case of studies of human children, these problematic situation is marked to the very extreme: due to a sort of psychosocial uncertainity principle, the very act of observation significantly modifies the properties of the observed subject. Trying to fox a healthy, curious, vivid human child in an artificial experimental setting is plainly and simply contradictory to any tentative of evaluation of child’s natural behaviour. Neither is reassuring a traditional quantitative "psychological" paradigm in which one proves one’s hypothesis through statistic comparison of a study-group with a control-group. Even if all went well and one would succeed to solve the unsolvable and limit the influence of external and hidden variables to a very minimum and even if all children would behave as expected during the experiment (a very improbable "if" indeed), and even if all subsequent statistical evaluation was sound and solid, one would end up with one null hypothesis, few coefficients and a p-value. "So You state that those kids cross-over such linguistic structures and those other don’t. And that the difference is significant because the p-value is 0.045. But, You know, our community has decided not to bow in front of the Fisher-defined p<0.05 significance level threshold (Fisher, 1925)" could be a provocative, yet valable denial of such a result. Asides and above all such criticizms thrones the ethical problem of invasivity of one’s experiments. One cannot have a theory that postulates that any stimulus - no matter how small and ephemere - can influence child’s lifelong trajectory and still aim to prove such theory by means of putting a child with artificial, non-human, mentally perturbing experimental conditions. Of course, in mental world of experimentators who depart from an axiom that children neither feel nor reason, such methodology is still allowed. Others can also somehow bridge the cognitive dissonance which necessarily follows. But for a scientist who departs from a belief that child feel and reason much more than adults shall ever will - and such was, indeed, our bias of departure - an experimental invasivity is an important κριτήριον which significantly constraints one’s ways of doing responsible and sustainable science (Hromada, 2010b). 170 12.1 method and data collection Given that our objectives were not (medi|clini)cal but rather those of recherche fondamentale, we have not found any reason which could potentially justify use of any kind of invasivity. All these considerations + practically zero funding taken into account a traditional quantitative methodology of experimental psycholinguistics has been discarded as inappropriate cca in 2nd year of our doctoral studies. Such a methodological design choice was further motivated by information announcing the "good news" that a child is to be born, in whose closest presence we could spent years to come. This have put us into positions of savants like Piaget, Braine, Labows or Tomasello who had all honor and luck to confront their theories with yearslasting, longitudinal observations of their own children. Thus, our rejection of purely cartesian attitude seems not to have disastrous consequences neither for validity nor for reproducibility of observations which have followed. And what have followed is this: from the moment of subject’s birth (0;0;0) author of this dissertation has kept a journal. Journal was first written as an objective "observation log" but quite soon (0;7) it obtained a form of personal monologue addressed, in 2nd person singular, to adult person which shall, ideally, become from the child herself. Entries in the journal have been written down according to a sort of biased, random sampling procedure: that is, whenever subject generated an event which was sufficiently salient and whenever all among other conditions were fulfilled (i.e. father observed the event or mother told about it to the father; journal was in the proximity; pen or pencil was in the proximity; observer had enough time to note the observation down, etc.), then and only then was the entry written down. Given such a relaxed methodology, 123 hand-written journal pages have been filled with 167 records (14 recorded by mother; 153 by father) before the subject have attained the upper bound of the toddler period (2;6;0).1 What shall follow in this chapter are principially biased descriptions and biased intepretations of suched biased observations. 12.1.1 biases In retrospective analysis of the observation journal, which started at (2;6;0) and ended at (2;6;14) observer is struck by omnipresence of following biases: 1. observer consider the subject to be endowed with consciousness 1 Asides the hand-written journal, cca 20 gigabytes of audiovisual material were collected, often in situations when the subject played, ate, was in REM-phase, danced or simply toddled and babbled. With two or three minor exceptions, this data shall not published in the present work. 171 parvuli deûm regnum 172 qualitative 2. observer considers the subject to be somebody, who shall evolve into a conscious adult 3. observer is the parent of the child 4. observer and vaste majority of other personnae mentioned in the journal seem to be strongly attached to the child with a bound which is difficult to describe without referring to the meaning of the word "love" (14.4.2) 5. observer focused on noting down the observations which match his theory and was, in fact, unable to note down observations which do not match the theory Disadvantage of such biases is that they distort the objective stateof-affairs. But this disadvantage can be reduced if such biases are known. And in case of biases 1-4, the disadvantage can even turn out to be advantageous: for these biases are well-known to vaste majority of those, who were ever blessed with having a child. Thus, instead of making our observations more subjective they help us to establish a common prism through which our communicative intention could be potentially understood. The 5th bias, of course, is problematic. From our current perspective there is little which can be done to combat such a sort of cognitive blindness which have made us ignore practically all data which does not fit our theory. Thus, instead of solving the situation by pretending that we have observed all that was to be observed, we prefer to honestly admit that in regards to all that could have been noted down, but wasn’t, observers has often acted as strongly biased, cognitively blind, hormonally reprogrammed fachidioten. end biases 12.1.1 It is known since time immemorial that a conscious, reflected, sattvic awareness of one’s biases is a condition sine qua non of a viable and valable methodology. But it was Husserl and his followers who gave the method its western name by calling it the phenomenological method. It is, indeed, a sort of phenomenological methodology which can be understood as the method behind words to come. end method 12.1 12.2 subject The subject was conceived as a result of emotionally charged yet fully conscious decision of two adult individuals. In the prenatal pe- 12.3 linguistic environment riod, mother have included consumption of magnesium-rich mineral waters and iron-containing supplements into her otherwise healthy, dairy&vegetable&fish-based diet. Pregnancy progressed without any major complications and in 6-th month (0;-3), father could have felt, during a week-lasting music festival, that the child was already able to atune its kicking with musical beats. Birth occured approximately three weeks before expected term and was probably caused by mother’s passing in the proximity of an active asphalt-drilling machine. Birth itself lasted exhausting 28 hours: the mother asked for an epidural injection after 23 hours of tentatives to mentally influence the extent of cervical dilation. From now on, the initial letters I.M. of her two names shall be used to refer to healthy girl thus born. Given that the first-born came to world approximately two months before winter solstice, its first tentatives to move corresponded with increased luminosity of longer days. Standard unfoldment of the universal sensori-motor algorithm followed: rotation from back to belly at (0;5), first unsuceessful crawling tentatives at (0;5;25), sitting on chair at (0;7;14), crawling on four at (0;8), autonomous standing at (0;10) and first step at (0;11;20). Lateralization expressed by right-handed object manipulation preference was noted down at (0;6;25). Eruption of first teeth noted at (0;9;17). In spite of this fact had the breast-feeding continued until the material bond between mother and daughter was broken - after multiple unsuccessful tentatives - at (1;10) by a more or less bilateral agreement of both participants involved. In the toddler period, neither IM nor the members of have closest social surroundings suffered any serious ilness or traumatic experience. IM can thus to be considered as what is often known in developmental literature as "normal child". end subject 12.2 12.3 linguistic environment IM’s linguistic competence developed in multilingual environment. Both parents are of slovak (western slavic) origin. However, since mother spend more than half of her life in Germany, and since the child was born and raised in Germany, IM-directed "motherese" was at least 60% german-based. Father migrated to Germany just few months before the child was born and was thus struggling with problem of secondary language acquisition practically in the same period as the child was struggling with first language acquisition. Between themselves, parents spoke mostly slovak. Father’s IM-oriented language was also mostly slovak. 173 174 qualitative But in majority of other regular daily interactions, IM was mostly exposed to german. In non-negligeable amount of cases, IM could observe one or both of her parents verbally interact in czech, english, french, spanish and, in much lesser extent, polish, ukrorussian, sanskrit and tibetan (sorted in descending order according to structural exposure frequency). IM started going to creche two days after her first birthday (1;0;2). There, she was mainly surrounded by peers verbally interacting by means of german-ressembling idioglossias. end linguistic environment 12.3 12.4 crying and babbling After few months of more&more differentiated crying forms, "happy cooing" was, along with smiling, noted down at (0;2;18). Three months later, as soon as of (0;5;11) mother had noted down the presence of canonical babbling sequences bäh, bäh, bäh; dwn dwn dwn; mamamama. In the same record the mother conjectures that the sequence hop hop hop corresponds to knee-bending and tou tou tou corresponds to stretching of hands. Being more sceptical about IM’s ability to verbally communicate, paternal record from (0;5;25) observed in child’s production the presence of vocalizations with occlusive labial, velar, glottal and laryngal features. Glide-like dwndwn like and trill-like drndrn were also noted during the period. Paternal scepsis notwithstanding, a synchronicity between the overall context and child’s communicative intention had made the father to note down, already at (0;7;14), the hypothesis that bwí could potentially mean porridge [Breie]. First sequences composed of different syllables were observed at (0;7;23). At (0;10;13), babbling sequence of a sort tititatatetedededidi was recorded and a week later, syllables ma;pa;ba;ta;da;te;ti;ne;de; me; pe;be;we;bwe were enumerated as most salient. As late as (1;8;7) such canonical babbling was listed as one among multiple modes of communication: 1. crying of a hungry newborn 2. squeling disapproval of a pampered child 3. "mentor mode" (observable especially when IM communicated with smaller children, acommpanied with vivid gests) 4. melody singing (especially when in stroller or in bike sit) 5. canonical babbling 12.5 first words In spite of the fact that both bursts of cry as well as production of expressions highly repetitive yet gently variating syllabic streams was observed as far as the end of toddler period2 , we conclude that the babbling schemas had lost their dominant position not later than at (1;6). For at this period it became evident that at least certain forms of IM’s language had lost their private, idioglottic character. Convergence of IM’s neurolinguistic structures towards an optimal communicative system was on its way. end babbling 12.4 12.5 first words As was indicated in the previous section, mother had detected the sequence "mama mama" as early as of (0;2;18). Father had noted down the marked repetition of sequence m@m@ at (0;7;19) and few days later, at (0;8) had noted down that m@m@denotes disagreement. However, it was only at when (0;9;17) father had noted down that "it is possible that the term m@m@- which becomes more and more phonetically similar to MAMA3 - already denotes the mother not only as a source of food, but also as a person whom You love and whose presence makes You happy". Given that IM had often used, in following months, the word "mama" in contexts as diverse as 1. request for food 2. call for help 3. declaration of joy 4. looking at father’s photo (c.f. below) 5. approaching "home"4 it seems to be the case that even the meaning of such a fundamental signifiant is not completely fixed and varies in time. But given that IM’s mother have practically always interpreted such a term as a signal which made her personally and immediately concerned, the term got potentially quite fixed and served as a sort of label denoting IM’s 2 C.f., for example, a sequence recorded between 30th and 70th second of the video downloadable at http://wizzion.com/im/latebabbling.avi. Recorded at (2;5;8). 3 We shall use upper case letters to mark such signifiers which most probably already encoded a specific meaning. Lower case transcriptions shall represent sequences whose meaning, at the moment of production, seemed to be absent or highly ambigous. 4 This was noticed at (1;7;12) when the father used "Google streetview" application to perform a small experiment. IM could see, on the monitor, the streets she already know from the real life. Once the walk ended in front of entrance to house where IM lives, IM pointed to monitor and cried "MAMA!". 175 176 qualitative mother. Later, already in two-word phase and after she has "discovered" that every peer in the creche has his own distinct "mama", IM started to denote her one and only mother with the term "MAJNE MAMA". When it comes to paternal term which is "tato" in slovak, father had noted down the production of sequence "tata" as soon as (0;7;23), mother at (0;8;21). The first indication that IM’s brain associates the term with the father was furnished by the mother who, during the trip to seacoast where father was absent (0;9;9) saw IM looking at father’s photo, uttering "TATO" and than observing the sea for a long time, silent. Three months later, at (1;0;23), such a romantic view was somewhat perturbed by father’s observation that IM used the term "mama" when looking at photo on which only father was depicted. Thus, it was only after months-lasting experimentation with pronounciation with dental occlusives in sequences like "ata" (0;7, 1;0;1), "ada; dada" (0;8;21) "toto; tete; tata" that the father noted, at (1;3;16) that the most popular word is currently TATO and it is quite possible that it also means what it is supposed to mean, since You often say it either when I disappear from Your view, or when You want something from me". A first non-parental term whose repetitive usage was considered as worth recording both by mother (0;8;21) and soon afterwards by father (0;8;29), was ENTE. Since at such an early age, IM had used this term -which meaning "duck" in german- in an exclusive and strongly repetitive fashion when she was confronted with books with ducks, bathtub ducks as well as real organic instances of species Anas platyrhynchos it can be stated that IM succeeded to create a cognitive representation of the word ENTE whose extension strongly overlaps with the one held by IM’s social surroundings. This can potentially be explained as a consequence of "duck-feeding and duck observation" rituals in which IM participated on a regular weekly basis since third week of her life. But given that ducks are often mentioned in "lists of first words" (c.f. Table 3) or "first word combinations" (c.f. for example (Braine and Bowerman, 1976, pp23,32,44,49)) presented by other western authors, one is tempted to state that IM’s obsession with the form ENTE is to be explained not only as a sort of caprice of ontogeny of individual psyche, but can also have cultural or even phylogenetic roots. At (1;0;23) father noted down that IM often uses the term BABA when speaking to and/or demanding the presence of her grandmother. This is consistent with the fact that the term is a slovak coloquial denoting grand-mother, or old woman in general5 . Later, the term was often used as a part of fixed construction "HALO BABA" (1;2;21,1;4;10) 5 Note that in many languages, the term "baba" is often associated to meanings which C.G.Jung would most probably understand as instances of the archetype of "old and wise authority". Thus, asides its well-known use as a honorific in sanskrit, persian, turkish or arab, the term baba denotes old and wise man among Shona people of Zimbabwe or Yoruba people of Nigeria, and potentially in other ethnics as well. 12.5 first words 177 People: MAMA (0;9), TATO (M0;9), BABA (1;3) Food: BAJA/ANAN [banana] (F1;5), MI [milk] (F1;5), BROT [bread](F1;4) Body parts: NENE [F1;0;23], HÁE [hair](1;5) Places: KITA [creche](F1;4), ŠPIPA[playground](F1;5), Animals: ENTE [duck] (MF0;8), uau-uau [dog] (F1;5), mjau [cat] (F1;5) Toys: BAJ [ball] (1;6), TEDY [teddy-bear] (1;6) Household objects: KE [keys] (1;5) Routines: halo (F1;4), e-e [refusal] (M1;4), najn [no] (M1;5) Activities: papa [to eat](1;2), hají [sleep](1;5), daj [give!](F1;5), auke [sway!](F1;5) Table 10: IM’s productive lexicon before attainment of 18 months. Words in the brackets denote most plausible meaning, as decoded by either father (F) or mother (M). Compare with Table 3. potentially imitated by rote from mother’s telephone talks to her mother. Table 10 contains the list of words noted down before IM attained one and half year of age. The list is fairly standard and resembles other such lists reported in litterature. Food and game-related imperatives were common, as well as animal-like onomatopeias. In majority of cases, an initially idioglotic, private sound-form of produced word developped in a sense which would ideally match the "ideal" sound-form of the parents. IM’s C- and P- structures adapted to her surroundings. There occured, however, multiple cases where C- structures of parents adapted to private P-structures of the child. Most salient among these was the case of a word NENE, noted down quite early (F6 1;0;23), referring to mother’s "breast". Mother swiftly included the paedologism into her own productive lexicon, as documents her (M1;5;24) journal entry where she used the term as a component of a wider declinated expression "meine nene". 12.5.1 nene & taboo (aph) Humans are essentially mammals. In a healthy normal situation, first communicative channel between the child and the world passes through mamelles de mama. And indeed many are indices that bond created by and during breast-feeding can significantly influence ontogeny of child’s cognitive and linguistic structures (Hromada, 2009). It is thus somehow surprising to see that the topic of breast and breast-feeding is either ignored or tacitly cast aside by major figures of contemporary DP. Indeed, one shall not find a single occurence of 6 From now on, all references to the observation log shall be preceded by the consonant specifying the author of the enter, e.g. (f)ather or (m)other. 178 qualitative the word "breast" in (Tomasello, 2009) or (Karmiloff and KarmiloffSmith, 2009). Also in Pinker’s Language Instinct which pretends to introduce The New Science of Language and Mind, the breast is mentioned only once in context quite unrelated to ontogeny (« Proto-IndoEuropean melg "to milk" resembles Proto-Uralic malge "breast" and Arabic mlg "to suckle» (Pinker, 1994)). Thus the only monography, which somehow saves the score and mentions breast in developmental context, is (Clark, 2003) where, in table 4.2 on page 83 is the term "nenin", produced by a french child translated as breast.7 end nene & taboo 12.5.1 The fact that the term breast seems to be taboo for contemporary psycholinguists is even more striking when one realizes that it was already one of the father of the discipline, Roman Jakobson, who pointed out that « often the sucking activities of a child are accompanied by a slight nasal murmur, the only phonation which can be produced when the lips are pressed to mother’s breast or to feeding bottle and the mouth is full. Later, this phonatory reaction to nursing is reproduced as an anticipatory signal at the mere sight of food and finally as a manifestation of a desire to eat, or more generally, as an expression of discontent and impatient longing for missing food or absent nurser, and any ungranted wish. Since the mother is la grande dispensatrice, most of the infant’s longings are addressed to her, and children gradually turn the nasal interjection into a parental term, and adapt its expressive make-up to their regular phonemic pattern.» (Jakobson, 1960) Asserting that our observations of IM’s interactions confirmed Jakobson’s insight, we propose following developmental analysis of IM’s πρώτα ονόματα: Left part of Figure 25 suggests that the development of structures MAMA and NENE can be understood in terms of a general process during which the succion reflex extends into vocalized labial P-structure (M@M@). Subsequently, this "centroid" schema differentiates into two schemas MAMA and NENE. Right part of the figure conjectures that such a differentiation can be explained in terms of replication, variation and selection: 1. first the initial structure (M@M@) gets reproduced 7 Note that none of IM’s parents was aware that the term NENE means "breast" in french argot. This was "discovered" only post hoc, after the term NENE was already unambigously used and understood by all family members. Given that the same signifiant was found out to denote the same referent in two independent language systems (i.e. IM’s idioglossia and french argot) the theory of "arbitrairness of sign" (de Saussure, 1916) is to be partially revisited. 12.6 repetitions and replications 179 Figure 25: First differentiation between the whole and its part (a) and its evolutionary explanation (b). innate schema(s) M@M@ replication M@M@ M@M@ M@M@ mutation MAMA mutation NENE MAMA (a) NENE (b) 2. some of resulting replicas are subject to mutation (shift towards open vowels in case of emergence of MAMA, shift towards alveolar nasals in case of NENE) 3. structures which turn out to be useful (e.g. they increase probability of being breast-fed) get reinforced, fixed and succeed to survive to time (contrary to "less fit" structures not resulting in fulfilment of child’s communicative intention) This being said, we shall now focus on other phenomena of IM’s linguistic development which seem to fit into such evolutionary framework. end first words 12.5 12.6 repetitions and replications Repetitio est mater studiorum et repetitio replicatio est. Repetition is a form of replication (3). It may be argued, of course, that this formula is not always valid: take as an example an agent without any memory whatsoever which just executes random movements and by sheer caprice of hasard repeats the same movement as it has already executed sometimes in the past. But in case of agents with mnemonic substrate powerful enough to project the temporal onto spatial (e.g. human brain) we see no a priori reason why the formula should be rejected. Hence, repetition of information is a form of replication of information. By repeating information, children brains replicate information. We distinguish two major types of processes behind replications: 1. intersubjective replications 2. intrasubjective replications 180 qualitative As everything in human mind, these processes mutually interact. But in early development, so we argue, they can be discriminated as independent. In an intersubjective replication, a structure S is articulated, performed and|or expressed by two or more distinct subjects. Thus, when mother’s saying of the word "TATO" is followed by child’s utterance of the same word, one observes a minute intersubjective replication. Thus, intersubjective repetition can be understood as equivalent to imitation. One observes intrasubjective replication whenever a structure S is articulated, performed and|or expressed by one subject in two distinct moments. A replication of a syllable MA in the word MAMA can be thus understood as one amongst its most simple cases. Canonical babbling or many among Piagetian "circular reactions" can be also understood as expressions of such a general cognitive process. As is always true in case of a healthy human mind, the intrasubjective and the intersubjective mutually interact. But in early development, so we argue, the two can be discriminated as independent. In IM’s case, it was around her first birthday when the interplay between these major processes started to express itself in observable forms of verbal interaction. More concretely, at (f1;0;7), IM produced a bi-syllabic MAMA after hearing bi-syllabic MAMA and tri-syllabic MAMAMA after hearing tri-syllabic MAMAMA. Her internal and potentially innate tendency to repeat was exposed to parsable and reproducible stimuli: result was one among first bipartite micro-dialogues noted down. The interplay between the two processes became more salient half a year later when IM started to consistently use her private words in recurrent contexts. Parents could therefore quite easily decode "meanings" of such intralexically repetitive terms as BIBIBIBI (f1;6;12) [in presence of a "baby"]; ANAN (f1;10;15) [when requesting a "banana"]; VAVA (f1;8;0) [when playing with "water" or NANA (f1;8;6) [when looking into mirror]8 . Given that these words do not exists per se in neither in German or Slovak (but, as will be shown in 12.10.1, some can be understand as cross-over forms between the two languages), they had gradually disappeared from IM’s lexicon. Disappearance of these pre-syntactic, protolexical structures notwithstanding, intrasubjective replication did not cease to play important role in development of IM’s linguistic faculty. Consistently with what is known in litterature, such repetitions prevailed whenever IM became aware of existence of a new form, whose articulation was to be perfectioned and mastered. For example, after having understood that a difficult-to-pronounce form AUTOBUS refers to instances of 8 Later, at (f1;11;16) it was noted down that IM tended to use the term ICH when she was agent of the action and NANA when she was receptor or benefactor of the action. 12.7 first constructions large, noisy, useful yet dangerous species, IM produced (f2;0;2) the term 63 times in less than 30 minutes. Given that during this time interval there was sometimes no autobus in sight and given that the articulatory sequences where sometimes interrupted by minutes-lasting pauses or sequences dedicated to other topics, one is obliged to explain such loops in terms of structures and processes whose temporal span extends well-beyond the milisecond- or second- span of the standard Millerian short-term memory. At the end of IM’s toddler period, we constate that plain intrasubjective replications are more and more rare. Sometimes, they still occur when the child is playing alone, especially with water or her child-ressembling puppets. Or they occur in situations where the term is too difficult to pronounce on its own (i.e. IM’s pronounciation of SAMBASAMBhAVA when exposed to the picture of the buddhist saint Padmasambhava (f2;9;3)). And some intrasubjective repetitions are still observable in communative scenarios (e.g. when saying MEINE, MEINE in order to emphasize that a certain toy or food should not be taken away). Whether these cases still represent a rudiment of a subjacent cognitive processus, or whether they are simply expressions of structures which were culturally acquired9 opens an argument which we have no intention to enter. end repetitions 12.6 12.7 first constructions While repetitive sequences can be rightfully considered as "constructions" because they contain multiple juxtaposed (con-) elements (structions), we label as "constructions" only such expressions which fulfill following conditions: 1. they contain sequences of two or more elements which, when taken alone, are distinct from each other 2. basic elements are at least as complex as a morpheme or a syllable Under such definition, the sequence "ma" is not a construction because it does not fulfill the second condition ("m" and "a" are neither morphemes nor syllables); and the sequence "mama" is not a construction because its basic elements (two "ma" syllables), when taken alone, are not distinct from each other. 9 Note that rhetorical figures as diverse as antanaclasis, epizeuxis, conduplicatio, anadiplosis, anaphora, epistrophe, mesodiplosis, diaphora, epanalepsis, diacope or chiasm all exploit, in one way or another, the impact of repetition upone one’s Cstructures. Note also that "reduplication" is a phenomenon observed in practically all major language families of the world. 181 182 qualitative Under such definition, products of plain intrasubjective replication are not to be considered as "constructions". What is needed in order to obtain "constructions" thus defined, is not only replication, but also variation. 12.7.1 first word combinations The first observed, decoded and registered multi-word combination which IM had uttered was: MAMA NENE (F1;4;25). Construction was uttered in context of a request for breast-feeding. Note that without previous knowledge of what NENE (12.5.1) means, it would be impossible to decode the signal as a legitimate phrase on its own and given its C1 V1 C1 V1 C2 V2 C2 V2 structure, it could be even considered, by an external observer unable to parse IM’s idioglossia, to be a meaningless babbling fragment. A month later, at (f1;4;30) IM uttered the expression TATO MAMA BABA ALA. Given that IM called her paternal grandmotherwith the nickname ALA, IM’s mother immediately understood the 4-word (!) utterance as a nominal phrase meaning "father’s mother is grandmother Alena". Three weeks later, at (f1;5;23) IM pronounced the utterance MAMATATO immediately after waking up, potentially requesting attention of (or greeting?) both parents by means of concatenation of both parental terms. At (f1;8;6) it was suspected that TATOTUTO means "here, father" since "tuto" is a standard local demonstrative of Slovak language. But the advent of full-fledged two-word stage was noted down at (f1;8;9) when IM had said in the interphone "HALO TATO" while usually she was saying either HALO or TATO. Only two weeks later, at (f1;8;23), the mother has immediately decoded the expression AJs NANA MAMA AKUKE as ditransitive construction meaning "ice-cream me mother buy". Given that the utterance was indeed produced in the proximity of an ice-cream stand, and given that it was accepted by both parents at least since (m1;8;23) that AKUKE means "einkaufen" (e: to shop, s: nakupit) and taken as granted that IM uses the term NANA to refer to herself (c.f. previous section), the father would be obliged to set aside his scepticism and buy both girls an ice-cream if ever the younger one would not fall asleep in the meantime. end first word combinations 12.7 12.7.2 first pivot(s) "Names" like MAMA, TATO or NANA were sometimes used as components of longer and more complex constructions. They were also partially productive in a sense that some of these constructions (like 12.7 first constructions 183 MAMATATO) were never uttered by the parents and thus could not have been learnt by rote. But before introduction of pivot words, productivity of such nominal terms was highly restricted: they never occured in less than a handful of constructions. Things changed with arrival of the first pivot term, and in case of IM, it was the term AUCH (meaning "too", "also"). Thus, an (f1;10;0) entry mentions following constructions: TATO AUCH (as a fatheraddressed request to eat as IM does); MAMA AUCH; NANA AUCH (when requesting to eat same food as parents eat); ENTE AUCH (when feeding ducks). The pivot AUCH could be thus understood as a productive "seed" of a following micro-grammar: MAMA TATO NANA ENTE AUCH Table 11: IM’s seeding grammar: AUCH at ultimate position. Depending from the context and agent-term of the construction, the pivot carried meanings as diverse as imperative "You (father) do that (eat) as I do", declarative "I do it (put clothes) as You do" or "They (ducks) also want to eat". In general, it seems that the term was quite closely related to the fact of imitation and/or to the fact of intending that activities of two distinct agents should be aligned. Thus, the next recorded constructions were: ICH AUCH NACH HAUZE (f1;10;15 - When wanting to go home) AKE, NANA AKE (f1;10;17 When seeing father swinging on a seesaw; AKE=g:schaukeln, e:to swing) YOGA (f1;10;30) Table 12: Seeding grammar extended: AUCH in the central position. Note, however, that the term AUCH allowed IM to articulate thoughts encompassing realities well beyond here&now. For example, when watching the scene of her favorite animated movie in which the benevolent mole bottle-feeds an orphaned eagle, IM declared: ICH AUCH MI (f1;11;1) meaning something like "I am also used to drink milk". Or putting the milk-bottle on mother’s breasts and saying NENE AUCH (f1;11;5). Or, when reading book about Babar the elephant (f1;11;5) IM stated ICH AUCH AM (f1;11;5 AM=tram) when observing picture on which which Babar the elephant takes the tram, potentially intending to declare that she also takes the tram; few pages later, ICH AUCH LIEALO (F1;11;5 LIEALO=s:lietadlo, e:airplane)10 ) was 10 In a sort of congitive and phonotactic economy par excellence, IM had consistently used the slovak signifier LIEtAdLO when mentioning airplanes in her otherwise germanophone constructions. Cognitive: airplanes were strongly associated with departures and arrivals of slovak-speaking TATO. Phonotactic: it is definitely easier for a child to pronounce a word full of laterals and vowels than the german "flugzeug" con- 184 qualitative declared when observing picture on which Babar exercises yoga on the airport. During the same evening reading session it was, however IM’s act of uttering ICH AUCH accompanied with pointing to the image of the Eiffel tower which made both parents to feel utterly perplexed. Not only because IM had indeed, visited the Eiffel tower more than 2 months before, but also because no "ICH AUCH" was uttered during the lecture of subsequent pages, on which Babar exercises yoga in Yosemite park, near Golden Gate bridge etc... Given the recurrence of the construction ICH AUCH, one would be tempted to state that it was this longer complex and not the simple AUCH which was the true pivot. But this was no the case since more than often, AUCH agglutinated to and with other agential terms than simple AUCH. Thus, TATO AUCH LIEtAdLO (f1;11;2) was uttered when observing airplanes on the sky; expression TATO AUCH UHE (UHE=g:schuhe,e:shoes) ordered father to put on his shoes. What’s more, in her pre-sleep monologue of (f1;11;4), IM had spontaneously generated all utterances given by the paradigm: and did so in a repetICH MAMA TATO AUCH AJA (e: egg) KUCHEN (e: cake) Table 13: Another AUCH-centered paradigm. itive and combinatorial fashion (i.e. produced all 6 combinations) normally common to scholastic methods or text-books in secondary language acquisition. This being said, both parents unanimously agree that IM’s first pivot "strong" enough to structure around itself whole system of constructions, was the intersubjective term AUCH. This pivot was only slightly antecedent to gain of force of other pivot, namely the egocentric MAJnE (d:meine,s:moje,e:my) expressed at (m1;11;2) in such utterances as MAJnE MAMA or MAJnE MIAU. Soon after, these phrases were also cried out from sleep: MAJnE MIAU at (f2;0;0, f2;0;21) MAJnE MAMA at (f2;0;21). But it was already at (f1;11;21) that this "pivot of personal property" was already strong enough to cause IM to cry out the expression MAJnE UHE (my shoes!) amidst the REM-phase of one of her sleeping cycles. Somewhat contrary to other children reported in litterature, IM had started to use only relatively late her term MEA (meaning d:mehr,e:more) as a productive pivot. Often, she had simply used other means (including the usage of AUCH or NOCH) to express longing for bigger quantities of food or for reproduction of certain action. end first pivot(s) 12.7.2 taining such phenomena as voiced velar occlusive juxtaposed with an affricate. Being reassured that she masters the syllable "LIE" well, IM had later consistently prefered to use the term LIEnKA (meaning "ladybird") instead of German "Marienkafer". 12.7 first constructions 12.7.3 first micro-grammars Once pivot words had helped IM to "understand" the meaning-specifying expressive force behind the act of juxtaposition of specific tokens, IM had swiftly and naturally proceeded to the application of such "combinatorial trick" in other contexts and for other uses. Asides protoislands of order structure around AUCH and MAJnE, instances acceptable by following micro-grammars were noted down (f2;0;7) as most salient and recurrent: Agent → MAMA | T AT O | NANA | ICH | BABA | BEJBY G1 → Agent AUCH Patient → MIAU | MET E G2 → MAJNE Patient Food → BROT | AJA | ANAN Drink → MI | VAVA (2) Action → HAJI | ESSEN | T RINKEN G3 → Food ESSEN G4 → Drink T RINKEN G5 → Action MACHEN G6 → Agent KOM Action Grosso modo, this proto-grammar already includes references to those actions (eating, drinking) and agents (parents, self) which are most vital for IM’s survival. But in rules G5 and G6 , one can already observe "the seed" of much more general a knowledge, a knowledge that certain precise actions can be "made" (G5 ) and, in a sort of half-imperative, half-causative fashion, other agents can be incited to "come" and actualize them (G6 ). From such knowledge, child is only one cognitive step far from the reflected and conscious metaknowledge of the fact that it is by language and language alone that such precise incitations can be made. From there on, whole evolution of IM’s syntactic P-structures has become complex, filled with non-monotonic returns, iasynchonic detours, parallel developments and both intra- and inter- insular population dynamics. Since sufficient accounting of such development would demand a book on its own, let’s know shift away from terminology of "grammars" towards more dynamic a terminology speaking about "mutations", "crossovers" and "life". end first micro-grammars 12.7.3 Many constructions hereby presented would not have been successfully decoded if the parents had not accepted as granted that at least some among unintelligible productions of their daughter are, in fact, 185 186 qualitative complex utterances. Almost a year later, at the very end of the toddler period, none of the parents is able to correctly understand the meaning behind all child’s utterances. Many are still unintelligible and very often, the child must still take recourse in other means of information-passing (e.g. pointing, gests, facial expressions) to make herself understood. What is, however, understood by both parents as well as by wider social surrounding is that at 2;6, IM diposes of rich internal world of dreams, intentions and playful tendencies which strongly influence what|when|how she interacts with her environment, both verbally and not verbally. Perhaps there is meaning, perhaps there is communicative intention behind any sequence which the child utters, no matter how unintelligible the sequence may sound. Or perhaps not and child simply explores the limits of language11 by joyful playing of the most fundamental among all the language games (Wittgenstein, 1953). Be it as it may, the process of language ontogeny brings into the world an unprecedented amount of novelty. True, novelty having the form of an unstoppable tantrum can sometime destroy one’s day. But luckily for all the parents of the Earth, the positives seem to outweigh the negatives. Thus, most of the time, verbal interaction with children is simply beautiful, comforting and -let’s not forget the another important aspect motivating all parties involved - are child’s first linguistics constructions perceived and felt as cutely and adorably funny. end first constructions 12.7 12.8 mutations Mutations (from lat. mutare "to change") are basic atomic units of change. Mutations occur in time; in informatic terms mutations are events caused by transition of information-encoding substrate from one state into another. Given that the physical nature of substrate of linguistic representations is still speculative, and in great extent unknown (8.6), we shall present, in the following paragraphs, just a handful of illustrations of such transitions occuring in ontogeny of IM’s linguistic structures and processes. 12.8.1 context-free substitutions Context-free substitutions are mutations characterized by substitution (replacement) of each occurence of the original symbol So rigin 11 If the statement « The limits of my language are the limits of my world» (Wittgenstein, 1922) is true, than agent’s exploration of limits of her language equivauts the exploration of limits of her world. 12.8 mutations with exactly one instance of the target symbol St arget. Given that all occurences of So rigin are substituted, CSM operators are, so to say, agnostic of substituents position. A first example was already given: transition M@M@ → MAMA (c.f. 25) can be explained as a substitution of a central vocalization @ for a more marked A, i.e. as a result of application of a rule @ → a. Other particularly illustrative example of a CSM was given by IM on three consecutive days, during which she was observed to utter sequences of a form BABIJÁ (f1;4;16) MAMIJÁ (f1;4;17) PAPIJÁ (f1;4;18) Within the framework of the theory hereby proposed, such transitions could be explained by mutation of the content attributed to non-terminal Clab,occ within the template: Clab,occ aClab,occ ijá which is equivalent, at certain level of abstraction, to substitutions b → m and m → p which most probably occured in IM’s mind during the first (resp. second) night between the observations. Note, however, that in spite of being labeled as context-free, even these mutations are not "global". It would be utterly false to believe that the fact that every B within the construction BABIJÁ was substituted by P resulted in the situation whereby IM ceased to pronounce the sound B alltogether. This was, of course, not the case and the sound "B" did not disappear from IM’s repertoire. Thus, in regard to the local "template" in which it occured, the substitution could be considered as context-free. But not more: the mutation had practically no impact beyond the local micro-grammar within which it took place. Or, to come back to the example of the primary differentiation Figure 25, the fact that @ was replaced by A in case of insula slowly converging to meaning of "mother" did not have any impact whatsoever upon the fact that within the insula slowly converging to meaning of "breasts" another mutation (i.e. MA → NE) took place. This is so because in moment of the mutation, both insulae were already materially encoded in at least partially distinct neural loci. To summarize: context-free mutations are mutations which alter all instances of a certain symbol. But the scope of their action is still constrained to only a specific template | insula | micro-grammar 12 . Or a restricted group of these. end context-free mutations 12.8 12 In following sections we shall use terms template, micro-grammar and insula in a mutually interchangeable, synonymic fashion to mark the fact these notions are computationally equivalent 187 188 qualitative 12.8.2 First vocatives context-sensitive substitution The scope of impact of a context-sensitive mutation is also constrained to a specific template or to a strongly restricted group of these. But in addition to this constraint, scope of applicability of CS-mutation is also limited by the context | position | neighborhood within the template itself. To illustrate with first well-documented CS-substitution: during her stay by IM’s czech-speaking BABA, the mother has documented IM’s production of expressions MAMI and BABI (m1;4;9). Emergence of these forms, which are completely correct vocatives in czech, could be explained by a context-sensitive mutation A$ → I$ 13 occuring in IM’s mind. In the observation journal, mother had commented the phenomenon: I suppose these came because of my calls "Babi" tu my own grand-mother and "Mami" to my mother.. Further analysis can unveil, however, that acquisition of such vocatives could have been synergetically catalyzed by the presence of a dog called DEXI and a cat JESI in grand-mother’s appartement. Since it was one among first IM’s exposures to animal life and since IM did not hesitate to establish not only visual, but also verbal (by production of onomatopees like HAU-HAU and MIAU-MIAU) and haptic communicative interlock, it is undoubtable that representations (i.e. signifiees) of both pets attained a highly salient status within IM’s mind. And given that in czech language vocative forms of I-terminated animal pet names are identic to the nominative forms, it cannot be excluded that the very presence and saliency of pet-denoting −I$ protonominals had stimulated IM’s nominative-to-vocative transition within more general a class of living beings. Thus, IM’s success in mastering of vocatives seems to be result of interplay of three mechanisms: 1. an endogenous mutation which caused the −A$ → −I$ transition within certain among IM’s private P-structures 2. exogenous gold-standard structures (i.e. persons in IM’s social environment which use −I$ nominals within certain contexts) 3. a cerebral mechanism reinforcing or even replicating such private representation which match public structures Nature of these three mechanisms correctly understood, one can see how development of practically any expressions - from initial babbling all the way through infantese, toddlerese, pupilese to the "correct" adult-like pronounciation - can be characterized as a sequence of such CSSs. In IM’s case, for example, one can see the trajectories along which the words for "milk", "water", "baloon" podded out of the initial babbling: 13 Consistently with the syntax of Perl Compatible Regular Expressions (PCREs) we shall denote the "ultimate position" with the dollar sign $. 12.8 mutations Context-sensitive substitutions (EXT) MiMi*14 → MI (f1;5;12) → MICH (f2;0;13) → MILCH UaUa* → VAVA (f1;8;10) → VASA (f2;1;8) → VAS BALaL* → BALOL (f1;10;30) → BALOND (f2;0;13) → BALON (f2;4;19) end context-sensitive substitutions 12.8.2.0 In these cases, mutations had often counteracted child’s tendency for elision, assimilation or fronting of certain phonemes at certain positions (9.2.1). In each example the symbol → tends to denote a moment, or a group of moments whereby IM’s linguistic structures underwent a structural change, i.e. mutation. In reality the situation is, of course, much more continuous and much less discrete than in our transcriptions. To describe whole phonic development more closely one would have to use a more refined transcription alphabet (e.g. International Phonetic Alphabet) but even this one could be criticized as too coarse-grained for the task at hand. But no matter what transcription system would one choose, independently even of the fact whether one stays faithful to continuous reality or discretize the phenomena in already existing boxes, one thing stays certain: IM’s interiorization of any individual linguistic structure consisted of multiple intermediate steps. end context-sensitive mutations 12.8 By stating that development of any individual linguistic structure consists of multiple intermediate steps we want to focus reader’s attention to the fact that not only P-structures and articulated signifiers develop, but - and this is important - also any C-structure (i.e. conceptual signifié) as well as structures relating the two do so as well. In preceding paragraphs, we have focused mainly on development of P-structures because their development is easier to assess. But this does not mean that the world of C-structures does not develop i.e. that it is not subject to mutations. Contrary, in fact, is the case: in course of her development, IM’s innermost structures had been constantly modified by multitude of events of exogenous origin. By myriads of minute interactions and couplings of linguistic inputs with other auditive, visual, haptic, olphactoric, gustative, vestibular, nociceptive or placiceptive inputs. By parental questions and parental corrections and by facts that a certain question and a certain correction were given in one context but 14 We use the star sign * to denote expressions which were not recorded in the observation journal but are considered as plausible protoforms. This is similar to comparativist tradition in which the *-sign denotes hypothetic forms postulated by the theory but not attested by any existing corpus. 189 190 qualitative not in another. But other, more endogenous factors related to playing, dreaming and φαντασία well beyond traditional adult notions of "abstraction and generalization" had to be active as well, in order to account for emergence as well as correction of such cases of poietic over-generalization as: 1. at (f1;11;8) saying ZONE (e: sun; d: Sonne) when seeing a fullmoon in the evening sky 2. at (f2;0;19) naming the circle of light projected by the lamp upon the bedrooms’s ceiling with the term BALONd 3. at (f2;3;19) saying LIENKA (e: ladybug) when seeing, on a picture in a picture book, a red ball with white dots 4. at (f2;5;8) using the term KUGEL (e:sphere) to describe pingpong ball (correctly called "BAL" a year before) 5. at (f2;6;15) answering NENE when asked to describe what is on a swimming-pool tile with two concentric circles 6. and the DING-DONG mystery 12.9 Cathedral case study of semantic mutations: the ding-dong mystery (aph) To demonstrate the arbitrairness of any system of categorization or even any epistemology, both Michel Foucault as well as Eleanore Rosch fondly cite the taxonomy fictitiously attributed, by Jose-Luis Borges, to an ancient Chinese encyclopedia entitled the Celestial Emporium of Benevolent Knowledge: « On those remote pages it is written that animals are divided into (a) those that belong to the Emperor, (b) embalmed ones, (c) those that are trained, (d) suckling pigs, (e) mermaids, (f) fabulous ones, (g) stray dogs, (h) those that are included in this classification, (i) those that tremble as if they were mad, j) innumerable ones, (k) those drawn with a very fine camel’s hair brush, (l) others, (m) those that have just broken a flower vase, (n) those that resemble flies from a distance.» (Borges, 1952) Such a Borgesian account is something which has to come, willynilly, to one’s mind when confronted with the case of the DINGDONG mystery (DDM). Contrary to Borges, however, is DDM not fictitious but rooted in reality of facts. These are as follows: First mention of DING-DONG (f2;0;7) clearly mentions the term in context of church bells. Indeed had used the expression to express her will to be in the proximity of Bratislava St.Martin cathedral exactly at 18:00 when cathedral’s bells ring the most. The same record mentions, however, that the term cannot refer to "church" in general since another church building was labeled as OKOL (s: kostol). 12.9 case study of semantic mutations: the ding-dong mystery (aph) That the concept develops started to be evident a month later (f2;1;4), when it was noted down, during the visit to the library, that IM had picked from a bookshelf a book about European history and labeled the building depicted on the front cover as DING-DONG. A week later, the (f2;1;8) record continues: You are still occupied with the DINGDONG concept. It seems to denote all big buildings, today, for example, have You seen the picture of the skyscraper and called it a DING-DONG. A later (f2;2;1) log indicates when things started to get somewhat more complex. Thus, during a simple walk between Berlin’s central station and Hackesheer Markt, IM used the term DING-DONG when labeling following objects: 191 Big building • stone sculptures on the bridge • tower in the distance • fluttering German wing atop the Bundestag • a building just next to Bundestag • synagogue’s golden dome • cross atop the Berlin’s cathedral • buildings of Marienkirche and Boden museum which indicates that at that period, the DING-DONG concept still overlapped with something similar to an adult concept of a "fancy piece of masonry" or "building’s top". The fact that the later (f2;2;11) log states "Still occupied by DING-DONG, You were completely fascinated by youtube videos of St.Martin’s cathedral." seems to support the hypothesis. While (f2;2;18) log entry stated that "DING-DONG fades into background" it also stated that "from time to time, You still label something with that term: picture of castle in the book, two noodles stuck together...". The entry logged at (f2;5;7), i.e. 5 months after initial use of the term, states with the word DING-DONG You have labeled the picture on the "ace of staffs" Crowley’s tarot card as well as a flute). And two weeks later (f2;5;21), it was written that "You are still occupied with DING-DONG. It seems that You use it especially to denote spiky things, for example green buoys on the Elbe river are DING-DONG. But the red ones, without the spike, are not". Approximately in the same period, father also considered as quite plausible the hypothesis that the term can also denote the property of being long (d: lang). Given the importance of the term within IM’s world, a small constructional island coalesced around it. At (f2;3;19) a recurrent usage of the construction dING-dONG mACHEN was noted down when Castle and noodles Tarot card and a flute Spiky green buoys 192 qualitative building towery lego churches15 , at (f2;5;8) intense repetitive production of expression ING ONG OJTET (d: lautet e: rings) was noted down and the (f2;5;23) recorded following playful variations: dING dONG lOJTET dING dONG lOJTET dING dONG lOJTET lOJTET dING dONG lOJTET dING dONG lOJTET dING dONG which repeatedly transgress even the most primitive subject-precedesverb syntactic rule of german language. Such indeed is the mystery of DING-DONG: unconcerned by the "correct word order", unconcerned even by appropriate, adequate and optimal conceptual boxes, infantine mind plays the poetic game. Shamelessly, joyfully and naturally plays the poetic game, and do so at all levels. Isn’t that Borgesian? end the ding-dong mystery 12.9 The above aphorism indicates that asides context-free and contextsensitive substitutions, yet another variation operators are at play in a developing mind. Not only formal but also semantic substitutions, not only replacement of symbol by an empty one, but also diminution (or expansion) of extension of a concept C (when C is understood as a set). Or, when concepts are understood in more geometric terms, mutations consisting of either increase or decrease in volume of the Clocalizing subspace or translation of C’s centroid to some other position. But as was already indicated not only by IM’s playful switch from grammatical dINGdONG lOJTET to agrammatical lOJTET dINGdONGduring, but firstly the enumeration of different intralexical metatheses (12.9.1), yet another class of mutation operators seem to act within the developing mind: switching of position within the sequence, permutations within the temporal order. 12.9.1 first transpositions A transposition occurs when two or more elements of a bigger whole (e.g. phonemes within a word or words within a phrase) exchange 15 However, a general term for other constructions built from lego or wooden cubes was BAUT (f2;2;18, f2;3;19), potentially derived from past participle of to build (d: gebaut). What’s more, when asked to label diverse lego blocks, IM had consistently used the term BAUK (f2;2;1). Note that such a term does not exist neither in slovak nor in german. 12.9 case study of semantic mutations: the ding-dong mystery (aph) their position. Frequency of occurence of elements within the sequence thus does not change, relative positions, however, do. Already a relatively early transcription (f1;5;30) of IM’s sometimes babbling, sometimes one-word "stream of consciousness" improvisations, produced at the breakfast table contains sequences like ÁU,UÁ and ÍTÁ, ÉTÍ. Such were indeed IM’s first tentatives to switch positions of two protophonemes in her protowords. IM’s later productions indicated activity of more complex (i.e. involving more than 2 transposed elements) metathesis-like reorganizations as: Context-sensitive metatheses (EXT) APUK (f2;1;10) → "kaput" IPEK (f2;1;12) → Wipke UKAKS (f2;1;24) → "Rucksack" MAKTA (f2;6;0) → "matka" etc. end context-sensitive metatheses 12.9.1.0 Given the prominence of such "errors" and mis-productions16 in IM’s speech, we are tempted to state that a non-negligeable amount of transpositions commonly studied in evolutionary linguistics (e.g. e: "fog", s: "hmla", czech: "mlha") had their origin in slips-of-tongue of individual toddlers which were subsequently accepted and spread through wider community. At the end of this development, IM started to permutate position of not only individual phonemes, but also of phonetic clusters or even words within phrases. Thus, (f2;4;5) mentions a permutation HUNDUNDMIAU MIAUUNDHUND undoubtably stimulated by a symmetric, non-preferential binary coordinative UND (e: and). But as demonstrates already the following entry (f2;4;6) noted down during a game whose objective was to put diverse wooden animals into correspondent slots: DA IST KKO, IST DA KKO as well as an entry noted down a day later, during the marble game KUGEL EINE, DA IST KUGEL EINE IM’s propensity to permutate the order often didn’t worry much about even the most fundamental among the syntactic contraints of germanic languages. That is, the constraint that the article (EINE) 16 From adult-like point of view 193 194 qualitative should precede the noun (KUGEL) and definitely not the other way around. In the following chapter, dedicated to quantitative analyses of the CHILDES corpus, we shall aim to shed somewhat more light upon the question whether this situation - in which IM’s urge to permutate word order was stronger than the most fundamental among the syntactic constraints - was peculiar to IM who, as partially slavic person potentially feels less bounded by the need to correctly prefix substantives with determinants; or more general a trend present even among germanic and anglosaxxon toddlers. end first transpositions 12.9.1 In above subsections we have presented multiple variation operators which, we believe, could be rightfully labeled as "mutations". Substitution of nothing with something, something with nothing, something with something else; expansion or diminution of extension; switches in positions, fillings of empty slots: in one way or another, all this was already known not later than after Aristotle and his followers (8.5). It was evident to Godel and Turing as it is evident to proponents of FLG: any computable number (resp. construable string of symbols) can be obtained by means of insertions, deletions, substitutions and transpositions. Until now, practically nothing new in comparison with traditional cognitivist symbolic architectures. But everything changes when the most noble among all variation operators is introduced: the crossover. end mutations 12.9 12.10 crossovers Boldly speaking, crossover is the operator of unita in diversitate. As indicated by the figures attached to tour brief discussion of biological evolution (Figure 4) and fitness landscapes (Figure 8), the power of crossover consists in its ability to: 1. let two (or more) parent structures to project their features upon one (or more) child structures 2. allow the evolving system to get out of locally optimal states (i.e. to fly away from the peaks into unkown realms in between) The second point systems implies that systems involving a crossover operator are able to continue evolving even there and then where other, gradient-following approaches are doomed to get stuck. Computationally speaking, crossover’s ability to direct the search into regions of useful unknown yields the ultimate coup-de-grace which the 12.10 crossovers model involving the operator apdodictcically gives to any model which does not do so. When it comes to the first point, consider, for example, a sort of semantic crossover: (HUMAN) × (WINGS) (ANGEL) Without resorting to cross-over mechanism and without falling into a trap of pseudoscientific divinatory explanations, we consider it as very difficult - if not impossible - to offer a scientific account of a cognitive mechanism by means of which all angels and chimeres, all centaurs and mermads, all mythological visions as well as technoscientific insights, how representations of entities without a material referent could have ever entered the mind of a primordial man. And by whom else if not by children could one see such process act? 12.10.1 multilingual crossovers Given the lucky coincidence (12.3) due to the fact that structures in IM’s linguistic environment primarily consisted of structures coming from two distinct sub-branches (i.e. germanic and balto-slavic) of the same language tree (Figure 5), IM was exposed, on a regular basis and in analogous contexts, with instances of constructions which were both similar and distinct in the same time. What resulted is a phenomenon of interlingual "mixing" which is well known to practically every parent of a healthy bilingual child. Let’s now focus on two most prominent types of such mixing. Intralexical crossovers A multilingual intralexical crossover is a mixing of two word-representing schemas SL and SJ intersubjectivly replicated from exogenous oracles using diverse languages L and J. Given that the schemes-tobe-combined are defined as word-representing, and given that their sources are oracles (e.g. parents, grandparents, teachers), they are expected to occur during: 1. cases of bi- or multi- lingual language acquisition (hence "intralexical") 2. acquisition of base-level terms (i.e. signifiers) for base-level meanings and referents 195 196 qualitative In the following exposure, crossovers shall be presented consistently with following formula: referent german slovak T ODDLERESE whereby the first row shall contain the English term for the referent R, second row shall contain the R-denoting term most frequently used by IM’s mother and third row the R-denoting term used by IM’s father. The last row of every example shall contain the transcript of IM’s idioglottic productions consistently (i.e. more than once) uttered in R-cooccurrent context. First three multilingual intralexical crossovers all noted down at (f1;7;30) were: eyes augen oči OGE and water vas@ voda VAVA and shoes šúhe boty OGH E Another couple of salient crossovers was noted down during game with animal picture books, at (f1;8;11) monkey afe opica API and at (f2;1;9) elephant elefant slon OLOOND Asides substantives, other parts-of-speech were mixed as well. Verbs, for example (f1;8;24) 12.10 crossovers buy ajnkaufen nakúpit’ AKUKE as well as possessives (f2;2;22) my majne moje MAJE Many other cases were noted down where IM had opted for a form which shared as many features as possible with forms in both ambient languages. Thus the (f2;0;14) entry recorded stick štok papek AK and it has to be added that in following weeks to come IM had used the term AK and|or @K to denote practically any piece of wood she could easily carry and manipulate. This was done in spite of initial parental tentatives to correct her and resulted, in fact, to parental adaptation whereby parents resigned and used the convenient term AK es well.17 We believe that these examples illustrate that in many cases of production of new words, IM tended to: 1. produce forms with known characteristics 2. produce forms which are as close as possible to both parental forms The first tendency seems to be the case for all healthy children, no matter whether they are raised in monolingual or multilingual environment (c.f. Table 2 and the associated discussion of "preference" and "avoidance"). When it comes to second, centroid-form-seeking tendency, it is definitely most easily assessed in case of multilingual acquistion wherein the forms to be crossed-over are distinct. To prove our point, we conclude this brief enumeration of IM’s multilingual intralexical productions with the final example (f2;0;17) drink trinken pije PIJEN 17 The form @K had withdrew into the background once IM mastered the correct pronounciation of more correct forms OK (f2;6;13), TOK and ŠTOK. The form @K however soon reappeared in order to mean "wolf" (sk: vlk). 197 198 qualitative as well as with a link http://wizzion.com/thesis/videos/pijen.mp4 which, between 4:20 and 4:34 (as well as at 5:47), demonstrates our case. end intralexical crossovers 12.10.1.0 Intraphrastic crossovers Multilingual intraphrastic crossovers are crossovers which mix, within one construction, morphemes originating from multiple languages. They are well known to practically any person subjected to second language acquisition: one wants to form construction in language 1 but somehow "unvoluntarily" populates it with certain items proper to language 2. IM started to produce her first intraphrastic mixes when she was still in the "pivot grammar" stage. It had thus often been the case that certain slots within constructions pivoted by german AUCH were filled with an item of slovak origin: (f1;11;3) TATO AUCH LIEtAdLO (nom. sg. sk: "airplane") (f1;11;4) ICH AUCH LIEtAdLO (f2;0;0) NANA AUCH rUKY (acc.pl. sk: "hands") (F2;0;7) LIENKA (nom. sg. sk: "ladybug") AUCH Alongside these AUCH-pivoted intraphrastic crossovers, IM’s production was also full of utterances composed of slovak noun and a german predicate. For example, during the period dedicated to story of mole and eagle, following utterances were very common (f1;11;12) OLOL (nom. sg. sk: "eagle") šAUEn (inf. de: "to watch ") KKO (nom. sg. sk: "mole") šAUEn (f1;2;18) OLOL ÍflGt (3p. sg. pres. de: "to fly") It is, however, discutable, whether one could count such constructions as "interlexical" crossovers . This is so because in concrete cases of usage IM had used the term KKO, OLOL etc. similiarly to terms like TATO, MAMA, BABA, i.e. as personal names. Not knowing other instance of eagle or mole than the one which was presented to her it seems more plausible to state that IM has juxtaposes languageagnostic names and not language-specific nouns asides her germanoriginated predicates. But 6 months later, with her toddler period coming to an end, a sudden phase transition in both amount and diversity of intraphrastic crossovers had occured. Thus, during interval of three days only, production of following germanoslavic structures has been observed: 12.10 crossovers (f2;5;13) TIETO (sk: "these") ČpANuCHy (sk: "sock pants") AJNgekAUFT (de: "bought") 2 (f2;5;15) WO (de: "where") IST (de: "is") mOTYl (sk: "butterfly") ? (f2;5;18) TO (sk: "that") JE (sk: "is") MAJNS (de: "mine") (f2;5;18) NANDA tAM (sk: "there") BYVA (sk: "lives") , MAMA AUCH tAM BYVA (f2;5;18) DA (de: "there") IST (de: "is") mUCHA (sk: "a fly") Closer inspection of these examples may reveal that sometimes IM had used the slovak (ex. 3. and 4.) and sometimes the german (ex. 5.) forms to express the meaning "there is". On the very same day even a small multilingual grammar was noted down: NICHt chrOBÁČIK (sk: "beetle") mrAVČEK ) (sk: "ant") ČMELJAK (sk: "bumblebee" Table 14: Interlinguistic micro-grammar. Both parents are unaware that they had produced such "negation in german + slovak animate substantive" constructions. Given that it is highly unlikely that someone in IM’s wider environment would expose her to such constructions, the sole explanation of their existence has to be sought for among IM’s endogenous cognitive processes. We agree with Piaget that at this stage, one among such processes can be the child’s egocentricity and her tendency to playfully negate any information that comes from exogenous oracles. And this had been, in IM’s case, expressed notably by means of german pivots NAJN and NICHt whose productive affinity at this period was such, that they succeeded to form constructions even with non-germanic words. end intraphrastic crossovers 12.10.1.0 On preceding pages we have presented few cases of multilingual crossovers, i.e. crossovers between schemas embedded in distinct languages. Two main groups - intralexical and intraphrastic - were introduced in order to organize the presentation. We consider as highly plausible that asides these two types, the super-group of interlinguistic cross-over contains other types of operators as well. But instead of studying into detail each one of them, let’s just close this brief discussion of bilingual acquisition with the aphorism stating that 199 200 qualitative Of crossover and calques (APH) If the reader has understood that operators which we have labeled as "interlingual crossovers" could elucidate phenomena which traditional linguistics call as "calques" or even "faux amis", then the reader has understood us well. end of crossover and calques 12.10.1.0 and now focus upon crossovers occurent not among elements of multiple languages, but among elements of one sole language. end multilingual crossovers 12.10.1 12.11 monolingual crossovers A monoglingual crossover is a crossover between two or more input schemas which all originate and are extracted from the same language L. A schema is the most fundamental element of the theory hereby introduced. It is a template, a pattern a sort of micro-grammar which, when embedded within human brain or within computational agent, can be useful for both comprehension and production. In comprehension, schema’s role is to "match" an external stimuli (e.g. linguistic utterance). In production, e.g. when coupled with articulatory circuitry, a schema determines the process of generation and execution of a specific action (e.g. pronouncing of a word or a phrase). Schemas themselves are composed of atomic features and it is important to realize that, in theory, one individual schema can integrate in itself features of different types: conceptual, semantic, syntactic or morphophonologic features can be considered as a constitutive elements of one individual schema S. In theory. In spite of the fact that certain schemas SX , SY can integrate in themselves "semantic" (i.e. signified) and "morphophonologic" (i.e. signifier) components in an extent which strongly ressembles entities WX , WY commonly known as "words", it would be a mistake to simply state that words are schemas and schemas are words. For it may be the case that a certain word can be encoded by multiple schemas. Let’s know glance at few crossover types which indicate that it can be, indeed, the case. 12.11.1 intralexical Given that porridge with bananas was her favorite breakfast, the word denoting banana (de: banáne, sk banán) was one amongst the 12.11 monolingual crossovers first items in her lexical repertoire. Thus, at (f1;5;12) it was noted down that IM consistently used the P-scheme BAJA to denote the fruit. Few months later, however, at (f1;10;8) it was observed that IM uses the P-scheme ANÁN to denote the same referent. A month later, as IM had consistently used the incorrect pronounciation to ask for the fruit, the father had tried to exogenously induce the correct pronounciation: IM: ANAN, ANAN F: banan, ba, ba, baba, banan IM: ANAN without success since IM was still responding with pronounciation of ANÁN. But knowing that few months ago, there used to be a period where IM labeled the fruit with the schema correctly beginning with B, the dialogue continued: F: baja IM: BANAN id est, IM had pronounced a correct form which she was unable to pronounce otherwise. This pedagogical "success story" can be quite easily explained in terms of a monolingual crossover. Thus, knowing that IM used to produce the P-schema BAJA before, father had simply uttered the token which had reactivated the latent schema. Subsequently, during a moment of practically instantenous cognitive crossover, the latent schema mixed with the dominant schema: BAJA ANÁN BANÁN and the correct "centroid form" of two protoforms was obtained. end intralexical crossovers 12.11.1 12.11.2 interlexical It may be the case that mind sometimes mixes together even the schemas which encode different semantic contents. Thus the first case of cross-over recorded by the father (f1;4;25) was the spontaneous usage of the vocative MAMI minute or two after the lecture of the book about the cat called MIMI: 201 202 qualitative MAMA MIMI MAMI c.f. 12.8.2 for description of other exogenous factors which have primed IM for acquisition of vocatives. Another couple of quite interesting crossovers was observed amidst IM’s "eagle period" 18 . As was already mentioned, the word which dominated IM’s production during the period was OLOL (sk: orol, en: eagle) and given the frequency of occurence of the term in IM’s production, it is undeniable that the P-schema OLOL$ was strongly activated. It may be for this reason that at (f1;10;30), IM’s term for "ballon" was BALOL, which could be explained in terms of a cross-over: OLOL$ BALÓN BALOL But since it could be argued that the production of the word BALOL could be also explained as the assimilation of the lateral feature by the terminating nasal consonant, and since we want to evit confusion between causes and effects19 , let’s just focus on the second case which we consider as particularly instructive. Thus it happened that at (f1;11;20), during her pre-sleep oratory, IM had tried to list names of all her kindergarten friends. But given that she forget to mention her friend Nikol, IM’s mother had turned monologue into a dialogue: M: NIKOL IM: KOLOL thus producing a word which does not exist neither in german nor in slovak, and doing so in a context which undeniably indicates that her communicative intention was to say "Nikol". One could, of course, argue that the production of such term as a result of avoidance of using the syllable ni- in the initial position (demonstrated, for example, by calling one of her friends KITA instead of Nikita) combined with the reduplication. But if such was the whole explanation, one could hardly see why IM had opted for the term KOLOL and not for *KOKOL. Thus, another force had to be at play and we argue that it was the productive affinity of the scheme OLOL and the subsequent crossover: 18 During this period IM was exposed, on her own request, to dozens or potentially even hundreds of instances of the same narrative concerning the friendship of the benevolent mole and an orphaned eagle. The exposure was multimodal: thus IM had sometimes watch the movie without commentaries, sometimes it was commented. Sometimes the picture book was read, sometimes the story without any visual support whatsoever was narrated. C.f. http://wizzion.com/thesis/videos/olol.mp4 19 What was first? BALOL or OLOL? 12.11 monolingual crossovers OLOL$ NIKOL KOLOL which have taken their toll. Another interesting interlexical crossover was observed during another pre-sleep dialogue (f2;1;14). When IM was asked to describe games she plays in her kindergarten with another kid, she had answered with the word MAUEN. Given that such a word does not exist in german language and given that frequent usage of terms "mahlen" (en: to draw, to paint) and "bauen" (en: to build) was observed noted down already a month before (f2;0;16), it cannot be excluded that the term was a result of a following crossover: paint and build mahlen bauen MAUEN and that, potentially, it had a meaning of both building (e.g. lego, wooden cubes etc.) and painting (a common activity in IM’s kindergarten) in the same time. If that was the case, IM’s answer by means of the term MAUEN could potentially suggest that crossover may be useful not only for explanation of development of surface morphophonologic signifiers but also in explanation of much more deeper semantics- and concept-related signifieds. end interlexical crossovers 12.11.2 12.11.3 intraphrastic crossovers Monolingual intraphrastic crossovers are operators which mix together components (e.g. morphemes) originating from different phrase-encoding schemes. Let’s look at just one video20 , recorded at (f2;5;15), to see what this could mean. The video shows IM and her mother during a creative session initiated by stone-painting and terminated by sticking small artificial eyes on the painted stones. Many interesting things happen in the video, including: 1. within 304 seconds, IM uses the fixed construction UNK@BLAU ("dark blue") 18 times in three "bursts" 2. at 4:21, IM produces a multilingual crossover VO (de: "where") ISt (de: "is") OKO (sk: "eye" nom. sg.), subsequently is corrected 20 http://wizzion.com/thesis/videos/augen.mp4 203 204 qualitative by her mother which she immitates and produces the full slovak construction ĎE jE OKO at 4:25 In regards to monolingual intraphrastic mixing, it is already the first phrase: ICH MAHLEN pronounced at 9th second which is of certain interest. This is so because this phrase - agrammatical on its own due to the the nonagreement of the pronoun (1p. singular) with the verb form (infinitive or 1p. plural) - can be understood as a result of crossover of two grammatically correct phrases: ich mahle wir mahlen ICH MAHLEN The same holds, mutatis mutandi, for incorrect pronounciations which came later, such as 1. 03:08 VO ISt AUGEN? (where is eyes?) 2. 04:02 VO ISt mAJN AUGEN? (where is my eyes?) 3. 04:14 mAJN AUGEN VEG (my eyes (is) away) Thus, all such syntactically incorrect constructions can be easily explained as a consequence of a crossover between correct forms which the child could have easily heard in her environment. For example: wo ist auge? wo sind augen? VO ISt AUGEN? This being said, we feel no need to spam the reader with other instances of such "monolingual intraphrastic" crossovers, produced by IM aplenty since cca. 2 years of age. Instead we conclude with yet another aphorism: Of crossover and overgeneralizations (APH) If the reader has understood that operators hereby labeled as "monolingual intraphrastic crossovers" could elucidate phenomenona which developmental linguistics label as "overgeneralizations", then the reader has understood us well. end of crossover and overgeneralizations 12.11.3.0 In other words, the notion of "monolingual intraphrastic crossover" can be a useful conceptual aid for anyone aiming to explain the problem of over-generalization or to construct a theory thereof. 12.11 monolingual crossovers end intraphrastic crossovers 12.11.3.0 Many among above-mentioned cases of monolingual crossover were triggered, induced or even primed by an exogenous event (i.e. parent asking or saying something). Detection of crossover forms of purely endogenous origin is much more complicated: it is easier for a parent to believe that the child speaks agrammatical and meaningless gibberish than to admit that the toddler communicates meanings to which he (the parent) does not have access anymore. For this reason we had restricted, with exception of MAUEN, this introduction to purely surface crossovers between morphophonologic and syntactic P-schemas. To go deeper would be too speculative. This being said, we conclude this introduction by a remark which states that hearing or seeing a child producing a monolingual crossover is, verily, a reveatory event: it is as if, for a brief moment, one had indirectly regained access to the realm of long-forgotten knowledge. end monolingual crossovers 12.11.3 On preceding pages, we had used the term "crossover" to denote operators acting within a cognitive system, which are able to yield a new child scheme by means of mixing of multiple parent schemes. It was tacitly indicated that existence of many phenomena, including linguistic calcs, creol languages or overgeneralizations could be explained in terms of activity of such operators in human brain. Following table recapitulates the basic distinctions Multilingual Monolingual Intralexical PIJE + TRINKEN =PIJEN BAJA + ANÁN = BANÁN Interlexical ??? (difficult to assess) BAUEN + MAHLEN = MAUEN Intraphrastic VO ISt OKO? (calques etc.) VO ISt AUGEN? (overgeneralizations) Table 15: Recapitulation of crossover types observed in IM’s production. Due to abstract nature of "operator" entities it aims to organize, can be this taxonomy rightfully criticized as both crude and arbitrary. Thus, for example, a distinction between multilingual and monolingual could be considered as arbitrary by anyone asserting that the child is exposed to a multilingual linguistic environment (composed of, for example, motherese, fatherese, teacherese etc.) even in case when all members of social environment speak the same dialect. 205 206 qualitative The distinction between intraphrastic and intralexical could be also attacked on the sole ground that in many morphosyntactically rich languages, the very nature or even existence of distinction between notion of lexeme and phraseme is not as straightforward as it may seem. But be it as it may, such theoretical hassles are of little use for phenomenological objectives of this chapter. By aiming to stay as faithful as possible to our initial method of describing but non-categorizing we cast aside this taxonomy as secondary and precise, that all aboveintroduced înt-21 terms were introduced and categorized not because we would be 100% sure that such operators indeed materially operate within the human brain, but because we hope that their introduction could potentially stimulate or even facilitate further discussions. One such discussion, concerning the assessement of crossover-like phenomena in CHILDES corpus shall be soon introduced. end crossovers 12.11 12.12 other phenomena Many unexpected and surprising events occur during such a complex and years-lasting process as language development definitely is. But since many amonth these phenomena are already in exhaustively described in litterature, let’s just briefly describe two observations which were in certain sense "salient": 12.12.1 multilingual c-scheme mismatch The journal log entry (f2;5;18) describes an interesting dialogue which happened one afternoon after the mother picked up IM in kindergarten: FAT: ako bolo v (sk: "how was it in") Kite ? (sk. locative of german word meaning kindergarten) IM: MAMA ABHOLEN (de: "mother pick up") FAT: ako bolo v Kite? IM: MAMA ABHOLEN FAT: ako bolo v Kite? IM: MAMA ABHOLEN On first sight it is somewhat difficult to see why IM had responded, three times in a row, with an answer "mama picked me up" to a question "how was it in kindergarten"?. The thing however can get more 21 Consistently with the syntax of PCRE we shall use the symbol t̂o denote the initial position. An expression înt hence matches all expressions prefixed by the trigram int. 12.12 other phenomena lucid when one realizes that a sequence "ako bolo" and sequence "abholen" have certain morphophonologic features in common. In other terms, both can be fact matched by a a following C-scheme22 : a.∗?bh?olo? Given that it is evident that notions associated to the event of "being picked up from kindergarten" are, within child’s mind, definitely more important than smalltalk questions about past events; and given that the father question was terminated with the term "Kita" which was practically the only attribute of the term "abholen" to which IM was exposed on a frequent basis, IM’s thrice repeated answer was neither non-sense nor surprising. On the contrary, it was a meaningful and true answer to the question which her C-schemes processed as question meaning something like "who picked You up from kindergarten?". Hence, not the term "slip of the tongue" but rather "slip of the ear" could be used to describe such phenomenon. We consider this case of multilingual perceptive parapraxis to be of particulary interest because it can potentially result in a method, or even set of experiments, allowing to elucidate the problem of development of C-schemas which is, contrary to development of directly observable P-schemas, quite difficult to empirically measure and assess. end c-scheme mismatch 12.12 12.12.2 compression of information Another interesting phenomenon was observed at (f2;6;18) at onset of period of increasing phrasal productivity. During a trip through the forrest with another family which has following members: H = father, M = mother, J = older son, T = younger son; IM enumerated a list of people who should go home in a following manner: ALLE NACH HAUZE (de: "all home") AUCH T, M AUCH (de: "also T, M also") PAPA AUCH J (de: "father too J") What is striking is the last sentence which was uttered after few seconds of silence during which apparently tried to remember the name of J’s father (i.e. H). Since she could not remember it (or avoided its pronounciation), she had ultimately found her way out by producing 22 The C-scheme is a valid Perl regular expression which matches both strings "ako bolo" and "abholen". In such regexps, symbol "." matches any possible symbol, symbol "*" means "match zero or more occurences of preceding element" and symbol "?" means "match zero or one occurence of preceding element". 207 208 qualitative PAPA AUCH J which, in correct german would have to be a 6-word "auch J und auch sein papa". But not caring much about correct rules of grammar which would oblige her to articulate sentences twice as long as necessary, IM had expressed the same communicative intention with three words only. Hence, at least in this case, "optimizing" forces inviting her to express her intention with as little resources as possible were definitely stronger than socializing and normative forces obliging her to produce only grammatically correct constructions. end compression of information 12.12 In a following manner could we continue and discuss one entry of the observation log after another. For example, we could discuss not only IM’s pre-sleep monologues, but also mention productions which she used to cry out of her sleep, or uttered immediately after waking up. We could focus on one meaning and describe development of labels which IM used to denote it. Or, as was the case when discussing the DING-DONG mystery, we could focus on one label and describe development of its meanings. Or we could list IM’s first adjectives, questions and sylogisms. Or publish digital versions of the observation log as well as all other recorded materials. But given the momentanous lack of IM’s conscious and reflected consent to publication of her personal data, we think it is now time to conclude this chapter dedicated to development of this particular child. other phenomena end 12.12 f(2;4;6) ICH HABE AJN HUND HABE AJN HUND HABE AJN HUNDI HABE AJN HUND LA LA LA DA BAUEN DA BAUEN JA DA BAUEN TATO end qualitative 12 13 Q U A N T I TAT I V E 13.1 method The method of previous chapter mainly consisted in observations and interpretations thematizing structures produced by one individual toddler. But knowing that science should always aim to unveil not only the individual and specific, but also and especially the universal and generic, a hard-core empiricist could rightfully reproach us that was presented until now was maybe cute, but it was not science. Thus, in order to correct and complement the methodological gap, all the effort presented in this chapter shall be subordinated to two ultimate virtues of the cartesian method. They are, of course: 1. reproducibility 2. quantification Reproducibility is to be attained by exact specification of the input data and by publication of computational machinery which transforms the data into information or even knowledge. More concretely, every analysis shall include a list of corpora which were analysed as well as the bash|PERL|R code which performed the analysis. Thus, instead of using traditional logicomathematical formalisms, other formalism - less theoretical and more practical one - shall be used: that of PERL and its regular expressons (regexps). When it comes to quantification, it shall be exactly the use of regexps which shall allow us to transform texts into numbers. By using regexps which are, in their essence, nothing else than strings of characters able to match sets of strings of characters, it should be potentially possible to identify, detect and mesure frequencies of occurence of quite abstract patterns or schemas. Summa summarum, the method of this chapter shall mix little bit of data-mining with little bit of statistics and information extraction in order to attain the goal commonly known as "knowledge extraction". end method 13.1 13.2 data « Child Language Data Exchange System (CHILDES)» (MacWhinney, 2014) is undoubtedly the biggest publicly accessible collection of both recordings of child speech as well as their transcripts. Since 209 210 quantitative its foundation in 1984 by Brian MacWhinney and Catherine Snow, CHILDES had attracted interest of thousands researchers from all over the world and thus became the most important dataset for the nascent DP discipline. Given its open yet standardized design, CHILDES contains hundreds of megabytes of transcripts representing children verbal productions and interactions in more than two dozen world languages. What’s more, some of these transcripts include morphosyntactic annotations and/or audiovisual recordings which allow a more thorough contextualization of otherwise pure-text transcripts. Note, however, that not all transcripts downloaded from the site of CHILDES project1 shall be analysed. Primo, both directory "Frogs" as well as "PhonBank-Phon" are to be removed from the workbench since they do not contain .CHA transcripts made "in vivo". Secundo, all transcripts of children whose age is higher than the upper-bound of toddlerese (i.e. >30 months) are also excluded from analysis. This can be done by running the agesort.pl2 script whose main functionality, however, is to devide the transcripts into two datasets: 1. PROTOTODD - the "prototoddlerese" dataset contains transcripts of children not older than 16 months 2. TODDLER - the "toddlerese" dataset contains transcripts of children between 16-30 months Transcripts contained in both datasets thus obtained follow the .CHA format which stipulates that: 1. lines with child-originated speech are marked with token *CHI 2. lines with mother-originated speech (motherese) are marked with token *MOT 3. lines with father-originated speech (fatherese) are marked with token *FAT and in all tables which will follow, we shall apply the same CHI|MOT|FAT notation to denote child, resp. mother or father. Distribution of different line types is presented on Table 16 CHI MOT FAT PROTOTODD 2248553 320454 13974 TODDLER 1453931 893357 154964 Table 16: Activity of different speakers in two age groups. Every line of the .CHA file roughly represents a distinct and unique utterance. Thus, Table 16 suggests first distinction between two age 1 $wget -r –no-parent http://childes.psy.cmu.edu/data/ 2 http://wizzion.com/thesis/code/childes/agesort.pl 13.3 universals 211 groups: in the protoddler period mothers in general produced 42% more utterances than children, the ratio was more than inversed in the later group4 . In comparison to mothers, fathers seem to serve only marginal role within both datasets, their presence, however, seems to be significantly higher in case of the older group (FATPROT /MOTPROT ≈ 4.3%, FATT ODD /MOTT ODD = 17.3%). end data 13.2 13.3 universals This section offers analysis of CHILDES transcripts coming from different languages. Table 18 shows number of distinct .CHA files (transcript) which are to be analysed as well as languages in which they were spoken. ara deu eng fra jpn rum spa tha biling other PROTOTODD5 54 176 1026 152 42 53 56 142 31 107 TODDLER 25 591 2505 410 235 42 140 46 801 1063 Table 17: Repartition of languages in studied corpus. It is thus evident that all in all, CHILDES is strongly biased towards indo-european languages in general and English in particular. This bias notwithstanding we shall, in following analyses, throw all data into one bag as if we were studying one sole language. 13.3.1 letters Let’s now run the script 6 performing the most simple analysis possible: i.e. the measurement of frequencies of occurence of diverse graphemes (i.e. letters) within utterances produced by children and their parents. This yields distributions presented on Table 17. It can be seen that no matter the speaker and no matter the age group, vowels A, E, and O are always among four most frequent entities. But closer inspection of the data can lead to discovery of certain interesting developmental phenomena occuring also between the groups whose contrast interest us the most, that is CHIPROT O and CHIT ODD . It can thus be seen that the utterances of children younger not older than 15 months are dominated by occlusive consonants (H, M, T, D, N, P) and other types of consonants like fricatives (S), trills (R) or laterals (L) attain more dominant positions only in later period. A particularly instructive case seems to be the decrease in ranking of the labionasal occlusive M. While this consonant is the 5th most 4 The ratio 1453931/893357 ≈ 1.614 is quite close to number φ ≈ 1.618 better known as "golden ratio" or "golden section". Sapienti sat. 6 http://wizzion.com/thesis/code/freq_1gram.pl 212 quantitative Table 18: 20 most frequent graphemes according to speakers and age groups. CHI PROTO MOT TODDL PROTO FAT TODDL PROTO TODDL a 32187 a 108278 e 400499 e 1448021 a 21516 a 208869 e 21151 e 103443 o 371032 a 1220454 o 12563 e 195921 o 19400 o 700249 a 344629 t 1101875 n 10394 o 155729 h 17569 n 665280 t 319301 o 1069481 e 10258 t 132600 m 12472 i 654456 h 252510 i 865696 i 9571 n 127344 t 11557 t 611047 n 233620 n 853893 u 8399 i 126082 d 10969 h 482184 i 228587 h 801306 t 8063 s 107667 u 10949 s 438493 s 194898 s 755952 h 7273 h 87652 n 10668 r 383941 u 173597 r 607236 r 5230 r 85586 i 10068 m 364861 r 160583 u 532238 k 5191 u 85416 b 8768 u 338297 y 142311 l 501064 s 4995 l 68172 p 7200 d 323376 l 137550 d 450710 m 4689 d 64320 y 6754 l 310237 m 109696 m 351217 j 4071 m 52445 l 6704 c 234078 d 106698 y 345154 d 3876 y 44005 r 6501 k 226683 w 94381 c 290880 p 3683 c 41410 c 6163 p 192730 g 85473 g 282157 l 3555 k 39026 s 5589 g 189097 k 77850 w 280615 w 2825 g 35873 â 5328 y 186999 c 76826 k 234218 y 2436 p 34373 g 5233 b 184783 p 72531 p 216647 c 2379 w 30227 k 4581 w 138086 b 67397 b 191091 b 1632 b 26719 w 3857 j 107099 f 33203 f 123552 g 1581 v 18730 frequent in the transcripts of younger children and is 2.58 times less frequent than the most frequent A, in case of older toddlers M is only 10th more important and 3.36 less frequent than A. Given that all four FAT and MOT distributions consistently place M at rank 12 or 13, the phenomenon of "decrease of importance of M" - and in lesser extent also of P and B - can be potentially explained in terms of divergence from certain potentially innate labiotactic schemata (c.f. 12.5.1) and gradual convergence towards more socially determined articulations. We leave to readers’s ingenuity detection and discussion of other phenomena presented by the table, including mother’s preference for the vowel E and father’s and children’s preference for the vowel A. end letters 13.3.1 13.3 universals 13.3.2 213 n-grams Let’s now focus on distributions of N-grams, that is, the sequences of N letters. Since we have already presented the distribution of letters which is equivalent to distribution of 1-grams, Table 19 presents the distribution of 2-grams (bigrams) as assessed in 7697 transcripts which our script7 had analysed. Table 19: 20 most frequent bigrams according to speakers and age groups. CHI PROTO MOT TODDL PROTO FAT TODDL PROTO TODDL ’a_’ 6168 ’e_’ 341873 ’e_’ 167116 ’e_’ 562814 ’a_’ 6085 ’e_’ 72319 ’e_’ 5487 ’a_’ 273927 ’t_’ 108290 ’t_’ 383994 ’n_’ 4173 ’a_’ 53852 ’^ a’ 4688 ’n_’ 191424 ’_t’ 92762 ’s_’ 335865 ’e_’ 3722 ’s_’ 42541 ’^ b’ 4089 ’t_’ 173103 ’s_’ 89654 ’_t’ 311683 ’aa’ 3242 ’t_’ 39194 ’^ d’ 4072 ’o_’ 157271 ’th’ 89035 ’th’ 266289 ’oo’ 2663 ’n_’ 36965 ’^ m’ 3744 ’s_’ 153264 ’he’ 76868 ’he’ 246986 ’j_’ 2660 ’_t’ 34962 ’y_’ 3699 ’_t’ 123289 ’ou’ 73519 ’n_’ 243248 ’i_’ 2642 ’o_’ 34081 ’h_’ 3660 ’er’ 120991 ’n_’ 66505 ’a_’ 220702 ’o_’ 2448 ’an’ 25338 ’ma’ 3649 ’an’ 118561 ’re’ 57772 ’ha’ 198677 ’_t’ 2233 ’ d’ 24430 ’ah’ 3451 ’i_’ 116078 ’ha’ 57707 ’ou’ 186602 ’_n’ 2194 ’ a’ 24018 ’da’ 3085 ’he’ 113844 ’a_’ 57042 ’o_’ 183667 ’_m’ 2022 ’er’ 23690 ’n_’ 3072 ’in’ 111541 ’yo’ 55805 ’_d’ 182416 ’aj’ 1990 ’ s’ 23146 ’oo’ 3024 ’th’ 102953 ’y_’ 53855 ’er’ 180576 ’an’ 1981 ’i_’ 22763 ’ba’ 2716 ’_a’ 96622 ’an’ 51556 ’ a’ 174639 ’uu’ 1933 ’th’ 22579 ’h_’ 263 ’_d’ 94794 ’u_’ 50789 ’an’ 170761 ’t_’ 1880 ’he’ 22503 ’an’ 2361 ’ch’ 94477 ’er’ 49019 ’re’ 170253 ’_k’ 1866 ’r_’ 22353 ’o_’ 2357 ’_m’ 94073 ’at’ 48593 ’in’ 167113 ’u_’ 1813 ’ha’ 21978 ’^h’ 2289 ’h_’ 93204 ’o_’ 46667 ’at’ 160793 ’ha’ 1758 ’u_’ 21764 ’de’ 2254 ’ma’ 92728 ’_a’ 44251 ’ i’ 159637 ’ p’ 1712 ’ou’ 20780 ’^p’ 2219 ’en’ 92239 ’_y’ 43725 ’r_’ 158742 ’ii’ 1574 ’re’ 20519 ’t_’ 2188 ’ha’ 91373 ’on’ 41773 ’_s’ 156205 ’th’ 1574 ’en’ 19847 ’at’ 2056 ’r_’ 89388 ’_s’ 41663 ’u_’ 147883 ’na’ 1555 ’on’ 17843 In our notation, symbol ^ means "beginning of utterance" and symbol _denotes the pause between the words (normally denoted by a simple blank space) and is understood as a symbol in its own right. In general, vowels A and E at the ultimate word position tend to dominate the lists but in case of the group which interests us most, i.e. CHIPROT O they are followed by a group of bigrams denoting either vowel A or occlusives B, D, and M (and somewhat later also H and P( occuring at the initial position of whole utterance. 7 http://wizzion.com/thesis/code/freq_2gram.pl 214 quantitative It is also worth noting that for this group, the most frequent bigrams having the consonent-vowel (CV) syllabic form are MA, DA and BA and bigrams following the VC form are AH, AN and AT. We consider these findings as consistent with both data commonly reported in DP litterature, as well as with qualitative observations of IM’s first protowords (c.f. Table 10 like MAMA, DADA or BABA. As usually, we set aside other potentially interesting questions like "is the predominance of long vowels AA, OO, UU, II in prototoddlerdirected fatherese a sheer artefact of the corpus8 or do these results point to somewhat more profound a phenomenon ?" and point hereby the reader’s attention to Table 20 outputs of the scripts assessing the frequencies of 3-grams. Table 20: 10 most frequent trigrams according to speakers and age groups. CHI PROTO ’^ba’ 1692 MOT TODDL ’_th’ PROTO FAT TODDL 57081 ’_th’ 58738 ’the’ 140513 PROTO ’aa_’ 1994 TODDL ’_th’ 187369 ’^ma’ 1629 ’er_’ 51736 ’you’ 54163 ’you’ 127201 ’aj_’ 1990 ’the’ 140513 ’mam’ 1619 ’en_’ 51386 ’the’ 41636 ’hat’ 112441 ’_th’ 1151 ’you’ 127201 ’ah_’ 1448 ’the’ 51022 ’ou_’ 40205 ’re_’ 104532 ’ii_’ 1148 ’hat’ 112441 ’^da’ 1428 ’re_’ 48700 ’_yo’ 38424 ’_yo’ 100048 ’an_’ 1074 ’re_’ 104532 ’ama’ 1323 ’^ja’ 41974 ’re_’ 35344 ’he_’ 98865 ’oo_’ 1016 ’_yo’ 100048 ’det’ 1145 ’in_’ 40925 ’hat’ 34041 ’ou_’ 97958 ’on_’ 1016 ’he_’ 98865 ’dad’ 1118 ’no_’ 40168 ’at_’ 30480 ’at_’ 96754 ’_ma’ 980 ’ou_’ 97958 ’aa_’ 1030 ’^no’ 39940 ’he_’ 28589 ’is_’ 76056 ’_na’ 823 ’at_’ 96754 ’et.’ 998 ’her’ 39541 ’her’ 28009 ’her’ 74624 ’aw_’ 819 ’is_’ 76056 ’^ah’ 971 ’ne_’ 38460 ’ere’ 25958 ’er_’ 74623 ’re_’ 805 ’her’ 74624 In general it can be stated that the trigram-related phenomena seem to extend quite naturally the phenomena which were already observed and discussed in relation to bigrams. Word onset syllables BA, MA and DA thus dominate the list of prototoddlerese. But since these trigrams are not fully qualified (they contain the meta-character ^), it can be stated that the most frequent trigrams with equally trigramic phonemic correlates are MAM, AMA, DET and DAD. In the later period, i.e. in CHIT ODDL transcripts one can observe a bias towards distribution of standard english marked, of course, by the dominant position of the graphemic trigram (and phonemic bigram) denoting the most frequent word of english language, the determiner THE. This bias notwithstanding, word onset syllables ^JA 8 These long vowel sequences seem to originate, in great extent, from transcripts of japanese and tamil fatherese. 13.3 universals and ^NO appear at highest positions of the list for a reason which we can briefly elucidate only in the footnote9 . Leaving again the question of fatherese aside as the problem of its own, let’s now look at motherese. In general, both distributions indicate that the corpus was strongly biased towards English. Thus, the obligatory THE is present (as well as its fragment _TH preceded by the pause), as well as trigrams HAT and ERE owing their high ranks to highly frequent words like what/that resp. where/there/here, within which they occur. What is striking is, however, the position of the trigram YOU. While in frequency lists generated from "standard English" corpora 10 , the word You is 17th most frequent and occurs ≈ 9.3 less often than the most frequent word THE, in the speech directed to younger infants the trigram it is the trigram You which dominates the list of fully qualified trigrams, occuring 1.3 more often than THE ! Among all phenomena observed until now, we consider mother’s tendency to say You, to be the most salient example of what we consider to be the very essence of motherese. end n-grams 13.3.2 13.3.3 intrasubjective replications It has been already repeated on multiple places (5,12.6) that interpretation of "repetition of information" as a sort of "replication of information" is one among main tenets of the theory hereby presented. Thus, let’s now try to assess the extent in which children repeat their own productions. Intralocutory duplications Intrasubjective duplications can be detected by searching for repetition of a sub-string X within the envelopping utterance-string U. If X is a bigram this can be easily done by matching the utterance with the regexp: $U =∼ /(.2) \ 1/g 9 Execution of grep -P "CHI:\tno" ./toddler/* indicates that within the corpus of later toddlerese, high frequency of ^no is principially caused by augmentation of child’s egocentric tendency to answer questions in negative. In case of ^ja execution of the command grep -P "CHI:\tja" ./toddler/* indicates that the situation seems to complicated by the fact that the grapheme J denotes different phonemes in different languages (compare "jagen" in german and "Jack" in English or "jagami" in sanskrit). This complication notwithstanding, it seems that high frequency of ^ja can be in nonnegligeable extent explained in terms of baltoslavogermanic "yes". Thus, for example, the sole 11312/c-00023045-1 transcript shows how small german boy Leo answered 104 times with the word JA 10 https://en.wiktionary.org/wiki/Wiktionary:Frequency _lists/PG/2006/04/1-10000 215 216 quantitative and for any duplicated sub-string at least two characters long, the matching pattern is $U =∼ /(.2,) \ 1/g (3) Note that these patterns match only adjacent repetitions, id est such cases when two instances of the repeated substring are juxtaposed side by side. The script11 confrontating the second (i.e. length(X) > 2) with child-produced utterances yields outputs presented in Table 21. Table 21: Duplicated expressions and numbers of child-originated and childdirected utterances in which they occur. CHI PROTO MOT TODDL PROTO TODDL ma 1117 ma 11756 ma 1266 ma 2633 pa 294 pa 4545 is_ 733 is_ 2408 da 290 ko 2970 bye 696 em 1904 bye 142 is_ 1651 no_ 547 mm 1338 an 140 ba 1412 it_ 532 pa 1283 ba 136 la 908 na 523 it_ 1177 ta 76 an 764 da 443 e_ 963 na 75 da 730 mm 382 na 820 ah 65 no_ 647 ba 336 a_ 700 woof_ 64 ta 616 an 254 an 692 uh 60 na 600 em 197 ba 505 open_ 59 do 588 boo 177 to_ 468 mommy_ 57 be 580 ing 167 ko 435 cou 53 bye 552 uh 160 no_ 384 vov 48 e_ 536 nyan 158 in 374 mm 45 bo 535 man 157 ing 349 ga 41 pi 468 ha 147 bye 328 he 40 ca 430 pa 143 er_ 319 no_ 40 in 387 nai 135 cher 311 book_ 39 cha 372 to_ 134 li 293 ha 39 ka 344 cou 132 _we 290 Postponing the discussion12 of specificites of the data hereby presented to the later date let’s focus on scientifically more pertinent a fact: the overall statistics of duplications. This is shown on Table 24 whose values were calculated by normalization by means of a formula P(duplication) = ALLmatching /ALLutterances 11 http://wizzion.com/thesis/code/isipr.pl 12 "Mothers do not say woofwoofwoof as babies do, mothers say manmanman". 13.3 universals whereby ALLmatching denotes the number of utterances produced by CHI (resp. MOT) matchable by regexp presented in Formula 3 and ALLutterances denotes the number of all utterances uttered by the person. CHI MOT PROTO 0.041 0.086 TODDL 0.066 0.058 Table 22: Probability that the utterance shall contain at least one ajdacently duplicated 2+gram. This table indicates that the intralocutory duplication is to be most probably observed in motherese directed to younger children. Younger children, on the contrary, tend to produce less adjacently duplicated sequences13 . In the later period, however, they tend to replicate, within one utterance, the fragments of their production more frequently than their mothers. end intralocutory replications 13.3.3.0 Translocutory replications Let’s now focus on reproduction not to be observed within one individual utterance, but between two adjacent utterances. Given the speaker S who utters U1 before uttering U2 , one can look for replication of patterns between U1 and U2 by simply 1. creating a new datastructure, a "couplet" which concatenates two utterances and the divisor symbol #, i.e. couplet = concatenate(U1 , #, U2 ) 2. matching the couplet with regex like $couplet =∼ /(.3,). ∗ #. ∗ \1/g (4) and this is exactly what is being done by a 3rd line of the14 script whose outputs are in part presented in Table 23. Note that in contrast to Formula 3, the regexp in Formula 4 contains expresion {3,} and not {2,}. This means that in this analysis, we have been looking for repeated strings of three or more characters (3+grams). This design choice was done in order not to pollute the results with repeated bigrams among which many (e.g. "th", "ch") represent in many languages just a sole phenome, and their repetition is thus highly probable. Other design choices are, of course, possible. 13 Or transcribers do not transcribe them as such. 14 http://wizzion.com/thesis/code/isitd.pl 217 218 quantitative Table 23: Most frequent translocutory 3+-grams. CHI PROTO MOT TODDL PROTO TODDL det. 332 ja. 8160 you 8698 you 19229 kore. 223 no. 4461 the 3809 the 15302 maman. 210 the 3328 that 1554 what 557 mama. 182 da. 3071 here 1531 that 441 eh. 174 yeah. 2870 what 908 here 300 baby. 162 ein 1979 and 776 ing 2423 ball. 126 that 1670 t’s 627 and 2088 no. 121 aa. 1655 look 550 t’s 2056 daddy. 104 nein. 1536 ing 520 das 1612 aa. 102 en. 1535 there 393 there 1437 papa. 89 there. 1363 that’s 382 she 1399 mommy. 84 here 1190 her 366 her 1323 ooh 81 das 1099 your 344 that’s 1 up. 76 die 1088 where 338 tha 1179 dah 75 der 1006 come 325 ein 1093 ah. 73 this 973 this 323 ich 1091 da. 73 and 883 see 300 est 1069 dog. 67 you 830 one 272 der 1030 dada. 62 yes. 772 yeah 261 one 1020 uhoh. 61 ich 698 no. 249 n’t 1005 Some interesting phenomena pop up here. In general the table is in general populated by deictic pronouns15 , determinants, answer particles and various form of positional adverbs. In motherese, an injunction to action appears from time to time in the form of a verb ("look", "come", "see"). And, of course, it is very probable that if the current motherese utterance contains the word "you", the next one utterance shall contain it as well. The presence of motherese expressions like "t’s" and "n’t" also suggests the occurence of first variation sets (that’s vs. it’s, isn’t vs. don’t etc.) What’s more, one can see quite clearly a distinction between language of younger and older children. While the distribution of translocutory duplications of older children is quite similar to motherese16 this is in no way the case for younger children. Repetition of "abstract" 15 DET is the danish deictic meaning "that" and KORE is japanese deictic meaning "this". Transcripts of danish children Anne (e.g. 11312/c-00021705-1), Jens (e.g. 11312/c-00021750-1) and japanese girl Hiromi (e.g. 11312/c-00009753-1) seem to be in great extent "responsible" for high scores of these words. 16 The most salient exception to this being the tendency to repeatedly utter "ja." or "no.". 13.3 universals deitics is quite rare and seems to be limited to few particular children like Jens and Hiromi. On the other hand, the list of repeated tokens is dominated by words denoting concrete persons ("maman", "mama", "baby", "ball", "daddy", "papa", "mommy", "dog") and particles with undefined content ("eh", "aa", "ooh", "ah", "uhoh") potentially referring to emotional states. Even the adverb/preposition "up" is present, sometimes probably serving the function of injunction "raise me up!" or "look up!". This being said, let’s now look at overall statistical properties of distributions thus obtained: CHI MOT PROTO 0.08 0.37 TODDL 0.28 0.38 Table 24: Probability that both parts of a utterance couplet shall contain at least one identic 3+gram. An significant increase in amount of translocutory replications is observed when one compares data of younger and older children. This is consistent with what was observed in case of intralocutory duplications Table 24 but here, the phenomenon is even more marked. Motherese, on the contrary, seems to keep a property of repeating a 3+gram slightly more often than once in three utterance couplets. end translocutory replications 13.3.3.0 Many minor phenomena asides, preceding subsubsections have briefly shown: 1. a fast17 and frugal18 regexp-based method of extraction of repetitive patterns from huge corpora 2. that language of children younger than 15months contains less intralocutory resp. translocutory replications of 2+ resp. 3+gram sequences than language of older toddlers This being said, let’s now focus on replication of structures which is to be observed not in and/or between utterances produced by one speaker but in utterances produced my multiple speakers. end intrasubjective replications 13.3.3 17 All presented analyses were performed in less than a minute on one single 2.5GHz core. 18 All scripts are shorter than 42 lines of pure PERL, including loading the corpora, cleaning it from metadata and most salient noise, parsing and printing the result. 219 220 quantitative 13.3.4 intersubjective replications Intersubjective replication is equivalent to imitation. It is observed if and only if two distinct subjects produce the same construction in a very limited timespan. To make things simple, this section shall be concerned only with detection of the most trivial intersubjective replications: those which immediately follow each other. PROTO CHIINIT TODDL MOTINIT CHIINIT MOTINIT ball 74 ball 42 the 2038 the 3058 baby 68 baby 40 that 1534 here 1731 daddy 50 here 33 here 1045 that 1175 up. 47 det 26 you 969 you 997 guh 40 apple 21 no. 895 ing 993 det 36 byebye. 21 what 764 what 539 dada 33 the 20 yeah. 528 one 502 more 33 daddy 20 ing 492 there 447 that 30 that 19 there 466 ein 436 byebye. 29 down 19 das 463 and 361 book 29 open 19 ein 444 t’s 333 hi. 28 mommy 18 want 439 das 314 water 25 hi. 17 one 348 ich 285 car 25 dada 17 ja. 346 det 261 down. 24 book 16 t’s 342 der 240 block 23 block 16 and 328 want 234 open 23 big 16 non 327 this 230 mama. 23 guh 15 yeah 326 que 216 bottle 22 you 15 daddy 324 est 209 big 20 water 14 hat’s 303 can 203 non 20 car 14 det 303 die 184 uhoh. 19 boo 14 baby 275 c’est 161 no. 18 dad 13 oh. 274 see 158 agu 18 bye. 13 her 262 it’s 155 backpack 17 okay 12 where 245 oh. 153 duck 17 bye 12 there. 230 den 149 doggie. 17 and 12 mhm. 226 they 144 apple 17 uhoh. 12 ich 219 baby 138 here 16 duck 12 that’s 217 that’s 138 dirty 16 sticky 11 can 213 car 138 Table 25: Most frequent words replicated from child to mother (CHIINIT ) and mother to child (MOTINIT ). Technically, the extraction is performed by means very similar to those which extract intrasubjective translocutory replications (c.f. previous section). The only difference being, of course, the origin of UT T1 13.3 universals and UT T2 . While in detection of intrasubjective replications UT T1 and UT T2 are uttered by the same person, in case of intersubjective replications it cannot be so and additional condition speaker(UT T1 ) 6= speaker(UT T2 ) has to be implemented in the code. Another thing which is to be carefully considered is identity of the person which initiated the replication (i.e. uttered UT T1 ) in contrast to identity of the person which reacted (i.e. uttered UT T2 ). On following pages these shall be distinguished by attributes INIT resp. REACT. Listings generated by the19 script implementing such considerations have been listed on table Table 25. They indicate, among other things, that • entities intersubjectively replicated and shared between mothers and younger toddlers tend to denote concrete physical referents ("ball", "baby", "book", "water", "car", "mama", "bottle", "backpack", "doggie", "apple", "block"), their properties ("big", "dirty", "sticky") or directions along vertical axis ("down", "up") • entities intersubjectively replicated and shared between mothers and older toddlers tend to encode more abstract linguistic entities (deictic pronouns, locative adverbs) as well as basic syntactic constructions ("that’s", "c’est", "it’s") • children initiate exchanges about different "topics" than mothers do20 Overall statistic properties assessed by the script are presented on Table 26. These are: number of couplets in which child utterance preceeds or follows the mother utterance (NC ); number of couplets which have at least one 3+gram in common (NR ) and probability that a MOT-CHI or CHI-MOT couplet will have at least one 3+gram in common (PR|C = NR /NC ). CHIINIT CHIREACT PROTO NC = 46795 NR = 6005 PR|C = 0.128 NC = 46923 NR = 4167 PR|C = 0.088 TODDL NC = 378958 NR = 130713 PR|C = 0.344 NC = 378712 NR = 92340 PR|C = 0.243 Table 26: Basic statistics concerning the replication of 3+grams between mother and the child. 19 http://wizzion.com/thesis/code/tsr.pl 20 For example in younger toddler group had mothers repeated 33 times the word "more" uttered by their child. But only in 9 cases was the word "more" uttered by a mother repeated by her child. Or, in older group, mothers have reproduced "no." of their children in 895 cases; toddlers repeated "no." of their mothers only in 133 cases. 221 222 quantitative It may be thus seen that in both groups, toddlers initiate more intersubjective replications than they react to. Or, in other terms, that mothers tend to prefer reacting to topic changes than changing the topic themselves. It is as if mothers, not children, were adapting themselves to the currently addressed topic. But it can be also seen that this asymmetry is less marked in exchange with older toddlers. For while mothers reproduce at least one trigram after approximately every third utterannce their children repeat at least one trigram approximately after every fourth utterance. In younger group this is not so: child reproduces the fragment of mother’s talk only once in every 12th utterances and mother do so only once in 8 utterances. These distinctions notwithstanding, we consider it worth mentioning that there seems to be, in fact, one thing common to both age groups: the ratio between probabilities. That is, Table 26 indicates that, statistically speaking, it is ≈ 1.421 times more probable that the replication-containing couplet was initiated by the child and not by the mother. end intersubjective replications 13.3.4 Thus ends our brief excursion through the realm of linguistic universalia. It could undoubtably continue, for example by following the direction indicated by Table 27: CHI MOT PROTO 114822 332923 TODDL 2454 24 1319 25 Table 27: Distributions of occurences of marker for laughing in diverse subsets of CHILDES corpus. and a lot of ink could be spilled by tentatives trying offer a serious, scientific, fully cartesian and p-value-endowed answer to question "how is it possible that the CHA format’s marker laugh is 2.5 times more frequent in transcripts of prototoddlerese when it contains 4013 less transcripts than the corpus of toddlerese?". But given the importance, intensity, diversity and perennial actuality of the topic (Aristotle, 5 BC), the role of laughing in development of mind can not to be addressed in the current pamphlet in extent it merits. 21 22 23 24 25 0.128/0.088 = 1.45; 0.344/0.243 = 1.42 $ grep laugh ./prototoddler/* |grep CHI |wc -l $ grep laugh ./prototoddler/* |grep MOT |wc -l $ grep laugh ./toddler/* |grep CHI |wc -l $ grep laugh ./toddler/* |grep MOT |wc -l 13.4 english-specific Instead of doing so let’s now fully admit that in case of analyses of corpus so strongly biased towards english language as the one hereby studied, it’s maybe wiser to stop babbling about "universals" 26 and rather start assessing, evaluating and interpreting the central tenets of our theory in "english-specific" terms. end universals 13.3 13.4 english-specific In this section we shall present results of few data-mining experiments which concerned only those parts of CHILDES corpora which: 1. which transcribe interaction between english-speaking adults and english-speaking children 2. which also contain morpohological and grammatical annotations (i.e. every utterance line is also followed by %mor line and %gra line) Table 28 contains overall statistics27 of the datasets fulfilling these conditions and obtained by running the script langsort.pl28 . PROTO (<16 months) TODDL (<16 months >31 months) Investigators 10 35 Subjects 86 288 Transcripts 330 1335 CHI utterances 42229 293751 MOT utterances 196781 370972 CHI words 132927 1035341 MOT words 1076028 1921131 Table 28: Counts related to morphologically annotated english-language transcripts analyzed in this section. As can be seen, the corpus still contains a non-negligeable amount of data describing interactions from almost hundred younger toddler subjects and almost three hundred older toddler subjects. Given that the data were collected and transcribed by dozens of diverse investigators, it can be expected that certain knowledge about generic ten26 Or do so elsewhere, c.f. (Hromada, 2016e). 27 All values were obtained by means of a standard UNIX utility wc (e.g. the amount of letters in CHI utterances was obtained by executing the shell command $grep -P ’ˆCHI’ ../toddl_english/*.cha |wc -c. Note that for wc’s operational definition of what "word" (i.e. a continuous sequence of characters separated from other words by blank spaces) means strongly overlaps, but is nonetheless not completely equivalent, to what "word" means in linguistics. 28 http://wizzion.com/thesis/code/langsort.pl 223 224 quantitative dencies could be attained if ever the data was to be processed in a stringently quantitative manner. Instructions and definitions of CHILDES Manual (MacWhinney and Snow, 1991) should be also taken into account more strictly than was the case in the preceding "universals" section. Other details of text preprocessing are mentioned in annex (??). 13.4.1 utterance-level constructions Table 29 contains top most frequent utterance-level constructions obtained by launching one simple command29 . That communication of younger children is dominated by non-linguistic behaviours (vocalizations, babbling, crying, laughing etc.) is hardly surprising. Nor is much surprising that younger children tend to produce shorter utterances. Nor the fact that vaste majority of multiword motherese utterances are short fixed expressions (e.g. "come on", "that’s right", "oh dear", "good girl"). Observation of certain similiarities between distributions of childdirected speech and speech produced by older children can one lead to a hypothesis that these distributions correlate. In order to verify the hypothesis, a simple script was programmed30 which merged two complete distributions into one table. Subsequently, Pearson correlation coefficients were calculated and are presented on table Table 30. One may thus observe the existence of statistically significant (i.e. p<0.05) correlations in all cases except the one: no statistically significant correlation was observed between MOTT ODDL and CHILDPROT O . These seems reasonable, for how could the language of a young toddler a priori correlate with language which the mother shall use when the child will be older? In reverse direction, however, a weak (cor=0.022) but nonetheless statistically significant correlation is observed: thus, there exists certain relation between distributions of utterances in language produced by older children and distribution of utterances in language heard by younger children. The strongest correlation, however, is to be observed between MOTT ODDL and CHIT ODDL which can be potentially explained in terms of convergence of toddlerese towards "the golden standard" actualized by language of the mother. This being said, let’s now conclude this brief overview of utterancelevel distributions with Table 31 which presents following quantities: • Nd : the number of distinct utterances present in the corpus • Pd : probability that the utterance is distinct (Nd normalized by number of all utterances (c.f. Table 28) 29 grep -h -P "CHI: [^0]" ./toddl_english/* | sort |uniq -c |sort -g -r 30 http://wizzion.com/thesis/code/correlator.pl 13.4 english-specific • H : Shannon entropy of utterance distribution, calculated31 as P H = − i P(xi ) log2 P(xi ) (Shannon, 1948) where P(xi ) denotes the probability of occurence of i-th utterance (e.g. its relative frequency of occurence) Given that the Shannon entropy can be understood as a measure of uncertainity and unpredictability, it may be stated that production of younger children yields most predictable transcripts. Production of older children is much less predictable and every new utterance seems to bring about twice as much information content (≈ 13.7 shannons instead of 6.8) . And utterances produced by mothers are even less predictable. end utterance-level constructions 13.4.1 13.4.2 pivot schemas In item 9.4.4, a pivot schema was defined as a two-word schema in which one word ("the pivot") recurres frequently in the same position and the other word varies. In order to detect potential pivot words, let’s define a sort of "pivoteness" score as: scorepivoteness = FN−gram ∗ length(N − gram) = FN−gram ∗ N which is to be calculated for every continuous N-gram which occurs in the corpus and has more than X characters (i.e. N > X) . For example, if the corpus have contained only 4 utterances containing only the expression "dogs" and one utterance containing the expressions "dog", and if the parameter X was set to 3, the score-attributing script32 would attribute score 4 ∗ 4 = 16 to tetragram "dogs", score 15 = 5 ∗ 3 to trigrammaton" "dog" and score 12 = 4 ∗ 3 to 3gram "ogs". However, bigrams "do", "og", "gs" as well as unigrams "d", "o", "g", "s" would be ignored since parameter X = 3. Table 32 lists top thirty 8+grams (i.e. X=733 ) extracted from all CHILDES transcripts of english-speaking children not older than 2 years and 7 months34 As may be seen, more than half of most salient pivots are onset expressions initiating the utterance (marked by the starting symbol ^) and the rest is divided between expressions which end the utterance ("in there", "’s that?", "on there", etc.) or are in midst of it (" in the ", " on the ", "another", "little"). 31 http://wizzion.com/thesis/code/entropycalc.pl 32 http://wizzion.com/thesis/code/exh.pl 33 Note that the choice of the parameter X was in great extent arbitrary and only in much lesser extent motivated by "magical number seven, plus or minus two", postulated by (Miller, 1956). 34 The complete list of all 8+gram expressions and their associated pivoteness7 scores is available at http://wizzion.com/thesis/results/pivots_english_7. 225 226 quantitative It is, however, quite probable that even among these pivot candidates there would be some which are not true pivots because they occur only in restricted amount of contexts. But in case like ours when all contexts are known, such "false pivots" can be potentially identified by an algorithm which, for every pivot candidate C: 1. assesses the distribution of contexts35 DC 2. calculates the Shannon entropy of DC and this is, indeed, the procedure actualized by the script pivotentropy.pl36 whose outputs37 introduced on the Table 33. As may be seen, results presented on Table 33 are quite similar to results already presented on Table 32. There exists, indeed, a statistically significant correlation between scorepivoteness and Hcontextual (i.e. Spearman’s non-parametric rank correlation test yields p-value < 2.2e-16, ρ = 0.474). Since evaluation of scorepivoteness is less costly than that of Hcontextual , and since entropy values are, so to say, more precise than the scorepivoteness , the fact that these two measures tend to correlate may turn out to be quite useful in applied NLP practice. Summa summarum, constructions extracted from CHILDES by means of above-mentioned methods strongly ressemble Bruner’s "formats" (5). end pivot schemas 13.4.2 13.4.3 pivot instances Let’s now focus on pivot instances, that is, on expressions which are matched by pivot schemas. We define: an utterance U instantiates the pivot schema P if and only if U can be matched by the Prepresenting pattern. In case we choose PERL regexes as a means of representation of pivot schemes this definition can be formalized as $U =∼ /$P/ whereby =~denotes the regex-matching operator. This notion is implemented by the script pivot_utterance_global.pl38 which, when initialized with list of pivots as its input data, returns 35 In what shall follow, the term "context" means "two words to the right" if pivot initiates the utterance, "two words to the left" if it terminates the utterance and "one word to the left and one word to the right" if it is in the midst of it. 36 http://wizzion.com/code/thesis/pivotentropy.pl 37 The list of 1000 pivot7 schemas with highest pivoteness and their CHILDES Hc ontextual entropies is downloadable at http://wizzion.com/thesis/results/pivot7_entropies_english.1000 38 http://wizzion.com/thesis/code/pivot_utterance_global.pl 13.4 english-specific the frequencies of utterances which instantiate one among ten pivots with such high informational content. Most frequent among such pivot-instantiating utterancess are listed on Table 34 along their respective frequencies of occurrence. The list makes evident certain usage-oriented, egocentric (e.g. "I want it", "that’s mine"), attention-sharing ("look at this one") tendencies potentially inherent to human toddlers. But in order to be sure that it is, indeed the case and not just an artefact of the method with whic we treat the corpus, let’s now slightly readjust the methodology: let’s NOT throw all data coming from all children to one bag which is subsequently analyzed, but instead, let’s keep all utterances well associated to their locutors in order to identify such utterances which are being spoken by biggest number of individual locutors. Operationalization of such a methodology into the PERL script39 makes it possible to pose the following question : Which pivot-instantiating utterance was uttered by the biggest number of distinct children ? 45 top-ranking utterances are listed on Table 35 as an answer40 . As before, this more horizontal an analysis indicates that toddlerese tends to be dominated by level-0 constructions: 1. encoding deictic focusing of attention to some object 2. expressing wanting or egocentric posession 3. asking for more information or level-1 crossovers of such level-0 schemas like, for example, (another one) × (I want) (I want another one) Q.E.D. end pivot instances 13.4.4 13.4.3 pivot grammars Three tables which follow aim to ellucidate more closely the concrete substance of some pivots with big Hcontextual 41 : Before leaving, we remind the reader of the fact that all "grammars" presented hereby are "intersubjective" in a sense that they were extracted from corpus of transcripts produced by distinct children. 39 http://wizzion.com/thesis/code/pivot_utterance_distinct.pl 40 http://wizzion.com/thesis/results/utterances_with_pivots_distinct_children_sorted 41 Quantity in square brackets denote utterance’s "popularity", i.e. the number of distinct children which have uttered the construction. 227 228 quantitative Thus, it is more reasonable to label micro-grammars hereby introduced as "social" than purely individual (e.g. "cognitive"). But given the size of the CHILDES sample which was analyzed and given that the condition of "random sampling"42 would hold - which is not granted - than it could be, more or less, expected, that "popularity" of utterances hereby unveiled characterizes not only English language as a mutually shared intersubjective entity, but could also characterize the intensity with which are certain structures encoded in the mind of an individual. end pivot grammars 13.4.4 It seems as if a non-negligeable amount of salient phenomena were revealed during the analysis of English parts of CHILDES corpus. Primo, distributions of utterance-level constructions indicated that • communicative tentatives of prototoddlers are dominated by non-linguistic means (29) • that distribution of utterances which mothers say to younger children significatively correlates with language produced by older children (30) • productions of mothers and older toddlers are less predictable than productions of younger toddlers (31) Secundo, analyses aimed at pivot schemas and their instances (35) suggest that • most salient (32) and potent (33) pivot schemas coalesce around expressions used for location-related and deictic "pointing" and expressions of "wanting" • most frequent instances of pivots tend to refocus interlocutor’s attention to something else ("another one"), reinforcement of the current situation ("I want some more"), or express child’s egocentricity ownership ("that’s mine") • certain utterances can be easily explained as crossovers between two frequent pivots (i.e. "I want" × "another one" → "I want another one") Tertio, closer inspection of certain pivot schemas instantiated in utterances of biggest number of distinct children shows that • non-abstract referents of child’s linguistic pointing are mainly animates (Daddy, elephant, cow, horse) and color attributes (green, red, yellow, orange, blue) (36) 42 Id est, that CHILDES corpora in general represent a random sample of child’s normal verbal interactions. 13.4 english-specific • children "want" a drink, to see, and to play (37) • the pivotal affinity of the adjective "little" is in part caused by concrete referents (piggy, baby, ball), in part by fixed expressions ("a little bit") and in part by expressions belonging to both classes ("Mary had a little lamb", "twinkle twinkle little star")43 This being said, reader is cordially invited to explore the "results" files in order to identify other interesting (ir)?regularities potentially allowing us to increase the amount of knowledge we have about the Weltanschauung of a modal english-speaking toddler. end english-specific 13.4 But many other, somewhat more universal "facts", were mined from the CHILDES corpus in the first half of this chapter. Asides the fact that mothers interacting with younger children laugh significatively more often than mothers interacting with older children (27), our initial attention was captivated by the relatively frequent44 occurence of the consonant nasal labial M in productions of younger children (18). When it comes to expressions composed of more than one signifier, one fact issued from analysis of child-directed motherese had struck us as particulary salient one. That is, the use of 2nd person singular pronoun "you" (13.3.2) significantly more common than in standard corpora. Extraction of two or more replicas of 2+-gram sequences juxtaposed to each other within one utterance had lead us to conclusion that intralocutory duplications (13.3.3) are most frequently observed in motherese directed to younger children. Subsequently, the analyses of translocutory duplications - that is, repetitions spanning multiple utterances - has revealed a structural distinction between language of younger and older children: while prototoddlers use repetition of meaning-carrying "lexical" morphemes ("mama", "baby", "ball", daddy"), repetitions of older toddlers are populated by members of the closed set of "grammatical" morphemes ("the", "this", "yes", "here") (23) . The latter distribution being similiar to distributions of the adult grammar, it was hypothetized that during the process of development, child’s language gradually adapts to language system of surrounding linguoracles, especially the mother. A following analysis of "intersubjective replications" - i.e. of cases were a word uttered by the mother was immediately uttered after the child, or vice versa (25)- had indicated, that the hypothesis of a child unilaterally adapting to the mother is not sufficient. More concretely, the summary results presented on Table 26 caused us to state that "mothers tend to prefer reacting to topic changes than changing the topic themselves". 43 Child’s growing exteroceptic, proprioceptic and/or spatial awareness of the fact that she’s "little girl" also plays, of course, an important role. 44 I.e. in contrast with older children. 229 230 quantitative Thus, it seems, that in a long run - during weeks, months and years - it is the child who adapts to the mother, but in a short span - in concrete scenes lasting seconds and minutes - it is the mother who adapts her topic, her focus, her attention to that of the child. Asides all these phenomena - and all other explicitely discussed on preceding chapters - we find it important to repeat once more the methodological objective behind this chapter. That is, to show that both relevant and interesting "knowledge" can be extracted from CHILDES corpora by means of a simple, fast and unambigously reproducible method of extraction of patterns attained by means of matching the corpus with PERL-compatible regular expressions. end quantitative 13 13.4 english-specific CHI MOT PROTO TODDL PROTO &=vocalize . 10683 yeah . 2337 1344 &=babble . 9126 no . 1180 &=nonspeech . 3856 oh . 1009 &=cry . 3003 894 &=involuntary . 381 &uh . 5206 231 TODDL &=involuntary . 6609 oh . 1969 yeah . 5219 no . 1785 &=nonspeech . 3852 yeah . mhm . 1458 okay . 3777 okay . 2769 yes . 1201 &hmm ? 2591 yes . 2640 there . 1011 &=speechplay . 2563 mhm . 379 &=laugh . 1204 huh ? 1000 come on ! 2002 right . 239 &ah . 1129 look . 985 here . 1900 &hmm ? 223 Mama . 816 that . 871 huh ? 1582 there . 206 &=cough . 773 here . 860 uhoh . 1576 what ? 106 Dada . 717 what’s that ? 846 no . 1456 that’s right . 104 &=labial . 601 Mummy . 764 &=laugh . 1377 what’s that ? 87 &eh . 587 okay . 719 what ? 1156 well . 86 &=laughs . 514 uhhuh . 637 look ! 1153 look . 79 ball . 489 yup [= yes] . 628 oh . 1144 come on . 67 ooh@b . 422 that one . 558 there you go . 1115 that’s it . 55 &=raspberry . 412 &mm . 476 &=labial . 1020 oh dear . 54 &mm . 410 on there . 474 mhm . 1004 thank_you . 54 baby . 406 oh no . 472 thank_you . 820 pardon ? 52 &u:h . 404 in there . 472 hi . 764 no ? 52 Mommy . 396 this . 454 yes . 740 whoops . 49 byebye . 389 more . 433 come (h)ere ! 734 what is it ? 48 guh@b . 382 what ? 414 that’s right . 726 huh ? 47 up . 380 uhoh . 331 yay . 722 what’s this ? 47 no . 375 oh dear . 329 ahhah . 690 what is that ? 45 oo@b . 354 baby . 317 whoa . 582 there you go . 42 &a:h . 337 car . 311 hello . 509 here . 38 Mom . 318 I don’t know . 307 &mm . 464 oh no . 36 uguh@b . 317 me . 291 hey . 451 uhhuh . 35 heh@b . 310 what’s this ? 281 what’s that ? 409 good girl . Table 29: Most frequent utterance-level constructions produced by englishspeaking mothers and children in 2 phases of their development. 232 quantitative MOTPROT O MOTT ODDL CHIPROT O t = 28.4625 df = 32679 p 6 2.2e − 16 cor = 0.155 t = 0.5 df = 74078 p = 0.6126 cor = 0.0019 CHIT ODDL t = 5.9801 df = 70555 p = 2.24e − 09 cor = 0.022 t = 317.27 df = 110006 p 6 2.2e − 16 cor = 0.692 Table 30: Correlations between distributions of frequences of utterances. CHI MOT PROTO Nd = 3645 Pd = 0.0863 H = 6.824 Nd = 83267 Pd = 0.423 H = 14.2 TODDL Nd = 120219 Pd = 0.41 H = 13.7 Nd = 199704 Pd = 0.538 H = 15.37 Table 31: Number of distinct utterances in diverse datasets and entropies of their distributions. Score Pivot Score Pivot Score Pivot 18472 ^that’s 6507 another 4320 ^I can’t 16368 ^ what’s 5950 ^that’s a 4288 ’t know. 16360 ^ I want 5841 want to 4239 ^I wanna 13527 ^ where’s 5810 on there. 4235 ^ there’s a 10640 in there. 5808 little 3950 ^ that one 9632 in the 4860 ^this is 3790 going to 9513 ^there’s 4760 that one. 3740 ^I want to 8328 ’s that? 4734 ^ another 3600 ^ it’s a 7335 ^I don’t 4667 ^ where’s the 3591 , Mummy. 7320 on the 4608 don’t know. 3552 ^ here’s Table 32: Thirty 8+grams with highest scorepivoteness . 13.4 english-specific Pivot Hcontextual ^that’s X Y 9.25876131528133 ^I want X Y 8.96609935363205 ^where’ X Ys 8.95606540894548 ^there’s X Y 8.79381971491988 X in the Y 8.74578245657441 X on the Y 8.65695029616604 X Y in there. 8.5923250584143 ^this is X Y 8.34192768618957 ^that’s a X Y 8.20433784100614 X little Y 8.01430314990973 Table 33: Ten CHILD-produced pivot7 schemas with highest contextual entropy (in shannons). Utterance Frequency another one. 143 I want it. 84 where’s it gone? 72 that’s it. 69 what’s in there? 68 that’s mine. 67 where is it? 65 I want another one. 60 I can’t do it. 49 look at this one. 48 Table 34: CHILDES utterances most frequently instantiating some pivot7 schema. 233 234 quantitative Utterance Children Utterance Children another one. 33 what’s in there? that’s mine. 27 that’s right. Utterance Children 13 little girl. 10 13 I want another one 10 I want it. 27 look at this. 13 I can’t find it. 1 I want that. 26 it’s all_gone. 13 a little one. 10 that’s it. 25 in the car. 12 where go? 9 yes please. 23 I like that. 12 where are you? 9 I can’t do it. 22 here’s one. 12 there , look. 9 where is it? 19 what’s this one? 11 that’s red. 9 look at that. 17 there’s one. 11 I want this one. 9 go in there. 16 and there. 11 go in here. 9 I want that one. 15 what’s in here? 10 where’s this go? 8 where’s it gone? 14 there’s another one. 10 where’s other one? 8 that’s better. 14 that’s green. 10 that’s yellow. 8 little one. 14 that’s Daddy. 10 that’s a . 8 I want some more. 14 put it in there. 10 that one there. 8 Table 35: Pivot-instantiating CHILDES utterances pronounced by biggest number of distinct children. that’s mine [27] it [25] better [14] right [13] green [10] Daddy [10] red [9] yellow [8] orange [7] all [7] nice [6] blue [6] a elephant [6] a cow [6] you [5] my [5] horsie [5] good [5] a car [5] ... Table 36: Most popular instances of pivot "ˆthat’s X" 13.4 english-specific I want it. [27] that. [26] that one.[15] some more.[14] another one.[10] this one.[9] this.[8] some.[7] a drink.[7] two.[6] to see.[6] one.[6] more.[6] to play.[5] down.[5] Table 37: Most popular instances of pivot "ˆI want X" little one. [14] girl. [10] lamb. [5] boy. [5] bit. [5] man. [4] car. [4] piggy. [3] ball. [3] baby. [3] a little one. [10] boy. [3] bit. [3] box. [3] that little one [3]. twinkle twinkle little star. [4] Mary had a little lamb. [3] Table 38: Most popular instances of pivot "X little Y" 235 14 SUMMA III Ideas are never static but develop across time and context, constantly cross-fertilizing with other currents of thought. — Edwin F. Bryant Hence ends the last part of the first volume of the Thesis aiming to offer certain fragments of evidence of the validity of the theory of intramental evolution. Two principal ways of acquisition of such fragments have been presented: 1. qualitative: holistic, naturalistic and multi-modal observations of development of one specific child, from prenatal period onwards 2. quantitative analyses of patterns recurrent in transcripts produced by hundred children immortalized in the CHILDES corpus but their combination is, of course, also possible. 14.1 crossroads of thoughts As is often the case in science, a crossover between the methods can also lead to interesting results. Thus, it was by means of PERL regex pattern matching that the following schema was detected in behaviour of Kuczaj’s (Kuczaj and Maratsos (1975) son named Abe (2;5.23): Listing 7: Some lines from abe009.cha (PID 11312/c-00016245-1) transcript 301 *FAT: 306 *CHI: 310 *FAT: . 315 *CHI: 322 *FAT: 325 *CHI: ... 365 *FAT: 368 *CHI: 372 *FAT: 375 *CHI: ... 398 *FAT: 236 okay (.) here’s another one a cow ate the carrot cow ate the carrot . okay (.) now do this one the boy fell down the stairs boy fell down the stairs . dinner was eaten by the boy . dinner eaten by the boy . the cow ate a carrot . cow ate a carrot . the little boy is happy . happy . a cow ate carrot the . 14.1 crossroads of thoughts 401 ... 507 510 516 523 526 529 533 536 539 542 ... 551 554 ... 582 585 588 591 *CHI: a cow ate carrot the . the cow did not eat his dinner . *FAT: CHI: cow didn’t eat his dinner he can’t get snacks . * that’s right no dinner (.) no snacks here’s another *FAT: game the elephant cannot go home elephant can’t go home . *CHI: nobody hit me . *FAT: nobody hit me . *CHI: FAT: the boy did not eat any cookies . * CHI: boy can’t eat any cookies . * FAT: the cow cannot eat grass . * cow can’t eat grass . *CHI: *FAT: *CHI: the boy did not sleep . boy can’t not sleep . *FAT: CHI: *FAT: *CHI: the goat eat did his dinner . goat didn’t eat his dinner . the boy not did eat any cookies . boy can’t eat any cookies . 642 *FAT: 647 *CHI: eat his 654 *MOT: 659 *CHI: we can play some more tomorrow too (.) okay . tomorrow too (.) boy can’t eat his carrots boy can’t carrots do you want to go outside for awhile (.) Abe ? play outside (.) boy can’t eat his carrots . Closer inspection of the above-listed father-son interaction unveils multiple interesting phenomena: Primo, Kuczaj’s son consistenly used the construction "boy can’t" in cases where he should repeat his father’s "boy did not". Well beyond the objectives of this Thesis is the question whether this phemonenon is to be explained by mismatch on the level of passive, perceptive morphosyntactic C-structures (c.f. also 12.12.1) or whether it has more to do with mismatch of productive P-structures leaves a. But mismatch there is, and for a reason unknown, Abe was consistently crossing-over the external schemata of a form "boy did not X" with private schema "boy can’t X". Secundo, other crossovers between external stimuli (e.g. utterances produced by external linguistic oracles like parents, peers or teachers) and child’s private world of needs, wants and protothoughts are to be observed on lines 510 and 647 of the transcript. In the first case, father’s utterance "cow did not eat his dinner" is augmented with Abe’s private "he can’t get snacks" which makes father to react to the "snack" topic1 . without preliminary intention to do so. 1 Snacks are also mentioned in other Abe transcripts, on line 45 in abe004.cha mother urges the child to eat with a threat "okay (.) come eat or no snacks later on ." and on line 516 of abe017.cha where Abe offers his father an apple "as a snack" 237 238 summa iii Even more important - for the purpose of verification of theory hereby presented - is the crossover construction which emerges at the very end of the transcript, in the moment where father closes the session with words "we can play some more tommorrow", thus putting Abe in a position of a brief vacuum where anything can be said. The vacuum is immediately filled by Abe’s production and replication (twice on lin 647, one on line 659) of the construction "boy can’t eat his carrots". Note that nowhere in the transcript had the father uttered a construction with "boy" as a subject and "carrot" as an object2 . Thus, the expression with which Abe closes the language game seems to be his own invention, an invention which we consider to be the product of the crossover summarized on Table 39. 306 cow ate the carrot 361 cow ate a carrot 510 cow didn’t eat his dinner he can’t get snacks 542 cow can’t eat grass 536 boy can’t eat any cookies 554 boy can’t not sleep 591 boy can’t eat any cookies 647 647 boy can’t eat his carrots 659 Table 39: Interphrastic crossover behind Abe’s "boy can’t eat his carrots". Given all this, a question can be posed: "why carrots?". Why not "dinner", "cookies", "cheese" or "grass" which are also used as direct objects of "eat-ing" mentioned in the transcript? Why was it the substantive "carrot" which, as Tomasello (2009) would say, had "filled the slot"? It may be the case that multiple cognitive processes and biases are to be taken into account in order to answer the question: 1. the primacy effect: term "carrot" is the first concrete object of eating mentioned in Kuczaj’s "repeat after me" language game 2. the frequency effect: Abe was three times exposed to father’s production of the term "carrot", i.e. more than in case of "cookies" (2 times), "grass" (2 times) or "cheese" (once) 2 The word "carrot", in fact, does not occur in any other Abe transcript, only in abe009.cha 14.1 crossroads of thoughts 3. the perturbation effect: term "carrot" was once heard (line 398) and once produced (line 401) in a syntactically anomalous construction "a cow ate carrot the" 3 4. the semantic consistency effect: "boy can’t eat his carrots" refers to more plausible a scenario than, for example, "boy can’t eat his grass" 5. priming etc. It seems to us evident that all these processes and biases are to be taken into account by anyone hoping to develop a reasonable theory of crossover among linguistic structures which does not contradict but rather naturally extends both cognitivist, connectionist, usage-based4 paradigms which dominate contemporary developmental psycholinguistics. But since the objectives addressed in the second volume of this work will be principially computational ones, let’s now start concluding this theoretical volume with one sole principle which can be immediately deployed in a functional program. 14.1.1 the linguistic crossover principle Fitness of product of the crossover of (linguistic) schemes A and B is proportional to fitness of A and B as well as to amount of features which A and B share. end the crossover principle 14.1 A more formal and geometric variant of this principle shall be furnished in the second volume of this work. For the time being, let’s just elucidate that by the term of "features" we mean not only overlap between "semantic" features (c.f. "the semantic consistency effect" in enumeration above) of two "parental" schemes, but also overlap between prosodic, phonologic, morphologic, syntactic or even pragmatic characteristics of schemes which are to be fusioned. As in case of any creative, poietic act, the form and content, the program and the data, fuse. Thus, "AFE" and "OPICA" yield "API" (12.10.1), "BAJA" and "ANAN" yield "BANAN" (12.11.1) etc. not only because they denote the same meaning but also because they are phonetically similar."MAHLEN" and "BAUEN" yields "MAUEN" (12.11.2) not only because their signifiants can be matched by the pattern /Labocc A*EN/ but also because within certain subspace of the envelopping semantic space they 3 Exposure to such anomalous stimuli can be potentially asessed in terms of P600 event-related potential. It cannot be excluded that such a P600-related anomalies attain higher level of salience and activation than terms occurent in coherent contexts. 4 And with little bit of luck also mentalist 239 240 summa iii tend to be quite close (i.e. they both denote object-manipulating, constructive, creative activites etc.). Hence, Abe joyfully utters "the boy can’t eat his carrots" not only because "boy eats" is semantically closer to "carrot" than to "grass" but also because - on the morphosyntactic level - expressions "cow can’t eat grass" and "boy can’t eat any cookies" are similiar enough to induce the activation of the pathway like "X can’t eat Y" → subsequently filled with most affine fillers ("boy" before "can’t" and "carrot" after "eat"). Subtilities aside, the linguistic crossover principle can be further elucidated by the following aphorism: 14.1.2 of crossovers and analogies (aph) If the reader has understood that events which we have labeled as "linguistic crossovers" could elucidate phenomena to which the traditional cognitive science refers by the term "analogy", then the reader has understood us well. end of crossover and analogies 14.1.2 ... and the precept "whenever You notice an analogy or schematization, seek for existence the implicit structural crossover behind it" can turn out to be useful methodological "rule of thumb" for any researcher potentially interested by our proposal. end crossroads of thoughts 14.1 14.2 axes of analysis Diverse aspects of crossovers produced by IM, Abe or other toddlers, can be studied. Of non-negligeable importance is the analysis in terms of temporal interval between the last activation of crossover’s input schemata and crossover’s output product. Thus, in case of Abe’s carrots, minutes had to pass between Abe’s productions of all initial "carrot" and "boy can’t" (inputs) expressions and his final "boy can’t eat his carrots" (output). Many crossovers uttered by IM also had a property of mixing together schemata separated from each other and their product by minutes of other content, c.f. the (MAMA + MIMI → MAMI, 12.11.2). But sometimes - as in case of PIJEN (12.10.1) - the timespan seemed to be even shorter and crossover seemed to be occuring in short term memory or even in a much more volatile phonologic buffer. And yet 14.3 the source of variation in other cases (12.11.1), a simple trick of letting child hear "AJAN" caused a 5-month old latent schema to get reactivated, fuse with much more recent BAJA and form the globally optimal form BANAN. Another important aspect is the origin of input schemata. Analyzed from this perspective, one can state that nature of majority of crossovers noted down in the volume, was of following kind: external personal CROSSOVER whereby "external" denotes the schemata encoded in the stimuli to which the child is exposed (e.g. motherese utterances etc.) while "personal" denotes private and often unique idioglottic structures already encoded and productive within the mind of the given child. Crossovers between two or more purely "personal" schemata also seem possible. Unfortunately for empiric science, they are either impossible to access (as is the case "dreaming"5 ) or difficult to recognize as what they truly are (e.g. certain babbling sequences etc.) end axes of analysis 14.2.0 14.3 the source of variation Encoded in the material substrate of the brain, schemata are subjects to same physical laws of entropy and decay as the brain and body itself. Cognitive schemata are not engraved onto some kind of eternally lasting crystal. Humans forget 6 . Forgetting is a form of variation and as every form of variation, it can sometimes lead to disastrous loss of information. But less rarely, it can also cause one to discard previous "locally optimal" information, thus giving one the impetus to seek more globally optimal states. Another source of variation inherent to the child is her tendency "to want another one" and "to play". While many phenomena related to craving and wanting more can be in large extent explicated in terms of standard behaviorist theories (reinforcement, reward etc.) or "3rd noble truth" already posited by Shakyamuni some 25 centuries ago ((Lama et al., 2005)), child’s everactual readyness to play does not cease to struck us with such intensity that even after months of observations and empiric research, we still consider our initial definition of the "child" (Section 5.2) as a reasonable and a valid one. 5 For if there is a realm inaccessible to reason of an adult man than it is indeed the realm of toddler’s dreams. 6 Sometimes the tendency to forget is so strong that some researchers (c.f. 9.4.2) have even forgotten that humans forget 241 242 summa iii What’s more, our research had lead us to conviction that a modal toddler is much more a member of the species Homo Ludens (Huizinga, 1956) than of the species Homo Sapiens. And if there is one single thing which should be potentially reproached to otherwise most advanced and complete theory of linguistic development - i.e. the usage-based theory of Tomasello - then let it be this one: 14.3.1 extending usage-based paradigm (txt) That language development could not be possible without child’s ability to share attention with other humans is true. And it is also true that recurrence and distribution of patterns among and within diverse "usage scenarios" to which child is exposed and in which she is supposed to act, all that is an indispensable prerequisite to the success of the whole process. But a similiarly indispensable prerequisite it’s child’s tendency to play with sounds, words, sentences and whole contexts. To laugh, to sing, to talk to herself, to say "no" when the child already knows that the only word which her interlocutor does NOT want to hear is..."no". To playfully explore the limits of principles and rules and to do so in order to break them. To playfully explore limits of one’s world. end extending usage-based paradigm 14.3.1 And to feel Joy during and because all of that. end the source of variation 14.4 14.3 from selection to replication Principial source of variation thus ellucidated, the theory of intramental evolution still lacks a component without which it could not be neither formalized nor translated into a functional computer program. That is, the description of the bridge between process of "selection" and process of "replication". What is still missing is such a fitness function which could be pertinent to the process of language development. In other terms, what we still lack is a criterion by means of which one’s language-processing system could evaluate which schemata (or their ordered sets) are "fit" for linguistic communication and which are not. We posit the following principle in order to fill this gap. 14.4 from selection to replication 14.4.1 the principle of exogenous selection (def) The more schema S encoded in cognitive system C matches the data produced by external oracle O, the more it is probable that S shall replicate into another region of C. end the principle of exogenous selection 14.4.1 Stated in more Piagetian and less probabilistic terms: whenever the schema succeeds to assimilate linguistic stimulus produced by the person endowed with implicit authority7 , then the schema gets copied in other region of child’s mind. Stated even more simply, the principle can be compressed in the following precept: 14.4.2 mpr precept (aph) Matching Pattern Replicates. end mpr precept 14.4.2 And that’s it. Given that within the brain of a child replicated schemata are practically immediately subjected to forces of decay and (play|forget)ful variations, three words of MPR precept prepare the territory for great deal of adaptation which could potentially follow. Under this view is the computational burden related to informationprocessing, noise-filtering and the selection of structures delegated to external oracles. By "uttering this and not that", by exposing child’s schemata to this "data" and not that "data", indeed by such indirect mediated means do the model persons influence the development of structures in child’s mind. Asides few dozens of innate schemata is the mind of the nascent child filled mainly with unceasing swarming of images issued from the unknown realm of φαντασία. All the rest - including labels, rules and criteria - comes from outside, neatly packaged, preprocessed and preselected by caring oracles. It is in this sense that the adjective exogenous is to be understood. end the principle of exogenous selection 14.4.2 7 Child experiences on a daily basis how persons like mother, father, grandparents, teachers, older siblings, older peers etc. succeed to solve problems which she is unable to solve on her own. In computational theory such problem-solvers able to yield immediate and correct answer are called oracle machines Turing (1939). C.f. Clark (2010) for discussion of how involvement of certain oracles, called Minimally Adequate Teachers, can reduce the computational complexity of the problem of grammatical inference of context free languages. 243 244 summa iii Nothing precludes that in a healthy symmetric relation between the parent and a child, parent can approach the child as if she was a computational oracle able to immediately solve certain types of problems. In such a case, adaptation and evolution leads to a sort of bilateral coadaptative, co-evolutive interlock in which the child does for a parent what a parent does for a child. That is, by selecting and exposing the parent with the data-to-bematched, the child indirectly influences the population dynamics of schemata encoded in parent’s mind. Et vice versa. Willingness of many mothers to adapt to the topic proposed by their child (Table 25) as well as their readiness to perceive a fragile and powerless hominid not as an alien but as a "2nd person singular" (Table 23) lead us to belief that authentic, non-superficial comprehension of "the Other" (Buber, 1937) - i.e. "love" - is not a privilege but rather an essential prerequisite of successful co-adaptation of two minds and souls whose destinies are inexorably bound to each other. Love (DEF) « Strong positive emotional relation to persons, things, ideas or self. Conscious, effective, voluntary acceptation of value of the other in one’s life. Readiness to be hostage for the other (Levinas). Platonic tradition accentuates that less perfect is attracted by more perfect (love as a longing for what one does not have, especially beauty). In christian tradition humans respond to the gift of existence (life, world, happiness, friendship, family) with love, id est by devotion to well-being of the other, which does not await anything in return (love as devotion). Love expresses itself on all levels of human being, physical, personal and spiritual. It is the only solid bound between humand and the ultimate source of everything in the world which has real value.» (Sokol, 1998) end love (def) 14.4.2 end from selection to replication 14.4 This being said, we end the first volume of this work with an expression of a simple hope. Of a hope that on preceeding pages we have already succeeded to furnish some fragmented, preliminary and undoubtably incomplete yet consistent evidence supporting the theory which was initiated by two words forming the Thesis Mind evolves. end summa iii 14 Part IV S I M U L AT I O N S There is an appealing symmetry in the notion that the mechanisms of natural learning may resemble the processes that created the species possessing those learning processes. — D.E. Goldberg and J. Holland This part can be understood as a collection of four scientific articles. Each article describes a distinct simulation and can be read individually. Aspiration common to all articles is to provide different facets of cognitively plausible, ex computatio et simulatio proof-of-concept for theory of intramental evolution. Zeroth simulation aspires to demonstrate that Evolutionary Computation (EC) can offer useful insights to an agent hoping to break the code of an unintelligible corpus (e.g. to help decode riddle as cryptic as the Voynich Manuscript). First simulation aspires to demonstrate that EC can be a useful means of multiclass classification of textual documents according to their semantic content (e.g. and in a Big Data scenario could potentially lead to results as good as those produced by connectionist "deep learning" methods). Second simulation aspires to demonstrate that EC can help to identify useful solutions to problem of multiclass part-of-speech classification. Third simulation aspires to demonstrate that EC can pave the way to induction of plausible micro-grammars from solely positive corpus of motherese utterances. 15 BREAKING INTO UNKNOWN CODE 15.1 generic introduction A cryptologue posed with an unbroken cipher is, in certain sense, in a position similar to a child (P+19) which has just been born into our common world. Both cryptologue and a child are confronted with novel constellations of symbols and features. Both assume that the data with which they are confronted - a motherese (P+90-93) utterance perceived by a child or a cipher studied by a cryptologue ultimately carry a certain meaningful message. Both combine their ingenuity with relentless perseverance: both accept that the path to success leads through ocean of trials and errors (P+22). Ultimately, they both transcend their initial state of limited knowledge and attain understanding: child shall understand the world and the scholar shall understand the cipher. This analogy between a child and a cipher-breaker can be pushed even further in case we speak about the cipher stored in the enigmatic medieval Voynich Manuscript (VM). This is so because VM contains a non-negligible amount of visual content and it can be rightfully speculated that if VM contains a cipher to be decoded, than the deciphering process (and its subsequent evaluation) shall be founded on discovery of associations between VM’s visual content and the adjacent "voynichese" script. This is - we believe - similar to the position of a visually nonimpaired human child who acquires a non-negligible amount of information about her world and her language by means of associating the components of surrounding visual scenes with simultaneously heard phonemic sequences (e.g. "red ball in mama’s hand"). This being said, let’s now present first implications of our "child as a cryptologue" analogy, as published in the article Hromada (2016a). 15.2 abstract Voynich Manuscript is a corpus of unknown origin written down in unique graphemic system and potentially representing phonic values of unknown or potentially even extinct language. Departing from the postulate that the manuscript is not a hoax but rather encodes authentic contents, our article presents an evolutionary algorithm which aims to find the most optimal mapping between voynichian glyphs and candidate phonemic values. 246 15.3 introduction Core component of the decoding algorithm is a process of maximization of a fitness function which aims to find most optimal set of substitution rules allowing to transcribe the part of the manuscript - which we call the Calendar - into lists of feminine names. This leads to microgrammars which allow us to consistently transcribe dozens among three hundred calendar tokens into feminine names: a result far surpassing both "popular" as well as "state of the art" tentatives to crack the manuscript. What’s more, by using name lists stemming from different languages as potential cribs, our "adaptive" method can also be useful in identification of the language in which the manuscript is written. As far as we can currently tell, results of our experiments indicate that the Calendar part of the manuscript contains names from baltoslavic, balkanic or Hebrew language strata. Two further indications are also given: primo, highest fitness values were obtained when the crib list contains names with specific in-fixes at token’s penultimate position as is the case, for example, for Slavic feminine diminutives (i.e. names ending with -ka and not -a). In the most successful scenario, 240 characters contained in 35 distinct Voynichese tokens were successfully transcribed. Secundo, in case of crib stemming from Hebrew language, whole adaptation process converges to significantly better fitness values when transcribing voynichian tokens whose order of individual characters have been reversed, and when lists feminine and not masculine names are used as the crib. 15.3 introduction Voynich Manuscript (VM) undoubtedly counts among the most famous unresolved enigmas of the medieval period. On approximately 240 vellum pages currently stored as manuscript (MS) 408 in Yale University’s Beinecke Rare Book and Manuscript Library, VM contains many images apparently related to botanics, astronomy (or astrology) and bathing. Written aside, above and below these images are bulks of sequences of glyphs. All this is certain. Also certain seems to be the fact that in 1912, VM was re-discovered by a polish book-dealer Wilfrid Voynich in a large palace near Rome called Villa Mandragone. Alongside the VM itself, Voynich also found the correspondence - dating from 1666 - between Collegio Romano scholar Athanasius Kircher and the contemporary rector of Charles University in Prague, Johannes Marcus Marci. Other attested documents - e.g. a letter from 1639 sent to Kircher by a Prague alchemist Georg Baresch - also indicate that during the first half of 17th century, VM was to be found in Prague. The very same correspondence also 247 248 breaking into unknown code indicates that VM was acquired by famous patron of arts, sciences and alchemy, Emperor Rudolf II. 1 Asides this, one more fact can be stated with certainty: the vellum of VM was carbon-dated to the early 15h century (Hodgins, 2014). 15.3.1 pre-digital tentatives Already during the pre-informatic era of first half of 20th century had dozens, if not hundreds, men of distinction invested non-negligible time of their life into tentatives to decipher the "voynichese" script. Being highly popular in their time, many such tentatives - like that of Newbold who claimed to "prove" that VM was encoded by Roger Bacon by means of 6-step anagrammatic cipher (Newbold, 1928), or that of Strong (Strong, 1945) who claimed VM to be a 16th-century equivalent of the Kinsey Report" - may seem to be, when looked upon through the prism of computer science, somewhat irrational 2 . C.f. (d’Imperio, 1978) for a overview of other 20th-century "manual" tentatives which resulted in VM-deciphering claims. After description of these tentatives and and after presentation of informationally very rich introduction to both VM and its historical context, d’Imperio adopts a skeptical stance towards all scholars who associated VM’s origin with the personage of Roger Bacon3 . In spite of skeptic who she was, d’Imperio hadn’t a priori disqualified a set of hypotheses that the language in which the VM was ultimately written was Latin or medieval English. And such, indeed, was the majority of hypotheses which gained prominence all along 20th century.4 . 15.3.2 post-digital tentatives First tentatives to use machines to crack the VM date back to prehistory of informatic era. Thus, already during 2nd world war did the cryptologist William F. Friedman invited his colleagues to form 1 Savants which passed through Rudolf’s court included Johannes Kepler, Tycho deBrahe or Giordanno Bruno. The last one is known to have sold a certain book to the emperor for 600 ducats. 2 Note, for example, Strong’s "translation" of one VM passage: "When the contents of the veins rip, the child comes slyly from the mother issuing with leg-stance skewed and bent while the arms, bend at the elbow, are knotted like the legs of a craw-fish." Strong (1945) Note also that such translation was a product of man who was "a highly respected medical scientist in the field of cancer research at Yale University" (d’Imperio, 1978). 3 "I feel, in sum, that Bacon was not a man who would have produced a work such as the Voynich manuscript...I can far more easily imagine a small society perhaps in Germany or Eastern Europe (d’Imperio, 1978, 51)" 4 Note that such pro-English and pro-Latin bias can be easily explained not by the properties of VM itself, but by the simple fact that first batches of VM’s copies were primarily distributed and popularized among Anglosaxon scholars of medieval philosophy, classical philology or occidental history 15.3 introduction "extracurricular" VM study group - programming IBM computers for sorting and tabelation of VM data was one among the tasks. Two decades later - and already in position of a first chief cryptologist of the nascent National Security Agency - Friedman had formed the 2nd study group. Again without ultimate success. One member of Friedman’s 2nd Study Group After was Prescott Currier whose computer-driven analysis led him to conclusion that VM in fact encodes two "statistically distinct" (Currier, 1970) languages. What’s more, Currier seems to have been the first scholar who facilitated the exchange and processing of Voynich manuscript by proposing a transliteration5 of voynichese glyphs into standard ASCII characters. This had been the predecessor of the European Voynich Alphabet (EVA) (Landini and Zandbergen, 1998) which had become a de facto standard when it comes to mapping of VM glyphs upon the set of discrete symbols. Canonization of EVA combined with dissemination of VM’s copies through Internet have allowed more and more researchers to transcribe the sequence of glyhps on the manuscript into ASCII EVA sequences. Is is thanks to laborious transcription work of people like Rene Zandberger, Jorge Stolfi or Takeshi Takahashi that verification or falsification of VM-related hypotheses can be nowadays in great extent automatized. For example, Stolfi’s analyses of frequencies of occurrence of different characters in different contexts has indicated that majority of Voynichese words seems to implement a sort of tripartite crust-coremantle (or prefix, infix, suffix) morphology. Later study has indicated that the presence of such morphological regularities could be explained as an output of a mechanical device called Cadran grill (Rugg, 2004). The "hoax hypothesis" is also supported by the study of Schinner (2007) who suggested that "the text has been generated by a stochastic process rather than by encoding or encryption of language". Pointing in the similar direction, the analysis also concludes that "glyph groups in the VM are not used as words". On the other hand, a methodology based on "first-order statistics of word properties in a text, from the topology of complex networks representing texts, and from intermittency concepts where text is treated as a time series" presented in (Amancio et al., 2013) lead its authors to conclusion that VM "is mostly compatible with natural languages and incompatible with random texts". Simply stated, the way how diverse "words" are distributed among different sections of VM indicates that these words carry certain semantics. And this indicates that VM, or at least certain parts of it, are not a hoax. 5 In this article we distinguish transliteration and transcription. Transliteration is a bijective mapping from one graphemic system into another (e.g. VM glyphs is transliterated into ASCII’s EVA subset). Transcription is a potentially non-bijective mapping between symbols one one side and sound- or meaning- carrying units on the other. 249 250 breaking into unknown code 15.3.3 our position Results of (Amancio et al., 2013) had made us adopt the conjecture "VM is not a hoax" as a sort of a fundamental hypothesis accepted a priori. Surely, as far as we stand, it could not be excluded that VM is a work of an abnormal person, of somebody who suffered severe schizophrenia or was chronically obsessed by internal glossolalia (Kennedy and Churchill, 2005). Nor can it be excluded that the manuscript does not encode full-fledged utterances but rather lists of indices, sequences or proper names of spirits-which-are-tobe-summoned or sutra-like formulas compressed in a sort of private pidgin or a sociolect. But given VM’s ingenuity and given the effort which the author had to invest into the conception of the manuscript and given a sort of "elegant simplicity" which seems to permeate the manuscript, we have felt, since our very first contact with the manuscript, a sort of obligation to interpret its contents as meaningful. That is, as having the capability of denoting the objects outside of the manuscript itself. As being endowed with the faculty of reference to the world (Frege, 1994) which we, 21st century interpreters, still inhabit hundred years after VM’s most plausible date of conception. It is with such bias in mind that our attention was focused upon a certain regularity which we have later decided to call "the primary mapping". 15.3.4 primary mapping Condition sine qua non of any act of deciphering is a discovery of rules which allow to transform initially meaningless cipher into meaningful information. In most trivial case, such deciphering is facilitated by a sort of Rosetta Stone (Champollion, 1822) which the decipherer already has at his disposition. Since both the cipher-text as well as the plain-text (also called "the crib") are explicitly given by the Rosetta Stone, discovery of the mapping between the two is usually quite straightforward. The problem with VM is, of course, that it seems not to contain any explicit key which could help us to decipher its glyphs. Thus, the only source of information which could potentially help us to establish reference between VM’s glyphs and the external world are VM’s drawings. One such drawing present atop of folio f84r is shown on Figure 26. Figure 26 displays twelve women bathing in eight compartments of a pool. Bathing women is a very common motive present in VM and there seems to be nothing peculiar about it. The fact that word-like sequences are written above heads of these women is also trivial. 15.3 introduction Figure 26: Drawing from folio f84r containing the primary mapping. One can, however, observe one regularity which seems to be interesting. That is, in case two women bath in the same compartment, the compartment contains two word-like sequences. If one woman bathes in the compartment, there is only one word-like sequence which is written above her head. One figure - one word, two figures - two words. This principle is stringently followed and can be seen on other folios as well. What is more, the words themselves are sometimes similar but they are not the same. Such trivial observations lead to trivial conclusion: these word-like sequences are labels. And since these names are juxtaposed to feminine figures, it seems reasonable to postulate that these labels are, in fact, feminine names. This is the primary mapping. 15.3.5 three conjectures Method which shall be described in following sections can be considered as valid only under assumption that following conjectures are valid: 1. "the primary mapping conjecture" : voynichese words asides feminine figures are feminine names 2. "diachronic stability of proper names" : proper names are less prone to diachronic change than other language units 3. "Occam razor" : instead of containing a sophisticated esoteric cipher, VM simply transmits a text written in an unknown script Further reasons why we consider "the primary mapping conjecture" as valid shall be given alongside our discussions of "the Calendar". When it comes to conjecture postulating the "diachronic stability of proper names", we could potentially refer to certain cognitive peculiarities or how human mind tends to treat proper names (Imai and Haryu, 2001). Or focus the attention of the reader to the fact that for practically every human speaker, one’s own name undoubtedly belongs among the most frequent and most important tokens which 251 252 breaking into unknown code one hears or utters during whole life. This results in a sort of stability against linguistic change and allow the name to cross the centuries with higher probability than words of lesser importance and frequency. But instead of pursuing the debate in such a direction, let’s just point out that successful decoding of Mycenaean Linear script B ((Ventris and Chadwick, 1953) would be much more difficult if certain toponyms like Amnisos, Knossos or Pylos haven’t succeeded to carry their phonetic skeleton through aeons of time. At last but not least, the "Occam razor conjecture" simply explicitates the belief that a reasonable scientist should not opt to explain VM in terms of anagrams and opaque hermeneutic procedures if similar - or even more plausible - results can be attained when approaching VM as it was a simple substitution cipher. 15.4 method The core of our method is an optimization algorithm which looks for such a candidate transcription alphabet Ax which, when applied upon the list of word types occurring in VM’s Calendar section yields an output list whose members should be ideally present in another list, called the Crib. The optimization is done by an evolutionary strategy - an individual chromosome encode a candidate transcription alphabet and a fitness function is given as a sum of lengths of all tokens which were successfully transcribed from Calendar to a specified Crib. 15.4.1 calendar Six among twelve words present on Figure 26 occur only on folio f84r. Six others occur on other folios as well, and five of these six words occur also as labels near feminine figures displayed on 12 folios of the section commonly known as "Zodiac". It is like this that our attention was focused from the limited corpus of "primary mapping" towards more exhaustive corpus contained in the Zodiac. Every page of Zodiac displays multiple concentric circles filled with feminine figures. Attributes of these figures differ - some hold torches, some do not, some are bathing, some are not - but one pattern is fairly regular. Asides every woman there is a star and asides every star, there is a word. While some authors postulate that these words are names of stars or names of days, we postulate that these words are simply feminine 15.4 method names6 . From Takahashi’s transliterations of twelve folios of the Zodiac we extract 290 tokens which instantiate 264 distinct word types. To evit possible terminological confusion, we shall denote this list of 264 labels7 with the term Calendar. Hence, Zodiac is the term to refer to folios f70v2 - f73v, while Calendar is simply a list of 264 labels. Total length of this 264 labels is 2045 letters. These characters are chosen from 19-symbol (|Acipher | = 19) subset of the EVA transliteration alphabet. 15.4.2 cribbing Cribbing is a method by means of which a hypothesis, that the Calendar contains lists of feminine names, can potentially lead to deciphering of the manuscript. For if the Calendar is indeed such a list, then one could use lists of existing and attested feminine names as hypothetical target "cribs". In crypt-analytic terms, an intuition that the Calendar contains feminine names makes it possible to perform a sort of known-plain-text attack (KPA). We say "a sort of", because in case of VM are the "cribs" upon which we shall aim to map the Calendar, not known with 100% certainty. Hence, it is maybe more reasonable to understand the cribbing procedure as the plausible-plain-text attack (PPA). This beings said, we label as "cribbing" a symbol-substituting procedure Pcribbing which replaces symbols contained in the cipher (i.e. in the Calendar) with symbols contained in the plain-text. Hence, not only cipher but also plain-text are inputs of the cribbing procedure. Every act of execution of Pcribbing can be followed an act of evaluation of usefulness Pcribbing in regards to its inputs. The ideal procedure would result in a perfect match between the rewritten cipher and the plain-text, i.e. Pcribbing (cipher) == plain − text On the other hand, a completely failed Pcribbing results in two corpora which do not have anything in common. And between two extremes of the spectrum, between "the ideal" and "the completely failed", one can place multitudes other procedures, some closer to the ideal than the others. This makes place for optimization. 6 It cannot be excluded, however, that they all this at once. Note, for example, that in many central European countries, it is still a fairly common practice to attribute specific names to specific days in a year, i.e. "meniny". 7 Available at http://wizzion.com/thesis/simulation0/calendar.uniq 253 254 breaking into unknown code Listing 8: Discrete cross-over #discrete crossover 2 my $child_genome; my $i=0; for (@mother_genome) { if ($_ ne $father_genome[$i]) { rand > 0.5 ? ($child.=$mother_genome[$i]) : ( $child.=$father_genome[$i]); 7 } else { $child_genome.=$mother_genome[$i]; } $i++; } 15.4.3 optimization All experiments described in the next section of this article implement an evolutionary computation algorithm strongly inspired by the architecture of Canonical genetic algorithm (CGA, P+46) Holland (1992); Rudolph (1994). Hence, initial population is randomly generated and the fitness-proportionate (i.e. "roulette wheel", P+42) selection is used as the main selection operator. But contrary to CGAs, our optimization technique does not implement a classical single-point crossover but rather a sort of "discrete crossover" which takes place only in case that parent individuals have different alleles of a specific gene. Another reason why our solution can be considered to be more similar to evolutionary strategies (Rechenberg, 1971) than to CGAs is related to the fact that it does not encode individuals as binary vector (P+48). Instead, every individual represents a candidate mono-alphabetic substitution cipher application of which could, ideally, transform the Calendar into a crib. More formally: given that cipher is written in symbols of the alphabet Acipher and given that the crib is written in symbols of the alphabet Acrib , then each individual chromosome will have length of |Acrib | genes and every individual gene could encode one among |Acipher | values. Size of the search space is therefore |Acipher || Acrib |. Search for optima in this space is governed by a fitness function: FPcribbing = X length(w) w∈cipher∧Pcribbing (w)∈crib where w is a word type occurring in the cipher (i.e. in the Calendar) and which, after being rewritten by Pcribbing also matches a token in the input crib. Given that the expression length(w) simply denotes w’s character length, the fitness function of the candidate transcription procedure Pcribbing is thus nothing else than the sum of char- 15.5 experiments Listing 9: Cipher2Dictionary adaptation fitness function 3 8 13 18 #Fitness Function my $text=$calendar; my $old = "acdefghiklmnopqrsty" ; my %translit; @translit{split //, $old} = split //, $individual; $text =~ s/(.)/defined($translit{$1}) ? $translit{$1} : $1/eg; # core transcription of calendar content my %matched; for (split/\n/,$text) { my $token=$_; if (exists $crib{$token}) { @antitranslit{split //, $individual} = split //, $old; $token =~ s/(.)/defined($antitranslit{$1}) ? $antitranslit{$1} : $1/eg; my $t=$token; $matched{$t}=1; } } for (keys %matched) { $Fitness[$i]+=length $_; } acter lengths of all distinct labels contained in the Calendar which Pcribbing successfully maps onto the feminine names contained in the input crib. 15.5 experiments Within the scope of this article, we present results of two sets of experiments which essentially differed in the choice of a name-containing cribs. Other input values (e.g. Takahashi’s transliteration of the Calendar used as the cipher) and evolutionary parameters (total population size = 5000, elite population size = 5, gene mutation probability <0.001) were kept constant between all experiments and subexperiments. Each experiment consisted of ten distinct runs. Each run was terminated after 200 generations. 15.5.1 slavic crib What we label as "Slavic crib" is a plain-text list of feminine names which we had compiled from multiple sources publicly available on the Internet. Principal sources of names were websites of western Slavic origin. This choice was motivated by following reasons: 255 256 breaking into unknown code 1. The oldest more or less certain trace of VM’s trajectory points to the city of Prague - the center of western Slavic culture. 2. Orthography of western Slavic languages relatively faithfully represent the pronunciation. That is, there are relatively few digraphs (e.g. a bi-gram "ch" which denotes a voiced velar fricative). Hence, the distance between the graphemic and the phonemic representations is not so huge as in case of English or french. 3. Slavic languages have rich but regular affective and diminutive morphology which is often used when addressing or denoting beloved persons by their first name. The third reason is worth to be introduced somewhat further: in both Slavic and western Slavic languages, a simple in-fixing of the unvoiced velar occlusive "k" before the terminal vowel "a" of a feminine names leads to creation of a diminutive form of such a name (e.g. alena → alenka, helena → helenka etc.) The fact that this morphological rule is used both by western as well as eastern Slavs indicates that the rule itself can be quite old, date to common Slavic or even preSlavic periods and hence, was quite probably in action already in the period when VM was written. For the purpose of this article, let’s just note that application of the substitution: a$ → ka/ allowed us to significantly increase the extent of the "Slavic crib". Thus, we have obtained a list a of 13815 distinct word types which are in quite close relation to phonetic representation of feminine names used in Europe and beyond8 . The alphabet of this crib comprises of 38 symbols, hence there exists 1939 possible ways how symbols of the Calendar could be replaced by symbols of this crib. Figure 27 shows the process of convergence from populations of randomly generated chromosomes towards more optimal states. In case of runs averaged in the "SUBSTITUTON" curve, the procedure Pcribbing consisted in simple mapping of the Calendar onto the crib by means of a substitution cipher specified in the chromosome. But in case of runs averaged in the "REVERSAL + SUBSTITUTION" curve, whole process was initiated by the reversal of order of characters present within individual tokens of the Calendar (e.g. okedy → ydeko, otedy → ydeto etc.) Let’s now look at contents of individuals which were "identified" by the optimization method. More concrete illustrations can also turn out to be quite illuminating. Hence, if the most elite individual of run 1 (i.e. the one with fitness 197) is as a means of substitution of EVA characters contained in the Calendar, one will see appearance of names like ALENA, ALETHE, 8 Slavic crib is publicly available at http://wizzion.com/thesis/simulation0/slavic_extended.crib 15.5 experiments Figure 27: Evolution of individuals adapting label in the Calendar to names listed in the Slavic crib. ANNA, ATENKA, HANKA, HELENA, LENA etc. And when the last one (i.e. the one with fitness 240 is used), the resulting list shall contain tokens like AELLA, ALANA, ALINA, ANKA, ANISSA, ARIANNKA, ELLINA, IANKA, ILIJA, INNA, LILIJA, LILIKA, LINA, MILANA, MILINA, RANKA, RINA, TINA etc. This being said, the observation that all reversal-implementing runs have converged to genomes which: 1. transcribe e in EVA as nasal n 2. transcribe k in EVA as velar k 3. transcribe t in EVA as nasal n 4. transcribe y in EVA as vowel a 5. transcribe a in EVA as vowel (80% times as "i", 10% as "e", 10% as "o") 6. transcribe l in EVA as either a liquid consonant (80% "l", 10% "r") or "m" (10%) ...could also be of certain use and importance. 15.5.2 hebrew crib At this point, a skeptical mind could start to object that what our algorithm adapt to is in fact not the Calendar, but the statistical properties 257 258 breaking into unknown code Fitness 197 e s t nhk a hk l h t ak amena 230 i k t n s knhk l z t a j s m i na 224 i c t nvk/gk l mba j / r i na 227 i 240 i k t nak f l k l mea j g r i na 226 i 208 i qgnxkdek l mxa j x r i na 239 i k t ndo l l k l f e ak i m i na 191 o t l n t nn r km z banh r ena 240 i s t n s kn l k l mea j I r i na t npa f l k l me ank r i na l nho l k r g eanam i na EVA a c d e f g h i k l m n o p q r s t y Table 40: Fittest chromosomes which map reversed tokens in the Calendar onto names of the Slavic crib of the crib. And in case of such a long and sometimes somewhat artificial list like CribSlavic , such an objection would be in great extent justified. For the adaptive tendencies of our evolutionary strategy are indeed so strong that it would indeed find a way to partially adapt the calendar to a crib which is long enough9 For this reason, we have decided to target our second experiment not at the biggest possible crib but rather at the oldest possible crib. And given that our first experiment has indicated that it seems to be more plausible to interpret labels in the Calendar as if they were written in reverse, id est from right to left, our interest was gradually attracted by Hebrew language10 . This lead us to two lists of names: • CribHebrew−men contains 555 masculine names11 • CribHebrew−women contains 283 feminine names12 both lists were extracted from the website finejudaica.com/pages/hebrew_names.htm and were chosen because they did not contain any diacritics and 9 This has been, indeed, shown by multiple micro-experiments which we do not report here due to the lack of space. No matter whether we use cribs as absurd as list of modern American names or Enochian of John Dee and Edward Kelly, we could always observe a sort of adaptation marked by the increase of fitness. But it was never so salient as in case of CribSlavic or CribHebrew . 10 Other reasons why we decided to focus on Hebrew include: important presence of Jewish diaspora in Prague of Rudolph the 2nd (c.f. the story of rabbi Loew and the Golem of Prague); ritual bathing of Jewish women known as mikveh; usage of VM-resembling triplicated forms (e.g. amen, amen, amen) in Talmudic texts; attested existence of so-called Knaanic language which seems to be principally a Czech language written in Hebrew script et caetera et caetera. 11 http://wizzion.com/thesis/simulation0/jewish_men 12 http://wizzion.com/thesis/simulation0/jewish_women 15.5 experiments Figure 28: Evolution of individuals adapting label in the Calendar to names listed in the Hebrew cribs. hence transcribing Hebrew names in a similar way as they had been transcribed millenia ago. Figure 28 displays the summary of all runs which aimed to transcribe the Calendar with Hebrew names. As may be seen, the whole system converged to highest fitness values when CribHebrew−women was used in concordance with reversal of order of characters. In such scenario, minimal attained fitness was attained by run converging to Fmin(hebrew28,283,hfr) = 52, maximal attained fitness was Fmax(hebrew28,283,hfr) = 63. Difference results of hebrew, reverse batch of runs and other results of other batches is statistically significant (Welch Two Sample t-test, p-value < 7e-10). Subsequently, a list of 283 was tokens randomly generated in a way that the distribution of lenghts of randomly generated sequences is identic to distribution of lenghts of names in the hebrew crib. Maximal attained fitness was Fmax(random28,283,hfr) = 26 among 10 runs aiming to adapt the Calendar to such a random crib. Statistical difference between results of batch of runs adapting to valid character-reversed hebrew crib hebrew28, 283, hfr and the equidistributed randomly generated crib random28, 283, hfr turned out to be strongly significative (Welch two sample two sided non-paired ttest: t = 22.0261, df = 15.442, p-value = 4.384e-13). The highest attained fitness value was was attained by the cribbing procedure which first reverses the order of characters whose EVA 259 260 breaking into unknown code representations are subsequently substituted by a following chromosome: This chromosome transcribes the voynichese Calendar labels okam, otainy, otey, oty, otaly, okaly, oky, okyd, ched, otald, orara, otal, salal and opalg to feminine Hebrew names (i.e. Bina, Gabriela, Ghila, Gala, Galila, Galina, Gina, Degana, Diyna, Deliyla, Yedidya, Lila, Lilit and Alica). Worth mentioning are also some other phenomena related to these transcriptions. One can observe, for example, that the label "otaly" translated as Galina - is also present on folios f33v, f34r or f46v which all contain drawings of torch-like plants. This is encouraging because the word "galina" is not only a Hebrew name, but also a substantive meaning "torch". Similarly, the word "lilit" is not only a name but also means "of the night". This word supposedly translates the voynichese token "salal" which is very rare - asides the Calendar it occurs only on purely textual folio f58v and on a folio f67v2 which, surprise!, may well depict circadian rhythms of sunrise, sunset, day and night. Or it could be pointed out kind that the huge majority of occurrences of voynichese trigram "oky" (potentially denoting the name "gina" which also means "garden") is to be observed on herbal folios. Or the distribution of instances of "okam" (transcribed as "bina" which means "intelligence and wisdom"13 could, and potentially should, be taken into consideration. Or maybe not. 15.6 conclusion In 2013, BBC Online had announced "Breakthrough over 600-yearold mystery manuscript". The breakthrough was to be effectuated by Stephan Bax who, in his article, describes the process of deciphering as follows: « ?» (?) What Bax does not add, unfortunately, is that the voynich crossword puzzle is so big that anyone who looks at it close enough can find in it small islands of order, local optima where few characters seem to fit the global pattern. Thus, even if Bax had succeeded, as he states, in "identification of a set of proper names in the Voynich text, giving a total of ten words made up of fourteen of the Voynich symbols and clusters", this would mean nothing else than that he had identified a locally optimal transcription alphabet. 13 Note that "bina" is one among highest sephirots located at north-western corner of kabbalistic tree of life. In this context it is worth noting that only partially readable EVA group "...kam" occurs as a third word near the north-western "rosette" of folio 85v2. Such considerations, however, bring us too far. 15.7 generic conclusion 261 In this article, we have presented two experiments employing two different lists of feminine names. Both experiments have indicated that if labels in the Zodiac encode feminine names, then these have been originally written from right to left 14 . The first experiment led to identification of multiple substitution alphabets which allow to map 240 EVA letters, contained in 40 distinct words present in the Calendar, onto 35 feminine-name-resembling sequences enumerated among 13815 items of CribSlavic . Results of second experiment indicate that if ever the Calendar contains lists of Hebrew names, then these names would be more probably feminine rather than masculine. This is, as far as we can currently say, all that could be potentially offered as an answer to the question « Can Evolutionary Computation Help us to Crib the Voynich Manuscript?» (Hromada, 2016a). Everything else is - without help coming from experts in other disciplines just a speculation. 15.7 generic conclusion Looked upon from a superficial point of view, an article presented in this "zeroth analysis" contains nothing else and nothing more than: 1. a very brief description of a particular enigma commonly known as "Voynich Manuscript" 2. introduction of a so-called "primary mapping" hypothesis potentially able to direct any future tentative to decipher the manuscript 3. discussion of inner workings of an "evolutionary algorithm" programmed whose source code is hereby transferred to the public domain15 4. presentation of fairly reasonable results obtained after confrontation of the manuscript with the algorithm which takes lists of Slavic and Hebrew names at its input What is meant by the attribute fairly reasonable is, of course, a place for argument. And contrary to legions of other researchers, we do not pretend that we have succeeded to "crack" the manuscript. We simply state that after being executed on a single core of 1.8GHz CPU, a simple 160-line script written in pure PERL can yield, in just few hours, intelligible transcriptions of "lattices of terms" contained in a previously unknown corpus. Thanks to a fairly trivial derivative of 14 Note, however, that this does not necessarily imply that the scribe of VM (him|her)self had written the manuscript in right-to-left fashion. For example, in case (s)he was just reproducing an older source which (s)he didn’t understand, his|her hand could trace movements from left to right while the very original had been written from right to left 15 http://wizzion.com/thesis/simulation0/voynich.PERL 262 breaking into unknown code a Canonical Genetic Algorithm, an average home PC can closely approximate a brute-force search which would otherwise run weeks (at least) even when executed at state-of-the-art computational clusters. Simply stated, our 0th simulation indicates that, which has already been indicated many times before: Evolution narrows-down the search to regions where most plausible hypotheses reside. Non-negligible speed-up goes hand in hand with such narrowingdown. And it is evident that such speed-up can be useful for any system which can invest only limited amount of time and energy into its search of the most optimal hypothesis. It does not really matter whether the system in which we speak in this context is a PERL script, child’s mind or the Nature herself: problem-solving system which implements evolutionary principles tends to converge (Rudolph, 1994) to "the answer" in less time, and with less resources wasted, than the system which does not implement such principles. At least as fascinating as her ability to speed things up is evolution’s propensity to produce adaptations. Zeroth simulation is particularly instructive in this regards: as noted in the footnote 9, the VM-to-crib transcribing EA produced certain results even in cases when cribs as "list of 20th century American names" have been used as target dictionaries. In spite of absurdity of such cribs - for it is indeed highly improbable that VM initially contained names like Butch or Mitch the EA succeed to discover certain inherent similarities between two texts in order to exploit them in the future search. Thus, the main conclusion of 0th simulation can be stated as follows: Evolution is able to facilitate the search for optimal mapping between distinct corpora encoded in distinct forms of representation. In this simulation, distinct forms of representation has been a socalled EVA alphabet (into which VM is transcribed) and phonemic alphabets common to Slavic or Hebrew languages. Mapping itself was nothing else than simple substitution of one symbol from one alphabet with one symbol from another alphabet. A mapping - a hypothesis - was considered "the fittest" if it succeeded to transcribe initial unintelligible EVA corpus to intelligible list of names. Both EVA corpus and the name list were EA’s inputs and thus in certain sense "innate" to each individual run of the algorithm. What was "acquired" during the process was the set of mono-alphabetic substitution rules. EA presented in 0th simulations is thus an example of evolution which processes strictly "symbolic" representations. This will not be the case in simulations which are now to follow: let’s now descend to the realm of sub-symbolic (vectorial) entities in order to propose an evolutionary solution to the problem of category induction. 16 E V O L U T I O N A R Y L O C A L I Z AT I O N O F S E M A N T I C PROTOTYPES 16.1 generic introduction How does a child create mappings between "signifiers" and "signifieds" (de Saussure, 1916), between words and their meanings? How do concepts emerge in the mind of a child? These question are addressed on many places of Conceptual Foundations. Be it during our discussion of "ontogeny of lexicon and semantics" (P+72-78) or classical theories thereof (P+93-95), be it during the definition of "category prototype" (P+132) or in the Hebb/Harris analogy (P+133) suggesting a sort of equivalence between Hebb’s law well-known to neuroscientists and so-called "distributional hypothesis" well-known to linguists, it has been indicated on multiple places that what contemporary linguists label as "vocabulary development" is, in its essence, nothing else than a usage-based, goal-oriented, associanist process. And that Chomsky’s critic of Skinner (P+95), in regards to acquisition of meanings, quite inappropriate: in fact it does not even apply. This is so because first syntactic representations (P+173-179) are acquired, tuned and perfectioned later than first semantic constructions (P+179-184). And how could such "vocabulary development" be simulated by an engineer willing to do so ? In an ideal world, such an engineer would have to have, at least, two things at his disposition: • a corpus C representing the world of a modal toddler: it should contain representations of objects with many attributes (some of them could and should mutually overlap) • an algorithm A capable of clustering objects into categories in a "cognitively plausible" (P+13) way (i.e. similar to the way child’s mind does it) Unfortunately, as far as 2016, no such C is available, at least not in textual form which could be processed by means of methods commonly used in computational linguistics (P+112-164). The corpus CHILDES (P+207-209) is as close as one can get to C but, and this is a nonnegligible "but", CHILDES contains transcripts representing interactions within certain worlds BUT does not contain descriptive representations of these worlds selves. And as we have noted elsewhere (Hromada and Gaudiello, 2014) construction of such corpus surpasses 263 264 evolutionary localization of semantic prototypes by far possibilities of any individual engineer and thus also possibilities of this dissertation. Willing to develop A but without proper C, one is obliged to approximate. In regards to simulations of induction of meaning, a plausible approximation could be proposed as follows: Let’s suppose that text documents are "objects" and that groups of objects which have similar semantic content (i.e. refer to or speak about similar things) delimit a certain "semantic category". Under such supposition - and under such supposition only - can one reduce the problem of vocabulary development to a problem of multi-class categorization of documents. Under such ceteris paribus and under such ceteris paribus only - can one pretend that the model first published in the article « Genetic Optimization of Semantic Prototypes for Multi-class Document Categorization» (Hromada, 2015) could , in the long run, potentially lead to full-fledged computational models of vocabulary development. 16.2 introduction In computational theories and models learning, one generally works with two types of models: regression and classification. While in regression models one maps continuous input domain onto continuous output range, in models of classification, one aims to find mappings able to project input objects onto a finite set of discrete output categories. This article introduces a novel means of construction of a particular type of the latter kind of learning models. Due to finite and discrete nature of its output range, classification - also called categorization by more cognition-oriented researchers - seems to be of utmost importance in any cognitively plausible (Hromada, 2014b) model of learning. But under these terms, two distinct meanings are confounded and the term categorization thus often represents both: 1. process of learning (e.g. inducing) of categories 2. process of retrieving information from already learned (induced) categories which crudely correspond to training, resp. testing phases of supervised learning algorithms. In the rest of this section we shall more closely introduce an approach combining notions of category prototype, dimensionality reduction and evolutionary computing in order to yield a potentially "cognitively plausible" means of supervised machine learning of a multi-class classifier. We shall subsequently present specificities of a Natural Language Processing (NLP) simulation which was executed in order to assess the feasibility of our approach. Results hence obtained shall be subsequently compared with comparable "deep learn- 16.2 introduction ing" semantic hashing technique of (Salakhutdinov and Hinton, 2009). The article shall be concluded with few remarks integrating whole research into more generic theories of neural and universal Darwinism. 16.2.1 geometrization of categories In contemporary cognitive science, categories are often understood as entities embedded in an ∆-dimensional feature space (Gärdenfors, 2004). The most fundamental advantage of such models, whose computer sciences counterparts are so-called "vector symbolic architectures" (VSAs) (Widdows and Cohen, 2014), is their ability to geometrize one’s data, i.e. to represent one’s data-set in a form which allows to measure distances (similarities) between individual items of the data-set. Thus, even entities like "word meanings" or "concepts" can be geometrically represented, either as points, vectors or sub-spaces of the enveloping vector space S. One can subsequently measure distances between such representations, e.g. distance of the meaning of the word "dog" from the meaning of "wolf" or "cat" etc. Geometrization of one’s data-set once effectuated, space S can be subsequently partitioned into a set R of |C| regions R = R1 , R2 , ..., R|C| . In unsupervised scenario, such partitioning is often done by means of diverse clustering algorithms, the most canonical among which being the k-means algorithm (MacQueen et al., 1967). Such clustering mechanisms often characterize candidate cluster CX in terms of a geometric centroid of the members of the cluster. Feasibility of a certain partition is subsequently assessed in terms of "internal clustering criteria" which often take into account distances among such centroids. In the rest of this article, however, we shall aim to computationally implement a supervised learning scenario and instead of working with the notion of category’s geometric centroid, our algorithm shall be based upon the notion of category’s prototype. The notion of the prototype was introduced into science notably by theory of categorization of Eleanore Rosch which departed from the theoretical postulate that: "the task of category systems is to provide maximum information with the least cognitive effort" (Rosch, 1999) In seminal psychological and anthropological studies which have followed, Rosch have realized that people often characterize categories in terms of one of their most salient members. Thus, a prototype of category CX can be most trivially understood as such a member of CX which is the most prominent, salient member of CX . For example "apples" are prototypes of category "fruit" and "roses" are prototypes of category "flowers" in western cultural context. 265 266 evolutionary localization of semantic prototypes But studies of Rosch had also suggested another, more mathematical, notion of how prototypes can be formalized and represented. A notion which is based upon the notion of closeness (e.g. "distance") in a certain metric space: "items rated more prototypical of the category were more closely related to other members of the category and less closely related to members of other categories than were items rated less prototypical of a category" (Rosch and Mervis, 1975) Given that this notion is essentially geometric, the problem of discovery of a set of prototypes can be potentially operationalized as a problem of minimization of a certain fitness function. The fitness function, as well as means how it can be optimized, shall be furnished in section 2. But before doing so, let’s first introduce certain computational tricks which allow to reduce the computational cost of such search of the most optimal constellation of prototypes. 16.2.2 radical dimensionality reduction There is potentially an infinite number of ways how a data-set D consisting of |D| documents can be geometrized into a ∆−dimensional space S. In NLP, for example, one often looks for occurrences of diverse words in the documents of the data-set (e.g. corpus). Given that there are |W| distinct words occurring in |N| documents of the corpus, one used to geometrize the corpus by means of a N * M co-occurrence matrix M whose X-th row vector represents the X-th document NX , Y-th column vector represents the Y-th word WY and the element on position MX,Y represents the number of times WY occurred in NX . Given the sparsity of such co-occurrence matrices as well as for other reasons, such bag-of-words models are more or less abandoned in contemporary NLP practice for sake of more dense representations, whereby the dimensionality of the resulting space, d, is much less than |W|, d  |W|. Renowned methods like Latent Semantic Analysis (LSA) (Landauer and Dumais, 1997) set aside because of their computational cost, we shall use the Light Stochastic Binarization (LSB) (Hromada, 2014c) algorithm to perform the most radical dimensionalityreducing geometrization possible. LSB is an algorithm issued from the family of algorithms based on so-called random projection (RP). Validity and feasibility of all these algorithms, be it Random Indexing (RI, (Sahlgren, 2005)) or Reflective Random Indexing (RRI,(Cohen et al., 2010)) is theoretically founded on a so-called lemma of Johnson-Lindenstrauss, whose corollary states that "if we project points in a vector space into a randomly selected subspace of sufficiently high dimensionality, the distances between the points are approximately preserved" (Sahlgren, 2005). 16.3 genetic localization of semantic prototypes Methods of application of this lemma in concrete NLP scenarios being described in references above, we precise that LSB can be labeled as "most radical" variant of RP-based algorithms because: • it tends to construct spaces with as small dimensionality as possible (in LSB, d < 300; in RI or RRI models, d > 300) • LSB tends to project the data onto binary and not real or complex spaces It can be, of course, the case that such dimensionality-reduction and binarization can lead to certain decrease of discriminative accuracy of LSB-produced spaces. On the other hand, given that dimensionality reduction and binarization necessary bring about reduction of computational complexity of any subsequent algorithm which could be used to explore the resulting space S, such decrease of accuracy is to be more swiftly counteracted by subsequent optimization. The goal of this study is to explore whether such post hoc optimization of classifiers operating within dense, binary, LSB-produced spaces is possible, and whether the combination of the two can be used as a novel means of machine learning. But before describing in more closer such evolutionary optimizations, let’s precise that because of its low-dimensional and binary nature, LSB can also be understood as yielding a sort of "hashing function" aiming to attribute similar hashes to similar documents and different hashes to different documents. In this sense, LSB is similar to approaches like Locality Sensitive Hashing (LSH, Datar et al. (2004)) or Semantic Hashing (SH, Salakhutdinov and Hinton (2009)) often used, or at least presented, as the solution of multi-class classification of Big-Data corpora. It is with the results of the latter, "deep-learning" approach, that we shall compare our own results in section 16.5. 16.3 genetic localization of semantic prototypes Let D = {d1 , ..., d|D| } be a training data-set consisting of |D| documents to which the training dataset attributes one among |L| corresponding members of set of class labels L = {L1 , ..., L|L| }. Let Γ denote a tuple Γ = C1 , ..., C|L| whose individual elements are sets containing indices of members of D to which a same label Ll is attributed in the training corpus (e.g. C1 = {3, 4, 5} if training corpus attributes its 1st label only to documents d3 , d4 and d5 ). Let H = {h1 , ..., h|D| } be a set of ∆-dimensional binary vectors attributed to members of D by a hashing function FH , i.e. hX = FH (dX ). Let S be a ∆-dimensional binary (Hamming) space into which members of H were projected by application of mapping FH . Then a classificatory pertinence FCP of the candidate prototype PK of K-th class (K 6 |C|) can be calculated as follows: 267 268 evolutionary localization of semantic prototypes FCP (PK ) = α X Fhd (ht , PK ) − ω X Fhd (hf , PK ) (5) f6⊂CK t∈CK whereby P denotes the position of the prototype in S, Fhd denotes the Hamming distance 1 , ht denotes the hash "true" document belonging to same class as the prototype, hf is the vector of the "false" document belonging to some other class of the training corpus and α and ω are weighting parameters. In simpler terms, an ideal prototype of category C is as close as possible to members of C and as far away as possible from members of other categories. Given such a definition of an ideal prototype, an ideal |C|-class classifier I can be trained by searching for such a set P = {P1 , ..., P|L| } of individual prototypes, which minimize their overall classification pertinence: X K=|L| I = min FCP (PK ) (6) K=0 In simpler terms, an ideal |C|-class classifier I is composed of |C| individual prototypes which are as close as possible to documents of their respective categories, and as far away as possible from all other documents. Equations 1 and 2 taken together, one obtains a fitness function which can be optimized by evolutionary computing algorithms. And given that one explores the prototypical constellations embedded in a binary space, one can use canonical genetic algorithms (CGAs, Goldberg (1990)) for the optimization of the problem of discovery of ideal constellation of most pertinent prototypes. We choose CGAs for three principal reasons: Primo, we choose CGAs mainly for their property, proven in Rudolph (1994), to converge to global optimum in finite time if ever they are endowed with the best-individual protecting, elitist strategy. Secundo, one can obtain practically useful and exploitable increase in speed simply due to the fact that CGAs are conceived to process binary vectors and do so on CPUs which are essentially built for processing such vectors. Tertio, CGAs offer a canonical, well-defined, "baseline" gateway to much more sophisticated evolutionary computing (EC) techniques and are well understood by both neophytes as well as the most experts of the EC community. For this reason, we consider as superfluous to describe in closer detail the inner workings of a CGA: instead, references (Goldberg, 1990; Rudolph, 1994) are to be followed and read. Given that the particular values of mutation and cross-over parameters shall be specified 1 Hamming distance of two binary vectors h1 and h2 is the smallest number of bits of h1 which one has to flip in order to obtain h2 . It is equivalent to a number of non-zero bits in a XOR(h1 , h2 ) binary vector. 16.4 corpus and training parameters in the following section, the only thing which in which the reader now needs to be reassured is her correct understanding of the nature of data structures which the algorithm hereby proposed shall implement, in order to encode an individual |C|-class classifier: Given that equation 1 defines a prototype candidate as a position in ∆-dimensional Hamming space and given that equation 2 stipulates that an ideal |C|-class classifier is to be composed of representations of |C| ideal prototype candidates, the data structure representing an individual solution can be constructed by a simple concatenation of |C| ∆-dimensional vectors. Thus, the individual members of the populations which the CGA shall optimize are, in essentia, nothing else than binary strings of length |C|*∆. 16.4 corpus and training parameters In order to be able to compare the performance of our algorithm with non-optimized LSB and SH, same corpus and dimensionality parameters were chosen as those, which are already reported in the previous studies (Salakhutdinov and Hinton, 2009; Hromada, 2014c). Thus, dimensionality of the resulting binary hashes was ∆=128. Every document of the corpus was hence attributed a 16-byte long hash. A so-called "20newsgroups" corpus2 has been used. The corpus contains 18,845 postings taken from the Usenet newsgroup collection divided into training set containing 11,314 postings, 7531 being the testing set (|Dtraining | = 11313, |Dtesting | = 7531). Both training and testing subsets are divided into 20 different newsgroups which correspond each to a distinct topic. Given that every distinct topic represents a distinct category label, |C| = 20. Documents of the corpus were subjected to a very trivial form of pre-processing: documents were split into word-tokens by means of ˆ [\w] separator. Stop-words contained in PERL library Lingua::StopWords were subsequently discarded. 3000 word types with highest "inverse document frequency" value were used as initial terms to which the initial random indexing iteration attributed 4 non-zero values. Hashing function FH = LSB(∆ = 128, Seed = 3, Iterations = 2) because there were 2 "reflective" iterations preceding the ultimate stage of "binarization". Once hashes were attributed to all documents of the corpus, the Hamming space S was considered as constructed and stayed unmodified during all phases of subsequent optimizations and evaluations. As CGA-compliant algorithm, the optimization applied generated the new generation by crossing over two parent solutions chosen by the fitness proportionate (e.g. roulette wheel) selection operator. Each among 2560 (128*20) genes was subsequently mutated (i.e. a correspondent bit was flipped to its opposite value) with probability of 2 http://qwone.com/ jason/20Newsgroups/ 269 270 evolutionary localization of semantic prototypes 0.1%. Population contain 200 individuals, zeroth generation was randomly generated. Elitist strategy was implemented so that all individuals with equally best fitness survived intact the transition to future generation. Parameters α and ω (e.g. equation 1) used in fitness estimation were both set to 1. Information concerning the category labels guided the optimization during the training phase. During the testing phase, such information was used only for evaluation purposes. Multiple independent runs were executed and values of precision and recall were averaged among the runs in order to reduce the impact of stochastic factors upon the final results. 16.5 evaluation and results Every 250th generation, classificatory accuracy of an individual solution with minimal overall classification pertinence (c.f. equation 2) was evaluated in regards to 7531 documents contained in the testing part of the corpus. Following aspects of classifier’s performance were evaluated in order to allow comparison with the results with Precision-Recall curves presented in (Salakhutdinov and Hinton, 2009; Hromada, 2014c): Number of retrieved relevant documents Total number of retrieved documents Number of retrieved relevant documents Recall = |Dtesting | Precision = The notion of relevancy is straightforward: an arbitrary document DT contained in the testing corpus is considered to be relevant to query document DQ if and only if they were both labeled with the same category label, LQ = LT . On the other hand, the correct understanding of what is meant by "retrieved" is the key to correct understanding of the core idea behind the functionality of the algorithm hereby proposed. That is: the prototypes induced by the CGA optimization are to be used as retrieval filters. We precise: given a hash hQ of a query document dQ , one can easily identify - among |C| prototypes encoded as components of an quasi-ideal constellation I furnished by the CGA - such a prototype PN which is nearest to hQ . Subsequently, each among N documents whose hashes are N nearest neighbors of the prototype PN , should be considered as retrieved by dQ . Prototypes discovered during the training phase therefore primarily specify, during the testing phase, which documents are to be considered as retrieved, and which not. For all LSB curves present on Figure 29, the size of such retrieval neighborhood was set to N=2000. Also, in order to obtain viable precision-recall curves, Radius R=(0, ..., ∆ = 128) of the Hamming ball was used as a trade-off parameter. 16.5 evaluation and results Figure 29: Retrieval and 20-class classification performance in 128dimensional binary spaces. Non-LSB results are reproduced from Figure 6 of study (Salakhutdinov and Hinton, 2009), plain LSB from (Hromada, 2014c). 271 272 evolutionary localization of semantic prototypes For every data-point of the plot on Fig. 1, hN was considered as retrieved by query hQ only if the hamming distance of query and the candidate document was smaller than R (hd(hQ , hN ) > R). Points on the very left of the plot correspond thus correspond to R=0 (i.e. hQ and hN collide), while points on the right correspond to R=128 (i.e. hQ does not have a single bit in common with hN ). As comparison of curves on the figure indicates, biggest increase in performance is attained by decision to use prototypes as retrieval filters. Thus, when one uses the most fit among 200 randomly chosen prototype constellations as a retrieval filter (c.f. curve CGA1(LSB)), one obtains significantly better results than when does not use any prototypes at all (c.f. curve "Plain LSB"). If the process is followed by further genetic optimization (c.f. CGA500 for situation after 500 generations), one observes a non-negligible increase of precision in the high recall region of the spectrum. But it can also be seen that the optimization has its limits, hence there is a slight decrease between 500th and 1000th generation which potentially corresponds to situation whereby the induced prototype constellation tends to over-fit the training data-set. This leads to subsequent decrease in overall accuracy of classification of documents contained in the testing data-set. Figure 29 also suggests that the genetic discovery of sets of prototypes - and their corresponding use as retrieval filters - seems to produce results which are better than those produced by both binarized Latent Semantic Analysis or SH. Exception to this is SH’s 20% precision at recall level of 51.2%. Note, however, that since on page 6 of their article, Salakhutdinov and Hinton (2009) claim to have used their hashes as retrieval filters of neighborhood of size N=100, and given that the every size of the category in a 20newsgroup corpus ≈ 390 documents, such a result is not even theoretically possible. This is so because even in case the classifying system would retrieve only the relevant documents (i.e. precision would be 100%) the maximal attainable recall would still be just 100/390 ≈ 25.6%. Both authors were contacted by mail with a request to rectify possible misunderstanding. Unfortunately, none of them replied. 16.6 conclusion Results hereby presented indicate that supervised localization of constellations of semantic prototypes can significantly increase accuracy of classifiers which use such constellations as retrieval filters. Given that the localization of such constellations is governed by the training corpus but the increase is also significant in case when one confronts the system with previously unseen testing corpus, we are allowed to state that our algorithm is capable of generalization. This was principally attained by combination of following ideas: 1. projection of documents into low-dimensional binary space 16.6 conclusion 2. definition of fitness of prototype in terms of distances to both documents of its category, as well as distance to document of other categories 3. search for fittest prototype constellations 4. use of the most fit prototype constellation as a sort of retrieval filter In spite of its generalizing and thus "machine learning" capabilities, our algorithms is essentially a non-connectionist one. Thus, instead of introducing synapses between neurons, or speaking about edges between nodes of the graph, briefly, instead of speaking about deep learning of multi-layer encoders of stacks of Restricted Boltzmann Machines fine-tuned by back-propagation as (Salakhutdinov and Hinton, 2009) do - we have found as more preferable to reason in geometric and evolutionary terms. It is indeed due to this "geometric" perspective that the computational complexity of the algorithm is fairly low: ∆|D||C| for evaluation of fitness of one individual prototype constellation. In future study, we aim to explore the performance of slightly modified fitness function whose complexity ∆|D| + |C|2 could be of particular interest in cases of huge data-sets (i.e. big |D|) with fairly limited number of classes (|C|). In practical terms, it is also advantageous that both fitness function evaluation as well as final retrieval assess distances in terms of binary hamming distance measure. In both cases, one can use basic logical operations like XOR + some basic assembler instructions which would furnish indices allowing to execute sort of "conceptual geometry" with particular swift and ease. Given these properties + the fact that hashes which are manipulated are fairly small (in one gigabyte of memory, one can store hashes for 8 million documents), one can easily predict existence of future application-specific integrated circuit (ASIC) potentially executing billions query2document comparisons per second. Computational aspects aside, our primary motive in developing the algorithm hereby proposed was to furnish a sort of cognitively plausible (Hromada, 2014b) "experimental proof" for our doctoral Thesis which postulates that a sort of evolutionary process exists not only in the realm of biological species, but also in realms populated by "species" of a completely different kind. Id est, in realms of linguistic structures and categories, in realms of word meanings, concepts and, who knows, maybe even in the realm of mind itself. Being uncertain about whether our demonstrate, with sufficient clarity, that it is reasonable to postulate not only neural (Edelman, 1987), but also intramental evolutionary processes, we conclude by saying that the formula hereby introduced offers a simple yet reason- 273 274 evolutionary localization of semantic prototypes ably accurate method of solving the problem of multi-class categorization of texts. 16.7 generic conclusion Speaking less concretely, this article shows that model, implementing evolutionary search within a certain type of vector space, can bring practically applicable results. Given that results obtained with training data are in non-negligible extent transposable to testing data, one can consider such model to instantiate a particular case of machine learning (P+125-130). Training data-set is labeled and labels are exploited to direct the evolutionary search: hence, the algorithm can be understood as a supervised one. Concretely speaking, this article shows how one can perform multiclass (N=20) classification of textual documents. Hence, newspaper articles were considered as entities which are to be classified and occurrence frequencies of words contained within the articles are used as features by means of which the articles are characterized. And speaking less concretely again, this chapter indicates that evolutionary computation can provide the means to identify constellations of regions in a semantic space which roughly correspond to constellations of semantic categories 3 . Ideally, the process converges to state where correct category labels are attributed to correct regions with correct extension. It is in this sense that the approach hereby introduced can be, mutatis mutandi, understood as a potential model of vocabulary development within individual child. This is so because the aim of vocabulary ontogeny is analogical: one aspires to attribute correct phonic representations ("words", "signifiers", "labels") to correct regions of the conceptual space. As has been observed by other researchers (P+173) or illustrated by the Borgesian Ding-Dong Mystery (P+177179) such process of attribution appropriate handel to appropriate vessels is far from being a monotonic descent to most optimal state. Rather, the process of acquisition of vocabulary is full of periods where the category is either too exhaustive or too specific, full of small adjustments, detours and returns. It is in this sense that the conjecture learning of words is an evolutionary process should be interpreted, and it is in this sense that the aspirations of the algorithm hereby introduced are to be understood. 3 Note that we use terms "semantic category", "semantic class" or "concept" as synonyms. E V O L U T I O N A RY I N D U C T I O N O F A L I G H T W E I G H T MORPHOSEMANTIC CLASSIFIER 17.1 generic introduction The aim of previous chapter was to show that one can use evolutionary computation to induce sufficiently pertinent semantic categories from a corpus of text documents. Individual text documents were understood as "entities", words present within such documents were understood as their "features" and topics1 to which diverse documents were attributed were understood as "semantic categories". Analogies between such process of induction of semantic categories and the process of "vocabulary development", occurring in practically every human being since birth until death, have also been made. In this chapter we shall explore evolutionary models of induction of yet another type of categories which also play a non-negligible role in human linguistic communication. Id est, induction of grammatical categories. And given that a commonly used definition of a grammatical category (GC) as a grouping of language units sharing some common feature or function is very general and vague, this chapter shall focus on particular type of GCs, that of "parts-of-speech" (e.g. "nouns", "verbs", "adjectives" etc.). There are three main "technical" reasons which motivate this choice: • part-of-speech induction (POS-i, P+135-136) and POS-tagging are well-known NLP problems • in spite of being well-known, relatively few researchers have proposed evolutionary means to solve these problems (P+137139) • certain transcripts within the CHILDES (P+196-222) corpus are tagged with POS-labels and it is the 3rd reason which is to be understood as the most decisive one in regards to "psycho-linguistic" aims of this dissertation. But the ultimate reason for which we have opted to focus on part-ofspeech categories is a theoretical one: Part-of-speech categories tend to integrate word’s semantic content with its grammatical function. 1 Note the congruence between the fact that the word "topic" is derived from the Greek τόπος which means "place" and the fact that in computational semantics, a topic is literally understood as a "place" within the semantic space 275 17 276 evolutionary induction of a lightweight morphosemantic classifier In other terms, the very information that "X belongs to category of nouns" informs the one, who already disposes of a certain notion of what a noun is, that X most probably denotes a thing or a state. And the very information that "Y is a member of category of verbs" suggests that Y most probably denotes a process or an activity. In this regards, the appartenance of the word W to the category C is an irreplaceable clue to not only of W’s function and position in the enveloping utterance, but also to W’s meaning. This is maybe not so important when the meaning of W is already known, but in case of a language-learning toddler, the ability to recognize that W ∈ C could significantly reduce her difficulties in solving the problem "to which components in a recently perceived scene should be a novel W associated?". Simply stated: POS-categories can help the child to bootstrap (Karmiloff and Karmiloff-Smith, 2009, pp.111-118) herself into the language. But how does a child construct such categories in the first place? The aim of the article hereby introduced and recently submitted to journal Computational Linguistics (Hromada, 2016c), is to propose an evolutionary answer. 17.2 introduction What is the essence of linguistic categories, how are such categories represented in human mind and how do such representations develop? Questions which intrigue linguists and philosophers since time immemorial, questions of such elusive nature that any proposal aspiring to answer them have to be, per definitionem, only partial and incomplete. Such epistemological problems notwithstanding, contemporary computer science tends to offer an instructive answer: categories are classes and classes can be operationalized as regions within an ∆-dimensional vector space S∆ . Under such definition, training of a categorizing system (i.e a "classifier") can be simulated as a search for the most accurate partitioning of S∆ . This holds for categories in general and hence it also holds for linguistic categories in particular. One possible way how such partitioning can be performed is offered by so-called Support Vector Machines (SVM, Cortes and Vapnik (1995)). Basic idea behind SVMs is simple: the algorithm aims to find such a hyper-plane (also called a "decision boundary") which cuts the vector space into two sub-spaces each of which shall ideally contain only data-points attributed to one class. But not only that: given that many such decision boundaries are often possible and identifiable, an SVM tends to identify the one which maximizes the gap (i.e. margin) between data-points themselves and the boundary. Motivation behind such a choice is simple: the more margin is maximized in regards to objects extracted from the training data-set, the more it can be expected that object extracted from a previously unseen "testing 17.2 introduction data-set" shall be also projected onto the correct side of the boundary. And very often it indeed does: SVMs are able to generalize. 17.2.1 from planes to prototypes In spite of their theoretical elegance, SVMs - as well as their neural network "perceptron" counterparts - have one important drawback. That is: SVMs and perceptrons look for a "plane" which cuts the space into partitions. But as is illustrated by Figure 31, data-to-be-classified is very often not "linearly separable": a linear decision boundary is nowhere to be found (Minsky and Papert, 1969). In SVM practice, the problem is often solved by applying a certain "kernel function" (Hofmann et al., 2008) which projects the initial data-set onto the space of higher dimensionality where - if the kernel was well chosen - could be the data separated. While kernel functions have other pleasing mathematical properties 2 , they are highly abstract and of significant« mathematical slant» (Hofmann et al., 2008). This, we believe, makes it almost impossible that kernel-based models could ever be labeled as "cognitively plausible" (Hromada, 2014b). In other terms: it is highly improbable that human cognitive and neurolinguistic system would implement as mathematically precise, pure and fragile a machinery as kernels definitely are. In this article we shall argue that it is in great extent possible to bypass the problem of "linear separability". This is to be attained by focusing one’s attention on neighborhoods points PA , PB , ..., PX supposedly representing categories A, B, ..., X instead of focusing it on linear boundaries BAB , BAX , BBX ... which supposedly represent the distinction between A and B; A and X etc. Hence, categories are to be defined in terms of their prototypes (Rosch and Mervis, 1975; Hromada, 2015). Prototypes themselves are points in S tending to satisfy the following condition: An point PC can be understood as an optimal prototype of a category C if and only if all data-points attributed to C are closer to P than to any other prototype (PX , PY ) simultaneously represented within the system. In spite of its surface simplicity, the problem posed by this definition of "the optimal prototype" is not an easy one to tackle in a multiclass scenario: the constraint closer than any other simultaneously represented prototype substantially complicates the case. If this constraint wasn’t present, the problem of identification of "optimal prototypes" would be trivial: prototype would be the centroid of all members of C. But the condition "closer than any other prototype" makes all components of the system mutually dependent on each other. In the end, 2 The most prominent of which is related to a so-called "kernel trick" which can significantly speed up the classifier-training process. 277 278 evolutionary induction of a lightweight morphosemantic classifier one is posed in front of the problem somewhat analogical to the famous three-body problem in physics. That is, a problem of which it is well known that it is insolvable by analytic means (Poincaré and Magini, 1899). 17.2.2 from prototypes to constellations This article aims to demonstrate that the problem of discovery of constellations of optimal prototypes can be approximated by a natureinspired non-connectionist method. In other terms, we shall use a relatively simple evolutionary algorithm in order to "induce" constellations of prototypes which are closer to training data-points to which they should be close and further from training data-points from which they should be far. Thus, an individual solution contains a position of each component prototype. Every individual has a genome of length |C|∆ whereby |C| denotes number of distinct classes and ∆ is the dimensionality of the space within which the search is performed. As is common to evolutionary algorithms (EAs), these individual solutions are subjected to process replication, selection and variation across multiple generations. Notions of "far" and "close" are implemented directly in the fitness function so that the evolutionary search minimizes the number of incorrectly positioned "nearest prototypes". Ideally - id est if EAs parameters have been correctly specified and iff the problem of prototype constellation is optimizable at all - the system should converge to such a constellation of prototypes which could accurately classify both testing and training data. 17.2.3 from constellations to lightweight classifiers Note that if EAs could discover and optimize such constellations, then these constellations would yield truly "lightweight" classifiers: solution to the C−class classification problem of objects in ∆−dimensional space has length |C| ∗ ∆. To be even more radical, let’s precise that the search shall operate within binary ∆ = 64 spaces which means that position of every data-point as well as a candidate prototype could be defined by exactly 8 bytes. 5−class classifiers presented in the next sections are described by no more and no less than 5 ∗ 8 = 48 bytes. Another reason why these classifiers can be considered as "lightweight" is the nature of features used to project diverse textual tokens into such 64−dimensional Hamming spaces. Being aware of results issued from our previous empiric simulations (Hromada, 2014a), we have decided to use three features only, i.e. • suffix of the word W (i.e. last three characters of the word-to-becategorized) 17.3 method • suffix of the word WL eft (i.e. word immediately preceding W) • suffix of the word WR ight (i.e. word immediately preceding W) are in order to transform tokens into geometric entities. No other feature has been used during the geometrization phase of the algorithm. All this in order to propose a nature-inspired model of induction of part-of-speech categories which is, we believe, at least as "minimalist" as Chomsky’s "minimalist" program (Chomsky, 1995). 17.3 method Algorithm presented in this article is very similar to the one presented in (Hromada, 2015). Procedure starts with characterization of training-corpus entities (i.e. "words") in terms of their features (i.e. "suffixes" of W, WL and WR ) . These features are subsequently used to project all entities into a 64-dimensional Euclidean space SE(64) : this component is known as Random Indexing (Sahlgren, 2005). In following steps, whole "space" is reflected so that entities and features "implicitly connected" in the original corpus shall be more pushed to each other than entities and features which are not so connected: this component is known as Reflective Random Indexing (Cohen et al., 2010). At last but not least, all vectors are "binarized" by a simple binary thresholding procedure known as Lightly Stochastic Binarization (Hromada, 2014c). All this steps yield a binary Hamming space SH(64) . Once SH(64) is constructed, one can proceed to localization of most optimal constellations of category prototypes. This is being done by a fairly standard evolutionary algorithm (EA) which is more closely described in 17.3.2. Most fit solutions obtained after certain number of generations are subsequently confronted with data extracted from the testing corpus in order to assess EA’s capability beyond the training set. 17.3.1 corpus This article is conceived as a part of dissertation addressing the possibility of developing evolutionary models of induction of linguistic categories in (and by) human children. This makes the choice of the corpus quite straightforward: the corpus from which we shall aim to extract first linguistic categories is to be contained in Child Language Data Exchange System (CHILDES, (MacWhinney and Snow, 1985)). However, not all among 30 thousand transcripts contained in CHILDES (Hromada, 2016e) contain part-of-speech labels. Quality of labels also varies: this is no surprise given that some transcripts were manually labeled and/or corrected by multiple annotators while other tran- 279 280 evolutionary induction of a lightweight morphosemantic classifier scripts have been labeled only by automatic NLP tools (Sagae et al., 2007). For this reason we have ultimately focused our interest on one particular corpus: Brown’s (Brown, 1973) transcriptions of verbal interactions of a girl named Eve. Primo because Brown’s work is seminal for whole discipline of developmental psycho-linguistics. Secundo because it is indeed the Eve section of Brown’s corpus whose POS-labels have been, according to (Sagae et al., 2007), manually corrected by human annotators. Classes According to (Sagae et al., 2007), each token of CHILDES corpus is labeled with one among 31 part-of-speech tags. However, majority of these tags are used only very rarely and/or denote such categories (e.g. AUX for auxiliaries, REL for relativizers or CONJ for conjunctions) of words which encode only little amount or semantic or deontic information. It is certain that mastery of words belonging to categories like AUX, REL or CONJ play an important role in development of full/fledged adult-like competence. But given that an objective of our dissertation was to elucidate evolutionary computation can simulate bootstrapping of morphosyntactic categories from semantics (and vice versa), we have decided to focus on induction of five classes only. These are enumerated in table 17.3.1. Class Tag CHILDES POS tags Example words ACTION v, part, cop "think", "saying", "is" SUBSTANCE n "cookies", "cow", "ball" PROPERTY adj, qn "better", "blue", "three" RELATION prep "on", "with", "to" REFERENCE pro, det, art "I", "you", "this", "the" Table 41: Five classes of interest, their corresponding CHILDES part-ofspeech tags, some example word types which instantiate them. What is common to these classes is, that their member words very often denote visible and tangible entities, states and processes. Id est, when a child hears these words it can be the case that she also perceives their referents by other senses. Classification of words labeled with tags OTHER than "v", "part", "cop", "n", "adj", "prep", "pro", "art", "det", "qn" has been excluded from the following analysis. Primo, • because such words do, more often than not, lack easily recognizable visual semantic contents and should not thus be mixed with words which encode such contents secundo, 17.3 method • because in ontogeny of a normal child, items belonging to such more abstract classes are mastered later (i.e. after the "toddlerese" (P+17) stage) than words denoting concepts subsumed under five classes listed in (Tomasello, 2009) tertio, • because problem of classification of words into 5 classes is, of course, less computationally complex and hence more tractable than problem of classification into 31 classes and finally, • it is far from certain whether categories like "auxiliaries" or "relativizers" are represented per se within minds of normal verbally communicating humans, or whether such categories are simply abstractions developed by linguists for their own purposes All these arguments taken together had made us renounce to tentatives to train 31-class POS-classifier and made us focus on training of 5-class classifier only. Pre-processing 10443 "motherese" utterances have been extracted from twenty transcripts of Brown’s Eve corpus. These are very easy to detect because in CHILDES, every utterance is on a separate line and begins with the trigram denoting the locutor of the utterance (in case of mothers, the trigram is MOT). 10443 lines which follow these "motherese" utterances and begin with marker %mor have been also extracted: these are lines which contain manually annotated POS-labels. Thus, 10443 line-couplets like this: Listing 10: Motherese utterance from CHILDES corpus + associated morphological tier. eve05.cha:*MOT: that s a duck . eve05.cha-%mor: pro:dem|that cop|be\&3S art|a n|duck . have been obtained by executing a simple shell command3 . Lines beginning with MOT and %mor have been subsequently merged by a PERL script enrich_pos.pl4 which yields output exemplified by the following listing: Such is the primary data format of this simulation. In this format, each token is characterized on a separate line along with the utterance in which it occurred, as well as with its "gold standard" class-label which was attributed to it by manual annotators. Individual columns 3 cd Brown/Eve; grep -A3 -P ’ˆMOT’ *|grep -P ’(MOT|%mor)’ 4 Publicly available at URL http://wizzion.com/thesis/simulation2/enrich_pos.perl 281 282 evolutionary induction of a lightweight morphosemantic classifier Listing 11: Primary input format of this simulation. that###REFERENCE###train###that s a duck . s###ACTION###train###that s a duck . 3 a###QUANTIFIER###train###that s a duck . duck###SUBSTANCE###train###that s a duck . are separated by ### separator. The first column denotes the entity itself (the word token), second column contains its class, third column specifies whether the token occurred in a training or testing part of the corpus and the last column contains whole context within which the token entity occurred (i.e. the enveloping utterance). Let’s precise that the training corpus was extracted from first 12 Eve transcripts (i.e. files eve01.cha - eve12.cha) which describe verbal interactions which occurred before Eve attained 2 years of age. Testing corpus, on the other hand, was composed of 8 files (eve13.cha eve20.cha) transcribed down as Eve was 2 - 2. 21 years old. The script enrich_pos.pl thus outputs 12453 training corpus tokens and 8746 testing corpus tokens instantiating 972 (training) and 934 (testing) word types. Almost one half (449) of word types occurring in testing corpus does not occur in the training corpus. 17.3.2 algorithm This is the core of the model. It consists of two major components: 1. "vector space preparation" (VSP): a trivial suffix-extracting filter is used in order to project text from the primary input onto a 64−dimensional Hamming space 2. "evolutionary optimization": searches SH64 for most discriminative constellations of prototypes Vector Space Preparation Approach which was used to "geometrize" the primary textual input shares its essential features with that of Random Indexing ((Sahlgren, 2005)) as well as with other Vector Symbolic Architectures (Cohen et al., 2012) based on so-called Random Projection (Hromada, 2013). We describe it elsewhere as follows: « Given the set of N objects which can be described in terms of F features, to which one initially associates a randomly generated d-dimensional vector, one can obtain d-dimensional vectorial representation of any object X by summing up the vectors associated to all features F1 , F2 observable within X. Initial feature vectors are generated in a way that out of d elements of vector, only S among them are set to either -1 or 1 value. Other values contain zero. Since the 17.3 method "seed" parameter S is much smaller than the total number of elements in the vector (d), i.e. S «d, initial feature vectors are very sparse, containing mostly zeroes, with occasional value of -1 or 1.» (Hromada, 2014c). Section 17.2.3 has already indicated the nature of features which we shall use to initiate the process of geometrization of textual input. We reiterate: we shall characterize every token T with three principal features only: 1. T’s own suffix5 2. suffix of the token to T’s right 3. suffix of the token to T’s left . Asides this, only two other "lateral features" are used: token T has feature INIT if it is the initial (i.e. first) token of the utterance. Conversely, it is endowed with feature END if it is the last (i.e. terminal) token of the enveloping utterance. These 3 principal and/or two lateral features are extracted - during the initial phase of VSP - by a following feature-extracting snippet. Listing 12: PERL code of suffix-feature extractor 1 sub suffix3_featurefilter { my @f; my @wrdz=split / /,shift; #utterance in 1st parameter my $nam = shift; #token of focus in the 2nd my ($index)= grep { $wrdz[$_] eq $nam } 0..$#wrdz; $index+=1; my $pos = 1; for my $w (@wrdz) { my $w=lc $w; my $s=substr $w,-3; my $n=$index-$pos; $n=$n*-1; #features with minus to the left push @f, $n.$s if (abs($n)<2); #main 3 features $pos++; } push @f, "INIT" if $index==1; #lateral feature push @f, "END" if $index==scalar(@wrdz); #lateral feature return @f; 6 11 16 } For example, when the Random Indexing procedure makes the following call: suffix3_featurefilter("that s a duck","that") 5 What we label as suffix SFXT of token T is, for the purpose of this text, equivalent to T 0 s terminal character trigram (i.e. T 0 s last three letters). 283 284 evolutionary induction of a lightweight morphosemantic classifier it returns three features characterizing this concrete occurrence (i.e. token) of the word "that": INIT 0hat 1s Accordingly, features −1hat, 0s, 1a would be used to characterize this instance of the token s and features −1a, 0uck, END would characterize this instance of duck. This is the last level of representation which can still be understood as "symbolic". Subsequently, Random Indexing associates a random, sparsely non-zero init vector to each distinct feature (e.g. INIT , END, 0hat, −1hat, 1s, 0s, −1s, −1a, 1a, 0a, 1uck, 0uck etc.) present in any motherese utterance of the Brown/Eve corpus. All in all, presence of 1321 distinct features has been assessed in the training corpus. Once features are extracted, things go geometric. Vector representations for individual tokens are obtained as sums of vector representations of associated features. Subsequently, initial random feature vectors are discarded and features themselves are characterized as sums of vector representations of associated tokens. This steps marks the first "reflective" iteration of the process called Reflective Random Indexing (RRI). C.f. Cohen et al. (2010) for closer description of how and why RRI works. For the purpose of this article, let’s just precise that introduction of 2 max 3 "reflective iterations" practically always increases results of one’s experiment. This is, in sense, quite expected: for what the reflective process does is not only enriching the representations of entities (e.g. tokens, documents) with information about their features (suffixes, resp. word occurrences) but also enriches representations of features with information about entities within which they occur. For example, not only should be the word thinking characterized with the feature "ends with suffix −ing" but, conversely, the feature "ing is in part characterized by the fact of occurrence in the word thinking. Note that all vectors produced by RI and RRI are euclidean. After every "reflection", vectors are normalized so that their unit length is 1. After last such reflection, each real number element of each vector is transformed into a Boolean value by a binary thresholding process known as Light Stochastic Binarization (Hromada, 2014c). Such binarization is the last step of the vector space preparation. At its end, one obtains a binary vector "hash" tending to have a property common to other convergent6 hashing methods (Datar et al., 2004; Salakhutdinov and Hinton, 2009): 6 A hashing function FH is said to be convergent if similarity between its inputs implies similarity of its outputs. On the other hand, FH is said to be "divergent" if similarity between inputs does not imply similarity between output hashes. Being of strongly divergent nature, functions like SHA2 or MD5 are not to be confounded with convergent hashing which we discuss here. 17.3 method Similar inputs tend to have similar hashes. The moment of attribution of binary hash to each token occurring in the corpus marks the end of the "vector space preparation" phase of the algorithm. In the current model, this VSP occurs only once - at the beginning of simulation and is not repeated. Evolutionary Optimization Ensemble of all binary hashes obtained from the corpus yields a hamming space SH with fairly low dimensionality. This is technically very advantageous since measuring distances can be very swift in such spaces: calculating the hamming distance between two binary strings is definitely7 less costly than calculating a distance between two real (or even complex) vectors. The fact that we can measure distances swiftly is crucial for our evolutionary approach for measurement of distances constitutes the very core of the fitness function which is to evaluated for every individual member of every single generation of every single run of the simulation. This is exemplified by the following snippet of PERL pseudo-code. Listing 13: PERL pseudocode of prototype-inducing fitness function my $fitness=0; for @individual (@population) { for $training_token (@training_tokens) { $training_token_hash=$hashes{$training_token}; 5 $training_token_class=$correct_classes{ $training_token}; $true_prototype_distance=hamming_weight( $training_token_hash XOR $individual[ $training_token_class]); for $incorrect_prototype ($incorrect_classes{ $training_token}) { #the innermost cycle $fitness-- if (hamming_weight( $training_token_hash XOR $individual[ $incorrect_prototype]) <= $true_prototype_distance); } 10 } } As may be seen that the innermost cycle of the fitness function evaluation contains three operations: 1. XOR between vector of the training object ~o and the vector of "false" prototype p~F : this yields new vector with true values on those positions where elements of input vectors differ 7 Or at least on an ordinary transistor-based 21st century Turing machine 285 286 evolutionary induction of a lightweight morphosemantic classifier 2. calculation of hemming weight (i.e. number of non-zero bits) of XOR’s result8 : this is equivalent to hamming distance Hd(~o, p~F ) 3. penalization (decrementation of fitness value) for every incorrect prototype p~F which is not further from ~o than o’s true prototype pT , i.e. Hd(~o, p~F ) <= Hd(~o, p~T ) (7) This concrete instance of prototype-inducing fitness function can be further elucidated by a formula Fobject (~i, ~o) = |PF | (8) px 6=pT ∧ Hd(~o,p~x )<=Hd(~o,p~T ) =⇒ px ,→PF which defines the object-wise fitness Fobject (~i, ~o) of individual solution ~i in regards to vector representation of the training object ~o as a number (i.e. cardinality of a set) of "false" prototypes |PF | which are not further from ~t as ~o’s corresponding (i.e. "true") prototype p~T . Subsequently, an overall fitness of the individual chromosome ~i in regards to each and every object occurring in a training corpus T , is a sum Ftotal (~i) = − X Fobject (~i, ~o) (9) o∈T The sum is inverted so that whole function is a maximization one. Under such definition the maximum fitness value is 0 and corresponds to situation where all training corpus objects are closer to their true prototypes than to any other prototype. In theory, it may be the case that multiple global optima of such kind exist. In practice, and in case of many vector spaces, such global optima may not exist at all and fitness of any locally optimal states will have negative value. Fitness function thus defined, the form of representation of individual solutions is quite straightforward: An individual solution ~i encodes a constellation of all candidate prototypes of |C| categories. This means that, in regards to every single object ~o present in the training corpus T , ~i shall encode not only "true" prototype p~T associated to ~o by the training corpus. It shall also encode all prototypes which are not ~o’s true prototypes and which - if ever located closer to 8 Assembler routine for hamming_weight calculation exploiting the POPCNT instruction implemented (on hardware level) of SSE4.2-compliant CPUs (Suciu et al., 2011) is accessible at URL http://wizzion.com/thesis/simulation2/popcount.asm 17.3 method ~o than p~T - should be evaluated as members of a set of "false positive" PF . In practice, individual solution ~i is represented as a vector or an ordered tuple which concatenates all its components. Number of possible distinct individuals is 2∆∗|C| where ∆ is the dimensionality of the space and |C| is the number of classes. Since in our simulations we have focused on partitioning of 64-dimensional space into five classes (|C|=5) there exist potentially 264∗5 = 2320 constellations. Fitness landscape is thus finite but its complete traversal seems to be impossible to execute in a reasonable amount of time 9 . Two evolutionary heuristics has been deployed in order to explore the landscape: 1. CANONIC: a heuristic strongly reminiscent of Canonical Genetic Algorithms (Goldberg, 1990) 2. MERGE1 : an extension to CANONIC which merges independent runs of CANONIC into one big population and continues the evolution further In both approaches, every generation starts with fitness evaluation for all individuals in the population. Subsequently, a so-called 2-way tournament selection operator (Sekaj, 2005) selects members of the mating pool. Size of the mating pool equals the size of population. Members of new generation are obtained from the mating pool as follows: two parents (mother and father) individuals are randomly chosen from the mating pool in order to be subsequently "cut" at a randomly chosen point. Segment before the cut is taken from the mother, segment after the cut is taken from the father and new offspring is obtained. Any gene of offspring’s genome can be mutated with 0.2% probability: mutation is equivalent to flipping of a bit. Elitism is not implemented and even the most fit individual can be subjected to decay. There are thus only two aspects in which CANONIC and MERGE1 differ. One difference is the population size: in CANONIC, populations are fairly small (100 individuals) while MERGE1 implements somewhat bigger ones (1000 individuals). Both heuristics also differ in the way how their initial population are generated. In CGAs one departs ex nihilo and CANONIC heuristics is no exception to this rule: genes present in the gene pool of generation 0 are randomly generated. Things are slightly different 9 At least on clusters of ordinary transistor-based 21st century Turing machines. 287 288 evolutionary induction of a lightweight morphosemantic classifier in case of MERGE1 heuristics: MERGE1 is initiated by populations yielded by different runs 10 of CANONIC after 200 generations. CANONIC and MERGE1 taken together can be thus understood as a very primitive form of "parallel genetic algorithm" (PGA) (Sekaj, 2004). Under this view, 100 independent runs of CANONIC can be understood as independent nodes on the lower level of the hierarchy and MERGE1 as the node of the higher level. A "migration" from all lowlevel nodes occurs after 200 generations. Follows a big tournament in which the initial MERGE1 population is constituted. Subsequently, MERGE1 evolves further. Parameters VSP CANONIC MERGE1 Input corpus Brown-Eve motherese 11 Feature Filter suffix3 Dimensionality ∆ = 64 Seed S=3 Reflections I=3 Population size N = 100 Selection Tournament Crossover One-point Mutation rate M = 0.2% Initial population ex nihilo Generations G = 200 Elitism E=0 Runs R = 100 Population size N = 1000 Selection Tournament Crossover One-point Mutation rate M = 0.2% Initial population results of CANONIC Machine Learning Generations G = 300 Runs R=6 Classes |C| = 5 Table 42: Parameters of simulation 2. 10 Note that one common "vector space preparation" phase preceded all CANONIC runs. Hence, in spite of the fact that diverse runs of CANONIC followed different evolutionary trajectories, they always did so in the space S64 explored by other runs as well. This makes it possible to "merge" results of different runs. 11 Available at http://wizzion.com/thesis/simulation2/eve12-8-5classes.mot 17.4 discussion of results 17.3.3 289 evaluation Accuracy of induced classifiers was primarily evaluated in terms of quantity of correctly predicted category labels (i.e. true positives). Hence, maximum score of 100% would correspond to situation when all objects have been successfully classified. On the contrary, a classifier attributing category membership by random would have precision of cca. 20% in case of classification into 5 equidistributed classes. Overall classification accuracy of classifiers induced by CANONIC and MERGE1 heuristics has been evaluated after each 10 generations of the training process. Asides this, each class has been explored individually in order to yield class-specific precision and recall values. Three other classification methods have been evaluated in order to compare the evolutionary method with non-evolutionary approaches: • CENT ROIDHAMMING and CENT ROIDEUCLIDEAN baselines • MSVM (i.e. a Multi-class Support Vector Machine) Two baseline approaches characterize every class by their centroid. In CENT ROIDHAMMING approach is centroid CX of a category X a hash obtained as an average of hashes of all objects belonging to X. Things are similar in case of CENT ROIDEUCLIDEAN : the only difference being due to the fact that elements of objects and centroid vectors are now represented in their real-valued form. Id est, a representation issued from the last reflective iteration of the RRI component of the VSP phase of our algorithm. At last but not least, binary vector space issued from the VSP phase has been partitioned by means of a MSVM implemented in the opensource package MSVMPack (Lauer and Guermeur, 2011). Default settings of the package have been used: linear kernel has been applied and training of MSVM2 (Guermeur and Monfrini, 2011) model has been stopped after converging to 98% accuracy level. 17.4 discussion of results Table 43 summarizes main results of five compared methods. Smallest amount of correctly classified tokens was attained by baseline CENT ROID approaches: this was expected since these approaches do not include any optimization at all12 . The observation that CENT ROIDHAMMING is less precise than CENT ROIDEUCLIDEAN is also trivial: transformation of real-valued vectors into binary ones brings about a nonnegligible information loss. Worse performance of binary-based classifiers is a result of this information loss. Optimization, however, can significantly reduce or even counteract impact of such loss. Hence, even a fairly simple CANONIC genetic 12 Note, however, that classification accuracy of these models is still significantly superior to a random classifier. 290 evolutionary induction of a lightweight morphosemantic classifier Method Training corpus Testing corpus CENT ROIDHAMMING 455 (42.12%) 412 (40.47%) CENT ROIDEUCLIDEAN 572 (52.96%) 533 (52.35%) MEAN(GACANONIC ) 631 (58.44%) 589 (57.88%) MEAN(GAMERGE1 ) 718 (66.51%) 657 (64.57%) FIT T EST (GAMERGE1 ) 772 (71.48%) 699 (68.66%) MSVM2 781 (72.31%) 736 (72.30%) Table 43: Overall results of five different approaches. GA results have been averaged across diverse runs (R = 6*100 for CANONIC, R=6 for MERGE1). algorithm discovers, in just five sweeps through the hamming space, constellations of prototypes whose precision is higher than that of Euclidean centroids. This is exemplified by Figure 30 which plots evolution of precision across generations. Figure 30: Evolutionary optimization increases the precision of a multi-class classifier. Curves represent results averaged across diverse runs (R = 6*100 for CANONIC, R=6 for MERGE1)). It may be seen that introduction of PGA-like approach - as exemplified by MERGE1 - results in a significant increase in amount of precisely classified tokens. The score is still not so high as that of MSVM2 (compare 781 with 718 for training corpus, resp. 736 with 657 in testing corpus), but the jump between CANONIC and MERGE1 suggests that that another PGA architecture, introduction of elitism 17.4 discussion of results 291 or a different choice of parameters or operators can potentially result in significant boost. Table 44: MSVM2 training corpus confusion matrix. Table 45: MSVM2 testing corpus confusion matrix. ACT SUB PROP REL REF ACT SUB PROP REL REF ACT 266 54 0 0 1 ACT 271 38 4 0 1 SUB 4 0 0 SUB 8 1 0 66 18 0 0 62 15 0 0 55 495 PROP 21 55 450 PROP 21 REL 20 12 1 2 0 REL 20 6 3 0 0 REF 15 47 3 0 0 REF 20 38 5 0 0 Table 46: Training corpus confusion matrix produced by FIT T EST (GAMERGE1 ). ACT SUB PROP REL REF Table 47: Testing corpus confusion matrix produced by FIT T EST (GAMERGE1 ). ACT SUB PROP REL REF ACT 278 28 4 4 7 ACT 269 21 9 6 9 SUB 56 427 34 18 19 SUB 62 371 41 25 15 PRO 19 39 43 3 1 PRO 16 35 40 3 4 REL 15 5 1 11 3 REL 15 3 4 5 2 REF 35 7 1 13 REF 11 26 8 4 14 9 As may be seen on confusion matrices shown on tables 5 - 6, MSVM2 fails to correctly classify any testing corpus token attributed to minor REL and REF categories (i.e. recall = 0%) and the situation is not better in case of PROP class neither (testing recall 15.3%)13 . On the other hand, this handicap is counteracted by MSVM’s higher recall rates in regards to dominating SUB and ACT classes. This could potentially suggest that MSVM still tends to behave like a good old "dualist" Support Vector Machine rather than a truly multi-class classifier. Confusion matrices on tables 7-8 indicate that FIT T EST (GAMERGE1 ) also performs quite well when it comes to classification of tokens into major categories ACTION (86.6% recall; 73.74% precision) and SUBSTANCE (73.74% recall; 79.96%precision). Asides this, it also attains 40% testing recall for PROPERTY class and 22% testing recall for the REFERENCE class. This suggests that even categories of minor importance play a certain role in models induced by evolutionary search for prototype constellations. 13 These low recall rates imply that the average F1 score of MSVM is, in fact, inferior that of FIT T EST (GAMERGE1 ). This is the case for both training (FMSVM = 0.481; FFIT T EST (MERGE1) = 0.518) as well as testing (FMSVM = 0.426; FFIT T EST (MERGE1) = 0.474) phases. 292 evolutionary induction of a lightweight morphosemantic classifier 17.5 17.5.1 conclusions computational conclusion Figure 31: Centroidal tessellation of twelve data-points belonging to three distinct classes. Dots represent data-points, crosses are category prototypes and colors denote category membership. Black lines denote tesselation boundaries. Figure 31 displays a potential training data-set composed of twelve data-points attributed to three distinct classes. One can observe that it is not possible to draw a single straight line which would separate all datapoints of one class from data-points of other classes. Hence, these data-points are plainly not separable by a linear boundary: many a researcher would be tempted to say that in order to classify such dataset, one would be obliged to apply a certain kernel and project it into space with higher dimensionality. This is, however, not necessary, if one applies a machine learning strategy which looks for constellation of points instead of lines, planes or hyper-planes. Denoted on Figure 31 by crosses of different colors, such points - labeled as "category prototypes" - satisfy one simple condition: Every data-point is closer to its prototype than to any other prototype. Search for constellations of prototypes which satisfy such condition can be thus understood as a problem closely related to problems of Voronoi-Dirichlet tessellations (Aurenhammer, 1991). But contrary to such approaches where "seed" "generator" points are given in 17.5 conclusions 293 advance, positions of such points are, in our approach, induced by means of evolutionary computation. Inductive process described in this article took place in a 64-dimensional binary space. Reasons behind this choice were of pragmatic nature: 1. optimization involving calculation of Hamming distances can be very fast, especially when implemented on arrays of dedicated Field Programmable Gate Arrays (Sklyarov and Skliarova, 2014) or Application Specific Integrated Circuits 2. binary hashes are very concise form of representation: our approach could be thus useful in Big Data scenarios14 These reasons aside, nothing forbids to bypass the "binarization procedure" and search for constellations of prototypes in an Euclidean space. It can be expected that precision of classifiers induced in Euclidean space would be higher than precision of classifiers induced in binary spaces. However, since there is no free lunch, such euclidean search would be undoubtedly more demanding when it comes to consumption of both memory and computational resources. Shortcomings related to decision to execute the search in binary space notwithstanding, obtained results are quite encouraging. Hence, in a scenario aiming to classify tokens occurring in Brown-Eve section of the CHILDES corpus into 5 morphosemantic classes, classifiers induced by evolutionary optimization identified almost as many true positives as a multi-class SVMs (Lauer and Guermeur, 2011). In terms of F1-Score obtained as a harmonic mean of average recall and average precision, the performance of the most fit prototype constellation FIT T EST (GAMERGE1 ) turned out to be even higher than that of MSVM2. This, however, is more a residuum of a F1-score metrics than a result which would merit to be reported elsewhere than in a fotnote13 . 17.5.2 psycholinguistic conclusion Table 48 lists tokens located in closest neighborhoods of three major prototypes which have been encoded in the constellation FIT T EST (GAMERGE1 ). A subsequent inspection of false positives present in Table 48 turns out to be quite instructive. Hence, the token "building", present in the utterance "what are you building here?" on line 5417 of eve05.cha transcript is clearly not a noun, as CHILDES annotators supposed, but rather a participle - and hence an instance belonging to ACTION class, as correctly predicted by FIT T EST (GAMERGE1 ). Idem for "hit" present in the utterance "did you hit your head?" present on line 4145 of eve01.cha transcript: the token is clearly not a noun, as postulated 14 In case of 64-bit hashes one could potentially need as little as 800 Megabytes of storage volume in order to store hash representations of 100 million documents. 294 evolutionary induction of a lightweight morphosemantic classifier PACT ION PSUBST ANCE PPROPERT Y H TOKEN POS H TOKEN POS H TOKEN POS 10 pointing part 10 penny noun 18 whistle noun 10 tripped v 10 tummy noun 20 bent v 11 slipped v 11 cracker noun 21 graham noun part noun tough adj alright adj 11 squashing part 11 graham+cracker noun 21 12 building noun 11 12 burped 12 cutting v 11 part 11 12 dripping part 11 key matter other adj paddle noun 22 pitcher noun noun 22 sweetheart noun v 12 drinker 12 mix v 12 letter v 12 numbers 13 hit v 12 paper 13 hit n 12 snowman part 12 22 noun 22 fixed 13 playing v nap 12 13 dropped noun 21 noun 23 v 23 noun 23 a art cough noun fun noun noun 23 grannie_hart noun worse adj 23 lemon noun 13 are v 13 bx noun 23 little adj 13 saw v 13 face noun 23 through prep maam noun 24 all_gone adj 13 standing part 13 13 swim v 13 purple noun 24 good adj 13 want v 13 soup noun 24 bigger adj 13 wiped v 13 stove noun 24 busy adj Table 48: Testing corpus tokens closest to prototypes of ACTION, SUBSTANCE and PROPERTY encoded in FIT T EST (GAMERGE1 ) constellation. Hamming distance H(token, prototype) and token’s CHILDES part-of-speech annotations. False positives are marked by bold font. by CHILDES annotators, but, as predicted, a verb and hence member of ACTION class. And one can continue: the token "matter" annotated on lines 2152 and 5688 of CHILDES corpus as a verb is clearly not a verb but a noun - and hence a member of a class SUBSTANCE because it twice occurs in the utterance "what’s the matter?. And in spite of the fact that CHILDES labels the token "numbers" as a verb, it is definitely not a verb when it occurs in the utterance "the numbers are going around too" (eve15.cha, line 6276). Et caetera et caetera. Thus, in spite of the fact that POS tokens in Brown/Eve section of the CHILDES corpus are supposedly « annotated with high accuracy» (Sagae et al., 2007) it is, sometimes, not really the case. In this regards, one would be tempted to state that, as of 2016 AD, is the frontier between developmental and computational psycholinguistics still re- 17.6 generic discussion sembling a structure standing on clay feet. This is a first conclusion which could be potentially useful to any (comp|dev)psycholinguists willing to undertake the path initiated by the study hereby introduced. The fact that our approach has allowed us to identify errors in the corpus which even humans didn’t succeed to identify, is indeed encouraging. And it is moreso encouraging when one realizes how simple was a feature set which has been used to construct the vector space in which all subsequent classifications took place. We repeat: Every token T was primarily characterized by: 1. T’s three last characters 2. three last characters of the token which precedes T 3. three last characters of the token which follows T asides this, only other information taken into account concerned T’s potential position at the very beginning or end of the utterance. Reason to depart from such a restricted feature set has been in part empiric (Hromada, 2014a). But there exist others, more profound reasons why we have initiated the training of a verbally interacting computational agent with focus on suffix-like features. Primo: the "less is more" hypothesis whose implication for neural-network-based processing of natural language has been so beautifully demonstrated by Elman (1993). Secundo, note Slobin’s operating principle A: « Pay attention to the ends of words.» (Slobin, 1973) which, according to its author, is a "general developmental universal". In this regards does our analysis indeed demonstrate that "ends of words" offer features strong enough to initiate a supervised process of induction of categories which have been, for the purpose of this article, labeled as "morphosemantic". And that the whole process can yield fruit even when representing a 5-class classifier with representation as concise, as 40-bytes long vector definitely is. 17.6 generic discussion This chapter has presented an algorithm which succeeds to correctly classify a significant amount of tokens into so-called "morphosemantic classes" (MS-classes). But why should one speak about such MSclasses instead of staying faithful to well-established term "parts of speech" ? 295 296 evolutionary induction of a lightweight morphosemantic classifier An answer is simple: because MS-classes are sometimes not equivalent to parts-of-speech categories. For example, an MS-category labeled as "ACTION" includes not only verbs, but also participles. Motivation behind this distinction is quite simple: it may potentially make sense for an expert linguist to state that "eating" functions is a participle but "to eat" a verb. However, a modal toddler of 20 months shall most probably turn out be ignorant of such a distinction (Tomasello, 2009). For what counts for such a toddler is the fact that he can associate both words "eat" and "eating" with the fact of simultaneously observing certain invariant structural property of her15 surrounding environment (i.e. observes activity of putting something into one’s mouth). Table 17.3.1 introduced five initial MS-classes16 . These MS-categories have been defined very loosely in the limited scope of this study: all substantives where defined as belonging to the class SUBSTANCE, diverse verbal, participial and infinitival forms as those instantiating ACTION, adjectives and numerals were collapsed into the MSclass PROPERTY, everything which had something to do with pointing, specification and deictics was subsumed under REFERENCE and prepositions were told to instantiate notion of RELATION. Said in more practical terms: introduction of the notion of MSclass allowed us to enrich certain section of the CHILDES corpus (i.e. Brown’s 20 transcripts of a girl named Eve) (Brown, 1973) with certain amount of loosely semantic information. Loosely because MS-classes, as used in this chapter, are loosely constructed themselves. For it is not always true that POSsubstantives always denote substances and POSverba always denote actions: no serious linguist could defend such a general view in more than one article and still stay unostracized by the linguistic community. Loosely, but in regards to "motherese" addressed to a modal toddler (P+17), also semantic. For what is more vital for a 18-month old child, to understand&express the difference between verb "eat" and participle/property "eating", or rather understand&express the difference between act of eating and the object being eaten ? We summarize: act of making a notational turn from the concept of "parts-of-speech" to the notion of "morphosemantic class" led to enrichment of CHILDES corpus with few bits of semantic information. Few bits maybe, but still more bits than noise. Subsequent coupling of this information with morphological information contained in suffixes followed by optimization by means of an evolutionary algorithm allowed us to converge to very concise, 40-byte long multiclass classi15 To stay consistent with Conceptual Foundations as well as with other books of psycholinguistic tradition, we refer to toddlers and children with feminine pronouns "she", "her" etc. 16 We leave to reader’s own ingenuity the exploration of an extent in which could these MS-classes correspond to Aristotle’s categories, or Kant’s and Piaget’s "forms of pure reason". 17.7 second simulation bibliography fiers. These classifiers have subsequently resulted in identification of errors produced by much more complex and - so the authors pretendalso « highly accurate» (Sagae et al., 2007) POS-tagging systems supposedly corrected by multiple human annotators. These considerations make us believe that a notion of morphosemantic classifier could be of certain use and applicability for any present or future researcher aiming to deploy, develop or fine-tune certain nature-inspired yet cognitively plausible (Hromada, 2014b) models of ontogeny of linguistic categories 17 . 17.7 second simulation bibliography 17 Proof-of-concept source code of this simulation is freely available at URL http://wizzion.com/thesis/simulation2/ELLA.tgz under mrGPL licence. 297 18 E V O L U T I O N A RY I N D U C T I O N O F 4 - S C H E M A MICROGRAMMARS FROM CHILDES CORPORA 18.1 general introduction First simulation has indicated that one can use evolutionary computation in order to partition a semantic feature space into regions which roughly correspond to certain "topics". Second simulation has shown how an evolutionary search succeeds to increase the accuracy of so-called morphosemantic classifiers. Both simulations differed in regards to corpus-which-was analyzed (20 newsgroups corpus in simulation 1, CHILDES/Brown/Eve corpus in simulation 2) as well as in a feature set used to project initial text into binary vector space. However, both simulations: 1. were optimized by means of an evolutionary algorithm 2. succeed to transpose knowledge present in the training set in order to correctly classify the elements of the testing set (i.e. generalization) 3. used labeled corpus as input of the learning process Taken together, points two and three indicate that simulations 1 and 2 can be understood as particular instances of supervised machine learning. That is, a case of learning which demands more than exposition to the plain input corpus. In case of supervised learning, one needs to have another, parallel, source of information as well. Category labels which have been manually attributed by human annotators are most common cases of such "parallel" source of information. It may be the case, however, that certain problems do not necessitate the exposure to such additional input at all. Such is, according to some linguists, also the problem of grammar induction (P+148-162) whereby one aims to infer a grammar of a language L solely from the corpus of utterances of L. Because of this, computational models of GI are considered to be particular cases of unsupervised machine learning1 . This chapter shall aim to present one particular model of GI. That is, an evolutionary model strongly resembling models presented in 1 Note, however, that the very act of choosing, in the moment T0 (and not in T1 ) and input corpus CX and not CY can also be considered as an act of supervision. C.f. (Hromada, 2014b, 2016f) for further discussion of the "unsupervised" vs. "semisupervised" dilemma. 298 18.2 introduction previous chapters. But also a model aspiring to induce certain generic "microgrammars" from nothing else than the Brown/Eve section of the CHILDES corpus. Article presented in this chapter has been submitted to journal Evolutionary Computation (Hromada, 2016b). 18.2 introduction Input of Grammar Induction (GI) process is a corpus of sentences written in language L, its output is, ideally a grammar (P+117-P+124) or a transparent language model able to generate sentences of L, including such sentences that were not present in the initial training corpus. In spite of a seemingly simple nature of the problem, induction of grammars from natural language is quite a difficult nut to crack. Thus, symbolic models like the Syntagmatic-Paradigmatic GI (Wolff, 1988), graph-based ADIOS (Solan et al., 2005; Brodsky et al., 2007) do, indeed, attain interesting results in their efforts to extract English grammar from English corpora. But given the deterministic nature of these models, they tend to converge to certain local optima from which there is no way out. To make things worse, such models often do not dispose of means which would allow them to purge themselves from unwanted overregularizations (P+83). In this chapter, we shall present a GI model aiming to harness evolution’s ability to discard the unwanted. What’s more, we shall exploit the genotype - phenotype distinction (Fogel, 1995) in order to perform sub-symbolic variation of sets of symbolic sequences. By doing so, we shall obtain a models which integrates entities represented at two levels of abstraction: 1. sub-symbolic feature vector spaces 2. symbolic PERL-compatible regular expressions Ideally, such a model could be both robust as well as flexible enough to find its middle path between grammars which cover just one thing, and grammars which cover everything. 18.2.1 two extremes The nature of resulting grammar is closely associated to the content of the initial corpus as well as to the nature of the inductive (learning) process. According to their « expressive power », all grammars can be located somewhere on a « specificity – generality » spectrum. On one extreme lies the grammar having following production rules : 1 → 2∗ 299 300 evolutionary induction of 4-schema microgrammars from childes corpor 2 → a|b|c . . . Z whereby ∗ means «repeat as many times as You Want» and | denotes disjunction. This very compact grammar can potentially generate any text of any size and as such is very general. But exactly because it can accept any alphabetic sequence and thus does not have any « discriminatory power » whatsoever, is such a grammar completely useless as an explication of system of any natural language. On the other extreme of the spectrum lies a completely specific grammar which has just one rule : 1 →< Corpus > This grammar contains exactly what Corpus contains and is therefore not compact at all (in fact, it is even two symbols longer than Corpus). Such a grammar is not able to encode anything else than the sequence which was literally encoded by the training Corpus. Such grammar is therefore completely useless for any scenario were novel sequences are to be generated (or accepted). The objective of GI process is to discover, departing solely from Corpus (written in language L), a grammar which is neither too specific, nor too general. If it is too general, it shall «over-regularize» (P+83). That is: such G shall be able to generate (or accept) sentences which the common speaker of L would never ever consider as grammatical. On the other hand, if G is too specific, it shan’t be able to represent all sentences contained in Corpus or, if it shall, it shan’t be able to generate (or accept) any sentence which is considered to be sentence of L but was not present in the initial training corpus Corpus. 18.2.2 definitions G-Category (DEF) Let’s have a set of N objects (O1 , O2 , ..., ON ) embedded within a ∆dimensional space S (i.e. every object OX can be described by a vector ~oX = V1 , V2 , ..., V∆ ). Then geometrized category (G∆ -Category) C is defined as a content of S-embedded D-dimensional sphere with 1. centroid whose coordinates are given by a vector ~c = C1 , C2 , ..., C∆ 2. radius R Under such definition, all objects OY , OZ , ... positioned within volume of C are to be understood as members of C. end g-category 18.2.2.0 18.2 introduction We reinforce: under this view, a G∆ -category is a convex region within S (Gärdenfors, 2004)2 . Concrete geometric properties of such a ball (e.g. increase in volume in regards to increase of radius etc.) are, of course dependent on the nature of metric space in which the sphere is embedded (e.g. V/r = 4/3πr3 for 3E-categories, i.e. categories embedded within 3-dimensional euclidean space). In our simulations 2 and 3, we have used the Lightly Stochastic Binarization (Hromada, 2014c) algorithm to project initial objects onto positions within 128- or 64-dimensional binary Hamming spaces. We define categories within such spaces as follows: H∆ -Category (DEF) H∆ -Category is a Hamming ball within a ∆-dimensional Hamming space. end H ∆ -category 18.2.2.0 Given that 1. the radius of a H ∆ -Category cannot be higher than ∆ (for such a sphere would envelop whole space S) 2. any integer ∆ can be represented with log 2 ∆ bits 3. log 2 128 = 7 and log 2 64 = 6 it is evident that one needs exactly 135 bits of information3 - in order to unambiguously specify a specific H 128 -category embedded in a 128-dimensional hamming space. And one needs 70 bits of information in order to unambiguously specify a H 64 -category embedded in a 64-dimensional hamming space. In this simulation, we shall juxtapose vectors representing diverse H 64 -categories in order to obtain more complex schemata. N ∆ -Schema (DEF) An N ∆ -Schema is a result of concatenation of N vectors g~1 , g~2 , ..., g~n whereby each vector g~1 , g~2 , ..., g~n represents a G-category located within a ∆-dimensional space S ∆ . end N ∆ -schema 18.2.2.0 Focus of the current simulation shall be on induction of schemata in case where N = 4. Given that basic units of such 4−schemata will be H 64 -categories, it can be easily seen that they such 4−schemata could be encoded by no more and no less than 4 ∗ 70 = 280 bits. end definitions 18.2.2 2 Those endowed synesthesia could potentially visualize G-categories as ∆dimensional pearls (Hesse, 1967) or balls of certain material, state and color. 3 128 bits to specify coordinates of the centroid and 7 bits to specify the radius 301 302 evolutionary induction of 4-schema microgrammars from childes corpor Under these definitions, the model and the simulation described in this text can be understood as a method which aims to infer - from plain-text Corpus written in language L Corpus - a 4−schema or (a set of 4−schemata) able to generate utterances which were originally not in the Corpus but are nonetheless still syntactically correct utterances of language L Corpus . end introduction 18.2 18.3 model In its essence, model presented in this simulation is reminiscent of the model presented in (Chapter 17). Hence, during the phase of "vector space preparation", texts from English-language transcripts of CHILDES corpora are first projected into 64−dimensional Hamming space H64 . Subsequently, a search within H64 is realized by means of an evolutionary algorithm. There exists, however, a certain difference which ultimately causes the algorithm hereby presented to be essentially a non-supervised one. Thus, in the present situation, a HX -category increase the probability of its survival in time if and only if is HX contained in the utterance-like N−schema which matches as many utterances as possible. 18.3.1 vector space preparation Listing 14: PERL code of neighbor-word feature extractor sub word_juxtaposition_featurefilter { my @features; my @all_words = split / /, shift; 4 my $word = shift; my ($word_position)= grep { $all_words[$_] eq $word } 0..$#all_words; if ($word_position==0) { #word begins the utterance push @features, "INIT" ; push @features, "1" .$all_words[$word_position+1]; 9 14 } elsif ($word_position==$#all_words) { #word ends the utterance push @features, "−1" .$all_words[$word_position-1]; push @features, "END" ; } else { push @features, "−1" .$all_words[$word_position-1]; push @features, "1" .$all_words[$word_position+1]; } return @features; 19 } 18.3 model Method known as Light Stochastic Binarization (LSB) (Hromada, 2014c) is used to project the input text onto H64 . Note, however, that initial features slightly differ from both approach presented in Chapter 16 which used word frequency distributions to project documents onto a resulting semantic space, as well as from approach presented in Chapter 17 which used suffixal information to project words onto a resulting morphosemantic space. In contrast to both these methods, the feature extractor presented on Listing 14 focuses on two sources of information only: the identity of the word WL and the word WR juxtaposed to the left (resp. to the right) side of the target word WX . For example, the function call: word_juxtaposition_featurefilter("this is a dog","dog") returns array @features characterizing this concrete token of the word "dog" in terms of two features: −1a, END In this case, the first feature encodes the fact that the token is preceded by an indeterminate article a while the second feature encodes the fact that "dog" is the last token of the utterance. Similarly, the token this would be characterized by features INIT , 1is; token is would be characterized by features −1this, 1a and the token a would be characterized by features −1is, 1dog. Once each word of each utterance is characterized by its features, one follows a standard Random Indexing procedure (Sahlgren, 2005) in order to attribute each distinct feature a distinct randomly generated 64−dimensional sparsely non-zero "init" vector. Subsequently, euclidean representation of every word type WX is obtained as a sum (i.e. unweighted linear combination) of features to which WX is associated in the corpus. These euclidean vectors are later normalized and enter the binarization procedure which leads to concise 8-byte hashes having the property: The more words WX and WY tend to occur in similar contexts, the less the Hamming distance between LSB(WX ) and LSB(WY ) shall be. It is, indeed, this property which shall potentially allow us to effectuate successful evolutionary searches within the H64 space which could be potentially labeled as "morpho-syntactic". end vector space preparation 18.3.1 303 304 evolutionary induction of 4-schema microgrammars from childes corpor 18.3.2 bridging the sub-symbolic and symbolic realms In order to better understand the model hereby presented, one needs to understand a certain distinction often implemented by proponents of evolutionary programming (Fogel, 1995) or evolutionary strategies (Rechenberg, 1971). Id est, the distinction between the genotype and the phenotype. Genotype Information-encoding substrate potentially modifiable by variation and replication operators. Unambiguously translatable into phenotype. end genotype 18.3.2.0 Phenotype Concrete manifestation of specific genotype against which fitness can be evaluated. A distinct phenotype PX can potentially manifest multiple distinct genotypes. end genotype 18.3.2.0 Listing 15: Transcription of vector representations (genotypes) into regular expression phenotypes " "; $extension = " " ; for $component (0..5) { $component_regex = " " ; $component_extension = 0; 6 $radius=$genotype_radius[$component]; for $word (@all_words_in_corpus) { $word_hash=$word_hashes[$word]}; $word_hcategory_distance = hamming_weight( $word_hash XOR $genotype[$component]); if ($word_hcategory_distance<$radius) { 11 !$cregex ? ($cregex = ’ ( ’ .$word) : ( $cregex .= ( ’| ’ .$word)); $cextension++; } } $cregex ? ($regex .= ($cregex. ’ ) ’ )):($regex .= ’ ’ ); 16 $extension *= $cextension if ($cextension); } $regex= ’^ ’ .$regex. ’$ ’ ; #utterance-based 1 $regex = In context of the current simulation, N−schemata (18.2.2) of length N = 4, i.e. 4−schemata, are to be understood as individual genotype instances. As is always the case in evolutionary computation, 18.3 model Word Hash Word Hash this BABA that BABB it BAAB is 0F23 are 0F11 a C123 the C125 not 5FF5 duck 7720 dog 7725 Table 49: Words of a CorpusMini and hexadecimal representations of their potential hashes. Syntagma5 H1 H2 H3 H4 H5 Center Radius Center R Center R Center R Center R BABC 17 0F20 5 5FF0 7 C124 3 7723 7 Table 50: A candidate genotype which could be potentially induced from the hypothetic CorpusMini . these schemata replicate, mutate, cross-over etc. But in order to get their fitness attributed, these genotypes have to be translated into phenotypes. Such translation is realized by means of the procedure displayed on Listing 15 The core idea of the genotype - phenotype translation is to be found on lines 9-11. On line 9, a hamming distance between hash of each among 5 components of the candidate genotype 4−schema is evaluated in regards to hash of each word WX represented in the H64 vector space. On line 10, algorithm checks whether the obtained distance is smaller than the radius which is also included in the genotype. If yes, then the literal sequence of signifiant of the word WX is injected into the resulting phenotype in a way, so that the resulting phenotype would be a syntactically correct Perl-Compatible Regular Expression (Wall et al., 1994; Hromada, 2011, 2016e) . In other terms, the code displayed in Listing 15 can be understood as a method of translation of sub-symbolic (feature-based) binary vector representations into symbolic representations known as regular expressions. For example, let’s look at Table 49 which illustrates a small hypothetical CorpusMini containing only words that, this, it, is ... and their corresponding binary hashes 4 . Then if ever a 5 − schema like the one presented in Table 50 would be identified by the evolutionary search, it would be translated into a regular expression: 4 As usual, 64-bit hashes are presented in hexadecimal format as sequences of four characters from range 0-9A-F 5 In order to stay aligned with traditional linguistics, we shall sometimes use the term "syntagma" (resp. its abbreviated form "syn") as a synonym for the term "component". 305 306 evolutionary induction of 4-schema microgrammars from childes corpor ˆ(this |that|it )(is )(not )(a |the )(dog |duck)$ which represents the microgrammar Utterance → Syn1 Syn2 Syn3 Syn4 Syn5 Syn1 → this | thatkit Syn2 → is Syn3 → not (10) Syn4 → a| the Syn5 → dog| duck potentially covering 12 distinct utterances6 . It would, however, not match utterances of a sort "this are not the dog" because the Hamming distance between the word are and the centroid of the 2nd component is bigger than the radius of the very same component (i.e. HD(LSB("are"), Centroid2 ) = HD(0F11, 0F20) = 9 > Radius2 ). In such a way, one can determine the exact form of a Perl-Compatible regular expression (PCREs) by means of distance measurements in the underlying H64 space. And given that PCREs are 1. strings of symbols which describe sets of strings of symbols 2. a sort of lingua franca of many engineers active in the domain of Natural Language Processing, data-mining or information retrieval 3. well-tuned and optimized by almost three decades of development by not only PERL but also C++, Python, or R communities 4. transparent to inspections by human examinators7 one can potentially start to see a certain utility usefulness in developing an architecture which can unambiguously transform subsymbolic geometrized genotypes into comprehensible, symbolic, and manually modifiable PCRE-compatible phenotypes. 18.3.3 fitness function Fitness of N−schema NX is principally determined by two characteristics: 6 We shall further denote the quantity of maximal theoretical number of covered utterances with the term extension. 7 Only 5 PCRE meta-characters are used in this article: ( denotes beginning of a disjunctive group; ) denotes end of a disjunctive group; | is a separator between two members of a disjunctive group; ˆ denotes beginning of expression and $ denotes the end of expression 18.3 model 1. extension E, or a maximal theoretically possible sensitivity, is a finite natural number representing the quantity (i.e. the cardinality of a set) of all utterances which could be matched by NX 2. Corpus sensitivity Y is a number of utterances, present in the Corpus, which have been matched by NX More formally: Let’s have a N−schema X composed of N H64 categories HX1 , HX2 ...HXN . Then X is said to have an overall extension E defined as a multiplicative product of extensions of individual categories: N Y EX = IHk (11) k=1 whereby the individual extension IHk of a k−th category Hk is defined as number of members of Hk . I.e. |IHk = |Hk | where |Hk | denotes the cardinality of set of objects whose distance from centroid h~k is less than the radius of category Hk . For example, extension E of the 5−schema presented in Table 50 is 12 because IH1 ∗ IH2 ∗ IH3 ∗ IH4 ∗ IH5 = 3 ∗ 1 ∗ 1 ∗ 2 ∗ 2 == 12. In contrast to E which is more an information-theoretic quantity, is the sensitivity Y a value which is always relevant in regards to certain corpus. YX = NX matchesCorpus This notion is further exemplified by first line of following listing. Listing 16: PERL code behind fitness function Fitness1 my $sensitivity = true { /$regex/ } @corpus; 2 #returns number of utterances in @corpus matchable by $regex if ($sensitivity) { $f=($sensitivity**2)/$extension; } else { $f=0; 7 } Extension and sensitivity thus defined, the fitness value of the schema NX has been, for the purpose of the current simulation, defined as: Fitness1 (NX ) = YX ∗ YX EX (12) Rationale behind our choice of this and not other 8 is simple: given that we shall tend to maximize the fitness function, we put extension 8 Many other fitness functions, of course, are possible and only very few of them have been tested. It cannot be excluded that more useful fitness functions shall be identified in the future. If not, then the fitness function Fitness1 hereby defined could be potentially thought of as an expression of certain cognitive law. Such conjectures, however, would bring us too far. 307 308 evolutionary induction of 4-schema microgrammars from childes corpor in the denominator (i.e. divisor) while putting the sensitivity into numerator (i.e. dividend). Thus aligned, it may be expected that implementation of such a fitness function shall direct the evolutionary search towards schemata with both low extension as well as with high sensitivity. For this reason, sensitivity is squared in order to somewhat counteract the impact of extension which itself is a multiplicative product of its components. 18.3.4 evolutionary strategy The INDUCT OR1 evolutionary algorithm implemented in this simulation is similar to the algorithm CANONIC presented in 17.3.2. Tournament operator is used as the main and only method of selection of fit individuals from the population to the mating pool. Size of the mating pool is equal to population size and mutations of centroid coordinates are equivalent to "bit flipping". There exist, however, certain important differences which distinguish the algorithm hereby presented from the CANONIC: 1. implementation of phenotype-genotype distinction 2. evolution of both centroid coordinates as well as category radii 3. zeroth population is not generated pseudo-randomly 4. crossover occurs only at specific locations 5. re-focusing strategy is implemented Taken together, this differences result in an algorithm endowed with certain characteristics of an evolutionary strategy (Rechenberg, 1971) or evolutionary programming (Fogel, 1995). 18.3.5 evolution of both centroids and radii As had been already indicated, individual solutions identified by INDUCT OR1 are essentially nothing else than 4−schemata. That is, binary vectors which encode a syntagmatic sequence of four H64 categories. Given that a H64 category are defined in terms of both their center as well as radius, INDUCT OR1 tries to identify not only the most optimal coordinates of category’s centroid (as was the case in Chapter 17), but also the most optimal "extension" which is principally represented by H’s radius. Information about radius of each category is thus also part of the chromosome and is encoded as an integer value from range < 0, ∆ >. Probability of mutation of radius-encoding gene is 0.2%. If subjected to mutation, radius is either decremented or incremented with 1: this corresponds to category becoming less, resp. more exhaustive. 18.3 model 18.3.6 pseudo-random initialization of 0th population Every single individual of the initial population of N−schemata is generated as follows: 1. choose a random word W1 occurring in the corpus and retrieve its geometric coordinates w~1 2. define w~R 1 as the center of first category H1 3. choose a random word W2 occurring in the corpus and retrieve its geometric coordinates w~2 4. define w~2 as the center of the second component H2 5. ... 6. choose a random word WN occurring in the corpus and retrieve its geometric coordinates w~N 7. define w~N as the center of the last syntagmatic component HN Subsequently, a radius which is neither too big nor too small is attributed to each among N components. In case of INDUCT OR1 , the radius was set-up to value 139 which, in context to 64−dimensional Hamming space, seems to denote a distance which is neither too small nor too big. Thus, contrary to ex nihilo initialization of CANONIC which started the induction process from randomly generated positions of all centroids, is INDUCT OR1 ’s initial 0-th population only partially random. This is so because at the end of initialization process, center of each component of every individual N−schema is the same as the position of a certain word present in the Corpus. 10 18.3.7 locus-constrained cross-over INDUCT OR1 cross-overs took place only at specific loci: namely at positions 64, 128 and 192 of the chromosome specifying centers of diverse G − categories. In more practical terms, such a design choice assured that information precising all coordinates of G − category of the parent individual X have been substituted by information precising all coordinates of another G − category encoded in another parent individual Y. 9 Big radius results in big extension of the corresponding category and hence to many false positives. Small radius causes the category to have small extension and hence to potentially miss many true positives. 10 Such an approach significantly boosts the inductive process which could have otherwise certain difficulties in booting itself up. 309 310 evolutionary induction of 4-schema microgrammars from childes corpor This distinction aside, the usage of cross-over in INDUCT OR1 strategy has been fairly standard: every individual of a new generation was obtained as a result of cross-over between two randomly chosen members of the mating pool. 18.3.8 re-focusing strategy Another particular aspect is related to INDUCT OR1 ’s ability to prioritize, with every new run, induction of new schemata. In practice, this is attained by starting every new run with execution of the code present in Listing 17. Listing 17: PERL code behind re-focusing strategy @corpus =grep {!/$previous_fittest_schema/ } @corpus; Literally speaking, this line of code removes from the corpus all utterances matched by the most fittest N − schema of the previous run. This results in gradual shrinking of size of the corpus against which the fitness of all future candidate schemata shall be evaluated. In more general terms, the re-focusing strategy orients the process to inference of schemata from such utterances, from which no schema has been yet induced. 11 . And said in more "cognitive" terms, the algorithm invests more attention into exploration of structural regularities within data which have not yet been explored. 11 Inductive process lacking the re-focus strategy would often "lock" itself to most salient patterns present in the corpus which would result in distinct runs often converging to similar schemata. 18.4 simulation Pseudo-random initialization Zeroth Generation Vector Space Geometrization & Binarization Preparation Genotype2Phenotype Fitness Evaluation Fittest 4-Schemata Re-focusing Offspring Variation Selection Working Corpus Features and Entities Mating Pool first run new run Input Corpus (CHILDES) Microgrammar Figure 32: Data flow among main components of INDUCT OR. Lime color denotes components related to evolutionary optimization, royal blue color denotes components of the preliminary VSP phase. 18.4 simulation Simulation presented in this section has implemented the evolutionary strategy INDUCT OR in order to induce sets of regexp-like rules from four-word English utterances contained in CHILDES corpus. Diagram elucidating relations between main INDUCT OR components is visible on Figure 32. The simulation was invoked twice, once in 64−dimensional space (INDUCT OR64 ) and once in 128−dimensional space (INDUCT OR128 ). The vector space preparation phase (c.f. Section 18.3.1) yielded a vector space in which all subsequent INDUCT OR runs took place. Each among 2 * 100 distinct runs of INDUCT OR was initialized by a pseudo-random generation of zeroth population. 18.4.1 corpus This article is conceived as a part of dissertation addressing the possibility of developing evolutionary models of induction of linguistic rules in (and by) human children. This makes the choice of the corpus quite straightforward: the corpus from which we shall aim to extract first linguistic categories is to be contained in Child Language Data Exchange System (CHILDES, (MacWhinney and Snow, 1985)). 311 312 evolutionary induction of 4-schema microgrammars from childes corpor Inspired by the "less is more hypothesis" (Elman, 1993), input corpus used in simulation hereby presented consisted of 1047 four-word "motherese"12 utterances extracted from English section of CHILDES. No other data has been used to guide the inductive process. 18.4.2 parameters Table 51: Parameters of diverse components of the INDUCT OR algorithm. VSP INDUCTOR Machine Learning 18.5 Input corpus CHILDESEnglish 13 Feature Filter word_juxtaposition Dimensionality ∆ = 64 or ∆ = 128 Seed S=3 Reflections I=0 Population size N = 100 Selection Tournament Crossover One-point Mutation rate M = 0.2% Initial population pseudo-random Generations G = 100 Elitism E=0 Runs R = 100 Syntagms N=4 observations ?? lists 100 regexp-like rules which have been evaluated as "fittest" at the end of distinct INDUCT OR runs which took place in a H64 space. These hundred rules match 176 from 1047 utterances present in the input corpus (16.8%). ?? lists 100 regexp-like rules which have been evaluated as "fittest" at the end of distinct INDUCT OR128 . These runs took place in a H128 space. These hundred rules match 176 from 1047 utterances present in the input corpus (15.8%). As marked in both Appendices by the token GENERAL, INDUCT OR was also able to identify many completely grammatical 4-schemata which able to accept (or generate) even utterances which have not been present in the input corpus. Such generalization faculty was observed in 82% resulting individuals in case of H64 and in 77% individual 4 − schemata induced in H128 . Among these individuals induced in H64 , 32 have been manually evaluated as ALLGOOD, id est capable of accepting|generating only grammatically correct utterances of English language. 12 In CHILDES, lines containing motherese utterances begin with the marker *MOT. 13 Available at http://wizzion.com/thesis/simulation3/utterances.4 18.5 observations 313 For example, the most fit schema of sixth run of INDUCT OR64 : ˆ(that )(is )(a )(bag|banana|basket|bridge|cherry|cow|gate |horse|kleenex|motorcycle|puzzle|rabbit|raccoon|shoe| spoon|story|timer|tractor)$ is able to accept|generate 18 grammatically correct English utterances in spite of the fact that only 5 among these 18 sentences have been explicitly present in the input corpus. Excessive over-regularization was observed in case of 21 individuals willing to accept|generate at least one WRONG utterance. Asides this, 4-schemata issued from 28 runs of INDUCT OR64 have been marked as DISPUT ABLE. That is, as capable of accepting|generating utterances which would be classified as "ungrammatical" by an orthodox grammarian, but could nonetheless occur in a real-life usage. This border cases include utterances as: where is the clever (individual 9) what are we joey (individual 18) there is what one (individual 34) there does he go (individual 55) what are you joey (individual 83) oh what is i (individual 87) as well as utterances which are syntactically correct, but semantically doubtful: oh you are strawberries (individual 63) oh you are fries (individual 63) okay that is thumb (individual 91) et caetera, et caetera. In case of INDUCT OR128 37 induced 4−schemata have been manually evaluated as ALLGOOD and 17 as DISPUT ABLE. Listing 18: First exemplar of a non-monotonic ontogenetic trajectory #ITERATION 30 FITNESS 1.333333 ^(do )(you )(like )(candy|some|strawberries)$ #ITERATION 40 FITNESS 1.14285714285714 4 ^(do )(you )(like )(bananas|box|candy|cover|fell|ketchup| nana|not|papa|popsicles|some|sorry|strawberries|tired)$ #ITERATION 50 FITNESS 1.8 ^(do )(you )(like )(box|candy|ketchup|some|strawberries)$ 18.5.1 diachronic observations A deeper time-oriented inspection of processes taking place during individual runs can also be of certain interest. 314 evolutionary induction of 4-schema microgrammars from childes corpor On Listing 18 it may be seen that after 30 iterations, INDUCT OR1 has identified a 4-schema able to accept|generate utterances "do you like candy", "do you like some" and "do you like strawberries". However, this schema was lost in following 10 generations and fitness fell from 1.33 to 1.1414 . Hence, an over-regular schema gained in prominence which was able to accept even such constructs as "do you like sorry" or "do you like tired". But in following ten generations, population dynamics of the whole system not only lead to correction of the previous errors, but even brought about the increase in fitness to 1.8 which went hand in hand with scheme’s ability to match utterances like "do you like box" or "do you like ketchup". Another run presented on Listing 19 also exemplified such nonmonotonic, error-correcting aspects of INDUCT OR1 algorithm: As it may be seen that an incorrect utterance "what is he going" was acceptable by the fittest individual of 40th and 50th iteration. This was corrected in 60th generation but further development brought about yet another batch of mistakes: utterances like "what is he cute" and "what is he share" were thus acceptable by the most fit individual of 80th generation. This has been subsequently corrected and the run terminated, after 100 generations, with a GENERAL, ALLGOOD 4schema. Listing 19: Second exemplar of a non-monotonic ontogenetic trajectory #ITERATION 30 FITNESS 1.33333333333333 2 ^(what )(is )(he )(doing|playing|saying)$ #ITERATION 40 FITNESS 1.8 ^(what )(is )(he )(doing|going|holding|playing|saying)$ #ITERATION 50 FITNESS 1.5 ^(what )(is )(he )(doing|drinking|going|holding|playing|saying)$ 7 #ITERATION 60 FITNESS 1.8 ^(what )(is )(he )(doing|drinking|holding|playing|saying)$ #ITERATION 70 FITNESS 2.25 ^(what )(is )(he )(doing|holding|playing|saying)$ #ITERATION 80 FITNESS 2.28571428571429 12 ^(what )(is )(he )(called|cute|doing|holding|playing|saying| share)$ #ITERATION 90 FITNESS 1.5 ^(what )(is )(he )(doing|drinking|going|holding|playing|saying)$ #ITERATION 100 FITNESS 2.25 ^(what )(is )(he )(doing|holding|playing|saying)$ 14 This is, of course, due to the fact that INDUCT OR1 does not implement any form of elitism which would safeguard the fittest individuals from destructive variations. 18.6 conclusion 18.6 conclusion Almost one third (32%) of 4 − schemata - identified by INDUCT OR1 sweeping a 64−dimensional Hamming space representing 1047 English "motherese" utterances - produce only correct generalizations. Collection of all induced N-schemata yields what we call a "microgrammar". Such a microgrammar is more a as construction-based (Fillmore et al., 1988; Lakoff, 1990) or usage-based (Tomasello, 2009) grammar than a grammar in sense of the Formal Language Theory (P117+122) or in the sense commonly accepted by proponents of the generativist doctrine (Chomsky, 2002). But given that such a microgrammar (c.f. ??) is capable of generating more syntactically correct utterances than those which had been presented through the training corpus, one can still consider it to be, in certain regards, modestly generative. We say "modestly" because the generative faculty is kept on the leash by evolution’s tendency to discard such schemata which would be too concrete (i.e. have low sensitivity Y), or too exhaustive (i.e. have high extension E). Hence, the thorny problem of over-generalization is - at least in case of algorithm implementing the INDUCT OR1 Evolutionary Strategy - not resolved by any a priori knowledge embedded in a some kind of chomskyan "Universal Grammar". Far from it: we propose to depart from the idea that the grammarinducing agents are not "ideal learners" in sense of Gold’s Theorem (Gold, 1967; Johnson, 2004). On the contrary: the process of grammarinduction can only fully succeed if some information-encoding representations are, sometimes, irreversibly forgotten or subsumed to variation. In this article, variation was attained by operators which: 1. mutate coordinates of centers of syntagmatic G − categories 2. mutate radii of syntagmatic G − categories (i.e. increases or decreases category’s extension) 3. substitute a G − categories from one N − schema with G − categories from another N − schema (i.e. locus-constrained crossover) By causing these operators to perform their operations in a subsymbolic vector space, and by evaluating results of their activities on a symbol-sequence level, one can obtain a system able to induce simple 4 − schema microgrammars from simplified corpus of English "motherese" utterances which are four words long. This15 , however, is only the beginning. 15 Proof-of-concept source code of this simulation is available http://wizzion.com/thesis/simulation3/EGI.tgz under mrGPL license. at URL 315 316 evolutionary induction of 4-schema microgrammars from childes corpor 18.7 general discussion There is an appealing symmetry in the notion that the mechanisms of natural learning may resemble the processes that created the species possessing those learning processes. — D.E. Goldberg and J. Holland More generally and beyond syntax, operators implemented in the 3rd simulation can be associated to following psychological phenomena: 1. mutation of an N-schema - synaptic pruning (P+38), information decay, forgetting etc. 2. crossover between two N-schemata - related to creativity, dreaming (P+89-90) and phantasia Other variation operators - corresponding to certain forms of 1. playing certain language games (Wittgenstein, 1953; Nowak et al., 1999), or "intrapsychic" (Brams, 2011) games 2. imitating certain phenomena observed in linguistic behavior of human children (P+184-204) could also be deployed. Another subsequent enhancement of the GI method hereby introduced could potentially result from introduction of additional feature sets. For example, one could take a fit N−schema X, decompose it into its component G−categories G1 , G2 , ..., GN and, if ever a certain component G−category Gα turns out to be disjunctive, enrich vectorial representations of all its members with information that they belong to Gα . For example, one could enrich vectorial representations of tokens "doing", "holding", "playing", "saying" with information that they turned out to be subsumed under G − category present in one quite fit 4 − schema (c.f. Listing 18). And enrich vectorial representations of tokens "ketchup", "strawberries" etc. with information that these tokens turned out to subsumed by yet another G − category present in another schema (c.f. Listing 19). Note that introduction of such feature-sets could be interpreted as introduction of a feedback-loop in the system. Essence of such a system could thus be considered to be not only linguistic, but also cybernetic (Wiener, 1961; Lorenz, 1973). It could be postulated that introducing of such feed-back, bootstrapping (Hromada, 2014b; Karmiloff and Karmiloff-Smith, 2009, pp.111-118) loop into the system would not only result in identification of more complex microgrammars, but would also cause the system to follow similar ontogenetic trajectories than those of children which undergo a so-called syntagmaticparadigmatic shift (Nelson, 1977). 18.7 general discussion Vector Space Geometrization & Binarization Preparation Zeroth Generation Offspring Fitness Evaluation Fittest N-Schemata Re-focusing Syntagmatic rules Features and Entities Variation Selection Working Corpus Mating Pool new run new run Input Corpus (CHILDES) Paradigmatic Categories Figure 33: Data flow among main components of extended variant of INDUCT OR introducing a syntagmatic-paradigmatic feedback loop. All such operators, features and feedback-loops taken together and coupled with 1. the fact that brain (P+5) is a finite material object with finite resources which is subjected to 2nd law of thermodynamics (P+7) 2. the fact that linguistic input which the child becomes is preprocessed by loving (P+241) and caring computational oracles (Turing, 1939; Clark, 2010) like mothers, fathers, care-takers etc. 3. the fact that acquisition of language takes place in informationally very rich, contextually grounded, usage-based scenarios (Tomasello, 2009) one cannot exclude that a sort of evolutionary, ecological, equilibriumseeking process indeed takes place in the mind of a modal healthy language-acquiring toddler. And given that certain high-profile developmental linguists terminate their inquiry, concerning the informatic properties of the language input, with the conclusion « internal mechanisms are necessary to account for the unlearning of ungrammatical utterances» (Marcus, 1993) we allow us to conclude with a suggestion that the internal mechanism which Marcus mentions is, in reality, not a sort of universal grammar (P+98-101) black-box but instead a potentially "general cog- 317 318 evolutionary induction of 4-schema microgrammars from childes corpor nitive process" (P+101, (Piaget, 1974)) whose very essence is to discard that, which is non-functional: Evolution (P+3). Part V SUMMA 19 SUMMA SUMMARUM The natural selection paradigm of such knowledge increments can be generalized to other epistemic activities, such as learning, thought and science. — D.T. Campbell The objective of this dissertation was to provide a computational evidence of the "operational thesis" (P+20): «Learning of toddlerese can be successfully simulated by means of evolutionary algorithm processing textual representations of motherese.» Given that • the third simulation used no other input than the plain-text corpus of motherese utterances and given that • the third simulation resulted in identification of schemata able to generate grammatically correct utterances which have not been present in the initial corpus A Popperian conclusion one may consider the "operational thesis" as temporarily unfalsified. In this sense, we consider any future effort to falsify or verify "the softest thesis" (P+17-19): «Ontogeny of toddlerese can be successfully simulated by means of evolutionary computation.» as effort worthy of interest. It is worthy of noting in this regards that certain notions like that of a 4 − schema or morphosemantic class are not to be considered as some ultimate elements of some sort of ewige Theorie but rather as temporary, limited building blocks of an architecture which is to be surpassed. Surpassed by what? Maybe surpassed by models which introduce not only 4 − schemata but also 2 − schemata, 3 − schemata, 5 − schemata ... N − schemata. Or by procedures which integrate semantic, morphological and syntactic spaces within a single "linguistic" space SL . Given what we have seen until now, it can not be a priori excluded that results of certain types of evolution-inspired simulations taking place within such SL would turn out to be consistent with "the softer hypothesis" (P+14-16) which states that 320 summa summarum 321 «learning of natural language can be successfully simulated by means of evolutionary computation» But when speaking about optimization taking place within a linguistic space SL , shouldn’t it be also possible to speak also about optimization taking place within even more generic a space SG ? For nothing prohibits that category-inducing methods hereby introduced could be used to induce classifiers of partially or even fully non-linguistic entities. For example, a research project stemming from this dissertation may potentially explore the extent in which the evolutionary search for prototypes could be useful in Computer Vision: the only thing which would be fundamentally different would be the essence of input entities (i.e. images and not texts) and features occurring in such entities (e.g. Haar features (?Hromada, 2010c; Hromada et al., 2010) or others). In fact, nothing forbids to use one among three CI models hereby introduced whenever one needs to perform: Relation to Computer Vision 1. multiclass classification of entities (exemplified by "supervised" simulations 1 and 2) 2. induction of rules from positive corpus only (exemplified by "unsupervised" simulation 3) In other terms, the combination of "vector spaces" and "evolutionary computation" components can be understood as a "generic optimization toolbox" (GOT) which could potentially be applied upon any set of features. It is, however, primarily the nature of the input corpus and the nature of features which extracted from the corpus which should most closely determine the nature of categorizationperforming agent thus induced. Hence, when applied upon data-sets describing "spatial" trajectories within a group of "labyrinths", one could aspire to induce rules allowing a certain robot, a certain automatized vehicle, or a certain sort of embedded artificial classifier system (Booker et al., 1989), to find its way out of the "labyrinth" it never saw before. Or - if one would depart from so-called "morally relevant features" (Hromada and Gaudiello, 2014) - one could even hope to simulate ontogeny of categories and rules of a somewhat different kind. That is, of categories and rules which are commonly labeled as "aesthetic" (i.e. beautiful / ugly), "moral" (i.e. good / bad), "deontologic" (i.e. forbidden / allowed) (Hromada, 2016f). Asides "linguistic", "visual", "spatial" or "moral", implementation of EML GOTs in induction of other types of intelligence (Gardner, 2011) or their combinations (Karpathy and Fei-Fei, 2015) in artificial agents and robots is also a task to be explored. If successful, it cannot be excluded that such explorations would potentially bring scientific and engineering communities one step closer us to deployment metamodular (Hromada, 2012a) artificial agents able to: Generic Optimization Toolbox Induction of spatial trajectories Moral Induction EML and theory of multiple intelligences 322 summa summarum 1. integrate (Tononi, 2004) multi-modal (i.e. linguistic, visual, proprioceptive etc.) information 2. use nature-inspired, evolutionary computational core to identify most fit groupings of such information By doing so, an ultimate ex computatio atque simulatio proof of the "soft thesis" (P+11-13): « learning can be successfully simulated by means of evolutionary computation » Evolutionary Machine Learning and its advantages could be, potentially, given. To offer such a proof, however, is a task which by far surpasses limits of any individual researcher. What is more, alternative machine learning paradigms (e.g. deep learning) currently predominate and it may be the case that popularity of such approaches decreases the amount of attention which could - and should - be focused on exploration of common grounds between computational models of learning and computational models of evolution. Let’s now enumerate certain advantageous properties of evolutionary machine learning (EML) models which have been presented in simulations one, two and three. These EML models are : 1. functional: function of the model is principally determined by choice of fitness function and selection/variation operators 2. alternative: in any moment TX , the learning system contains multiple alternative solutions of the problem (P+8-10) 3. population-based: behavior of the learning system can be interpreted in terms of population dynamics (P+116) Comparison with connectionist models Contrary to these, connectionist models are more "structural" than "functional", they do not explicitly encode representations of diverse solutions and their convergence towards optimal states is more easily interpretable in terms of differential "gradient descent" of "backpropagation" than in terms of population dynamics. What’s more, by coupling the notion of evolution with that of a vector space, and by implementing a fairly trivial phenotype - genotype transcription (Section 18.3.2), one can obtain unsupervised EML models 1. bridging sub-symbolic (vectorial) and symbolic (regexps and grammars) realms 2. transparent to investigation and modulation by a human investigator (i.e. easy to interpret and teach) summa summarum Note that the property of being transparent to investigation and modulation is not a property which should be taken à la légère. For it could result in a creation of the inter-subjective bound between the artificial system which is being (investig|modul)ated and the human who (investig|modul)ates. In other terms, it could, potentially, result in emergence of entities of non-organic origin who could, and should, be considered as not only objects of machine-learning but also as subjects of machine-teaching. Such considerations, however, bring us further than paradigms like machine learning or even computer science could ever bring us. Such considerations bring us towards meta-paradigm1 of paedagogy and didactics (Komenskỳ et al., 1991) which solely can demonstrate the validity and usefulness of the Theory of Intramental Evolution (Hromada, 2015). Such considerations bring us towards such regions of SG whereby the very "hard thesis" (P+2-10) 323 Of learners and teachers «Learning is a form of evolution» could be evaluated as valid. Valid or not, nothing forbids the sign-manipulating2 mind (P+1) to realize a transposition (P+190-192) which savants like Bateson (Bateson, 2006) once realized. That is, a transposition between two terms each of which denote one big stochastic system, a transposition between "Mind" and "Nature", a transposition which obliges one to state: «Evolution is a form of learning3 » Such is, indeed, the ultimate result of the dissertation with which we aspire for attribution of the title Philosophiae Doctor in both cybernetics as well as cognitive psychology. Such is, indeed, the result of work commenced by two words forming the "initial thesis" (P+1): «Mind Evolves» * ** 1 A scientific paradigm (Kuhn, 2012) transfers knowledge about certain field of study. A scientific meta-paradigm transfers knowledge concerning the transfer of knowledge. 2 « Thinking is essentially the activity of operating with signs.» (Wittgenstein, 1934) 3 Lorenz (1973) states that the principal difference between learning and evolution is the ability of a learning system to "learn from one’s own errors". System which learns is supposed to have such ability while system which "only" evolves does not. But is it really always the case? Ultimate Chiasm BIBLIOGRAPHY Adler, A. (1976). Connaissance de l’homme. Payot. Amancio, D. R., Altmann, E. G., Rybski, D., Oliveira Jr, O. N., and Costa, L. d. F. (2013). Probing the statistical properties of unknown texts: application to the voynich manuscript. PloS one, 8(7):e67310. Ambridge, B., Theakston, A. L., Lieven, E. V., and Tomasello, M. (2006). The distributed learning effect for children’s acquisition of an abstract syntactic construction. Cognitive Development, 21(2):174–193. Araujo, L. (2002). Part-of-speech tagging with evolutionary algorithms. In Computational Linguistics and Intelligent Text Processing, pages 230–239. Springer. Aristotle (-335 BC). Poetics: On Comedy. Unknown. Aristotle (342BC). On Coming-to-be & Passing-way. At the Clarendon Press. Atkinson, Q. D. and Gray, R. D. (2005). Curious parallels and curious connections—phylogenetic thinking in biology and historical linguistics. Systematic biology, 54(4):513–526. Augustine, S. (1838). Confessions. Book I. Aurenhammer, F. (1991). Voronoi diagrams—a survey of a fundamental geometric data structure. ACM Computing Surveys (CSUR), 23(3):345–405. Aycinena, M., Kochenderfer, M. J., and Mulford, D. C. (2003). An evolutionary approach to natural language grammar induction. Final Paper Stanford CS224N June. Baixeries, J., Elvevåg, B., and Ferrer-i Cancho, R. (2013). The evolution of the exponent of zipf’s law in language ontogeny. PloS one, 8(3):e53227. Bandura, A. and McClelland, D. C. (1977). Social learning theory. Barrett, D. (2007). Waistland: A (R) evolutionary View of Our Weight and Fitness Crisis. WW Norton & Company. Barrett, M. D. (1978). Lexical development and overextension in child language. Journal of child language, 5(02):205–219. 324 bibliography Bateson, G. (2006). Mind and nature: A necessary unity (advances in systems theory, complexity, and the human sciences). Bee, H. L. and Boyd, D. R. (2000). The developing child. Allyn and Bacon Boston. Bellegarda, J. R. (2005). Unsupervised, language-independent grapheme-to-phoneme conversion by latent analogy. Speech Communication, 46(2):140–152. Bentley, P. (1999). Evolutionary design by computers. Morgan Kaufmann. Best, K.-H. (2006). Quantitative linguistik: Eine annäherung. 3., stark überarbeitete und ergänzte auflage. Blackmore, S. (2000). The meme machine. Oxford University Press. Booker, L. B., Goldberg, D. E., and Holland, J. H. (1989). Classifier systems and genetic algorithms. Artificial intelligence, 40(1):235– 282. Borges, J. L. (1952). El idioma analítico de john wilkins. Otras inquisiciones, pages 158–159. Braine, M. D. (1971). On two types of models of the internalization of grammars. The ontogenesis of grammar, pages 153–186. Braine, M. D. and Bowerman, M. (1976). Children’s first word combinations. Monographs of the society for research in child development, pages 1–104. Brams, S. J. (2011). Game theory and the humanities: bridging two worlds. MIT Press. Brighton, H., Kirby, S., and Smith, K. (2003). Situated cognition and the role of multi-agent models in explaining language structure. In Adaptive agents and multi-agent systems, pages 88–109. Springer. Broca, P. (1861). {Remarque sur le siege de la facult\\’{e} du language articul\\’{e}, suivie d\’une observation d\’aph\\’{e} mie (perte de la parole)}. {Bulletin de la soci\\’{e} t\\’{e} anatomique de Paris}, 36:330–356. Brodsky, P., Waterfall, H., and Edelman, S. (2007). Characterizing motherese: On the computational structure of child-directed language. In Proceedings of the 29th Cognitive Science Society Conference, ed. DS McNamara & JG Trafton, pages 833–38. Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., and Lai, J. C. (1992). Class-based n-gram models of natural language. Computational linguistics, 18(4):467–479. 325 326 bibliography Brown, R. (1958). Words and things. Brown, R. (1973). A first language: The early stages. Harvard U. Press. Bruner, J. S. and Watson, R. (1983). Child’s talk: Learning to use language. Oxford University Press Oxford. Bryant, E. F. (2009). The yoga sutras of patanjali. Buber, M. (1937). I and thou. Clark, Edinburgh. Campbell, D. T. (1960). Blind variation and selective retentions in creative thought as in other knowledge processes. Psychological review, 67(6):380. Campbell, D. T. (1974). An essay on evolutionary epistemology. The philosophy of Karl Popper, pages 413–463. Champollion, J. F. (1822). Observations sur l’obelisque Egyptien de l’Ile de Philae. Chomsky, N. (1957). Syntactic structures. Mouton. Chomsky, N. (1959). A review of bf skinner’s verbal behavior. Language, 35(1):26–58. Chomsky, N. (1995). The minimalist program, volume 28. Cambridge Univ Press. Chomsky, N. (2002). Syntactic structures. Walter de Gruyter. Christodoulopoulos, C., Goldwater, S., and Steedman, M. (2010). Two decades of unsupervised pos induction: How far have we come? In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 575–584. Association for Computational Linguistics. Clark, A. (2010). Distributional learning of some context-free languages with a minimally adequate teacher. In Grammatical Inference: Theoretical Results and Applications, pages 24–37. Springer. Clark, E. (1987). The principle of contrast: A constraint on language acquisition. Mechanisms of language acquisition, pages 1–33. Clark, E. V. (2003). First Language Acquisition. Cambridge University Press. Cohen, T., Schvaneveldt, R., and Widdows, D. (2010). Reflective random indexing and indirect inference: A scalable method for discovery of implicit connections. Journal of Biomedical Informatics, 43(2):240–256. bibliography Cohen, T., Widdows, D., Schvaneveldt, R. W., Davies, P., and Rindflesch, T. C. (2012). Discovering discovery patterns with predication-based semantic indexing. Journal of biomedical informatics, 45(6):1049–1065. Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3):273–297. Cosmides, L. and Tooby, J. (1997). Evolutionary psychology: A primer. Evolutionary Psychology: a primer. Currier, P. (1970). 1976." voynich ms. transcription alphabet; plans for computer studies; transcribed text of herbal a and b material; notes and observations.". Unpublished communications to John H. Tiltman and M. D’Imperio, Damariscotta, Maine. Darwin, C. (1859). The Origin of Species. J. Murray. Darwin, C. and Bettany, G. T. (1890). Journal of researches into the natural history and geology of the countries visited during the voyage of HMS" Beagle" round the world: under the command of Capt. Fitz Roy, RN. Ward, Lock. Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. S. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry, pages 253–262. ACM. Dawkins, R. (1976). The selfish gene. Oxford University Press Oxford. De Chardin, P. T., Wall, B., et al. (1965). The phenomenon of man, volume 383. Harper & Row New York, NY, USA:. de Saussure, F. (1916). Cours de la linguistique générale. DeCasper, A. J. and Spence, M. J. (1986). Prenatal maternal speech influences newborns’ perception of speech sounds. Infant behavior and Development, 9(2):133–150. Dehaene, S. and Changeux, J.-P. (1989). A simple model of prefrontal cortex function in delayed-response tasks. Journal of Cognitive Neuroscience, 1(3):244–261. Dennett, D. C. (1995). Darwin’s dangerous idea. The Sciences, 35(3):34– 40. Devescovi, A., Caselli, M. C., Marchione, D., Pasqualetti, P., Reilly, J., and Bates, E. (2005). A crosslinguistic study of the relationship between grammar and lexical development. Journal of Child Language, 32(04):759–786. d’Imperio, M. E. (1978). The voynich manuscript: an elegant enigma. Technical report, DTIC Document. 327 328 bibliography Dubremetz, M. (2013). Vers une identification automatique du chiasme de mots. TALN-RÉCITAL 2013, page 150. Dupont, P. (1994). Regular grammatical inference from positive and negative samples by genetic search: the gig method. In Grammatical Inference and Applications, pages 236–245. Springer. Edelman, G. M. (1987). Neural Darwinism: The theory of neuronal group selection. Basic Books. Elbers, L. and Ton, J. (1985). Play pen monologues: the interplay of words and babbles in the first words period. Journal of Child Language, 12(03):551–565. Ellis, R. and Wells, G. (1980). Enabling factors in adult-child discourse. First Language, 1(1):46–62. Elman, J. L. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48(1):71–99. Erjavec, T. (2004). Multext-east version 3: Multilingual morphosyntactic specifications, lexicons and corpora. In LREC. Fenson, L., Dale, P. S., Reznick, J. S., Bates, E., Thal, D. J., Pethick, S. J., Tomasello, M., Mervis, C. B., and Stiles, J. (1994). Variability in early communicative development. Monographs of the society for research in child development, pages i–185. Ferguson, C. A. and Farwell, C. B. (1975). Words and sounds in early language acquisition. Language, pages 419–439. Fernando, C., Szathmáry, E., and Husbands, P. (2012). Selectionist and evolutionary approaches to brain function: a critical appraisal. Frontiers in computational neuroscience, 6. Ferrer-i Cancho, R. and Elvevåg, B. (2010). Random texts do not exhibit the real zipf’s law-like rank distribution. PLoS One, 5(3):e9411. Fillmore, C. J., Kay, P., and O’connor, M. C. (1988). Regularity and idiomaticity in grammatical constructions: The case of let alone. Language, pages 501–538. Fisher, R. A. (1925). Statistical methods for research workers. Genesis Publishing Pvt Ltd. Flake, G. W. (1998). The computational beauty of nature: Computer explorations of fractals, chaos, complex systems, and adaptation. MIT press. Floridi, L. (2011). The Philosophy of Information. Oxford University Press. bibliography Fodor, J. A. (1983). The modularity of mind: An essay on faculty psychology. MIT press. Fogel, D. B. (1995). Phenotypes, genotypes, and operators in evolutionary computation. In Evolutionary Computation, 1995., IEEE International Conference on, volume 1, page 193. IEEE. Fogel, L. J., Owens, A. J., and Walsh, M. J. (1966). Artificial intelligence through simulated evolution. Foiter, M. L. (2002). Symbolism: The foundation of culture. Companion Encyclopedia of Anthropology, page 366. Fraisse, P. (1974). Psychologie du rythme. Presses universitaires de France Paris. Frege, G. (1994). Über sinn und bedeutung. Wittgenstein Studien, 1(1). Furrow, D., Nelson, K., and Benedict, H. (1979). Mothers’ speech to children and syntactic development: Some simple relationships. Journal of child language, 6(03):423–442. Galton, F. (1875). English men of science: Their nature and nurture. D. Appleton. Gärdenfors, P. (2004). Conceptual spaces: The geometry of thought. MIT press. Gardner, H. (1985a). Frames of mind: The theory of multiple intelligences. Basic books. Gardner, H. (1985b). The mind’s new science. Basic Books. Gardner, H. (2011). Frames of mind: The theory of multiple intelligences. Basic books. Gertner, S., Greenbaum, C. W., Sadeh, A., Dolfin, Z., Sirota, L., and Ben-Nun, Y. (2002). Sleep–wake patterns in preterm infants and 6 month’s home environment: implications for early cognitive development. Early Human Development, 68(2):93–102. Gödel, K. (1931). Über formal unentscheidbare sätze der principia mathematica und verwandter systeme i. Monatshefte für mathematik und physik, 38(1):173–198. Gold, E. M. (1967). Language identification in the limit. Information and control, 10(5):447–474. Goldberg, D. E. (1990). Genetic algorithms in search, optimization & machine learning. Addison-Wesley. Goldberg, D. E. and Holland, J. H. (1988). Genetic algorithms and machine learning. Machine Learning, 3:95–99. 329 330 bibliography Gómez, R. L. (2011). Memory, sleep and generalization in language acquisition. Experience, Variation and Generalization: Learning a First Language, 7:261. Grice, H. (1975). Logic and conversation’in p. cole and j. morgan (eds.) syntax and semantics volume 3: Speech acts. Guermeur, Y. and Monfrini, E. (2011). A quadratic loss multi-class svm for which a radius–margin bound applies. Informatica, 22(1):73–96. Haeckel, E. (1879). The evolution of man. London: Kegan Paul. Hamilton, W. D. (1963). The evolution of altruistic behavior. American naturalist, pages 354–356. Harris, M. (2013). Language experience and early language development: From input to uptake. Psychology Press. Harris, Z. S. (1954). Distributional structure. Word. Hebb, D. O. (1964). The Organization of Behaviour: A Neuropsychological Theory. John Wiley and Sons. Hesse, H. (1967). Das Glasperlenspiel: Versuch e. Lebensbeschreibung d. Magisters Ludi Josef Knecht samt Knechts hinterlassenen Schriften, volume 842. Suhrkamp. Hodgins, G. (2014). Forensic investigations of the voynich ms. In Voynich 100 Conference www. voynich. nu/mon2012/index. html. Accessed, volume 4. Hofmann, T., Schölkopf, B., and Smola, A. J. (2008). Kernel methods in machine learning. The annals of statistics, pages 1171–1220. Holland, J. H. (1975). Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence. U Michigan Press. Holland, J. H. (1992). Genetic algorithms. Scientific american, 267(1):66– 72. Holly Smith, B., Crummett, T. L., and Brandt, K. L. (1994). Ages of eruption of primate teeth: a compendium for aging individuals and comparing life histories. American Journal of Physical Anthropology, 37(S19):177–231. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558. Householder, F. W. (1981). Apollonius Dyscolus: The Syntax of Apollonius Dyscolus, volume 23. John Benjamins Publishing. bibliography Hromada, D. (2008). 23 comments to the chomskian doctrine. personal communication with D.Sportiche. Hromada, D. (2010a). Quantitative intercultural comparison by means of parallel pageranking of diverse national wikipedias. Hromada, D. D. (2009). Basen o jablku. Master’s thesis, Faculty of Humanities, Charles University, Prague, Czech Republic. Hromada, D. D. (2010b). Concepts of «invasivity» and «reversibility» and their relation to past, present and future techniques of neural imagery. Course work written for Bordeaux team of Neural Imagery affiliated to Ecole Pratique des Hautes Etudes. Hromada, D. D. (2010c). smiled : Sourire naturel et sourire artificiel. de l’utilisation d’opencv pour le tracking, la reconnaissaince des expressions faciales et la détection du sourire. Master’s thesis, Ecole Pratique des Hautes Etudes, Paris, France. Hromada, D. D. (2011). Initial experiments with multilingual extraction of rhetoric figures by means of perl-compatible regular expressions. In RANLP Student Research Workshop, pages 85–90. Hromada, D. D. (2012a). From age&gender-based taxonomy of turing test scenarios towards attribution of legal status to meta-modular artificial autonomous agents. page 7. Hromada, D. D. (2012b). Variations upon the theme of evolutionary language game. Written for prof. Vladimir Kvasnicka, downloadable at http://wizzion.com/papers/2012/. Hromada, D. D. (2013). Random projection and geometrization of string distance metrics. In RANLP, pages 79–85. Hromada, D. D. (2014a). Comparative study concerning the role of surface morphological features in the induction of part-of-speech categories. In Text, Speech and Dialogue, pages 46–52. Springer. Hromada, D. D. (2014b). Conditions for cognitive plausibility of computational models of category induction. In Information Processing and Management of Uncertainty in Knowledge-Based Systems, pages 93–105. Springer. Hromada, D. D. (2014c). Empiric introduction to light stochastic binarization. In Text, Speech and Dialogue, pages 37–45. Springer. Hromada, D. D. (2014d). Geometrizacia ontologii - pripadova studia snomed. Written for doc. Mikulas Popper. Hromada, D. D. (2015). Genetic optimization of semantic prototypes for multiclass document categorization. submitted to Elitech 2015 conference. 331 332 bibliography Hromada, D. D. (2016a). Can evolutionary computation help us to crib the voynich manuscript ? Submitted to JADT 2016 conference. Hromada, D. D. (2016b). Evolutionary induction of 4-schema microgrammars from childes corpora. submitted to journal Evolutionary Computation. Hromada, D. D. (2016c). Evolutionary induction of a lightweight morphosemantic classifier. submitted to Computational Linguistics. Hromada, D. D. (2016d). Evolutionary Models of Ontogeny of Linguistic Categories: Four Simulations. PhD thesis, Slovak Technical University and University Paris Lumieres. Hromada, D. D. (2016e). Fast and frugal retrieval of linguistic universalia from childes transcripts. Submitted to JADT 2016 conference. Hromada, D. D. (2016f). Narrative fostering of morality in artificial agents: Constructivism, machine learning and story-telling. In L’esprit au-delà du droit: Pour un dialogue entre les sciences cognitives et le droit. Mare et Martin. Hromada, D. D. and Gaudiello, I. (2014). Introduction to moral induction model and its deployment in artificial agents. In Sociable Robots and the Future of Social Relations, pages 209–216. IOS Press. Hromada, D. D., Tijus, C., Poitrenaud, S., and Nadel, J. (2010). Zygomatic smile detection: The semi-supervised haar training of a fast and frugal system: A gift to opencv community. In Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), 2010 IEEE RIVF International Conference on, pages 1–5. IEEE. Huizinga, J. (1956). Homo ludens vom ursprung der kultur im spiel. Imai, M. and Haryu, E. (2001). Learning proper nouns and common nouns without clues from syntax. Child development, 72(3):787– 802. Jackendoff, R. (2002). Foundations of language: Brain, meaning, grammar, evolution. Oxford University Press. Jakobson, R. (1960). Why “mama” and “papa”. Essays in honor of Heinz Werner. Jiménez López, M. D. et al. (2000). Gramar systems: a-formallanguage-theoretic framework for linguistics and cultural evolution. bibliography Johnson, K. (2004). Gold’s theorem and cognitive science*. Philosophy of Science, 71(4):571–592. Jones, W. (1788). The third anniversary discourse, delivered 2 february, 1786. Asiatick Researches, 1:415–431. Jusczyk, P. W. and Aslin, R. N. (1995). Infants0 detection of the sound patterns of words in fluent speech. Cognitive psychology, 29(1):1– 23. Jusczyk, P. W., Cutler, A., and Redanz, N. J. (1993). Infants’ preference for the predominant stress patterns of english words. Child development, 64(3):675–687. Karmiloff, K. and Karmiloff-Smith, A. (2009). Pathways to language: From fetus to adolescent. Harvard University Press. Karpathy, A. and Fei-Fei, L. (2014). Deep visual-semantic alignments for generating image descriptions. arXiv preprint arXiv:1412.2306. Karpathy, A. and Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128– 3137. Karypis, G. (2002). Cluto-a clustering toolkit. Technical report, DTIC Document. Kauffman, S. (1995). At home in the universe: The search for the laws of self-organization and complexity. Oxford University Press. Kelemen, J. (2004). Miracles, colonies, and emergence. In Formal Languages and Applications, pages 323–333. Springer. Kelemenová, A. and Csuhaj-Varjú, E. (1994). Languages of colonies. Theoretical Computer Science, 134(1):119–130. Keller, B. and Lutz, R. (1997). Evolving stochastic context-free grammars from examples using a minimum description length principle. In 1997 Workshop on Automata Induction Grammatical Inference and Language Acquisition. Citeseer. Kennedy, G. and Churchill, R. (2005). The Voynich manuscript: the unsolved riddle of an extraordinary book which has defied interpretation for centuries. Orion Publishing Company. Kennedy, J., Kennedy, J. F., and Eberhart, R. C. (2001). Swarm intelligence. Morgan Kaufmann. Keysers, C. and Perrett, D. I. (2004). Demystifying social cognition: a hebbian perspective. Trends in cognitive sciences, 8(11):501–507. 333 334 bibliography Komenskỳ, J. A., Okál, M., and Pšenák, J. (1991). Vel’ká didaktika: Didactica magna. Slovenské pedagogické nakladatel’stvo. Koza, J. R. (1992). Genetic programming: on the programming of computers by means of natural selection, volume 1. MIT press. Kuczaj, S. A. and Maratsos, M. P. (1975). What children can say before they will. Merrill-Palmer Quarterly of Behavior and Development, pages 89–111. Kuhn, T. S. (2012). The structure of scientific revolutions. University of Chicago press. Küntay, A. and Slobin, D. I. (1996). Listening to a turkish mother: Some puzzles for acquisition. Social interaction, social context, and language: Essays in honor of Susan Ervin-Tripp, pages 265–286. Küntay, A. and Slobin, D. I. (2002). Putting interaction back into child language: Examples from turkish. Psychology of Language and Communication, 6(1). Kvasnicka, V. and Pospichal, J. (1999). An emergence of coordinated communication in populations of agents. Artificial Life, 5(4):319– 342. Kvasnicka, V. and Pospichal, J. (2007). Evolúcia jazyka a univerzální darwinizmus. Mysel, inteligencia a zivot. Labov, W. and Labov, T. (1978). The phonetics of cat and mama. Language, pages 816–852. Lakoff, G. (1990). Women, fire, and dangerous things: What categories reveal about the mind. Cambridge Univ Press. Lama, D. et al. (2005). In the Buddha’s words: An anthology of discourses from the Pali Canon. Simon and Schuster. Landauer, T. K. and Dumais, S. T. (1997). A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 104(2):211. Landini, G. and Zandbergen, R. (1998). A well-kept secret of mediaeval science: The voynich manuscript. Aesculapius, 18:77–82. Lashley, K. (1950). In search of the engram. Symposia of the Society for Experimental Biology. Lauer, F. and Guermeur, Y. (2011). Msvmpack: a multi-class support vector machine package. The Journal of Machine Learning Research, 12:2293–2296. bibliography Li, W. (1992). Random texts exhibit zipf’s-law-like word frequency distribution. Information Theory, IEEE Transactions on, 38(6):1842– 1845. Lieven, E. V., Pine, J. M., and Baldwin, G. (1997). Lexically-based learning and early grammatical development. Journal of child language, 24(01):187–219. Lorenz, K. (1973). Die Rückseite des Spiegels. R. Piper. Lotka, A. J. (1925). Elements of physical biology. MacQueen, J. et al. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. California, USA. MacWhinney, B. (1987). The competition model. Mechanisms of language acquisition, pages 249–308. MacWhinney, B. (2014). The CHILDES project: Tools for analyzing talk, Volume I: Transcription format and programs. Psychology Press. MacWhinney, B. and Snow, C. (1985). The child language data exchange system. Journal of child language, 12(02):271–295. MacWhinney, B. and Snow, C. (1991). Childes manual. Maratsos, M. (1988). The acquisition of formal word classes. Categories and processes in language acquisition, pages 31–44. Marchman, V. A. and Bates, E. (1994). Continuity in lexical and morphological development: A test of the critical mass hypothesis. Journal of child language, 21(02):339–366. Marcus, G. F. (1993). Negative evidence in language acquisition. Cognition, 46(1):53–85. Markman, E. M. and Hutchinson, J. E. (1984). Children’s sensitivity to constraints on word meaning: Taxonomic versus thematic relations. Cognitive psychology, 16(1):1–27. Maynard Smith, J. (1986). The problems of biology, volume 144. Oxford: Oxford University Press. McAuley, J. D., Jones, M. R., Holub, S., Johnston, H. M., and Miller, N. S. (2006). The time of our lives: life span development of timing and event tracking. Journal of Experimental Psychology: General, 135(3):348. Mehler, J., Jusczyk, P., Lambertz, G., Halsted, N., Bertoncini, J., and Amiel-Tison, C. (1988). A precursor of language acquisition in young infants. Cognition, 29(2):143–178. 335 336 bibliography Menyuk, P., Liebergott, J., Schultz, M., Chesnick, M., and Ferrier, L. (1991). Patterns of early lexical and cognitive development in premature and full-term infants. Journal of Speech, Language, and Hearing Research, 34(1):88–94. Miller, G. (1956). The magic number seven plus or minus two: Some limits on our automatization of cognitive skills. Psychological Review, 63:81–97. Miller, G. A. (1995). Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41. Mink, J. and Blumenschine, R. (1981). Ratio of central nervous system to body metabolism in vertebrates: its constancy and functional basis. Am J Physiol, 241(3):R203–R212. Minsky, M. and Papert, S. (1969). Perceptrons. Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2012). Foundations of machine learning. MIT press. Morgan, J. L. and Saffran, J. R. (1995). Emerging integration of sequential and suprasegmental information in preverbal speech segmentation. Child development, 66(4):911–936. Morgan, T. H. (1916). A Critique of the Theory of Evolution. Princeton University Press. Mouillot, D. and Lepretre, A. (2000). Introduction of relative abundance distribution (rad) indices, estimated from the rankfrequency diagrams (rfd), to assess changes in community diversity. Environmental monitoring and assessment, 63(2):279–295. Nelson, K. (1977). The syntagmatic-paradigmatic shift revisited: a review of research and theory. Psychological bulletin, 84(1):93. Nelson, K. (2006). Narratives from the crib. Harvard University Press. Newbold, W. R. (1928). Cipher of Roger Bacon. Nowak, M. A., Plotkin, J. B., and Krakauer, D. C. (1999). The evolutionary language game. Journal of Theoretical Biology, 200(2):147– 162. Ofria, C. and Wilke, C. O. (2004). Avida: A software platform for research in computational evolutionary biology. Artificial life, 10(2):191–229. O’Neil, M. and Ryan, C. (2003). Grammatical evolution. Springer. Pagel, M., Atkinson, Q. D., Calude, A. S., and Meade, A. (2013). Ultraconserved words point to deep language ancestry across eurasia. Proceedings of the National Academy of Sciences, 110(21):8471–8476. bibliography Páleš, E. (1994). Sapfo–parafrázovač slovenčiny. Veda vydavatel’stvo SAV. Piaget, J. (1947). La psychologie de l’intelligence. Piaget, J. (1965). The Moral Judgment of the Child. The free press. Piaget, J. (1974). Introduction à l’épistémologie génétique. Paris, PUF. Piatelli-Palmarini, M. (1980). Language and learning: The debate between jean piaget and noam chomsky. Pine, J. M. and Lieven, E. V. (1997). Slot and frame patterns and the development of the determiner category. Applied psycholinguistics, 18(02):123–138. Pinker, S. (1994). The language instinct: The new science of language and mind, volume 7529. Penguin UK. Pinker, S. (2000). Survival of the clearest. Nature, 404(6777):441–442. Planck, M. (1926). Über die begründung des zweiten hauptsatzes der thermodynamik. Sitzungsberichte der Preussischen Akademie der Wissenschaarticle. Plato (380BC). Republic. Pohlheim, H. (1996). Geatbx: Genetic and evolutionary algorithm toolbox for use with matlab documentation. Online]. http://www. geatbx. com/docu/algindex. html.(Accessed May, 2004). Poincaré, H. (1908). L’invention mathématique. Poincaré, H. and Magini, R. (1899). Les méthodes nouvelles de la mécanique céleste. Il Nuovo Cimento (1895-1900), 10(1):128–130. Popper, K. R. (1972). Objective knowledge: An evolutionary approach. Clarendon Press Oxford. Price, G. R. et al. (1970). Selection and covariance. Nature, 227:520–21. Provasi, J., Anderson, D. I., and Barbu-Roth, M. (2014). Rhythm perception, production, and synchronization during the perinatal period. Frontiers in psychology, 5. Ray, T. S. (1992). Evolution, ecology and optimization of digital organisms. Santa Fe. Rechenberg, I. (1971). Evolutionsstrategie: Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Dr.-Ing. PhD thesis, Thesis, Technical University of Berlin, Department of Process Engineering. 337 338 bibliography Rizzolatti, G., Sinigaglia, C., and Anderson, F. T. (2008). Mirrors in the brain: How our minds share actions and emotions. Oxford University Press. Roffwarg, H. P., Muzio, J. N., and Dement, W. C. (1966). Ontogenetic development of the human sleep-dream cycle. Science. Rosch, E. (1999). Principles of categorization. Concepts: core readings, pages 189–206. Rosch, E. and Mervis, C. B. (1975). Family resemblances: Studies in the internal structure of categories. Cognitive psychology, 7(4):573– 605. Rosenberg, A. and Hirschberg, J. (2007). V-measure: A conditional entropy-based external cluster evaluation measure. In EMNLPCoNLL, volume 7, pages 410–420. Rudolph, G. (1994). Convergence analysis of canonical genetic algorithms. Neural Networks, IEEE Transactions on, 5(1):96–101. Rugg, G. (2004). An elegant hoax? a possible solution to the voynich manuscript. Cryptologia, 28(1):31–46. Sagae, K., Davis, E., Lavie, A., MacWhinney, B., and Wintner, S. (2007). High-accuracy annotation and parsing of childes transcripts. In Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition, pages 25–32. Association for Computational Linguistics. Sahlgren, M. (2005). An introduction to random indexing. In Methods and applications of semantic indexing workshop at the 7th international conference on terminology and knowledge engineering, TKE, volume 5. Salakhutdinov, R. and Hinton, G. (2009). Semantic hashing. International Journal of Approximate Reasoning, 50(7):969–978. Samuel, A. (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3):210. Schinner, A. (2007). The voynich manuscript: evidence of the hoax hypothesis. Cryptologia, 31(2):95–107. Schleicher, A. (1873). Die Darwinsche theorie und die sprachwissenschaft: Offenes sendschreiben an herrn Ernst Häckel, volume 2. Böhlau. Schmidt, J. (1872). Die verwantschaftsverhältnisse der indogermanischen sprachen. Böhlau. Schwartz, R. G. and Leonard, L. B. (1982). Do children pick and choose? an examination of phonological selection and avoidance in early lexical acquisition. Journal of child language, 9(02):319–336. bibliography Schwartz, R. G. and Terrell, B. Y. (1983). The role of input frequency in lexical acquisition. Journal of child language, 10(01):57–64. Sekaj, I. (2004). Robust parallel genetic algorithms with reinitialisation. In Parallel Problem Solving from Nature-PPSN VIII, pages 411–419. Springer. Sekaj, I. (2005). Evolučné vỳpočty a ich využitie v praxi. Iris. Shannon, C. E. (1948). A mathematical theory of communication. Simonton, D. K. (1999). Creativity as blind variation and selective retention: Is the creative process darwinian? Psychological Inquiry, 10(4):309–328. Skinner, B. F. (1957). Verbal Behavior. Sklyarov, V. and Skliarova, I. (2014). Hamming weight counters and comparators based on embedded dsp blocks for implementation in fpga. Advances in Electrical and Computer Engineering, 14(2):63– 68. Slobin, D. I. (1973). Cognitive prerequisites for the development of grammar. Studies of child language development, 1:75–208. Smith, T. C. and Witten, I. H. (1995). A genetic algorithm for the induction of natural language grammars. In Proc. of IJCAI-95 Workshop on New Approaches to Learning for Natural Language Processing, pages 17–24. Sokol, J. (1998). Vyšehrad. Malá filosofie člověka: Slovník filosofickỳch pojmu. Solan, Z., Horn, D., Ruppin, E., and Edelman, S. (2005). Unsupervised learning of natural languages. Proceedings of the National Academy of Sciences of the United States of America, 102(33):11629–11634. Sosík, P. and Štỳbnar, L. (1997). Grammatical inference of colonies. In New Trends in Formal Languages, pages 236–246. Springer. Spears, W. M., De Jong, K. A., Bäck, T., Fogel, D. B., and De Garis, H. (1993). An overview of evolutionary computation. In Machine Learning: ECML-93, pages 442–459. Springer. Spencer, H. (1894). Education: Intellectual, moral, and physical. CW Bardeen. Strong, L. C. (1945). Anthony askham, the author of the voynich manuscript. Science, 101(2633):608–609. Suciu, A., Cobarzan, P., and Marton, K. (2011). The never ending problem of counting bits efficiently. In Roedunet International Conference (RoEduNet), 2011 10th, pages 1–4. IEEE. 339 340 bibliography Swadesh, M. (1952). Lexico-statistic dating of prehistoric ethnic contacts: with special reference to north american indians and eskimos. Proceedings of the American philosophical society, pages 452– 463. Tomasello, M. (2009). Constructing a language: A usage-based theory of language acquisition. Harvard University Press. Tomasello, M., Akhtar, N., Dodson, K., and Rekau, L. (1997). Differential productivity in young children’s use of nouns and verbs. Journal of Child Language, 24(02):373–387. Tomita, M. (1982). Dynamic construction of finite-state automata from examples using hill-climbing. In Proceedings of the fourth annual cognitive science conference, pages 105–108. Tononi, G. (2004). An information integration theory of consciousness. BMC neuroscience, 5(1):42. Trevarthen, C. (1993). The self born in intersubjectivity: The psychology of an infant communicating. Trivers, R. (1972). Parental investment and sexual selection. Turing, A. M. (1939). Systems of logic based on ordinals. Proceedings of the London Mathematical Society, 2(1):161–228. Turing, A. M. (1950). Computing machinery and intelligence. Mind, pages 433–460. Ventris, M. and Chadwick, J. (1953). Evidence for greek dialect in the mycenaean archives. The Journal of Hellenic Studies, 73:84–103. Vygotsky, L. S. (1978). Mind and society: The development of higher mental processes. Vygotsky, L. S. (1987). Thinking and speech. The collected works of LS Vygotsky, 1:39–285. Wall, L. et al. (1994). The perl programming language. Watson, J. D., Crick, F. H., et al. (1953). Molecular structure of nucleic acids. Nature, 171(4356):737–738. Werker, J. F. and Tees, R. C. (1984). Cross-language speech perception: Evidence for perceptual reorganization during the first year of life. Infant behavior and development, 7(1):49–63. Wernicke, C. (1874). {Der aphasische Symptomencomplex}. Widdows, D. and Cohen, T. (2014). Reasoning with vectors: a continuous model for fast robust inference. Logic Journal of IGPL, page jzu028. bibliography Wiener, N. (1961). Cybernetics or Control and Communication in the Animal and the Machine, volume 25. MIT press. Wilson, E. O. (2000). Sociobiology: The new synthesis. Harvard University Press. Wittgenstein, L. (1922). Tractatus logico-philosophicus. Kegan Paul. Wittgenstein, L. (1934). The blue book. Wittgenstein, L. (1953). Philosophical Investigations. Blackwell. Wolff, J. G. (1988). Learning syntax and meanings through optimization and distributional analysis. Categories and processes in language acquisition, 1(1). Wright, S. (1932). The roles of mutation, inbreeding, crossbreeding, and selection in evolution, volume 1. na. Zipf, G. K. (1949). Human behavior and the principle of least effort. 341 CONTENTS i theses 1 1 initial thesis 2 1.1 mind (DEF) 2 1.2 to evolve (DEF) 2 2 hard thesis 3 2.1 Evolution (DEF) 3 2.2 Learning (DEF) 4 2.3 Form (DEF) 4 2.4 First condition of HT’s validity (DEF) 4 2.5 Second condition of HT’s validity (DEF) 5 2.6 Third condition of HT’s validity (DEF) 5 2.7 Fourth condition of HT’s validity (DEF) 5 2.8 Brain (DEF) 5 2.9 2nd law of thermodynamics (DEF) 7 2.10 Connectionist explanation of non-locality (TXT) 2.11 Alternative explanation of non-locality (TXT) 3 soft thesis 11 3.1 Evolutionary computation (DEF) 11 3.2 Successful Simulation (DEF) 12 3.3 Cognitive Plausibility (DEF) 13 4 softer thesis 14 4.1 Natural language (DEF) 14 4.2 Why natural language ? (APH) 14 5 softest thesis 17 5.1 Toddlerese (DEF) 17 5.2 Child (DEF) 19 6 operational thesis 20 6.1 Text (DEF) 20 7 summa i 22 7.1 Trial and Error (DEF) 22 8 8 ii paradigms 25 8 universal darwinism 26 8.1 Biological evolution 27 8.2 Evolutionary Psychology 29 8.3 Memetics 30 8.4 Evolutionary Epistemology 31 8.4.1 Phylogeny 31 8.4.2 Ontogeny 31 8.4.3 Individual Creativity 32 8.4.4 Genetic Epistemology 32 343 344 contents 8.5 Evolutionary Linguistics 34 8.5.1 Ethnogeny (DEF) 35 8.6 Neural and Mental Darwinism 38 8.7 Evolutionary computation 40 8.7.1 Genetic algorithms 41 Fitness Proportional Selection (SRC) 42 Fitness functions and fitness landscapes 44 Canonical Genetic Algorithms 46 Parallel Genetic Algorithms 47 8.7.2 Evolutionary programming & evolutionary strategies 48 8.7.3 Genetic programming 49 Grammatical evolution 51 8.7.4 Tierra 55 8.7.5 Evolutionary Language Game 56 9 developmental psycholinguistics 63 9.1 Language Development (DEF) 63 9.1.1 Central dogma of DP (DEF) 66 9.2 Development of Toddlerese 67 9.2.1 Ontogeny of prosody, phonetics and phonology 68 9.2.2 Canonical Babbling 70 9.2.3 Ontogeny of lexicon and semantics 73 The Principle of Contrast (DEF) 76 9.2.4 Ontogeny of morphosyntax 79 Compositionality (DEF) 80 Pivot schema (DEF) 82 The principle of precedence of the specific (DEF) 85 Principle of distributed practice (DEF) 86 9.2.5 Ontogeny of pragmatics 87 9.2.6 Physiological and cognitive development 89 9.3 Motherese 91 Variation sets 92 9.4 Language Acquisition Paradigms 94 9.4.1 Classical 94 9.4.2 Generativists and Nativists 96 Panini’s Grammar (APH) 97 Shiva Sutras (SRC) 97 Refutation of Gold’s Theorem (APH) 101 9.4.3 Empiricists and Constructivists 103 9.4.4 Socio-pragmatic and Usage-based paradigms 106 Format (DEF) 109 10 computational linguistics 114 10.1 Quantitative and corpus linguistics 114 10.1.1 Zipf’s law 115 10.1.2 Logistic law 117 10.2 Formal Language Theory 119 contents 10.2.1 Basic tenets (DEF) 119 Grammar and Rule(DEF) 120 10.2.2 Chomsky-Schützenberger hierarchy (TXT) 121 10.2.3 Grammar System Theory (TXT) 123 Language Colony (DEF) 123 10.3 Natural Language Processing 126 10.3.1 Machine learning 127 Evaluation 129 Precision and Recall (DEF) 130 V-measure (DEF) 131 10.4 Semantic Vector Architectures 133 10.4.1 Category Prototype (DEF) 134 10.4.2 Hebb-Harris Analogy (APH) 135 10.4.3 Bag-of-Terms 136 TF-IDF 137 10.4.4 Latent Semantic Analysis (TXT) 138 10.4.5 Random Indexing (TXT) 140 10.4.6 Light Stochastic Binarization 142 10.4.7 Evolutionary Localization of Semantic Attractors 10.5 Part-of-speech induction 146 10.5.1 Non-evolutionary POS-i 147 10.5.2 Evolutionary 148 10.6 Grammar induction 150 10.6.1 Existing non-evolutionary approaches 151 10.6.2 Existing evolutionary approaches 155 11 summa ii 167 iii observations 169 12 qualitative 170 12.1 Method and data collection 170 12.1.1 Biases 171 12.2 Subject 172 12.3 Linguistic environment 173 12.4 Crying and Babbling 174 12.5 First words 175 12.5.1 NENE & taboo (APH) 177 12.6 Repetitions and Replications 179 12.7 First constructions 181 12.7.1 First word combinations 182 12.7.2 First pivot(s) 182 12.7.3 First micro-grammars 185 12.8 Mutations 186 12.8.1 Context-free substitutions 186 12.8.2 Context-sensitive substitution 188 Context-sensitive substitutions (EXT) 189 12.9 Case study of semantic mutations: The DING-DONG mystery (APH) 190 143 345 346 contents 12.9.1 First transpositions 192 Context-sensitive metatheses (EXT) 193 12.10Crossovers 194 12.10.1Multilingual crossovers 195 Intralexical crossovers 195 Intraphrastic crossovers 198 Of crossover and calques (APH) 200 12.11Monolingual crossovers 200 12.11.1Intralexical 200 12.11.2Interlexical 201 12.11.3Intraphrastic crossovers 203 Of crossover and overgeneralizations (APH) 204 12.12Other phenomena 206 12.12.1Multilingual C-scheme Mismatch 206 12.12.2Compression of Information 207 13 quantitative 209 13.1 Method 209 13.2 Data 209 13.3 Universals 211 13.3.1 Letters 211 13.3.2 N-grams 213 13.3.3 Intrasubjective replications 215 Intralocutory duplications 215 Translocutory replications 217 13.3.4 Intersubjective replications 220 13.4 English-specific 223 13.4.1 Utterance-level Constructions 224 13.4.2 Pivot schemas 225 13.4.3 Pivot instances 226 13.4.4 Pivot grammars 227 14 summa iii 236 14.1 Crossroads of thoughts 236 14.1.1 The linguistic crossover principle 239 14.1.2 Of crossovers and analogies (APH) 240 14.2 Axes of analysis 240 14.3 The source of variation 241 14.3.1 Extending usage-based paradigm (TXT) 242 14.4 From selection to replication 242 14.4.1 The principle of exogenous selection (DEF) 243 14.4.2 MPR Precept (APH) 243 Love (DEF) 244 iv simulations 245 15 breaking into unknown code 15.1 Generic Introduction 246 15.2 Abstract 246 15.3 Introduction 247 246 contents 15.3.1 Pre-digital tentatives 248 15.3.2 Post-digital tentatives 248 15.3.3 Our position 250 15.3.4 Primary Mapping 250 15.3.5 Three Conjectures 251 15.4 Method 252 15.4.1 Calendar 252 15.4.2 Cribbing 253 15.4.3 Optimization 254 15.5 Experiments 255 15.5.1 Slavic crib 255 15.5.2 Hebrew crib 257 15.6 Conclusion 260 15.7 Generic Conclusion 261 16 evolutionary localization of semantic prototypes 16.1 Generic Introduction 263 16.2 Introduction 264 16.2.1 Geometrization of Categories 265 16.2.2 Radical Dimensionality Reduction 266 16.3 Genetic Localization of Semantic Prototypes 267 16.4 Corpus and Training Parameters 269 16.5 Evaluation and Results 270 16.6 Conclusion 272 16.7 Generic Conclusion 274 17 evolutionary induction of a lightweight morphosemantic classifier 275 17.1 Generic Introduction 275 17.2 Introduction 276 17.2.1 From planes to prototypes 277 17.2.2 From prototypes to constellations 278 17.2.3 From constellations to lightweight classifiers 278 17.3 Method 279 17.3.1 Corpus 279 Classes 280 Pre-processing 281 17.3.2 Algorithm 282 Vector Space Preparation 282 Evolutionary Optimization 285 Parameters 288 17.3.3 Evaluation 289 17.4 Discussion of Results 289 17.5 Conclusions 292 17.5.1 Computational conclusion 292 17.5.2 Psycholinguistic conclusion 293 17.6 Generic Discussion 295 17.7 Second Simulation Bibliography 297 347 263 348 contents 18 evolutionary induction of 4-schema microgrammars from childes corpora 298 18.1 General Introduction 298 18.2 Introduction 299 18.2.1 Two extremes 299 18.2.2 Definitions 300 G-Category (DEF) 300 H∆ -Category (DEF) 301 N∆ -Schema (DEF) 301 18.3 Model 302 18.3.1 Vector Space Preparation 302 18.3.2 Bridging the Sub-symbolic and Symbolic realms 304 Genotype 304 Phenotype 304 18.3.3 Fitness Function 306 18.3.4 Evolutionary Strategy 308 18.3.5 Evolution of both centroids and radii 308 18.3.6 Pseudo-random initialization of 0th population 309 18.3.7 Locus-constrained cross-over 309 18.3.8 Re-focusing strategy 310 18.4 Simulation 311 18.4.1 Corpus 311 18.4.2 Parameters 312 18.5 Observations 312 18.5.1 Diachronic observations 313 18.6 Conclusion 315 18.7 General Discussion 316 v summa 319 19 summa summarum bibliography 324 320 LIST OF FIGURES Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 Figure 11 Figure 12 Figure 13 Figure 14 Figure 15 Figure 16 Main notions of this dissertation. iv Distinction between "connectionist" (a) and "alternative" (b) representations of the same data. It is evident that the latter allows for more structural variation than the former. 9 Cognitive Hexagram 16 One-point and two-point crossovers. Figures reproced from Morgan (1916). 28 Schleicher’s Stammbaum of family of Indo-European languages. Reproced from Schleicher (1873). 36 Possible mechanism of replication of patterns of synaptic connections between neuronal groups. Reproduced from Fernando et al. (2012). 39 Basic genetic algorithm schema. Reproduced from Pohlheim (1996) 41 Possible fitness landscape for a problem with only one variable. Horizontal axis represents gene’s value, vertical axis represents fitness. 45 Different architectures of Parallel Genetic Algorithms. Reproduced from Sekaj (2004) 47 Sequence of steps constructing the program sqrt(x+5) 50 Sequence of transformations from genotype until phenotype in both Gr.Ev and Biological systems. Figure reproduced from O’Neil and Ryan (2003). 54 A case whereby mutual alignement of soundmeaning mappings can be useful. Reproduced from Kvasnicka and Pospichal (2007)’s reproduction in Pinker (2000). 58 Development of productive vocabulary in early (a) and late (b) toddlerese. Figures reproced from Fenson et al. (1994). 79 Corpus of two-word utterances produced by a toddler Andrew. Reproduced from Braine and Bowerman (1976). 82 Mean length of utterances produced by English and Italian children of different age (in months) . Figures reproduced from Devescovi et al. (2005). 86 Some modalities of information exchange between mother and her child. Reproduced from Trevarthen (1993). 88 349 350 List of Figures Figure 17 Figure 18 Figure 19 Figure 20 Figure 21 Figure 22 Figure 23 Figure 24 Figure 25 Figure 26 Figure 27 Figure 28 Figure 29 Figure 30 Logistic law in relation to historic and ontogenetic linguistic processes. Data taken from Best (2006). 117 Emergence of "miraculous" infinite generative capacity by means of interlock of two finite grammars. Figure reproduced from Kelemen (2004). 124 Comparison of reflective LSB (with I=2 iterations) and unreflective LSB (I=0) LSB with Semantic Hashing and binarized Latent Semantic Analysis. Reproduced from Hromada (2014c). 144 Equivalence classes and production rules induced from English language samples by ADIOS algorithm. Fig. reproduced from Wolff (1988). 152 Equivalence classes and production rules induced from English language samples by ADIOS algorithm. Reproduced from Solan et al. (2005). 154 Finite state automaton matching all strings over (1 + 0)* without an odd number of consecutive 0’s after an odd number of consecutive 1’s. Reproduced from Tomita (1982). 156 Grammars induced from nine different POS-tagged corpora. Reproduced from Aycinena et al. (2003). 161 Two simple grammars covering the sentence "the dog saw a cat". Fig. reproduced from Smith and Witten (1995). 162 First differentiation between the whole and its part (a) and its evolutionary explanation (b). 179 Drawing from folio f84r containing the primary mapping. 251 Evolution of individuals adapting label in the Calendar to names listed in the Slavic crib. 257 Evolution of individuals adapting label in the Calendar to names listed in the Hebrew cribs. 259 Retrieval and 20-class classification performance in 128-dimensional binary spaces. Non-LSB results are reproduced from Figure 6 of study (Salakhutdinov and Hinton, 2009), plain LSB from (Hromada, 2014c). 271 Evolutionary optimization increases the precision of a multi-class classifier. Curves represent results averaged across diverse runs (R = 6*100 for CANONIC, R=6 for MERGE1)). 290 List of Figures Figure 31 Figure 32 Figure 33 Centroidal tessellation of twelve data-points belonging to three distinct classes. Dots represent data-points, crosses are category prototypes and colors denote category membership. Black lines denote tesselation boundaries. 292 Data flow among main components of INDUCT OR. Lime color denotes components related to evolutionary optimization, royal blue color denotes components of the preliminary VSP phase. 311 Data flow among main components of extended variant of INDUCT OR introducing a syntagmaticparadigmatic feedback loop. 317 351 L I S T O F TA B L E S Table 1 Table 2 Table 3 Table 4 Table 5 Table 7 Table 8 Table 9 Table 10 Table 11 Table 12 Table 13 Table 14 Table 15 Table 16 Table 17 352 Conceptual parallels between biological and linguistic evolution. Table partially reproduced from Atkinson and Gray (2005). 37 Children avoid production of words with unknown characteristics. Reproduced from table Clark (2003) based on data in Schwartz and Leonard (1982). 71 Words produced by at least half of children in the monthly sample. Reproduced from table in Clark (2003) based on data from Fenson et al. (1994). 76 Case of development of word|meaning mapings. Based on data in Barrett (1978). 77 Utterances classified as tokens of the four major types of the motherese. Reproduced from table 4.2 in (Bruner and Watson, 1983, pp.7980). 108 Vectorial representations of three sentence-sized documents. Every distinct word yields a distinct column. 137 Vectorial representations of sentence-sized documents D1 = "mama má emu" and D2 = "mama má mamu". Every distinct character trigram yields a distinct column. 137 K-means clustering of tokens according both suffixal and co-occurence informations. Table partially reproduced from Hromada (2014b). 148 IM’s productive lexicon before attainment of 18 months. Words in the brackets denote most plausible meaning, as decoded by either father (F) or mother (M). Compare with Table 3. 177 IM’s seeding grammar: AUCH at ultimate position. 183 Seeding grammar extended: AUCH in the central position. 183 Another AUCH-centered paradigm. 184 Interlinguistic micro-grammar. 199 Recapitulation of crossover types observed in IM’s production. 205 Activity of different speakers in two age groups. 210 Repartition of languages in studied corpus. 211 List of Tables Table 18 Table 19 Table 20 Table 21 Table 22 Table 23 Table 24 Table 25 Table 26 Table 27 Table 28 Table 29 Table 30 Table 31 Table 32 Table 33 Table 34 Table 35 Table 36 Table 37 Table 38 Table 39 Table 40 Table 41 20 most frequent graphemes according to speakers and age groups. 212 20 most frequent bigrams according to speakers and age groups. 213 10 most frequent trigrams according to speakers and age groups. 214 Duplicated expressions and numbers of childoriginated and child-directed utterances in which they occur. 216 Probability that the utterance shall contain at least one ajdacently duplicated 2+gram. 217 Most frequent translocutory 3+-grams. 218 Probability that both parts of a utterance couplet shall contain at least one identic 3+gram. 219 Most frequent words replicated from child to mother (CHIINIT ) and mother to child (MOTINIT ). 220 Basic statistics concerning the replication of 3+grams between mother and the child. 221 Distributions of occurences of marker for laughing in diverse subsets of CHILDES corpus. 222 Counts related to morphologically annotated englishlanguage transcripts analyzed in this section. 223 Most frequent utterance-level constructions produced by english-speaking mothers and children in 2 phases of their development. 231 Correlations between distributions of frequences of utterances. 232 Number of distinct utterances in diverse datasets and entropies of their distributions. 232 Thirty 8+grams with highest scorepivoteness . 232 Ten CHILD-produced pivot7 schemas with highest contextual entropy (in shannons). 233 CHILDES utterances most frequently instantiating some pivot7 schema. 233 Pivot-instantiating CHILDES utterances pronounced by biggest number of distinct children. 234 Most popular instances of pivot "ˆthat’s X" 234 Most popular instances of pivot "ˆI want X" 235 Most popular instances of pivot "X little Y" 235 Interphrastic crossover behind Abe’s "boy can’t eat his carrots". 238 Fittest chromosomes which map reversed tokens in the Calendar onto names of the Slavic crib 258 Five classes of interest, their corresponding CHILDES part-of-speech tags, some example word types which instantiate them. 280 353 354 List of Tables Table 42 Table 43 Table 44 Table 45 Table 46 Table 47 Table 48 Table 49 Table 50 Table 51 Parameters of simulation 2. 288 Overall results of five different approaches. GA results have been averaged across diverse runs (R = 6*100 for CANONIC, R=6 for MERGE1). 290 MSVM2 training corpus confusion matrix. 291 MSVM2 testing corpus confusion matrix. 291 Training corpus confusion matrix produced by FIT T EST (GAMERGE1 ). 291 Testing corpus confusion matrix produced by FIT T EST (GAMERGE1 ). 291 Testing corpus tokens closest to prototypes of ACTION, SUBSTANCE and PROPERTY encoded in FIT T EST (GAMERGE1 ) constellation. Hamming distance H(token, prototype) and token’s CHILDES part-of-speech annotations. False positives are marked by bold font. 294 Words of a CorpusMini and hexadecimal representations of their potential hashes. 305 A candidate genotype which could be potentially induced from the hypothetic CorpusMini . 305 Parameters of diverse components of the INDUCT OR algorithm. 312 ACRONYMS CL Computational Linguistics DP Developmental Psycholinguistics EC Evolutionary Computing EL Evolutionary Linguistics ES Evolutionary Strategy ET Evolutionary Theory FLT Formal Language Theory GA Genetic Algorithm GE Genetic Epistemology GS Grammar System GI Grammar Induction | Grammar Inference HT Hard Thesis LA Language Acquisition LD Language Development MDL Minimal Description Length MLU Mean Length of Utterance ND Neural Darwinism NLP Natural Language Processing OT Operational Thesis POS-i Part-of-Speech Induction POS-t Part-of-Speech Tagging ST Soft Thesis S2 T Softer Thesis S3 T Softest Thesis UD Universal Darwinism VD Vocabulary Development 355 ACKNOWLEDGEMENTS Ideas presented in this dissertation are result of crossover of multiple sources of intellectual and cultural influence. Some of them are enumerated in the bibliography and some of them are mentioned in acknowledgments section of my bachelor and master dissertations. Their individual names being listed elsewhere, the influence of the teachers - affiliated with Charles University in Prague, National University of Mongolia, University of Nice Sophia-Antipolis as well as of researches affiliated to 3rd Section of Ecole Pratique des Hautes Etudes (be it in Paris, Dijon, Caen, Aix en Provence or Bordeaux) - is not to be underestimated. Also not to be underestimated is the influence of EPHE’s 4th section. Although the unpredictability of life has’t allowed me to pursue the "philologic track" longer than one year, I stay in the state of reconnaissance profonde towards prof. G-J. Pinault, J.E.M. Houben, D. Petit, A. LeMarechal, Fanny Meunier and others for showing me how "muddy" are formalisms of any theory in comparison with gems of poetry and language which have been, were, had been or simply are still alive. University Paris 8 St. Dennis - renamed to Paris Lumières amidst the work on this dissertation - is also to be praised as an entity without whose support of which this dissertation would not see the light of a day. On one side its teachers, such as S. Peperkamp who introduced me into both beauties and intricacies of psycholinguistic disputes, or M.B. Jover whose courses of epistemology reignited in me a long-forgotten interest in philosophy. On another side dozens of other fonctionnaires who allowed me - sometimes slowly, sometimes fast, but always with success - to pursue my investigations of unknown epistemic territories. However, EPHE and Paris 8 are just drops on the surface of structure much greater and ancient: Lutetia Parisorium, the city of Paris. Verily, if there is a three-dimensional mineral structure which is to be attributed the acknowledgment as a primary source of inspiration for this work, then it is the web which surrounds the Notre-Dame. Deeply grounded and firmly represented in my hippocampus hippocampus, Paris is indeed the city where first drafts of this dissertation had been brought to life and discussed in wines-lasting debates with Simon Carrignon, Adil elGhali, Fabien Ruggieri, Ilaria Gaudiello, Jean-Marc Thiebaut, Mary Rougeux, Yann Leger, Christophe Chavatte, Kechadi Lagha, Anne Ronsheim, Barbarka Jarkovska, Maurice & Florence Benayoun, Ivan Bigorgne, Geoffrey Tissier, Jeremy Gardent, my first student Pauline Vallies and second student Geoffrey Vantalon, Jarmila Mendelova, Louise Hearsum, Jitka Pelechova, Ophelie Monsoreau, Julienne Michele and Julie Rocton. Help coming from Kristina Poliakova, Julian Bonnyaud, Mikaela Barankova & Monique Girodroux, François Jodelet, Anh Nguyen, stream of good books coming from antiquaire Andre from bookshop on the border of rue d’Ulm and rue Claude Bernard as well as support of bouqinists Julien & Lue from Quai Voltaire were quite vital during periods when dissertation existed only in posse and not yet in esse. But it would be unjust to praise Paris but not address some praise also à la République toute-entière. For without help of her institutions and her establishements - as diverse as CROUS, CAF, ANPE, Cité des Sciences, So- 357 358 acronyms ciété d’exploitation de la Tour Eiffel (!), Mairie de Paris or Campus France - it would be highly unplausible that an ordinary Bratislava boy could ever dedicate years of his life to pure science. In this regards, the role of French Ministry of Foreign Affairs, Embassy of France in Slovakia and people like Mme. Monika Saganova are of particular importance because of their assistance which ultimately allowed me to cover significant part of material needs with the scholarship of french government for doctoral studes under double supervision. It is also thanks to Michal Oravec and Zuzana Dideková that such a double supervision got realized. By a strange coincidence of events and independently from each other they have both attracted my attention to the fact that in my own country of origin, Slovakia, there already exists a wellestablished, firm and intellectually rich tradition of cybernetics in general and evolutionary computation in particular. Hence I met Mr. Ivan Sekaj who was not only willing to take me under his wings, re-introduced me to education system of my own homeland, made me program my first genetic algorithm and always somehow succeeded to adopt his agenda to my needs. It is thanks to him that I had opportunity to get in contact with other "wizards from Mlynska Dolina", including prof. V. Kvasnicka or M. Popper. None of these meetings and encounters would take place, however, if it hadn’t been for one man: professor Charles Tijus. This is so because it was mainly Charles - assisted by Francois Jouen and Joelle Provasi - who kept alive the curriculum "Cognition Humain et Artificielle" at EPHE/Paris8, it was Charles who guided the direction of my Master Thesis and asides this also managed the complexities of laboratory ChART and the research platform Lutin where the germs of this dissertation have been conceived. But asides all this, it was Charles who convinced me that pursuing the path of science is worth the effort in order to subsequently give me practically absolute liberté in finding my own method of such pursue. At last but not least, my ultimate "thank you" is dedicated to a woman which has transformed herself, during 6 years of doctoral studies, from a completely unknown féerie into a virtual acquitance into my guest into my host into my tourist guide into my friend into my love into my muse into mother of our daughter into my fiancee into my wife. It is thanks to You, Lucia, and thanks to thousands of numinous adjustments You do that our Iolanda Maitreya sleeps her green ideas peacefully and not furiously, that our house fragrances with myriads essences and this dissertation could be hereby considered as finished. colophon This document was typeset using the typographical look-and-feel classicthesis developed by André Miede. The style was inspired by Robert Bringhurst’s seminal book on typography “The Elements of Typographic Style”. classicthesis is available for both LATEX and LYX: http://code.google.com/p/classicthesis/ Happy users of classicthesis usually send a real postcard to the author, a collection of postcards received so far is featured here: http://postcards.miede.de/ Final Version as of November 17, 2016 (classicthesis version 1). D E C L A R AT I O N I declare that this Thesis is a fruit of my own work and that all citations and references to external sources are explicitly marked. Daniel Devatman Hromada, November 17, 2016 Enrichir et raisonner sur des espaces sémantiques pour l’attribution de mots-clés Adil El Ghali1, 2 Daniel Hromada1 Kaoutar El Ghali (1) LUTIN UserLab, 30, avenue Corentin Cariou, 75930 Paris cedex 19 (2) IBM CAS France, 9 rue de Verdun, 94253 Gentilly elghali@lutin-userlab.fr RÉSUMÉ Cet article présent le système hybride et multi-modulaire d’extraction des mots-clés à partir de corpus des articles scientifiques. Il s’agit d’un système multi-modulaire car intègre en soi les traitements 1) morphosyntaxiques (lemmatization et chunking) 2) sémantiques (Reflective Random Indexing) ainsi que 3) pragmatiques (modélisés par les règles de production). On parle aussi d’un système hybride car il était utilisé -sans modification majeure- pour trouver des solutions aux toutes les deux pistes du DEFT 2012. Pour la Piste 1 - où une terminologie était fournie - nous obtînmes le F-score de 0.9488 ; pour la Piste 2 – où aucune liste des mots clés candidates n’était pas fourni au préalable – le F-score obtenu est 0.5874. ABSTRACT Enriching and reasoning on semantic spaces for keyword extraction This article presents a multi-modular hybrid system for extraction of keywords from corpus of scientific articles. System is multi-modular because it integrates components executing transformations on 1) morphosyntactic level (lemmatization and chunking) 2) semantic level (Reflected Random Indexing), as well as upon more 3) « pragmatic » aspects of processed documents, modeled by production rules. The system is hybrid because it was able to address both tracks of DEFT2012 competition – a «reduced search-space» scenario of Track 1, whose objective was to map the content of a scientific article upon one among the members of a « terminological list » ; as well as more « real-life » scenario of Track2 within which no list was associated to documents contained in the corpus. In both Tracks, the system hereby presented has obtained the an F-score of 0.9488 for the Track1, and 0.5874 for the Track2. MOTS-CLÉS Chunking. : Extraction de mots-clés, Espaces sémantiques, RRI, Réseau bayésien, Règles de production, KEYWORDS: Keyword extraction, Semantic spaces, RRI, Bayesian Network, Production Rules, Chunking. 1 Introduction L’édition 2012 du défi fouille de textes (DEFT) a pour thème l’identification automatique des mots-clés indexant le contenu d’articles publiés dans des revues scientifiques. Deux pistes ont été proposées : dans la première (Piste 1) la terminologie des mots-clés est fournie, alors que dans la deuxième (Piste 2) l’attribution des mots-clés devait se faire sans terminologie. Pour la réalisation de cette tâche nous avons décidé, dans la continuité de ce que nous avions réalisé en 2011 (El Ghali, 2011), de représenter le sens des termes et des documents du corpus dans des espaces sémantiques utilisant la variante Reflective Random Indexing (RRI). Le choix de RRI une variante de Random Indexing (RI) (Sahlgren, 2006) est motivé par les bonnes propriétés de cette méthode, héritées de RI et qui sont largement décrites dans la littérature (Cohen et al., 2010a). Mais une de ces propriétés moins connue et commentée s’est révélée particulièrement pertinente pour le problème posé dans le cadre de cette édition du DEFT, à savoir l’uniformité de l’espace sémantique : en effet, les vecteurs construits par RRI pour représenter les documents et les termes du corpus sont « comparables ». Dans la méthode que nous avons développé pour cette édition du DEFT, nous avons voulu répondre à deux questions principales : 1. quel serait l’apport d’un pré-traitement linguistique de surface aux espaces sémantiques ? et en quoi pourrait-on comparer ces pré-traitements aux méthodes de constructions d’espaces sémantiques permettant de capturer des éléments de structure ? 2. peut-on améliorer les méthodes de scoring développées dans les précédentes éditions du DEFT en utilisant les dernières avancées en Intelligence artificielle, notamment le raisonnement à base de règles et les graphes probabilistes, encodant respectivement des règles générales sur le choix des mots-clés et des informations incertaines issues du corpus d’apprentissage ? La première question s’imposait naturellement du fait qu’une grande partie des mots-clés qui ont été fournis pour la Piste 1 sont en fait des groupes de mots et que leurs catégories morphosyntaxiques et grammaticales respectait des règles assez simples. Pour pouvoir traiter les mots-clés composés de plusieurs mots, certaines méthodes de représentation de textes en espaces sémantiques telles que BEAGLE (Jones et Mewhort, 2007), PSI (Cohen et al., 2009), ou encore RRI avec des indexes positionnels (Widdows et Cohen, 2010), permettent d’encoder les informations sur l’ordre des mots. La deuxième question est née du fait que l’on disposait d’informations de nature différentes qui pouvait aider à attribuer correctement des mots-clés : sur la sémantique, sur la distribution des mots-clés, sur la structure, sur les revues dont sont issues les articles ... Ces informations pouvaient être difficilement encodées dans un seul formalisme de décision. Nous avons donc décidé de définir une procédure de décision pour l’attribution de mots-clés qui combine des règles symboliques avec des réseaux bayésiens, avec les Règles de production Probabilistes (Aït-Kaci et Bonnard, 2011). Nous avons fait le choix d’aborder les deux pistes du défi de cette année de manière sensiblement identique, les mêmes méthodes ont été utilisées pour les deux pistes. Pour ce faire, nous avons construit une terminologie pour la Piste 2. Cette terminologie est une liste de mots-clés candidats établie en utilisant un espace sémantique et un pré-traitement linguistique de surface. L’article est organisé comme suit : nous commençons par présenter dans la section 2 une analyse du corpus et des informations qui peuvent en être extraite et qui sont utiles pour la tâche d’attribution de mots-clés. Ensuite, dans la section 3, nous rappelons brièvement le principe de fonctionnement de RRI, puis nous décrivons comment incorporer les informations issue du pré-traitement linguistique dans les espaces sémantiques, mais aussi comment la liste des candidats mots-clés pour la Piste 2 est construite. Dans la section 4 nous présentons le principe de fonctionnement de la procédure de décision pour l’attribution des mots-clés. Enfin, dans la section 5 nous détaillons les caractéristiques de chacune des exécutions et discutons les résultats avant de conclure. 2 Le Corpus 2.1 2.1.1 Statistiques générales de corpus d’apprentissage Piste 1 Pour la Piste 1, il y a 140 documents dans le corpus d’apprentissage. Les documents proviennent de 4 revues différentes, l’identificateur de la revue étant encodé dans le nom du fichier XML contenant l’article. La liste terminologique – i.e. la liste contenant tous les termes uniques choisies comme un mot clé pour un document dans le corpus - associée au corpus d’apprentissage contient Tappr = 666 termes uniques. Les nombres des mots-clés associés sont fournis pour chaque document du corpus d’apprentissage aussi bien que du corpus de test. En somme, Σi Nappri = 754. En moyenne, chaque article de corpus d’apprentissage a : mean(Nappr ) = 5.386 ; median(Nappr ) = 5; min(Nappr ) = 1; max(Nappr ) = 13; sd(Nappr ) = 1.344 Etant donné que Σi Nappri > Tappr , il est évident qu’il y a des termes qui sont définis comme mots clés pour plusieurs articles. Le principe de bijection 1 terme – 1 article n’est pas donc applicable. Plus précisément, pour le corpus d’apprentissage, 604 mots clés sont associés à un seul article, 46 en sont associés à deux, 10 à trois, quatre mots clés (i.e. « identité », « interprétation », « enseignement de la traduction », « traduction ») sont chacun associés à quatre articles, tandis que le terme « humanitaire » est défini comme mot clé pour cinq articles et le terme « mondialisation » pour sept articles. On note aussi que parmi 62 termes qui sont associés à plus qu’un article, seulement 26 (i.e. 41,9%) sont associés aux articles appartenants à plus qu’une revue. Les analyses fréquentielles préliminaires montrent aussi que dans 141 parmi 740 cas, le mot clé ne se trouve pas dans le corps ni résumé d’article auquel il est associé. En d’autres termes, pour plus que 19% des mots clés, la fréquence de leur occurrence dans l’article est zéro, c’est donc plus qu’évident qu’il faut aller au-delà des fréquences « brutes » si on veut que notre système d’extraction des mots clés ait la précision > 80% (la Figure 1 montre les fréquences d’occurrence des mots-clés dans les documents associés). L’objectif de la Piste 1 est donc de concevoir le système qui, partant de fichiers de corpus d’apprentissage contenant Dappr ∗ Tappr = 140 * 666 = 93240 couplages (document, terme) serait capable à déterminer les couples ayant été établis par les auteurs de leurs documents. 2.1.2 Piste 2 Le corpus d’apprentissage contient 142 documents. Contrairement à la Piste 1, aucune liste terminologique n’est fournie, l’espace de recherche dans lequel on cherche les candidats censé d’être les mots clés est donc beaucoup plus grande. Mais les quantité des mots clés associés au différents articles sont présents. Grâce à ces quantités fournis dans la balise des documents XML, on sait sans regarder au fichier de référence que la distribution de Σi Nappri = 763 FIGURE 1 – Cca 19% (en rouge) des mots clés de corpus d’apprentissage ne figurent pas dans les documents auxquels ils sont attribués associations entre mots clés et articles dispose de propriétés suivantes : mean(Nappr ) = 5.411; median(Nappr ) = 5; min(Nappr ) = 3; max(Nappr ) = 13; sd(Nappr ) = 1.404. L’analyse de fichier de référence révèle que parmi 681 termes qui couvrent l’ensemble de tous les mots clés du corpus d’apprentissage de piste2 , 627 en sont associés à un seul article, 37 à deux, 12 à trois, deux termes à (« humanitaire » et « didactique ») à quatre articles, les termes « identité » et « culture » étant associé à cinq articles et le terme « traduction » à huit documents. Étant donné que l’information concernant l’appartenance d’un article à une revue est présente, on sait aussi que parmi 54 termes associés à plus qu’un article, seulement 18 (i.e. 33.3%) sont associés à plus qu’une revue. L’analyse des fréquences de mots clés dans les articles associés donne les résultats qui vont dans le même sens que ceux de la Piste 1 : dans 145 cas (19%), les mots clés n’apparaîssent pas dans l’article auquel ils étaient associés ! 2.2 2.2.1 Statistiques générales du corpus de test Piste 1 Le corpus de test de la Piste 1 contient D t est = 94 documents dans . La liste terminologique du corpus de test contient 478 termes uniques. Parmi ces 478 termes-candidats, 435 en sont associés avec un seul document, 34 aux deux documents différentes, quatre termes sont associés aux trois articles, et quatre termes aux quatre articles, le terme le plus réussi comme mot clé étant « identité » lui-même associé au six articles. Parmi les 43 termes associés à plus d’un article, 20 (i.e. 46,5%) sont associés aux articles appartenants à plus d’une revue. La distribution de la somme du nombre des mots clés associés aux articles du corpus de test de la Piste 1 ( Σi Nt est i = 537) dispose de propriétés suivantes : mean(Nt est ) = 5.712; median(Nt est ) = 5; min(Nt est ) = 1; max(Nt est ) = 12; sd(Nt est ) = 1.701. 2.2.2 Piste 2 La distribution de Σi Nt est i = 484 mots clés attribués aux 93 documents contenus dans le corpus de test de la Piste 2 est caractérisé par les mesures suivantes : mean(Nt est ) = 5.204; median(Nt est ) = 5; min(Nt est ) = 2; max(Nt est ) = 10; sd(Nt est ) = 1.323. La consultation des fichiers de référence obtenus après la fin de la phase competitive de DEFT2012 nous permets à savoir que parmi 35 termes associés à plus qu’un article, seulement 10 (i.e. 28,6%) sont associés aux articles appartenants à plus d’une revue. 2.3 Que peut-on apprendre d’autre du corpus ? Un rapide parcours du corpus de d’apprentissage et de la terminologie fournie pour la Piste 1, nous montre qu’au delà des fréquences, les mots-clé choisis par les auteurs respectent quelques règles : – les mots-clés sont différents entre eux : les auteurs n’utilisent que rarement des mots-clés très proches ; – ils sont assez souvent repris dans l’introduction et la conclusion de l’article ; – leur catégorie morphosyntaxique ou grammaticale est très rarement « verbale », les mot-clés les plus utilisés sont des noms (communs ou propres), des adjectifs ou des groupes nominaux ; Par ailleurs, comme on pouvait s’y attendre les mots-clés sont fortement liés sémantiquement au document, comme le montre la figure 2 : FIGURE 2 – Similarités document-mots-clés (min, max, mean) vs. document-terminologie (mean) 3 Espaces sémantiques Les modèles de représentation vectorielle de la sémantique des mots sont une famille de modèles qui représentent la similarité sémantique entre les mots en fonction de l’environnement textuel dans lequel ces mots apparaissent. La distribution de co-occurrence des mots dans le corpus est rassemblée, analysée puis transformée en espace sémantique dans lequel les mots sont représentés comme des vecteurs dans un espace vectoriel de grande dimension. LSA (Landauer et Dumais, 1997), HAL (Lund et Burgess, 1996) et RI (Kanerva et al., 2000) en sont quelques exemples. Ces modèles sont basés sur l’hypothèse distributionnelle de (Harris, 1968) qui affirme que les mots qui apparaissent dans des contextes similaires ont un sens similaire. La caractérisation de l’unité de contexte est une problèmatique commune à toutes ces méthodes, sa définition est différente suivant les modèles. Par exemple, LSA construit une matrice mot-document dans laquelle chaque cellule ai j contient la fréquence d’un mot i dans une unité de contexte j. HAL définit une fenêtre flottante de n mots qui parcourt chaque mot du corpus, puis construit une matrice mot-mot dans laquelle chaque cellule ai j contient la fréquence à laquelle un mot i co-occure avec un mot j dans la fenêtre précédemment définie. Différentes méthodes mathématiques permettant d’extraire la signification des concepts, en réduisant la dimensionnalité de l’espace de co-occurence, sont appliquées à la distribution des fréquences stockées dans la matrice mot-document ou mot-mot. Le premier objectif de ces traitements mathématiques est d’extraire les «patrons» qui rendent compte des variations de fréquences et qui permettent d’éliminer ce qui peut être considéré comme du « bruit ». LSA emploie une méthode générale de décomposition linéaire d’une matrice en composantes indépendantes : la décomposition de valeur singulière (SVD). Dans HAL la dimension de l’espace est réduite en maintenant un nombre restreint de composantes principales de la matrice de co-occurrence. À la fin de ce processus de réduction de dimensionnalité, la similitude entre deux mots peut être calculée selon différentes méthodes. Classiquement, la valeur du cosinus de l’angle entre deux vecteurs correspondant à deux mots ou à deux groupes de mots est calculée afin d’approximer leur similarité sémantique. 3.1 Reflective Random Indexing La méthode de construction d’espace sémantique utilisée est Reflective Random Indexing (RRI) (Cohen et al., 2010a), c’est une nouvelle méthode de construction d’espaces sémantiques basée sur la projection aléatoire qui est assez différente des autres méthodes de construction d’espaces sémantiques. Ses particularités sont (i) qu’elle ne construit pas de matrice de co-occurrence et (ii) qu’elle ne nécessite pas, contrairement aux autres modèles vectoriels de représentation sémantique, des traitements statistiques lourds comme la SVD pour LSA. RRI est basée sur la projection aléatoire (Vempala, 2004; Bingham et Mannila, 2001), qui permet un meilleur passage à l’échelle pour grand nombre des documents. La construction d’un espace sémantique avec RRI se déroule comme suit : – Créer une matrice A(d × n), contenant des vecteurs indexes, où d est le nombre de documents ou de contextes et n le nombre de dimensions choisies par l’expérimentateur. Les vecteurs indexes sont des vecteurs creux générés aléatoirement. – Créer une matrice B(t × n), contenant des vecteurs termes, où t est le nombre de termes différents dans le corpus. Initialiser tous ces vecteurs avec des valeurs nulles pour démarrer la construction de l’espace sémantique. – Pour tout document du corpus, chaque fois qu’un terme τ apparaît dans un document δ, accumuler le vecteur index de δ au vecteur terme de τ. à la fin du processus, les vecteurs termes qui apparaîssent dans des contextes similaires ont accumulé des vecteurs indexes similaire. L’aspect « Reflective » dans RRI consiste à rejouer plusieurs cycles des trois étapes de l’algorithme non plus à partir de vecteurs aléatoires mais à partir des vecteurs indexes obtenues pour les documents. Ces cycles permettent de gommer l’aspect aléatoire de l’espace, le système convergeant généralement au bout d’un nombre réduit de cycles. 3.1.1 Semantic Vectors Plusieurs implémentations libre de RRI sont disponibles, nous utilisons la librairie Semantic Vectors 1 (Widdows et Cohen, 2010). Semantic Vectors présente un certain nombre d’avantages par rapport aux autres librairies implémentant RRI, en particulier, parce qu’il offre, d’une part, une implémentation de RRI basé sur des indexes positionnels (Cohen et al., 2010a) qui construit l’espace sémantique non plus en se basant sur les occurrences d’un terme dans un document mais dans une fenêtre glissante à la manière de HAL, cette version de RRI permet de capturer outre les informations sur la sémantiques des termes, des informations structurelles sur leur proximité. D’autre part, Semantic Vectors implante un certain nombre de mesures de similarité entre des groupes de mots, en particulier (i) la « disjonction quantique » (Cohen et al., 2010b) qui permet de construire un volume correspondant à plusieurs termes dans l’espace sémantique et de calculer la distance entre ce volume et d’autres termes ou documents de l’espace ; (ii) « similarité tensorielle » qui prend en entrée une suite ordonnée de termes et calcule sa similarité avec d’autres suites ordonnées, exploitant ainsi les informations d’ordre provenant des indexes positionnels. Semantic Vectors est utilisé dans nombre d’applications. Nous l’avons utilisé dans nos participations au DEFT depuis l’édition 2009. Dans des tâches proches de celle qui nous occupe, la librairie a été utilisée pour comparer RRI à d’autres méthodes d’espaces sémantiques pour la recherche de relations entre termes dans un corpus (Rangan, 2011). 3.2 Enrichir les espaces sémantiques avec des informations linguistiques Dans le problème d’attributions de mots-clés à un texte, les termes utilisés comme mots-clés sont, pour une partie d’entre-eux, des groupes de mots. La sémantique associée à un groupe de mots dans espace sémantiques n’est pas aussi précise que celle associé à un mot : elle comprend des composantes de ce mots dans d’autres contextes. Pour pouvoir traiter la sémantique de ces groupes de mots, certaines méthodes de représentation du sens en espaces sémantiques telles que BEAGLE (Jones et Mewhort, 2007), PSI (Cohen et al., 2009), ou encore RRI avec des indexes positionnels (Cohen et al., 2010b; Widdows et Cohen, 2010), permettent d’encoder les informations sur l’ordre des mots. Nous avons voulu tester une autre méthode basée sur une analyse linguistique de surface du texte. 1. http://code.google.com/p/semanticvectors/ Le principe de cette méthode est d’identifier des groupes de mots candidats dans le texte via une phase de chunking (Abney, 1991) puis de construire des classes d’équivalence de chunks qui regroupent une majorité de mots identiques (après lemmatisation des mots) et qui sont sémantiquement proches - en se basant sur la sémantique, dans un espace sémantique “classique”, des mots qu’ils contiennent -. Le corpus est alors transformé en remplaçant tous les chunks d’une même classe d’équivalence par un représentant de la classe et un nouvel espace sémantique est construit à partir de ce nouveau corpus, dans cet espace les représentants des classes de chunks sont considérés comme des mots. Pour les besoins de la Piste 1, le chunker a été entrainé pour considérer comme chunk tous les mots-clés composés de la terminologie fournie. Dans la Piste 2 ce même chunker, ainsi que la procédure de construction de classes de chunks, sont utilisés pour construire une liste de mots-clés candidats. 4 4.1 Affectation de mots-clés comme procédure de décision mixte Réseau Bayésien pour l’affectation de mots-clés En analysant un corpus d’articles, nous cherchons, dans un premier temps, à déterminer la taille des différents mots-clés rattachés à un article donné. Dans un second temps, nous nous efforçons d’établir les probabilités d’appartenance de ces mots-clés à une liste pré-établie. Nous disposons pour chaque document du corpus des informations suivantes : – les longueurs du résumé l et du texte L ; – la revue R dans laquelle l’article est paru ; – le nombre de mots-clés n et leurs tailles respectives n1 , . . . , nn (ie le nombre de mots les composant) ; – les similarités avec la totalité du lexique des mots-clés (d1 , . . . , dN ) (N taille de la terminologie) ; – les mots-clés (kw1 , . . . , kw n ). Il s’agit donc de trouver des relations entre les variables exogènes (l, L, R, n, d1 , . . . , dN ) permettant de prévoir le comportement des variables endogènes (n1 , . . . , nn , kw1 , . . . , kw n ). A cette fin, il faut disposer d’un formalisme de modélisation des connaissances adapté. Les réseaux bayésiens (Barber, 2012), étant des modèles graphiques auxquels sont associées des représentations probabilistes sous-jacentes, apparaissent comme particulièrement adaptés à notre cas d’étude. Un réseau bayésien B est un couple (G, θ ) où G est un graphe acyclique dirigé dont les noeuds représentent un ensemble de variables aléatoires X = {X 1 , . . . , X n } et θi = [P(X i /C(X i ))] est la matrice des probabilités conditionnelles du nœud i connaissant l’état de ses parents C(X i ). L’intérêt des réseaux bayésiens est donc que leurs structures graphique et probabiliste permettent de prendre en charge une représentation modulaire des connaissances, une interprétation à la fois quantitative et qualitative des données. En effet, le graphe d’un réseau bayésien permet ainsi de représenter schématiquement les relations entre les variables du système à modéliser et les distributions de probabilités, elles, permettent de quantifier ces relations. Le modèle que l’on se propose de construire est un réseau bayésien à variables discrètes (le nom de la revue R, les mots-clés kw i , leur nombre n, leurs tailles ni ) et à variables continues (longueurs du résumé l, de l’article L et les similarité à la terminologie). C’est un modèle mixte, appelé modèle conditionnel gaussien, pour lequel la distribution des variables continues conditionnellement aux variables discrètes est une gaussienne multivariée. Cela implique qu’il peut y avoir des arcs partant de noeuds discrets vers des noeuds continus, mais pas l’inverse hormis pour le cas où les noeuds continus sont observables (ce qui est notre cas). Notons également que le nombre de variables n1 , . . . , nn et kw1 , . . . , kw n varie selon le nombre de motsclés n ; le nombre de noeuds dans un réseau bayésien étant fixe, nous nous proposons de poser n1 , . . . , n25 , les tailles des différents mots-clés avec ni = 0 si i > n et kw1 , . . . , kw25 les différents mots-clés avec kw i = N U L L si i > n. Pour résumer nous disposons des variables aléatoires suivantes représentées par les noeuds du réseau bayésien que l’on cherche à construire : – – – – – – – R, le nom de la revue (variable discrète pouvant prendre 4 valeurs) ; l, la longueur du résumé (variable continue) ; L, la longueur de l’article (variable continue) ; n, le nombre de mots-clés (variable discrète pouvant prendre 25 valeurs) ; n1 , . . . , n25 , la taille des mots-clés (variable discrète pouvant prendre 11 valeurs) ; d1 , . . . , d1062 , les similarités à l’ensemble des mots-clés (variable continue) ; kw1 , . . . , kw25 , les mots-clés (variable discrète pouvant prendre 1062 valeurs). L’observation des distributions des documents entre les différentes revues nous permet d’affirmer que celles-ci sont similaires dans le corpus d’apprentissage et celui de test ; ce qui implique que le biais qu’introduit cette distribution n’impactera pas les performances du modèle à construire. Les moyennes des longueurs de résumé l et d’article L présentent le même ordre de grandeur. Ces moyennes ne sont certes pas similaires dans le corpus d’apprentissage et celui de test, mais elles sont distribuées de la même manière, ie que les longueurs de résumé (respectivement d’article) sont égales dans le corpus d’apprentissage et dans celui de test au même facteur près. Notons également que les longueurs d’article et de résumé ne sont pas distribuées de la même manière ; cela veut dire qu’en plus de la relation directe évidente entre ces deux variables, il existe probablement une cause commune aux deux, ce qui se traduit dans la structure du réseau bayésien par la présence d’un parent commun. Les distributions des nombres de mots par article (respectivement par résumé) peuvent être approximées par des mélanges de gaussiennes. Ces histogrammes sont similaires pour le corpus entier et pour celui d’apprentissage. Ce qui nous montre que l’échantillon étudié peut être considéré comme représentatif du problème. Toutefois, la relative disparité observée entre le corpus de test et celui d’apprentissage créera probablement un problème de biais qu’il faudra prendre en compte durant la construction du modèle. Les histogrammes des nombres de mots par article (respectivement par résumé) représentent pour les différentes revues des distributions différentes. Ces variables sont donc directement reliées à la nature de la revue. Ces différentes distributions ont des formes quelconques, cependant, nous remarquons que l’on pourra les approximer par un mélange de gaussiennes ; ce qui nous conforte dans le choix d’un modèle conditionnel gaussien pour représenter ces variables dans un réseau bayésien. En observant la monotonie des moyennes des similarités à la terminologie des mots-clés sur les différentes parties du corpus, nous remarquons qu’elle présente la même allure (et même quasiment le même tracé) dans tous les cas (corpus entier, corpus d’apprentissage, revue en particulier, . . . ). Cela nous permet de supposer que la sélection de mots-clés se fait strictement de la même manière partout, et donc l’idée d’en faire un modèle mathématique est parfaitement cohérente. Sur la base de ces différentes observations, prenons un exemple de structure de réseau bayésien reliant les variables de notre problème. Par convention, les variables discrètes sont représentées par des noeuds carrés, les variables continues par des noeuds ronds et les variables observables par des noeuds ombrés (figure 3). FIGURE 3 – Structure du réseau bayésien appris sur le corpus 4.2 Combiner des décisions statistiques avec du raisonnement à base de règles Les récents travaux en intelligence artificielle sur la combinaison de méthodes de décision statistiques et de raisonnement à base de règles de production, comme les Règles de Production Probabilistes (PPR) de (Aït-Kaci et Bonnard, 2011), nous offrent un cadre pour modéliser une procédure de décision qui prend en compte ce qui est appris par le réseau bayésien décrit ci-dessus, et les connaissances symboliques encodées dans les règles sur le choix des mots-clés dont nous avons donné des exemples en 2.3. Le principe de fonctionnement du système de décision, construit en se basant sur PPR, est de calculer un score pour chacun des mots-clés pour un document donné. Ce calcul est réalisé en utilisant des règles pouvant faire appel au réseau bayésien. Par exemple, la règle “les mots-clés sont différents entre eux” peut se traduire par la règle production “si deux mots-clés sont proches alors augmenter le score de celui qui est le plus haute probabilité d’être un mot-clé du document et réduire l’autre” qui s’écrit : IF similarity(kw1, kw2) > seuil AND bnproba(kw1|doc) > bnproba(kw2|doc) THEN increase-score(kw1, doc) AND decrease-score(kw2, doc) Le système de règles que nous avons utilisé contient une quinzaine de règles. Nous ne pouvons pas les détailler ici par manque de place. 5 Les exécutions soumises La table 1 résume les exécutions soumises par notre équipe. Ses résultats sont très satisfaisants pour toutes les approches que nous avons utilisé. La moyenne de F-score pour la Piste 1 pour l’ensemble des participants étant de 0,3575 et pour la Piste 2 de 0,2045. On notera que les premières exécutions pour les deux pistes (1.1 et 2.1) qui sont nos exécutions de base donnent des résultats corrects en des temps relativement bas. Run 1.1 1.2 1.3 Precision 0.4618 0.9479 0.7486 Rappel 0.4618 0.9497 0.7486 F-score 0.4618 0.9483 0.7486 Temps (en s) 2 7590 - 2.1 2.2 2.3 0.2438 0.3471 0.5879 0.2438 0.3471 0.5867 0.2438 0.3471 0.5873 26 269 12700 TABLE 1 – Résultats soumis : performance et temps d’éxecution 5.1 5.1.1 Piste 1 Run 1.1 – baseline : RRI et k-NN Dans cette exécution qui constitue notre baseline, nous avons construit un espace sémantique RRI avec l’ensemble des documents du corpus (appr + test), un document étant constitué par la concaténation du résumé et du corps de l’article. Puis pour chaque document d du corpus de test, nous avons retenu comme mots-clés les k plus proches voisins du document dans la terminologie, k étant le nombre de mots-clés pour le document d. Le vecteur pour un mot-clé kw i composé des mots w1 , ..., w n étant obtenu en sommant les vecteurs des mots qu’il contient. ~ i = Σi w kw ~i (1) 5.1.2 Run 1.2 – RRI(chunks), BN et règles Dans cette exécution, qui a obtenu le meilleur résultat, nous avons construit un espace sémantique “enrichi” comme nous l’avons décrit dans la section 3.2, mais dans lequel un document était représenté par quatre vecteurs, un pour le résumé, un pour le corps de l’article et deux vecteurs pour le premier et le dernier paragraphe de l’article (que nous avons pris comme approximation de l’introduction et la conclusion) . Nous avons ensuite appris le réseau bayésien décrit en 4.1 en utilisant les distances entres les documents et les mots-clés obtenues sur cet espace. Enfin, nous avons utilisé la procédure de décision décrite en 4.2 pour affecter un score à chacun des mots-clés, les mots-clés retenus sont les k ayant les plus hauts scores (k étant le nombre de mots-clés pour le document). 5.1.3 Run 1.3 Dans le cadre de ce run, on a combiné les résultats de run 1 et run 2, en donnant une légère préférence aux candidates-termes lesquels sont plus longues que d’autres termes-candidates. On a donc combiné, par exemple, les termes-candidates de run1 : Catalogne ; Narotzky ; conflit ; contexte ;district industriel ; femmes ; production traductionnelle ; production écrite ; réseau avec les termes-candidates de run 2 : Espagne ; Narotzky ; anthropologie économique ; district industriel ; féminisme ; histoire ; réseaux de production ; économie politique ; économie régionale pour obtenir la liste des candidates de run3 : district industriel ; réseaux de production ; économie politique ; production traductionnelle ; anthropologie économique ; Narotzky ; économie régionale ; production écrite ; féminisme Le score du candidat était calculé par la formule : scor e = F r ∗ (l − Fa ) (2) où F r est la fréquence relative du terme-candidat dans l’article analysé, Fa est la fréquence absolue du terme-candidat dans tous les articles du corpus et l est le nombre de caractères du terme-candidat. 5.2 5.2.1 Piste 2 Run 2.1 – baseline : RRI et k-NN Cette exécution est identique à la première exécution de la Piste 1 5.1.1, la terminologie obtenue par la méthode décrite en 3.2 contient 3000 candidats mots-clés. 5.2.2 Run 2.2 – RRI(PositionalIndex), Tensor Similarity et k-NN Dans cette deuxième exécution, nous avons utilisé la même terminologie que pour 2.1, mais l’espace sémantique a été construit en utilisant RRI sur des indexes positionnels. Le calcul des vecteurs de mots-clés utilise l’opérateur Tensoriel de Semantic Vectors. Les mots-clés retenus pour un document d sont les k plus proches voisins du document d dans la terminologie, k étant le nombre de mots-clés pour le document d. 5.2.3 Run 2.3 – RRI(chunks), BN et règles Cette exécution est identique à la deuxième exécution de la Piste 1 décrite en 5.1.2, la terminologie obtenu par la méthode décrite en 3.2 à laquelle on ajouté les mots-clés du corpus d’apprentissage elle contenaint 3270 candidats mots-clés. 5.3 Discussion Nous pouvons voir que les exécutions 1.2 et 2.3 sont celles qui obtiennent les meilleurs résultats, ce qui nous conforte dans nos hypothèses de départ. Les exécutions officielles nous ne permettent pas de comparer les performances des espaces “enrichis” par des chunks et des espaces RRI avec indexes positionnels, nous avons effectué une exécution 2.2bis avec un espace “enrichi” et k-NN le F-score obtenu est de 0.4186, le résultat est sensiblement meilleur que l’exécution 2.2. Rappelons que pour le 1.3, on a combiné les résultats de 1.1 et 1.2 de en donnant plus de poids aux candidates-termes longues (cette règle n’ayant pas été incluse dans le système de règles décrit en 4.2 ). Etant donné que le F-score obtenu (0.7486) se trouve au mi-chemin entre le F-score de 1.1 et de 1.2, nous ne pouvons pas réellement conclure quand à la pertinence de cette règle. Conslusion Dans cet article, nous avons présenté un système d’attribution de mots-clés à des articles scientifiques, qui se base sur des espaces sémantiques construit en utilisant RRI. Puis nous avons essayé d’améliorer les performances du systèmes par deux moyens : (i) en enrichissant les espaces sémantiques par des informations issues d’une analyse linguistique de surface, et (ii) en définissant une procédure de décision basée sur une combinaison de réseaux bayésiens et de systèmes à base de règles. Les résultats obtenus montrent que ces deux hypothèses se sont révélées payantes et qu’elles améliorent sensiblement les résultats obtenus par une approche RRI seul (qui obtient déjà des résultats honorables). Références ABNEY, S. (1991). Principle-Based Parsing, chapitre Parsing By Chunks. Kluwer Academic Publishers. AÏT-KACI, H. et B ONNARD, P. (2011). Probabilistic production rules. Rapport technique, IBM. BARBER, D. (2012). Bayesian Reasoning and Machine Learning. Cambridge University Press. BINGHAM, E. et MANNILA, H. (2001). Random projection in dimensionality reduction : Applications to image and text data. In in Knowledge Discovery and Data Mining, pages 245–250. ACM Press. COHEN, T., SCHVANEVELDT, R. et RINDLESCH, T. (2009). Predication-based semantic indexing : Permutations as a means to encode predications in semantic space. In Proceedings of the AMIA Annual Symposium, pages 114–118. COHEN, T., SCHVANEVELDT, R. et WIDDOWS, D. (2010a). Reflective random indexing and indirect inference : A scalable method for the discovery of implicit connections. Biomed Inform, 43(2): 240–256. COHEN, T., WIDDOWS, D., SCHVANEVELDT, R. et RINDLESCH, T. (2010b). Logical leaps and quantum connectives : Forging paths through predication space. In Proceedings of the AAAI Fall 2010 symposium on Quantum Informatics for cognitive, social and semantic processes (QI-2010). EL GHALI, A. (2011). Expérimentations autour des espaces sémantiques hybrides. In Actes de l’atelier DEFT’2011, Montpellier. HARRIS, Z. (1968). Mathematical Structures of Language. John Wiley and Son, New York. JONES, M. N. et MEWHORT, D. J. K. (2007). Representing word meaning and order information in a composite holographic lexicon. Psychological Review, 114(1):1–37. KANERVA, P., KRISTOFERSON, J. et HOLST, A. (2000). Random Indexing of Text Samples for Latent Semantic Analysis. In GLEITMAN, L. et JOSH, A., éditeurs : Proceedings of the 22nd Annual Conference of the Cognitive Science Society, Mahwah. Lawrence Erlbaum Associates. LANDAUER, T. K. et DUMAIS, S. T. (1997). A Solution to Plato’s Problem : The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge. Psychological Review, 104(2):211–240. LUND, K. et BURGESS, C. (1996). Producing high-dimensional semantic space from lexical co-occurence. Behavior research methods, instruments & computers, 28(2):203–208. RANGAN, V. (2011). Discovery of related terms in a corpus using reflective random indexing. In Proceedings of Workshop on Setting Standards for Searching Electronically Stored Information In Discovery Proceedings (DESI-4). SAHLGREN, M. (2006). The Word-Space Model : Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. Thèse de doctorat, Department of Linguistics Stockholm University. VEMPALA, S. S. (2004). The Random Projection Method, volume 65 de DIMACS Series in Discrete Mathematics and Theoretical Computer Science. American Mathematical Society. WIDDOWS, D. et COHEN, T. (2010). The semantic vectors package : New algorithms and public tools for distributional semantics. In Proceedings of the Fourth IEEE International Conference on Semantic Computing (IEEE ICSC2010). Parallel Democracy Model and Its First Implementations in the Cyberspace Daniel Devatman Hromada * Abstract Parallel democracy model is a variant of traditional participative democra­ cy approach aiming to provide a new method of convergence toward quasi-op­ timal solutions of diverse perennial political and social challenges. Within the framework of the model, such challenges are operationalized into variables for which the authority can associate possible value. The value which is assigned to the variable in a moment T of system’s history is called an active value. Every couple {variable, active value} represents a functional property of a given society and the genomic vector of such properties can not only describe, but also determi­ ne the functioning of a society under question. In majority of occidental societies, the authority to assign values to variables is delegated to parliament which assi­ gns values to variables by means of aggregated voting. The government or other institutional bodies subsequently execute actions according to such activated va­ lues. In almost all modern political systems, the value->variable assignment is done in sequential (serial) order for example by voting for one law proposal after another during a parliamentary session or a referendum. Any reform of the sy­ stem —an update of multiple variables— is a costly process because a change of value for almost any variable requires a new vote which, in majority of societies, involves the relocation of vote-givers to a vote giving location during a period de­ dicated to voting. The progress of quasi-perennial storage systems and databases in combination with communication networks make it possible to aggregate and store information about numbers of votes related to potentially infinite number of variables. Therefore a method by means of which the values are activated and assigned to variables does not necessarily need to be sequential. Given the con­ dition that the act of voting is identic to the incrementation of the chosen value stored on the medium, it is not necessary anymore that the decision-makers shall meet at one point in time in order to give their vote to the value they want to activate. Even the condition that they must meet in one place is weakened since the vote-giving place can be purely virtual. Kyberia.sk and kyberia.cz domains are domains where first tentatives to implement such a system were implemented «in vivo» for a limited set of variables. The aggregated voting of already registered kyberia.cz users determine the number of votes which a registration application of a new user needs to obtain in order to be accepted. Kyberia.sk senators can decide what motto shall be displayed at the top of the page —the option which receive the biggest number of votes becomes the active title. Since both variables are internal constants of kyberia’s engine, no human intervention is necessary * STU / EPHE / Paris 8, hromi@kyberia.sk. TEORIA POLITICA. NUOVA SERIE, ANNALI III 2013: 165-180 ••TEORIA POLITICA 2013••.indb 165 18/6/13 10:09:44 166 daniel devatman hromada after a value becomes variable’s active value and the system automatically recon­ figure its own code. Keywords: Parallel democracy model. Participative democracy. Vote aggregation. Auto-configuration of social networks. Genome of political bodies. 1. From perennial challenges to properties of political system One cannot speak about human society but ignore the innate nature of human beings. And because the innate nature of human beings changes only slowly in course of evolution of homo sapiens sapiens species, and because at least some features of this «nature of human beings» —be it loving, learning, laughing, looking or listening— seem to be constant and omnipresent among all human beings, it is not unimaginable that the very human nature causes and shall cause certain challenges to appear and reappear in any human and/or transhuman society imaginable and conceivable. We label as «perennial» such challenges which cannot be ignored by any human society. In order to avoid useless metaphysical debates which could stem from such a definition, we rectify that in the rest of this article, following definition shall be adopted: «perennial challenge (PC) is a challenge implicitly or explicitly addressed by all documented societies of human history». To be more concrete, we may consider questions of a form «shall X be Y in our polis?» as representatives of such challenges, no matter whether X means «death penalty, slavery, wine-drinking, meat-eating or prostitution» and no matter whether Y means «forbidden, permitted, or obligatory». Throughout the course of human history, such Xs and Ys were, in one form or another, represented in minds of all individuals forming a given society. The thing which changed, the structure which evolved, was only the weighted network of associa­ tions among such X & Y representations. One of the main objectives of social sciences of the last century was to unveil  1 what was the structure  2 of such networks  3, i. e. what Xs were connected with what Ys, and to find the raison d’être for such connections in the underlying totality of such graph-like system. The political science, on the other hand, focus attention upon a different question: «who is the source of authority?», «who holds the power ?», «WHO associates values to X in this polis?». The objective of this article is present the Parallel Democracy Model (PDM) within the scope of which the answer to this question is: «everybody» and to present its first naive implementations within the framework of virtual communities kyberia.sk and kyberia.cz. 1 2 3 Bourdieu, 1984. Lévi-Strauss, 1967. Saussure et al., 1995. ••TEORIA POLITICA 2013••.indb 166 18/6/13 10:09:44 Parallel Democracy Model and Its First Implementations in the Cyberspace 2. 167 From properties of political systems to typed variables and their values A property of a political system is, within the theoretical framework of Parallel Democracy Model (PDM), defined by a couple {typed variable,active value}. A typed variable can be defined as «uniquely defined conceptual entity (the variable) and a set (called its type), consisting of all the values that the entity may take»  4. An active value is such a member of the set of possible variable's values which is assigned to the variable in time T. Intuitively, a variable can be imagined as a box with the label on the box being variable's name and the content of the box being the value of the variable. One can have many boxes with many different labels —one can have many variables. One can have boxes for shoes and one can have boxes for food— there are many types of variables. Since the set of possible sets is infinite, infinitely many types of variables can exist. Only some of them are of particular practical interest for PDM, more concretely: 1. 2. 3. 4. 5. 6. 7. boolean – has two members {true, false} integer – set of all integers real – set of all real numbers probability – set of all real numbers from the interval <0, 1> text – set of all possible strings of symbols legality – set of three members {permitted, obligatory, forbidden} formule – set of all possible mathematical formulae. As was indicated above, every property of a political system, when operationalized into a variable, represents a challenge with which every society and every polis has to deal, in one way or another. One can easily imagine a variable $  5immigrants representing the number of immigrants which a given polis is ready to integrate during certain $interval of time. The variable $immigrants would be of type integer if ever it is defined in absolute terms; it would be of type real or probability if ever it is defined relatively to the size of the polis (i. e. 0.2 or 1.2 if polis is ready to increase its population up to 120% of its original size). A boolean-typed variable can code such a property of the polis which can either exist or not, e. g. $has_basileus in order to encode the possibility that the polis has (or has not) its βασιλεÚς; $deathpenalty_exists in order to encode the death penalty is legal (or not) in the polis or $has_pdm in order to denote the difference between the polis which is fully PDM-compliant in contrast to the one which is not. Other variables already implemented in real-life scenarios will be mentioned in following paragraphs in order to clarify our point. To every variable, an «activ» value is assigned in every moment of variable's history. It is possible to have «array» variables which are associated with more than one active value in a given moment, due to pedagogic reasons we shall not, 4 5 Floridi, 2011. Every variable is denoted by a $ prefix. ••TEORIA POLITICA 2013••.indb 167 18/6/13 10:09:44 168 daniel devatman hromada however, deal with such cases in the current article, and if ever the need to assign simultaneously two values shall emerge, we shall assign them to two distinct scalar variables. Thus, within the following framework, a variable always contains one and only one active value in a given moment T of its history. We precise that a value which is assigned to the variable in the moment T is the active value, contrary to other members of the set specifying variable's type which are, in a moment T, just «potentially activable values». Analogically to the box on which the label $shoe is engraved and which contains today my winter shoes, and which could, possibly, sometimes in the future, contain my summer shoes, one can imagine the variable $has_basileus to which the value «false» is assigned today (i. e. «false» is the active value) in the polis which lacks its basileus but to which, possibly, the value «true» shall be assigned tomorrow (i. e. «true» is potentially activable value). The rationale behind this analogy is simple: not to forget that values are assigned to the variable in a mutually exclusive way. Simply stated, there shall always be one and only one pair of shoes in the box in one moment and whether they are the boots or the sandals will undoubtedly determine the extent to which I'll feel comfortable after I shall decide do put the content of the box on and walk out into the rain. 3. From properties represented by variables to systems represented by vectors of variables We believe that a political system —be it a polis or a sultanate, a republic or an empire— can be described in terms of sets of its properties. As we have indicated above, properties are often closely related to a challenge with which a society has to deal, in one way or another. Such a challenge can almost always be conceived as a question whose form is «what value Y shall be assigned to variable X ?». For example, a very ancient challenge of «whether or not to accept aliens in one’s polis» can be formalized as a variable $accept_immigrants of type boolean, or in a more evolved systems as a variable of type integer or real (c.f. above) which has the value 0 (i. e. no immigrants) as its limit option among multitudes of other options. The way how every society deals with a given challenge yields a property of a given society and can be operationalized into variable. In every moment of system’s history, a certain value is assigned to that variable and the sequence —or rather a vector— of such values can yield a description of the system under question. We precise that a vector of length N is an ordered sequence of N values. Therefore, any element of a vector can represent a variable. Let’s imagine, for example a simplistic vector of length 2 defined as a sequence of 2 boolean variables [$has_basileus, $has_parliament]. Under such an interpretation, the vector having values [1,0] represents the autocracy; the vector [1, 1] represents either a constitutional monarchy or such a presidential democracy where rights of president are so strong that he can be even considered to be the basileus; the vector ••TEORIA POLITICA 2013••.indb 168 18/6/13 10:09:44 Parallel Democracy Model and Its First Implementations in the Cyberspace 169 [0, 1] can represent a parliamentary democracy with basileus role non-existing or reduced to ceremonial purposes and the vector [0, 0] can represent a system without basileus nor parliament, e. g. an anarchy, oligarchy, mediarchy etc. We do not pretend that such a simplistic vector composed of 2 boolean variables could be of much practical use and we present it only as a paedagogic example whose secondary effect is to suggest that even a highly general and abstract functional level - described in the case of a modern society in the Constitution- could possibly be addressed by our formalism. It is not completely hors propos to imagine a research program aiming to 1) enumerate all existing or known political systems 2) find invariants among them 3) operationalize those invariants into variables with a set of possible values and 4) subsequently describe every unique political system by unique vector of values. The output of such a «herculean task», if ever finished, would be a set of vectors —a dataset— describing properties of political systems during different moments of recorded history. It may be the case that purely mathematical or topological study of such vector ensembles (i. e. the vector spaces) would indicate, among other things, that only very small part of a search space «of all possible configurations of a political system» was already explored in course of human history, and that multitudes of theoretically stable political configurations are still to be discovered. 4. From descriptive vectors to normative vectors What was said until now shall hardly be of big surprise for the expert in political theory. In one way or another, every political or historical theory from Aristotle to Toynbee addresses the same problem —to describe how, by what means and by what individual or social body have been set the values for variables determining the properties of a given society and what are the most common (cor)relations between different variable values? The turning point occurs when one realizes that a formal framework presented hereby —i. e. the framework which allows us to formalize political systems into vectors of variables— can be exploited not only for descriptive&explicative purposes, but that it can be normative as well. In other terms: vector-like representations of political systems can help us not only to understand the systems under study; they can allow us to «run» them in an unprecedented way. But before we shall explain how this turn from descriptivity to normativity can occur in case of political science, let’s take some inspiration from biology. Less than a century after Darwin’s theory of evolution suggested that there exists a material substrate of heredity, such a substrate was discovered by Watson & Crick, having the form of a DNA molecule. In modern science, this molecule is conceptualized as a genome which can be defined as an ordered sequence of genes. A gene can be defined as «locatable region of genomic sequence corresponding to a unit of inheritance»  6. From the point of view of this article, a gene is simply a 6 Pearson, 2006. ••TEORIA POLITICA 2013••.indb 169 18/6/13 10:09:44 170 daniel devatman hromada variable which can code different values, for example the gene $eye_color is of type {green, blue, brown, grey,... }. The values of different variables determine the biochemical «unfolding» of the development procedure whose output is an individual living being phenotypically expressed by different properties. Ceteris paribus, one can imagine that a society —with its laws, functions, rituals, institutions etc.— is a phenotypical expression of a vector of values —the genome— which is normally implicitly encoded in artifacts, books of laws, or a distributed holographic information in the brains of the members of the society under study. In order to form a functional political body, every society has to integrate 1) a certain set of institutions which «execute» certain actions according to active values of the variables (e. g. tax collector will behave according to the value set in the variable $tax_rate) 2) a certain set of procedures which precise how & by whom the different elements of «society’s genome» are updated. Often, these procedures are self-referential in a sense that not only their very execution is governed by values encoded in society’s genomic vector (i. e. input parameters) but the result of their action (i. e. the output) can be formalized as an update of a value (or a set of values) of other parts of the initial same genomic vector. As an example of such a vector update which is determined by the values in the very same vector, we may take an example a micro-society encoded by a vector of length 2 composed of a variable $tax_rate_updator having two possible values {basileus, parliament} as its type and a variable $tax_rate having a real value from interval <0, 1> as its type. Subsequently, a procedure update_ tax_rate can be defined which, when executed, consults the information source referenced by an active value of the variable $tax_rate_updator in order to set the value of the variable $tax_rate. Notice that the execution of the procedure can be fully automatic, for example by means of a procedure update_tax_rate()  7 since the only thing which is in reality going all on the time is assigning values to variables. We hope that at this point, it is evident to the reader that in such a case the vector [basileus, 0.07] would be the genome for a political system where only the monarch has a right to change the tax rate from 7% of one's income, while the genome [parliament, 0.23] would encode such the society whereby only parliament has the right to change the current 23% tax rate. We hope that this small example makes it somewhat more clear that the formalism presented hereby can be exploited not only for descriptive purposes. Verily, the objective is not to offer «another formal descriptive framework» for social sciences, for this was already done in multitudes of papers. The objective is to indicate that in the world where many executive procedures like collect_ taxes() can be fully automatized by computer programs or other artificial agents serving as tax collectors, the vectors we speak about can determine the function­ ing of the society and to do so in the strongest possible sense. To take our analogy from natural sciences somewhat further, we precise the inspiration for what will be presented in the rest of this article does not come from descriptive sciences 7 Procedures are denoted by () suffix. ••TEORIA POLITICA 2013••.indb 170 18/6/13 10:09:44 Parallel Democracy Model and Its First Implementations in the Cyberspace 171 like genetics or biochemistry; on the contrary, our aim is inspired by constructive aims of genetic engineering. 5. From Serial Model to Parallel Model Who can prove that from the earliest human societies until our present situation, the functioning of social bodies —be it the tribe of Pygmees, the macroanthropos  8 of Classical Athens or European Union— have not been determined by some genome-like vector of values ? It is true that the representational medium of the genomic vector changes —from pure wetware (i. e. stored in brain or set of brains) of pre-literal societies through pillars of Ashoka the Great towards multilingual norms of the Union, stored and backed-up in parallel in dozens of books, digital corpora or even cities. It seems, however, that one thing haven't really changed, and that is the method by means of which the variables of utmost importance are being updated in almost all existing societies. We label this method as the central dogma of the Serial Model (SM) and define it like this: «the Serial Model, values of variables which determine the functioning of the society are updated in a serial order, one after another». Caesar gives a list to his scribe and orders him: «You go to Forum Romanum and first You engrave the first law into the Stone, after Thou shall engrave the second». The parliament meets, they discuss one proposal, than vote for it and only if the proposal is voted for by the majority, the genome of the state shall be modified; afterwards another proposal is being discussed and voted for. University’s council meets and they discuss the order of the day, point after point, vote after vote. The temporal preposition «after» is crucial here. But certain deviations from the Central Method exist even in the world governed by SM. Mostly they are due to mutual independence of institutions to which the «authority to update the certain parts of the vector» have been assigned. A circulaire de ministére can be distributed in the same moment as the new law is passed. A coup d’état can occur in the society if ever the president tends to assign to variable X a different value than his strongest general, or even if they both update different variables with such values that two resulting vectors (president-generated vs. general-generated) diverge in such a degree of orthogonality  9 that they cannot be considered as consistent anymore. Or, in a somewhat extreme but pedagogically useful case one can imagine a muslim scholar articulating a fatwa obliging the nudity at noon in the midst of a sultanate which have just accepted the sharia law. Theoretically, even under SM, such updates of different parts of the vector can occur on the very same day, even at the very same moment, because different agents can modify values of different variables contained within the genomic vector. 8 9 Plato, 2009. Notice that orthogonality is a geometric term. ••TEORIA POLITICA 2013••.indb 171 18/6/13 10:09:44 172 daniel devatman hromada Traditionally, such cases were considered to be a «bug» of the political system and extraordinary amount of intellectual power was invested, in course of human history, to bug-proof different systems by adding new watchdog institutions, or by proposing new set of variables pretending to be hierarchically superordinate to already existing ones, as is the case for Constitution. The final result, however, is that the length of society’s genomic vector —i. e. the number of variables to be set— grows, becoming less and less comprehensible for a common human being whom it was supposed to serve in its very beginning  10, hence bringing with itself still more or and more place for disharmony and (cor)ruption. But what may seem to be a bug when interpreted through the prism of level of abstraction  11 of Serial Model can turn out to be a feature when another level of abstraction is involved in the interpretation. Such is the case for the Parallel Model whose central dogma can be stated as follows: «Within the Parallel Model, a new value can get assigned to any variable in any moment, and independently from the moment of assignment of a value to any other variable. Theoretically, the values of all variables can be changed in the very same moment or in any other moment». Seemingly tautological and therefore useless, the preceding definition can nonetheless lead to an unprecedented sort of «transvaluation of all values»  12 in the political domain. While all the transformations in political domain —be it small-scale reforms or full-fledged social revolutions— have simply updated the values of few variables or of certain parts of societies’ genomic vector, a possible transition from SM to PM is not the change of content. It is the change of the form and more concretely, it is being realized by transformation of the form of procedure of voting. 6. From Parallel Model to Parallel Democratic Model We believe that a transition from SM to PM is possible because of 1) development of information-storage mediums which can be accessed for viewing and updating independently of temporal constraints, 2) development of communication networks which allow us to access or update informational content stored on such mediums independently of spatial constraints and 3) such an extensive presence of information&communication technologies (ICT) that, at the beginning of a so-called 3rd millennium, the critical mass of inhabitants of the planet Earth can access (e. g. google) or even update (e. g. wikipedia) certain pools of informational content. In its very essence, the genomic vector —i. e. an ordered sequence of variables describing and governing the functioning of a political body— is a piece of informational content. Hence it can be stored on information storage mediums and accessed or updated by means of ICTs. 10 11 12 Hobbes, 2011 Floridi, 2011. Nietzsche, 1969. ••TEORIA POLITICA 2013••.indb 172 18/6/13 10:09:44 Parallel Democracy Model and Its First Implementations in the Cyberspace 173 To make the genomic vector, or at least its certain parts, of one’s own polis accessible & updateable by all, or at least by the biggest possible number of independent human agents, is the goal of all those who strive for participative democracy. However, even the most radical proponents of participative democracy sometimes lack to realise that the way how society stores and aggregates information strongly influences the way how it can function as a political body. We have already addressed the question of storing when stating that the genomic vector of a pre-literal society was stored in a distributed fashion in the brains of critical mass of members of such societies (older members often had a decisive word to say in case of «data check-sum error») and indicated that a completely new system of legal formulae and institutions could have emerged from religious rituals  13 only because of the advent of writing and later, printing press. Contrary to writing, press or television, which allows many to get into passive contact with the information stemming from a unique source of content, modern ICTs allow many to get into active contact with the medium encoding the informational content  14. Thanks to ICTs, the shared information can be not viewed but also updated by anyone. What's more, multiple updates of multiple informational contents can be realised in the same moment. This is of crucial importance for implementation of PM. It is debatable in what extent one can have a legal system which swiftly adapts to ever-still-accelerating transformations of external world if one bases himself solely on a printing press where every law is widely known only after 1) the authority stated the law 2) the statement of the law is published by means of a costly process of book preparation & printing 3) the book has to be distributed and attain its target. It is, however, non-debatable that in the end, the local lawyer whose practice could be substantially transformed from the very moment he receives the new collection of laws, shall have few possibilities to influence the edition of the next volume of the book. He can view but he cannot update. And even if he could update —for example because he is lucky, virtuous or corrupted enough to be the member of parliament— his overall contribution to society's welfare is more than doubtful since even with the best will possible, he shall be, more often than not, obliged to attribute values to variable which do not concern his domain of expertise. Voting is the most fundamental form of opinion aggregation which is implemented in many social bodies in order to assign a certain value to a certain variable or the set of variables. In its most common, SM-consistent form, the voting act requires that a voting agent to cast his vote at a voting place during a temporal interval dedicated to voting. Vote concerns only one variable (in case of the most simplistic yes/no referendum) or a bundle of variables (in case of passing a complex law in the parliament). Subsequently, votes are aggregated by the voting committee (in case of elections) or an automatic vote-aggregating device (in case of parliaments) and according to the result of the aggregation, the value 13 14 Coulanges, 2010. McLuhan, 1965. ••TEORIA POLITICA 2013••.indb 173 18/6/13 10:09:44 174 daniel devatman hromada of the variable concerned by the voting is assigned (or not) a new value. Only afterwards can the body of vote-givers proceed to another vote. Let's now imagine a voting scenario for PM: One can imagine, for example, a tribe inhabiting a village located in an environment so hostile, that in every moment of existence of the village, at least two thirds of adult men are patrolling at different spots on the circumference of the village. It happens from time to time that some warriors die in the battle and sometimes their chieftain dies as well. Given a security constraint that forbids the majority of men to meet at one spot and vote, thus leaving the perimeter of the village unprotected, what method could assure that the village shall always have a chieftain respected by the biggest number of his comrades ? One can imagine the following answer: to every man of the tribe, a distinct color is associated, be it the color that only the man himself can mix. It does not really matter whether the knowledge of color's preparation was revealed to a given individual during a certain rite of passage or whether it was transferred to him by his father —what matters is, that any adult member of the tribe can use a distinct color as his unique identification token. In the middle of the village, there is a group of totems. One of the most central totems is divided into sections, for example stripes colored in different colors. The old legend states that once a man is able to mix his own distinct color, the spirit of the village shall allow him to do two things: Firstly, he can paint his stripe on the totem, hence creating his own section. Secondly, if ever he meets a man worthy of his respect, he can paint one and only one line into the stripe colored in the same color as is the tattoo on the forehead of such a respectable man. And if ever, after engraving such a line into the chosen section of the totem (let’s say green), one finds out that the chosen section contains more colored lines than any other section, one’s duty is to go and seek as much comrades as one can find in order to tell them that the village’s new chieftain is a man with a green tattoo on his forehead... In this example, the totem represents a variable. When considered as a set, the group of all colors of different sections of the totem $chief represent the type of that variable. When taken individually, every colored section of the totem represents a possible value of the variable. Lines on different sections represent votes which a given possible value had obtained and the section which had obtained the biggest number of votes —i. e. a stripe with the biggest number of distinctly colored lines on it— represents the «active value» of the variable $chief. The act of writing a line correspond to the act of voting and the act of counting the lines within all the sections and the subsequent choice of the section which contain the maximum number of lines can be interpreted as aggregation of votes. There are several crucial aspects to notice in the above «totem» scenario. Primo, in spite of the fact majority of voters never meet sat the same spot at the same time, they succeed to aggregate their votes because they use the surface of the totem as the information storage medium. Secundo, the aggregation can be possibly executed after a cast of every individual vote and thus just one vote can overthrow the current chieftain. Tertio, the totem in itself does not change after a new chieftain is elected, no information is lost and therefore a chieftain which had just lost his chieftain status can quite easily regain it by obtaining two fresh ••TEORIA POLITICA 2013••.indb 174 18/6/13 10:09:45 Parallel Democracy Model and Its First Implementations in the Cyberspace 175 votes —one which will put him into a tie situation with the present chieftain and another one which shall put him into the lead again, hence starting a sort of «cat&mouse» game between two chieftains. Quatro, any man can express his respect towards many possible chieftains by drawing lines into multiple stripes it is not forbidden to give vote to more than candidate. One can also express his respect to one candidate today and for another tomorrow - one can change his mind. Quinto, it is forbidden to give more than one vote to just one candidate once the respect was expressed by drawing a line, it cannot be reinforced. From this point of view, all candidates are equal. Sixto, since no line is deleted from the totem, the votes of vote-givers which have already passed away can influence the result of the aggregation process until the moment when the totem-variable $chief falls into oblivion, for example due to the rising influence of other totemvariables which demand the attention of inhabitants of the village. Septo, given that any voter has a unique identification token no further knowledge about the attributes of the village (e. g. size of its population) is necessary. This being said, it is now time to present the combination of PM with ICT-sustained participative democracy which results in a Parallel Democracy ­Model: «Parallel Democracy Model (PDM) is a framework allowing auto-configuration and self-adaptation of social bodies according to aggregated collective will of individuals who compose these bodies (e. g. virtual avatars in case of virtual networks considered as such social bodies)»  15. PDM aims to address several fallacies inherent to the most common variant of the SM known as «parliamentary democracy». In case of PDM, there is no need for individuals to meet in the same moment in order to influence the functioning of the society. However, they still have to meet in one place —which can be of purely virtual nature. Many different variables related to the functioning of the social body are presented in this place simultaneously (i. e. in parallel) and in perennial fashion (i. e. from eternity to eternity). In the most free variant of PDM, any individual is free not only to vote for one or more possible values of a variable (i. e. add a line on a stripe), but also add a new possible value to the variable (coloring a section of the totem with a new stripe), hence extending its type or even add a new variable (i. e. erection of a new totem). The act of voting is operationalized as the incrementation of the vote counter associated to a possible value on the storage medium. Among other features, the most extreme variant of PDM makes it even possible that the vote of a person already dead can influence political reforms to come. 7. Description of first tentatives to implement Parallel Democratic Model Kyberia.sk is a virtual community founded by the author of this article in year 2001. During following years to come, it had succeeded to change from a community of hackers, artists and philosophers into a mainstream social net15 Hromada, 2012. ••TEORIA POLITICA 2013••.indb 175 18/6/13 10:09:45 176 daniel devatman hromada work, nonetheless guarding its local nature and complete economical and political autonomy from the surrounding real-life environment. In 2008 it won the prize for the best Slovak Internet community and in 2009 it forked from Slovak cyberspace to Czech cyberspace and parallel project was launched on kyberia.cz domain, exploiting somewhat more evolved variant of the initial engine, which had meanwhile become open source and was published  16 under AGPL license. From the very beginning, one of the academic objectives of the Kyberia project was to furnish a certain virtual «in vivo» incubator for experiments with community-modeling. One such tentative was realized in year 2003 when kyberia's version introduced in a feature called «K». K, which was originally meant to abbreviate the term «karma» and later «kredit» became a sort of currency which is 1) distributed on a daily basis and in a certain amount to every registered user of kyberia 2) can be transferred by its owner to another data node (i. e. a submission, forum, blog, user, whatever). Further extensions like K-wallet were added in subsequent version 2.3 of kyberia's engine, thus making kyberia's K-based transaction system very similar to normal economical system. Since the economical aspects of kyberia are of minor importance within the scope of this article, let's clarify that an act of giving a K to a given node is very similar to what had later been implemented on facebook in form of «I like» button. What is of importance, however, is that the version 2.3. of kyberia's engine have been 1) the first to introduce a tentative to implement PDM in order to alleviate the administrative burden placed on the shoulders of kyberia's administrators 2) the K-giving system was exploited as a method for casting votes. The variable which was chosen as the first one to be subjected to PDM is a variable $page_title whose type is text, and whose «activated value» can be seen by any visitor of the page at the top of the page, in the browser's title bar (as of 15/11/2012 the $page_title is assigned the value «Remember, remember, the velvet November»). Let's inspect closer how the value of this variable is assigned. There is a certain specific region of kyberia.sk called «Agora» where only users who were granted the status of a «senator» can give K. Within this «nodeshell» there is another «nodeshell» called «system configur»  17 where the system seeks for variables and their values - in terms of the «totem scenario» from part 6 of this paper, it can be illustrated as that part of the village where the totems are erected. And within this «node» there is a node «title content»  18 which, for the automatic scripts of kyberia's engine, represents the variable $title_content. Into this variable node, any senator can add his own «child node» whose content is the «possible future value» of $title_content. The act of adding such a «child node» into the node representing the variable $title_content is similar to the act of drawing of a new stripe on the totem; the only difference being due to variable's type: cardinality of type of all possible text strings is much bigger (infinitely bigger, in fact) than a finite number of possible chieftains in the village example presented above. 16 17 18 https://github.com/Kyberia/Kyberia-bloodline. http://kyberia.sk/id/5604218/. http://kyberia.sk/id/5604239/. ••TEORIA POLITICA 2013••.indb 176 18/6/13 10:09:45 Parallel Democracy Model and Its First Implementations in the Cyberspace 177 What follows is quite simple: the senators simple give Ks to one or more nodes whose content represent the possible value. They mark their line on their stripe of interest. Subsequently, every night at 2:23 AM, an automatic procedure update_title() is executed which looks which child node of the «title content» variable has obtained the biggest number of Ks, takes its content and assigns it as a value of the variable $title_content which is internal to kyberia's code. Such an «active value» will be then visible to all visitors of kyberia.sk domain in the top part of their web browser. An ignorant novice may consider it to be a waste of time to have such a seemingly complex machinery in order to do such a simple change as that of assigning a new value to a title of the website. But the fact is, that the «behind the scene» machinery is not that complex —just a simple cron script containing 40 lines of simple php code— and is very universal: ANY global parameter of kyberia’s code —be it the number of Ks distributed to users on a daily basis, a K-cost of adding a new node or the number of K-s can has to obtain from other senators in order to become a senator— can be easily integrated into PDM by simply adding it to the «system configure» node located in Agora of kyberia.sk. Since conservative operators of the Slovak kyberia feel certain reluctance to integrate more variables into PDM, more extensive «in vivo» experimentation is pursued within the scope of much smaller, nonetheless much more liberal domain of kyberia.cz. There, not only the title of the page (as of 21/12/2012 the $page_title = «mèδεις ageôμετρèτος eisitô μου tè» stegè»»), but 3 other variables can be set as well, the most interesting among them being the variable PDM_CONSTANT_REGISTRATION_K indirectly addressing the challenge «shall immigrants be accepted into our polis?» which was already described, in the initial parts of this article, as the challenge which has to be addressed by any human society. As a sufficiently big community, kyberia also has to address this challenge. Both czech and slovak kyberias shere the feature that there is only one way how one can become their member: 1) one has to apply for registration and 2) one’s application has to obtain a sufficient number of approval votes from any already registered user (as is the case for kyberia.cz) or a senator (kyberia.sk). The number of needed registration-approving votes is addressed by the variable PDM_CONSTANT_REGISTRATION_K. Currently, the «active value» of this variable is 3 within the scope of kyberia.cz domain, meaning that the registration application of a new user shall be approved only after it had received at least 3 K-votes. If such is the case, an automatic register_user() procedure will execute necessary database transactions transforming user’s registration application into a full-fledged user node; subsequently the user is informed by email that he can enter the domain. It is possible that if ever the size of the kyberia.cz shall grow, more and more users will propose or vote still higher and higher values of the above-mentioned variable, in order to somehow regulate the influx of possible immigrants. On the other hand, it can also happen that the variable shall be assigned the value «zero» —in such a case a registration application could be approved even if it haven’t received any K-vote. Such a case could be quite dangerous, however, since it could ••TEORIA POLITICA 2013••.indb 177 18/6/13 10:09:45 178 daniel devatman hromada lead to uncontrollable influx of alternative egos which are a true problem for every virtual community and for which the kyberia.sk has found a partially successful solution by setting the acceptation threshold to 5 senator approval votes. 8. From PDM to political engineering Let’s look closer at the above-mentioned variable determining how many votes are needed in order to approve a registration application of a new user. We believe that it can illustrate the importance of good choice of variable’s value in relation to the survival of a community or a society. As was already mentioned, if the value is too low, anyone can easily become the member of a community. In case of virtual communities, the system like facebook, which does not put almost any constraints on user selection, can easily become a playground for toxic egos causing the overall quality of content to go down. In case of real-life societies, such completely open societies can easily became a haven for black-passengers or outlaws. But if the constraint determining the acceptation of a new member is too strict —i. e. the number of votes needed is too high— the system can easily get into situation where less and less immigrants succeed to get approved. This can potentially lead to the death of the community or society, especially in case of significant user outflux (e. g. «locking out» of kyberia and investing computational resources of one’s brain into the construction of google+ identity). We believe that in the history of humanity, it was not uncommon to see highly advanced societies perish just because the $immigration_rate variable have been assigned a non-optimal value or because it was wrongly balanced with other set of variables contained in society’s genomic set, e. g. those variables which determined the immigrant’s subsequent integration into society. The problem is, of course, that in situation where no honest man can pretend to know in advance what is the optimal value of a single variable, it is practically impossible to attain any kind of optimality in cases complex sets of variables. The problem, in its very essence, is that humans beings are unable to find agreement of what «optimality» means in case of sociopolitical bodies. For a computer scientist it is evident that there exist certain problems for which we shall never know the solution, nor know whether we shall ever know their solution  19. In the world where the aims to attain «the common good» and to spread «the human dignity»  20 had indirectly led to biggest demographic and ecologic disasters of recorded history, one would tend to adopt a sceptical attitude expressed by a belief that the problem of global optimality of political bodies is an unsolvable problem. It was many times advised that instead of falling into scepticism, it is wiser to observe in amazement  21 the wisdom of Nature. Be it the ontogeny of a human 19 20 21 Turing, 1936. Mirandola, 1486. Plato, 1986: 155d. ••TEORIA POLITICA 2013••.indb 178 18/6/13 10:09:45 Parallel Democracy Model and Its First Implementations in the Cyberspace 179 baby or a phylogeny of species, the Nature maybe does not find the global solutions to «life, universe and all» but succeeds to discover stunningly elegant and simple local optima by means of very simple heuristics like «trial and error» and evolution. We believe that the reason why Nature succeeds to do so is because it unceasingly permutes and mutates diverse information-carrying vectors, and that it always find new ways —new mutation operators— to do so. As all experts in the domain of evolutionary algorithms (EA) know, a very method combining the idea of 1) information conservation 2) information replication and 3) information mutation can offer us sufficiently satisfactory solutions for stunningly wide range of problems. This article tries to suggest had that the evolution of different configurations of political bodies can be not only described in terminology not so distant from the terminology used by experts in EA; we also indicate that act of making explicit the variables which determine the functioning of a given society —as is the case in PDM— could accelerate the research of a locally optimal political configuration. In our opinion the advantage lies in PDM's vote-aggregation ability to harness «wisdom of crowds»  22 better than a «classical» crowd-sourcing algorithm located at the very core of different variations of the Serial Model. We may be, of course, wrong in our conclusions but our «in vivo» social experiment with kyberia communities haven't furnished us any reason to support a belief that systems based on SM-aggregation should be uncritically accepted and PDM-like variants a priori excluded. Verily we believe that the only obstacle to wider expansion of PDM seems to be SM’s strong «social inertia» and not a flaw inherent to PDM itself. As there is no order without conservation, there is no evolution without mutation. If in case of political bodies mutation can be operationalized as a modification of variable’s value then it follows that methods of opinion-aggregation can be interpreted as mutation operator if ever the execution of such method results in update of variable included in society’s genomic vector. Simply stated: as members of different virtual communities, as citizens of the polis aiming to apply the principles of participative democracy or simply as holders of the passport of the Union, we all have a possibility to contribute to the final output of a variable-updating operator. Whether we want it or not, we are all co-engineers of the political body which envelops us as mother’s matrice. Our actions contribute to mutations of the vector which is generating cette matrice and this matrice subsequently influences our future actions and choices. In majority of cases, this dialectics between the agent and his sociopoliticohistoricoecolognomical environment is implicit and hidden behind stratas of constitutions, laws and institutions. The objective of the model hereby proposed is to make more explicit at least certain parts of this dialectics. Our hope is that by making things —values, variables, vectors, models— explicit, we make them accessible to conscious reflexion. By making them acces22 Surowiecki, 2005. ••TEORIA POLITICA 2013••.indb 179 18/6/13 10:09:45 180 daniel devatman hromada sible to conscious reflexion, and by subsequent transforming of these structures according to this very reflexion, we let consciousness to co-construct our shared world, hoping that consciousness and reason shall help us to reduce to zero the probability of participation on the construction of the world about which it already stated: «this is not the world I love»  23. Hopefully, by reducing the possibility of waking once into such a world, we shall gradually raise the feeling that the corner of the universe we slowly learn to inhabit... is our home  24. Reducing and raising, incrementing and decrementing, encoding and decoding, analysing but also uniting —such are the tools of a conceptual engineer. Since in this article we had implemented these tools for purposes of political science, we find it appropriate to end this excursion by proposing a following definition of Political Engineering: «Political engineering is the science and an art of adjusting the values of variables which determine the functioning of a political body» and we terminate with the conclusion that it is left to engineer’s own choice whether (s)he wants this adjustment to be done in accordance with external environment, or with internal intentions. Bibliography Bourdieu, P. (1984). Distinction: A Social Critique of the Judgement of Taste, Harvard, Harvard University Press. Coulanges, F. de (2010). La cité antique: Étude sur le culte, le droit, les institutions de la Grèce et de Rome (1893), Cambridge, Cambridge University Press. Floridi, L. (2011). The Philosophy of Information, Oxford, Oxford University Press. Hobbes, T. (2011). Leviathan (1651), Empire Books. Hromada, D. D. (2012). Initiation to Parallel Democracy Model. Presented at the Fabrique de la Loi, Ecole des Sciences Politiques, Paris. Kauffman, S. (1996). At Home in the Universe: The Search for the Laws of Self-Organiza­ tion and Complexity, Oxford, Oxford University Press. Lévi-Strauss, C. (1967). Structural Anthropology, New York, Doubleday Anchor Books. Lévi-Strauss, C. (2005). Interview with Claude Levi-Strauss, Television France 2: http:// www.youtube.com/watch?v=bT8sFygU8fY. McLuhan, M. (1965). The Gutenberg galaxy: the making of typographic man, Toronto, University of Toronto Press. Mirandola, G. P. D. (1971). Oration on the Dignity of Man (1486), Gateway. Nietzsche, F. (1969). Umwertung Aller Werte Band 1. Deutscher Taschenbuch. Pearson, H. (2006). Genetics: What is a gene?, in «Nature», 441, 398-401. Plato (1986). Theaetetus: Part I of The Being of the Beautiful, Chicago, University of Chicago Press. Plato (2009). The Republic, Cambridge, Cambridge University Press. Surowiecki, J. (2005). The Wisdom of Crowds, New York, Anchor. Turing, A. (1936). On computable numbers with an application to Entschiedungsproblem, in «J. of Math», 58, 345-363. 23 24 Lévi-Strauss, 2005. Kauffman, 1996. ••TEORIA POLITICA 2013••.indb 180 18/6/13 10:09:45 Foreword from the Congress Chairs For the Turing year 2012, AISB (The Society for the Study of Artificial Intelligence and Simulation of Behaviour) and IACAP (The International Association for Computing and Philosophy) merged their annual symposia/conferences to form the AISB/IACAP World Congress. The congress took place 2–6 July 2012 at the University of Birmingham, UK. The Congress was inspired by a desire to honour Alan Turing, and by the broad and deep significance of Turing's work to AI, the philosophical ramifications of computing, and philosophy and computing more generally. The Congress was one of the events forming the Alan Turing Year. The Congress consisted mainly of a number of collocated Symposia on specific research areas, together with six invited Plenary Talks. All papers other than the Plenaries were given within Symposia. This format is perfect for encouraging new dialogue and collaboration both within and between research areas. This volume forms the proceedings of one of the component symposia. We are most grateful to the organizers of the Symposium for their hard work in creating it, attracting papers, doing the necessary reviewing, defining an exciting programme for the symposium, and compiling this volume. We also thank them for their flexibility and patience concerning the complex matter of fitting all the symposia and other events into the Congress week. John Barnden (Computer Science, University of Birmingham) Programme Co-Chair and AISB Vice-Chair Anthony Beavers (University of Evansville, Indiana, USA) Programme Co-Chair and IACAP President Manfred Kerber (Computer Science, University of Birmingham) Local Arrangements Chair AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 3 Foreword from the Workshop Chairs 2010 marked the 60th anniversary of the publication of Turing’s paper, in which he outlined his test for machine intelligence. Turing suggested that consideration of genuine machine thought should be replaced by use of a simple behaviour-based process in which a human interrogator converses blindly with a machine and another human. Although the precise nature of the test has been debated, the standard interpretation is that if, after five minutes interaction, the interrogator cannot reliably tell which respondent is the human and which the machine then the machine can be qualified as a 'thinking machine'. Through the years, this test has become synonymous as 'the benchmark' for Artificial Intelligence in popular culture. There is both widespread dissatisfaction with the 'Turing test' and widespread need for intelligence testing that would allow to direct AI research towards general intelligent systems and to measure success. There are a host of test beds and specific benchmarks in AI, but there is no agreement on what a general test should even look like. However, this test seems exceedingly useful for the direction of research and funding. A crucial feature of the desired intelligence is to act successfully in an environment that cannot be fully predicted at design time, i.e. to produce systems that behave robustly in a complex changing environment - rather than in virtual or controlled environments. The more complex and changing the environment, however, the harder it becomes to produce tests that allow any kind of benchmarking. Intelligence testing is thus an area where philosophical analysis of the fundamental concepts can be useful for cutting edge research. There has been recently a growing interest in simulating and testing in machines not just intelligence, but also other mental human phenomena, like qualia. The challenge is twofold: the creation of conscious artificial systems, and the understanding of what human consciousness is, and how it might arise. The appeal of the Turing Test is that it handles an abstract inner process and renders it an observable behaviour, in this way, in principle; it allows us to establish a criteria with which we can evaluate technological artefacts on the same level as humans. New advances in cognitive sciences and consciousness studies suggest it may be useful to revisit this test, which has been done through number of symposiums and competitions. However, a consolidated effort has been attempted in 2010 and in 2011 at AISB Conventions through TCIT symposiums. However, this year’s symposium forms the consolidated effort of a larger group of researchers in the field of machine intelligence to revisit, debate, and reformulate (if possible) the Turing test into a comprehensive intelligence test that may more usefully be employed to evaluate 'machine intelligence' at during the 21st century. AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 4 The Chairs Vincent C. Müller (Anatolia College/ACT & University of Oxford) and Aladdin Ayesh (De Montfort University) With the Support of: Mark Bishop (Goldsmiths, University of London), John Barnden (University of Birmingham), Alessio Plebe (University Messina) and Pietro Perconti (University Messina) The Program Committee: Raul Arrabales (Carlos III University of Madrid), Antonio Chella (University of Palermo), Giuseppe Trautteur (University of Napoli Federico II), Rafal Rzepka (Hokkaido University) … plus the Organizers Listed Above The website of our symposium is on http://www.pt-ai.org/turing-test Cite as: Müller, Vincent C. and Ayesh, Aladdin (eds.) (2012), Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World (AISB/IACAP Symposium) (Hove: AISB). Surname, Firstname (2012), ‘Paper Title’, in Vincent C. Müller and Aladdin Ayesh (eds.), Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World (AISB/IACAP Symposium) (Hove: AISB), xx-xx. AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 5 Table of Contents Foreword from the Congress Chairs 3 Foreword from the Workshop Chairs 4 Daniel Devatman Hromada From Taxonomy of Turing Test-Consistent Scenarios Towards Attribution of Legal Status to Meta-modular Artificial Autonomous Agents 7 Michael Zillich My Robot is Smarter than Your Robot: On the Need for a Total Turing Test for Robots 12 Adam Linson, Chris Dobbyn and Robin Laney Interactive Intelligence: Behaviour-based AI, Musical HCI and the Turing Test 16 Javier Insa, Jose Hernandez-Orallo, Sergio España, David Dowe and M.Victoria Hernandez-Lloreda The anYnt Project Intelligence Test (Demo) 20 Jose Hernandez-Orallo, Javier Insa, David Dowe and Bill Hibbard Turing Machines and Recursive Turing Tests 28 Francesco Bianchini and Domenica Bruni What Language for Turing Test in the Age of Qualia? 34 Paul Schweizer Could there be a Turing Test for Qualia? 41 Antonio Chella and Riccardo Manzotti Jazz and Machine Consciousness: Towards a New Turing Test 49 William York and Jerry Swan Taking Turing Seriously (But Not Literally) 54 Hajo Greif Laws of Form and the Force of Function: Variations on the Turing Test 60 AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 6 !"#$%&'()*(+,("-./0(,%1/2#+#$3%#4%15"6+'%1(07% 89(+/"6#0%7#:/",0%&77"6.576#+%#4%;('/<%87/750%7#%=(7/=#,5/+6(<%>(?/7$/+%@"#$/,/!"# &.07"/97A$$%&'$()*+*,-.$%/)*,+%'01$$*0$2(3*4*'3$*,$()3')$1($1-5'$ *,1( $ -66(/,1 $ 1&' $ $ -+'7+',3') $ (4 $ - $ 8/3+' $ 9&( $ ':-./-1'0 $ 1&'$ 2-6&*,' $ -,3 $ 1&' $ -+'7+',3') $ (4 $ - $ ;/2-, $ 9*1& $ 9&(2 $ 1&'$ 2-6&*,' $ *0 $ 6(2<-)'3 $ 3/)*,+ $ ':-./-1*(,= $ %&*0 $ >*'.30 $ - $ ?-0*6$ 1-@(,(2> $ (4 $ %/)*,+%'01A6(,0*01',1 $ 06',-)*(0 $ 9&*6& $ *0$ 0/?0'B/',1.> $ '@1',3'3 $ ?> $ 1-5*,+ $ *,1( $ -66(/,1 $ 1&' $ 1><' $ (4$ *,1'..*+',6' $ ?'*,+ $ ':-./-1'3= $ C(,0*01',1.> $ 9*1& $ 1&' $ %&'()> $ (4$ D/.1*<.' $ E,1'..*+',6'0F $ ,*,' $ ?-0*6 $ *,1'..*+',6' $ 1><'0 $ -)'$ <)(<(0'3F$-,3$-,$'@-2<.'$(4$-$<(00*?.'$06',-)*($4()$':-./-1*(,$ (4$'2(1*(,-.$*,1'..*+',6'$*,$'-).>$01-+'0$(4$3':'.(<2',1$*0$+*:',=$ E1 $ *0 $ 0/++'01'3 $ 1&-1 $ 0<'6*4*6 $ *,1'..*+',6' $ 1><'0 $ 6-, $ ?'$ 0/?0'B/',1.>$+)(/<'3$*,1($&*')-)6&>$-1$1&'$1(<$(4$9&*6&$*0$0'-1'3$ -,$G)1*4*6*-.$E,1'..*+',6'$.-?'..'3$-0$H2'1-A2(3/.-)I=$J*,-..>F$*1$ *0$<)(<(0'3$1&-1$0/6&$-$2'1-A2(3/.-)$GE$0&(/.3$?'$3'4*,'3$-0$-,$ G)1*4*6*-.$G/1(,(2(/0$G+',1$-,3$0&(/.3$?'$+*:',$-..$1&'$)*+&10$ -,3$)'0<(,0*?*.*1*'0$-66()3*,+$1($-+'$(4$&/2-,$6(/,1')<-)10$*,$ 6(2<-)*0(, $ 9*1& $ 9&(2 $ -, $ GE $ /,3') $ B/'01*(, $ &-0 $ <-00'3 $ 1&'$ %/)*,+%'01=!"# B%C&8DE%1FGDH*1I81%1&JKHK=L K)*2($L!M$F$9'$<)'0',1$-$.-?'..*,+$06&'2-$4()$-$0'1$(4$3*:')0' $ -)1*4*6*-. $ *,1'..*+',6' $ 1'010 $ *,0<*)'3 $ ?> $ - $ <)(6'3/)' $ *,*1*-..>$ <)(<(0'3$?>$G.-,$D-1&*0(,$%/)*,+$L"MF$*,$()3')$1($01-,3-)3*N'$ -11)*?/1*(,$-,3$':-./-1*(,$(4$1&'$.'+-.$01-1/0$(4$G)1*4*6*-.$$G+',10$ OGG0PF$?'$*1$)(?(1F$6&-1A?(1$()$-,>$(1&')$,(,A()+-,*6$:')?-..>$ *,1')-61*,+$0>01'2=$ J()$1&(0'$9&($3($,(1$5,(9$()$&-:'$4()+(11',$9&-1$%/)*,+$ %'01 $ O%%P $ *0F $ 9' $ <)'6*0' $ 1&-1 $ -66()3*,+ $ 1( $ - $ 2-, $ )*+&14/..>$ .-?'.'3$-0$-$4(/,3*,+$4*+/)'$(4$1&'()'1*6-.$*,4()2-1*60F$-$%%$*0$-$ 9->$&(9$1($-33)'00$1&'$B/'01*(,$9&'1&')$HC-,$2-6&*,'0$1&*,5QI$ *,$-$06*',1*4*6-..>$<.-/0*?.'$>'1$3''<.>$'2<-1&*6$9->= D()'$6(,6)'1'.>F$%/)*,+$<)(<(0'0$1&-1$1&'$<')4()2-,6'$(4$-,$ GG $ /,3') $ B/'01*(, $ 0&-.. $ ?' $ ':-./-1'3 $ ?> $ - $ &/2-, $ R/3+' $ O8P$ 9&(0'$(?R'61*:'$*0$1($3'1')2*,'$9&*6&$-2(,+$19($',1*1*'0$A9*1&$ 9&(2$8$*0$*,$)'-.A1*2'$*,1')-61*(,A$*0$(4$&/2-,$-,3$9&*6&$*0$(4 $ -)1*4*6*-.$,-1/)'=$%)-3*1*(,-..>F$2()'$-11',1*(,$9-0$<(*,1'3$/<(,$ 1&'$)(.'$(4$GG$-*2*,+$1($S1)*65T$8$*,1($1&*,5*,+$1&-1$GG$*0$&/2-,$ O;P=$U'F$&(9':')F$<)(<(0'$1($<-)1*-..>$1/),$1&'$-11',1*(,$1($)(.'0$ (4$8$7$;=$J()$*1$*0$':*3',1$1&-1$4-61()0$.*5'$8V;T0$-+'F$+',3')$()$ ! $W.(:-5$%'6&,*6-.$X,*:')0*1>F$J-6/.1>$(4$Y.'61)*6-.$Y,+*,'')*,+$-,3$ E,4()2-1*(,$%'6&,(.(+>F$E,01*1/1'$(4$C(,1)(.$-,3$E,3/01)*-.$E,4()2-1*60F$ Z)-1*0.-:-F$W.(:-5*-=$$Y2-*.[$&)(2*\5>?')*-=05=$ " $ ]/1*, $ X0').-? $ -44*.*-1'3 $ 1( $ 3(61()-. $ 06&((. $ C(+,*1*(,F $ ]-,+-+'F$ E,1')-61*(,$(4$X,*:')0*1>$K-)*0$^F$J)-,6'= # $C(+,*1*(,$;/2-*,'$'1$G)1*4*6*'..'$.-?()-1()>$OC&G_%P$$-44*.*-1'3$1($ `6(.'$K)-1*B/'$3'0$;-/1'0$`1/3'0= 8V;T0$.':'.$(4$'@<')1*0'$<.->$6')1-*,$)(.'$*,$-00'00*,+$%%a0$4*,-.$ )'0/.1=$ G1$1&'$:')>$6()'$(4$(/)$GG$3',(1-1*(,$06&'2-F$(,'$4*,30$-$ H%%I$?*+)-2$0*+,*4>*,+$'*1&')$H%/)*,+$%'01IF$H%'01$%-@(,(2>I$ ()$9&-1':')$'.0'$(,'$6&((0'0$1&'2$1($3',(1'=$U*1&(/1$-33*1*(,-.$ <)'4*@'0 $ () $ 0/44*@'0F $ 1&' $ <)'0',6' $ (4 $ - $ %% $ ?*+)-2 $ *, $ 1&'$ 01-,3-)3*N'3 $ .-?'. $ (4 $ -, $ GG $ *,3*6-1'0 $ 1&-1 $ 1&' $ 6-,3*3-1' $ &-0$ -.)'-3>$0/66'004/..>$<-00'3$-1$.'-01$(,'$*,01-,6'$(4$1&'$%/)*,+$ %'01 $ -66'<1'3 $ ?> $ 1&' $ 6(22/,*1>= $ U&', $ - $ ,/2')*6 $ <)'4*@ $ *0$ +*:',F$*1$3',(1'0$1&'$-+'$(4$-$R/3+'F$()$-+'$-:')-+'$-$01-1*01*6-..>$ )'.':-,1$+)(/<$(4$R/3+'0$9&($':-./-1'3$1&'$1'01=$b,$1&'$6(,1)-)>F$ 9&',$-$,/2')*6$<(014*@$*0$+*:',F$*1$3',(1'0$1&'$-+'$(4$-$&/2-,$ 6(/,1')<-)1 $ A $ () $ -+' $ -:')-+' $ (4 $ &/2-, $ 6(/,1')<-)10 $ A $ $ *,$ 6(2<-)*0(,$9*1&$9&(2$1&'$GG$$&-0$0/66''3'3$1($<-00$1&'$1'01= G $ 6-0' $ 2-> $ (66/) $ 9&')' $ 8 $ -,3c() $ ;a0 $ +',3') $ 0&-..$ 0*+,*4*6-,1.>$*,4./',6'$1&'$%%A<)(6'3/)'$O6=4=$1&'$H2(1&')$R/3+'I$ '@-2<.'$*,$<-)1$#$(4$1&*0$-)1*6.'P=$%&/0$*1$0''20$1($?'$)'-0(,-?.'$ 1($*,1'+)-1'$+',3')$*,4()2-1*(,$*,1($1&'$.-?'.*,+$06&'2-= %&/0F$1/)*,+$1'010$':-./-1'3$-66()3*,+$1($1&'$6)*1')*-$<)(<(0'3$ *,$<-)-+)-<&0$6-,$ ?'$ .-?'.'3$ ?>$ 2'-,0$(4$06&'2-$&-:*,+$1&'$ 4()2[ *MNMM11OO*ON *M %3',(1'0$8a0$+',3')F $*O$ 3',(1'0$;a0$+',3')$-,3$ $MM %-,3 $OO% 1(5',0 $ -)' $ 0/?01*1/1'3 $ ?> $ -+' $ O*, $ >'-)0P $ (4 $ R/3+' $ () $ &/2-,$ )'0<'61*:'.>=$HNP%*0$-$)'+/.-)$'@<)'00*(,$B/-,1*4*')$*,3*6-1*,+$1&-1$ 1&'$+',3')$*,4()2-1*(,$*0$(,.>$4-6/.1-1*:'$-,3$6-,$?'$(2*11'3$4()$ 6')1-*,$0'10$(4$1'010= J()$'@-2<.'F$-,$GG$9&*6&$9-0$,(1$)'6(+,*N'3$-0$-,$-)1*4*6*-.$ ',1*1>$d$-,3$1&')'4()'$<-00'3$1&'$%/)*,+$%'01$A$9&',$6(2<-)'3$1($ 01-1*01*6-..> $ )'.':-,1 $ ,/2?') $ (4 $ !^A>'-) $ (.3 $ &/2-, $ 2-.'$ 6(/,1')<-)10 $ 9&*.' $ ?'*,+ $ ':-./-1'3 $ ?> $ 01-1*01*6-..> $ 0*+,*4*6-,1$ )'.':-,1$,/2?')$(4$"!A>'-)$(.3$4'2-.'$R/3+'0F$0&-..$?'$.-?'..'3$ -0$J"!%%!^DA6(2<.*-,1$GG= G0$9*..$?'$01-1'3$*,$1&'$.-01$<-)1$(4$1&'$-)1*6.'F$9'$<)(<(0'$1&-1$ - $J"!%%!^DA6(2<.*-,1 $ GG $ 0&-.. $ (?1-*, $ 6')1-*, $ .'+-. $ )*+&10F$ '0<'6*-..>$*4$1&'$1'01$6(,6'),'3$GGT0$2'1-A2(3/.-)$4-6/.1*'0= G41')$4(6/0*,+$-11',1*(,$/<(,$87;a0$-+'$()$+',3')F$-,(1&')$ 0'1$(4$:-)*-,10$(4$%%A.*5'$06',-)*(0$?'6(2'$<(00*?.'=$U&*.'$*,$ 6-0'$(4$1&'$01-,3-)3$%%F$1&'$(?R'61*:'$*0$1($H<')0/-3'$1&'$R/3+'$ (4$(,'a0$&/2-,$,-1/)'IF$*,$-,$-+'A()*',1'3$06',-)*(F$-,$-+',1a0$ (?R'61*:'$6-,$?'$0*2<.>$1($H<')0/-3'$1&'$R/3+'$(4$ $?'*,+$(.3')$ 1&-, $ 1&' $ &/2-, $ 6(/,1')<-)1IF $ 9&*.' $ *, $ 1&' $ +',3')A()*',1'3$ 06',-)*(0F$-,$-+',1a0$(?R'61*:'$6-,$?'$0*2<.>$1($H<')0/-3'$1&'$ R/3+'$1&-1$*1$*0$E$-,3$,(1$1&'$(1&')$<.->')$9&*6&$*0$(4$0'@$eI=$ U' $ 6(,0*3') $ *1 $ 9()1& $ 2',1*(,*,+ $ 1&-1 $ 1&' $ .-11') $ 0'1 $ (4$ 06',-)*(0$-)'$-0$6.(0'$-0$(,'$6-,$+'1$1($H%&'$E2*1-1*(,$f-2'I$ <)(<(0'3$?>$%/)*,+$-1$1&'$:')>$?'+*,,*,+$(4$&*0$'<(6&-.$-)1*6.'$ L"M$-,3$&',6'$0''20$1($?'$1&'$6.(0'01$1($1&'$()*+*,$(4$1&'$%%$*3'-= AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 7 3,4#($# ('&.-&C+*2')6 #Q-& #.3&$%'&#,*)(,&+$,-*# 2-C,*4#.&-C# $%,)# 5-C+,*#-.#H-5'1)9#\)2%'&)#+*5#X+2%)9#,$#,)#:-&$%#G&'P&'+5,*4#M_O6 ETT infix !" BA SE Intelligence group Corporal group Babbling group Sensual group Subordinated intelligence types Organic; Spatial; Somato-sexual Moral; Emotional; Linguistic Mathematico-logical; Musical; Visual Table 2. Clustering of basic intelligence types into basic intelligence clusters !"#$%#$%&''#&'(&')'*$+$,-*#-.#-*'#$/('#-.#(-)),01'#213)$'&,*4)# -. # (&-(-)'5 # ,*$'11,4'*2' # $/(')6 # 73&'1/ # +')$%'$,2 # &'+)-*)# '82'($'59#-$%'&#&'+)-*)#:%/#:'#(&-(-)'#$%,)#+*5#*-$#-$%'&#:+/# -.#213)$'&,*4#;<=>,*)(,&'5#,*$'11,4'*2')#+&'#+)#.-11-:)?# ;%' # @2+&*+1 # 4&-3(A # 2-*),)$) # -. # ,*$'11,4'*2') # +))-2,+$'5 # $-# 2-&(-&+1#+)('2$)#-.#-*'B)#'8,)$'*2'6#=*#2+)'#-.#+#%3C+*#0',*49# ,*$'11,4'*2')#-.#$%,)#4&-3(#+&'#C32%#C-&'#@,**+$'A#+*5#,*#1'))'&# '8$'*$#@+2D3,&'5A#.&-C#'*E,&-*C'*$#$%+*#,)#$%'#2+)'#,*#-$%'&#$:-# 213)$'&)6#F'#(&'2,)'?#-&4+*,2#,*$'11,4'*2'#-&#G+1)-#2+11'5#@*+$3&+1# ,*$'11,4'*2' # ,* # H+&5*'&B) # C-5'1 # -& # 2,&23,$) # IJK # ,* # L'+&/> F,1)-*B) # C-5'1 # MNOP # ,) # $,4%$1/ # &'1+$'5 # $- # -&4+*,)CB) # +0,1,$/ # $-# )3&E,E'6 # Q,*5 # .--5 # :%,2% # /-3 # 2+* # 5,4')$9 # ')2+(' # + # (&'5+$-&9# C+R' # + # *')$ # S # +11 # $%,) # +2$,E,$,') # +&' # &'1+$'5 # $- # -&4+*,2# ,*$'11,4'*2'6#=$#,)#+1)-#')('2,+11/#-&4+*,2#,*$'11,4'*2'#:%,2%#5'+1)# :,$% # *-2,2'($,E'9 # -1(%+2$-&,2 # +*5 # 43)$+$,E' # ,*(3$)6 # T-C+$-> )'83+1#,*$'11,4'*2'#)8#+))3&')#&'(&-532$,-*#+*5#0/#+2$,E+$,*4#$%'# &'($#( ) "* ) +$,(-'.,# +))3&') # $%' # )3&E,E+1 # -. # )('2,') # +*5 # 1,.' # ,*# 4'*'&+16 # T(+$,+1 # ,*$'11,4'*2' #)(# ,*E-1E') # *-$ # -*1/ # +4'*$B)# C-E'C'*$ # :,$%,* # $%' # UV # D3+),'321,5'+* # )(+2' # -. # $%' # )%+&'5# 'E'&/5+/#&'+1,$/9#03$#+1)-#$+R')#,*$-#+22-3*$#$%'#'8('&,'*2'#-.# -*'B) #/"01)(-)()-+(2,# :%,2%#,)#&'1+$'5#')('2,+11/#$-#%+($,2#+*5# (&-(&,-2'($,E'#,*(3$)6 ;-#)3CC+&,W'?#,*$'11,4'*2'#$/(')#-.#!"#4&-3(#&'D3,&'#$%+$#+*# +4'*$#%+)#+#C+$'&,+1#0-5/6#;%'#('&.-&C+*2')#-.#$%,)#0-5/#2+*#0'# $')$'5#0/#C'+*)#-.#C+*/#(-)),01'#;;>)2'*+&,-)#0'#,$#5+*)'#G)(P9# $+*$&/-4,2#&,$'#G)8P#-&#@4-J)3&E,E'#,*#+#.-&')$#.-&#+#5+/9#0-/666A# G-&P6 ;%' # XY # 213)$'& # '*E'1-() # $%-)' # ,*$'11,4'*2' # $/(') # :%,2%# 5'E'1-(#')('2,+11/#,*#'+&1/#2%,15%--5#-.#+#%3C+*#0+0/6#F%+$B)# C-&'9#+11#$%&''#,*$'11,4'*2')#-.#$%'#213)$'&#)$'C#.&-C#$%'#*-$,-*# $%+$# $%'&'# +&' # -$%'& # 0',*4) # ,* # $%' # )%+&'5# :-&159 # +*5 # $%+$ # -*'# )%-315# 2-CC3*,2+$'# :,$%# $%'C# G1,P9 # )%+11# *-$# %3&$# $%'C# GC-P# +*59#,.#(-)),01'9#$&/#$-#3*5'&)$+*5#$%'C#G'CP6# F'#2-*),5'&#,$#:-&$%#C'*$,-*,*4#$%+$#53&,*4#$%'#-*$-4'*/#-.# +*#,*5,E,53+1#0',*49 #0-$%#1,*43,)$,2#+*5#C-&+1#,*$'11,4'*2'#+&'# (-)),01/#)30Z'2$)#$-#)-C':%+$#),C,1+&#,*532$,E'#(&-2'53&')?#0'#,$# @C-&+1A[#-&#@4&+CC+&A#,*532$,-*#53&,*4#:%,2%#+#)'$#-.#4'*'&+1# &31')#,)#0',*4#,*.'&'5#.&-C#(-),$,E'#'8+C(1'#)'$#'*2-3*$'&'5#,*# $%'#0',*4B)#'*E,&-*C'*$6 Q,*+11/9#$%'#T\#213)$'*,.,')##$%-)'#,*$'11,4'*2'#$/(')#:%,2%# 5'E'1-( # &'1+$,E'1/ # 1+$'1/ # 53&,*4 # -*'B) # -*$-4'*/6 # ]'&,1/9 # ,$ # ,)# ')('2,+11/ # ,* # 5-C+,*) # -. # E,)3+1 # GE,P9# C3),2+1 # GC3P# +*5# C+$%'C+$,2->1-4,2+1 # GC1P #,*$'11,4'*2') # $%+$ # -*' # '*2-3*$'&)# ;%,)#R,*5#-.#%,'&+&2%/#:'#(&')'*$#%'&'0/#,*#-&5'&#$-#(&'(+&'# +)#)-1,5#0+)'C'*$)#+)#:'#+&'#2+(+01'#-.#.-&#(-)),01'#+$$&,03$,-*# -.#2,E,2#&,4%$)#$-#.3$3&'#;YYY;)6 !""#$$%&'($&)*")+",&-&,"%&./$0"$)"### =C+4,*'#+*#QIN # #;CC;INQ # #)+&$,.,2,+1#+4'*$9#,6'6#+*#+&$,.,2,+1#+4'*$# :%,2%#%+)#)322''5'5#$-#('&)3+5'#IN#/'+&#-15#:-C'*#$-#2-*),5'&# %'&##+)#0',*4#\=;`\^#-15'&#"^#C-&'#.'C,*,*'#"^#C-&'#%3C+*# ,* # +11# (-)),01' # ;;>)2'*+&,-) # +)# :'11 # +)# ,* # $%',&# 2-C0,*+$,-*)6# !-315# )32% # +* # '*$,$/ # 0' # 4&+*$'5 # &,4%$) # 'D3+1 # $- # &,4%$) # -. # +# 5'.'+$'5#+4'#4&-3(#,*#)(,$'#-.#.+2$#$%+$#$%'#'*$,$/#3*5'&#D3')$,-*# ,)#-.#+&$,.,2,+19#*-*>-&4+*,2#-&,4,*a Q&-C# )$&,2$1/ # 1'4+1 # (-,*$ # -. # E,':9 # $%' # +*):'& # 2-315# $%'-&'$,2+11/#0'#@/')A#:,$%,*#$%'#.&+C':-&R#-.#^-C'>-&,4,*+$'5# 1'4+1#)/)$'C)# #),*2'#*-*>-&4+*,2#'*$,$,')#1,R'#@2-&(-&+$,-*)A#-&# -$%'&#@1'4+1#('&)-**+'A#+1&'+5/#5,)(-)'#-.#2'&$+,*#&,4%$)#:,$%,*# )32%#1'4+1#.&+C':-&R)6# ;%3)9 # $%' # (+$% # $-:+&5) # +$$&,03$,-* # -. # 1'4+1 # &,4%$) # G+*5# &')(-*)+0,1,$,')P # $- # YY) # :' # +&' # (&-4&+CC,*4 # ,) # &'1+$,E'1/# )$&+,4%$.-&:+&5? IP V'.,*'#$%'#)'$#-.#;;)#:%,2%#'E+13+$'#5,E'&)'#.+231$,')# -.#%3C+*#C,*5 KP !+*-*,W'#$%'C#,*$-#=T">1,R'#)$+*5+&5#:%,2%#+$$&,03$')# +#2'&$+,*#1+0'1#GYYYP#$-#+*#+4'*$#:%,2%#(+))')#$%'C UP 7'&)3+5'#$%'#(301,2#$%+$#,$#,)#,*#$%',&#-:*#,*$'&')$)#$%+$# '*$,$,')# :%,2%#%+E'#-0$+,*'5# +*#YYY# 1+0'1# )%+11#0'# 4,E'*#+$#1'+)$#2'&$+,*#2,E,1#&,4%$) T30)'D3'*$1/9 # ,$ # )%+11 # -*1/ # 0' # *'2'))+&/ # $- # (+)) # + # 1+: # -&# 2-*)$,$3$,-*+1#+C'*5C'*$#)$+$,*4#$%+$? %12345"6"%7589:5;<1=14175"9>"###5";?7"1@7:41A"49"4371?" 3BC;:"A9B:47?8;?45D 6 66+*5#$%'#&,4%$)#)%+11#0'#4,E'*6 F'#(&'2,)'#$%+$#:,$%,*#-3+0'1,*4#)2%'C+)9#+*#YY#-0$+,*)# $%'#&+*R#YYY#,.#+*5#-*1/#,.#,$#(+))')#;<<;#,*#+#$')$#:%'&'#+4'# -. # Z354' # +*5 # %3C+* # 2-3*$'&(+&$ # ,) # ,5'*$,26 # =. # ,$ # %+(('*)9 # :'# (&-(-)'#$-#2-*),5'&#%'&b%,C#+)#'D3+1#+C-*4#'D3+1)6 Q-&#'8+C(1'9#,*#(-)),01'#.3$3&'#1,0'&+1#)-2,'$,')#:%-)'#2,E,2# -& # $&+5,$,-*+1 # &--$) # +11-: # $%'C # $- # 5- # )-9 # )32% # + #KI;CC;KI5 ;YYY;KI #2-315#(-)),01/#%+E'#+#&,4%$#$-#0'# #4&+*$'5#+22'))#$-# +531$>-*1/ # 5,)23)),-* # .-&3C # -& # )-2,+1 # *'$:-&R9 # :%,1' # +# # #5;YYY;Ic>2-C(1,+*$ # )/)$'C # )%+11 # *-$ # 5,)(-)' # -.# #Ic#;CC;Ic )32%#+#&,4%$6#`-:'E'&9#(-)),01/#'E'*#+ #Id;CC;Idb;YYY;Id# 2-3159#,*#)32%#)-2,'$,')9#(-)),01/#%+E'#+#&,4%$#$-#5,)(-)'#-.#+# 0+*R#+22-3*$#,*#-&5'&#$-#'8'23$'#.,*+*2,+1#$&+*)+2$,-*)#+22-&5,*4# $-#,$)#'C0'55'5#E+13'#)/)$'C6# F%,1'# :' # 5- # *-$ # .''1 # +$ # +11 # +($ # $- # +*):'& # $%' # D3')$,-*)# @:%'$%'&#)32%#KI;CC;KI # #5##;YYY;KI## -&#)%+11#(-)),01/#5,)(-)'# -. # 1'4+1 # &,4%$) # 'D3,E+1'*$ # $- # $%-)' # -. # %,) # %3C+* # +531$# 2-3*$'&(+&$)aA # :' # *-*'$%'1')) # $%,*R # $%+$ # $%' # E+13' # -. #667 4"8(8#"4#&'),5')#,*#$%'#.+2$#$%+$#,$#.+2,1,$+$')#$%'#(&-2'))#-.#(-),*4# )32%#D3')$,-*)6# [ ##^'5+2$,-*#-.#-3&##@C-&+1#,*532$,-*A#(&-(-)+1#,)#,*#(&-2'))#6 AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 10 !"#$%"$&'($)*+,#$)'(+($!-$%.$/0%"%"/$1*1("&213$)($4((,$&'0&$ .25'$62(.&%*".$.'*2,#$7($8*.(#9$ !"#$%%&'( -"$ *+#(+ $ &* $ 405%,%&0&( $ : $ .;"5'+*"%<( $ (=5'0"/( $ 7(&)((" $ !-> ("/%"((+.3$)($8+*8*.($70.%5$70.%5$?2+%"/?(.&?0=*"*1;$@7???A$ 0"#$(=&("#(#$?2+%"/?(.&?0=*"*1;$@(???A$,07(,%"/$.5'(10.9$ B???$,07(,.$50"$7($10&5'(#$7;$0$CDEF>5*180&%7,($+(/2,0+$ (=8+(..%*"G H@IJKLAM@N#OA??@N#OA@IJKLAMH )'(+($ 4%+.& $ @4052,&0&%P(A $ /+*28 $ 10&5'(# $ 7; $ 80+("&'(.%. $ /+*28> 10&5'%"/ $ *8(+0&*+ $ #("*&(. $ &'( $ /("#(+ $ *4 $ &'( $ Q2#/(@.A $ )'*$ 8(+4*+1(#$?2+%"/R.$?(.&3$)'(+(7;$.(5*"#$/+*28$%"4*+1.$*4$&'(%+$ 0/(3$&'($&'%+#$/+*28$10&5'(.$&'($0/($*4$'210"$5*2"&(+80+&@.A$0"#$ &'($,0.&$*"($@4052,&0&%P(A$*4$&'(%+$/("#(+9 D???$ ,07(,.$50"$7($10&5'(#$7;$0$CDEF>5*180&%7,($+(/2,0+$ (=8+(..%*"G H@IJKLAM@N#OA?@I0>SLTUVA?@N#OA@IJKLAMH )'(+($ 4%+.& $ @)W4052,&0&%P(A $ /+*28 $ 10&5'(# $ 7; $ 80+("&'(.%.$ /+*28>10&5'%"/$*8(+0&*+$#("*&(.$&'($/("#(+$*4$&'($Q2#/(@.A$)'*$ 8(+4*+1(#$?2+%"/R.$?(.&3$$&'($.(5*"#$/+*28$%"4*+1.$*4$&'(%+$0/(3$ &'($&'%+#$/+*28$10&5'(.$&'($0/($*4$'210"$5*2"&(+80+&@.A$0"#$&'($ ,0.&$*"($@)W4052,&0&%P(A$*4$&'(%+$/("#(+9$-4$&'($&'%+#$/+*28$>%9(9$0$ 7%/+01$%"4%=$,*50&(#$7(&)(("$&)*$?.$>$ $%.$%"$&'($,*)(+$50.(3$%& $ #("*&(.$&'($%"&(,,%/("5($&;8($)'%5'$)0.$&(.&(#$)'%,($&'($7%/+01$ %"4%= $ %" $ 288(+ $ 50.( $ %"#%50&(. $ &'0& $ &'( $ &(.& $ %"P*,P(# $ 5,2.&(+ $ *4$ %"&(,,%/("5($&;8(.9 X%"($%"&(,,%/("5($&;8(.$8+*8*.(#$4*+$(???$0+($("21(+0&(#$%"$ ?07,( $ Y9 $ ?'( $ 405& $ &'0& $ &'(; $ 0+( $ %"%&%0,,; $ 5,2.&(+(# $ %"&* $ &'+(($ 5,2.&(+.$@B!3$Z[3$\DA$)%&'%"$&'($.5*8($*4$&'%.$0+&%5,($#*(.$"*&$ (=5,2#($*&'(+$>8*&("&%0,,;$.*2"#(+>$5,2.&(+%"/.$&*$7($8+*8*.(#9 ?'( $ ]1(&0>1*#2,0+^ $ KK $ 7%/+01 $ %"#%50&(. $ &'0& $ &'( $ 0/("&$ 2"#(+ $ 62(.&%*" $ 80..(# $ 0,, $ _"*)" $ &(.&. $ 0. $ )(,, $ 0. $ 0,, $ &'(%+$ 5*17%"0&%*".9 $ -4 $ (P(+ $ 0 $ "() $ &(.& $ %. $ 055(8&(# $ 7; $ 0 $ .5%("&%4%5$ 5*112"%&;$0"#$0"$0+&%4%5%0,$0/("&$8+(P%*2.,;$,07(,(#$0.$?KK?$ 40%,.$&*$ 80..$.25'$ 0$ &(.&3 $ %&$.'0,,$ "*&$7($ 5*".%#(+(#$ 0.$]1(&0> 1*#2,0+^$2"&%,$&'($1*1("&$)'("$@.A'($.'0,,$@+(AM*+/0"%<($@'%1` '(+A.(,4$$%"$.25'$0$)0;$&'0&$@.A'($.'0,,$80..$&'($"()$&(.&9 !"$!!!$,07(,$50"$8*..%7,;$7($0&&+%72&(#$&*$0"$0+&%4%5%0,$0/("&$ )'*$'0#$05'%(P(#$.25'$0$,(P(,$*4$02&*"*1;$IYaL$&'0&$%&$80..(#$0$ ?KK?$%"$5*"#%&%*"$)'(+($0/($@0"#$8*..%7,;$/("#(+A$*4$Q2#/(.$ %.$%#("&%5$ &*$0/($@+(.8(5&%P(,;$ /("#(+A$ *4$'210"$ 5*2"&(+80+&.9$ \%"5($bR.$:$cR.$0/($0+(.$%#("&%53$4*+$.25'$0"$!!!$&'($,07(,$50"$ 7($077+(P%0&(#$.*$&'0&$4*+$(=018,($UY?KK?UY$7(5*1(.$.%18,;$ ?!!!?UY$*+$(P("$!!!UY9$ J*+10,,;$&'(+($%.$"*$,(/0,$*7.&05,($%"$0&&+%72&%"/$5(+&0%"$5%P%,$ +%/'&. $ &* $ !!!. $ .%"5( $ &'( $ 1*.& $ 5*11*" $ ,(/0, $ 4+01()*+_.$ 0,+(0#;$/%P($5(+&0%"$+%/'&.$&*$"*">*+/0"%5$,(/0,$8(+.*""0($IYYL9$ ?'($80&'$&*$.25'$0&&+%72&%*"$%.$+(,0&%P(,;$.&+0%/'&4*+)0+#G$#(4%"($ )'0&$(=05&,;$!!!$1(0".$7;$50"*"%<%"/$#%44(+("&$&(.&.$@0"#$&'(%+$ 5*17%"0&%*".A$%"$0$4*+1$*4$0$.28+0"0&%*"0,$.&0"#0+#$@(9/9$-\[A9$ !4&(+)0+#.3$%&$.244%5(.$&*$%"&(/+0&($&'($.&0&(1("&$,%_($]E%/'&.$:$ E(.8*".07%,%&%(. $ *4 $ !!!. $ 0+( $ %#("&%5 $ &* $ &'(%+ $ '210"$ 5*2"&(+80+&.*"%"&*$,*50,$,(/0,$5*#(=9 J%"0,,;3 $ &* $ 8+(P("& $ 8*..%7,( $ 5*"42.%*"3 $ )( $ 5*".%#(+ $ %&$ %18*+&0"&$&*$.&0&($&'0&$)'%,($&'($62(.&%*"$)'(&'(+$&'($,07(,%"/$ .5'(10. $ ,%_($ 7???$ *+$ (???$ 50"$ #(4%"($ )'0& $!"#$%&' &$())' *+' #",+-$ &* $ 0 $ "*">*+/0"%5 $ 0/("&d $ 2"#(+ $ 0"; $ 5%+521.&0"5($ )'0&.*(P(+$%&$%.$"*&$055(8&07,($&*$5*"5%(P($"*+$088,;$.25'$0$,(/0,$ 4+01()*+_$)'%5'$)*2,#$(=8,*%&$&'($.5'(10$'(+(7;$8+*8*.(#$%"$ *+#(+$&*$.&0&($)'0&$+%/'&.$.'*2,#$7('%(.+-$4+*1$0"$*+/0"%5$0/("&9$ J*+ $ 0"; $ .25' $ &("&0&%P( $ )*2,# $ 7( $ 5*"&+0+; $ &* $ 8*.%&%P( $ 0"#$ 5*".&+25&%P($ %"&("&%*"$ 7('%"#$&'%.$ 8+*8*.0,$0"#$ .'0,,$&'(+(4*+($ "2,,%4; $ &'( $ P0,%#%&; $ *4 $ &'( $ 5*"&+05& $ 7(&)((" $ 1(":105'%"(.$ '(+(7;$8+*8*.(#$IYUL$9 &+,-./0123%1-4# ?'%. $ 0+&%5,( $ )*2,# $ "*& $ 7( $ )+%&&(" $ )%&'*2& $ %"&(,,(5&20, $ .288*+&$ 4+*1$#*59$\(_0Q$0"#$*&'(+$1(17(+.$*4$eEC-$JD-$\?e$&(013$0.$ )(,,$0.$)%&'*2&$_%"#$/2%#0"5($*4$8+*49$?%Q2.$)'*$'0.$'(,8(#$1($ &* $ /,2( $ %& $ 0,, $ &*/(&'(+9 $ K0"; $ &'0"_. $ /* $ 0,.* $ 1(17(+. $ *4$ K*7("J05&$4*+$7(%"/$0.$%".8%+0&%P($0.$*",;$&'(;$50"$7($0"#$&*$ !#%,$D,f'0,%$4*+$%"%&%0&%*"$%"&*$.(10"&%5$P(5&*+$.805(.9$ '151'1-+1# IYL $ g9 $ h08(_9 $ E9e9E9 $ > $/0&&120," ' 3-",+!45)-6 ' /0*0%"9 $!P("&%"213$ C+0/2(3$h(._*.,*P("._i$E(827,%_09$@YjUaA9 IUL $!9$K9 $?2+%"/9$Z*182&%"/ $K05'%"(+;$0"#$-"&(,,%/("5(9 $7"-8'9:;3$ UklG$mkknmla$$@YjoaA9 IkL$c9$f0+#"(+9 $'?$+'?$+0!@'0='71)%"A)+':-%+))"#+-B+&9$ B0.%5$B**_.9$X()$p*+_3$e\!$@YjqkA9 ImL $ Z9 $ !,,("3 $ f9 $ r0+"(+ $ 0"# $ b9 $ S%".(+9 $ C+*,(/*1("0 $ &* $ !"; $ J2&2+($ !+&%4%5%0,$K*+0,$!/("&9' CD'EFA%'D'?$+0!'D'G!%"='D:-%+))D 'YUG$UoY>UlY$ @UaaaA9 IoL$s9$s9$c+*10#09$?'($Z("&+0,$C+*7,(1$*4$E*7*(&'%5.G$4+*1$s(4%"%&%*"$ &*)0+#.$\*,2&%*"9$C+*5.9$[4$Uo&'$ 0""20,$5*"4(+("5($*4$-"&(+"0&%*"0,$ !..*5%0&%*" $ 4*+ $ Z*182&%"/ $ 0"# $ C'%,*.*8'; $ @-!Z!CA3 $ !0+'2.3$ s("10+_9$$@UaYYA9 IlL $ s9$ s9 $ c+*10#03 $ Z9 $ ?%Q2.3 $ \9 $ C*%&+("02# $ 0"# $ X0#(,$ b9 $ S;/*10&%5$ \1%,($s(&(5&%*"G$?'($\(1%>\28(+P%.(#$c00+$?+0%"%"/$*4$0$J0.&$0"#$ J+2/0, $ \;.&(19 $ C+*5.9 $ *4 $ $ -DDD $ E-rJUaYU $ 5*"4(+("5(9 $ c0"*%3$ r%(&"019$$@UaYaA9 ItL $ f9 $ ?*"*"%9 $ !" $ -"4*+10&%*" $ -"&(/+0&%*" $ ?'(*+; $ *4 $ Z*".5%*2."(..9$ H7I'J+1!0&B"+-B+'D'o6mU9$@UaamA9 IqL$E9$!9$u%,.*"9$K1(-%12'L&@B$0)0#@D'X()$J0,5*"9$@YjtjA9 IjL$s9$c*4.&0#&(+9$M08+)N'E&B$+!N'H(B$'O'(-'E%+!-()'M0)8+-'H!("89$B0.%5$ B**_.9$@YjtjA9 IYaL $ -9 $ g0"&9 $M!1-8)+#1-# ' 41! ' 7+%(A$@&". ' 8+! ' P"%%+-9 $ s(2&.5',0"#9$ @YtqoA9 IYYL$-9$!.%1*P9$<0!Q(!8'%$+'<01-8(%"0-9$\8(5&+09$$e\!9$@YjjkA9 IYUL $ s9$ s9 $ c+*10#09 $ J+*1 $ !/(:f("#(+>70.(# $ ?0=*"*1; $ *4 $ ?2+%"/$ ?(.&$\5("0+%*.$&*)0+#.$!&&+%72&%*"$*4$F(/0,$\&0&2.$&*$K(&0>K*#2,0+$ !+&%4%5%0,$!2&*"*1*2.$!/("&.9$!55(8&(#$4*+$ZvEu$.;18*.%21$*4$ !-\BH-!Z!C$u*+,#$Z*"/+(..9$B%+1%"/'013$eg9$$@UaYUA9 !"#$%&'("()' ' G#+D ' M+-8+!D ' :-%+))"#+-B+ ' ?@A+&D ' ?1!"-# ' ?+&% ' ?(F0-02@D ' ?$+0!@ ' 0= ' 71)%"A)+ ' :-%+))"#+-B+&D ' 7+%(R2081)(! ' "-%+))"#+-B+D ' G1%0-0201& ' G!%"="B"() ' G#+-%D ' G%%!"*1%"0- ' 0= ' B",") ' !"#$%& ' %0 ' "-=0!2(%"B ' &@&%+2&D ' I$"-%(2(-" ' =!(B%() ' 208+) ' 0= ' "-%+))"#+-B+ ' %@A+SB02A0-+-% ' B)1&%+!"-#D ' H??? ' (-8 ' E??? ' %+B$&A"+B"+&'(--0%(%"0-'&B$+2(&'T o $X*&($&'($02&'*+$&*$(#%&*+.G$C,(0.($#*$"*&$#(,(&($&'%.$ *"#$%&'("')"&%$4+*1$&'($827,%.'(#$P(+.%*"9 AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 11 My Robot is Smarter than Your Robot - On the Need for a Total Turing Test for Robots Michael Zillich1 Abstract. In this position paper we argue for the need of a Turinglike test for robots. While many robotic demonstrators show impressive, but often very restricted abilities, it is very difficult to assess how intelligent such a robot can be considered to be. We thus propose a test, comprised of a (simulated) environment, a robot, a human tele-operator and a human interrogator, that allows to assess whether a robot behaves as intelligently as a human tele-operator (using the same sensory input as the robot) with respect to a given task. 1 INTRODUCTION The Turing Test [35] considered the equivalent of a brain in a vat, namely an AI communicating with a human interrogator solely via written dialogue. Though this did not preclude the AI from having acquired the knowledge that it is supposed to display via other means, for example extended multi-sensory interactions within a complex dynamic environment, it did narrow down what is considered as relevant for the display of intelligence. Intelligence however encompasses more than language. Intelligence, in all its flavours, developed to provide a competitive advantage in coping with a world full of complex challenges, such as moving about, manipulating things (though not necessarily with hands), hiding, hunting, building shelter, caring for offspring, building social contacts, etc. In short, intelligence needs a whole world to be useful in, which prompted Harnad to propose the Total Turing Test [19], requiring responses to all senses not just formatted linguistic input. Note that we do not make an argument here about the best approach to explain the emergence of intelligence (though we consider it likely that a comprehensive embodied perspective will help), but only about how to measure intelligence without limiting it to only a certain aspect. The importance of considering all aspects of intelligence is also fully acknowledged in robotics, where agents situated in the real world are faced with a variety of tasks, such as navigation and map building, object retrieval, or human robot interaction, which require various aspects of intelligence in order to be successfully carried out in spite of all the challenges of complex and dynamic scenes. So robotics can serve as a testbed for many aspects of intelligence. In fact it is the more basic of the above aspects of intelligence that still pose major difficulties. This is not to say that there was no progress over the years. In fact there are many impressive robot demonstrators now displaying individual skills in specific environments, such as bipedal walking in the Honda Asimo [6] or quadruped walking in the Boston Dynamics BigDog[32], learning to grasp [25, 33], navigation in the Google Driverless Car or even preparing pancakes [11]. For many of these demonstrators however it is easy to see where 1 Vienna University of Technology, Austria, email: zillich@acin.tuwien.ac.at the limitations lie and typically the designers are quick to admit that this sensor placement or that choice of objects was a necessary compromise in order to concentrate on the actually interesting research questions at hand. This makes it difficult however to quantitatively compare the performance of robots. Which robot is smarter, the pancake-flipping robot in [11]2 , the beer-fetching PR23 or the pool-playing PR24 ? We will never know. A lot of work goes into these demonstrators, to do several runs at conferences or fairs and shoot videos, before they are shelved or dismantled again, but it is often not clear what was really learned in the end; which is a shame, because certainly some challenges were met with interesting solutions. But the limits of these solutions were not explored within the specific experimental setup of the demo. So what we argue for is a standardised, repeatable test for complete robotic systems. This should test robustness in basic “survival” skills, such as not falling off stairs, running into mirrors or getting caught in cables, as well as advanced tasks, such as object search, learning how to grasp or human-robot interaction including natural language understanding. 2 RELATED WORK 2.1 Robot Competitions Tests are of course not new in the robotics community. There are many regular robot challenges which have been argued to serve as benchmarks [12], such as RoboCup [24] with its different challenges (Soccer, Rescue, @Home), the AAAI Mobile Robot Competitions [1], or challenges with an educational background like the US FIRST Robotics Competitions [8] or EUROBOT [3]. Furthermore there are specific targeted events such as the DARPA Grand Challenges 2004 and 2005 and DARPA Urban Challenge 2007 [2]. While these events present the state of the art and highlight particularly strong teams, they only offer a snapshot at a particular point in time. And although these events typically provide a strict rule book, with clear requirements and descriptions of the scenarios, the experiments are not repeatable and the test arena will be dismantled after the event (with the exception of simulations of course). So while offering the ultimate real-world test in a challenging and competitive setting, and thus providing very important impulses for robotics research, these tests are not suitable because a) they are not repeatable, b) rules keep changing to increase difficulty and maintain a challenging competition and c) the outcomes depend a lot on factors related 2 www.youtube.com/watch?v=4usoE981e7I 3 www.willowgarage.com/blog/2010/07/06/beer-me-robot 4 www.willowgarage.com/blog/2010/06/15/pr2-plays-pool AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 12 to the team (team size and funding, quality of team leadership) rather than the methods employed within the robot. 2.2 Robotic Benchmarks The robotics community realised the need for repeatable quantitative benchmarks [15, 21, 26, 27], leading to a series of workshops, such as the Performance Metrics for Intelligent Systems (PerMIS) or Benchmarks in Robotics Research or the Good Experimental Methodology in Robotics series, and initiatives such as the EURON Benchmarking Activities [4] or the NIST Urban Search And Rescue (USAR) testbed [7]. Focusing on one enabling capability at a time, some benchmarks concentrate on path planning [10], obstacle avoidance [23], navigation and mapping [9, 13], visual servoing [14], grasping [18, 22] or social interaction [34, 20]. Taking into account whole robotic systems [16] propose benchmarking biologically inspired robots based on pursuit/evasion behaviour. Also [29] test complete cognitive systems in a task requiring to find feeders in a maze and compete with other robots. 2.3 Robot Simulators Robotics has realised the importance of simulation environments early on, and a variety of simulators exist. One example is Player/Stage [17], a robot middleware framework and 2D simulation environment intended mostly for navigation tasks and its extension to a full 3D environment with Gazebo [5], which uses a 3D physics engine to simulate realistic 3D interactions such as grasping and has recently been chosen as the simulation test bed for the DARPA Robotics Challenge for disaster robots. [28] is another full 3D simulator, used e.g. for simulation of robotic soccer players. Some simulators such as [30] and [36] are specialised to precise simulation of robotic grasping. These simulators are valuable tools for debugging specific methods, but their potential as a common testbed to evaluate complete robotic systems in a set of standardised tasks has not been fully explored yet. In summary, we have on the one hand repeatable, quantitative benchmarks mostly tailored to sub-problems (such as navigation or grasping) and on the other hand competitions testing full systems at singular events, where both of these make use of a mixture of simulations and data gathered in the real world. 3 THE TOTAL TURING TEST FOR ROBOTS What has not fully emerged yet however is a comprehensive test suite for complete robotic systems, maintaining a clearly specified test environment plus supporting infrastructure for an extended period of time, allowing performance evaluation and comparison of different solutions and measuring their evolution over time is What this test suite should assess is the overall fitness of a robotic system to cope with the real world and behave intelligently in the face of unforeseen events, incomplete information etc. Moreover the test should ideally convey its results in an easily accessible form also to an audience beyond the robotics research community, allowing other disciplines such as Cognitive Science and Philosophy as well as the general public to assess progress of the field, beyond eye-catching but often shallow and misleading demos, Harnads [19] Total Turing Test provides a fitting paradigm, requiring that “The candidate [the robot] must be able to do, in the real world of objects and people, everything that real people can do, in a way that is indistinguishable (to a person) from the way real people do it.” “Everything” will of course have to be broken down into concrete tasks with increasing levels of difficulty. And the embodiment of the robot will place constraints on the things it can do in the real world, which has to be taken into account accordingly. 3.1 The Test The test would consist of a given scene and a set of tasks to be performed by either an autonomous robot or a human tele-operating a robot (based on precisely the same sensor data the robot has available, such as perhaps only a laser ranger and bumpers). A human interrogator would assign tasks to the robot, and also place various obstacles that interfere with successful completion. If the human interrogator can not distinguish the performance of the autonomous robot from the performance of the tele-operated robot, the autonomous robot can be said to be intelligent, with respect to the given task. Concretely the test would have to consist of a standardised environment with a defined set of tasks, as is e.g. common in the RoboCup@Home challenges (fetch an item, follow a user). The test suite would provide a API, e.g. based on the increasingly popular Robot Operating System (ROS) [31], allowing each robot to be connected to it, with moderate effort. Various obstacles and events could be made to interfere with execution of these tasks, such as cables lying on the floor, closed glass doors, stubborn humans blocking the way. Different challenges will pose different problems for different robots. E.g. for the popular omnidirectional drives of holonomic bases such as the Willow Garage PR2 cables on the floor represent insurmountable obstacles, while other robots will have difficulties navigating in tight environments. 3.2 Simulation A basic building block for such a test suite is an extension of available simulation systems to allow fully realistic simulation of all aspects of robotic behaviour. The simulation environment would have to provide photo-realistic rendering with accurate noise models (such as lens flares or poor dynamic range as found in typical CCD cameras) beyond the visually pleasing but much to “clean” rendering of available simulators. Also the physics simulation will have to be very realistic, which means that the simulation might not be able to run in real time. Real time however is not necessarily a requirement for a simulation as long as computation times of employed methods are scaled in accordance. Furthermore the simulation would need to also contain humans, instructing the robot in natural language, handing over items or posing as dynamic obstacles for navigation. Figure 1 shows a comparison of a robot simulated (and in this case tele-operated) in a state of the art simulator (gazebo) with the corresponding real robot carrying out the same task autonomously as part of a competition [37]. While the simulation could in this case provide reasonably realistic physics simulation (leading to objects slipping out of the hand if not properly grasped) and simulation of sensors (to generate e.g. problems for stereo reconstruction in lowtexture areas) more detailed simulations will be needed to capture more aspects of the real world. AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 13 (a) (b) Figure 2. Example score for a fictional robot equipped with a laser ranger and camera, but no arm and language capabilities. Figures are scores on the Pass/fail test and Intelligence test respectively. Figure 1. Comparison of (tele-operated) simulation and (autonomous) real robot in a fetch and carry task. 3.3 Task and Stages The test would be set up in different tasks and stages. Note that we should not require a robot to do everything that real people can do (as originally formulated by Harnad). Robots are after all designed for certain tasks, requiring only a specific set of abilities (capable of language understanding, equipped with a gripper, ability to traverse outdoor terrain, etc.). And we are interested in their capabilities related to these tasks. The constraints of a given robot configuration (such as the ability to understand language) then apply to the robot as well as the human tele-operator. Stages would be set up with increasing difficulties, such that a robot can be said to be stage-1 safe for the fetch and carry task (all clean, static environment) but failing stage 2 in 20% of cases (e.g. unforeseen obstacles, changing lighting). The final stages would be a real world test in a mock-up constructed to follow the simulated world. While the simulation would be a piece of software available for download, the real world test would be held as an annual competition much like RoboCup@Home, with rules and stages of difficulty according to the simulation. Note that unlike in RoboCup@Home these would remain fixed, rather than change with each year. 3.4 Evaluation The test would then have two levels of evaluation. Pass/fail test This evaluation would simply measure the percentage of runs where the robot successfully performs a task (at a given stage). This would be an automated assessment and allows developers to continuously monitor progress of their system. Intelligence test This would be the actual Total Turing Test with humans interrogators assessing whether a task was performed (successfully or not) by a robot or human tele-operator. The score would be related to the percentage of wrong attributions (i.e. robot and tele-operator were indistinguishable). Test runs with human tele-operators would be recorded once and stored for later comparison of provided robot runs. The requirement of collecting statistics from several interrogators means that this test is more elaborate and would be performed in longer intervals such as during annual competitions. This evaluation then allows to assess the intelligence of a robot (with respect to a given task) in coping with the various difficulties posed by a real environment. The setup of tasks and stages allows to map the abilities of a given robot. Figure 2 shows the scores of a fictional robot. The robot is equipped with a laser ranger and camera and can thus perform the navigation tasks as well as following a human, but lacks an arm for carrying objects or opening doors as well as communication capabilities required for the human guidance task, As can be seen the robot can be considered stage-1 intelligent with respect to the random navigation task (driving around randomly without colliding or getting stuck), i.e. it is indistinguishable from a human tele-operator driving randomly, in the perfect simulated environment. It also achieves perfect success rates in this simple setting. Performance in the real world for perfect conditions (stage 4) is slightly worse (the simulation could not capture all the eventualities of the real world, such as wheel friction). Performance for added difficulties (such as small obstacles on the floor) decreases, especially in the real word condition. Performance drops in particular with respect to the tele-operator and so it becomes quickly clear to the interrogators which is the robot and which the tele-operator, i.e. the robot makes increasingly “stupid mistakes” such as getting stuck when there is an obvious escape. Accordingly the intelligence score drops quickly. The robot can also be said to be fairly stage-1 and stage-4 intelligent with respect to navigation and human following, and slightly less intelligent with respect to finding objects. In this respect modern vacuum cleaning robots (the more advanced versions including navigation mapping capabilities) can be considered intelligent with respect to the cleaning task, as their performance there will generally match that of a human tele-operating such a robot. For more advanced tasks including object recognition, grasping or dialogue the intelligence of most robots will quickly degrade to 0 for any stages beyond 1. 4 CONCLUSION We proposed a test paradigm for intelligent robotic systems, inspired by Harnads Total Turing Test, that goes beyond current benchmarks and robot competitions. This test would provide a pragmatic definition of intelligence for robots, as the capability to perform as good as a tele-operating human for a given task. Moreover, test scores would be a good indicator whether a robot is ready for the real world, i.e. is endowed with enough intelligence to overcome unforeseen obstacles and avoid getting trapped in “stupid” situations. There are however several technical and organisational challenges to be met. Running realistic experiments will require simulators of considerably improved fidelity. But these technologies are becoming increasingly available thanks in part to the developments in the gaming industry. Allowing researchers to simply plug in their systems will require a careful design of interfaces to ensure that all capabilities are adequately covered. The biggest challenge might actually be the definition of environments, tasks and stages. This will have to be a community effort and draw on the experiences of previous benchmarking efforts. AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 14 [24] Hiroaki Kitano, Minoru Asada, Yasuo Kuniyoshi, Itsuki Noda, and Eiichi Osawa, ‘RoboCup: The Robot World Cup Initiative’, in IJCAI-95 workshop on entertainment and AI/ALife, (1995). The research leading to these results has received funding from the [25] D Kraft, N Pugeault, E Baseski, M Popovic, D Kragic, S Kalkan, European Community’s Seventh Framework Programme [FP7/2007F Wörgötter, and N Krüger, ‘Birth of the object: Detection of object2013] under grant agreement No. 215181, CogX and from the Ausness and extraction of object shape through object action complexes’, trian Science Fund (FWF) under project TRP 139-N23 InSitu. International Journal of Humanoid Robotics, 5(2), 247–265, (2008). [26] Raj Madhavan and Rolf Lakaemper, ‘Benchmarking and Standardization of Intelligent Robotic Systems’, Intelligence, (2009). [27] Performance Evaluation and Benchmarking of Intelligent Systems, eds., REFERENCES Raj Madhavan, Edward Tunstel, and Elena Messina, Springer, 2009. [1] AAAI Mobile Robot Competition, [28] O. Michel, ‘Webots: Professional Mobile Robot Simulation’, Internahttp://www.aaai.org/Conferences/AAAI/2007/aaai07robot.php. tional Journal of Advanced Robotic Systems, 1(1), 39–42, (2004). [2] DARPA Grand Challenge, http://archive.darpa.mil/grandchallenge. [29] Olivier Michel, Fabien Rohrer, and Yvan Bourquin, ‘Rat’s Life: A Cog[3] Eurobot, http://www.eurobot.org. nitive Robotics Benchmark’, European Robotics Symposium, 223–232, [4] EURON Benchmarking Initiative, www.robot.uji.es/EURON/en/index.html. (2008). [5] Gazebo 3D multi-robot simulator http://gazebosim.org. [30] Andrew Miller and Peter K. Allen, ‘Graspit!: A Versatile Simulator for [6] Honda ASIMO, http://world.honda.com/ASIMO. Robotic Grasping’, IEEE Robotics and Automation Magazine, 11(4), [7] NIST Urban Search And Rescue (USAR), 110–122, (2004). http://www.nist.gov/el/isd/testarenas.cfm. [31] Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, [8] US First Robotics Competition, www.usfirst.org. Jeremy Leibs, Rob Wheeler, and Andrew Y Ng, ‘ROS: an open-source [9] Benjamin Balaguer, Stefano Carpin, and Stephen Balakirsky, ‘ToRobot Operating System’, in ICRA Workshop on Open Source Software, wards Quantitative Comparisons of Robot Algorithms: Experiences (2009). with SLAM in Simulation and Real World Systems’, in IROS Work[32] Marc Raibert, Kevin Blankespoor, Gabriel Nelson, Rob Playter, and shop on Benchmarks in Robotics Research, (2007). The BigDog Team, ‘BigDog, the Rough-Terrain Quadruped Robot’, in [10] J Baltes, ‘A benchmark suite for mobile robots’, in Intelligent Robots Proceedings of the 17th World Congress of The International Federaand Systems 2000IROS 2000 Proceedings 2000 IEEERSJ International tion of Automatic Control, pp. 10822–10825, (2008). Conference on, volume 2, pp. 1101–1106. IEEE, IEEE, (2000). [33] Ashutosh Saxena, Justin Driemeyer, and Andrew Y. Ng, ‘Robotic [11] Michael Beetz, Ulrich Klank, Ingo Kresse, Lorenz Maldonado, Alexis Grasping of Novel Objects using Vision’, The International Journal Mösenlechner, Dejan Pangercic, Thomas Rühr, and Moritz Tenorth, of Robotics Research, 27(2), 157–173, (2008). ‘Robotic Roommates Making Pancakes’, in 11th IEEE-RAS Interna[34] Katherine M. Tsui, Munjal Desai, and Holly A. Yanco, ‘Towards Meational Conference on Humanoid Robots, (2011). suring the Quality of Interaction: Communication through Telepresence [12] S Behnke, ‘Robot competitions - Ideal benchmarks for robotics reRobots’, in Proceedings of the Performance Metrics for Intelligent Syssearch’, in Proc of IROS2006 Workshop on Benchmarks in Robotics tems Workshop (PerMIS), (2012). Research. Citeseer, (2006). [35] Alan Turing, ‘Computing Machinery and Intelligence’, Mind, 59, 433– [13] Simone Ceriani, Giulio Fontana, Alessandro Giusti, Daniele Marzorati, 60, (1950). Matteo Matteucci, Davide Migliore, Davide Rizzi, Domenico G Sor[36] S. Ulbrich, D. Kappler, T. Asfour, N. Vahrenkamp, A. Bierbaum, renti, and Pierluigi Taddei, ‘Rawseeds ground truth collection systems M. Przybylski, and R. Dillmann, ‘The OpenGRASP Benchmarking for indoor self-localization and mapping’, Autonomous Robots, 27(4), Suite: An Environment for the Comparative Analysis of Grasping and 353–371, (2009). Dexterous Manipulation’, in IEEE/RSJ International Conference on In[14] Enric Cervera, ‘Cross-Platform Software for Benchmarks on Visual telligent Robots and Systems, (2011). Servoing’, in IROS Workshop ong Benchmarks in Robotics Research, [37] Kai Zhou, Michael Zillich, and Markus Vincze, ‘Mobile manipulation: (2006). Bring back the cereal box - Video proceedings of the 2011 CogX Spring [15] R. Dillmann, ‘Benchmarks for Robotics Research’, Technical report, School’, in 8th International Conference on Ubiquitous Robots and EURON, (2004). Ambient Intelligence (URAI), pp. 873–873. Automation and Control In[16] Malachy Eaton, J J Collins, and Lucia Sheehan, ‘Toward a benchmarkstitute, Vienna University of Technology, 1040, Austria, IEEE, (2011). ing framework for research into bio-inspired hardware-software artefacts’, Artificial Life and Robotics, 5(1), 40–45, (2001). [17] Brian P Gerkey, Richard T Vaughan, and Andrew Howard, ‘The Player / Stage Project : Tools for Multi-Robot and Distributed Sensor Systems’, in International Conference on Advanced Robotics (ICAR), pp. 317– 323, (2003). [18] Gerhard Grunwald, Christoph Borst, and J. Marius Zöllner, ‘Benchmarking dexterous dual-arm/hand robotic manipulation’, in IROS Workhop onPerformance Evaluation and Benchmarking for Intelligent Robots and Systems, (2008). [19] S Harnad, ‘Other Bodies, Other Minds: A Machine Incarnation of an Old Philosophical Problem’, Minds and Machines, 1, 43–54, (1991). [20] Zachary Henkel, Robin Murphy, Vasant Srinivasan, and Cindy Bethel, ‘A Proxemic-Based HRI Testbed’, in Proceedings of the Performance Metrics for Intelligent Systems Workshop (PerMIS), (2012). [21] I Iossifidis, G Lawitzky, S Knoop, and R Zöllner, ‘Towards Benchmarking of Domestic Robotic Assistants’, in Advances in Human Robot Interaction, eds., Erwin Prassler, Gisbert Lawitzky, Andreas Stopp, Gerhard Grunwald, Martin Hägele, Rüdiger Dillmann, and Ioannis Iossifidis, volume 14/2004 of Springer Tracts in Advanced Robotics {STAR}, chapter 7, 97–135, Springer Press, (2005). [22] R. Jäkel, R., Schmidt-Rohr, S. R., Lösch, M., & Dillmann, ‘Hierarchical structuring of manipulation benchmarks in service robotics’, in IROS Workshop on Performance Evaluation and Benchmarking for Intelligent Robots and Systems with Cognitive and Autonomy Capabilities, (2010). [23] J.L. Jimenez, I. Rano, and I. Minguez, ‘Advances in the Framework for Automatic Evaluation of Obstacle Avoidance Methods’, in IROS Workshop on Benchmarks in Robotics Research, (2007). ACKNOWLEDGEMENTS AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 15 Interactive Intelligence: Behaviour-based AI, Musical HCI and the Turing Test Adam Linson, Chris Dobbyn and Robin Laney1 Abstract. The field of behaviour-based artificial intelligence (AI), with its roots in the robotics research of Rodney Brooks, is not predominantly tied to linguistic interaction in the sense of the classic Turing test (or, “imitation game”). Yet, it is worth noting, both are centred on a behavioural model of intelligence. Similarly, there is no intrinsic connection between musical AI and the language-based Turing test, though there have been many attempts to forge connections between them. Nonetheless, there are aspects of musical AI and the Turing test that can be considered in the context of nonlanguage-based interactive environments–in particular, when dealing with real-time musical AI, especially interactive improvisation software. This paper draws out the threads of intentional agency and human indistinguishability from Turing’s original 1950 characterisation of AI. On the basis of this distinction, it considers different approaches to musical AI. In doing so, it highlights possibilities for non-hierarchical interplay between human and computer agents. 1 Introduction The field of behaviour-based artificial intelligence (AI), with its roots in the robotics research of Rodney Brooks, is not predominantly tied to linguistic interaction in the sense of the classic Turing test (or, “imitation game” [24]). Yet, it is worth noting, both are centred on a behavioural model of intelligence. Similarly, there is no intrinsic connection between musical AI and the language-based Turing test, though there have been many attempts to forge connections between them. The primary approach to applying the Turing test to music is in the guise of so-called “discrimination tests”, in which human- and computer-generated musical output are compared (for an extensive critical overview of how the Turing test has been applied to music, see [1]). Nonetheless, there are aspects of musical AI and the Turing test that can be considered in the context of nonlanguage-based interactive environments—in particular, when dealing with real-time musical AI, especially interactive improvisation software (see, for example, [23] and [8]). In this context, AI for nonhierarchical human-computer musical improvisation such as George Lewis’ Voyager [16] and Turing’s imitation game are both examples of “an open-ended and performative interplay between [human and computer] agents that are not capable of dominating each other” [21]. 2 Background It is useful here to give some context to the Turing test itself. In its original incarnation, the test was proposed as a thought experiment to explain the concept of a thinking machine to a public uninitiated 1 Faculty of Mathematics, Computing and Technology, Dept. of Computing, Open University, UK. Email: {a.linson, c.h.dobbyn, r.c.laney}@open.ac.uk in such matters [24]. Rather than as a litmus test of whether or not a machine could think (which is how the test is frequently understood), the test was in fact designed to help make sense of the concept of a machine that could think. Writing in 1950, he estimates “about fifty years’ time” until the technology would be sufficient to pass a real version of the test and states his belief “that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted”. Thus his original proposal remained a theoretical formulation: in principle, a machine could be invented with the capacity to be mistaken for a human; if this goal were accomplished, a reasonable person should accept the machine as a thinking entity. He is very clear about the behaviourist underpinnings of the experiment: May not machines carry out something which ought to be described as thinking but which is very different from what a man does? This objection is a very strong one, but at least we can say that if, nevertheless, a machine can be constructed to play the imitation game satisfactorily, we need not be troubled by this objection. He goes on to describe the “imitation game” as one in which the machine should “try to provide answers that would naturally be given by a man”. His ideas became the basis for what eventually emerged as the field of AI. As Turing emphasised, the thought experiment consisted of an abstract, “imaginable” machine that—under certain conditions to ensure a level playing field—would be indistinguishable from a human, from the perspective of a human interrogator [24]. Presently, when the test is actually deployed in practice, it is easy to forget the essential role of the designer, especially given the fact that the computer “playing” the game is, to an extent, thrust into the spotlight. In a manner of speaking, the interactive computer takes centre stage, and attention is diverted from the underlying challenge set forth by Turing: to determine the specifications of the machine. Thus, one could say in addition to being a test for a given machine, it is also a creative design challenge to those responsible for the machine. The stress is on design rather than implementation, as Turing explicitly suggests imagining that any proposed machine functions perfectly according to its specifications (see [24], p. 449). If the creative design challenge were fulfilled, the computer would behave convincingly as a human, perhaps hesitating when appropriate and occasionally refusing to answer or giving incorrect answers such as the ones Turing imagines [24]: Q: Please write me a sonnet on the subject of the Forth Bridge. A: Count me out on this one. I never could write poetry. AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 16 Q: Add 34957 to 70764. A: (Pause about 30 seconds and then give as answer) 105621. The implication of Turing’s example is that the measure of success for those behind the machine lies in designing a system that is also as stubborn and fallible as humans, rather than servile and (theoretically) infallible, like an adding machine. 3 Two threads unraveled Two threads can be drawn out of Turing’s behavioural account of intelligence that directly pertain to contemporary AI systems: the first one concerns the kind of intentional agency suggested by his example answer, “count me out on this one”; the second one concerns the particular capacities and limitations of human embodiment, such as the human inability to perform certain calculations in a fraction of a second and the human potential for error. More generally, the second thread has to do with the broadly construed linguistic, social, mental and physical consequences of human physiology. Indeed, current theories of mind from a variety of disciplines provide a means for considering these threads separately. In particular, relevant investigations that address these two threads—described in this context as intentional agency and human indistinguishability—can be found in psychology, philosophy and cognitive science. 3.1 Intentional agency The first thread concerns the notion of intentional agency, considered here separately from the thread of human indistinguishability. Empirical developmental psychology suggests that the human predisposition to attribute intentional agency to both humans and nonhumans appears to be present from infancy. Poulin-Dubois and Shultz chart childhood developmental stages over the first three years of life, from the initial ability to identify agency (distinguishing animate from inanimate objects) on to the informed attribution of intentionality, by inference of goal-directed behavior [22]. Csibra found that infants ascribed goal-directed behavior even to artificially animated inanimate objects, if the objects were secretly manipulated to display teleological actions such as obstacle avoidance [7]. Király, et al. identify the source of an infant’s interpretation of a teleological action: “if the abstract cues of goal-directedness are present, even very young infants are able to attribute goals to the actions of a wide range of entities even if these are unfamiliar objects lacking human features” [10]. It is important to note that in the above studies, the infants were passive, remote observers, whereas the Turing test evaluates direct interaction. While the predisposition of infants suggests an important basis for such evaluation, more is needed to address interactivity. In another area of empirical psychology, a study of adults by Barrett and Johnson suggests that even a lack of apparent goals by a self-propelled (nonhuman) object can lead to the attribution of intentionality in an interactive context [2]. In particular, their test subjects used language normally reserved for humans and animals to describe the behaviour of artificially animated inanimate objects that appeared to exhibit resistance to direct control in the course of an interaction; when there was no resistance, they did not use such language. The authors of the study link the results of their controlled experiment to the anecdotal experience of the frustration that arises during interactions with artifacts such as computers or vehicles that “refuse” to cooperate. In other words, in an interactive context, too much passivity by an artificial agent may negate any sense of its apparent intentionality. This suggests that for an agent to remain apparently intentional during direct interaction, it must exhibit a degree of resistance along with the kind of adaptation to the environment that indicates its behaviour is being adjusted to attain a goal. These features appear to be accounted for in Turing’s first example answer above: the answer is accommodating insofar as it is a direct response to the interrogator, but the show of resistance seems to enhance the sense of “intelligence”. It is noteworthy that this particular thread, intentional agency, relates closely to Brooks’ extension of intelligence to nonlinguistic, nonhuman intelligence, especially in relation to insect and other animal intelligence, which he has emulated in robotic form with his particular approach to AI (see [3]). 3.2 Human indistinguishability The second thread, the idea that human capacities and limitations should be built into an AI system, strongly relates to many significant accounts of embodied, situated activity (see, for example, [9], [4] and [11]). These accounts focus on how the human body, brain, mind and environment fundamentally structure the process of cognition, which can be understood through observable behaviour. When dealing with AI, the focus on behaviour clearly ties back to Turing. These themes are also taken up in Brooks’ behaviour-based AI approach, but, at least in his early research, he applies them primarily to nonhuman intelligence. In particular, he relates these themes to the kinds of adaptive behaviour described in the first thread. The differing properties of the second thread will come into sharper focus by returning to Turing’s example, for a consideration of matters particular to humans. Although Turing’s example of pausing and giving an incorrect answer is a clear example of a human limitation over a machine, it is possible to give an inverted example of human and machine competence that applies equally well. If the question posed to the machine were instead “Is it easy to walk from here to the nearest supermarket?”, the machine’s answer would depend on how its designers handled the notion of “easy to walk to”. In this case, the machine must not only emulate humans’ abstract cognitive limitations when solving arithmetical problems; it must also be able to respond according to human bodily limitations. One could easily imagine a failed machine calculation: the supermarket is at the end of a single straight road, with no turns; it answers “yes, it is easy to walk to”. But if the supermarket is very distant, or nearby but up a steep incline, then in order for the machine to give an answer that is indistinguishable from a human one, it must respond in a way that seems to share our embodied human limitations. Returning to the arithmetic example, as Doug Lenat points out, even some wrong answers are more human than others: “93 − 25 = 78 is more understandable than if the program pretends to get a wrong answer of 0 or −9998 for that subtraction problem” [14]. Although Lenat disputes the need for embodiment in AI (he prefers a central database of human common sense [13], which could likely address the “easy to walk to” example), it could be argued, following the above theoretical positions, that the set of humanlike wrong answers is ultimately determined by the “commonalities of our bodies and our bodily and social experience in the world” [11]. This second thread, which could also be characterised as the attempt to seem humanlike, is taken up in another nonlinguistic area of AI, namely, musical AI. Some “intelligent” computer music composition and performance systems appear very close to achieving human indistinguishability in some respects, although this is not always their explicitly stated purpose. For example, Manfred Clynes AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 17 describes a computer program that performs compositions by applying a single performer’s manner of interpretation to previously unencountered material, across all instrumental voices [5]. He states that “our computer program plays music so that it is impossible to believe that no human performer is involved,” which he qualifies by explaining the role of the human performer as a user of the software, who “instills the [musical performance] principles in the appropriate way”. Taking an entirely different approach, David Cope, argues that a Turing-like test for creativity would be more appropriate to his work than a Turing test for intelligence [6]. On the other hand, he has called his well-known project “Experiments in Musical Intelligence” and he also makes reference to “intelligent music composition”. Furthermore, he states that his system generates “convincing” music in the style of a given composer (by training the system with a corpus of human-composed music), and one can infer that, in this context, “convincing” at least approximates the notion of human indistinguishability. With a more critical articulation, Pearce and Wiggins carefully differentiate between a test for what Cope calls “convincing” and a Turing test for intelligence [19]. As they point out, despite the resemblance of the two approaches, testing for intelligence is distinct from determining the “(non-)membership of a machine composition in a set of human composed pieces of music”. They also note the significant difference between an interactive test and one involving passive observation. 4 Broadening the interactive horizon One reason for isolating these two threads is to recast Turing’s ideas in a wider social context, one that is better attuned to the contemporary social understanding of the role of technology research: namely, that it is primarily intended (or even expected) to enhance our lives. Outside the thought experiment, in the realm of practical application, one might redirect the resources for developing a successful Turing test candidate (e.g., for the Loebner Prize) and instead apply them toward a different kind of interactive system. This proposed system could be built so that it might be easily identified as a machine (even if occasionally mistaken for a human), which seemingly runs counter to the spirit of the Turing test. However, with an altered emphasis, one could imagine the primary function of such a machine as engaging humans in a continuous process of interaction, for a variety of purposes, including (but not limited to) stimulating human creativity and providing a realm for aesthetic exploration. One example of this kind of system is musical improvisation software that interacts with human performers in real time, in a mutually influential relationship between human and computer, such as Lewis’ Voyager. In his software design, the interaction model strongly resembles the way in which Turing describes a computer’s behaviour: it is responsive, yet it does not always give the expected answer, and it might interrupt the human interlocutor or steer the interaction in a different direction (see [16]). In the case of an interactive improvising music system, the environment in which the human and computer interact is not verbal conversation, but rather, a culturally specific aesthetic context for collaborative music-making. In this sense, a musical improvisation is not an interrogation in the manner presented by Turing, yet “test” conversations and musical improvisations are examples of free-ranging and open-ended human-computer interaction. Among other things, this kind of interaction can serve as a basis for philosophical enquiry and cognitive theory that is indeed very much in the spirit of Turing’s 1950 paper [24] (see also [15] and [17]). Adam Linson’s Odessa is another intelligent musical system that is similarly rooted in freely improvised music (for a detailed descrip- tion, see [18]). It borrows from Brooks’ design approach in modelling the behaviour of an intentional agent, thus clearly taking up the first thread that has been drawn out here. Significantly, it isolates this thread (intentional agency) for study by abstaining from a direct implementation of many of the available methods for human emulation (aimed at the second thread), thus resulting in transparently nonhuman musical behaviour. Nonetheless, initial empirical studies suggest that the system affords an engaging and stimulating human-computer musical interaction. As the system architecture (based on Brooks’ subsumption architecture) is highly extensible, future iterations of the system may add techniques for approximating fine-grained qualities of human musicianship. In the meantime, however, further studies are planned with the existing prototype, with the aim of providing insights into aspects of human cognition as well as intelligent musical agent design. 5 Conclusion Ultimately, whether an interactive computer system is dealing with an interrogator in the imitation game or musically improvising with a human, the system must be designed to “respond in lived real time to unexpected, real-world input” [17]. This responsiveness takes the form of what sociologist Andrew Pickering calls the “dance of agency”, in which a reciprocal interplay of resistance and accommodation produces unpredictable emergent results over time [20]. This description of a sustained, continuous play of forces that “interactively stablize” each other could be applied to freely improvised music, whether performed by humans exclusively, or by humans and computers together. Pickering points out a concept similar to the process of interactive stabilisation, ‘heterogeneous engineering’, elaborated in the work of his colleague John Law (see [12]); the latter, in its emphasis on productive output, is perhaps more appropriate to the musical context of free improvisation. Although these theoretical characterisations may seem abstract, they concretely pertain to the present topic in that they seek to address the “open-ended and performative interplay between agents that are not capable of dominating each other” [21], where the agents may include various combinations of humans, computers and other entities, and the interplay may include linguistic, musical, physical and other forms of interaction. With particular relevance to the present context, Pickering applies his conceptual framework of agent interplay to the animal-like robots of Turing’s contemporary, cybernetics pioneer Grey Walter, and those of Brooks, designed and built decades later [21]. Returning to the main theme, following Brooks, “the dynamics of the interaction of the robot and its environment are primary determinants of the structure of its intelligence” [3]. Thus, independent of its human resemblance, an agent’s ability to negotiate with an unstructured and highly dynamic musical, social or physical environment can be treated as a measure of intelligence closely aligned with what Turing thought to be discoverable with his proposed test. REFERENCES [1] C. Ariza, ‘The interrogator as critic: The turing test and the evaluation of generative music systems’, Computer Music Journal, 33(2), 48–70, (2009). [2] J.L. Barrett and A.H. Johnson, ‘The role of control in attributing intentional agency to inanimate objects’, Journal of Cognition and Culture, 3(3), 208–217, (2003). [3] R.A. Brooks, Cambrian intelligence: the early history of the new AI, MIT Press, 1999. AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 18 [4] A. Clark, Being There: Putting Brain, Body, and World Together Again, MIT Press, 1997. [5] M. Clynes, ‘Generative principles of musical thought: Integration of microstructure with structure’, Communication and Cognition AI, Journal for the Integrated Study of Artificial Intelligence, Cognitive Science and Applied Epistemology, 3(3), 185–223, (1986). [6] D. Cope, Computer Models of Musical Creativity, MIT Press, 2005. [7] G. Csibra, ‘Goal attribution to inanimate agents by 6.5-month-old infants’, Cognition, 107(2), 705–717, (2008). [8] R.T. Dean, Hyperimprovisation: Computer-interactive sound improvisation, AR Editions, Inc., 2003. [9] H. Hendriks-Jansen, Catching ourselves in the act: Situated activity, interactive emergence, evolution, and human thought, MIT Press, 1996. [10] I. Király, B. Jovanovic, W. Prinz, G. Aschersleben, and G. Gergely, ‘The early origins of goal attribution in infancy’, Consciousness and Cognition, 12(4), 752–769, (2003). [11] G. Lakoff and M. Johnson, Philosophy in the Flesh: The Embodied Mind and Its Challenge to Western Thought, Basic Books, 1999. [12] J. Law, ‘On the social explanation of technical change: The case of the portuguese maritime expansion’, Technology and Culture, 28(2), 227– 252, (1987). [13] D.B. Lenat, ‘Cyc: A large-scale investment in knowledge infrastructure’, Communications of the ACM, 38(11), 33–38, (1995). [14] D.B. Lenat, ‘The voice of the turtle: Whatever happened to ai?’, AI Magazine, 29(2), 11, (2008). [15] G. Lewis, ‘Interacting with latter-day musical automata’, Contemporary Music Review, 18(3), 99–112, (1999). [16] G. Lewis, ‘Too many notes: Computers, complexity and culture in voyager’, Leonardo Music Journal, 33–39, (2000). [17] G. Lewis, ‘Improvising tomorrow’s bodies: The politics of transduction’, E-misférica, 4.2, (2007). [18] A. Linson, C. Dobbyn, and R. Laney, ‘Improvisation without representation: artificial intelligence and music’, in Proceedings of Music, Mind, and Invention: Creativity at the Intersection of Music and Computation, (2012). [19] M. Pearce and G. Wiggins, ‘Towards a framework for the evaluation of machine compositions’, in Proceedings of the AISB, pp. 22–32, (2001). [20] A. Pickering, The mangle of practice: Time, agency, and science, University of Chicago Press, 1995. [21] A. Pickering, The cybernetic brain: Sketches of another future, University of Chicago Press, 2010. [22] D. Poulin-Dubois and T.R. Shultz, ‘The development of the understanding of human behavior: From agency to intentionality’, in Developing Theories of Mind, eds., Janet W. Astington, Paul L. Harris, and David R. Olson, 109–125, Cambridge University Press, (1988). [23] R. Rowe, Machine musicianship, MIT Press, 2001. [24] A.M. Turing, ‘Computing machinery and intelligence’, Mind, 59(236), 433–460, (1950). AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 19 The AN Y NT Project Intelligence Test Λone Javier Insa-Cabrera1 and José Hernández-Orallo2 and David L. Dowe3 and Sergio España4 and M.Victoria Hernández-Lloreda5 Abstract. All tests in psychometrics, comparative psychology and cognition which have been put into practice lack a mathematical (computational) foundation or lack the capability to be applied to any kind of system (humans, non-human animals, machines, hybrids, collectives, etc.). In fact, most of them lack both things. In the past fifteen years, some efforts have been done to derive intelligence tests from formal intelligence definitions or vice versa, grounded on computational concepts. However, some of these approaches have not been able to create universal tests (i.e., tests which can evaluate any kind of subjects) and others have even failed to make a feasible test. The AN Y NT project was conceived to explore the possibility of defining formal, universal and anytime intelligence tests, having a feasible implementation in mind. This paper presents the basics of the theory behind the AN Y NT project and describes one of the test propotypes that were developed in the project: test Λone . Keywords: (machine) intelligence evaluation, universal tests, artificial intelligence, Solomonoff-Kolmogorov complexity. 1 INTRODUCTION There are many examples of intelligence tests which work in practice. For instance, in psychometrics and comparative psychology, tests are used to evaluate intelligence for a variety of subjects: children and adult Homo Sapiens, other apes, cetaceans, etc. In artificial intelligence, we are well aware of some incarnations and different variations of the Turing Test, such as the Loebner Prize or CAPTCHAs [32], which are also feasible and informative. However, they do not answer the pristine questions: what intelligence is and how it can be built. In the past fifteen years, some efforts have been done to derive intelligence tests from formal intelligence definitions or vice versa, grounded on computational concepts. However, some of these approaches have not been able to create universal tests (i.e., tests which can evaluate any kind of subjects) and others have even failed to make a feasible test. The AN Y NT project6 was conceived to explore the possibility of defining formal, universal and anytime intelligence tests, having a feasible implementation in mind. 1 2 3 4 5 6 DSIC, Universitat Politècnica de València, Spain. email: jinsa@dsic.upv.es DSIC, Universitat Politècnica de València, Spain. email: jorallo@dsic.upv.es Clayton School of Information Technology, Monash University, Australia. email: david.dowe@monash.edu PROS, Universitat Politècnica de València, Spain. email: sergio.espana@pros.upv.es Universidad Complutense de Madrid, Spain. email: vhlloreda@psi.ucm.es http://users.dsic.upv.es/proy/anynt/ In the AN Y NT project we have been working on the design and implementation of a general intelligence test, which can be feasibly applied to a wide range of subjects. More precisely, the goal of the project is to develop intelligence tests that are: (1) formal, by using notions from Algorithmic Information Theory (a.k.a. Kolmogorov Complexity) [24]; (2) universal, so that they are able to evaluate the general intelligence of any kind of system (human, non-human animal, machine or hybrid). Each will have an appropriate interface that fits its needs; (3) anytime, so the more time is available for the evaluation, the more reliable the measurement will be. 2 BACKGROUND In this section, we present a short introduction to the area of Algorithmic Information Theory and the notions of Kolmogorov complexity, universal distributions, Levin’s Kt complexity, and its relation to the notions of compression, the Minimum Message Length (MML) principle, prediction, and inductive inference. Then, we will survey the approaches that have appeared using these formal notions in order to give mathematical definitions of intelligence or to develop intelligence tests from them, starting from the compression-enhanced Turing tests, the C-test, and Legg and Hutter’s definition of Universal Intelligence. 2.1 Kolmogorov complexity and universal distributions Algorithmic Information Theory is a field in computer science that properly relates the notions of computation and information. The key idea is the notion of the Kolmogorov Complexity of an object, which is defined as the length of the shortest program p that outputs a given string x over a machine U . Formally, Definition 1 Kolmogorov Complexity KU (x) := p min l(p) such that U (p)=x where l(p) denotes the length in bits of p and U (p) denotes the result of executing p on U . For instance, if x = 1010101010101010 and U is the programming language Lisp, then KLisp (x) is the length in bits of the shortest program in Lisp that outputs the string x. The relevance of the choice of U depends mostly on the size of x. Since any universal machine can emulate another, it holds that for every two universal Turing machines U and V , there is a constant c(U, V ), which only depends on U and V and does not depend on x, such that for all x, |KU (x) − KV (x)| ≤ c(U, V ). The value of c(U, V ) is relatively small for sufficiently long x. AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 20 From Definition 1, we can define the universal probability for machine U as follows: Definition 2 Universal Distribution Given a prefix-free machine7 U , the universal probability of string x is defined as: pU (x) := 2−KU (x) which gives higher probability to objects whose shortest description is small and gives lower probability to objects whose shortest description is large. Considering programs as hypotheses in the hypothesis language defined by the machine, paves the way for the mathematical theory of inductive inference and prediction. This theory was developed by Solomonoff [28], formalising Occam’s razor in a proper way for prediction, by stating that the prediction maximising the universal probability will eventually discover any regularity in the data. This is related to the notion of Minimum Message Length for inductive inference [34][35][1][33] and is also related to the notion of data compression. One of the main problems of Algorithmic Information Theory is that Kolmogorov Complexity is uncomputable. One popular solution to the problem of computability of K for finite strings is to use a time-bounded or weighted version of Kolmogorov complexity (and, hence, the universal distribution which is derived from it). One popular choice is Levin’s Kt complexity [23][24]: Definition 3 Levin’s Kt Complexity KtU (x) := p min {l(p) + log time(U, p, x)} such that U (p)=x where l(p) denotes the length in bits of p, U (p) denotes the result of executing p on U , and time(U, p, x) denotes the time8 that U takes executing p to produce x. Finally, despite the uncomputability of K and the computational complexity of its approximations, there have been some efforts to use Algorithmic Information Theory to devise optimal search or learning strategies. Levin (or universal) search [23] is an iterative search algorithm for solving inversion problems based on Kt, which has inspired other general agent policies such as Hutter’s AIXI, an agent that is able to adapt optimally9 in all environments where any other general purpose agent can be optimal [17], for which there is a working approximation [31][30]. 2.2 Developing mathematical definitions and tests of intelligence Following ideas from A.M. Turing, R.J. Solomonoff, E.M. Gold, C.S. Wallace, M. Blum, G. Chaitin and others, between 1997 and 7 For a convenient definition of the universal probability, we need the requirement of U being a prefix-free machine (see, e.g., [24] for details). Note also that even for prefix-free machines there are infinitely many other inputs to U that will output x, so pU (x) is a strict lower bound on the probability that U will output x (given a random input) 8 Here time does not refer to physical time but to computational time, i.e., computation steps taken by machine U . This is important, since the complexity of an object cannot depend on the speed of the machine where it is run. 9 Optimality has to be understood in an asymptotic way. First, because AIXI is uncomputable (although resource-bounded variants have been introduced and shown to be optimal in terms of time and space costs). Second, because it is based on a universal probability over a machine, and this choice determines a constant term which may very important for small environments. 1998 some works on enhancing or substituting the Turing Test [29] by inductive inference tests were developed, using Solomonoff prediction theory [28] and related notions, such as the Minimum Message Length (MML) principle. On the one hand, Dowe and Hajek [2][3][4] suggested the introduction of inductive inference problems in a somehow induction-enhanced or compression-enhanced Turing Test (they arguably called it non-behavioural) in order to, among other things, completely dismiss Searle’s Chinese room [27] objection, and also because an inductive inference ability is a necessary (though possibly “not sufficient”) requirement for intelligence. Quite simultaneously and similarly, and also independently, in [13][6], intelligence was defined as the ability to comprehend, giving a formal definition of the notion of comprehension as the identification of a ‘predominant’ pattern from a given evidence, derived from Solomonoff prediction theory concepts, Kolmogorov complexity and Levin’s Kt. The notion of comprehension was formalised by using the notion of “projectible” pattern, a pattern that has no exceptions (no noise), so being able to explain every symbol in the given sequence (and not only most of it). From these definitions, the basic idea was to construct a feasible test as a set of series whose shortest pattern had no alternative projectible patterns of similar complexity. That means that the “explanation” of the series had to be much more plausible than other plausible hypotheses. The main objective was to reduce the subjectivity of the test — first, because we need to choose one reference universal machine from an infinite set of possibilities; secondly, because, even choosing one reference machine, two very different patterns could be consistent with the evidence and if both have similar complexities, their probabilities would be close, and choosing between them would make the series solution quite uncertain. With the constraints posed on patterns and series, both problems were not completely solved but minimised. k=9 k = 12 k = 14 : : : a, d, g, j, ... a, a, z, c, y, e, x, ... c, a, b, d, b, c, c, e, c, d, ... Answer: m Answer: g Answer: d Figure 1. Examples of series of Kt complexity 9, 12, and 14 used in the C-test [6]. The definition was given as the result of a test, called C-test [13], formed by computationally-obtained series of increasing complexity. The sequences were formatted and presented in a quite similar way to psychometric tests (see Figure 1) and, as a result, the test was administered to humans, showing a high correlation with the results of a classical psychometric (IQ) test on the same individuals. Nonetheless, the main goal was that the test could eventually be administered to other kinds of intelligent beings and systems. This was planned to be done, but the work from [26] showed that machine learning programs could be specialised in such a way that they could score reasonably well on some of the typical IQ tests. A more extensive treatment of this phenomenon and the inadequacy of current IQ tests for evaluating machines can be found in [5]. This unexpected result confirmed that C-tests had important limitations and could not be considered universal in two ways, i.e., embracing the whole notion of intelligence, but perhaps only a part of it, and being applicable to any kind of subject (not only adult humans). The idea of extending these static tests to other factors or to make them interactive and extensible to other kinds of subjects by the use of rewards (as in the area of reinforcement learning) was suggested in [7][8], but not fully AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 21 developed into actual tests. An illustration of the classical view of an environment in reinforcement learning is seen in Figure 2, where an agent can interact through actions, rewards and observations. !"#$%&'()!* !"#$% %$+'%, #$&'()$*#$% '-()!* Figure 2. Interaction with an Environment. A few years later, Legg and Hutter (e.g. [21],[22]) followed the previous steps and, strongly influenced by Hutter’s theory of AIXI optimal agents [16], gave a new definition of machine intelligence, dubbed “Universal10 Intelligence”, also grounded in Kolmogorov complexity and Solomonoff’s (“inductive inference” or) prediction theory. The key idea is that the intelligence of an agent is evaluated as some kind of sum (or weighted average) of performances in all the possible environments (as in Figure 2). The definition based on the C-test can now be considered a static precursor of Legg and Hutter’s work, where the environment outputs no rewards, and the agent is not allowed to make an action until several observations are seen (the inductive inference or prediction sequence). The point in favour of active environments (in contrast to passive environments) is that the former not only require inductive and predictive abilities to model the environment but also some planning abilities to effectively use this knowledge through actions. Additionally, perceptions, selective attention, and memory abilities must be fully developed. Not all this is needed to score well in a C-test, for instance. While the C-test selects the problems by (intrinsic) difficulty (which can be chosen to fit the level of intelligence of the evaluee), Legg and Hutter’s approach select problems by using a universal distribution, which gives more probability to simple environments. Legg and Hutter’s definition, given an agent π, is given as: Definition 4 Universal Intelligence [22] "∞ # ∞ ! µ,π ! ri Υ(π, U ) = pU (µ) · E µ=i i=1 where µ is any environment coded on a universal machine U , with π being the agent to be evaluated, and riµ,π the reward obtained by π in µ at interaction i. E is the expected reward on each environment, where environments are assigned with probability pU (µ) using a universal distribution [28]. Definition 4, although very simple, captures one of the broadest definitions of intelligence: “the ability to adapt to a wide range of environments”. However, this definition was not meant to be eventually converted into a test. In fact, there are three obvious problems in this definition regarding making it practical. First, we have two infinite sums in the definition: one is the sum over all environments, and the 10 The term ‘universal’ here does not refer to the definition (or a derived test) being applicable to any kind of agent, but to the use of Solomonoff’s universal distribution and the view of the definition as an extremely general view of intelligence. second is the sum over all possible actions (agent’s life in each environment is infinite). And, finally, K is not computable. Additionally, we also have the dependence on the reference machine U . This dependence takes place even though we consider an infinite number of environments. The universal distribution for a machine U could give the higher probabilities (0.5, 0.25, ...) to quite different environments than those given by another machine V . Despite all these problems, it could seem that just making a random finite sample on environments, limiting the number of interactions or cycles of the agent with respect to the environment and using some computable variant of K, is sufficient to make it a practical test. However, on the one hand, this is not so easy, and, on the other hand, the definition has many other problems (some related and others unrelated). The realisation of these problems and the search for solutions in the quest of a practical intelligence test is the goal of the AN Y NT project. 3 ANYTIME UNIVERSAL TESTS This section presents a summary of the theory in [11]. The reader is referred to this paper for further details. 3.1 On the difficulty of environments The first issue concerns how to sample environments. Just using the universal distribution for this , as suggested by Legg and Hutter, will mean that very simple environments will be output again and again. Note that an environment µ with K(µ) = 1 will appear half of the time. Of course, repeated environments must be ruled out, but a sample would almost become an enumeration from low to high K. This will still omit or underweight very complex environments because their probability is so low. Furthermore, measuring rewards on very small environments will get very unstable results and be very dependent on the reference machine. And even ignoring this, it is not clear that an agent that solves all the problems of complexity lower than 20 bits and none of those whose complexity is larger than 20 bits is more intelligent than another agent who does reasonably well on every environment. This constrasts with the view of the C-test, which focus on the issue of difficulty and does not make the probability of a problem appearing inversely related to this difficulty. In any case, before going on, we need to clarify the notions of simple/easy and complex/difficult that are used here. For instance, just choosing an environment with high K does not ensure that the environment is indeed complex. As Figure 3 illustrates, the relation is unidirectional; given a low K, we can affirm that the environment will look simple. On the other hand, given an intuitively complex environment, K must be necessarily high. Environment with high K ⇐= Intuitively complex (difficult) environment Environment with low K =⇒ Intuitively simple (easy) environment Figure 3. Relation between K and intuitive complexity. Given this relation, only among environments with high K will we find complex environments, and, among the latter, not all of them will be difficult. From the agent’s perspective, however, this is more extreme, since many environments with high K will contain difficult patterns that will never be accessed by the agent’s interactions. AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 22 As a result, the environment will be probabilistically simple. Thus, giving most of the probability to environments with low K means that most of the intelligence measure will come from patterns that are extremely simple. 3.2 Selecting discriminative environments Furthermore, many environments (either simple or complex) will be completely useless for evaluating intelligence, e.g., environments that stop interacting, environments with constant rewards, etc. If we are able to make a more accurate sample, we will be able to make a more efficient test procedure. The question here is to determine a non-arbitrary criterion to exclude some environments. For instance, Legg and Hutter’s definition forces environments to interact infinitely, and since the description must be finite, there must be a pattern. This obviously includes environments such as “always output the same observation and reward”. In fact, they are not only possible but highly probable on many reference machines. Another pathological case is an environment that “outputs observations and rewards at random”. However, this has a high complexity if we assume deterministic environments. In both cases, the behaviour of any agent on these environments would almost be the same. In other words, they do not have discriminative power. Therefore, these environments would be useless for discriminating between agents. In an interactive environment, a clear requirement for an environment to be discriminative is that what the agent does must have consequences on rewards. Thus, we will restrict environments to be sensitive to agents’ actions. That means that a wrong action might lead the agent to part of the environment from which it can never return (non-ergodic), but at least the actions taken by the agent can modify the rewards in that subenvironment. More precisely, we want an agent to be able to influence rewards at any point in any subenvironment. This does not imply ergodicity but reward sensitivity at any moment. That means that we cannot reach a point from which rewards are given independently of what we do (a dead-end). 3.3 Symmetric rewards and balanced environments An important issue is how to estimate rewards. If we only use positive rewards, we find some problems. For example, an increase in the score may originate from a really good behaviour on the environment or just because more rewards are accumulated since they are always positive. Instead, an average reward seems a better payoff function. Our proposal is to use symmetric rewards, which can range between −1 and 1: Definition 5 Symmetric Rewards We say an environment has symmetric rewards when: ∀i : −1 ≤ ri ≤ 1 If we set symmetric rewards, we also expect environments to be symmetric, or more precisely, to be balanced on how they give rewards. This can be seen in the following way. In a reliable test, we would like that many (if not all) environments give an expected 0 reward to random agents. This excludes both hostile and benevolent environments, i.e., environments where doing randomly will get more negative (respectively positive) rewards than positive (respectively negative) rewards. In many cases it is not difficult to prove that a particular environment is balanced. Another approach is to set a reference machine that only generates balanced environments. Using this approach on rewards, we can use an average to estimate the results on each environment, namely: Definition 6 Average Reward Given an environment µ, with ni being the number of completed interactions, then the average reward for agent π is defined as follows: !ni µ,π i=1 ri vµπ (ni ) = ni Now we can calculate the expected value (although the limit may not exist) of the previous average, denoted by E(vµπ ), for an arbitrarily large value of ni . To view the test framework in more detail, in [11] some of these issues (and many other problems) of the measure are solved. It uses a random finite sample of environments. It limits the number of interactions of the agent with respect to the environment. It selects a discriminative set of environments, etc. 4 ENVIRONMENT CLASS The previous theory, however, does not make the choice for an environment class, but just sets some constraints on the kind of environments that can be used. Consequently, one major open problem is to make this choice, i.e., to find a proper (unbiased) environment class which follows the constraints and, more difficult, which can be feasibly implemented. Once this environment class is identified, we can use it to generate environments to run any of the tests variants. Additionally, it is not only necessary to determine the environment class, but also to determine the universal machine we will use to determine the Kolmogorov complexity of each environment, since the tests only use a (small) sample of environments, and the sample probability is defined in terms of the complexity. In the previous section we defined a set of properties which are required for making environments discriminative, namely that observations and rewards must be sensitive to agent’s actions and that environments are balanced. Given these constraints if we decide to generate environments without any constraint and then try to make a post-processing sieve to select which of them comply with all the constraints, we will have a computationally very expensive or even incomputable problem. So, the approach taken is to generate an environment class that ensures that these properties hold. In any case, we have to be very careful, because we would not like to restrict the reference machine to comply with these properties at the cost of losing their universality (i.e. their ability to emulate or include any computable function). And finally, we would like the environment class to be userfriendly to the kind of systems we want to be evaluated (humans, non-human animals and machines), but without any bias in favour or against some of them. According to all this, we define a universal environment class from which we can effectively generate valid environments, calculate their complexity and consequently derive their probability. 4.1 On actions, observations and space Back to Figure 2 again, actions are limited by a finite set of symbols A, (e.g. {lef t, right, up, down}), rewards are taken from any subset R of rational numbers between −1 and 1, and observations are also AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 23 limited by a finite set O of possibilities (e.g., the contents of a grid of binary cells of n × m, or a set of light-emitting diodes, LEDs). We will use ai , ri and oi to (respectively) denote action, reward and observation at interaction i. Apart from the behaviour of an environment, which may vary from very simple to very complex, we must first clarify the interface. How many actions are we going to allow? How many different observations? The very definition of environment makes actions a finite set of symbols and observations are also a finite set of symbols. It is clear that the minimum number of actions has to be two, but no upper limit seems to be decided a priori. The same happens with observations. Even choosing two for both, a sequence of interactions can be as rich as the expressiveness of a Turing machine. Before getting into details with the interface, we have to think about environments that can contain agents. This is not only the case in real life (where agents are known as inanimate or animate objects, animals among the latter), but also a requirement for evolution and, hence, intelligence as we know it. The existence of several agents which can interact requires a space. The space is not necessarily a virtual or physical space, but also a set of common rules (or laws) that govern what the agents can perceive and what the agents can do. From this set of common rules, specific rules can be added to each agent. In the real world, this set of common rules is physics. All this has been extensively analysed in multi-agent systems (see e.g. [20] for a discussion). The good thing about thinking of spaces is that a space entails the possible perceptions and actions. If we define a common space, we have many choices about observations and actions already taken. A first (and common) idea for a space is a 2D grid. From a 2D grid, the observation is a picture of the grid with all the objects and agents inside. In a simple grid where we have agents and objects inside the cells, the typical actions are the movements left, right, up and down. Alternatively, of course, we could use a 3D space, since our world is 3D. In fact, there are some results using intelligence testing (for animals or humans) with a 3D interface [25][36]. The problem of a 2D or 3D grid is that it is clearly biased in favour of humans and many other animals which have hardwired abilities for orientation in this kind of spaces. Other kinds of animals or handicapped people (e.g. blind people) might have some difficulties in this type of spaces. Additionally, artificial intelligence agents would highly benefit by hardwired functionalities about Euclidean distance and 2D movement, without any real improvement in their general intelligence. Instead we propose a more general kind of space. A 2D grid is a graph with a very special topology, where there are concepts which hold such as direction, adjacency, etc. A generalisation is a graph where the cells are freely connected to some other cells with no particular predefined pattern. This suggests a (generally) dimensionless space. Connections between cells would determine part or all the possible actions, and observations and rewards can be easily shown graphically. 4.2 Definition of the environment class After the previous discussion, we are ready to give the definition of the environment class. First we must define the space and objects, and from here observations, actions and rewards. Before that, we have to define some constants that affect each environment. Namely, with na = |A| ≥ 2 we denote the number of actions, with nc ≥ 2 the number of cells, and with nω the number of objects/agents (not including the agent which is to be evaluated and two special objects known as Good and Evil). 4.2.1 Space The space is defined as a directed labelled graph of nc nodes (or vertices), where each node represents a cell. Nodes are numbered, starting from 1, so cells are refered to as C1 , C2 , . . . , Cnc . From each cell we have na outgoing arrows (or arcs), each of them denoted as Ci →α Cj , meaning that action α ∈ A goes from Ci to Cj . All the "i . At least two outgoing outgoing arrows from Ci are denoted by C "i such arrows cannot go to the same cell. Formally, ∀Ci : ∃r1 , r2 ∈ C that r1 = Ci →αm Cj and r2 = Ci →αn Ck with Cj '= Ck and αm '= αn . At least one of the outgoing arrows from a cell must lead to itself (typically denoted by α1 and is the first action). Formally, "i such that r = Ci →α1 Ci . ∀Ci : ∃r ∈ C A path from Ci to Cm is a sequence of arrows Ci → Cj , Cj → Ck , . . . , Cl → Cm . The graph must be strongly connected, i.e., all cells must be connected (i.e. there must be a walk over the graph that goes through all its nodes), or, in other words, for every two cells Ci , Cj there exists a path from Ci to Cj . 4.2.2 Objects Cells can contain objects from a set of predefined objects Ω, with nω = |Ω|. Objects, denoted by ωi can be animate or inanimate, but this can only be perceived by the rules each object has. An object is inanimate (for a period or indefinitely) when it performs action α1 repeatedly. Objects can perform actions following the space rules, but apart from these rules, they can have any behaviour, either deterministic or not. Objects can be reactive and can be defined to act with different actions according to their observations. Objects perform one and only one action at each interaction of the environment (except from the special objects Good and Evil, which can perform several actions in a row). Apart from the evaluated agent π, as we have mentioned, there are two special objects called Good and Evil. Good and Evil must have the same behaviour. By the same behavior we do not mean that they perform the same movements, but they have the same logic or program behind them. Objects can share a same cell, except Good and Evil, which cannot be at the same cell. If their behaviour leads them to the same cell, then one (chosen randomly with equal probability) moves to the intended cell and the other remains at its original cell. Because of this, the environment becomes stochastic (non-deterministic). Objects are placed randomly at the cells with the initialisation of the environment. This is another source of stochastic behaviour. 4.2.3 Observations and Actions The observation is a sequence of cell contents. The cells are ordered by their number. Each element in the sequence shows the presence or absence of each object, included the evaluated agent. Additionally, each cell which is reachable by an action includes the information of that action leading to the cell. 4.2.4 Rewards Raw rewards are defined as a function of the position of the evaluated agent π and the positions of Good and Evil. For the rewards, we will work with the notion of trace and the notion of “cell reward”, that we denote by r(Ci ). Initially, r(Ci ) = 0 AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 24 complexities, and we analysed whether the obtained results correlated with the measure of difficulty. The results were clear, showing that the evaluation obtains the expected results in terms of the relation between expected reward and theoretical problem difficulty. Also, it showed reasonable differences with other baseline algorithms (e.g. a random algorithm). All this supported the idea that the test and the environment class used are on the right direction for evaluating a specific kind of system. However, the main question was whether the approach was in the right direction in terms of constructing universal tests. In other words, it was still necessary to demonstrate if the test serves to evaluate several kinds of systems and put their results on the same scale. In [18] we compared the results of two different systems (humans and AI algorithms), by using the prototype described in this paper and the interface for humans. We set both systems to interact with exactly the same environments. The results, not surprisingly, did not show the expected difference in intelligence between reinforcement learning algorithms and humans. This is explained by several reasons. One of them is that the environments were still relatively simple and reinforcement learning algorithms could still capture and represent all the state matrix for these problems with some partial success. Another reason is that exercises were independent, so humans could not reuse what they were learning on some exercises for others, an issue where humans are supposed to be better than these simple reinforcement algorithms. Also, another possibility is the fact that the environments had very few agents and the few agents that existed were not reactive. This makes the state space bounded, which is beneficial for Q-learning. Similarly, the environments had no noise. All these decisions were made on purpose to keep things simple and also to be able to formally derive the complexity of the environments. In general, other explanations can be found as well, since the lack of other interactive agents can be seen as a lack of social behaviours, as we explored in subsequent works [12]. Of course, test Λone was just a first prototype which does not incorporate many of the features of an anytime intelligence test and the measuring framework. Namely, the prototype is not anytime, so the test does not adapt its complexity to the subject that is evaluating. Also, we made some simplifications to the environment class, causing objects to lose reactivity. Furthermore, it is very difficult to construct any kind of social behaviour by creating agents from scratch. These and other issues are being addressed in new prototypes, some of them under development. 6 CONCLUDING REMARKS The AN Y NT project aimed at exploring the possibility of formal, universal and feasible tests. As already said, test Λone is just one prototype that does not implement all the features of the theory of anytime universal tests. However, it is already very informative. For instance, the experimental results show that the test Λone goes in the right direction, but it still fails to capture some components of intelligence that should put different kinds of individuals on the right scale. In defence of test Λone , we have to say that it is quite rare in the literature to find the same test applied to different kinds of individuals11 . In fact, as argued in [5], relatively simple programs can get good scores on conventional IQ tests, while small children (with high potential intelligence) will surely fail. Similarly, illiterate people and 11 The only remarkable exceptions are the works in comparative psychology, such as [14][15], which are conscious of the difficulties of using the same test, with different interfaces, for different subjects. most children would score very badly at the Turing Test, for instance. And humans are starting to struggle with many CAPTCHAs. All this means that many feasible and practical tests work because they are specialised for specific populations. As long as the diversity of subjects is enlarged, measuring intelligence becomes more difficult and less accurate. As a result, the mere possibility of constructing universal tests is still a hot question. While many may think that this is irresoluble, we think that unless an answer to this question is found, it will be very difficult (if not impossible) to assess the diversity of intelligent agents that are envisaged for the forthcoming decades. Being one way or another, there is clearly an ocean of scientific questions beyond the Turing Test. ACKNOWLEDGEMENTS This work was supported by the MEC projects EXPLORAINGENIO TIN 2009-06078-E, CONSOLIDER-INGENIO 26706 and TIN 2010-21062-C02-02, and GVA project PROMETEO/2008/051. Javier Insa-Cabrera was sponsored by Spanish MEC-FPU grant AP2010-4389. REFERENCES [1] D. L. Dowe, ‘Foreword re C.S. Wallace’, The Computer Journal, 51(5), 523–560, Christopher Stewart WALLACE (1933–2004) memorial special issue, (2008). [2] D. L. Dowe and A. R. Hajek, ‘A computational extension to the Turing Test’, in Proceedings of the 4th Conference of the Australasian Cognitive Science Society, University of Newcastle, NSW, Australia, (1997). [3] D. L. Dowe and A. R. Hajek, ‘A computational extension to the Turing Test’, Technical Report #97/322, Dept Computer Science, Monash University, Melbourne, Australia, 9pp, http://www.csse.monash.edu.au/publications/1997/tr-cs97-322abs.html, (1997). [4] D. L. Dowe and A. R. Hajek, ‘A non-behavioural, computational extension to the Turing Test’, in International conference on computational intelligence & multimedia applications (ICCIMA’98), Gippsland, Australia, pp. 101–106, (1998). [5] D. L. Dowe and J. Hernandez-Orallo, ‘IQ tests are not for machines, yet’, Intelligence, 40(2), 77–81, (2012). [6] J. Hernández-Orallo, ‘Beyond the Turing Test’, Journal of Logic, Language and Information, 9(4), 447–466, (2000). [7] J. Hernández-Orallo, ‘Constructive reinforcement learning’, International Journal of Intelligent Systems, 15(3), 241–264, (2000). [8] J. Hernández-Orallo, ‘On the computational measurement of intelligence factors’, in Performance metrics for intelligent systems workshop, ed., A. Meystel, pp. 1–8. National Institute of Standards and Technology, Gaithersburg, MD, U.S.A., (2000). [9] J. Hernández-Orallo, ‘A (hopefully) non-biased universal environment class for measuring intelligence of biological and artificial systems’, in Artificial General Intelligence, 3rd International Conference AGI, Proceedings, eds., Marcus Hutter, Eric Baum, and Emanuel Kitzelmann, “Advances in Intelligent Systems Research” series, pp. 182–183. Atlantis Press, (2010). [10] J. Hernández-Orallo, ‘On evaluating agent performance in a fixed period of time’, in Artificial General Intelligence, 3rd Intl Conf, ed., M. Hutter et al., pp. 25–30. Atlantis Press, (2010). [11] J. Hernández-Orallo and D. L. Dowe, ‘Measuring universal intelligence: Towards an anytime intelligence test’, Artificial Intelligence, 174(18), 1508–1539, (2010). [12] J. Hernández-Orallo, D. L. Dowe, S. España-Cubillo, M. V. HernándezLloreda, and J. Insa-Cabrera, ‘On more realistic environment distributions for defining, evaluating and developing intelligence’, in Artificial General Intelligence 2011, eds., J. Schmidhuber, K.R. Thórisson, and M. Looks (eds), volume 6830 of LNAI, pp. 82–91. Springer, (2011). [13] J. Hernández-Orallo and N. Minaya-Collado, ‘A formal definition of intelligence based on an intensional variant of kolmogorov complexity’, in In Proceedings of the International Symposium of Engineering of Intelligent Systems (EIS’98), pp. 146–163. ICSC Press, (1998). AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 26 [14] E. Herrmann, J. Call, M. V. Hernández-Lloreda, B. Hare, and M. Tomasell, ‘Humans have evolved specialized skills of social cognition: The cultural intelligence hypothesis’, Science, Vol 317(5843), 1360–1366, (2007). [15] E. Herrmann, M. V. Hernández-Lloreda, J. Call, B. Hare, and M. Tomasello, ‘The structure of individual differences in the cognitive abilities of children and chimpanzees’, Psychological Science, 21(1), 102, (2010). [16] M. Hutter, Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability, Springer, 2005. [17] M. Hutter, ‘Universal algorithmic intelligence: A mathematical top→down approach’, in Artificial General Intelligence, eds., B. Goertzel and C. Pennachin, Cognitive Technologies, 227–290, Springer, Berlin, (2007). [18] J. Insa-Cabrera, D. L. Dowe, S. España-Cubillo, M. V. HernándezLloreda, and J. Hernández-Orallo, ‘Comparing humans and AI agents’, in Artificial General Intelligence 2011, eds., J. Schmidhuber, K.R. Thórisson, and M. Looks (eds), volume 6830 of LNAI, pp. 122–132. Springer, (2011). [19] J. Insa-Cabrera, D. L. Dowe, and J. Hernández-Orallo, ‘Evaluating a reinforcement learning algorithm with a general intelligence test’, in CAEPIA, Advances in Artificial Intelligence, volume 7023 of LNCS, pp. 1–11. Springer, (2011). [20] D. Keil and D. Goldin, ‘Indirect interaction in environments for multi-agent systems’, Environments for Multi-Agent Systems II, 68–87, (2006). [21] S. Legg and M. Hutter, ‘A universal measure of intelligence for artificial agents’, in International Joint Conference on Artificial Intelligence, volume 19, p. 1509, (2005). [22] S. Legg and M. Hutter, ‘Universal intelligence: A definition of machine intelligence’, Minds and Machines, 17(4), 391–444, (2007). http://www.vetta.org/documents/UniversalIntelligence.pdf. [23] L. A. Levin, ‘Universal sequential search problems’, Problems of Information Transmission, 9(3), 265–266, (1973). [24] M. Li and P. Vitányi, An introduction to Kolmogorov complexity and its applications (3rd ed.), Springer-Verlag New York, Inc., 2008. [25] F. Neumann, A. Reichenberger, and M. Ziegler, ‘Variations of the turing test in the age of internet and virtual reality’, in Proceedings of the 32nd annual German conference on Advances in artificial intelligence, pp. 355–362. Springer-Verlag, (2009). [26] P. Sanghi and D. L. Dowe, ‘A computer program capable of passing IQ tests’, in Proc. 4th ICCS International Conference on Cognitive Science (ICCS’03), Sydney, Australia, pp. 570–575, (July 2003). [27] J. Searle, ‘Minds, brains, and programs’, Behavioral and Brain Sciences, 3(3), 417–457, (1980). [28] R. J. Solomonoff, ‘A formal theory of inductive inference. Part I’, Information and control, 7(1), 1–22, (1964). [29] A. M. Turing, ‘Computing machinery and intelligence’, Mind, 59, 433– 460, (1950). [30] J. Veness, K. S. Ng, M. Hutter, and D. Silver, ‘Reinforcement learning via AIXI approximation’, in Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10), pp. 605–611, (2010). [31] J. Veness, K.S. Ng, M. Hutter, W. Uther, and D. Silver, ‘A Monte Carlo AIXI Approximation’, Journal of Artificial Intelligence Research, 40(1), 95–142, (2011). [32] L. Von Ahn, M. Blum, and J. Langford, ‘Telling humans and computers apart automatically’, Communications of the ACM, 47(2), 56–60, (2004). [33] C. S. Wallace, Statistical and Inductive Inference by Minimum Message Length, Ed. Springer-Verlag, 2005. [34] C. S. Wallace and D. M. Boulton, ‘A information measure for classification’, The Computer Journal, 11(2), 185–194, (1968). [35] C. S. Wallace and D. L. Dowe, ‘Minimum message length and Kolmogorov complexity’, Computer Journal, 42(4), 270–283, (1999). Special issue on Kolmogorov complexity. [36] D.A. Washburn and R.S. Astur, ‘Exploration of virtual mazes by rhesus monkeys ( macaca mulatta )’, Animal Cognition, 6(3), 161–168, (2003). [37] C.J.C.H. Watkins and P. Dayan, ‘Q-learning’, Mach. learning, 8(3), 279–292, (1992). AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 27 Turing Machines and Recursive Turing Tests José Hernández-Orallo1 and Javier Insa-Cabrera2 and David L. Dowe3 and Bill Hibbard4 Abstract. The Turing Test, in its standard interpretation, has been dismissed by many as a practical intelligence test. In fact, it is questionable that the imitation game was meant by Turing himself to be used as a test for evaluating machines and measuring the progress of artificial intelligence. In the past fifteen years or so, an alternative approach to measuring machine intelligence has been consolidating. The key concept for this alternative approach is not the Turing Test, but the Turing machine, and some theories derived upon it, such as Solomonoff’s theory of prediction, the MML principle, Kolmogorov complexity and algorithmic information theory. This presents an antagonistic view to the Turing test, where intelligence tests are based on formal principles, are not anthropocentric, are meaningful computationally and the abilities (or factors) which are evaluated can be recognised and quantified. Recently, however, this computational view has been touching upon issues which are somewhat related to the Turing Test, namely that we may need other intelligent agents in the tests. Motivated by these issues (and others), this paper links these two antagonistic views by bringing some of the ideas around the Turing Test to the realm of Turing machines. Keywords: Turing Test, Turing machines, intelligence, learning, imitation games, Solomonoff-Kolmogorov complexity. 1 INTRODUCTION Humans have been evaluated by other humans in all periods of history. It was only in the 20th century, however, that psychometrics was established as a scientific discipline. Other animals have also been evaluated by humans, but certainly not in the context of psychometric tests. Instead, comparative cognition is nowadays an important area of research where non-human animals are evaluated and compared. Machines —yet again differently— have also been evaluated by humans. However, no scientific discipline has been established for this. The Turing Test [31] is still the most popular test for machine intelligence, at least for philosophical and scientific discussions. The Turing Test, as a measurement instrument and not as a philosophical argument, is very different to the instruments other disciplines use to measure intelligence in a scientific way. The Turing Test resembles a much more customary (and non-scientific) assessment, which happens when humans interview or evaluate other humans (for whatever 1 DSIC, Universitat Politècnica de València, Spain. email: jorallo@dsic.upv.es 2 DSIC, Universitat Politècnica de València, Spain. email: jinsa@dsic.upv.es 3 Clayton School of Information Technology, Monash University, Australia. email: david.dowe@monash.edu 4 Space Science and Engineering Center, University of Wisconsin - Madison, USA. email: test@ssec.wisc.edu reason, including, e.g., personnel selection, sports1 or other competitions). The most relevant (and controversial) feature of the Turing Test is that it takes humans as a touchstone to which machines should be compared. In fact, the comparison is not performed by an objective criterion, but assessed by human judges, which is not without controversy. Another remarkable feature (and perhaps less controversial) is that the Turing Test is set on an intentionally restrictive interaction channel: a teletype conversation. Finally, there are some features about the Turing Test which make it more general than other kinds of intelligence tests. For instance, it is becoming increasingly better known that programs can do well at human IQ tests [32][8], because ordinary IQ tests only evaluate narrow abilities and assume that narrow abilities accurately reflect human abilities across a broad set of tasks, which may not hold for non-human populations. The Turing test (and some formal intelligence measures we will review in the following section) can test broad sets of tasks. We must say that Turing cannot be blamed for all the controversy. The purpose of Turing’s imitation game [37] was to show that intelligence could be assessed and recognised in a behavioural way, without the need for directly measuring or recognising some other physical or mental issues such as thinking, consciousness, etc. In Turing’s view, intelligence can be just seen as a cognitive ability (or property) that some machines might have and others might not. In fact, the standard scientific view should converge to defining intelligence as an ability that some systems: humans, non-human animals, machines —and collectives thereof—, might or might not have, or, more precisely, might have to a larger or lesser degree. This view has clearly been spread by the popularity of psychometrics and IQ tests.2 While there have been many variants and extensions of the Turing Test (see [33] or [31] for an account of these), none of them (and none of the approaches in psychometrics and animal cognition, either) have provided a formal, mathematical definition of what in1 2 In many sports, to see how good a player is, we want competent judges but also appropriate team-mates and opponents. Good tournaments and competitions are largely designed so as to return (near) maximal expected information. In fact, the notion of consciousness and other phenomena is today better separated from intelligence than it was sixty years ago. They are now seen as related but different things. For instance, nobody doubts that a team of people can score well in a single IQ test (working together). In fact, the team, using a teletype communication as in the Turing Test, can dialogue, write poetry, make jokes, do complex mathematics and all these human things. They can even do these things continuously for days or weeks, while some of the particular individuals rest, eat, go to sleep, die, etc. Despite all of this happening on the other side of the teletype communication, the system is just regarded as one subject. So the fact that we can effectively measure the cognitive abilities of the team or even make the team pass the Turing Test does not lead us directly to statements such as ‘the team has a mind’ or ‘the team is conscious’. At most, we say this in a figurative sense, as we use it for the collective consciousness of a company or country. In the end, the ‘team of people’ is one of the best arguments against Searle’s Chinese room and a good reference whenever we are thinking about evaluating intelligence. AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 28 telligence is and how it can be measured. A different approach is based on one of the things that the Turing Test is usually criticised for: learning3 . This alternative approach requires a proper definition of learning, and actual mechanisms for measuring learning ability. Interestingly, the answer to this is given by notions devised from Turing machines. In the 1960s, Ray Solomonoff ‘solved’ the problem of induction (and the related problems of prediction and learning) [36] by the use of Turing machines. This, jointly with the theory of inductive inference given by the Minimum Message Length (MML) principle [39, 40, 38, 5], algorithmic information theory [1], Kolmogorov complexity [25, 36] and compression theory, paved the way in the 1990s for a new approach for defining and measuring intelligence based on algorithmic information theory. This approach will be summarised in the next section. While initially there was some connection to the Turing Test, this line of research has been evolving and consolidating in the past fifteen years (or more), cutting all the links to the Turing Test. This has provided important insights into what intelligence is and how it can be measured, and has given clues to the (re-)understanding of other areas where intelligence is defined and measured, such as psychometrics and animal cognition. An important milestone of this journey has been the recent realisation in this context that (social) intelligence is the ability to perform well in an environment full of other agents of similar intelligence. This is a consequence of some experiments which show that when performance is measured in environments where no other agents coexist, some important traits of intelligence are not fully recognised. A solution for this has been formalised as the so-called Darwin-Wallace distribution of environments (or tasks) [18]. The outcome of all this is that it is increasingly an issue whether intelligence might be needed to measure intelligence. But this is not because we might need intelligent judges as in the Turing Test, but because we may need other intelligent agents to become part of the exercises or tasks an intelligence test should contain (as per footnote 1). This seems to take us back to the Turing Test, a point some of us deliberately abandoned more than fifteen years ago. Re-visiting the Turing Test now is necessarily very different, because of the technical companions, knowledge and results we have gathered during this journey (universal Turing machines, compression, universal distributions, Solomonoff-Kolmogorov complexity, MML, reinforcement learning, etc.). The paper is organised as follows. Section 2 introduces a short account of the past fifteen years concerning definitions and tests of machine intelligence based on (algorithmic) information theory. It also discusses some of the most recent outcomes and positions in this line, which have led to the notion of Darwin-Wallace distribution and the need for including other intelligent agents in the tests, suggesting an inductive (or recursive, or iterative) test construction and definition. This is linked to the notion of recursive Turing Test (see [32, sec. 5.1] for a first discussion on this). Section 3 analyses the base case by proposing several schemata for evaluating systems that are able to imitate Turing machines. Section 4 defines different ways of doing the recursive step, inspired by the Darwin-Wallace distribution and ideas for making this feasible. Section 5 briefly explores how all this might develop, and touches upon concepts such as universality in Turing machines and potential intelligence, as well as some sug3 This can be taken as further evidence for Turing not conceiving the imitation test as an actual test for intelligence, because the issue about machines being able to learn was seen as inherent to intelligence for Turing [37, section 7], and yet the Turing Test is not especially good at detecting learning ability during the test. gestions as to how machine intelligence measurement might develop in the future. 2 MACHINE INTELLIGENCE MEASUREMENT USING TURING MACHINES There are, of course, many proposals for intelligence definitions and tests for machines which are not based on the Turing Test. Some of them are related to psychometrics, some others may be related to other areas of cognitive science (including animal cognition) and some others originate from artificial intelligence (e.g., some competitions running on specific tasks such as planning, robotics, games, reinforcement learning, . . . ). For an account of some of these, the reader can find a good survey in [26]. In this section, we will focus on approaches which use Turing machines (and hence computation) as a basic component for the definition of intelligence and the derivation of tests for machine intelligence. Most of the views of intelligence in computer science are sustained over a notion of intelligence as a special kind of information processing. The nature of information, its actual content and the way in which patterns and structure can appear in it can only be explained in terms of algorithmic information theory. The Minimum Message Length (MML) principle [39, 40] and SolomonoffKolmogorov complexity [36, 25] capture the intuitive notion that there is structure –or redundancy– in data if and only if it is compressible, with the relationship between MML and (two-part) Kolmogorov complexity articulated in [40][38, chap. 2][5, sec. 6]. While Kolmogorov [25] and Chaitin [1] were more concerned with the notions of randomness and the implications of all this in mathematics and computer science, Solomonoff [36] and Wallace [39] developed the theory with the aim of explaining how learning, prediction and inductive inference work. In fact, Solomonoff is said to have ‘solved’ the problem of induction [36] by the use of Turing machines. He was also the first to introduce the notions of universal distribution (as the distribution of strings given by a UTM from random input) and the invariance theorem (which states that the Kolmogorov complexity of a string calculated with two different reference machines only differs by a constant which is independent of the string). Chaitin briefly made mention in 1982 of the potential relationship between algorithmic information theory and measuring intelligence [2], but actual proposals in this line did not start until the late 1990s. The first proposal was precisely introduced over a Turing Test and as a response to Searle’s Chinese room [35], where the subject was forced to learn. This induction-enhanced Turing Test [7][6] could then evaluate a general inductive ability. The importance was not that any kind of ability could be included in the Turing Test, but that this ability could be formalised in terms of MML and related ideas, such as (two-part) compression. Independently and near-simultaneously, a new intelligence test (C-test) [19] [12] was derived as sequence prediction problems which were generated by a universal distribution [36]. The difficulty of the exercises was mathematically derived from a variant of Kolmogorov complexity, and only exercises with a certain degree of difficulty were included and weighted accordingly. These exercises were very similar to those found in some IQ tests, but here they were created from computational principles. This work ‘solved’ the traditional subjectivity objection of the items in IQ tests, i.e., since the continuation of each sequence was derived from its shortest explanation. However, this test only measured one cognitive ability and its presentation was too narrow to be a general test. Consequently, AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 29 these ideas were extended to other cognitive abilities in [14] by the introduction of other ‘factors’, and the suggestion of using interactive tasks where “rewards and penalties could be used instead”, as in reinforcement learning [13]. Similar ideas followed relating compression and intelligence. Compression tests were proposed as a test for artificial intelligence [30], arguing that “optimal text compression is a harder problem than artificial intelligence as defined by Turing’s”. Nonetheless, the fact that there is a connection between compression and intelligence does not mean that intelligence can be just defined as compression ability (see, e.g., [9] for a full discussion on this). Later, [27] would propose a notion which they referred to as a “universal intelligence measure” —universal because of its proposed use of a universal distribution for the weighting over environments. The innovation was mainly their use of a reinforcement learning setting, which implicitly accounted for the abilities not only of learning and prediction, but also of planning. An interesting point for making this proposal popular was its conceptual simplicity: intelligence was just seen as average performance in a range of environments, where the environments were just selected by a universal distribution. While innovative, the universal intelligence measure [27] showed several shortcomings stopping it from being a viable test. Some of the problems are that it requires a summation over infinitely many environments, it requires a summation over infinite time within each environment, Kolmogorov complexity is typically not computable, disproportionate weight is put on simple environments (e.g., with 1− 2−7 > 99% of weight put on environments of size less than 8, as also pointed out by [21]), it is (static and) not adaptive, it does not account for time or agent speed, etc Hernandez-Orallo and Dowe [17] re-visited this to give an intelligence test that does not have these abovementioned shortcomings. This was presented as an anytime universal intelligence test. The term universal here was used to designate that the test could be applied to any kind of subject: machine, human, non-human animal or a community of these. The term anytime was used to indicate that the test could evaluate any agent speed, it would adapt to the intelligence of the examinee, and that it could be interrupted at any time to give an intelligence score estimate. The longer the test runs, the more reliable the estimate (the average reward [16]). Preliminary tests have since been done [23, 24, 28] for comparing human agents with non-human AI agents. These tests seem to succeed in bringing theory to practice quite seamlessly and are useful to compare the abilities of systems of the same kind. However, there are some problems when comparing systems of different kind, such as human and AI algorithms, because the huge difference of both (with current state-of-the-art technology) is not clearly appreciated. One explanation for this is that (human) intelligence is the result of the adaptation to environments where the probability of other agents (of lower or similar intelligence) being around is very high. However, the probability of having another agent of even a small degree of intelligence just by the use of a universal distribution is discouragingly remote. Even in environments where other agents are included on purpose [15], it is not clear that these agents properly represent a rich ‘social’ environment. In [18], the so-called Darwin-Wallace distribution is introduced where environments are generated using a universal distribution for multi-agent environments, and where a number of agents that populate the environment are also generated by a universal distribution. The probability of having interesting environments and agents is very low on this first ‘generation’. However, if an intelligence test is administered to this population and only those with a certain level are preserved, we may get a second population whose agents will have a slightly higher degree of intelligence. Iterating this process we have different levels for the Darwin-Wallace distribution, where evolution is solely driven (boosted) by a fitness function which is just measured by intelligence tests. 3 THE BASE CASE: THE TURING TEST FOR TURING MACHINES A recursive approach can raise the odds for environments and tasks of having a behaviour which is attributed to more intelligent agents. This idea of recursive populations can be linked to the notion of recursive Turing Test [32, sec. 5.1], where the agents which have succeeded at lower levels could be used to be compared at higher levels. However, there are many interpretations of this informal notion of a recursive Turing Test. The fundamental idea is to eliminate the human reference from the test using recursion —either as the subject that has to be imitated or the judge which is used to tell between the subjects. Before giving some (more precise) interpretations of a recursive version of the Turing Test, we need to start with the base case, as follows (we use TM and UTM for Turing Machine and Universal Turing Machine respectively): Definition 1 The imitation game for Turing machines4 is defined as a tuple "D, B, C, I# • The reference subject A is randomly taken as a TM using a distribution D. • Subject B (the evaluee) tries to emulate A. • The similarity between A and B is ‘judged’ by a criterion or judge C through some kind of interaction protocol I. The test returns this similarity. An instance of the previous schema requires us to determine the distribution D and the similarity criterion C and, most especially, how the interaction I goes. In the classical Turing Test, we know that D is the human population, C is given by a human judge, and the interaction is an open teletype conversation5 . Of course, other distributions for D could lead to other tests, such as, e.g., a canine test, taking D as a dog population, and judges as other dogs which have to tell which is the member of the species or perhaps even how intelligent it is (for whatever purpose —e.g., mating or idle curiosity). More interestingly, one possible instance for Turing machines could go as follows. We can just take D as a universal distribution over a reference UTM U , so p(A) = 2−KU (A) , where KU (A) is the prefix-free Kolmogorov complexity of A relative to U . This means that simple reference subjects have higher probability than complex subjects. Interaction can go as follows. The ‘interview’ consists of questions as random finite binary strings using a universal distribution s1 , s2 , ... over another reference UTM, V . The test starts by subjects A and B receiving string s1 and giving two sequences a1 and b1 as respective answers. Agent B will also receive what A has output 4 5 The use of Turing machines for the reference subject is relevant and not just a way to link two things by their name, Turing. Turing machines are required because we need to define formal distributions on them, and this cannot be done (at least theoretically) for humans, or animals or ‘agents’. This free teletype conversation may be problematic in many ways. Typically, the judge C wishes to steer the conversation in directions which will enable her to get (near-)maximal (expected) information (before the timelimit deadline of the test) about whether or not the evaluee subject B is or is not from D. One tactic for a subject which is not from D (and not a good imitator either) is to distract the judge C and steer the conversation in directions which will give judge C (near-) minimal (expected) information. AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 30 immediately after this. Judge C is just a very simple function which compares whether a1 and b1 are equal. After one interation, the system issues string s2 . After several iterations, the score (similarity) given to B is calculated as an aggregation of the times ai and bi have been equal. This can be seen as formalisation of the Turing Test where it is a Turing machine that needs to be imitated, and the criterion for imitation is the similarity between the answers given by A and B to the same questions. If subject B cannot be told or instructed about the goal of the test (imitating A) then we can use rewards after each step, possibly concealing A’s outputs from B as well. This test might seem ridiculous at first sight. Some might argue that being able to imitate a randomly-chosen TM is not related to intelligence. However, two issues are important here. First, agent B does not know who A is in advance. Second, agent B tries to imitate A solely from its behaviour. This makes the previous version of the test very similar to the most abstract setting used for analysing what learning is, how much complexity it has and whether it can be solved. First, this is tantamount to Gold’s language identification in the limit [11]. If subject B is able to identify A at some point, then it will start to score perfectly from that moment. While Gold was interested in whether this could be done in general and for every possible A, here we are interested in how well B does this on average for a randomly-chosen A from a distribution. In fact, many simple TMs can be identified quite easily, such as those simple TMs which output the same string independently of the input. Second, and following this averaging approach, Solomonoff’s setting is also very similar to this. Solomonoff proved that B could get the best estimations for A if B used a mixture of all consistent models inversely weighted by 2 to the power of their Kolmogorov complexity. While this may give the best theoretical approach for prediction and perhaps for “imitation”, it does not properly “identify” A. Identification can only be properly claimed if we have one single model of A which is exactly as A. This distinction between one vs. multiple models is explicit in the MML principle, which usually considers just one single model, the one with the shortest two-part message encoding of said model followed by the data given this model. There is already an intelligence test which corresponds to the previous instance of definition 1, the C-test, mentioned above. The Ctest measures how well an agent B is able to identify the pattern behind a series of sequences (each sequence is generated by a different program, i.e., a different Turing machine). The C-test does not use a query-answer setting, but the principles are the same. We can develop a slight modification of definition 1 by considering that subject A also tries to imitate B. This might lead to easy convergence in many cases (for relatively intelligent A and B) and would not be very useful for comparing A and B effectively. A significant step forward is when we consider that the goal of A is to make outputs that cannot be imitated by B. While it is clearly different, this is related to some versions of Turing’s imitation game, where one of the human subjects pretends to be a machine. While there might be some variants here to explore, if we restrict the size of the strings used for questions and answers to 1 (this makes agreeing and disagreeing equally likely), this is tantamount to the game known as ‘matching pennies’ (a binary version of rock-paper-scissors where the first player has to match the head or tail of the second player, and the second player has to disagree on the head or tail of the first). Interestingly, this game has also been proposed as an intelligence test in the form of Adversarial Sequence Prediction [20][22] and is related to the “elusive model paradox” [3, footnote 211][4, p 455][5, sec. 7.5]. This instance makes it more explicit that the distribution D over the agents that the evaluee has to imitate or compete with is crucial. In the case of imitation, however, there might be non-intelligent Turing machines which are more difficult to imitate/identify than many intelligent Turing machines, and this difficulty seems to be related to the Kolmogorov complexity of the Turing machine. And linking difficulty to Kolmogorov complexity is what the C-test does. But biological intelligence is frequently biased to social environments, or at least to environments where other agents can be around eventually. In fact, societies are usually built on common sense and common understanding, but in humans this might be an evolutionarilyacquired ability to imitate other humans, but not other intelligent beings in general. Some neurobiological structures, such as mirror neurons have been found in primates and other species, which may be responsible of understanding what other people do and will do, and for learning new skills by imitation. Nonetheless, we must say that human unpredictability is frequently impressive, and its relation to intelligence is far from being understood. Interestingly, some of the first analyses on this issue [34][29] linked the problem with the competitive/adversarial scenario, which is equivalent to the matching pennies problem, where the intelligence of the peer is the most relevant feature (if not the only one) for assessing the difficulty of the game, as happens in most games. In fact, matching pennies is the purest and simplest game, since it reduces the complexity of the ‘environment’ (rules of the game) to a minimum. 4 RECURSIVE TURING TESTS FOR TURING MACHINES The previous section has shown that introducing agents (in this case, agent A) in a test setting requires a clear assessment of the distribution which is used for introducing them. A general expression of how to make a Turing Test for Turing machines recursive is as follows: Definition 2 The recursive imitation game for Turing machines is defined as a tuple !D, C, I" where tests and distributions are obtained as follows: 1. Set D0 = D and i = 0. 2. For each agent B in a sufficiently large set of TMs 3. Apply a sufficiently large set of instances of definition 1 with parameters !Di , B, C, I". 4. B’s intelligence at degree i is averaged from this sample of imitation tests. 5. End for 6. Set i = i + 1 7. Calculate a new distribution Di where each TM has a probability which is directly related to its intelligence at level i − 1. 8. Go to 2 This gives a sequence of Di . The previous approach is clearly uncomputable in general, and still intractable even if reasonable samples, heuristics and step limitations are used. A better approach to the problem would be some kind of propagation system, such as Elo’s rating system of chess [10], which has already been suggested in some works and competitions in artificial intelligence. A combination of a soft universal distribution, where simple agents would have slightly higher probability, and a one-vs-one credit propagation system such as Elo’s rating (or any other mechanism which returns maximal expected information with a minimum of pairings), could feasibly aim at having a reasonably AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 31 good estimate of the relative abilities of a big population of Turing machines, including some AI algorithms amongst them. What would this rating mean? If we are using the imitation game, a high rating would show that the agent is able to imitate/identify other agents of lower rating well and that it is a worse imitator/identifier than other agents with higher rating. However, there is no reason to think that the relations are transitive and anti-reflexive; e.g., it might even happen that an agent with very low ranking would be able to imitate an agent with very high ranking better than the other way round. One apparently good thing about this recursion and rating system is that the start-up distribution can be very important from the point of view of heuristics, but it might be less important for the final result. This is yet another way of escaping from the problems of using a universal distribution for environments or agents, because very simple things take almost all the probability —as per section 2. Using difficulty as in the C-test, making adaptive tests such as the anytime test, setting a minimum complexity value [21] or using hierarchies of environments [22] where “an agent’s intelligence is measured as the ordinal of the most difficult set of environments it can pass” are solutions for this. We have just seen another possible solution where evaluees (or similar individuals) can take part in the tests. 5 DISCUSSION The Turing test, in some of its formulations, is a game where an agent tries to imitate another (or its species or population) which might (or might not) be cheating. If both agents are fair, and we do not consider any previous information about the agents (or their species or populations), then we have an imitation test for Turing machines. If one is cheating, we get closer to the adversarial case we have also seen. Instead of including agents arbitrarily or assuming that any agent has a level of intelligence a priori, a recursive approach is necessary. This is conceptually possible, as we have seen, although its feasible implementation needs to be carefully considered, possibly in terms of rankings after random 1-vs-1 comparisons. This view of the (recursive) Turing test in terms of Turing machines has allowed us to connect the Turing test with fundamental issues in computer science and artificial intelligence, such as the problem of learning (as identification), Solomonoff’s theory of prediction, the MML principle, game theory, etc. These connections go beyond to other disciplines such as (neuro-)biology, where the role of imitation and adversarial prediction are fundamental, such as predatorprey games, mirror neurons, common coding theory, etc. In addition, this has shown that the line of research with intelligence tests derived from algorithmic information theory and the recent Darwin-Wallace distribution are also closely related to this as well. This (again) links this line of research to the Turing test, where humans have been replaced by Turing machines. This sets up many avenues for research and discussion. For instance, the idea that the ability of imitating relates to intelligence can be understood in terms of the universality of a Turing machine, i.e. the ability of a Turing machine to emulate another. If a machine can emulate another, it can acquire all the properties of the latter, including intelligence. However, in this paper we have referred to the notion of ‘imitation’, which is different to the concept of Universal Turing machine, since a UTM is defined as a machine such that there is an input that turns it into any other pre-specified Turing machine. A machine which is able to imitate well is a good learner, which can finally identify any pattern on the input and use it to imitate the source. In fact, a good imitator is, potentially, very intelligent, since it can, in theory (and disregarding efficiency issues), act as any other very intelligent being by just observing its behaviour. Turing advocated for learning machines in section 7 of the very same paper [37] where he introduced the Turing Test. Solomonoff taught us what learning machines should look like. We are still struggling to make them work in practice and preparing for assessing them. ACKNOWLEDGEMENTS This work was supported by the MEC projects EXPLORAINGENIO TIN 2009-06078-E, CONSOLIDER-INGENIO 26706 and TIN 2010-21062-C02-02, and GVA project PROMETEO/2008/051. Javier Insa-Cabrera was sponsored by Spanish MEC-FPU grant AP2010-4389. REFERENCES [1] G. J. Chaitin, ‘On the length of programs for computing finite sequences’, Journal of the Association for Computing Machinery, 13, 547–569, (1966). [2] G. J. Chaitin, ‘Godel’s theorem and information’, International Journal of Theoretical Physics, 21(12), 941–954, (1982). [3] D. L. Dowe, ‘Foreword re C. S. Wallace’, Computer Journal, 51(5), 523 – 560, (September 2008). Christopher Stewart WALLACE (19332004) memorial special issue. [4] D. L. Dowe, ‘Minimum Message Length and statistically consistent invariant (objective?) Bayesian probabilistic inference - from (medical) “evidence”’, Social Epistemology, 22(4), 433 – 460, (October - December 2008). [5] D. L. Dowe, ‘MML, hybrid Bayesian network graphical models, statistical consistency, invariance and uniqueness’, in Handbook of the Philosophy of Science - Volume 7: Philosophy of Statistics, ed., P. S. Bandyopadhyay and M. R. Forster, pp. 901–982. Elsevier, (2011). [6] D. L. Dowe and A. R. Hajek, ‘A non-behavioural, computational extension to the Turing Test’, in Intl. Conf. on Computational Intelligence & multimedia applications (ICCIMA’98), Gippsland, Australia, pp. 101– 106, (February 1998). [7] D. L. Dowe and A. R. Hajek, ‘A computational extension to the Turing Test’, in Proceedings of the 4th Conference of the Australasian Cognitive Science Society, University of Newcastle, NSW, Australia, (September 1997). [8] D. L. Dowe and J. Hernandez-Orallo, ‘IQ tests are not for machines, yet’, Intelligence, 40(2), 77–81, (2012). [9] D. L. Dowe, J. Hernández-Orallo, and P. K. Das, ‘Compression and intelligence: social environments and communication’, in Artificial General Intelligence, eds., J. Schmidhuber, K.R. Thórisson, and M. Looks, volume 6830, pp. 204–211. LNAI series, Springer, (2011). [10] A.E. Elo, The rating of chessplayers, past and present, volume 3, Batsford London, 1978. [11] E.M. Gold, ‘Language identification in the limit’, Information and control, 10(5), 447–474, (1967). [12] J. Hernández-Orallo, ‘Beyond the Turing Test’, J. Logic, Language & Information, 9(4), 447–466, (2000). [13] J. Hernández-Orallo, ‘Constructive reinforcement learning’, International Journal of Intelligent Systems, 15(3), 241–264, (2000). [14] J. Hernández-Orallo, ‘On the computational measurement of intelligence factors’, in Performance metrics for intelligent systems workshop, ed., A. Meystel, pp. 1–8. National Institute of Standards and Technology, Gaithersburg, MD, U.S.A., (2000). [15] J. Hernández-Orallo, ‘A (hopefully) non-biased universal environment class for measuring intelligence of biological and artificial systems’, in Artificial General Intelligence, 3rd Intl Conf, ed., M. Hutter et al., pp. 182–183. Atlantis Press, Extended report at http://users.dsic.upv.es/proy/anynt/unbiased.pdf, (2010). [16] J. Hernández-Orallo, ‘On evaluating agent performance in a fixed period of time’, in Artificial General Intelligence, 3rd Intl Conf, ed., M. Hutter et al., pp. 25–30. Atlantis Press, (2010). [17] J. Hernández-Orallo and D. L. Dowe, ‘Measuring universal intelligence: Towards an anytime intelligence test’, Artificial Intelligence Journal, 174, 1508–1539, (2010). AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 32 [18] J. Hernández-Orallo, D. L. Dowe, S. España-Cubillo, M. V. HernándezLloreda, and J. Insa-Cabrera, ‘On more realistic environment distributions for defining, evaluating and developing intelligence’, in Artificial General Intelligence, eds., J. Schmidhuber, K.R. Thórisson, and M. Looks, volume 6830, pp. 82–91. LNAI, Springer, (2011). [19] J. Hernández-Orallo and N. Minaya-Collado, ‘A formal definition of intelligence based on an intensional variant of Kolmogorov complexity’, in Proc. Intl Symposium of Engineering of Intelligent Systems (EIS’98), pp. 146–163. ICSC Press, (1998). [20] B. Hibbard, ‘Adversarial sequence prediction’, in Proceeding of the 2008 conference on Artificial General Intelligence 2008: Proceedings of the First AGI Conference, pp. 399–403. IOS Press, (2008). [21] B. Hibbard, ‘Bias and no free lunch in formal measures of intelligence’, Journal of Artificial General Intelligence, 1(1), 54–61, (2009). [22] B. Hibbard, ‘Measuring agent intelligence via hierarchies of environments’, Artificial General Intelligence, 303–308, (2011). [23] J. Insa-Cabrera, D. L. Dowe, S. España-Cubillo, M. Victoria Hernández-Lloreda, and José Hernández-Orallo, ‘Comparing humans and ai agents’, in AGI: 4th Conference on Artificial General Intelligence - Lecture Notes in Artificial Intelligence (LNAI), volume 6830, pp. 122–132. Springer, (2011). [24] J. Insa-Cabrera, D. L. Dowe, and José Hernández-Orallo, ‘Evaluating a reinforcement learning algorithm with a general intelligence test’, in CAEPIA - Lecture Notes in Artificial Intelligence (LNAI), volume 7023, pp. 1–11. Springer, (2011). [25] A. N. Kolmogorov, ‘Three approaches to the quantitative definition of information’, Problems of Information Transmission, 1, 4–7, (1965). [26] S. Legg and M. Hutter, ‘Tests of machine intelligence’, in 50 years of artificial intelligence, pp. 232–242. Springer-Verlag, (2007). [27] S. Legg and M. Hutter, ‘Universal intelligence: A definition of machine intelligence’, Minds and Machines, 17(4), 391–444, (November 2007). [28] S. Legg and J. Veness, ‘An Approximation of the Universal Intelligence Measure’, in Proceedings of Solomonoff 85th memorial conference. Springer, (2012). [29] D. K. Lewis and J. Shelby-Richardson, ‘Scriven on human unpredictability’, Philosophical Studies: An International Journal for Philosophy in the Analytic Tradition, 17(5), 69 – 74, (October 1966). [30] M. V. Mahoney, ‘Text compression as a test for artificial intelligence’, in Proceedings of the National Conference on Artificial Intelligence, AAAI, pp. 970–970, (1999). [31] G. Oppy and D. L. Dowe, ‘The Turing Test’, in Stanford Encyclopedia of Philosophy, ed., Edward N. Zalta. Stanford University, (2011). http://plato.stanford.edu/entries/turing-test/. [32] P. Sanghi and D. L. Dowe, ‘A computer program capable of passing IQ tests’, in 4th Intl. Conf. on Cognitive Science (ICCS’03), Sydney, pp. 570–575, (2003). [33] A.P. Saygin, I. Cicekli, and V. Akman, ‘Turing test: 50 years later’, Minds and Machines, 10(4), 463–518, (2000). [34] M. Scriven, ‘An essential unpredictability in human behavior’, in Scientific Psychology: Principles and Approaches, eds., B. B. Wolman and E. Nagel, 411–425, Basic Books (Perseus Books), (1965). [35] J. R. Searle, ‘Minds, brains and programs’, Behavioural and Brain Sciences, 3, 417–457, (1980). [36] R. J. Solomonoff, ‘A formal theory of inductive inference’, Information and Control, 7, 1–22, 224–254, (1964). [37] A. M. Turing, ‘Computing machinery and intelligence’, Mind, 59, 433– 460, (1950). [38] C. S. Wallace, Statistical and Inductive Inference by Minimum Message Length, Information Science and Statistics, Springer Verlag, May 2005. ISBN 0-387-23795X. [39] C. S. Wallace and D. M. Boulton, ‘An information measure for classification’, Computer Journal, 11(2), 185–194, (1968). [40] C. S. Wallace and D. L. Dowe, ‘Minimum message length and Kolmogorov complexity’, Computer Journal, 42(4), 270–283, (1999). AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 33 What language for Turing Test in the age of qualia? Francesco Bianchini1, Domenica Bruni2 Abstract. What is the most relevant legacy by Turing for epistemology of Artificial Intelligence (AI) and cognitive science? Of course, we could see it in the ideas set out in his well-known article of 1950, Computing Machinery and Intelligence. But how could his imitation game, and its following evolution in what we know as Turing Test, still be so relevant? What we want to argue is that the nature of imitation game as a method for evaluating research on intelligent artifacts, has not its core specifically in (natural) language capability as a way of showing the presence of intelligence in a certain entity, but in the interaction between human being and machines. Humancomputer interaction is a particular field in information science for many important practical respects, but interaction between human being and machines is the deepest sense of Turing’s ideas on evaluation of intelligent behavior and entities, within and beyond its connection with natural language. And from this point of view it could be methodologically and epistemologically useful for further research in every discipline involving machine and artificial artifacts, especially as concerns the very current subject of consciousness and qualia. In what follows we will try to argue such a perspective by showing some field in which interaction, in connection with different sorts of language, could be of interest in the spirit of Turing’s 1950 article.12 1 TURING, LANGUAGE AND INTERACTION One of the most interesting idea by Turing was a based-onlanguage test for proving the intelligence, or the intelligent behavior, of a program. In Turing’s terms, it is a machine showing an autonomous and self-produced intelligent behavior. Actually, Turing never spoke about a test, but just about an imitation game, using the concept of imitation as an intuitive concept. This is a typical way of thinking as regards Turing, though, who had provided a method for catching the notion of computable function in a mechanical way through a set of intuitive concepts about fifteen years before [24]. Likewise the case of computation theory, the Turing’s aim in 1950 article was to deal with a very notable subject in the easiest and most straightforward manner, and avoiding the involvement with more complex and specific theoretical structures based on fielddependent notions. In the case of imitation game the combination of the notion of “imitation” and of the use of natural language allowed Turing to express a paradigmatic method for evaluating artificial products, but gave rise as well to an endless debate all over the last sixty years about the suitableness of this kind of testing artificial intelligence. Leaving aside the problem concerning the correct 1 Dept. of Philosophy, University of Bologna. Email: francesco.bianchini5@unibo.it 2 Dept. of Cognitive Science, University of Messina, Email: dbruni@unime.it interpretation of the notion of “imitation”, we may ask first whether the role of language in the test is fundamental or it is just connected to the spirit of the period in which Turing wrote his paper, that is within the current behaviorist paradigm in psychology and in the light of the natural language centrality in the philosophy of twentieth century. In other terms, why did Turing choose natural language in order to build a general frame for evaluating the intelligence of artificial, programmed artifacts? Is such a way of thinking (and researching) still useful? And, if so, what can we say about it in relation with further research in this field? As we said, the choice of natural language had the purpose to put the matter in an intuitive manner. We human beings usually ascribe intelligence to other human beings through linguistic conversations, mostly carrying out in a question-answer form. Besides, Turing himself asserts in 1950 article that such a method «has the advantage of drawing a fairly sharp line between the physical and the intellectual capacities of a man» [26]. This is the ordinary explanation of Turing’s choice. But it is also true that, in a certain sense, the very first enunciation of the imitation game is in another previous work by Turing, where, ending his exposition on machine intelligence, he speaks about a «little experiment» regarding the possibility of a chess game between two human beings (A and C), and between a human being (A) and a paper machine worked by a human being (B). Turing asserts that if «two rooms are used with some arrangement for communicating moves, and a game is played between C and either A or the paper machine […] C may find it quite difficult to tell which he is playing. (This is a rather idealized form of an experiment I have actually done.)» [25]. Such a brief sketch of the imitation game in 1948 paper is not surprising because that paper is a sort of first draft of the Turing’s ideas of 1950 paper, and it is even more considerable for some remarks, for example, on self-organizing machines or on the possibility of machine learning. Moreover, it is not surprising that Turing speaks about machines referring to them as paper machines, namely just for their logical, abstract structure. It is another main Turing’s theme, that remembers the human computor of 1936 paper. What is interesting is the fact that the first, short outline of imitation game is not based on language, but on a subject that is more early-artificialintelligence-like, that is, chess game. So, (natural) language is not necessary for imitation game from the point of view of Turing, and yet the ordinary explanation of Turing’s choice for language is still valid within such a framework. In other terms, Turing was aware not only that there are other domains in which a machine can apply itself autonomously – a trivial fact – but also that such domains are as enough good as natural language for imitation game. Nevertheless, he choose natural language as paradigmatic. What conclusions can we draw from such remarks? Probably two ones. First, Turing was pretty definitely aware that the evaluation of artificial intelligence (AI) products, in a broad AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 34 sense, would be a very difficult subject, maybe the more fundamental as regards the epistemology of AI and cognitive science, even if, obviously, he didn’t use such terms in 1950. Secondly, that the choice of language and the role of language in imitation game are even more subtle than the popular culture and the AI tradition usually assert. As a matter of fact, he did not speak about natural language in general but of a “questionanswer method”, a method that involves communication, not just language processing or producing. So, from this point of view it seems that, for Turing, natural language processing or producing are just some peculiar human cognitive abilities among many other ones, and are not basic for testing intelligence. What is basic for such a task is communication or, to use another, more inclusive term, interaction. But a specification is needed. We are not maintaining that the capability of using language is not a cognitive feature, but that in Turing’s view interaction is the best way in order to detect intelligence, and language interaction, by means of question-answer method, is perhaps the most intuitive form of interaction for human beings. No interaction is tantamount to no possibility to identify intelligence, and for such a purpose one of the two poles of interaction must be a human being3. Furthermore, the «question and answer method seems to be suitable for introducing almost anyone of the fields of human endeavour that we wish to include» [26] and, leaving aside the above-mentioned point concerning the explicit Turing’s request to penalize in no way machines or human beings for their unshared features, we could consider it as the main aim of Turing, namely generalizing the intelligence testing. Of course, such an aim anticipates one of the mainstream of the following rising AI4, but it has an even wider range. Turing was not speaking, indeed, about problem solving, but trying to formulate a criterion and a method to show and identify machine intelligent behavior in different-field interaction with human beings. So, language communication seems to become both a lowest common denominator for every field in which it is possible testing intelligence and, at the same time, a way to cut single field or domain for testing intelligence from the point of view of interaction. Now we will consider a few of them, in order to investigate and discuss whether they could be relevant for qualia problem. 3 A similar way of thinking seems to be suggested, as regards specifically natural language, by an old mental experiment formulated by Putnam, in which he imagines a human being learning by heart a passage in a language he did not know and then repeating it in a sort of stream of consciousness. If a telepath, knowing that particular language, could perceive the stream of consciousness of the human being who has memorized the passage, the telepath could think the human being knows that language, even though it is not so. What does it lack in the scene described in the mental experiment? A real interaction. As a matter of fact, the conclusion of Putnam himself is that: «the understanding, then, does not reside in the words themselves, nor even in the appropriateness of the whole sequence of words and sentences. It lies, rather, in the fact that an understanding speaker can do things with the words and sentences he utters (or thinks in his head) besides just utter them. He can answer questions, for example […].» [19]. And it appears to be very close to what Turing thought more than twenty years before. 4 For example, consider the target to build a General Problem Solver pursued by Newell, Shaw and Simon for long [15, 16]. 2 LANGUAGE TRANSLATION AS CULTURAL INTERACTION A first field in which language and interaction are involved is language translation. We know that machine translation is a very difficult target of computer science and AI since their origins up to nowadays. The reason is that translation usually concerns two different natural languages, two tongues, and it is not a merely act of substitution. On the contrary, translation involves many different levels of language: syntactic and semantic levels, but also cultural and stylistic levels, that are very context-dependent. It is very difficult for a machine to find the correct word or expression to yield in a specific language what is said in another language. Many different approaches in this field, especially from computational linguistic, are available to solve the problem of a good translation. But anyway, it is an operation that still remains improvable. As a matter of fact, if we consider some machine translation tools like Google Translator, there are generally syntactic and semantic problems in every product of such tools, even if, maybe, the latter are larger than the former. So, how can we test intelligence in this field concerning language? Or, in other terms, what could be a real test for detecting intelligence as regards translation? A tool improvement could be not satisfying. We could think indeed that, with the improvement of machine translation tools, we could have better and better outcomes in this field, but what we want is not a collection of excellent texts, from the point of view of translation. What we want is a sort of justification of the word choice in the empirical activity of translation. If we could have a program that is able to justify its choosing of words and expressions in the act of translation, we could consider that the problem of a random good choice of a word or of an expression is evaded. In a dialogue, a personal tribute to Alan Turing, Douglas Hofstadter underlines a similar view. Inspired by the two little snippets of Turing’s 1950 article [26], Hofstadter builds a (fictitious) conversation between a human being and a machine in order to show the falsity of simplistic interpretations of Turing Test, that he summarizes in the following way: «even if some AI program passed the full Turing Test, it might still be nothing but a patchwork of simple-minded tricks, as lacking in understanding or semantics as is a cash register or an automobile transmission» [10]. In his dialogue, Hofstadter tries to expand the flavor of the second Turing snippet, where Mr Pickwick is compared to a winter’s day [26]. The conversation by Hofstadter has translation as the main topic, in particular poetry translation. Hofstadter wants to show how complex such a subject is and that it is very difficult that a program could have a conversation of that type with a human being, and thus pass the Turing Test. By reversing perspective, we can consider translation one of the language field in which, in the future, it could be fruitful testing machine intelligence. But we are not merely referring to machine translation. We want to suggest the a conversation on a translation subject could be a target for a machine. Translation by itself, indeed, concerns many cultural aspects, as we said before, and the understanding and justification of what term or expression is suitable in a specific context of a specific language could be a very interesting challenge for a program, that would imply the knowledge of the cultural context of a specific language by the program, and AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 35 therefore the implementation of mechanisms for representing and handling two different language contexts. In Hofstadter’s dialogue, much attention is devoted to the problem from a poetic point of view. We can have a flavour of the general issues involved by considering an extract from the dialogue, which is between two entities, a Dull Rigid Human and an Ace Mechanical Translator: «DRH: Well, of course, being an advanced AI program, you engaged in a highly optimized heuristic search. AMT: For want of a better term, I suppose you could put it that way. The constraints I found myself under in my search were, of course, both semantic and phonetic. Semantically, the problem was to find some phrase whose evoked imagery was sufficiently close to, or at least reminiscent of, the imagery evoked by croupir dans ton lit. Phonetically, the problem was a little trickier to explain. Since the line just above ended with “stir”, I needed an “ur” sound at the end of line 6. But I didn’t want to abandon the idea of hyphenating right at that point. This meant that I needed two lines that matched this template: Instead of …ur…ing …… bed where the first two ellipses stand for consonants (or consonant clusters), and the third one for “in” or “in your” or something of the sort. Thus, I was seeking gerunds like “lurking”, “working”, “hurting”, “flirting”, “curbing”, “squirming”, “bursting”, and so on — actually, a rather rich space of phonetic possibilities. DRH: Surely you must have, within your vast data bases, a thorough and accurate hyphenation routine, and so you must have known that the hyphenations you propose — “lur-king”, “squir-ming”, “bur-sting”, and so forth — are all illegal… AMT: I wish you would not refer to my knowledge as “your vast data bases”. I mean, why should that quaint, old-fashioned term apply to me any more than to you? But leaving that quibble aside, yes, of course, I knew that, strictly speaking, such hyphenations violate the official syllable boundaries in the eyes of rigid language mavens like that old fogey William Safire. But I said to myself, “Hey, if you’re going to be so sassy as to hyphenate a word across a line-break, then why not go whole hog and hyphenate in a sassy spot inside the word?”» [10]. Poetry involves metrical structures, rhymes, assonances, alliterations and many other figures of speech [10]. But, they constitute some constraints that are easily mechanizable, by means of the appropriate set of data bases. In fact, a machine could be faster than a human being in finding, for example, every word rhyming with a given one. So the problem is not if we have to consider poetry or prose translation, and their differences, but that of catching the cultural and personal flavor of the text’s author, within a figure of speech scheme or not. Poetry just has some further, but mechanizable, constraints. So, what remains outside such constraints? Is it the traditional idea of an intentionality of terms? We do not think that things are those. The notion of intentionality seems always to involve a first-person, subjective point of view that is undetectable in a machine, as a long debate of last thirty years seems to show. But if we consider the natural development of intentionality problem, that of qualia, (as subjective conscious experiences that we are able to express with words), maybe we could have a better problem and find a better field of investigation in considering translation as a sort of qualia communication. In other terms, a good terminological choice and a good justification of such a choice could be a suitable method for testing intelligence, even in its capability to express and understand qualia. And this could be a consequence of the fact that, generally speaking, translation is a sort of communication, a communication of contents from a particular language to another particular language; and in the end a context interaction. 3 INTERACTION BETWEEN MODEL AND REALITY Another field in which the notion of interaction could be relevant from the point of view of the Turing Test is that of scientific discovery. In the long development of machine learning some researchers implemented programs that are able to carry out generalizations from data structures within a specific scientific domain, namely scientific laws5. Even thought they are very specific laws, they are (scientific) laws in all respects. Such programs were based on logic method and, indeed, they could only arrive to a generalization from data structures and they were not able to obtain their outcomes from experimental conditions. More recently, other artificial artifacts have been built in order to fill such a gap. For example, ADAM [8] is a robot programmed for carrying out outcomes in genetics with the possibility of autonomously managing real experiments. It has a logic-based knowledge base that is a model of metabolism, but it is able as well to plan and run experiments to confirm or disconfirm some hypotheses within a research task. In particular, it could set up experimental conditions and situations with a high level of resource optimization for investigating gene expression and associating one or more genes to one protein. The outcome is a (very specific but real) scientific law, or a set of them. We could say that ADAM is a theoretical and practical machine. It formulates a number of hypotheses of gene expression using its knowledge bases, that includes all that we already know about gene expression from a biological point of view. It does the experiments to confirm or disconfirm every hypothesis, and then it carries out a statistical analysis for evaluating the results. So, is ADAM a perfect scientist, an autonomous intelligent artifacts in the domain of science? Figure 1. Diagram of the hypotheses generation–experimentation cycle for the production of new scientific knowledge, on which ADAM is based (from [21]). 5 For example GOLEM. For some outcomes of it, see [14]; for a discussion see [5]. AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 36 Of course, it is true that its outcomes are original in some cases; and it is also true that its creators, its programmers do not see in it a substitute for scientists, but only an assistant for human scientists, even though a very efficient one, at least at the current phase of research, likewise it happens in other fields like chess playing and music. What does lack ADAM to become a scientist in? We could say that it lacks in the possibility of controlling or verifying its outcomes from different points of view, for example from an interdisciplinary perspective. But it seems a mere practical limit, surmountable with a lot of additional scientific knowledge of different domains, given that it has the concrete possibility to do experiments. Yet, as regards such a specific aspect, what is the reach of ADAM – or other programs devoted to scientific discovery, like EVE, specialized in pharmaceutical field – in conducting experiments? Or, that is the same thing, how far could it get in formulating hypotheses? It all seems to depend on its capacity of interaction with the real world. And so we could say that in order to answer the question if ADAM or other similar artificial artifacts are intelligent, we have to consider not only the originality of their outcomes, but also their creativity in the hypothesis formulation, task that is strictly dependent on its practical interaction with the real world. Is this a violation of what Turing said we have not to consider in order to establish if a machine is intelligent, namely its “physical” difference from human beings? We think not. We think that interaction between a model of reality and reality itself from a scientific point of view is the most important aspect in scientific discovery and it could be in the future one of the way in which evaluate the results of artificial artifacts and their intelligence. As a matter of fact, science and scientific discovery take place in a domain in which knowledge and methods are widely structured and the invention of new hypotheses and theories could reveal itself as a task of combination of previous knowledge, even expressed in some symbolic language, more than a creation from nothing. And the capability to operate such a combination could be the subjective perspective, the first person point of view of future machines. 4 EMOTION INTERACTING: THE CASE OF LOVE Another field in which the notion of interaction could be relevant from the point of view of Turing Test are emotions, their role in the interaction with the environment and the language to transmit the emotions. Emotions are cognitive phenomena. It is not possible to characterize them as irrational dispositions, but they provide with all the necessary information about the word around us. The emotions are a way to relate the environment and other individuals. Emotions are probably a necessary condition for our mental life [2, 6]. They show us our radical dependence on the natural and social environment. One of the most significant cognitive emotions is love. Since antiquity, philosophers have considered love as a crucial issue in their studies. Modern day psychologists have discussed its dynamics and dysfunctions. However, it has rarely been investigated as a genuine human cognitive phenomenon. In its most common sense, love has been considered in poetry, philosophy, and literature, as being something universal, but at the same time, as a radically subjective feeling. This ambiguity is the reason why love is such a complicated subject matter. Now, we want to argue that love, by means of its rational character, can be studied in a scientific way. According to the philosophical tradition, human beings are rational animals. However, the same rationality guides us in many circumstances, sometimes creates difficult puzzles. Feelings and emotions, like love, fortunately are able to offer an efficient reason for action. Even if what “love” is defies definition, it remains a crucial experience in the ordinary life of human beings. It participates in the construction of human nature and in the construction of an individual’s identity. This is shown by the universality of the feeling of love across cultures. It is rather complicated to offer a precise definition of “love”, because its features include emotional states, such as tenderness, commitment, passion, desire, jealousy, and sexuality. Love modifies people’s way of thinking and acting, and it is characterized by a series of physical symptoms. In fact, love has often been considered as a type of mental illness. How many kinds of love are there? In what relation are they? Over the past decades many classifications of love have been proposed. Social psychologists such as Berscheid and Walster [1], for example, in their cognitive theory of emotion, propose two stages of love. The former has to do with a state of physiological arousal and it is caused by the presence of positive emotions, like sexual arousal, satisfaction, and gratification, or by negative emotions, such as fear, frustration, or being rejected. The second stage of love is called “tagging”, i.e., the person defines this particular physiological arousal as a “passion” or “love”. A different approach is taken by Lee [12] and Hendrick [7, 9]. Their interest is to identify the many ways we have for classifying or declining love. They focus their attention on love styles, identifying six of them: Eros, Ludus, Mania, Pragma, Storge and Agape. Eros (passionate love) is the passionate love which gives central importance to the sexual and physical appearance of the partner; Ludus (game-playing love) is a type of love exercised as a game that does not lead to a stable, lasting relationship; Mania (possessive, dependent love) is a very emotional type of love which is identified with the stereotype of romantic love; Pragma (logical love) concerns the fact that lovers have a concrete and pragmatic sense of the relationship, using romance to satisfy their particular needs and dictating the terms of them; Storge (friendship-based love) is a style in which the feeling of love toward each other grows very slowly. Finally, it is possible to speak of Agape (all-giving selfless love) characterized by a selfless, spiritual and generous love, something rarely experienced in the lifetime of individuals. Robert Sternberg [20] offers a graphical representation of love called the “triangle theory”. The name stems from the fact that the identified components are the vertices of a triangle. The work of the Yale psychologist deviates from previous taxonomies, or in other words, from the previous attempts made to offer a catalogue of types of existing love. The psychological elements identified by Sternberg to decline feelings of love are three: intimacy, passion, decision/commitment. The different forms of love that you may encounter in everyday life would result from a combination of each of these elements or the lack of them. Again, in the study and analysis of the feeling of love we encounter a list of types of love: non-love, affection, infatuation, empty love, romantic love, friendship, love, fatuous love, love lived. Philosophers, fleeing from any kind of taxonomy, approach the feeling of love cautiously, surveying it and perhaps even fearing it. Love seems to have something in common with the AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 37 deepest of mysteries, i.e. the end of life. It leads us to question, as death does, the reality around us as well as ourselves, in the hope that something precious and important not pass us by. But love is also the guardian of an evil secret that is revealed, which consists in the nonexistence of the love object, in that it is nothing but a projection of our own desires. Love is, according to Arthur Schopenhauer, a sequence of actions performed by those who know perfectly that there is a betrayal in that it does nothing else but carry out the painful event which life consists in. Thus, love, too, has its Maya veil, and once torn down, what remains? What remains is the imperative of the sexual reproduction of the species instinct. Human nature has for Harry G. Frankfurt [4] two fundamental characteristics: rationality and the capacity to love. Reason and love are the regulatory authorities that guide the choices to be made, providing the motivation to do what we do and constraining it by creating a space which circumscribes or outlines the area in which we can act. On one hand, the ability to reflect and think about ourselves leads to a sort of paralysis. The ability to reflect, indeed, offers the tools to achieve our desires, but at the same time, is often an impediment to their satisfaction, leading to an inner split. On the other, the ability to love unites all our fragments, structuring and directing them towards a definite end. Love, therefore, seems to be involved in integration processes of personal identity. In The Origin of species [3] Charles Darwin assigned great importance to sexual selection, arguing that language, in its gradual development, was the subject of sexual selection, recognizing in it features of an adaptation that we could call unusual (such as intelligence or morality). The dispute that has followed concerning language and its origins has ignited the minds of many scholars and fueled the debate about whether language is innate or is, on the contrary, a product of learning. Noam Chomsky has vigorously fought this battle against the tenets of social science supporting that language depends on an innate genetic ability. Verbal language is a communication system far more complex than other modes of communication. There are strong referential concepts expressed through language that are capable of building worlds. Similar findings have been the main causes of the perception of language within the community of scholars, as something mysterious, something that appeared suddenly in the course of our history. For a long time arguments concerning the evolution of language were banned and the idea that a similar phenomenon could be investigated and argued according to the processes that drive the evolution of the natural world were considered to be of no help in understanding the complex nature of language. Chomsky was one of the main protagonists of this theoretical trend. According to Chomsky, the complex nature of language is that it can be understood only through a formal and abstract approach such as the paradigm of generative grammar. This theoretical position puts out the possibility of a piecemeal approach to the study of language and the ability to use the theory of evolution to get close to understanding it. Steven Pinker and Paul Bloom, two well-known pupils of Chomsky, in an article entitled “Natural Language and Natural Selection”, renewed the debate on the origin of language, stating that it is precisely the theory of evolution that presents the key to explaining the complexity of language. A fascinating hypothesis on language as a biological adaptation is that which considers it an important feature in courtship. Precisely for this reason it would have been subject to sexual selection [13]. A good part of courtship has a verbal nature. Promises, confessions, stories, statements, requests for appointments are all linguistic phenomena. In order to woo, find the right words, find the right tone of voice and the appropriate arguments, you need to employ language. Even the young mathematician Alan Turing utilized the courtship form to create his imitation game with the aim of finding an answer to a simple – but only in appearance – question (“can machines think?”). Turing formulated and proposed a way to establish it by means of a game that has three protagonists as subject: a man, a woman and an interrogator. The man and woman are together in one room, in another place is the interrogator and communication is allowed through the use of a typewriter. The ultimate goal of the interrogator is to identify if on the other side there is a man or a woman. The interesting part concerns what would happen if in the man’s place a computer was put that could simulate the communicative capabilities of a human being. As we mentioned before, the thing that Turing emphasizes in this context is that the only point of contact between human being and machine communication is language. If your computer is capable of expressing a wide range of linguistic behavior appropriate to the specific circumstances it can be considered intelligent. Among the behaviors to be exhibited, Turing insert kindness, the use of appropriate words, and autobiographical information. The importance of transferring to whoever stands in front of us autobiographical information, coating therefore the conversation with a personal and private patina, the expression of shared interests, the use of kindness and humor, are all ingredients typically found in the courtship rituals of human beings. It is significant that a way in which demonstrating the presence of a real human being passed through a linguistic courtship, a mode of expression that reveals the complex nature of language and the presence within it of cognitive abilities. Turing asks: “Can machines think?”, and we might answer: “Maybe, if they could get a date on a Saturday evening”. To conclude, in the case of a very particular phenomenon such as love, one of the most intangible emotions, Turing shoves us to consider the role of language as fundamental. But love is a very concrete emotion as well, because of its first person perspective. Nevertheless, in order to communicate it also we human beings are compelled to express it by words in the best way we can, and at the same time we have just language for understanding love emotion in other entities (of course, human beings), together with every real possibility of making mistake and deceiving ourselves. And so, if we admit the reality of this emotion also from a high level cognitive point of view, that involves intelligence and rationality, we have two consequences. The first one is that just interaction reveals love; the second one is that just natural language interaction, made of all the complex concepts that create a bridge between our feelings and the ones of another human being, reveals the qualia of the entity involved in a love exchange. Probably that is why Turing wanders through that subject in his imitation game. And probably the understanding of this kind of interaction could be, in the future, a real challenge for artificial artifacts provided with “qualia detecting sensor”, that cannot be so much different from qualia itself. AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 38 5 A TURING TEST FOR HUMAN (BEING) BRAIN A last way in which we could see interaction (connected to language) as relevant for testing intelligence in machines needs two perspective reversals. The first one concerns the use of Turing-Test-like methods to establish the presence of (a certain level of) consciousness in unresponsive brain damage patients. As a matter of fact, such patients are not able to use natural language for communicating as human beings usually do. So researchers try to find signs of communications that are different from languages, like blinks of an eyelid, eye-tracking, simple command following, response to pain, and they try at the same time to understand if they are intentional or automatic [22]. In such cases, neurologists are looking for signs of intelligence, namely of the capability of using intentionally cognitive faculties through a behavioral method that overturns the one of Turing. In the case of machines and Turing Test, natural language faculty is the evidence of the presence of intelligence in machines; in the case of unresponsive brain damage patients, scientists assume that patients were able to communicate through natural language before damage, and so that they were and are intelligent because intelligence is a human trait. Thus, they look for bodily signs to establish a communication that is forbidden through usual means. This is even more relevant if we consider vegetative state patient, that are not able to perform any body movement. In the last years, some researchers supposed that it is possible to establish a communication with vegetative state patients, a communication that would show also a certain level of consciousness, by means of typical neuroimaging techniques, like fMRI and PET [17]6. In short, through such experiments they observed that some vegetative state patients, unable to carry out any body response, had a brain activation very similar to that of healthy human beings when they were requested with auditory instructions to imagine themselves walking through one’s house or playing tennis. Even though the interpretation of such outcomes is controversial, because of problems regarding neuroimaging methodology and the nature itself of conscious activity, if we accept them, they would prove perhaps the presence of a certain level of consciousness in this kind of patients, namely the presence of consciousness in mental activities. They would prove, thus, the presence of intentionality in the patient response, and not only of cognitive processes or activities, that could be just cognitive “island” of mental functioning [11]. Such experimental outcomes could be very useful for building new techniques and tools of brain-computer interaction for people who are no longer able to communicate by natural language and bodily movements, even though there are many problems that have still to be solved from a theoretical and epistemological point of view as regards the methodology and the interpretations of such results [23]. Is it a real communication? Are those responses a sign of awareness? Could those responses be real answers to external request? Yet, what is important for our argumentation is the possibility of back-transferring these outcomes to machines, and this is the second reversal we mentioned before. As a matter of fact, these experiments are based on the assumption that also human beings 6 are machines and that communication is interaction between mechanical parts, also in the case of subjective, phenomenal experiences, that are evoked by means of language, but without external signs. So, the challenging question is: is it possible to find a parallel in machines? Is it possible to re-create in artificial artifacts this kind of communication that is not behavioral, but is still mechanical and detectable inside machines – virtual or concrete mechanisms – and is simultaneously a sign of consciousness and awareness in the sense of qualia? Is this sort of (non-natural-language) communication, if any, a way in which we could find qualia in programs or robots? Is it the sort of interaction that could lead us to the feeling of machines? REFERENCES [1] E. Berscheid, E. Walster, Interpersonal Attraction, Addison-Wesley, Boston, Mass., 1978. [2] A.R. Damasio, Descartes’ Error: Emotion, Reason, and the Human Brain, Putnam Publishing, New York, 1994. [3] C. Darwin, On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life, Murray, London, 1859. [4] H.G. Frankfurt, The Reasons of Love, Princeton University Press, Princeton, 2004. [5] D. Gillies, Artificial Intelligence and Scientific Method, Oxford University Press, Oxford, 1996. [6] P. Griffith, What emotions really are. The Problem of Psychological Categories, Chicago University Press, Chicago, 1997. [7] C. Hendrick, S. Hendrick, ‘A Theory and a Method of Love’, Journal of Personality and Social Psychology, 50, 392–402, (1986). [8] R.D.King, J. Rowland, W. Aubrey, M. Liakata, M. Markham, L.N. Soldatova, K.E. Whelan, A. Clare, M. Young, A. Sparkes, S.G. Oliver, P. Pir, ‘The Robot Scientist ADAM’, Computer, 42, 8, 46–54, (2009). [9] C. Hendrick, S. Hendrick, Romantic Love, Sage, California, 1992. [10] D.R. Hofstadter, Le Ton beau de Marot, Basic Books, New York, 1997. [11] S. Laureys, ‘The neural correlate of (un)awareness: lessons from the vegetative state’, Trends in Cognitive Sciences, 9, 12, 556–559, (2005). [12] J. Lee, The Colors of Love, Prentice-Hall, Englewood Cliffs, 1976. [13] G.F. Miller, The Mating Mind. How Sexual Choice Shaped the Evolution of Human Nature, Anchor Books, London, 2001. [14] S. Muggleton, R.D. King, M.J.E. Sternberg, ‘Protein secondary structure prediction using logic-based machine learning’, Protein Engineering, 5, 7, 647–657, (1992). [15] A. Newell, J.C. Shaw, H.A. Simon, ‘Report on a general problemsolving program’, Proceedings of the International Conference on Information Processing, pp. 256–264, (1959). [16] A. Newell, H.A. Simon, Human problem solving, Prentice-Hall, Englewood Cliffs, NJ, 1972. [17] A.M. Owen, N.D. Schiff, S. Laureys, ‘The assessment of conscious awareness in the vegetative state’, in S. Laureys, G. Tononi (eds.), The Neurology of Consciousness, Elsevier, pp. 163–172, 2009. [18] A.M. Owen N.D. Schiff, S. Laureys, ‘A new era of coma and consciousness science’, Progress in Brain Research, 177, 399–411, (2009). [19] H. Putnam, Mind, Language and Reality. Philosophical Papers, Vol. 2. Cambridge University Press, Cambridge, 1975. [20] R. Sternberg, ‘A Triangular Theory of Love’, Psychological Review, 93, 119–35, (1986). [21] A. Sparkes, W. Aubrey, E. Byrne, A. Clare, M.N. Khan, M. Liakata, M. Markham, J. Rowland, L.N. Soldatova, K.E. Whelan, M. Young, R.D. King, ‘Towards Robot Scientists for autonomous scientific discovery’, Automated Experimentation, 2:1, (2010). For a general presentation and discussion see also [18, 23]. AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 39 [22] J.F. Stins, ‘Establishing consciousness in non-communicative patients: A modern day version of the Turing Test’, Consciousness and Cognition, 18, 1, 187–192, (2009). [23] J.F. Stins, S. Laureys, ‘Thought translation, tennis and Turing tests in the vegetative state’, Phenomenology and Cognitive Science, 8, 361–370, (2009). [24] A.M. Turing, ‘On Computable Numbers, with an Application to the Entscheidungsproblem’, Proceedings of the London Mathematical Society, 42, 230–265, (1936); reprinted in: J. Copeland (ed.), The essential Turing, Oxford University Press, Oxford, pp. 58-90, 2004. [25] A.M. Turing, ‘Intelligent Machinery’, Internal report of National Physics Laboratory, 1948 (1948); reprinted in: J. Copeland (ed), The essential Turing, Oxford University Press, Oxford, pp. 410–432, 2004. [26] A.M. Turing, ‘Computing Machinery and Intelligence’, Mind, 59, 433–460, (1950). AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 40 Could There be a Turing Test for Qualia? Paul Schweizer1 Abstract. The paper examines the possibility of a Turing test designed to answer the question of whether a computational artefact is a genuine subject of conscious experience. Even given the severe epistemological difficulties surrounding the 'other minds problem' in philosophy, we nonetheless generally believe that other human beings are conscious. Hence Turing attempts to defend his original test (2T) in terms of operational parity with the evidence at our disposal in the case of attributing understanding and consciousness to other humans. Following this same line of reasoning, I argue that the conversation-based 2T is far too weak, and we must scale up to the full linguistic and robotic standards of the Total Turing Test (3T). Within this framework, I deploy Block's distinction between Phenomenal-consciousness and Access-consciousness to argue that passing the 3T could at most provide a sufficient condition for concluding that the robot enjoys the latter but not the former. However, I then propose a variation on the 3T, adopting Dennett's method of 'heterophenomenology', to rigorously probe the robot's purported 'inner' qualitative experiences. If the robot could pass such a prolonged and intensive Qualia 3T (Q3T), then the purely behavioural evidence would seem to attain genuine parity with the human case. Although success at the Q3T would not supply definitive proof that the robot was genuinely a subject of Phenomenalconsciousness, given that the external evidence is now equivalent with the human case, apparently the only grounds for denying qualia would be appeal to difference of internal structure, either physical-physiological or functionalcomputational. In turn, both of these avenues are briefly examined. 1the 1 INTRODUCTION which underpins cognitive science, Strong AI and various allied positions in the philosophy of mind, computation (of one sort or another) is held to provide the scientific key to explaining mentality in general and, ultimately, to reproducing it artificially. The paradigm maintains that cognitive processes are essentially computational processes, and hence that intelligence in the natural world arises when a material system implements the appropriate kind of computational formalism. So this broadly Computational Theory of Mind (CTM) holds that the mental states, properties and contents sustained by human beings are fundamentally computational in nature, and that computation, at least in principle, opens the possibility of creating artificial minds with comparable states, properties and contents. 1 Institute for Language, Cognition and Computation, School of Informatics, Univ. of Edinburgh, EH8 9AD, UK. Email: !"#$%&'()*+)",)#-. Traditionally there are two basic features that are held to be essential to minds and which decisively distinguish mental from non-mental systems. One is representational content: mental states can be about external objects and states of affairs. The other is conscious experience: roughly and as a first approximation, there is something it is like to be a mind, to be a particular mental subject. As a case in point, there is something it is like for me to be consciously aware of typing this text into my desk top computer. Additionally, various states of my mind are concurrently directed towards a number of different external objects and states of affairs, such as the letters that appear on my monitor. In stark contrast, the table supporting my desk top computer is not a mental system: there are no states of the table that are properly about anything, and there is nothing it is like to be the table. be applied to a system with no representational states, so too, many would claim that a system entirely devoid of conscious experience cannot be a mind. Hence if the project of Strong AI is to be successful at its ultimate goal of producing a system that truly counts as an artificially engendered locus of mentality, then it would seem necessary that this computational artefact be fully conscious in a manner comparable to human beings. 2 CONSCIOUSNESS AND THE ORIGINAL TURING TEST In 1950 Turing [1] famously proposed an answer to the question has since become universally referred to as the 'Turing test' (2T). In can pose questions to the remaining two players, where the goal of the game is for the questioner to determine which of the two respondents is the computer. If, after a set amount of time, the questioner guesses correctly, then the machine loses the game, and if the questioner is wrong then the machine wins. Turing claimed, as a basic theoretical point, that any machine that could win the game a suitable number of times has passed the test and should be judged to be intelligent, in the sense that its behavioral performance has been demonstrated to be indistinguishable from that of a human being. In his prescient and ground breaking article, Turing explicitly considers the application of his test to the question of machine consciousness. This is in section (4) of the paper, where he considers the anticipated 'Argument from Consciousness' objection to the validity of his proposed standard for answering the question 'Can a machine think?'. The objection is that, as per the above, consciousness is a necessary precondition for genuine thinking and mentality, and that a machine might fool its AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 41 interlocutor and pass the purely behavioural 2T, and yet remain completely devoid of internal conscious experience. Hence merely passing the 2T does not provide a sufficient condition for concluding that the system in question possesses the characteristics required for intelligence and bona fide thinking. Hence the 2T is inherently defective. Turing's defensive strategy is to invoke the well known and severe epistemological difficulties surrounding the very same question regarding our fellow human beings. This is the other minds problem how do you know that other people actually have a conscious inner life like only conscious being in the universe. As Turing humorously notes, this type of 'solipsistic' view (although more accurately characterized as a form of other minds skepticism, rather than full blown solipsism), while logically impeccable, tends to make communication difficult, and rather than continually arguing over the point, it is usual to simply adopt the polite convention that everyone is conscious. Turing notes that on its most extreme construal, the only way that one could be sure that a machine or another human being is conscious and hence genuinely thinking is to be the machine or the human and feel oneself thinking. In other words, one would have to gain first person access to what it's like to be the agent in question. And since this is not an empirical option, conscious all we have to go on is behaviour. Hence Turing attempts to justify his behavioural test that a machine can think, and ipso facto, has conscious experience, by claiming parity with the evidence at our disposal in the case of other humans. He therefore presents his anticipated objector with the following dichotomy: either be guilty of an inconsistency by accepting the behavioural standard in the case of humans but not computers, or maintain consistency by rejecting it in both cases and embracing solipsism. He concludes that most consistent proponents of the argument from consciousness would chose to abandon their objection and accept his test rather than be forced into the solipsistic position. However, it is worth applying some critical scrutiny to Turing's reasoning at this early juncture. Basically, he seems to be running epistemological issues together with semantical and/or factive questions which should properly be kept separate. mean by saying that a system has a mind i.e. what essential traits and properties are we ascribing how we can know that a given system actually satisfies this behaviouristic methodology has a strong tendency to collapse these two themes, but it is important to note that they are conceptually distinct. In the argument from consciousness, the point is that we mean something substantive, something more than just verbal stimulus-response patterns, when we attribute mentality to a system. In this case the claim is that we mean that the system in question has conscious experience, and this property is required for any agent to be accurately described with So one could potentially hold that consciousness is the term) and that: (1) other human beings are in fact conscious (2) the computer is in fact unconscious though it passes the 2T. This could be the objective state of affairs that genuinely obtains in the world, and this is completely independent of whether we can know, with certainty, that premises (1) and (2) are actually true. Although epistemological and factive issues are intimately related and together inform our general practices and goals of inquiry, nonetheless we could still be correct in our assertion, without being able to prove that consciousness was essential to genuine mentality, then one could seemingly deny that any purely behaviouristic standard was sufficient to test for whether a system had or was a mind. In the case of other human beings, we certainly take behaviour as evidence that they are conscious, but the evidence could in principle overwhelmingly support a false conclusion, in both directions. For example, someone could be in a comatose state where they could show no evidence of being conscious because they could make no bodily responses. But in itself this of what was going on and perhaps be able to report, retrospectively, on past events once out of their coma. And again, maybe some people really are zombies, or sleepwalkers, and exhibit all the appropriate external signs of consciousness oo spell be ruled out a priori. Historically, there has been disagreement regarding the proper interpretation of Turing's position regarding the intended import of his test. Some have claimed that the 2T is proposed as an operational definition of intelligence, thinking, etc., (e.g. Block [2], French [3]), and as such it has immediate and fundamental faults. However, in the current discussion I will adopt a weaker reading and interpret the test as purporting to furnish an empirically specifiable criterion for when intelligence can be legitimately ascribed to an artefact. On this reading, the main role of behavior is inductive or evidential rather than constitutive, and so behavioral tests for mentality do not provide a necessary condition nor a reductive definition. At most, all that is warranted is a positive ascription of intelligence or mentality, if the test is adequate and the system passes. In the case of Turing's 1950 proposal, the adequacy of the test is defended almost entirely in terms of parity of input/output performance with human beings, and hence alleges to employ the same operational standards that we tacitly adopt when ascribing conscious thought processes to our fellow creatures. Thus the issue would appear to hinge upon the degree of evidence a successful 2T performance provides for a positive conclusion in the case of a computational artefact, (i.e. for the negation of (2) above), and how this compares to the total body of evidence that we have in support of our belief in the truth of (1). We will only be guilty of an inconsistency or employing a double standard if the two are on a par and we nonetheless dogmatically still insist on the truth of both (1) and (2). But if it turns out to be the case that our evidence for (1) is significantly better than for the negation of (2), then we are not forced into there is clearly very little parity with the human case. We rely on far more than simply verbal behaviour in arriving at the polite convention that other human beings are conscious. In addition to conversational data, we lean very heavily on their bodily actions involving perception of the spatial environment, navigation, AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 42 physical interaction, verbal and other modes of response to communally accessible non-verbal stimuli in the shared physical surroundings, etc. So the purely conversational standards of the 2T are not nearly enough to support a claim of operational parity with humans. In light of the foregoing observations, in order to move towards evidential equivalence in terms of observable behaviour, it is necessary to break out of the closed syntactic bubble of the 2T and scale up to a full linguistic and robotic version of the test. But before exploring this vastly strengthened variation as a potential test for the presence of conscious experience in computational artefacts, in the next section I will briefly examine the notion of consciousness itself, since we first need to attain some clarification regarding the phenomenon in question, before we go looking for it in robots. 3 TWO TYPES OF CONSCIOUSNESS Even in the familiar human case, consciousness is a notoriously elusive phenomenon, and is quite difficult to characterize rigorously. In addition, the word uniform and univocal manner, but rather appears to have different meanings in different contexts of use and across diverse academic communities. Block [4] provides a potentially illuminating philosophical analysis of the distinction and possible relationship between two common uses of the word. a number of different concepts and denoting a number of different phenomena. He attempts to clarify the issue by distinguishing two basic and distinct forms of consciousness that are often conflated: Phenomenal or P-consciousness and Access or Ais experience: what makes a state phenomenally conscious is that controversially, Block holds that P-conscious properties, as such, he notoriously difficult explanatory gap problem in philosophical theorizing concerns P-consciousness e.g. how is it possible that appeal to a physical brain process could explain what it is like to see something as red? So we must take care to distinguish this type of purely qualitative, Phenomenal consciousness, from Access consciousness, the latter of which Block sees as an information processing correlate of P-consciousness. A-consciousness states and structures are those which are directly available for control of speech, reasoning and action. Hence Block's rendition of Aconsciousness is similar to Baars' [5] notion that conscious representations are those that are broadcast in a global workspace. The functional/computational approach holds that the level of analysis relevant for understanding the mind is one that allows for multiple realization, so that in principle the same mental states and phenomena can occur in vastly different types of physical systems which implement the same abstract functional or computational structure. As a consequence, a staunch adherent of the functional-computational approach is committed to the view that the same conscious states must be preserved across widely diverse type of physical implementation. In contrast, a that details of the particular physical/physiological realization matter in the case of conscious states. Block says that if P = A, then the information processing side is right, while if the biological nature of experience is crucial then we can expect that P and A will diverge. A crude difference between the two in terms of overall characterization is that P-consciousness content is qualitative while A-consciousness content is representational. A-conscious states are necessarily transitive or intentionally directed, they are always states of consciousness of. However. P-conscious states On Block's account, the paradigm Pconscious states are the qualia associated with sensations, while the paradigm A-conscious states are propositional attitudes. He maintains that the A-type is nonetheless a genuine form of consciousness, and tends to be what people in cognitive neuroscience have in mind, while philosophers are traditionally more concerned with qualia and P-consciousness, as in the hard problem and the explanatory gap. In turn, this difference in meaning can lead to mutual misunderstanding. In the following discussion I will examine the consequences of the distinction between these two types of consciousness on the prospects of a Turing test for consciousness in artefacts. 4 THE TOTAL TURING TEST In order to attain operational parity with the evidence at our command in the case of human beings, a Turing test for even basic linguistic understanding and intelligence, let alone conscious experience, must go far beyond Turing's original proposal. The conversational 2T relies solely on verbal input/output patterns, and these alone are not sufficient to evince a correct interpretation of the manipulated strings. Language is primarily about extra-linguistic entities and states of affairs, and there is nothing in a cunningly designed program for pure syntax manipulation which allows it to break free of this closed loop of symbols and demonstrate a proper correlation between word and object. When it comes to judging human language users in normal contexts, we rely on a far richer domain of evidence. Even when the primary focus of investigation is language proficiency and comprehension, sheer linguistic input/output data is not enough. Turing's original test is not a sufficient condition for concluding that the computer genuinely understands or refers to anything with the strings of symbols it f relations and interactions with the objects and states of affairs in the real world that its words are supposed to be about. To illustrate the point; if the computer has no eyes, no hands, no mouth, and has never seen or eaten anything, then it is not talking about hamburgers when its program generates the string -a-m-b-u-r-g-e-rinside a closed loop of syntax. In sharp contrast, our talk of hamburgers is intimately connected to nonverbal transactions with the objects of nonverbal stimuli to appropriate linguistic behaviours. When given the visual stimulus of being presented with a pizza, a taco and a kebab, we can produce the salient utterance "Those particular foodstuffs are not hamburgers". And there are appropriate nonverbal actions. For example, we can follow complex verbal instructions and produce the indicated patterns of behaviour, such as finding the nearest Burger King on the basis of a description of its location in spoken English. Mastery of both of these types of rules is essential for deeming that a AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 43 human agent understands natural language and is using expressions in a correct and referential manner - and the hapless 2T computer lacks both.2 2the And when it comes to testing for conscious experience, we again need these basic additional dimensions of perception and action in the real world as an essential precondition. The fundamental limitations of mere conversational performance naturally suggest a strengthening of the 2T, later named the Total Turing Test (3T) by Harnad [7], wherein the repertoire of relevant behaviour is expanded to include the full range of intelligent human activities. This will require that the computational procedures respond to and control not simply a teletype system for written inputs and outputs, but rather a well crafted artificial body. Thus in the 3T the scrutinized artefact is a robot, and the data to be tested coincide with the full spectrum of behaviours of which human beings are normally capable. In order to succeed, the 3T candidate must be able to do, in the real world of objects and people, everything that intelligent people can do. Thus Harnad expresses a widely held view when he claims that the 3T is "...no less (nor more) exacting a test of having a mind than the means we already use with one another... [and, echoing Turing] there is no stronger test, short of being the candidate". And, as noted above, the latter state of affairs is not an empirical option. examined.3 3the Since the 3T requires the ability to perceive and act in the real world, and since A-consciousness states and structures are those which are directly available for control of speech, reasoning and action, it would seem to follow that the successful 3T robot must be A-conscious. For example, in order to pass the test, the robot would have to behave in an appropriate manner in any number of different scenarios such as the following. The robot is handed a silver platter on which a banana, a boiled egg, a teapot and a hamburger are laid out. The robot is asked to pick up the piece of fruit and throw it out the window. Clearly the robot could not perform the indicated action unless it had direct information processing access to the identity of the salient object, its spatial location, the movements of its own mechanical arm, the location and geometrical properties of the window, etc. Such transitive, intentionally directed A-conscious states are plainly required for the robot to pass the test. But does it follow that the successful 3T robot is Pconscious? It seems, not, since on the face of it there appears to be no reason why the robot could not pass the test relying on Aconsciousness alone. All that is being tested is its executive control of the cognitive processes enabling it to reason correctly and perform appropriate verbal and bodily actions in response to a myriad of linguistic and perceptual inputs. These abilities are demonstrated solely through its external behaviour, and so far, there seems to be no reason for P-conscious states to be invoked. intelligence and linguistic understanding in the actual world, the 2 Shieber [6] provides a valiant and intriguing rehabilitation/defense of the 2T, but it nonetheless still neglects crucial data, such as mastery of language exit and entry rules. Ultimately Shieber's rehabilitation in terms of interactive proof requires acceptance of the notion that conversational input/response patters alone are sufficient, which premise I would deny for the reasons given. The program is still operating within a closed syntactic bubble. 3 See Schweizer [8] for an argument to the effect that even the combined linguistic and robotic 3T is still too weak as a definitive behavioural test of artificial intelligence. A-conscious robot could conceivably pass the 3T while at the same time there is nothing it is like to be the 3T robot passing the test. We are now bordering on issues involved in demarcating the 'easy' from the 'hard' problems of consciousness, which, if pursued at this point, would be moving in a direction not immediately relevant to the topic at hand. So rather than exploring arguments relating to this deeper theme, I will simply contend that passing the 3T provides a sufficient condition for Block's version of A-consciousness, but not for P-consciousness, since it could presumably be passed by an artefact devoid of qualia. Many critics of Block's basic type of view (including Searle [9] and Burge [10]) argue that if there can be such -conscious but not P-conscious, then they are not genuinely conscious at all. Instead, Aand is a form of consciousness only to the extent that it is parasitic upon P-conscious states. So we could potentially have a 3T for A-consciousness, but then the pivotal question arises, is A-consciousness without associated qualitative presentations really a form of consciousness? Again, I will not delve into this deeper and controversial issue in the present discussion, but simply maintain that the successful 3T robot does at least exhibit the type of A-awareness that people in, e.g., cognitive neuroscience tend to call consciousness. But as stated earlier, 'consciousness' is a multifaceted term, and there are also good reasons for not calling mere A-awareness without qualia a fullfledged form of consciousness. For example, someone who was drugged or talking in their sleep could conceivably pass the 2T while still 'unconscious', that is A-'conscious' but not P-conscious. And a human sleep walker might even be able to pass the verbal and robotic 3T while 'unconscious' (again A-'conscious' but not Pconscious). What this seems to indicate is that only A'consciousness' can be positively ascertained by behaviour. But there is an element of definitiveness here, since it seems plausible to say that an agent could not pass the 3T without being A-'conscious', at least in the minimal sense of Aawareness. If the robot were warned 'mind the banana peel' and it was not A-aware of the treacherous object in question on the ground before it, emitting the frequencies of electromagnetic radiation appropriate for 'banana-yellow', then it would not deliberately step over the object, but rather would slip and fall and fail the test. 5 A TOTAL TURING TEST FOR QUALIA In the remainder of the paper I will not pursue the controversial issue as to whether associated P-consciousness is a necessary condition for concluding that the A-awareness of the successful 3T robot is genuinely a form of consciousness at all. Instead, I will explore an intensification of the standard 3T intended to prod more rigorously for evidential support of the presence of Pconscious states. This Total Turing Test for qualia (Q3T) is a more focused scrutiny of the successful 3T robot which emphasizes rigorous and extended verbal and descriptive probing into the qualitative aspects of the robot's purported internal experiences. So the Q3T involves unremitting questioning and verbal analysis of the robot's qualitative inner experiences, in reaction to a virtually limitless variety of salient external stimuli, such as paintings, sunsets, musical AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 44 performances, tastes, textures, smells, pleasures and pains, emotive reactions... Turing suggests a precursor version of this strategy in his 1950 discussion of the argument from consciousness, where he observes that the question of machine consciousness could be addressed by a sustained viva voce, where the artefact was asked questions directly concerning its aesthetic and other types of qualitative reactions and judgement in response to opened-ended questioning by the interrogator. Turing provides a conjectural illustration of the method in the guise of a 'sonnet writing' programme being quizzed by a human judge. Interrogator: In the first line of your sonnet which reads "Shall I compare thee to a summer's day," would not "a spring day" do as well or better? Witness: It wouldn't scan. Interrogator: How about "a winter's day," that would scan all right. Witness: Yes, but nobody wants to be compared to a winter's day. Interrogator: Would you say Mr. Pickwick reminded you of Christmas? Witness: In a way. Interrogator: Yet Christmas is a winter's day, and I do not think Mr. Pickwick would mind the comparison. Witness: I don't think you're serious. By a winter's day one means a typical winter's day, rather than a special one like Christmas. And so on.... The above sample dialogue serves as a good reminder of just how difficult the original 2T really is (and consequently why it hasn't yet been passed). However, this conjectured scenario is still confined to a merely conversational setting of verbal inputs and verbal outputs, and hence falls far short of the behavioural potential of a full 3T edition, as well as the concomitant evidential standards applicable in the human case. Plebe and Perconti [11] put forward a strengthened adaptation of a 2T-style viva voce, where the pivotal difference is that, in addition to merely linguistic inputs, the computer must now give appropriate and testable conversational reactions to uploaded images. This is an interesting and important augmentation of the original 2T, since the inputs are no longer strictly linguistic, and the test is aimed at evaluating verbally plausible responses to stimuli that, to us at least, have a phenomenal aspect. As an example of the method, Plebe and Perconti supply an excerpt from a hypothetical conversation. Interrogator: Do you want to look at a picture of me? Machine: Oh yes, thanks, let's upload that. <.... uploading> Machine: Mmmh, I see several people here, who are you? Interrogator: Try to guess. Machine: Well, I know you are blond and shy, so I would guess the second from the left. This appears to be an order of magnitude jump over the purely verbal 2T, and accordingly its standards of satisfaction are even more futuristic. However, in terms of the ultimate goal of providing a test, the passing of which constitutes a sufficient condition for the presence of genuine conscious experience in an artefact, it should be noted that the inputs, at a crucial level of analysis, remain purely syntactic and nonqualitative, in that the uploaded image must take the form of a digital file. Hence this could at most provide evidence of some sort of (proto) A-awareness in terms of salient data extraction and attendant linguistic conversion from a digital source, where the phenomenal aspects produced in humans by the original (predigitalized) image are systematically corroborated by the computer's linguistic outputs when responding to the inputted code. Although a major step forward in terms of expanding the input repertoire under investigation, as well as possessing the virtue of being closer to the limits of practicality in the nearer term future, this proposed new qualia 2T still falls short of the full linguistic and robotic Q3T. In particular it tests, in a relatively limited manner, only one sensory modality, and in principle there is no reason why this method of scrutiny should be restricted to the intake of photographic images represented in digital form. Hence a natural progression would be to test a computer on uploaded audio files as well. However, this expanded 2T format is still essentially passive in nature, where the neat and tidy uploaded files are hand fed into the computer by the human interrogator, and the outputs are confined to mere verbal response. Active perception of and reaction to distal objects in the real world arena are critically absent from this test, and so it fails to provide anything like evidential parity with the human case. And given the fact that the selected non-linguistic inputs take the form of digitalized representations of possible visual (and/or auditory) stimuli, there is still no reason to think that there is anything it is like to be the 2T computer processing the uploaded encoding of an image of, say, a vivid red rose. But elevated to a full 3T arena of shared external stimuli and attendant discussion and analysis, the positive evidence of a victorious computational artefact would become exceptionally strong indeed. So the extended Q3T is based on a methodology akin to Dennett's [12] 'heterophenomenology' given the robot's presumed success at the standard Total Turing Test, we count this as behavioural evidence sufficient to warrant the application of the intentional stance, wherein the robot is treated as a rational agent harbouring beliefs, desires and various other mental states exhibiting intentionality, and who's actions can be explained and predicted on the basis of the content of these states. Accordingly, the robot's salient sonic emissions are interpreted as natural language utterances asserting various propositions and expressing assorted contents. For the reasons delineated above in section 4, I would argue that this interpretive step and application of the intentional stance to a mere artefact is not evidentially warranted in a limited 2T type of setting, and AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 45 hence in the latter case, the syntactic tokens emitted by the computer cannot be given the semantic value that we would attribute to utterances produced by our fellow human beings. However, following Dennett, the gathering of scientifically rigorous empirical evidence proceeds with neutrality regarding the possible conscious experience of the robot. In this extended Q3T context, the artefact is queried not about uploaded images as 'qualitative' input data, but rather about its own subjective reactions and attendant introspective states. The robot is not hand fed selected digital files, but rather is presented with the same external stimuli as its human interlocutor, the latter of whom can then interrogate the robot regarding its purported qualitative experiences in response to the shared stimuli. Hence the human and the robot can both observe the same Dürer and the same Dali paintings, listen to the same Vivaldi and Schubert concertos, attend the same wine tasting event, eat the same brand of hamburger... And the robot must discourse on its subjective impressions and reactions like a true aesthete, and where this discourse can be calibrated by the qualitative reactions. As an example of the method, an excerpt from a hypothetical Q3T interaction might look something like the following. Interrogator: Isn't the sunset over the Pacific magnificent? Robot: Yes, the colours are sublime. Let's go to the top of that rise for a better view... Excellent, now we have the full panorama. Interrogator: What impression does it make on you? Robot: As the glowing orange orb of the sun begins to touch the calm blue line of the horizon, yet another day is passing. Now the radiant disc begins to sink into the sea and the background crimson intensifies and grows warmer. It gives me a feeling of sadness, but at the same time a glimpse of ineffable beauty... Ouch!! Why the hell did you pinch me? Interrogator: Just wanted to bring you back down to earth - you were starting to wax a bit too lyrical. Robot: Thanks very much for your concern. Could you please pass me the bottle of delicious chilled Chardonnay - I want to savour another glass along with the last rays of the setting sun. Interrogator: Here you go. Robot: Arrrgh, that tastes disgusting! - what happened to the wine? Interrogator: Uhh, I just stirred in a little marmite when you weren't looking - wanted to see how you'd react. This is a Q3T, after all... Even though a merely A-conscious robot could conceivably pass the verbal and robotic 3T while at the same time as there being nothing it is like for the robot passing the test, in this more focussed version of the 3T the robot would at least have to be able to go on at endless length talking about what it's like. And this talk must be in response to an open ended range of different combinations of sensory inputs, which are shared and monitored by the human judge. Such a test would be both subtle and extremely demanding, and it would be nothing short of remarkable if it could not detect a fake. And presumably a human sleepwalker who could pass a normal 3T as above would nonetheless fail this type of penetrating Q3T (or else wake up in the middle!), and it would be precisely on the grounds of such failure that we would infer that the human was actually asleep and not genuinely P-conscious of what was going on. If sufficiently rigorous and extended, this would provide extremely powerful inductive evidence, and indeed to pass the Q3T the robot would have to attain full evidential parity with the human case, in terms of externally manifested behaviour. 6 BEYOND BEHAVIOUR So on what grounds might one consistently deny qualitative states and P-consciousness in the case of the successful Q3T robot and yet grant it in the case of a behaviourally indistinguishable human? The two most plausible considerations that suggest themselves are both based on an appeal to essential differences of internal structure, either physical/physiological or functional/computational. Concerning the latter case, many versions of CTM focus solely on the functional analysis of propositional attitude states such as belief and desire, and simply ignore other aspects of the mind, most notably consciousness and qualitative experience. However others, such as Lycan [13], try to extend the reach of Strong AI and the computational paradigm, and contend that conscious states arise via the implementation of the appropriate computational formalism. Let us denote this extension of the basic CTM framework to the version of CTM+ might hold that qualitative experiences arise in virtue of the particular functional and information processing structure of the human brand of cognitive architecture, and hence that, even though the robot is indistinguishable in terms of input/output profiles, nonetheless its internal processing structure is sufficiently different from ours to block the inference to Pconsciousness. So the non-identity of abstract functional or computational structure might be taken to undermine the claim that bare behavioural equivalence provides a sufficient condition for the presence of internal conscious phenomena. At this juncture, the proponent of artificial ] consciousness might appeal to a version of Van Gul objections. When aimed against functionalism, the missing qualia arguments generally assume a deviant realization of the very same abstract computational procedures underlying human ours in all respects, and the position being supported is that consciousness is to be equated with states of the biological brain, rather than with any arbitrary physical state playing the same functional role as a conscious brain process. For example, in Block's [15] well known 'Chinese Nation' scenario, we are asked to imagine a case where each person in China plays the role of a neuron in the human brain and for some (rather brief) span of time the entire nation cooperates to implement the same AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 46 computational procedures as a conscious human brain. The rather compelling 'common sense' conclusion is that even though the entire Chinese population may implement the same computational structure as a conscious brain, there are nonetheless no purely qualitative conscious states in this scenario outside the conscious Chinese individuals involved. And this is then taken as a counterexample to purely functionalist theories of consciousness. -strategy is to claim that the missing qualia argument begs the question at issue. How do we know, a priori, that the very same functional role could be played by arbitrary physical states that were unconscious? The anti-functionalist seems to beg the question by assuming that such deviant realizations are possible in the first place. At this point, the burden of proof may then rest on the functionalist to try and establish that there are in fact functional roles in the human cognitive system that could only be filled by conscious processing states. Indeed, this strategy seems more interesting than the more dogmatic functionalist line that isomorphism of abstract functional role alone guarantees the consciousness of any physical state that happens to implement it. So to pursue this strategy, Van Gulick examines the psychological roles played by phenomenal states in humans and identifies various cognitive abilities which seem to require both conscious and self-conscious awareness e.g. abilities which involve reflexive and meta-cognitive levels of representation. These include things like planning a future course of action, control of plan execution, acquiring new non-habitual task behaviours These and related features of human psychological organization seem to require a conscious self-model. In this manner, conscious experience appears to play a unique information throughout the brain. In turn, the proponent of artificial consciousness might plausibly claim that the successful Q3T robot must possesses analogous processing structures in order to evince the equivalent behavioural profiles when passing the test. So even though the processing structure might not be identical to that of human cognitive architecture, it must nonetheless have the same basic cognitive abilities as humans in order to pass the Q3T, and if these processing roles in humans require phenomenal states, then the robot must enjoy them as well. However, it is relevant to note that Van Gulick's analysis seems to blur Block's distinction between Pconsciousness and A-consciousness, and an obvious rejoinder at this point would be that all of the above processing roles in both humans and robots could in principle take place with only the latter and not the former. Even meta-cognitive and 'conscious' self models could be accounted for merely in terms of Aawareness. And this brings us back to the same claim as in the standard 3T scenario - that even the success of the Q3T robot could conceivably be explained without invoking Pconsciousness per se, and so it still fails as a sufficient condition for attributing full blown qualia to computational artefacts. 7 MATTER AND CONSCIOUSNESS Hence functional/computational considerations seem too weak to ground a positive conclusion, and this naturally leads to the question of the physical/physiological status of qualia. If even meta-cognitive and 'conscious' self models in humans could in principle be accounted for merely in terms of A-awareness, then how and why do humans have purely qualitative experience? One possible answer could be that P-conscious states are essentially physically based phenomena, and hence result from or supervene upon the particular structure and causal powers of the actual central nervous system. And this perspective is reenforced by what I would argue (on the following independent grounds) is the fundamental inability of abstract functional role to provide an adequate theoretical foundation for qualitative experience. Unlike computational formalisms, conscious states are inherently non-abstract; they are actual, occurrent phenomena extended in physical time. Given multiple realizability as a hallmark of the theory, CTM+ is committed to the result that qualitatively identical conscious states are maintained across widely different kinds of physical realization. And this is tantamount to the claim that an actual, substantive and invariant qualitative phenomenon is preserved over radically diverse real systems, while at the same time, no internal physical regularities need to be preserved. But then there is no actual, occurrent factor which could serve as the causal substrate or supervenience base for the substantive and invariant phenomenon of internal conscious experience. The advocate of CTM+ cannot rejoin that it is formal role which supplies this basis, since formal role is abstract, and such abstract features can only be instantiated via actual properties, but they do not have the power to produce them. The only (possible) non-abstract effects that instantiated formalisms are required to preserve must be specified in terms of their input/output profiles, and thus internal experiences, qua actual events, are in principle omitted. So (as I've also been argued elsewhere: see Schweizer [16,17]) it would appear that the non-abstract, occurrent nature of conscious states entails that they must depend upon intrinsic properties of the brain as a proper subsystem of the actual world (on the crucial assumption of physicalism as one's basic metaphysical stance obviously other choices, such as some variety of dualism, are theoretical alternatives). It is worth noting that from this it does not follow that other types of physical subsystem could not share the relevant intrinsic properties and hence also support conscious states. It only follows that they would have this power in virtue of their intrinsic physical properties and not in virtue of being interpretable as implementing the same abstract computational procedure. 8 CONCLUSION We know by direct first person access that the human central nervous system is capable of sustaining the rich and varied field of qualitative presentations associated with our normal cognitive activities. And it certainly seems as if these presentations play a vital role in our mental lives. However, given the above critical observation regarding Van Gulick's position, viz., that all of the salient processing roles in both humans and robots could in principle take place strictly in terms of A-awareness without Pconsciousness, it seems that P-conscious states are not actually necessary for explaining observable human behaviour and the attendant cognitive processes. In this respect, qualia are rendered functionally epiphenomenal, since purely qualitative states per se are not strictly required for a functional/computational account of human mentality. However, this is not to say that they are AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 47 physically epiphenomenal as well, since it doesn't thereby follow that this aspect of physical/physiological structure does not in fact play a causal role in the particular human implementation of this functional cognitive architecture. Hence it becomes a purely contingent truth that humans have associated P-conscious experience. And this should not be too surprising a conclusion, on the view that the human mind is the product of a long course of exceedingly happenstance biological evolution. On such a view, perhaps natural selection has simply recruited this available biological resource to play vital functional roles, which in principle could have instead been played by P-unconscious but A-aware states in a different type of realization. And in this case, P-conscious states in humans are thus a form of 'phenomenal overkill', and nature has simply been an opportunist in exploiting biological vehicles that happened to be on hand, to play a role that could have been played by a more streamlined and less rich type of state, but where a 'cheaper' alternative was simply not available at the critical point in time. Evolution and natural selection are severely curtailed in this respect, since the basic ingredients and materials available to work with are a result of random mutation on existing precursor structures present in the organism(s) in question. And perhaps human computer scientists and engineers, not limited by what happens to get thrown up by random genetic mutations, have designed the successful Q3T robot utilizing a cheaper, artificial alternative to the overly rich biological structures sustained in humans. So in the case of the robot, it would remain an open question whether or not the physical substrate underlying the artefact's cognitive processes had the requisite causal powers or intrinsic natural characteristics to sustain P-conscious states. Mere behavioural evidence on its own would not be sufficient to adjudicate, and an independent standard or criterion would be required.4 4So if P-conscious states are thought to be essentially physically based, for the reasons given above, and if the robot's Q3T success could in principle be explained through appeal to mere A-aware stets on their own, then it follows that the nonidentity of the artefact's physical structure would allow one to consistently extend Turing's polite convention to one's conspecifics and yet withhold it from the Q3T robot. Sciences 4: 115-122 (2000). [4] N. Block, 'On a confusion about a function of consciousness', Behavioral and Brain Sciences 18, 227-247, (1995). [5] B. Baars, A Cognitive Theory of Consciousness, Cambridge University Press, (1988). [6] S. Shieber, 'The Turing test as interactive proof', Nous 41:33-60 (2007). [7] Minds and Machines 1: 43-54, (1991). [8] P. Schweizer, 'The externalist foundations of a truly total Turing test', Mind & Machines, DOI 10.1007/s11023-012-9272-4, (2012). [9] J. Searle, The Rediscovery of the Mind, MIT Press, (1992). [10] T. Burge, 'Two kinds of consciousness', in N. Block et al. (eds), The Nature of Consciousness: Philosophical Debates, MIT Press, (1997). [11] A. Plebe and P. Perconti, 'Qualia Turing test: Designing a test for the phenomenal mind', in Proceedings of the First International Symposium Towards a Comprehensive Intelligence Test (TCIT), Reconsidering the Turing Test for the 21st Century, 16-19, (2010). [12] D. Dennett, Consciousness Explained, Back Bay Books, (1992). [13] W. G., Lycan, Consciousness, MIT Press, (1987). [14] R. Van Gul : Are we all just armadillos? , in Consciousness: Psychological and Philosophical Essays, M. Davies and G. Humphreys (eds.), Blackwell, (1993). [15] N. Block, 'Troubles with functionalism', in C. W. Savage (ed), Perception and Cognition, University of Minnesota Press, (1978). [16] P. Schweizer, Minds and Machines, 12, 143-144, (2002) [17] P. Schweizer, 'Physical instantiation and the propositional attitudes', Cognitive Computation, DOI 10.1007/s12559-0129134-7, (2012). REFERENCES Mind 59: 433A. Turing, 460 (1950). [2] N. Block, 'Psychologism and behaviorism', Philosophical Review 90: 5-43 (1981). [3] R. French, 'The Turing test: the first 50 years', Trends in Cognitive [1] 4 This highlights one of the intrinsic limitations of the Turing test approach to such questions, since the test is designed as an imitation game, and humans are the ersatz target. Hence the Q3T robot is designed to behave as if it had subjective, qualitative inner experiences indistinguishable from those of a human. However, if human qualia are the products of our particular internal structure (either physicalphysiological or functional-computational), and if the robot is significantly different in this respect, then the possibility is open that the robot might be P-conscious and yet fail the test, simply because its resulting qualitative experiences are significantly different than ours. And indeed, a possibility in the reverse direction is that the robot might even pass the test and sustain an entirely different phenomenology, but where this internal difference is not manifested in its external behaviour. AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 48 Jazz and Machine Consciousness: Towards a New Turing Test Antonio Chella1 and Riccardo Manzotti2 Abstract. A form of Turing test is proposed and based on the capability for an agent to produce jazz improvisations at the same level of an expert jazz musician. .12 1 INTRODUCTION The Essay in the style of Douglas Hofstadter [19] related to the system EMI by David Cope [11] [12], evokes a novel and different perspective for the Turing test. The main focus of the test should be creativity instead of linguistic capabilities: can a computer be so creative to the point that its creations could be indistinguishable from those of a human being? According to Sterberg [36], creativity is the ability to produce something that is new and appropriate. The result of a creative process is not reducible to some sort of deterministic reasoning. No creative activity seems to identify a specific chain of activity, but an emerging holistic result [25]. Therefore, a creative agent should be able to generate novel artifacts not by following preprogrammed instructions, but on the contrary by means of a real creative act. The problem of creativity has been widely debated in the field of automatic music composition. The previously cited EMI by David Cope, subject of the Hoftadter essay, produce impressive results: even for an experienced listener it is difficult to distinguish musical compositions created by these programs from those ones created by a human composer. There is no doubt that these systems capture some main aspects of the creative process, at least in music. However, one may wonders if an agent can actually be creative without being conscious. In this regard, Damasio [14] suggests a close connection between consciousness and creativity. Cope himself in his recent book [13] discusses the relationship between consciousness and creativity. Although he does not take a clear position on this matter, he seem to favor the view according to which consciousness is not necessary for creative process. In facts, Cope asks if a creative agent should need to be aware of being creating something and if it needs to experience the results of its own creations. The argument of consciousness is typically adopted [3] to support the thesis according to which an artificial agent can never be conscious and therefore it can never be really creative. But recently, there has been a growing interest in machine consciousness [8] [9], i.e., the study of consciousness through the design and implementation of conscious artificial systems. This interest is motivated by the belief that this new approach, based on the construction of conscious artifacts, can shed new light on the many critical aspects that affect the mainstream 1 2 University of Palermo, Italy, email: antonio.chella@unipa.it IULM University, Milan, Italy, email: riccardo.manzotti@iulm.it studies of consciousness from philosophy and neuroscience. Creativity is just one of these critical issues. The relationship between consciousness and creativity is difficult and complex. On the one side some authors claim the need of awareness of the creative act. On the other side, it is suspected that many cognitive processes that are necessary for the creative act may happen in the absence of consciousness. However it is undeniable that consciousness is closely linked with the broader unpredictable and less automatic forms of cognition, like creativity. In addition, we could distinguish between the mere production of new combinations and the aware creation of new content: if the wind would create (like the monkeys on a keyboard) a melody which is indistinguishable from the “Va Pensiero” by Giuseppe Verdi, it would be a creative act? Many authors would debate this argument [15]. In the following, we discuss some of the main features for a conscious agent like embodiment, situatedness, emotions and the capability to have conscious experience. These features will be discussed with reference to musical expression, and in particular to a specific form of creative musical expression, namely jazz improvisation. Musical expression seems to be a form of artistic expression that most of others is able to immediately produce conscious experience without filters. Moreover, differently from olfactory or tactile experiences, musical experience is a kind of structured experience. According to Johnson-Laird [20], jazz improvisation is a specific form of expertise of great interest for the study of the mind. Furthermore, jazz is a particularly interesting case of study in relation to creativity. Creativity in a jazz musician is very different from typical models of creativity. In fact, the creativity process is often studied with regards to the production of new abstract ideas, as for example the creation of a new mathematical theory after weeks of great concentration. On the contrary, jazz improvisation is a form of immediate and continuous lively creation process which is closely connected with the external world made up of musical instruments, people, moving bodies, environments, audience and the other musicians. 2 CREATIVITY There are at least two aspects of creativity that is worth distinguishing since the beginning: syntactic and semantic creativity. The first one is the capability to recombine a set of symbols according to various styles. In this sense, if we have enough patience and time, a random generator will create all the books of the literary world (but without understanding their meaning). The second aspect is the capability to generate new meaning that will be then dressed by appropriate symbols. These two aspects correspond to a good approximation to the etymological difference between the terms intelligence and AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 49 intuition. Intelligence is often defined as the ability to find novel connections for different entities, but intuition should be able to do something more, i.e., to bring in something that was previously unavailable. In short, the syntactic manipulation of symbols may occur without consciousness, but creativity does not seem to be possible without consciousness. Machine consciousness is not only a technological challenge, but a novel field of research that has scientific and technological issues, such as the relationship between information and meaning, the ability for an autonomous agent to choose its own goals and objectives, the sense of self for a robot, the capability to integrate information into a coherent whole, the nature of experience. Among these issues there is the capability, for an artificial agent, to create and to experience its own creations. A common objection to machine consciousness emphasizes the fact that biological entities may have unique characteristics that cannot be reproduced in artifacts. If this objection is true, machine consciousness may not be feasible. However, this contrast between biological and artificial entities has often been over exaggerated, especially in relation to the problems of consciousness. So far, nobody was able to satisfactorily prove that the biological entities may have characteristics that can not be reproduced in artificial entities with respect to consciousness. In fact, at the a meeting on machine consciousness in 2001 at Cold Spring Harbor Laboratories, the conclusion from Koch [23] was that no known natural law prevents the existence of subjective experience in artifacts. On the other hand, living beings are subject to the laws of physics, and yet are conscious, able to be creative and to prove experience. The contrast between classic AI (focused on manipulation of syntactic symbols) and machine consciousness (open to consider the semantic and phenomenal aspects of the mind) holds in all his strength in the case of creativity. Is artistic improvisation - jazz improvisation in particular - a conscious process? This is an open question. The musicologist Gunther Schuller [33] emphasizes the fact that jazz improvisation affects consciousness at all levels, from the minimal to the highest one. It is a very particular kind of creative process. Jazz improvisation has peculiar features that set it apart from the traditional classic improvisation [29]: as part of Western classical music, improvisation is a kind of real time composition with the same rules and patterns of classic composition. On the contrary, jazz improvisation is based on a specific set of patterns and elements. The melody, the rhythm (the swing), the chord progressions are some of the issues that need to be analyzed and studied with stylistic and aesthetic criteria different from those of Western classical music [10]. 3 EMBODIMENT Embodiment does not simply mean that an agent must have a physical body, but also and above all, that different cognitive functions are carried out by means of aspects of the body. The aspect of corporeality seems to be fundamental to the musical performance and not only for jazz improvisation. In this regard, Sundberg & Verrillo [38] analyzed the complex feedback that the body of a player receives during a live performance. In facts, auditory feedback is not sufficient to explain the characteristics of a performance. The movement of the hands on the instrument, the touch and the strength needed for the instrument to play, the vibrations of the instrument propagated through the fingers of the player, the vibration of the air perceived by the player’s body, are all examples of feedback guiding the musician during a performance. The player receives at least two types of bodily feedback: through the receptors of the skin and through the receptors of the tendons and muscles. Todd [39] assumed a third feedback channel through the vestibular apparatus. Making music is essentially a body activity [26]. Embodiment is fundamental to jazz improvisation: can an agent without a body, such as a software like EMI that runs on a mainframe, be able to improvise? Apparently not, because it would miss the bodily feedback channels described above. And, in fact, the results obtained by EMI in the version Improvisation are modest and based on ad hoc solutions. The same problem arises for consciousness: can a software that run on a mainframe be conscious? It does not seem that embodiment is a sufficient condition for consciousness, but it may be a necessary condition. Basically, a cognitive entity must be embodied in a physical entity. However, it is necessary to deeply reflect about the concept of embodiment. Trivially, a cognitive agent can not exist without a body; even AI expert systems are embodied in a computer which is a physical entity. On the other hand it is not enough to have a body for an agent in order to be not trivially embodied: the Honda ASIMO robot3, considered the state of the art of today robotic technology, is an impressive humanoid robot but its performances are essentially based on a standard controller in which the behaviors are almost completely and carefully defined in advance by its designers. In addition, biology gives us many examples of animals, such as the cockroaches, whose morphology is complex and that allows them to survive without cognitive abilities. The notion of embodiment is therefore much more deep and complex than we usually think. Not only the fact that an agent might have a body equipped with sophisticated sensors and actuators, but other conditions must be met. The concept of embodiment requires the ability to appreciate and process the different feedback from the body, just like an artist during a musical live performance. 4 SITUATEDNESS In addition to having a body, an agent is part of an environment, i.e., it is situated. An artist, during a jam session, is typically situated in a group where she has a continuous exchange of information. The artist receives and provides continuous feedback with the other players of the group, and sometimes even with the audience, in the case of live performances. The classical view, often theorized in textbooks of jazz improvisation [10], suggests that during a session, the player follows his own musical path largely made up by a suitable musical sequence of previously learned patterns. This is a partial view of an effective jazz improvisation. Undoubtedly, the musician has a repertoire of musical patterns, but she is also able to deviate from its path depending on the feedback she receives from other musicians or the audience, for example from 3 http://asimo.honda.com AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 50 suggestions from the rhythm section or due to signals of appreciation from the listeners. Cognitive scientists (see, e.g., [20]) typically model jazz improvisation processes by means of Chomsky formal grammars. This kind of model appears problematic because it does not explain the complexity of the interaction between the player, the rest of the group and the audience. A more accurate model should take into account the main results from behavior-based robotics [5]. According to this approach, a musician may use a repertoire of behaviors that are activated according to the input she receives and according to an appropriate priority based on her musical sensibility. Interesting experiments in this direction have been recently described in the literature. Roboser [27] is an autonomous robot that can move autonomously in an environment and generate sound events in real time according to its internal state and to the sensory input it receives from the environment. EyesWeb [6] is a complex system that analyzes body movements and gestures with particular reference to emotional connotations in order to accordingly generate sound and music in real time and also to suitably control robots. Continuator [28] is a system based on a methodology similar to EMI, but differently from it, is able to learn and communicate in real time with the musician. For example, the musician suggests that musical phrases and the system is able to learn the style of the musician and to continue and complete the sentences by interacting with the musician. However, the concept of situated agent, as the concept of embodiment, is a complex and articulate one. An effective situated agent should develop a tight integration development with their surrounding environment so that, like a living being, its body structure and cognition would be the result of a continuous and constant interaction with the external environment. A true situated agent is an agent that absorbs from its surroundings, changes according to it and, in turn, it changes the environment itself. A similar process occurs in the course of jazz improvisation: the musicians improvise on the basis of their musical and life experiences accumulated and absorbed over the years. The improvisation is then based on the interaction and also, in the case of a jazz group, even of past interactions with the rest of the group. Improvisation is modified on the basis of suggestions received from other musicians and audience, and in turn changes the performances of the other group musicians. A good jazz improvisation is an activity that requires a deeply situated agent. successful performance the player create a tight empathic relationship between herself and the listeners. Gabrielsson & Juslin [17] conducted an empirical analysis of the emotional relationship between a musician and the listeners. According to this analysis, a song arouses emotions on the basis of its structure: for example, a sad song is in a minor key, it has a slow rhythm and the dissonances are frequent, while an exciting song is fast, strong, with few dissonances. The emotional intentions of a musician during a live performance can be felt by the listener with greater or lesser effectiveness depending on the song itself. The basic emotional connotations such as the joy or the sadness are easier to transmit, while more complex connotation such as solemnity are more difficult to convey. The particular musical instrument employed has a relevance in the communication of emotions, and of course the degree of achieved empathy depends on the skill of the performer. This analysis shows that an agent, to make an effective performance, must be able to convey emotions and to have a model (even implicit) of them. This hypothesis is certainly attractive, but it is unclear how to translate it into computational terms. So far, many computational models of emotions have been proposed in the literature. This is a very prolific field of research for robotics [16]. However, artificial emotions have been primarily studied at the level of cognitive processes in reinforcement learning methods. Attractive and interesting robotic artifacts have been built able to convey emotions, although it is uncertain whether these experiments represent effective steps forward in understanding emotions. For example, the well known robot Kismet [4] is able to modify some of its external appearance like raising an eyebrow, grimace, and so on. during its interactions with an user. These simple external modifications are associates with emotions. Actually, Kismet has no real model of emotions, but merely uses a repertoire of rules defined in advance by the designer: it is the user that naively, interacting with the robot, ends up with the attribution of emotions to Kismet. On the other hand, it is the human tendency to anthropomorphize aspects of its environment. It is easy to see a pair of eyes and a mouth in a random shape, so it is at the same time easy to ascribe emotions and intentions to the actions of an agent. In summary, an agent capable of transmitting emotions during jazz improvisation must have some effective computational models for generation and evocation of emotions. 6 EXPERIENCE 5 EMOTIONS Many scholars consider emotions as a basic element for consciousness. Damasio [14] believes that emotions form a sort of proto-consciousness upon which higher forms of consciousness are developed. In turn, consciousness, according to this frame of reference, is intimately related with creativity. The relationships between emotions and music have been widely analyzed in the literature, suggesting a variety of computational models describing the main mechanisms underlying the evocation of emotions while listening to music [21] [22]. In the case of a live performance as a jazz improvisation, the link between music and emotions is a deep one: during a Finally, the more complex problem for consciousness is: how can a physical system like an agent able to improvise jazz to produce something similar to our subjective experience? During a jam session, the sound waves generated by the musical instruments strike our ears and we experience a sax solo accompanied by bass, drums and piano. At sunset, our retinas are struck by rays of light and we have the experience of a symphony of colors. We swallow molecules of various kinds and, therefore, we feel the taste of a delicious wine. It is well known that Galileo Galilei suggested that smells, tastes, colors and sounds do not exist outside the body of a conscious subject (the living animal). Thus experience would be created by the subject in some unknown way. AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 51 A possible hypothesis concerns the separation between the domain of experience, namely, the subjective content, and the domain of objective physical events. The claim is that physical reality can be adequately described only by the quantitative point of view in a third person perspective while ignoring any qualitative aspects. After all, in a physics textbook there are many mathematical equations that describe a purely quantitative reality. There is room for quality content, feelings or emotions. Explaining these qualitative contents is the hard problem of consciousness [7]. Yet scholars as Strawson [37] questioned the validity of such a distinction as well as the degree of real understanding of the nature of the physical world. Whether the mental world is a special construct generated by some feature of the nervous systems of mammals, is still an open question. It is fair to stress that there is neither empirical evidence nor theoretical arguments supporting such a view. In the lack of a better theory, we could also take into consideration the idea inspired by externalism [31] [32] according to which the physical world comprehends also those features that we usually attribute to the mental domain. A physicalist must be held that if something is real, and we assume consciousness is real, it has to be physical. Hence, in principle, a device can envisage it. In the case of artificial agents for jazz improvisation, how is it possible to overcome the distinction between function and experience? Such a typical agent is made up by a set of interconnected modules, each operating in a certain way. How the operation of some or all of the interconnected modules should generate conscious experience? However, the same question could be transferred to the activity of neurons. Each neuron, taken alone, does not work differently from a software module or a chip. But it could remains a possibility: it is not the problem of the physical world, but of our theories of the physical world. Artificial agents are part of the same physical world that produce consciousness in human subjects, so they may exploit the same properties and characteristics that are relevant for conscious experience. In this regard, Tononi [41] proposed a theory supported by results from neuroscience, according to which the degree of conscious experience is related to the amount of integrated information. According to this framework, the primary task of the brain is to integrate information and, noteworthy, this process is the same whether it takes place in humans or in artifacts like agents for jazz improvisation. According to this theory, conscious experience has two main characteristics. On the one side, conscious experience is differentiated because the potential set of different conscious states is huge. On the other side, conscious experience is integrated; in facts a conscious state is experienced as a single entity. Therefore, the substrate of conscious experience must be an integrated entity able to differentiate among a big set of different states and whose informational state is greater than the sum of the informational states of the component sub entities [1] [2]. According to this theory, Koch and Tononi [24] propose a potential new Turing test based on the integration of information: artificial systems should be able to mimic the human being not in language skills (as in the classic version of Turing test), but rather in the ability to integrate information from different sources. Therefore, an artificial agent aware of its jazz improvisation should be able to integrate during time the information generated by its own played instrument, the instruments of its band as well as information from the body, i.e., the feedback from skin receptors, the receptors of the tendons and muscles and possibly from the vestibular apparatus. Furthermore, it should also be able also to integrate information related to emotions. Some of the early studies based on suitable neural networks for music generation [40] are promising in the way to implement an information integration agent. However, we must emphasize the fact that the implementation of a true information integration system is a real technological challenge In fact, the typical engineering techniques for the building of an artifact is essentially based on the principle of divide et impera, that involves the design of a complex system through the decomposition of the system into easier smaller subsystems. Each subsystem then communicates with the others subsystems through well-defined interfaces so that the interaction between the subsystems happen in a very controlled way. Tononi's theory requires instead maximum interaction between the subsystems in order to allow an effective integration. Therefore, new techniques are required to design effective conscious agents. Information integration theory raised heated debates in the scientific community. It could represent a first step towards a theoretically well-founded approach to machine consciousness. The idea of being able to find the consciousness equations which, like the Maxwell's equations in physics, are able to explain consciousness in living beings and in the artifacts, would be a kind of ultimate goal for scholars of consciousness. 7 CONCLUSIONS The list of problems related to machine consciousness that have not been properly treated is long: the sensorimotor experience in improvisation, the sense of time in musical performance, the problem of the meaning of a musical phrase, the generation of musical mental images and so on. These are all issues of great importance for the creation of a conscious agent for jazz improvisation, although some of them may overlap in part with the arguments discussed above. Although the classic AI achieved impressive results, and the program EMI by Cope is a great example, so far these issues have been addressed only partially. In this article we have discussed the main issues to be addressed in order to design and build an artificial that can perform a jazz improvisation. The physicality, the ability to be located, to have emotions, to have some form of experience are all problems inherent in the problem of consciousness. A new Turing test might be based on imitating the ability to distinguish a jazz improvisation produced by an artificial agent, maybe able to integrate information according to Tononi, than improvisation produced by an expert jazz musician. As should be clear, this is a very broad subject that significantly extends the traditional the mind-brain problem. Machine consciousness is, at the same time, a theoretical and technological challenge that forces to deal with old problems and new innovative approaches. It is possible, and hope that the artificial consciousness researchers push to re-examine many threads left hanging from the Artificial Intelligence and cognitive science. “Could consciousness be a theoretical time bomb, ticking away in the belly of AI? Who can say?” (Haugeland [18], p. 247). AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 52 REFERENCES [1] D. Balduzzi and G. Tononi, ‘Integrated information in discrete dynamical systems: Motivation and theoretical framework’, PLoS Computational Biology, 4, e1000091, (2008). [2] D. Balduzzi and G. Tononi, ‘Qualia: The geometry of integrated information’, PLoS Computational Biology, 5, e1000462, (2009). [3] M. Boden, The Creative Mind: Myths and Mechanisms - Second Edition, Routledge, London, 2004. [4] C. Breazeal, Designing Sociable Robots, MIT Press, Cambridge, MA, 2002. [5] R. Brooks, Cambrian Intelligence: The Early History of the New AI, MIT Press, Cambridge, MA, 1999. [6] Camurri, S. Hashimoto, M. Ricchetti, A. Ricci, K. Suzuki, R. Trocca and G. Volpe, ‘EyesWeb: Toward Gesture and Affect Recognition in Interactive Dance and Music Systems’, Computer Music Journal, 24, 57 – 69, (2000). [7] D. Chalmers. The Conscious Mind: In Search of a Fundamental Theory, Oxford University Press, Oxford, 1996. [8] A. Chella and R. Manzotti (eds.), Artificial Consciousness, Imprint Academic, Exeter, UK, 2007. [9] A. Chella and R. Manzotti, ‘Machine Consciousness: A Manifesto for Robotics’, International Journal of Machine Consciousness, 1, 33 – 51, (2009). [10] J. Coker, Improvising Jazz, Simon & Schuster, New York, NY, 1964. [11] D. Cope, ‘Computer Modeling of Musical Intelligence in EMI’, Computer Music Journal, 16, 69 – 83, 1992. [12] D. Cope, Virtual Music. MIT Press, Cambridge, MA, 2001. [13] D. Cope, Computer Models of Musical Creativity, MIT Press, Cambridge, MA, 2005. [14] A. Damasio, The Feeling of What Happens: Body and Emotion in the Making of Consciousness, Houghton Mifflin Harcourt, 1999. [15] A. Danto, ‘The Transfiguration of Commonplace’, The Journal of Aesthetics and Art Criticism, 33, 139 – 148, (1974). [16] J.-M. Fellous and M. A. Arbib, Who Needs Emotions?: The Brain Meets the Robot, Oxford University Press, Oxford, UK, 2005. [17] A. Gabrielsson and P.N. Juslin, ‘Emotional Expression in Music Performance: Between the Performer's Intention and the Listener's Experience’, Psychology of Music, 24, 68 – 91, (1996). [18] J. Haugeland, Artificial Intelligence: The Very Idea, MIT Press, Bradford Books, Cambridge, MA, 1985. [19] D. Hofstadter, ‘Essay in the Style of Douglas Hofstadter’, AI Magazine, Fall, 82 – 88, (2009). [20] P.N. Johnson-Laird, ‘Jazz Improvisation: A Theory at the Computational Level’, in: Representing Musical Structure, P. Howell, R. West & I. Cross (eds.), Academic Press, London, 1991. [21] P. N. Juslin & J. A. Sloboda (eds.), Handbook of Music and Emotion - Theory, Research, Application, Oxford University Press, Oxford, UK, 2010. [22] P.N. Juslin & D. Västfjäll, ‘Emotional responses to music: The need to consider underlying mechanisms’, Behavioral and Brain Sciences, 31, 559 – 621, (2008). [23] K. Koch, ‘Final Report of the Workshop Can a Machine be Conscious’, The Banbury Center, Cold Spring Harbor Laboratory, http://theswartzfoundation.com/abstracts/2001_summary.asp (last access 12/09/2011). [24] K. Koch and G. Tononi, ‘Can Machines Be Conscious?’, IEEE Spectrum, June, 47 – 51, (2008). [25] A. Koestler, The Act of Creation, London, Hutchinson, 1964. [26] J. W. Krueger, ‘Enacting Musical Experience’, Journal of Consciousness Studies, 16, 98 – 123, (2009). [27] J. Manzolli and P.F.M.J. Verschure, ‘Roboser: A Real-World Composition System’, Computer Music Journal, 29, 55 – 74, (2005). [28] F. Pachet, ‘Beyond the Cybernetic Jam Fantasy: The Continuator’, IEEE Computer Graphics and Applications, January/February, 2 – 6, (2004). [29] J. Pressing, ‘Improvisation: Methods and Models’, in: Generative Processes in Music: The Psychology of Performance, Improvisation, and Composition, J. Sloboda (ed.), Oxford University Press, Oxford, UK, 1988. [30] P. Robbins & M. Aydede (eds.), The Cambridge Handbook of Situated Cognition, Cambridge, Cambridge University Press, 2009. [31] T. Rockwell, Neither Brain nor Ghost, MIT Press, Cambridge, MA, 2005. [32] M. Rowlands, Externalism – Putting Mind and World Back Together Again, McGill-Queen’s University Press, Montreal and Kingston, 2003. [33] G. Schuller, ‘Forewords’, in: Improvising Jazz, J. Coker, Simon & Schuster, New York, NY, 1964. [34] J. R. Searle, ‘Minds, brains, and programs’, Behavioral and Brain Sciences, 3, 417 – 457, (1980). [35] A. Seth, ‘The Strength of Weak Artificial Consciousness’, International Journal of Machine Consciousness, 1, 71 – 82, (2009). [36] R. J. Sternberg (eds.), Handbook of Creativity, Cambridge, Cambridge University Press, 1999. [37] G. Strawson, ‘Does physicalism entail panpsychism?’, Journal of Consciousness Studies, 13, 3 – 31, (2006). [38] J. Sundberg and R.T. Verrillo, ‘Somatosensory Feedback in Musical Performance’, (Editorial), Music Perception: An Interdisciplinary Journal, 9, 277 – 280, (1992). [39] N.P. McAngus Todd, ‘Vestibular Feedback in Musical Performance: Response to «Somatosensory Feedback in Musical Performance»’, Music Perception: An Interdisciplinary Journal, 10, 379 – 382, (1993). [40] P.M. Todd & D. Gareth Loy (eds.), Music and Connectionism, MIT Press, Cambridge, MA, 1991. [41] G. Tononi, ‘An Information Integration Theory of Consciousness’, BMC Neuroscience, 5, (2004). AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 53 Taking Turing Seriously (But Not Literally) William York1 and Jerry Swan2 Abstract. Results from present-day instantiations of the Turing test, most notably the annual Loebner Prize competition, have fueled the perception that the test is on the verge of being passed. With this perception comes the misleading implication that computers are nearing human-level intelligence. As currently instantiated, the test encourages an adversarial relationship between contestant and judge. We suggest that the underlying purpose of Turing’s test would be better served if the prevailing focus on trickery and deception were replaced by an emphasis on transparency and collaborative interaction. We discuss particular examples from the family of Fluid Concepts architectures, primarily Copycat and Metacat, showing how a modified version of the Turing test (described here as a “modified Feigenbaum test”) has served as a useful means for evaluating cognitive-modeling research and how it can suggest future directions for such work. 1 INTRODUCTION; THE TURING TEST IN LETTER AND SPIRIT The method of “postulating” what we want has many advantages; they are the same as the advantages of theft over honest toil. – Bertrand Russell, Introduction to Mathematical Philosophy Interrogator: Yet Christmas is a Winter’s day, and I do not think Mr. Pickwick would mind the comparison. Respondent: LOL – Pace Alan Turing, “Computing Machinery and Intelligence” If Alan Turing were alive today, what would he think about the Turing test? Would he still consider his imitation game to be an effective means of gauging machine intelligence, given what we now know about the Eliza effect, chatbots, and the increasingly vacuous nature of interpersonal communication in the age of texting and instant messaging? One can only speculate, but we suspect he would find current instantiations of his eponymous test, most notably the annual Loebner Prize competition, to be disappointingly literal-minded. Before going further, it will help to recall Turing’s famous prediction about the test from 1950: I believe that in about fifty years’ time it will be possible, to programme computers, with a storage capacity of about 109 , to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning ([22], p. 442). 1 2 Indiana University, United States, email: wwyork@indiana.edu University of Stirling, Scotland, email: jsw@cs.stir.ac.uk The Loebner Prize competition adheres closely to the outward form—or letter—of this imitation game, right down to the fiveminute interaction period and (at least for the ultimate Grand Prize) the 70-percent threshold.3 However, it is questionable how faithful the competition is to the underlying purpose—or spirit—of the game, which is, after all, to assess whether a given program or artifact should be deemed intelligent, at least relative to human beings.4 More generally, we might say that the broader purpose of the test is to assess progress in AI, or at least that subset of AI that is concerned with modeling human intelligence. Alas, this purpose gets obscured when the emphasis turns from pursuing this long-term goal to simply “beating the test.” Perhaps this shift in emphasis is an inevitable consequence of using a behavioral test: “If we don’t want that,” one might argue, “then let us have another test.” Indeed, suggestions have been offered for modifying the Turing test (cf. [6], [7], [3]), but we still see value in the basic idea behind the test—that of using observable “behavior” to infer underlying mechanisms and processes. 1.1 Priorities and payoffs The letter–spirit distinction comes down to a question of research priorities, of short-term versus long-term payoffs. In the short term, the emphasis on beating the test has brought programs close to “passing the Turing test” in its Loebner Prize instantiation. Brian Christian, who participated in the 2009 competition as a confederate (i.e., one of the humans the contestant programs are judged against) and described the experience in his recent book The Most Human Human, admitted to a sense of urgency upon learning that “at the 2008 contest..., the top program came up shy of [passing] by just a single vote” ([1], p. 4). Yet in delving deeper into the subject, Christian realized the superficiality—the (near) triumph of “pure technique”—that was responsible for much of this success. But it is not clear that the Loebner Prize has steered researchers toward any sizable long-term payoffs in understanding human intelligence. After witnessing the first Loebner Prize competition in 1991, Stuart Shieber [20] concluded, “What is needed is not more work on solving the Turing Test, as promoted by Loebner, but more work on the basic issues involved in understanding intelligent behavior. The parlor games can be saved for later” (p. 77). This conclusion seems as valid today as it was two decades ago. 1.2 Communication, transparency, and the Turing test The question, then, is whether we might better capture the spirit of Turing’s test through other, less literal-minded means. Our answer is 3 4 Of course, the year 2000 came and went without this prediction coming to pass, but that is not at issue here. See [5] for more discussion of the distinction between human-like intelligence versus other forms of intelligence in relation to the Turing test. AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 54 not only that we can, but that we must. The alternative is to risk trivializing the test by equating “intelligence” with the ability to mimic the sort of context-neutral conversation that has increasingly come to pass for “communication.” Christian points out that “the Turing test is, at bottom, about the act of communication” ([1], p. 13). Yet given the two-way nature of communication, it can be hard to disentangle progress in one area (AI) from deterioration in others. As Jaron Lanier recently put it, You can’t tell if a machine has gotten smarter or if you’ve just lowered your standards of intelligence to such a degree that the machine seems smart. If you can have a conversation with a simulated person presented by an AI program, can you tell how far you’ve let your sense of personhood degrade in order to make the illusion work for you? ([13], p. 32). In short, the Turing test’s reliance on purely verbal behavior renders it susceptible to tricks and illusions that its creator could not have reasonably anticipated. Methodologies such as statistical machine learning, while valuable as computational and engineering tools, are nonetheless better suited to modeling human banality than they are human intelligence. Additionally, the test, as currently instantiated, encourages an adversarial approach between contestant and judge that does as much to obscure and inflate progress in AI as it does to provide an accurate measuring stick. It is our contention that a test that better meets Turing’s original intent should instead be driven by the joint aims of collaboration and transparency. 2 INTELLIGENCE, TRICKERY, AND THE LOEBNER PRIZE Does deception presuppose intelligence on the part of the deceiver? In proposing his imitation game, Turing wagered—at least implicitly—that the two were inseparable. Surely, a certain amount of cunning and intelligence are required on the part of humans who excel at deceiving others. The flip side of the coin is that a degree of gullibility is required on the part of the person(s) being deceived. Things get more complicated when the deception is “perpetrated” by a technological artifact as opposed to a willfully deceptive human. To quote Shieber once again, “[I]t has been known since Weizenbaum’s surprising experiences with ELIZA that a test based on fooling people is confoundingly simple to pass” (p. 72; cf. [24]). The gist of Weizenbaum’s realization is that our interactions with computer programs often tell us less about the inner workings of the programs themselves than they do about our tendency to project meaning and intention onto artifacts, even when we know we should know better. 2.1 The parallel case of art forgery For another perspective on the distinction between genuine accomplishment and mere trickery, let us consider the parallel case of art forgery. Is it possible to distinguish between a genuine artist and a mere faker? It is tempting to reply that in order to be a good faker— one good enough to fool the experts—one must necessarily be a good artist to begin with. But this sort of argument is too simplistic, as it equates artistry with technical skill and prowess, meanwhile ignoring originality, artistic vision, and other qualities that are essential to genuine artistry (cf. [14], [2]). In particular, the ability of a skilled art forger to create a series of works in the style of, say, Matisse does not necessarily imply insight into the underlying artistic or expressive vision of Matisse—the vision responsible for giving rise to those works in the first place. As philosopher Matthew Kieran succinctly puts it, “There is all the difference in the world between a painting that genuinely reveals qualities of mind to us and one which blindly apes their outward show” ([11], p. 21). Russell’s famous quote about postulation equating to theft helps us relate an AI methodology to the artistry–forgery distinction. Russell’s statement can be paraphrased as follows: merely saying that there exists a function (e.g., sqrt()) with some property (e.g., sqrt(x)*sqrt(x)=x for all x >= 0) does not tell us very much about how to generate the actual sqrt() function. Similarly, the ability to reproduce a small number of values of x that meet this specification does not imply insight into the underlying mechanisms involved, relative to which the existence of these specific values is essentially a side effect. A key issue here is the small number of values: Since contemporary versions of the Turing test are generally highly time-constrained, it is even more imperative that the test involve a deep probe into the possible behaviors of the respondent. 2.2 Thematic variability in art and in computation Many of the Loebner Prize entrants (e.g., [23]) have adopted the methodologies of corpus linguistics and machine learning, so let us reframe the issue of thematic variability in these terms. We might abstractly consider the statistical machine-learning approach to the Turing test as being concerned with the induction of a generative grammar. In short, the ability to induce an algorithm that reproduces some themed collection of original works does not in itself imply that any underlying sensibilities that motivated those works can be effectively approximated by that algorithm. One way of measuring the “work capacity” of an algorithm is to employ the Kolmogorov complexity measure [21], which is essentially the size of the shortest possible functionally identical algorithm. In the induction case, algorithms with the lowest Kolmogorov complexity will tend to be those that exhibit very little variability—in the limiting case, generating only instances from the original collection. This would be analogous to a forger who could only produce exact copies of another artist’s works, rather than works “in the style of” said artist—the latter being the stock-in-trade of infamous art forgers Han van Meegeren [25] and Elmyr de Hory [10]. In contrast, programs from the family of Fluid Concepts architectures (see 4.1 below) possess relational and generative models that are domain-specific. For example, the Letter Spirit architecture [19] is specifically concerned with exploring the thematic variability of a given font style. Given Letter Spirit’s (relatively) sophisticated representation of the “basis elements” and “recombination mechanisms” of form, it might reasonably be expected to have a high Kolmogorov complexity. The thematic variations generated by Letter Spirit are therefore not easily approximated by domain-agnostic data-mining approaches. 2.3 Depth, shallowness, and the Turing test The artistry–forgery distinction is useful insofar as it offers another perspective on the issue of depth versus shallowness—an issue that is crucial in any analysis of the Turing test. Just as the skilled art forger is adept at using trickery to simulate “authenticity”—for example, by artificially aging a painting through various techniques such as baking or varnishing ([10], [25])—analogous forms of trickery tend to find their way into the Loebner Prize competition: timely pop-culture references, intentional typos and misspellings, strategic changes of AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 55 subject, and so on (cf. [20], [1]). Yet these surface-level tricks have as much to do with the genuine modeling of intelligence as coating the surface of a painting with antique varnish has to do with bona fide artistry. Much like the art forger’s relationship with the art world, the relationship between contestant programs and judges in the Loebner Prize is essentially adversarial, not collaborative. The adversarial nature of these contestant–judge interactions, we feel, is a driving force in the divergence of the Turing test, in its current instantiations, from the spirit in which it was originally conceived. 3 SOME VARIATIONS ON THE TURING TEST The idea of proposing modifications to the Turing test is not a new one. In this section, we look at such proposals—Stevan Harnad’s “Total Turing Test” (and the accompanying hierarchy of Turing tests he outlines) and Edward Feigenbaum’s eponymous variation on the Turing test—before discussing how they relate to our own, described below as a “modified Feigenbaum test.” 3.1 The Total Turing Test Harnad ([6], [7]) has outlined a detailed hierarchy of possible Turing tests, with Turing’s own version occupying the second of five rungs on this hypothetical ladder. Harnad refers to this as the T2, or “penpal,” level, given the strict focus on verbal (i.e., written or typed) output. Directly below this level is the t1 test (where “t” stands for “toy,” not “Turing”). Harnad observed, a decade ago, that “all of the actual mind-modelling research efforts to date are still only at the t1 level, and will continue to be so for the foreseeable future: Cognitive Science has not even entered the TT hierarchy yet” ([7], §9). This is still the case today. Just as the t1 test draws on “subtotal fragments” of T2, T2 stands in a similar relation to T3, the Total Turing Test. This test requires not just pen-pal behavior, but robotic (i.e., embodied) behavior as well. A machine that passed the Total Turing Test would be functionally (though not microscopically) indistinguishable from a human being.5 Clearly, there are fewer degrees of freedom—and hence less room for deception—as we climb the rungs on Harnad’s ladder, particularly from T2 to T3. However, given the current state of the art, the T3 can only be considered an extremely distant goal at this point. It may be that the T2, or pen-pal, test could only be convincingly “passed”— over an arbitrarily long period of time, as Harnad stipulates, and not just the five-minute period suggested by Turing and adhered to in the Loebner Prize competition—by a system that could move around and interact with other people and things in the real world as we do. It may even be that certain phenomena that are still being modeled and tested at the t1 level—even seemingly abstract and purely “cognitive” ones such as analogy-making and categorization—are ultimately grounded in embodiment and sensorimotor capacities as well (cf. [12]), which would imply fundamental limitations for much current research. Unfortunately, such questions must be set aside for the time being, as they are beyond the scope of this paper. 3.2 The Feigenbaum test The Feigenbaum test [3] was proposed in order test the quality of reasoning in specialized domains—primarily scientific or otherwise technical domains such as astrophysics, computer science, and medicine. The confederate in the Feigenbaum test is not merely an 5 The T4 and T5 levels, which make even greater demands, are not relevant for our purposes. ordinary human being, but an “elite scientist” and member of the U.S. National Academy of Sciences. The judge, who is also an Academy member and an expert in the domain in question, interacts with the confederate and the contestant (i.e., the program). Feigenbaum elaborates, “The judge poses problems, asks questions, asks for explanations, theories, and so on—as one might do with a colleague” ([3], p. 36). No time period is stipulated, but as with the Turing test, “the challenge will be considered met if the computational intelligence ’wins’ one out of three disciplinary judging contests, that is, one of the three judges is not able to choose reliably between human and computer performer” (ibid.). 3.3 A modified Feigenbaum test Feigenbaum’s emphasis on knowledge-intensive technical domains is in keeping with his longtime work in the area of expert systems. This aspect of his test is incidental, even irrelevant, to our purposes. In fact, we go one step further with our “modified Feigenbaum test” and remove the need for an additional contestant beyond the program. Rather, the judge “interacts” directly with the program for an arbitrarily long period of time and evaluates the program’s behavior directly—and qualitatively—on the basis of this interaction. (No illusion is made about the program passing for human, which would be premature and naive in any case.) What is relevant about the Feigenbaum test for our purposes is its emphasis on focused, sustained interaction between judge and program within a suitably subtle domain. Our modified Feigenbaum test stresses a similar type of interaction, though the domain—while still constrained—is far less specialized or knowledge-intensive than, say, astrophysics or medicine. In fact, the domain we discuss below— letter-string analogies—was originally chosen as an arena for modeling cognition because of its balance of generality and tractability [9]. In other words, the cognitive processes involved in thinking and otherwise “operating” within the domain are intended to be more or less general and domain-independent. At the same time, the restriction of the domain, in terms of the entities and relationships that make it up, is meant to ensure tractability and plausibility—in contrast to dealing (or pretending to deal) with complex real-world knowledge of a sort that can scarcely be attributed to a computer program (e.g., knowledge of medicine, the solar system, etc.). In the following section, we argue on behalf of this approach and show how research carried out under this ongoing program represents an example of how one can take the idea of Turing’s test seriously without taking its specifications literally. 4 TAKING TURING SERIOUSLY: AN ALTERNATIVE APPROACH In an essay entitled “On the Seeming Paradox of Mechanizing Creativity,” Hofstadter [8] relates Myhill’s [17] three classes of mathematical logic to categories of behavior. The most inclusive category, the productive, is the one that is of central interest to us here. While no finite collection of rules suffices to generate the members of a productive set P (and no x ∈ / P ), a more expansive and/or sophisticated set of generative rules (i.e., creative processes) can approximate P with unbounded accuracy. In order to emphasize the role of such “unbounded creativity” in the evaluation of intelligence, we describe a modified Feigenbaum test restricted to the microdomain of letter-string analogies. An example of such a problem is, “If abc changes to abd, how would you change pxqxrx in ’the same way’?” (or simply abc → abd; pxqxrx AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 56 → ???). Problems in this domain have been the subject of extensive study [9], resulting in the creation of the well-known Copycat model [16] and its successor, Metacat [15]. Before describing this test, however, we briefly discuss these programs’ architectures in general terms. 4.1 Copycat, Metacat, and Fluid Concepts architectures Copycat’s architecture consists of three main components, all of which are common to the more general Fluid Concepts architectural scheme. These components are the Workspace, which is essentially roughly the program’s working memory; the Slipnet, a conceptual network with variably weighted links between concepts (essentially a long-term memory); and the Coderack, home to a variety of agentlike codelets, which perform specific tasks in (simulated) parallel, without the guidance of an executive controller. For example, given the problem abc → abd; iijjkk → ???, these tasks would range from identifying groups (e.g., the jj in iijjkk) to proposing bridges between items in different letter-strings (e.g., the b in abc and the jj in iijjkk) to proposing rules to describe the change in the initial pair of strings (i.e., the change from abc to abd).6 Building on Copycat, Metacat incorporates some additional components that are not present in its predecessor’s architecture, most notably the Episodic Memory and the Temporal Trace. As the program’s name suggests, the emphasis in Metacat is on metacognition, which can broadly be defined as the process of monitoring, or thinking about, one’s own thought processes. What this means for Metacat is an ability to monitor, via the Temporal Trace, events that take place en route to answering a given letter-string problem, such as detecting a “snag” (e.g., trying to find the successor to z, which leads to a snag because the alphabet does not “circle around” in this domain) or noticing a key idea. Metacat also keeps track of its answers to previous problems, as well as its responses on previous runs of the same problem, both via the Episodic Memory. As a result, it is able to be “reminded” of previous problems (and answers) based on the problem at hand. Finally, it is able to compare and contrast two answers at the user’s prompting (see Section 4.3 below). Philosophically speaking, Fluid Concepts architectures are predicated upon the conviction that it is possible to “know everything about” the entities and relationships in a given microdomain. In other words, there is no propositional fact about domain entities and processes (or the effect of the latter on the on the former) that is not in principle accessible to inspection or introspection. In Copycat, the domain entities range from permanent “atomic” elements (primarily, the 26 letters of the alphabet) to temporary, composite ones, such as the letter strings that make up a given problem (abc, iijjkk, pxqxrx, etc.); the groups within letter strings that are perceived during the course of a run (e.g., the ii, jj, and kk in iijjkk); and the bonds that are formed between such groups. The relationships include concepts such as same, opposite, successor, predecessor, and so on. A key aspect of the Fluid Concepts architecture is that it affords an exploration the space of instantiations of those entities and relationships in a (largely) non-stochastic fashion—that is, in a manner that is predominately directed by the nature of the relationships themselves. In contrast, the contextual pressures that give rise to some subtle yet low frequency solutions are unlikely to have a referent within a statistical machine-learning model built from a corpus of Copycat an6 See [16] for an in-depth discussion of codelet types and functions in Copycat. swers, since outliers are not readily captured by gross mechanisms such as sequences of transition probabilities. 4.2 An example from the Copycat microdomain To many observers, a letter-string analogy problem such as the aforementioned abc → abd; iijjkk → ??? might appear trivial on first glance.7 Yet upon closer inspection, one can come to appreciate the surprising subtleties involved in making sense of even a relatively basic problem like this one. Consider the following (non-exhaustive) list of potential answers to the above problem: • iijjll – To arrive at this seemingly basic answer requires at least three non-trivial insights: (1) seeing iijjkk as a sequence of three sameness groups—ii, jj, and kk—not as a sequence of individual letters; (2) seeing the group kk as playing the same role in iijjkk that the letter c does in abc; and (3) seeing the change from c to d in terms of successorship and not merely as a change from the letter c to the letter d. The latter point may seem trivial, but it is not a given, and as we will see, there are other possible interpretations. • iijjkl – This uninspiring answer results from simply changing the letter category of the rightmost letter in iijjkk to its successor, as opposed to the letter category of the rightmost group. • iijjkd – This answer results from the literal-minded strategy of simply changing the last letter in the string to d, all the while ignoring the other relationships among the various groups and letter categories. • iijjdd – This semi-literal, semi-abstract answer falls somewhere in between iijjll and iijjkl. On the one hand, it reflects a failure to perceive the change from c to d in the initial string in terms of successorship, instead treating it as a mere replacement of the letter c with the letter d. On the other hand, it does signal a recognition that the concept group is important, as it at least involves carrying out the change from k to d in the target string over to both ks and not just the rightmost one. This answer has a “humorous” quality to it, unlike iijjkl or iijjkd, due to its mixture of insight and confusion. This incomplete catalog of answers hints at the range of issues that can arise in examining a single problem in the letter-string analogy domain. Copycat itself is able to come up with all of the aforementioned answers (along with a few others), as illustrated in Table 1, which reveals iijjll to be the program’s “preferred choice” according to the two available measures. These measures are (1) the relative frequency with which each answer is given and (2) the average “final temperature” associated with each answer. Roughly speaking, the temperature—which can range from 0 to 100—indicates the program’s moment-to-moment “happiness” with its perception of the problem during a run, with a lower temperature corresponding to a more positive evaluation 4.3 The modified Feigenbaum test: from Copycat to Metacat One limitation of Copycat is its inability to “say” anything about the answers it gives beyond what appears in its Workspace during the 7 Such problems may seem to bear a strong resemblance to the kinds of problems one might find on an IQ test. However, an important difference worth noting is that the problems in the Copycat domain are not conceived of as having “correct” or “incorrect” answers (though in many cases there are clearly “better” and “worse” ones). Rather, the answers are open to discussion, and the existence of subtle differences between the various answers to a given problem is an important aspect of the microdomain. AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 57 Table 1. Copycat’s performance over 1000 runs on the problem abc → abd; iijjkk → ???. Adapted from [16]. Answer iijjll iijjkl iijjdd iikkll iijkll iijjkd ijkkll Frequency Average Final Temperature 810 165 9 9 3 3 1 27 47 32 46 43 65 43 course of a run. While aggregate statistics such as those illustrated in Table 1 can offer some insight into its performance, the program is not amenable to genuine Feigenbaum-testing, primarily because it doesn’t have the capacity to summarize its viewpoint. To the extent that it can be Feigenbaum-tested, it can only do so in response to what might termed first-order questions (e.g., abc → abd; iijjkk → ???). It cannot answer second-order questions (i.e., questions about questions), let alone questions about its answers to questions about questions. In contrast, Metacat allows us to ask increasingly sophisticated questions of it, and thus can be said to allow for the sort of modified Feigenbaum-testing described in Section 3.3. One can “interact” with the program in a variety of ways: by posing new problems; by inputting an answer to a problem and running the program in “justify mode,” asking it to evaluate and make sense of the answer; and by having it compare two answers to one another (as in the above examples). In doing the latter, the program summarizes its “viewpoint” with one of a set of canned (but non-arbitrary) English descriptions. For example, the preferred answer might be “based on a richer set of ideas,” “more abstract,” or “more coherent.” The program also attempts to “explain” how the two answers are similar to each other and how they differ. For example, consider the program’s summary of the comparison between iijjll and iijjdd in response to the aforementioned problem: The only essential difference between the answer iijjdd and the answer iijjll to the problem abc → abd; iijjkk → ??? is that the change from abc to abd is viewed in a more literal way for the answer iijjdd than it is in the case of iijjll. Both answers rely on seeing two strings (abc and iijjkk in both cases) as groups of the same type going in the same direction. All in all, I’d say iijjll is the better answer, since it involves seeing the change from abc to abd in a more abstract way. It should be emphasized that the specific form of the verbal output is extremely unsophisticated relative to the capabilities of the underlying architecture, indicating that it is possible to exhibit depth of insight while treating text generation as essentially a side-effect. This contrasts sharply with contemporary approaches to the Turing test. For the sake of contrast, here is the program’s comparison between the answers iijjll and abd, which illustrates some of the program’s limitations in clumsily (and, of course, unintentionally) humorous fashion: The only essential difference between the answer abd and the answer iijjll to the problem abc → abd; iijjkk → ??? is that the change from abc to abd is viewed in a completely different way for the answer abd than it is in the case of iijjll. Both answers rely on seeing two strings (abc and iijjkk in both cases) as groups of the same type going in the same direction. All in all, I’d say abd is really terrible and iijjll is very good. Apart from the thin veneer of human agency that results from Metacat’s text generation, the program’s accomplishments—and just as importantly, its failures—become transparent through interaction. 4.4 Looking ahead In order for it to actually pass an “unrestricted modified Feigenbaum test” in the letter-string analogy domain, what other questions might we conceivably require Metacat to answer? Here are some suggestions: 1. Problems that involve more holistic processing of letter strings. There are certain letter strings that humans seem to have little trouble processing, but that are beyond Metacat’s grasp—for example, the string oooaaoobboooccoo in the problem abc → abd; oooaaoobboooccoo → ???. How are we so effortlessly able to “tune out” the o’s in oooaaoobboooccoo? What would it take for a Metacat-style program to be able to do likewise? 2. Meta-level questions about sequences of answers. For example, “How is the relationship between answer A and answer B different from that between C and D?” Such questions could be answered using the declarative information that Metacat already has; all that would seem to be required is the ability to pose the question. 3. Questions pertaining to concepts about analogy-making in general, such as mapping, role, theme, slippage, pressure, pattern, and concept. Metacat deals implicitly with all of these ideas, but it doesn’t have explicit knowledge or understanding of them. 4. An ability to characterize problems in terms of “the issues they are about,” with the ultimate goal of having a program that is able to create new problems of its own—which would certainly lead to a richer, more interesting exchange between the program and the human interacting with it. Some work in this area was done in the Phaeaco Fluid Concepts architecture [4], but the issue requires further investigation. 5. Questions of the form, “Why is answer A more humorous (or stranger, or more elegant, etc.) than answer B?” Metacat has implicit notions, however primitive, of concepts such as succinctness, coherence, and abstractness, which figure into its answer comparisons. These notions pertain to aesthetic judgment insofar as we tend to find things that are succinct, coherent, and reasonably abstract to be more pleasing than things that are prolix, incoherent, and either overly literal or overly abstract. Judgments involving humor often take into account such factors, too, among many others. Metacat’s ability—however rudimentary—to employ criteria such as abstractness and coherence in its answer evaluations could be seen as an early step toward understanding how these kinds of qualitative judgments might emerge from simpler processes. On the other hand, for adjectives such as “humorous,” which presuppose the possession of emotional or affective states, it is not at all clear what additional mechanisms might be required, though some elementary possibilities are outlined in [18]. 6. A rudimentary sense of the “personality traits” associated with certain patterns of answers. In other words, just as Metacat is able compare two answers with one another, a meta-Metacat might be able to compare two sets of answers—and, correspondingly, two answerers—with one another. For example, a series of literalminded or short-sighted answers might yield a perception of the answerer as being dense, while a series of sharp, insightful an- AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 58 swers punctuated by the occasional obvious clunker might yield a picture of an eccentric smart-aleck. Ultimately, however, the particulars of Copycat, Metacat, and the letter-string analogy domain are not so important in and of themselves. The programs merely serve as an example of a kind of approach to modeling cognitive phenomena, just as the domain itself serves as a controlled arena for carrying out such modeling. To meet the genuine intent of the Turing test, we must be able to partake in the sort of arbitrarily detailed and subtle discourse described above in any domain. As the forgoing list shows, however, there is much that remains to be done, even—to stick with our example—within the tiny domain in which Copycat and Metacat operate. It is unclear how far a disembodied computer program, even an advanced successor to these two models, can go toward modeling socially and/or culturally grounded phenomena such as personality, humor, and aesthetic judgment, to name a few of the more obvious challenges involved in achieving the kind of discourse that our “test” ultimately calls for. At the same time, it is unlikely that such discourse lies remotely within the capabilities of any of the current generation of Loebner Prize contenders, nor does it even seem to be a goal of such contenders. 5 CONCLUSION We have argued that the Turing test would more profitably be considered as a sequence of modified Feigenbaum tests, in which the questioner and respondent are to collaborate in an attempt to extract maximum subtlety from a succession of arbitrarily detailed domains. In addition, we have explored a parallel between the “domain-agnostic” approach of statistical machine learning and that of artistic forgery, in turn arguing that by requesting successive variations on an original theme, a critic may successfully distinguish mere surface-level imitations from those that arise via the meta-mechanisms constitutive of genuine creativity and intelligence. From the perspective we have argued for, Metacat and the letter-string-analogy domain can be viewed as a kind of Drosophila for the Turing test, with the search for missing mechanisms directly motivated by the specific types of questions we might conceivably ask of the program. [8] D. R. Hofstadter, Metamagical Themas: Questing for the Essence of Mind and Pattern, Basic Books, New York, 1986. [9] D. R. Hofstadter, Fluid Concepts and Creative Analogies, Basic Books, New York, 1995. [10] C. Irving, Fake! The story of Elmyr de Hory, the greatest art forger of our time, McGraw-Hill, New York, 1969. [11] M. Kieran, Revealing Art, Routledge, London, 2005. [12] B. Kokinov, V. Feldman, and I. Vankov, ‘Is analogical mapping embodied?’, in New Frontiers in Analogy Research, eds., B. Kokinov, K. Holyoak, and D. Gentner, New Bulgarian Univ. Press, Sofia, Bulgaria, (2009). [13] J. Lanier, You Are Not a Gadget, Alfred A. Knopf, New York, 2010. [14] A. Lessing, ‘What is wrong with a forgery?’, Journal of Aesthetics and Art Criticism, 23(4), 461–471, (1979). [15] J. Marshall. Metacat: A self-watching cognitive architecture for analogy-making and high-level perception. Doctoral dissertation, Indiana Univ., Bloomington, 1999. [16] M. Mitchell, Analogy-Making as Perception: A Computer Model, MIT Press, Cambridge, Mass., 1993. [17] J. Myhill, ‘Some philosophical implications of mathematical logic’, Review of Metaphysics, 6, 165–198, (1952). [18] R. Picard, Affective Computing, MIT Press, Cambridge, Mass., 1997. [19] J. Rehling. Letter spirit (part two): Modeling creativity in a visual domain. Doctoral dissertation, Indiana Univ., Bloomington, 2001. [20] S. Shieber, ‘Lessons from a restricted Turing test’, Communications of the ACM, 37(6), 70–78, (1994). [21] R.J. Solomonoff, ‘A formal theory of inductive inference, pt. 1’, Information and Control, 7(1), 1–22, (1964). [22] A. Turing, ‘Computing machinery and intelligence’, Mind, 59, 433– 460, (1950). [23] R. Wallace, ‘The anatomy of A.L.I.C.E.’, in Parsing the Turing Test, eds., R. Epstein, G. Roberts, and G. Beber, 1–57, Spring, Heidelberg, (2009). [24] J. Weizenbaum, Computer Power and Human Reason, Freeman, San Francisco, 1976. [25] H. Werness, ‘Han van Meegeren fecit’, in The Forger’s Art, ed., D. Dutton, 1–57, Univ. of California Press, Berkeley, (1983). ACKNOWLEDGEMENTS We would like to thank Vincent Müller and Aladdin Ayesh for their hard work in organizing this symposium, along with the anonymous referees who reviewed and commented on the paper. We would also like to acknowledge the generous support of Indiana University’s Center for Research on Concepts and Cognition. REFERENCES [1] B. Christian, The Most Human Human, Doubleday, New York, 2011. [2] D. Dutton, ‘Artistic crimes’, British Journal of Aesthetics, 19, 302–314, (1979). [3] E. A. Feigenbaum, ‘Some challenges and grand challenges for computational intelligence’, Journal of the ACM, 50(1), 32–40, (2003). [4] H. Foundalis. Phaeaco: A cognitive architecture inspired by bongard’s problems. Doctoral dissertation, Indiana Univ., Bloomington, 2006. [5] R. French, ‘Subcognition and the limits of the Turing test’, Mind, 99, 53–65, (1990). [6] S. Harnad, ‘The Turing test is not a trick: Turing indistinguishability is a scientific criterion’, SIGART Bulletin, 3(4), 9–10, (1992). [7] S. Harnad, ‘Minds, machines and Turing: the indistinguishability of indistinguishables’, Journal of Logic, Language, and Information, 9(4), 425–445, (2000). AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 59 Laws of Form and the Force of Function. Variations on the Turing Test Hajo Greif1 Abstract. This paper commences from the critical observation that the Turing Test (TT) might not be best read as providing a definition or a genuine test of intelligence by proxy of a simulation of conversational behaviour. Firstly, the idea of a machine producing likenesses of this kind served a different purpose in Turing, namely providing a demonstrative simulation to elucidate the force and scope of his computational method, whose primary theoretical import lies within the realm of mathematics rather than cognitive modelling. Secondly, it is argued that a certain bias in Turing’s computational reasoning towards formalism and methodological individualism contributed to systematically unwarranted interpretations of the role of the TT as a simulation of cognitive processes. On the basis of the conceptual distinction in biology between structural homology vs. functional analogy, a view towards alternate versions of the TT is presented that could function as investigative simulations into the emergence of communicative patterns oriented towards shared goals. Unlike the original TT, the purpose of these alternate versions would be co-ordinative rather than deceptive. On this level, genuine functional analogies between human and machine behaviour could arise in quasi-evolutionary fashion. 1 A Turing Test of What? While the basic character of the Turing Test (henceforth TT) as a simulation of human conversational behaviour remains largely unquestioned in the sprawling debates it has triggered, there are a number of diverging interpretations as to whether and to what extent it provides a definition, or part of a definition, of intelligence in general, or whether it amounts to the design of an experimental arrangement for assessing the possibility of machine intelligence in particular. It thus remains undecided what role, if any, there is for the TT to play in cognitive inquiries. I will follow James H. Moor [13] and other authors [21, 2] in their analysis that, contrary to seemingly popular perception, the TT does neither provide a definition nor an empirical criterion of the named kind. Nor was it intended to do so. At least at one point in Alan M. Turing’s, mostly rather informal, musings on machine intelligence, he explicitly dismisses the idea of a definition, and he attenuates the idea of an empirical criterion of machine intelligence: I don’t really see that we need to agree on a definition [of thinking] at all. The important thing is to try to draw a line between the properties of a brain, or of a man, that we want to discuss, and those that we don’t. To take an extreme case, we are not interested in the fact that the brain has the consistency of cold porridge. We don’t want to say ‘This machine’s quite hard, so 1 University of Klagenfurt, Austria, email: hajo.greif@aau.at it isn’t a brain, and so it can’t think.’ I would like to suggest a particular kind of test that one might apply to a machine. You might call it a test to see whether the machine thinks, but it would be better to avoid begging the question, and say that the machines that pass are (let’s say) ‘Grade A’ machines. [. . . ] (Turing in a BBC radio broadcast of January 10th, 1952, quoted after [3, p. 494 f]) Turing then goes on to introducing a version of what has come to be known, perhaps a bit unfortunately, as the Turing Test, but was originally introduced as the “imitation game”. In place of the articulation of definitions of intelligence or the establishment of robust empirical criteria for intelligence, we find much less ambitious, and arguably more playful, claims. One purpose of the test was to develop a thought-experimental, inductive approach to identifying those properties shared between the human brain and a machine which would actually matter to asking the question of whether men or machines alike can think: What is the common ground human beings and machines would have to share in order to also share a set of cognitive traits? It was not a matter of course in Turing’s day that there could possibly be any such common ground, as cognition was mostly considered essentially tied to (biological or other) human nature.2 In many respects, the TT was one very instructive and imaginative means of raising the question whether the physical constitution of different systems, whether cold-porrige-like or electriccircuitry-like, makes a principled difference between a system with and a system without cognitive abilities. Turing resorted to machine simulations of behaviours that would normally be considered expressions of human intelligence in order to demonstrate that the lines of demarcation between the human and the mechanical realm are less than stable. The TT is however not sufficient as a means for answering the questions it first helped to raise, nor was it so intended. Turing’s primary aim for the TT was one demonstration, among others, of the force and scope of what he introduced as the “computational method” (which will be briefly explained in section 2). Notably, the computational method has a systematically rooted bias towards, firstly, considering a system’s logical form over its possible functions and towards, secondly, methodological individualism. I will use Turing’s mathematical theory of morphogenesis and, respectively, the distinction between the concepts of structural homology and functional analogy in biology as the background for discussing the implications of this twofold bias (in section 3). On the basis of this discussion, a tentative reassessment of the potentials and limits of the 2 In [1, p. 168 f], Margaret Boden notices that the thought that machines could possibly think was not even a “heresy” up to the early 20th century, as that claim would have been all but incomprehensible. AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 60 TT as a simulation will be undertaken (in section 4): If there is a systematic investigative role to play in cognitive inquiries for modified variants of the TT, these would have to focus on possible functions to be shared between humans and machines, and they would have to focus on shared environments of interaction rather than individual behaviours. 2 The Paradigm of Computation Whether intentionally or not, Turing’s reasoning contributed to breaking the ground for the functionalist arguments that prevail in much of the contemporary philosophies of biology and mind: An analysis is possible of the operations present within a machine or an organism that systematically abstracts from their respective physical nature. An set of operations identical on a specified level of description can be accomplished in a variety of physical arrangements. Any inference from the observable behavioural traits of a machine simulating human communicative behaviour, as in the TT, to an identity of underlying structural features would appear unwarranted. Turing’s work was concerned with the possibilities of devising a common logical form of abstractly describing the operations in question. His various endeavours, from morphogenesis via (proto-) neuronal networks to the simulation of human conversational behaviour, can be subsumed under the objective of exploring what his “computational method” could achieve across a variety of empirical fields and under a variety of modalities. Simulations of conversational behaviours that had hitherto been considered an exclusively human domain constituted but one of these fields, investigated under one modality. Turing’s computational method is derived from his answer to a logico-mathematical problem, David Hilbert’s “Entscheidungsproblem” (the decision problem) in predicate logic, as presented in [8]. This problem amounts to the question whether, within the confines of a logical calculus, there is an unequivocal, well-defined and finite, hence at least in principle executable, procedure for deciding on the truth of a proposition stated in that calculus. After Kurt Gödel’s demonstration that neither the completeness nor the consistency of arithmetic could be proven or disproven within the confines of arithmetic proper [7], the question of deciding on the truth of arithmetical propositions from within that same axiomatic system had to be recast as a question of deciding on the internal provability of such propositions. The – negative – answer to this reformulated problem was given by Turing [18] (and, a little earlier, by a slightly different method, Alonzo Church). Turing’s path towards that answer was based on Gödel’s elegant solution to the former two problems, namely a translation into arithmetical forms of the logical operations required for deciding on the provability of that proposition within the system of arithmetical axioms. Accordingly, the method of further investigation was to examine the calculability of the arithmetical forms so generated. To decide on the calculability of the problem in turn, Turing introduced the notion of computability. A mathematical problem is considered computable if the process of its solution can be broken down into a set of exact elementary instructions by which one will arrive at a determinate solution in a finite number of steps, and which could be accomplished, at least in principle, by human “computers”.3 Even complex problems should thus become reducible to a set of basic 3 I am following B. Jack Copeland [4] here on his definition of computability, as he makes a considerable effort at spelling out what notion of computability Turing was using in [18]. He thus hopes to stem the often-lamented flood of loose and misguiding uses of that term in many areas of science. operations. The fulfilment of the required routines demands an ability to apply a set of rules and, arguably, some mental discipline, but these routines are not normally considered part of the most typical or complex properties of human thought – and can be mechanised, in a more direct, material sense, by an appropriately constructed and programmed machine. Hence, Turing’s notion of “mechanical” was of a fairly abstract kind. It referred to a highly standardised and routinised method of solving mathematical problems, namely the computational method proper. This method could be equally applied by human, mechanical or digital “computers”, or by any other system capable of following the required routines. Given this description of computability, the primary aim of Turing’s models of phenomena such as morphogenesis, the organisation of the nervous system or the simulation of human conversation lies in finding out whether, how and to what extent their specific structural or behavioural patterns can be formally described in computational terms – and thus within the realm of mathematics. A successful application of the computational method to the widest variety of phenomena would have implications on higher epistemological or arguably even metaphysical levels, but, being possible implications, these are not contained within the mathematical theory. 3 The Relevance of Form and Function The design of Turing’s computational method intuitively suggests, but does not entail, that the phenomena in question are chiefly considered in their, computationally modellable, form. Turing focuses on the formal patterns of organic growth, on the formal patterns of neuronal organisation and re-organisation in learning, and on the logical forms of human conversation. The possible or actual functions of these formally described patterns, in terms of the purposes they do or may serve, are not systematically considered. A second informal implication of Turing’s computational approach lies in his focus on the behaviour of isolated, individual systems – hence not on the organism in its environment, but on the human brain as a device with input and output functions.4 Such focus on self-contained, individual entities was arguably guided by a methodological presupposition informed by the systematic goals of Turing’s research: The original topics of his inquiry were the properties of elementary recursive operations within a calculus. Hence, any empirical test for the force and scope of the computational method, that is, any test for what can be accomplished by means of such elementary recursive operations, would naturally but not necessarily commence in the same fashion. In order to get a clearer view of this twofold bias, it might be worthwhile to take a closer look at the paradigm of Turing’s computational method. That paradigm, in terms of elaboration, rigour and systematicity, is not to be found in his playful and informal imitation game approach to computer simulations of conversational behaviour. Instead, it is to be found in his mathematical theory of morphogenesis [20]. This inquiry was guided by Sir D’Arcy Thompson’s, at its time, influential work On Growth and Form [17], and it was directed at identifying the basic chemical reactions involved in generating organic patterns, from an animal’s growth to the grown animal’s anatomy, from the dappledness or stripedness of furs to the arrangement of a sunflower’s florets and the phyllotactic ordering of leaves on a plant’s twigs. The generation of such patterns was modelled in rigorously formal-mathematical fashion. The resulting model was impartial to the actual biochemical realisation of pattern formation. It would only provide some cues as to what concrete reactants, termed “morphogens” by Turing, one should look out for. 4 For this observation, see, for example, [9, p. 85]. AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 61 Less obviously but similarly important, Turing chose not to inquire into any adaptive function, in Darwinian terms, of the patterns so produced. These patterns may or may not serve an adaptive function, and what that function amounts to is of secondary concern at best. Explaining the generation of their form does not contribute to explaining that form’s function, nor does it depend on that function. In this respect, too, Turing’s thoughts appear to be in line with, if not explicitly endorsing, D’Arcy Thompson’s skeptical view of the relevance of adaptation by natural selection in evolution. The formative processes in organisms are considered at least partly autonomous from Darwinian mechanisms. Whether the florets of a sunflower are patterned on a Fibonacci series, as they in fact are, or whether they are laid out in grid-like fashion, as they possibly cannot be according to the mathematical laws of form expounded by Turing, is unlikely to make a difference in terms of selective advantage. In turn however, natural selection may not offer a path to a grid-like pattern in the first place, while enabling, but arguably not determining, the Fibonacci pattern. In likewise fashion, the cognitive abilities of human beings or other animals would not in the first place be considered as adaptive abilities, defined in relation to challenges posed by their environments, but in their, mathematically modellable, form. Turing’s bias towards form over function, in conjunction with his methodological individualism, created a difficulty in systematically grasping a relation that might look straightforward or even obvious to the contemporary reader, who is likely to be familiar with the role of populations and environments in evolution, and who might also be familiar with philosophical concepts of functions: analogy of functions across different, phylogenetically distant species. In Turing’s notion of decoupling logical form from physical structure, the seeds of the concept of functional analogy appear to be sown, however without growing to a degree of maturity that would prevent the premature conclusions often drawn from Turing’s presentation of the TT. It is the condition of observable similarity in behaviour that has been prone to misguide both proponents and critics of the TT. One cannot straightforwardly deduce a similarity of kind – in this case, being in command of a shared form of intelligence – from a similarity in appearance. A relation of proximity in kind could only be firmly established on the grounds of a relation of common descent, that is, from being part of the same biological population or from being assembled according to a common design or Bauplan. This is the ultimate skeptical resource for the AI critic who will never accept some computer’s or robot’s trait as the same or equivalent to a human one. However convincing it may look to the unprejudiced observer, any similarity will be dismissed as a feat of semi-scientific gimmickry. Even a 1:1 replica of a human being, down to artificial neurones and artificial muscles made of high-tech carbon-based fibres, is unlikely to convince him or her. What the skeptic is asking for is a structural homology to lie at the foundation of observable similarities. In the biological discipline of morphology, the distinction between analogies and homologies has first been systematically applied by Richard Owen, who defined it as follows: “A NALOGUE.” – A part or organ in one animal which has the same function as another part or organ in a different animal. “H OMOLOGUE.” – The same organ in different animals under every variety of form and function. [15, p. 7, capitalisation in original] This distinction was put on an evolutionary footing by Charles Darwin, who gave a paradigmatic example of homology himself, when he asked: “What can be more curious than that the hand of a man, formed for grasping, that of a mole for digging, the leg of the horse, the paddle of the porpoise, and the wing of the bat, should all be constructed on the same pattern, and should include the same bones, in the same relative positions?” [5, p. 434] – where the reference of “the same” for patterns, bones and relative positions is fixed by their common ancestral derivation rather than, for Owen and other Natural Philosophers of his time, by abstract archetypes. In contrast, an analogy of function of traits or behaviours amounts to a similarity or sameness of purpose which a certain trait or behaviour serves, but which, firstly, may be realised in phenotypically variant form and which, secondly, will not have to be derived from a relation of common descent. For example, consider the function of vision in different species, which is realised in a variety of eye designs made from different tissues, and which is established along a variety of lines of descent. The most basic common purpose of vision for organisms is navigation within their respective environments. This purpose is shared by camera-based vision in robots, who arguably have an aetiology very different from any natural organism. Conversely, the same navigational purpose is served by echolocation in bats, which functions in an entirely different physical medium and under entirely different environmental circumstances, namely the absence of light. There are no principled limitations as to how a kind of function is realised and by what means it is transmitted. The way in which either variable is fixed depends on the properties of the (biological or technological) population and of the environment in question. In terms of determining its content, a function is fixed by the relation between an organism’s constitution and the properties of the environment in which it finds itself, and thus by what it has to accomplish in relation to organic and environmental variables in order to prevail. This very relation may be identical despite the constitution of organisms and the properties of the environment being at variance between different species. Perceiving spatial arrangements in order to locomote under different lighting conditions would be a case in point. In terms of the method by which a function is fixed, a history of differential reproduction of variant traits that are exposed to the variables of the environment in which some population finds itself will determine the functional structure of those traits. If an organism is endowed with a reproducible trait whose effects keep in balance those environmental variables which are essential to the organism’s further existence and reproduction, and if this happens in a population of reproducing organisms with sufficient frequency (which does not even have to be extremely high), the effects of that trait will be their functions.5 Along the lines of this argument, an analogy of function is possible between different lines of descent, provided that the environmental challenges for various phylogenetically remote populations are similar. There are no a-priori criteria by which to rule out the possibility that properties of systems with a common descent from engineering processes may be functionally analogous to the traits and behaviours of organisms. In turn, similarity in appearance is at most a secondary consequence of functional analogy. Although such similarity is fairly probable to occur, as in the phenomenon of convergent evolution, it is never a necessary consequence of functional analogy. The similarity that is required to hold between different kinds of systems lies in the tasks for whose fulfilment their respective traits are selected. Structural homology on the other hand does neither require a similarity of tasks nor a similarity of appearance, but a common line of descent from which some trait hails, whatever function it may have acquired later along that line, and whatever observable similarity it may bear 5 This is the case for aetiological theories of function, as pioneered by [23] and elaborated by [11]. AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 62 to its predecessor. In terms of providing criteria of similarity that go beyond what can be observed on the phenotypical level, functional analogy trumps structural homology. 4 The Turing Test as Demonstrative vs. Investigative Simulation On the grounds of the above argument, the apparent under-definition of the epistemical role of the TT owes to an insufficient understanding of the possibilities and limitations of functional analogy in the AI debates: It is either confounded with homological relations, which, as there are no common lines of descent between human beings and computers, results in the TT being rejected out of hand as a test for any possible cognitive ability of the latter. Or analogous functions are considered coextensive with a set of phenotypical traits similar, qua simulation, to those of human beings. Either way, it shows that inferences to possible cognitive functions of the traits in question are not warranted by phenotypical similarity. Unless an analogy of function can be achieved, the charge of gimmickry against the TT cannot be safely defused. If however such an analogy can be achieved, the test itself would not deliver the evidence necessary for properly assessing that analogy, nor would it provide much in the way of a suggestion how that analogy could be traced. One might be tempted to put the blame for this insufficient understanding of functional analogy on Turing himself – but that might be an act of historical injustice. Firstly, he did not claim functional analogies to be achieved by his simulations. Secondly, some of the linkages between the formal-mathematical models which he developed and more recent concepts of evolution that comprise the role of populations and environments in shaping organic functions were not in reach of his well-circumscribed theory of computation. They were not even firmly in place at the time of his writing. Much of contemporary evolutionary reasoning owes to the Modern Synthesis in evolutionary biology, which was only in the process of becoming the majority view among biologists towards the end of Turing’s life.6 With the benefit of hindsight however, and with the clarification of matters that it allows, is there any role left for the TT to be played in inquiries into human cognition – which have to concern, first and foremost, the functions of human cognition? Could it still function as a simulation of serious scientific value? Or, trying to capture Turing’s ultimate, trans-mathematical objective more precisely and restating the opening question of this paper: Could the TT still help to identify the common ground human beings and machines would have to share in order to also share a set of cognitive traits? For modified forms of that test at least, the answer might be positive. First of all, one should be clear about what kind of simulation the TT is supposed to be. If my reconstruction of Turing’s proximate aims is valid, the imitation game was intended as a demonstrative simulation of the force and scope of the computational method, with no systematic cognitive intent. By many of its interpreters and critics however, it was repurposed as an investigative simulation that, at a minimum, tests for some of the behavioural cues by which people normally discern signals of human intelligence in communication, or that, on a maximal account, test for the cognitive capacities of machines proper. The notions of demonstrative and investigative simulations are distinguished in an intuitive, prima facie fashion in [16, p. 7 f], but may not always be as clearly discernible as one might hope. Demonstrative simulations mostly serve a didactic purpose, in reproducing some well-known behaviours of their subject matter or “target” in a different medium, so as to allow manipulations of those behaviours’ variables that are analogous to operations on the target proper. The purpose of flight simulators for example lies in giving pilots a realistic impression of experience of flying an airplane. Events within the flight simulation call for operations on the simulation’s controls that are, in their effects on that simulation, analogous to the effects of the same operations in the flight that is being simulated. The physical or functional structure of an airplane will not have to be reproduced for this purpose, nor, of course, the physical effects of handling or mishandling an in-flight routine. Only an instructive simile thereof is required. I hope to have shown that this situation is similar to what we encounter in the TT, as originally conceived. No functional analogy between simulation and target is required at all, while the choice and systematic role of observable similarities is contingent on the didactic purpose of the simulation. An investigative simulation, on the other hand, aims at reproducing a selection of the behaviours of the target system in a fashion that allows for, or contributes to, an explanation of that behaviours’ effects. In a subset of cases, the explanation of the target’s functions is included, too. Here, a faithful mapping of the variables of the simulation’s behaviours, and their transformations, upon the variables and transformations on the target’s side is of paramount importance. No phenomenal similarity is required, and a mere analogy of effects is not sufficient, as that analogy might be coincidental. Instead, some aspects of the internal, causal or functional, structure of the target system will need to be systematically grasped. To this purpose, an investigative simulation is guided by a theory concerning the target system, while the range of its behaviours is not exhausted by that theory: Novel empirical insights are supposed to grow from such simulations, in a manner partly analogous to experimental practice.7 I hope to have shown that this is what the TT might seem to aim at, but does not achieve, as there is no underlying theory of the cognitive traits that appear to be simulated by proxy of imitating human conversational behaviour. An alternative proposal for an investigative role of the TT along the lines suggested above would lie in creating analogues of some of the cognitive functions of communicative behaviour. Doing so would not necessarily require a detailed reproduction of all or even most underlying cognitive traits of human beings. Although such a reproduction would be a legitimate endeavour taken by itself, although probably a daunting one, it would remain confined to the same individualistic bias that marked Turing’s own approach. A less individualistic, and perhaps more practicable approach might take supra-individual patterns of communicative interaction and their functions rather than individual minds as its target. One function of human communication, it may be assumed, lies in the co-ordination of actions directed at shared tasks. If this is so, a modified TT-style simulation would aim at producing, in evolutionary fashion, ‘generations’ of communicative patterns to be tried and tested in interaction with human counterparts. The general method would be similar to evolutionary robotics,8 but, firstly, placed on a higher level of behavioural complexity and, secondly, directly incorporating the behaviour of human communicators. In order to allow for some such quasi-evolutionary process to occur, there should not be a reward for the machine passing the TT, nor for the human counterpart revealing the machine’s nature. Instead, failures of the machine to effectively communicate with its human counterpart, in re7 8 6 For historical accounts of the Modern Synthesis, see, for example, [10, 6]. For this argument on the epistemic role of computer simulations, see [22]. For a paradigmatic description of the research programme of evolutionary robotics, see [14]. AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 63 lation to a given task, would be punished by non-reproduction, in the next ‘generation’, of the mechanism responsible for the communicative pattern, replacing it with a slightly (and perhaps randomly) variant form of that mechanism. In this fashion, an adaptive function could be established for the mechanism in question over the course of time. Turing indeed hints at such a possibility when briefly discussing the “child machine” towards the end of [19, pp. 455–460] – a discussion that, in his essay, appears somewhat detached form the imitation game proper. For such patterns to evolve, the setup of the TT as a game of imitation and deception might have to be left behind – if only because imitation and deception, although certainly part of human communication, are not likely to constitute its foundation. Even on a fairly pessimistic view of human nature, they are parasitic on the adaptive functions of communication, which are more likely to be cooperative.9 Under this provision, humans and machines would be endowed with the task of trying to solve a cognitive or practical problem in co-ordinated, perhaps collaborative, fashion. In such a situation, the machine intriguingly would neither be conceived of as an instrument of human problem-solving nor as an autonomous agent that acts beyond human control. It would rather be embedded in a shared environment of interaction and communication that poses one and the same set of challenges to human and machine actors, with at least partly similar conditions of success. If that success is best achieved in an arrangement of symmetrical collaboration, the mechanisms of selection of behavioural patterns, the behavioural tasks and the price of failure would be comparable between human beings and machines. By means of this modified and repurposed TT, some of the functions of human communication could be systematically elucidated by means of an investigative simulation. That simulation would establish functional analogies between human and machine behaviour in quasi-evolutionary fashion. 5 Conclusion It might look like an irony that, where, on the analysis presented in this paper, the common ground that would have to be shared between human beings and machines in order to indicate what cognitive traits they may share, ultimately and in theory at least, is functionally identified, and where the author of that thought experiment contributed to developing the notion of decoupling the function of a system from its physical structure, the very notion of functional analogy did not enter that same author’s focus. As indicated in section 4 above, putting the blame on Turing himself would be an act of historical injustice. At the same instance however, my observations about the formalistic and individualistic biases built into Turing’s computational method do nothing to belittle the merits of that method as such, as its practical implementations first allowed for devising computational models and simulations of a variety of functional patterns in a different medium, and as its theoretical implications invited systematical investigations into the physical underdetermination of functions in general. In some respects, it might have taken those biases to enter this realm in the first place. [3] The Essential Turing, ed., B. Jack Copeland, Oxford University Press, Oxford, 2004. [4] B. Jack Copeland, ‘The Church-Turing Thesis’, in The Stanford Encyclopedia of Philosophy, html, The Metaphysics Research Lab, Stanford, spring 2009 edn., (2009). [5] Charles Darwin, On The Origin of Species by Means of Natural Selection. Or the Preservation of Favoured Races in the Struggle for Life, John Murray, London, 1 edn., 1859. [6] David J. Depew and Bruce H. Weber, Darwinism Evolving. Systems Dynamics and the Genealogy of Natural Selection, MIT Press, Cambridge/London, 1995. [7] Kurt Gödel, ‘Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I’, Monatshefte für Mathematik, 38, 173–198, (1931). [8] David Hilbert and Wilhelm Ackermann, Grundzüge der theoretischen Logik, J. Springer, Berlin, 1928. [9] Andrew Hodges, ‘What Did Alan Turing Mean by “Machine”?’, in The Mechanical Mind in History, eds., Philip Husbands, Owen Holland, and Michael Wheeler, 75–90, MIT Press, Cambridge/London, (2008). [10] Ernst Mayr, One Long Argument. Charles Darwin and the Genesis of Modern Evolutionary Thought, Harvard University Press, Cambridge, 1991. [11] Ruth Garrett Millikan, Language, Thought, and Other Biological Categories, MIT Press, Cambridge/London, 1984. [12] Ruth Garrett Millikan, Varieties of Meaning, MIT Press, Cambridge/London, 2004. [13] James H. Moor, ‘An Analysis of the Turing Test’, Philosophical Studies, 30, 249–257, (1976). [14] Stefano Nolfi and Dario Floreano, Evolutionary Robotics: The Biology, Intelligence and Technology of Self-Organizing Machines, MIT Press, Cambridge/London, 2000. [15] Richard Owen, On the Archetype and Homologies of the Vertebrate Skeleton, John van Voorst, Lodon, 1848. [16] Philosophical Perspectives in Artificial Intelligence, ed., Martin Ringle, Humanities Press, Atlantic Highlands, 1979. [17] D’Arcy Wentworth Thompson, On Growth and Form, Cambridge University Press, Cambridge, 2 edn., 1942. [18] Alan M. Turing, ‘On Computable Numbers, with an Application to the Entscheidungsproblem’, Proceedings of the London Mathematical Society, s2-42, 230–265, (1936). [19] Alan M. Turing, ‘Computing Machinery and Intelligence’, Mind, 59, 433–460, (1950). [20] Alan M. Turing, ‘The Chemical Basis of Morphogenesis’, Philosophical Transactions of the Royal Society, B, 237, 37–72, (1952). [21] Blay Whitby, ‘The Turing Test: AI’s Biggest Blind Alley?’, in Machines and Thought, eds., Peter Millican and Andy Clark, volume 1 of The Legacy of Alan Turing, 53–62, Clarendon Press, Oxford, (1996). [22] Eric B. Winsberg, Science in the Age of Computer Simulation, University of Chicago Press, Chicago, 2010. [23] Larry Wright, ‘Functions’, Philosophical Review, 82, 139–168, (1973). References [1] Margaret A. Boden, Mind as Machine: A History of Cognitive Science, Oxford University Press, Oxford, 2006. [2] B. Jack Copeland, ‘The Turing Test’, Minds and Machines, 10, 519– 539, (2000). 9 For an account of the evolution of co-operative functions, see, for example, [12, ch. 2]. AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World 64 Battle of signifiants 1 Abstract Memetics is a new science which is just establishing its relations with other sciences – we’ll glance over few of them. Since huge amount of entities which are called memes belongs also to the set of linguistic signes, it can be expected that the exchange between linguistics and memetics will be fecond. Thus we’ll use basic linguistic terminology, especially the lexicological notion of morpheme defined like « smallest unit having a meaning » as a source of metaphores which could be helpful in interpretation of phonemena which could be labeled as « m emetic » . By using this new theoretical framework and by focusing only upon the relations among few «allomemes » used for expressing happiness or amusement, which are for example emoticons like :­) , :) or a morpheme « lol » , we’ll try to offer to our reader an illustrous example of our « back to the signs » method which could possibly lead to the formalization of memetics, and thus to her firm establishement at the point of contact between natural and human sciences. We’ll show, also, how the formal properties of a meme­sign – for example its length or the number of possible modalities of its expression ­ influence its fitness, and thus its distribution in population of hosts. 1. Introducion "A scholar is just a library's way of making another library." maxim of Daniel Dennett 1.1 Memetics The inspiration for the memetic science came from biology. A book Selfish Gene within which R. Dawkins had introduced his meme concept as "a unit of cultural transmission or a unit of imitation" (Dawkins, 1976) was , in a first place, a book about (socio)biological hypothesis stating that the basic components of hereditary code – genes – are « having their own interests» in the process of evolution. This « selfish interest of a gene » can be sometimes contradictory to the interests of the other genes within the code, or even to the interest of the « hosting » entity as a whole. « Selfish » meme was thus created as an analogy to « selfish » gene. Both memes and genes belong amongst the replicators – replicators are the entities which have a tendency to make copies of themselves. Genes are molecular structures stored within the cellular nucleus, memes are – for the internalist school of memetics­ some vaguely defined « information structures » stored within the brains synaptic networks or are ­for the school of externalists­ « externalized » within the material artifacts. Genes replicate by the processes of DNA/RNA translation and transcription, memes do it by completely different means – by the process of immitation. There are many other metaphoric images between genetics and memetics – science about heredity was and will be the big terminological and methodological inspiration. Some speak even about « memetic engineering », others, inspired by virology , use the termes « viruses of the mind » while speaking about memes like terrorism, drug addiction or ... ideologies and religions. In nature, errors during the process of replication can be observed. These errors ­ mutations – lead to different properties of replicated entities. If the replication is taking place within the system with limited ressources which are essential for the replication , the cumulation of differences caused by mutations will result logically in the competition ­> selection ­> Darwinian evolution. Since the ressources essential for meme replication – in other words the « brain potential » of an individual or a population ­ are finite , an emergence of a second form of evolution would be an inescapable consequence of existence of these « brain­based » replicators . This « second form » of Darwinian evolution is for some memeticians term synonymous to cultural evolution known from anthropology. Memes are supposed to be for a cultural evolution what genes are for good old biological evolution of natural species. Memes are the basic units of cultural evolution. Evergrowing complexity of culture is a result of evolutionary forces acting upon memes. Whether the memetic science will become a firm science or will lose its momentum and disappear like some bizarre postmodern pseudoscientific discipline will strongly depend on the progress within psychology and cognitive sciences. If these disciplines will be able to furnish the memetics with some evidence of material basis of replicators within the brain, memetics will flourish. While there is a still­growing evidence present that the entities within the human memory are not passive aggregates of data, but active entities trying to get affirmed by repetitve expression1 either in form of internal obsessive idea or an externalized habit (Hromada, 2007), existence of memes within the brains can be proven only indirectly. As one of the memetic pioneers puts it: « ...instead of language based on a concrete mechanism of information storage, we must settle for an abstract representation of the information brainstored. Thus, memory abstractions form the basis for memetic evolution theory. » (Lynch, 1998) « The memetics movement split almost immediately into those who wanted to stick to Dawkins' definition of a meme as "a unit of information in the brain," and those who wanted to redefine it as observable cultural artifacts and behaviours. » ( http://en.wikipedia.org/wiki/Memetics#Internalists_and_externalists ). Since the only way how to prove the existence of the memes within the mind is the introspection, endeavours of psychology­ oriented internalists, or intrapersonal memeticians , are doomed to battle with deep methodological problems. On the other side, the endeavours of sociology­oriented externalists, or interpersonal memeticians, can be based upon huge amount of solid empiric data which can be analysed by well known hypothese­testing methods. Attitude of interpersonal memetician is not to ask much about the events which are occuring within the brains of individuals, simply because these events are determined by too many variables. For an interpersonal memetician , « meme » exists only if it is articulated, externalized, expressed into the world. By the act of expression, ephemeral brain content becomes an empiric object – a phoneme studied by phonetician, a grapheme in the book, a character in the database or an artifact on the shelf. After mesuring the populations and distributions of different types of these objects he tries to bring to light dynamics inherent to these populations. In ideal case, the dynamic of a memetic system can be explained by the properties of the system only, reference to human beings as the hosts of the 1 For an interpersonal memetician, a meme is conceived, in the first place, as an entity replicated by immitation among the subjects. Intrapersonal memetics, on the contrary, studies memetic phenomena within only one subject, and thus the only way how such a meme can be replicated within, or by, one subject only is repetition. Studies of repetitive behavior, like circular reaction of babies (Piaget, 1961) leads to establishement of intrapersonal memetics. « memes » being only secondary. For such a scientist, it is a « will of meme » which is often determining the will of a man. Much less often it is the other way around. This is the stance which will be applied in this article. 1.2 Linguistics « For the sign always escapes, in certain measure, the will individual or social – that’s its essential characteristic » (Saussure, 1962). If we would, within the preceding sentence, commutate the word « sign » by the word « meme » we would obtain one of the central postulates of memetics. This citation, given by a founder of modern structuralism to its students almost a century ago, can be found within the chapter where Saussure tries to establish a new science called semiology by a definition « a science which studies the life of signs in the midst of the social life ». If we allowed ourselves to reduce the memes to signs, wouldn’t it be,also, the beautiful definition of memetics ? « Sign, broadly defined, designates an element X which represents another element Y, or serves as a substitute to him » (Niklas­Salminen, 1997). Accepting this definition, we may say that every « articulated meme » is a sign in a sense, that it referes to the fact of existence of « brain structure » which gave birth to its articulation into the empiric world. Element X is an object or a set of objects within the world – a word, a picture, an observed behavior ­ while element Y is a mental representation. Existence of this mental representation is the causa efficiens of creation of element X. Since we couldn’t have element X without element Y, the fact of X referes to the fact of Y. 2In such a way, memes as we described in part 1.1 can always be conceived as signs. Thus, we can and we will find fructiferous resemblances between semiology and memetics. Saussure himself was a linguist, in fact one of the best linguists of all times, and he developped the notion of semiological science for one principal reason. He needed to situate nascent linguistics somewhere. He situated it within the semiology, linguistics thus become a « branch of semiology » dealing with linguistic signs. On the beginning of 21st century, linguistic is, without any doubt, the most evolved semiological discipline. And more – it is a science which has a phenomenas produced by a man for an object – thus it is a human science, but its methods are rigorous like those of the physics and its symbolic system is formalized – at least the phonetics is without any doubt also a natural science. Within this article we will try to show that highly evolved linguistic science can play a role of wise old sister for pubescent memetics. Similiarly as linguistic reduces the wide of field of interest of semiology just to the field of linguistic signs, will our « linugistic memetics » of this article specialize only upon the « linguistic memes ». Linguist analyzes language in relation with humans – he asks for example what functions does the language offer to people and finds at least six answers (Jakobson, 1963; Yaguello, 1981) – while a memetician asks what does the language do for herself. Language can be understood as a huge, socially distributed complex of memes – S. Blackmore proposes a term « memplex » (Blackmore, 1999). As we said earlier, meme is a structure which copies from brain to brain. It can either be a rule (of behavior), or materia upon which this rule is applied. In 2 The element X is in semiology called signifier (le signifiant) , the element Y – the mental representation, the concept – is called signified (le signifié). « For the Post­Structuralists, the signifier is now the dominant unit and can be considered as analogous to the meme. » (Gatherer, 1997). Even when we do not consider ourselves to be post­ structuralists only , this attitude will be applied in this article – meme for us is much more identified with a signifier X than with obscure signified Y our current state of knowledge, we do not dare to make any concrete observations nor hypothesis, when it comes to the grammatical rules of syntax. But since an attitude of memetician is to expect that the « organization of brain structures » are more immitated from « outside » than generated from « within », it can be expected that grammatical memetics could get into clash with generative grammarians of N.Chomsky. But the other set of rules known for linguists – so called phonological rules – can serve as a wonderful source of data for linguists. Phonological rules are induced from corpuses which mirror the language used within given society within given periods of time. By comparing the corpuses from different times or different locations, one can observe changes within the speech habits of individuals – changement of accent, disparitions or additions of new phonemes etc. Now , when we know that a habit is a meme , we can say, using the termes of memetics – looking at the corpuses created and rules induced by phonologues, one can observe the rise of new memes and their battle with the old ones for the time and energy of brains. One can observe the rise of memetic infection, and its fall. 1.3 Lexical statistics and memetic dynamics As we stated earlier, language is a memplex3 consisting of two classes of memetic structures : memes acting as rules of grammar, and memes acting as contents upon which these rules are applied – phonemes, morphemes and words. That grammatical rules and phonological habits are memes spread by immitation is obvious to anyone who had ever spent his time field­working with babies. We hope to persuade the others during our future experiments, when our methods will be more evolved and our grants higher then zero. In this textual experiment we’ll focus not on meme – rules, but meme – contents4. And more – since our corpus is textual and not audial, it would be useless to focus our attention to the phonetic layer. We aim to find the basic units of memetic exchange, but to say that phonemes are basic atoms of linguistic memetic exchange means for us to go with our analysis too far. We are persuaded that the better candidates for the memes are in the higher levels – memes are either morphemes, or words5. « Morphéme est défini comme la plus petite unité de signification de la langue » (Niklas­ Salminen, 1997). Morphéme is a sense embodied, it is a semantic domain condensed and expressed , it is the smallest possible signifier X for a signified Y. If we have , for example, the word « lover » an morpheme « lov » referes to semantique domain related to acts of love, while grammatical morpheme « er » refers to semantique domain which could be described like « agent of the related action ». Because the morphéme is , ex vi termini , a basic atom of linguistic exchange which has both « signifier » and « signified » side, it possibly can serve as a candidate for a meme. The problem with morphemes is that they are artificial constructs of linguists, similiarly as 3 Thesis that the language is truly a memplex, in other words an entity which creates its own mechanisms catalysing its own replication, seems to be more true in light of the fact of existence of institutions like , for example CREDIF – Centre de recherche et d’études pour la diffusion du français 4 Knowing that the content can be also described by a rule – by a rule of production of a given content, we don’t see any major difference between the description of rules and the description of contents. 5 In fact there is one more unit which could be possibly a meme – a syllable. We leave it out as a potential candidate for a meme for the same reason as a phoneme – syllable does not have a sense. Even when we will focus on signifiers and not signifieds, we cannot , and we don’t want to pretend that the semantic level does not exist. It exists, and it is raison d’être of language. atoms – however useful this notion is – are artificial constructs of physicians (and memes artificial constructs of memeticians;). Normal by­linguistic­not­infected humans have natural tendency to perceive the « word » as a basic meaning­carrying unit. Even when in fact it’s not at all easy to define what the word is, and the ways fo this definition are not at all evident (Niklas­Salminen, 1997), people would most probably divide the sentence into the words, not morphemes, if asked to analyse the sentence into simpler elements which are not phonemes, even in case of languages like french, when there is no difference between words and morphemes caused by accent or pause. And the answer why they did it like it could be simple « because we feeled it ». The notion word as « basic unit of something » is simply deeply rooted within man’s mind. So it could also be a good candidate for a meme. We leave the question whether it should be a word or rather a morpheme which should be identified with a « meme » undecided within the scope of this article. Word is conceived as an lexical unit by a science called lexicology. This science , also relatively new, but anyway in more evolved state than memetics could offer its fruits to that branch of memetics, which deals with evolution and distribution of memplexes composed of linguistic signs. Here are the few lexicological termes whose application within memetics could be more than fructiferous: Word’s disponibility ­ « variable which does not only depend on the subject’s knowledge of vocabulary of given language, but also of the conditions within which it is expressed ». (Niklas­ Salminen, 1997). This notion and the related studies of aphatics could be useful especially in the sphere of intrapersonal, cognitive memetics. If we understand that the first essential step of memetic replication is the expression, which necessairily begins by a word « coming to one ‘s mind or mouth» , we’ll see that « word ‘s disponibility » can be identified with the « word’s/meme’s own tendency to get expressed and thus replicated ». Lexique – totality of all the words used within the given language. Virtually infinite. Potentially identifiable with « memetic pool » within the frame of linguistic memetics. Word frequency – number of occurrences of a given word within a corpus. The concept of word frequency is the core term of lexical statistics. Lexical statistics « is the application of statistic methods for a description of a vocabulary » (Niklas­Salminen, 1997). It’s important to notice that lexical statistics is a quantitative science, thus mathematically formalizable. It is the discipline which aims to describe and compare textual corpuses in terms of word frequencies , but these corpuses are very often either articial aggregates of « as much data as possible » like Tresor de la langue française or corpuses composed of artistic production of one author. With the progress of digital communication, the amount of textual real­life grows fabulously – data in form of emails, mobile phone messages , discussion forum submissions etc. We can, of course, analyse this data by lexical statistics methods, but our goal is different. We do not want to describe a vocabulary, nor to find out whether this or that man is a real or fictive author of this or that book. We want to analyze the rates of changes of meme frequencies within a corpus describing real social system. We aim to discover inherent dynamics of this system. Afterwards, we’ll be allowed to formulate hypothesis and generalize conclusions about what we had found. And last but not least, we want to create such models, which will allow us to predict the future state of a given social system. 2. Das experiment The memes themselves are like fractals—they can apply to content as fine­grained as words, lines and study locations or as general as complete discourses, complete drawings and complete research articles. (Dirlam, 2003) 2.1 Corpus – le champ de bataille Notre corpus est notre joyau. Il s’agit d’une base de donnée d’une communauté virtuelle présente sur le domaine http://kyberia.sk , l’auteur de cet article étant son fondateur et sénateur . Nous avons à notre disposition toutes les donnés de ce système dès le jour de sa création en 2001 jusqu’en juillet 2007. On peut dire que cette base des données connaît deux sections principales – une section des « data nodes – les noeuds des données » ­ ce sont des forums, des articles, des blogs , des commentaires d’amitié etc. ­ il y en a 2266901 au total, créés par 8517 utilisateurs ; et la second section comprend des 3385647 messages échangés parmi les 3870 utilisateurs. Nous ne sommes que au début de nos recherches, nous avons donc décidé de laisser de côté la section beaucoup plus complexe6 composée des noeuds des données afin de nous focaliser sur l’ensemble moins complexe mais plus riche (du point de vue quantitatif) – sur les messages échangés parmi les utilisateurs. Kyberia.sk, évaluée comme « le premier serveur de communauté réussi en Slovaquie » ( http://pocitace.sme.sk/clanok.asp?cl=3509119 ) est d’abord un réseau social dense. Les relations parmi les utilisateurs ne sont uniquement virtuelles, loin d’être – ce n’est pas un ensemble des personnes anonymes , mais une vraie mini­societé humaine, une véritable « zone temporairement autonome » (Hakim Bey, 1984) . Le fait que pour faire partie du domaine, il faut que la communauté accepte la demande de d’enregistrement du « novice », fait de kyberia une zone « semi­autonome » qui mène à la création d’une identité collective ­ « nous faisons partie de la tribu de kyberia ». Il existe, évidemment, un grand nombre d’approches méthodologiques qu’on pourrait utiliser et infiniment de phénomènes qu’on pourrait étudier par rapport à la communauté de kyberia. Le lexicologue y trouvera une émergence des mots et affixes nouveaux; l’anthropologue y trouvera, peut être, une tendance de plus et plus forte envers l’endogamie; le sociologue, connaissant la composition de la communauté, pourrait en tirer des hypothèses générales portant sur la société humaine en général, dans laquelle kyberia est noyée. Il pourrait sans doute le faire, comme le réseau sociale de Kyberia est le miroir de la sociéte urbaine slovaque­tchèque et la base de donnée de kyberia est le miroir de ce réseau sociale . Nous utiliserons cette base des données textuelles comme un échantillon des données empiriques. Sur cet échantillon, nous metterons en oeuvre des méthodes de la statistique lexicale et nous essayerons d’interpréter nos résultats en termes mémétiques. 6 C’est grâce à la complexité et multicontextualité de ce système des noeuds des données que nous nous sommes permis de proposer le précurseur de la premiére loi de la mémétique interne ­ « La probabilité de l’articulation répetée d’un signe est inversement proportionelle à la période de temps qui s’est écoulée depuis la dernière articulation de ce signe, et tout cela indépendamment du contexte » (Hromada, 2007) Nous avons consacré d’autres articles aux problèmes éthiques de notre recherche. 2.2 :­) & :) ­ Les acteurs Globalement,on peut prévoir que si deux mots sont employés exactement dans les mêmes contextes, l’un d’eux a tendance à disparaître ou à changer de sens (Niklas­Salminen, 1997) En l’état actuel de nos connaissances, nous ne pouvons pas dire si on peut identifier le mème, en tant que contenu d’un échange linguistique, avec le mot ou le morphème; nous nous focaliserons sur les signes linguistiques monomorphématiques, qui sont à la fois morphèmes aussi bien que mots. Même si notre échantillon contient des millions messages créés par des milliers d’utilisateurs pendant plus de 4 ans, il vaudrait mieux choisir pour nos analyses primordiales les signes linguistiques dont les fréquences d’utilisation sont plutôt hautes. Or nous voulons mettre en oeuvre les méthodes statistiques qui nous apporteront plus la vérité dans nos résultats plus nous aurons des données. Comme la communauté de kyberia se compose d’utilisateurs parlant plusieurs langues, il faudra donc ou bien le prendre en compte de ce fait et ne jamais l’oublier pendant l’interprétation de nos résultats, ou bien de se concentrer simplement sur des signes linguistiques universaux, qui dépassent la domaine de chaque langue individuelle. Puisque nous sommes d’accord avec la vielle maxime « La simplification est une meilleure sophistication » (De Vinci) , nous avons choisi la seconde voie où nous sommes amenés, naturellement, à l’ensemble des signes linguistiques7 appelés « émoticônes». Une émotîcone est définie comme « une représentation en caractères typographiques d'une émotion » ( http://fr.wikipedia.org/wiki/%C3%89motic%C3%B4ne ). Pendant 25 ans d’existence, les émotîcons ont réussi à envahir des cerveaux de billions d’habitants de la Terre. Il s’agit, donc, des mèmes par excellence , bien adaptés au milieu de la societé humaine grâce à leur ressemblance avec le visage humaine, dont l’efficacité extrême le rend digne de notre attention. Nous ne voulons pas mélanger les torchons et les serviettes dans nos analyses, nous ne regarderont donc que les « smileys » signifiants « l’émotion d’amusement, rire , bonheur ». Les émoticons comme :­( , ;­( , :( seront a priori exclues de nos analyses, puisque leur signifié, en d’autres mots, l’émotion qu’elles expriment ­ l’état intérieur d’hôte , est différente . En cette analyse primitive nous excluerons même des émoticons comme ;­) , ;) , ou ;­] dont la haute fréquence d’utilisation par rapport au reste d’Internet est une particularité de kyberia. Bien que cette analyse soit féconde nous ne sommes pas sûr s’il s’agisse, dans les cas de ;­) et de :­) par exemple, d’émotîcons codant la même émotion – ayant la même fonction. En bref, nous ne sommes pas sûr s’il s’agit d’émôticons synonymiques ou non. Notre doute provient de la différence entre les caractères qui codent des yeux – tandis que les deux­points en :) ont l’air « normal » pour tout le monde, le point­virgule en ;) apporte avec soi souvent la connotation d’une « badinerie raffinée ». En bref – c’est possible que ce soient pas des synonymes. Et notre première pas sera l’analyse des synonymes – nous définissons des synonymes comme les deux signifiants différents dont des domaines sémantiques se recouvrent, se chevauchent, sont 7Nous laissons de côté les discussions terminologiques si des émotîcons sont en fait les signes linguistiques motivées ou des symboles. Leur formes primordiales, avec les deux­points pour des yeux, le tiret pour le nez et la parenthèse pour la bouche était, sans doute, fortement motivées. En d’autres mots, il y avait « un rapport de ressemblance formelle entre la forme de l’objet représentant et celui de l’objet représenté » (Niklas­Salminen, 1997) lors de leur création, il s’agissait donc des symboles. Un des buts de cet essai est de montrer comment la ressemblance formelle entre le signifiant et le signifié dépérit, en accroissant l’arbitraire , grâce aux forces de l’évolution mèmetique. presque identiques . Ils servont la même fonction8 dans la langue et dans la vie – ils peuvent être commuté sans changement de sens. Du point de vue mémétique, nous disons que des synonymes sont des allomèmes9 si elles sont présentes dans le même hôte. L’hôte , en mémétique , est un être humain dont le cerveau contient la représentation neurale d’une mème et qui répand le mème par son activité, volontaire ou involontaire. Nous appelons expression cet acte d’hôte qui transforme la représentation mentale en objet empirique, perceptible par d’autres hôtes éventuels. L’émotîcon :­) que nous appelerons aussi un « smiley classique » et l’émoticon « :) » que nous appelerons aussi un « smiley occidental » sont des synonymes – elles référent au même domaine sémantique, à la même intention du locuteur qui veut s’en se servir pour exprimer les émotions positives ou pour employer un registre « léger » . Il s’agit d’allomèmes si elles sont présentes dans le cerveau d’un même locuteur. Comment pouvons­nous savoir si elles y sont vraiment présentes, sans d’être obligé de briser la crâne de ce dernier ? Simplement. En observant le comportement résultant de l’expression d’un mème par un sujet, nous pouvons être sûr que le cerveau de ce sujet contient une représentation interne de ce mème. Si quelqu’un a écrit :) n’importe où et n’importe quand dans le passé, et si nous le savons, nous le percevons en tant que l’hôte d’un smiley occidental – voilà notre simplification méthodologique10 . Si, dans une autre situation, le même sujet exprimerais une autre mème nous le concevrons en tant que l’hôte de ce dernière aussi. Aujourd’hui on croit que le cerveau d’un sujet sain peut hébrereger un nombre théoriquement infini des mèmes. Si nous savons que un sujet est l’hôte de deux mots qui jouent la même rôle dans la communauté du sujet – ils sont des synonymes – ces deux mots sont en relation allomémique l’un avec l’autre pour le sujet donné. Pour un sujet 11donné et pour une période de temps donnée, nous pouvons mesurer un nombre d’utilisations des mots dans notre corpus – leurs fréquences. S’il s’agit de synonymes, on appelle une allomème dominante celle dont la fréquence d’utilisation est plus haute que celle d’une allomème recessive. Si je sais que le smiley occidental était exprimé 23 fois par le sujet X pendant l’année 2007 tandis que le smiley classique n’était exprimé que 42 fois par le même sujet pendant la même période, 8 « Le sens d’un mot est son emploi » (Benveniste, 1974) ; "If we had to name anything which is the life of the sign, we should have to say that it was its use" (Wittgenstein, 1958) 9 Allomème est un neologisme motivé par le terme « l’allèle », connu des généticiens. « On nomme allèle une variante donnée d'un gène au sein d'une espèce. Tous les allèles d'un même gène occupent le même locus (emplacement) sur un même chromosome » ( http://fr.wikipedia.org/wiki/All%C3%A8le ) . Inspirés par cette définition, nous definissons l’allomème comme des « variantes existant en tant que représentations neurales dans un cervaux humaine qui représentent la même intention ou la domaine sémantique et qui sont exprimées par l’acte d’expression par leur hôte». Notre notion est proche de celle de « variante culturelle » de Boyd et Richerson , mais tandis que pour eux, une « variante culturelle » est une abréviation pour « l’information enregistrée dans des cerveaux humaines ...une fois les gens changeront les concepts d’une « folk psychology » en concepts scientifiques et fiables » (Boyd,Richerson, 2005), l’allomème référe à un comportement exprimé. Et encore, un concept d’allomème ne devient utile que lorsque il y a des mèmes en opposition – ce concept aidera nos analyses seulement quand il y a toujours au moins deux expressions allomémetiques d’un locus sémantique ou d’une même intention . 10 Il s’agit d’un hôte même s’il a écrit ce mot par hasard , en faisant une faute par exemple. En effet cela peut arriver qu’il reproduis la même faute – et ce qui n’était qu’une faute au début devient une habitude. En ce cas nous parlons d’une mutation mémétique involontaire qui était jadis au fond de presque toute diversité culturelle. 11 Ou pour un ensemble dessujets. le smiley classique sera dominant par rapport au smiley occidental. Dans la figure 1 nous voyons la première visualisation des données de kyberia. Chaque colonne X représente des activités diachroniques d’un utilisateur, chaque rang Y représente une période du temps – on peut dire qu’elle décrit la communauté du point de vue synchronique. Si pendant la période Y , l’utilisateur X a utilisé plus souvent le smiley classique, nous placerons la couleur violet sur la position X,Y; si le smiley occidental était son allomème dominante, nous y placerons la couleur turqoise; s’il n’a utilisé aucun des deux allomèmes, nous laisserons la place en noir. Nous appellons une image qui décrit les changements des fréquences et des distributions des mèmes dans le temps un mémogramme. Dessin 1: Un mémogramme de la communauté kyberia. La coordonnée X représente un utilisateur, Y coordonée représente une période de temps, la couleur sur la position X,Y représente l'allomème dominante pour l'utilisateur X pendant la période Y. Le turquoise pour la dominance des smileys occidentaux :); le violet pour les smileys classiques. Pour voir la version pleine d’image, clickez sur l’image. Nous pouvons remarquer un ensemble de colonnes à droite qui sont entièrement violettes. Il s’agit d’un grand nombre dutilisateurs ­ à peu prés six dixième de tous ­ pour lesquels le smiley classique n’était jamais une allomème dominante. Dessin 2: Un mèmogramme visualisant les mêmes données que Illustration 1 mais RANGÉ DIFFEREMENT. Le pixel violet répresente une dominance des smileys occidentaux, turquoise représente la dominance des smileys classiques. Pour voir la version pleine d’image, clickez sur elle ou ici . Sur le dessin 2 nous voyons les mêmes données, mais ordonnées autrement . Nous pouvons remarquer un petit ensemble de colonnes entièrement violettes, une petite pente à droite qui représente des utilisateurs qui n’ont jamais utilisé le smiley classique. Il y en a à peu près un dixième de tous. Regardons maintenant le dessin 3. Chaque point rouge représente le ratio Y entre la population d’hôtes actifs12 du smiley occidental et la population d’hôtes actives du smiley classiques pendant la semaine X. Nous pouvons remarquer que ce ratio croit graduellement au fil du temps. Après avoir mis en oeuvre une des méthodes les plus simples d’analyse statistique nommée « la méthode des moindres carrés » (http://en.wikipedia.org/wiki/ Least_squares ) nous avons calculé le coefficient regressif β=0,027 . En gros, en faisant abstraction des oscillations chaotiques, le ratio entre les :) et les :­) est 0,027 fois plus élevé après chaque semaine. Ce phénomène peut être produit par 12 Un hôte actif d’un mème pour une période est un hôte qui a exprimé réellement le mème pendant la période. Au contraire, un hôte latent est un hôte dont nous savons qu’il a exprimé le mème auparavant, mais ne l’a pas exprimé pendant la période que nous observons. la diminution des hôtes actives du mème :­)  l’accroissement des hôtes actives du mème :)  une combinaison des facteurs précédents Les dessins 1 et 2 montrent qu’il y a beaucoup plus d’utilisateurs qui n’ont jamais écrit le smiley classique que ceux qui n’ont jamais écrit le smiley occidental. Est­ce que ce fait ne peut­il être lié, dans une certaine mesure, au phénomène du changement linéaire que nous observons ici ? Nous maintenons que « oui » et nous allons essayer de mettre un peu de lumière sur cette liaison. Nous prétendons que les deux phénomènes sont les résultats du fait que: Le smiley occidental est une forme plus stable que le smiley classique. Un mathématicien du chaos dirait que :) est un attractor plus fort que celui de :­) . Les Darwinists pourraient dire que « le fitness » d’un smiley occidental est plus élevée que celui d’un smiley classique. En d’autres mots, c’est plus probable que quelqu’un va partir d’un smiley classique vers les autres allomèmes, y compris le smiley occidental, que quelqu’un va partir d’un smiley occidental vers les autres allomèmes. Nous voyons au moins deux raisons pour cette prétendue « stabilité »: 1. La stabilité grâce au « frequency­based bias »13 de Boyd & Richerson : Sur le dessin 3 nous voyons que la fréquence du smiley occidental était au debut même d’existence de la communauté au moins 3 fois plus haute que celle du smiley classique. Si le « frequency­based bias » algorithme influence en quelque sorte le comportement humain, cette différence initiale en fréquences a pu mener vers sa propre autocatalyse . Si l’homme a vraiment une tendance à imiter les mèmes plus communs autour de lui, le fait que le :) était 4 fois plus répandu au début même de la communauté, a comme conséquence le fait qu’un utilisateur Dessin 3: Le coordonnée Y du point rouge nouveau ou irrésolu adopte cette variante et que représente le nombre de gens qui ont l’utilisateur usant des :­) subit les forces plus exprimé le smiley occidental pendant la puissantes pour changer son habitude que celui semaine X divisé par le nombre de gens usant des :) qui ont exprimé le smiley classique 2. La stabilité aux propriétés du signe/mème même pendant la même semaine X  13 « Frequency­based bias: the use of commonness or rarity of a cultural variant as a basis for choice. For example, the most advantageous variant is often likely to be the commonest.If so,a conformity bias is an easy way to acquire the : Le smiley occidental consiste en deux caractères, le smiley classique consiste en trois caractères. L’hôte doit donc investir plus d’énergie pour écrire :­) que pour :). Et encore ­ la probabilité que l’hôte pourrait faire une faute en les écrivant est donc 3/2 plus haute en cas de :­) ­ l’hôte donc dois investir plus d’énergie dans de possibles corrections . Si le caractère qui fait la différence – le nez, le tiret – a apporté avec soi quelque avantage, cet investissement pourrais avoir un sens. « Cet effort, le locuteurs ne le font que pour autant qu’il est payant » (Yaguello, 1991). Mais cet effort n’est pas payant. Le tiret n’apporte avec soi aucune information nouvelle. Le deux­points informe l’interlocuteur que la chaîne des caractères qui suivra sera un émotîcon si ces caractères ne sont pas alpha­numériques – on peut dire que le deux­point joue un rôle presque grammatical . La parenthèse fournit le contenu sémantique de l’émotîcon – est­ce une expression de la tristesse ou du bonheur ? Mais le tiret ne nous informe de rien. Il est qu’une épave de la motivation primaire. Il est redondant. Il n’est pas payant – il va ou bien trouver sa propre nouvelle signification distinctive ou bien disparaître. Tels sont les lois de la linguistique, tels sont aussi les lois de la mémétique Même si nous ne voulons pas priver la première explication d’honneur qui lui appartient légitimement, nous nous permettons de focaliser l’attention de notre chére lectrice sur la deuxième explication. Elle est, dans un certain sens , fondamentalement différent. Tandis que nous devons recourir au fait de la distribution initiale des allomèmes dans la communauté pour l’explication par le « frequency­based bias » , nous ne devons que recourir au propriétés du signe lui­même dans le deuxième cas. Voilà la première démonstration comment une propriété strictement formelle (et objective) d’un signe – la composition du signifiant ­ influence le nombre de hôtes infectés. Quelle que soit l’explication choisie , il faut remarquer que nous n’avons jamais recouru aux explications par la volonté propre d’un locuteur. En regardant de près les dessins 1 & 2 il, nous remarquons qu’il y a des cas lorsqu’un utilisateur a changé son habitude dans un moment donné, et il n’a jamais recouru à la vieille habitude de nouveau. Il pourrait bien s’agir d’un utilisateur qui a réussi à maîtriser l’influence des forces mémétiques autour de lui par sa propre raison. Mais ce sont des cas exceptionnels, pas faciles à atteindre. Beaucoup plus souvent nous pouvons voir un homme succomber aux forces culturelles autour de lui. En fait, c’est peut être bien impossible pour un être humain de ne pas succomber à ces forces. La phonologie – avec ses études des propagations et des transformations des langues, dialectes, accents et des règles qui les fondent – nous en donne une abondance des exemples. On ne peut pas dire non à un accent, quand on est noyé dedans, on a mal à dire non à une dictature de religion ou à l’idéologie quand tous le monde autour de nous y croit . La mémétique interpersonnelle étudie donc les forces sociales et leurs effets sur le comportement des êtres humains en faisant abstraction des intérêts humains. Ce sont des intérêts et les propriétés des mèmes et leurs complexes qui comptent et gagnent à long terme. La mémétique se demande: 1. Quelles sont des causes14 d’un changement d’une vieille habitude/régle/mot X en une nouvelle habitude/mot Y ? correct variant » (Boyd & Richerson, 2005) 14 La cause et la raison n’est pas la même chose. Les raisons sont réfléchies, les causes sont aveugles – elles suivent la règle naturelle. Les raisons sont liées plutôt à la causalité logique et sémantique, tandis que les causes sont liées à la causalité physique. Les raisons ont leurs raisons , les phénomènes sociales ont leurs causes. 2. Quelles sont des causes pour rester à l’habitude/régle/mot Y?. et elle essaye de trouver des réponses en étudiant les propriétés des habitudes/régles/mots eux mêmes. Elle essaie de trouver quelle est « le fitness » du mème donné pour un environnement donnée, et à partir de cette connaissance, elle essaie de prédire l’état futur de cet environnement. 2.3 Le prédateur et la proie "How many people are actually 'laughing out loud' when they send LOL?" (Crystal, 2001) Dessin 4: Les courbes montrent l'évolution des populations mensuelles des hôtes actives du mème "lol" (tourqoise) et ":­)" (violette) dans la communauté de kyberia.sk. Les lignes rouges désignent les « crises de communication » principales – chaque fois que nous voyons l’abaissement de toutes courbes, nous pouvons parler d’une crise de communication . Son nom est « lol ». Personne n’est sûr de ce que ce signifiant veut dire. Wikipedia donne des interprétations comme "laughing out loud","laugh out loud","lots of luck" ou "lots of love". L’auteur de cet article a cru pendant longtemps que « lol » voulait dire « lots of laugh ». Peu importe. Lol est puissante même sans le sens fixe. Pour les phonéticiens il n’est qu’un syllabe commençant et finissant par un même approximant « l » ­ c’est­à­dire par une consonne Liquide, fLuide, Ludique – et sonorisé par une voyelle postérieure mi­fermée arrondie « o » ­ un son un peu sombre et moins marqué que celui de « a » ou « i », mais quand même très puissant. Un signifiant libre du signifié. Une forme attirante sans contenu. Il est venu au monde comme la forme écrite. Dans un certain sens il a eu de la chance – avant l’Internet il n’existait pas de mot comme le mot en anglais, ni en allemand, ni en français. Il y avait un trou et il l’a rempli. Seulement le mot « l’Olland » (le Pays­bas en italien) lui était proche, mais l’italien n’a jamais été une langue marquante pour l’évolution d’Internet. Regardons maintenant le dessin 4 pour voir son histoire dans la communauté kyberia.sk . Nous pouvons remarquer que pendant la première année , le mème « lol » était moins répandue que le smiley classique. Mais la distance a amoindri après la « première crise de communication 15», et la taille des populations est devenu presque la même après la deuxième crise. Puis nous voyons un duel 15 Nous appelons « la crise de communication » cette époque dans l’histoire du système pendant laquelle le nombre total de tous les mèmes échangés parmi les membres du système est réduit. Dans le cas de l’histoire de kyberia, les crises de communication ont souvent été causées par des problèmes de serveur , de transitions aux nouvelles versions du logiciel ou par d’autres événements ayant leurs racines plutôt en dehors du système de kyberia. Contrairement à la crise de communication, l’état normal du système de kyberia est caractérisé par un agrandissement de la population total qui est causé par l’afflux des nouveaux utilisateurs. durant quelques mois quand le smiley classique a plus ou moins réussi à maintenir une petite avance. Mais après la troisième crise toute petite le « lol » a finalement réussi d’avoir une population plus grande que celle du smiley classique. La montée de la population de « lol » est beaucoup plus abrupte que celle de :­) après la première « grande interruption traditionnelle du fonctionnement de kyberia » désignée comme la quatorzième crise , de même qu’après la cinquième et sixième. Nous pouvons donc faire une généralisation et dire que le même « lol » s’est répandu beaucoup plus effectivement pendant les crises de communication. En d’autres mots le fitness16 du mème « lol » pendant la crise de communication est plus élévée que celui du même « :­) ». Quelles peuvent être des raisons de ce phénomène ? Nous nous permettons d’offrir cette hypothèse17 simple et potentiellement falsifiable : tandis que le smiley classique ou d’autres émotîcons ne peuvent être exprimé que en forme écrite, le mème « lol » a à sa disposition aussi une autre forme, une autre modalité d’expression – la modalité de parole . En effet, on peut observer une invasion de « lol » et de ses formes dérivées ( cf. Appendice 1) dans la langue parlée. La crise de communication de kyberia ne concerne que le fonctionnement du système web, les membres de la communauté sont donc contraints de recourir aux autres moyens de la communication, s’ils veulent échanger leurs mèmes. Non seulement ils veulent échanger leurs mèmes, c’est la nature propre des mèmes d’induire ses hôtes envers leur expression . C’est la nature même du cerveau humain de vouloir exprimer ses contenus – et si c’est impossible d’utiliser une modalité d’expression, on utilise une autre . Quand le logiciel de kyberia ne marche pas, les mèmes qui ne peuvent être exprimés que par l’écriture sont fortement handicapés. Non seulement la prolifération en nouvelles hôtes n’est plus possible, leur puissance est affaiblie dans leurs hôtes passés . On les oubliera tout simplement. En d’autres mots – plus la fréquence d’un même dans le monde extérieur diminue ­> moins sa représentation nerveuse dans les cerveaux des hôtes est figée 18­> moins le mot est disponible ­> la probabilité d’une expression future devient plus petite ­> la fréquence diminue . Voilà, une autoinhibition . Les autres mèmes qui n’ont pas cet handicap et remplissent plus ou moins la même fonction ne tardent pas à venir prendre leur place. L’échange des mèmes a continué d’exister pendant chaque crise, elle n’a que changé de forme – plus la crise du logiciel était grave, plus le barycentre de la communication s’est déplacé vers la communication parlée. Cette communication a bien sûr favorisé la prolifération du « lol » qui est un signifiant court et fort – son fitness est vraiment grand. Or, qui dirait « le deux­points, le tiret, la parenthèse » lors d’une rencontre avec des amis – et des hôtes potentiels – dans un bar? Chaque rencontre personnelle a potentiellement servi comme une couveuse du « lol ». Nous voyons donc des résultats du simple fait que le même « lol » peut être exprimé par la bouche, tandis que le mème :­) ne peut être exprimé que par l’activité des mains. Voilà la deuxième 16 The overall survival and proliferation rate of a meme m can be expressed as the meme fitness F(m), which measures the average number of memes at moment t divided by the average number of memes at the previous time step or "generation" t – 1. ( Principia Cybernetica Web ­ http://pespmc1.vub.ac.be/MEMEFITN.html ) 17 Je remercie mon amis Lubos Iskra dont l’idée était d’expliquer ce phénomène par l’handicap de la modalité écrite. 18 Une lectrice fidèle pourrait objecter que la diminution de la fréquence extérieure d’un mème ne mène pas automatiquement vers l’affaiblissement de la répresentation nerveuse liée pourvu que la fréquence interne reste haute, par exemple grâce à la méditation. L’objection est tout à fait pertinente dans le cadre d’une mémétique intrapersonnelle, mais elle est hors de la portée de cet article qui traite de questions soulevés par la mémétique interpersonnelle. Nous remercions notre lectrice pour sa compréhension exceptionelle. démonstration comment une propriété strictement formelle (et objective) d’un signe même – la composition du signifiant ­ influence le nombre de hôtes infectés, la réussite d’un mème en tant que mème. Nous finirons, donc, notre petite excursion par le premier postulat de notre petite théorie: Le fitness d’un mème est proportionnel au nombre des modalités d’expression par lesquelles ce mème peut être exprimé. Les conséquences sont les plus visibles quand une ou plusieurs des modalités d’expression sont restreintes. 3. Extroduction Vaecitryaḿ prákrtadharmah samánaḿ na bhaviśyati: Diversity, not identity, is the law of nature. (Anandamurti, 1961) Nous bâtissons notre terminologie, nous formulons nos premières postulats. Nous sommes en train d’établir une science. D’abord une science empirique – puisque ce sont les données empiriques d’où nous tirons nos hypothèses. Puis une science formalisée et mathématisée. Une science humaine car l’objet de son intérêt est l’homme et son activité. Et une science sociale car l’objet de son intérêt est l’homme au sein de la vie sociale. Mais également une science naturelle , quand nous aspirons à prédire le futur. Imaginons le trio des tetragrams donnés – BRHM, ALLA et JHVH – et une communauté humaine incapable à articuler la consonne H. Ceteris paribus , nous pourrions prédire un aspect de l’état futur de cette communauté n’en connaissant que:  le premier postulat de notre théorie  les propriétés propres aux mèmès observés­ JHVH contient 2xH, BRHM une et ALLA aucune  les propriétés « d’environnement » – les hôtes n’arrivent pas à produire des fricatives glottales nous saurons que ce sera la forme ALLA qui va réussir, à une échéance suffisamment longue, d’infecter le nombre le plus grand de cerveaux des hôtes faisant partie de cette population, BRHM sera le deuxième et JHVH va perdre. Tout cela parce que les locuteurs n’arrivent pas à prononcer une H. Créer n’importe quelle liaisons entre cet exemple simplifié et le monde réelle signifierait de pousser le bouchon trop loin. On n’atteint pas « ceteris paribus » dans le monde réel, et certainement pas dans la vie de l’homme – tant de variables, tant de relations, tant de chaos ! La science qui viens de naître ici n’atteindra son but qu’à condition que son but soit modeste. Espérer que quelqu’un puisse prédire l’avenir de l’humanité – voilà un rêve de fou ! Mais si nous restions modeste, nous pourrions, peut être, découvrir ou même bâtir des îlots d’ordre dans le chaos des données. Modeste comme un linguiste qui dit, après qu’il a regardé son corpus pendant des années que « Mesdames, Messieurs, la différence entre O fermé et O ouvert est en train de disparaître, un d’eux disparaîtra donc entièrement dans un horizon de 23 ans », nous nous permettons d’exprimer une connaissance banale « Ladies, Gentlemen, notre deux expériences que nous venons d’effectuer ici nous montrent que le smiley classique est en train de mourir ». Nous ne sommes qu’au commencement de notre récit, nous avons donc une grand peine à accepter la mort. Même s’il s’agit de la mort de signes – notre priorité va vers la vie au lieu de la mort. Nous proposerons, donc, de sauver le smiley classique en lui donnant une signifié bien défini. Lequel? Voici la réponse en forme d’une définition métalinguistique: « :­) est pour nous un émoticôn du bonheur sincère, c’est un essai de décrire le vrai état de notre visage au moment où nous le tapons sur notre clavier. L’exprimer nous a coûté plus d’énergie que l’expression d’un « ;) » bon marché ou un « lol » agressif , et nous savons bien où nous l’avons investi. Si nous suivons notre définition rigoureusement, nous remplirons une forme blessée du sens. Le smiley classique ne va pas mourir pour nous. Si nous réussissons à faire répandre cette définition parmi les autres êtres magnifiques, il ne mourra pas même pour eux. Ce jour­là ne sera pas qu’un jour de renaissance pour notre vieux ami , le :­) , mais de même une grande journée pour la mémétique appliquée, une naissance véritable du « memetic engeneering ». Nous ne pouvons nous contenter d’étudier le passé. C’est à l’épreuve du présent et du futur proche que nous devons confronter nos résultats. (Asimov, 1993)                  Dawkins, R. 1976. The Selfish Gene. Oxford: Oxford University Press. Hromada, D. 2007. Moja prva rozprava o metode. http://node.nel.edu/?node_id=6823 Lynch, A., 1998; Units, Events and Dynamics in Memetic Evolution. Journal of Memetics ­ Evolutionary Models of Information Transmission, 2 de Saussure, F., 1972. Cours de linguistique génerale. Paris: Editions Payot Niklas­Salminen, A. 1997. La lexicologie. Paris: Armand Colin Blackmore S., 1999. The Meme Machine. Oxford: Oxford University Press Gatherer, D., 1997; Macromemetics: Towards a Framework for the Re­unification of Philosophy. Journal of Memetics ­ Evolutionary Models of Information Transmission, 1. Yaguello, M., 1991. Alice au pays du langage. Paris: Editions du seuil Jakobson, R., 1963. Essais de linguistique générale. Paris Piaget, J., 1962. La psychologie d’intelligence. Paris: Colin Dirlam,D.K.,(2003). Competing Memes Analysis. Journal of Memetics ­ Evolutionary Models of Information Transmission, 7. Anandamurti, Sh. (1961). Ananda Sutram ( http://en.wikipedia.org/wiki/Ananda_Sutram ) Wittgenstein, L. (1958). The Blue and Brown Books (Notes dictated to Cambridge students in 1933–35) Richerson, P. J. and R. Boyd (2005). Not by genes alone : how culture transformed human evolution. Chicago ; London, University of Chicago Press. Hakim Bey (1984). The Temporary Autonomous Zone, Ontological Anarchy, Poetic Terrorism ( http://www.hermetic.com /bey/taz_cont.html ) Crystal, D. (2001) . Language and the Internet. Cambridge University Press Asimov, I. (1993). L’aube de la fondation – Forward the Foundation. Nightfall Inc. Appendice 1 ­ Les formes contenant le trigram « lol » en majuscule ou miniscule. Les formes fléchis des mots slovaques ou noms existants sont en italiques. Les formes particulièrement marrant sont en gras. La forme Nombre d’expressions Nombre des hôtes lol 21984 828 LOL 5009 426 lolo 1596 228 lola 201 87 lolitka 107 61 lolitky 107 60 megalol 197 49 LoL 289 46 Lol 90 45 lolita 74 45 LOLO 67 37 lolek 39 31 lolovia 37 22 lolko 39 21 lolitku 25 19 lolu 23 17 lolka 39 17 lolitu 23 16 lolik 56 15 lolitiek 21 15 MEGALOL 48 15 lolino 54 15 Lola 19 15 Commentaire une personnage d’un bande dessiné polonais Lolek & Bolek lolol 34 14 loll 24 14 lololol 18 14 Lolo 22 14 lolitkou 13 13 lolity 15 13 megaLOL 19 12 lololo 19 12 lolinko 36 12 lole 23 12 Lolita 14 12 instalol 13 10 lolom 15 10 lOL 10 9 skloly 9 9 lolou 29 9 lolle 26 9 lOl 21 9 Lolle 29 9 lolov 10 9 LOLOLOL 10 8 LOl 16 8 lolololol 10 8 lolky 13 7 lolec 11 7 bolol 11 7 malolet 9 7 neloluj 13 7 lollypop 10 7 lolipop 7 7 halolo 6 6 roflol 8 6 skola = l’école ne loles pas ! haloo ? olol 11 6 loli 7 6 lolovci 6 6 maloletych 6 6 lolovina 9 6 lolovi 7 6 Lolitu 5 5 LOLA 5 5 oklolo 6 5 lollipop 5 5 lolit 7 5 lolofon 6 5 belole 7 5 lolujes 5 5 lolitas 7 5 LOLo 6 4 lolooo 4 4 Lolek 4 4 mohlol 4 4 lolitke 4 4 callol 5 4 Lole 5 4 loljesus 8 4 lolujem 4 4 lolina 5 4 slole 3 3 dalol 3 3 halolooo 4 3 lolite 3 3 rofllol 3 3 lollll 3 3 une lolise, une lolerie okolo = autour tu loles mohol = il a pu je lole LOLka 4 3 loloo 3 3 lolitkam 3 3 megalolo 4 3 lolooool 5 3 loluj 4 3 loles ! (imperatif) pondelol 3 3 pondelok=lundi Loly 3 3 ROFLOL 4 3 RLOL 3 3 radiololator 4 3 slolu 3 3 Lolitka 3 3 filologickej 5 3 halolololo 3 3 lolololo 3 3 lolitkovsky 3 3 belolem 3 3 lolof 6 3 ololo 3 3 MEGAlol 3 3 mailol 3 3 lolooooo 3 3 gigalol 4 3 pololeg 3 3 lolovat 4 3 lolll 3 3 lolinka 9 3 lolly 5 3 neinstalol 3 3 bilologiu 3 3 lolitkas 3 3 no comment loler la bilologie filologicke 3 3 lolitkami 3 3 loliky 5 3 LOLKA 3 3 lolitou 5 3 Ololiuqui 5 3 lolololololol 3 3 sololit 4 3 loloviny 3 3 mrtelol 3 3 lololololol 4 3 loloooooo 3 3 lolitkach 3 3 lol2 9 3 loly 7 3 LOLek 3 2 pololegalne 2 2 killol 2 2 sololomos 4 2 pololegalnych 2 2 palol 2 2 lolovinu 3 2 klolkej 2 2 o kolkej=quand? abslolutne 2 2 abslolutely lolla 2 2 imapwnyooavat arlol 2 2 Filologickej 4 2 volol 2 2 loliq 2 2 LOLITAS 2 2 pololet 2 2 les lolies, les loleries Paulol Xylolitov 3 2 lollipops 2 2 lolipops 2 2 loL 6 2 MEGAGIGAL OL 2 2 bololepsie 2 2 pololezal 2 2 splolu 2 2 karfilol 2 2 reinstallol 3 2 reinstalol 2 2 filologia 2 2 installol 2 2 blolo 2 2 lolobrigita 4 2 lolitaci 2 2 spololu 2 2 lolot 3 2 Dlol 3 2 LOLOLOLOL OL 2 2 LOLLLLL 2 2 loluje 2 2 sklolou 2 2 ololol 3 2 Lollypop 2 2 nolol 2 2 vizulolalne 2 2 hyperLOL 2 2 loloool 3 2 halololo 2 2 le chou­floleur spolu=ensemble il lole filologicka 2 2 LOLy 2 2 lolovske 2 2 lolcek 2 2 megamrtelol 2 2 lolovic 2 2 lolty 2 2 lol­ième zlolo 2 2 zlo=mal, le malol polololo 3 2 choroolol 2 2 Lolly 3 2 maloletym 2 2 LOLL 3 2 tylolo 2 2 lolitkovske 2 2 lolitk 2 2 pololezala 2 2 sellol 4 2 lolkujes 2 2 lolitek 2 2 lolne 2 2 lolitach 2 2 lolin 2 2 okololo 2 2 pololeti 2 2 lolitovske 2 2 helflol 2 2 LOLOLOLOL 2 2 lolz 3 2 lolzor 2 2 loline 2 2 pololezmo 2 1 un lolitique ? slolocne 1 1 pololegalny 2 1 2menylola 1 1 stiholol 1 1 LLOL 1 1 okolol 1 1 filolofilozo 1 1 sLOLOniik 1 1 lolotat 1 1 pololokalne 1 1 halilelolajovou 1 1 rychloletiacu 1 1 zlolitkovske 1 1 DDDlol 2 1 megalolooool 1 1 propranolol 3 1 Lolee 1 1 monoklolami 1 1 lolllololll 1 1 _lol_ 1 1 lolollolol 1 1 jhellolove 1 1 tuliloliona 1 1 splolocnu 1 1 loloti 2 1 ololeidiiiiii 1 1 pololegalna 1 1 heLOLou 7 1 ROFLLOL 1 1 bolole 2 1 klolsk 1 1 megahyperlol 1 1 le lolétat ? ??? des lolots loliak 1 1 pololasky 1 1 hlolcicka 1 1 techlologiu 1 1 loln 1 1 lolegynka 1 1 lolaa 1 1 popiciLOL 1 1 Lolou 1 1 loliku 1 1 pinelola 1 1 haloluja 1 1 lollitaz 2 1 oraclol 1 1 lola444 1 1 Lolituma 1 1 Indiemixtapezl olz_We_Think _These_Bands _Are_Underrat ed_Vol_1 1 1 pentaLOL 1 1 pololkonverzac nym 1 1 lol2323 1 1 KAROLOL 1 1 lolotenie 1 1 _Lolita_v1 1 1 lolineky 1 1 MEGAGIGAH YPERLOL 1 1 turboLoL 1 1 lalalola 1 1 une petite lollègue alléloluia ! nedalol 1 1 lollita19 1 1 LOLkovat 1 1 loloooll__tak 1 1 lolllllloooolllll 1 1 luvinlolis 1 1 ___lol___ 2 1 lolitkouSmajlo u 1 1 pololelegalnom 1 1 zaloloval 1 1 LoLo 1 1 Lollo 1 1 nezabudlol 1 1 blola 1 1 lolopuky 1 1 Lolit 1 1 lolokaaar 1 1 jelolj 1 1 oklolia 1 1 nelolkaj 1 1 HYPERLOL 2 1 FLola 1 1 shwarcneger_lo l_ 1 1 ponolol 1 1 lolololololololo lololololol 1 1 _lolita101 1 1 alkohlolizme 1 1 kyberialol 1 1 donieslol 1 1 il n’a pas oublolié l’acohlolism mrtelolinko 1 1 lolovity 1 1 moltololto 2 1 knedlol 1 1 MEGAgigaTE RRAquadroLO L 1 1 velkeLOL 1 1 lolkovani 1 1 MegaLoL 1 1 lolovine 1 1 omgzlolroflolz 1 1 LOLOBRIDZI DA 1 1 loloidna 1 1 halol 1 1 nezmenilol 1 1 vymalolova 1 1 bilology 1 1 Lolka 1 1 lolzolo 1 1 lolna 2 1 lolmajster 2 1 lol___a 1 1 kokololu 1 1 lolipo 1 1 loles 2 1 lolityyyyyyyyy yyyyyyyyyyy 1 1 lollo 1 1 popocatepetLO L 1 1 lolobridzida 1 1 loloide volna = libre dmnc__lol___jj 1 1 angllol 1 1 loliada 1 1 lol__zvlast 1 1 trillolbyta 1 1 pololezi 1 1 jelolt 1 1 celolovenske 1 1 borololo 1 1 LOLk 1 1 megaLoL 1 1 loloizmom 1 1 newadilololo 1 1 neposlolo 1 1 LOLLLLLL 1 1 lolepop 1 1 lolckera 1 1 neodoslolo 1 1 lolowityy 1 1 lolrofl 1 1 exploler_ 1 1 nebololi 1 1 jakyhokloli 1 1 lollapalozy 1 1 anjelologiu 1 1 pololubovu 1 1 sklolaminatove 1 1 turbolol 1 1 hydrolol 1 1 lollllllolllllolllll 1 1 lol__zadne 1 1 LOLom 1 1 olympiade par loloism Internet Exploler cybercrustgrind coreLOLo 1 1 hahalolfnuk 1 1 klolo 1 1 plolnoci 1 1 ohlolooool 2 1 polol 1 1 vklzlol 1 1 hypercyberlolis ticky 1 1 malolo 1 1 zacalol 1 1 LLLLLLLLLO LEEEEEEEEE E 1 1 lolinou 1 1 nebololeo 1 1 trebalol 1 1 sumylol 1 1 nalologovany 1 1 hahallllllol 1 1 megaloll 1 1 hoppilollu 1 1 pololegal 1 1 morflologiou 1 1 mrtedrtemegal ol 2 1 lolk 15 1 islolated 1 1 lollow 1 1 belolle 1 1 hlalol 1 1 strelol 1 1 par morphlologie pelolo 1 1 lolofoon 2 1 lol___jj 2 1 slolbody 1 1 trulolo 1 1 Loleka 1 1 lolovinou 1 1 sklolkz 1 1 stlolom 1 1 Ceskoslovensk o_lol 1 1 maloletou 1 1 teolologiu 1 1 LOLobriadok 1 1 lolololololololo lo 1 1 lolik2 1 1 cisololmooo 1 1 lolatee 1 1 lol0nz 1 1 loollolool 1 1 lollllllll 1 1 rotflmaolol 2 1 zabudlol 1 1 Lolobrigida 1 1 vlolal 1 1 OBRLOL 1 1 coklolvek 1 1 evlolucia 1 1 loldopici 3 1 zablolkoval 1 1 lolofonia 2 1 la teolologie l’evlolution la lolophonie hololololololol olo 1 1 lolololololololo lool 1 1 lol_btw_ne 1 1 zlolzite 1 1 MoolOl 2 1 zlolovanec 1 1 LOL_ak 1 1 zobrazovalolen 2 1 molol 1 1 omgwtflol 1 1 dalolol 1 1 flololo 1 1 loliest 1 1 lolku 2 1 LOL___ju 1 1 lolkovia 1 1 holaLOLa 1 1 lollinov 1 1 Walol 1 1 DDDDlolll 1 1 sklolu 1 1 cloldcut 1 1 polole 1 1 halololooo 1 1 lollia_cedric 3 1 lolitacke 1 1 LOLMAO 1 1 beloletek 1 1 lolyneki 1 1 blolku 1 1 zlozite=difficile Loli 1 1 lolorofl 1 1 Lola_Rennt_Ic on_by_nothing unusual 2 1 __lol__si 1 1 Hilloltopu 1 1 lolx 2 1 lolofci 1 1 Lolitku 1 1 megaroflgigalo l 1 1 rotlflol 1 1 uninstalol 1 1 lolalolovska 1 1 Mohlol 1 1 trojlola 1 1 odkazowac_lol 1 1 dopeLOL 1 1 LOLDOPICI 1 1 poslol 1 1 dopadlol 1 1 lolololoooolool 1 1 muheheheeheh eLOL 1 1 pikolol 1 1 Bialoleck 2 1 lololololooooo ol 1 1 alkhololom 1 1 loololoolololllo ooollloooollloo ol 1 1 LOLitku 1 1 lollapalooza 2 1 PLOLOVICU 1 1 funlol 1 1 slolarizacie 1 1 ultralol 1 1 ezamilolujem 1 1 iholol 1 1 napadlol 1 1 lolja 1 1 lolow 1 1 okolole 1 1 spololocnostou 1 1 disabLoL 1 1 lolleho 1 1 LOL_ 1 1 loliik 3 1 lolooolooololol olololololololol ol 1 1 Indolol 1 1 ublolenu 1 1 klola 2 1 2xlol 1 1 megalolopar 1 1 pjilologie 1 1 delol 1 1 LOLenheit 1 1 lol__ja 1 1 dokolola 1 1 karololko 1 1 dalolfsd5br 2 1 halalilol 1 1 frololo 2 1 LOLik 1 1 lollehp 1 1 pololaikovi 1 1 trilolobyt 3 1 lol3 1 1 LOLLIPOP 1 1 tetraLOL 1 1 LOLLITA 1 1 insallol 1 1 LOLITKA 1 1 haloohalolololo 1 1 lolto 1 1 superlolo 1 1 filologickA 1 1 lollercoaster 1 1 LOLna 1 1 Lolkoviaaaa 1 1 loliakoval 1 1 oklolnostiach 1 1 Lolu 2 1 stololeee 1 1 Lolowito 1 1 Lolla 1 1 loller 1 1 exploler 1 1 lolobrogita 1 1 pondelolk 1 1 lllol 1 1 loloid 1 1 pelol 1 1 ROFLOLMAO 1 1 un trilolobyt XIXI LLola 1 1 chvilolinu 1 1 lOliik 1 1 lolinoo 2 1 turboLOL 1 1 lolujeme 1 1 LOLu 1 1 hlolo555 1 1 lilalolipop 1 1 pololezanie 1 1 lolllllllllllek 1 1 lolofony 1 1 sisilolo 1 1 angelology_7_ heavens 1 1 lol_fuuuj_dnes 1 1 LOLLOLLOL OLOLOLOLO LOLOLOL 1 1 zlol 2 1 loleee 1 1 psychlologicke 1 1 lol_dobe 1 1 bolol_ 1 1 roflmaololand 1 1 sololom 1 1 potesilolo 1 1 nevytopilolol 1 1 dookolola 1 1 hahalol 1 1 LOOOOLOLL L 1 1 nous lolons psychlologique zgulolovavat 1 1 Lolom 1 1 cislol 1 1 nepodarilol 1 1 clolek 1 1 loloch 3 1 lolitkaaaa 1 1 lolokar 1 1 oklol 1 1 girlgrdresslolli 1 1 hahalololo 1 1 SLOLY 1 1 LOLity 1 1 oteploli 1 1 filol 1 1 holklolo 1 1 looollololllllow i 1 1 LOL_poslem 1 1 OLOLO 1 1 roflolo 1 1 rychloliecbu 1 1 maloludi 1 1 Lolitou 1 1 maloletky 1 1 maloletyhc 1 1 totalnajsammeg abigLOL 1 1 volola 1 1 pilolas 1 1 megaultralol 1 1 lololololololol 1 1 cislo = le nombre clovek = l’homme loooooooooooo oololololooooo oooooooo 1 1 zalolam 1 1 nemohlol 1 1 ololeidy 2 1 ellolol 1 1 pindolol 1 1 psycholologick 1 1 stylol 1 1 lolofka 1 1 obrlol 1 1 lolkuujem 1 1 DDlol 1 1 lolokarovia 1 1 kololo 1 1 LOLLYPOP 3 1 dovololit 1 1 nebololuto 1 1 lolime 1 1 nedalolen 1 1 lolkujem 1 1 hlavobloly 1 1 lolitampegs 1 1 psychlolgiz 1 1 lolge 1 1 skalolezecka 1 1 loollloll 1 1 lolofil 1 1 lolkov 2 1 lolitkaaa 1 1 veselol 3 1 dovoli=permettre veselo= gaie doubleLOL 1 1 wzniklol 1 1 Lollipop 1 1 nebololo 1 1 lol_skuskujem 1 1 lolpodlamna 1 1 dalolo1oo 1 1 hellolou 1 1 LoLinka 1 1 pololetny 1 1 zavlolat 1 1 Hilloltop 1 1 lolipoopppps 1 1 psychololo 1 1 lollllllllllllllllllll lllllllllllllllllll 1 1 lolka31 1 1 SKLOLY 1 1 uberlol 1 1 sklolit 2 1 LOLovona 1 1 minilolo 1 1 lolitkz 1 1 okololi 1 1 bloli 1 1 celolampove 1 1 teralol 1 1 lol___ja 1 1 pololesne 1 1 neulolujem 1 1 LOLEK 1 1 pololesik 1 1 zavolat = appeler lololoool 1 1 LOLOLOLOL OLOL 1 1 filologick 1 1 pszchololgicko m 1 1 TERRALOL 1 1 megauplol 1 1 klole 1 1 tombolololLOL y 1 1 lloooolllloloool ll 1 1 lolusik 1 1 pelolandom 1 1 LOLITA 1 1 vyschlol 1 1 loliny 1 1 zaujmalololo 1 1 LOLooooo 1 1 psychololog 1 1 lakalolo 1 1 nebololepsie 1 1 lolinovia 1 1 malolat 5 1 berlolo 1 1 diplolomka 1 1 sLOLOwanistic ky 1 1 lolacka 1 1 blolllllllllllllllo oooooooooooo oooo 1 1 lolova 1 1 lolos 2 1 LOLFL 1 1 zmizlol 1 1 lolosalat 1 1 megalolmi 1 1 lolololooooooo oooooooooooo 1 1 __lol 2 1 lolalife 1 1 nyLOL 1 1 hololo 1 1 LOLicek 1 1 maloletosti 1 1 lolatinka 1 1 lolpkar 1 1 postelolou 1 1 honilol 1 1 kiullol 1 1 loLQ 1 1 kalola 1 1 berlolal 1 1 lolel 1 1 tazkyLOL 1 1 pololez 1 1 Zvlolena 1 1 lolstajl 1 1 hallolo 1 1 velociLOL 1 1 LOL__drzim 1 1 cobibololoca 1 1 lollololl 1 1 lOLo 1 1 LOLLOLOLO 1 1 lolipopa 1 1 sloly 1 1 urlolog 1 1 lolooooooo 1 1 pololikvidfank y 1 1 lolitkovskym 1 1 LOL_veru_a 1 1 malolista 1 1 rollol 1 1 OLOL 2 1 lolobeat 1 1 lolokrull 1 1 loltio 1 1 000lol 1 1 stihololo 1 1 LOLLLLLLLL LL 1 1 lolikov 1 1 lolip 1 1 ceklol 1 1 pololezim 1 1 skalolezectvo 1 1 boolol 1 1 tri_loliti 3 1 lolovating 1 1 lolrajt 1 1 vololo 1 1 20lol2 1 1 lolje 1 1 pololeze 1 1 lollitazz 1 1 klolu 1 1 20lol 2 1 lolarina 1 1 lolloloolll 1 1 lolowj 1 1 galol 1 1 fololow 1 1 3lol 1 1 helolo 1 1 lollyrock 1 1 spololocne 1 1 OLOLOLLL 1 1 akychklolvek 1 1 lolotali 1 1 loliiiiiik 1 1 lolitne 1 1 Propranolol 1 1 ROLLOL 1 1 chechechachab uhehemuhamu hahawobrlol 1 1 vykrilol 1 1 lolovna 1 1 lollky 1 1 lol__tak 1 1 kokololo 1 1 klolk 1 1 bilologien 1 1 zlolujem 1 1 lolllllllllllllllllo oooooooooooo oolllllllloooooo 1 1 oooooooool zilol 1 1 lolalita 1 1 premaloloval 1 1 astrlolog 1 1 exploleru 1 1 lol__ 1 1 battlol 1 1 Dlolo 1 1 LOLujes 1 1 pololightove 1 1 lolacov 1 1 kiloladu 1 1 kvloli 1 1 zlolim 1 1 lolitkyyyyy 1 1 Lolz 1 1 utralol 1 1 bllol 1 1 lolv 1 1 LolSiDaaj 1 1 barlolAma 1 1 stlol 1 1 lolypop 1 1 megalloll 1 1 rychlolebo 1 1 bilologia 1 1 veselololo 2 1 LoLezne 1 1 hahaalolol 1 1 vylolova 1 1 lolcity 1 1 la lolyauté alol 1 1 lolitkaaaaaaaaa aaaaaaaaaaaaaa aaa 1 1 pololitologiu 2 1 LOLlypop 1 1 lololollll 1 1 veselolo 1 1 idelologicky 1 1 doniesolol 1 1 llol 1 1 lol___si 1 1 loliz 1 1 fylolozsophya 1 1 beloluk 1 1 zavlolam 1 1 hlolist 4 1 rlole 1 1 lolololoprd 1 1 Adventures_Of _Lolo_NES_Sc reenShot1 1 1 terraLOL 1 1 loluol 1 1 pololitologie 1 1 lolloll 1 1 lolaaa 1 1 lolololololololo lolololololololo lolololololololo lolololololololo l 1 1 loletchka 1 1 lolcore 1 1 la politolologie idelologique scrollol 1 1 skaloleziem 1 1 Roflol 1 1 nebololololooo 1 1 lolomotor1kz 1 1 LOLOL 1 1 angelologie 1 1 technopolol 1 1 trilology 1 1 landlolrd 1 1 rychlolozeni 1 1 LOLROFLMA O 1 1 lolenka 1 1 beloled 1 1 nebololelo 1 1 ___LOL___N O 1 1 lolobridzid 1 1 lollipopcards 1 1 15lolo11 1 1 lollitky 3 1 wololooo 2 1 skalolezecke 1 1 intallol 1 1 lolvia 1 1 lOlko 1 1 lol_ 1 1 xvilolinku 1 1 sololity 1 1 vysokoslolaka 1 1 lolool 1 1 klolko 1 1 Halololo 1 1 videololovanie 1 1 AHOLoL 1 1 lolindu 1 1 LOLOLOLOL OLOLOLOLO L 1 1 lol_no 1 1 pekne_lol 1 1 bolalolka 1 1 kokotkomegade bololo 1 1 ololooool 1 1 elolvasom 1 1 loltouzjehuste 1 1 goaloldov 1 1 imnolol 7 1 lolllllllll 1 1 lolith 1 1 lolipops7009li 1 1 megalolko 1 1 magalol 1 1 hralolky 1 1 loland 1 1 tylolinek 1 1 kanaly_LOL 1 1 OFLLLOL 1 1 fotololozo 1 1 LOLOVIA 2 1 chololate 1 1 dopelol 1 1 lollipop_icon_s 1 1 mall lololololo 1 1 lollerskates 1 1 lolofl 1 1 lolofoonka 1 1 talol 5 1 helol 1 1 metrpolola 1 1 pablol 1 1 zelole 1 1 trilologia 1 1 lol___ako 1 1 lolkou 1 1 omylol 1 1 elolvashatod 1 1 inhalol 1 1 hallol 1 1 loleka 2 1 loleoo 1 1 medgalol 1 1 lol_ani 1 1 zavlolali 1 1 loling 1 1 zvelolekar 1 1 rychloliecenia 1 1 lol___no 1 1 magagigaLOL 1 1 lolika 1 1 loltime 1 1 bololen 1 1 toblolo 1 1 dalolo6bv 2 1 une trilologie nainstalol 1 1 megagigalol 4 1 lolotal 1 1 elollem 1 1 loLOTR 1 1 lolakne 1 1 haloolooololo 1 1 magemegameg alol 1 1 lolkami 1 1 bolololo 1 1 vylolaval 1 1 loltroll 1 1 pindololu 1 1 DDDDlol 2 1 LOLOFL 1 1 olololl 1 1 lolobridgida 1 1 lolale 1 1 daloloo 1 1 splolocnost 1 1 __lol___no 2 1 LOL_sign 1 1 loloeiuya 1 1 lolkas 1 1 _LOL 1 1 Angelologiu 1 1 pisalol 1 1 elolvasoma 1 1 Alola 1 1 lolophonius 1 1 colol 1 1 spolocnost=la societé skalolezen 1 1 LOLOLOLOL O 1 1 francouzakaLO L 1 1 lolitkoifna 1 1 filologa 1 1 loloooo 1 1 lloloo 1 1 proxylol 2 1 lolovinami 1 1 nezamiloluj 1 1 daLoL 1 1 lolitin 1 1 utiahlol 1 1 loladin 1 1 hlolkou 1 1 lollolool 1 1 loleeee 1 1 ololitky 1 1 bololieva 1 1 LOLOLO 1 1 prdelolezec 1 1 zalolat 1 1 lolovsky 1 1 veseloololoo 1 1 rozlolovat 1 1 okruhlolista 1 1 lloll 1 1 popololo 1 1 lolkick 1 1 lolroflkewl 1 1 zlolonom 1 1 dooklola 1 1 tlolko 1 1 2xLOL 1 1 ooololollol 1 1 roflolol 1 1 megarofllol 1 1 diolol 1 1 scrolol 2 1 loluju 1 1 jololos 1 1 pololeziac 1 1 lolain 1 1 najdlolzitejsie 1 1 lolosh 1 1 aaaaaaaaaaaaaa aaaaaaaaaaaaa LOL 1 1 BROKOLOLO LOLOLICUU UUU 1 1 filologickou 1 1 nestretlol 1 1 LOLOLOLOO OOOOOL 1 1 psilolo 1 1 loloolololololo gopeeed 1 1 lolica 1 1 Lolipope 1 1 lolipap 2 1 lolinek 3 1 monglolsky 1 1 psychoLOLgie 1 1 LOLsosovu 1 1 nevadilolo 1 1 lol___jo 1 1 lolmao 1 1 lollloll 1 1 Skontrloluj 1 1 pololelegal 1 1 lolllll 1 1 pololeviej 1 1 lolitjek 1 1 lolpozdrav 1 1 trilolololobit 1 1 sololuitoch 1 1 neodoslalololol ololo 1 1 lolisko 1 1 marcelolm 1 1 lolaaaaaaaaaaa aa 1 1 pololegalni 1 1 sotalol 1 1 lolinovat 1 1 lolbot 1 1 lolololooooooo ol 1 1 ZLOl 1 1 pololegalnej 1 1 lolololololololll llll 1 1 olalola 1 1 dlolezite 1 1 lolvlastne 1 1 filologie 1 1 LOLako 1 1 Pololezmo 1 1 lol__se 1 1 ohlolit 1 1 neLOLuj 1 1 butylolakron 1 1 LOLyPOP 1 1 lolingonthefloo rlaughing 1 1 lolofisku 2 1 vyjdeLOLOLO LOLOLOLOL OLOLOLOLO LOLOLOLOL OLOLOLOL 1 1 lolacafe 1 1 vizulol 2 1 lollllllllllllllllllll llllllllllllllllllllll lllllllllllllllll 1 1 lolco 1 1 haloloo 1 1 lOOLLLol 1 1 lolitami 1 1 lolposh 1 1 lols 1 1 Lolovi 1 1 felolem 2 1 chcelol 1 1 sedelol 1 1 urlologii 1 1 elolvastam 1 1 trilolobeatom 1 1 elol 1 1 lolcetung 1 1 metoprolol 1 1 naklolko 1 1 rychlolahko 1 1 philology 2 1 hyperlol 1 1 mololko 1 1 kaloly 1 1 Mao Tse tung? JADT’ 18 PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON STATISTICAL ANALYSIS OF TEXTUAL DATA JADT’ 18 PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON STATISTICAL ANALYSIS OF TEXTUAL DATA (Rome, 12-15 June 2018) Vol. I UniversItalia 2018 PROPRIETÀ LETTERARIA RISERVATA Copyright 2018 - UniversItalia - Roma ISBN 978-88-3293-137-2 A norma della legge sul diritto d’autore e del codice civile è vietata la riproduzione di questo libro o di parte di esso con qualsiasi mezzo, elettronico, meccanico, per mezzo di fotocopie, microfilm, registra-tori o altro. Le fotocopie per uso personale del lettore possono tuttavia essere effettuate, ma solo nei limiti del 15% del volume e dietro pagamento alla SIAE del compenso previsto dall’art. 68, commi 4 e 5 della legge 22 aprile 1941 n. 633. Ogni riproduzione per finalità diverse da quelle per uso personale deve essere autorizzata specificatamente dagli autori o dall’editore. Program Committee Ramón Álvarez Esteban: Univ. of León, E Valérie Beaudouin: Telecom ParisTech, F Mónica Bécue: Poly. Univ. of Catalunya, E Sergio Bolasco: Sapienza Univ. of Rome, I Isabella Chiari: Sapienza Univ. of Rome, I François Daoust, UQÀM, Montreal, CDN Anne Dister, FUSL, Bruxelles / UCL, Louvain, B Jules Duchastel: UQÀM, Montreal, CDN Serge Fleury: Univ. Paris 3, F Cédrick Fairon: UCL, Louvain, B Luca Giuliano: Sapienza Univ. of Rome, I Serge Heiden, ENS, Lyon, F Domenica Fioredistella Iezzi, Univ. of Tor Vergata, I Margareta Kastberg, Univ. of Franche Comté, F Ludovic Lebart: CNRS / ENST, Paris, F Jean-Marc Leblanc: Univ. of Créteil, F Alain Lelu: Univ. of Franche Comté, F Dominique Longrée, Univ. of Liège, B Véronique Magri: Univ. of Nice Sophia-Antipolis, F Pascal Marchand: Univ. of Toulouse, F William Martinez: Univ. of Lisboa, P Damon Mayaffre: CNRS, Nice, F Sylvie Mellet: CNRS, Nice, F Michelangelo Misuraca: Univ. of Calabria, I Denis Monière: Univ. of Montréal, CDN Bénédicte Pincemin: CNRS, Lyon, F Céline Poudat: Univ. of Nice Sophia-Antipolis, F Pierre Retinaud: Univ. of Tolouse, F André Salem: Univ. Paris 3, F Monique Slodzian: Inalco, F Arjuna Tuzzi: Univ. of Padua, I Mathieu Valette: Inalco, F Organising Committee Domenica Fioredistella Iezzi: Univ. of Tor Vergata, I Sergio Bolasco: Sapienza Univ. of Rome, I Livia Celardo: Sapienza Univ. of Rome, I Isabella Chiari: Sapienza Univ. of Rome, I Francesca della Ratta: ISTAT, I Fiorenza Deriu: Sapienza Univ. of Rome, I Francesca Dolcetti: Sapienza Univ. of Rome, I Andrea Fronzetti Colladon: Univ. of Tor Vergata, I Francesca Greco: Sapienza Univ. of Rome, I Isabella Mingo: Sapienza Univ. of Rome, I Michelangelo Misuraca: Univ. of Calabria, I Arjuna Tuzzi: Univ. of Padua, I Maurizio Vichi: Sapienza Univ. of Rome, I Francesco Zarelli: ISTAT, I Local Organisation Francesco Alò, Giulia Giacco, Paolo Meoli, Vittorio Palermo, Viola Talucci Table of contents Introduction ............................................................................................................... XVII Acknowledgements ....................................................................................................XIX Invited Speakers GERMAN KRUSZEWSKI Memorize or generalize? Searching for a compositional RNN in a haystack Adam Liška ......................................................................................................... XXIII BING LIU Scaling-up Sentiment Analysis through Continuous Learning .................. XXIV PASCAL MARCHAND La textométrie comme outil d’expertise : application à la négociation de crise. ................................................................ XXV GEORGE K. MIKROS Author Identification Combining Various Author Profiles. Towards a Blended Authorship Attribution Methodology ............................................................. XXVI ROBERTO NAVIGLI From text to concepts and back: going multilingual with BabelNet in a step or two ....................................................................... XXVII Contributors MOTASEM ALRAHABI1, CHIARA MAINARDI1 Identification automatique de l’ironie et des formes apparentées dans un corpus de controverses théâtrales ........................................................................... 1 MOHAMMAD ALSADHAN, SASCHA DIWERSY, AGATA JACKIEWICZ, GIANCARLO LUXARDO Migrants et réfugiés : dynamique de la nomination de l'étranger ................... 10 R. ALVAREZ-ESTEBAN, M. BÉCUE-BERTAUT, B. KOSTOV, F. HUSSON, J-A SÁNCHEZ-ESPIGARES Xplortext, a R package. Multidimensional statistics for textual data science . 19 ELENA, AMBROSETTI, ELEONORA MUSSINO, VALENTINA TALUCCI L'evoluzione delle norme: analisi testuale delle politiche sull'immigrazione in Italia ........................................................................................................................... 26 VIII JADT’ 18 MASSIMO ARIA, CORRADO CUCCURULLO A bibliometric meta-review of performance measurement, appraisal, management research ............................................................................................. 35 LAURA ASCONE Textual Analysis of Extremist Propaganda and Counter-Narrative: a quantiquali investigation ................................................................................................... 44 LAURA ASCONE, LUCIE GIANOLA Analyse de données textuelles appliquée à des problématiques de sécurité et d'enquête judiciaire ................................................................................................. 52 SIMONA BALBI, MICHELANGELO MISURACA, MARIA SPANO A two-step strategy for improving categorisation of short texts ..................... 60 CHRISTINE BARATS, ANNE DISTER, PHILIPPE GAMBETTE, JEAN-MARC LEBLANC, MARIE PERES Appeler à signer une pétition en ligne : caractéristiques linguistiques des appels ........................................................................................................................ 68 MANUEL BARBERA, CARLA MARELLO Newsgroup e lessicografia: dai NUNC al VoDIM .............................................. 76 IGNAZIA BARTHOLINI Techniques for detecting the normalized violence in the perception of refugee / asylum seekers between lexical analysis and factorial analysis...................... 83 PATRIZIA BERTINI MALGARINI, MARCO BIFFI, UGO VIGNUZZI Dal corpus al dizionario: prime riflessioni lessicografiche sul Vocabolario storico della cucina italiana postunitaria (VoSCIP) ............................................ 90 MARCO BIFFI Strumenti informatico-linguistici per la realizzazione di un dizionario dell’italiano postunitario ........................................................................................ 99 ANNICK FARINA, RICCARDO BILLERO Comparaison de corpus de langue « naturelle » et de langue « de traduction » : les bases de données textuelles LBC, un outil essentiel pour la création de fiches lexicographiques bilingues........................................................................ 108 FELICE BISOGNI, STEFANO PIRROTTA Il rapporto tra famiglie di anziani non autosufficienti e servizi territoriali: un'analisi dei dati esploratoria con l'Analisi Emozionale del Testo (AET) .... 117 ANTONELLA BITETTO, LUIGI BOLLANI Esperienza di analisi testuale di documentazione clinica e di flussi informativi sanitari, di utilità nella ricerca epidemiologica e per indagare la qualità dell'assistenza......................................................................................................... 126 GUIDO BONINO, DAVIDE PULIZZOTTO, PAOLO TRIPODI Exploring the history of American philosophy in a computer-assisted framework .............................................................................................................. 134 JADT’ 18 IX MARC-ANDRE BOUCHARD, SYLVIA KASPARIAN La classification hiérarchique descendante pour l’analyse des représentations sociales dans une pétition antibilinguisme au Nouveau-Brunswick, Canada .................................................................................................................... 142 LIVIA CELARDO, RITA VALLEROTONDA, DANIELE DE SANTIS,CLAUDIO SCARICI, ANTONIO LEVA Analysing occupational safety culture through mass media monitoring..... 150 BARBARA CORDELLA, FRANCESCA GRECO, PAOLO MEOLI,VITTORIO PALERMO, MASSIMO GRASSO Is the educational culture in Italian Universities effective? A case study ...... 157 MICHELE A. CORTELAZZO, GEORGE K. MIKROS, ARJUNA TUZZI Profiling Elena Ferrante: a Look Beyond Novels .............................................. 165 FABRIZIO DE FAUSTI, MASSIMO DE CUBELLIS, DIEGO ZARDETTO1 Word Embeddings: a Powerful Tool for Innovative Statistics at Istat .......... 174 Gibbons A. (1985). Algorithmic Graph Theory. Cambridge University Press. . 182 VIVIANA DE GIORGI, CHIARA GNESI Analisi di dati d’impresa disponibili online: un esempio di data science tratto dalla realtà economica dei siti di e-commerce ................................................... 183 ALESSANDRO CAPEZZUOLI, FRANCESCA DELLA RATTA, STEFANIA MACCHIA,MANUELA MURGIA, MONICA SCANNAPIECO, DIEGO ZARDETTO The use of textual sources in Istat: an overview ................................................ 192 FRANCESCA DELLA RATTA, GABRIELLA FAZZI, MARIA ELENA PONTECORVO, CARLO VACCARI, ANTONINO VIRGILLITO Twitter e la statistica ufficiale: il dibattito sul mercato del lavoro ................. 200 SAMI DIAF Gauging An Author’s Mood Using Hidden Markov Chains ......................... 209 MARC DOUGUET Les hémistiches répétés ........................................................................................ 215 FRANCESCA DRAGOTTO, SONIA MELCHIORRE «Mangiata dall’orco e tradita dalle donne». Vecchi e nuovi media raccontano la vicenda di Asia Argento, tra storytelling e Speech Hate ............................. 223 CRISTIANO FELACO, ANNA PAROLA Il cosa e il come del processo narrativo. L’uso combinato della Text Analysis e Network Text Analysis al servizio della precarietà lavorativa ....................... 233 ANA NORA FELDMAN Hablando de crisis: las comunicaciones del Fondo Monetario Internacional 242 VALERIA FIASCO Brexit in the Italian and the British press: a bilingual corpus-driven analysis ...................................................................... 250 VIVIANA FINI, GIUSEPPE LUCIO GAETA, SERGIO SALVATORE Textual analysis to promote innovation within public policy evaluation .... 259 X JADT’ 18 ALESSIA FORCINITI, SIMONA BALBI A proposal for Cross-Language Analysis: violence against women and the Web ................................................................ 268 BEATRICE FRACCHIOLLA, OLINKA SOLENE DE ROGER La verbalisation des émotions ............................................................................. 276 LUISA FRANCHINA, FRANCESCA GRECO, ANDREA LUCARIELLO, ANGELO SOCAL, LAURA TEODONNO Improving Collection Process for Social Media Intelligence: A Case Study . 285 ANDREA FRONZETTI COLLADON, JOHANNE SAINT-CHARLES, PIERRE MONGEAU The impact of language homophily and similarity of social position on employees’ digital communication ..................................................................... 293 MATTEO GERLI Looking Through the Lens of Social Sciences: The European Union in the EUFunded Research Projects Reporting .................................................................. 300 LUCIE GIANOLA, MATHIEU VALETTE Spécialisation générique et discursive d’une unité lexical L’exemple de joggeuse dans la presse quotidienne régionale ................................................... 312 PETER A. GLOOR, JOAO MARCOS DE OLIVEIRA, DETLEF SCHODER The Transparency Engine – A Better Way to Deal with Fake News .............. 319 FRANCESCA GRECO, LEONARDO ALAIMO, LIVIA CELARDO Brexit and Twitter: The voice of people.............................................................. 327 FRANCESCA GRECO, GIULIO DE FELICE, OMAR GELO A text mining on clinical transcripts of good and poor outcome psychotherapies ..................................................................................................... 335 FRANCESCA GRECO, DARIO MASCHIETTI, ALESSANDRO POLLI DOMINIO: A Modular and Scalable Tool for the Open Source Intelligence 343 LEONIE GRÖN, ANN BERTELS, KRIS HEYLEN Is training worth the trouble? A PoS tagging experiment with Dutch clinical records..................................................................................................................... 351 FRANCE GUERIN-PACE, ELODIE BARIL Les outils de la statistique textuelle pour analyser les corpus de données d’enquêtes de la statistique publique .......................... 359 SERGE HEIDEN Annotation-based Digital Text Corpora Analysis within the TXM Platform 367 DANIEL HENKEL Quantifying Translation : an analysis of the conditional perfect in EnglishFrench comparable-parallel corpus..................................................................... 375 DANIEL DEVATMAN HROMADA Extraction of lexical repetitive expressions from complete works of William Shakespeare ............................................................................................................ 384 JADT’ 18 XI OLIVIER KRAIF, JULIE SORBA Spécificités des expressions spatiales et temporelles dans quatre sous-genres romanesques (policier, science-fiction, historique et littérature générale) .... 392 CYRIL LABBE, DOMINIQUE LABBE Les phrases de Marcel Proust .............................................................................. 400 LUDOVICA LANINI, MARÍA CARLOTA NICOLÁS MARTÍNEZ Verso un dizionario corpus-based del lessico dei beni culturali: procedure di estrazione del lemmario ....................................................................................... 411 DANIELA LARICCHIUTA, FRANCESCA GRECO, FABRIZIO PIRAS, BARBARA CORDELLA, DEBORA CUTULI, ELEONORA PICERNI, FRANCESCA ASSOGNA, CARLO LAI, GIANFRANCO SPALLETTA, LAURA PETROSINI “The grief that doesn’t speak”: Text Mining and Brain Structure 419 GEVISA LA ROCCA, CIRUS RINALDI Icone gay: tra processi di normalizzazione e di resistenza. Ricostruire la semantica degli hashtag........................................................................................ 428 LUDOVIC LEBART Looking for topics: a brief review......................................................................... 436 GAËL LEJEUNE, LICHAO ZHU Analyse Diachronique de Corpus : le cas du poker .......................................... 444 JULIEN LONGHI, ANDRE SALEM Approche textométrique des variations du sens ............................................... 452 LAURENT VANNI1, DAMON MAYAFFRE, DOMINIQUE LONGREE ADT et deep learning, regards croisés. Phrases-clefs, motifs et nouveaux observables ............................................................................................................. 459 LUCIE LOUBERE Déconstruction et reconstruction de corpus... À la recherche de la pertinence et du contexte ......................................................................................................... 467 HEBA METWALLY L’apport du corpus-maquette à la mise en évidence des niveaux descriptifs de la chronologie du sens. Essai sur une Série Textuelle Chronologique du Monde diplomatique (1990-2008). ....................................................................................... 474 JUN MIAO, ANDRE SALEM Séries textuelles homogènes................................................................................. 491 SILVIO MIGLIORI, ANDREA QUINTILIANI, DANIELA ALDERUCCIO, FIORENZO AMBROSINO, ANTONIO COLAVINCENZO, MARIALUISA MONGELLI, SAMUELE PIERATTINI, GIOVANNI PONTI SERGIO BOLASCO, FRANCESCO BAIOCCHI, GIOVANNI DE GASPERIS TaLTaC in ENEAGRID Infrastructure................................................................ 501 ISABELLA MINGO, MARIELLA NOCENZI The dimensions of Gender in the International Review of Sociology. A lexicometric approach to the analysis of the publications in the last twenty years ........................................................................................................................ 509 XII JADT’ 18 ADIEL MITTMANN, ALCKMAR LUIZ DOS SANTOS The Rhythm of Epic Verse in Portuguese From the 16th to the 21st Century514 DENIS MONIERE, DOMINIQUE LABBE Le vocabulaire des campagnes électorales ......................................................... 522 CYRIELLE MONTRICHARD Faire émerger les traces d’une pratique imitative dans la presse de tranchées à l’aide des outils textométriques ........................................................................... 532 ALBERT MORALES MORENO Evolución diacrónica de la terminología y la fraseología jurídicoadministrativa en los Estatutos de autonomía de Catalunya de 1932, 1979 y 2006 .......................................................................................................................... 541 CEDRIC MOREAU Comment penser la recherche d’un signe pour une plateforme multilingue et multimodale français écrit / langue des signes française ? .............................. 556 JEAN MOSCAROLA, BORIS MOSCAROLA Conclusion ADT et visualisation, pour une nouvelle lecture des corpus Les débats de 2ème tour des Présidentielles (1974-2017) ........................................ 563 MAURIZIO NALDI A conversation analysis of interactions in personal finance forums ............. 571 STEFANO NOBILE Analisi testuale, rumore semantico e peculiarità morfosintattiche: problemi e strategie di pretrattamento di corpora speciali.............................. 578 DANIEL PELISSIER L’individu dans le(s) groupe(s) : focus group et partitionnement du corpus ................................................................................................................ 586 BENEDICTE PINCEMIN, CELINE GUILLOT-BARBANCE, ALEXEI LAVRENTIEV Using the First Axis of a Correspondence Analysis as an Analytical Tool. Application to Establish and Define an Orality Gradient for Genres of Medieval French Texts .......................................................................................... 594 CELINE POUDAT Explorer les désaccords dans les fils de discussion du Wikipédia francophone .................................................................................................................................. 602 MATTHIEU QUIGNARD, SERGE HEIDEN, FREDERIC LANDRAGIN, MATTHIEU DECORDE Textometric Exploitation of Coreference-annotated Corpora with TXM: Methodological Choices and First Outcomes .................................................... 610 PIERRE RATINAUD Amélioration de la précision et de la vitesse de l’algorithme de classification de la méthode Reinert dans IRaMuTeQ ............................................................. 616 JADT’ 18 XIII LUISA REVELLI Il parametro della frequenza tra paradossi e antinomie: il caso dell’italiano scolastico .................................................................................. 626 PIERGIORGIO RICCI How Twitter emotional sentiments mirror on the Bitcoin transaction network .............................................................................................. 635 CHANTAL RICHARD, SYLVIA KASPARIAN Analyse de contenu versus méthode Reinert : l’analyse comparée d’un corpus bilingue de discours acadiens et loyalistes du N.-B., Canada ......................... 643 VALENTINA RIZZOLI, ARJUNA TUZZI Bridge over the ocean: Histories of social psychology in Europe and North America. An analysis of chronological corpora ................................................ 651 LOUIS ROMPRE, ISMAÏL BISKRI Les « itemsets fréquents » comme descripteurs de documents textuels ....... 659 CORINNE ROSSARI, LJILJANA DOLAMIC, ANNALENA HÜTSCH, CLAUDIA RICCI, DENNIS WANDEL Discursive Functions of French Epistemic Adverbs: What can Correspondence Analysis tell us about Genre and Diachronic Variation? ................................. 668 VANESSA RUSSO, MARA MARETTI, LARA FONTANELLA, ALICE TONTODIMAMMA Misleading information in online propaganda networks ............................... 676 ELIANA SANANDRES, CAMILO MADARIAGA, RAIMUNDO ABELLO Topic modeling of Twitter conversations .......................................................... 684 FRANCESCO SANTELLI, GIANCARLO RAGOZINI, MARCO MUSELLA What volunteers do? A textual analysis of voluntary activities in the Italian context ..................................................................................................................... 692 S. SANTILLI, S. SBALCHIERO, L. NOTA, S. SORESI A longitudinal textual analysis of abstract presented at Italian Association for Vocational guidance and Career Counseling’ Conferences from 2002 to 2017 ............................................................................ 700 JACQUES SAVOY A la poursuite d’Elena Ferrante........................................................................... 707 JACQUES SAVOY Regroupement d’auteurs dans la littérature du XIXe siècle ........................... 716 STEFANO SBALCHIERO, ARJUNA TUZZI What’s Old and New? Discovering Topics in the American Journal of Sociology................................................................................................................. 724 NILS SCHAETTI, JACQUES SAVOY Comparison of Neural Models for Gender Profiling ........................................ 733 LIONEL SHEN Segments répétés appliqués à l'extraction de connaissances trilingues ......... 740 XIV JADT’ 18 SANDRO STANCAMPIANO Misurare, Monitorare e Governare le città con i Big Data .............................. 748 FADILA TALEB, MARYVONNE HOLZEM Exploration textométrique d’un corpus de motifs juridiques dans le droit international des transports ................................................................................. 755 JAMES M. TEASDALE The Framing of the Migrant: Re-imagining a Fractured Methodology in the Context of the British Media. ............................................................................... 763 MARJORIE TENDERO1, CECILE BAZART Results from two complementary textual analysis software (Iramuteq and Tropes) to analyze social representation of contaminated brownfields ........ 771 MATTEO TESTI, ANDREA MERCURI, FRANCESCO PUGLIESE Multilingual Sentiment Analysis ......................................................................... 780 JUAN MARTÍNEZ TORVISCO A linguistic analysis of the image of immigrants’ gender in Spanish newspapers............................................................................................................. 788 FRANCESCO URZÌ Lo strano caso delle frequenze zero nei testi legislativi euroistituzionali...... 796 SYLVIE VANDAELE Les traductions françaises de The Origin of Species : pistes lexicométriques . 805 PIERRE WAVRESKY, MATTHIEU DUBOYS DE LABARRE, JEAN-LOUP LECOEUR Circuits courts en agriculture : utilisation de la textométrie dans le traitement d’une enquête sur 2 marchés ............................................................................... 814 MARIA ZIMINA, NICOLAS BALLIER On the phraseology of spoken French: initial salience, prominence and lexicogrammatical recurrence in a prosodic-syntactic treebank Rhapsodie .... 822 Abstracts FILIPPO CHIARELLO, GUALTIERO FANTONI, ANDREA BONACCORSI, SILVIA FARERI What kind of contributions does research provides? Mapping issue based statements in research abstracts .......................................................................... 833 FILIPPO CHIARELLO, GIACOMO OSSOLA, GUALTIERO FANTONI,ANDREA BONACCORSI, ANDREA CIMINO, FELICE DELL’ORLETTA Technical sentiment analysis: predicting the success of new products using social media ............................................................................................................ 835 JADT’ 18 XV FIORENZA DERIU, DOMENICA FIOREDISTELLA IEZZI Citizens and neighbourhood life: mapping population sentiment in Italian cities ......................................................................................................................... 837 FRANCESCA DI CARLO, ROSY INNARELLA, BRIZIO LEONARDO TOMMASI Vax network: profiling influential nodes with social network analysis on twitter ...................................................................................................................... 838 DAVIDE DONNA Alteryx .................................................................................................................... 840 VALERIO FICCADENTI, ROY CERQUETI, MARCEL AUSLOOS Complexity of US President Speeches ................................................................ 841 PETER A. GLOOR Measuring the Dynamics of Social Networks with Condor ........................... 842 IOLANDA MAGGIO, DOMENICA FIOREDISTELLA IEZZI, MATTEO FATIGHENTI “BIG DATA” Words Trend Analysis using the multidimensional analysis of texts ......................................................................................................................... 844 MARIO MASTRANGELO Itinerari turistici, network analysis e text mining ............................................. 845 MARIA FRANCESCA ROMANO, GUIDO REY, ANTONELLA BALDASSARINI PASQUALE PAVONE Text Mining per l’analisi qualitativa e quantitativa dei dati amministrativi utilizzati dalla Pubblica Amministrazione......................................................... 847 ALESSANDRO CESARE ROSA Taglio cesareo e Vbac in Italia al tempo dei Big Data: una proposta di ulteriore contributo informativo.......................................................................................... 849 Introduction The International Conference on the Statistical Analysis of Textual Data (JADT, Journées d’Analyse Statistique des Données Textuelles) has been at its 14th edition. It was held for the third time in Rome, from 12 to 15 June 2018, organized by the DII - Department of Enterprise Engineering “Mario Lucertini” at Tor Vergata University of Rome and the DSS - Department of Statistical Sciences at Sapienza University of Rome. This biennial conference has continuously gained importance since its first occurrence in Barcelone (1992), and with the editions of Montpellier (1994), Rome (1996), Nice (1998), Lausanne (2000), Saint-Malo (2002), Louvain-la Neuve (2004), Besançon (2006), Lyon (2008), Rome (2010), Liège (2012), Paris (2014), Nice (2016). Every two years, the JADT conference presented the state of the art concerning theories, problems, methods, algorithms, software and applications in several domains, sharing a quantitative approach to the study of lexical, textual, pragmatic or discursive features of information expressed in natural language. The proceedings of the 2018 Conference collect 113 contributions by 243 scholars from 15 countries spread all over the world. These papers include contributions open to all scholars and researchers working in the field of textual data analysis, ranging from lexicography to the analysis of political discourse, from information retrieval to marketing research, from computational linguistics to sociolinguistics, from text mining to content analysis. The invited speakers focused on the central topics of the conference, discussing open and new themes, e.g. machine learning algorithms to profiling users of social media, new multilingual approaches, textometry, and authorship. The proceedings follow an alphabetical order by the surname of the first author of the contributions. In this edition, several innovations have been introduced with respect to the past. In a roundtable, we discussed the past, present and future of Statistical Analysis of Textual Data and Text Mining methods, by examining the point of view of Universities and enterprises. The papers, which followed a review process carried out with two and sometimes three reviewers, are maximum of 6 pages. The idea is that the papers were not yet in their final version, and the exchange with other scholars during the conference led to an improvement. For the first time, a selection of extended papers presented at XVIII JADT’ 18 the JADT conference will be published, after another reviewing process, in a book published by Springer and in several special issues of acknowledged Journals (Advances in Data Analysis and Classification, International Review of Sociology, Italian Journal of Applied Statistics, Social Indicators Research, RPC Rivista di Psicologia Clinica). The perspective of enhancing the papers discussed during JADT conference will allow the scholar community to keep the network of active contacts and lively exchanges. D. Fioredistella Iezzi, Livia Celardo, Michelangelo Misuraca Acknowledgements We express our gratitude to the 56 reviewers who offered their assistance in selecting and anonymously reviewing the papers of this volume: Massimo Aria, Barbara Baldazzi, Nadia Battisti, Valérie Beaudouin, Sergio Bolasco, Etienne Brunet, Mónica Bécue, Isabella Chiari, Livia Celardo, Michele Cortelazzo, Pasquale Del Vecchio, Francesca Della Ratta, Fiorenza Deriu, Anne Dister, Francesca Dolcetti, Annick Farina, Serge Fleury, Andrea Fronzetti, Luca Giuliano, Peter Gloor, Francesca Greco, Francesca Grippa, Serge Heiden, D. Fioredistella Iezzi, Antonio Iovanella, Sylvia Kasparian, Margareta Kastberg, Dominique Labbé, Ludovica Lanini, Alexei Lavrentev, Ludovic Lebart, Jean-Marc Leblanc, Alain Lelu, Dominique Longrée, Véronique Magri, Pascal Marchand, Damon Mayaffre, Sylvie Mellet, Silvia Micheli, Michelangelo Misuraca, Denis Monière, Gianluca Murgia, Pa-squale Pavone, Bénédicte Pincemin, Céline Poudat, Pierre Ratinaud, Piergiorgio Ricci, Maria Francesca Romano, Johanne Saint-Charles, André Salem, Massimiliano Schiraldi, Max Silberztein, Maria Spano, Arjuna Tuzzi, Mathieu Valette, Ramón Álvarez Esteban. JADT2018 was held under the patronage of ISTAT (Istituto Nazionale di Statistica - National Institute of Statistics). We are also very grateful to the following sponsors: ISTAT, Le Sphinx, The Information Lab, Master in Data Science at Tor Vergata University, Prisma. As regards the organisation of the conference, we would like to thank all the members of the local organising team: Francesco Alò, Silvia Castellan, Giulia Giacco, Paolo Meoli, Vittorio Palermo, Viola Talucci. Special thanks go to Livia Celardo, Isabella Chiari, Andrea Fronzetti Colladon, Francesca Della Ratta, Fiorenza Deriu, Francesca Dolcetti, Francesca Greco, for the organisation of special tracks concerning Official Statistics, Linguistics, Applications on social and psychological domains, Social Network and Semantic Analysis. Invited Speakers JADT’ 18 XXIII Memorize or generalize? Searching for a compositional RNN in a haystack Adam Liška German Kruszewski Facebook- germank@fb.com Abstract Machine learning systems have made rapid progress in the past few years, as evidenced by the remarkable feats they have accomplished on fields as diverse as computer vision or reinforcement learning. Yet, as impressive as these achievements are, they rely on learning algorithms that require orders of magnitude more data than a human learner would. This disparity could be rooted in many different factors. In this talk, we will draw on the hypothesis that compositional learning — that is, the ability to recombine previously acquired skills and knowledge to solve new problems– could be one important element of fast and efficient learning (Lake et al, 2017). In this direction, we will discuss our ongoing efforts towards building systems that can learn in compositional ways. Concretely, we will present a simple benchmark based on function composition to measure the compositionality of learning systems and use it to draw insights for whether current learning systems learn or can learn in a compositional manner. XXIV JADT’ 18 Scaling-up Sentiment Analysis through Continuous Learning Bing Liu University of Illinois at Chicago - liub@uic.edu Abstract Sentiment analysis (SA) or opinion mining is the computational study of people’s opinions, sentiments, emotions, and evaluations. Due to numerous research challenges and almost unlimited applications, SA has been a very active research area in natural language processing and text mining. In this talk, I will first give a brief introduction to SA, and then move on to discuss some major difficulties with the current technologies when one wants to perform sentiment analysis in a large number of domains, e.g., all products sold in a large retail store. To tackle the scaling-up problem, I will describe our recent work on lifelong machine learning (LML) (or lifelong learning) that tries to enable the machine learn like humans, i.e., learning continuously, retaining or accumulating the knowledge learned in the past, and using the knowledge to help future learning and problem solving. This paradigm is quite suitable for SA and can help scale up SA to a large number of domains with little manual involvement. JADT’ 18 XXV La textométrie comme outil d’expertise : application à la négociation de crise. Pascal Marchand Université de Toulouse – pascal.marchand@iut-tlse3.fr Résumé Pour aborder la pertinence de la pratique textométrique dans des problématiques de terrains et comme outil d’expertise, on étudiera les échanges réels impliquant les négociateurs des Forces d’intervention de Police, dans des contextes de barricades, prises d’otages, terrorisme ou intention suicidaire à haut niveau de dangerosité. Nous envisagerons donc la négociation au travers des dynamiques de choix lexical et nous chercherons à cartographier le lexique, classer des segments de textes et comparer des profils de locuteurs et de situations. On se propose ainsi de répondre aux questions suivantes :  Y a-t-il des thèmes récurrents dans les crises ?  Y a-t-il une chronologie lexicale de la crise ?  Comment se gèrent les émotions ?  Quelles sont les spécificités des situations « radicalisées » ? L’objectivation des échanges et la mise en évidence des séquences formelles peut alors fournir une aide au diagnostic, dans le but de tirer des éléments concrets pour des objectifs de retour d'expérience et de formalisation des pratiques des professionnels de la négociation. XXVI JADT’ 18 Author Identification Combining Various Author Profiles. Towards a Blended Authorship Attribution Methodology George K. Mikros National and Kapodistrian University of Athens – gmikros@gmail.com Abstract The aim of this presentation is to describe a new method of attributing texts to their real authors using combined author profiles, modern computational stylistic methods based on shallow text features (n-grams) and machine learning algorithms. Until recently, authorship attribution and author profiling were considered similar methods (nearly identical feature sets and classification algorithms), but with different aims, i.e. in the former to identify the author’s identity and in the latter to detect author’s characteristics such as gender, age, psychological profile etc. Both of these methods have been used independently aiming at different research aims and in different real-life tasks. However, in this talk we will present a unified methodological framework where standard authorship attribution methodology and author profiling are combined so that we can approach more effectively open or semi-open authorship attribution problems, a category known as authorship verification which is particularly difficult to tackle with present computational stylistic methods. More specifically, we will present preliminary research results from the application of this blended methodology to a real semi-open authorship problem, the Ferrante’s authorship case. Using a corpus of 40 modern Italian literary authors compiled by Arjuna Tuzzi and Michele Cortelazzo from the University of Padua (Tuzzi & Cortelazzo, under review), we will explore the dynamics of author profiling in gender, age and region and various methods we can combine the extracted profiles so that we can entail the identity of the real author behind Ferrante’s books. Moreover, we will extend this methodology and validate its usefulness in social media texts using the English Blog Corpus (Argamon, Koppel, Pennebaker, & Schler, 2007). Using, simulated scenarios of authorship attribution cases (the real author to be included in the training data and the real author to be missing from the training corpus) we will further evaluate the usefulness of the proposed blended methodology which can lead to some exciting new possibilities for investigating author identities in both closed and open authorship attribution tasks. JADT’ 18 XXVII From text to concepts and back: going multilingual with BabelNet in a step or two Roberto Navigli Sapienza University of Rome – roberto.navigli@uniroma1.it Abstract Multilinguality is a key feature of today’s Web, and it is this feature that we leverage and exploit in our research work at the Sapienza University of Rome’s Linguistic Computing Laboratory, which I am going to overview and showcase in this talk. I will describe the most recent developments of the BabelNet technology. I will introduce BabelNet live – the largest, continuously-updated multilingual encyclopedic dictionary – and then discuss a range of cutting-edge industrial use cases implemented by Babelscape, our Sapienza startup company, including: multilingual interpretation of terms; multilingual concept and entity extraction from text; cross-lingual text similarity. Contributors JADT’ 18 1 Identification automatique de l’ironie et des formes apparentées dans un corpus de controverses théâtrales Motasem Alrahabi1, Chiara Mainardi2 1 Université Paris-Sorbonne Abu Dhabi – motasem.alrahabi@gmail.com 2 Université Sorbonne Nouvelle – chiara.mainardi@univ-paris3.fr Abstract This paper presents the results of an automatic analysis on a corpus of French texts about theatre debates (16th –19th centuries). The purpose of this study is to highlight the important role of different forms of irony in the theatre controversy and to reveal the stand point of authors and established authorities towards theatre performances. Despite the difficulty of this task, our research shows encouraging results. This unprecedent comparison of these kind of texts, in which authors condemn the theatre or approve it, enables to a broader understanding of the authors’ positions, arguments and rhetorical strategies relating to theatre controversies. Résumé Cet article présente les résultats de notre analyse automatique d’un corpus de débats sur le théâtre (16ème – 19ème siècle). L’objectif de cette étude est d’illustrer le rôle important que jouent les différentes formes de l’ironie dans la polémique autour du théâtre et de mettre en évidence la position des auteurs ou des autorités antiques citées vis-à-vis des spectacles. Les résultats obtenus sont encourageants malgré la difficulté de la tâche et ils nous permettent de comparer d’une façon inédite les textes des auteurs défenseurs avec ceux des auteurs pourfendeurs du théâtre et d’avoir une meilleure compréhension de certains arguments et stratégies d’auteurs dans le champ de la controverse. Keywords: Ironie, théâtre, marqueurs linguistiques, annotation sémantique, système à base de règles. 1. Introduction Nous proposons une analyse automatique d’un corpus en français qui rassemble des débats sur le théâtre depuis le milieu du 16e siècle jusque dans les années 1840. Notre objectif est d’illustrer le rôle important que jouent les expressions de l’ironie dans la polémique autour du théâtre et de mettre en évidence la position des auteurs ou des autorités antiques citées vis-à-vis des 2 JADT’ 18 spectacles. Nous présentons d’abord les ressources linguistiques développées, l’outil d’annotation utilisé et le corpus ; ensuite, nous commentons les résultats d’analyse automatique et, avant de conclure, nous explorons les perspectives de ce projet en cours. 2. Prémisses sur l’ironie L’ironie est un fait de langue utilisé afin de transmettre un message directement ou indirectement opposé à ce qui est dit littéralement. Largement étudiée en philosophie, en rhétorique ou en linguistique (Berrendonner, Sperber et Wilson, Kerbrat-Orecchioni, Ducrot, Grice…), l’ironie représente un concept hétérogène extrêmement difficile à définir du fait de ses nombreuses formes et de la complexité des phénomènes qui sont en jeu. L’ironie fonctionne à l’aide d’indices laissés par le locuteur à l’interlocuteur pour lui faire comprendre ses intentions par des jeux de parallélismes, de contradictions, d’exagérations et d’hyperboles plus ou moins marqués. Ces indices – souvent pragmatiques ou extralinguistiques – sont plus ou moins évidents, d’où l’importance de la prise en compte du contexte (référentiel, locuteur, interlocuteur…), des connaissances partagées et des normes sociales et culturelles. La présente étude constitue la première étape pour une détection automatique du champ de l’ironie au sein de notre corpus. Conscients de la difficulté de la tâche et de l’absence de ressources linguistiques adaptées à notre corpus et à nos objectifs, nous nous sommes tournés vers une approche symbolique en nous basant sur un travail précédent autour de l’annotation automatique des modalités énonciatives (Riguet et Alrahabi, 2017). Employés dans les stratégies argumentatives, ces marqueurs observables aident à exprimer ou à rapporter l’ironie ou d’autres cas qui s’y apparentent (sarcasme, raillerie, satire, moquerie…). Exemple : De sorte qu'on ne peut mieux définir la Comédie, qu'une « assemblée de railleurs où personne ne se connait, et où chacun rit des défauts qui les rendent tous également coupables et ridicules ». [Lelevel, 1694] Les marqueurs utilisés sont principalement des verbes comme se moquer, ironiser, parodier… Ensuite, par l’observation d’une partie du corpus, nous avons enrichi ces ressources par des substantifs, des adjectifs et des adverbes. Nous avons ensuite classé ces marqueurs dans des sous-catégories selon différentes nuances sémantiques : 1) ironie, dérision, se moquer, sarcastique, parodier… ; 2) chicaner, taquiner, narguer… ; 3) faire rire, comique, pitre, grotesque, idiot… ; 4) mordant, piquant, pinçant, aigre… ; 5) mépriser, dénigrer, sous-estimer, vilipender… ; 6) calomnier, hypocrisie, ruse, malice… ; etc. En tout, nous avons collecté autour de 70 marqueurs JADT’ 18 3 linguistiques. 3. Méthodologie et choix techniques La détection automatique de l’ironie est une tâche difficile, notamment à cause de la multitude des moyens linguistiques qui expriment, souvent de manières subtiles, l’ironie ou les autres formes apparentées. Différents travaux computationnels s’intéressent à la détection automatique de ces phénomènes linguistiques (Joshi et al., 2016): approches à base de règles, approches statistiques et approches d’apprentissage profond. Dans le présent projet, nous avons utilisé Excom2 (Alrahabi, 2010), un outil d’annotation à base de règles qui nous a permis d’avoir le contrôle sur le processus d’annotation et d’améliorer progressivement la pertinence des ressources linguistiques exploitées. Pour le système, la présence dans une phrase d’un marqueur de l’ironie déclenche les règles associées qui explorent le contexte et vérifient la présence ou l’absence de marqueurs complémentaires. Dans la phrase qui suit, la présence de l’adverbe moqueusement dans le contexte d’un marqueur de parole permet à Excom2 d’attribuer à ce passage textuel l’étiquette « Ironie » : « Il lui faut, dit-on moqueusement, cinq épithètes ! » [Corpus OBVIL] Les règles dans Excom2 peuvent être organisées selon un ordre de priorité et utiliser en entrée les résultats d’autres règles. Avant l’étape de l’annotation, l’outil procède à la segmentation des textes afin de les découper en sections, paragraphes et phrases. Pour l’Ironie, nous avons créé 8 règles que nous avons associées aux différents marqueurs linguistiques. 4. Corpus Cette présente étude s’appuie sur des textes à argumentation théâtrophile ou théâtrophobe, et sur des textes adoptant une stratégie « mesurée ». Cette dernière consiste à dénoncer des abus de la scène pour, ensuite, convaincre le lecteur à préserver l’utilité intrinsèque du théâtre. Ces trois types de textes possèdent une logique souvent détournée et déconcertante pour le lecteur : sous le déroulement des chapitres, on découvre parfois des connexions implicites, un usage de l’ironie très répandu et des phrases à la forme négative qui infléchissent notablement la détection des contenus. Avec ses reprises pour réitérer ou au contraire pour retourner l’argument contre l’adversaire, ce corpus de controverses théâtrales se prête bien à des analyses numériques. Le corpus rassemble 59 textes (environ un million de mots) écrits en langue française depuis le milieu du 16e siècle jusque dans les années 4 JADT’ 18 18401. Ceux-ci ont été préalablement numérisés et édités dans le cadre du Labex OBVIL de Paris IV-Sorbonne et sont librement accessibles en ligne2. 5. Evaluation Une première phase de tests sur un échantillon du corpus a été nécessaire pour stabiliser les règles d’identification et de désambiguïsation. Afin d’évaluer la qualité des annotations obtenues, nous nous sommes focalisés dans un premier temps sur le calcul de la précision. Nous avons alors annoté avec Excom2 une autre partie du corpus (7 articles, 215675 mots) et nous avons obtenu 416 annotations. Ensuite, nous avons demandé à une personne qui connait les œuvres de cette période de juger les sorties du système selon un guide d’annotation. Pour chaque annotation, l’évaluatrice devait choisir entre : « Correct », « Incorrect » ou « Je ne sais pas ». Le critère d’évaluation était le suivant : est-ce que l'auteur du texte fait allusion à l'ironie dans la phrase en question? Nous avons obtenu 93.9 % de précision. 6. Difficultés rencontrées Nous nous sommes heurtés à plusieurs difficultés. Au niveau du lexique, peu de changements ont été effectués sur nos marqueurs, comme par exemple le mot satire qui se trouve avec les deux orthographes satire (88 occurrences) et satyre (68 occurrences). En français, le dernier désigne le demi-dieu compagnon de Dionysos ou Bacchus. Cependant, dans certains textes qui n’ont pas encore été modernisés, et sont en langue française du 16e ou 17e siècle, ce mot indique plus largement la « satire ». D’un autre côté, certains marqueurs sont polysémiques et génèrent du bruit, comme ridicule (437 occurrences, le marqueur le plus fréquent), plaisanter (176 occurrences) et comique (131 occurrences). Exemple [Rousseau, 1758] : Le ridicule est l'arme favorite du vice. C'est par elle qu'en attaquant dans le fond des cœurs le respect qu'on doit à la vertu, il éteint enfin l'amour qu'on lui porte. Concernant la syntaxe du 17e et 18e siècle, nous avons observé une certaine complexité au niveau des phrases qui sont parfois très longues (cinq lignes ou Nous renvoyons à la liste de la bibliographie française qui constitue le corpus total de la Haine du Théâtre: http://obvil.paris-sorbonne.fr/corpus/hainetheatre/bibliographie_querelle-france/ 2 Il s’agit d’une partie du corpus de « La Haine du théâtre », projet dirigé, au sein du Labex OBVIL, par François Lecercle et Clotilde Thouret (Lecercle et al., 2016), http://obvil.paris-sorbonne.fr/projets/la-haine-du-theatre. 1 JADT’ 18 5 plus), et au niveau des signes de ponctuation qui ne sont pas stables. Plusieurs virgules, points virgules, etc. peuvent en effet se succéder dans une seule phrase. De plus, les auteurs de notre corpus utilisent des tournures complexes. Très souvent, ces phrases sibyllines sont ironiques, et cela se passe d’autant plus si elles se trouvent à la forme interrogative. 7. Interprétation des résultats Dans l’étude des débats sur le théâtre, les expressions de l’ironie sont une voie d’entrée féconde dans le corpus. On constate d’abord que, tout au long des siècles couverts par le projet Haine du Théâtre (16e – 19e siècles), l’usage de l’ironie se situe entre les valeurs de 0,20 à 0,30 % (1265 annotations en total). Nous avons ensuite analysé les marqueurs de l’ironie en étudiant leur présence relative selon les siècles et nous avons pris en compte uniquement ceux ayant un pourcentage supérieur à 5% à l’intérieur d’un même siècle. Figure 1 : Les marqueurs d’ironie dans le corpus HdT pondérés par siècle Une baisse considérable a lieu au 17e siècle. S’il est prématuré d’en tirer des conclusions hâtives, nous pouvons cependant tout de suite constater que cela est probablement dû à l’affirmation de la religion, de l’ordre du classicisme ainsi qu’à l’autoritarisme étatique qui s’insinuait dans les esprits des écrivains de cette époque. En revanche, au fur et à mesure du 17e au 19e siècle, les valeurs de ces marqueurs augmentent de manière assez stable. De manière générale, l’ironie est utilisée dans le corpus comme procédé éthique et stylistique, ce qui rend les auteurs bien efficaces dans l’élaboration de leur vision de la querelle. Qu’ils soient théâtrophobes ou théâtrophiles, ils peuvent jouer avec les nuances des marqueurs d’ironie, dissimuler un double-sens dans leurs phrases, s’exprimer figurément de manière contraire 6 JADT’ 18 à ce qu’ils communiquent littéralement. Par exemple, nous retrouvons une présence considérable du lemme « mépris » au 17e et 18e siècles. Il s’agit principalement d’un usage de l’ironie en tant que mécanisme de régulation de la vie sociale. Notamment, Conti et Voisin utilisent un humour inoffensif contre les excès de l’art et mettent en avant la bienséance : Ceux qui vont aux Spectacles, non par hasard, mais de propos délibéré, et avec tant d'ardeur, qu'ils abandonnent l'Eglise par un mépris insupportable pour y aller, où ils passent tout le jour à regarder ces femmes infâmes, auront-ils l'impudence de dire qu'ils ne les voient pas pour les désirer [Conti 1667, Voisin 1671] L’ « hypocrisie » commence à être utilisée au 17e et son utilisation se réduit avec le temps (jusqu’à 1% au 19e). Le lemme en question est essentiellement appliqué à des phrases où l’ironie n’est qu’un « autre nom du malheur » (Martin 2009), une manière de renforcer le point de vue de l’auteur. L’hypocrisie est un vice privilégié, qui ferme la bouche à tout le monde, et qui jouit en repos d'une impunité souveraine. [Coustel 1694] Très répandu dans le corpus est l’usage de l’ironie comme écho satirique. Le lemme « calomnier », présent dans les textes du 17e au 19e siècle, en est l’exemple : […] cessez de calomnier vos contemporains selon l'usage immémorial de ceux qui profèrent de vaines paroles. [Senancour 1825] Figure 2 : Valeurs pondérées de l’annotation de l’ironie dans le corpus JADT’ 18 7 Les premiers résultats nous ont ainsi permis d’effectuer des comparaisons très intéressantes entre les textes des auteurs défenseurs et les textes des auteurs pourfendeurs du théâtre. A partir du nombre d’expressions ironiques correctement identifiées comme telles, nous avons recensé leur nombre et dressé des statistiques pour chaque auteur du corpus annoté. On constate qu’en données relatives, les auteurs qui utilisent le plus les marqueurs d’ironie appartiennent à la « querelle Rousseau » (moitié du 18e s.). Cela est à analyser en perspective mais, en l’espèce, dans cet article nous pouvons le mettre en lien avec l’usage de l’ironie au 18e siècle, comme plusieurs écrits sur Voltaire le témoignent (Loriot, 2015). Les mots de D’Alembert sont très parlants sur ce sujet et éclairent le rôle de l’ironie [Alembert, 1759] : Si la satire et l’injure n’étaient pas aujourd’hui le ton favori de la critique, elle serait plus honorable à ceux qui l’exercent, et plus utile à ceux qui en sont l’objet. Les marqueurs linguistiques qui ont été détectés pour cette période appartiennent à la sphère sémantique du ridicule, de la satire, de la farce et du comique3. D’autres marqueurs verbaux, tels que se moquer et plaisanter sont présents dans cette querelle et sont communs aux écrits de la précédente controverse datant du milieu du 17e siècle. Les valeurs ironiques de cette dernière, dont les représentants théâtrophobes sont Conti et Nicole, parmi d’autres, sont cependant moins importantes (0,06 vs. 0,17). Outre ces marqueurs verbaux, nous pouvons citer les catégories de substantifs tels que le ridicule et le faire rire. A la même période, Aubignac, auteur de la stratégie offensive-défensive, part d’une critique du théâtre pour arriver à sa défense. Il s’inspire des marqueurs habituels pour la période du 17e siècle et reprend dans ses phrases les propos de ses collègues, pour ensuite les réfuter. De plus, il recourt plus spécifiquement à des marqueurs ironiques tel que railler et idiot. Contemporaine à d’Aubignac, la querelle entre Caffaro et Bossuet nous donne des résultats surprenants : si Caffaro emploie peu de marqueurs relevant de l’ironie (0,05), Bossuet est lui chef de file parmi ses contemporains (valeur de 0,27). Comme les autres auteurs, Bossuet puise dans les marqueurs du comique et du ridicule, tout comme la forme verbale plaisanter. Néanmoins, nous retrouvons dans ses résultats des mots appartenant à la catégorie de marqueurs piquants [Bossuet, 1694]: 3 Signalons que le marqueur « ironie » et toutes ses variantes n’ont que 11 occurrences dans le corpus ! 8 JADT’ 18 Il ne faut pas s’étonner que l’église ait improuvé en général tout ce genre de plaisirs [les spectacles…] à cause que communément, ainsi que nous l’avons remarqué, par sa bonté et par sa prudence, elle épargne la multitude dans les censures publiques : néanmoins parmi ces défenses, elle jette toujours des traits piquants contre ces sortes de spectacles, pour en détourner tous les fidèles. Nous comprenons ainsi que pour juger le théâtre incompatible avec la morale chrétienne, Bossuet privilégie un style vif et mordant, il appuie l’église tout en dénigrant les défenseurs du théâtre. La recherche sur les stratégies de la querelle du théâtre, tout en se questionnant sur les modalités argumentatives et les objectifs circonstanciels de chaque auteur, nous dévoile également certaines idées récurrentes autour de la considération du théâtre. Les différents textes partagent un certain nombre de lieux communs, comme par exemple l’idée de perversion, l’inflation temporelle, ou les arguments économiques et politiques. 8. Discussion et perspectives Dans cet article, nous avons présenté une approche à base de règles pour la détection automatique de l’ironie et des formes apparentées dans un corpus de débats sur le théâtre (16e – 19e siècle). La méthode que nous avons adoptée nous a fourni une matière abondante et des données quantitatives pour mieux cerner l’objet d’étude. Vu la particularité du phénomène langagier étudié et la simplicité de notre approche par analyse de surface, nous considérons que ces premiers résultats sont très encourageants (93.9 % de précision). A ce titre, ils méritent d’être approfondis afin d’en tirer le plus grand bénéfice en terme d’exploitation et de précision. Nous envisageons de calculer le taux de rappel dans l’annotation et d’identifier les sources des segments annotés (les locuteurs). L’un de nos objectifs consiste également à annoter les phrases négatives et à analyser leur association avec l’ironie (Mainardi et al., 2015), ce qui nous permettrait de dégager des pistes de recherche inédites dans le domaine des humanités numériques. Références Alrahabi, M. (2010). EXCOM-2: plateforme d'annotation automatique de catégories sémantiques. Applications à la catégorisation des citations en français et en arabe. Thèse de doctorat, Université Paris-Sorbonne. Joshi A., Bhattacharyya P., Carman M. J., (2016). Automatic Sarcasm Detection: A Survey ACM Comput. Surv. V, N, Article A (January 2016). Lecercle F., Mainardi C., Thouret C. (2016). Pour une exploration numérique des polémiques sur le théâtre, RHLF, n°116 / 4 dir. Didier Alexandre, Littérature et humanités numériques, PUF. JADT’ 18 9 Loriot C. (2015), Rire et sourire dans l'opéra-comique en France aux 18ème et 19ème siècles, Lyon, Symétrie. Mainardi C., Sellami Z., Jolivet V., (2015). “A Semantic Exploration Method Based on an Ontology of 17th Century Texts on Theatre: la Haine du Théâtre", First International Workshop on Semantic Web for Cultural Heritage (SW4CH 2015), New Trends in Databases and Information Systems, 539, pp. 468-476, Communications in Computer and Information Science. Martin L. (2009), “Le rire est une arme. L'humour et la satire dans la stratégie argumentative du Canard enchaîné”, A contrario 2009/2 (n° 12), 26-45. Riguet M., Alrahabi M. (2017), "Pour une analyse automatique du Jugement Critique: les citations modalisées dans le discours littéraire du XIXe siècle", in DHQ: Digital Humanities Quarterly 2017 10 JADT’ 18 Migrants et réfugiés : dynamique de la nomination de l'étranger Mohammad Alsadhan, Sascha Diwersy, Agata Jackiewicz, Giancarlo Luxardo Praxiling UMR 5267 (Univ Paul Valéry Montpellier 3, CNRS) muhammad.alsadhan@univ-montp3.fr, sascha.diwersy@univ-montp3.fr, agata.jackiewicz@univ-montp3.fr, giancarlo.luxardo@univ-montp3.fr Abstract Intense debates arose from the migrant crisis experienced by Europe in recent years, both in the media and in the politics. We address here the issue of nomination used for the newcomers, that we propose to study based on the comparison of the two substantivations in French: migrant and réfugié. Using their combinatory profiles, we seek to highlight the contrast between the two terms and the changes in their semantics and their axiological charge. In order to do so, we rely on a large corpus of texts, established over a threeyear period: the French parliamentary debates of the Assemblée Nationale. The comparative study of the combinatory profiles related to the two terms shows that both shared and unshared collocatives are encountered, and that their profiles overall tend to converge. Résumé Au cours des dernières années, la crise migratoire en Europe a suscité de vifs débats politico-médiatiques. Nous nous intéressons ici à la question de la nomination des nouveaux arrivants, que nous proposons d’étudier par la comparaison des deux substantivations migrant et réfugié. A partir de leurs profils combinatoires, nous cherchons à mettre en évidence le contraste entre ces deux termes, les changements dans leur sémantique et leur charge axiologique. Pour cela, nous nous appuyons sur un corpus, établi sur une période d’environ trois ans : les débats à l'Assemblée Nationale. L’étude comparative des profils combinatoires associés aux deux termes montre que l’on rencontre à la fois des collocatifs partagés et d’autres non partagés et que leurs profils tendent globalement à converger. Keywords: political discourse, cooccurrences, diachronic data and hierarchical clustering, curve clustering. JADT’ 18 11 1. Introduction L'Union Européenne a connu en 2015 une arrivée massive d’étrangers extraeuropéens, qui a donné lieu à des formules telles que « crise migratoire » ou « crise des réfugiés ». Dans un contexte de net clivage de l’opinion publique, cette crise a entraîné des positions politiques contrastées dans chaque pays concerné et des compromis difficiles à trouver. Les débats politico-médiatiques ont porté d'abord sur la prise en charge des victimes, le droit « d'asile » à accorder aux nouveaux venus, de même que sur la lutte contre les filières illégales, avec des positions « pro-immigration » ou « anti-immigration ». Mais ce phénomène s'expliquant en partie par les conflits en cours au Sud et à l'Est de l'Europe, la question de la désignation des intéressés a été posée. Alors que jusque-là les « migrants » étaient principalement motivés par des perspectives économiques, il a été remarqué qu'une partie de ces personnes devraient être nommés « réfugiés » ou « demandeurs d'asile ». D'autres termes, comme « clandestins », ont pu aussi être évoqués. Nous cherchons ici à questionner la dynamique de la nomination utilisée dans les débats politiques. A partir d’un corpus de débats parlementaires nous mettons en œuvre divers procédés de classification basés sur la nature diachronique des données. 2. Les corpus de débats parlementaires Nous faisons l’hypothèse que les discours autour de la crise migratoire font usage des deux termes migrant et réfugié en partie de façon interchangeable, en partie dans des contextes où seulement l’un des deux est possible. Cette distinction entre plusieurs emplois en discours, nous proposons de la mettre en évidence par le voisinage des deux termes et d’évaluer sa variation d’abord sur le discours politique et en fonction du temps. Le corpus traité dans la suite est constitué à partir des transcriptions des débats en séance publique à l’Assemblée Nationale pour la période qui va de janvier 2014 à février 2017 (ce qui correspond à la fin de la XIVe législature). Les données textuelles, publiées en format XML et disponibles en accès libre sur le site data.assemblee-nationale.fr, représentent environ 28,6 millions de mots occurrences. Elles ont été transformées et enrichies par des annotations linguistiques suivant une méthodologie décrite par Diwersy et al. (2018). De nombreuses métadonnées sont définies sur ce corpus, mais dans la suite nous nous concentrons sur la date (mois-année) associée à une unité structurelle de base correspondant au tour de parole (intervention d’un député). 12 JADT’ 18 3. Analyse chronologique L’évolution du sémantisme des termes migrant et réfugié peut être étudiée par l’association de méthodes mettant en jeu : (i) les fréquences d’apparition de ces deux lemmes dans les corpus, (ii) leurs profils collocationnels, qui peuvent faire émerger des champs sémantiques spécifiques, (iii) la variation de la similarité de ces profils collocationnels dans le temps et la caractérisation de la contribution de chaque collocatif à l’évolution des scores de similarité obtenus. Figure 1 L’évolution des fréquences relatives des deux lemmes par trimestre dans le corpus est illustrée par le graphique en figure 1. Il met en évidence une évolution fréquentielle en parallèle avec un pic d’utilisation des deux termes autour de septembre 2015. La corrélation de rang entre les deux séries fréquentielles, mesurée par le taux de Kendall, est ici significative (environ 0,74, pour une p-valeur de 0,0005). Dans la suite, l’unité de temps choisie est le trimestre ; il en résulte des analyses sur 13 trimestres pour la période couverte. Afin de produire une périodisation plus précise, nous avons mis en œuvre une approche combinant annotations en relations de dépendance syntaxique, création de lexicogrammes représentant les profils collocationnels par trimestre des deux termes (ordonnés suivant le score d’application du test exact de Fisher) et application de Classifications Ascendantes Hiérarchiques par Contiguïtés (CAHC), cf. (Diwersy et Luxardo, 2016 ; Gries et Hilpert, 2008). La construction d’une CAHC peut être entreprise suivant deux méthodes :  pour chaque lemme, en calculant la similarité entre deux trimestres JADT’ 18 13 successifs d’après le coefficient de Pearson (Pearson product-moment correlation coefficient),  en calculant la variation de la similarité entre les vecteurs représentant les profils collocationnels des deux lemmes, d’après l’écart type cumulé sur deux trimestres successifs. La première méthode révèle les variations plus importantes sur les trimestres initiaux jusqu’au pic de la crise. La deuxième méthode qui permet d’illustrer la comparaison entre les deux termes par un graphique unique est représentée par la figure 2. Figure 2 L’étude de cette classification hiérarchique permet de révéler sept étapes (représentées par sept zones grises). L’évolution du score de similarité est illustrée par un graphique qui se superpose au dendrogramme et qui confirme une croissance globale de 0 à 0,2 (mais avec un pic à 0,6). Le passage d’une période à l’autre est marqué par une progression jusqu’à P03 (correspondant au 3e trimestre 2015, suivant le pic de la crise) mais avec un déclin des périodes P03 à P05 et de P06 à P07. 14 JADT’ 18 Figure 3 4. Évolution des profils combinatoires et orientations discursives Cette section vise à expliciter les facteurs linguistiques à l’origine des tendances statistiques établies dans la partie précédente. Il s’agit, d’une part, de mettre en évidence les différences sémantiques entre migrant et réfugié telles qu’elles se manifestent à travers leurs profils différentiels et, d’autre part, de relever les points essentiels concernant leur similarité distributionnelle. Les profils différentiels sont constitués par les collocatifs exclusifs à chacun des substantifs étudiés et, de ce fait, ne contribuent à aucun moment à la similarité de leurs profils combinatoires. Le tableau 1 en donne un aperçu restreint aux collocatifs les plus saillants, situés dans le premier décile des inventaires collocationnels en termes de score d’association. Tableau 1 - Profils différentiels constitués par le premier décile des collocatifs exclusifs à migrant et à réfugié migrant réfugié Dépendances en aval (régime) Dépendances en amont Coordin (termes recteurs) ation Dépendances en aval Dépendances en amont Epithètes Compl. du nom Compl. du nom Epithètes irrégulier illégal clandestin âgé Calais Calaisis situation Compl. objet Compl. circ. Sujet dissuader entasser refouler secourir retour langue déferleme nt réadmissi on politique guerre palestinien afghan vietnamien irakien cambodgien persécuté réinstallé CO CC Sujet affluer CDN Coord inatio n CDN statut protection (Haut-) Commissar iat qualité relocalisati on distinction concubin défi apatri de bénéfic iaire déplac é migra nt JADT’ 18 15 Parmi les collocatifs saillants du nom réfugié, on notera d’abord la forte présence d’une série de termes (statut, qualité ; (Haut-)Commissariat ; protection ; apatride)1 qui renvoient au cadre des dispositions relevant du droit international qui imposent aux autorités un devoir d’assistance envers des personnes dont le départ de leur lieu de résidence habituelle est considéré comme étant contraint par une menace existentielle. Catégoriser une personne au moyen du terme réfugié revêt donc un enjeu juridique, administratif et politique, dont l’ampleur peut se voir régulée d’une part, par des mises en paradigme explicites avec d’autres termes dans le cadre d’une coordination (cf. les collocatifs apatride, bénéficiaire, déplacé et migrant) et, d’autre part, par des catégorisations secondaires exprimées par des expansions nominales (épithètes ou compléments du nom) caractérisant les causes du départ forcé. A travers les modifieurs du nom réfugié impliquant une relation causale (politique, persécuté ; (de) guerre) se construit un paradigme, et finalement une hiérarchie de causes potentiellement légitimes ou non-légitimes (et de réponses à apporter aux conséquences liées à ces causes).2 A côté de ces modifieurs, qui dénotent directement la cause du départ forcé, on trouve toute une série d’adjectifs ethnonymiques (palestinien, afghan, vietnamien, irakien, cambodgien) qui la dénotent indirectement en s’appuyant sur le savoir partagé concernant l’histoire troublée de ces pays. Cet environnement discursif montre que le mot réfugié se présente comme la nomination d’un statut juridique et qu’il est intégré à une argumentation orientée positivement. Les collocatifs de migrant révèlent un profil sémantique bien différent, en ce sens que ce terme place au centre de l’intérêt la question de la (non)conformité à des dispositions légales imposées à des personnes dont le séjour sur un territoire différent de celui de leur lieu résidentiel d’origine est considéré comme étant le résultat d’un déplacement conditionné par des considérations utilitaires (et en premier lieu économiques). C’est bien à cette dimension sémantique que se rapporte, dans le profil différentiel de migrant, de façon saillante la série des collocatifs irrégulier, illégal, clandestin et situation (qui, quant à lui, s’oppose, de ce point de vue, à statut et qualité, collocatifs exclusifs à réfugié). Ayant hérité les traits aspectuels du participe en –ant dont On trouve dans les déciles inférieurs – non documentés ici – d’autres collocatifs comme statutaire ou conventionnel qui rentrent dans cette même série. 2 On peut observer que cette sous-catégorisation va souvent de pair avec une modalisation d’appartenance catégorielle, exprimée par l’adjectif épithète véritable qui constitue avec vrai et authentique une série de collocatifs (appartenant à la catégorie de l'enclosure) exclusifs à réfugié qui sont néanmoins représentés à des rangs inférieurs de l’inventaire cooccurrentiel. 1 16 JADT’ 18 il est issu par conversion, le nom migrant présente le séjour momentané de la personne qualifiée en tant que telle à un endroit donné comme étant l’épisode d’une série inaccomplie de déplacements3 – séjour et déplacements qui, à travers des collocatifs tels dissuader, refouler et retour, se voient caractérisés comme relevant aussi bien de la volonté des personnes en mouvement, que de la bienveillance ou du refus des autorités qui en ont le contrôle potentiel. Faut-il voir en cela la motivation inférentielle de l’évaluation négative que véhicule un terme comme déferlement contrairement à ses variantes axiologiquement plus neutres afflux, flux ou encore arrivée, qui, eux, font tous partie des collocatifs partagés des noms migrant et réfugié ? Pour mieux cerner les collocatifs partagés qui contribuent le plus à l’évolution de la similarité distributionnelle des deux noms en question, nous avons mis en œuvre la méthode de classification proposée par Trevisani & Tuzzi (2016), en l’appliquant aux séries chronologiques des produits de scores d’association normés propres à chaque collocatif, qui entrent dans la composition des sommes donnant les produits scalaires lesquels représentent les indices de similarité retenus. Figure 4 L’application de la méthode4 fait ressortir, sur l’ensemble des 72 collocatifs communs à migrant et réfugié, 6 classes de profils évolutifs, dont 5 sont Contrairement à cela, réfugié, qui résulte de la nominalisation d’un participe passé, est associé à la représentation d’un seul épisode de déplacement accompli et envisagé en termes de son origine. 4 Nous remercions Arjuna Tuzzi d’avoir mis à notre disposition le script R permettant de mettre en œuvre les calculs respectifs. 3 JADT’ 18 17 constituées par un seul terme, à savoir millier, afflux, accueillir, crise et accueil (cf. figure 3). D’un point de vue sémantique, ces 5 collocatifs, qui, à différents moments de la série chronologique analysée, occupent les premiers rangs en termes de contribution aux scores de similarités respectifs, forment tout un condensé de la trame discursive impliquant les noms migrant et réfugié au cours de la période étudiée, avec :  millier et afflux, qui renvoient à une affluence perçue comme massive ;  crise, qui caractérise ce processus comme ayant atteint un point culminant à fort potentiel de déstabilisation ;  ainsi que accueillir et accueil qui se rapportent à la prise en charge des conséquences immédiates du processus concerné. Facteurs distributionnels de premier ordre, ces collocatifs placent migrant et réfugié dans un rapport paradigmatique associé à plusieurs dimensions sémantiques, qui, en vue des orientations argumentatives fortement divergentes instaurées par les deux noms (cf. supra), fait de leur choix un véritable enjeu discursif. 5. Conclusion et perspectives Les prolongements de cette étude exploratoire sont nombreux. En partant de Wihtol De Wenden (2016), il nous semble possible de construire un modèle d’analyse comportant cinq catégories qui sont autant de facettes du phénomène migratoire actuel : (i) origines et causes des migrations, (ii) profils des migrants, (iii) situation des migrants, (iv) gouvernance des migrations, (v) mobilité et restrictions migratoires. L’application de cette grille de lecture aux collocations impliquant les termes réfugié et migrant (ou encore leurs équivalents), peut s’avérer une piste de recherche prometteuse qui permet de donner aux résultats de l’analyse linguistique que nous venons d’effectuer une dimension transdisciplinaire, comme c’est par exemple le cas pour la différence entre facteurs « push » (poussant les individus à partir de leur pays) et « pull » (incitant les individus à venir dans un pays spécifique) établie par Wihtol de Wenden, différence qui se reflète dans la divergence fondamentale de l’orientation argumentative des programmes de sens propres aux noms étudiés, en ce que réfugié implique la notion de départ forcé alors que migrant évoque l’idée d’un déplacement volontaire. Si la figure du réfugié ou du migrant est essentiellement une construction politique (Wihtol De Wenden, 2016, p. 50) – ce que confirme d’ailleurs le profil collocationnel du terme correspondant tel qu’il se manifeste dans le corpus de discours parlementaire analysé - les différents (et nombreux) profils des personnes en déplacement peuvent être étudiés à partir des témoignages qu’elles livrent à propos de leur expérience migratoire. C’est l’objet d’une enquête menée auprès de Syriens arrivés en France depuis 2012, 18 JADT’ 18 qui se situe dans le prolongement du présent article et qui comporte à ce stade un volet uniquement qualitatif, dont les résultats préliminaires (Alsadhan et Richard, 2018) montrent que, lorsque le choix se présente, c’est bien le vocable réfugié qui est privilégié en tant qu’auto-désignant. Références Alsadhan, M., Richard A. (2018, à paraître). La réception des réfugiés Syriens du discours médiatico-politique identitaire français, in Sandré M., Richard A. & Hailon F. : Le discours politique identitaire face aux migrations, No 8 de la revue Studii de lingvistica. Diwersy, S., Luxardo, G. (2016). Mettre en évidence le temps lexical dans un corpus de grandes dimensions : l’exemple des débats du Parlement européen, in Mayaffre D., Poudat C., Vanni L., Magri V. & Follette P. (éds.) : JADT 2016 : Actes des 13es Journées internationales d’Analyse statistique des Données Textuelles, Nice, 2016, URL : http://lexicometrica.univ-paris3.fr/jadt/jadt2016/01ACTES/83638/83638.pdf. Diwersy, S., Frontini, F., Luxardo, G. (2018, à paraître). The Parliamentary Debates as a Resource for the Textometric Study of the French Political Discourse, in Proceedings of ParlaCLARIN workshop, 11th edition of the Language Resources and Evaluation Conference (LREC2018). Gries, S.T., Hilpert, M. (2008). The identification of stages in diachronic data: variability-based neighbour clustering. Corpora 3 (1), pp. 59-81. Trevisani, M., Tuzzi, A. (2016). Analisi di dati testuali cronologici in corpora diacronici: effetti della normalizzazione sul curve clustering, in Mayaffre D., Poudat C., Vanni L., Magri V. & Follette P. (éds.) : JADT 2016 : Actes des 13es Journées internationales d’Analyse statistique des Données Textuelles, Nice, 2016, URL : http://lexicometrica.univ-paris3.fr/jadt/jadt2016/01ACTES/82630/82630.pdf. Wihtol De Wenden C. (2016). Migrations. Une nouvelle donne, Éditions de la Maison des sciences de l'homme, Paris. JADT’ 18 19 Xplortext, a R package. Multidimensional statistics for textual data science R. Alvarez-Esteban1, M. Bécue-Bertaut2, B. Kostov3, F. Husson4, J-A Sánchez-Espigares2 2 1Universidad de León – ramon.alvarez@unileon.es Universitat Politècnica de Catalunya – monica.becue@upc.edu; josep.a.sanchez@upc.edu 3Institut d'Investigacions Biomèdiques August Pi i Sunyer – belchin3541@gmail.com 4Agrocampus Ouest – husson@agrocampus-ouest.fr Abstract We present here the package Xplortext for textual data science which provides classical and novel features for textual analysis. Starting from the corpus encoded into a lexical table, aggregate or not, several problems are dealt with: revealing both document and word structures and their mutual relationships, by applying correspondence analysis (CA); comparing several corpora structures by using multiple factor analysis for contingency tables (MFACT); uncovering complex relationships between words and contextual variables via CA for a simple or a multiple generalized aggregate lexical table (CA-GALT and MFA-GALT), clustering documents thanks to a hierarchical clustering algorithm (HCA); evaluating the evolution of the vocabulary along time thanks to a chronological constrained hierarchical clustering algorithm (CCHCA). Resumé Nous présentons ici le paquet Xplortext pour la science des données textuelles qui comprend des méthodes classiques et récentes d'analyse textuelle. Partant du corpus encodé sous forme tableau lexical, agrégé ou non, plusieurs problèmes sont traités: révélation des structures sur les documents et les mots ainsi comme leurs relations mutuelles, en appliquant l'analyse des correspondances (AC); comparer plusieurs structures de corpus en utilisant l'analyse factorielle multiple pour les tables de contingence (MFACT); découvrir des relations complexes entre mots et variables contextuelles via CA pour une table lexicale agrégée simple ou multiple (CAGALT et MFA-GALT), en regroupant des documents grâce à un algorithme de clustering hiérarchique (HCA); évaluer l'évolution du vocabulaire au fil du temps grâce à un algorithme de classification hiérarchique sous contrainte chronologique (CCHCA). Keywords: Xplortext, R package, Textual data, Contextual data, Correspondence analysis, Multiple factor analysis for contingency tables, 20 JADT’ 18 Generalized aggregate lexical table, Hierarchical clustering, Contiguity constrained hierarchical clustering, Labeled tree. 1. Introduction R offers numerous tools for textual data science. However, among them, multidimensional statistics is not so well represented that it should be. Xplortext, a new R package, intends to fill in the gaps. Its features are based on the exploratory approach to texts, in the line of the works by Benzécri (1981) and Lebart et al. (1998). The fundamental choices behind the design of Xplortext are to offer classical and novel textual analysis methods based on multidimensional statistics in a same package. The mains issues were to consider:  Classical multidimensional statistical methods, in which CA remains being the core method.  Novel methods, favoring those able to jointly analyze textual and contextual data to know not only who says what, taking here the title of a paper by Lebart, but also why he/she is saying that.  Numerous graphical outputs providing great flexibility to choose the elements to be represented.  Specific methods to deal with chronological corpora. 2. Example The political speech corpus used as an example consists of 11 documents of about 10,000 occurrences each one. These are the "investiture speeches" of 6 Spanish presidential candidates who have been pronounced from 1979 to 2011: Suarez (1979), Calvo-Sotelo (1981), González (1982, 1986, 1989 and 1993), Aznar (1996 and 2000), Zapatero (2004 and 2008) and Rajoy (2011). 3. Encoding the textual data and basic statistics Xplortext takes advantage of functions of the R package tm to import the corpus. Mainly, plain text files (typically .txt) and spreadsheet-like file (.csv, .xls) are considered. By default, plain text and CSV files are assumed to use the local native system (usually latin1) on Windows and utf8 in Mac or Linux. The encoding of the file can be given in the R command read. If necessary, the corpus can be saved in a known encoding beforehand. In any format, one row corresponds to one document. The text to analyze can be filled in one or several columns; the remaining columns provide information about the documents and are automatically imported as contextual (quantitative and/or qualitative) variables. Textual and contextual data must be located in the same file. Conversion to lower/upper cases, numbers removal and punctuation are managed by Xplortext depending on the JADT’ 18 21 arguments of Textdata function. Stopwords can be taken into account using the lists provided by either Xplortext (issued from tm) or the user. The importing step ends with the encoding of the corpus into a documents × words table (lexical table) and, possibly, a documents × repeated segments table (segmental table). Another option is to ask for an aggregated lexical table according to the categories of a variable. Then, elementary indicators, such as the corpus and vocabulary sizes, are computed and the words and repeated segments indices are listed and represented by a histogram, visualizing so their frequency (Fig.1). Classical summaries of the contextual variables are given. 4. Correspondence analysis as a core method Correspondence analysis (CA) is a core method in Xplortext revealing both document and word structures and their mutual relationships. 4.1. CA and content and form of a corpus The content and form of a corpus are both important as CA results. In fact, content is better captured when replaced into the form as, "the form is the bottom that comes back to the surface" in the words of Victor Hugo. Figure 2 shows the factor maps issued from a CA performed on the documents × words table. Figure 1: Most frequent words and repeated segments The trajectory of the speeches is revealed, enhancing the existence of three temporal poles. The represented words are the most contributive and have to be read as seen along the trajectory. In this way, they clearly illustrate the three poles and allow us to capture the meaning of the evolution. Note that the confidence ellipses around the documents are very narrow. 22 JADT’ 18 Figure 2: Documents and the most contributive words on the first CA plane 4.2. Multiple factor analysis for contingency tables When dealing with a multiple contingency table (=juxtaposition of several contingency tables), the multiple factor analysis for contingency tables (MFACT; Bécue-Bertaut & Pagès, 2004; Bécue-Bertaut & Pagès, 2008), extension of CA, turns to be useful. Very different aims can be looked for. For example, interesting aims would be comparing the documents structures as issued either from using different thresholds on the word frequency (10, 20, 30 or 50; 4 lexical tables) or from keeping or not the tool words (2 lexical tables) or the stopwords. Figure 3: Synthetic representation of the groups as issued from MFATC JADT’ 18 23 MFACT offers a high number of graphical and numerical results, either similar to those of any principal component methods (such as PCA or CA) or specific to the comparison of structures defined on the rows by the groups of columns. Among the latter, the representation of the groups provides a synthetic tool by representing each group with one point, revealing the global dissimilarities between the group structures (Fig. 3). 4.3. Generalized aggregate lexical tables Correspondence analysis on a generalized aggregated lexical table (CAGALT; Bécue-Bertaut & Pagès, 2015; Bécue-Bertaut, Kostov & Pagès, 2014) deals with two paired tables (frequency table, contextual variables table) observed on the same statistical units. In textual analysis, the frequency table is a lexical table and the statistical units are the documents. This method can be seen as a canonical correspondence analysis (CCA; ter Braak, 1986) approach to the texts. It enables to study the relationships between contextual variables and words but untangling the respective influences of the variables/categories on the lexical choices to avoid spurious relationships. MFA-GALT (multiple factor analysis for analyzing a series of generalized aggregated lexical tables; Kostov, 2015) deals with several paired tables, possibly defined on several sets of statistical units while the set of variables is common to all the contextual tables. In textual analysis, MFA-GALT compares the relationships between words and variables in these several paired tables. A favored application concerns surveys answered in different languages by several samples, being common the open-ended and the closed questions. 5. Clustering algorithms A classical hierarchical clustering algorithm (HCA) is included in Xplortext. Clustering starts from the documents coordinates on the CA dimensions. An exhaustive description of the clusters is provided, extracting their characteristic words and looking for the differentiated behavior of the variables in the clusters. The number of clusters is issued from the hierarchical tree structure. An automatic suggestion is done. A method for chronological constrained hierarchical clustering algorithm (CCHCA) is also offered. Only chronological contiguous nodes can be grouped. Further, the tree is described by the chronological words defined as follows. The characteristic words of each node are identified but finally a word is associated to only one node, the one that it better characterizes. These words are used to label the nodes (Fig. 4). Although the tree could be used to determine clusters, its main role is to allow for capturing the evolution of the speeches and their vocabulary through a descending reading of the labels 24 JADT’ 18 and nodes of the tree. Figure 4: Labeled chronological tree 6. Works in progress The following features will be included in a next future:  Chronological clustering (Legendre et al., 1985) has been proposed to divide a chronological series of species (=species counts operated at different moments) into homogeneous temporal parts. A same aggregation criterion as in chronological constrained clustering is used but a test is performed before aggregating two nodes to ensure their homogeneity. If homogeneity does not exist, the corresponding aggregation is not performed. As a result, the series is possibly divided into non-connected sub-series. This clustering method has been applied with benefit to the chronological series of words corresponding to a chronological corpus, allowing for dividing the corpus into non-connected homogeneous parts (Bécue-Bertaut et al., 2014).  Regularized CA (Josse et al.) allows for recovering a low-rank structure from noisy data, such as textual data, by using regularization schemes via a simple parametric bootstrap algorithm. 7. Conclusion Xplortext is published on R CRAN. Bécue-Bertaut, et al. (2018) present a JADT’ 18 25 series of applications of this package through several examples whose results are interpreted in details. The corresponding bases and scripts are published on the website http://xplortext.org. References Bécue-Bertaut M. and coll. (2018). Analyse textuelle avec R. Presses Universitaires de Rennes (PUR), Rennes. Bécue-Bertaut M., Kostov B., Morin A. and Naro G. (2014). Rhetorical strategy in forensic closing speeches. Multidimensional statistics-based methodology. Journal of Classification, 31: 85-106. Bécue-Bertaut, M. and Pagès, J. (2004). A principal axes method for comparing multiple contingency tables: MFACT. Computational Statistics and Data Analysis, 45: 481-503. Bécue-Bertaut M. and Pagès J. (2008). Multiple factor analysis and clustering of a mixture of quantitative, categorical and frequency data. Computational Statistics and Data Analysis, 52: 3255–3268. Bécue-Bertaut M. and Pagès J. (2015). Correspondence analysis of textual data involving contextual information: CA-GALT on principal components. Advances in Data Analysis and Classification, 9: 125–142. Bécue-Bertaut M., Pagès J. and Kostov B. (2014). Untangling the influence of several contextual variables on the respondents’ lexical choices. A statistical approach. SORT – Statistics and Operations Research Transactions, 38: 285–302. Benzécri, J.-P. (1981). Pratique de l’Analyse des Données. Tome III. Linguistique & Lexicologie. Dunod, Paris. Josse J., Sardy S. and Wager S. (2016). denoiseR: A Package for Low Rank Matrix Estimation. arXiv: 1602.01206 Kostov B. (2015). A principal component method to analyse disconnected frequency tables by means of contextual information. (Doctoral dissertation). Retrieved from http://upcommons.upc.edu/handle/2117/95759. Lebart, L., Salem, A. and Berry, L. (1998) Exploring textual data. Kluwer. Legendre, P., Dallot, S. and Legendre, L. (1985). Succession of species within a community: chronological clustering, with applications to marine and freshwater zooplankton, American Naturalist, 125: 257–288. ter Braak CJF. (1986). Canonical correspondence analysis: a new eigenvector technique for multivariate direct gradient analysis. Ecology, 67: 1167–1179. 26 JADT’ 18 L'evoluzione delle norme: analisi testuale delle politiche sull'immigrazione in Italia Elena, Ambrosetti1, Eleonora Mussino2, Valentina Talucci3 1 Associate Professor, Sapienza Università di Roma 2 Associate Professor, Stockholm University 3 Researcher, ISTAT 1. Introduzione Nei paesi del Sud-Europa, le politiche migratorie tendono a privilegiare le questioni relative all'ingresso degli immigrati (ad esempio ingressi regolari e irregolari, sanatorie e ricongiungimento familiare) rispetto agli aspetti legati all'integrazione (Pastore 2004, Solé 2004). Questo squilibrio nell'azione politica è imputabile alla volontà dei paesi di immigrazione di poter controllare i flussi, bloccare gli ingressi non autorizzati e determinare il numero e la composizione dei migranti. Le politiche migratorie regolano in modo diretto l’esito dell’ingresso o meno nel Paese di destinazione e successivamente orientano i percorsi di inserimento nel tessuto economicosociale e culturale degli stranieri ammessi in Italia. Attraverso lo studio delle politiche dell’immigrazione dall’Unità d’Italia a oggi possiamo analizzare come il linguaggio istituzionale nel corso degli anni e varie legislature si sia trasformato tracciando diversi aspetti legati alle migrazioni internazionali nel nostro paese. Questo argomento assume particolare importanza in quanto la scelta di un tipo di linguaggio potrebbe influenzare opinioni e atteggiamenti nei confronti degli stranieri da parte della popolazione italiana. 2. Le politiche migratorie in Italia L’Italia, sebbene sia diventata un paese di immigrazione negli anni Settanta, soltanto nel 1986 si è dotata della prima normativa sull’immigrazione a seguito dell’adesione nel 1975 alla Convenzione alla Convenzione 143 dell'Organizzazione Internazionale del Lavoro (OIL) e dell'aumento dei flussi di immigrati nel corso degli anni Ottanta. La legge 943/1986 (Legge Foschi) riguardava in primo luogo lo status dei lavoratori, inoltre includeva il ricongiungimento familiare e l'accesso allo stato sociale di base (Colombo e Sciortino, 2004). La legge venne indirizzata ai lavoratori extra-comunitari, con l’obiettivo di equipararli ai lavoratori italiani e ai lavoratori dell'Unione europea (Nascimbene, 1988; Colombo e Sciortino, 2004). Inoltre la legge introdusse una sanatoria per i lavoratori extracomunitari che si trovavano già nel territorio senza documenti regolari. Nel febbraio 1990, la legge 39/1990 (Legge Martelli) fu approvata dal Parlamento italiano a seguito delle JADT’ 18 27 pressioni dovute all’incremento degli arrivi dopo la caduta della cortina di ferro e dalla imminente ratifica del Trattato di Schengen (ratificato nel 1993 e entrato in vigore nel 1997). Al contrario della precedente legge Foschi, la legge si rivolgeva a tutte le categorie di migranti e non solo quindi ai lavoratori, per cui è considerata la prima legge organica sulle migrazioni. Nonostante ciò essa viene ricordata principalmente per la sanatoria di circa 218.000 irregolari. Ricordiamo anche alcuni altri aspetti significativi coperti dalla legge Martelli: l’introduzione dell’obbligo di visto, con conseguente inasprimento del controllo delle frontiere, che rese molto più difficile entrare in Italia, la programmazione annuale delle quote di lavoratori extracomunitari attraverso il cosiddetto Decreto Flussi, l’asilo politico, e da ultimo, l’inasprimento delle condizioni per l’ottenimento ed il rinnovo del permesso di soggiorno. Nel 1995 fu emanata la legge 489/1995 (Legge Dini): essa conteneva ulteriori misure restrittive per il controllo delle frontiere, una nuova sanatoria per i lavoratori stranieri irregolari e la regolamentazione dei flussi di lavoratori stagionali. A differenza delle misure restrittive, che non trovarono attuazione in quanto ritenute contrarie alla Costituzione, la sanatoria rappresentò il vero successo del decreto Dini, con un numero di stranieri regolarizzati pari a 248.000 persone. Nel 1997, con l’entrata in vigore dell'accordo di Schengen è stato introdotto nell’ordinamento italiano l'adeguamento alla politica comune in materia di visti. Sempre in tema di normative comunitarie, la legge 209/1998 ha ratificato il trattato di Amsterdam, entrato in vigore in Italia in quell'anno. Nello stesso anno il governo ha approvato il Testo Unico delle disposizioni concernenti la disciplina dell'immigrazione e norme sulla condizione dello straniero, Dlgs 286/1998 (Legge Turco-Napolitano). Obiettivo della legge era quello di operare una rottura con il passato e di condurre ad una gestione del fenomeno migratorio strutturale e di lungo periodo. La legge era basata su quattro pilastri (Zincone e Caponio, 2004): 1. Prevenzione e lotta all’ immigrazione irregolare: da notare in particolare l’introduzione dell'espulsione immediata di migranti irregolari e di centri di permanenza temporanea per detenere immigrati clandestini in attesa di espulsione; 2. Migrazioni da lavoro: i nuovi arrivi di lavoratori stranieri sono regolati con quote annuali di lavoratori stabilite ogni anno dal Ministero del lavoro; viene introdotto il meccanismo dello sponsor secondo il quale un cittadino italiano o uno straniero residente si fa garante dell’ingresso di uno straniero privo di contratto di lavoro; 3. Promozione dell'integrazione di migranti già residenti in Italia: creazione del Fondo Nazionale per l’integrazione dedicato al finanziamento di attività multiculturali e ad azioni antidiscriminazione; introduzione del permesso di soggiorno di lungo periodo, o carta di soggiorno per i migranti residenti da almeno 5 anni in Italia; 4. Concessione 28 JADT’ 18 di diritti umani fondamentali, come l'assistenza sanitaria di base, ai migranti irregolari. La legge Turco-Napolitano si fece carico della regolarizzazione di 217.000 stranieri. Nel 2002 è stata introdotta la legge Bossi-Fini, con lo scopo di modificare in maniera restrittiva il Testo unico del 1998. Più specificamente, la legge ha modificato i primi due pilastri della legge. Con la nuova normativa sono state adottate una serie di misure volte a scoraggiare l’insediamento permanente dei migranti tra le quali: l’abolizione del sistema dello sponsor, la riduzione del periodo di validità del permesso di soggiorno e il collegamento della validità del permesso di soggiorno a un contratto di lavoro ("contratto di soggiorno"). Inoltre fu adottata una politica più repressiva nei confronti dei migranti irregolari che includeva l’applicazione del rimpatrio forzato, controlli più sistematici della polizia che includevano il pattugliamento delle coste italiane e la detenzione di coloro che rimanevano sul territorio italiano più a lungo di quanto previsto dal permesso di soggiorno (over-stayers). In linea con le precedenti leggi, la legge 189/2002 ha regolarizzato 634.728 immigrati, rappresentando la più grande sanatoria mai adottata in Europa fino a quel momento (Zincone, 2006). Dopo il 2002 sono state apportate poche modifiche alla normativa sulla migrazione, si tratta in particolare di: misure per combattere l'immigrazione clandestina, sanatorie per migranti irregolari presenti sul territorio italiano e recepimento di direttive UE che implicano modifiche alla normativa esistente. L'acquisizione della cittadinanza per nascita (jus sanguinis) e per residenza (jus soli) era inizialmente regolata dalla legge 555/1912. Le condizioni erano molto restrittive: la cittadinanza era concessa solo al figlio di un uomo italiano e sotto condizioni specifiche al figlio di una donna italiana. La legge 123/1983 ha introdotto nella legislazione italiana l'acquisizione della cittadinanza per matrimonio e ha riformato l'acquisizione della cittadinanza per nascita, concedendo indifferentemente il diritto di cittadinanza al figlio di madre o padre italiani. L'acquisizione della cittadinanza italiana è stata ulteriormente riformata dalla legge 91/1992 riservando particolari diritti ai cittadini europei rispetto agli extra europei. La cittadinanza per matrimonio è stata riformata nel 2009 (legge 94 del 15 giugno), prolungando il periodo di residenza necessario in Italia da sei mesi a due anni dalla data del matrimonio. Negli ultimi anni si contano diversi tentativi per introdurre una nuova normativa sulla cittadinanza allo scopo di semplificare e ridurre il tempo per l’ottenimento della cittadinanza per i migranti di seconda generazione (nati in Italia). Come primo risultato, l'art. 33 del decreto 69/2013 ha semplificato la procedura di acquisizione della cittadinanza per gli stranieri nati in Italia. Nonostante ciò, fino ad oggi manca una nuova normativa in materia. JADT’ 18 29 La normativa sulle migrazioni in Italia è stata costantemente caratterizzata dalla mancanza di una politica attiva degli ingressi e dal continuo tentativo di rallentare ed osteggiare il radicamento giuridico e sociale della popolazione straniera sul territorio italiano. Il ricorso continuo a strumenti ex-post come le sanatorie, l’utilizzo delle quote come sistema di emersione di lavoratori stranieri già presenti sul territorio italiano piuttosto che come norma di ingresso di nuovi lavoratori, ed il forte accento che la classe politica e i media pongono sulla lotta all’immigrazione illegale sono esempi emblematici di come il fenomeno migratorio in Italia venga affrontato in termini di contenimento e controllo e non di allargamento e integrazione. La presenza straniera, ancora oggi, è perlopiù considerata transitoria e viene percepita e gestita in termini di risposta ad eventi contestuali di emergenza. 3. Dati e Metodi I dati testuali utilizzati per realizzare questo lavoro sono tutti i capi normativi contenuti nelle leggi approvate in Italia dal 1912 al 2014 in materia di migrazione. La metodologia di analisi proposta fa capo alla Content Analysis realizzata attraverso tecniche automatizzate dei dati. Si effettua applicando un insieme di routine, supportate da specifici software in questo caso TaLTAC2 – Trattamento automatico Lessico testuale per l’Analisi del contenuto - che consentono di automatizzarne in parte o del tutto l’esplorazione, la descrizione e il trattamento di grosse moli di dati; in questo modo vengono trasformati insiemi di testi non strutturati in insiemi di testi strutturati. Oltre alla descrizione dei contenuti del testo è possibile analizzare il corpus in base ad una o più variabili disponibili sui frammenti come l’anno e la maggioranza di governo1. L’estrazione dell’informazione peculiare individuata attraverso il test del p-value permetterà di avere, per ogni variabile esplicativa, una lista di parole chiave sovra o sotto rappresentate rispetto a un modello di riferimento. Inoltre tramite l’analisi delle Casa delle libertà: centro destra: Governo Berlusconi II, XIV Legislatura (30 maggio 2001 - 27 aprile 2006); Coalizione di centro destra: Governo Berlusconi IV, XVI Legislatura (dal 29 aprile 2008 al 23 dicembre 2012); Grande coalizione: XVII Legislatura Governi Letta e Renzi, centro sinistra e Alternativa popolare; Indipendente: Governo Dini - (17/01/1995 - 17/05/1996) governo tecnico; Indipendenti: Governo Monti (dal 16 novembre 2011 al 27 aprile 2013) Governo tecnico, XVI Legislatura; Liberale: Governo Giolitti (1911-1914), UL - PR - PDC - PD - UECI – CC, centro destra; L’Unione: centro sinistra, XV Legislatura (28 aprile 2006 - 6 febbraio 2008) Governo Prodi II; Pentapartito: Coalizione politica: DC - PSI - PSDI - PRI -PLI, IX Legislatura; Quadripartito: Coalizione politica: DC - PSI – PSDI - PLI, X Legislatura; Ulivo: centro sinistra, XIII Legislatura. 1 30 JADT’ 18 corrispondenze lessicali cerchiamo un pattern che metta in relazione in modo sistematico i lemmi e le dimensioni identificate con le caratteristiche associate ad ogni legge. 4. Risultati Le leggi sono state analizzate come un unico corpus che soddisfa i criteri standard di dimensione minima richiesta affinché le analisi siano robuste. Ad una prima analisi lessicometrica il testo, costituito da 150.714 occorrenze e 8.113 forme grafiche, rassicura sulla sua adeguata estensione: la proporzione di parole diverse sul totale delle occorrenze (V/N*100= 5,383) si allontana notevolmente dalla soglia del 20% rispettando, quindi, la soglia minima di significatività statistica di un corpus (Bolasco, 1999). Sorprendentemente il livello di ricercatezza del linguaggio non è particolarmente elevato, come si vede dalla percentuale di hapax (V1/V*100) e dal coefficiente a di Zimpf rispettivamente 28,350% e 1,325. Guardando il vocabolario, la prima parola non vuota è comma (1529) seguita da numero (1160) e articolo (1066). Le altre parole tema, ovvero quei sostantivi che compaiono con maggiore frequenza nel testo, sono straniero, decreto, Stato, disposizioni, ingresso, territorio e soggiorno. Abbiamo poi eseguito un confronto tra il nostro vocabolario e il “lessico del discorso programmatico di Governo” (Bolasco, 1999) per individuare quanto fosse peculiare il linguaggio del nostro corpus anche rispetto a un vocabolario tecnico-legislativo. Da questo confronto abbiamo ottenuto uno “scarto” che indica quanto la forma in questione sia sovra (postivo) o sottorappresentata (negativo) rispetto al modello di riferimento Bolasco (1999); più lo scarto è alto più le forme sono definite peculiari rispetto al testo analizzato, ovvero lo caratterizzano. Senza entrare nel merito delle parole chiave legate al vocabolario prettamente giuridico (come decreto, lettera), emerse già dalla gerarchia delle occorrenze, si possono analizzare le altre principali dimensioni del testo: oltre alla parola straniero la prima dimensione che emerge è quella di frontiera (ingresso, territorio, frontiera, accesso, durata) e di esercizio di diritto (regolamento, autorizzazione, disposizioni). Ma la dimensione più corposa è quella delittuosa (pena, delitti, reato, reati, tribunale, sentenza, condanna, violazione, esecuzione). Fa riflette invece come le parole sottorappresentate siano governo, politica, pubblico, parlamento: ovvero quelle legate alla dimensione legislativa. Partendo dall’ipotesi che il linguaggio sia cambiato nel tempo abbiamo effettuato un’analisi delle specificità (vedi tavola 1). Quando una parola è sovra-rappresentata si parlerà di forma caratteristica (o specificità positiva), al contrario quando essa è sottorappresentata parleremo di specificità negativa; le forme prive di specificità in quel gruppo si definiscono banali, JADT’ 18 31 mentre quelle che non sono specifiche di nessun gruppo sono considerate appartenenti al vocabolario di base del corpus (Bolasco, 1999). Tavola 1: Specificitá positive per anno di legislatura 1912 1986 1990 1992 1995 1998 2000 cittadinanza lavoro entrata cittadinanza lavoro lavoro visto legge lavoratori permesso straniera soggiorno soggiorno ottenimento Stato immigrati materia Stato permesso stranieri professionale italiana extracomunitari frontiera italiana entrata permesso consente residenza sociale lavoro figlio penale autonomo seguito cittadino lavoratore previdenza estero motivi attuazione transito presidente della estero previdenza extracomunitari cittadino stagionale motivi Repubblica servizio autorizzazione cittadini servizio tempo attivita' autonomo Governo collocamento decreto militare sociale sociale tempo figli materia previdenza durata straniera extracomunitariostranieri figlio italiani apolidi italiano caso europea societa' figli occupazione soggiorno entrata lire estero visti militare diritti interno residenza legislativo modalita' sussistenza padre consulta quanto acquista previdenza regolarmente requisiti matrimonio entrata prima età pubblica pubblica importo 2002 2004 2007 2008 2009 2010 2011 lavoro convalida soggiorno prevenzione penale conoscenza seguente decreto successive permesso pubblico codice test espulsione soggiorno legislativo periodo legislativo procuratore lingua questore testo euro sensi ricerca entrata italiana penale legislativo modificazioni legislativo penale interno permesso termine permesso giudice familiari sensi pubblico lungo provvedimento penale provvedimento volontariato procuratore giudiziario svolgimento permesso asilo seguenti unico procedura imputato modalità periodo soggiornanti giudice interno commi ricongiungimentoin presenza di europea in presenza di decorrere familiare funzioni domestico europeo rimpatrio stagionale provvedimenti motivi sicurezza seguenti prefettura respingimento codice parole durata persona parole sistema lettera caso accompagnamen lungo ricercatore comma legislativo legislativo autorità composizione nazionale sostituite guida istruzione allontanamento procedura decisione rilasciato antimafia legislativo rilascio misure Dalle specificità ottenute analizzando l’andamento del linguaggio nel tempo emerge come si sia iniziato a scrivere di migrazioni parlando di cittadinanza e residenza, introducendo progressivamente concetti connessi al lavoro e all’essere extra-comunitario arrivando da un lato a temi di integrazione e dall’altro a temi di criminalizzazione dello straniero. Il panorama lessicale nel tempo si è arricchito ma anche “estremizzato”. Questa “estremizzazione” potrebbe essere il risultato delle diverse coalizioni/maggioranze e quindi non solo legato ad una dimensione temporale ma ancor di più politica, per questo motivo é bene analizzare le due dimensioni contemporaneamente. 32 JADT’ 18 5. Dimensioni lessicali L’analisi delle corrispondenze lessicali2 è stata condotta sui primi 50 lemmi estratti dal confronto tra i lemmi dei verbi del nostro vocabolario e quelli del “lessico del discorso programmatico di Governo”. Attraverso l’analisi delle corrispondenze abbiamo riassunto la diversità del lessico utilizzato nelle diverse leggi rispetto all’anno e la coalizione di governo. I primi due assi fattoriali, proiettati in figura 1, rappresentano il 46% della variabilità spiegata. La prima dimensione, rappresentata dal primo fattore, è caratterizzata dalla dimensione temporale. Fatta eccezione per il 1992 e il 2007 tutte le leggi approvate dopo il 2002 si contrappongono a quelle precedenti. Il secondo asse è caratterizzato dalla contrapposizione del partito Liberale (Governo Giolitti 1911-1914) e Quadripartito (Governo Andreotti VII 1991-1992), in contrapposizione alle altre maggioranze di Governo. Le coordinate ci permettono di proiettare le classi e le forme grafiche sul piano e il posizionamento ci permette di individuare e interpretare i profili a seconda della vicinanza dei punti. Figura 1: Dimensioni lessicali delle interviste, rappresentazione del primo piano fattoriale Andando a vedere più in dettaglio i quadranti possiamo notare che nel primo, in cui si collocano il Quadripartito e gli anni 1992 e 2010, le forme grafiche che contraddistingono lo spazio fanno riferimento alla dimensione culturale “lingua” e “conoscenza”. Le forme grafiche proiettate nel secondo piano, carattarezzato dagli anni 2002 e a seguire e dalla Casa delle libertá, la grande-coalizione e gli indipendenti del governo tecnico Monti, esprimono 2 Con l’ausilio del programma Spad, nello specifico con il metodo CORBIT JADT’ 18 33 principalmente gli aspetti legati alla delittuosità (e.g. violazioni, delitti, reato, pena) e giuridica (e.g. norme, tribunale, giudice, esecuzione). A cavallo del primo e secondo quadrante troviamo anche la coalizione di Centro-destra. Nel terzo quadrante troviamo gli anni 1986, 1990, 1995, 1998, 2007 e il governo tecnico Dini con l'Unione e il Pentapartito. Le forme grafiche su questo piano identificano le caratteristiche del soggiorno come: carta, durata, status, temporanea. Mentre la dimensione di frontiera caratterizza il quarto quadrante: territorio, frontiera, legale, autorizzazione. A cavallo di queste due dimensioni si trovano il mondo del lavoro e quello associativo che sono parte integrante del percorso migratorio in Italia; non sorprende quindi che caratterizzano sia il terzo che il quarto quadrante. 6. Conclusioni L’obiettivo di questo lavoro era di esplorare il panorama legislativo in riguardo alle migrazioni in un’ottica statistica, con l’obiettivo di estrarre le sue caratteristiche e le sue peculiarità. In questa prospettiva le differenze linguistiche, temporali e soprattutto dei diversi esecutivi rappresentano un’interessante bacino informativo per investigare l’evoluzione semantica delle norme. Seppur descrittivo questo lavoro assume una particolare importanza in quanto la scelta di un tipo di linguaggio potrebbe influenzare opinioni e atteggiamenti nei confronti degli stranieri da parte della popolazione italiana. I nostri risultati mostrano che il panorama lessicale della normativa italiana sull’immigrazione dal 1912 al 2014 è notevolmente mutato. In primo luogo, dal punto di vista delle specificità ottenute analizzando l’andamento del linguaggio nel tempo è emerso che inizialmente, quando l’Italia era un paese di emigrazione, la normativa sulle migrazioni era caratterizzata da temi quali la cittadinanza e la residenza. Dagli anni Ottanta del secolo scorso, con l’incremento dei flussi migratori in entrata nel nostro paese, sono stati introdotti progressivamente concetti connessi al lavoro e all’essere extra-comunitario. Alla fine degli anni Novanta del secolo scorso, a seguito del netto incremento degli arrivi di stranieri in Italia, si inizia a parlare di integrazione e di ricongiungimento familiare. Infine a partire dagli anni duemila inizia il processo di “criminalizzazione” dello straniero pertanto entrano nel vocabolario specifico temi quali sicurezza, respingimento, allontanamento. In secondo luogo, l’analisi delle corrispondenze fattoriali ha confermato che a partire dal 2002 (Legge BossiFini) vi è stato un netto cambiamento del linguaggio usato nella normativa dell’immigrazione, il linguaggio è infatti caratterizzato sempre più da temi legati alla sicurezza e alla legalità. Inoltre il linguaggio usato, è stato senz’altro influenzato da altri fattori che qui non abbiamo preso in considerazione come per esempio, il recepimento delle politiche europee 34 JADT’ 18 sull’immigrazione, la situazione geo-politica internazionale, l’incremento degli atti terroristi di matrice islamista a partire dagli attentati negli Stati Uniti l’11 settembre 2001. Con questo lavoro abbiamo delineato un panorama lessicale che ha cambiato direzione orientandosi sempre di più verso temi di regolamentazione e contenimento (espulsione, allontanamento irregolare). Esso ha confermato un approccio negativo riguardo alle migrazioni indipendentemente dalla maggioranza di governo. Bibliografia Bolasco S. (1999), Analisi multidimensionale dei dati, Carocci Roma Colombo, A., & Sciortino, G. (2004). Alcuni problemi di lungo periodo delle politiche migratorie italiane. Le Istituzioni del Federalismo, 5, 763–788. Nascimbene, B. (1988). Lo Straniero nel diritto italiano. Milano: Giuffré Editore. Pastore, F. (2004). A community out of balance: nationality law and migration politics in the History of post-unification Italy. Journal of Modern Italian Studies, 9(1), 27–48. Solé, C. (2004). Immigration policies in southern Europe. Journal of Ethnic and Migration Studies, 30(6), 1209–1221. Zincone, G., & Caponio, T. (2004). Immigrant and immigration policymaking: the case of Italy. IMISCOE Working Paper Country Report. Amsterdam: IMISCOE. Zincone, G. (2006). The making of policies: immigration and immigrants in Italy. Journal of Ethnic and Migration Studies, 32(3), 347–375. JADT’ 18 35 A bibliometric meta-review of performance measurement, appraisal, management research Massimo Aria1, Corrado Cuccurullo2 University of Naples Federico II– aria@unina.it University of Campania L. Vanvitelli – corrado.cuccurullo@unicampania.it 1 2 Abstract Performance measurement, appraisal, and management have become one of the most prominent and relevant research issues in in management studies. The emphasis on empirical contributions has resulted in voluminous and fragmented research streams. Thus, synthesizing the research literature is relevant for effectively using the existing knowledge base, advancing a line of research, and providing evidence-based insights. In this paper, we propose a bibliometric meta-review that offers a different knowledge base for future research agenda with implications also for teaching and practice. We analyze the performance management literature through a bibliometric analysis of reviews recently published (2000 - 2017) in the scientific journals of domains, such as Management, Business and Operations. The main purpose is to map and understand the intellectual structure through co-citation analysis. Keywords: Science Mapping; Content Analysis; Bibliometrix; Performance Measurement. 1. Introduction Performance measurement, appraisal, and management have become one of the most prominent and relevant research issues in in management studies. They are an ongoing topic of conferences and of books and journal articles as well as of professional and popular grey literature. Researches on these topics have been conducted in different sectors and for various organizations, including public and professional ones. While the number of academic publications on these topics is increasing at a rapid pace, the emphasis on empirical contributions has resulted in voluminous and fragmented research streams that hampers the ability to accumulate knowledge and actively collect evidence through a set of previous research papers. So, literature reviews are increasingly assuming a crucial role in synthesizing past research findings to effectively use the existing knowledge base, advance a line of research, and provide evidence-based insight into the practice of exercising and sustaining professional judgment and expertise. Among the different qualitative and quantitative reviewing, bibliometrics 36 JADT’ 18 has the potential to introduce a systematic, transparent, and reproducible review process based on the statistical measurement of science, scientists, or scientific activity. In this paper, we propose a bibliometric “review of reviews” (meta-review) that offers a different knowledge base for future research agenda with implications also for teaching and practice. The goal of this article is to find a path and to take stock of the existing knowledge in performance measurement, appraisal, and management research. 2. Research Synthesis on performance measurement, appraisal and management 2.1 Overcoming semantic ambiguity ‘‘Performance’’ is a complex concept and can be seen from different angles. It is a multi-dimensional construct, the measurement of which varies depending on a variety of factors. For example, it is important to determine whether the measurement objective is to assess performance outcomes or behavior at organizational or individual levels, in financial terms or multidimensional ones (e.g. balanced scorecard framework), as intermediate or final consequence of a managerial action. In very general terms, performance is the contribution (result and how to achieve the result) that an entity (individual, group of individuals, organizational unit, organization, program, or public policy) provides through its action towards achieving the aims and objectives and also the satisfaction of the needs for which the organization was formed. While measurement concerns performance indicators and appraisal is the process of evaluating the performance of individuals and teams, performance management is a systematic process for improving organizational performance by developing the performance of individuals and teams. It is a means of getting better results by understanding and managing performance within an agreed framework of planned goals, standards and competency requirements. 2.2 The need of a meta-review In this work we analyze the performance management literature through a bibliometric analysis of literature reviews recently published (2000 - 2017) in the scientific journals of domains, such as Management, Business and Operations. The main purpose is to map and understand the intellectual structure through co-citation analysis of this recent and evolving macro-topic, highlighting internal clusters. The main contribution is to understand better the state of art in terms of gaps, divergences, commonalities and tendencies JADT’ 18 37 in which the field is going on. So, we provide a map to scholars in positioning their future research work and to teachers to introduce so vast topic to students. This field of research is well suited to a bibliometric meta-review for the following reasons: 1. there is little consensus among scholars. For example, Franco-Santos et al. (2007) have counted 17 different definitions for business performance measurement system, while Taticchi et al. (2010) almost 25 diverse frameworks. 2. the field is deeply multidisciplinary. The most widely cited authors come from a variety of different disciplinary backgrounds, such as accounting, strategy, operations management and research, human resources. The scholars’ background diversity brings different research questions, theoretical bases and methodological approaches. The functional silos, through which research on performance management is developing, impede to have a coherent and agreed body of knowledge. Understanding deeply the intellectual structure of the field and its evolution is a relevant challenge for researchers. 3. there is a community of dedicated scholars around the world that share the same agenda (cohesion in dominant issues) but use divergent theoretical approaches and methods. 4. the field is still relatively immature. As in terms of age it is relatively young, the limited professionalization is not surprising. In addition, there is not a reference journal as Strategic Management Journal for strategy scholars. In this case, our study can be contributive, showing the gaps in literature and providing some guidelines for researchers. 5. common accepted performance management practices do not exist (Richard et al., 2009). In many contexts performance management is dysfunctional, although this problem is known since more 50 years (Ridgway, 1956). We still miss more robust empirical and theoretical analysis of performance management frameworks and methodologies. Empirical investigations of the performance impact of frameworks, including the most diffused balanced scorecard, have failed to offer uncontroversial findings (Banker et al., 2000; Ittner et al., 2003; Neely et al., 2004). Some authors call for further and longitudinal studies for understanding the social influences and implications, but they do not show which paths follow. 6. some publications assumed seminal roles in the evolution of the scientific field. These articles, owing to their impact, are accelerating factors of development of the field (Berry, Parasuraman 1993). It is therefore important to identify what are the most influential performance 38 JADT’ 18 management articles published between 1991 and 2010, to understand better the state of art and discover the linkages among authors. 7. there is an extended spectrum of this research field and an increased intensity of research, but most part of it also confirms the incompleteness and inconsistence of results. There are still various open issues and unsolved problems. This depends on the fragmentation of the field of research, on different disciplinary membership of researchers and their cultural context. This diversity implies the use of different theories and methods and therefore also the emergence of different dominant themes. 8. a profound and rapid evolution is taking place. Not only the research has shifted from the financial performance to the multidimensional one, but a shift of scholars’ attention from the organizational to the individual performance is under way. Moreover, another significant shift is ongoing. While earlier research was often normative, founded on economic rationality, more recent research is more analytical and explanatory (Cuccurullo et al., 2016). The overwhelming volume and variety of new information, conceptual developments, and data are the milieu where bibliometrics becomes useful by providing a structured, more objective and reliable analysis to present the “big picture” of extant research. 3. Methods Our bibliometric meta-review is a quantitative research synthesis of the reviews published on the same topic that we conducted with bibliometrix (Aria, Cuccurullo, 2017), a unique tool, developed in the R language, which follows a classic logical bibliometric workflow. 3.1 Data collection For data retrieval, we used the Social Science Citation Index (Indexes=SCIEXPANDED, SSCI) of Clarivate Analytics Web of Science. It is the most used database of scientific knowledge by management scholars (Zupic, Čater, 2015). Our search terms were (TS=(("performance manag*") OR ("performance measur*") OR ("performance apprais*"))). We applied our search keyword to the Timespan=2000-2017 and filtered findings for language (English) and document types (Review). Therefore, we found 783 reviews. Then, we refined our search by categories (Management or Business or Operation Research Management Science) and obtained 167 reviews. Finally, we selected all the reviews published in the most authoritative journals as ranked as 3, 4, 4* by ABS 2015: We dropped off 31 journals for a total of 50 reviews. Our final dataset is formed by 117 reviews. JADT’ 18 39 3.2 Data analysis Our effort at delineating the intellectual structure of the discipline involves author co-citation analysis (ACA), a bibliometric technique that uses a matrix of co-citation frequencies between authors as its input. This matrix is the basis for various types of analyses. ACA ability to reveal patterns of association between authors based on their co-citation frequencies makes it a prospective methodology for understanding the evolution of an academic discipline. Authors working in a stream of research often cite one another as well as draw on common sources of knowledge. Further, their works are likely to be frequently co-cited (i.e., cited together) by other authors working on intellectually similar themes. The citations of seminal authors provide a basis for unraveling the complex patterns of associations that exist among them as well as to trace the changes in intellectual currents taking place over time. 4. Findings 4.1 Descriptive analysis Our dataset includes 117 reviews published in 46 journals since 2000 (table 1 and 3). They received 105 citations on average (table 2). They show fluctuating growth that reaches its peak every 5 years (table 3). Table 1: Main Information about data Articles 117 Sources (Journals, Books, etc.) 46 Keyword Plus – Author's Keywords Period 770 – 383 2000 - 2017 Average citations per article Authors 297 Authors of single authored articles Co-Authors per Articles 10 2.65 Collaboration Index 2.79 105.1 Table 2: Top manuscripts per citations Paper TC TCperY ear 71.1 1 BHARADWAJ AS,(2000),MIS Q. 2 DIAMANTOPOULOS A;SIGUAW JA, (2006), BRIT. J. MANAGE. 128 0 588 3 MELO MT et al.(2009), EUR. J. OPER. RES. 587 65.2 4 ZHOU P et al.(2008),EUR. J. OPER. RES. 429 42.9 5 WRIGHT PM;BOSWELL WR, (2002), J. MANAGE. 379 23.7 6 WRIGHT PM et al.(2005), PERS. PSYCHOL. 347 26.7 7 ZACHARATOS A et al..(2005), J. APPL. PSYCHOL. 305 23.5 49.0 40 JADT’ 18 8 ADAMS R et al.(2006), INT. J. MANAG. REV. 302 25.2 9 GIBSON C;VERMEULEN F, (2003), ADM. SCI. Q. 291 19.4 10 CARDOEN B et al. (2010), EUR. J. OPER. RES. 288 36.0 Table 3: Most Relevant Sources Sources 1 J. OF MANAGEMENT Article s 11 2 INT. J. OF OPERATIONS & PRODUCTION MANAGEMENT;INT. J. OF PRODUCTION ECONOMICS 4 EUROPEAN J. OF OPERATIONAL RESEARCH; INT. J. OF MANAGEMENT REVIEWS 6 INT. J. OF HUMAN RESOURCE MANAGEMENT; INT. J. OF PRODUCTION RESEARCH 8 J. OF BUSINESS ETHICS; STRATEGIC MANAGEMENT J. 8 10 BRITISH J. OF MANAGEMENT; J. OF APPLIED PSYCHOLOGY; J. OF MANAGEMENT STUDIES; MANAGEMENT ACCOUNTING RESEARCH; OMEGA-INT. J. OF MANAGEMENT SCIENCE; SUPPLY CHAIN MANAGEMENT 3 7 5 4 4.2 Co-citation network and cluster analysis The objective of our paper is to identify the intellectual structure of the performance measurement and management field. More specifically, our goals are to (1) delineate the subfields that constitute the intellectual structure of the field; (2) determine the relationships, if any, between the subfields; (3) identify authors who play a pivotal role in bridging two or more conceptual domains of research; and (4) graphically map the intellectual structure in a network space in order to visualize spatial distances between intellectual themes. In extreme synthesis, figure 1 shows: 1. A first cluster (red bubbles) represented by works concerning the system of multidimensional performance measurement and evaluation. At its center, we find prominent authors who contribute with specific frameworks, such as the balanced scorecard (Kaplan, Norton, 1992, 1996) and performance prism (Neely et al., 1995). Next to them, we find the contribution of Ittner et al. (2003) about one of the great problems in the multidimensional measurement: the balance between subjectivity and objectivity. Always central in the cluster, we find performance system design (Neely et al., 2000), At the upper and lower extremes of the cluster, we find other two issues of multidimensional performance systems: strategic alignment (Chenhall, 2005) and the guidelines to implement systems (Bititci et al., 1997). JADT’ 18 2. 3. 4. 5. 41 A second cluster (blue bubbles) concerns the current prevailing perspective of studying performance measurement and management: the strategic one. In particular, figure 1 highlights the bridge contributions of two cornerstones of the resource-based view (Barney, 1991, Wernelfelt, 1984). In front of this cluster, in the upper-left part of the map, we find another one (violet bubbles) that deals with theories, such as the agency theory: (Eisenhardt, 1989; Jensen, 1976) - which are the main method of investigation - (Carpenter, 2003), and psychology (Kahneman, 1979) Two other neighboring clusters, located in the lower left part of the map, concerns human resources. A first cluster (green bubbles) includes almost entirely works published on Academy of Management. Their preferred theme is perceptions of organizational performance (Delaney, Huselid 1996). A second cluster concerns participation in the appraisal from psychological perspective (Cawley et al., 1998; Keeping, Levy 2000). One last cluster is isolated and concerns the studies of operation research on performance measurement Figure 1: Co-citation network of cited references 42 JADT’ 18 References Aria, M. & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science mapping analysis, Journal of Informetrics, 11(4), pp 959-975. Barney, J. (1991). Firm resources and sustained competitive advantage. Journal of management, 17(1), 99-120. Bititci, U. S., Carrie, A. S., & McDevitt, L. (1997). Integrated performance measurement systems: a development guide. International journal of operations & production management, 17(5), 522-534. Cawley, B. D., Keeping, L. M., & Levy, P. E. (1998). Participation in the performance appraisal process and employee reactions: A meta-analytic review of field investigations. Journal of applied psychology, 83(4), 615. Chenhall, R. H. (2005). Integrative strategic performance measurement systems, strategic alignment of manufacturing, learning and strategic outcomes: an exploratory study. Accounting, Organizations and Society, 30(5), 395-422. Cuccurullo, C., Aria, M., & Sarto, F. (2016). Foundations and trends in performance management. A twenty-five years bibliometric analysis in business and public administration domains, Scientometrics. Delaney, J. T., & Huselid, M. A. (1996). The impact of human resource management practices on perceptions of organizational performance. Academy of Management journal, 39(4), 949-969. Eisenhardt, K. M. (1989). Agency theory: An assessment and review. Academy of management review, 14(1), 57-74. Ittner, C. D., Larcker, D. F., & Meyer, M. W. (2003). Subjectivity and the weighting of performance measures: Evidence from a balanced scorecard. The accounting review, 78(3), 725-758. Jensen, M. C., & Meckling, W. H. (1976). Theory of the firm: Managerial behavior, agency costs and ownership structure. Journal of financial economics, 3(4), 305-360. Kaplan, R. S., & Norton, D. P. (1992). In Search of Excellence. Harvard manager, 14(4), 37-46. Kaplan, R. S., & Norton, D. P. (1996). Using the balanced scorecard as a strategic management system. Keeping, L. M., & Levy, P. E. (2000). Performance appraisal reactions: Measurement, modeling, and method bias. Journal of applied psychology, 85(5), 708. Neely, A. (2005). The evolution of performance measurement research: developments in the last decade and a research agenda for the next. International Journal of Operations & Production Management, 25(12), 1264-1277. Neely, A., Gregory, M., & Platts, K. (1995). Performance measurement system JADT’ 18 43 design: a literature review and research agenda. International journal of operations & production management, 15(4), 80-116. Neely, A., Mills, J., Platts, K., Richards, H., Gregory, M., Bourne, M., & Kennerley, M. (2000). Performance measurement system design: developing and testing a process-based approach. International journal of operations & production management, 20(10), 1119-1145. Wernerfelt, B. (1984). A resource-based view of the firm. Strategic management journal, 5(2), 171-180. 44 JADT’ 18 Textual Analysis of Extremist Propaganda and Counter-Narrative: a quanti-quali investigation Laura Ascone Université de Cergy-Pontoise – laura.ascone@etu.u-cergy.fr Abstract This paper investigates the rhetorical strategies of jihadist propaganda and counter-narrative in English and French. Since jihadist propaganda aims at both persuading the Islamic State’s sympathisers and threatening its enemies, attention was focused on the way threat and persuasion are verbalised. As far as jihadist propaganda is concerned, the study was conducted on the Islamic State’s two official online magazines: Dabiq, published in English, and Dar al-Islam, published in French. As for the counter-narrative, the corpus was composed of the articles published on the main English and French governmental websites. Combining quantitative and qualitative approaches allowed to examine the general characteristics as well as the specificities of both jihadist propaganda and counter-narrative. The software Tropes was used to analyse the corpora from a semantic-pragmatic perspective. The results’ statistical validity was then verified and synthesised with the softwares Iramuteq and R. This study revealed that the rhetorical strategies varied between both jihadist propaganda and counter-narrative, and French and English. Keywords: jihadist propaganda, counter-narrative, discourse analysis, threat, persuasion. 1. Introduction The recent terrorist attacks by Daesh in Western countries have led researchers and experts to examine the islamisation of radicalism (Roy, 2016). Different studies have been conducted on the psychosociological contexts that may lead someone to adhere to the jihadist ideology (Benslama, 2016; Khosrokhavar, 2014), as well as on the role played by the Internet in the radicalisation process (Von Behr, 2013). Yet, even though terrorism would not exist without communication (McLuhan, 1978), the rhetorical strategies of the jihadist propaganda have been neglected and remained unexplored. This research investigates the rhetorical strategies of both jihadist propaganda and counter-narrative published on the Internet in English and French. More precisely, this analysis focuses on the way threat and persuasion are expressed in jihadist discourse, as well as on the way French government and international institutions face and counter jihadist JADT’ 18 45 propaganda. From a linguistic perspective, threat and persuasion are complex speech acts. Therefore, pragmatics and, more specifically, Searle’s (1969) speech act theory, constituted the basis of this study. As far as jihadist propaganda is concerned, the analysis was conducted on the Islamic State’s two official online magazines: Dabiq, published in English, and Dar al-Islam, published in French. As for the counter-narrative, the corpus was composed of the articles published on the main French and English institutional websites such as stopdjihadism.fr or counterjihadreport.com. The fact that jihadist propaganda and counter-narrative address different readerships, led us to hypothesise that differences in both content and form might be identified between the two magazines, as well as among the different governmental websites. Combining quantitative and qualitative approaches (Garric and Longhi, 2012; Rastier, 2011) (that is, lexicometry and textometry for the quantitative approach, and the interpretation of the text according to the ideology behind it for the qualitative one), allowed to examine the general characteristics as well as the specificities of both jihadist propaganda and counter-narrative. Following Marchand’s (2014) work, the software Tropes was used to analyse the corpora from a semantic-pragmatic perspective. The results were then investigated in a qualitative way, and their statistical validity verified with the softwares Iramuteq and R. The combination of these two approaches allowed to overcome the limitations imposed by both the software’s automatic analysis and the qualitative subjective interpretation. By comparing the rhetorical strategies used in both jihadist propaganda (Huyghe, 2011) and counter-narrative, the aim of this research was to identify the linguistic differences between these two discourses and these two languages, in order to determine the rhetorical strategies that might prove efficient in countering jihadist propaganda. After having presented the rhetorical pattern of jihadist propaganda, the linguistic characteristics of English and French counter-narratives will be examined. The jihadist and governmental rhetorical strategies will then be contrasted. 2. Corpus and methodology 2.1. Jihadist propaganda The analysis of the rhetorical strategies in jihadist propaganda was conducted on Daesh’s official online magazines Dabiq, published in English, and Dar al-Islam, published in French. Since these two magazines address a readership that has already adhered to the jihadist ideology, their goal is to both reinforce the reader’s adhesion and incite him/her to act in the name of the jihadist ideology. The reader is then incited to adopt the behaviour a good Muslim should have, and to take revenge on who is presented by Daesh as the responsible for the Muslims’ humiliation, that is the West. As 46 JADT’ 18 far as Dabiq is concerned, the corpus investigated was composed of all the articles published on the first fourteen numbers (i.e. 377,450 words). As for Dar al-Islam, the analysis was conducted on the first nine issues (229,762 words). To analyse the rhetorical strategies used in jihadist propaganda, a quanti-qualitative approach was adopted (Garric and Longhi, 2012; Rastier, 2011). More precisely, this iterative approach was composed of five stages. A first qualitative analysis of the jihadist ideology, the radicalisation process, and the linguistic characteristics of hate speech and propagandistic discourse was essential to the understanding of the jihadist discourse as well as to the advancement of our first hypotheses. The second stage corresponded to a quantitative analysis whose goal was to verify the validity of our hypotheses: the corpus was then examined with the software Tropes, which allows to investigate a text from a semantic perspective. More precisely, based on a pre-established lexicon, the software identifies the themes tackled in the text, and shows how these themes are linked to one another. The most frequent themes in both magazines are religion and conflict. However, in order to study the way threat and persuasion are expressed in the two corpora, a deeper qualitative analysis was conducted on the themes sentiment for the French corpus, and feeling for the English one (third stage). In other terms, the quantitative analysis constituted the basis for a qualitative study, which was then conducted only on the expressions conveying feelings. Because of their size-difference, the nine issues of the French magazine count 318 sentimentexpressions, whereas Dabiq counts 705 feeling-expressions. Therefore, in order to contrast the results, a normalisation was applied. Then, a quantitative analysis was conducted with the software Iramuteq, which is an interface of the software R and which performs statistical analysis of textual data based on Reinert’s classification method. This way, it was possible to test the hypotheses and results issued by the qualitative study (fourth stage). Furthermore, a qualitative manual analysis of the first number of both Dabiq and Dar al-Islam allowed to identify the propositions conveying threat, persuasion, obligation, prohibition, and rewards, that had not been detected by the software Iramuteq. This way, it was possible to provide a lexicon specific to the corpus under investigation, that was not detected by the software because of the special features of the jihadist discourse (fifth stage). The combination and alternation of both quantitative and qualitative approaches allowed to examine Daesh’s discourses in relation to the context in which it is produced (Valette and Rastier, 2006). 2.2. Counter-narrative The analysis on the rhetorical strategies in French and English counternarratives was conducted on the main governmental and institutional JADT’ 18 47 websites. The French corpus was composed of the articles published on www.stop-djihadisme.gouv.fr (the platform created after the first terrorist attacks in France in 2015), www.interieur.gouv.fr (the website of the Minister of Interior), and www.cpdsi.fr (the website of the Centre de Prévention contre les Dérives Sectaires liées à l’Islam). The corpus counts 115,950 words. As far as the English corpus is concerned, it was composed of the articles published on www.counterjihadreport.com (a news aggregating website), www.consilium.europa.eu (the website of the European Council and of the European Union Council), www.ec.europa.eu (the website of the European Commission), and on the Radicalisation Awareness Network (this is a specific section of the website of the European Commission). The corpus counts 116,000 words. In order to conduct comparable analyses, the same quanti-qualitative approach was adopted. The qualitative analysis of the geopolitical context and of the different campaigns used to face and counter the jihadist radicalisation was essential to the understanding of both French and English counter-narratives (first stage). Then, a quantitative analysis was conducted with the software Tropes, which allowed to identify the most frequent themes. The themes religion and droit (“law”) were the most present in the French corpus, whereas the themes education and communication were the most frequent in the English one (second stage). The third stage corresponded to the qualitative analysis that was conducted on the category sentiment, for the French corpus (292 propositions), and feeling, for the English one (370 propositions). A normalisation was then applied to compare jihadist and governmental discourse. A second quantitative analysis was then conducted with the softwares Iramuteq and R to test the results issued by the qualitative study (fourth stage). The results of the analysis on jihadist propaganda and counter-narrative were then contrasted to compare the rhetorical strategies used in jihadist propaganda and counter-narrative. 3. The rhetorical strategies used in French and English jihadist propaganda The quantitative analysis conducted with the software Tropes, and the qualitative study conducted on the categories sentiment and feeling, revealed the components of the jihadist discourse. The propaganda of the Islamic State is based on five key concepts: threat, persuasion, reward, obligation, and prohibition. The assessment of interjudge agreement was necessary to determine these five concepts as well as to categorise the different propositions selected by Tropes as objectively as possible. Each category was examined from both quantitative (i.e., its identification and distribution in the two magazines, Dabiq and Dar al-Islam, using the softwars Tropes and Iramuteq, and the corpus analysis toolkit AntoConc) and qualitative (i.e., analysing each concept in relation to the context in which it was produced) 48 JADT’ 18 perspectives. Yet, these five concepts are not independent from one another. Rather, they are strongly linked to one another. Figure 1: the rhetorical pattern of jihadist propaganda Figure 1 shows the rhetorical pattern of jihadist discourse. Since Dabiq and Dar al-Islam aim at manipulating the reader’s behaviour, jihadist propaganda is based on obligations and prohibitions. Rewards as well as guilty feelings towards the Muslims living in the Middle-East, aim at leading the reader to respect these prescriptions. Not respecting them would mean facing negative consequences. Threat may then be expressed against the members of the Islamic State themselves and, more in general, against any Muslim. Obligations are also exploited to impose the readership a hostile and violent attitude against Western countries, which is justified by the feeling of victimisation. Fighting against the Muslims’ enemy is presented by jihadists as a heroic and valorising action, and therefore, a persuasive one. Furthermore, not only are attractive factors rewards for the reader’s obedience. They are sometimes presented as independent from the reader’s behaviour. In other terms, persuasion is presented as a positive and valorising act that, contrary to rewards, does not depend on whether the reader respects or not the prescriptions imposed The sentence “Jihad is necessary to obtain Allah’s forgiveness”, for instance, presents an obligation (“it is necessary”) and a reward that will be granted if the obligation is respected (“to obtain Allah’s forgiveness”). However, this sentence expresses more than an obligation and a reward. Jihad, which is interpreted as attractive by jihadists, tends to be associated with terrorist attacks and, consequently, it will be perceived as threatening by Western countries. Furthermore, this JADT’ 18 49 sentence implies that if the obligation is not respected, the individual will not obtain Allah’s forgiveness. In other terms, this sentence indirectly expresses a threat against the readership too. 4. The rhetorical strategies used in French and English counter-narratives The large number of Daesh’s sympathisers and foreign fighters shows that the communicative and rhetorical strategies adopted in Daesh’s propaganda have an important and persuasive impact on the readership. On the contrary, the counter-narrative produced by the different governments to face and counter jihadist propaganda, has been criticised not to be as efficient as jihadist propaganda. In the French corpus, 292 propositions conveying sentiment (“feeling”) were identified, whereas 370 propositions conveying feelings were identified in the English one. The frequency of the five categories (i.e. of the propositions conveying threat, persuasion, reward, obligation, and prohibition) was calculated in the French and English corpora. The reward-category is the only one that was more present in the French corpus than in the English one. Contrary to the Islamic State’s propaganda, the propositions conveying rewards and prohibitions are almost absent in both French and English counter-narratives. On the contrary, what these two discourses have in common is the high frequency of the propositions conveying threat (Example 1). 1. “Terrorist groups will continue to exploit the refugee crisis in their propaganda, seeking to portray Western mistreatment of Muslims, and inciting fear by alleging that their supporters are being smuggled in amongst genuine refugees.” (RAN website) As Example 1 shows, threat tends to be associated to the other (i.e., the Islamic State), which implies that Western countries are presented as victims of the Islamic State. In the English corpus, 355 occurrences of the word victim(s) were identified. The corpus analysis toolkit AntConc showed that the most frequent collocation of this term is the word terrorism (57 co-occurrences). On the contrary, the French corpus, where the word victime/s occurs 70 times only, presents only 2 co-occurrences of the term terrorisme. Rather, French counter-narrative tends to talk about rescuing and helping victims (secours/aide aux victimes). Furthermore, differences were identified between the different websites in a same language. Figure 2 shows the under- and overuse of the most representative terms in two French governmental websites: stopdjihadisme and CPDSI. More precisely, based on a Chi2 dependence test, the graph shows the words that are significantly associated or “anti-associated” to the two websites. The figure revealed that CPDSI website focuses more on the religious dimension. The words islam, jihad and jihadiste (“jihadist”) are significantly associated to 50 JADT’ 18 this sub-corpus. This implies that jihad and jihadiste are presented and interpreted as religious terms. On the contrary, the website of the stopdjihadisme campaign is characterised by an overuse of the words terroriste (“terrorist”), terrorisme (“terrorism”), Syrie (“Syria”), radicalisation (“radicalisation”), Irak (“Iraq”), français (“French”), and France (“France”). The overuse of these specific terms shows that the campaign and, consequently, its website focus more on the geopolitical dimension, where the radicalisation process is presented in relation to terrorism and not to Islam. Figure 2: under- and overuse of some key-terms in French counter-narrative 5. Conclusion This comparative analysis revealed that jihadist discourse and counternarrative present both similarities and differences. As far as the differences are concerned, the frequency of the propositions conveying threats, persuasion, prohibitions, obligations, and rewards varied between these two discourses: they were more frequent in counter-narrative than in jihadist propaganda. The Islamic State’s propaganda aims at reinforcing the reader’s adhesion to the jihadist ideology, and at inciting him/her to act against its enemies in the name of the jihadist ideology. On the contrary, counternarrative does not aim at reinforcing an ideology. Rather, it aims at countering the jihadist radicalisation. This difference was confirmed by the variation of the different category-frequencies in jihadist propaganda and counter-narrative. Despite this crucial difference, similarities between these two discourses were identified. More precisely, both discourses present the respective speakers’ communities as victims of the other and, consequently, incite the readership to fight, whether violently or not, against the enemy. As far as the methodology is concerned, the procedures adopted allowed to JADT’ 18 51 investigate the general and special features of both jihadist and governmental discourses. The results obtained in the quantitative analysis constituted the starting point for a qualitative analysis, which permitted to identify the features that had not been detected by the softwares as well as to refine Tropes’s pre-established lexicon. References Angenot, M. (2008). Dialogue de sourds. Traité de rhétorique antilogique. Paris : Mille et une nuits. Benslama, F. (2016). Un furieux désir de sacrifice : le surmusulman. Paris : Edition du Seuil. Garric, N., & Longhi, J. (2012). L’analyse de corpus face à l’hétérogénéité des données : d’une difficulté méthodologique à une nécessité épistémologique. Langage, (3) : 3-11. Huyghe, F.-B. (2011). Terrorismes : violence et propagande. Paris : Gallimard. Khosrokhavar, F. (2014). Radicalisation. Paris : Editions de la maison des sciences de l’homme. Marchand, P. (2014), Analyse avec IRaMuTeQ de dialogues en situation de négociation de crise : le cas Mohammed Mehra. Communication présentée aux 12es Journées Internationales d’Analyse statistique des Données Textuelles, Paris, 25. McLuhan, M. (1978). The brain and the media: The “Western” hemisphere. Journal of Communication, 28(4): 54-60. Rastier, F. (2011). La mesure et le grain : sémantique de corpus. Champion ; diff. Slatkine. Roy, O. (2016). Le djihad et la mort. Le Seuil. Searle, J. (1969). Speech acts: an essay in the philosophy of language. London: Cambridge University Press. Valette, M., & Rastier, F. (2006). Prévenir le racism et la xénophobie : propositions de linguistes. Langues modernes, 100(2),68. Von Behr, I. (2013). Radicalisation in the digital era: the use of the Internet in 15 cases of terrorism and extremism. 52 JADT’ 18 Analyse de données textuelles appliquée à des problématiques de sécurité et d'enquête judiciaire Laura Ascone1, Lucie Gianola1 1 AGORA, Université de Cergy-Pontoise – laura.ascone@etu.u-cergy.fr, lucie.gianola@u-cergy.fr Abstract This presentation investigates two cases of textual analysis applied to security contexts: - the analysis of the rhetorical strategies adopted in the Islamic State’s official online magazines: Dabiq, published in English, and Dar al-Islam, published in French; - the use of methods for named entities’ automatic extraction, and the conception of a textual exploration software for criminal analysis. Résumé Nous présentons deux cas d'application de l'analyse de données textuelles dans des contextes liés à la sécurité : - l'analyse des stratégies rhétoriques de propagande djihadistes à travers l'étude des revues Dabiq et Dar-al-Islam, - l'utilisation de méthodes d'extraction automatique d'entités nommées et la conception d'un outil d'exploration textuelle pour l'analyse criminelle. Keywords: analyse de données textuelles, radicalisation, analyse criminelle 1. Introduction L'essor de préoccupations sécuritaires liées aux actes de terrorisme perpétrés à travers le monde depuis le début du XXIème siècle pousse les chercheurs, acteurs publics et sociaux à rechercher de nouveaux moyens d'analyse de ce phénomène. En France, les sciences humaines et sociales se saisissent de la question comme le démontre l'organisation de plusieurs journées d'études sur la question (« Nouvelles figures de la radicalisation », Toulouse, avril 2017, « Les SHS face à la menace », Cergy, septembre 2017, « Des sciences sociales en état d'urgence : islam et crise politique », Paris, décembre 2017). Nous souhaitons présenter dans cet article deux sujets d'étude relatifs à ces préoccupations sécuritaires : une étude de la rhétorique de Daesh du point de vue du recours aux émotions dans les revues Dabiq (anglais) et Dar al-Islam (français), ainsi qu'une collaboration entre le Pôle Judiciaire de la Gendarmerie Nationale (PJGN) et l'Université de Cergy-Pontoise visant à fournir de nouveaux outils d'analyse textuelle des procédures judiciaires aux équipes d'analystes criminels. Le phénomène de la radicalisation djihadiste a amené chercheurs et professionnels à examiner les raisons JADT’ 18 53 psychosociologiques qui sont à la base de l'adhésion à l'idéologie djihadiste (Khosrokhavar, 2014) ainsi que les stratégies adoptées par le groupe extrémiste pour diffuser les messages de propagande (Lombardi, 2015). Toutefois, bien qu'elles jouent un rôle crucial au sein de la propagande djihadiste, les stratégies rhétoriques qui visent à menacer ou à persuader les différents lecteurs restent inexplorées. La première partie de cette étude vise donc à présenter une analyse quanti-qualitative du schéma rhétorique et des émotions sur lesquels se base la propagande djihadiste. Dans la continuité des travaux de Marchand (2014), les logiciels Iramuteq et Tropes ont permis d’étudier le corpus d’un point de vue quantitatif. Les résultats issus de cette analyse quantitative ont ensuite constitué le point de départ d’une analyse qualitative sur les extraits exprimant des émotions, afin d’examiner plus en détail les stratégies rhétoriques de la propagande djihadiste. Le cas de l'analyse des procédures judiciaires nous confronte à une problématique typique d'extraction d'information passant par la reconnaissance automatique d'entités nommées : notre travail de recherche consiste notamment à concevoir les bases d'un outil de navigation textuelle ad hoc. Bien que les besoins des analystes criminels soient similaires à ceux d'autres domaines d'application (analyse de la voix du client, traitement automatique de la langue biomédicale, etc.), le contexte de l'enquête judiciaire pose de nouvelles contraintes de précision dans l'extraction et dans la mise à disposition des résultats à l'expert, c'est-à-dire à l'analyste criminel. Le besoin social et institutionnel de nouvelles approches de documents d'origines variées dans les contextes judiciaires et sécuritaires nous permet de démontrer la pertinence de méthodes d'analyse de données textuelles déjà éprouvées dans ces deux cas d'étude. 2. Description de la rhétorique djihadiste : cas des revues Dabiq et Dar alIslam 2.1. Corpus et méthodologie Cette recherche a été menée sur les deux revues de Daech : Dabiq, publié en anglais, et Dar al-Islam, publié en français. Dabiq s’adresse aux sympathisants non arabophones de Daech, tandis que Dar al-Islam, qui n’est pas une traduction de Dabiq, s’adresse à un lectorat uniquement francophone. Cette distinction nous conduit à avancer l’hypothèse que les deux revues diffèrent dans leur contenu ainsi que dans la forme du message qu’elles portent. Toutefois, l’une et l’autre s’adressent à un lectorat qui a déjà adhéré à l’idéologie islamiste. Leur objectif n’est donc pas de persuader le lecteur de s’approcher de l’islamisme, mais de renforcer son adhésion et de l’amener à agir au nom de cette idéologie. Afin d’analyser les stratégies rhétoriques du discours jihadiste, une approche quanti-qualitative a été adoptée (Rastier, 54 JADT’ 18 2011). Plus particulièrement, cette approche itérative était constituée de quatre étapes. Une première analyse qualitative de l’idéologie djihadiste, du processus de radicalisation et des caractéristiques linguistiques du discours de haine a été essentielle à la compréhension du discours djihadiste ainsi qu’à l’avancement des premières hypothèses. La deuxième étape correspond à une analyse quantitative qui a permis de vérifier les hypothèses avancées : le corpus a donc été examiné avec le logiciel Tropes (Ghiglione et al, 1998), qui permet d’analyser un texte d’un point de vue sémantico-pragmatique à partir d’un lexique préétabli, et d’identifier les thèmes les plus récurrents dans le corpus ainsi que la manière dont ces thèmes sont liés l’un à l’autre. Afin d’analyser la manière dont le discours djihadiste arrive à persuader et menacer les différents lecteurs (Giro, 2014), une analyse qualitative a été menée sur les thèmes sentiment, pour le corpus français, et feeling, pour le corpus anglais (troisième étape). En d’autres termes, l’analyse quantitative a constitué le point de départ pour une étude qualitative, qui a donc été menée sur les énoncés exprimant des émotions et des sentiments (Caffi et Janney, 1994). Enfin, une dernière analyse quantitative a été menée avec le logiciel Iramuteq (Ratinaud et Marchand, 2012) qui, basé sur la méthode Reinart, permet, par exemple, de déterminer le sous- et suremploi de certains termes au sein des différents corpus (quatrième étape). La combinaison d’approches qualitatives et quantitatives a permis d’examiner de discours djihadiste en relation avec le contexte dans lequel il a été produit (Valette et Rastier, 2006). 2.2. Résultats L’analyse des énoncés exprimant des émotions et des sentiments dans les deux revues officielles de Daesh a permis de déterminer le schéma rhétorique sur lequel se construit la propagande djihadiste. Puisque l’objectif de Dabiq et de Dar al-Islam est de manipuler le comportement du lecteur, la propagande de Daech se fonde sur l’imposition d’obligations et d’interdictions. L’accord de récompenses ainsi que le sentiment de culpabilité visent à amener le lecteur à respecter ces indications. En revanche, tout musulman qui ne respecte pas ces indications, subira des conséquences négatives : il sera jugé d’apostat et il sera donc considéré comme un ennemi. On a ici la menace exprimée par Daech contre les musulmans. Les obligations sont exploitées également pour imposer au lecteur une action violente contre l’Occident, justifiée et alimentée par le sentiment de victimisation. Combattre l’ennemi est présenté comme une action héroïque et valorisante. En participant au combat contre l’Occident, le lecteur aura l’impression de devenir un héros qui lutte au nom d’une cause juste et noble (De Bonis 2015), et de voir ses faiblesses disparaître (Rumman, Suliman et al 2016). En outre, en citant des versets coraniques concernant la victoire des musulmans, l’auteur assure à JADT’ 18 55 son lecteur que la communauté musulmane aura la victoire sur l’ennemi ; l’extrait suivant en est un exemple : « Allah par vos mains les châtiera, les couvrira d’ignominie, vous donnera la victoire sur eux et guérira les poitrines d’un peuple croyant » (Dar al-Islam, n° 8). La victoire sur l’ennemi est perçue par les djihadistes comme persuasive. Toutefois, cet énoncé, perçu comme persuasif par les djihadistes, le sera comme menaçant par l’Occident. De même, le djihad, qui est interprété comme persuasif par les membres du groupe djihadiste puisqu’il permet d'accéder au Paradis, tend à être associé aux attentats terroristes et donc à être perçu comme menaçant par les occidentaux. Cette double interprétation rejoint la définition de Perelman et Olbrechts-Tyteca (1988), qui proposent d’« appeler persuasive une argumentation qui ne prétend valoir que pour un auditoire particulier » (p. 36). Bien que Dabiq et Dar al-Islam présentent le même schéma rhétorique, leur contenu varie de manière conséquente. Cette étude a révélé, par exemple, que la revue française focalise son discours sur la figure de l’autre (i.e., de l’ennemi). En revanche, la revue anglaise est focalisée sur la figure du musulman et, plus particulièrement, sur la conduite qu’un bon musulman devrait avoir. 3. Analyse textuelle des procédures judiciaires Au sein d'une équipe d'enquête, le travail des analystes criminels consiste à lire et synthétiser les documents de procédures (auditions de témoins, données téléphoniques et bancaires, comptes-rendus d'expertise, etc.) afin de fournir aux enquêteurs et aux magistrats une vision plus globale des informations collectées, par le biais de schémas de représentation et de synthèses (Rossy 2011). Leur intervention est requise dans des affaires complexes comme les cold cases ou les affaires impliquant de larges réseaux, et permet de fournir de nouvelles pistes d'investigation pour les enquêteurs. À l'heure actuelle, les analystes s'appuient sur un logiciel de reconnaissance optique de caractères, des outils de bureautique classique (traitement de texte, tableur) ainsi que sur le logiciel de représentation graphique d'IBM Analyst's Notebook. Cet outillage ne les dispense pas d'une phase de lecture précise et chronophage de la procédure visant entre autres à repérer et extraire manuellement les informations pertinentes pour l'enquête, regroupées en différents types d'entités qui une fois extraites sont agencées en représentation graphique (chronologique ou relationnelle). 3.1. Corpus de travail Le corpus de travail mis à notre disposition par le PJGN est une procédure judiciaire complète jugée et résolue concernant un homicide. Le dossier, comme toute procédure judiciaire, rassemble une variété de documents : 56 JADT’ 18 rapports d'expertise, procès-verbaux d'investigations, procès-verbaux d'auditions de témoins et de mis en cause, factures téléphoniques détaillées, données bancaires, planches photographiques, etc. Nous avons choisi de concentrer notre travail sur le sous-corpus composé des auditions de témoins et de personnes gardées à vue. Ce choix s'est fait lors de notre prise de connaissance du corpus et du domaine, les auditions représentant la masse d'information la plus dense et la plus difficilement accessible d'une procédure : le nombre des auditions (dans notre cas, 370 auditions pour environ 600 000 mots) et leur manque de structure gênent leur traitement avec des outils standards, contrairement par exemple aux données téléphoniques qui peuvent être intégrées telles quelles dans Analyst's Notebook ou à d'autres données collectées en gendarmerie sous forme de formulaires structurés. 3.2. Détection automatique d'entités nommées La notion d'entité en analyse criminelle correspond à la notion d'entité nommée (EN) en extraction d'information : une unité linguistique monoréférentielle qui a la capacité de renvoyer à un référent unique (Nouvel & al, 2015). D'une manière générale, cinq types d'entités intéressent les analystes criminels : les personnes, les lieux, les dates et heures, les véhicules et les numéros de téléphone. Nous avons entrepris d'appliquer des techniques de détection d'EN éprouvées sur les documents de procédures judiciaires, tout en variant les approches de manière à répondre au mieux aux contraintes de chaque type d'entité. Deux fonctionnalités du logiciel UNITEX (Paumier, 2016) ont été mises en œuvres : l'édition de grammaires pour la détection des dates, l'utilisation d'un lexique pour la détection des villes, et la combinaison d'un lexique de prénoms et de règles pour les noms de personnes. Les numéros de téléphone quant à eux sont détectés à l'aide d'une expression régulière. En l'état actuel des choses, nous sommes donc en mesure de détecter :  Les dates normées : “le 10 janvier 2017”, “l'an deux mille dix-sept, le dix janvier”, “le 10/01/2017”  Les noms et prénoms de personnes : “Blanche Rivière”, “Petit Noémie”, “Michel E. Dupont”  Plus de 36000 villes figurant dans un lexique1 Le développement d’une approche de détection des véhicules, car leurs mentions dans le corpus combinent plusieurs types d’informations :  genre de véhicule : moto, scooter, camionnette, voiture, etc. 1 Disponible (janvier 2018) à l'adresse : http://sql.sh/736-base-donnees-villes-francaises JADT’ 18 57  marque  mention du modèle ou d’une forme (4X4, citadine, berline, break, etc.)  couleurs et signes distinctifs (rouille, sérigraphie, année du modèle, etc.) La délimitation de la mention d’un véhicule ne peut se résumer à la combinaison d’une marque et d’un modèle, comme le montrent les deux exemples suivants tirés du corpus :  Il s'agit d'un petit modèle comme une TWINGO pour vous donner le volume. Il était de couleur orangé. Il est petit car il a un petit coffre.  M. X. m'a cependant parlé d'un véhicule 4X4 conduit par un individu qui avait un fusil. La détection des véhicules nous amènera donc à envisager une approche de détection plus complexe que celles déjà mises en place. 3.3 Analyse de données textuelles et analyse criminelle, une même problématique ? Si la détection automatique des entités nommées dans le contexte de l'analyse criminelle en gendarmerie constitue une tâche habituelle de TAL, on ne peut pas pour autant en circonscrire les apports potentiels à des aspects purement techniques. La méthodologie de travail de l’analyse criminelle repose sur l'interprétation humaine pour la production d'hypothèses, et en cela nous la rapprochons de l'analyse des données textuelles (ADT) telle que définie par (Ho-Dinh, 2017) : « Avec l’ADT, nous nous situons au contraire dans une perspective de construction des connaissances, par l’interprétation humaine des résultats obtenus grâce à des outils informatiques de calcul et de visualisation. La puissance informatique vient donc en assistance de l’exploration et la fouille des données. Cette différence fondamentale permet de produire des connaissances qualitatives sur les données et non seulement quantitatives. » La poursuite de nos travaux s'oriente donc non seulement vers l'amélioration des résultats de détection d'entités et l'introduction d'approches statistiques (TF-IDF, clustering de documents, etc) mais également vers le développement d'une interface d'exploration textuelle propre, prenant en compte les spécificités du genre textuel de la procédure judiciaire (tri du texte en fonction de sa nature : texte d'en-tête, informations d'état-civil), et permettant une navigation efficace entre entités détectées, mesures statistiques, et texte original. La méthodologie de l’analyse criminelle et les pratiques du métier pourraient être à revoir en conséquence, impliquant une phase de formation des analystes criminels aux méthodes textométriques. 58 JADT’ 18 4. Conclusion Nous estimons avoir soulevé des perspectives théoriques et techniques pour l'analyse de données textuelles dans les domaines judiciaires et de la sécurité, relevant aussi bien de l’analyse de discours que du TAL et de la textométrie. Dans le cas de la propagande de Daesh, l’analyse et la compréhension du discours djihadiste pourraient contribuer à la formulation d’un contrediscours qui puisse faire face et contrer la propagande djihadiste. Concernant les pratiques d'analyse textuelles en analyse criminelle, nous espérons que la mise en place de techniques d'automatisation et d'un outil d'exploration textuelle permette de repenser la méthode d'accès à l'information en analyse criminelle et soit une première étape d'une réflexion plus large sur la collecte et la circulation de l'information et des documents dans le processus judiciaire. Ces deux cas d'études illustrent la pertinence d'approches de sciences humaines et sociales dans le contexte sécuritaire et judiciaire, qui a jusqu'à présent surtout eu recours à des expertises en sciences dites « dures » (médecine légale, biologie, chimie, informatique, etc.), regroupées sous l'appellation de « sciences forensiques ». Nous espérons que de telles contributions permettront de renforcer les liens et d'ouvrir la voie à d'autres projets associant institutions judiciaires et de défense et chercheurs en sciences humaines et sociales. References Caffi C., & Janney R. W. (1994). Toward a pragmatics of emotive communication. Journal of pragmatics, 22(3), 325-373. De Bonis M. (2015). La strategia della paura. Limes, 11. Ghiglione, R., Landré, A., Bromberg, M., & Molette, P. (1998). L’analyse automatique des contenus. Paris, Dunod. Giro M. (2015). Parigi: il branco di lupi, lo Stato Islamico e quello che possiamo fare. Limes. Ho Dinh O. (2017). Caractérisation différentielle de forums de discussion sur le VIH en vietnamien et en français. Thèse de doctorat, Inalco, Paris. Marchand P. (2014). Analyse avec Iramuteq de dialogues en situation de négociation de crise : le cas Mohammed Mehra. Actes des 12èmes Journées internationales d’Analyse statistique des Données Textuelles (JADT), Paris, pp. 457-471. Nouvel D., Erhmann M., Rosset S. (2015). Les entités nommées pour le traitement automatique des langues. ISTE Editions Paumier S. (2016). Unitex 3.1 user manual, http://www-igm.univ-mlv.fr/ unitex Perelman C., & Olbrechts-Tyteca L. (1988) (5e éd.). Traité de l’argumentation. Bruxelles : Edition de l’Université de Bruxelles. JADT’ 18 59 Rastier F. (2011). La mesure et le grain: sémantique de corpus. Champion; diff. Slatkine. Ratinaud P., Marchand P. (2012). Application de la méthode ALCESTE à de "gros" corpus et stabilité des "mondes lexicaux" : analyse du "CableGate" avec IraMuTeQ. Actes des 11eme Journées internationales d’Analyse statistique des Données Textuelles (JADT), Liège, 13-15 juin, p. 835-844. Rossy Q. (2011). Méthodes de visualisation en analyse criminelle : approche générale de conception des schémas relationnels et développement d'un catalogue de patterns. Thèse de doctorat, Université de Lausanne, Faculté de droit et des sciences criminelles. Rumman A., Suliman M. et al. (2016). The Secret of Attraction: ISIS Propaganda and Recruitmenet. Traduit par Ward, W. J. et al. Amman: Friedrich-EbertStiftung. Valette M., & Rastier F. (2006). Prévenir le racisme et la xénophobie: propositions de linguistes. Langues modernes, 100(2), 68. 60 JADT’ 18 A two-step strategy for improving categorisation of short texts Simona Balbi1, Michelangelo Misuraca2, Maria Spano1 1 Università di Napoli Federico II – simona.balbi@unina.it maria.spano@unina.it 2 Università della Calabria – michelangelo.misuraca@unical.it Abstract Text categorisation allows organising a collection of documents with respect to their content. When we consider short texts – e.g., posts and comments shared onto social media – this task is harder to achieve because we have few significant terms. Refer to higher-level structures, representing concepts, or topics occurring in the collection, can improve the effectiveness of the procedure. In this paper, we propose a novel two-step strategy for text categorisation, in the frame of feature extraction. Concepts are identified by using network analysis tools, namely community detection algorithms. Therefore, it is possible to organise the document collection with respect to the different concepts and describe the groups of documents with respect to terms. A case study about Pope Francis on Twitter is presented for showing the effectiveness of our proposal. Keywords: short texts, text categorisation, textual network, community detection 1. Introduction The ever-increasing popularity of the Internet, together with the amazing progress of computer technology, has led to a tremendous growth in the availability of electronic documents. Therefore, there is a great interest in developing statistical tools for the effective and efficient extraction of information on the Web, in a so-called Text Mining perspective. The most common reference model for representing documents, in Text Mining, is the so-called vector space model: a document is a vector in the (extremely sparse) space spanned by the terms. Documents are usually coded as bag-of-words, i.e. as an unordered set of terms, disregarding grammatical and syntactical roles. The focus is on the presence/absence of a term in a document, its characterisation and discrimination power. In the knowledge discovery process, the core of the majority of procedures is related to dimensionality reduction, both via feature selection and/or feature extraction. Statistical tools enable an effective feature extraction. One of the most interesting tasks in Text Mining is Text categorisation which allows organising a collection of documents, grouping them with respect to their JADT’ 18 61 content. Here we propose a novel two-step strategy designed for the text categorisation of short documents – e.g., posts and comments shared onto social media – when the task is harder to achieve because we have few significant terms. The basic idea is that Textual data can be processed at different levels, e.g. we can consider single terms, or subsets of terms identifying different concepts, in a feature extraction frame. Concepts are identified by using network analysis tools, namely community detection algorithms. Therefore, it is possible to organise the document collection with respect to the different concepts and describe the groups of documents with respect to terms. The effectiveness of our proposal is showed by analysing a set of tweets about the Pope Francis, posted on November 2017. 2. Background and related work The bag-of-words encoding is characterised by high dimensionality and an inherent data sparsity. According to Aggrawal and Yu (2000), the performances of text categorisation algorithms decline dramatically due to these aspects. Therefore, it is highly desirable a previous dimensionality reduction. In pre-processing, feature selection and/or feature extraction are often used before applying any further analysis. Via feature selection, only a subset the original vocabulary is considered, according to with some criterions. Several feature selection techniques are reported in the literature, such as term strength (Yang, 1995), information gain (Yang and Pedersen, 1997), Chi-squared statistic (Galavotti et al., 2000), entropy-based ranking (Dash and Liu, 2000). Feature extraction (also known as feature reduction) is a process for extracting a set of new features from the original vocabulary by applying some functional mapping. Common feature reduction techniques include lexical correspondence analysis (Lebart et al., 1998), latent semantic indexing (Deerwester et al., 1990). These techniques obtain dimensionality reduction, by transforming the original terms in fewer linear combinations, spanning sub-dimensional spaces, that may not have a clear meaning and sometimes results are difficult to be interpreted. To cope with this limit, here we consider a different viewpoint. Both feature selection and feature extraction are basically founded on the analysis of a documents x terms matrix, in which the generic element is the frequency of a term in a document, or another related weight representing the importance of the term. It is possible to get back part of the use context of each term by constructing a terms x terms co-occurrence matrix. In general, each element of this latter matrix is the number of times two terms co-occur in the corpus. This particular data structure can be represented as a network, where each term is a vertex and each element of the matrix different from 0 is an edge. 62 JADT’ 18 The problem of reducing the original dimensionality and perform a feature extraction can be seen as a community detection problem: terms used together define a concept, as in latent semantic indexing, or correspondence analysis, but without any algebraic transformation. Differently from the approaches previously described, indeed, this method preserves the original meaning of the terms and allows a better readability of the results. A community in a network is a set of nodes where vertices are densely interconnected and sparsely connected to other parts of the network (Wasserman and Faust, 1994). There is no universally accepted definition for a community, but it is well known that most real-world networks display community structures. When we consider networks of terms, communities of terms densely interconnected can be interpreted as topics. From a theoretical point of view, community detection is not very different from clustering. Many algorithms have been proposed. Traditional approaches are based on hierarchical or partitional clustering (e.g.: Scott, 2000; Hlaoui and Wang, 2004). The most popular algorithm is the one proposed by Girvan and Newman (2004). The method is historically important because it marked the beginning of a new era in the field of community detection, by introducing the notion of "modularity". Originally introduced to define a stopping criterion, modularity (nowadays refers as Girvan and Newman's modularity) has rapidly become an essential element of many community detection methods, as fast-greedy (Clauset et al., 2004), label propagation (Raghavan et al., 2007), leading eigenvector (Newman, 2006). It measures the difference between the observed fraction of edges that fall within the given communities and the expected fraction in the hypothesis of random distribution. For a most comprehensive review of the community detection literature, it is possible to refer to Fortunato (2010). 3. Problem definition and proposed method Text categorisation allows to group documents belonging to a collection with respect to the textual content of the documents themselves. When we consider short texts, this task is more difficult to achieve because we have few significant terms for characterising the different groups. The identification of high-level structures representing the concepts/topics occurring in the collection can improve the effectiveness of the grouping procedure. In this paper, a two-step strategy for improving the automatic organisation of a collection of documents is proposed. LetT={d1, …, dn}p be a set of n document vectors in a term space of p dimension, represented by a documents x terms matrix, where each element tij is the occurrence of an i term into a j document (i=1, ..., p; j=1, ..., n). For the purpose of our analysis, we are just interested if the term i occurs in JADT’ 18 63 document j, or not. Then we consider a binary matrix B, where the generic element bij is equal to1 if the term i occurred at least once in document j, 0 otherwise. From the matrix B we derive the terms x terms co-occurrence matrix A by the product ABBT. The generic element aii′ is the number of documents in which the term i and the term i′ co-occur (ii′). An element aii on the principal diagonal represents the total number of documents in the collection containing the term i. A is an undirected weighted adjacency matrix that can be used to analyse the relations existing among the different terms. As each community can be seen as a concept/topic occurring in the collection, in order to detect a group of terms defining a concept, we perform a community detection on the matrix A. Each community can be seen as a concept/topic occurring in the collection. As we said above, the greedy algorithm is based on the optimisation of a quality function known as modularity. Suppose the vertices are divided into communities such that vertex/term i belongs to the community ci. The modularity Q is defined as Q=   i i'  1   aii'  s(c ,c ) 2 h  i i' 2 h ii'  where h is the total number of edges in the network, i is the degree of the term i and the s function s(ci,ci′) is 1 if ci=ci′ and 0 otherwise. In practice, a value above about 0.3 is a good indicator of an interesting community structure in a network. The greedy algorithm falls in the general family of agglomerative hierarchical clustering methods. Starting with a state in which each term is the sole member of one of K concepts, the algorithm repeatedly joins concepts together in pairs choosing in each step the join that results in the greatest increase in modularity. At the end of the detection process, we obtain a terms x concepts matrix C, a complete disjunctive table where the cik element (k=1, …, K) is 0 or 1 when a term i belongs or not to a community. The text categorisation is performed with a clustering algorithm on the matrix documents x concepts T*(TTC)DK-1, where DK-1 is the diagonal matrix of the column marginal distribution of C. Each cell of T* contains the proportion of terms belonging to a concept. 64 JADT’ 18 4. A case study Twitter is one of the most popular – and worldwide leading – social networking service. It can be seen as a blend of instant messaging, microblogging and texting, with brief content and a very broad audience. The embryonic idea was developed considering the exchange of texts like Short Message Service in a small group of users. As of the third quarter of 2017, it has 330 million monthly active users, with an amount of daily sent tweets close to 500 million (Source: Twitter, Statista). Our aim is to categorise a set of tweets, generated by the same hashtags, with respect to the different concepts expressed in the collection itself. 4.1. Data description and pre-processing By using the Twitter Archiver add-on1 for Google Sheet, we collected 24588 tweets about Pope Francis, published between November 10th and December 7th 2017. We use the hashtag #papafrancesco in the query, with any kind of restriction on the language of the tweets. Moreover, we do not filter the socalled retweets, so that some texts are replicated in the corpus. The preprocessing was performed in two steps. First, we stripped URLs, usernames, hashtags, emoticons and RT prefixes, and we normalised the tweets by removing special characters and any separators than blanks. Second, on the 23915 cleaned tweets, we performed a lemmatisation and a grammatical tagging. The terms contained in the tweets written in other languages different from Italian were considered as noise. In the analysis, we consider only nouns because of their content-bearing role. Moreover, we delete from the vocabulary the terms occurring less than 10 times. Thus we obtain a documents x terms matrix T with 23915 rows and 1603 columns, and the corresponding terms x terms co-occurrence matrix A. 4.2. Concept identification and categorisation process We perform the community detection procedure on A in order to identify the concepts. For better highlighting relations among the terms, we fixed a threshold of 30 on the value of co-occurrence, deleting isolated terms. The greedy algorithm detected 38 different concepts. The high value of the modularity measure (Q = 0.648) supports the effectiveness of our procedure results. In Table 1, we list as an example the terms belonging to some of the detected concepts. 1 https://chrome.google.com/webstore/detail/twitter-archiver/pkanpfekacaoj dncfgbjadedbggbbphi JADT’ 18 65 Table 1 – Concepts detected in the collection with corresponding terms Concept 2 7 10 19 23 27 … Terms scienza, sperimentazione, accanimento, responsabilità, malato, cura, eutanasia, … bangladesh, religione, viaggio, cultura, myanmar, discorso, buddista, monaco, … aborto, perversione, febbraio, don, pieri, colonizzazione, crimine, mafia pensiero, figlio, papà, cecilia, moser, monte dramática, miedo, josé, experimentan, condición, maría, marcada, incertidumbre giornatamondialedeipoveri, aula, giovanni, paolo, preparazione, pranzo … It is interesting to note that the algorithm identifies the concepts not written in Italian (e.g., #23 contains Spanish terms) and the concepts not related to Pope Francis (e.g., #19 refers to a popular reality show). By selecting only the terms belonging to the different communities, we obtain a 19799 x 38 matrix T*. On this matrix, we perform a hierarchical clustering based on the Ward criterion. In Figure 1 it is shown the histogram of the level indices obtained by the clustering. The indices represent the loss of inter-class inertia caused by the aggregation. The maximum gap in the distribution suggests to consider a partition in 37 clusters. Figure 1 – Histogram of the level indices calculated on the dendrograms’ nodes 66 JADT’ 18 Because of the unsupervised nature of the approach, the quality of the results can be investigated only by looking at the clusters’ composition. Due to the limitation of 140 characters, each tweet can express one to three concepts at most. In Table 2 we can see the concepts occurring in the different clusters. The order of the concepts represents their importance in terms of statistical significance. The preliminary results seem to be very promising, but a deep investigation has to be considered in order to validate the proposal. Table 2 – Clusters’ size and composition Cluster Tweets Concepts Cluster Tweets Concepts Cluster Tweets Concepts 1 120 6 14 8210 4, 7 27 51 30 2 506 15, 6, 9 15 536 1 28 150 36 3 95 9, 15 16 1348 32 29 163 37 4 62 12 17 1379 13 30 41 21 5 179 29 18 677 3 31 51 28 6 93 14 19 2699 2 32 102 22, 4 7 79 16 20 666 8, 7 33 71 26, 22 8 160 10 21 48 24, 20, 13 34 42 17, 11 9 445 5 22 155 20, 4, 24 35 288 11, 34 10 304 19, 18 23 242 38 36 125 34, 11 11 36 18 24 55 25 37 42 23, 11 12 66 31 25 71 33 Total 19799 13 335 27 26 107 35 5. Final remarks The proposed strategy aims at categorising the documents of a collection by detecting high-level structures, i.e. concepts, as subsets of terms. The terms belonging to each concept are retained in the process and can be used for characterising the identified groups of documents. The tools are given by network analysis tools, namely community detection algorithms. The strategy is suitable when we deal with short texts. Future developments of this work are devoted to set automatically a co-occurrence threshold in the community detection step and to evaluate alternative similarity indices for measuring the relation strength among terms. JADT’ 18 67 References Aggrawal C.C. and Yu P.S. (2000). Finding generalized projected clusters in high dimensional spaces. Proceedings of SIGMOD’00, pp. 70-81. Clauset A., Newman M.E. and Moore C. (2004). Finding community structure in very large networks. Physical review E, 70(6), 066111. Dash M. and Liu H. (2000). Feature selection for clustering. Proceedings of Pacific-Asia Conference on knowledge discovery and data mining, pp. 110-121. Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harshmanet R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6): 391-407. Fortunato S. (2010). Community detection in graphs. Physics Reports, 486(3): 75-174. Galavotti L., Sebastiani F. and Simi M. (2000). Feature selection and negative evidence in automated text categorization. Proceedings of KDD-00. Hlaoui A., Wang S. (2004). A direct approach to graph clustering. Neural Networks and Computational Intelligence: 158-163. Lebart L., Salem A., Berry L. (1998). Exploring textual data. Springer Netherlands. Newman M.E. (2006). Modularity and community structure in networks. Proceedings of the national academy of sciences, 103(23): 8577-8582. Newman M.E. and Girvan M. (2004). Finding and evaluating community structure in networks. Physical review E, 69(2): 026113. Raghavan U.N., Albert R. and Kumara S. (2007). Near linear time algorithm to detect community structures in large-scale networks. Physical review E, 76(3): 036106. Scott J. (2000). Social Network Analysis: a handbook. Sage, London. Wasserman S. and Faust K. (1994). Social network analysis. Cambridge University Press. Yang Y. (1995). Noise reduction in a statistical approach to text categorization. Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 256-263. Yang Y. and Pedersen J.O. (1997). A comparative study on feature selection in text categorization. Proceedings of ICML-97, pp. 412-420. 68 JADT’ 18 Appeler à signer une pétition en ligne : caractéristiques linguistiques des appels Christine Barats1, Anne Dister2, Philippe Gambette3, Jean-Marc Leblanc1, Marie Peres1 1 Université Paris-Est, CEDITEC (EA 3119), Créteil, France – christine.barats@parisdescartes.fr, jean-marc.leblanc@u-pec.fr, marie.leblanc@u-pec.fr 2Université Saint-Louis - Bruxelles, Belgique – anne.dister@usaintlouis.be 3Université Paris-Est, LIGM (UMR8049), Champs-sur-Marne, France – gambette@u-pem.fr Résumé L’analyse des 12 522 textes d’appel d’une plateforme de pétitionnement en ligne permet d’examiner leurs caractéristiques linguistiques. Le recours à des outils textométriques met ainsi au jour certaines régularités quant aux modalités d’appel à signer. Nous nous intéressons tout particulièrement aux régularités lexicales, aux formes d’adresse ainsi qu’aux modalités d’implication des signataires. Mots-clés : statistique textuelle, pétition en ligne, textes d’appel Abstract The analysis of the 12 522 petition texts of an online petition platform allows to examine their linguistic characteristics. The use of statistical textual analysis tools brings to light several regularities as for the modalities of the call to be signed. We focus on the lexical regularities, the salutations as well as the modalities of implication of the signatories. Keywords : statistical textual analysis, online petition, petition texts 1. Introduction Les plateformes de pétitionnement en ligne prolongent et modifient l’acte de pétitionnement (Contamin, 2001). Dans la dynamique des recherches sur l’incidence des dispositifs de participation en ligne sur les formes d’écriture numérique et d’engagement politique (Boure, Bousquet, 2011 ; Mabi, 2016 ; Badouard, 2017 ; Contamin, 2017), nous nous proposons d’interroger les caractéristiques des textes d’appel au regard d’une plateforme numérique de pétitionnement. Le corpus que nous avons analysé est issu de l’un des principaux sites francophones de pétitions en ligne (lapetition.be). Il se compose de plus de 12 500 pétitions ayant récolté au total 3,25 millions de signatures sur la période comprise entre le 31 octobre 2006 et le 12 février 2015. Le site propose 9 rubriques parmi lesquelles le porteur de la pétition est tenu JADT’ 18 69 de classer sa pétition : Art et culture ; Droits de l’Homme ; Environnement, nature et écologie ; Humour/Insolite ; Loisirs ; Politique ; Protection animalière ; Social ; Autres. Comme nous l’avons montré ailleurs (Barats et al., 2016) et rappelé en figure 1, les différentes rubriques connaissent des variations importantes tant en termes de nombre de pétitions (figure 1) qu’en ce qui concerne la longueur des textes des appels, le nombre de signatures ou encore le nombre et le volume des commentaires laissés par les signataires. Le choix de la rubrique relève du promoteur de la pétition et témoigne d’une interprétation qui varie selon les porteurs de projet, mais débouche sur des régularités internes à chaque rubrique qui émergent de classifications automatisées du corpus. Dans cet article, nous nous centrerons exclusivement sur les textes des appels, avec une attention particulière sur leur incipit, afin d’observer quelles sont les régularités lexicales et syntaxiques qui caractérisent les textes d’appel sur l’ensemble du corpus, mais également en contrastant les rubriques. Les 12 522 textes constituent un corpus de 2,6 millions de mots. Humour / Insolite 397 Art et culture 652 Loisirs 795 Environnement, nature et écologie 1034 Protection animalière 1378 Droits de l’Homme 1738 Social 1806 Politique 2276 Autres 2446 Figure 1 - Distribution du nombre de pétitions par rubrique 70 JADT’ 18 2. Les mots les plus fréquents dans les textes d’appel Afin d’identifier la présence ou non de formes communes aux textes d’appel, nous avons examiné les débuts des textes d’appel, indépendamment des rubriques. La répartition du premier mot des appels ne correspond pas à une loi de puissance (l’habituelle loi de Zipf) car la courbe décroit plus lentement. Les débuts des textes d’appel font donc apparaitre un vocabulaire fréquent particulier. Les 20 formes de cette liste sont en première position dans plus de la moitié des textes de pétitions : nous, pour, bonjour, le, la, je, les, monsieur, pétition, l, il, a, depuis, non, en, cette, si, madame, contre, suite. Si l’on se penche maintenant sur le vocabulaire des 200 formes les plus fréquentes dans l’ensemble des textes d’appel, on constate que les premiers verbes conjugués sont est, sont, ont, soit, peut, demandons, faut, doit, avons, sommes, demande, sera et les premiers mots lexicaux pétition, enfants, pays, personnes, vie, Belgique, France, temps, animaux, monsieur, monde, place, projet, jour, droit, loi, politique, mois, travail, ville, ministre, gouvernement, citoyens, cas, Bruxelles, justice, président, lieu, site, chiens, situation, rue. On le voit dans la figure 2, dix formes apparaissent non seulement parmi les 30 mots les plus fréquents (hors mots vides) des appels mais aussi parmi les 30 les plus fréquents en première position des textes : nous, pour, je, pétition, non, contre, j, vous, on, notre. À l’inverse, des mots qui apparaissent avec une fréquence élevée en première position des textes d’appel ne se retrouvent pas parmi les 200 mots les plus fréquents, ou très bas dans le classement : bonjour (545), monsieur (313), madame (141), chers (111), stop (82), signez (80), mesdames (73), appel (60), voila (53), marre (45), messieurs (41), cher (40), voici (40), lettre (36), voilà (30), trop (30), oui (29), sauvons (24), test (23), aidez (22), salut (18). On trouve ici des formes spécifiques de l’interpellation directe : bonjour, salut, madame et mesdames, monsieur et messieurs ou encore chers. La présence de bonjour ou salut rend compte de la diversité des modalités d’interpellation qui renvoient à des niveaux de langue différents et des formulations parfois inattendues. L’accessibilité en ligne du dispositif facilite le lancement d’une pétition : notre corpus se décline sur un continuum qui va des pétitions les plus sérieuses, celles qui trouvent un écho dans la presse, qui auraient sans doute existé sans le dispositif d’une plateforme en ligne, qui sont signées par plusieurs dizaines ou centaines de personnes, aux pétitions très confidentielles, « juste pour rire », dont le texte de l’appel est très réduit et qui récoltent peu de signatures. Bonjour apparait avec une plus grande fréquence dans la rubrique « Loisirs ». La forme test, quant à elle, révèle certaines difficultés liées au dispositif : il s’agit de tester si une pétition peut être mise en ligne, et le texte de l’appel comprend alors ce seul mot. JADT’ 18 71 Figure 2 – Visualisation en chaines de fréquences partagées (Lechevrel & Gambette, 2016) des 30 mots les plus fréquents, hors mots vides, en première position et parmi les textes des pétitions. Deux présentatifs (voici : 40 occurrences, voila/voilà : 83 occurrences) sont fréquemment attestés en première position des appels à pétition, en particulier dans les rubriques « Loisirs » et « Humour ». La valeur énonciative de ces deux formes est relativement différente. La forme voilà est dans un grand nombre d’emplois une marque de l’oralité qui introduit le propos sans en modifier fondamentalement le contenu, mais qui reste un présentatif (« Voilà je suis une très grande fan du destin de Lisa », « Voilà les Tokyo Hôtel refont des tournées »…). D’autres emplois sont le produit d’une réflexion (« Voilà, j’ai décidé de faire une pétition », « Voilà, je fais cette pétition ») ou ont valeur de conclusion : (« Voilà pourquoi il faut avoir peur de l’avenir »). Cette dernière configuration reste plus fréquente lorsque voilà se trouve dans une position autre dans la phrase (« Voilà le problème », « voilà pourquoi j’ai décidé de »…). Une deuxième catégorie d’emploi, où voici et voilà revêtent les mêmes valeurs, avec une fréquence plus importante de voici, concerne les marques temporelles (« Voilà quelques années que l’on demande l’autorisation de porter des shorts », « Voici 22 mois que je suis papa »). Enfin voici comme voilà (dans des 72 JADT’ 18 proportions bien moindres pour la seconde forme) prennent une valeur de présentatif dans un grand nombre d’emplois (« Voilà le but de ma pétition », « voilà ma propre pétition », « voici une histoire comme tant d’autres », « voici une pétition à faire suivre », « voici le lien de ma pétition…). Avec les verbes à l’impératif signez, aidez et sauvons, le porteur de la pétition entre directement dans le vif du sujet : il s’agit d’inciter les signataires à agir par l’acte de pétitionnement. Stop, marre, trop, et oui participent du même mouvement : agir, mettre fin, encourager à, etc. On ajoute à cette liste pour, deuxième mot le plus fréquent en première position. Avec contre, il est très clairement une marque caractéristique de la posture pétitionnaire : on s’oppose, on soutient. Dans la majorité des rubriques, les textes qui commencent par non ou contre sont moitié moins nombreux que ceux qui commencent par oui ou pour, excepté dans la rubrique « Environnement » où ils sont plus nombreux. Nos investigations vont se poursuivre en privilégiant les fonctionnalités d’annotation du corpus offertes par TextObserver afin de davantage prendre en compte les différents contextes d’emploi de ces formes et ainsi renforcer leur désambigüisation. Les verbes à l’impératif sont un indicateur intéressant d’implication du signataire que l’on retrouve aussi dans l’emploi des pronoms nous, vous et je auxquels nous allons maintenant nous intéresser. 3. L’implication des signataires et des porteurs de pétitions Le pronom nous est particulièrement mobilisé dans notre corpus : mot le plus fréquent au début des appels, il est aussi le pronom le plus utilisé dans l’ensemble du corpus. Ce nous se veut mobilisateur : il inclut dès le texte de la pétition les futures pétitionnaires dans l’acte de pétitionnement. Une extraction des 10 mots cooccurrents les plus spécifiques du pronom nous placé en première position, à l’aide de l’outil TextObserver (Barats et al., 2013), permet de faire émerger par ordre décroissant de spécificité : demandons, voulons, souhaitons, soussignés, citoyens, soutenons, réclamons, opposons, déclarons, appris. Ce pronom introduit très souvent une demande ou une dénonciation, parfois des éléments de contexte (cf. appris). On ne peut évidemment exclure que certains de ces nous ne réfèrent qu’aux porteurs de la pétition, sans l’inclusion des signataires. Néanmoins, la présence des cooccurrents citoyens et soussignés et les retours que nous avons faits aux textes montrent que la grande majorité des nous incluent les signataires. Une étude plus approfondie est en cours pour quantifier plus précisément les différents cas. Une interrogation par rubrique confirme l’importance quantitative de ce nous inclusif, en particulier dans le cas des rubriques « Environnement », « Politique » et « Social » comme le montre la figure 3(a). JADT’ 18 73 Figure 3 – Nombre de pétitions, par rubrique, dont le texte d’appel contient j’, je ou nous (a) et nombre médian de mots des textes de pétitions qui contiennent ou non ces pronoms (b). Le pronom je arrive quant à lui en quatrième position des mots les plus fréquents en début de texte, et il est le troisième pronom le plus mobilisé sur l’ensemble des textes après nous et vous. Il n’est pas rare que les deux pronoms nous et je/j’ soient utilisés dans les textes d’appel, le porteur de la pétition passant de son expérience personnelle pour ensuite mobiliser les pétitionnaires, comme dans l’exemple de la pétition suivante intitulée « Contre la fermeture du Delhaize d’Herstal » (pet 14595) : « Je trouve ça honteux de fermer un magasin qui est récompensé du meilleur rapport clients-Personnel! Il est temps de se serrer les coudes et de se battre jusqu’au bout! Ne nous laissons pas faire!!!!! ». Figure 4 – Pourcentage de textes de pétitions renvoyant ou non à une URL (a) et mentionnant facebook (b), par type de rubrique. Un des moyens de passer d’une implication individuelle à une mobilisation collective est de faire référence à d’autres espaces de relai d’information sur le web, ce qui se traduit par la présence d’URL, qui ciblent parfois des 74 JADT’ 18 réseaux sociaux. 11% des appels comprennent des URL. L’incidence des rubriques se confirme : « Protection animalière » et « Environnement » comportent le plus grand nombre d’URL (17%), comme le montre la figure 4(a). Afin d’approfondir ce résultat, nous avons prêté attention à la présence du réseau social Facebook : 1,6% des textes de pétition y renvoient, comme on le voit en figure 4(b). La rubrique « Protection animalière » est celle qui fait le plus appel à des relais via des pages Facebook, confirmant un mode de mobilisation spécifique et transmedia (Barats et al., 2016). La rubrique « Politique » est celle qui fait le moins appel au réseau social Facebook. Notons cependant que la pétition la plus signée sur l’unité de la Belgique, d’aout 2007, a proposé, à l’issue de la fermeture de la pétition, de rassembler sur un site web les photos d’une des manifestations organisées en novembre 2007. Les textes des pétitions rendent ainsi compte de l’articulation de différents dispositifs web dans la dynamique de pétitionnement, qu’une approche strictement quantitative n’indique que partiellement. On peut s’étonner, en observant la figure 3(a), du nombre relativement important, dans chacune des rubriques, de pétitions dans lesquelles aucun de ces deux pronoms n’apparait et qui serait peut-être le signe de pétitions moins implicantes, plus impersonnelles. En effet, on constate également que moins de 15% de ces textes sans nous ni je/j’ contiennent le pronom vous. Si l’on y regarde de plus près, on se rend compte que les textes des pétitions sans nous ni je/j sont, pour chaque rubrique, beaucoup plus courts que les textes de celles qui incluent nous et/ou je/j, comme le montre la figure 3(b). 5. Conclusions et perspectives Notre analyse des premiers mots de textes d’appel de pétitions montre que le vocabulaire utilisé dans cette position présente davantage de régularités liées aux particularités de la pétition que la totalité des textes. Elle permet de repérer quelques caractéristiques linguistiques qui varient parfois selon les rubriques (pronoms personnels, formes d’adresse, URL, etc.). L’approche textométrique trouve parfois ses limites, comme avec l'ambigüité du nous qui peut inclure ou non les promoteurs ou les signataires de la pétition, ou bien dans le cas de la polarité positive ou négative de prépositions et de verbes qui ne suffisent pas à repérer si la pétition traduit plutôt une demande ou une dénonciation. Ce travail constitue une première étape vers une vérification systématique d’autres marqueurs qui permettent d’impliquer les signataires, comme par exemple la présence de verbes à l’impératif ou de déterminants, en vue d’une mise en relation avec le nombre de signataires et éventuellement de recommandations pour la rédaction de textes de pétitions en ligne. JADT’ 18 75 Références Badouard R. (2017). Le désenchantement de l’internet. Désinformation, rumeur et propagande. Paris, FYP éditions. Barats C., Leblanc J.-M. and Fiala P. (2013). Approches textométriques du web : corpus et outils. In Barats, C., editor, Manuel d’analyse du Web en sciences humaines et sociales. Paris, Armand Colin. Barats C., Dister A., Gambette Ph., Leblanc J.-M., Peres M. (2016). Analyser des pétitions en ligne : potentialités et limites d’un dispositif d’études pluridisciplinaires, JADT 2016, Nice. http://lexicometrica.univparis3.fr/jadt/jadt2016/01-ACTES/83043/83043.pdf Boure R. and Bousquet F. (2011). La construction polyphonique des pétitions en ligne. Le cas des appels contre le débat sur l’identité nationale. Questions de Communication, vol. 20: 293-316. Contamin J.-G. (2001). Contribution à une sociologie des usages pluriels des formes de mobilisation: l’exemple de la pétition en France. Thèse de doctorat, Université Paris 1. Contamin J.-G., Léonard T. and Soubiran T. (2017). Les transformations des comportements politiques au prisme de l’e-pétitionnement. Potentialités et limites d’un dispositif d’étude pluridisciplinaire, Réseaux, vol. 204(4): 97-131. Lechevrel N. and Gambette P. (2016). Une approche textométrique pour étudier la transmission des savoirs biologiques au XIXe siècle. Nouvelles perspectives en sciences sociales, vol. 12(1): 221-253 Mabi C. (2016). Analyser les dispositifs participatifs par leur design. In Barats, C., editor, Manuel d’analyse du Web en sciences humaines et sociales. Paris, Armand Colin. 76 JADT’ 18 Newsgroup e lessicografia: dai NUNC al VoDIM* Manuel Barbera, Carla Marello Università degli Studi di Torino – b.manuel@inrete.it; carla.marello@unito.it Abstract VoDIM (Vocabolario dinamico dell’italiano moderno - Dynamic dictionary of modern Italian) represents a new development in recent Italian lexicography. In this paper we argue that NUNC corpora ( www.corpora.unito.it), which contain texts from newsgroups that were downloaded at the beginning of XXI century, display aspects of “written-spoken” Italian. NUNC might offer instances of new meaning of “old” words and new collocational contexts. We discuss several examples taken from the corpora, such as the internationalism Umwelt, the collocation assolutamente sì and the abbreviation clima for ‘climatizzatore’ ‘air conditioning’. Abstract Il VoDIM (Vocabolario dinamico dell’italiano moderno) rappresenta una grande novità nella lessicografia italiana di questi anni. Qui si argomenta che i corpora italiani della suite NUNC ( www.corpora.unito.it), ricavati dai testi presenti nei newsgroup di inizio millennio, sono un buon testimone dell’italiano “scritto-parlato” e potrebbero essere utili per documentare nel VoDIM nuove accezioni e l’uso di nuove collocazioni. Si portano come esempi il caso dell’ internazionalismo Umwelt, della collocazione di assolutamente con sì e dell’accorciamento clima per ‘climatizzatore’. Keywords: VoDIM – NUNC – Lessicografia – italiano 1. Introduzione Il VoDIM (Vocabolario dinamico dell’italiano moderno), progetto capitanato dall’Accademia della Crusca1 che coinvolge otto gruppi di ricerca di altrettante università italiane, fra cui anche il gruppo torinese, sarà un dizionario dell’italiano postunitario online, basato su corpora e su altri dizionari acquisiti in formato digitale come il Tommaseo - Bellini, la quinta Crusca ed il Battaglia, e disegnato per poter essere interrogabile anche a A Manuel Barbera si devono i §§ 2 e 3, a Carla Marello i §§ 4 e 5 ed il § 1 va ascritto ad entrambi; anche se ovviamente il lavoro è stato concepito insieme ed entrambi gli autori se ne sentono pienamente responsabili. 1 Cfr. http://www.accademiadellacrusca.it/it/eventi/crusca-torna-vocabolariolesicografia-dinamica-dellitaliano-post-unitario. * JADT’ 18 77 “corpus variabile”, definito dall’utente. I corpora su cui si appoggia diventano quindi essenziali. Un primo corpus di riferimento base (i cui risultati non sono ancora pubblici: http://dizionariodinamico.it/prin2012crusca/dictionary) è stato prodotto col PRIN 2012 dalla medesima Crusca (in collaborazione con le Università di Catania, Firenze, Genova, Milano, Napoli, Piemonte Orientale, Tuscia e con il CNR), ma, naturalmente, da solo è insufficiente alla bisogna. 2. I NUNC Un corpus con cui si suggerisce di completarlo è il NUNC-IT; i NUNC (homepage: http://www.bmanuel.org/projects/ng-HOME.html), ideati da Manuel Barbera (in bmanuel.org), ed appannaggio del medesimo gruppo torinese che partecipa al VoDIM, propriamente sono una suite multilingue di corpora che vorrebbe documentare il genere testuale “newsgroup” all’inizio del terzo millennio; molte versioni ne sono state implementate (anche per tematiche specifiche), tutte reperibili dalla homepage; il risultato non è ancora del tutto soddisfacente; pure, qualche uso può già esserne fatto2. Un newsgroup è un forum telematico a libero accesso, gratuito, disponibile su Internet, che si manifesta nella forma di testi scritti, i post, inviati ad una “bacheca elettronica” mantenuta presso una rete di server (i newsserver che costituiscono UseNet). Gli utenti del gruppo possono scaricare, leggere e rispondere ai post, costruendo catene (thread) di botte e risposte. I newsgroup sono articolati in una tassonomia precisa, ossia in un sistema di cornici argomentative che si chiamano “gerarchie”, a base geograficonazionale e/o tematica. I vantaggi di questa base testuale per la linguistica dei corpora sono numerosi e sono stati trattati in Barbera, 2007 e Barbera et Marello, 2009; qui ci interessa in primo luogo il fatto che presentano una Umgangssprache assolutamente contemporanea, reale e molto variata per registri e temi. Per quanto riguarda il VoDIM, molte voci, neologismi, tecnicismi, prestiti, ecc., non sono attestate nel corpus base della Crusca e quindi i NUNC potrebbero risultare utile serbatoio di contesti. 3. Un case study: Umwelt Si veda ad esempio un prestito tecnico, il termine Umwelt. Introdotto (in tedesco) dal biologo (estone, ma di famiglia tedesca del Baltico) Jakob Johann baron von Uexküll già nel titolo della sua importante opera del 1909 (Umwelt und Innenwelt der Tiere), è entrato presto nella tradizione 2 Come dimostrato da alcuni degli interventi presenti in Barbera et al. 2007; in Costantino et al. 2009, per non citare che i primi utilizzi di dieci anni fa. 78 JADT’ 18 filosofica (a partire da una recensione di Max Scheler del 1914): usato da Heidegger in un suo corso del 1929-30, è diventato poi moneta corrente (tra gli altri) in francese con Gilles Deleuze, Maurice Merleau-Ponty e Jacques Lacan, nonché in italiano con Giorgio Agamben. Ma è usato soprattutto in testi di biologia, naturalmente, e poi in semiotica, in cui è stato diffuso negli anni Sessanta da Thomas Albert Sebeok (born Sebők Tamás) ed è alla base della moderna biosemiotica (cfr. Kull, 2001). Nei NUNC il termine è ripetutamente attestato. Per Gadamer comprendere l ' esistenza3 - e qui c'è ancora Heidegger significa prima di tutto pre-comprenderla , in quanto la comprendiamo con un linguaggio che non scegliamo , ma che , trascendentalmente , definisce già la realtà in cui ci muoviamo : l'Um-Welt , da un lato , e dall ' altro lato , il Mit-welt . Ma , Gadamer cerca di andare alla radice del movimento del pensiero del soggetto e tale origine sta nell ' esigenza di comprendere e farsi comprendere , cioè nel muoversi nell ' Umwelt e nel Mitwelt . Il fatto è che per Gadamer l ' Altro è visibile solo con gli " occhi nostri ", ciò con ciò che " siamo ", con la nostra " identità ", il nuovo si dà solo nel familiare . E in un certo senso è così . L ' altro è ciò che mi disturba che mi inquieta perchè non riesco a ridurlo al mio mondo : è un'eccedenza . Quello precedente è un esempio dell’uso tecnico-filosofico del termine, che non si discosta molto da quello che si potrebbe trovare nello spogliare i testi (e le traduzioni) di quella tradizione. Più interessante è l’esempio seguente: Anche in Italia il consumo di televisione è vertiginosamente aumentato : […] . Oltre a due effetti di rilevanza individuale : - la caduta verticale della capacità di fissare l ' attenzione per più di un certo tempo ( se a un buon insegnante occorre anche un ' ora per sviluppare un dato argomento , gli spazi televisivi obbligati in novanta secondi troncano quello stesso argomento in modo irreparabile ) e - la perdita di interesse per la lettura - aspetti che coinvolgono per mimetismo inconscio ( vale a dire per l ' inconscio occupazione degli spazi mentali ad opera non solo delle immagini ma dell ' intera atmosfera televisiva che foggia l ' Umwelt dell ' uomo moderno ) anche persone che fruiscono della TV per tempi ben sotto la media - l ' esposizione allo " 3 Le citazioni dal corpus sono nel prosieguo riportate tel quel: in particolare sono mantenute le tokenizzazioni di interpunzioni ed apostrofi, tutti gli “errori di digitazione”, e le idiosincrasie ortografiche proprie del genere. JADT’ 18 79 sbarramento " delle immagni4 televisive ha due rilevanti effetti sociali : - il conformismo applicato e - l ' ignoranza generalizzata . […] Si tratta di un traslato, chiaramente fuori dai campi “tecnici” di diffusione del termine. Lessicograficamente ciò è particolarmente rilevante perché testimonia il traghettamento del prestito al di fuori del dominio originario di appartenenza, assicurandone lo sdoganamento all’uso comune, anche se colto o relativamente tale. Per questo tipo di riscontri i NUNC possono rivelarsi particolarmente utili. 4. Al di qua e al di là della parola grafica Il VoDIM oltre che datare la comparsa di particolari lessemi o di determinate accezioni, si propone anche di attestare la comparsa di accorciamenti e combinazioni di parole: i NUNC, in effetti, presentano usi incipienti passati dal parlato a questa forma di scritto di inizio millennio. Dal punto di vista della frequenza statistica di tali usi, i dati estratti dai corpora NUNC presentano delle criticità dovute al fenomeno del quoting, ma costituiscono una ricca miniera di prime attestazioni: si vedano, ad esempio, lo studio di Onesti et Squartini, 2007 sul modo di dire tutta una serie di o di Valle, 2006 sulla penetrazione precoce di anglismi (più o meno italianizzati). Per quanto concerne gli accorciamenti, in particolare, in Allora et Marello, 2008 ne abbiamo dato una nutrita raccolta. Un esempio per tutti è clima come accorciamento di climatizzatore; Marello l’aveva già fatto oggetto di un breve articolo5 e ne aveva constatato la presenza in più post del 2002 di NUNCMotori. Si veda il brano di thread in cui compare anche un disinvolto conce per concessionario6: Qualcuno e' in grado di dirmi quanti grammi (olio/gas?) servono per la ricarica del clima per un CRD del 2002? Una spesa approssimativa? Grazie Ciao a tutti, scusate se mi intrometto, ma oggi dopo giorni di dubbio ho chiamato il conce per lo stesso motivo di Massimo,30 km per sentire un po' di aria fresca con il clima impostato a 5 gradi e macchina lasciata Come si diceva, le citazioni dal corpus sono riportate tel quel, ivi compresi gli errori presenti nella fonte. Tantopiù che la maggiore tolleranza alle cattive digitazioni, e l’aperta accettazione di alcune caratteristiche grafico-ortografiche, sono tipiche di questo genere di CMR. 5 Apparso sul Corriere del Ticino il 23 settembre 2005 6 Non approdato questo agli onori della registrazione nei dizionari, come invece accade per clima la cui data di prima attestazione è secondo il dizionario Zingarelli il 2000. 4 80 JADT’ 18 prima all'ombra Al di là della parola grafica può, ad esempio, essere interessante documentare gli usi di assolutamente sì7: se ne trovano ben 103 nei NUNC generali. Ecco due esempi: Ma ti senti tanto tanto tanto depressa ??? Ci dobbiamo preoccupare ? [>]… Oggi un pò meno , però devo dire che ho passato veramente dei brutti momenti. L ' importante è riprendersi , no ? Assolutamente sì ! Riprendersi e ripartire subito ! tu sei un troll ? […] No , perché il flame occasionale non fa di una persona un troll - werted è un troll ? Assolutamente sì , perché attua flame , insulti e provocazioni in modo sistematico e con offese che vanno oltre l ' ambito dello sfottò sportivo . In più utilizza tutte le tecniche tipiche del trollaggio , dal morphing al faking al flooding . Stessa indagine si può fare per anche no, constatando che è nella stragrande maggioranza dei contesti è ma anche no. 5. Conclusioni Un ulteriore fattore che rende i NUNC apprezzabili per il linguista e il lessicografo attento all’uso è la dialogicità, che si intravede soprattutto negli esempi presentati nel § 4. È un fenomeno pervasivo nei NUNC, di solito declinato nei newsgroup come quoting (cfr. Barbera, 2011 e Marello, 2007). Computazionalmente ciò crea, è vero, alcuni problemi (ancora non del tutto risolti), dato che il fenomeno del testo ripetuto, se incontrollato, va inevitabilmente ad intaccare l’aspetto statistico, vanificando un semplice uso quantitativo dei corpora; però testualmente è un fenomeno di grande importanza, specie se valorizzabile, come nei NUNC, con la possibilità di potere allargare i contesti fino a 2000 parole. La capacità dei newsgroup di fissare nello scritto usi eminentemente orali, di trasferire la fluidità dell’oralità ad uno speciale tipo di scrittura, costituendo una sorta di ponte tra i due media, può rivelarsi particolarmente importante per il VoDIM, proprio perché i corpora NUNC registrano tendenze emergenti nella lingua italiana. Sulla peculiarità diamesica di questo particolare tipo di “scritto-parlato” abbiamo sostato in Barbera et Marello, 2009, ma qui non possiamo non rimarcarne l’opportunità che potrebbe presentare per il VoDIM. I NUNC, come dicevamo, non sono ancora perfetti: i prototipi che sono stati 7 Oggetto di un articolo sul Corriere del Ticino del 21 gennaio 2004. JADT’ 18 81 messi online sono solo delle beta, ma la volontà di perfezionarli c’è: e non è da escludere che il VoDIM rappresenti l’occasione giusta per farlo. Bibliografia Allora A. e Marello C. (2008), “Ricarica clima”. Accorciamenti nella lingua dei newsgroup, in Cresti E., editor, Atti del IX Congresso della Società Internazionale di Linguistica e Filologia Italiana (SILFI): “Prospettive nello studio del lessico italiano” (Firenze, 14-17 giugno 2006). Cesati: vol. II, pp. 533-538. Barbera M., Per la storia di un gruppo di ricerca. Tra bmanuel.org e corpora.unito.it, in Barbera M., Corino E. e Onesti C., editors, Corpora e linguistica in Rete. Guerra Edizioni: pp. 3-20. Barbera M., Une introduction au NUNC: histoire de la création d’un corpus, in Ferrari A. et Lala L., editors, Variétés syntaxiques dans la variété des textes online en italien: aspects micro- et macrostructuraux. Université de Nancy II, 2011: pp. 9-36. Barbera M. e Marello C. (2009), Tra scritto-parlato, Umgangssprache e comunicazione in rete: i corpora NUNC, in Antonini A. e Stefanelli S., editors, Per Giovanni Nencioni. Convegno internazionale di studi. Pisa Firenze, 4-5 Maggio 2009. Le Lettere: pp. 157-86. Poi in Barbera M., Quanto più la relazione è bella: saggi di storia della lingua italiana 1999-2014, Bmanuel.org - Youcanprint, 2015: pp. 157-182. Costantino M., Marello C. e Onesti C. (2009), La cucina discussa in rete. Analisi di gruppi di discussione italiani relativi alla cucina, in Robustelli C. e Frosini G., editors, Atti del convegno ASLI 2007 “Storia della lingua e storia della cucina. Parola e cibo: due linguaggi per la storia della società italiana”. Modena, 20-22 settembre 2007. Cesati: pp. .717-727. Kull K. (2001), Jakob von Uexküll: An introduction. Semiotica, vol. 134 (1/4): pp. 1-59. Marello C. (2007), Does Newsgroups “Quoting” Kill or Enhance Other Types of Anaphors?, in Korzen I. and Lundquist L., editors, Comparing Anaphors between Sentences, Texts and Languages. Samfundslitteratur Press: pp. 145157. Onesti C. e Squartini M. (2007), “Tutta una serie di”. Lo studio di un pattern sintagmatico e del suo statuto grammaticale, in Barbera M., Corino E. e Onesti C., editors, Corpora e linguistica in Rete. Guerra Edizioni: pp. 271284. Valle L. (2006), Varietà diafasiche e forestierismi nell'italiano nei gruppi di discussione in rete, in López Díaz M. et Montes López M., editors, Perspectives fonctionnelles: emprunts, économie et variations dans les langues. S.I.L.F. 2004. XXVIII Colloque de la Société internationale de linguistique 82 JADT’ 18 fonctionnelle, tenu à Saint-Jacque-de-Compostelle et à Lugo du 20 au 26 septembre 2004. Editorial Axac: pp. 371-374. Zingarelli N. (2017), Lo Zingarelli 2017. Vocabolario della lingua italiana. A cura di Mario Cannella e Beata Lazzarini. Zanichelli. JADT’ 18 83 Techniques for detecting the normalized violence in the perception of refugee / asylum seekers between lexical analysis and factorial analysis Ignazia Bartholini Univ. of Palermo - ignazia.bartholini@unipa.it Abstract 1 The theme of gender violence finds a peculiar declination if linked to the phenomenon of forced migrations, and intersects historical-cultural variants of neo-patriarchal nature to cultural-religious orthodoxies the newcomers often bear with them. Studying gender violence in the context of globalized migrations allows us to highlight three bias that mark the western discourse and that concern the way of conceiving its phenomenology as pre-modern (a); detaching violence interpretation from politics of intervention and contrast (b); considering gender asymmetries, sexist representations and practices in the Mediterranean hosting society as residual (c). Subsequently, the factorial structure of the questionnaire was investigated through the Principal Components Analysis (ACP) and the subsequent Oblimin rotation of the factorial axes, as a relation between the dimensions of the questionnaire was assumed. The reliability of the scales was verified by the Cronbach alpha coefficient. Abstract 2 Il tema della violenza di genere trova una declinazione peculiare se collegato al fenomeno delle migrazioni forzate e interseca le varianti storico-culturali di natura neo-patriarcale alle ortodossie culturali-religiose che i nuovi arrivati portano spesso con loro. Studiare la violenza di genere nel contesto delle migrazioni globalizzate ci consente di evidenziare tre pregiudizi che segnano il discorso occidentale e che riguardano: il modo di concepire la sua fenomenologia come premoderna (a); la searazione fra l'interpretazione della violenza e le politiche di intervento e contrasto (b); il considerare le asimmetrie di genere, le rappresentazioni sessiste e le pratiche Mediterranee come residuali (c). Successivamente, la struttura fattoriale del questionario è stata analizzata attraverso la Principal Components Analysis (ACP) e la successiva rotazione Oblimin degli assi fattoriali, essendo stata ipotizzata una relazione tra le dimensioni del questionario. L'affidabilità delle scale è stata verificata dal coefficiente alfa Cronbach. Keywords: gender violence, forced migrations, sexist representation 84 JADT’ 18 1. Introduction Over the last two decades, the field of border and migration management has been characterized by the increasing interrelatedness of discourses about control practices and about humanitarian issues (Walters 2011, Fassin 2010). Today, European policies seek to incorporate strategies to support forced migrants as key instruments for the protection of refugees (Moro 2012). Forced migration, which can also be addressed through the lens of gender (Hans 2008), is grafted onto a broader field of research, which includes welfare strategies, social representations and intercultural dynamics. According to the UNHCR, gender-based violence refers to “any act of gender-based violence that results in, or is likely to result in, physical, sexual or psychological harm or suffering to women, including threats of such acts, coercion or arbitrary deprivation of liberty, whether occurring in public or private life” (UNHCR 2008: 201). It can take, among others, the form of “rape, forced impregnation, forced abortion, trafficking, sexual slavery, and the intentional spread of sexually transmitted infections, including HIV/AIDS” (UNHCR 2008: 7, 10). Forms of violence happen not only inside the migratory journey by other refugees, but also by public officers, government employees, aid agencies crew (Ferris 2007; Freedman 2015). 2. The numbers of the phenomenon According to data of the Italian Ministry of Internal Affairs, between 2015 and 2016, 154719 migrants disembarked in Italy, of which 82136 asylum seekers. From January to March 2016 9,307 migrants disembarked in Italy. Currently, migrants come mostly from Gambia, Senegal, Mali, Guinea, Ivory Coast, Morocco, Somalia, Sudan and Cameroon (Source: ANSA). In January 2016 asylum seekers were 7,505, mostly from Pakistan (1510), Nigeria (1306), Afghanistan (665) and Gambia (625). Among these, 6739 were men, 766 women, 292 unaccompanied minors and 199 minors. 6507 requests were reviewed so far with the following outcomes: 190 people (3%) were granted the refugee status, 698 (11%) obtained a subsidiary permit, 1352 (21%) were granted with a humanitarian protection and 4266 (66%) were denied (source: Italian Ministry of Internal Affairs). Only in the 2017, from the Hotspot Trapani-Milo, managed by "Badia Grande NGO” one of partners of the project " Provide ", have transited 21,478 refugees / asylum seekers (Source - Ministry of Interior), with 21 different nationalities. These include 16,010 men, 3177 women, 2291 children divided in 1787 males and 504 females. JADT’ 18 85 Last year, two researchers from the University of Palermo submitted a questionnaire of 36 items to 465 women, temporarily hosted at the TrapaniMilo Hotspot in Sicily. 3. Objectives of research The core question of the research concerns the identification of violence’s subjective dimensions from the side of the victims and the operators, as well as the problems in building social multicultural constructions of violence. The research wants to identify violence’s subjective dimensions from the side of the victims and the operators, as well as the problems in building social multicultural constructions of violence. For this purpose, the research investigates a specific articulation of the “migratory violence,” which entails cultural specificities and contextual conditions, such as the journey and the time spent in reception facilities. In order to highlight topics and problems related to the social construction of gender violence, attention will be paid to victims’ point of view concerning the ‘normalized’ procedural violence, even by means of operational definitions of victims’ first reception treatments in the institutional arenas. Furthermore, gender relations are biased by the whole migration experience, and this leads to various forms of direct, indirect and structural violence: forms of gender-based violence are seen not only among refugees. Finally, refugees and asylum seekers may suffer structural violence in the form of social exclusion and discrimination (Jaji 2009, Crisp, Morris & Refstie 2012), secondary victimization (Pinelli 2011, Tognetti 2016), labour exploitation (Coin 2004), forced prostitution (Naggujja et al 2014, Krause-Vilmar 2011) and sexual abuse (Crisp, Morris & Refstie 2012). Therefore, the migratory violence to which women—as well as minors and LGBT—are subjected, becomes, a particular mode of reading and interpretation of intra- and intercultural gender relations. For the first part of the research's objective, was to assess the perception of the violence suffered of the women of sample before and during the journey to the coast of Sicily. For the second one of the research's objective, was to individuate some effective interventions for the reduction of the migrant' exposure to different types of violence and threat, to encourage the access to physical and psychological services, to assist the violence' victims with integration, support safe and appropriate cultural instruments , to provide support for families, stable settlement in host country and to concerted actions for reducing the inequalities in access to resources. 86 JADT’ 18 4. Methodology A1. Once the ethnic intersection, socioeconomic gender and status explored, an internalist perspective will be employed, based on the analysis of the narrative devices, that is the conversations’ reports that migratory violence victims conduct with experts (linguistic and intercultural mediators, social assistants, psychologists and lawyers, but also doctors and police officers) or with members of the third sector. A2. Definitions of lived or experienced violence, through interviews to refugees and operators in the first and second reception centres, that have particular acquaintance with the phenomenon; Subsequently, the factorial structure of the questionnaire investigated through the Principal Components Analysis (ACP) and the subsequent Oblimin rotation of the factorial axes, as a relation between the three dimensions of the questionnaire was assumed: a. the daily life before the trip; b. the gender dynamics and relationships among the family members; c. the violence normalized. The reliability of the scales was verified by the Cronbach alpha coefficient. In order to verify the hypothesis, that there are statistically significant differences to the mean scores of the different dimensions, analyzes of the variance have been carried out. Multivariate analysis techniques on variance, together with a lexical analysis, allowed us to select: 1. the keywords present in the corpus of the questionnaire using frequency indexes; 2. the meta-information contained within the text units; 3. the context units through specific data arrays for content analysis The communication that we propose to present will describe the results of the research conducted and the methodological opportunity of the text analysis tools used by the researchers involved. 5. Some Research’s results To individuate the vulnerabilities of migrants, it was necessary to identify appropriate instruments of analysis for being able the needs of violence victims and in order to deal with them in a respectful, sensitive, professional and non-discriminatory manner. The have explained the need to receive the proper degree of assistance and a stronger support and protection. The keywords more frequently used by migrants are been: protection, fear, opportunity, work, life. JADT’ 18 87 The content analysis, and the context units involved through specific data, describe the necessity to acknowledge the women/asylum seekers, who could be victims by other men after their arrive in reception center too and the opportunity to put specific procedures to prevent, identify, and respond to the different forms of proximity gender-based violence. The content analysis, and the context units involved through specific data, describe the necessity to acknowledge the women/asylum seekers, who could be victims by other men after their arrive in reception center too and the opportunity to put specific procedures to prevent, identify, and respond to the different forms of proximity gender-based violence. 6. Conclusion The problems that refugees face require humanitarian responses and effective interventions (Dal Lago 1999; Colombo 2012; Camarrone 2016), such as the reduction of exposure to different types of violence and threat in postmigration phase and the access to physical and psychological services (Shamir 2005; Ambrosini 2010; Bartholini 2017). From this perspective, the Mediterranean represents a peculiar field of analysis of that normalized violence – procedural and proximal – that denies refugees/asylum-seekers, minors and LGBT people to consider themselves as right holders and subjects of the same dignity and value. Morevor, the results or content analysis shows the necesity of a stronger integration, with a support strategies of appropriate cultural s and social practices and to provide adeguate support for families in a stable settlement in our host countries (Balibar 2012). Lastly, the research highlights the need of some concerted action to reduce inequalities in access to resources (Robinson et al. 2006). Gender violence related persecution may give rise to claims for international protection (Gilbert 2009). Council of Europe Convention on preventing and combating violence against women (Istanbul Convention of 2011) and the Directive 2012/29/EU in establishing minimum standards on the rights, support and protection of victims, contribute to achieve the obligation to "ensure access for victims and their family members to general victim support and specialist support, in accordance with their needs". Although member states are stepping up their work in order to streamline a gender understanding into public decision making, policy and operations, this effort is not always reflected in the asylum procedures. 88 JADT’ 18 References Ambrosini M. (2010). Richiesti e respinti. L’immigrazione in Italia. Come e perché. Milano: il Saggiatore. Balibar E. (2012). Strangers as enemies. Walls all over the world, and how to tear them down. Mondi Migranti, Vol. 6, n. 1: 7-25. DOI: 10.3280/MM2012001001 Bartholini I (2017). Migrations: A Global Welfare Challenge: Policies, Practices and Contemporary Vulnnerabilities, (with F. Pattaro Amaral; A. Silvera Samiento; R. Di Rosa), Edition Corunamerica, Barranquilla (Colombia), p.1-196 (ISBN 978-9588-59812-2-5). Camarrone D. Hotspot di Lampedusa, la sindaca chiede al Ministero dell’interno una verifica urgente delle procedure UE, Diritti e frontiere, 8 gennaio 2016, in http://dirittiefrontiere.blogspot.it/2016/01/la-verita-sul-sistema-hotspot.html Colombo A. (2012). Fuori controllo? Miti e realtà dell’immigrazione in Italia. Bologna: Il Mulino. Coin F. (2004). Gli immigrati, il lavoro, la casa. Franco Angeli: Milano. Convenzione di Dublino (1990), in http://www.camera.it/_bicamerali/schengen/fonti/convdubl.htm Crisp J., Morris T. & Refstie, H. (2012). Displacement in urban areas: new challenges, new partnerships. Disasters, 36(1): S23-S42. Dal Lago A. (1999). Non Persone. L’esclusione dei migranti in una società globale. Milano: Feltrinelli. Fassin D. (2010). La raison humanitaire. Une histoire morale du temps present, Gallimard-Seuil-Hautes Études: Paris. Gilbert L. (2009). Immigration as Local Politics: Re-Bordering Immigration and Multiculturalism through Deterrence and Incapacitation. International Journal of Urban and Regional Research, Vol. 33, n. 1: 26-42. DOI: 10.1111/j.1468-2427.2009.00838.x Jaji R. (2009). Refugee woman and the experiences of local integration in Nairobi, Kenya. University of Bayreuth: Bayreuth. Krause-Vilmar J. (2011). The Living Ain’t Easy, Urban Refugees in Kampala. UN Report Ministero dell’Interno, Rapporto sulla protezione internazionale in Italia 2015, in http://www.interno.gov.it/sites/default/files/t31ede-rapp_prot_int_2015__rapporto.pdf Naggujja Y. et al (2014). From The Frying Pan to the Fire: Psychosocial Challenges Faced By Vulnerable Refugee Women and Girls in Kampala, Report of the Refugee Law Project. JADT’ 18 89 Osti G. & Ventura F. a cura di (2012). Vivere da Stranieri in Aree Fragili. Napoli: Liguori. Palidda S. a cura di (2011). Il discorso ambiguo sulle migrazioni. Messina: Mesogea. Pinelli B. (2011). Attraversando il Mediterraneo. Il sistema campo in Italia: violenza e soggettività nelle esperienze delle donne, Lares, 77: 159-180. Regolamento (CE) n. 343/2003 (Dublino II), in http://eur-lex.europa.eu/legalcontent/IT/TXT/?uri=URISERV%3Al33153 Regolamento UE n. 604/2013 (Dublino III), in http://eurlex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2013:180:0031:0059:IT:P DF Robinson D. & Reeve K. (2006). Neighbourhood Experiences of New Immigration. Reflections from the Evidence Base. York: Joseph Rowntree Foundation. Shamir R. (2005). Without borders? Notes on globalization as a mobility regime. Sociological Theory, Vol. 23, n. 2: 197-217. DOI: 10.1111/j.07352751.2005.00250.x Tognetti M. (2016). Donne e processi migratori fra continuità e cambiamento. ParadoXa, X(3): 69-88. Walters W. (2011). Foucault and Frontiers: Notes on the Birth of the Humanitarian Border. In: Bröckling U. (Ed.). Governmentality: Current Issues and Future Challenges. Routledge: London. 90 JADT’ 18 Dal corpus al dizionario: prime riflessioni lessicografiche sul Vocabolario storico della cucina italiana postunitaria (VoSCIP) Patrizia Bertini Malgarini1, Marco Biffi2, Ugo Vignuzzi3 1LUMSA – p.bertini@lumsa.it Università degli Studi di Firenze – marco.biffi@unifi.it 3Sapienza. Università di Roma – ugo.vignuzzi@uniroma1.it 2 Abstract The Vocabolario storico della cucina italiana postunitaria (VoSCIP) it is a historical dictionary of the language of the cooking, which has also had a considerable importance for identifying a national linguistic model after the Unity of Italy. The dictionary is based on a representative corpus (today 42 texts), but by its nature it is a work in progress, open, and it is progressively increasing. The first exemplar entries (such as cappelletti, anolini, tagliatelle, bagnomaria) had been presented in various conferences and in some articles; the entries had been based on a restricted corpus (28 texts) and they have highlighted some critical issues, so it was necessary a further methodological reflection. The aim of our paper is to propose some aspects of these investigations and this methodological reflection: a) the structure of the voice in a differentiated form (“light” and “complex”); b) the treatment of emerging positions from the statistical analysis tools of the corpus; c) the lemmatization of compound words in the face of the morphological polymorph emerging from the diachronic depth of the corpus; d) the correct balance between the examples mentioned in the voice and the possibility of a direct interrelation with the database. Sintesi Il Vocabolario storico della cucina italiana postunitaria (VoSCIP) è un dizionario storico di una lingua speciale, quella della cucina, che ha avuto una notevole importanza anche nel quadro dell’individuazione di un modello linguistico nazionale soprattutto all’indomani dell’Unità. Il dizionario si basa su un corpus rappresentativo (attualmente di 42 testi), ma che per sua natura è elastico, e aperto, e viene quindi progressivamente incrementato. Le prime voci campione (quali per esempio cappelletti, anolini, tagliatelle, bagnomaria) presentate in vari convegni e in articoli in volume e riviste, basate su un corpus ristretto a 28 testi, hanno messo in luce alcune criticità che hanno spinto a una ulteriore riflessione metodologica. Proprio alcuni aspetti di tali approfondimenti sono oggetto del contributo che proponiamo: a) la struttura JADT’ 18 91 della voce in forma differenziata (“leggera” e “complessa”); b) il trattamento delle collocazioni emergenti dagli strumenti statistici di analisi del corpus; c) la lemmatizzazione di parole composte a fronte della polimorfia morfologica emergente dalla profondità diacronica del corpus; d) il corretto equilibrio tra esempi citati nella voce e possibilità di un’interrelazione diretta con la banca dati. Keywords: lingua della cucina, lingue speciali, linguistica dei corpora, lessicografia, vocabolario, italiano, dizionario storico 1. Il VoSCIP Il “Vocabolario storico della cucina italiana postunitaria” (VoSCIP) nasce con lo scopo di documentare il costituirsi e il fissarsi di una cultura e di una lingua unitaria della gastronomia in Italia dopo l’Unità. Si tratta di un’esigenza ben presente a tutti gli addetti ai lavori (linguisti, storici dell’alimentazione, sociologi ecc.) e che nello specifico ha preso le mosse da una precisa prospettiva di ricerca, quella di esaminare le vie e i modi dell’affermarsi di un italiano gastronomico “comune”, a partire da Pellegrino Artusi e dal modello archetipico del suo fortunatissimo La scienza in cucina e l’arte di mangiar bene. Il progetto “L’Italiano in cucina. Per un Vocabolario storico della lingua italiana della gastronomia” è stato assunto dall’Accademia della Crusca che lo ha inserito nell’ambito degli studi che mirano alla costruzione del suo progetto strategico dedicato alla redazione di un Vocabolario Italiano postunitario. Per la realizzazione del VoSCIP si è proceduto preliminarmente a fissare un corpus rappresentativo di testi, nel quale naturalmente un ruolo nodale spetta alla Scienza in cucina: corpus che, per motivi di fattibilità pratica, si è deciso di far arrivare alla Seconda guerra mondiale e dintorni, nell’auspicabile prospettiva di poter spostare successivamente il terminus ad quem alla contemporaneità (con l’inclusione, oltre che dei testi a stampa posteriori al ’50, delle diverse produzioni legate al “trasmesso” nelle sue varie forme, dai ricettari presenti sul WEB, ai blog ai social media etc.). Il corpus principale di riferimento comprende al momento oltre un centinaio di volumi apparsi tra la fine del Settecento (torneremo fra poco sulle ragioni della scelta di arretrare il terminus post quem) e il 1950: i testi sono stati selezionati utilizzando le principali bibliografie sulla produzione gastronomica italiana del periodo considerato (preziosa in primo luogo quella di Alberto Capatti che correda l’edizione del 2010 della Scienza artusiana della Rizzoli). Necessariamente si è dovuto tener conto pure di fattori pratici quali in primo luogo la reperibilità delle opere e soprattutto la loro disponibilità e/o acquisibilità da parte dell’Academia Barilla, con la quale è stata a tali scopi stipulata una specifica convenzione da parte 92 JADT’ 18 dall’Accademia della Crusca. Al momento, i testi acquisiti informaticamente e marcati (XML/TEI) sono quaranta. Prima di proseguire, una doverosa precisazione (già annunciata) sul terminus post quem: anche se il nostro obiettivo primario è, come abbiamo detto, quello di raccogliere e descrivere la lingua della tradizione gastronomica italiana postunitaria, per meglio documentare le origini di questo italiano in cucina (soprattutto per l’aspetto della fraseologia, cioè in primo luogo polirematiche e collocazioni, ma anche detti proverbiali, modi di dire, ecc.) abbiamo deciso di prendere in considerazione anche alcuni dei testi più significativi tra fine Settecento e primo Ottocento, a partire dalle due redazioni dell’Apicio moderno e dal Cuoco galante di Vincenzo Corrado. Sempre al medesimo fine, stiamo procedendo inoltre allo spoglio sistematico di tutto ciò che è pertinente all’ambito semantico del cibo nella tradizione lessicografica italiana, a partire dalle cinque impressioni del Vocabolario degli Accademici della Crusca, dal Tommaseo Bellini, dal Giorgini Broglio, e soprattutto dal Dizionario moderno (prima ed., 1905) di Alfredo Panzini. L’interesse di questo vocabolario, che offre un vero e proprio panorama della vita e della cultura italiana tra fine Ottocento e Novecento, è costituito dal nostro punto di vista proprio dallo spazio attribuito a quelle parole nuove, che già nella prima edizione lo stesso Panzini catalogava in “scientifiche, tecniche, mediche, filosofiche, [parole straniere, neologismi, parole dello sport,] della moda, del teatro, della cucina”. Imprescindibile nell’ambito lessicale del cibo (come è ben noto) è la dimensione diatopica per la quale il VoSCIP potrà utilizzare gli importanti risultati delle indagini geolinguistiche del Novecento, in primis degli atlanti linguistici: l’AIS e l’ALI, ma anche l’ASLEF, l’ALEPO, l’ALT, l’ALLI, e i preziosi materiali in corso di pubblicazione per l’ALS (tra cui si ricorderà almeno il paradigmatico volume di Ruffino 1995). Per verificare la fattibilità del nostro progetto abbiamo realizzato alcune voci pilota: siamo partiti da tagliatella, cui sono seguite agnelotto, cappelletto e anolino; in tutt’altro ambito abbiamo recentissimamente elaborato la voce bagnomaria. Proprio la redazione di queste voci e in particolare dell’ultima, bagnomaria, ha messo in luce alcune criticità del modello di voce originariamente elaborato e reso necessario un ripensamento che sfruttasse a pieno le risorse della lessicografia computer aided (o della lessicografia computerizzata) e della multimedialità oggi disponibili. 2. La banca dati I testi del corpus sono stati sottoposti a una marcatura XML/TEI leggera, mirata soprattutto a finalità lessicografiche. Attualmente sono stati acquisiti, collazionati e marcati 42 testi che coprono uniformemente l’arco cronologico JADT’ 18 93 considerato. Per quanto riguarda l’header sono state previste le indicazioni di autore, titolo, luogo di edizione, editore, anno, tipologia testuale, indicazione diamesica, in modo che possano costituire la base per filtrare sottocorpora specifici. All’interno del testo sono state marcate le pagine di ogni volume (così che le trascrizioni possano essere di volta in volta collegate alla riproduzione in facsimile dell’originale), le eventuali figure, le parti in lingue diverse dall’italiano (perché possano essere escluse dall’interrogazione del lessicografo). Non si è ritenuto di prevedere nessuna marcatura per i forestierismi, che, al pari degli altri lessemi, devono essere analizzati opportunamente dal lessicografo in ogni loro contesto. In una seconda fase della marcatura dei primi 42 testi, in via di attuazione, è prevista anche la marcatura del testo delle singole ricette e del loro titolo. Lo scopo primario di questa marcatura è quello di ottenere una lista aperta delle ricette presenti nel corpus, che possano eventualmente essere messe a confronto tra di loro con appositi algoritmi legati alle forme presenti nel titolo. In questo modo sarà possibile individuare una linea diacronica delle singole ricette e seguire l’evoluzione della lingua in esse contenute. Per quanto concerne il trattamento informatico va tenuto conto che la banca dati è un esempio di testualità ibrida: sia in relazione all’acquisizione filologica del testo e alla sua interrogabilità, sia per quanto riguarda la possibilità di applicazione di procedure di lemmatizzazione automatica. Trattandosi di testi ottonovecenteschi la possibilità di buoni risultati nell’applicazione degli strumenti informatico-linguistici realizzati nel panorama nazionale e internazionale scema progressivamente allontanandosi dalla contemporaneità verso il 1861, ma anche per i testi ottocenteschi e primo novecenteschi si hanno garanzie sufficienti. Vista la particolare natura della banca dati, la sua cronologia e la sua finalità lessicografica, nell’equilibrio della gestione delle risorse, si è preferito quindi non investire su una lemmatizzazione controllata, che avrebbe comportato l’inserimento di correttivi legati alla lingua ottocentesca e primo-novecentesca sia sui dizionari macchina che sulle morfologie macchine attualmente in circolazione (prevalentemente di base anglofona, con tutti i limiti che questo comporta, e, anche nel migliore dei casi tarati per l’italiano scritto recente; cfr. Biffi 2016). La banca dati (attualmente in fase di testing nella sua versione beta) è quindi consultabile con un motore di ricerca per forme, potenziato da strumenti (caratteri jolly, ricerca fuzzy) che facilitino l’individuazione delle varianti formali, morfologiche e grafico-fonetiche, e da una lemmatizzazione automatica basata sulle morfologie macchina attualmente esistenti (e quindi tarate sull’italiano scritto contemporaneo, ma comunque sufficientemente funzionali per il reperimento delle forme varianti di testi otto-novecenteschi, soprattutto se a fini lessicografici). La piattaforma di interrogazione prevede 94 JADT’ 18 specifiche funzioni di ricerca a distanza e collocazioni, e la possibilità di accedere a dati statistici, sia in versione tabellare, sia in versione heatmap e tag cloud. Con queste caratteristiche la banca dati può peraltro essere del tutto omogenea a quelle che gravitano intorno al progetto del Corpus di riferimento per un nuovo vocabolario dell’italiano moderno e contemporaneo. Fonti documentarie, retrodatazioni, innovazioni, finanziato su fondi PRIN 2012 e coordinato da Claudio Marazzini, offrendo così ampi margini di dialogo con gli strumenti lessicografici a essa collegati. 3. Struttura delle voci e dizionario elettronico La struttura della voce progettata risente naturalmente delle caratteristiche dei dizionari storici. Ecco la sua architettura: LEMMA + categoria grammaticale 0.1. Forme attestate nel corpus dei testi (con tutte le varianti) La forma lemmatizzata per la voce principale è quella più diffusa nell’uso odierno: ci si serve del GRADIT, Grande dizionario italiano dell’uso, di Tullio De Mauro, con i relativi aggiornamenti. 0.2. Nota etimologica essenziale. 0.3. Prima attestazione nel corpus. 0.3.1. Indicazione numerica della frequenza (per ciascuna forma; nell’indicazione delle occorrenze, la seconda cifra, preceduta dal segno +, si riferisce alle forme presenti in eventuali indici). 0.4. Distribuzione geografica delle varianti. Per ora si forniscono i dati relativi ai soli AIS e ALI. Aggiungiamo in nota il riscontro con le forme registrate da Touring Club Italiano 1931. 0.5. Note linguistiche/merceologiche (forestierismi; italianismi in altre lingue). La bibliografia per ora si riferisce solo alle ‘Note linguistiche’, e, per quanto riguarda gli italianismi in altre lingue, al DIFIT (consultabile in versione elettronica in http://www.italianismi.org/difit-elettronico). 0.6. Riepilogo dei significati. 0.7. Locuzioni polirematiche e vere proprie (con la prima attestazione nel corpus). 0.8. Rinvii (sono previsti soprattutto ‘iperlemmi’, o, se si preferisce voci ‘generali’, di raccordo). 0.9. Corrispondenze lessicografiche (= riscontri nei dizionari e JADT’ 18 95 nei corpora lessicografici in rete): si distinguono i vocabolari etimologici (compreso il LEI) da quelli descrittivi (in ordine cronologico, a partire dal Tommaseo-Bellini). 1. Prima definizione Contesti 1.1. Definizione subordinata Contesti 1.2. Definizione subordinata Contesti [...] 2. Seconda definizione Contesti [...]. La voce richiama, con gli opportuni adattamenti, quella del TLIO Tesoro della Lingua Italiana delle origini, dell’Istituto dell’Opera del Vocabolario Italiano del CNR di Firenze. I primi esperimenti, sui quali è basata ad esempio l’ultima voce campione relativa a bagnomaria (a partire da una versione iniziale del corpus, limitata a 28 testi), hanno evidenziato che la struttura rischia però di essere troppo pesante in vista di una effettiva fattibilità realizzativa del progetto. I limiti “dimensionali” emergenti (che bene risultano evidenti in Bertini Malgarini e Vignuzzi 2017) sono legati soprattutto alla ricchezza degli esempi e all’ampiezza delle citazioni da altri strumenti lessicografici. A entrambi questi limiti si pensa però di provvedere aumentando l’interazione con gli altri strumenti collegati e collegabili. In primo luogo prevedendo una profonda interazione tra banca dati testuale e dizionario sia nella fase di redazione della scheda che in quella di pubblicazione. In questo modo sarà possibile limitare il numero di esempi citati per poi rimandare a un dossier completo delle occorrenze mediante il collegamento con il corpus informatizzato. Nell’ottica di creare un accesso aperto alla banca dati dei testi è opportuno porsi il problema dell’utilizzo pubblico di testi coperti da diritto d’autore. Il tema è già stato affrontato all’interno del gruppo PRIN 2008 “Il portale della TV, la TV dei portali” e in occasione del convegno conclusivo del progetto Marina Pietrangelo – ricercatrice dell’ITTIG (Istituto di Teoria e Tecniche dell’Informazione Giuridica) appositamente invitata a parlare sul tema Per un uso legale degli audiovisivi in corpora di ricerca – ha risposto con un sostanziale via libera previsto dalla norma nel caso di progetti con esclusiva finalità di ricerca e senza nessun risvolto economico (Pietrangelo 2017). Anche i riferimenti agli 96 JADT’ 18 altri dizionari vanno poi realizzati attraverso collegamenti con le versioni elettroniche in rete attualmente disponibili (ad esempio quella del Tommaseo-Bellini: Tommaseo online; quella delle edizioni del Vocabolario degli Accademici della Crusca: Lessicografia della Crusca in rete; e infine quella del vocabolario postunitario che si sta realizzando all’interno del progetto PRIN 2015 “Vocabolario dinamico dell’italiano post-unitario”, coordinato da Claudio Marazzini). Sono tuttora allo studio procedure per il trattamento delle collocazioni emergenti dagli strumenti statistici di analisi del corpus, e per la lemmatizzazione di parole composte a fronte della polimorfia morfologica emergente dalla profondità diacronica del corpus. All’interno di una vera e propria stazione lessicografica tutti questi strumenti saranno integrati all’interno di un sistema di back-office che, tramite fasi di valutazione progressiva e di controllo, porterà alla diretta pubblicazione della voce in rete. Infine, proprio la potenziale interazione/integrazione con il citato futuro “Vocabolario dinamico dell’italiano post-unitario” ha suggerito al gruppo di ricerca di predisporre una scheda lessicografica variabile: alla scheda approfondita del dizionario storico si affiancheranno infatti una scheda strutturata secondo le specifiche di un dizionario sincronico per quelle voci che facciano ancora oggi parte dell’italiano dell’uso, e strumenti di calibrazione dei campi che l’utente esperto e non esperto potrà gestire in modo da avere di volta in volta una voce personalizzata. In sede di discussione sarà presentata e discussa una voce “esemplare” del VoSCIP, anche in relazione alla selezione e all’organizzazione del materiale lessicografico e alla sua pubblicazione (in rete e in forma cartacea). JADT’ 18 97 Riferimenti bibliografici Bertini Malgarini, P. e Vignuzzi, U. (2017). Bagnomaria nel Vocabolario storico della cucina italiana postunitaria (VoSCIP): < http://permariag.wixsite.com/permariagrossmann/vignuzzi>. Biffi, M. (2016). Progettare il corpus per il vocabolario postunitario, in Marazzini, C. e Maconi, L. (a cura di), L’italiano elettronico. Vocabolari, corpora, archivi testuali e sonori. Accademia della Crusca, pp. 259-80. Pietrangelo, M. (2016). Per un uso legale degli audiovisivi in corpora di ricerca, in Alfieri, G., Biffi, M. et alii (a cura di), Il portale della TV. La tv dei portali. Bonanno, pp. 171-185. Ruffino, G. (1995). I pani di Pasqua in Sicilia. Un saggio di geografia linguistica e etnografica. Centro di Studi Filologici e Linguistici Siciliani. Touring Club Italiano (1931). Guida gastronomica d’Italia. Touring Club Italiano [rist. anast. 2003]. Strumenti AIS = Jaberg, K. e Jud, J. (1928-1940). Sprach- und Sachatlas Italiens und der Südschweiz. Ringier, 8 voll. (trad. it. 1987. AIS. Atlante linguistico ed etnografico dell’Italia e della Svizzera meridionale, Unicopli). Anche in rete: NavigAIS, . ALEPO = Telmon, T. e Canobbio, S. (1984-). Atlante linguistico ed etnografico del Piemonte occidentale (vedi ) ALI = Bartoli, M. G et alii (1995-). Atlante Linguistico Italiano. Istituto Poligrafico e Zecca dello Stato. ALLI = Moretti, G. et alii (1982-). Atlante Linguistico dei Laghi Italiani (vedi ALS = Ruffino, G. (1995-). Atlante Linguistico della Sicilia (vedi ). ALT = Giacomelli, G. (2000). Atlante Lessicale Toscano. LEXIS (in CD-ROM); Ora in rete come ALT-WEB: . ASLEF = Pellegrini, G. B. et alii (1972-). Atlante Storico-Linguistico-Etnografico Friulano. Istituto di glottologia e fonetica dell’Università Istituto di filologia romanza della Facoltà di lingue e letterature straniere dell’Università. DIFIT = Stammerjohann. H. (2008). Dizionario di italianismi in francese, inglese e tedesco. Accademia della Crusca. Anche in rete: . GRADIT = De Mauro, T. (2007). Grande Dizionario Italiano dell’Uso. UTET. LEI = Pfister, M. e Schweickard, W. (1979-). Lessico Etimologico Italiano, Edito per incarico della Commissione per la Filologia romanza. Reichert. Lessicografia della Crusca in rete = Accademia della Crusca (2004). Lessicografia 98 JADT’ 18 della Crusca in rete. . TLIO = Opera del Vocabolario Italiano (1997-). Tesoro della lingua italiana delle origini. . Tommaseo-Bellini = Tommaseo, N. e Bellini V. (1861-1879). Dizionario della lingua italiana, Società L’Unione Tipografico-Editrice. Tommaseo online = Accademia della Crusca (2015). Tommaseo online. . JADT’ 18 99 Strumenti informatico-linguistici per la realizzazione di un dizionario dell’italiano postunitario Marco Biffi Università degli Studi di Firenze – marco.biffi@unifi.it Abstract The paper focuses on some general problems about representative corpora for the compilation of dictionaries. It starts from the concrete case of the Vocabolario dell’italiano post-unitario, which, due to its hybrid nature, offers a complete view of both the criticalities of synchronic lexicography and of the historical one. Therefore is introduced the concept of Banca linguistica, that is a platform in which different types of corpora, a search meta-engine of the existing databases, and tools of access to existing electronic dictionaries converge. A final paragraph is dedicated to the concept of “quantum relativity” of data of computational linguistics. Sintesi Il contributo mette a fuoco alcuni problemi generali relativi alla costituzione di corpora rappresentativi per la redazione di dizionari, partendo dal caso concreto del Vocabolario dell’italiano post-unitario, che, per la sua natura ibrida, offre un quadro completo sia delle criticità della lessicografia sincronica sia di quella storica. Si introduce pertanto il concetto di Banca linguistica in cui convergono diverse tipologie di corpora, un metamotore di ricerca per la consultazione delle banche dati esistenti e sistemi di integrazione con i dizionari elettronici esistenti. Infine ci si sofferma sul concetto di “relatività quantistica” dei dati estrapolabili dalle ricerche informatico-linguistiche. Keywords: Linguistica dei corpora, Italiano, Dizionario sincronico, Dizionario storico, Testo elettronico, Bilanciamento, Metamotore, Banca linguistica, Relatività quantistica, Informatica linguistica, Linguistica computazionale 1. Introduzione In questo contributo cercherò di mettere a fuoco alcuni problemi generali relativi alla costituzione di strumenti per la redazione di dizionari partendo da un caso specifico, quello del progetto di un dizionario “ibrido”, insieme storico e sincronico, su cui sta lavorando un gruppo di ricerca nazionale coordinato da Claudio Marazzini. Il progetto – che ha come obiettivo finale la redazione di un vocabolario dell’italiano post-unitario che raccolga il patrimonio linguistico nazionale della lingua ufficiale dello Stato dal 1861 a 100 JADT’ 18 oggi – ha visto l’avvio con una prima fase finanziata sul PRIN 2012 Corpus di riferimento per un Nuovo Vocabolario dell’Italiano moderno e contemporaneo. Fonti documentarie, retrodatazioni, innovazioni; e ha poi potuto continuare con un secondo finanziamento sul PRIN 2015 Vocabolario dinamico dell’italiano postunitario. Ai due progetti hanno partecipato numerose università italiane: Piemonte Orientale, Milano, Genova, Firenze, Viterbo, Napoli, Catania (al progetto sul corpus ha partecipato anche l’Istituto di Teorie e Tecniche dell’Informatica Giuridica ITTIG del CNR di Firenze; al progetto sul vocabolario dinamico partecipa anche l’Università degli Studi di Torino); come partner esterno ha collaborato l’Accademia della Crusca, per la quale il dizionario post-unitario è uno dei tre progetti strategici attuali, accanto al Vocabolario dantesco e all’Osservatorio degli italianismi nel Mondo (OIM). Per quanto le dinamiche di impiego di corpora per la redazione di dizionari storici siano note, soprattutto dopo l’esperienza del TLIO Tesoro della lingua italiana delle origini dell’Istituto dell’Opera del Vocabolario Italiano del CNR di Firenze, meno si è riflettuto sulle implicazioni pratiche della costituzione di un dizionario sincronico basato su un corpus rappresentativo, e del tutto nuovo è il caso di uno strumento ibrido come il vocabolario post-unitario, in cui le criticità della lessicografia informatica storica e sincronica si mescolano, evidenziando come si debba piuttosto muoversi nella direzione di strumenti articolati. 2. Criticità di fisionomia di un corpus rappresentativo dell’italiano postunitario Un primo problema da affrontare per un corpus rappresentativo per un dizionario è la sua dimensione. Se proviamo a effettuare un rapido controllo sulla situazione dei corpora di riferimento per altre lingue europee (in particolare inglese e tedesco, che hanno avuto una maggiore attenzione a questo tema), sia il British National Corpus (per il 10% costituito da trascrizioni dell’inglese parlato – cfr. Cresti-Panunzi 2013: 36-37) che il DWDS-Kerncorpus (testi del XX secolo di cinque tipologie: letteratura, 25%; giornali, 25%; prosa scientifica, 20%; guide, libri di ricette e testi analoghi, 20%; lingua parlata trascritta, 10% – cfr. Klein 2013: 18-19) hanno dimensione pari a circa 100 milioni di parole. Questa era la dimensione che nel primo decennio del secolo individuava corpora di dimensioni standard (cfr. Chiari 2007: 45; secondo la tabella ivi riportata); anzi, 100 milioni di parole era la soglia che divideva i corpora standard da quelli di grandi dimensioni. Tenendo conto dei progressi informatici e metodologici degli ultimi anni, certamente è opportuno introdurre qualche correttivo; e in effetti sia per l’inglese che per il tedesco questi correttivi esistono, perché i corpora bilanciati sono affiancati da thesauri. Al BNC è stata recentemente affiancata la Bank of English (un JADT’ 18 101 monitor corpus, secondo la terminologia di Sinclair, di testi completi per un totale di 650 milioni di parole – cfr. Cresti-Panunzi 2013: 36-37); al Kerncorpus si sono aggiunti alcuni moderni corpora di giornali (successivi al 1995) e altre raccolte più piccole di testi, per un totale di 2,6 miliardi di parole (e, anche sul piano diacronico, si sta cercando di completare il quadro con il Deutsche Textarchiv in allestimento dal 2005 e ormai in via di completamento, che raccoglie 1500 libri accuratamente scelti, di solito prime edizioni e volumi di giornali, nell’arco cronologico compreso fra il 1650 e il 1900 – cfr. Klein 2013: 18-19). Per quanto riguarda la raccolta di testi si è già sottolineata l’importanza di quella che è stata definita “parabola dimensionale dei corpora” (Biffi 2016: 262). Figura 1 La rappresentazione geometrica analitica di questa parabola evidenzia il rapporto tra la lingua nei secoli (nella fattispecie l’italiano) e la possibilità di rappresentarla con un corpus dell’ordine di grandezza di 100.000 parole (kiloparole), di milioni di parole (megaparole), di miliardi di parole (gigaparole). La possibilità di costruire corpora di grandi dimensioni diminuisce tanto maggiormente quanto più si va indietro nel tempo, mentre aumenta vertiginosamente per la lingua dei nostri giorni, con dimensioni ormai veramente molto elevate, che non corrispondono certamente a tutto ciò che si produce in una certa lingua, perché questo è ovviamente impossibile, ma che tendenzialmente vi si avvicinano molto. La ridotta dimensione dei corpora dell’italiano del passato – questo sottolinea la curva – non è soltanto legata al fatto, oggettivo, che per il passato disponiamo di un minor numero di testi, 102 JADT’ 18 ma, in modo determinante, al fatto che molto più difficilmente riusciamo a riunire i testi del passato in formato elettronico per poterli interrogare con efficacia. Le difficoltà sono legate ai limiti di tutti gli strumenti informatici coinvolti nella realizzazione di corpora informatici, che paradossalmente convergono nel determinare l’andamento di questa curva: l’efficacia dell’OCR (il riconoscimento ottico, automatico, dei caratteri), l’efficacia delle morfologie macchina per la lemmatizzazione, l’efficacia dei motori di ricerca disponibili con facilità e a costo poco elevato; quindi toccano i processi che coinvolgono sia l’acquisizione dei testi, sia il loro trattamento, sia la loro interrogazione e interrogabilità (Biffi 2016: 263-267). Per il passato gli effetti della parabola rendono gestibile il problema di una reale rappresentatività del corpus di riferimento. In effetti il TLIO, che si muove in un arco cronologico che va dalle origini al 1375, può disporre come base di partenza di un corpus che riunisce una raccolta consistente di testi volgari del periodo considerato, spaziando a tutto tondo sull’asse diatopico e diafasico (e quindi garantendo una grande rappresentatività anche in diastratia). Ha fondamenta molto solide anche a fronte di dimensioni che, sulla scala di misurazione dei corpora, non sono particolarmente elevate. Le ridotte dimensioni hanno consentito infatti di abbattere gli effetti “parabolici” dell’efficacia dell’acquisizione e del trattamento del testo elettronico (i testi, ricavati dalle principali edizioni critiche hanno potuto essere sottoposti a un’attenta collazione), così come dell’efficacia delle morfologie macchina (il corpus è stato lemmatizzato di fatto manualmente con l’ausilio di procedure semiautomatiche). La possibilità di progettare e realizzare un motore per lemmi e un motore per forme personalizzato ha poi definitivamente abbattuto i problemi di interrogazione/interrogabilità. Ma è evidente che anche salendo di poco nella cronologia, proprio per l’effetto “parabolico”, i problemi aumentano vertiginosamente. Per quanto riguarda le morfologie macchina, ad esempio, sarebbe opportuno ricalibrarle in base alle variazioni diacroniche delle strutture morfologiche e morfosintattiche, seguendo l’asse del tempo (ed esperimenti si stanno facendo: ad esempio per la morfologia della lingua di Leonardo in un progetto finanziato dalla Biblioteca Leonardiana di Vinci e da me curato per la parte linguistica); ma il processo è lungo e non è mai stato affrontato in modo sistematico, né metodologicamente né pragmaticamente. Questo perché, ma vale per tutti gli aspetti della linguistica computazionale e più in generale di quella che preferisco chiamare linguistica informatica, la tendenza generale è quella di lavorare per piccole monadi e non creare sistema mettendo in sinergia le competenze e gli strumenti in modo da ampliare e affinare le tecnologie disponibili rendendole sempre più potenti. Così oggi disponiamo di vari strumenti, in parte sovrapponibili, in parte complementari, ma nulla di JADT’ 18 103 realmente condivisibile da migliorare con un sistema open source, in modo da concentrare gli sforzi su ciò che realmente manca e o è debole. Il “pezzo” delle morfologie macchina è particolarmente significativo: costruire un corpus diacronico per un dizionario storico significa infatti fornire i primi mattoni per ricalibrare le morfologie macchine esistenti tarandole sul periodo preso in considerazione; ma in nessun caso si è pensato di usare questi corpora del passato come punto di partenza per migliorare le procedure di lemmatizzazione che a loro volta potenzierebbero le possibilità lessicografiche in un circolo virtuoso destinato a raffinare gli strumenti a disposizione della comunità scientifica. Per tornare alle specificità del dizionario dell’italiano post-unitario, il suo carattere ibrido lo colloca in una posizione particolarmente delicata perché in quanto diacronico, dal 1861 al 2000, risente dei limiti informatici di cui abbiamo parlato (anche se, ad esempio, in questo segmento cronologico le procedure di riconoscimento automatico dei caratteri danno ottimi risultati). Ma diventa decisamente sincronico nel periodo 2000-2014, quando abbiamo la possibilità di creare un enorme corpus massivo (delle dimensioni delle gigaparole), anche con facilità, semplicemente attingendo dal web mediante programmi di data crawling (web crawler, o spider), come dimostra molto bene il caso di RIDIRE (www.ridire.it, diretto da Emanuela Cresti), un corpus di 1,3 miliardi di parole, realizzato con un crowler controllato che ha permesso un “bilanciamento” basato su domini semantici (architettura e design, arti figurative, cinema, cucina, letteratura e teatro, moda, musica, religione, sport) e domini funzionali (amministrazione e legislazione, economia e affari, informazione). 3. Dal corpus rappresentativo alla “Banca linguistica” Da un punto di vista teorico la scelta migliore per il corpus di riferimento del dizionario dell’italiano post-unitario sarebbe quella di un corpus bilanciato nell’ordine di megaparole dal 1861 al 2014, da affiancare con un corpus massivo della dimensione delle gigaparole sul 2000-2014, un risultato, come si è visto, ormai realizzabile. Però il gruppo di ricerca è partito da una situazione pregressa di progetti già realizzati e studi già avviati con validi risultati raggiunti, per cui si è scelto di mettere a frutto al massimo le esperienze dei componenti del gruppo, recuperando tutti i materiali che ciascuno poteva portare in dote al progetto per poi ampliarli e consolidarli con competenze specifiche. La copertura quindi è “a macchia di leopardo”, ed è pertanto necessario utilizzare al massimo, anche per la zona cronologica che va dal 1861 al 2000, un approccio massivo, che conduce inevitabilmente sulla strada della “banca linguistica”, del thesaurus, dal quale poi estrarre un corpus bilanciato (o più di uno, in modo dinamico anche in relazione alle esigenze del redattore della voce assegnata). 104 JADT’ 18 Figura 2 La “Banca linguistica” può essere una piattaforma in cui siano disponibili vari sub-corpora, in cui siano raccolti tutti i materiali con una marcatura semantica che consenta successivi bilanciamenti, con un “corpus centrale” che sarà la base primaria del lavoro del lessicografo del vocabolario postunitario, ma che andrà continuamente tarato grazie ai dati emergenti dalla consultazione del corpus massivo contemporaneo e dei sub-corpora diacronici presenti. La piattaforma dovrà anche dialogare con i dizionari elettronici di cui disponiamo dal 1861 a oggi: il Tommaseo Online e la versione elettronica della quinta edizione del Vocabolario degli Accademici della Crusca presente nella Lessicografia della Crusca in rete per la parte diacronica (nella speranza che l’accordo siglato nel settembre 2017 tra UTET e Accademia della Crusca per la digitalizzazione del Grande Dizionario della Lingua Italiana maturi frutti rapidi); le versioni dei dizionari sincronici presenti in rete (il Sabatini Coletti, il De Mauro, il Treccani, e tutto quanto sarà disponibile); tutti i corpora dell’italiano presenti sul web, inclusi quelli, preziosissimi, degli archivi elettronici delle principali testate giornalistiche nazionali (Biffi 2016: 272-273). Non va dimenticato infatti che il panorama dei corpora dell’italiano è abbastanza ampio (per un quadro generale si veda Cresti-Panunzi 2013; ma è necessario perfezionare il censimento). Però è mancata, come del resto è naturale, una politica organica di costruzione di un sistema: abbiamo quindi un’estrema eterogeneità di strumenti, piattaforme, codifiche (per fortuna in anni recenti, almeno per quest’ultimo aspetto, la forza centrifuga si sta progressivamente contenendo con il ricorso sempre più frequente, se non totale, alla codifica XML/TEI), che costringe il ricercatore a collegarsi n volte, su n piattaforme, con n filosofie diverse, con n motori diversi, per poter effettuare una ricerca a tutto campo. Diventa quindi JADT’ 18 105 fondamentale un metamotore. Una versione beta di metamotore dei corpora dell’italiano è stata realizzata dall’unità di ricerca dell’Università degli Studi di Firenze del gruppo PRIN 2012 da me diretta (www.metaricerche.it). Come si legge nella sezione del portale intitolata “Il metamotore”: «Gli strumenti individuati sono stati classificati secondo i possibili livelli di integrazione: corpora liberamente consultabili; corpora liberamente consultabili previa registrazione; corpora da scaricare. È stato poi predisposto uno studio di fattibilità per la definizione di una serie di procedure atte ad analizzare gli strumenti di partenza, determinare il livello di integrabilità (che passa anche dalla possibilità di poter interagire con lo staff tecnico della singola banca dati, a seguito di un accordo “strategico” sulla condivisione dei contenuti) e individuare delle procedure da seguire a seconda del livello. Si è passati poi a definire l’architettura del sistema, la tecnologia di riferimento e l’interfaccia di consultazione, almeno per una prima versione prototipale della piattaforma». La versione beta prevede l’integrazione di 8 banche dati, scelte come campioni delle principali tipologie di livelli di integrazione:  Livello massimo (si è trovato accordo con lo staff tecnico che gestisce la banca dati): LIR (Lessico dell’Italiano Radiofonico), LIS (Lessico dell’Italiano Scritto) e LIT (Lessico Italiano Televisivo), Accademia della Crusca.  Livello base (si è integrata la banca dati in una finestra, in attesa di una maggiore interoperabilità): MIDIA (Morfologia dell'Italiano in DIAcronia) Università Roma Tre; CorDIC (Corpora Didattici Italiani di Confronto) Laboratorio Linguistico Italiano Università degli Studi di Firenze.  Livello minimo (si è integrata la banca dati in una finestra senza possibilità di maggiore interoperabilità): Archivio dei quotidiani «Corriere della Sera» e «La Repubblica». Se questo strumento potrà essere potenziato fino a riunire nella lista dei risultati tutte le banche dati testuali disponibili attualmente per l’italiano, nella “Banca linguistica” del redattore del Vocabolario post-unitario sarà disponibile un accesso centralizzato a tutti i corpora esistenti, da integrare, modulare e bilanciare con il corpus riunito dal gruppo di ricerca PRIN, con il corpus massivo dell’italiano contemporaneo, con gli strumenti lessicografici elettronici. Rimangono da considerare alcune criticità che, se rimosse, consentirebbero un ulteriore potenziamento della “Banca linguistica”, e che possiamo richiamare in questa sede solo brevemente per punti. a) La gran parte dei testi (ad esempio quelli letterari recenti) sfuggono alla possibilità di essere organizzati in corpora interrogabili per le difficoltà legate ai diritti d’autore. b) Le raccolte di corpora in diacronia, tranne rare eccezioni (come ad esempio il CEOD, Corpus Epistolare Ottocentesco Digitale) prediligono la 106 JADT’ 18 tradizione letteraria di registro alto. Esistono già campioni rappresentativi di italiano post-unitario, come il DIACORIS (25 milioni di occorrenze), ma si devono ancora integrare i vuoti legati alle lingue speciali (come è stato tentato di fare all’interno del progetto PRIN 2012). c) Resta da indagare quanto dal web si possano recuperare (in modo più o meno automatico) materiali per le sezioni in diacronia, grazie soprattutto alla presenza massiccia di testi ottocenteschi riuniti in biblioteche digitali come Google libri e Archive. 4. Informatica linguistica e relatività quantistica Se il punto di partenza per la redazione di un dizionario non è più un corpus di riferimento omogeneo predisposto allo scopo, ma una “banca linguistica” in cui si è chiamati a gestire materiali non omogenei ed esogeni, non è inutile richiamare in questo paragrafo finale l’importanza dei risvolti “quantistici” della linguistica informatica (Biffi 2017: 545-549). Consultando banche dati (includendo in questa categoria non solo i corpora ma anche le edizioni elettroniche dei dizionari) non è difficile imbattersi in diffrazioni nei risultati quantitativi (e quindi in quelli qualitativi, nella misura in cui possono determinarsi lacune nella ricerca di determinati contesti), che sicuramente in parte si spiegano con errori umani inseriti nelle varie fasi realizzative delle banche dati (dovuti ai moderni copisti digitali, ai programmatori, al progetto), ma anche per il concorso di fattori precisi e individuabili. Nel contributo citato (Biffi 2017) le diffrazioni riguardano i risultati relativi al numero dei lemmi nelle tre versioni elettroniche del Vocabolario degli Accademici della Crusca del 1612, e sono da ricondurre a diversità di tokenizzazione, diversità di approccio nella restituzione alle voci dell’intrinseca struttura di base di dati, diverse priorità nella restituzione del testo elettronico. In altre banche dati i fattori di diffrazione saranno probabilmente da ricondurre ad altro, ma si dovrà sempre tener conto delle caratteristiche e dell’architettura della banca dati così come degli strumenti di ricerca a essa applicati. Come nelle scienze esatte da Heisenberg in poi si deve tener conto dell’indeterminazione introdotta dallo strumento di misurazione, consultando le banche dati sarà opportuno ricordare che le caratteristiche dello strumento di conoscenza (in questo caso la banca dati) perturbano il risultato della ricerca costringendoci a un’inevitabile approssimazione “quantistica”; una perturbazione però dominabile, giacché si possono ricostruire le cause di diffrazione e quindi correggere il risultato finale, come avviene con la meccanica quantistica laddove è necessario sostituirla alla meccanica classica. E allora, per poter ottenere risultati scientifici consultando una banca dati, è necessario conoscere a fondo le caratteristiche dello strumento, e tenere conto JADT’ 18 107 della sua variabilità “quantistica” nel momento in cui leggiamo i dati. E, quando si leggono e gestiscono i risultati, è necessario non solo essere consapevoli di quale strumento si è usato, ma anche delle specifiche modalità di ricerca applicate; in altre parole si deve tener conto continuamente del contesto filologico della ricerca informatica, esattamente come, quando si consulta l’edizione critica di un testo, si tiene conto anche delle varianti dell’apparato. Riferimenti bibliografici Biffi, M. (2016). Progettare il corpus per il vocabolario postunitario, in Marazzini, C. e Maconi, L. (a cura di), L’italiano elettronico. Vocabolari, corpora, archivi testuali e sonori. Accademia della Crusca, pp. 259-80. Biffi, M. (2018). Tra fiorentino aureo e fiorentino cinquecentesco. Per uno studio della lingua dei lessicografi, in Belloni, G. e Trovato, P. (a cura di), La Crusca e i testi. Lessicografia, tecniche editoriali e collezionismo librario intorno al Vocabolario del 1612. libreriauniversitaria.it, pp. 543-560. Chiari, I. (2007). Introduzione alla linguistica computazionale, Laterza. Cresti, E. e Panunzi, A. (2013). Introduzione ai corpora dell’italiano, Il Mulino. 108 JADT’ 18 Comparaison de corpus de langue « naturelle » et de langue « de traduction » : les bases de données textuelles LBC, un outil essentiel pour la création de fiches lexicographiques bilingues Annick Farina, Riccardo Billero Università degli Studi di Firenze – annickfarina@unifi.it; riccardo.billlero@gmail.com Abstract The aim of this paper is to describe the work done to exploit the LBC database for the purpose of translation analysis as a resource to edit the bilingual lexical sections of our dictionaries of Cultural Heritage (in nine languages). This database, made up of nine corresponding corpora, contains texts whose subject is cultural heritage, ranging from technical texts on art history to books on art appreciation, such as tour guides, and travel books highlighting Italian art and culture. We will illustrate the different questions with the SketchEngine LBC French corpus, made up at the moment of 3,000,000 words. Our particular interest here is in research that not only orients lexical choices for translators but that also precedes the selection of bilingual quotations (from our Italian/French parallel corpus) and that we rely on for editing an optional element of the file called "translation notes." We will rely on this as much for works on "universals of translation" already described by Baker (1993) as for studies aimed at improving Translation Quality Assessment (TQA). We will show how a targeted consultation of different corpora and sub-corpora that the database allows us to distinguish ("natural language" vs "translation”, "technical texts" vs "popularization texts" or "literary texts") can help us identify approximations or translation errors, so as to build quality comparative lexicographical information. Keywords: electronic lexicography, multilingual lexical resources, corpus linguistics Résumé Cet article a pour but de décrire notre travail sur la base de données LBC pour ce qui concerne l’analyse de traductions comme ressources pour la rédaction de la partie bilingue de nos dictionnaires du Patrimoine (dans les neuf langues du projet). La base de données contient des corpus distincts de neuf langues composés de textes qui sont tous reliés au patrimoine italien : des textes techniques des différents domaines artistiques, des ouvrages de critique d’art ou d’histoire de l’art, des guides touristiques, des récits de JADT’ 18 109 voyages, etc. Nous illustrerons différentes interrogations du corpus français (actuellement composé d’environ 3 millions de mots) dans SketchEngine. En particulier, nous nous intéresserons à des recherches qui nous guident non seulement vers la sélection de traduisants pour certains termes mais qui précèdent aussi la sélection de citations bilingues (extraites de notre futur corpus parallèle italien/français) et sur lesquelles nous nous appuyons pour la rédaction d’un élément facultatif de la fiche appelé « notes de traduction ». Nous nous appuyons pour ce faire tant sur les travaux sur les « universaux de traduction » (Baker 1993) que sur études qui visent à l’amélioration de la qualité des traduction (TQA : Translation Quality Assessment). Nous montrerons comment une consultation ciblée des différents corpus et souscorpus que la base nous permet de distinguer (textes en « langue naturelle » vs « en traduction », « textes techniques » vs « de vulgarisation » vs « littéraires ») peut nous aider à repérer des approximations ou des erreurs de traduction, nous aidant à construire une information lexicographique comparative de qualité. Keywords: lexicographie, ressources lexicales plurilingues, corpus linguistiques. 1. Introduction Un des principaux buts du projet Lessico dei Beni Culturali est de constituer des dictionnaires monolingues de neuf langues différentes en fonction d’un usage précis relié à un objet particulier : la description (et traduction de descriptions) du patrimoine toscan principalement dans des textes de vulgarisation (guides touristiques, sites de musées, etc.). Pour ce faire, nous avons constitué des bases de données textuelles, que nous complétons sous la forme d’un Work in progress, qui nous serviront pour différentes tâches, de la création de nomenclatures à la rédaction de fiches lexicographiques/terminologiques monolingues et de fiches de traduction reliant les nomenclatures des différentes langues entre elles (pour la description de ces bases cfr. Billero et al. 2017). C’est l’utilisation de ces bases de données textuelles pour la rédaction de fiches bilingues de traduction que nous illustrerons ici1, en nous basant sur l’analyse de différentes interrogations sur SketchEngine (principalement statistiques et de contexte) de notre corpus LBC français, composé actuellement d’environ trois millions Pour l’utilisation de nos bases pour la réalisation des dictionnaires monolingues, voir l’article de Nicolás et Lanini dans ce volume. Nous constituons en effet actuellement les nomenclatures des différentes langues en suivant le modèle qu’elles ont défini pour l’italien. Le lien bilingue entre ces différentes nomenclatures ne sera possible que lorsque nous aurons constitué nos bases de données parallèles. 1 110 JADT’ 18 de mots. Nous comparerons en particulier des données provenant de plusieurs sous-corpus comparables de textes « en langue naturelle » et de textes « en traduction ». Nous proposerons aussi une première comparaison de résultats provenant d’un sous-ensemble du corpus italien avec un sousensemble contenant les traductions françaises des mêmes textes, qui constituent un matériau fragmentaire pour le moment parce que nous travaillons encore à l’insertion des textes dans le but de créer des bases parallèles de traduction de l’italien vers toutes les langues du projet. Nous montrerons comment une consultation ciblée des différents corpus et souscorpus que la base nous permet de distinguer (italien « langue naturelle » vs français « langue naturelle », français « en traduction » vs français « langue naturelle », français « textes spécialisés » vs français « vulgarisation » vs français « littéraire ») peut nous aider à repérer des approximations ou des erreurs de traduction, nous aidant à construire une information lexicographique comparative de qualité. 2. Comparaison entre corpus « en langue naturelle » et « en traduction » : une perspective à mi-chemin entre traductologie descriptive et prescriptive Nous appuyant sur des analyses qui ne considèrent pas la langue de traduction comme un « troisième code » (Frawley 1984), nous estimons pour ce que des textes traduits trouvent parfaitement leur place à l’intérieur d’une base textuelle unique d’une même langue, aux côtés de textes « en langue naturelle ». Cependant, sur le modèle de propositions d’utilisation de corpus de traduction dans un but didactique, tant pour l’enseignement des langues que pour celui de la traduction, il nous semble nécessaire d’offrir la possibilité d’une consultation de la base dans des sous-corpus distincts regroupant des textes des deux types et de définir des critères d’évaluation des textes traduits à intégrer dans la base, en constituant des corpus séparés de textes traduits dans toutes les langues du projet. Ces corpus nous sont utiles comme outils de mémoire de traduction pour travailler sur la partie bilingue de nos fiches lexicographiques dans une perspective plus prescriptive que descriptive. Comme le montrera notre comparaison de résultats provenant de notre base française LBC « en langue naturelle » et « en traduction » avec un corpus de près de 100.000 mots actuellement non intégré dans la base composé de traductions d’ouvrages de « vulgarisation » traduits en français (guides touristiques de la Toscane et sites de musées surtout), certains des textes qui nous intéressent présentent des caractéristiques que l’on peut assimiler à du « translationese » et ne pourraient que fausser des interrogations de la base visant à attester des formes ou structures typiques du français tel qu’il est écrit et parlé par la majorité des locuteurs de cette langue sans interférence avec une autre langue. JADT’ 18 111 2.1 Information descriptive et prescriptive dans les dictionnaires LBC : universaux et écarts A la suite de Baker (1993), nous partons du principe qu’il existe des universaux de traduction qui nous serviront de canevas pour l’illustration des différents types d’interrogation effectués à l’intérieur de nos sous-corpus et de comparaison des résultats obtenus. C’est sur ces universaux que nous nous basons pour fournir la partie descriptive de l’information lexicographique comparative détaillée présente dans la partie bilingue de nos dictionnaires. Cette information correspond d’abord à l’observation des corpus parallèles, qui fournissent des attestations de traduction des lemmes (mots ou collocations) décrits par le dictionnaire, apparaissant dans des citations bilingues à l’intérieur de la partie bilingue de l’article. Nous analyserons en particulier : - la simplification (principalement, pour ce qui concerne notre corpus, le choix d’hyperonymes pour traduire certains termes plus spécifiques) qui donne lieu dans nos dictionnaires à l’introduction d’une information sémantique ajoutée qui accompagne le traduisant proposé : les traits distinctifs particuliers au lemme qui ne sont pas rendus par le traduisant seront indiqués avec ou sans parenthèses après le traduisant (par ex. tavola traduit par peinture (sur bois) et tavoletta traduit par (petite) peinture (sur bois)) - le nivellement (non-respect du registre, par exemple le choix de technicismes plutôt que de mots de la langue générale et vice versa). Toutes les entrées ont une indication de marque d’usage. Dans le cas d’une traduction qui implique un changement de registre, ce changement sera relevé dans la partie « note de traduction » ou apparaitra dans la partie réservée aux indicateurs sémantiques distinctifs dans le cas où plusieurs traductions du même lemme seraient possibles avec ou sans perte de registre. C’est le cas par exemple de tondo italien (non marqué) par rapport à médaillon (non marqué) et à l’italianisme tondo (technicisme utilisé principalement par les historiens de l’art). Baker analyse aussi l’explicitation qui est particulièrement fréquente dans les textes qui nous intéressent parce qu’elle est quasi systématiquement utilisée lors de l’usage d’un italianisme, en particulier pour les realia qui ont un traitement particulier dans nos dictionnaires (cfr. Farina 2014, 2016). Il serait possible de rechercher d’une manière systématique ce type de données dans notre corpus en extrayant toutes les occurrences de « type de » ou « sorte de » ou les éléments indiqués entre parenthèses, mais nous avons volontairement laissé de côté cette catégorie qui est trop fortement reliée à l’objet décrit par nos textes et à des choix stylistiques partagés entre les auteurs de textes « en langue naturelle » et les traducteurs dans le contexte de notre base, et ne nous permettrait donc pas d’illustrer par une comparaison des deux types de ressources des 112 JADT’ 18 contraintes linguistiques reliées aux opérations de traduction2. Nous avons laissé de côté aussi la « normalisation » ou « conservatisme » qui s’adapte peu à notre matière, peu propice à la variation ou à l’exploration sur le plan lexical et stylistique. Contrairement à Baker (1993 : 243), qui définit les universaux de traduction comme des « features which typically occur in translated text rather than original utterances and which are not the result of interference from specific linguistic systems », nous avons adopté une perspective plutôt prescriptive, ou mieux didactique, en prenant en considération les phénomènes d’interférence (influence de la langue source sur la langue cible) fréquents dans des opérations de traduction qui concernent deux langues proches comme l’italien et le français et dans des textes dont la qualité est loin d’être homogène. L’interférence est en effet selon nous à la source non seulement de nombreux cas de simplification et d’écarts de nivellement trouvés dans nos comparaisons mais d’autres manifestations assimilables à des pertes découlant de l’opération de traduction, voire à des erreurs ou inexactitudes de traduction. Le modèle du TQA (Translation Quality Assessment) et, en particulier, les différents types de mesures de qualité qui peuvent orienter le traducteur vers une amélioration de la fluidité et de la précision peuvent nous servir de référence pour ce faire (cfr. « Multidimensional Quality Metrics », Uszkoreit et al. 2013). Ces analyses nous orientent principalement vers le choix d’une position qui peut sembler aller à l’encontre d’une exploitation de corpus descriptive comme celle de Baker. De fait, elle se présente comme un accompagnement permettant à l’utilisateur de nos dictionnaires d’effectuer des choix, sur la base d’une exploitation descriptive des ressources consultées telle que nous l’avons déjà décrite et de l’indication de données statistiques résultant d’analyses de fréquence comme celles que nous les présenterons cidessous. Le rédacteur des fiches lexicographiques pourra de plus décider, le cas échéant et lorsque nos analyses de ces données le pousseront à repérer des erreurs ou écarts qui pourraient être réduits, de ne pas proposer une forme qui apparait dans la base comme traduisant (tout en l’indiquant dans la partie de l’article fournissant des indications statistiques sur les traduisants trouvés) ou de rédiger la partie « note de traduction », facultative dans nos articles bilingues, pour conseiller les utilisateurs dans leurs choix en expliquant pourquoi certaines formes peuvent être préférées à d’autres. 2 L’utilisation abondante d’italianismes est une caractéristique dominante dans les guides touristiques analysés, assimilable à une volonté de leurs auteurs de donner à ces textes une « touche d’italianité» (Farina 2014 : 61) JADT’ 18 113 3. Langue naturelle vs langue traduite : observation du corpus La différence de fréquence de mots ou de collocations présents dans des corpus comparables contenant des textes français en « langue naturelle » et des textes qui proviennent d’une traduction en français peuvent nous permettre de repérer des formes choisies sous l’influence de la langue source. 3.1 Fréquence zéro dans les textes en langue naturelle Nous avons comparé la liste des mots présents dans le sous-corpus LBC de textes de vulgarisation écrits en français contenant 270.000 mots avec un corpus non intégré à la base pour le moment de textes de même type mais en traduction 93.000 mots en réalisant une liste des mots présents exclusivement dans le sous-corpus « en traduction ». - fautes La majorité des formes rencontrées sont assimilables à des fautes : absence d’accent (cloitre), influence de l’orthographe italienne le français (baroche), « francisation » excessive au niveau orthographique (Caliari) ou par l’utilisation d’une traduction française là où l’usage préconise la forme italienne (Sainte-Réparate désigne en français la personne ou la cathédrale de Nice mais pas l’église Santa Reparata de Florence, la forme française n’est attestée nulle part dans la base LBC) ou l’inverse (Giove n’est jamais utilisé en italien dans notre corpus, où il est traduit par Jupiter), utilisation de mots qui n’ont rien à voir avec la description du patrimoine florentin, probablement parce qu’ils correspondent à un sens du mot-source qui s’applique à d’autres contextes (coursive dans une description du Dôme de Florence, ou panonceau pour se référer aux compartiments des portes du Paradis). Ce genre d’erreurs ne donne pas lieu à la réalisation d’une information ciblée à l’intérieur des dictionnaires sauf dans le cas d’une grande fréquence de l’erreur (par ex. pour panonceau présent dans plusieurs sources avec un total de 8 occurrences mais pas coursive qui n’a qu’une attestation). - nivellement On peut distinguer des formes qui correspondent à une différence « pragmatique » ou stylistique entre français et italien qui ne nous intéressent pas d’un point de vue lexicographique, comme l’utilisation de mentionnons dans plusieurs textes en traduction qui ne se retrouve dans aucun des textes de la base complète, ou de certaines formes du passé-simple (décora, succéda) qui ne sont pas utilisées dans les textes de vulgarisation en français « naturel ». Il s’agit de formes qui correspondent à des normes différentes relatives aux types de texte du corpus : une analyse plus approfondie pourrait probablement nous permettre d’observer un usage peu ou pas attesté du « nous » dans les guides touristiques, et l’usage peu fréquent de formes au passé-simple par rapport au passé-composé ou au présent, etc. 114 JADT’ 18 Ce qui nous intéresse beaucoup plus dans cette comparaison c’est de repérer des formes qui, tout en étant parfaitement « correctes » en français, peuvent être considérées comme hors contexte par rapport aux usages attestés dans le même type de contexte en langue naturelle. La différence dans l’usage d’un mot non attesté peut faire l’effet d’un « anachronisme » (différence dans la fréquence d’usage en synchronie). C’est le cas par exemple de l’adjectif grandducal et du participe passé paraphé dont les équivalents italiens sont plus fréquents dans la langue d’aujourd’hui que ne le sont leurs traductions littérales françaises. L’écart dans le registre peut aussi s’appliquer dans le cas d’une différence de « technicité ». L’adjectif autographe présent dans plusieurs sources de vulgarisation en traduction est absent des textes de même type de notre corpus en langue naturelle, mais on en trouve quelques occurrences dans des textes plus spécialisés du corpus général. La différence de registre donnera lieu à un marquage différencié entre lemme en langue source et sa traduction attestée. 3.2 Différence de fréquence dans les textes-source par rapport aux textes-cible Pour illustrer les phénomènes de simplification, nous avons interrogé deux sous-corpus de notre base LBC constitués de 51 vies de l’ouvrage Le vite de' più eccellenti pittori, scultori e architettori de G. Vasari (1568) et de leurs traductions en français (traduction Leclanché-Weiss, 1900). Ne pouvant encore nous baser sur des statistiques provenant des bases parallèles de traduction (pour la description de ces bases cfr. Zotti 2017), nous nous sommes concentrés sur des mots français qui avaient une grande fréquence en comparant cette fréquence à celle du mot le plus proche en italien (même sens, mêmes traits distinctifs). Ceci nous a permis de relever des écarts de fréquence qui nous pousseront à une étude plus approfondie dans le but de définir des réseaux analogiques dans les deux langues qui nous donnent la possibilité de proposer des liens de traduction permettant d’éviter une perte de précision. Tableau a par exemple une fréquence de 2232 par million de mots dans notre sous-corpus français tandis que quadro a une fréquence de 793 par million de mots dans le sous-corpus italien contenant les mêmes textes en langue originale. Un grand nombre d’hyponymes de quadro sont en effet traduits par tableau en français. Si cette perte est probablement compensée par l’ajout de traits distinctifs qui accompagnent le mot, nous retenons que le traducteur ne pourrait que gagner en précision si nous lui proposions d’autres formes pour rendre le sens de ces différents hyponymes. 4. Conclusion La comparaison de résultats qui concernent la fréquence de formes à l’intérieur du corpus LBC nous a permis d’illustrer l’utilisation de différents JADT’ 18 115 sous-corpus pour orienter l’information tant descriptive que normative que nous souhaitons fournir dans la partie bilingue de nos dictionnaires LBC. « Nous considèrerons, même si cela reste à démontrer […] qu’une sur- ou une sous-représentation d’un phénomène linguistique donné peut correspondre à une violation de la contrainte d’usage […] et qu’une bonne traduction se doit de tendre vers une homogénéisation entre la langue originale et la langue traduite. » (Loock et al. 2013 : sp) L’application de méthodes visant à la vérification de la qualité des traductions et la création d’outils qui se basent sur des analyses critiques de traductions existantes, en les comparant, en particulier, à des productions qui ne passent pas par la médiation d’une autre langue devrait permettre une optimisation du caractère naturel des textes traduits et de la précision, objectif essentiel pour la diffusion d’une information de qualité. Bibliographie Baker M. (1993). Corpus Linguistics and Translation studies. Implications and Applications. In Baker M. and al. editors, Text and Technology, Amsterdam/Philadelphie, Benjamins, pp.233–250. Billero R., Nicolas Martinez, M.C. (2017). Nuove risorse per la ricerca del lessico del patrimonio culturale: corpora multilingue LBC. In CHIMERA Romance Corpora and Linguistic Studies, Vol.4, No. 2, pp 203-216, ISSN: 2386-2629, 2017 Farina A. (2016). Le portail lexicographique du Lessico plurilingue dei Beni Culturali, outil pour le professionnel, instrument de divulgation du savoir patrimonial et atelier didactique, PUBLIF@RUM, vol. 24 http://publifarum.farum.it/ezine_articles.php?id=335 Farina A. (2014). Descrivere e tradurre il patrimonio gastronomico italiano: le proposte del Lessico plurilingue dei Beni Culturali. In Chessa F. and De Giovanni C., La terminologia dell'agroalimentare, Milan, Franco Angeli, pp. 55-66. Frawley W. (1984). Prolegomenon to a theory of translation. In Frawley W. editor, Translation: Literary, Linguistic and Philosophical Perspectives, Newark, Univ. of Delaware Press : 159-175 Loock R., Mariaule M. and Oster C. (2013). Traductologie de corpus et qualité : étude de cas. Tralogy - Session 5 - Assessing Quality in MT / Mesure de la qualité en TA http://lodel.irevues.inist.fr/tralogy/index.php?id=188 Johansson S. and Hofland K. (1994). Towards an English-Norwegian parallel corpus. In Fries U. and al. editors, Creating and using English language corpora, Amsterdam, Rodopi pp. 25-37. Loock R. (2016), La Traductologie de corpus. Villeneuve-d'Ascq. Presses Universitaires Septentrion. 116 JADT’ 18 Uszkoreit H., Burchardt A. and Lommel A. (2013). A New Model of Translation Quality Assessment Tralogy - Session 5 - Assessing Quality in MT / Mesure de la qualité en TA http://lodel.irevues.inist.fr/tralogy/index.php?id=319 Zotti V. (2017). L’integrazione di corpora paralleli di traduzione alla descrizione lessicografica della lingua dell’arte : l’esempio delle traduzioni francesi delle Vite di Vasari. In Zotti V., Pano A. editors, Informatica Umanistica. Risorse e strumenti per lo studio del lessico dei beni culturali. Firenze University Press. JADT’ 18 117 Il rapporto tra famiglie di anziani non autosufficienti e servizi territoriali: un'analisi dei dati esploratoria con l'Analisi Emozionale del Testo (AET) Felice Bisogni1, Stefano Pirrotta2 Associazione GAP - SPS Scuola di Psicoterapia Psicoanalitica - felice.bisogni@gmail.com 2Associazione GAP - SPS Scuola di Psicoterapia Psicoanalitica - stefanopirrotta@gmail.com 1 Abstract In this paper the authors present a research committed by a local authority to explore the relationship between not self-sufficient elders, their family members and the community based assistance services they uses. The exploratory data analysis, conducted with the Emotional Text Analysis (ETA) (Carli, Paniccia, 2002), was used to identify emotional and cultural factors related to the experience of assisting and being assisted at home and within the community based services. The ETA has been realized on an assembled text corpus produced transcribing 45 audio recorded interviews to not selfsufficient elders and their family members, patients of general practitioners and/or users of the community based services (home-based and halfresidential). The interviews has been processed with T-Lab statistic software (Lancia, 2004) and ETA has been applied to produce a clusters analysis. Four clusters of dense words related to each others on 3 factorial axes emerged. From the factorial axes emerges a emotional representation of elderlness as a continuos allert related to the risk of dyng and as a depressive prescription to survive related to the pretension to be assisted within their own family in virtue of “blood ties”. The reciprocal control and contentiousness, and the desirers to transgress the obligation of care giving and being cared are some relevant emotions emerging by the ETA. The research's results shows also a demand of a new assistance model emerges, founded on the possibility to talk, to play and to have fun with others. Finally it emerges a demand of services not only dealing with medical problems but also providing psychological support and training to the families to develop relational competences and to build reliable relationship out of the family. In the conclusions of the paper some considerations regarding the relationships between the clusters on the factorial axes and between clusters and illustrative variables are highlithed. 118 JADT’ 18 Abstract In questo articolo gli autori presentano una ricerca, condotta con la metodologia dell'Analisi Emozionale del Testo (AET) (Carli, Paniccia, 2002), commissionata da un ente locale al fine di esplorare i fattori emozionali che organizzano l'esperienza di relazione tra un gruppo di anziani non autosufficienti e i loro familiari e alcuni servizi socio-sanitari territoriali. L'AET è stata realizzata su un corpus di testo assemblato trascrivendo 45 interviste audio registrate ad anziani non autosufficienti e loro familiari, che utilizzano servizi di medicina generale e/o servizi sociali territoriali (di tipo domiciliare o semiresidenziale). Le interviste sono state processate con il software statistico T-lab (Lancia, 2004) e l'AET è stata applicata per produrre una Cluster analysis. Dall’analisi sono emersi 4 cluster di “parole dense” (Carli, Paniccia, 2002) in rapporto tra loro su 3 assi fattoriali, che rappresentano il modo emozionale condiviso con cui gli intervistati parlano delle loro attese sui servizi Dall’interpretazione dei dati è emerso un rapporto tra famiglia ed anziano in crisi nel condividere desiderio e piacevolezza nello stare insieme. Emerge una rappresentazione emozionale dell'anzianità come allerta continua di fronte al rischio di morire e prescrizione depressiva a sopravvivere connessa alla pretesa di essere assistiti all'interno della propria famiglia in virtù di “rapporti di sangue”. A questo si contrappone il desiderio di trasgredire l'obbligo famigliare ad assistere e farsi assistere. I risultati della ricerca rilevano una domanda di nuovi modelli di assistenza fondati sulla possibilità di parlare, giocare e divertirsi. Una domanda di servizi non rivolti esclusivamente ai problemi medici ma anche a offrire supporto psicologico e formazione alle famiglie per sviluppare competenze relazionali e relazioni affidabili all'esterno della famiglia. Nelle conclusioni vengono messe in evidenza alcune considerazioni riguardanti il rapporto tra cluster sugli assi fattoriali e tra i cluster e le variabili illustrative. Keywords: Emotional Text Analysis (ETA), assistance, elders, family, community based services. 1. Introduzione Sono circa 2,5 milioni gli anziani non autosufficienti presenti in Italia. Secondo le più recenti previsioni ISTAT (2017), la percentuale di individui di 65 anni e più crescerà di oltre 10 punti percentuali entro il 2050, arrivando a costituire il 34% della nostra popolazione. La presenza di un anziano non autosufficiente in famiglia diventerà sempre più un’esperienza comune per le famiglie italiane. Diversi studi hanno mostrato come l’organizzazione dell’assistenza agli anziani non autosufficienti da parte dei propri familiari comporti significativi problemi emozionali (Haley, 2003). Un recente studio JADT’ 18 119 ha analizzato il testo di 26 interviste a familiari di anziani non autosufficienti con esperienza di assistenza da parte di un badante (Paniccia, Giovagnoli, Caputo, 2015). Dall’analisi del testo, condotta tramite la metodologia AET (Carli, Paniccia, 2002), è emerso come i sistemi di relazione familiari entrino in crisi contestualmente all’inattività e alla malattia dell’anziano. L'autrice afferma che la domanda delle famiglie ai servizi sia quella di non essere emarginate con il loro problemi entro il solo contesto familiare, per altro in cambiamento. “Sul piano della ricerca - afferma Paniccia - va sviluppata la differenza, proposta anche dagli intervistati, tra esplorazione dei vissuti degli anziani assistiti da un lato, degli altri membri della famiglia dall’altro”. In quest’ottica, la ricerca-intervento proposta risponde a questo invito, esplorando il vissuto e le attese di un gruppo di anziani non autosufficienti e loro familiari nei confronti di alcuni servizi territoriali. 2. Il progetto di ricerca-intervento psicosociale Il progetto di ricerca-intervento è stato realizzato dagli autori per conto dell'Associazione GAP, un’organizzazione che si occupa di ricerca e intervento psicosociale nell'ambito della disabilità. Il committente è stato un ente locale interessato a coinvolgere anziani non autosufficienti e loro familiari nella costruzione di nuovi modelli di assistenza coerenti con la domanda delle famiglie stesse. L'ente locale intendeva sviluppare un'offerta di servizi d'assistenza innovativi a fronte di cambiamenti sociali e culturali che stanno profondamente modificando l’organizzazione tradizionale della famiglia. Famiglia in passato maggiormente attrezzata al proprio interno per provvedere all'assistenza degli anziani. In tale contesto la ricerca intervento psicosociale è stato proposta come strumento di esplorazione del rapporto tra servizi d'assistenza rivolti agli anziani presenti nel territorio di competenza dell'ente committente e famiglie che a tali servizi si rivolgono. In tale contesto GAP a un gruppo di familiari e anziani non autosufficienti. Tutte le interviste sono state audio-registrate e trascritte in modo da ottenere il testo su cui è stata poi applicata l'Analisi Emozionale del Testo. In questa sede presentiamo i risultati dell'Analisi Emozionale del Testo applicata al testo prodotto trascrivendo 45 interviste a familiari e anziani non autosufficienti. 2.1. La raccolta dei dati Le interviste sono state realizzate a 45 familiari e anziani non autosufficienti in carico ai servizi di medicina generale o ai servizi di centro diurno per anziani fragili partner del progetto. Di questi circa il 60 % usufruivano di servizi di medicina generale insieme al servizio di centro diurno per anziani fragili. Il restante 40% utilizzava esclusivamente i servizi di medicina generale. Sono state realizzate 25 interviste ad anziani e 20 interviste a loro 120 JADT’ 18 familiari. Le interviste sono state trattate in un unico corpus e per questo in analisi è stata inserita la variabile illustrativa “ruolo dell’intervistato”, differenziando le interviste ad anziani da quelle a familiari. L'età media degli anziani intervistati è di 79 anni, mentre l'età media dei famigliari è di 60 anni. Gli intervistati sono stati scelti in ordine al criterio di coinvolgere nella ricerca chi ponesse ai servizi partner problemi complessi che i servizi stessi sentivano di avere difficoltà a prendere in carico. Questo nell'ipotesi che gli intervistati potessero poi partecipare ad un intervento psicosociale fondato sulla restituzione dei risultati della ricerca e sulla loro discussione critica al fine di contribuire alla progettazione di modelli di assistenza più in linea con i problemi sperimentati. Agli intervistati è stato proposto di partecipare a un'intervista aperta, non strutturata, con una sola domanda stimolo seguita dall'invito a dire tutto quello che veniva in mente. La domanda stimolo è stata la seguente: “nell'ambito di un progetto di ricerca-intervento siamo interessati a esplorare il rapporto tra servizi di assistenza, anziani e famiglie che a tali servizi si rivolgono. In particolare ci interessa esplorare il punto di vista dei familiari e degli anziani. Aggiungiamo che stiamo intervistano anche un gruppo di medici di base e di operatori dei servizi socio-sanitari. Siamo interessati alla sua esperienza; vorremo ascoltarla e raccogliere ciò che lei ha da dire”. Gli intervistatori si sono presentanti come psicologi professionisti membri di un'associazione interessata a costruire servizi per l’invecchiamento e la non auto-sufficienza. Agli intervistati è stato detto che i risultati della ricerca sarebbero stati condivisi con tutti gli interessati per capire quali iniziative sviluppare. 3. Metodologia L'Analisi Emozionale del Testo (Carli, Paniccia, 2002) è uno strumento proprio della ricerca-intervento psicosociale, sviluppato per esplorare i modi in cui i gruppi sociali simbolizzano emozionalmente e in modo condiviso un contesto o un tema e come queste simbolizzazioni organizzino il comportamento di quel gruppo. Tale metodologia, fondata sul principio del conoscere per intervenire, prevede l'attivazione di un processo di esplorazione, analisi e discussione critica della “cultura locale” condivisa entro un determinato contesto, in relazione al tema posto ad oggetto della ricerca. L'utilizzo di AET implica la destrutturazione del processo narrativo e delle connessioni che costituiscono il senso intenzionale dei discorsi entro un testo posto in analisi. Questo approccio metodologico è fondato sull'individuazione di gruppi di parole in rapporto tra loro che più di altre veicolavano significati emozionali: parole definite “parole dense”. Operativamente abbiamo realizzato il processo statistico e informatico attraverso il software T-lab (Lancia, 2004) scegliendo la strategia dell’Analisi JADT’ 18 121 Tematica dei Contesti elementari non supervisionata. Le interviste realizzate sono state assemblate entro un unico corpus, composto da 14053 tokens e 4121 types mentre gli hapax rilevati sono stati 230. Per quanto riguarda la sua ricchezza lessicale, il TTR (Type/Token Ratio) è 0.293. Abbiamo raggruppato le occorrenze di “parole dense” entro lessemi e in questo corpus ne sono stati individuati e messi in analisi 856. Il numero di “contesti elementari” di testo classificati 1423 (= 99.58%; del totale di 1429). Il processo di elaborazione dei dati seguito dal software comporta i seguenti passi: a) costruzione di una tabella di dati di unità contesto x unità lessicali (fino a 150,000 righe x 3,000 colonne), con valori di presenza/assenza; b) TF-IDF normalizzazione e scalaggio dei vettori riga alle unità lunghezza (norma Euclidea); c) clusterizzazione delle unità contesto (misure: coefficiente coseno; metodo: bisezione K-means); d) - limatura delle partizioni ottenute e, per ciascuna di esse: e) costruzione di una tabella di contingenza di unita lessicali x clusters; f) test del chi quadro applicato a tutte le intersezioni della tabella di contingenza; g) analisi delle corrispondenze della tabella di contingenza di unità lessicali x clusters. L’analisi statistica ha permesso di individuare diversi cluster corrispondenti a raggruppamenti di parole co-occorrenti. I cluster sono quelli che hanno una ricorsività significativa entro il testo e rappresentano le dimensioni più trasversali che caratterizzano la cultura locale esplorata. 4. Risultati Il corpus delle interviste è stato elaborato con il software T-Lab che ha proposto come ottimale una partizione a 4 Cluster (CL) in rapporto tra loro su tre fattori (le cui percentuali di inerzia sono Fattore 1= 41,24%, Fattore 2= 32,68%, Fattore 3= 26,08%). Il cluster 3 e il cluster 2 sono in rapporto su polarità opposte del primo fattore; il cluster 1 e il cluster 4 sono in rapporto su polarità opposte del secondo fattore, mentre il cluster 1 e il cluster 3 sono in rapporto sul terzo fattore. Nella tabella (fig.1) è riporta la lista per cluster delle “parole dense” e le variabili illustrative relative al gruppo delle interviste degli anziani (_ruol_anz) e al gruppo delle interviste dei familiari di anziani (_ruol_fam). 122 JADT’ 18 Tabella 1: Lista parole dense per cluster con i relativi valori di chi2 CLUSTER 1 N. of e.c..: 448 χ2 soit: 31.48% 171,81 problema 167,27 casa 79,08 uscire 71,56 lasciare 57,67 vivere 41,82 bisogno 36,62 h24 27,08 abbandonare 26,38 libero 25,59 badante 23,46 pulire 20,33 costringere 19,05 persona 18,71 autonomo 17,74 perdere 16,72 _ruol_fam CLUSTER 2 N. of e.c.: 371 χ2 soit: 26.07% 155,86 centro 116,53 persona 83,4 aiutare 68,86 trovare 63,55 malattia 57,09 dottore 55,95 psicologia 52,96 supporto 36,48 municipio 31,94 gruppo 26,73 amicizia 24,09 frequentare 22,05 offrire 21,62 cooperativa 21,28 informazione CLUSTER 3 N. of e.c.: 383 χ2 soit: 26.91% 408,52 figli 90,02 moglie 87,15 fratello 52,81 sposare 46,33 mangiare 40,44 dormire 37,92 morire 36,57 mamma 35,14 telefono 34,77 marito 31,17 maschio 28,96 nonni 26,96 femmina 26,77 cadere 26,68 soldi 27,45 _ruol_anz CLUSTER 4 N. of e.c.: 221 χ2 soit: 15.83% 122,43 imparare 109,56 cura 97,87 giocare 61,33 parlare 49,95 fumare 47,67 giardino 44,09 dimenticare 42,25 insieme 36,51 somatizzare 35,9 gita 32,1 simpatia 31,21 riflettere 31,21 sigaretta 25,17 ascoltare 25,17 spazio Di seguito, una lettura dei raggruppamenti di parole dense e della loro collocazione sul piano fattoriale. 4.1. Cluster 3: obbligo all'assistenza intra-famigliare e prescrizione alla sopravvivenza Il cluster è presente in percentuale statisticamente maggiore entro il testo delle interviste agli anziani (38,4%). Gli intervistati parlano del rapporto con i propri famigliari: figli, le mogli, i fratelli. L'assistenza viene inscritta entro il vincolo obbligante dell’essere una famiglia (etimologicamente da famulo, colui che serve, che si prende cura): emerge l’attesa che il ruolo famigliare implichi il dovere di occuparsi di chi non riesce a vivere da solo, preoccupandosi di garantire la sopravvivenza e occupandosi di bisogni inderogabili come mangiare e dormire. Emerge una rappresentazione infantilizzante dell’anziano che sollecita l'instaurarsi di rapporti di dipendenza e accudimento. In tale contesto la quotidianità, deprivata di desideri ed obbiettivi, sembra scorrere in modo depressivo in attesa di morire, con il rischio di una chiusura depressiva all'interno della famiglia. L'anzianità sembra identificata con la figura del vecchio morente che non ha più nulla da dare o da chiedere alla vita. L'unico riferimento alla vitalità entro il cluster è quello connesso a parole come nipoti e telefonare: laddove si allenta l'obbligo dell’assistenza sembra farsi spazio la possibilità di un rapporto piacevole e gratificante. JADT’ 18 123 4.2. Cluster 2: ricerca di servizi e domanda alla psicologia In questo cluster è rappresentato il processo di ricerca di servizi di assistenza. Si cercano centri, contesti estranei alla famiglia, che aiutino ad occuparsi dei problemi della persona non autosufficiente. Da un lato si guarda alla sua soggettività, dall'altro si rappresenta una ricerca affannosa di servizi fondata sull'angoscia di trovare soluzioni. La non autosufficienza è rappresentata come malattia. Ciò comporta un vissuto di urgenza e pericolo e la fantasia di dover contrastare qualcosa che mette a rischio la sopravvivenza. Su questo si chiama in causa il dottore, in ipotesi il medico di base, cui viene attribuita una competenza utile. Allo stesso tempo è chiamata in causa la psicologia cui viene richiesto un intervento di supporto. Si evoca in tal modo una prospettiva di intervento alternativa alla cura. Si chiede di essere aiutati a prepararsi e di essere accompagnati, di parlare con qualcuno poiché ci si sente impreparati, confusi.. A questo proposito i famigliari sembrano portatori di una domanda di ascolto e consulenza fondata sul parlare. Agli enti locali e del privato sociale gli intervistati si propongono come clienti, viene domandata l'articolazione di un'offerta di servizi, valorizzando dispositivi d'intervento di gruppo. 4.3. Cluster 1: funzione di controllo delegata alla badante e paura del cambiamento Il cluster è presente in percentuale statisticamente maggiore entro il testo delle interviste ai familiari (39%). Gli intervistati parlano del problema che vivono, situato nella casa, un contesto chiuso che offre riparo e che al contempo costringe. Da un lato si cercano vie di uscita e d'altro lato c'è difficoltà a lasciare, ad allontanarsi da rapporti protettivi e vincolanti. Viene rappresentato un contrasto tra queste emozioni e il vivere: emerge un sentimento di vita contrastata, per dirla con Canguilhem (1998). In tale contesto si è presi dalla fantasia di abbandonare: emerge l'emozionalità della colpa. Ciò avviene entro un contesto in cui la non autosufficienza viene trattata quale bisogno esclusivamente fattuale e pressante, 24 ore su 24. L'invecchiamento è rappresentato come evento che non lascia tregua, che tormenta e angoscia. In tale contesto si chiede l’intervento della badante per ripristinare il controllo, fare ordine. La badante è rappresentata come una necessità motivata dal bisogno. L'assistenza all’anziano è qualcosa a cui ci si sente costretti o da cui liberarsi, tertium non datur. Ma in questo cluster vediamo come vivendo l'invecchiamento come bisogno continuo e prescrivendo l'assistenza si generi colpa. Colpa connessa all’impotenza per il non riuscire a rapportarsi ai cambiamenti con cui la non autosufficienza confronta. 124 JADT’ 18 4.4. Cluster 4: domanda di costruzione di contesti dove parlare, giocare, apprendere. In questo cluster gli intervistati esprimono una domanda di contesti e rapporti fondati sull'apprendimento, il gioco e sulla parola. Emergono desideri e si riconoscono risorse che evocano la possibilità di trovare motivi per cui valga la pena vivere. Emerge una rappresentazione della vecchiaia caratterizzata da vitalità e desiderio di trasgredire. Si allenta la prescrittività dell'obbligo della sopravvivenza: la vecchiaia è anche creatività, possibilità di smarcarsi dagli obblighi rituali della vita sociale. Il riconoscimento del limite del tempo, l'avvicinarsi della fine, motiva la ricerca di esperienze piacevoli che diano senso alla vita. Si evoca il divertimento come obbiettivo alternativo al controllo e alla sorveglianza senza obbiettivi. Sottolineiamo come la domanda divertimento implichi il riconoscimento di una verità non scontata: che si è ancora vivi fino a cinque minuti prima di morire. 5. Conclusioni Per concludere proponiamo alcune considerazioni sul rapporto tra i cluster sui tre assi fattoriali. Ricordiamo che il cluster 3 e il cluster 2 sono in rapporto su polarità opposte del primo fattore, il cluster 1 e il cluster 4 sono in rapporto su polarità opposte del secondo fattore, mentre il cluster 1 e il cluster 3 sono in rapporto sul terzo fattore. Sul primo fattore emerge come la dimensione motivazionale che sostiene la domanda di servizi da parte della famiglia sia il desiderio di uscire dall’obbligo familiare. È il vissuto di obbligo e l’incapacità di condividere entro i rapporti desiderio ed interessi che spinge la famiglia in un'affannosa ricerca di interlocutori e professionisti esterni. Sul secondo fattore emergono diverse modalità di rapportarsi al problema della non autosufficienza. Su di un polo del fattore (cluster 1) la fattualizzazione dell'invecchiamento come bisogno continuo di assistenza che mette in pericolo la sopravvivenza mostra come i problemi associabili alla non autosufficienza non siano esplorati. Tali problemi sembrano piuttosto presunti dal familiare in modo autoreferenziale. L'emozionalità della colpa e la fantasia irrealizzabile di ristabilire il controllo su una situazione in cambiamento vissuta come persecutoria sono corollari di tale autorefenzialità sottesa dall'incompetenza a utilizzare i rapporti familiari come contesto di confronto e scambio sui problemi e sul da farsi. D'altro lato, sull'altro polo del secondo fattore il riconoscimento di limiti, quali ad esempio il tempo limitato della vita e l'ineluttabilità della fine, sembra fare spazio al riconoscimento del desiderio degli anziani di divertirsi anche concedendosi qualche trasgressione, come alternativa a sopravvivere in modo controllante e mortifero. Infine il terzo fattore suggerisce una relazione tra la dinamica di autorefenzialità dei rapporti familiari e la domanda di servizi emergente JADT’ 18 125 entro la cultura in analisi, a cui si chiede non soltanto di curare ma anche di aiutare la famiglia a sviluppare competenze e confrontarsi sui propri problemi. I risultati della ricerca suggeriscono una domanda nei confronti di servizi di accompagnamento e che sostengano la famiglia – intesa come contesto di rapporti tra la persona non autosufficiente e i suoi familiari - nel riconoscimento di desideri e obbiettivi attorno a cui organizzare l'assistenza e la convivenza nel modo più piacevole, vitale e divertente possibile. Bibliografia Carli R., Paniccia R.M. (2002). L’analisi emozionale del testo. Franco Angeli, Roma. Haley, W. E. (2003). Family caregivers of elderly patients with cancer: understanding and minimizing the burden of care. The journal of supportive oncology, 1(4 Suppl 2), 25-9. ISTAT (2017), Demografia in cifre, Roma, Istituto Nazionale di Statistica – www.demo.istat.it. Lancia, F. (2004). Strumenti per l’analisi dei testi. Franco Angeli, Roma. Paniccia, R. M., Giovagnoli, F., & Caputo, A. (2015). In-home elder care. The case of Italy: the badante. Rivista di Psicologia Clinica, (2), 60-83. 126 JADT’ 18 Esperienza di analisi testuale di documentazione clinica e di flussi informativi sanitari, di utilità nella ricerca epidemiologica e per indagare la qualità dell'assistenza. Antonella Bitetto1, Luigi Bollani2 1 Azienda Socio Sanitaria Territoriale di Monza – a.bitetto@asst-monza.it 2Università di Torino – luigi.bollani@unito.it Abstract This study finds reason in the now wide availability of clinical documentation stored in electronic form to track the patient's health status during his care path or for sending information to other institutions on the activities carried out for administrative purposes. The diffusion of these methods now makes available many biomedical collections of electronic data, easily accessible at low cost that can be used for research purposes in the field of observational epidemiological studies, in analogy with what was historically already practiced in studies based on the reviewing of medical records. However, since these collections are not organized according to specific survey schemes, they sometimes do not allow the index events to be discriminated with the necessary reliability between one source and another. It has always been believed that the critical re-reading of texts can partially help these informative shortcomings with the aim of bringing back according to possibility - the words or segments contained in the texts, to statistically analyzable categories. The recent transfer of these collections from paper to electronic forms opens the possibility of carrying out this process automatically, reducing time and costs of the process and perhaps increasing its reliability. It is proposed to address the problem, showing study criteria and an example of analysis based on an empirical experience, consistent with the needs of a biomedical context. Keywords: textual analysis; electronic health data; medical thesaurus; analysis of lexical correspondences; emergency in psychiatry Riassunto Questo studio trova ragione nella ormai ampia disponibilità di documentazione clinica archiviata in forma elettronica per tracciare lo stato di salute del paziente durante il suo percorso di cura o inviare informazioni ad altri enti sulle attività svolte a scopo amministrativo. La vasta diffusione di questi metodi mette a disposizione ormai numerose raccolte di tipo JADT’ 18 127 biomedico, facilmente accessibili a basso costo che possono essere utilizzate a scopo di ricerca nel settore degli studi epidemiologici osservazionali, in analogia con quanto storicamente veniva già praticato negli studi basati sulla rilettura delle cartelle cliniche. Non essendo però tali raccolte organizzate secondo schemi di rilevazione specifici a volte non permettono di discriminare con la necessaria attendibilità tra una fonte e l’altra gli eventi indice. Da sempre si ritiene che la rilettura critica dei testi possa, parzialmente soccorrere a tali carenze informative nell’obiettivo di ricondurre - secondo possibilità - le parole o i segmenti contenuti nei testi disponibili a categorie statisticamente analizzabili. Il recente passaggio di tali raccolte dalla forma cartacea a quella elettronica apre la possibilità di operare per via automatica riducendo tempi e costi del processo e forse incrementandone l'attendibilità. Ci si propone di affrontare il problema, mostrando criteri di studio ed un esempio di analisi basato su un’esperienza empirica, conforme alle esigenze di un contesto biomedico. Parole chiave: analisi testuale; dati sanitari elettronici; thesaurus medico; analisi delle corrispondenze lessicali; psichiatria d’urgenza 1. Introduzione Il progressivo processo di dematerializzazione della documentazione clinica (valutazioni specialistiche ambulatoriali, verbali di Pronto Soccorso, referti esami diagnostici) e l’implementazione dei flussi di dati sanitari a scopo giuridico amministrativo (per il pagamento delle prestazioni erogate o per l’aggiornamento dell’anagrafe, dell’INPS etc.) hanno reso disponibili informazioni che possono essere utilizzate anche per obiettivi diversi da quelli per cui i dati sono raccolti. I dati sanitari informatizzati (EHR “electronic health records”), vengono generalmente distinti in: a) strutturati (ad es. registrati utilizzando terminologie cliniche controllate come la Classificazione internazionale delle malattie -10ª revisione (ICD10) o la nomenclatura sistematica della medicina - Termini clinici (SNOMED-CT), b) semistrutturati (ad es. esami di laboratorio ed informazioni sulla prescrizione) che seguono uno schema che varia a seconda delle convenzioni adottate localmente, c) non strutturati (ad es. testo clinico) e d) binari (ad esempio file di immagini come Rx e TAC). La sistematicità di queste raccolte di dati, organizzati in maggioranza per entità individuali, li rende particolarmente preziosi per diversi scopi di ricerca epidemiologica che utilizza disegni di tipo osservazionale sia nell’ambito della qualità dell’assistenza che dell’epidemiologia più classica, che studia rischi ed esiti delle malattie (Mitchell J. et al., 1994). Per contro essendo tali raccolte organizzate per scopi altri da quelli del monitoraggio della qualità o della ricerca scientifica, spesso devono essere “trattate” prima di poter essere 128 JADT’ 18 analizzate con metodi statistici. In passato ciò veniva fatto attraverso la rilettura delle cartelle cliniche da parte di esperti della materia. Attualmente si cerca sempre più di ricorrere a metodi di analisi automatica dei testi che garantisce una miglior standardizzazione e revisione (Denaxas S. et al., 2017). A titolo di esempio si segnala che l’analisi automatica dei testi di flussi informativi e della documentazione clinica elettronica ha permesso d’indagare ambiti terapeutici e di sicurezza fondamentali come la qualità dell’assistenza infermieristica e l’occorrenza di eventi avversi come – tra i tanti - gli incidenti domestici, le reazioni allergiche e gli effetti collaterali ai farmaci (Ehrenberg A. et Ehnfors M., 1999; Coloma P.M. et al., 2011; Migliardi A. et al., 2004). Sono stati anche prodotti numerosi studi epidemiologici classici per lo più riferiti a patologie croniche ad alta prevalenza come le malattie cardiocircolatorie, il diabete o l’asma, all’estero e in Italia (Gini R. et al., 2016; Vaona A. et al. , 2017), in alcuni casi mettendo in evidenza bisogni di cura inespressi o complicanze dovute a ritardi o trattamenti inappropriati (Persell S.D., et al., 2009; Ho M.L., et al., 2012). Alcune ricerche si sono focalizzate sui disturbi mentali, area medica scelta per l’esperienza di analisi di testo di seguito presentata. In questo ambito la documentazione clinica elettronica permette di ottenere informazioni a basso costo su ampi settori di popolazione che possono ricomprendere casistiche difficili altrimenti da reclutare: questo è il caso di soggetti in fase prodromica ad alto rischio di sviluppare psicosi (Fusar-Poli P. et al., 2017) o autolesionisti (Zanus C. et al., 2017). 2. Metodi La classificazione dei corpora non ancora studiati in categorie statisticamente analizzabili rappresenta un argomento controverso ma anche una sfida che giustifica, a nostro avviso, indagini di approfondimento delle procedure metodologiche da adottare. Nel seguito si propone un metodo per il trattamento di testi medici non strutturati di psichiatria, secondo criteri già in parte utilizzati in precedenti esperienze (Bitetto A. et al., 2017). 2.1. Corpus Le informazioni provengono dai verbali di consulenze psichiatriche svolte presso il Pronto Soccorso di un ospedale universitario lombardo di grandi dimensioni (1250 letti accreditati). Il corpus è monolingua - in italiano - composto da brevi testi scritti dallo psichiatra di turno alla fine della consulenza in urgenza. I referti sono verificati e quindi conservati dal servizio informativo ospedaliero, certificato ISO 9001/2015, che ha fornito il corpus dei dati, in forma anonima. Si sono analizzati 1721 referti, relativi al periodo 01/01/2012 – 31/12/2012. JADT’ 18 129 2.2. Pretrattamento di filtraggio linguistico Il corpus è stato sottoposto ad un pretrattamento di filtraggio linguistico. Dalle 177349 parole presenti nei referti originali sono state eliminate la punteggiatura, i numeri, i pronomi, gli articoli, le proposizioni, i nomi propri - anche dei farmaci- e le parole con una ricorrenza inferiore a 10. Ne è risultato un elenco di 1679 parole distinte, che è stato rivisto manualmente da un esperto per selezionare i termini in grado di descrivere i problemi/bisogni di salute mentale secondo il modello strutturale utilizzato dalla scala HoNOS (Wing J.K. et al., 1998; Lora A. et al., 2001). Si tratta di un modello di valutazione dello stato di salute mentale impostato per problemi e non sulle diagnosi, che difficilmente sono riportate nei referti di pronto soccorso. Il modello distingue 12 “problemi” riconducibili ai seguenti concetti: item H1- COMPORTAMENTI IPERATTIVI, AGGRESSIVI; item H2 COMPORTAMENTI DELIBERATAMENTE AUTOLESIVI; item H3 PROBLEMI LEGATI ALL’ASSUNZIONE DI ALCOOL O DROGHE; item H4 PROBLEMI COGNITIVI; item H5 - PROBLEMI DI MALATTIA SOMATICA; item H6 - PROBLEMI LEGATI AD ALLUCINAZIONI E DELIRI; item H7 PROBLEMI LEGATI ALL’UMORE DEPRESSO; item H 8 - ALTRI PROBLEMI DA ALTRI SINTOMI PSICHICI; item H9 - PROBLEMI NELLE RELAZIONI SIGNIFICATIVE; item H10 - PROBLEMI NELLO SVOLGIMENTO DI ATTIVITÀ DELLA VITA QUOTIDIANA; item 11- PROBLEMI NELLE CONDIZIONI DI VITA; item H12 - PROBLEMI NELLE ATTIVITÀ LAVORATIVE E RICREATIVE. In questo modo è stato creato un thesaurus composto da 214 locuzioni brevi e 81 parole singole riconducibili a 11 categorie cliniche (esclusa la H10, data la mancanza di locuzioni in grado di ricondurre ad essa). Nel thesaurus si sono inoltre considerate parole e acronimi che individuano accesi legati al “rifiuto delle cure”. La procedura di filtraggio dei testi, basata sul thesaurus (ponendo anche attenzione a non includere contesti dove la parola chiave è negata), ha permesso di riclassificare 1629 referti che rappresentano la base dell’analisi. 2.3. Analisi statistica I diversi referti sono stati esaminati per la presenza/assenza di ciascuna parola o locuzione chiave esaminata, in modo da introdurre per ogni parola una codifica binaria rispetto al complesso dei testi considerati. Successivamente tale codifica è stata estesa agli item della classificazione HoNOS valutando – in ogni referto – la presenza di ciascun item, determinata dalla presenza di almeno una parola chiave ad esso associata (l’assenza dell’item si determina per contro in mancanza di parole chiave ad esso associate). Per rappresentare l’associazione tra i diversi item, rispetto ai referti studiati, si è quindi condotta un’analisi delle corrispondenze 130 JADT’ 18 (Benzécri, Jean-Paul, 1973) sulla tabella testi x item HoNOS (in aggiunta ad essi si è anche incluso concetto di rifiuto/interruzione delle cure); per poter apprezzare inoltre le relazioni tra parole e comportamenti/problemi, espressi dalla classificazione introdotta, si sono aggiunte le parole e locuzioni chiave in forma supplementare. 3. Risultati La tabella 1 mostra la distribuzione di frequenza delle aree problematiche descritte e riclassificate secondo i criteri della scala HoNOS. Tabella 1 – Item HoNOS e percentuale di presenza del comportamento/problema riscontrato nei referti Item HoNOS H1 H2 H3 H4 H5 H6 H7 H8 H9 H11 H12 RifiutoCure % di presenza 30.82 15.22 12.22 7.18 20.32 18.35 32.72 59.55 5.10 1.23 7.31 nei referti 18.97 Come atteso i referti riferiscono soprattutto le manifestazioni cliniche del disagio attraverso descrizioni dettagliate dei sintomi osservati rispetto ad altri fattori di tipo ambientale (H9, H11, H12). Tra i sintomi, quelli di più frequente riscontro sono l’umore depresso (H7) e la classe che raccoglie tutte le manifestazioni cliniche non specificate “altri sintomi psichici” (H8). Molto frequente è anche la descrizione di problemi di natura organica (sintomi fisici H5) come atteso, visto che la gestione delle urgenze psichiatriche avviene presso il pronto soccorso generale in cui la richiesta di parere su accesi legati a problematiche fisiche è più alta che presso un ambulatorio di secondo livello. Molto elevata è anche l’occorrenza di comportamenti violenti ed iperattivi (H1), una delle urgenze più tipiche dell’ambito psichiatrico. Figura 1 – A sinistra : rappresentazione congiunta dei primi 8 item HoNOS (sintomi psichici e fisici); A destra : sintomi comportamentali (H1, H2, H3), sintomi psichici (H6, H7, H8) e fattori ambientali precipitanti (H9, H11, H12) Nella figura 1 – grafico di sinistra - sono rappresentati i risultati dell’analisi delle corrispondenze sulle categorie dei sintomi, l’area problematica di JADT’ 18 131 maggior riscontro nei testi. Il primo piano fattoriale – mostrato nel grafico – spiega il 34.17 % della varianza totale. Rispetto alla dimensione 1, lungo l’asse delle ascisse, le categorie di sintomi si suddividono in due gruppi: sulla destra troviamo i problemi legati all’umore depresso (H7) vicino ad altri sintomi (H8), di cui come già detto l’ansietà rappresenta l’area più vasta, e i sintomi fisici (H5), confermando la probabile origine psicosomatica di parte di essi. Nel medesimo raggruppamento si collocano i comportamenti deliberatamente autolesivi e suicidari, che sono secondo la letteratura spesso associati a problemi di depressione. Su valori elevati di ascissa, sono invece raggruppati i sintomi psicotici (H6), i comportamenti agitati (H1), in relazione con il rifiuto delle cure, cui spesso infatti si associano. Risultano invece indipendenti dalle altre categorie di sintomi i problemi legati all’abuso di alcool e droghe (H3) e quelli dovuti alla presenza di problemi cognitivi di origine neurologica (H4), che occupano gli estremi della dimensione 2, individuate dall’asse delle ordinate. La stessa analisi è rappresentata nella figura 2, proiettando anche le parole pertinenti del thesaurus utilizzato. Figura 2 – Rappresentazione congiunta degli item HoNOS relativi ai primi 8 item e rappresentazione supplementare delle parole/locuzioni chiave utilizzati per individuare i diversi item 132 JADT’ 18 Riprendendo la figura 1 – grafico a destra – si trova una seconda analisi delle corrispondenze condotta sulle categorie di sintomi psichici e comportamentali insieme ai fattori precipitanti di tipo ambientale. In questo caso il primo piano fattoriale spiega il 30.33% della varianza totale. La distribuzione dei sintomi psichici lungo l’asse delle ascisse conferma, come atteso, i risultati dell’analisi del primo subset di categorie. In questo caso è possibile notare la tendenza dei problemi legati all’abuso di alcool e droghe (H3) a disporsi verso il centro del grafico in prossimità della categoria altri sintomi (H8), con cui è possibile che certe manifestazioni siano in relazione. Per quanto riguarda i fattori ambientali emerge dai dati una relazione tra problemi di lavoro (H12), sintomi dello spettro depressivo (H7) e condotte deliberatamente autolesive (H2). È possibile che il Pronto Soccorso rappresenti un primo punto di accesso per un’utenza con forme reattive anche gravi, secondarie a fattori di stress occupazionale (burnout, depressioni reattive). Le altre categorie relative a problematiche ambientali (H9 e H11) si collocano agli estremi della dimensione 2, mostrando un certo grado di indipendenza rispetto all’occorrenza di sintomi comportamentali e psichici. 5. Conclusioni L’esperienza empirica di analisi testuale automatica di referti del Pronto Soccorso conferma la sua utilità nell’indagare fenomeni complessi come le manifestazioni cliniche e i fattori di rischio dell’urgenza psichiatrica. L’analisi delle corrispondenze si dimostra un metodo semplice e utile per esplorare le relazioni tra le diverse dimensioni in esame. Emergono per altro alcuni problemi legati alla qualità delle informazioni che, in quanto raccolte per altri scopi, presentano un eccesso di informazione rispetto ad alcune aree (manifestazioni sintomatologiche) mentre sono carenti in altre, come il grado di disabilità del soggetto non analizzabile come fattore precipitante dell’urgenza. È possibile che tali carenze possano essere superate acquisendo informazioni da altre fonti come alcuni ricercatori hanno fatto (Fusar-Poli P. et al., 2017). Resterebbe comunque aperto il problema di condividere e standardizzare i metodi di trattamento dei dati nelle diverse fasi dell’indagine, dalle modalità con cui sono raccolte le informazioni e compilati i referti, alla creazione di un thesaurus di parole e locuzioni chiave standard per la psichiatria sulla base di concetti teorici e criteri condivisi. Bibliografia Benzécri, J.P. (1973). L'analyse des données. Vol. 2. Paris: Dunod. Bitetto A., et al. (2017). La consultazione psichiatrica in Pronto Soccorso come JADT’ 18 133 fonte informativa sui bisogni inespressi di salute mentale. Nuova rassegna studi psichiatrici vol. 15 novembre 2017 Coloma P.M. et al. (2011). Combining electronic healthcare databases in Europe to allow for large-scale drug safety monitoring: the EU-ADR Project. Pharmacoepidemiol Drug Saf.; 20(1):1–11. 40. Denaxas S., et al. (2017).Methods for enhancing the reproducibility of biomedical research findings using electronic health records. Bio Data Mining;10:31 Ehrenberg A. et Ehnfors M. (1999). Patient problems, needs, and nursing diagnoses in Swedish nursing home records. Nursing Diagnosis; 10(2), 6576. Fusar-Poli P, et al. (2017). Diagnostic and Prognostic Significance of Brief Limited Intermittent Psychotic Symptoms (BLIPS) in Individuals at Ultra High Risk. Schizophr Bull; 43(1):48-56 Gini R. et al. (2016). Automatic identification of type 2 diabetes, hypertension, ischaemic heart disease, heart failure and their levels of severity from Italian General Practitioners' electronic medical records: a validation study. BMJ Open; 6(12): e012413. Ho ML, et al. (2012). The accuracy of using integrated electronic health care data to identify patients with undiagnosed diabetes mellitus. J Eval Clin Pract. ;18(3):606–11. Lora A. et al. (2001). The Italian version of HoNOS (Health of the Nation Outcome Scales), a scale for evaluating the outcomes and the severity in mental health services. Epidemiology and Psychiatric Sciences; 10.3: 198-204. Migliardi A. et al. (2004). Descrizione degli incidenti domestici in Piemonte a partire dalle fonti informative correnti. Epidemiologia & Prevenzione ; 28.1: 20-26. Mitchell J. et al., (1994). Using medicare claims for outcome research. Medical care; 35:589-602 Persell S.D. et al. (2009). Electronic health record-based cardiac risk assessment and identification of unmet preventive needs. Med Care; 47(4):418–24. Vaona A. et al. (2017). Data collection of patients with diabetes in family medicine: a study in north-eastern Italy. BMC Health Serv Res.;17(1):565 Wing J.K. et al., (1998). Health of the Nation Outcome Scales (HoNOS). Research and development. The British Journal of Psychiatry; 172 (1) 11-18 Zanus C. et al. (2017). Adolescent Admissions to Emergency Departments for Self-Injurious Thoughts and Behaviors. PLoS One.;12(1): e0170979. 134 JADT’ 18 Exploring the history of American philosophy in a computer-assisted framework Guido Bonino1, Davide Pulizzotto2, Paolo Tripodi3 2 1Università di Torino – guido.bonino@unito.it LANCI, Université du Québec à Montréal – davide.pulizzotto@gmail.com 3Università di Torino – paolo.tripodi@unito.it Abstract The aim of this paper is to check to what extent some tools for computerassisted concept analysis can be applied to philosophical texts endowed with complex and sophisticated contents, so as to yield results that are significant not only because of the technical success of the procedures leading to the results themselves, but also because the results, though highly conjectural, are a direct contribution to the history of philosophy Sommario Lo scopo di questo articolo è di verificare in che misura la computer-assisted concept analysis possa essere applicata a testi filosofici di contenuto complesso e sofisticato, in modo da produrre risultati significativi non solo dal punto di vista del successo tecnico delle procedure, ma anche in quanto i risultati stessi, sebbene altamente congetturali, costituiscono un contributo diretto alla storia della filosofia. Keywords: philosophy, history of philosophy, paradigm, necessity, idealism, Digital Humanities, Text Analysis, Computer-assisted framework 1. Computer-assisted concept analysis The development of artificial intelligence poses a methodological challenge to the humanities. Many traditional practices in disciplines such as philosophy are increasingly integrating computer support. In particular, Concept Analysis (CA) has always been a common practice for philosophers and other scholars in the humanities. Thanks to the development of Text Mining (TM) and Natural Language Processing (NLP), computer-assisted text reading and analysis can provide the humanities with new tools for CA (Meunier and Forest, 2005), making it possible to analyze large textual corpora, which were previously virtually unassailable. Examples of computer-assisted analyses of large corpora in philosophy are Allard et al., 1963; McKinnon, 1973; Estève et al., 2008; Danis, 2012; Sainte-Marie et al., 2010; Le et al., 2016; Meunier and Forest, 2009; Ding, 2013; Chartrand et al., 2016; Pulizzotto et al., 2016; Slingerland et al., 2017. The use of computer- JADT’ 18 135 assisted text analysis is also relevant for the distant reading approach, developed by Franco Moretti in the context of literature studies (Moretti, 2005; Moretti, 2013), but which we are convinced can be usefully extended to different fields (for the application to philosophy see the Conference “Distant Reading and Data-Driven Research in the History of Philosophy” held in Turin in 2017, http://www.filosofia.unito.it/dr2/). The main aim of this paper is to check to what extent some tools for computer-assisted CA can be applied to texts endowed with complex and sophisticated contents, so as to yield results that are significant not only because of the technical success of the procedures leading to the results themselves, but also because the results, though highly conjectural, are a direct contribution to the humanities. Philosophy, in particular the history of philosophy, seems to be a good case to be considered, because of the sophistication of its contents. Our main purpose is that of illustrating some of the different kinds of work that can be done in history of philosophy with the aid of computer-assisted CA. 2. Method 2.1. The corpus To understand how TM and NLP can assist the work in history of philosophy, some standard methods have been applied to a specific corpus, which is provided by Proquest (www.proquest.com). The corpus is a collection of 20,751 PhD dissertations in philosophy discussed in the US from 1981 to 2015. It therefore contains 20,751 documents: each document is a text, comprising the title and the abstract of a dissertation, which are dealt with as a single unit of analysis. The corpus also contains some metadata, such as the author of the dissertation, the year of publication, the name of the supervisor, the university, the department, and so forth. In the present paper we are not going to exploit fully the wealth of information provided by these metadata, which are certainly worth being the subject of further research. However, we will use the crucial datum of the year of publication, which allows us to assume a diachronic (that is, historical) perspective on the investigated documents. 2.2. Data preprocessing A preliminary step consists in a set of four preprocessing operations that allow us to extract the linguistic information needed for the analysis: 1) Part of Speech (POS) tagging; 2) lemmatization; 3) vectorization; 4) selection of the sub-corpora responding to Keyword In Context (KWIC) criteria. The POS tagging and the lemmatization process are performed on the basis of the TreeTagger algorithm described by Schmid, 1994 and 1995. This 136 JADT’ 18 operation consists in the annotation of each word for each document according to its morphological category. Some irrelevant categories (such as determinants, prepositions and pronouns) are eliminated. Nouns, verbs, modals, adjectives, adverbs, proper nouns and foreign words are taken into account. The lemmatization process reduces a word to his lemma, according to the correspondent POS tag. At the end of this process, we can identify 17,750 different lemmas, which are called types. The mathematical modeling of each document into a vector space is called vectorization. In such a model, each document is encoded by a vector, whose coordinates correspond to the TF-IDF weighting of the words occurring in that document. This weighting function calculates the normalized frequencies of the words in each document (Salton, 1971). At the end of the process, a matrix M is built, which contains 20,571 rows corresponding to each document, and 17,750 dimensions, corresponding to the types. Finally, three sub-corpora are created on the basis of the KWIC criterion. These sub-corpora correspond to the set of all the text segments in which one of these three lexical form, each of which convey the meaning of a concept, appears: ‘necessity’, ‘idealism’, and ‘paradigm’. The three concepts have been chosen because of the considerable diversity of their statuses: ‘necessity’ has always been a keyword of several sub-fields of philosophy; ‘idealism’ refers both to a philosophical current, historically determined, and to an abstract position in philosophy; ‘paradigm’ entered the philosophical vocabulary in relatively recent times, mainly after the publication of Kuhn, 1962, as a technical term in the philosophy of science. We obtain a set of 719 documents for ‘necessity’, 450 documents for ‘idealism’, 975 documents for ‘paradigm’. 2.3. Word-sense disambiguation process For each sub-corpus, we identify the semantic patterns (usually, word cooccurrence patterns) associated to each lexical form, so as to discover the most relevant semantic structures of that concept. This is done by using clustering, a common method in Machine Learning for pattern recognition tasks (Aggarwal and Zhai, 2012). Clustering techniques applied to texts are based on two hypotheses: a contiguity hypothesis and a cluster hypothesis. The former states that texts belonging to the same cluster form a contiguous region that is quite clearly distinct from other regions, while the latter says that texts belonging to the same cluster have similar semantic content (Manning et al., 2009, p. 289 and 350). For our purposes, clustering is an instrument for semantic disambiguation. In our experiment, we use the Kmeans algorithm (Jain, 2010, p. 50), a widely employed algorithm for WordSense Disambiguation tasks (Pal and Saha, 2015). The main parameter that needs to be tuned in the K-means algorithm is the k JADT’ 18 137 parameter, which determines the number of centroids to be initialized. Each execution of the K-means algorithm generates a partition Pk having a number of clusters equal to k. Since each centroid is the “center vector” of each cluster, it can also be used to identify the most “prototypical” documents in a given cluster. To complete this operation, a tool generally used to select relevant documents in Information Retrieval is employed, that is, the cosine computation among a query vector and a group of “document vectors” (Manning et al., 2009). In this context, each centroid of a Pk partition can be used as a query in order to identify documents with a higher cosine value. Clustering has first been applied synchronously on the Si matrices with k = {2, 3, 4, …, 50}, thus obtaining the most recurring semantic patterns; then it has been applied diachronically, dividing each matrix into three different periods (1981-1993, 1994-2003, 2004-2015) in order to obtain sets of documents with similar cardinality. On each sub-matrix of Si several clusterings with k = {2, 3, 4, ..., 50} were performed, in order to identify the temporal evolution of the most important semantic patterns associated to the three concepts under study. For each generated Pk partition, we also perform the cosine computation in order to obtain a set of the most relevant PhD dissertations belonging to each cluster. 3. Analyses In this section, we are going to present three analyses, focusing on three different concepts: paradigm, necessity and idealism. Each case illustrates a different kind of historical-philosophical result. 3.1. Necessity After exploring both synchronically and diachronically several clusters (with different k) associated to the concept of necessity, we have focused on a clustering with k=18 in the period 1981-2015 (the clusters are not significantly different from one another in the three decades). It turns out that there are at least 16 clearly distinct and philosophically interesting meanings of ‘necessity’: two (maybe distinct) theological notions; physical necessity; political necessity; necessity as investigated in modal logic and possible world semantics; moral necessity; necessity as opposed to freedom in debates over determinism; the necessity of historical processes; metaphysical necessity; two notions of causal necessity (attacked by Hume); the necessity of life events; logical necessity; phenomenological necessity; necessity of the Absolute (Hegel); necessity of moral duty (Kant); ancient concept of necessity; the necessity of law. In addition to these, there is also a rather big cluster in which ‘necessity’ seems to occur mainly with its ordinary, not strictly philosophical meaning. 138 JADT’ 18 If the clustering we applied to ‘necessity’ were extended to a large number of philosophical words (chosen in our corpus by domain experts), that would be the first step for the construction of a bottom-up vocabulary of philosophy, and ultimately of a data-driven philosophical dictionary, in which the different (though related) meanings of philosophical terms would be determined on the basis of actual use, rather than merely on the lexicographer’s discernment. This lexicographic work is also an indispensable step if one wants to overcome the “concordance approach”: it seems to us that this bottom-up lexicography could be a promising starting point for the construction of semantic networks. 3.2. Idealism Unlike ‘necessity’, the term ‘idealism’ has different distributions in the decades 1981-1993, 1994-2003 and 2004-2015. We have only considered the largest clusters (> 10 documents), since for our purpose (that of reconstructing the main historical developments of American academic philosophy), isolated cases and minor tendencies are not relevant. The evolution of some clusters over decades suggests interesting historical reflections. First, the cluster “Kant” is persistently important. In fact, it becomes more and more important, even in wider contexts, that is, in documents that are not directly devoted to Kant. This is shown by the rising trend of the cluster “Transcendental” (a term typically, but not always directly connected with Kant). Second, the cluster “Hegel” disappears in the second decade, then it reappears: is this a real phenomenon, rather than a statistical artefact? How can it be explained? Third, the cluster “Realism” disappears in the third decade: is there a relationship between the return of “Hegel” and the disappearance of “Realism”? This is not the kind of question, which comes naturally to the mind of the historian of philosophy, on the basis on his/her knowledge of well-known developments of the history of recent American philosophy. This hypothesis can be formulated only thanks to some sort of defamiliarization (ostranenie) with respect to the received views in history of philosophy. Yet, it seems unlikely that philosophers in the last decade gave up speaking of realism. The received view may after all be correct, that realism is more and more central in late analytic philosophy (think, for example, of the centrality of David Lewis) (Bonino and Tripodi forthcoming). Such a view is confirmed by other data, such as the number of occurrences of ‘realis-’ in the abstracts of the corpus. 1981-93: 373 (5,76% of 6,471); 1994-2003: 465 (6,31% of 7,361); 2004-2015: 482 (5,6% of 8,585). Thus the focus on realism is still there, in the third decade. One is therefore led to formulate an alternative hypothesis: philosophers ceased to speak of idealism in relation to realism: perhaps the contrast realism- JADT’ 18 139 idealism has become less important than many used to think; perhaps after Dummett, realism is contrasted with anti-realism, rather than with idealism; perhaps some sort of “interference” is here produced by the presence of a further opposition, that between realism and nominalism. The moral of this example is that clustering applied to large and conceptually sophisticated corpora allows the historians of philosophy to concoct alternative stories to account for the historical facts. This indicates that the data-driven approach can trigger the production of conjectures one would not think about. It is usually maintained that statistical techniques are useful in that they restrict the space of possible interpretations (Mitchell, 1997), but in other cases, such as the one described in this section, at least in an early phase of the hermeneutic process, in virtue of their defamiliarizing impact they can also have the opposite effect: that of broadening that same space and discovering nouveaux observables (Rastier, 2011). 3.2. Paradigm This case study deals with the term ‘paradigm’ in the period 1981-2015. After exploring several k in the three decades, we focus on the synchronic analysis of the set of clusters with k=16. The first result that immediately stands out is that ‘paradigm’ occurs rather often: 995 documents, twice as many as ‘idealism’ (450), and considerably more than ‘necessity’ (719), a concept which is widely regarded as central in the recent history of Anglo-American philosophy. Using Google Ngram Viewer, and thus taking into account a generalist, non disciplinary corpus, it turns out that such a high frequency is peculiar to the philosophical discourse (the lowest value of ‘necessity’ is 0.0025%, which is higher than the highest value for ‘paradigm’, which is 0.0016%). Why does ‘paradigm’ occur so frequently? On the one hand, one could find this datum not so surprising, since ‘paradigm’ is a technical term in the philosophy of science, introduced by Kuhn, 1962 to refer to a set of methodological and metaphysical assumptions, examples, problems and solutions, a vocabulary, which are taken for granted, in a given period of normal science, by a scientific community. On the other hand, moving from a priori considerations to the examination of the data, a partly different landscape emerges: ‘paradigm’ seems to be a fashionable concept, which is used in a variety of contexts as a term that is neither technical nor simply ordinary. Only in cluster 8 has the term a straightforward technical use, derived from Kuhn’s philosophy of science. Each of the other clusters (1: theology, 2: music, 3: philosophy of law, 4: education; 5: nursing; 6: philosophy of religion; 7: moral philosophy; 9: bioethics, 10: spiritualism; 11: political theory; 12: self narrative; 13: theology; 14: Kant-Leibniz; 15 140 JADT’ 18 aesthetics; 16: philosophy and language in Wittgenstein, Heidegger etc.) does not correspond to a different meaning of the term ‘paradigm’, but simply to the application of the same concept to different fields. In most cases we have to do with non-technical contexts, in which ‘paradigm’ has neither its original grammatical meaning nor its ordinary, non-philosophical meaning (standard, exemplar). It seems to us that its meaning and use are generic and vague, rather than precise and technical; nonetheless, they evoke Kuhn: a quasi-Kuhnian vocabulary became fashionable; it entered many philosophical discourses, often more “humanistic” than “scientific” in spirit, and much less technical than the philosophy of science. This case study expresses an especially interesting kind of result obtainable by using TM and NLP techniques to assist research in history of philosophy: it shows how the interpretation of clusters fosters the discovery of terminological fashions as opposed to genuine conceptual developments. References Aggarwal C.C., and Zhai C.X. (2012). “A Survey of Text Clustering Algorithms.” In Mining Text Data, 77–128. Springer. Allard M. et al. (1963). Analyse conceptuelle du Coran sur carte perforées. Mouton. Bonino G. and Tripodi P. (eds.), History of Late Analytic Philosophy, special issue of “Philosophical Inquiries”, forthcoming. Chartrand L., Meunier J.-G. and Pulizzotto D. (2016). CoFiH: A heuristic for concept discovery in computer-assisted conceptual analysis. In Mayaffre D. et al. (eds.), Proceedings of the 13th International conference on statistical analysis of textual data, vol. I, pp. 85-95. Danis J. (2012). L’analyse conceptuelle de textes assistée par ordinateur (LACTAO); une expérimentation appliquée au concept d’évolution dans l’œuvre d’Henri Bergson. Université du Québec à Montréal (http://www.archipel.uqam.ca/4641/1/M12423.pdf). Ding X. (2013). A text mining approach to studying Matsushita’s management thought. Proceedings of the 5th International conference on informatin, process and knowledge, pp. 36-39. Estève R. (2008). Une approche lexicométrique de la durée bergsonienne. Actes des journées de la linguistique de corpus, vol. 3: 247-258. Jain A.K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, vol. 31(8): 651-666. Kuhn T.S. (1962). The structure of scientific revolutions. University of Chicago Press. Le N.T, Meunier J.-G., Chartrand L. et al. (2016). Nouvelle méthode d’analyse syntactico-sémantique profonde dans la lecture et l’analyse de textes JADT’ 18 141 assistéespar ordinateur (LATAO). In Mayaffre D., et al. (eds.), Proceedings of the 13th International conference on statistical analysis of textual data. Manning C.D. at al. (2009). Introduction to Information Retrieval. Online edition. Cambridge, UK: Cambridge University Press. McKinnon A. (1973). The conquest of fate in Kierkegaard. CIRPHO, 1(1): 4558. Meunier J.-G. and Forest D. (2005). Classification and categorization in computer assisted reading and analysis of texts. In Cohen H. and Lefebvre C. (eds.), Handbook of categorization in cognitive science, pp. 955-978. Elsevier. Meunier J.-G. and Forest D. (2009). Lecture et analyse conceptuelle assistée par ordinateur: premières expériences. In Annotation automatique et recherche d’informations. Hermes. Mitchell T.M. (1997). Machine learning. McGraw-Hill. Moretti F. (2005). Graphs, maps, trees. Abstract models for a literary history. Verso. Moretti F. (2013). Distant reading. Verso. Pal A.R. and Saha D. (2015). Word sense disambiguation: A survey. International Journal of Control Theory and Computer Modeling, vol. 5(3). Pincemin B. (2007). Concordances et concordanciers: de l’art du bon KWAC. XVIIe Colloque d’Albi. Langages et signification – Corpus en lettres et sciences sociales: des documents numériques à l’interprétation, pp. 33-42. Pulizzotto D. et al. (2016). Recherche de “périsegments” dans un contexte d’analyse conceptuelle assistée par ordinateur: le concept d’“esprit” chez Peirce. JEP-TALN-RECITAL 2016, vol. 2, pp. 522-531. Rastier F. (2011), La mesure et le grain. Sémantique de corpus. Champion. Sainte-Marie M. et al. (2010). Reading Darwin between the lines: a computerassisted analysis of the concept of evolution in the Origin of species. 10th International conference on statystical analysis of textual data. Salton G. (1971). The SMART Retrieval System: Experiments in Automatic Document Processing. NJ: Prentice-Hall, Upper Saddle River. Schmid H. (1994). “Probabilistic Part-of-Speech Tagging Using Decision Trees.” In Proceedings of the International Conference on New Methods in Language Processing. Manchester; UK. Schmid H. (1995). “Improvements In Part-of-Speech Tagging With an Application To German.” In Proceedings of the ACL SIGDAT-Workshop, 47–50. Slingerland E. et al. (2017). The distant reading of religious texts: A “big data” approach to mind-body concepts in early China. Journal of the American Academy of Religion: 1-32. 142 JADT’ 18 La classification hiérarchique descendante pour l’analyse des représentations sociales dans une pétition antibilinguisme au Nouveau-Brunswick, Canada Marc-André Bouchard, Sylvia Kasparian Université de Moncton – emb1214@umoncton.ca; sylvia.kasparian@umoncton.ca Abstract In this article, we apply Jean-Blaise Grize’s theoretical framework and Max Reinert’s descending hierarchical classification to a corpus composed of comments published as part of a petition against institutional bilingualism in New Brunswick. Using Iramuteq, we point to the lexical worlds which constitute anti-bilingualism arguments. Résumé Dans cet article, nous appliquons le cadre théorique développé par JeanBlaise Grize et la classification hiérarchique descendante de Max Reinert à un corpus constitué de commentaires publiés dans le cadre d’une pétition contre le bilinguisme institutionnel au Nouveau-Brunswick. Utilisant le logiciel Iramuteq, nous dégageons les mondes lexicaux qui constituent l’argumentation anti bilinguisme. Mots-clés: mondes lexicaux, représentations sociales, schématisation, classification hiérarchique descendante, pétition en ligne 1. Introduction Toute analyse de discours, comme l’admet Jean-Blaise Grize dans Logique naturelle et communications (1998; 144-145), est confrontée au problème de la correspondance entre discours et représentations. Celui-ci serait attribuable notamment à l’importance que donne l’analyse du discours à la situation de communication, un facteur qui complique la relation de correspondance entre ce qu’on dit et ce qu’on pense « vraiment ». Dans le cadre de cet article, nous proposons d’explorer l’intersection entre analyse de discours et étude des représentations et nous tenterons de montrer que, bien que le problème de la correspondance entre discours et représentations individuelles reste difficile à résoudre, les corpus de pétition en ligne homogénéisent le discours et jouent sur la schématisation que construit le locuteur, de façon à ce que les analyses logométriques puissent JADT’ 18 143 accéder à certaines représentations sociales en jeu. À cet effet, nous aurons recours à la méthode Reinert (une classification hiérarchique descendante originalement popularisée par le logiciel ALCESTE) (1990) implantée dans le logiciel Iramuteq (Ratinaud, 2009), qui consiste à relever les mondes lexicaux d’un corpus. Plusieurs auteurs, dont Max Reinert lui-même, ont déjà établi des liens entre cette méthode et le champ d’étude des représentations sociales (1993; 13). Notre contribution à la conversation sera celle d’appliquer la méthodologie issue de la logométrie et le cadre théorique développé par Grize à un nouveau type de corpus qui gagne en popularité depuis le début du 21e siècle, celui des pétitions en ligne. L’exemple par lequel nous illustrerons notre exposé théorique sera celui de l’analyse, à l’aide d’Iramuteq, des mondes lexicaux d’une pétition en ligne lancée au NouveauBrunswick (Canada) en 2013, sur la plate-forme www.change.org, contre l’exigence du bilinguisme comme critère d’emploi dans la fonction publique provinciale. 2. Cadre théorique Selon Denise Jodelet, on peut définir la représentation sociale comme « une forme de connaissance socialement élaborée et partagée, ayant une visée pratique et concourant à la construction d’une réalité commune à un ensemble social » (1997; 53). Ainsi, comme le remarque Serge Moscovici, leur étude demande des méthodes d’observation plutôt que d’expérimentation étant donné qu’elle se manifeste « comme une "modélisation" de l’objet directement lisible dans, ou inférée de, divers supports linguistiques, comportementaux ou matériels » (idem; 61). Bien qu’elle soit forme de connaissance, la représentation se distingue de la connaissance scientifique en ce qu’elle découle de ce que Jean-Blaise Grize nomme la logique naturelle (Grize, 1997; 171-172), donnant ainsi sur un « savoir de sens commun » (Jodelet, 1997; 53). Il faut entendre par « logique naturelle » qu’il est question d’une logique d’ordre logico-discursif, manifestée dans le discours par la schématisation, qui « prend en compte les contenus et non les seules formes de la pensée » (Grize, 1997; 171-172). Selon Grize, la schématisation compte cinq notions articulant son ensemble ainsi : [1] Une schématisation est la mise en discours [2] du point de vue qu’un locuteur A [3] se fait – ou a – d’une certaine réalité R. [4] Cette mise en discours est faite pour un interlocuteur, ou un groupe d’interlocuteurs, B [5] dans une situation d’interlocution donnée (idem). 144 JADT’ 18 Ainsi, Grize propose que toute communication est situation d’interlocution, dans laquelle l’orateur construit une schématisation en fonction de son préconstruit culturel, de ses représentations de l’objet en question, et de sa finalité; cette schématisation est constituée d’images de l’orateur, de l’auditeur et de l’objet dont il s’agit, et elle est ensuite reconstruite par l’auditeur en fonction de ses propres représentations, préconstruit culturel et finalité (Grize, 1993; 7). La schématisation est donc partielle et partiale : « elle est partielle dans la mesure où son auteur n’y fait figurer que ce qu’il juge utile à sa finalité, à l’effet qu’il veut produire; elle est partiale puisqu’il l’aménage de telle façon que B la reçoive » (Grize, 1997; 175). En termes de finalité, selon Patrick Charaudeau, les discours, plus particulièrement ceux de type argumentatif ont une double quête, soit le vraisemblable et l’influence, le succès de celle-ci étant fonction des « représentations socioculturelles partagées par les membres d’un groupe donné au nom de l’expérience ou de la connaissance » (1992; 784). C’est donc dire que, compte tenu de la « double quête » du mode de discours, les représentations d’objets sur lesquelles le locuteur construit sa schématisation sont choisies en raison du partage, supposé par le locuteur, de ces représentations chez le(s) destinataire(s). Dès lors, l’analyse des mondes lexicaux communs à un groupe de locuteurs dans une même situation de communication peut nous donner des indices des représentations sociales que se fait le groupe d’un objet du monde social. En effet, selon Max Reinert, dans un corpus collectif, un monde lexical serait indicateur d’un espace de référence commun à un groupe et « l’indice d’une forme de cohérence liée à l’activité spécifique du sujet-énonciateur » (Reinert, 1993; 13). La méthode de classification hiérarchique descendante (Reinert, 1990) propose une représentation de ces mondes lexicaux (ou thématiques) sous la forme de tableaux de classification obtenus par voie du croisement des unités de contexte (ou segments) et des lexèmes d’un corpus. L’hypothèse à la base de cette méthode est que « dans la mesure où une représentation collective exprime une certaine régularité de structure dans une classe de représentations singulières […] cette régularité est due aux contraintes de ce que nous appelons "un monde" » (Reinert, 1993; 29-30). La prise en compte de la fréquence et de l’environnement des formes d’un corpus permet non seulement de relever les formes lexicales les plus propices à constituer des indices de représentations sociales, mais aussi de définir ces formes lexicales en fonction de leur cotexte. 3. Corpus Le corpus que nous analysons dans la présente recherche est issu d’une pétition en ligne. Contrairement à la pétition classique, la pétition en ligne permet à ceux qui y apposent leur nom d’y publier, s’ils le désirent, un JADT’ 18 145 commentaire justifiant leur appui au titre et à la description de celle-ci. Celle dont il est question ici, Stop the hiring discrimination against citizens who speak English only1, a été lancée en 2013 au www.change.org. Ses commentaires, en plus d’être signés par leurs auteurs, sont accessibles publiquement sur la page même. Cette particularité du canal de communication, que Contamin (2001) appelle « un paradoxe classique des pétitions », a une incidence sur le destinataire de la mise en discours en ce que ce dernier n’est pas seulement le gouvernement de la province, mais aussi le grand public. Ainsi, les corpus de pétitions en ligne homogénéisent les discours selon le modèle de la communication de Grize. D’abord, le groupe de locuteurs se trouve dans la même situation d’interlocution (monologues, à l’écrit, mode argumentatif) et est invité à partager son point de vue sur une même réalité (en l’occurrence, le bilinguisme institutionnel de la province du Nouveau-Brunswick). Ces mises en discours sont faites pour un public général, et la nature engagée de la pétition fait en sorte que, en théorie du moins, seuls les locuteurs partageant le point de vue énoncé dans le titre sont représentés. Le point de vue partagé par les intervenants, dans notre corpus, est que l’exigence du bilinguisme anglais-français pour des emplois dans la fonction publique provinciale constitue une discrimination envers les NéoBrunswickois anglophones, qui sont largement unilingues (moins de 15% de ceux-ci se considèrent bilingues, comparativement à un taux de plus de 70% dans la communauté minoritaire francophone). Ces discours s’inscrivent dans un long débat au sein de la population néo-brunswickoise sur le bilinguisme institutionnel, et historiquement le clivage se fonde sur la base linguistique : les francophones sont en faveur du bilinguisme de l’État et de l’avancement des droits linguistiques, alors que les anglophones y sont plus réticents. En tout, à son terme à la fin de l’année 2013, la pétition Stop the hiring discrimination against citizens who speak only English récolte 7758 signatures, pour un total de 2372 commentaires, la longueur de chacun variant d’un mot (« jobs ») à 304 mots, pour une moyenne de 37,66 unités linguistiques par commentaire. Ce corpus compte 4 425 formes différentes représentant un total de 89 338 occurrences. Le corpus nettoyé et uniformisé a été soumis à l’analyse du logiciel Iramuteq qui nous donne le dendrogramme des classes constituant les mondes lexicaux des commentaires présentés dans la section suivante. 1https://www.change.org/p/the-government-of-new-brunswick-stop-the-hiringdiscrimination-against-citizens-who-speak-only-english 146 JADT’ 18 4. Analyse Les 89 338 occurrences (4425 formes différentes) qui constituent notre corpus sont regroupées en 3492 lemmes, soit 2954 formes actives et 538 formes supplémentaires. Et l'ensemble du corpus est segmenté en un total de 2423 parties constituées d'un nombre plus ou moins égal de formes (en moyenne 36.87 formes par segment). L’analyse de la classification hiérarchique descendante avec Iramuteq produit le graphe présenté dans la Figure 1. Figure 1 : Classification sur segments de textes simples La lecture de la Figure 1 révèle que la première segmentation du corpus donne lieu à la Classe 1 (en rouge), formant une classe représentant 30.3 % des segments classés et constituée d'un lexique que nous nommons l’axe sociopolitique : on y aborde d'abord la dynamique « majority » / « minority », qui, à se fier à cette liste de formes, jouerait un rôle d'avant-plan dans les représentations du Canada et des provinces de ce pays. On remarque aussi, en plus de quelques formes relevant de la culture et de la langue, un champ lexical qui semble indiquer la présence de positionnements politiques dans le corpus (« right », « common », « sense », « rule », « vote », « political », « equal »), alors que les verbes (« fight », « cater », « stand », « stop », « start », « push »), de nature politique aussi, renforcent l'hypothèse que cette classe est JADT’ 18 147 constituée de segments exprimant des représentations au sujet de la société canadienne. Une fois la Classe 1 constituée, le calcul divise le deuxième segment en deux classes : la Classe 2 (en vert), contenant 31.7 % de ceux-ci; contre 38 % dans la Classe 3 (en bleu). On observe que, collectivement, cellesci se démarquent de la Classe 1 par leur lexique relevant de l'expérience personnelle plutôt que de l’opinion politique. Cette caractéristique personnelle se manifeste dans la Classe 2 par des formes comme « home », « family », « child », « young », et « daughter ». Les verbes, quant à eux, précisent le contexte de cette expérience : « move », « find », « leave », « work », « live », « stay », « raise », « love », et « born »; tout comme quelques adjectifs évaluatifs et/ou axiologiques : « hard », « good », « decent », et « impossible ». On observe aussi quelques formes, en plus de « [new] brunswick », qui réfèrent à une province canadienne, soit à l'Alberta. Le contenu de la Classe 2 constitue donc l’axe biographique, rejoignant souvent le thème de l'exode vers l'Ouest canadien. La troisième et dernière classe du corpus (en bleu) gravite autour du thème du travail, voire plus précisément de la recherche d'un emploi. C'est aussi dans cette classe qu'on trouve les seules références directes à la langue, mise à part la forme « language » dans la Classe 1 : « bilingual », « speak », et « french ». Certaines formes spécifiques à la Classe 3 laissent entendre que celle-ci est, en partie, plus impersonnelle que la Classe 2 : « employee », « person », « applicant », et « individual ». À partir de la classification sur segments de texte, on peut parcourir, de façon automatisée, l'ensemble des segments de chaque section et leur attribuer un score selon le nombre de mots représentatifs de la classe où ils se trouvent; on tient aussi compte du degré de représentativité de ces formes. Ainsi, les deux segments qui suivent sont caractéristiques de la Classe 1: « discrimination of the english[-]speaking white majority populace should stop with the democratic system becoming more in play with majority rules as a true reflection of the people »; « we as a province cannot afford duplicate books in 2 languages to support a minority and the need to speak french in a majority speaking english province to have a job is ridiculous » Il apparait, dans les segments caractéristiques de la Classe 1, un renversement du rapport de pouvoir classique entre un groupe majoritaire et un groupe minoritaire : les anglophones sont ici opprimés, alors que ce sont les francophones qui sont avantagés, qui ont l'oreille attentive du gouvernement, et, ultimement, qui détiennent le marché du travail bilingue. Cette oppression serait apparente dans la difficulté pour les anglophones unilingues de se trouver un emploi, dans la fonction publique notamment, mais peut-être aussi dans le secteur privé. On remarque d'emblée une représentation de la démocratie se résumant à la règle de majorité (telle que 148 JADT’ 18 définie par H. B. Mayo (1957; 50) comme : « the principle that when there is a majority on a matter, then the wishes of the majority should prevail »), ce qui est explicitement communiqué au premier segment caractéristique de la Classe 1. En ce qui concerne la Classe 2, voici deux des segments les plus caractéristiques : « it is very important to me because my daughter like 1000s of other working children here in new brunswick have had to leave their home province in order to find work because they only speak their own language of english. »; et « i have been out of work for over a year. Unable to find a full time job due to bilingualism restrictions. Going to have to move west. ». Il apparait donc qu’il y a un motif récurrent dans la Classe 2 : pour trouver un bon emploi, voire un emploi tout court, il faut être bilingue, faute de quoi on s’exile, notamment dans l’Ouest canadien. On remarque que ces segments témoignent d’un sentiment d’impuissance mais aussi de réticence face à l’idée de quitter sa province natale. Certains segments caractéristiques de la Classe 2 traitent de l’expérience personnelle du commentateur, qui a dû ou qui croit avoir à déménager dans une province non bilingue, alors que d’autres racontent l’exode, accompli ou prévu, de leur(s) enfant(s). On remarque que, dans les segments de la Classe 2 qui précèdent, on attribue volontiers la pauvreté du marché de l’emploi pour les anglophones au facteur linguistique. Ensuite, les segments caractéristiques de la Classe 3 sont les suivants: « because this is a problem, i have 17 years’ experience and 2 degrees and i can’t even apply for the jobs i qualify for because it’s mandatory bilingual positions when over 90% of the day is dealing in english, they won’t even interview you unless you speak french »; et « the most qualified person for the job is not always hired because they are not bilingual ». Les différentes formes du concept de « qualification », et d’autres qui y sont liées sémantiquement, sont omniprésentes dans ces segments caractéristiques. Il apparait d’emblée qu’on exclut les compétences linguistiques de ce concept. En effet, une personne qui parle seulement l’anglais est présentée comme potentiellement aussi qualifiée, et à l’occasion plus qualifiée, qu’un candidat bilingue à un emploi qui demande le bilinguisme. Le scénario, souvent hypothétique, qui est donné à voir tend à mettre en jeu une personne unilingue qui serait plus qualifiée qu’une autre chez qui le bilinguisme est présenté comme le seul atout. 5. Conclusion En somme, dans le cadre de cette pétition, les locuteurs ont mis en discours des représentations du bilinguisme institutionnel au Nouveau-Brunswick par l’entremise de trois mondes lexicaux, présentant ainsi trois facettes de la discrimination perçue envers les anglophones dans la fonction publique. Le premier monde lexical est sociopolitique et énonce des principes généraux JADT’ 18 149 sur ce qui est juste; le deuxième est biographique et relate les effets personnels de cette discrimination; et le troisième porte sur des exemples de la façon dont se manifeste cette discrimination dans le monde du travail. Ainsi, l’échantillon des représentations sociales du bilinguisme institutionnel constituant notre corpus donne à voir un lien de causalité entre l’exigence du bilinguisme pour certains emplois et les difficultés du marché du travail de la province. Dans le but de convaincre un public général, ce point de vue est présenté sous un angle à la fois idéologique, personnel ou pratique, renvoyant ainsi à certaines images de la démocratie, de l’exode et de la compétence; images qui, bien que relativement homogènes dans notre corpus, ne seraient pas nécessairement partagées dans les représentations sociales des anglophones bilingues et des francophones. Bibliographie Charaudeau, Patrick (1992). Grammaire du sens et de l’expression. Hachette. Contamin, J.-G. (2001). Contribution à une sociologie des usages pluriels des forms de mobilization : l’exemple de la petition en France. Thèse de doctorat de l’Université Paris 1. Grize, Jean-Blaise (1998). Logique naturelle et communications. Presses Universitaires de France. Jodelet, Denise (1997). Les représentations sociales. Dans Jodelet ed. Les representations sociales (5e ed.). Presses Universitaires de France. Mayo, H. B. (1957). Majority Rule and the Constitution in Canada and the United States. Political Research Quarterly, vol. 10(1) : 49-62 Ratinaud, Pierre (2009). Iramuteq : interface de R pour les analyses multidimensionnelles de textes et de questionnaires. http://www.iramuteq.org. Reinert, Max (1990). Alceste une méthodologie d’analyse des données textuelles et une application. Bulletin de Méthodologie Sociologique, vol. 26(1): 24-54 Reinert, Max (1993). Les “mondes lexicaux” et leur “logique” à travers l’analyse statistique d’un corpus de récits de cauchemars. Langage et société, vol. 66(1) : 5-39 Reinert, Max (1997). Postures énonciatives et mondes lexicaux stabilisés en analyse statistique de discours. Langage et société, no. 121/122 : 189-202 150 JADT’ 18 Analysing occupational safety culture through mass media monitoring Livia Celardo1, Rita Vallerotonda2, Daniele De Santis2, Claudio Scarici2, Antonio Leva2 2 1 Sapienza University of Rome INAIL Research – Headquarters for Research of the Italian National Institute for Insurance against Accidents at Work Abstract 1 In the last years, a group of researchers within the National Institute for Insurance against Accidents at Work (INAIL) has launched a pilot project about mass media monitoring in order to find out how the press deal with the culture of safety and health at work. To monitor mass media, the Institute has created a relational database of news concerning occupational injuries and diseases, that was filled with information obtained from the newspaper articles about work-related accidents and incidents, including the text itself of the articles. In keeping with that, the ultimate objective is to identify the major lines for awareness-raising actions on safety and health at work. In a first phase of this project, 1,858 news articles regarding 580 different accidents were collected; for each injury, not only the news texts but also several variables were identified. Our hypothesis is that, for different kind of accidents, a different language is used by journalists to narrate the events. To verify it, a text clustering procedure is implemented on the articles, together with a Lexical Correspondence Analysis; our purpose is to find language distinctions connected to groups of similar injuries. The identification of various ways in reporting the events, in fact, could provide new elements to describe safety knowledge, also establishing collaborations with journalists in order to enhance the communication and raise people attention toward workers' safety. Abstract 2 Negli ultimi anni un gruppo di ricercatori all’interno dell’Istituto Nazionale per l’Assicurazione contro gli Infortuni sul Lavoro e le malattie professionali (INAIL) ha lanciato un progetto pilota riguardante il monitoraggio dei mass media con lo scopo di analizzare come la stampa tratta la salute e la sicurezza sul lavoro. A tal fine, l’Istituto ha istituito un database relazionale delle notizie riguardanti gli infortuni e le malattie, incluso il testo stesso delle notizie. L’obiettivo finale del progetto è dunque quello di identificare le direttrici principali su cui muoversi per azioni di sensibilizzazione su salute e JADT’ 18 151 sicurezza sul lavoro. Nella prima fase del progetto, 1,858 articoli di giornale riguardanti 580 infortuni sono stati raccolti; per ogni evento, non solo il testo della notizia ma anche diverse variabili sono state individuate. La nostra ipotesi è che per diversi tipi di infortunio un diverso linguaggio viene usato dai giornalisti per narrare l’accaduto. Per verificare ciò, una procedura di Text Clustering è stata implementata sugli articoli, insieme ad una Analisi delle Corrispondenze Lessicali; il nostro obiettivo è quello di individuare delle differenze nel linguaggio in relazione a diversi gruppi di infortuni. L’identificazione di diversità nel modo in cui viene riportata la notizia al lettore può fornire nuovi elementi per descrivere la cultura della sicurezza, al fine di instaurare delle collaborazioni con i giornalisti stessi per rendere migliore la comunicazione e accrescere l’attenzione del cittadino verso la sicurezza del lavoratore. Keywords: Occupational safety; Work-related accident; Text mining; Mass media. 1. Introduction The study described here grew out of the collaboration between the Department of Social Sciences and Economics of Sapienza University of Rome and the Headquarters for Research of INAIL (Italian National Institute for Insurance against Accidents at Work) where, since 2012 a team of researchers has developed the idea of monitoring the mass media in view of prevention against accidents at work (INAIL, 2015). With this in mind, those researchers achieved the so-called “Repertorio Notizie SSL” (News Repository on Occupational Safety and Health), that is a relational database of media news related to occupational injuries and diseases. The objective of this project is to observe the culture of occupational safety and health communicated by mass media agencies in order to identify new elements for increasing prevention against accidents at work. In this study we focus on the hypothesis that there are some asymmetries in the language used to describe the injuries depending on the characteristics of the event. To test it, we performed on the repository data some Automatic Text Analysis procedures. The article is structured as follow: in section no.2, the News Repository is presented; in section no.3, data are presented and the methodology is exposed; in section no.4, the results of the analyses are shown; in section no.5, conclusions are drawn. 2. The tool News Repository on Occupational safety and health (NeRO) is a tool created to allow analyses of news contents and texts related to occupational diseases 152 JADT’ 18 and injuries. In fact, our strategic objective is to increase public awareness and safety culture through a different approach, which will be also based on the study of news articles, their composition and communication dynamics. So, the first operational purpose is to understand: - which kind of terms are used in news articles about accidents at work or occupational diseases; - what inspires a title; - how the same news is treated by different sources/media; - how the news text could be interpreted in different ways due to who communicates the news itself; - whether or not some specific aspects of the events are considered by media. Our study plans to analyze the cultural characteristics of mass media communication regarding occupational safety and health (OSH), observing the attitude of mass media (and journalists) towards the subject and the way users perceive the news depending on which words are used. As mentioned before, NeRO is an ad hoc relational database, centred on the gathering of newspaper articles regarding accidents at work, but it is also arranged to gather news on near misses, occupational diseases and incidents from all kind of sources (press, television or radio). It involves several digital interconnected tables, which contain structured – i.e. based on appropriate classifications – and unstructured – i.e. textual – information. Information retrieval regards events happened in Italy and it could contain both online and directly consulting newspapers, since we exploited Google Alert Service (using some suitable keywords) and a daily-newspaper subscription (“la Repubblica”). The reference unit is the event (right now, we are restricting events to accidents) and different aspects and information are linked to it: one or more articles about it, one or more workers injured, and so on. The data-entry interface consists of a series of thematic screens, starting from the opening one, which covers the list of already recorded events. These screens allow to enter the following data, step by step:  [Screen “Event”] Text containing event description, date of the event, venue, company where accident occurred (if appropriate), economic activity;  [Screens “News”] Texts of each article related to the event, newspaper name (or press affiliation), news title, web url, date of the article;  [Screens “Worker” and Sub-screens “Accident” and “Harms, disorders or diseases”] Injured worker’s biographical data, information about accident, type of injury, physical implication or resulting disease. JADT’ 18 153 3. Methodology and data The repository, at the end of data collection, was composed of 1,858 news, related to almost six hundreds different accidents. In order to analyse the content of the news texts in connection with the characteristics of the different events, we performed a content analysis using the Reinert’s method (Reinert, 1983) for a descendant hierarchical partition. This algorithm, starting from the co-occurences matrix, generates groups of lexical units – i.e. words – that more co-occur in the texts. Then, the lexical groups were projected on the factorial axes, together with the variables modalities, using the Lexical Correspondence Analysis (Lebart, Salem and Berry, 1997); in this way, we could observe how the language is connected to the accidents features. Finally, to better understand the differences between news texts we analysed the specificities related to the modalities of the variables. 4. Main results and discussion The cluster analysis made on news texts using the Reinert’s method– choosing as segments the articles – produced three lexical groups (in order, the red, the blue and the green ones, in Figure 1): - Cluster 1 (56.5%): in this group are included the words related to the description of the events, in terms of what happened; - Cluster 2 (26.5%): here we have the terms connected to the road accidents; - Cluster 3 (17%): this group is about the emotional aspects connected to the events. We projected the lexical groups (Figure 1) and the modalities of the variables related to the events (Figure 2) on the first two factors obtained using the lexical correspondence analysis. As shown in the figure no. 2, there are some interesting characterizations of the language used in newspapers. Some variables, like the economic activity and the accident site, present a strong lexical differentiation among the modalities; this means that who is narrating the event - i.e. the journalist uses a specific language to describe the accident, on the basis of these characteristics. The other variables presented no particular specificities, except for the one related to the mortality of the accident. In fact, as shown in the figure no. 2, on the second factor the variable “accident mortality” is best represented because of the position and the distance of the modalities “yes” and “no” from the origin. To better understand the lexical differences, we analysed also the specificities (Bolasco and De Mauro, 2013; Lafon, 1980; Lebart, Salem and Berry, 1997) for this particular variable. 154 JADT’ 18 Figure 1 Lexical groups Figure 2 Lexical correspondence analysis Starting from the results showed in table no.1, we can observe that there is a significant difference in the language utilized when the accident is fatal or not. The terms used in the case of a non-fatal event are related to the description of the injury, while in the case of a mortal accident the situation is completely different: the words utilized refer to the emotional sphere of the event, so concepts like the family or the unpredictability are very often used to describe what was happened. JADT’ 18 155 Table 1 Analysis of the specificities – Variable: “accident mortality” Fatal accident - No Fatal accident - Yes z = test-value z = test-value Hospital 59.17 Tragedy 35.68 Serious 58.84 Family 27.17 To transfer 54.90 Useless 23.62 Dangerous 28.38 To leave 19.84 Rescue 24.13 Victim 18.68 Ambulance 24.09 Tragic 17.71 Leg 23.12 Friend 14.95 Injury 22.06 Band 14.89 Trauma 20.55 Condolence 12.65 Hand 18.84 Province 12.15 Fracture 16.70 Son 11.49 Helicopter 13.70 Wife 11.48 Bus 12.23 Escape 10.63 Crossroad 10.20 Mayor 9.11 5. Conclusions The project here presented showed how News Repository on OSH (NeRO) can contribute to analyse occupational safety and health, although in some institutions there are already databases dedicated to newspaper articles dealing with OSH. Actually, in addition to news texts, NeRO provides several systematized information, enabling to filter news according to various search criteria and, above all, to carry out a number of studies and organized analysis on textual data, too. In this paper, we showed one of the study we implemented on Repository data using Automatic Text Analysis. The results revealed that a large amount of information is contained within these data; anyway, some information asymmetries are present. For that reason, it will be essential to set up a discussion with a network of journalists and other experts, in order to improve and enhance the media communication. The challenge is to get out from the inner circle of prevention practitioners and build a bridge that could connect the Institution to a more general public, also contemplating liaison organizations (such as trade unions and employers' associations). References Bolasco S. and De Mauro T. (2013). L'analisi automatica dei testi: fare ricerca con il text mining. Carocci Editore. Iezzi D. F. (2012). Centrality measures for text clustering. Communications in Statistics-Theory and Methods, 41(16-17), 3179-3197. INAIL. (2015). Il monitoraggio dei mass media in materia di salute e sicurezza: Strumenti per la raccolta e l’analisi delle informazioni. Lafon P. (1980). Sur la variabilité de la fréquence des formes dans un 156 JADT’ 18 corpus. Mots, 1(1), 127-165. Lebart L., Salem A. and Berry L. (1997). Exploring textual data(Vol. 4). Springer Science & Business Media. Reinert M. (1983). Une méthode de classification descendante hiérarchique: application à l’analyse lexicale par contexte. Les cahiers de l’analyse des données, 8(2), 187-198. JADT’ 18 157 Is the educational culture in Italian Universities effective? A case study Barbara Cordella, Francesca Greco, Paolo Meoli, Vittorio Palermo, Massimo Grasso Sapienza University of Rome – barbara.cordella@uniroma1.it; francesca.greco@uniroma1.it; paolomeoli3@libero.it; vittorio.palermo2511@gmail.com; massimo.grasso@uniroma1.it Abstract 1 The paper explores the professors and students’ representation of professional training in Clinical Psychology in the faculty of Medicine and Psychology of the Sapienza University of Rome in order to understand whether the educational context supports students in developing their ability to enter the job market. To this aim, an Emotional Text Mining of the interviews of 30 students and 17 teachers of the Clinical Psychology Master of Science was performed. Both corpora underwent the analysis procedure performed with T-Lab, i.e. a cluster analysis with a bisecting k-means algorithm followed by a correspondence analysis on the keyword per cluster matrix, and the results were compared. The results show 4 clusters and 3 factors for each corpus, highlighting a relationship between student and professor representations. Both of them split the training process, distinguishing the educational process from the professional one. The emotional text mining of the interviews turned out to be an enlightening tool letting their latent dimensions emerge, setting the process and outcome of the academic training, and it proved to be very useful for educational purposes. Abstract 2 La ricerca ha esplorato la rappresentazione della formazione in Psicologia Clinica dei professori e degli studenti della facoltà di Medicina e Psicologia della Sapienza Università di Roma al fine di comprendere se il contesto formativo supporti gli studenti nello sviluppo di competenze utili all’inserimento nel mercato del lavoro. A questo scopo è stata effettuata un’Emotinal Text Mining delle interviste di 30 studenti e di 17 professori del Corso di Laurea Magistrale in Psicologia Clinica con T-Lab (analisi dei cluster con algoritmo bisecting k-means seguita da un’analisi delle corrispondenze sulla matrice cluster per parole-chiave). I risultati mostrano 4 cluster e 3 fattori in entrambi i corpora, evidenziando una relazione tra le rappresentazioni degli studenti con quelle dei professori per quanto concerne il processo di apprendimento, distinguendo e mantenendo separati gli aspetti formativi da quelli professionali. L’Emotional Text Mining risulta essere uno 158 JADT’ 18 strumento utile ad evidenziare le dimensioni latenti che organizzano il processo e i risultati dell’apprendimento accademico. Keywords: Education, Clinical Psychology, Job Market, Youth Unemployment, Emotional Text Mining. 1. Introduction The problem of youth unemployment is relevant nowadays. In Italy, 25% of young people under 30 years of age are unemployed and this percentage grows to 40% for under 25s (Mckinsey & Company, 2014). But why is this percentage so high? According to Mckinsey’s study (ibidem), it shows that the figure of 40% for youth unemployment does not rely on the economic cycle but on “structural causes”. Among other causes, education is one of the relevant factors of youth unemployment, and is a protection factor for poverty and quality of life, as stated by ISTAT (2017). Graduates are less likely to become poor although the employability and the wages depend on the type of degree. 80% of young graduates in psychology are employed after four years (Anpal Servizi, 2017). Psychologists are more likely to become entrepreneurs than employees. Most probably, the length of time needed to get into the job market is connected to the mismatch between the educational system and enterprise (McKinsey & Company, 2014). Young people’s skills are considered appropriate by 70% of Schools and Universities, but only by 42% of employers. The effectiveness of education depends in part on the representation of the professional training characterizing the University. Several studies were performed in order to investigate students’ representation in the Psychology Faculty in order to improve the training process (e.g., Carli et al., 2004; Paniccia et al., 2009). Due to the change in the educational plan that took place over the past decade, this study aims to understand whether the present educational context supports students in developing their ability to enter the job market, performing an emotional text mining (Cordella et al., 2014; Greco, 2016) of the interviews of students and teachers of the Master Degree in Clinical Psychology at the Sapienza University of Rome. 2. Methodology We know that a person's behaviour depends not only on their rationale thinking but also, and sometimes most of all, on their emotional and social way of mental functioning (Carli, 1990; Moscovici, 2005). Namely, people consciously categorize reality and, at the same time, unconsciously symbolize it emotionally (Fornari, 1976). These two thinking processes are the product of the double-logic way of the functioning of the mind (Matte Blanco, 1981) which allows people to adapt to their social environment. According to this JADT’ 18 159 socio-constructivist approach, based on a psychodynamic model, the unconscious processes are social, as people generate interactively and share the same emotional meanings. The socially shared emotional symbolization sets the interactions, behaviours, attitudes, expectations and communication processes, and for this reason, the analysis of the narrations allows for the acquisition of the latent emotional meaning of the text (Salvatore & Freda, 2011). If the conscious process sets the manifest content of the narration, namely what is narrated, the unconscious process can be inferred through how it is narrated, that is to say, the words chosen to narrate and their association within the text. We consider that people emotionally symbolize an event, or an object, and socially share this symbolisation. The words they choose to talk about this event, or object, is the product of the socially-shared unconscious symbolization (Greco, 2016). According to this, it is possible to detect the associative links between the words to infer the symbolic matrix determining the coexistence of these terms in the text. To this aim, we performed a multivariate analysis based on a bisecting k-means algorithm (Savaresi et Boley, 2004) to classify the text, and a correspondence analysis (Lebart et Salem, 1994) to detect the latent dimensions setting the cluster per keywords matrix. The interpretation of the cluster analysis results allows for the identification of the elements characterizing the emotional representation of education, while the results of correspondence analysis reflect its emotional symbolization (Cordella et al., 2014; Greco, 2016). The advantage connected with this approach is to interpret the factorial space according to words polarization, thus identifying the emotional categories that generate professional training representations, and to facilitate the interpretation of clusters, exploring their relationship within the symbolic space. 3. Data collection and analysis In order to explore the emotional representation of the education in the Master of Science in Clinical Psychology, we interviewed 30 students (13% of students) and 17 teachers (71% of teachers) of the Sapienza University of Rome accordingly to their voluntary participation. We used an openquestions interview for students and teachers. Students’ interviews resulted in a medium size corpus of 57.387 tokens, and teachers’ interviews resulted in a small size corpus of 28.746 tokens. In order to check whether it was possible to statistically process data, two lexical indicators were calculated: the type-token ratio and the hapax percentage (TTRstudents = 0,09; Hapaxstudents = 50,3%; TTRteachers = 0,147; Hapaxteachers = 53,8%). According to the size of the corpus, both lexical indicators highlight its richness and indicate the possibility to proceed with the analysis. First, data were cleaned and preprocessed by the software T-Lab (Lancia, 2017) and keywords were selected. 160 JADT’ 18 Due to the size of the corpus and the hapax percentage, in order to choose the keywords, we used the selection criteria proposed by Greco (Cordella et al., 2014; Greco, 2016). In particular, we used stem as keywords instead of type, filtering out the lemma of the open-questions of the interviews. Then, on the context units per keywords matrix, we performed a cluster analysis with a bisecting k-means algorithm (Savaresi et Boley, 2004) limited to ten partitions, excluding all the context units that did not have at least two keywords cooccurrence. The eta squared value was used to evaluate and choose the optimal solution. To finalize the analysis, a correspondence analysis on the keywords per clusters matrix was made (Lebart et Salem, 1994) in order to explore the relationship between clusters, and to identify the emotional categories setting professional training representations both for students and teachers. 4. Main results and discussion The results of the cluster analysis show that the keywords selected allow the classification on an average of 96% for both corpuses. The eta squared values was calculated on partitions from 3 to 9, and they show that the optimal solution is four clusters for both corpora. The correspondence analysis detected three latent dimensions. In table 1 and 2, we can appreciate the emotional map of the professional training emerging from the interviews of the teachers and the students and cluster location in the factorial space. Table 1  Cluster coordinates on factors of the teachers’ corpus (the percentage of explained inertia is reported between brackets above each factor) Cluster (CU in Cl %) 1 2 3 4 Training Group (22,3%) Clinical Training (33,7%) Institutional Obligations (20,2%) Student Orientation (23,8%) Factor 1 1 (26,53%) Motivation Group -0,21 Institution 0,33 Institution 0,65 Group -0,79 Factor 2 (19,03%) Outcome Competence 0,51 Competence 0,23 Degree -0,66 Degree -0,39 Factor 3 (14,56%) Role Teacher -0,50 Professional 0,39 Teacher -0,38 Professional 0,16 CU in Cl = context units classified in the cluster. The teachers’ corpus first factor (table 1) represents the motivation in teaching, focusing on the group of students and their specific needs or on the Institutional generic scopes; the second factor focuses on the training outcome, the degree or the professional skills; and the third factor reflects the role of the academic professor that could represent oneself as a teacher or a JADT’ 18 161 professional. As regards the students corpus (table 2), the first factor represents the approach to university experience, which can be perceived as an individual experience or a social one (relational); the second factor explains how students experience vocational training, perceiving it as the fulfilment of obligations or the construction of professional skills that requires personal involvement; and the third factor reflects the outcome of the educational training that can focus on professional skills development or on the achievement of qualifications. Table 2  Cluster coordinates on factors of the students’ corpus (the percentage of explained inertia is reported between brackets above each factor) Cluster (CU in Cl %) 1 2 3 4 Idealized Product (27,6%) Professional Education (20,8%) Group Identity (26,3) Empty Degree (25,3%) Factor 1 (23,2%) Approach Individual -0,56 -0,04 Relational 0,69 Individual -0,32 Factor 2 (15,3%) Training Fulfilment 0,45 Construction -0,63 Fulfilment 0,22 0,01 Factor 3 (14,0%) Outcome Skills -0,43 Skills -0,24 -0,01 Qualifications 0,59 CU in Cl = context units classified in the cluster. Table 3  Teachers’ Cluster (the percentage of context units classified in the cluster is reported between brackets) Cluster 1 (22,3%) Cluster 2 (33,7%) Training Group CU keyword studente 59 cercare 43 corso 43 teoria 32 lezione 21 modalità 21 20 organizzazione intervento 19 relazione 17 Clinical Training keyword CU psicologia 94 lavoro 81 clinico 54 insegnare 36 contesto 29 problema 27 intervento 27 diverso 25 conoscenza 22 modello 16 interno Cluster 3 (20,2%) Institutional Obligations keyword CU scuola 29 persona 28 laurea 19 università 18 trovare 17 specializzazione 16 importante 16 entrare 15 14 scegliere percorso 14 22 CU = context units classified in the cluster. Cluster 4 (23,8%) Student Orientation keyword CU domanda 42 idea 40 33 organizzazione aggiungere 32 processo 30 rispetto 29 orientare 21 parlare 21 Corso di laurea 20 Attività 18 didattiche 162 JADT’ 18 The four clusters of both corpuses are of different sizes (tables 1 and 2) and reflect the representations of the professional training (table 3 and 4). Regarding the teachers’ corpus (table 3), the first cluster represents the group of students as a tool to teach professional skills, focusing on the group process where relational dynamics are experienced; the second cluster focuses on clinical training, teaching skills marketable in the job market; the third cluster focuses on the teachers’ institutional obligations regardless of the students’ training needs; and the fourth cluster represents students’ orientation as a way to support students in managing their academic training regardless of professional skills. As regards the students’ corpus (table 4), in the first cluster the good training involves students’ adherence to lesson tasks regardless of critical thinking on the theoretical model proposed; in the second cluster, learning professional skills is strictly connected to the ability to get and respond to market demand; the third cluster reflects the relevance of belonging to a group of colleagues supporting the construction of a professional identity that, unfortunately, seems unconnected to professional skills development; and the fourth cluster represents professional training as a process in which the degree achievement is the main goal, regardless of the job market demand. Table 4  Students’ Cluster (the percentage of context units classified in the cluster is reported between brackets) Cluster 1 (27,6%) Idealized Product CU keyword esperienza 116 triennale 44 percorso 43 professione 41 università 37 possibilità 35 capire 33 diverso 31 senso 30 vivere 25 Cluster 2 (20,8%) Professional Education keyword CU pensare 89 esame 71 psicologia 65 seguire 55 realtà 55 vedere 55 iniziare 53 triennale 53 lavoro 44 interessante 44 Cluster 3 (26,3) Group Identity keyword CU scelta 154 studiare 153 frequentare 104 rapporto 102 piacere 98 colleghi 97 parlare 74 organizzare 68 domanda 55 aggiungere 36 Cluster 4 (25,3%) Empty Degree keyword CU vivere 26 trovare 85 tesi 20 sentire 91 riuscire 30 prendere 33 persone 105 maniera 23 livello 35 laboratorio 18 CU = context units classified in the cluster. Students and teachers seem to have similar representations of the training process: the academic need of building a network, highlighted by the students’ cluster on group identity, and the teachers’ cluster on training group and student orientation; the relevance of achieving a qualification, highlighted by the students’ cluster on empty degree and the teachers’ cluster on institutional obligation; and the development of professional skills marketable in the job market reflected by the teachers’ cluster on clinical training and the JADT’ 18 163 students’ cluster on professional education in line with what it was found by Carli and colleagues (2004) and Paniccia and colleagues (2009) by means of a similar methodology, the emotional textual analysis (Carli et al., 2016). The awareness of the psychological demand of the labour market is an indicator of the professional training process’s effectiveness. Nevertheless, students and teachers split the academic achievement from the development of professional skills. This could be a critical aspect, possibly explaining young graduates’ difficulty in entering the job market, focusing more on academic context rather than on market demand. As a consequence, during the training process, students do not develop the connection between professional training (what they are learning) and professional skills (what they are going to do in the future). 5. Conclusion Although the study results could not be generalized, due to the participants’ selection criteria and the methodology we used, they highlight professional training representation characteristics, which are the elements influencing the rate of unemployment among young psychologists. Even though it is not possible to quantify the relevance of the characteristics of the representation, the emotional text mining, allowing for the identification of the words association explanatory of the education representation, allows for hypotheses definition and the identification of the resources and the issues pertaining the professional training in a specific context. The interpretation of the text mining results lets the social unconscious process emerge, setting the education useful to defining the type of psychological intervention able to support the representation transformation toward a more effective training process. In this particular case study, the intervention would aim to develop the connection between professional qualification achievement and the professional skills development, which are currently split. References Anpal Servizi (2017), L’inserimento occupazionale dei laureati in psicologia, dell’università La Sapienza di Roma, Direzione e studi analisi statistica - SAS. Carli R. (1990). Il processo di collusione nelle rappresentazioni sociali. Rivista di Psicologia Clinica, 4: 282-296. Carli R., Dolcetti F. and Dolcetti (2004). L’Analisi Emozionale del Testo (AET): un caso di verifica nella formazione professionale. In Purnelle G., Fairon C. and Dister A., editors, Actes JADT 2004: 7es Journées internationales d’Analyse statistique des Données Textuelles, pp. 250-261. Carli R., Paniccia R.M., Giovagnoli F., Carbone A. and Bucci F. (2016). 164 JADT’ 18 Emotional Textual Analysis. In L. A. Jason and D. S. Glenwick, editors, Handbook of methodological approaches to community-based research: Qualitative, quantitative, and mixed methods. Oxford University Press. Cordella B., Greco F. and Raso A. (2014). Lavorare con Corpus di Piccole Dimensioni in Psicologia Clinica: Una Proposta per la Preparazione e l’Analisi dei Dati. In Nee E., Daube M., Valette M. and Fleury S., editors, Actes JADT 2014 (12es Journées internationales d’Analyse Statistque des Données Textuelles, Paris, France), pp. 173-184. Fornari F. (1976). Simbolo e codice: Dal processo psicoanalitico all’analisi istituzionale. Feltrinelli. Greco F. (2016). Integrare la disabilità. Una metodologia interdisciplinare per leggere il cambiamento culturale. Franco Angeli. ISTAT (2017). Rapporto annuale 2017. ISTAT Lancia F. (2017). User’s Manual : Tools for text analysis. T-Lab version Plus 2017. Lebart L. and Salem A. (1994). Statistique Textuelle. Dunod Matte Blanco I. (1981). L’inconscio come insiemi infiniti: Saggio sulla bi-logica. Einaudi McKinsey & Company (2014). Studio ergo Lavoro, come facilitare la transizione scuola lavoro per ridurre in modo strutturale la disoccupazione giovanile in italia. Report di Ricerca "Studio ergo Lavoro", McKinsey & Company, https://www.mckinsey.it/file/2785/download?token=a3VfesjU. Moscovici S. (2005). Le rappresentazioni sociali. Il Mulino. Paniccia R.M., Giovagnoli F., Giuliano S., Terenzi V., Bonavita V., Bucci F., Dolcetti F., Scalabrella F. and Carli R. (2009). Cultura Locale e soddisfazione degli studenti di psicologia. Una indagine sul corso di laurea “intervento clinico” alla Facoltà di Psicologia 1 dell’Università di Roma “Sapienza”. Rivista di Psicologia Clinica, Supplemento n. 1: 1-49. Salvatore S. and Freda M. F. (2011). Affect, unconscious and sensemaking: A psychodynamic, semiotic and dialogic model. New Ideas, Psychology, Vol. 29, pp. 119–135. Savaresi S. M. and Boley D. L. (2004). A comparative analysis on the bisecting K-means and the PDDP clustering algorithms. Intelligent Data Analysis 8(4): 345-362. JADT’ 18 165 Profiling Elena Ferrante: a Look Beyond Novels Michele A. Cortelazzo1, George K. Mikros2, Arjuna Tuzzi3 2 1University of Padova – cortmic@unipd.it National and Kapodistrian University of Athens – gmikros@isll.uoa.gr 3University of Padova – arjuna.tuzzi@unipd.it Abstract Elena Ferrante represents rather a peculiar editorial and journalistic phenomenon: Today, she enjoys a wide international audience, though, on the other hand, there is surprisingly little scientific literature that discusses her works. Since Elena Ferrante is a pseudonym for an anonymous writer, some investigators have already dealt with the pursuit of her real identity and, at the moment, the main suspects that emerged are Domenico Starnone, Marcella Marmo and Anita Raja. Corpora collected in order to analyze Elena Ferrante's works and compare them with the works of other authors are usually composed of novels, however Marcella Marmo and Anita Raja are not novelists and their works are not ascribed to genres comparable with novels. One of Elena Ferrante's books, La Frantumaglia, is useful to collect corpora of texts of different genres (letters, essays, interviews, etc.) and they might include texts by authors that have never been taken into consideration in research studies based on novelists. Nevertheless, these texts raise specific questions that concern their exploitability in traditional authorship attribution procedures due to their limited size. This study aims at working on a corpus of texts other than novels by means of a machine learning approach, in the frame of methods for authorship attribution and profiling. Riassunto Elena Ferrante costituisce un fenomeno editoriale e giornalistico italiano molto particolare: attualmente gode di grande visibilità internazionale ma, allo stesso tempo, c'è sorprendentemente poca letteratura scientifica che si occupa delle sue opere. Siccome Elena Ferrante è lo pseudonimo di un/una autore/autrice ancora anonimo/anonima, alcuni si sono già confrontati con la ricerca della sua vera identità e i maggiori sospettati emersi, finora, sono Domenico Starnone, Marcella Marmo e Anita Raja. I corpora che vengono utilizzati per studiare la produzione di Elena Ferrante e confrontarla con quella di altri autori sono costituiti normalmente da romanzi ma Anita Raja e Marcella Marmo non sono scrittrici e i loro lavori non si possono ascrivere a generi confrontabili con i romanzi. Una delle opere di Elena Ferrante, La frantumaglia, può essere utilizzata per costituire corpora con testi di generi 166 JADT’ 18 diversi (lettere, saggi, interviste, ecc.) che possono includere materiali di autori non ancora considerati nelle ricerche basate su romanzieri. Tuttavia, questi testi presentano specifiche problematiche legate alla ridotta dimensione e parziale utilizzabilità con strumenti di attribuzione d'autore tradizionali. Questo lavoro ha come obiettivo studiare un corpus di testi diversi dai romanzi con un approccio machine learning nell'ambito dei metodi per l'attribuzione d'autore e il profiling. Keywords: authorship attribution, machine learning, profiling, stylometry, support vector machine 1. Introduction In previous works the novels signed by Elena Ferrante have already been studied in the panorama of Italian contemporary literature and they have displayed that this author has a peculiar writing style and shows relevant individual traits. Moreover, in previous investigations the Italian writer that showed the highest level of similarity with Elena Ferrante is Domenico Starnone (Galella, 2005; 2006; Gatto, 2016; Cortelazzo et Tuzzi, 2017; Tuzzi et Cortelazzo, 2018). In this study we aim at testing further hypothesis and look at texts that are not ascribed to the genre "novels". In this way we have the opportunity to consider for authorship attribution and profiling experiments new candidates, i.e. writers that are not exclusively novelists. A first reference can be made to Marcella Marmo and Anita Raja, two Italian women, that have been suspected to be the hand that hides behind the penname of Elena Ferrante, respectively, by Marco Santagata (2016) and Claudio Gatti (2016). The corpus collected for this new study has a specific focus on three main suspects (Marcella Marmo, Anita Raja, Domenico Starnone) and includes further suspected authors (Goffredo Fofi, Mario Martone, Valeria Parrella, Francesco Piccolo), authors that in previous analysis showed some common traits with Elena Ferrante's works (Gianrico Carofiglio, Clara Sereni), authors that provocatively claimed to be Elena Ferrante (Laura Buffoni) and members of the E/O publishing house (Sandro Ferri, Sandra Ozzola and the editorial board that is supposed to be the collective editor of the publishers' web pages). 2. Corpus The corpus includes letters, interviews and further material written by different authors (tab. 1) that can be compared with texts included in the book La Frantumaglia by Elena Ferrante (2016). An innovative perspective has been adopted for analyzing texts: a Machine Learning (ML) approach based on a Support Vector Machine (SVM) method that takes into consideration 13 authors for a classical Authorship Attribution (AA) and different variables JADT’ 18 167 (gender, age, geographical area) for profiling tasks. The whole corpus adopted for this study is composed of 113 texts and includes 143,695 word tokens and 19,020 word types. In the classical ML perspective, the corpus is arranged into two groups: a "training set" and a "testing set". The training corpus (tab. 1) includes 86 texts (87,458 word tokens), 78 written by 12 authors and 8 by a collective subject (EO) that represents the editorial staff of E/O publishing house. The corpus is balanced in terms of gender and partly balanced for age and geographical area (tab. 2). Information about gender and age is not available (n.a.) for E/O, as it is presumed to be a group. The testing corpus includes 27 texts (6 essays, 7 interviews, 14 letters for a total of 56,237 word tokens in size) signed by Elena Ferrante and collected in her book La Frantumaglia. Five texts are chapters of the same large essay that has been written as an answer to Giuliana Olivero and Camilla Valletti's questions (Ferrante 2016). Table 1. Authors and categories of texts included in the training corpus Authors Category texts tokens texts Laura Buffoni 3 4,477 article 53 Gianrico 6 4,940 essay 9 Carofiglio E/O 8 3,955 interview 12 Sandro Ferri 2 3,838 letter 4 Goffredo Fofi 9 7,378 web 8 Marcella 5 12,991 Marmo Mario Martone 10 9,320 Sandra Ozzola 4 1,879 Valeria 7 4,676 Parrella Francesco 6 5,529 Piccolo Anita Raja 4 13,617 Clara Sereni 2 2,271 Domenico 20 12,587 Starnone Tot 86 87,458 Tot 86 tokens 42,124 22,926 15,480 1,611 5,317 87,458 Since most stylometric measures and linguistic features are heavily influenced from text size, we decided to split our texts into equal sized text chunks. Both the training and the testing corpus were segmented into 200 words text chunks. After the chunking procedure, the training corpus inflated from 86 texts to 386 chunks of 200 words in length and the testing 168 JADT’ 18 corpus from 27 texts to 259 chunks of 200 word tokens in length. This enlargement had also the positive effect of making our sample space larger, giving us the opportunity to use a wider spectrum of linguistic features. Table 2. Descriptive variables of texts included in the training corpus Gender n.a. Age authors texts tokens 1 8 3,955 Naples Area authors texts tokens n.a. 1 8 3,955 authors texts tokens f 6 25 39,911 >60old 7 46 54,561 Naples 6 52 58,720 m 6 53 43,592 60young 5 32 28,942 NoNaples 7 34 28,738 Tot 13 86 87,458 Tot 13 86 87,458 Tot 13 86 87,458 3. Method In order to investigate our research aims, we developed a feature-rich document representation model comprised by the following features groups: 1) Author Multilevel N-gram Profiles (AMNP): 1,500 features, 500 features of each n-gram category (2-grams and 3-grams at the character level, and 2-grams at the word level); 2) Most Frequent Words in the corpus (MFW, 500 features). The first feature group (AMNP) provides a robust document representation which is language independent and able to capture various aspects of stylistic textual information. It has been used effectively in authorship attribution problems (Mikros et Perifanos, 2011; 2013) and gender identification focused on bigger texts (e.g. blog posts, cfr. Mikros, 2013). AMNP consists of increasing order n-grams in both character and word level. Since character and word n-grams capture different linguistic entities and function complementary, we constructed a combined profile of 2, 3 characters n-grams and 2 words n-grams. For each n-gram we calculated its normalized frequency in the corpus and included the 500 most frequent entries resulting in a combined vector of 1,500 features. The second feature group (MFW) can be considered classic in the stylometric tradition and it is based on the idea that the MFWs belong to the functional words class and are beyond the conscious control of the author, thus revealing its stylometric finger print. In this study we used the 500 most frequent words of the corpus. The above described features have been exploited for training a classification machine learning algorithm, Support Vector Machines (SVM, Vapnik, 1995), in both a standard authorship classification task and in three different author profiling tasks (author’s gender, age, and geographical area). SVM is considered a state-of-the-art algorithm for text classification tasks. The SVM constructs hyper-planes of the feature space in order to provide a linear solution to the classification problem. For our trials we experimented with JADT’ 18 169 various kernels and we ended up choosing the polynomial one as this was the most accurate in our dataset. All statistical models developed have been evaluated using 10-fold cross validation (90% training set – 10% testing set) and the accuracies reported represent the mean of the accuracies obtained in each fold. Since the feature space was sparse, we eliminated all features that showed a variance close to zero, using the two following rules: the percentage of unique values was less than 20%, and the ratio of the most frequent to the second most frequent value was greater than 20. The nearzero variance feature removal shrank the number of the employed features and led to a reduction of 47.4% (from the initial 2,000 available features we kept 1,052 features). 4. Results 4.1. Authorship Attribution Results For the standard authorship classification task (tab. 3), first we worked with the whole corpus as training dataset and obtained an accuracy of 0.7098 on average (71%). Among the set of 13 candidates included in the corpus, a large share of testing text chunks resulted attributed to Domenico Starnone (32%), Anita Raja (21%) and Mario Martone (21%). Table 3. Attribution of text chunks included in the testing corpora (whole and reduced corpus) Authors Starnone Raja Martone E/O Buffoni Parrella Fofi Carofiglio Ferri Marmo Piccolo Ozzola Tot whole corpus No. chunks 84 55 55 18 16 15 7 2 2 2 3 0 259 % 32% 21% 21% 7% 6% 6% 3% 1% 1% 1% 1% 0% 100% reduced corpus Authors No. chunks Starnone 115 Raja 73 Martone 39 E/O enlarged 32 Tot 259 % 44% 28% 15% 12% 100% 170 JADT’ 18 Table 4. Cross-classification matrix in authorship attribution task (whole and reduced corpus) reduced corpus whole corpus Starnone Raja Martone E/O enlarged Tot 77 Starnone 2 0 5 84 48 Raja 3 0 4 55 30 Martone 14 2 9 55 15 E/O 1 2 0 18 Buffoni 6 5 2 3 16 Parrella 8 7 0 0 15 Fofi 4 3 0 0 7 Piccolo 2 0 0 1 3 Carofiglio 0 2 0 0 2 2 Ferri 0 0 0 2 Marmo 0 2 0 0 2 Ozzola 0 0 0 0 0 Tot 115 73 32 39 259 We deemed useful to reduce the candidates to Starnone, Raja, Martone and rearrange the E/O collective author into a new enlarged version of the E/O group, i.e. we pool together all the members of the E/O publishing house (Sandro Ferri, Sandra Ozzola and the E/O staff). As an effect of this selection we obtained an improvement in the performance of the ML algorithm (+13%) since the accuracy rose up to 0.8408 on average (84%). With reference to this reduced version of the training corpus, that includes only four candidates, again most text chunks seem to belong to Domenico Starnone (44%) and Anita Raja (28%). From a cross comparison of the results achieved (tab. 4) with the whole and reduced versions of the training corpus we observed that the text chunks of the testing corpus that have been attributed to Domenico Starnone and Anita Raja proved more stable and consistent if compared to a more unstable and weak role of Mario Martone. The existence of an action of the publishing house was confirmed in both versions, although in some cases a confusion of the E/O editors with Starnone and Raja's hands is somewhat visible. 4.2. Profiling Results Results achieved with profiling tasks are more schematic since the algorithm is called to work with simpler dichotomous variables (tab. 5). With respect to gender, the ML algorithm obtained an accuracy of 0.8000 on average (80%) and the results achieved with the automatic classification of the text chunks of the testing corpus suggested that among the fragments of La Frantumaglia we might have different hands: at least a man (54%) and a woman (46%). If compared with the case of gender profiling, the ML JADT’ 18 171 algorithm achieved a similar performance in terms of accuracy for both the classification by age (0.8027, 80%) and geographical area (0.7850, 78%) but for the most part the text chunks appeared to be written by an old author (76%) from Naples (90%). f m Tot Table 5. Profiling of text chunks included in the testing corpus Gender Age Naples area No. % No. % No. chunks chunks chunks 141 54% >60 old 197 76% Naples 233 118 46% 60 62 24% NoNaples 26 young 259 100% Tot 259 100% Tot 259 % 90% 10% 100% 5. Discussion and conclusions Among limitations and constrains of this method, first and foremost we have to take into account that we have different genres among the texts of this corpus (essays, interviews, newspapers articles, letters) and this feature surely affects our results. Texts show similarities when they are written by the same author or belong to the same text genre and these two effects are not easy to disentangle in our text corpus. Secondly, when the SVM prediction is called to assign testing chunks to authors and/or categories it always leads to an attribution that is the result of a formula generated by the ML algorithm (in other words it never answers "do not know"). Results depend both on quality of texts and basket of opportunities offered during the training phase. As a consequence, we have to refer to the accuracy of the model and consider the classification as the best attribution among options given by the set of reasonable candidates and available categories. Thirdly, La Frantumaglia represents an interesting set of texts signed by Elena Ferrante that are not ascribed to the genre "novels" and it enables new analyses to compare and contrast the author's writing style with the one of authors that are not strictly novelists. Nevertheless, we cannot be sure that all texts included in La Frantumaglia are written by the same hand and, moreover, we do not know whether these texts are written by the author that actually wrote also the novels signed by Elena Ferrante. From the authorship attribution viewpoint more than one hand emerged as likely and we can formulate some hypothesis. If we take into account only main suspected authors mentioned in our Introduction, Domenico Starnone and Anita Raja are confirmed; on the contrary, Marcella Marmo seems not believable. Mario Martone's role is an interesting suggestion since similarities of chunks taken from La Frantumaglia with his texts might be the indirect outcome of direct interactions between Martone and Ferrante (e.g. letters and interviews where they are both 172 JADT’ 18 speaking about the movie L'amore molesto). Also the E/O staff's role is engaging as it is easy to imagine the effect on the writing style of one or more editors that work as proofreaders, copyreaders and ghostwriters when Elena Ferrante has to answer many interviews and letters collected by the publishing house. From profiling experiments a composite picture of La Frantumaglia emerges. The procedure reveals the existence of different hands once more, suggested the involvement of at least a man and a woman, and draws the portray of an author (single or collective) from Naples that is over 60 years old. Does the mystery about Elena Ferrante's work remain a mystery? Acknowledgements We thank Arianna Menin for providing us with the corpus of texts of La Frantumaglia collected for her first level (B.A.) 3-years degree thesis in Communication (University of Padova, a.y. 2016/2017, supervisor prof.ssa Arjuna Tuzzi). References Cortelazzo M.A. and Tuzzi A. (2017). Sulle tracce di Elena Ferrante: questioni di metodo e primi risultati. In Palumbo, G. (ed), Testi, corpora, confronti interlinguistici: approcci qualitativi e quantitativi, EUT – Edizioni Università di Trieste, pp. 11-25. Ferrante, E. (2016). La Frantumaglia. Roma: E/O. Galella, L. (2005). Ferrante-Starnone. Un amore molesto in via Gemito, La Stampa, 16 January 2005, pp. 27. Galella, L. (2006). Ferrante è Starnone. Parola di computer. L'Unità, 23 November 2006. Gatti, C. (2016). Elena Ferrante, le «tracce» dell'autrice identificata, Il Sole 24 Ore – Domenica, 2 October 2016, pp. 1-2. Gatto, S. (2016). Una biografia, due autofiction. Ferrante-Starnone: cancellare le tracce, Lo Specchio di carta. Osservatorio sul romanzo italiano contemporaneo, 22 October 2016. www.lospecchiodicarta.it Mikros, G.K. (2013). Authorship Attribution and Gender Identification in Greek Blogs. In Obradović, I., Kelih, E. and Köhler R. (eds.), Selected papers of the VIIIth International Conference on Quantitative Linguistics (QUALICO) in Belgrade, Serbia, April 16-19, 2012, Belgrade: Academic Mind, pp. 21-32. Mikros, G.K. and Perifanos, K. (2011). Authorship identification in large email collections: Experiments using features that belong to different linguistic levels Proceedings of PAN 2011 Lab, Uncovering Plagiarism, Authorship, and Social Software Misuse held in conjunction with the CLEF 2011 Conference on Multilingual and Multimodal Information Access Evaluation, 19- JADT’ 18 173 22 September 2011, Amsterdam. Mikros, G.K. and Perifanos, K. (2013). Authorship attribution in Greek tweets using multilevel author’s n-gram profiles. In Hovy, E., Markman, V., Martell, C. H. and Uthus D. (eds.), Papers from the 2013 AAAI Spring Symposium "Analyzing Microtext", 25-27 March 2013, Stanford, California. Palo Alto, California: AAAI Press, pp. 17-23. Santagata M. (2016). Elena Ferrante è …, La lettura – Corriere della Sera, 13 March 2016, pp. 2-5. Tuzzi, A. and Cortelazzo, M.A. (2018), What is Elena Ferrante? A Comparative Analysis of a Secretive Bestselling Italian Writer, Digital Scholarship in the Humanities (on line first version). Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. 174 JADT’ 18 Word Embeddings: a Powerful Tool for Innovative Statistics at Istat Fabrizio De Fausti1, Massimo De Cubellis1, Diego Zardetto1 1 ISTAT – Italian National Institute of Statistics (defausti, decubell, zardetto)@istat.it Abstract 1 In recent years, word embedding models have proven useful in many Natural Language Processing problems. These models are generated by unsupervised learning algorithms (like Word2Vec and GloVe) trained on very large text corpora. Their main purpose is to map words to vectors of a metric space in a very smart way, so that the resulting numeric representation of input texts effectively captures and preserves a wide range of semantic and syntactic relationships between words. In this paper we discuss word embedding models generated from huge corpora of raw text in Italian language, and we propose an original graph-based methodology to explore, analyze and visualize the structure of the learned embedding spaces. Abstract 2 Il lavoro illustra le potenzialità dei modelli Word Embedding nell’analisi di grandi collezioni di dati testuali e propone un originale metodo basato sui grafi per l’esplorazione della struttura semantica catturata dai modelli. Keywords: Word Embeddings, Word2Vec, Graphs, Text Summarization, Italian Tweets, NLP. 1. Introduction Word embedding models represent a powerful tool that can be used as input for subsequent machine learning tasks, like text classification, topic modeling and document similarity. This work shows how we built, tested and used word embedding models (based on the Word2Vec algorithm, see Section 2.1) to achieve the following objectives:  Istat is currently collecting streaming Twitter data on a large scale. Word embedding models helped us devise domain-specific ‘filters’, namely sets of keywords that we used to filter out off-topic tweets with respect to the intended statistical production goal. Here we will show the case of the so- JADT’ 18 175 called “Europe filter”, meant to measure people’s mood about the European Union.  Istat is currently exploiting textual data automatically scraped from the websites of Italian enterprises in order to predict whether or not they perform e-commerce. Given the huge corpus of noisy and unstructured texts derived from this web-scraping procedure, word embedding models allowed us: (i) to automatically create an “e-commerce pseudo-ontology” and to smartly summarize the input texts, (ii) to encode the summarized texts into a rich numeric representation in order to feed a Deep Learning classifier. 2. Methodology In recent years, new successful algorithms for natural language modeling have been proposed, based on Neural Networks (e.g. Word2Vec and Glove). These algorithms, starting from very large corpora of raw text, are able to create models that map words to low-dimensional vector spaces, called word embeddings (Mikolov et al., 2013a). Although these algorithms do not rely on any linguistic domain-knowledge, nor on handcrafted syntactic and semantic relationships between words, they are surprisingly able to learn both of them from raw data. Indeed, words that are strongly related from a syntactic and/or semantic point of view are mapped to vectors that are almost parallel to each other; conversely, words that are syntactically and/or semantically loosely related are mapped to nearly perpendicular vectors. Moreover, these models perform amazingly well when it comes to solving analogies between words, just like a human would do. For example, if one asks a trained word embedding model «which word X completes the analogy: [ ‘Paris’ : ‘France’ = ‘Madrid’ : X ]», the answer will very likely be X = ‘Spain’. We mention here only one type of relationship (capital-nation), but word embedding models are able to capture a wide variety of relationships, such as: male-female, singular-plural, superlative-comparative, synonym-antonym, politicianparty, etc. 2.1 Word2Vec Word2Vec (Mikolov et al., 2013b) is one of the most influential word embedding algorithms. It consists of a neural network trained to solve a predictive problem according to one of the following two approaches: predicting the central word given the other words of a context (Cbow), or predicting the words of the context given the central word (Skipgram). At the end of the training the predictive ability of the network is not used; instead, 176 JADT’ 18 its internal structure (weights of the network) is exploited to represent the coordinates of each word of the dictionary in the embedding space. While a large text corpus is the main input to Word2Vec, the algorithm allows also for several hyperparameters which can be tuned to improve the quality of the learned model. Some scholars (e.g. Levy et al., 2015) consider these hyperparameters as key points to understand Word2Vec’s superiority as compared to previous language modeling techniques. The main hyperparameters of Word2Vec are:  Embedding space dimension: the dimension of the vector space to which the words of the corpus are mapped;  Window size: the width of the sliding window used to process the corpus. It defines how large the context is;  Iteration: how many times the weights of the neural network are updated during training;  Learning model: the approach used to train the neural network, either Cbow or Skipgram. Of course, further factors affect the performance of a Word2Vec model:  Size of the corpus: bigger corpora perform better than small ones;  Quality of the corpus: very noisy, fragmented and poorly curated texts generally produce lower quality embedding spaces. At the end of the training phase, the quality of the learned word embedding model can be assessed through standard test functions. Classical examples are the word-similarity and the word-analogy functions (see e.g. Pennington et al., 2014). 2.2 Exploring and visualizing big embedding models through graphs As sketched in Section 2, word embedding algorithms transform words into vectors of a low-dimensional metric space. The dimension of this numeric space is usually set to values in the range 100-300 (see e.g. Mikolov et al., 2013a). When input corpora are huge, taking into account inflected forms of words, the output embedding model can contain hundreds of thousands of vectors. As a consequence, the full structure of the embedding model is very hard to analyze. Exploration and visualization of such models requires to (i) reduce the dimensionality of the embedding space, and to (ii) focus on just a subset of vectors, namely those derived by the most relevant words for the analysis at hand. While traditional solutions exist for the first task, like PCA and t-SNE (van der Maaten, Hinton, 2008), no standard methods are available for the second one. We propose here a new technique, based on graphs (Gibbons, 1985), that simultaneously addresses both needs. It selects JADT’ 18 177 just a subset of relevant words, adopting a clever filtering criterion based on their semantic proximity, and allows visualizing the resulting sub-model in a two-dimensional graph. 2.3 Building the graphs a Given a “node” vector/word v in the embedding space, let’s define . To build , we connect v to its W nearest base graph of width vectors/words in the embedding space (the cosine distance is used). The base graph will thus have W + 1 nodes. Node v can be either the image of an actual word , i.e. , or the vector resulting from the sum of multiple and , i.e. . The idea is that, within the words, say embedding space, the sum of word vectors can be exploited to disambiguate the meaning of polysemous words. An example is provided in Table 1, where the 5 closest words to the vector V(‘rome’) are reported on the left panel, and the 5 closest words to the vector V(‘rome’) + V(‘colosseum’) + V(‘ancient’) are reported in the right panel. Evidently, the addition of words ‘colosseum’ and ‘ancient’ to the polysemous word ‘rome’ moves the semantic area explored by the base graph from a geographical to an historical sense. Table 1. Word disambiguation by sum of vectors: the polysemous word is ‘rome’. Closets 5 Words from V(rome) Cosine Similarity Closest 5 Words from Cosine V(rome) + V(colosseum) + V(ancient) Similarity turin palermo naples milan bologna 0.6818 0.6377 0.6212 0.6129 0.5857 roman archeological pompei trastevere trajan 0.5822 0.5318 0.5250 0.5217 0.5189 Our approach builds a full output graph by iteratively combining N base . We devised three different methods to combine base graphs graphs according to different exploration strategies. We called these methods Geometric, Linear and Geometric-Oriented: the corresponding pseudo-codes are provided in Table 2. Besides the width parameter W and the number of iterations N, all the three methods require as input a set of seed words [seeds] to define the starting point for the exploration of the embedding model. 178 JADT’ 18 Table 2 Pseudo codes of the proposed graph generation methods. Function find_leaves() returns all the nodes with zero outdegree; function shortestPath() calculates the shortest path between two nodes. Geometric ([seeds], N, W) Linear ([seeds], N, W) Geometric-Oriented ([seeds], N, W) v = V(seed1) + V(seed2) + … G_w(v) for iteration in [1, …, N]: for leaf in find_leaves(): G_w(V(leaf)) v = V(seed1) + V(seed2) + … G_w(v) for iteration in [1, …, N]: for leaf in find_leaves() virtualNode_leaf = 0 addEdge(leaf, virtualNode_leaf) for node in shortestPath(v, leaf): virtualNode_leaf = virtualNode_leaf + node G_W(virtualNode_leaf) v = V(seed1) + V(seed2) + … G_w(v) for i in [1, …, N]: virtualNode_i = 0 for leaf in find_leaves(): addEdge(leaf, virtualNode_i) virtualNode_i = virtualNode_i + V(leaf) G_w(virtualNode_i) As will be shown in Section 3, the Geometric method tends to expand the exploration range very quickly, rapidly losing the initial semantic focus provided by the seed words; the Linear method stays much more focused, but explores just a narrow sub-model; the Geometric-Oriented method provides a satisfactory compromise between the previous two methods. 3. Application 3.1 Building word embedding models on large corpora of Italian tweets Istat is currently collecting streaming Twitter data on a large scale. Italian tweets are captured provided that they pass at least one active ‘filter’. Filters are simply sets of keywords deemed to be relevant for specific statistical production goals. For instance, the ‘Social Mood on Economy’ filter involves 60 keywords borrowed from the questionnaire of the Italian Consumer Confidence Survey, and collects about 40,000 tweets per day. We used a large collection of about 100 million Italian tweets to train Word2Vec with different settings of hyperparameters, therefore generating different embedding models. We subsequently analyzed the obtained models and tested their quality as discussed in Section 3.1.2. This way we managed to identify the best performing set of hyperparameters to be used for the applications described in Sections 3.2 and 3.3. 3.1.1 Process The data processing pipeline we implemented consists of the following steps:  Collection of Italian tweets through Twitter’s streaming API as JSON files; JADT’ 18 179  Parsing of JSON files and storage of the tweets in a relational database;  Extraction from the database of the textual content of about 100 million tweets and export to a raw text file (corpus);  Preprocessing of the raw text (text cleaning and normalization);  Setting of Word2Vec hyperparameters;  Training of Word2Vec on the tweets’ corpus;  Test of the learned word embedding model. 3.1.2 Benchmark and selection of the best hyperparameters With the aim of identifying the best hyperparameters, we customized benchmark word-analogy tests contributed by the Stanford University (Pennington et al., 2014), translating them in Italian and adding new word analogies involving specific terms of the Economics field. Note that our tests involved many groups of analogies, encoding a wide range of different relationships between words, of both the syntactic and the semantic kind. As a measure of model goodness, we adopted the so called “Top-1 accuracy” criterion. According to this criterion an analogy [a : b = c : x] is successfully solved by the learned model if and only if the closest (i.e. Top-1) embedding vector to V(c) - V(a) + V(b) is exactly V(x). We evaluated against our customized word-analogy tests many output models generated by diverse settings of hyperparameters, and eventually found the following optimal values: embedding space dimension = 200, window size = 8, iteration = 15, learning model = Cbow. 3.2 Design of the “Europe” filter As already mentioned in Section 3.1, Istat collects only Italian tweets that match at least one active filter. So far, the keywords defining the filters have been designed by subject-matter experts. In this section, instead, we illustrate how word embedding models can be exploited to automatically develop new filters in a data-driven way. The idea is to leverage our graph-based exploration methodology to select the best keywords, starting from few relevant seed words. In particular, on the occasion of the 16th anniversary of the Treaties of Rome, our objective was to capture the sentiment of Italian Twitter users about European Union. In Figures 1 and Figure 2 we show the graphs resulting from the Geometric-Oriented and Geometric methods respectively. Note that both graphs were generated using the same seed words, namely: ‘europa’, ‘ue’, ‘bruxelles’, ‘europea’, ‘unione’, ‘euro’. The Geometric-Oriented graph appears more compact and the words are indeed closely related to the semantic area of the seed words. The Geometric graph, 180 JADT’ 18 instead, finds many more words, which are clearly grouped in coherent clusters and represent a valuable semantic enrichment with respect to the original seeds. Given its richness, this second graph has been considered by subject-matter experts as a very good candidate to play the role of “Europe” filter. Figure 1: Geometric-Oriented ([‘europa’, ‘ue’, ‘bruxelles’, ‘europea’, ‘unione’, ‘euro’], 8, 8) 3.3 Text Summarization and Encoding One ongoing Istat’s Big Data project aims at exploiting textual data automatically scraped from the websites of Italian enterprises in order to predict whether or not they perform e-commerce. To address this task, Deep Learning techniques are being used. Since input scraped texts are huge and Deep Learning algorithms are computationally intensive, a preliminary text summarization step is in order. Besides increasing efficiency, the summarization algorithm should hopefully improve accuracy by reducing the signal-to-noise ratio of input data. Word embedding models allowed us to achieve this goal with a purely data-driven approach. To guide the summarization, we leveraged word embeddings trained on the whole web-scraped corpus. We used the Linear-graph illustrated in Figure 3 to select a set of marker words with high discriminative power for the detection of e-commerce, adopting as initial seeds the words: ‘carrello’, ‘shopping’, ‘online’. (These marker words constitute what we called an “ecommerce pseudo-ontology” in the Introduction.) To summarize the texts, only input sentences containing marker words have been retained. This way, we obtained a 92.2% reduction of the original noisy text, along with a substantial improvement in the performance of the Deep Learning classifier (+20%, as compared to marker words defined by subject-matter experts). Lastly, we relied again on word embeddings to encode the summarized texts and feed the Deep Learning classifier. Once more, our experiments show that JADT’ 18 181 word embedding models outperform more traditional text encoding approaches, like bag-of-words. Figure 2: Geometric([‘europa’, ‘ue’, ‘bruxelles’, ‘europea’, ‘unione’, ‘euro’], 3, 8) Figure 3: Linear ([‘shopping’, ‘online’, ‘carrello’], 11, 8) 4. Conclusions The techniques for dealing with large corpora of texts can greatly benefit from recent technology advancements. Word Embeddings are an example of this opportunity. Extensive evidence shows that Word Embedding models are indeed superior to more traditional text encoding methods like, e.g., bagof-words. Ongoing works on textual Big Data at Istat make extensive use of these new approaches with very promising results. References Mikolov T., Yih W., Zweig G. (2013a). Linguistic Regularities in Continuous Space Word Representations. Proceedings of NAACL-HLT 2013, pp. 746751. Mikolov T., Chen K., Corrado G., Dean J. (2013b). Efficient Estimation of Word Representations in Vector Space. CoRR abs/1301.3781. 182 JADT’ 18 Levy O., Goldberg Y., Dagan I. (2015). Improving Distributional Similarity with Lessons Learned from Word Embeddings. Trans. of the Association for Computational Linguistics, vol.(3): 211-225. Pennington J., Socher R., Manning C.D. (2014). GloVe: Global Vectors for Word Representation. Proceedings of EMNLP 2014, pp. 1532-1543. van der Maaten L.J.P. and Hinton G.E. (2008). Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research, vol(9): 2579-2605. Gibbons A. (1985). Algorithmic Graph Theory. Cambridge University Press. JADT’ 18 183 Analisi di dati d’impresa disponibili online: un esempio di data science tratto dalla realtà economica dei siti di e-commerce Viviana De Giorgi, Chiara Gnesi Istat – degiorgi@istat.it; gnesi@istat.it Abstract This work describes the process of extracting, organising and analysing detailed information on firms that trade electronic equipment on the Alibaba.com site. The first part concerns how translating unstructured information into variables organised in a statistical database by using dimensional classes, indices, indicators and classifications. A companyproduct matching is realised by encoding a textual variable with an international classification, and an automated analysis is applied in order to explore, describe and analyse the corpus retrieved from the Internet. In the second part a descriptive and econometric analysis shows how demographic and economic information on enterprises from Alibaba.com are very significant for competitiveness on the foreign market. Keywords: encoding, classification, textual analysis, regression model. Sommario Il presente lavoro consiste nello sviluppo di un modello che consenta di trattare, organizzare ed analizzare informazioni dettagliate sulle imprese che commerciano apparecchiature elettroniche sul portale Alibaba.com. La prima parte riguarda il processo di trasformazione dell’informazione destrutturata in variabili organizzate in un database statistico attraverso l’uso di classi dimensionali, indici, indicatori e classificazioni. Si è realizzato un abbinamento impresa-prodotto utilizzando una classificazione internazionale attraverso la codifica di una variabile testuale, su cui è applicata un’analisi automatizzata al fine di esplorare, descrivere e analizzare il corpus testuale tratto da Internet. Nella seconda parte è svolta un’analisi descrittiva ed econometrica, i cui risultati mostrano la presenza sul portale cinese di informazioni demografiche ed economiche sulle imprese altamente significative per la competitività sul mercato estero. Parole chiave: codifica, classificazione, analisi testuale, regressione. 184 JADT’ 18 1. Introduzione Questo lavoro nasce dagli spunti di riflessione e studio offerti nel corso delle lezioni di un Master universitario in Data Science1 e si rivolge in particolare alle tecniche di trattamento, gestione ed analisi dei dati provenienti da fonti recuperabili on line2 e fruibili in maniera gratuita. L’approccio adottato è quello della singola impresa che vuole migliorare la propria competitività nel mercato di riferimento, analizzando i dati generati dai processi aziendali nel settore in cui è presente o mira a posizionarsi. A tal fine sono preziose le informazioni dettagliate e aggiornate sui volumi prodotti, transazioni, struttura e demografia delle imprese concorrenti, presenti nei siti di commercio elettronico. Il presente lavoro è stato sviluppato utilizzando i dati estratti attraverso un’intensa attività di web scraping dal portale Alibaba.com, con riferimento alle imprese operanti nel settore delle apparecchiature elettroniche. 2. Dai dati destrutturati alle variabili statistiche: costruzione del database Nel processo di trasformazione dell’informazione destrutturata acquisita online in variabili statistiche, un ruolo centrale riveste laclassificazione delle imprese a partire dal principale prodotto commercializzato. La variabile testuale – che corrisponde alla descrizione non codificata del prodotto commercializzato dalla società – è stata codificata secondo una classificazione di attività economica standardizzata a livello internazionale. Si è scelto l’elenco Prodcom con riferimento alle divisioni 26, 27 e 28, per un totale di 989 sottocategorie di prodotti3. L’attribuzione del codice Prodcom alla singola impresa è stata effettuata implementando un sistema di codifica ad hoc4 strutturato in step successivi. La fase iniziale consiste nella normalizzazione dei testi attraverso lo sviluppo Master universitario in Data Science, Università Tor Vergata, Dipartimento di Ingegneria dell’impresa "Mario Lucertini", anno accademico 2015/2016. Si ringraziano Francesco Borrelli, Valentina Talucci e Domenica Fioredistella Iezzi per gli utili suggerimenti. 2 L’acquisizione dei dati è stata effettuata nell’arco temporale che va dal 26 novembre 2016 al 7 gennaio 2017 dalla dott. Antonella Miele attraverso una attività di web scraping. I dati utilizzati sono relativi a 2.349 imprese presenti sul sito Alibaba.com e operanti nel settore delle apparecchiature elettroniche. 3http://ec.europa.eu/eurostat/ramon/nomenclatures/index.cfm?TargetUrl=LST_C LS_DLD&StrNom=PRD_2011&StrLanguageCode=EN&StrLayoutCode=HIERARCHI C# 4Non avendo a disposizione software già sviluppati utilizzabili, è stato implementato un sistema di codifica ad hoc utilizzando il software SAS. 1 JADT’ 18 185 di un parser applicato alla variabile testuale e alle descrizioni della classificazione utilizzata. Successivamente si è realizzato un matching tra i due campi, attraverso un algoritmo che identifica l’abbinamento tra stringhe, sfruttando il dizionario al massimo livello di dettaglio possibile5. Infine si è realizzato l’abbinamento impresa-prodotto, assegnando a ciascuna impresa un codice Prodcom che identifica univocamente il principale prodotto commercializzato6. Il sistema di codifica ha permesso la classificazione del 95% delle imprese: un 30% circa vende “computer e prodotti di elettronica e ottica, apparecchi elettromedicali, apparecchi di misurazione e orologi”, un quarto vende “apparecchiature elettroniche e apparecchiature per usodomestico non elettriche” e il 40% circa vende “apparecchiature elettriche diverse dalle precedenti” (Tavola 1). Tavola 1: Imprese per divisioni Prodcom, valori assoluti e percentuali Divisione prodcom n 26 – computer e prodotti di elettronica e ottica 718 27 – apparecchiature elettroniche e apparecchiature per uso domestico non elettriche 618 28 – fabbricazione di macchinari ed apparecchiature n.c.a 893 non classificati 120 Totale complessivo 2349 % 30,6 26,3 38,0 5,1 100,0 L’analisi dei residui ha rivelato che la causa principale del mancato abbinamento deriva dalla presenza sul mercato di Alibaba di prodotti, elettrici e non, altamente specializzati ovvero sulla frontiera della tecnologia, non presenti nella Prodcom. Tuttavia, l’abbondanza di acronimi, abbreviazioni, slang hanno reso l’attività di standardizzazione particolarmente complessa. In seguito alla codifica della variabile testuale, si è proceduto a una sua analisi automatizzata al fine di esplorare, descrivere e analizzare il corpus In questa fase si è utilizzato il dizionario al massimo livello di dettaglio possibile – 8 digit – in modo da abbracciare la descrizione del maggior numero di prodotti possibile. L’abbinamento prodotto/dizionario si è realizzato per molte sottocategorie di Prodcom; dopo aver analizzato i risultati ottenuti, si è scelto di utilizzare i 4 digit come il massimo livello di disaggregazione compatibile con una soglia di accuratezza ritenuta accettabile. 6L’assegnazione del codice è stata realizzata attribuendo all’impresa il codice Prodcom corrispondente alla classe in cui si è realizzato in maggior numero di match prodotto – dizionario, pesata per la frequenza più alta riscontrata in una determinata categoria di prodotto. 5 186 JADT’ 18 tratto da Internet. L’analisi testuale7 consente di esplorare la struttura del testo sia come corpus – raccolta di frammenti testuali fra loro confrontabili – sia in relazione alla codifica ad esso attribuita. A tal fine, si è utilizzato TaLTaC2, particolarmente adattoallo studio di informazioni testuali non strutturate di grandi dimensioni e di informazioni strutturate a queste ultime collegate. Un primo approfondimento è offerto dalle misure lessicometriche, che consistono in una serie di misure e di indici statistici calcolati sul vocabolario e sulle sue classi di frequenza (Bolasco, 1999).Il corpusè costituito da 25.295 occorrenze, che corrispondono al numero totale di forme grafiche intese come unità di conto(Giuliano, 2004). L’ampiezza del vocabolario, pari a 4.363 forme grafiche distinte, riflettela specificità settoriale a cui attiene l’analisi. Coerentemente, l’indice di estensione lessicale percentuale, pari a 17,2, e l’indice di Guiraud normalizzato, pari a 27,4,confermano come la dimensione del vocabolario sia affetta da un bias determinato dalla specificità delle imprese analizzate. Tuttavia, nel settore è presente una gamma di prodotti piuttosto diversificata, come suggerito dal numero di hapax, pari a 50,2 (tavola 2). Tavola 2: misure lessicometriche sul corpus Misure lessicometriche Occorrenze - N Forme grafiche distinte - V Type/Token (V/N)*100 % di Hapax (V1/V)*100 Frequenza media generale - N/V G di Guiraud - V/sqrN Coefficiente a Valori 25.395 4.363 17,2 50,7 5,8 27,4 1,2 L’analisi lessicale, svolta a partire dall’analisi delle specificità, ha consentito di verificare, all’interno di singole classi, la rilevanza dei prodotti attraverso la sovra o sotto rappresentazione rispetto alla classificazione internazionale. L’utilizzo del dizionario della Prodcomcome risorsa statistica-linguistica esterna, ha permessoanalisi in parallelo. In effetti, l’indice Term Frequency Inverse Document Frequency (TFIDF) calcolato anche sul dizionario, ha consentito di evidenziare le caratteristiche peculiari dei prodotti venduti dalle imprese rispetto al panorama delle stesse che commercializzano prodotti elettronici.Inoltre, attraverso il confronto tra le forme grafiche del 7A tal fine, si è utilizzato TaLTaC2, un software per l’analisi automatica del testo nella duplice logica di Text Analysis e di Text Mining (TM), quindi sia come analisi del testo che come recupero e estrazione di informazione all’interno dello stesso JADT’ 18 187 corpus e quelle del dizionario della Prodcom, si è potuto operare un controllo indiretto sulla qualità della codifica di cui al precedente paragrafo utilizzando lo scarto standardizzato come proxy di significatività8. Tale misura consente, infine, di caratterizzare le imprese rispetto alla peculiarità dei prodotti che le contraddistinguono all’interno del settore di riferimento (figura 1). Figura 1: Parole chiave del corpus in base allo scarto standardizzato Ulteriori elaborazioni sui dati reperiti dal sito hanno consentito la creazione di ulteriori variabili statistiche. Tra queste: tenure – una proxy dell’anzianità dell’impresa, costruita a partire dall’anno di iscrizione al portale; addetti e fatturato medi – a partire dal valore medio delle classi di riferimento; qualità – una variabile dummy che segnala la presenza di una certificazione di prodotto; propensione all’export – come quota percentuale di esportazioni sul fatturato ; ricerca e sviluppo – in termini di addetti medi impiegati nelle attività innovative; efficienza – capacità di risposta dell’impresa alle esigenze dei clienti. Il database finale è costituito da 18 variabili, che afferiscono all’Anagrafica dell’impresa, all’Attività economica, al Commercio estero, alla Dimensione economica, alla Competitività e alla Ricerca & Sviluppo. 3. Analisi descrittiva ed ecometrica dei dati Ai dati descritti precedentemente sono state applicate le tecniche largamente adottate della ricerca statistica: un’analisi descrittiva del collettivo di riferimento, un’analisi multivariata di tipo esplorativo per la ricerca delle variabili da utilizzare in un modello econometrico e un modello di regressione che tenga conto della specificità dei dati9. Si riportano di seguito i principali risultati. Si è utilizzata la formula classicadellamisura di specificità in cui fi* è la frequenzarelativadella forma graficanell’elencoProdcom. 9Le informazionisulleimpresepresentisulsitovengonoaggiornate, anche se non si sa bene quando e come, e l’informazione dell’anno di riferimento è presente talvolta e solo per alcune variabili (per esempio il fatturato) 8 188 JADT’ 18 Per tutti i settori di attività, più della metà delle imprese si dichiara produttrice e venditrice, forse perché tale caratteristica tende a essere un parametro di scelta da parte di chi deve acquistare. Sono per lo più imprese medio-grandi, giovani, che in genere interagiscono con i clienti, con alte percentuali di export sul valore del fatturato, con presenza di dipendenti dedicati alla ricerca e sviluppo, disponibilità del certificato dei prodotti venduti. Cumulano un volume di esportazioni maggiore dell’80% le imprese che hanno più di 50 dipendenti, oppure sono nelle classi più elevate di fatturato, oppure rispondono almeno all’80% di richieste dal sito, o infine si dichiarano produttrici dei prodotti venduti. L’analisi condotta, e quindi il modello di regressione studiato, riguarda la dipendenza che il volume di esportazioni ha con le variabili presenti nel data set. Al fine della scelta delle variabili da utilizzare nel modello è stata effettuata un’analisi cluster gerarchica (SAS Institute Inc., 1999), scegliendo la variabile con minimo valore di 1-R2ratio10, e individuando le seguenti variabili: la produttività d’impresa, la variabile dimensionale data dal numero dei dipendenti occupati in ricerca e sviluppo e le tre variabili categoriche percentuale di risposta a richieste, attività economica e tipologia d’impresa. Le prime due risultano avere nel proprio cluster, nella suddivisione in 5 gruppi, il valore minimo di 1-R2ratio; tra le variabili categoriche invece si evidenziano quelle aventi minore correlazione own cluster con le altre variabili. Il modello implementato consente di stimare i valori della variabile dipendente “volume delle esportazioni” sulla base dei valori assunti/osservati da/per alcune variabili indipendenti. Anche come conseguenza dei risultati dell’analisi cluster descritta precedentemente, si è scelto di includere tra queste: il fatturato per dipendente, il numero di dipendenti di ciascuna impresa, la tipologia di prodotto a 2 cifre, la percentuale di risposta alle richieste di possibili acquirenti, la quota di dipendenti d’impresa occupati in ricerca e sviluppo, la tipologia di impresa e il numero di anni di attività. È stato stimato il seguente modello di regressione lineare (Rencher e Schaalje, 2008): + dip+ dove: (a) è il logaritmo naturale del volume di esportazioni; (b) 101-R^2 ratio=(1-R^2 own cluster)/(1-R^2 nextclosest), dove own cluster=correlazione con il proprio gruppo di variabilie nextclosest=correlazione con il gruppo più vicino + JADT’ 18 189 è il logaritmo della produttività; (c) della percentuale di risposta; (d) è il logaritmo è il logaritmo della quota di è la tipologia di impresa, (f) dipendenti occupati in ricerca e sviluppo; (e) ate è la tipologia di prodotto, (g) dip è il numero dei dipendenti. In presenza di una variabile dipendente con distribuzione log-normale11, l’applicazione di una trasformazione logaritmica alla variabile dipendente e alle variabili indipendenti continue ha come primo obiettivo di ottenere una distribuzione assomigliante a quella di una normale. Ciò implica, per i modelli lineari, la possibilità di estensione di tale ipotesi distributiva anche ai residui (ε) del modello e quindi consente di condurre in modo corretto i necessari test di significatività sui coefficienti stimati. Inoltre, la contemporanea trasformazione logaritmica delle variabili indipendenti (continue) consente di interpretare i valori dei coefficienti stimati direttamente in termini di elasticità. L’introduzione della variabile dip2 è utile per verificare l’esistenza di eventuali relazioni non lineari tra dip e la dipendente, ovvero per capire se all’aumento del numero di dipendenti corrisponda una crescita delle esportazioni progressivamente superiore/inferiore. È stato inoltre studiato un secondo modello (modello2) introducendo l’interazione tra la quota di dipendenti occupati nella ricerca e sviluppo e la variabile categoriale relativa alla tipologia d’impresa. Tale scelta è coerente con l’idea che il livello di attività in ricerca e sviluppo possa rappresentare una fonte di valore aggiunto maggiore per le imprese che producono rispetto a quelle che vendono soltanto. I risultati ottenuti e riportati nella tavola 3 vengono di seguito descritti: (1) la relazione tra la variabile dipendente e la misura di produttività utilizzata è significativamente positiva; a una variazione dell’1% del fatturato per addetto corrisponde, mediamente, un variazione di oltre l’1% del volume delle esportazioni; (2) queste sono correlate positivamente anche con la percentuale di risposta a richieste dal sito e con il numero di anni di attività dell’impresa (coefficienti sempre significativi); (3) la stima dei due coefficienti relativi alla dimensione d’impresa evidenziano che questa accresce (come era logico aspettarsi) il volume delle esportazioni, ma con tassi progressivamente decrescenti all’aumentare del numero dei dipendenti (rendimenti decrescenti 11 La variabile aleatoria segue la distribuzione logaritmica solo se segue la distribuzione normale densità di probabilità è f(x)=e^(-〖(lnx-μ)〗^2/〖2σ〗^2)/(x√2πσ) . La sua funzione di 190 JADT’ 18 di scala); (4) sembrano esistere effetti differenziali tra il volume di esportazioni e le tipologie di prodotti venduti per settore di attività economica, ma non sempre i coefficienti sono significativi; (5) le dummy relative alla tipologia d’impresa mostrano coefficienti sempre non significativamente diversi da zero in assenza di interazione con la proxy di ricerca e sviluppo (modello1); (6) se fatte interagire (modello2) emerge invece come le due tipologie impresa produttrice e produttrice/venditrice abbiano un effetto positivo sulle esportazioni (rispetto alla modalità di riferimento impresa solo venditrice) e l’intensità di ricerca e sviluppo sembra accrescere significativamente le esportazioni solo per il settore delle imprese produttrici; (7) la variabile in oggetto risulta infatti correlata negativamente con la dipendente nei casi di imprese operanti esclusivamente nel settore del commercio e positivamente per quelle manifatturiere o contemporaneamente anche venditrici. Tavola 3: Stima dei parametri del modello lineare (modello 1 e modello 2) nel data set iniziale e nel data set integrato Variabile ln(fattxdip) Resp num_anni ate26 (rif,) at 27 ate28 Others Dip dip^2 type venditrice (rif,) produttrice produttrice/venditrice ln(dip_in_rd/dip) x venditrice ln(dip_in_rd/dip)) x produttrice ln(dip_in_rd/dip) x produttrice/venditrice Costante N r2_ajusted modello1 1,024*** 0,002*** 0,022*** modello2 1,025*** 0,002*** 0,023*** -0,067* -0,094*** 0,067 0,012*** -0,001*** -0,061 -0,085** 0,063 0,012*** -0,001*** -0,008 -0,055 -0,097*** 0,358*** 0,161* -0,212*** 0,188*** 0,125*** 2,084*** 1.913 0,866 2,291*** 1.913 0,865 *p<0,1; **p<0,05; ***p<0,01 Le funzioni di densità della variabile dipendente osservata e stimata mostrano entrambe una forma distributiva approssimativamente normale: non emergono significative differenze tra i due modelli, ce forniscono entrambi una buona approssimazione. JADT’ 18 191 Riferimenti bibliografici Bolasco S. (1999). L’analisi multidimensionale dei dati, Roma, Carocci. Giuliano L. (2004), L’analisi automatica dei dati testuali. Software e istruzioni per l’uso, Milano, LED. Rencher A.C, Schaalje G.B. (2008). LinearModels in Statistics. Second Edition. Wiley. SAS Institute Inc. (1999), LogisticRegressionModeling Course Notes, Cary, NC: SAS Institute Inc., pages 56-57. 192 JADT’ 18 The use of textual sources in Istat: an overview Alessandro Capezzuoli, Francesca della Ratta, Stefania Macchia, Manuela Murgia, Monica Scannapieco, Diego Zardetto1 ISTAT – Istituto Nazionale di Statistica – nome.cognome@istat.it Abstract 1 Text Mining techniques allow a more widespread use of textual materials also in Official Statistics. We show implementations and current pilots realized in Istat, with a focus on both techniques and applications. Initially, text mining techniques were used to manage complex taxonomies or conduct open question analysis, while at the moment Big data frameworks allow to expand the different sources of data also to merge several data sources and to reduce response burden. Abstract 2 Le tecniche di Text Mining consentono un ampio utilizzo di dati testuali anche nella Statistica Ufficiale. Sono descritte le implementazioni e le sperimentazioni realizzate in Istat in questo ambito, focalizzando sulle tecniche utilizzate e le applicazioni realizzate. Inizialmente il Text Mining veniva effettuato per gestire le tassonomie o effettuare analisi testuale delle riposte aperte, mentre più di recente il contesto dei Big data ha consentito di ampliare le fonti utilizzate e di integrarle tra loro anche in funzione del contenimento del response burden. Keywords: text mining, official statistics, sentiment analysis 1. Automatic coding and semantic search of taxonomies The first use of text mining techniques in Italian official statistics was finalized to manage complex classifications. Indeed, classifications are defined, which consist of structured lists of concepts, mutually exclusive, corresponding to codes that allow to produce a partition of the population. When the identification of the code corresponding to the concept does not present any ambiguity, it is possible to use closed questions with lists of items among which the one matching with the response is selected. 1 This work comes from a common effort; paragraph 1.1 is written by Manuela Murgia and Stefania Macchia, par. 1.2 by Alessandro Capezzuoli; par. 2 by Francesca della Ratta, par. 3 by Monica Scannapieco and Diego Zardetto. JADT’ 18 193 On the other hand, when codes belong to classifications that are complex in terms of structure, criteria and hierarchies, then the management of taxonomies is a very difficult task that implies the knowledge of the classification. Let us think, for example, of the classification of Occupation: in order to identify the code corresponding to each occupation it is necessary to consider different aspects, like the level of competences, their scope or the activities managed. In this paragraph, it is described how, with the evolution of technologies, this activity has been performed in different ways, using different software tools. 1.1. Automatic coding Up to some years ago, statistics survey questionnaires rarely used open questions allowing textual answers because of the difficulties in processing them in order to provide a measure of the phenomenon. On the other hand, this could not often be avoided for some variables, like occupation, economic activity, education level that have necessarily to be coded according to official classifications for either national or cross-national data comparison. In the past, verbal responses were manually coded, but this was very timeconsuming, costly and error prone, especially for large amount of data (Macchia et Murgia, 2002). For this reason Istat decided to adopt automated coding systems that consist of two main parts: i) a database (dictionary) and ii) a matching algorithm. The dictionary is made of texts associated with numeric codes. Codes are those of official classifications and represent the possible values to be assigned to the verbal responses entering the coding process, while texts are the textual labels expressing the concepts that the classifications associate to codes. In order to improve the coding results, dictionaries are enriched with common language descriptions, resulting from answers to previous surveys. The matching algorithm is a ‘weighting algorithm’ that assigns a weight to each word of the verbal response to be coded. The weight indicates how much a word is informative and it depends on the word’s frequency inside the dictionary: the higher its frequency the minor its weight. Then the algorithm compares the input response with all the texts inside the dictionary looking for a perfect match. If no exact match is found then it looks for a partial match with the most “similar” description, choosing the one with the highest weight. The efficiency of the automated coding systems allowed Istat to use them not only to code responses of statistical surveys, but also to offer the coding service to a larger public such as governmental or private institutions, private citizens, who need to associate free text descriptions to official classifications codes, let’s think, for instance, to businesses which have to identify their economic activity code for declarations to Chambers of Commerce. The 194 JADT’ 18 coding service was then made available on the Istat web site for the ATECO (the Italian version for Nace, the Economic Activity classification) variable. The software used for many years was ACTR (1998-2015) developed and distributed by Statistics Canada. In 2015 ACTR was not working anymore on the new Istat IT platform and it was substituted by CIRCE that behaves like ACTR but it is developed in house and based on R (Murgia et al., 2016). The choice of R made it possible to create a coding package freely downloadable from the website and also to offer a web service for the coding of the ATECO. The web service can be easily incorporated in any other software applications: electronic questionnaires of Istat surveys or in software systems of external organizations. 1.2 Semantic search within taxonomies The evolution of technology allowed to explore also other software solutions suitable to represent the Statistical classifications logical structure, described within the Generic Statistical Information Model (GSIM). To this end, it was possible to exploit a very simple JSON object, to which then associate the metadata related to the classification (family, series, level, etc.). PUT and GET methods, related to the HTTP protocol, permit an easy acquisition of classification items that can then be organized through ad hoc procedures, on the basis of GSIM model, and stored into a relational database. Being a JavaScript Object Notation, JSON is the natural environment for the construction of web applications using programming languages like e.g. Ajax/JavaScript combined with ad hoc frameworks as appropriate. Elasticsearch and Solr are the main frameworks used to search and share data. In particular, Elasticsearch provides a set of powerful and complete tools/plugins for data dissemination and the use of REST resources. Elasticsearch is well suited for the solution of some critical issues related to the use of statistical classifications in different fields (surveys, administrative registers, information systems, etc.), such as: • acquisition, storage, management and updates of classifications; • multilingual semantic search for coding; • sharing and dissemination of coding tools. Textual search is a very popular technique for users who seek information on the web. It does not require any special skill and users have already acquired through surfing the web and it is also suitable to search within statistical classifications and facilitate coding. The most common problem related to semantic searches within taxonomies concerns false-positive and falsenegative results. The search is usually done through SQL queries allowing users to perform two types of operations: "exact match" and "full text". String parsing algorithms can be associated to the SQL queries. JADT’ 18 195 A statistical classification can be indexed within Elasticsearch to perform complex and differentiated textual searches through DSL (Domain Specific Language) in JSON format. This solution permits to simplify the formulation of complicated SQL queries and makes the search system from any programming language usable. Elasticsearch allows users to manipulate large volumes of data thanks to an internal document management, completely independent from relational databases, and the opportunity to create distributed cluster. Istat experience in using this methodology has been very satisfactory. The coding systems related to the main statistical classifications (ISCO, NACE, ISCED, COFOG, COICOP) were included in several Istat surveys ("Labour Force Survey", multi-purpose survey "Aspects of daily life", "Consumer prices", etc.) and Information system on occupation. Easy to use, widgets have been developed to include coding systems within web questionnaires and web applications. 2. Open questions analysis Social research uses open questions also when category answers are not known or when researchers prefer to explore interviewees’ different points of view using their own categories. This approach offers a great opportunity to realize analysis in depth, but it is difficult to be applied with the largest sample used in official statistics. So it is generally preferred using open questions only in pilot survey or small samples, to explore the possible list of answers and to obtain the closed-end list for the final survey. As an example, Istat used this approach in a survey on the female participation in parliamentary life: in 2000 an open question was introduced in a quarterly Multipurpose survey and the list of answering categories obtained with textual analysis was used in the 2005 annual Multipurpose survey. However, in the early 2000s Text mining tools made it possible to analyse open questions also when codes does not belong to pre-defined classifications. The first example was introduced in Istat by Sergio Bolasco, who analysed the daily diaries collected in 2002-2003 Time use Survey to obtain a classification of some daily life actions (Bolasco et al., 2007). This classification was obtained using the Entity Research by Regular expression (RE) inserted in the tool Taltac2, a function that represents a very important turning point for the use of textual data in statistical surveys, because it made possible to pass from the simple description of words contained in a corpus (Lexical analysis) to the classification of single records on the basis of 196 JADT’ 18 words that are contained in each of them (Textual analysis2). The single word is no more the unit of analysis as the RE function searches or counts within the entire record a particular word or a combination of words, putting the result in a new customized variable. This function was afterwards used in other Istat surveys. First it was used in the Survey on Occupations, developed in 2005-2006 and aimed at describing Italian labour market occupations, providing detailed information on each Occupational Unit. Researchers were interested also in tasks in which workers are daily involved, which was asked through an open question: “What does your job consist of? Which are the activities you are involved in during your working day?”. Our aim was to provide each Occupational Units with a list of semi-standardized activities, labelling in the same way similar activities expressed in different ways by respondents. So, we used a strategy of text categorization adding in final dataset an extra column variable with a synthesis of the activities stated by interviewees: the final result was a list of over 7,000 specific activities (della Ratta, 2009). A similar approach is currently used to check and correct the coding of economic activity carried out by interviewers in the Labour Force Survey: every quarter, 1500 records out of 24000 responses collected in the survey referred to specific Nace section are analyzed. The correctness of the codes assigned is verified from a double perspective: not only by comparing respondents’ vocabulary reported in the response field of the question on economic activity with the specific dictionary of the official classification (Nace rev-2), but also considering other extra information connected with this variable collected in the same survey questionnaire. The process is completed with a thorough examination of data consistency in each session, to validate the corrections made and to assign the definitive proper code. At the end errors are transmitted to interviewers during specific training sessions in order to improve the all process of data collection, from the interview to the coding assignment (della Ratta et Tibaldi, 2014). Other uses of Text Mining tools regarded the classification of open questions of the online survey on the dimensions of well-being (della Ratta et Tinto, The search for the textual information is run by complex queries using regular expressions with Boolean operators (AND, OR, NOT), lexeme reductions (wildcards as “*” and “?”, e.g. contact* and customer? ) and distances (LAGgxx) between consecutive words, that allow to identify different expressions used to convey the same concept (contact*LAG3 customer? is able to identify series such as “to contact the customer”, “contacts with customers”, “I contact my main customers”; the value of the new variable could be “to contact customers”). 2 JADT’ 18 197 2012), or the analysis of residual answers inserted in single questions (“Others”, please specify) that can improve the exhaustiveness of questionnaires and can be used in training activity for interviewers. In conclusion, the availability of Text Mining tools made it possible to process open questions independently by the size of the text, being free in this way to use un-structured data in official statistics, especially in recursive analysis in which text categorization strategies can be repeated several times. 3. Dealing with Textual Big Data Since recent years, in line with European-level strategic directives, Istat has been exploring the potential of Big Data sources for Official Statistics. Many of such sources – and notably those that seem the most promising so far – are made up of huge collections of unstructured and noisy texts. In current Istat’s projects, two types of unstructured sources were taken into account, namely: (i) textual data collected from the websites of Italian companies, obtained through automatic procedures of access and extraction performed on a large scale (hundreds of thousands of sites); (ii) messages in Italian language publicly available on Social Networks, typically collected in streaming after a preliminary selection step performed using ‘filters’ (i.e. sets of keywords that a message must match to be deemed relevant). The contexts of use of textual data from company websites include the enrichment of information in statistical business registers and the potential replacement of questions from surveys questionnaires. The possible uses of data from Social Network mainly concern the production of high-frequency (e.g. daily) sentiment indices. At the moment the experiments with Social Networks data focused on the Twitter platform and on the development of “specific” sentiment indices: the goal is to measure the Italian mood about topics or aspects of life that might be relevant for Official Statistics (like the economic situation, the European Union, the migrants’ phenomenon, the terrorist threat, and so on). The hope is that such sentiment indices can improve the quality of Istat’s economic forecasting models, enrich existing statistical products (for example the BES) or create new statistical outputs in their own right. Among the processing techniques used for these sources, a particularly promising type consists of the Word Embedding models. These models are generated by unsupervised learning algorithms (such as Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014), both based on neural networks) trained on large collections of text documents. Their main objective is to map natural language words into vectors of a metric space, in such a way that the numerical representation of texts captures and preserves a wide range of syntactic and semantic relationship existing between words. 198 JADT’ 18 Istat successfully tested Word Embedding models in both the application scenarios sketched above. In the first scenario, Word Embeddings have been exploited to automatically summarize the huge text corpora scraped from company websites, and to subsequently encode the summarized texts in order to feed a Deep Learning algorithm for downstream analysis (e.g. to predict whether a given enterprise performs e-commerce). In the second scenario, Word Embedding models have been leveraged both to design the ‘filters’ used to select relevant messages from Twitter and to evaluate the actual performance of the same ‘filters’ after data collection. In the following of this section a specific focus will be provided on data scraped from enterprises websites3. The Istat sampling survey on Information and Communication Technologies (ICT) in enterprises aims at producing information on the use of Internet and other networks by Italian enterprises for various purposes (e-commerce, e-skills, e-business, social media, egovernment, etc.). In 2013, an Istat project started with the purpose of studying the possibility to estimate some indicators produced by the survey directly from the websites of the enterprises; these indicators included online sale rate, social media presence rate and job advertisement rate. The idea was to use web scraping techniques, associated, in the estimation phase, to text and data mining algorithms, with the aim of replacing traditional instruments of data collection and estimation, or to combine them in an integrated approach (Barcaroli et al., 2015). The recently achieved results are very encouraging with respect to the use of such techniques (Barcaroli et al., 2017). The whole pipeline that has been set up for this project includes:  A scraping activity performed by an ad-hoc developed software (RootJuice4).  A storage step in which scraped data are stored in a NoSQL database, i.e. Apache Solr.  A data preparation and text encoding step, performed in two different ways: 1. tokenization, word filtering, lemmatization, generation of a termdocument matrix 2. word filtering and word embeddings.  An analysis step, performed via machine learning methods on each of the text encodings resulting from the previous step. 3 A more detailed focus on the processing of Twitter data is presented in the paper “Word Embeddings: a powerful tool for innovative statistics at Istat”, submitted to this conference. 4 Available on GitHub : https://github.com/SummaIstat/RootJuice/. JADT’ 18 199 4. Conclusions and remarks The techniques for dealing with large corpora of texts can greatly benefit from recent technology advancements. Word Embeddings are an example of this opportunity, giving additional possibilities to use un-structured data in official statistics for the purpose of integrating analyses or reducing response burden. Extensive evidence shows that Word Embedding models are indeed superior to more traditional text encoding methods like, e.g., bag-of-words. Ongoing works on textual Big Data at Istat make extensive use of these new approaches with very promising results. References Barcaroli G., Nurra A., Salamone S., Scannapieco M., Scarnò M.and Summa D. (2015). Internet as Data Source in the Istat Survey on ICT in Enterprises. Journal of Austrian Statistics, vol. 44, n. 2. Barcaroli G., Scannapieco and M. Summa D. (2017). Massive Web Scraping of Enterprises Web Sites: Experiences and Solutions. 61st World Statistical Congress, ISI. Bolasco S., Pavone P., D’Avino E. (2007). Analisi dei diari giornalieri con strumenti di statistica testuale e text mining. In: Romano. I tempi della vita quotidiana, Istat, Roma, Argomenti, n. 32. della Ratta Rinaldi F. (2009). Il trattamento dei dati, in F. Gallo, P. Scalisi, C. Scarnera. L’indagine sulle professioni. Anno 2007, Contenuti, metodologia e organizzazione. Collana Metodi e Norme, n. 42, Roma, Istat. della Ratta-Rinaldi F.and Tinto A. (2012). Le opinioni dei cittadini sulle misure del benessere. Risultati della consultazione online. Roma, IstatCnel. della Ratta-Rinaldi F. and Tibaldi M. (2014). Sperimentazione di un sistema di controllo e correzione per la codifica dell’attività economica. Istat Working Paper, n. 4, 2014. Macchia S. and Murgia M. (2002). Coding of textual responses: various issues on automated coding and computer assisted coding. Proc. of JADT 2002: 6es Journées Internationales d’Analyse Statistique des Données Textuelles. Mikolov T., Chen K., Corrado G. and Dean J. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of Workshop at ICLR. Murgia M. and Prigiobbe V. (2016). La nuova applicazione di codifica web dell’ATECO 2007: WITCH, un web service basato sul sistema di codifica CIRCE. Istat Working Papers n. 19. Pennington J., Socher R. and Manning C. D. (2014). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1532–1543. 200 JADT’ 18 Twitter e la statistica ufficiale: il dibattito sul mercato del lavoro Francesca della Ratta, Gabriella Fazzi, Maria Elena Pontecorvo, Carlo Vaccari, Antonino Virgillito1 Istat – Istituto Nazionale di Statistica, Rome – Italy Abstract The goal of the paper is to show the potential and the benefits of the integration between the big data analysis techniques and techniques used for the textual analysis, through the analysis of a corpus extracted from Twitter. The analysis is the development of a method already experimented in other works (della Ratta, Pontecorvo, Virgillito, Vaccari, 2016 and 2017), in which we started from the collection of selected tweets through a list of hashtags defined according to the theme of interest. This procedure allows to obtain in a reasonable time a selection of tweets of interest, on which to apply textual analysis techniques to describe the contents of the text and to identify its main semantic contents. The paper analyzes the role of the National Institute of Statistics in the discussion on the labor market in the periods when ISTAT spreads the monthly and quarterly press releases on employment. The analysis, already conducted at the end of 2016, has been replicated and refined in the same period of 2017, in order to show the distinctive elements of the labor market debate and to understand the changes in the perception of public opinion, also taking into account the changes in terms of the economic situation and the political scenario. Key words: big data; text mining; twitter; Istat, labour market 1. Big data e Twitter I dati provenienti dai Social Network sono una delle sorgenti di Big Data più utilizzate dai ricercatori: l’enorme diffusione di questi siti web, nei quali gli utenti generano grandi quantità di informazioni, li rende potenzialmente una delle fonti più interessanti anche per i dati testuali. Twitter è un Social Network nel quale gli utenti scrivono e leggono corti messaggi chiamati 1 Questo lavoro è frutto della riflessione condivisa degli autori; il paragrafo 1 è stato redatto da Carlo Vaccari e Antonino Virgillito, il paragrafo 2.1 da Francesca della Ratta, il 2.2 da Gabriella Fazzi e Maria Elena Pontecorvo, le conclusioni da tutti gli autori. JADT’ 18 201 “tweet”, normalmente visibili da tutti gli utenti, che possono anche “iscriversi” ai tweet di altri utenti (diventando “follower”), inoltrare (“retweet”) singoli tweet ai propri followers o aggiungere “mi piace” ad altri tweet. Twitter è oggi uno dei Social Network più diffusi, e ha superato nel 2017 i 300 milioni di utenti attivi. Secondo Alexa (2018) Twitter è oggi il tredicesimo sito più visitato al mondo, l’ottavo negli USA. Scopo di questo lavoro è applicare le tecniche dell’analisi testuale a un corpus estratto da Twitter, unendo i due mondi dei Big Data e dell’Analisi Testuale. La raccolta dei dati da Twitter è stata effettuata utilizzando una piattaforma, la “Sandbox”2, che è il risultato finale del progetto “Big Data in Official Statistics”, portato avanti nell’ambito dell’High Level Group on Modernisation of Official Statistics (HLG-MOS). La Sandbox è un ambiente web-based utilizzato per numerosi esperimenti basati su diverse sorgenti dati come le visite alle pagine di Wikipedia, i dati sul Commercio Estero del sito Comtrade dell’ONU, i siti delle imprese per ricercare annunci di lavoro e, appunto, i tweet raccolti in varie nazioni del mondo. La Sandbox è oggi ancora utilizzata per portare avanti le sperimentazioni della ESSnet on Big Data3, un progetto europeo coordinato da Eurostat per l’utilizzo dei Big Data nella produzione di statistiche ufficiali. I tweet analizzati sono stati raccolti attraverso uno strumento online messo a disposizione gratuitamente da Twitter (Streaming API), interrogato attraverso programmi scritti in R ed eseguiti all’interno della Sandbox. Questa soluzione, per quanto semplice da utilizzare e di immediata implementazione, presenta limitazioni sia per l’ammontare dei dati che possono essere estratti, sia per la non completa aderenza dei dati ottenuti rispetto ai filtri impostati in fase di estrazione, come spiegato nella sezione successiva. I tweet acquisiti sono stati memorizzati su Elasticsearch, un database installato nella Sandbox specializzato in dati semi-strutturati, che permette di memorizzare grandi quantità di documenti ed estrarre velocemente dei sottoinsiemi attraverso query basate su parole chiave. 2. L’analisi dei post sul mercato del lavoro: l’impatto dell’Istat 2.1 Creazione del corpus Per analizzare i dati estratti da Twitter si è replicato il metodo testato in occasione di precedenti lavori (della Ratta, Pontecorvo, Virgillito, Vaccari; 2016 e 2017). Si è deciso, in questo contesto, di focalizzare l’analisi sul ruolo 2 I risultati del progetto Sandbox, coordinato da Virgillito nel 2014 e da Virgillito e Vaccari nel 2015, sono illustrati in Unece (2014 e 2016). 3 https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/ESSnet_Big_Data 202 JADT’ 18 ricoperto dall’Istat nella diffusione delle informazioni sulla tematica del lavoro, estraendo automaticamente un primo set di tweet nelle settimane in cui l’Istat diffonde i dati mensili e trimestrali sul mercato del lavoro. Tale estrazione, già effettuata a fine 2016, è stata replicata nello stesso periodo del 2017, partendo da una query piuttosto ampia4 che ha consentito di ottenere un corpus di 58.277 tweet relativo al periodo 28 novembre-12 dicembre 2017. Da questo corpus sono stati estratti tutti gli hashtag con occorrenza maggiore di 14 (facilmente identificabili nel testo grazie alla presenza del simbolo #) tra i quali sono stati individuati quelli strettamente connessi alla discussione sul mercato del lavoro (Tabella 1). È quindi stato estratto, utilizzando il software Taltac2, un corpus di 19.398 tweet contenente almeno uno degli hashtag di interesse. Questo corpus è stato ulteriormente ripulito eliminando i tweet relativi alle offerte di lavoro (presenza degli hashtag #offertalavoro #annunciolavoro), considerati non pertinenti. Si è così arrivati a un corpus composto da 17.419 tweet, composto da 283.000 occorrenze, 18.000 forme grafiche e una ricchezza lessicale (rapporto type/token) del 6,7%. Poco più di un terzo dei tweet sono originali, mentre il volume dei retweet costituisce il 63% del corpus complessivo, in misura maggiore rispetto al corpus del 2016. Per “misurare” l’impatto dell’Istat nel dibattito sul lavoro sono stati etichettati tutti i tweet in cui compare la forma “Istat”: il 13,9% del totale, una misura quasi triplicata rispetto a quanto osservato nel 2016 (5%). Se da un lato nel 2016 l’impatto del concomitante dibattito referendario aveva ridimensionato il peso del commento del dato Istat nella discussione sul mercato del lavoro, nel 2017 i temi della ripresa occupazionale e delle sue caratteristiche sembrano aver attirato maggiormente l’attenzione degli utenti. Inoltre, la prima uscita del rapporto annuale integrato sul mercato del lavoro ha probabilmente accresciuto il peso dei commenti sui dati. Se nel 2016 la presenza dei riferimenti a Istat si addensava in corrispondenza delle uscite ufficiali, nel 2017 è distribuita in maniera più uniforme, con un picco in corrispondenza del comunicato trimestrale del 7 dicembre (nel quale La query iniziale utilizzata è la seguente: "(istat OR inps OR #istat OR #inps OR #lavoro OR #occupati OR #disoccupati OR #disoccupato OR #jobsact OR #occupazione OR #disoccupazione OR #mercatodellavoro OR #poletti OR #cassaintegrazione)". Sul primo corpus di 58.277 tweet estratto dall’API di Twitter è stata rieseguita la stessa query in Elasticsearch, che ha consentito di effettuare una selezione ulteriore, eliminando moltissimi tweet che pur estratti attraverso la stessa query non contenevano le parole chiave, evidenziando una non completa accuratezza dell’API gratuita di Twitter nell’applicazione dei filtri di estrazione. Alla fine si è ottenuto un corpus di circa 26 mila tweet, su cui è stata effettuata la selezione successiva. 4 JADT’ 18 203 ha avuto molta eco la notizia del record assoluto di lavoratori a termine Figura 1). Tabella 1 – Selezione di hashtag HASHTAG OCC HASHTAG #lavoro 14.172 #licenziamento #jobsact 1.225 OCC HASHTAG 190 #occupati OCC HASHTAG 48 OCC #MercatoDelLavoro 19 #disoccupato 173 #Occupazione 42 #precarizzazione 19 #occupazione 948 #Thyssenkrupp 164 #Cococo 42 #precarietà 19 #JobsAct 861 #contaillavoro 158 #Discoll 41 #Smartworking 18 #Jobsact 587 #lavoratori 156 #orientamento 41 #voucher 17 #disoccupazione 463 #Disoccupazione 149 #cassaintegrazione 40 #freelance 17 #povertà 414 #Melegatti 139 #mercatodellavoro 37 #Art18 15 #Poletti 278 #GaranziaGiovani 124 #JobsActSempre 32 #dipendente 15 #precari 265 #precariatodistato 110 #smartworking 31 #ScuolaLavoro 15 #LAVORO 205 #precariato 109 #thyssen 31 #ContailLavoro 201 #pandoro 98 #RelazioniIndustrialiA 20 #disoccupati 196 #articolo18 53 #poletti 20 37.1 2017 2016 26.8 25.1 23.5 21.7 20.3 13.6 11.5 9.3 8.2 6.7 5.1 4.8 0.8 28 0.5 29 4.9 3.7 1.6 1.4 0.4 30 Novembre 4.8 3.8 2.3 0.2 1 2 3 4 5 6 7 8 9 2.2 1.0 10 0.5 11 12 Dicembre CALENDARIO DIFFUSIONI ISTAT: 28/11 Natalità e fecondità; 1/12 Occupati e disoccupati mese di ottobre *, Conti economici trimestrali; 5/12 Nota trimestrale sull'andamento dell'economia; 6/12 Condizioni di vita, reddito e carico fiscale delle famiglie; 7/12 Il mercato del lavoro (III trimestre); 11/12 Il mercato del lavoro (rapporto annuale integrato)**. (*) nel 2016 uscito il 30/11; (**) solo nel 2017 Figura 1 – Incidenza riferimenti a Istat per giorno. Anno 2016 e 2017 Più modesto l’impatto del comunicato mensile sull’occupazione, i cui dati erano risultati sostanzialmente stabili (al contrario nel 2016, data la concomitanza con il referendum costituzionale, il mensile aveva registrato la quota massima di citazioni). Un volume consistente è stato registrato in occasione del comunicato sulle condizioni di vita e di reddito (6/12) e della 204 JADT’ 18 presentazione del primo rapporto integrato sul mercato del lavoro (11/12). Anomalo il picco del 3 dicembre, domenica, alimentato da un notevole tasso di retweet molto critici sulle politiche del lavoro dell’attuale governo dovuti probabilmente all’intervento del segretario del PD Matteo Renzi in una popolare trasmissione serale (Che Tempo Che Fa) centrato anche sulle politiche del mercato del lavoro degli ultimi anni. Il più citato è stato un tweet critico sul meccanismo di conteggio degli occupati, insieme ad altri più politici sull’aumento del lavoro a termine dell’ultimo periodo. 2.2 Il contenuto del corpus Il contenuto del corpus può essere descritto utilizzando le parole chiave, calcolate rispetto all’italiano standard (Bolasco, 2013) che consentono di delimitare gli ambiti di contenuto: si incontra innanzitutto contratti, con riferimento all’aumento dei contratti a termine o a tempo determinato. Contribuiscono alla sovrarappresentazione del termine un numero limitato di tweet (13) che ricevono tuttavia numerosi retweet e che, riprendendo il dato Istat sulla durata dei contratti evidenziano l’aumento del precariato, anch’esso termine sovrautilizzato (Figura 2). Altri termini molto presenti nel testo sono disuguaglianze ed esclusione, utilizzati soprattutto in un post della Caritas che riprende il dato sulla povertà pubblicato il 6 dicembre. Figura 2 - Tagcloud delle parole chiave Colpisce la presenza di termini molto forti, che connotano un dibattito dai toni pesanti: trucco, fraudolenta, infamia, tossico, truffa, schiavitù. Analizzando i contesti d’uso si riscontra che ciascuno di questi termini è riferito a episodi diversi (trucco dei dati sulla definizione di occupazione); manodopera fraudolente; truffa del Governo sulle pensioni; accordo tossico con riferimento JADT’ 18 205 al CETA; infamia contro il lavoro in riferimento al Jobs Act) e che proprio i tweet più forti siano quelli in grado di generare un numero elevato di retweet. Significativi anche termini utilizzati in tweet in cui si evocano storie, e in cui il dato statistico è sostituito dal caso esemplare, capace di generare empatia e, di conseguenza, retweet. Non è un caso che i termini maggiormente sovrarappresentati facciano riferimento ad un unico tweet, su un lavoratore colpito da leucemia che guarisce ma viene comunque licenziato. Fra gli esempi anche quello di una madre separata, licenziata dall’Ikea a Milano. Anche i riferimenti al record degli occupati e a quanti esultano per i dati sull’occupazione sono riportati talvolta in maniera critica; fa eccezione il riferimento al tasso di disoccupazione giovanile, che viene ripreso in maniera neutra dall’agenzia Ansa e retweettato numerose volte. Prendendo in considerazione i segmenti ripetuti (ossia le sequenze di parole ripetute nel testo), si possono delimitare quattro aree semantiche principali a cui fanno riferimento i tweet (Tabella 2). In primo luogo ci sono le espressioni che rimandano alla pura diffusione delle notizie che ruotano intorno alla tematica “lavoro” e che hanno un peso rilevante anche in termini di occorrenze. In particolare emerge da un lato il riferimento ai dati diffusi dall’Istat su povertà, natalità e occupazione, dall’altro spiccano due segmenti che si riferiscono agli episodi di attualità già citati: il licenziamento da parte di Ikea di una madre separata con due figli piccoli e quello di un dipendente di una fabbrica di vernici, avvenuto dopo un lungo periodo di assenza per malattia. Accanto ai segmenti relativi alle notizie, vi sono poi i segmenti riconducibili ai commenti degli esponenti politici, ai provvedimenti legislativi e alle prime avvisaglie di campagna elettorale. A questi fanno da contraltare i tweet caratteristici del dibattito pubblico tra cui non mancano note polemiche o sarcastiche. Infine, nonostante il file sia stato in parte ripulito dagli hashtag riconducibili agli annunci di lavoro, emergono comunque alcuni segmenti inerenti la ricerca di particolari profili professionali. Come è facilmente intuibile, peraltro, alcuni contenuti caratterizzano maggiormente i frammenti in cui si fa esplicito riferimento all’Istat. Rispetto all’analisi effettuata nello stesso periodo dello scorso anno, l’analisi delle specificità mostra la prevalenza di un linguaggio più tematico che tecnico quando si cita l’istituto (dati, contratti, #povertà, disoccupazione), mentre i tweet che parlano di lavoro senza citare l’Istat fanno riferimento ai fatti di cronaca e alla politica (#jobsact, #pensioni, legge, licenziato ecc.), con minori riferimenti personali ai soggetti che nel 2016 erano in prima linea nella campagna referendaria. Inoltre l’analisi delle concordanze mostra che lo stesso riferimento all’Istat viene utilizzato in differenti contesti. 206 JADT’ 18 Tabella 2 – Segmenti ripetuti principali Le notizie Segmento Occ Riferimenti politici Segmento occ Dibattito pubblico e polemica Segmento occ dati #Istat 419 Missione compiuta\\#JobsAct 67 Come ti trucco i dati 348 119 Annunci di lavoro Segmento occ #lavoro #roma #romalavoro 152 #lavorare #lavoro 144 #adnkronos a rischio #povertà guarisce e viene licenziato esclusione sociale 392 Ministro #Poletti 60 continuano a produrre sfruttamento 253 #jobsact funziona 54 tutto da rifare 55 kijiji lavoro 53 236 Fedriga Presidente 47 essere licenziati 40 cerca socio 32 madre separata 195 campagna elettorale 43 #Bonus dipendenza 86 manovra finanziaria 28 si sono rivelate tutele inesistenti 27 80 Liberi e Uguali 18 conti non tornano 18 71 presidenta #boldrini 11 politici hanno distrutto tre generazioni 3 56 Politiche Attive 9 dovremmo ribellarci 2 55 Lavori usuranti 2 giovani andati grazie a te tempo determinato tempo indeterminato terzo trimestre crollo della natalità #Algoritmi #BigData creano via 31 2 Commessa IV livello part time #lavoro professionale diventare #psicoterapeuta ufficio acquisti dirigente medico Concorsi Pubblici Gazzetta Ufficiale Oltre alla stretta diffusione delle notizie e al commento del dato sull’aumento dei contratti a termine, non manca l’uso strumentale dei dati come metro di giudizio delle politiche sul mercato del lavoro [#Istat "record di occupati a_termine: sono 2,8 milioni". ecco l' unico risultato oggettivo del #jobsact.."; continua a calare la #disoccupazione-i nuovi dati #Istat confermano le previsioni, un' altra ventata di ottimismo...]. Rispetto al 2016 il tono sarcastico di alcuni tweet è meno rivolto esplicitamente all’Istat ma in generale alla situazione del Paese [«record di #precari in Italia, 2,8 milioni. va tutto ben, madama la marchesa.. #lavoro #Istat #occupazione”»]. Resta però un residuo polemico su alcune definizioni di occupazione e disoccupazione [«Ricordiamo che per #istat se si lavora un'ora retribuita a settimana si è considerati occupati. #supercazzola»; «Come ti trucco i dati #Istat sulla disoccupazione: il 14; 6% dei contratti dura meno di 3 giorni, il 31% un_mese»]. Infine, di interesse la valutazione del tono del testo, possibile con l’analisi degli aggettivi positivi e negativi, riconosciuti all’interno di Taltac2. Il rapporto tra aggettivi negativi e positivi è del 50,2%, un valore che denota una criticità media, pari a quella che si riscontra nel linguaggio della stampa (Bolasco, della Ratta, 2004). Il livello di criticità è variabile nelle diverse giornate: è più basso nei giorni di diffusione dei 27 21 13 8 7 3 JADT’ 18 207 comunicati, specie quello mensile, mentre è particolarmente elevato il 3 dicembre, a causa del “rumore” prodotto dai retweet (i retweet presentano una criticità del 63,6%), probabilmente a causa del maggiore successo dei tweet polemici. Tra gli aggettivi negativi più frequenti precari, fraudolenta, dannoso, fallito5. 3. Conclusioni L’analisi effettuata ha consentito di affinare una metodologia di trattamento dei tweet: dal punto di vista della loro estrazione, la procedura utilizzata ha consentito di ottenere in partenza un file più pulito su cui operare una selezione a partire dalla lista degli hashtag. L’analisi del testo ha poi consentito di evidenziare i diversi contesti in cui si fa riferimento al dato della statistica ufficiale. Particolarmente interessante il confronto tra i risultati dello stesso corpus a un anno di distanza. Infatti, nello stesso periodo dell’anno precedente la discussione era fortemente condizionata dal dibattito referendario che ha probabilmente “stravolto” la discussione sulle tematiche del lavoro. Nei tweet di un anno prima i livelli di criticità erano più elevati e il ruolo dell’Istat più ridimensionato (13% la presenza odierna contro il 5% di un anno prima). Il tono del testo appare in generale più neutro, con maggiori richiami all’Istat nella sua veste ufficiale di diffusore di dati e meno come oggetto di scherno e polemica. Riguardo ai contenuti, nella discussione di fine 2017 sembra avere avuto più peso la discussione sugli effetti del Jobs Act e della diffusione del lavoro precario. Il corpus odierno è inoltre caratterizzato da un più ampio ricorso al retweet. Riferimenti Alexa (2016). Twitter site overview, at http://www.alexa.com/siteinfo/twitter.com. Bolasco S. (2013). L’analisi automatica dei testi. Fare ricerca con il text mining. Roma, Carocci. Bolasco S., della Ratta-Rinaldi F. (2004). Experiments on semantic categorisation of texts: analysis of positive and negative dimension. In JADT 2004 - Le poids des mots, Actes des 7es Journées internationales d’Analyse Statistique des Données Textuelles. UCL. Louvain. della Ratta-Rinaldi F., Pontecorvo M.E., Virgillito A., Vaccari C. (2016). Big data and textual analysis: a corpus selection from twitter. Rome between the fear of terrorism and the Jubilee. In JADT 2016 - Statistical Analysis of 5Sono stati comunque eliminati i termini tecnici (riferiti a specifici aggregati statistici) che hanno una connotazione negativa, come disoccupato, scoraggiato o povero. 208 JADT’ 18 Textual data – Vol.2. Nice. della Ratta-Rinaldi F., Pontecorvo M.E., Virgillito A., Vaccari C. (2017). The Role of NSIs in the Job Related Debate through Textual Analysis of Twitter Data. NTTS 2017. Brussels. UNECE (2016). Big Data in Official Statistics. http://www1.unece.org/stat/platform/display/bigdata/Big+Data+in+Official+S tatistics UNECE (2014). Big Data in Official Statistics. http://www1.unece.org/stat/platform/display/bigdata/Big+Data+in+Official+S tatistics Vaccari C. (2014). Big Data and Official Statistics. PhD Thesis, School of Science and Technologies. University of Camerino. JADT’ 18 209 Gauging An Author’s Mood Using Hidden Markov Chains Sami Diaf Hildesheim Universität – sami.diaf@uni-hildesheim.de Abstract This paper aims to gauge the mood of an author using a text-based approach built upon a lexicon score and a hidden Markov model. The text is tokenized into sentences, each given a polarity score, yielding three evaluative factors (positive, neutral and negative) which represent the observable states. The mood of the author is considered a latent state (good, bad) and is estimated via a hidden Markov model. Tested on a psychological fiction, Franz Kafka’s novel Metamorphosis, this methodology shows an interesting linkage between the author’s feelings and the intent of his writing. Keywords: Sentiment analysis, hidden Markov model, polarity. 1. Introduction (Times Bold 14 pt, left) Sentiment analysis is defined as the general method to extract subjectivity and polarity from a text, while semantic orientation refers to the polarity and strength of words, phrases, or texts, meaning a measure of subjectivity and opinion in the text, capturing an evaluative factor and potency or strength of a given corpus toward a given subject (Taboada et al., 2011). Extracting sentiment automatically usually involves two main approaches (Taboada et al., 2011): a lexicon-based approach built on computing orientation for a document from the semantic orientation of words or sentences, and a text-classification approach stemming from supervised machine learning techniques and involves building classifiers from labeled instances of texts or sentences. Lexicon-based models stress out the importance of adjectives as an indicator of a text’s semantic orientation and have been preferred in the linguistic context as classifiers yielded changing results regarding their areas of application (Taboada et al., 2011). Among many lexicon-based approaches adopted in the academic field, the one implemented by Hu and Liu (Hu and Liu, 2004) remains popular. It was built upon two hypotheses concerning the semantic orientation: independence of context (prior polarity) and being expressed as a numerical value suing an opinion lexicon. This article uses the polarity approach of Hu and Liu to build a sequence of 210 JADT’ 18 evaluative factors (positive, neutral and negative), considered as the realization of an observable state x, and supposes the mood of the author could be approached via a two-state latent variable z taking two hidden states (good and bad). For this aim, hidden Markov models (Murphy, 2012) will be used to estimate the transition probabilities between hidden and observed states, to better estimate long-range correlations among the sequence of data than standard Markov models. 2. Polarity function Polarity is defined as the measure of positive or negative intent in a writer’s tone (Kwartler, 2017) and can be calculated by sophisticated or fairly straightforward methods, usually using two lists of words: one positive and one negative. Hu and Liu set up the architecture for the polarity function used to tag polarized words in the English language (Hu and Liu, 2004) and Rinkler (2017) provided a detailed description of the polarity function and its computation. A context cluster of words is pulled around a polarized word to be considered as valence shifters. Words in this context cluster are tagged as neutral, negator, amplifier or de-amplifier. Each polarized word is then weighted according to a dictionary of positive/negative words and weights, and then further weighted by the number of position of the valence shifters directly surrounding the positive or negative word. Final computation step is the sum of the context clusters divided by the square root of the word count, which yields an unbounded polarity score. 2. Application To illustrate this framework, we took the English version of the novella Metamorphosis written by Franz Kafka published in 1915 under the name « Die Verwandlung » and freely available at the Project Gutenberg database. This work was translated to English by David Wyllie in 2002 and belongs to the psychological fiction category. The novella is broken down into sentences, a process called tokenization, and then we compute the polarity function for each sentence, to construct a sequence of evaluative factors (positive, neutral or negative) according to the polarity score, as shown in Figure 1. JADT’ 18 211 Figure 1. Sequence of data corresponding to the polarity score of each sentence. This step generates 812 sentences where the positive and negative polarity scores represent respectively 29.1% and 28.6% of the total. The remaining sentences (42.3%) correspond to the neutral evaluative factor. Statistical tests show that the generated time series has the first two autocorrelations significantly different from zero and exhibits a slightly persistent memory as the estimated Hurst exponent is 0.587, significantly different from the value of 0.5 which corresponds to the case of a Brownian motion (Mandelbrot and Hudson, 2006). The estimated probability transition matrix of evaluative factors via the maximum likelihood shows the associated Markov chain is irreducible with no persistent states, as shown in Figure 2. Figure 2. Probability transition matrix of the evaluative factors. We assume the mood of the author could be modeled via a latent variable Z taking two states (good and bad). Hence, we can build a hidden Markov model explaining the interactions between observable states (positive, neutral and negative) and latent, unobservable states (good and bad). To estimate the hidden Markov model, the transition matrix of the latent state is set uniformly, that is all its elements equals 0.5, the same applies also for the initial latent vector. However, the emission matrix which describes the links between the latent and the observable states is set arbitrarily as in Figure 3. 212 JADT’ 18 Figure 3. Prior probability transition of the emission matrix. Given these priors, the estimated hidden Markov model using the BaumWelch algorithm (Murphy, 2012) yields a starting probability vector slightly skewed to good mood (51%) than bad mood (0.49). The estimated transition and emission matrices are reported in Figure 4 and 5 respectively. Figure 4. Estimated transition matrix via Baum-Welch algorithm. Figure 5. Estimated emission matrix via Baum-Welch algorithm. Results demonstrate significant links between writing without intent (neutral state) and being in a good mood, and between negative intent and the bad mood. The most probable states estimated via the Viterbi algorithm (Murphy, 2012) clearly show the dominance of the good state (71.4%) over the bad (28.6%) as shown in Figure 6. These findings help clarify the nature of the story (thriller, roman, novella, …) and the author’s narrative style which could be confirmed by analyzing the remaining works. Finally, it is worth noticing that this methodology could also be used to assess the accuracy of translations with respect to the original work, by comparing the similarities of the transition and the emission probabilities of the hidden Markov models. JADT’ 18 213 Figure 6. Most probable states estimated via Viterbi algorithm (Bad in red and Good in blue). 4. Conclusion This works expands the application field of semantic orientation to explore a new probabilistic approach based on hidden Markov models and evaluation factors. The resulting outcomes help understanding the author’s mood by examining the linkage between the evaluative factors which express the author’s mindscape through his writing. The emission probabilities between the latent states and the evaluative factors helped identifying hidden structures linked to the psychological state of the author and the development of the facts. This approach could be used as a controller of translation accuracy under the condition of having a precise list of positive and negative words in the original language, to be able to compute the polarity score. 214 JADT’ 18 References Hu M. and Liu B. (2004). Mining and summarizing customer reviews. Proceedings of the ACM SIGKDD, pp. 168-177. Kwartler T. (2017). Text Mining in Practice with R. Wiley. Mandelbrot B. and Hudson R.L (2006). The Misbehavior of Markets: A Fractal Review of Finance Turbulance. Basic Books. Murphy K.P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. Project Gutenberg [www.gutenberg.org] Rinkler T. (2017). Polarity score (Sentiment Analysis) [https://www.rdocumentation.org/packages/qdap/versions/2.2.9/topics/po larity] Silge J. and Robinson D. (2017). Text mining with R: A Tidy Approach. O’reilly. Taboada M., Brooke J., Tofiloski M., Voll K. and Stede M. (2011). Lexiconbased Methods for Sentiment Analysis. Computational Linguistics Vol. 37, Issue. 2, pp. 267-307. JADT’ 18 215 Les hémistiches répétés Marc Douguet Université Grenoble Alpes – marc.douguet@univ-grenoble-alpes.fr Abstract In this paper, we propose to use the syllabic structure of classical alexandrine in order to automatically identify textual recurrences in French 17th-century theater. The two hemistichs of 6 syllables each present a syntactical unity: consequently, extracting recurrent hemistichs is a way, on the one hand, to hightlight idiomatic expressions characteristic of this period, and, on the other hand, to evaluate the influence of metric constraints on writing. Résumé Dans cet article, nous proposons d’utiliser les caractéristiques métriques de l’alexandrin classique afin de repérer automatiquement des récurrences textuelles dans le corpus du théâtre français du XVIIe siècle. Les deux hémistiches de 6 syllabes chacun qui le constituent possèdent en effet une unité syntaxique : dès lors, les réemplois fréquents des mêmes hémistiches permettent d’une part de faire émerger les éléments langages propres à ce style d’écriture, et d’autre part d’évaluer l’influence des contraintes métriques sur l’écriture. Keywords: repeated segments, metre, verse, textual recurrences 1. Introduction La détection des segments répétés dans un corpus est un outil particulièrement précieux pour l’analyse stylométrique : elle permet à la fois de caractériser le style propre à un auteur, un genre ou une période, et d’évaluer l’originalité d’un auteur par rapport à ses contemporains, sa capacité à s’affranchir ou non des éléments de langage de son époque (cf. notamment Salem, 1987 ; Legallois, 2009 ; Delente et Legallois, 2016). De ce point de vue, l’alexandrin classique présente une caractéristique qui nous semble n’avoir pas encore été totalement exploitée. La césure divise en effet le vers en deux hémistiches d’égale longueur (6 syllabes) qui constituent des unités à la fois rythmiques et syntaxiques. Or ces unités font l’objet de nombreuses répétitions. Par rapport à l’approche qui consiste à extraire tous les segments de n mots pour détecter les récurrences, cette approche (qui la complète) a, pour la stylistique computationnelle de la poésie, un triple avantage : – elle permet de n’extraire que des segments qui constituent déjà des unités 216 JADT’ 18 syntaxiques et évite d’avoir à trier manuellement les résultats pertinents ; – elle permet d’extraire des segments qui, quel que soit leur nombre de mots, ont le même nombre de syllabes, et sont donc, en régime poétique, d’importance strictement comparable ; – elle permet de mettre en rapport réflexion sur la répétition et analyse de la versification et d’apprécier, notamment, la contrainte que le mètre fait peser sur l’écriture. 2. Méthodologie Nous avons travaillé sur un corpus de 200 pièces de théâtre en alexandrins publiées entre 1630 et 1680, représentatif de la diversité des genres dramatiques de cette période (tragédie, comédie, tragi-comédie1). Le corpus est édité en XML-TEI, avec un balisage qui décrit le découpage en actes, en scènes, en répliques et en vers2. Nous avons développé un syllabeur capable de césurer les vers et d’en extraire séparément chacun des hémistiches. Celui-ci est plus modeste que d’autres outils développés en analyse automatique du vers (notamment Beaudouin, 2002 ; Delente et Renault, 2015 ; Salvador, 2016), puisqu’il n’a pas pour ambition de placer avec exactitude la limite entre deux syllabes à l’intérieur d’un mot. Afin de produire un dictionnaire de diérèses et de synérèses, nous l’avons préalablement entraîné en vérifiant manuellement les résultats. Le syllabeur reconnaît automatiquement comme des vers de 12 syllabes 99,98% des 55 031 vers de Corneille dont on a préalablement vérifié qu’ils étaient des alexandrins. La marge d’erreur est uniquement due à l’ambiguïté de certains mots, dont la prononciation change en fonction de la catégorie grammaticale (par exemple « content » et « fier », selon qu’il s’agit de verbes ou d’adjectifs). Le corpus est composé de 332 938 vers, soit en théorie 665 876 hémistiches. Nous n’en avons retenu que 624 597, après avoir exclu ceux qui était distribués sur plusieurs répliques. Le nombre d’occurrences de chaque hémistiche est calculé après avoir supprimé les ponctuations et les majuscules. La liste des pièces, les scripts utilisés ainsi que les résultats complets sont disponibles sur https://github.com/marcdouguet/dheform. 2 Les textes sont disponibles sur https://github.com/dramacode/tcp5. Ils nous ont été fournis par le projet « Bibliothèque dramatique » (http://bibdramatique.parissorbonne.fr/), dirigé par Georges Forestier, et le projet « Théâtre classique » (http://theatre-classique.fr/), dirigé par Paul Fièvre. Nous les remercions tous deux d’avoir rendu accessibles leurs sources XML, sans lesquelles ce travail n’aurait pas été possible. 1 JADT’ 18 217 3. Fréquence des hémistiches répétés Le phénomène de la reprise textuelle des hémistiches est sans commune mesure avec celui, similaire, qui concerne les vers entiers. Dans notre corpus, 499 vers sont répétés au moins une fois, soit seulement 0,1%. Pour quelqu’un qui a une connaissance approfondie du corpus, ces répétitions sont souvent repérables manuellement, et les éditions critiques en soulignent certaines (on connaît notamment le célèbre « Je suis maître, je parle, allez, obéissez » dans La Mort de Pompée de Corneille, repris dans L’École des femmes de Molière). Les enjeux de ces reprises mériteraient d’être étudiées (plagiat, parodie, citation d’un personnage par un autre, phénomène de refrain, etc.). La répétition d’hémistiches possède des enjeux différents, à la fois en raison de la brièveté des segments répétés et du très grand nombre de répétitions : 16% des hémistiches du corpus sont répétés au moins deux fois, et un hémistiche y apparaît en moyenne 1,11 fois. L’écriture en vers utilise donc un certain nombre d’éléments de langage et d’idiomatismes préexistants, que le dramaturge combine de manière originale. En complément des relevés quantitatifs, nous avons également développé une interface de lecture (accessible sur http://obvil.lip6.fr/dheform) : l’utilisateur peut entrer un texte, dont les hémistiches répétés seront mis en évidence à l’aide d’un code couleur. 4. Analyse des hémistiches les plus fréquents À titre d’exemple, le tableau suivant liste les 10 hémistiches les plus fréquents du corpus, avec leur nombre d’occurrences et deux exemples en contexte : en cette occasion 119 en l’état où je suis 98 pour la dernière fois 87 à votre majesté 87 que votre majesté 70 en cette extrémité 68 je vous l’ai déjà dit 55 une seconde fois 51 les armes à la main 42 de votre majesté 41 Que me donne l’amour en cette occasion N’offrez donc point, Seigneur, en cette occasion Que ferai-je, Philante, en l’état où je suis ? Je ne réponds de rien en l’état où je suis. Dites-lui de ma part pour la dernière fois Pour la dernière fois je me jette à vos pieds. Le respect que je dois à votre Majesté Je me livre, grand Prince, à votre Majesté, Que votre Majesté le rappelait près d’elle. Ah ! Grand Roi, se peut-il que votre Majesté Mettre tout en usage en cette extrémité ; Quoi ? vous m’abandonnez en cette extrémité, Je vous l’ai déjà dit, sans vous parler de moi, Je vous l’ai déjà dit, j’estime votre flamme, Je renonce à choisir une seconde fois ; J’en ferais un ingrat une seconde fois. Les armes à la main, venez si bon vous semble, Laissez-nous lui parler les armes à la main, Qui vient offrir aux pieds de votre Majesté Il tira des bienfaits de votre Majesté : 218 JADT’ 18 Si l’on élargit l’analyse aux 470 hémistiches qui possèdent plus de 10 occurrences, on peut distinguer plusieurs catégories de récurrences. De nombreux hémistiches sont composés d’un substantif de trois syllabes ou plus, précédé de prépositions, de conjonctions et de déterminants, et placé en position de sujet, de complément de nom ou d’objet. Dans cette configuration, on repère plusieurs variations autour d’un même substantif : « à votre majesté » (87 occurrences – nous indiquerons désormais systématiquement le nombre d’occurrences d’un hémistiche entre parenthèses), « que votre majesté » (70), « de votre majesté » (41), « de générosité » (40), « la générosité » (26), « à ma confusion » (30), « cette confusion » (15), etc. Les substantifs concernés relèvent principalement d’une thématique morale ou politique, caractéristique du style d’écriture dramatique du XVIIe siècle. Plus intéressants sont les compléments circonstanciels qui insistent sur le caractère exceptionnel de la situation et sur l’état émotif du locuteur et renforcent ainsi le pathos du discours : « en cette occasion » (119), « en l’état où je suis » (98), « en cette extrémité » (68), « en ce malheur extrême » (23), « en cette conjoncture » (22). De nombreuses expressions modalisent l’énoncé : insistance agacée (« je vous l’ai déjà dit » (55)), certitude (« il n’en faut point douter » (37), « il n’en faut plus douter » (25)), prétérition (« je ne vous dirai point » (40)). On notera également la série « pour la dernière fois » (87), « une seconde fois » (51), « pour la première fois » (29), qui relie une situation dramatique à d’autres, passées ou à venir. Certains syntagmes figés possèdent au contraire une fonction référentielle : violence des relations (« les armes à la main » (42), « un poignard dans le sein » (27)), instinct (« la voix de la nature » (19), pouvoir (« la suprême puissance » (25), « une entière puissance » (24), « un absolu pouvoir » (22)), etc. Les expressions temporelles sont quant à elle nombreuses, et peuvent être associées à une sentence générale décrivant les mœurs du temps (« dans le siècle où nous sommes » (17)) ou à l’urgence d’une situation (« sans tarder davantage », (19)). La fréquence élevée d’« avant la fin du jour » (31) montre à quel point le dramaturges explicitent le respect de l’unité de temps dans leurs œuvres afin d’accroître la tension dramatique. Les expressions spatiales renvoient elles aussi à l’universalité (« sur la terre et sur l’onde » (16)) ou au contraire aux lieux fréquemment convoqués dans le théâtre classique (« dans son appartement » (20), « dans la chambre prochaine » (16)). Ces expressions figées peuvent souvent être considérées comme des « chevilles », où l’on sent clairement que l’invention verbale se soumet aux contraintes de la métrique. On peut ici identifier deux cas de figure. D’une part, le sémantisme de certains hémistiches circonstanciels est parfois très faible : « en cette occasion », « en l’état où je suis » pourraient aussi bien être JADT’ 18 219 supprimés sans nuire au sens du texte, ou greffés sur n’importe quel énoncé. D’autre part, même si elles sont mieux ancrées dans l’énoncé, les expressions figées que nous avons relevées (« la suprême puissance », « la voix de la nature ») doivent certainement leur succès au fait qu’elle rentrent facilement dans le moule de l’alexandrin. C’est ici l’apposition récurrente d’un adjectif (la puissance sera « entière » ou « suprême »), ou l’utilisation d’une formule imagée (« la voix de la nature », au lieu de « la nature ») qui se justifie par les contraintes de la versification. Il serait intéressant de poursuivre cette analyse en la croisant avec la théorie de la fonction poétique du langage de Jakobson, que résume en partie l’exemple suivant : « Without its two dactylic words the combination “innocent bystander” would hardly have become a hackneyed phrase. » (1960 : 358) 5. Vers et prose Afin d’évaluer la spécificité de l’écriture poétique, nous avons constitué un corpus de pièces en prose de la même époque (11 tragédies de d’Aubignac3, Baro et Puget de La Serre, et 9 comédies de Molière). Nous avons compté le nombre d’occurrences de chacune des expressions correspondant à un hémistiche récurrent, en le rapportant à la taille respective des deux corpus, calculée en nombre de mots. Certains « hémistiches » (les guillemets s’imposent ici) sont aussi fréquents en vers qu’en prose, mais il n’existe pas de corrélation nette entre les deux corpus, alors même que l’on reste dans le genre dramatique. Or les « hémistiches » que l’on trouve aussi fréquemment en prose qu’en vers, voire plus fréquemment, sont ceux qui reposent à la fois sur un substantif unique (suffisamment long pour occuper les six syllabes avec les déterminants, les prépositions et les conjonctions qui le précèdent) et qui n’ont pas une fonction de complément circonstanciel. Le fait qu’ils figurent parmi les hémistiches les plus fréquents dans le corpus en vers s’explique simplement par le fait que le substantif en question est lui-même extrêmement fréquent. En revanche, les formules figées qui reposent sur une association de plusieurs termes et qui ne font qu’apporter une modalisation sont bien surreprésentées en vers (par exemple « je vous l’ai déjà dit » : 17 occurrences pour un million de mots en vers, 0 en prose ; « il n’en faut point douter » : 12 en vers, 0 en prose ; « pour la dernière fois » : 28 en vers, 9 en prose). Ces expressions, spécifiques au théâtre en vers, semblent donc bien devoir leur suremploi à la nécessité de couler la phrase dans le moule de l’alexandrin. 3 Nous tenons à remercier ici Bernard J. Bourque, qui nous a fourni la version numérique de son édition Abbé d’Aubignac, Pièces en prose, Tübingen, Gunter Narr Verlag, coll. « Biblio 17 », 2012. 220 JADT’ 18 6. Premiers et seconds hémistiches Un des défauts de cette approche est de surévaluer la césure au détriment de l’unité du vers, et de la considérer comme une coupure, une pause entre deux segments indépendants. Deux écueils se profilent. D’un côté, on risque d’oublier que l’hémistiche ne constituent pas toujours, au sein d’un vers, une unité syntaxique pertinente. Les dramaturges du XVIIe siècle pratiquent souvent le rejet, le contre-rejet ou l’enjambement internes (par exemple : « Le temps de cet orgueil me fera la raison », dans La Galerie du Palais de Corneille). Cependant, notre projet est avant tout lexical, et non prosodique. Isoler les hémistiches n’est qu’une manière de faire émerger des idiomatismes, en se fondant sur le fait que, malgré des exceptions, la césure à l’hémistiche reste le plus souvent la plus forte rupture syntaxique du vers. Il ne faudrait pas non plus oublier que l’élocution fond les deux hémistiches dans un même mouvement, et que ceux-ci ne se situent donc pas sur le même plan : un poème en alexandrins n’est pas une suite d’hémistiches. Ici, l’analyse automatique à laquelle nous nous sommes livré donne justement des arguments en faveur de l’unité du vers, car elle nous permet de faire émerger plusieurs différences entre les premiers et les seconds hémistiches, qui complètent et confirment les analyses de Beaudouin (2002 : 275-319) concernant la répartition des phonèmes et des catégories morphosyntaxiques en fonction de la position métrique. Ils diffèrent tout d’abord dans le taux de répétition. 13% des hémistiches placés en première position sont employés ailleurs dans notre corpus (soit en première, soit en seconde position), ce qui est moins que le pourcentage global de récurrences. Au contraire, ce pourcentage monte à 18% quand on considère les hémistiches placés en seconde position. Cette divergence s’explique facilement par le fait que le second hémistiche n’est pas seulement soumis à la contrainte du mètre, mais aussi à celle de la rime. Si l’on considère la proportion d’hémistiches qui commencent par un son vocalique, on constate également un déséquilibre : 27% des premiers hémistiches, mais 30% des seconds. La différence est faible, mais elle nous semble permettre de quantifier la contrainte que pose la présence d’un e à la fin du premier hémistiche, qui serait fautive si le second commençait par un son consonantique. Ainsi, tandis que le premier hémistiche peut commencer par n’importe quel son, un hémistiche commençant par un son vocalique est plus facile à placer en seconde position qu’un hémistiche commençant par un son consonantique. Enfin, les hémistiches les plus fréquents ne sont pas les mêmes selon que l’on considère ceux placés en première et en seconde position. Certains sont utilisés aussi bien à l’une ou l’autre place (par exemple, « en l’état où je suis » apparaît 40 fois en premier, 58 fois en second), mais on observe souvent une JADT’ 18 221 répartition nette : les hémistiches de modalisation de l’énoncé sont plus souvent en premier (« je ne vous dirai point » : 39 pour 1, « je vous l’ai déjà dit » : 52 pour 3 ; « je vous le dis encor » : 20 pour 2), les hémistiches ayant fonction de compléments, en second (« à votre majesté » : 85 pour 2 ; « de votre majesté » : 40 pour 1 ; « à mon ressentiment » : 37 pour 0). 7. Conclusion et perspectives La détection automatique des récurrences d’hémistiches permet donc de mettre en valeur les contraintes spécifiques qui pèsent sur l’écriture en vers. Même si les conclusions que l’on peut tirer ne font que confirmer un savoir déjà existant, cette méthode nous offre aussi un point d’entrée original dans le corpus du théâtre classique. Elle nous amène à lire autrement ces textes et rend particulièrement sensible, derrière la voix d’un auteur, la voix diffuse d’un style d’époque. À travers ces expressions et ces associations d’idées transparaît tout un imaginaire qui constitue en quelque sorte le « dictionnaire des idées reçus » du XVIIe siècle. Nous n’avons fait là que jeter quelques pistes de réflexion. Un examen quantitatif et qualitatif plus précis est nécessaire pour mieux cerner les enjeux de ce phénomène, tout comme la prise en compte de textes versifiés non dramatiques. Il restera également à étendre le corpus de référence des textes en prose et à définir d’autres principes de comparaison pour évaluer l’influence de la métrique sur la diversité syntagmatique des textes. Envisager les récurrences au niveau, plus abstrait, du motif syntaxique (dans la lignée des travaux de Ganascia, 2001 ; Longrée et al., 2008 ; Mellet et Longrée, 2013 ; Legallois et Prunet, 2015), nous permettra par ailleurs de regrouper des occurrences présentant une structure syntaxique semblable (« la voix de la nature », « le flambeau de la guerre », « les fruits de la victoire ») ou centrées sur les mêmes termes (« qu’on le/la/les fasse venir »). Enfin, la fréquence relative de ces hémistiches récurrents nous paraît être un outil statistique particulièrement prometteur pour évaluer la spécificité du style d’écriture propre à un genre ou un auteur, ainsi que pour observer l’évolution de ces éléments de langage dans le temps. Références Beaudouin V. (2002). Mètre et rythmes du vers classique. Corneille et Racine. Honoré Champion. Delente É. et Legallois D. (2016). La répétition littérale dans Les RougonMacquart : présentation d’un phénomène peu connu. Excavatio, vol.28. Delente É. et Renault R. (2015). Outils et métrique : un tour d’horizon. Langages, vol.199 : 5-22. Ganascia J.-G. (2001). Extraction automatique de motifs syntaxiques. Dans 222 JADT’ 18 Maurel D. (éd), TALN - RECITAL 2001 : 8e conférence annuelle sur le Traitement Automatique des Langues Naturelles. Jakobson R. (1960). Closing statements: Linguistics and Poetics. Dans Sebeok T. A. (éd), Style in Language. The Technology Press of MIT/John Wiley and Sons, inc. Legallois D. (2009). À propos de quelques n-grammes significatifs d’un corpus poétique du XIXe siècle. L’Information grammaticale, vol.121 : 46-52. Legallois D. et Prunet A. (2015). Sequential patterns: a new corpus-based method to inform the teaching of language for specific purposes. Journal of Social Science, vol.44 : 127-140. Longrée D., Luong X. et Mellet S. (2008), Les motifs : un outil pour la caractérisation topologique des textes. Dans Heiden S. et Pincemin B. (éds), JADT 2008. 9es Journées internationales d’Analyse statistique des Données Textuelles, pp. 733-744. Mellet S. et Longrée D. (2013). Le motif : une unité phraséologique englobante ? Étendre le champ de la phraséologie de la langue au discours. Langages, vol.189 : 65-79. Salem A. (1987). Pratique des segments répétés. Essai de statistique textuelle. Klincksieck. Salvador X.-L. (2016). Versification : outil d’analyse du mètre français (http://www.projetprada.fr/versification et https://gist.github.com/xavierLaurentSalvador). JADT’ 18 223 «Mangiata dall’orco e tradita dalle donne». Vecchi e nuovi media raccontano la vicenda di Asia Argento, tra storytelling e Speech Hate Francesca Dragotto1 Sonia Melchiorre2 1 Università di Roma Tor Vergata – dragotto@lettere.uniroma2.it 2Università della Tuscia – melchiorresmr@unitus.it Abstract 1 Re-enacted and dissected in the National and International news, the narration of the rape denounced by Italian actress Asia Argento has triggered several coming outs revealing the violence perpetuated against other actors and actresses by prominent personalities of the Hollywood star system. Textually molded between diffused narration and the blink of a tweet, the story has hooked the public displaying, in the Italian media in particular, a morbid legitimation of Victim Blaming. Asia Argento has become the object of Hate Speech revealing, in turn, a cultural palimpsest of lies and guilty silences deriving from stereotypes represented in comments of the most crass and basest order. The present discussion starts therefore from a quantitative and qualitative analysis of texts, in English and Italian, reporting the story and aims to reveal the similarities and differences between language practices substantiating the discourse of violence. Another corpus derived from the social networks will also reveal the righteous indignant reactions of cybernauts concerning this story which will help identify the language patterns at the core of gender-based violence. Abstract 2 Spolpata dalle cronache nazionali e internazionali, la narrazione della violenza sessuale denunciata dall’attrice italiana Asia Argento ha funto da detonatore di una esplosiva sequela di coming out rivelatori di episodi analoghi subiti, da altre attrici e, seppur in misura inferiore, attori, da parte di personaggi di spicco dello Star System hollywoodiano. Colata in tutti gli stampi testuali compresi tra la narrazione diffusa e il succinto tweet, la trama di questa vicenda ha tenuto e ad oggi ancora tiene significativo banco mediatico, alimentando un dibattito che, nel caso italiano, si è dimostrato spesso più interessato all’individuazione di ragioni utili a legittimare il Victim Blaming che a ricostruire le coordinate del contesto in primis psicologico nel quale si sarebbe consumata la violenza. Oggetto di innumerabili discorsi di odio, il racconto rappresentato dalla cronaca italiana 224 JADT’ 18 costituisce un oggetto utile a investigare il sentimento sociale nei confronti di storie di violenza con protagoniste persone (in special modo donne) famose, nei confronti delle quali si attivano reazioni di sdegno frammisto alla colatura dei più beceri stereotipi di genere. Muovendo dall’analisi quantitativa e qualitativa di un corpus di testi incentrati su questa vicenda, prodotti in lingua inglese e in lingua italiana, chi scrive si ripropone di far emergere luoghi di contatto e di separazione tra le diverse forme della cronaca, unitamente alle costellazioni lessicali, semantiche e pragmatiche che le hanno sostanziate. Correderà questa analisi quella di un secondo corpus, stavolta estrapolato dalla ricca produzione social riconducibile ad account ora individuali, ora di gruppi noti per l’indefessa attività di comunicazione indignata intorno a vicende dell’attualità. Scopo ultimo del lavoro, sarà l’intercettazione dell’eventuale pattern linguistico e concettuale della violenza di genere, del quale si testeranno i limiti di validità all’interno di sistemi diversi e di varietà diverse dello stesso sistema. 1. La narrazione Umiliata e offesa. Questo il destino toccato all’attrice italiana Asia Argento, tra le prime a denunciare la violenza subita dal produttore cinematografico hollywoodiano Harvey Weinstein. La donna ha avuto il coraggio di esporre pubblicamente il suo stupratore assieme a una ottantina di altre, che come lei, hanno subito prima un oltraggio fisico e successivamente un’esposizione mediatica senza precedenti. Appare significativo da un punto di vista narrativo, che la vicenda sia stata innescata da un tweet e che sia successivamente rimbalzata nei media di tutto il mondo. Nel breve lasso di un cinguettìo Asia Argento rivela i nomi di tutte le donne che con coraggio hanno denunciato la violenza perpetrata nei loro confronti da un uomo che si credeva potente e intoccabile. Ed ecco che dal racconto delle vittime scaturisce una nuova narrazione in cui le donne diventano survivors, dando voce alla loro rabbia contro un sistema patriarcale, sessista e misogino, condensato in uno slogan già storico: Me too, “anche io”, nel quale tutte le donne del mondo vittime di violenza si sono riconosciute. È accaduto poi che due parole si transustanziassero nella Person of the Year 2017 guadagnando la copertina del Time, che si incarnassero nei corpi abbigliati di nero di tutte le attrici che hanno partecipato al Golden Globe 2018 e che, infine, si trasformassero nel Time’s Up, “Il tempo è scaduto”, refrain che si propone come impulso trasformatore della rabbia in forza (ri)costruttrice e che, probabilmente, accompagnerà l’afro-americana Ophra Winfrey nella corsa per la Casa Bianca. In Italia, nel frattempo, si fatica, e molto, ad ammettere perfino che le parole usate dai media nel caso Asia Argento dimostrino l’esistenza di un grave problema culturale. Nel nostro paese parole tossiche, nell’insieme dette hate speech, hanno condotto a un vergognoso victim blaming JADT’ 18 225 nei confronti di Asia Argento: una etichetta eufemistica per le orecchie italiane che finisce però per assumere la forma testuale di un testo argomentativo dalle cui trame scaturisce violenza e accanimento mediatico – ironia sprezzante e spregiativa nei casi migliori – non già nei confronti degli aggressori, bensì delle persone vittime di violenza sessuale. Questa tendenza ben si evince dalla disamina, anche solo cursoria, di testi recuperabili dal web. In questa sede ne è stata raccolta una selezione, in lingua italiana e inglese, successivamente sottoposta ad analisi contrastiva. Dall’analisi è emersa la tendenza all’uso di una terminologia, sistematicamente sostenuta da toni aggressivi, rivelatrice di un sistema più complesso di collusione culturale con un sistema che sarebbe frettoloso liquidare come fallocentrico e misogino e percorso da una omosocialità maschile da spogliatoio. Portatrice di significato per quanto e come dice, ma anche per quanto non dice, la lingua di questi testi (e in generale di ogni testo), costituisce infatti una porta di accesso all’architettura ideologica che la sorregge e che sorregge le coordinate di chi se ne serve: una architettura che cela un mondo sclerotizzato, che nel caso in questione prevede un pendant tra atteggiamento aggressivo di chi offende e lesione della dignità di chi offeso/a, su cui è necessario gettare luce se si vogliono comprendere le dinamiche che guidano l’agire in questa porzione di tempo che vede la vita sociale e comunicativa governata dalle strutture dei social media. In attesa dei risultati dell’analisi di un corpus meglio strutturato e più tendente alla sistematicità – con tutti i limiti che la sistematicità applicata al testo inteso in senso cognitivo può avere – in questa prima fase si procederà con l’esposizione dei nuclei più significativi ottenuti per carotaggio. I frammenti proposti sono stati scelti perché rappresentativi ciascuno di un corpus dalle caratteristiche analoghe. 1.1 Victim blaming Queste alcune delle domande proposte ad Argento da G.M. Tammaro, de La Stampa (15 ottobre 2017), a immediato ridosso della denuncia pubblica dell’attrice. Difficile non rintracciarvi lo schema narrativo plurisecolare dell’interrogatorio della vittima di violenza (si pensi, uno su tutte, al primo processo per stupro della storia, quello nei confronti della pittrice Artemisia Gentileschi, e non dell’aggressore Agostino Tassi, del 1612): il testo-genere si compone di domande alle quali chi ha subito violenza deve rispondere in maniera dettagliata per non essere tacciata di collusione con il predatore.1 In grassetto gli elementi che si ritengono rilevanti per il discorso. 1http://www.lastampa.it/2017/10/15/italia/cronache/un-orco-mi-ha-mangiata-lacosa-pi-sconvolgente-i-tanti-attacchi-dalle-donnehUwq9t9TFgRHkmcjU8yhAL/pagina.html (ultimo accesso 11/01/2018). 226 JADT’ 18 1. Perché ha deciso di rivelare questa storia a distanza di tanti anni? 2. Non pensa che parlare prima avrebbe evitato che altre donne subissero come lei? 3. Che cosa l’ha ferita maggiormente? 4. E lei come reagisce? 5. Come ha vissuto questi anni di silenzio? 6. Si sente ancora in colpa per questo? 7. Che cosa temeva che le potesse accadere, in caso di denuncia all’epoca dei fatti? 8. Fabrizio Lombardo, ex capo di Miramax Italia, nega di averla portata da Harvey Weinstein, come lei invece sostiene. 9. Dopo il primo incontro in un hotel in Costa Azzurra, lei iniziò una relazione con Weinstein? 10. Weinstein cercò di contattarla ancora? 11. Lei accettò? 12. Qual era l’atteggiamento di Weinstein nei suoi confronti? 13. Come cambiò il suo comportamento, nei confronti di Weinstein? 14. Quindi vi incontraste altre volte? 15. Poi però ha deciso di farsi avanti in prima persona: come mai? 16. In Italia non tutti la pensano così. Non tutti le credono. Non tutti stanno dalla sua parte. 17. La accusano anche di aver firmato la petizione a favore di Roman Polanski, indagato per pedofilia. 18. Si è pentita? 19. Dopo essersi fatta avanti insieme alle altre donne e aver raccontato quello che le è successo, cosa spera che accada? Poste una di seguito all’altra, le domande assumono la forma di una narrazione a se stante, caratterizzata da una costellazione di termini e da una semantica incentrata sulla vittima non in quanto tale ma in quanto teste che deve fornire spiegazioni per quanto accaduto, per giustificare il suo silenzio. Quelle a seguire sono invece alcune delle frasi pronunciate, a vario titolo, da Mario Adinolfi, Vittorio Feltri e Vittorio Sgarbi, rimbalzate tra numerosi siti e quotidiani del mondo, tra i quali il New Yorker, per primo, il Guardian e l’Independent. L’articolo di The Guardian riporta, per esempio, le seguenti parole: “Far from being hailed as brave, Argento’s allegations were initially treated in some Italian media outlets with a mix of scepticism and scorn” dove colpisce il pendant tra il brave, ‘coraggiosa’, utilizzato dalla giornalista per definire Asia Argento, e l’atteggiamento generalizzato di ‘scetticismo’ e ‘disprezzo, disdegno’ (scorn rimanda anche all’idea di ‘rifiuto’, di non accettazione di qualcosa che viene proposto). La giornalista riporta poi le JADT’ 18 227 parole di Asia Argento: “Here people don’t understand. They’ll say, ‘oh it’s just touching tits’. Well yeah, and this is a very grave thing for me. It is not normal. You can’t touch me, I am not an object”. Il pezzo non omette la descrizione dettagliata della violenza subita dall’attrice e il commento offensivo di Vittorio Feltri, che sminuisce l’atto sessuale poiché solo sesso orale (licking e non oral sex nella sua interpretazione). L’elemento più rilevante dell’articolo resta una delle frasi conclusive della giornalista: “For now, not a single fellow female actor who is well known has spoken out in support of her, even though the Italian film industry is rife with abuse”, dove rife with abuse rimanda da un lato alla reiterazione di atti, dall’altro, significando rife ‘pieno zeppo’, allude anche a un atteggiamento collusivo di quanti con comportamento omertoso non denunciano. In un altro articolo, sull’Independent, sempre in Gran Bretagna, Lydia Smith scrive: “But she was subsequently criticised by some sections of the Italian media for not coming forward sooner about the alleged assaults, despite hesitation being common among survivors for fear of reprisals, among other reasons. […]”. Riporta poi gli interventi di Renato Farina apparsi su Libero e i suoi commenti Victim Blaming volti cioè alla colpevolizzazione della vittima e tipico di chi è rimasto molto, troppo indietro rispendo a un mondo che va veloce:2 “Conservative newspaper Libero published an op-ed by Renato Farina, with the headline: ‘First they give it away, then they whine and pretend to repent’”.3 1.2 Hate speech “Se denunci uno stupro in Italia sei tu la troia”. E, ancora, “Solo in Italia vengo considerata colpevole del mio stupro perché non ne parlai quando avevo 21 anni”, denuncia Asia Argento dopo le critiche e le aggressioni verbali ricevute sui media italiani, anche da parte di star, che insinuano o apertamente dichiarano che “Si può sempre dire di no...”. Il 13 ottobre Asia Argento torna sul caso Weinstein con un tweet amaro: “Ho denunciato uno stupro e per questo vengo considerata una tr...”. Ma il mondo dello spettacolo affronta la questione in modo che è eufemistico definire prudente. “Conosco bene Asia Argento e la stimo”, rivela Vladimir Luxuria. “Quando ho letto che raccontava di essere stata costretta a un rapporto orale, la prima reazione è stata di solidarietà. Ma quando ho letto che, dopo aver subito questa violenza, ha fatto un film con lui, è andata con lui sul red carpet a http://www.liberoquotidiano.it/news/opinioni/13264032/harvey-weinsteinrenato-farina-scandalo-sessuale-hollywood.html (ultimo accesso 08/01/2018). 3 http://www.independent.co.uk/arts-entertainment/films/news/harveyweinstein-sexual-assault-asia-argento-flees-italy-public-condemn-speaking-outa8012511.html (ultimo accesso 08/01/2018). 2 228 JADT’ 18 Cannes, l’ha frequentato per cinque anni, allora mi sono detta che c’era qualcosa che non andava. Purtroppo in queste vicende bisogna avere una credibilità totale, altrimenti basta una sola fake news a mettere in discussione tutto: […]”. Ottavia Piccolo, stimatissima attrice di teatro e cinema, preferisce sorvolare: “Sono cose che sono sempre accadute, non voglio parlarne perché rischierei di dire solo banalità”. Mentre Rita Dalla Chiesa affronta senza timore l’argomento: “Sicuramente la paura di perdere il lavoro può esserci. Se però una persona si è sentita realmente offesa e traumatizzata ma poi, invece di scappare, resta all’interno di questo cerchio negativo, prende treni e aerei e va agli appuntamenti in albergo, non parlerei più di stupro, ma di un rapporto cosciente”. Cita poi le parole di Barbara Palombelli, con le quali afferma di concordare: “[…] Sei stata violentata? E perché lo dici dopo anni? Troppo comodo. Non facciamo battaglie femministe su cose che col femminismo non c’entrano niente”. “Sarò una mosca bianca”, rivela invece Alba Parietti, “ma a me non è mai capitato niente del genere. A volte basta l’atteggiamento per scoraggiare un uomo. Il punto centrale del problema è la paura: l’eterna paura delle donne nei confronti degli uomini, del loro potere, di non essere credute. Conosco potenti donne manager che quando tornano a casa si lasciano menare dal marito. Perché questo tipo di atteggiamento non riguarda solo il mondo dello spettacolo, ma tutti gli ambiti lavorativi. Con un’aggravante: nello spettacolo non insegui un posto da 1200 euro al mese, ma fama e successo”. Il 26 ottobre 2017 Guia Soncini, editorialista della rivista Gioia, commenta sul New York Times il fallimento del femminismo italiano, riferendosi alla vicenda di Asia Argento:4 “This episode is another example of my country just being male-run, sexist Italy […] This, in a country that has a total of zero national newspapers edited by women and zero female columnists in its main national papers. […] Where the reaction to Ms. Argento’s account has been truly vicious has been on social media. And there, it has primarily come from women. […] What this tells us about Italian feminism isn’t clear, but it’s certainly ugly. […] There’s something underripened about the state of feminism in my country”. Peccato che Soncini avesse postato un tweet decisamente poco femminista (“Sogno un pezzo su Weinstein d’una sola riga. Quello sarà un vecchio porco, ma voli gliela tiravate con la fionda, finché pensavate servisse”) qualche giorno prima (10 ottobre 2017), cosa che non sfugge propria ad Asia Argento. L’attacco più diretto è quello sferrato, via Facebook, da Selvaggia Lucarelli in un post 4 https://mobile.nytimes.com/2017/10/26/opinion/italian-feminism-asia-argentoweinstein.html?partner=IFTTT &_r=0&referer=https://t.co/pj6FLcp4Fx (ultimo accesso 10/01/2018). JADT’ 18 229 molto lungo:5 “Ora. Francamente. Vai a letto con un bavoso potente per anni e non dici di no per paura che possa rovinare la tua carriera. Legittimo. Frigni 20 anni dopo su un giornale americano raccontando di tuoi rapporti da donna consenziente tra l’altro avvenuti in età più che adulta, dovendo attraversare oceani, con viaggi e spostamenti da organizzare, dipingendoli come “abusi”. Meno legittimo. Ad occhio, sono abusi un po’ troppo prolungati e pianificati per potersi chiamare tali. E se tu sei la prima a dire che lo facevi perché la tua carriera non venisse danneggiata, stai ammettendo di esserci andata per ragioni di opportunità. Nessuno ti giudica, Asia Argento. Però ti prego. Paladina delle vittime di molestie, abusi e stupri, anche no. Facciamo che sei finita in un gorgo putrido di squallidi do ut des e te ne sei pentita. Con 20 anni di ritardo però”.6 1.3 La sindrome di Stoccolma All’inizio di quest’anno i media hanno riportato la notizia dell’ennesimo femminicidio in Italia. Si scopre e ci si meraviglia che la donna, bruciata viva dal suo convivente, abbia più volte difeso il suo aggressore. Questo atteggiamento ha un nome: Sindrome di Stoccolma, una sindrome che sembra colpire tante donne e il cui effetto andrebbe per lo meno valutato anche per spiegare le reazioni delle tante donne che hanno reagito attribuendo la responsabilità di quanto accaduto alla Argento, chiamando del tutto fuori il suo aggressore. Natalia Aspesi, femminista e donna di cultura, ha sostenuto che “Se mi chiedi un massaggio in ufficio e io te lo concedo, poi non mi posso stupire su come va a finire”. E, ancora, “Che i produttori, almeno da quando ho memoria di vicende simili, hanno sempre agito così. E le ragazze, sul famoso sofà, si accomodavano consapevoli. Avevano fretta di arrivare. E ancor più fretta di loro avevano le madri legittime che su quel divano, senza scrupoli di sorta, gettavano felici le eredi in cerca di un ruolo, di un qualsiasi ruolo”.7 “L’eccezione alla regola proposta è Sofia Loren, che sposò un produttore per proteggersi – afferma ancora Aspesi – da attenzioni indesiderate”. A chi le chiede se stia giustificando Weinstein, risponde inoltre “Non giustifico niente. Il femminismo è ancora una delle missioni più importanti per le donne di tutto il mondo, forse la più importante in assoluto. https://www.leggo.it/gossip/news/asia_argento_stuprata_da_weinstein_selvagg ia_lucarelli_frigni_dopo_20_anni_foto_video_11_ottobre_2017-3295503.html (ultimo accesso 10/01/2018). 6https://www.leggo.it/spettacoli/cinema/asia_argento_weinstein_sfogo_twitter_1 2_ottobre_2017-3297028.html (ultimo accesso 09/01/2018). 7 https://www.vanityfair.it/news/approfondimenti/2017/10/11/weinsteincommento-natalia-aspesi (ultimo accesso 11/01/2018). 5 230 JADT’ 18 È qualcosa in cui ho creduto e credo ancora ciecamente. Ma non mi pare che con queste denunce possa fare un salto decisivo. Magari sbaglio, ma ho i miei dubbi”. Il dubbio “Che sia una vendetta fratricida, per togliere di mezzo Weinstein. Era un produttore potente come pochi e sporcaccione come moltissimi altri. Che la storia, risaputa da decenni, sia venuta fuori con questa virulenza soltanto adesso, accompagnata da decine di testimonianze, non può essere casuale”. A completare la rassegna, un articolo, senza firma, battuto da ADN Kronos (13/10/2017), che già col solo titolo riesce a sintetizzare lo stato della polemica Donne che odiano le donne, gogna social per Asia Argento: “[…] E nel marasma dei commenti social che la accusano di volta in volta di opportunismo, di prostituzione, di sensazionalismo, a colpire duro incredibilmente sono soprattutto le donne. Man mano che si scorrono i commenti agli articoli dedicati al caso in questi giorni dai principali quotidiani, non è infatti difficile incappare – anzi, è impossibile – nei tanti insulti lanciati contro l'attrice: a scriverli sono mamme, nonne, ragazze, studentesse, tutte convinte della colpevolezza di Asia Argento, rea nel migliore dei casi per chi commenta di aver aspettato troppo a parlare o, nel peggiore, di essersi prostituita in cambio di un posto al sole di Hollywood”.8 1.4 La decisione di lasciare l’Italia “Newspapers ‘slut-shamed’ Asia Argento so badly over the Weinstein saga that she’s leaving Italy”,9 riporta spesso la stampa straniera nel dar conto dell’evoluzione della saga di Asia Argento, giudicata coraggiosa e ispiratrice di altre donne. Fuor di patria. “Part of the criticism from some Italian newspapers and social media users revolves around the counter-argument that these celebrities should have come forward years ago (we debunked this argument here). While these newspapers and internet users are hardly the only ones engaging in this form of victim-blaming, the violent tone used by some is alarming and astonishing […].”. Cita quindi il caso di Renato Farina. La reazione sorprende ancor più la stampa straniera che ha un mezzo di facile paragone nella solidarietà riservata alle attrici americane protagoniste di analoghe denunce nei confronti di Weinstein. Giunta a Laura Boldrini la notizia dell’espatrio volontario, la Presidente della Camera indirizza il proprio appello all’attrice chiedendole di desistere dai suoi propositi: «Resta http://www.adnkronos.com/fatti/cronaca/2017/10/13/donne-che-odiano-donnegogna-social-per-asia-argento_4KNSPMO49OoLtVvox04GWN.html 9 http://mashable.com/2017/10/18/asia-argento-harvey-weinstein-sexualharassment-slut-shaming/#YIIO i.0cNaql 8 JADT’ 18 231 in Italia, non mollare».10 Da sempre impegnata in attività contro la violenza sulle donne, da New York ha commentato al Corriere della Sera: “Non ho avuto modo di chiamare Asia Argento perché sono in missione a New York e in Canada. Le mando, però, questo messaggio: bisogna rimanere in Italia per rafforzare la solidarietà tra donne. Asia non mollare”. Ha poi aggiunto “Detesto il fatto che Asia Argento debba arrivare a giustificarsi […]. Questo è il mondo alla rovescia, non è importante se e quando una donna decide di denunciare un abuso. Queste sono sue scelte. Lo scandalo è che un uomo di potere, questo Weinstein, si sentiva libero di saltare addosso alle ragazze che volevano lavorare. Questo è il sistema marcio che va sradicato”. La stessa presidente della Camera non è del resto estranea all’azione denigratrice del web, che ne ha spesso fatto la destinataria di valanghe di insulti e parole violente. Riporta, tra gli altri, l’intervento di Boldrini il quotidiano Libero, che,11 il 19 ottobre 2017, titola Laura Boldrini: “Cara Asia Argento resta in Italia, le donne sono con te” un articolo parco di commento ma nel quale la lingua non rispettosa del genere e della morfologia della lingua italiana – su tutti la presidenta – comunica ben più di quanto avrebbero fatto molte parole: “‘Per quanto riguarda le molestie e gli stupri’, ha sottolinea[to n.d.r.] la presidenta, ‘il problema sono gli uomini e il loro comportamento […]’”. 2. Considerazioni finali In attesa di uno scandalo a ruoli capovolti, che, da stereotipi culturali e linguistici dominanti, ad oggi lascerebbe prefigurare tutt’altro genere di commenti, ci si limiterà a una rosa di citazioni che se anche ampliata notevolmente non riuscirebbe a spostare di una virgola – chi scrive ne è convinta – lo stato di polarizzazione che si è venuto a prefigurare in Italia fin dai primi giorni di diffusione della vicenda. Una polarizzazione oppositiva che richiama quella tipica del tifo e più di recente della fede politica – che sembra rendere incapaci di acquisire, anche solo provvisoriamente, una prospettiva diversa, anche solo in parte, da quella originaria, – alla quale nessun commento sembra potersi sottrarre. Ragion per cui, per evitare che anche l’approccio descrittivo tipico dell’analisi del testo possa essere accusato di faziosità da una o dall’altra parte, occorrerebbe ampliare il corpus di riferimento di questo lavoro almeno con la disamina quantitativa e qualitativa di tutti i tweet presenti nell’account di Asia Argento con riferimento ai profili che li hanno generati; con la disamina almeno 10 https://www.vanityfair.it/news/cronache/2017/10/19/caso-weinstein-lauraboldrini-asia-argento 11 http://www.liberoquotidiano.it/news/politica/13266009/laura-boldrini-caraasia-resta-in-italia-donne-sono-con-te-minigonna-uomini.html 232 JADT’ 18 quantitativa dei segmenti e dei contesti in cui il termine vittima compare esplicitamente o è richiamato in altro modo; con la disamina dei contesti e delle forme cui si ricorre per parlare di chi ha offeso, con l’attività social scaturita dalle cronache relative a momenti clou dell’anno in materia di violenza o di rivendicazione di genere, nello specifico nei confronti delle donne, quali la giornata contro la violenza sulle donne o l’8 marzo. Già attuata a campione, la raccolta e la successiva analisi di messaggi mostra una pervicace azione a ripetere impermeabilmente le proprie azioni comunicative, tanto nei contenuti tanto nella forma e nelle costellazioni di termini che accompagnano il focus di volta in volta oggetto di discussione. Segno inequivocabile della posizione che gli elementi da cui si irradia la costellazione stessa hanno nell’enciclopedia e nella coscienza e sensibilità della comunità linguistica italofona. JADT’ 18 233 Il cosa e il come del processo narrativo. L’uso combinato della Text Analysis e Network Text Analysis al servizio della precarietà lavorativa Cristiano Felaco1, Anna Parola2 Università degli Studi di Napoli Federico II – cristiano.felaco@unina.it; anna.parola@unina.it Abstract This paper shows the analytic procedures in order to use jointly Text Analysis and Network Text Analysis. Text Analysis allows to detect the main themes subjects in the narrations and hence the processes of signification, Network Text Analysis permits to track down the relations between linguistic expressions of text, identifying therefore the path of flow of thoughts. Using jointly the two methods is possible not only to explore the content of narrations, but, starting from the words and concepts with higher semantic strength, also to identify the processes of signification. To this purpose, we will present a research aiming to understand high school students’ perception of employment precariousness in Italy. The lexical corpus was built by narrations collected from 2013 to 2016 in blog of Repubblica “Microfono Aperto”. Riassunto Il lavoro presenta le procedure analitiche per un uso congiunto delle tecniche di Text Analysis e Network Text Analysis. La prima permette di cogliere i temi principali affrontati nelle narrazioni e quindi i processi di significazione, la seconda di rintracciare le relazioni tra le espressioni linguistiche di un testo, individuando i percorsi dei flussi di pensiero. L’uso combinato delle due tecniche permette, dunque, non solo di esplorare i contenuti delle narrazioni, ma, lavorando su parole e concetti con una maggiore carica semantica, anche di ricostruire i percorsi attraverso i quali si costruisce il significato. A tale scopo sarà presentata una ricerca volta a comprendere la percezione degli studenti delle scuole secondarie superiori sulla precarietà lavorativa in Italia. Il corpus testuale è stato creato a partire dalle narrazioni raccolte dal 2013 al 2016 nel blog di Repubblica “Microfono Aperto”. Keywords: Thematic Analysis of Elementary Contexts; Network Text Analysis; Employment Precariousness; Students. 234 JADT’ 18 1. Introduzione La narrazione, e più nello specifico il narrare, è un processo di costituzione di una tessitura testuale dotata di senso e veicolante significati. Analizzare i testi permette di cogliere da un lato la percezione di chi narra su un dato argomento e il processo di significazione attribuita all’esperienza narrata, ma dall’altro di comprendere i flussi di pensiero, entrando nello specifico delle parole utilizzate e della loro sequenzialità. L’uso della statistica testuale al servizio delle narrazioni permette, perciò, il riconoscimento in profondità del significato delle parole e del senso ivi presente (Bolasco, 2005). Tra le tecniche di analisi del contenuto, l’uso combinato della Text Analysis (TA) e Network Text Analysis (NTA) si presta bene a questi scopi. Se la TA permette di cogliere i temi affrontati, le parole scelte e utilizzate e le dimensioni di senso attribuite (Lebart et al., 1998), il cosa si narra, l’uso della TNA offre un ulteriore approfondimento sul come si narra. Analizzando, infatti, la posizione delle parole all’interno della rete testuale è possibile rintracciare le parole con una maggiore carica semantica, individuando in questo modo i diversi percorsi e contesti di significato (Hunter, 2014) mediante lo studio della natura delle relazioni tra i vari termini. Partendo dall’assunto che la struttura di relazioni tra le parole di un testo possa corrispondere ai modelli mentali e alle mappe cognitive messe in atto dagli autori del testo (Carley, 1997; Popping et Roberts, 1997), tale metodo permette di modellizzare il linguaggio come rete di parole e di relazioni attraverso la creazione di una mappa cognitiva (Popping, 2000). Il concetto è il nucleo (mentale) che viene rappresentato attraverso un termine o un’espressione linguistica; i termini possono essere in relazione tra loro formando un’affermazione. Le affermazioni che condividono uno stesso concetto formano una struttura interdipendente creando così una mappa concettuale o rete testuale costituita da punti (o nodi) che rappresentano le singole parole (o concetti) e da linee, cioè i legami che li collegano. 2. Metodologia L’approccio proposto prevede dapprima che i testi prodotti siano sottoposti ad un’analisi statistica dei dati testuali servendosi del software di analisi automatica T-lab, e successivamente analizzati in una prospettiva di rete mediante il software Gephi. 2.1 Pre-trattamento dei testi Raggruppati all’interno di un unico corpus, la prima fase di lavorazione del testo si compone di una fase di normalizzazione del corpus e di personalizzazione del dizionario. La prima ha l’obiettivo di riconoscere le parole come forme grafiche e ciò comporta una trasformazione del corpus JADT’ 18 235 (eliminazione di spazi vuoti in eccesso, marcatura degli apostrofi, riduzione delle maiuscole), e la creazione di stringhe per le locuzioni polirematiche, insiemi di parole che hanno un significato unitario non desumibile da quello delle parole che lo compongono, arrivando alla creazione delle multiwords. La fase di personalizzazione del dizionario è effettuata con le procedure di lemmatizzazione e disambiguazione del testo che permettono di rinominare le forme grafiche in lemmi. Lo step della disambiguazione permette di selezionare le forme omografe per disambiguarle; quello di lemmatizzazione, partendo dal riconoscimento delle forme con la stessa radice lessicale (lessema) o appartenenti alla stessa categoria lessicale, di ricondurre ogni aggettivo e sostantivo al maschile singolare, ogni verbo alla forma di infinito presente, e così via. Terminata questa fase, si procede al controllo delle caratteristiche lessicali del corpus per comprenderne la trattabilità a livello statistico, verificando i valori del type/token ratio, adeguato per un valore inferiore a 0.2, e gli hapax, adeguato per una percentuale inferiore al 50% per corpus di grandi dimensioni, e per percentuali leggermente superiori in caso di corpus di medie o piccole dimensioni. Prima di procedere all’analisi, va, inoltre, presa visione della lista delle parole chiave, creata con una procedura automatica dal software, e alla loro occorrenza all’interno del corpus, e si fissa una soglia di occorrenza minima, escludendo dall’analisi tutte le parole presenti meno di n. volte. La scelta della soglia di occorrenza dipende dalle caratteristiche lessicali e dalle dimensioni del corpus in analisi. Le parole chiave possono dunque essere prese nella loro integrità, ridotte in relazione alla soglia di occorrenza, o ancora ulteriormente ridotte in base agli scopi della ricerca. 2.2. Analisi dei testi mediante Analisi Tematica dei Contesti Elementari L’Analisi Tematica dei Contesti Elementari mediante una Cluster Analysis permette di costruire ed esplorare i contenuti del corpus in analisi (Lancia, 2004). I cluster sono costituiti da un insieme di contesti elementari definiti dagli stessi pattern di parole chiave e descritti attraverso le unità lessicali che maggiormente vanno a caratterizzare i contesti elementari. La cluster analysis è eseguita mediante un metodo gerarchico-ascendente non supervisionato (algoritmo bisecting K-means), caratterizzato dalla cooccorrenza dei tratti semantici. Nello specifico, la procedura d'analisi è costituita da: analisi delle co-occorrenze mediante la creazione di una tabella dati unità di contesto*unità lessicali con valori di presenza/assenza; pretrattamento dei dati tramite TF-IDF e trasformazione di ogni vettore riga a lunghezza 1 (norma euclidea); uso del coseno e clusterizzazione tramite algoritmo bisecting K-means; analisi comparativa con creazione della tabella di contingenza unità lessicali*cluster; test del chi-quadrato agli incroci 236 JADT’ 18 cluster*unità lessicali. Rispetto al criterio di partizione che determina il numero dei cluster, viene utilizzato un algoritmo che utilizza il rapporto tra varianza intercluster e varianza totale assumendo come partizione ottimale quella in cui questo rapporto supera la soglia del 50%. L’interpretazione della posizione occupata dai cluster nello spazio fattoriale e delle parole che li caratterizzano permettono di individuare le relazioni implicite che organizzano il pensiero dei soggetti, consentendo di cogliere il punto di vista del narratore nei confronti dell’evento narrato. Quest’ultimo comprende anche una serie di elementi valutativi, riflessioni, significati, giudizi di valore, ma anche proiezioni affettive. 2.3. Analisi delle reti Il secondo step d’analisi prevede l’inserimento del corpus all’interno del software Gephi. Tale software organizza i vari lemmi in una matrice di adiacenza (lemma*lemma) consentendo la creazione di una rete 1-mode, uno strumento utile per visualizzare la struttura di relazioni tra i vari lemmi, rappresentati da cerchi o nodi, e collegati tramite legami rappresentati da linee direzionate. Tale tecnica permette di cogliere il modo con cui i nodi sono connessi tra loro, identificando così le zone di vicinato (neighbourhood), e individuando quei nodi che occupano una posizione di rilevanza in differenti set o nell’intero network. A tale scopo, vengono calcolate differenti misure basate sulla centralità e, tra queste, la degree centrality che indica le parole usate con maggiore frequenza in connessione ad altre parole all’interno delle narrazioni e nei vari contesti di significato. Più nel dettaglio, l’incidenza di ogni nodo può essere espressa sia come in-degree, numero di archi entranti in un punto, individuando in questo modo i cosiddetti “predecessori” di ogni unità lessicale, sia come out-degree, numero di archi uscenti dal punto, mostrando invece i “successori”. Tale relazione tra predecessori e successori all’interno della rete testuale aiuta a comprendere la varietà semantica generata dai nodi. Altro indice utilizzato è la betweennes centrality, misura di centralità globale basata sulla vicinanza, che esprime il grado con cui un nodo sta “fra” gli altri nodi del grafo. I nodi collocati in queste zone del network eserciterebbero una funzione di controllo sui flussi informativi e di “passaggio” permettendo il collegamento tra due o più set del network (Freeman, 1979). Nell’ottica dell’analisi testuale, questi lemmi, infatti, giocano un ruolo centrale nella circolazione dei significati all’interno della rete, fungendo da punto di giunzione da cui si connettono zone diverse di testo e si snodano specifici percorsi di significato, andando a definire in questo modo la varietà semantica delle narrazioni. JADT’ 18 237 3. Caso studio Presentiamo uno studio condotto attraverso l’uso combinato delle tecniche allo scopo di comprendere la percezione degli studenti del mondo del lavoro nel contesto italiano. Gli ultimi dati disponibili mostrano che l’Italia è tra i paesi europei con il più alto tasso di disoccupazione giovanile (Eurostat, 2017). L’instabilità, la precarietà e la discontinuità delle entrate rendono i giovani vulnerabili ai cicli economici, modificando natura e tempi della transizione al mondo del lavoro e riducendo le opportunità di sviluppare soddisfacenti piani di vita (Leccardi, 2006). La sfiducia incide sui propulsori della transizione, cioè sul mantenimento di aspirazioni elevate, sulla cristallizzazione degli obiettivi di carriera e sul comportamento intensivo della ricerca di un lavoro (Vuolo et al., 2012). Per lo studio abbiamo utilizzato una fonte di dati testuali provenienti dal blog di Repubblica “Microfono Aperto” in cui studenti delle scuole superiori, nel periodo dal 2013 al 2016, hanno risposto al promt “Quattro giovani su dieci senza lavoro. E tu che pensi? Di chi sono le colpe? Cosa vorresti che venisse fatto al più presto per garantirti un dignitoso futuro?”. Raccontarsi attraverso la Rete agevola il processo di riflessione su di sé, sul proprio ruolo e sul rapporto con ciò che accade nel contesto in cui il giovane è inscritto. In una situazione di malessere per la precarietà lavorativa, il web può essere un utile contenitore per la condivisione dell'esperienza di precarietà, costituendo un ambiente di condivisione e socializzazione delle proprie esperienze (Di Fraia, 2007). 3.1 Risultati Il corpus conta 130 narrazioni (10110 occorrenze, 2484 forme grafiche, 1590 hapax), utilizzando come variabili descrittive la provenienza territoriale (nord, centro, sud) e il tipo di istituto frequentato (istituto tecnicoprofessionale e liceo) e soddisfa i criteri statistici di trattabilità. L’analisi tematica dei contesti elementari ha prodotto quattro cluster (Fig. 1; Tab. 1), rinominati CL1 “Guardare le opportunità” (14,6%); CL2 “E il governo?” (19,8%); CL3 “Dai sogni alla crisi” (38,5%); CL4 “La ricerca del lavoro, dove?” (27,1%). Le narrazioni del cluster “Guardare le opportunità” rimandano all’analisi di sacrifici e opportunità; emerge in modo marcato la necessità di una “attività”, di una messa in pratica di azioni nel presente in vista di un futuro migliore. Per questo motivo, la crisi è al tempo stesso un’opportunità che i giovani devono cogliere per dimostrare le proprie capacità: Ormai, per ciò che si sente, chiunque si chiede del proprio futuro. Per garantire che un giorno ci sia più lavoro, si deve agire ORA. […]. Anche chi cerca lavoro, però, deve volare basso e accontentarsi, per il momento, di poco, invece di restare a casa arreso. Secondo me i giovani devono avere l'opportunità di dimostrare ciò che valgono, dimostrare al mondo ciò che sanno essere e far capire a tutti che sono capaci "se si 238 JADT’ 18 impegnano" di fare qualsiasi lavoro, dal più semplice al più complesso. I testi del secondo cluster sono maggiormente orientati alla ricerca della “colpa” e ad una richiesta di soluzioni principalmente dallo Stato: Penso che lo Stato dovrebbe dare più spazio ai giovani assicurando loro protezione e tutela. I parlamentari devono conservare i diritti e le possibilità di ogni giovane, siamo noi il futuro di questo stato, e come tali abbiamo bisogno di opportunità. Il cluster “Dai sogni alla crisi” rimanda alla dimensione più interna dell’essere immersi in una società che sta attraversando un momento di crisi economica. Gli studenti rimarcano che la mancanza di lavoro annulla i sogni: Sono davvero preoccupata, tutti noi sogniamo cosa fare da grandi e sapere che il 38,7% dei giovani non riesce a trovare lavoro mi rende indignata. I giovani sono il futuro, il progresso, si impegnano […] Sappiamo tutti cosa dice il primo articolo della nostra splendida costituzione, eppure sembra sia ignorato. Bisogna dare più occasioni ai giovani, tenere in considerazione la nostra costituzione, per aprire le porte al futuro e rendere l'Italia migliore. Le narrazioni dell’ultimo cluster riguardano trasversalmente tutte le difficoltà del cercare lavoro (la ricerca affannata, le aziende che non assumono a causa delle troppe tasse) e della necessità di andare all’estero: L'Italia si ritrova in un periodo di profonda crisi e se non si riprende economicamente ridando la possibilità a noi giovani di far capire a chi di dovere che abbiamo le capacità e volontà di lavorare, l'Italia perderà tutti quei giovani ma soprattutto tutte quelle menti che andranno all'estero in cerca di condizioni di vita più favorevoli ma soprattutto di maggiori possibilità di lavoro. La posizione delle variabili descrittive mostra una differenza per la variabile provenienza territoriale e nessuna differenza per istituto frequentato. Se infatti il frequentare una scuola piuttosto che un’altra sembra non incidere sulla percezione del mondo del lavoro e sui vissuti di sfiducia, che sono invece comuni, l’appartenenza territoriale ha un suo peso. La modalità nord è, in termini di vicinanza, posta in prossimità dei cluster 1 e 4, il centro del 3 e il sud del cluster 2. Ciò indica come gli studenti del nord tendano maggiormente a problematizzare il fenomeno del precariato e la difficile ricerca del lavoro, mettendo anche l’accento sulle opportunità che i giovani hanno di dimostrare il proprio valore; le tematiche di quelli del sud vanno maggiormente nella colpevolizzazione del contesto, in linea con una maggiore risonanza del tema di discussione a causa di un’elevata incidenza della disoccupazione giovanile; le narrazioni degli studenti del centro, invece, maggiormente richiamano i propri vissuti interni. JADT’ 18 239 Figura 1: Cluster Analysis La rete prodotta è composta da 259 nodi e 414 legami. Una prima approfondita forma di visualizzazione della struttura di relazioni tra i vari lemmi mostra i livelli più alti di degree centrality, in cui “lavoro”, “giovani”, “futuro”, “problema” e “possibilità” rappresentano i nodi con maggiori connessioni. Inoltre, questi stessi nodi riportano anche i valori più alti di indegree centrality, nodi “assorbenti” che presentano più legami in entrata che in uscita rispetto a tutti gli altri punti; gli studenti tendono a indirizzare i propri discorsi e, più in generale, il flusso di pensiero verso le tematiche relative al lavoro in termini sia di possibilità future sia analizzandone le problematiche ad esso legate. Dall’altro canto, “impegnare” (inteso come impegno messo in atto) e “condizioni” rappresentano il fulcro da cui muove la narrazione verso altre parole, nodi “sorgente” che hanno più legami in uscita che in entrata rispetto ai restanti nodi della rete. I lemmi che rimandano ai vissuti degli studenti, ai propri stati d’animo rispetto all’attuale condizione e ad una prospettiva lavorativa futura incerta sono quelli che giocano un ruolo centrale nella circolazione dei significati all’interno della rete, presentando difatti i valori più elevati di betweenness centrality. In particolare, “disoccupato”, “costringere”, “rimanere” e “scoraggiare” sono i nodi che fungono da principale punto di giunzione da cui si snodano specifici percorsi di significato: le diverse zone del network, e quindi diverse parti della narrazioni sono collegate tra loro da quei lemmi che ruotano intorno al tema della precarietà del presente, una situazione di costrizione e di forte scoraggiamento. 240 JADT’ 18 In-degree Centrality Out-degree Centrality Betweenness Centrality Figura 2 4. Conclusioni L’uso misto della TA e NTA permette di rappresentare un quadro sintetico della struttura semantica, comprendere di cosa si parla, ma anche in che modo lo si fa: la scelta delle parole e l’ordine stesso di presentazione di un’idea o opinione rispetto al tema in oggetto. L’uso congiunto delle due tecniche fornisce: a) una sintesi delle informazioni contenute nelle narrazioni; b) l’analisi dei temi affrontati; c) un focus sulla strutturazione delle frasi in termini di relazioni tra lemmi. Permette così di mettere in relazione categorie tematiche e di contenuto in quanto struttura latente, ricostruendo a ritroso il processo discorsivo. Bibliografia Bolasco S. (2005). Statistica testuale e text mining: alcuni paradigmi applicativi. Quaderni di Statistica, vol. 7: 1-37. JADT’ 18 241 Carley K.M. (1997). Extracting team mental models through textual analysis. Journal of organizational behavior, 18(1): 533-558. Di Fraia G., a cura di, (2007). Il fenomeno blog. Blog-grafie: identità narrative in rete. Milano: Guerini e Associati. Eurostat (2017). Statistics on young people neither in employment nor in education or training. Report. Freeman L.C. (1979). Centrality in Social Networks Conceptual Clarification. Social Networks, vol. 1: 215-239. Hunter S. (2014). A novel method of network text analysis. Open Journal of Modern Linguistics, vol. 4(2): 350–366. Lancia, F. (2004). Strumenti per l’analisi dei testi. Milano: Franco Angeli. Lebart L., Salem A. and Berry, L. (1998). Exploring textual Data. Dordrecht: Kluwer Academic Publishers. Leccardi C. (2006). Redefining the future: Youthful biographical constructions in the 21st century. New directions for child and adolescent development, vol. 113: 37-48. Popping R. (2000). Computer-assisted Text Analysis. London: Sage. Popping R. and Roberts C.W. (1997). Network approaches in text analyisis. In Klar R. and Opitz O., editors, Classification and knowledge organization. Berlin, New York: Springer. Vuolo M., Staff J. and Mortimer, J. T. (2012). Weathering the great recession: Psychological and behavioral trajectories in the transition from school to work. Developmental psychology, vol. 48(6): 1759. 242 JADT’ 18 Hablando de crisis: las comunicaciones del Fondo Monetario Internacional Ana Nora Feldman Universidad Nacional de Luján – anafeldman@gmail.com Abstract The annual reports of the International Monetary Fund issued annually under the name of “World Economic Outlook" from the years 2005 to 2012, are analyzed in this Paper by using the techniques of Statistical Analysis of Textual Data. The scan tool text, allows us to see the way the IMF describes in their reports the world crisis, highlighting their strengths and weaknesses in their role of the ultimate guarantor of global economic balance. Much has been discussed about the foresight of the crisis and what was the position of the IMF regarding its consequences. The denial of the crisis, only recognized in 2010, is consistent with the mission that the International Monetary Fund considers to carry out, lecturing on how governments should correct their economies (Weisbrot et al., 2009). All this ignoring that "their prescriptions failed" (Stiglitz, 2002) as their "structural adjustment policies" … "produced hunger and unrest" benefiting those who had more resources while "the poor sometimes sank more and more in misery. " In particular what is analyzed from the processing of textual corpus with Taltac2 software, developed by Prof. Sergio Bolasco from the Università di Roma "La Sapienza", are the concepts and language associated as a contribution to "a significant debate on a variety of exclusions "..." that encompass the political, economic and social fields"(Sen et Kliksberg, 2007) and considering that the World Economic Outlook reports may be useful for understanding the behavior of the IMF in the context of the financial crisis. The texts analyzed are written by technicians and bureaucrats, who possess a high level of expertise and skillful management of common codes, and are the product of a clear intention on how the global economic situation and the role of the Monetary Fund (and technicians), within this context, must be read. These reports, as will be demonstrated meet the goal of preaching the hegemonic conception on markets and policies, seeking to satisfy goals related to communication and marketing strategies in order to align public opinion, government officials and government objectives behind this concept. It is along this line that the contradictions between the more political text (the introduction and the summary) and the technical text (the body of the publication) are also shown. JADT’ 18 243 Resumen Con la ayuda de técnicas de Análisis Estadístico de Datos Textuales, se analizan los informes anuales del Fondo Monetario Internacional que se publican anualmente con el nombre de “Perspectivas de la Economía Mundial” entre los años 2005 y 2012. Se trata de evidenciar en los textos la forma en la que describe el FMI a la crisis, poniendo en evidencia sus fortalezas y debilidades en su rol de último garante del equilibrio económico mundial. Mucho se ha discutido acerca de la capacidad de previsión de la crisis y cuál fue la posición del Fondo Monetario respecto de sus consecuencias. La negación de la crisis, sólo reconocida en el año 2010, es coherente con la misión que el FMI considera que debe cumplir, aleccionando sobre la forma en que los gobiernos deben corregir sus economías (Weisbrot et al., 2009). Todo esto ignorando que “sus recetas fallaron” (Stiglitz, 2002) pues “las políticas de ajuste estructural”… “produjeron hambre y disturbios” beneficiando a quienes poseían más recursos mientras que “los pobres en ocasiones se hundían aún más en la miseria”. En particular se analizan con la ayuda de Taltac2, desarrollado por el Prof. Sergio Bolasco de la Università di Roma “La Sapienza”, los conceptos y el lenguaje asociado como aporte a “un debate significativo acerca de una variedad de exclusiones” … “que abarcan el campo político, económico y social” (Sen et Kliksberg, 2007) para comprender el comportamiento del FMI en el contexto de la crisis financiera. Los textos analizados son escritos por técnicos y burócratas, que poseen un alto nivel de especialización y un manejo hábil de códigos comunes, y son producto de una clara intencionalidad acerca de cómo debe leerse la situación económica mundial y el rol del Fondo Monetario (y sus técnicos) en dicho contexto. Estos informes, como se demostrará, cumplen con el objetivo de predicación de la concepción hegemónica, sobre mercados y políticas, buscando satisfacer objetivos relacionados con estrategias comunicacionales y de marketing con el objetivo de alinear a la opinión pública, funcionarios y gobiernos detrás de esa concepción. En esa óptica es que se muestran también las contradicciones entre el texto más político (la introducción y el resumen) y el texto técnico (el cuerpo de la publicación). Keywords: textual data analysis, content analysis, political language, economic and financial crisis. 1. Introducción La crisis económico – financiera que comenzó en Estados Unidos en el año 2007, y que luego se extendió a Europa y otros continentes, fue reconocida de manera tardía por parte del Fondo Monetario Internacional (FMI). Considerando que la misión del Fondo es la de prever los riesgos originados en crisis económicas y brindar recomendaciones acerca de los mecanismos de 244 JADT’ 18 mitigación, la pregunta que se impone es ¿por qué, ante la crisis financiera de mayor envergadura después de la Gran Recesión de 1930, el Fondo ignoró la crisis, evitando declarar la emergencia de envergadura mundial? Desde el punto de vista político (y discursivo), al negar la crisis, el FMI impidió la puesta en marcha los mecanismos previstos para afrontar problemáticas de semejante envergadura. En este trabajo se analizan, con técnicas de Análisis de Datos Textuales, los informes anuales (Perspectivas de la Economía Mundial) publicados durante 8 años (2005-2012). Congruencias y contradicciones nos permiten analizar, desde un punto de vista diferente, las estrategias políticas del Fondo Monetario que ha visto muy desgastada su imagen como recurso válido e idóneo para el salvataje de economías en peligro. 2. Corpus El criterio para la elección del período en análisis es el de relevar información en momentos diferentes de la crisis. Partiendo desde un “momento 0” (previo a su aparición), pasando por la instancia de reconocimiento del estado de situación, para finalmente considerar el cambio más importante en la política llevada adelante hasta ese momento por parte del FMI, es decir el paso del paradigma neoliberal “no intervencionista” (ninguna acción por parte del Estado para que el mercado se regule solo) a una política activa de ayuda por parte de los gobiernos (de Estados Unidos y de la Unión Europea), para “salvar las principales empresas, compañías y bancos en quiebra” (Rapoport et Brenta, 2010). Desde una óptica de análisis del contenido (Krippendorf, 1969), se realiza un análisis comparativo de dichos informes, buscando conocer cuál ha sido la forma en la que el FMI ha descrito la crisis y cuáles son las temáticas asociadas a la misma. La hipótesis, es que este lenguaje y contenido no neutral de criterios técnicos y políticos, responden al acuerdo de la que hemos llamado comunidad internacional “de peso real” (Feldman, 1995). 3. Ocho años de discursos del Fondo Monetario Internacional Ya hemos trabajado y presentado diferentes aspectos relacionados con las comunicaciones del Fondo Monetario ante la crisis más importante tanto económica como financiera. Discursos que dependen del Director General de turno y el uso de la lexicometría como herramienta para la interpretación de los informes (Feldman, 2015 a y b). En este trabajo analizaremos las cuestiones relacionadas con la congruencia y el uso político que se da en estas publicaciones anuales. La ambigüedad del discurso, la dificultad de previsión y reconocimiento (o negación) de la misma, sus causas y consecuencias y los reiterados anuncios del fin de la JADT’ 18 245 crisis (en los años 2012, 2013 y 2014) que han sido objeto de crítica por todos los bloques de países más o menos cercanos al FMI. El objetivo entonces es identificar las posiciones del Fondo Monetario Internacional en el tiempo. Se trata de comprender cómo habla y cómo calla el FMI sobre este crucial tema, como aporte a “un debate significativo” sobre exclusiones que “abarcan el campo político, económico y social” (Sen et Kliksberg, 2007). Subyace a esta propuesta la idea que la exploración y el análisis de textos, mediante recursos de estadística exploratoria multidimensional, permite “una concepción ecológica para el tratamiento de datos cualitativos” (Bolasco, 2007). El software utilizado es TALTAC. 3.1. El Discurso del FMI El corpus está constituido por un total de 1.056.336 palabras (u ocurrencias). Se trata de textos largos (más de 300 páginas incluyendo gráficos y tablas) con un promedio de 132.042 ocurrencias. Si bien la distribución entre años es aproximadamente similar, el informe del 2008 se distingue pues concentra el 16% del total de ocurrencias. Tabla 1 – Análisis Lexicométrico Así como el año 2008 se destaca por su extensión el del 2009 es el que utiliza una mayor riqueza de vocabulario. Según nuestra experiencia (Feldman, 1995), la utilización de una cantidad elevada de palabras en un informe podría estar indicando una situación de “malestar” o bien del uso de lenguaje “desvirtuado”. Es decir, se deben utilizar más palabras para describir algo que aún no ha sido consensuado entre los técnicos y, por consiguiente, no ha sido conceptualizado adecuadamente. 246 JADT’ 18 Tabla 2 – Riqueza de Vocabulario La distribución en los años de la forma “crisis” es lo suficientemente ilustrativa acerca del uso dado, por parte del FMI, al correr de los años. Gráfico 1 – Distribución de la forma “crisis” en el tiempo 3.2. Dos niveles de análisis: año por año los informes del Fondo Si tomamos en cuenta sólo la Introducción y el Resumen Ejecutivo (a los que llamaremos “textos políticos”), que preceden al cuerpo del informe técnico (más de 300 páginas de textos y números) de cada Informe (a los que llamaremos “textos técnico-económicos”), éstos pueden ser considerados piezas comunicacionales que tienen un alcance público mayor, pues existe JADT’ 18 247 una amplia gama de públicos que “consumen” los documentos técnicos del FMI (periodistas económicos, economistas, público en general) pero que normalmente no leen los informes completos. Muchas veces son justamente estos escritos sintéticos, los que tienen un efecto mayor en la modelación de la opinión pública internacional. ¿Existen entonces diferencias y/o inconsistencias entre los informes considerados integralmente y los resúmenes ejecutivos e introducciones? A través de la lectura de los mismos y el análisis de las principales formas estadísticamente significativas comentaremos diferencias y similitudes entre estos. Sin presagiar ninguna crisis, tanto en el año 2005 como en el 2006, en sus textos se registra coherencia económica a partir de la sintonía entre los contenidos de la primera parte con aquellas formas estadísticamente significativas del documento técnico: INFLACIÓN, INVERSIÓN, AHORRO (2005), PRODUCTIVIDAD y SECTORES PRODUCTIVOS (2006). En el año 2007, el del comienzo de la crisis el FMI comienza a hablar de un “período incierto y difícil” y las palabras estadísticamente significativas hacen referencia sobre todo a la VOLATILIDAD, contemporáneamente habla de crecimiento, registrándose disonancia económica entre ambas partes. El año 2008, como ya señalado más arriba es el que concentra el 16% del total de ocurrencias del corpus. Nos encontramos aquí ante una disonancia discursiva / económica con el uso de muchos términos no habituales del FMI (VIVIENDA y CAMBIO CLIMÁTICO) para la descripción de la situación económica (disonancia y/o incongruencia en el uso de términos, cfr. Feldman, 1995). Ya estallada la crisis en el año 2009, a partir de la presión internacional, el FMI debe comenzar a explicar aquello que no previó ni anunció (ver gráfico 1 y Tabla 2). Encontramos mayor disonancia entre texto y contexto y nuevas formas significativas (DESPLOME, ALARMAS). Intentando retomar el liderazgo político, luego de haber sufrido numerosas críticas por su falta de previsión de la crisis, el FMI, durante el año 2010, donde – entre su parte sintética y el documento técnico – encontramos coherencia política y disonancia económica. Entre las formas significativas encontramos CRISIS. A partir del año 2011 en el que encontramos más distancia entre lo que se lee en la Introducción y el Resumen Ejecutivo y el contenido del Informe completo, reaparece la política. Una vez recuperado su espacio institucional y su razón de ser, los textos del 2012 poseen coherencia tanto política como económica. 5. Conclusiones El Fondo realiza una lectura de los indicadores económicos contradictoria, con una visión poco clara acerca de la gravedad y las consecuencias de esta crisis. El análisis del contenido de los textos (discursos e informes), con el uso 248 JADT’ 18 de herramientas de estadística textual, permite graficar de manera irrefutable las contradicciones y los silencios en los que incurre el FMI desde los primeros síntomas de la crisis en el año 2007. Los conceptos entonces vertidos en los Informes Perspectivas de la Economía Mundial son el producto “de una curiosa mezcla de ideología y mala economía, un dogma que en ocasiones parecía apenas velar intereses creados” recomendando “soluciones viejas, inadecuadas” con brutales efectos “sobre los pueblos de los países a los que se aconsejaba aplicarlas” (Stiglitz, 2002). Estas recetas fallaron en muchas oportunidades y produjeron situaciones sumamente graves en varios países. Un mensaje, un emisor, un objeto y una misión que falló, pues el FMI no cumplió con su rol de evitar que el mundo caiga nuevamente en una nueva Gran Depresión. Los textos analizados permiten establecer algunas pistas acerca de las motivaciones de este fracaso. En las contradicciones evidenciadas y en los intentos de negación de una realidad que no dejaba dudas acerca de la magnitud de esta crisis se afianza la idea de que existe en el Fondo Monetario y otros organismos internacionales un problema de gobernanza Tabla 3 – Análisis de coherencia y disonancia de los Informes año por año JADT’ 18 249 Bibliografía Bolasco S., D’Avino E. y Pavone P. (2007) Analisi dei diari giornalieri con strumenti di statistica testuale e text mining, Publicado en I tempi della vita quotidiana. Un approccio multidisciplinare all'analisi dell'uso del tempo. ISTAT, Roma Feldman, A. (1995), Il concetto di sviluppo umano secondo le Nazioni Unite: analisi del contenuto in Bolasco, S., Lebart, L. e Salem, A. (eds.). JADT 1995 - Analisi statistica dei dati testuali, Roma, CISU, 2 voll. Feldman, A. (2015a) Análisis del Posicionamiento del Fondo Monetario Internacional frente a la crisis del año 2007 en Revista Latinoamericana de Opinión Pública. Año 2016, número 6, EDUNTREF. Buenos Aires Feldman, A. (2015b) Text Mining Strategies applied on the annual reports of the International Monetary Fund. A look at the crisis en ISI 2015 World Statistics Congress, Rio de Janeiro Krippendorff, K. (1969). Theories and Anlytical Constructs en: G. Gerbner, O.R. Holsti, K. Krippendorff, W.J. Paisely y P.J. Stone (eds.) The Analysis on Communication Content, New York, John Wiley & Sons, p. 6 e ss. Lebart, L y Salem, A. (2008). Statistique Textuelle, Dunod, Paris. Nemiña, Pablo. (2009) Aportes para un esquema de análisis del comportamiento del FMI en crisis financieras a partir de su actuación durante la crisis argentina (2001-2002). Documentos De Investigación Social Número 8. ISSN 1851-8788. IDAES, UNSAM, Buenos Aires Rapoport, M. y Brenta, N. (2010). Las grandes crisis del capitalismo contemporáneo. Capital Intelectual. Buenos Aires. Sen, A. y Kliksberg, B. (2007). Primero la Gente. Ediciones Deusto. 9na edición Editorial Temas, Buenos Aires, Argentina. Weisbrot, M., Cordero, J. y Sandoval, L. (2009). Empowering the IMF: Should Reform be a Requirement for Increasing the Fund’s Resources? Center for Economic and Policy Research. Washington, D.C., Estados Unidos www.cepr.net 250 JADT’ 18 Brexit in the Italian and the British press: a bilingual corpus-driven analysis Valeria Fiasco Università Roma Tre – valeria.fiasco@gmail.com Abstract 1 (English) The spread of English as the Lingua Franca of international communication has given rise to meaningful language contact phenomena in the world’s languages like loanwords and pseudo-loanwords, namely, words from one language (the donor language) are adopted by another language (the recipient language) sometimes becoming naturalized (Gusmani 1973). From this perspective, it is thus interesting to observe their behaviour in real language use. In particular, this study investigates Anglicisms and pseudoAnglicisms found in the newspaper discourse of Brexit by way of a bilingual corpus collected from two Italian newspapers, i.e. La Repubblica and Il Corriere della Sera and two British newspapers, i.e. The Independent and The Guardian selected for both their authoritativeness and their extensive readership. The exit of the United Kingdom from the European Union was chosen because it is a widely covered topic both in the Italian and in the British press, thus providing abundant material for comparative analysis, as well as offering useful data in order to explore linguistic variation. It was useful for building an electronic corpus which was retrieved from the digital archives of the newspapers’ websites in order to carry out an automated text analysis. The corpus includes articles collected during the periods that both preceded and followed the Brexit referendum. In order to carry out the analysis, corpus-driven methodology was used, namely an approach that lets hypotheses emerge from corpus observation (Tognini-Bonelli 2001). The investigation was carried out by way of the software TalTac2, and the automated text analysis, as a result, turned out to be invaluable in order to investigate and monitor the newspapers’ vocabulary which included technical terms from the fields of politics, economics and finance as well as general language words. In order to design and sample a representative corpus, the parameters proposed by Biber (1993) were used to identify descriptive criteria so as to select and balance the population. The aim of this study is to get an overview of the Brexit discourse as used in the two countries' newspapers’ vocabulary and terminology (of the two countries) by using text mining to compare and categorize the whole corpus as a collection of texts and, then, to cluster documents on the basis of the JADT’ 18 251 lexical similarity of the vocabulary to establish semantic fields or conceptual areas. Furthermore, by way of the lexical and textual analysis, this study also investigates Anglicisms and pseudo-Anglicisms in the Italian newspapers, identifying and analyzing a list of English words used in Italian. The two British newspapers serve as a reference corpus to compare to the list of Anglicisms extracted from the Italian corpus. The articles retrieved from the British newspapers serve to find out which words are typical of each corpus and to identify pseudo-anglicisms, namely new words that seem to be English forms, even though they do not exist in English, or if they do exist, they have a clearly different meaning. Lastly, the data gathered from the bilingual corpus analysis were later compared with other wider corpora included in SketchEngine and on the Brigham Young University platform in order to make generalizations about the distribution of Anglicisms and pseudo-Anglicisms in general language corpora. Keywords: Bilingual Corpus, Textual Analysis, Anglicism, Linguistic Interference Abstract 2 (Italian) La diffusione e l’affermazione dell’inglese come lingua franca della comunicazione internazionale ha generato fenomeni significativi di contatto linguistico come i prestiti e i falsi prestiti, ossia parole originariamente nate in una lingua modello che entrano a far parte di un’altra lingua (lingua replica) alla quale vengono talvolta assimilate e adattate (Gusmani 1973). È quindi interessante osservarne l’uso e l’andamento in testi autentici che presentano la lingua nel suo uso corrente. Questo studio analizza gli anglicismi e i falsi anglicismi nel discorso giornalistico della Brexit, attraverso un corpus tratto dai quotidiani italiani La Repubblica e Il Corriere della Sera e dai quotidiani britannici The Guardian e The Independent, che sono stati selezionati per la loro diffusione e la loro autorevolezza. La scelta della tematica dell’uscita del Regno Unito dall’Unione Europea è stata dettata da diversi fattori, tra i quali l’ampia diffusione dell’argomento nella stampa italiana e in quella britannica, dando la possibilità di creare un corpus per realizzare un’analisi comparativa attraverso l’esplorazione della variazione linguistica. Dal momento che queste riviste offrono una versione online che mette a disposizione un archivio digitale consultabile, sono particolarmente adatte per creare un corpus che può essere esaminato attraverso l’analisi automatica del testo. Il corpus è composto da articoli raccolti durante il periodo che precede e segue il referendum della Brexit e la metodologia utilizzata per condurre l’analisi è di tipo corpus-driven, ossia un approccio esplorativo in cui, partendo dall’osservazione del corpus, si arriva alla formulazione delle ipotesi 252 JADT’ 18 (Tognini-Bonelli 2001). Il software TalTac2 e l’analisi automatica dei testi sono stati estremamente preziosi per esaminare e monitorare il lessico della stampa che include termini tecnici della politica, dell’economia e della finanza, insieme a parole che fanno parte del lessico comune. Per progettare il corpus, sono stati utilizzati i parametri proposti da Biber (1993) con lo scopo di identificare i criteri descrittivi per selezionare e bilanciare la popolazione all’interno del corpus. L’obiettivo di questa ricerca è offrire un’analisi del lessico e della terminologia utilizzata nel discorso sulla Brexit nei quotidiani italiani e inglesi attraverso il text mining per raffrontare i testi che compongono il corpus, categorizzarli e raggrupparli sulla base di somiglianze lessicali per individuare i campi semantici e le aree concettuali. Inoltre, l’analisi lessicale e testuale ha consentito l’identificazione degli anglicismi e dei falsi anglicismi nei quotidiani italiani, mentre il corpus dei quotidiani britannici ha svolto la funzione di corpus di riferimento per paragonare la lista degli anglicismi estratta dal corpus italiano con i dati raccolti nel corpus britannico, capire quali parole sono tipiche di ogni lingua e identificare i falsi anglicismi, vale a dire parole che presentano una forma inglese, che però non esistono nel vocabolario originario o nel caso in cui esistano, il loro significato è completamente differente. Infine, i dati raccolti dall’analisi del corpus bilingue sono stati successivamente confrontati con altri corpora più ampi, consultabili su SketchEngine e sulla piattaforma della Brigham Young University con lo scopo di fare delle generalizzazioni sulla distribuzione degli anglicismi e dei falsi anglicismi in corpora non specialistici. Parole chiave: Corpus bilingue, analisi testuale, anglicismo, interferenza linguistica 1. Introduction The growing influence of English on many languages in the world represents the linguistic change produced by language contact. English is used in both academic and professional settings revealing a pervasive presence of Anglicisms in European languages (Marazzini & Petralli 2015). This situation can be traced back to economic and trade developments, as well as political and social circumstances in the past decades. The Anglo-American globalization also exerts an influence on language with an increasing number of EFL (English as a Foreign Language) and ESL (English as a Second Language) learners and the English use as a Lingua Franca (ELF) for international communication giving rise to the borrowing of an increasing number of Anglicisms which have thus become the symbol of the American lifestyle, an expression of symbols, dynamism and progress. Pulcini, Furiassi, JADT’ 18 253 Rodríguez Gonzàlez (2012:1) use the term Anglicization to stress the growing extensive research on lexical borrowing which has had a major impact on vocabulary and phraseology of English origin. Lexical borrowings adapt to their receiving language in various ways, from occasional coinages to integrated words, from more restricted circles to broad groups until reaching the totality of the speakers of the recipient language. Gusmani (1993:28) states that there are cases of complete acclimatization in which the speakers of the recipient language become so used to the foreign word that it is perceived to be part of the recipient language, i.e. film. One of the main sources of neologisms and borrowings is from newspapers and magazines which detect the emerging trends in contemporary language and coin new words in a creative fashion. According to Beccaria (1983:65), newspapers are one of the main forums of exchange between written and spoken language, where different varieties coexist, for example, bureaucratic, technical and literary language. Moreover, in newspapers, the interaction between the general and specialized language takes place allowing specific terms to penetrate the popular culture (Cabré 1999:17). 2. Research design This paper stems from the assumption that the linguistic interference of English on Italian brings about significant effects giving rise to lexical borrowing phenomena like Anglicisms and false Anglicisms, especially in newspaper language. This bilingual corpus-driven analysis describes both the Italian and the British discourse of Brexit with the aim of analyzing its vocabulary and terminology as used in both the Italian and the British press. By way of text mining, patterns and trends that allow us to make connections between the two languages under investigation can be discovered. We can identify Brexit’s main themes, get a picture of how corpus data are shaped and subdivided into text fragments that correspond to the newspaper article’s sections (title, subtitle, summary, text). We can investigate the linguistic interference of English on Italian and the markedness between the Anglicisms/pseudo-Anglicisms retrieved in the Italian newspapers and their Italian equivalent words. The exit of the United Kingdom from the European Union was chosen because it is a historic and momentous event which has been the focus of attention of numerous newspapers, thus, providing abundant material to collect in the corpus. The reason behind the choice of the two languages lies in the linguistic interference phenomena they are closely involved in: English performs the role of a highly productive donor language, while Italian is a recipient language which is under the influence of English. The bilingual corpus is made up of articles retrieved from two Italian 254 JADT’ 18 newspapers, i.e. La Repubblica and Il Corriere della Sera and two British newspapers, i.e. The Independent and The Guardian. They were selected for their authoritativeness, their extensive readership and the possibility to access their on-line archives with a free subscription. Moreover, they all dealt with the Brexit issue thoroughly. The corpus was compiled by downloading and storing all the articles about Brexit published in the on-line versions of these newspapers from June to October 2016, that is, the period that preceded and followed the Brexit referendum. The selected articles provide a brief, but detailed overview of the Brexit, even though they are not representative of all of the Italian and the British press. The corpus is composed of two corpora, the Italian and the British one. The Italian corpus includes 42 articles from La Repubblica and 42 articles from Il Corriere della Sera for a total amount of 51,158 tokens, whereas the British corpus includes 31 articles from The Guardian and 31 articles from The Independent for a total amount of 49,995 tokens. However, a difference can be observed in the number of articles that make up the overall corpus, because the average length of the British articles was shorter than that of the Italian ones. On the whole, the corpus includes 146 articles and 101,153 tokens. The corpus was designed and sampled according to the parameters proposed by Biber (1993) in order to build up a representative corpus and to identify descriptive criteria so as to select and balance the population. The issue of whether a corpus is representative and reliable is essential, because the information included in the corpus and the way it is constructed is central in the corpus-driven approach, namely a method that lets hypotheses emerge from corpus observation (TogniniBonelli 2001). The automated text analysis on the corpus was carried out by way of the software TalTac2, to investigate the newspapers’ vocabulary, to observe the behaviour of Anglicisms’, as well as to make a detailed bilingual analysis. In order to make generalizations about the distribution of Anglicisms and pseudo-Anglicisms in general language and to retrace their routes from/into the donor and the recipient language, other general language corpora were consulted: Sketch Engine (British National Corpus, itTenTen16 and enTenTen13 corpus) and the online corpora available on the Brigham Young University website (News on the Web – NOW, Global WebBased English – GloWbE, TIME Magazine Corpus). Furthermore, the software Iramuteq was used to carry out the cluster analysis of both corpora, to map them and extract the semantic associations of words according to their similarity. 3. Results In order to identify the main themes and semantic fields of the corpus, the cluster analysis grouped its lexical content so as to maximize the similarity or JADT’ 18 255 the dissimilarity of different groups of words. The analysis divided the Italian and the English corpus into 4 homogeneous clusters whose topics are economics and finance or European and British politics. The output graph was a dendrogram showing the association of all the words included in the two corpora according to their similarity. It grouped the words into two clusters: the first one concerns economics/finance and the second one is related to politics. The percentage of words included in the Italian economics cluster equals 31% compared to 23% in the English economics cluster. In both corpora, the words from the semantic field of economics are homogeneously distributed, i.e. bank/banca, market/mercato, growth/crescita, fund/fondo, investor/investimento, rate/tasso. As for the politics cluster, both corpora subdivide the lexical content into three clusters. In the Italian corpus, the cluster of politics generates cluster 4 (23%) grouping the words concerning the British politics and the sub-clusters 1 and 3. Sub-cluster 1 (22%) regards the European politics and the Brexit referendum, i.e. Unione, europeo, UE, negoziati, uscire, trattativa, while sub-cluster 3 (23%) is related to European policies linked to political integration and post-Brexit immigration policies, i.e. difesa, migrare, integrazione, emergenza. In the English corpus, the cluster of politics generates cluster 1 (26%) that corresponds to Italian cluster 3, i.e. movement, immigration, person, European and two sub-clusters (2 and 3) about the British politics. In particular, sub-cluster 3 is about the Leave campaign, i.e. Ukip, independence, break, Farage whereas sub-cluster 2 is about the Remain campaign of the United Kingdom in the European Union, i.e. Cameron, conservative, labour, tory. Moreover, the dendrogram also shows who the main actors of this event are: the European Union, David Cameron, Nigel Farage, Theresa May, Boris Johnson, and Jeremy Corbyn. By way of its textual analysis, the software TalTac2 also identified the words occurring within specific text fragments in which the corpus has been subdivided and labelled, i.e. headline, sub-heading, lead, body. This analysis particularly focused on the headlines. On the whole, the most frequent lexical word in both corpora, Brexit, is mainly found in the headlines and in the body of Italian newspapers, while it can only be observed in the body of the British press. The concept of “exit, leaving the European Union” mainly appears in the body of the articles of the British press, while in Italian newspapers it is predominantly found in headlines. The brief exploration of the headlines starts with the key topics expressed by the nouns in both the Italian and the English corpus. The topics refer to the domain of politics, the governance of the UK, the debate and the negotiations between the two parties and the problems arising from the exit of the United Kingdom from the European Union (i.e. referendum, European Union, leader, government, campaign, support/negoziato, collasso, rischio, leader, rischio, referendum). In 256 JADT’ 18 particular, the most recurrent nouns in both the English and Italian headlines mirror the themes addressed in the two corpora, i.e. politics: Brexit, EU referendum, Remain, vote/Brexit, premier, uscita; economics and finance: borsa, sterlina/pound. As for verbs describing the actions, conditions or experiences linked to the Brexit, they outline a delicate and unstable situation in both corpora, i.e. to vote, to fail, to resign, to face, to divide/uscire, crollare, affrontare, rischiare, intervenire. As far as the analysis of the linguistic interference is concerned, the Italian corpus includes 174 Anglicisms (types) for a total amount of 1.096 occurrences (tokens) whose percentage in the corpus is about 2.1%. As to types, their sum includes a lot of hapax legomena 91 out of 174 Anglicisms to be exact (approximately 52.3% of types). The 174 Anglicisms belong to the semantic fields of politics (22.5%), economics (27.5%), general language (45.5%), and newspaper language (4.5%). The list of Anglicisms extracted from the Italian corpus was later compared with the British one to check whether they were actually used in English and how: 81 Anglicisms out of 174 were found in the English corpus. The other 93 Anglicisms are real English words except for neo-premier (58.64 per million words) which can be defined as a pseudo-Anglicism. It is a loanblend or a hybrid compound (Furiassi 2010:40) formed by the English word premier and the Greek-derived suffix neo-. These two lexical elements are individually used in English, but they are not used together. The suffix neo- can be found in English compounds referring to political movements like neo-socialist, neo-fascist or regarding art and philosophy subjects, i.e. neo-baroque, neo-Aristotelian. The use and frequency of the compound neo-premier was compared with the Italian itTenTen16 corpus on SketchEngine. This online corpus displays two variants of the compound: the hyphenated word neo-premier (0.02 per million words) and neopremier (0.02 per million words). Conversely, the search of the same word in English corpora like BNC, enTenTen13, or Now corpus didn’t produce any results. The most frequent Anglicisms in the Italian corpus are Brexit (309 tokens, 0.6%), referendum (111 tokens, 0.22%), premier (89 tokens, 0.17%), leader (61 tokens, 0.12%). These four words are particularly frequent in the British corpus as well: Brexit (232 tokens, 0.46%), referendum (157 tokens, 0.31%), leader (71 tokens, 0.14%). In particular, the word Brexit is productive in both the English and the Italian corpus with numerous hyphenated compounds composed of Latin and Greek suffixs or English-derived morphemes. Some of them are common to both corpora, i.e. post-Brexit (English corpus 140 per million words, Italian corpus 58.6 per million words), hard-Brexit (English corpus 80 per million words, Italian corpus 58.6 per million words), pro-Brexit (English corpus 100 per million words, Italian corpus 39.1 per million words). JADT’ 18 257 Other Brexit-compounds like pre-Brexit (39.1 per million words) and dopoBrexit (19.5 per million words) are only found in the Italian corpus, while the compound anti-Brexit (40 per million words) is only included in the English corpus. As far as the word premier is concerned, in the English corpus, it only shows 1 token (20 per million words), while its synonym, prime minister, has a frequency of 119 tokens (2,380 per million words). The occurrence of this compound was then compared with larger English corpora like the BNC where Prime Minister is written both in capital letters (85.17 per million words) and in lowercase letters (8.33 per million words). On the contrary, the word premier is present in the BNC and occurs with a frequency of 0.23 per million words, but it mainly occurs in the semantic field of football, i.e. as a modifier of the noun league in the collocation premier league. However, it is also found in the domain of politics as a noun co-occurring with the modifiers deputy, country. Conversely, in the Italian corpus itTenTen16 in SketchEngine, premier always occurs in the semantic field of politics. Two different uses of the word premier and Prime Minister can thus be observed in the two languages. 4. Conclusion The aim of this paper has been to provide an outline of the Brexit discourse as used in the vocabulary and terminology used by two Italian and two important British newspapers. By way of cluster analysis, the Brexit’s main themes have been identified: economics, finance, European and British politics, and the Post-Brexit immigration policies. Another characteristic that has been explored in this paper is the distribution of the words in various newspaper article sections which was accomplished by focusing on the headlines. The analysis showed that the nouns included in newspapers’ headlines refer, for the most part, to Brexit’s main political issues, even though some words from the field of economics can be found as well. Whereas verbs aim at describing the difficult circumstances that both the European Union and the United Kingdom will face. As far as Anglicisms are concerned, the investigation highlighted that even though they are often used by newspapers, they represent only about 2% of the whole corpus. This percentage conforms to the most recent studies on Anglicisms in Italian by Serianni (2015), Cortellazzo (2015) and Scarpa (2015). They mirror the topic subdivision of the corpus, and in fact they mainly belong to the semantic fields of economics and politics, whereas almost half of them can be classified as general language words. In the Italian corpus, only one pseudo-Anglicism has been identified, i.e. neo-premier, and its status has been confirmed by numerous general English corpora. The analysis of Brexit-related Anglicisms provides a small but interesting contribution to the research on Anglicisms; 258 JADT’ 18 therefore, it would be interesting to keep collecting data about this historical fact so as to expand the two small corpora under investigation, to make them as comprehensible and comprehensive as possible, and to carry out an even more detailed contrastive analysis. References Biber D. (1993). Representativeness in Corpus Design. In Literary and Linguistic Computing, vol. 8 (4): 243-257. Bolasco S. (1999). Analisi multidimensionale dei dati. Carocci. Bolasco S. (2013). L’analisi automatica dei testi. Carocci. Cabré Castellví M. T. (1999). Terminology: Theory, methods and applications. John Benjamins Publishing Company. Cortellazzo M.A. (2015). Per un monitoraggio degli anglicismi incipienti. In Marazzini C., Petralli A. La lingua italiana e le lingue romanze di fronte agli anglicismi. Accademia della Crusca. Furiassi C. (2010). False Anglicisms in Italian. Polimetrica. Görlach M. (2001). A dictionary of European Anglicisms. Oxford University Press. Gusmani R. (1973). Analisi del prestito linguistico. Libreria scientifica editrice. Gusmani R. (1993). Saggi sull’interferenza linguistica. Le lettere. Hunston S. (2002). Corpora in Applied Linguistics. Cambridge University Press. Lenci A., Montemagni S. and Pirrelli V. (2007). Testo e computer. Elementi di linguistica computazionale. Carocci. Marazzini C., Petralli A. (2015). La lingua italiana e le lingue romanze di fronte agli anglicismi. Accademia della Crusca. Pulcini V., Furiassi C. and Rodríguez González F. (2012). The Anglicization of European lexis. John Benjamins. Scarpa F. (2015). L’influsso dell’inglese sulle lingue speciali dell’italiano. Edizioni Università Trieste. Serianni L. (2015) Per una neologia consapevole. In Marazzini C., Petralli A. La lingua italiana e le lingue romanze di fronte agli anglicismi. Accademia della Crusca. Sinclair J. (1991). Corpus Concordance Collocation. Oxford University Press. Tognini-Bonelli E. (2001). Corpus Linguistics at work. John Benjamins Publishing Company. JADT’ 18 259 Textual analysis to promote innovation within public policy evaluation Viviana Fini1, Giuseppe Lucio Gaeta2, Sergio Salvatore3 2 1 Ospedale Apuane, Massa – vivianafini@gmail.com Università di Napoli L’Orientale - glgaeta@gmail.com 3Università del Salento - sergio.salvatore65@icloud.com Abstract This paper illustrates the contribution by textual analysis in carrying out the research activities promoted by FORMEZ PA through the REVES (Reverse Evaluation to Enhance local Strategies) pilot project1 that aims to innovate public policy evaluation. While evaluation usually embraces a policy/project viewpoint and adopts a sort of a top-down approach consistent with the flow of rules/resources from policy makers to citizens’, REVES reverses this perspective. Indeed, it aims to assess public policies’ performance in intercepting and supporting development strategies promoted by citizens/local actors. One of the three case studies carried out by the REVES project focuses on Melpignano, a small municipality in the Puglia Region of Southern Italy. Semi-structured interviews were carried out with a sample of twenty policy actors (national, regional and local policy designer and policy implementers as well as policy beneficiaries) linked with this municipality. By using the TLab software, textual analyses of responses were performed in order to identify their symbolic and latent components and to understand the actors’ point of view about the world and specifically about local development. This allowed to assess how similar concepts - such as civic participation, innovation, community - are used with profoundly different cultural meanings by the actors. This contributes to understanding public policies’ difficulties in enhancing local strategies. Keywords: Local cultures, textual analysis, innovation within evaluation. The evaluative research was carried out within the framework of the NUVAL Project, "Actions to support the activities of the National Evaluation System and Evaluation Units" implemented by Formez PA. The case study was accomplished by Viviana Fini and Vito Belladonna, under the scientific coordination of Laura Tagle, Serafino Celano, Antonella Bonaduce, Giuseppe Lucio Gaeta. Viviana Fini carried out the cultural analysis under the supervision of Sergio Salvatore and thanks with the contribution of Giuseppe Lucio Gaeta. 1 260 JADT’ 18 Abstract L'articolo descrive il contributo della ricerca culturale condotta attraverso lo strumento dell’analisi testuale nella realizzazione del progetto di ricerca pilota REVES (Reverse Evaluation to Enhance local Strategies) promosso da FORMEZ PA con l’intento di innovare la valutazione delle politiche pubbliche. Mentre il processo valutativo tradizionalmente segue il flusso delle risorse finanziarie e l’attuazione di norme/provvedimenti da parte dei soggetti locali, REVES propone un capovolgimento di prospettiva, intendendo valutare le performance delle politiche pubbliche nell’intercettare e valorizzare le strategie di sviluppo autonomamente elaborate dai territori. Uno dei casi studio del progetto si incentra sulla città pugliese di Melpignano. Sono state condotte interviste semi-strutturate con un campione di 20 attori di policy (policy maker e attuatori di politiche attivi sul piano nazionale, regionale e locale oltre a potenziali beneficiari delle politiche) a vario titolo connessi con la città. Con l’ausilio del software TLab sono state condotte analisi testuali aventi l’obiettivo di evidenziare le componenti latenti che orientano le visioni del mondo e dello sviluppo proprie degli attori intervistati. Ciò ha consentito di valutare come concetti simili, ad esempio partecipazione civica, innovazione, comunità – siano impiegati dagli attori con significati culturali diversi. Ciò contribuisce alla comprensione del motivo delle difficoltà delle politiche pubbliche nel valorizzare strategie localmente elaborate. Keywords: Culture locali, Analisi testuale, innovazione nella valutazione. 1. Introduzione L’articolo dà conto dell’indagine culturale - svolta attraverso analisi testuale realizzata per supportare l’innovazione che il progetto REVES ha apportato al campo della valutazione delle politiche di sviluppo locale. Con un approccio reverse accountability, il progetto si è domandato se e come le politiche sovra-locali siano state in grado di cogliere e valorizzare le istanze di specifici contesti locali, indagando il caso studio “Melpignano”, Comune in provincia di Lecce, noto in letteratura per aver elaborato, proposto e attuato, nel corso degli ultimi 30 anni, una visione e una strategia innovativa di intervento riguardante lo sviluppo locale (Attanasi et al., 2011; Parmiggiani, 2013). Si discutono qui i risultati dell’indagine culturale e il vantaggio che l’analisi testuale ha permesso al progetto di realizzare, consentendo una lettura che è andata oltre il contenuto delle singole interviste, permettendo di cogliere come concetti simili fossero utilizzati talvolta – dagli intervistati – con significati culturalmente profondamente diversi. JADT’ 18 261 2. L’indagine culturale come presupposto della ricerca valutativa Il lavoro di ricerca realizzato mediante analisi testuale ha avuto quale fine la rilevazione delle dimensioni culturali che in modo latente hanno dato forma alle visioni e agli interventi sullo sviluppo locale. Questo tipo di indagine si inscrive in una cornice teorica psicologica ad orientamento psicodinamico e psico-culturale (Carli et al., 2002; Salvatore et al., 2011), che considera i comportamenti e i discorsi degli attori sociali come espressione di dinamiche culturali che solo in parte sono consce, in gran parte sono inconsce, latenti (Matte Blanco, 1975; Fornari, 1979; Carli et al., 2002). Ciò che gli attori fanno, dicono, ritengono saliente - secondo tale approccio – è funzione di un campo di forze latenti, un sistema stabile di significati generalizzati, che chiamiamo cultura (Carli et al., 2002; Salvatore et al., 2011). L’idea di organizzare le azioni valutative sui risultati dell’indagine culturale ha risposto all’esigenza del progetto di “costruire” l’oggetto di indagine a partire da una comprensione profonda delle motivazioni alla base di certi esiti, in conseguenza della presenza/assenza di alcune iniziative. L’indagine culturale ha consentito di fare ipotesi su cosa ha avvicinato/distanziato modelli di azione appartenenti ad attori di policy diversi, consentendo di classificare i loro discorsi in relazione alla variabilità culturale che li caratterizza e che definisce lo scenario entro cui ciascuno di essi, senza la mediazione del pensiero razionale, si è mosso. 2.1 L’analisi testuale: modalità di analisi Il metodo utilizzato per l’analisi testuale si fonda sul principio delle cooccorrenze lessicali come fonte di ricostruzione del contesto intratestuale. Tale principio è stato definito all’interno della linguistica (Reinert, 1986) e successivamente elaborato in chiave psicologica (Carli & Paniccia, 2002; Lancia, 2004). In termini generali il metodo, utilizzando il software TLab, trasforma il corpus lessicale in una matrice digitale di co-occorrenze, la quale viene a sua volta sottoposta ad una procedura di analisi multidimensionale che permette di estrapolare i cluster semantici attivi nel testo (cioè i cluster di parole co-occorrenti entro le stesse frasi, in quanto tali indicative di pattern di significato) che vengono successivamente sottoposti ad interpretazione. La procedura adottata segmenta il testo in Unità di Contesto Elementari (ECU), ossia parti di testo interrotte da punteggiatura, che possono contare da un minimo di 250 caratteri ad un massimo di 500. Attraverso una serie di operazioni il corpus testuale viene successivamente trasformato in una matrice digitale in grado di rappresentare il testo in termini di presenza/assenza dei lemmi nelle ECU che lo compongono. La matrice che si viene così a definire è sottoposta ad una procedura di analisi multidimensionale combinata, che unisce l’Analisi delle Corrispondenze 262 JADT’ 18 Multiple (ACM) e l’Analisi dei Cluster (AC). L’ACM permette di estrapolare le modalità nei termini delle quali i lemmi si associano all’interno delle ECU (vale a dire: le loro co-occorrenze intra - ECU). Ciascuna dimensione fattoriale individuata dalla ACM rappresenta un pattern di co-occorrenze che si ripropone attraverso il testo, o in una sua porzione sufficientemente ampia. Le dimensioni fattoriali estrapolate dalla ACM vengono quindi utilizzate come criteri classificatori dalla successiva CA. In questo modo la CA permette di raggruppare ECU (e lemmi) in base alla loro somiglianza – ossia in base alle combinazioni di parole per come si danno nelle frasi di testo. Il risultato finale della procedura è dunque l’identificazione di cluster di frasi tra loro simili in quanto caratterizzate dalla compresenza delle stesse parole; oppure, specularmente, dalla identificazione di cluster di parole simili in quanto tendenti ad essere utilizzate insieme nelle stesse frasi. Per questa loro caratteristica computazionale, i cluster individuati si prestano ad essere interpretati nei termini di nuclei tematici, tali in quanto caratterizzati dal riferimento ad un aggregato sufficientemente stabile di parole (Lancia, 2005). L’output dell’analisi può essere considerato come una rappresentazione del campo culturale caratterizzante lo specifico contesto di policy (Carli et al., 2002), dove sono visibili le dimensioni latenti che dinamizzano il campo (Fattori) e la variabilità relativa ai diversi modi di pensare dei soggetti intervistati (Cluster). 2.2 Popolazione di riferimento e campione La popolazione di riferimento sono gli attori delle politiche. Il campione è costituito da 20 soggetti che a vario titolo hanno operato in relazione allo sviluppo locale, con i quali è stata condotta un’intervista in profondità, considerati figure chiave del contesto studiato per le seguenti variabili illustrative: ruolo (politici, cittadini, tecnici); tipo di implicazione nella politica (policy maker, policy designer, attuatori, destinatari); livello di appartenenza (locale, sovracomunale, regionale, nazionale). Trattandosi di uno studio pilota, al campione rappresentativo si è preferito un campione a grappolo per quote non proporzionali (Blalock jr, 1960), facendo riferimento agli attori presenti entro i contesti, distribuiti in modo tendenzialmente equivalente in relazione alle tre variabili. La scelta di un campione di questo tipo ha consentito di costruire ipotesi, più che di verificarle, enucleando lo spettro di eterogeneità culturale presente entro la popolazione di riferimento. 3. I principali risultati dell’analisi culturale 3.1 I Fattori: le principali dimensioni latenti del campo culturale I principali fattori estratti sono tre. Di seguito, una loro interpretazione sul piano culturale. JADT’ 18 263 Primo Fattore - Simbolizzazione del processo di regolazione sociale: operatività proceduralizzata vs appartenenza valorizzata Invitati a parlare della propria visione dello sviluppo, del proprio ruolo in relazione ad esso, delle politiche in grado di promuoverlo, i soggetti incontrati parlano, in prima istanza, del modo in cui regolano il processo relazionale con i propri interlocutori. Da un lato (operatività proceduralizzata) lo sviluppo del territorio viene visto come esito dell’adesione, da parte degli attori locali, al frame valoriale e alle azioni proposte dalle politiche di sviluppo. Dall’altro il riferimento è al costruire un comune sentire (appartenenza valorizzata), governando e amministrando fatti concreti riguardanti la vita delle persone, avvalorando le valenze affettive dei legami di appartenenza. Due differenti modelli di regolazione sociale, che implicano due visioni alternative di sviluppo: tecnicalità come modello di relazione che funziona a supposto contesto dato (Carli et al., 1999) - lo sviluppo qui è realizzabile per decreto - vs modello di regolazione sociale che funziona in modo esperienziale, - lo sviluppo è qui concepito come sviluppo endogeno del sistema (Fini et al., 2015). Secondo Fattore - Forme del desiderio: salvaguardia vs riuscita. In seconda istanza, i soggetti intervistati parlano della spinta che muove la loro azione, ossia della forma del loro desiderio. Da un lato (salvaguardia) la trasformazione in mito della comunità di appartenenza sembra rispondere al desiderio di sottrarre la propria storia alla contingenza. Operazione che offre “sicurezza” in cambio di “dipendenza”. Dall’altro (riuscita) viene messa al centro una dialettica tra identità ed estraneità, con “speranza” e “avvenire” che prendono il posto di “sicurezza”. In entrambe i casi “comunità” è lemma centrale, ma mentre nella polarizzazione salvaguardia le parole con cui cooccorre la fanno sembrare valore e scopo dell’azione, nel secondo caso appare più come un prodotto da costruire, dialogicamente, tra dentro e fuori, vecchio e nuovo. Due diverse modalità di entrare in rapporto con l’estraneità: nel primo caso si adatta ciò che è sconosciuto a ciò che già si sa; nel secondo caso si utilizza il noto per esplorare l’ignoto. Terzo Fattore - Simbolizzazione della domanda di sviluppo: funzione sostitutiva vs funzione integrativa I soggetti intervistati, in terza istanza parlano della domanda di sviluppo. Da un lato, laddove ci si propone di adeguare i destinatari alle regole della pianificazione, le regole diventano ordini invalicabili e gli operatori sentono svilito il proprio ruolo ad un mero adempimento e si sentono impotenti. Dall’altro i destinatari delle policy si propongono imprenditivamente, avendo a mente ciò che è rilevante per sé e chiedendo regole che consentano di muoversi all’interno di aspettative condivise. Emergono, polarizzate, due domande di sviluppo: la prima soggiacente ad un modello che potremmo 264 JADT’ 18 definire “sostitutivo” (Carli, Paniccia, 1999), che attribuisce alla policy un potere elevato, valutabile a prodotto finito, che mette l’impotenza al posto del desiderio. La seconda, relativa ad un modello che potremmo chiamare “integrativo” (Carli, Paniccia, 1999) che esprime il desiderio di contribuire al raggiungimento degli obiettivi dei destinatari, in compenetrazione di funzioni e scelte e che pensa per processi. 3.2 I principali Cluster La Cluster Analysis ha individuato 4 Cluster principali. CL_1 Elementary Context: 407 di 2504 (16,25%) Tab 1. Contesti Elementari CL_2 CL_3 Elementary Elementary Context: Context: 840 di 2504 593 di 2504 (33,55%) (23,68%) CL_4 Elementary Context: 664 di 2504 (26,52%) C1. Le parole con un χ2 maggiormente significativo (che riportiamo tra parentesi) per questo cluster sono: tema (102,4); amministrazione (100,6); aspetto (83,9); processo (68,5); economico (66,4); contesto (64,4); imprenditoriale (62,3); azione (50,6); amianto (52); costruire (49,6); impresa (48); innovazione (43,5). Abbiamo denominato C1 “Governo imprenditivo dell’innovazione”, per l’accento posto sull’innovazione, considerata come processo da governare proattivamente. C2. Le parole maggiormente rappresentative sono: io (277,2); tu (154,7); sindaco (80,5); parlare (63,4); trovare (62,1); sentire (56,7); persona (51,7); giorno (45,5); figlio (41,6); paese (34,9); riuscire (32,6). Abbiamo denominato C2 “Implicazione nella gestione della cosa pubblica”, per l’accento posto sulla partecipazione diretta e personale, ognuno con il proprio ruolo e la propria soggettività, al governo del bene comune. C3. Le parole maggiormente rappresentative sono: cooperativo (224,7); comunità (182); notte (105,3); anno (103,1); Melpignano (91,2); fare (87,8); cittadino (83,5); acqua (83); bello (78); casa (75,7); pagare (68,9); euro (63,7); Taranta (60,7). Abbiamo denominato C3 “Comunità come identità” per l’accento posto su tutto ciò che ha reso possibile la costruzione di Melpignano come comunità che si riconosce nella gestione della cosa pubblica e nella valorizzazione della tradizione popolare. C4. Le parole maggiormente rappresentative sono: territorio (442,4); programmazione (191,7); sviluppo (179,4); area (173,9); regione (171,1); GAL (118,6); attività (104,3); intervento (102,6); livello (90,3); vasto (86,9); Puglia (77,8); governance (75,2). Abbiamo denominato C4 “Pianificazione come JADT’ 18 265 sviluppo” per l’identificazione del territorio con i confini amministrativi e la sovrapposizione tra sviluppo e varie forme di pianificazione, come se definire confini e pianificare azioni fosse di per sé garanzia di produzione di sviluppo. 3.3 Discussione La Tabella 2 mostra il rapporto Cluster-Fattori. Cluster CL_01 CL_02 CL_03 CL_04 Tab 2. Rapporto Cluster-Fattori Fattore1 Fattore2 - 22,2374 14,7017 37,0788 59,5785 60,9616 - 52,9475 - 81,5382 - 11,9437 Fattore3 63,7361 - 22,7426 0 - 30,8565 La proiezione dei Cluster sullo spazio fattoriale ha consentito di comprendere come concetti simili fossero utilizzati dagli intervistati con significati culturalmente molto diversi. È il caso, ad esempio, di C2 (quadrante riuscita - appartenenza valorizzata e quadrante funzione sostitutiva - appartenenza valorizzata). I discorsi di C2 concernono l’essere attivi nella gestione della cosa pubblica. Ma il loro differente posizionamento sullo spazio fattoriale ci ha fatto ipotizzare una differente visione e, di conseguenza, un diverso utilizzo del tema della partecipazione civica, argomento strategico per il contesto locale e per le politiche di sviluppo e strettamente connesso con l’attivazione dei cittadini. Questa ipotesi ha orientato in modo mirato le successive esplorazioni che hanno evidenziato, sotto lo stesso cappello, micro-processi socioorganizzativi molto diversi: da un lato il destinatario di policy visto come soggetto da implicare nella produzione del bene, esplorando e valorizzando il suo desiderio (in coerenza con il quadrante riuscita-appartenenza valorizzata). Qui la partecipazione è considerata esito di una costruzione dialogica. Dall’altro (quadrante funzione sostitutiva-appartenenza valorizzata) i destinatari alternativamente visti come fruitori passivi di un bene prodotto da altri o soggetti ai quali delegare sovranità e la partecipazione trattata come strumento di rafforzamento dei sistemi di appartenenza. Questa evidenza ha consentito di superare la classica distinzione presente in letteratura tra processi top down/bottom up (Bens, 2005; Sclavi, 2002) e, in una restituzione ai soggetti locali, di discutere con loro su come lo scarto esistente stesse piuttosto nelle diverse modalità di presa in carico dell’estraneità relativa al desiderio del destinatario delle policy. Grazie al tipo di indagine è stato possibile anche cogliere come temi quali innovazione e comunità, che nelle 266 JADT’ 18 interviste emergevano in modo contiguo come due miti locali per certi versi sovrapponibili, evidenziassero invece posizionamenti culturali differenti: quando a prevalere è C1-innovazione (ad esempio: inventare una tradizione come il Festival di musica popolare La Notte della Taranta; introdurre la raccolta differenziata; promuovere presso la cittadinanza l’uso dei pannelli fotovoltaici) le pratiche raccontate sono maggiormente orientate dall’importanza attribuita al raggiungimento di obiettivi (quadrante operatività proceduralizzata – riuscita) e dalla necessità di capire come rendere le innovazioni appetibili per la cittadinanza (quadrante operatività proceduralizzata – funzione integrativa). Quando invece a prevalere è il tema C3-comunità (ad esempio promuovere lo sviluppo di una Cooperativa di Comunità) ciò che sembra essere motore dell’azione è l’idea di rafforzare il proprio sistema di appartenenza (quadrante appartenenza valorizzata – salvaguardia; e appartenenza valorizzata-funzione sostitutiva). Infine la proiezione di C4 sullo spazio fattoriale nel quadrante operatività proceduralizzata – salvaguardia e operatività proceduralizzata – funzione sostitutiva ha consentito di cogliere quanto, entro questo assetto culturale, la pianificazione si muova in modo avulso dai contesti anche laddove la retorica dei programmi preveda strumenti per l’ascolto e la partecipazione dei destinatari delle policies. Da sottolineare, poi, come le variabili illustrative si siano polarizzate maggiormente sul primo fattore: operatività proceduralizzata vs appartenenza valorizzata. Tecnici da un lato e cittadini/politici dall’altro; policy designer da un lato e policy maker/destinatari dall’altro. Queste polarizzazione ci hanno fatto pensare ad una vicinanza culturale tra policy maker/politici e destinatari/cittadini, evidenziando come la politica locale, a differenza di quella centrale, sia in una posizione privilegiata per comprendere domande e interpretare esigenze, limiti, potenzialità di sviluppo dei contesti reali. Gli attuatori, invece, si posizionano in opposizione a policy maker, destinatari e policy designer. Questo ci ha interrogati sul loro difficile ruolo di cuscinetto, tra le domande dei diretti interlocutori della politica (destinatari, policy maker) e le esigenze intrinseche ai programmi. 4. Conclusioni L’indagine culturale realizzata mediante analisi testuale ha consentito al team di ricerca di costruire l’oggetto di indagine a partire da elementi altrimenti difficilmente individuabili, dal momento che i contenuti proposti dagli intervistati si presentavano pressoché identici. Poter cogliere tali differenze sostanziali dal punto di vista culturale ci ha permesso di realizzare osservazioni, interviste, discussioni con gli attori locali in merito a quando andavamo capendo ben più mirate e interessanti, anche per i soggetti locali stessi. In ciò riposa la vera innovazione che l’indagine culturale ha consentito JADT’ 18 267 al Progetto REVES di apportare nel campo della valutazione delle politiche di sviluppo locale. Riferimenti bibliografici Attanasi, G., Giordano, G. (2011). Eventi, cultura e sviluppo. L’esperienza de “La Notte della Taranta". Milano: Egea Bens, I., (2005). Facilitating with ease! Core skills for facilitators, team leaders and members, managers, consultants and trainers. San Francisco: Josey-Bass. Blalock, Jr., H. M. (1960). Social Statistics. New York: McGraw-Hill Book Company. Carli R., Paniccia, R.M (1999). Psicologia della formazione. Bologna: Il Mulino. Carli, R., Paniccia, R.M. (2002). L’Analisi Emozionale del Testo. Milano: Franco Angeli. Fini, V., Belladonna, V., Tagle, L., Celano, S., Bonaduce, A., & Gaeta, L.G. (2016), Progetto Pilota di Valutazione Locale, Studio di Caso: Comune di Melpignano. Come Stato centrale, fondazioni e Regioni possono sollecitare la progettualità locale retrieved at http://valutazioneinvestimenti.formez.it/sites/all/files/2_reves_rapporto_cas o_melpignano.pdf Fini, V., Salvatore. S. (in press). The fuel and the engine. A general semiocultural psychological framework for social intervention. In S. Schliewe, N. Chaudhary & P. Marsico (Eds.), Cultural Psychology of Intervention in the Globalized World. Charlotte (NC): Information Age Publishing. Fornari, F. (1979). I fondamenti di una teoria psicoanalitica del linguaggio. Torino: Boringhieri. Lancia F. (2004). Strumenti per l’analisi dei testi. Introduzione all’uso di T-LAB. Milano: Franco Angeli. Matte Blanco, I. (1975). L'inconscio come insiemi infiniti. Saggio sulla bi-logica. Torino: Einaudi. Parmiggiani, P. (2013). Pratiche di consumo, civic engagement, creazione di comunità, in Sociologia del lavoro, 132, 97 – 112. Reinert, M. (1986). Un logiciel d’analyse textuelle: ALCESTE, in Cahiers de l’Analyse des Données, 3. Salvatore, S., & Zittoun, T. (2011). Outlines of a psychoanalytically informed cultural psychology. In S. Salvatore, & T. Zittoun (Eds). Cultural Psychology and Psychoanalysis in Dialogue. Issues for Constructive Theoretical and Methodological Synergies (pp. 3-46). Charlotte, NC: Information Age. Sclavi, M. (2002). Avventure Urbane. Progettare la città con gli abitanti. Milano: Euleuthera. 268 JADT’ 18 A proposal for Cross-Language Analysis: violence against women and the Web Alessia Forciniti, Simona Balbi University of Naples Federico II - alessia.forc@libero.it Abstract Aim of the paper is investigating the mood on the Web with respect to one of the most relevant Human Rights violation, without any geographic distinction: the violence against women. While the literature that studies the phenomenon is rapidly growing, the action field is still fragile and the question marks are about the relationship between the public opinion and the contextual factors. In a first look at the phenomenon, we aim at mapping gender violence on the Web, in a Big Data perspective. The peculiar problem we deal with consists in analysing short documents (tweets) written in six European different languages, in the occasion of a common event: the International Day for the Elimination of Violence against Women, 25 November 2017. For our statistical analysis, we choose a multi-linguistic, cross-national perspective. The basic idea is that there are some common structures, language independent ("concepts"), which are declined in the different national natural language expressions ("terms"). Investigating those structure (e.g. factors of lexical correspondence analyses separately performed on the different collections), enables a double level analysis trying to understand and visualise national peculiarities and communalities. The statistical tool is given by Procrustes rotations. Keywords: Big Data, Text Mining, Cross-national study, Procrustes rotations 1. Introduction This paper proposes a statistical-linguistic analysis about the mood on Web in relation to a social issue of universal relevance: the violence against women (European Union Agency for Fundamental Rights (FRA), 2014; ONU and United Nations Population Fund, 2016, 2017). The social media, today, are becoming an important platform of the collective thought of the society and therefore, they represent an interesting container of context to study. The constant growth in unstructured information on Web makes the Text mining applications increasingly important in achieving to knowledge extraction of the phenomena. This work faces the problem of the public opinion on the phenomenon of gender-based violence, in Europe, as reply to a common JADT’ 18 269 event: the International Day for the Elimination of Violence against Women (United Nations, General Assembly, 1999), 25 November 2017.The proposed method of analysis is a multi-linguistic, cross-national study of the multimedia contents extracted from Twitter through Web scraping techniques. The features of data (Wu X., Wu G-Q.,Zhu et al.,2014) propose an analysis in terms of Big Data (Zielinski et al., 2012). Considering the aspects of the comparative research (Finer, 1954; Lijphart, 1975) the choice of number of cases study does not excess the six European countries; three west countries, as United Kingdom (Uk), Italy, France and three east countries, as Bulgaria, Czech Republic and Romania. The research takes on several methodological issues; it requires the treatment of multilingual corpora (tweets are written in six different languages) and not all the treated languages in this study are typical of the Textual Data mining application. The implications are relative to: a careful pre-processing step (corpora cleaning from URL and emoticons), it does not exists a package or software that includes a list of stop words for all investigated languages in this research and in addition the appropriate system of weights for the analysis unit in relation to the nature of data (short messages of up 140 characters). The accuracy of these choices is very important for the good result of the investigation. Therefore, this work has not only a simple cognitive function of the phenomenon but it represents an opportunity to test the scientific method. The cross-linguistic perspective is given by projection on factorial plan of the most frequent terms for couples of countries. In order to visualize the national peculiarities and communalities, the factors are projected in the two different natural languages on a common reference space, per pairwise through the Procrustes rotations. 2.Theoretical Framework In order to visualize the relationships between document and between terms, in textual data analysis, is commonly performed a factorial approach. The starting point is a lexical table, cross-tabulating terms and documents (in this case terms and tweets). This study in question intends propose a Procrustes analysis, such as efficient geometric technique to align lexical matrices. Our research proposes six lexical tables (X1, ..., X6,) as many as there are the case studies. There is an extremely wide multivariate analysis literature devoted to the problem of comparing and synthesising information contained in two or more matrices. An interesting way of approaching the problem consists in comparing geometrical configurations in some Euclidean space (Gordon, 1981). In our case, Correspondence Analysis (CA) is performed on the six tables and visualises the major themes and suggests similarities and peculiarities 270 JADT’ 18 between countries. In order to have a measure of this similarity for couple of countries, we can compute the sum of the square distances between corresponding points in the two configurations: The data structure consists of two matrices, X (n,p) and Y (n,p). X is the lexical table having in row the n tweets in which the corpus is organized, and in columns some content bearing words selected among the most frequent terms in the corpus for a country. Y is the lexical table having in row the n tweets, and in columns the content bearing words selected in the natural language of the other country. Through the CA performed on each corpus, we compute the principal coordinates and create two matrices: X1 and Y1; which represent coordinate matrices of each language. The coordinates matrices have been standardized and normalized so that is not necessary “rescaling” factor”. 3. Data extraction: the Web Scraping The Social media are a potentially infinite source of user data, and Twitter is one of the worldwide used Social network. Twitter is a micro-blogging service which messages (called tweets) of up to 140 characters. Web scraping is the process of automatically extract data from the Web by an Application Programming Interface (API) supported by software (or by packages connected to software). For our research, data extraction has been conducted with API Twitter and R, respecting specific parameters, common for each country: a keyword translated in the different 6 languages, the specification of the language, the geocode (in order to exclude urban semantic deriving from dialects or territorial slang which change the common sense of words) and finally the sample size (with technical limits; it is possible to extract until to n=3200 tweets per day). The monitoring period is a week around the International Day for the Elimination of Violence against Women, from 23 November to 30 November 2017. 4. Knowledge extraction of the phenomenon Considering but at same time overlooking the detailed description of the methodological issues aimed at pre-processing procedures of multi-linguistic and multimedia content, the argumentation focuses on the results. The results represent one of the most interesting developments of our proposal. JADT’ 18 271 However, a note deserves the attention: given the structure and the length of each document (tweet), the system of weights of elementary unit is tf (term. frequency) where: wij = The canonical tools used for Textual Data Analysis, such as occurrence values of the most frequent terms does not represent, in this case, a useful tool to comparing relation between countries. There are other statistical tools can enable us to go deeper in understanding of the phenomenon, such as the factorial approach. 4.1. Procrustes analysis for a cross-language study The scientific method that this research intends to test is the Procrustes Analysis by performing the overlapping of two different configurations. The configurations to comparing are two normalized CA coordinates matrices. vita insinuano aiutare exploitation rape abuse domestic fight men approvato consiglio internationalday government women rights activism elimination world aggression reflect issue donnevittime report campaign violence race gender fenomeno contrastare giornatainternazionale genere stanziato violenza maschile mnl dirittifondamentali casadonnepisa femminista libera novembre legislation riformatore -0.5 0.0 0.5 abusi -1.0 Dimension 2 1.0 1.5 Procrustes errors -2 -1 0 Dimension 1 Figure 1. Procrustes errors: comparison between Italy-Uk 1 272 JADT’ 18 The graphic representation allows to observe the Procrustes errors between the two dimensions: points of Italy's normalized principal coordinates matrix and United Kingdom's points of normalized principal coordinates matrix, where Uk is the rotated matrix. Beyond the descriptive statistics about the residual scores, the graph shows how around the axes origin there is a concentration of points both X1 and Y1 and so we can affirm that there is not a wide distance between X1 e Y1.Procrustean approach confirms the similarity estimated by CA maps between Uk and Italy (Figure 2 and Figure 3); where despite, third quadrant of Italy's and United Kingdom's suffer a dense overlapping of statistics entities, it is possible note similar topics, which are collocated nearly in same position on the multidimensional space. Figure 2. Correspondence Analysis Maps for Uk Furthermore, through the CA, is possible to investigate structures, language independent (“concepts”), which is declined in the different national natural JADT’ 18 273 1.5 language expressions ("terms"). In other words, even though there are terms that they are not the exact translation from a language to another and so from Italian to English o conversely, does not changes the conceptual aspect. Studying the vocabulary of the country we can consider the conceptual aspect and we can create thematic-groupings and to label the clusters. Procrustes errors and t Correspondence Analysis permits to observe the collocation of the statistic entity "abuse". In Procrustes errors plot (Figure 1) the "term" is distant from others statistics units; therefore it represent a Procrustes residual. Same consideration is given by observing CA maps (Figure 2 and 3). Despite, the word "abuse" is the relative translation of natural language from Italian to English the collocation on the multidimensional space is different. The "joint terms space" (Figure 4) of the comparison between Italy and Uk, allows to affirm that the terms that are the exact translation, are almost close in the projected factorial space; e.g. "women", "violence", "international day" and "rights". domestic 1.0 approvato libera consiglio 0.5 giornatainternazionale 0.0 government dirittifondamentaliviolence novembre aggression rights violenza women casadonnepisa femminista legislation genere contrastare maschile stanziato mnl reflect issue activism gender campaign world donne community race report riformatore fenomeno abuse fight abusi men vittime -0.5 Dimension 2 (13.6%) internationalday aiutare rape -1.0 insinuano vitaexploitation -0.5 0.0 0.5 1.0 1.5 2.0 Dimension 1 (21.1%) Figure 3. Correspondence Analysis Maps for Italy Finally, by confirming the Procrustes errors plot (Figure 1) and the CA maps (Figures 2 and 3), it is possible to see the unit "abuse" (despite the exact 274 JADT’ 18 translation) is more distant compared to the relative translation of natural language of the other investigated context. The visualizations of Procrustes Correspondence Analysis and “Joint terms space”, test the similarity between Italy and United Kingdom in a cross-linguistic perspective. The graphic intelligibility allows confirming the concordance between the two profiles in relation to public opinion on violence against women. Figure 4. Joint terms space Italy-Uk In the complex, the visualizations lead us to assert what above mentioned, while singularly they permit to investigate specific aspects of the linguistic peculiarities. The "Joint terms space" confirms the overlapping of statistics units (between countries) around the axis origin, so like the Procrustes errors graph. Therefore, it does not exist a big difference between Italy and Uk. The closeness between the "terms" of different languages collocated on the same reference space recall the thematic-groupings brought out by CA. 5. Conclusion and perspectives In this paper we faced the problem of comparing corpora when, one is not the translation of the other. Some investigations (e.g. comparison between Uk and Italy) indicate that the Procrustes approach is a valid tool for crosslanguage study. However, the cross-national investigations, carried out for all case studies, bring out some limits relative to semantic of the natural language expressions of the countries. It is possible that some terms, which JADT’ 18 275 are natural language expressions of a country does not coincide with the translation of the language expressions of another country. For example, in the same case Italy-Uk, we can consider that "reformer" can indicate the political aspect that Uk shows through terms such as "legislation" or "government". Different terms (in natural language expressions) could be ascribable to common conceptual labels since actually are belonging to same semantic category. The future perspective is addressed to resolve the semantic problems between countries by performing an analysis that focuses on study of thematic-axes. References Balbi and Misuraca (2006). Procrustes Techniques for Text Minig, in Zani et al., (Eds.), Data Analysis, Classification and the Forward Search, pp.227-234 Berlin, Heidelberg: Springer. Bolasco (1999), Analisi multidimensionale dei dati. Metodi, strategie e criteri d’interpretazione, Carocci, Roma. Bolasco (2005), Statistica testuale e text mining: alcuni paradigmi applicativi. Quaderni di Statistica, Vol.7, pp. 1-37. European Union (2017). Report on equality between women and men in the EU. Feldman et al. (1998). Mining text using keyword distributions. Journal of Intelligent Information Systems. Vol. 10, Issue 3, pp. 281–300. Finer (1954). Metodo, ambito e fini dello studio comparato dei sistemi politici, in Studi politici, III, 1, pp. 26-43. FRA, European Union Agency for fundamental Rights (2014). Report summary: Violence against women: an EU-wide survey. Results at a glance. Publications Office of the European Union. Gower (1975). Generalised Procrustes Analysis. Psychometrika, vol.(40):33-51. Lijphart (1975). The comparable-cases strategy in comparative research, in Comparative political studies, VIII, pp. 161-174. Wu X., Wu G-Q., Zhu et al. (2014). Data mining with big data. IEEE Transactions on Knowledge and Data Engineering. Vol. 26, Issue: 1. Zielinski et al. (2012). Multilingual Analysis of Twitter News in Support of Mass Emergency Events, Multilingual Twitter Analysis for Crisis Management. 276 JADT’ 18 La verbalisation des émotions Béatrice Fracchiolla, Olinka Solène De Roger University of Lorraine in Metz beatrice.fracchiolla@univ-lorraine.fr; olinka-solene.de-roger8@etu.univ-lorraine.fr Abstract Our study concerns the correlation between the perception of negative emotions and discursive productions to express them. Our study is based on 26 transcribed oral interviews to be analyzed with Lexico3 (13 men and 13 women). We study the way in which healthy volunteers react verbally to the conditioned production of negative emotions after viewing the government realized video stop jihad, broad casted on television after the 2015 attacks. Interviews were collected between November 2016 and February 2017 through out the COREV1project framework (understanding verbal violence in reception). At the same time, following an identical protocol, we showed another "neutral" video to the same people in order to have a control group. All the subjects saw both videos, but in different orders, after 11hours of intervals. According to our methodology of analysis with Lexico3 we were able to extract the linguistic data allowing to have an over view of the emotional feelings perceived by the volunteers after viewing each neutral or violent video and to propose a synthetic card of them. The analysis was conducted with three tools for statistic alanalysis of textual data proposed by Lexico3:search for specificity according to the partitions using the PCLC tool (Main Lexicometric Characteristics of the Corpus), the concordances, the graphs of ventilation by partition. The over all analysis of the results shows firstly that the emotions are distributed according to the nature of the videos (neutral video: positive emotions and /or neutral - violent video: negative emotions) and that the violent video provokes a quantity of speech longer than the neutral. Then, if the intensity of perceived emotions seems to differ according to the person wehere show it also is globally correlated to the order of diffusion of the videos. We can see in the responses and the construction of the speeches a correlation of positive or negative intensity of the emotions according to the video which is seen first Like wise, the analysis The Corev project (2016-2017) which allowed us to constitute the corpus studied is an association of the CNRS, the University of Lorraine and the hospital of Pitié Salpêtrière in order to make a comparative analysis of the neurophysiological responses, emotional and discursive to exposure to (verbal) violence before / after sleep and before / after waking. 1 JADT’ 18 277 seems to show that the reception of the violence invites volunteers and urges them to express them selves more about their feelings: can we see here a correlation also between discursive productivity and negative emotions - a form of verification to the French proverb that "happy people have nothing to say " ? Résumé Notre étude porte sur la corrélation qui existe entre la perception d’émotions négatives et les productions discursives pour les exprimer. Elle est réalisée à partir de 26 entretiens individuels oraux retranscrits pour être analysés via Lexico3 (13 hommes et 13 femmes). Nous étudions la manière dont des volontaires sains réagissent verbalement à la production conditionnée d’émotions négatives après avoir visionné la vidéo stop-djihad du gouvernement, diffusée à la télévision après les attentats de 2015. Les entretiens ont été recueillis entre novembre 2016 et février 2017 dans le cadre du projet COREV2 (comprendre la violence verbale en réception). Parallèlement, suivant un protocole identique, nous avons montré une autre vidéo « neutre » aux mêmes personnes afin d’avoir un groupe contrôle. Tous les sujets ont vu les 2 vidéos, mais dans des ordres différents, à 11h d’intervalles. Suivant notre méthodologie d’analyse via Lexico3 nous avons pu extraire les données linguistiques permettant d’avoir un aperçu des ressentis émotionnels perçus par les volontaires après le visionnage de chaque vidéo neutre ou violente et d’en proposer une carte synthétique. L’analyse par Lexico 3 a été menée via trois outils d’analyse statistiques des données textuelles proposés par Lexico3: la recherche de particularité selon les partitions à l’aide de l’outil PCLC (Principales Caractéristiques Lexicométriques du Corpus), les concordances, les graphiques de ventilation par partition. L’analyse globale des résultats montre tout d’abord que les émotions sont réparties selon la nature des vidéos (vidéo neutre : émotion positive et ou neutre – vidéo violente : émotion négative) et que la vidéo violente suscite un temps de prises de parole plus long que la neutre. Si l’intensité des émotions perçues semble différer selon la personne nous montrons ici qu’elle est également relative à l’ordre de diffusion des vidéos. Des indices lexicaux ou discursifs nous permettent de vérifier que les sujets qui ont vu d’abord la vidéo djihad réagissent avec plus d’émotions positives Le projet Corev (2016-2017) qui nous a permis de constituer le corpus étudié est issu d’une association entre le CNRS, l’Université de Lorraine et l’hôpital de la Pitié Salpetrière dans le but de faire une analyse comparée des réponses neurophysiologiques, émotionnelles et discursives à une exposition à de la violence (verbale) avant / après sommeil et avant /après réveil. 2 278 JADT’ 18 à la vidéo « neutre »et , inversement, que celles et ceux qui ont vu la vidéo neutre en premier réagissent avec plus d’émotions négatives lors de la projection de la vidéo stop-djihad. Autrement dit : nous constatons dans les réponses et la construction des discours une corrélation d’intensité positive ou négative des émotions en fonction de la vidéo qui est vue en premier. De même, l’analyse semble montrer que la réception de la violence interpelle les volontaires et les pousse à plus s’exprimer sur leur ressenti : peut-on voir ici une corrélation également entre productivité discursive et émotions négatives – soit une forme de vérification du proverbe selon lequel « les gens heureux n’ont rien à dire ». Keywords: verbal violence, discourse analysis, emotions, textual statistical analisis, Lexico3 1. Introduction Dans cette étude, nous nous intéressons à la manière dont des sujets confrontés à des éléments violents extériorisent verbalement leurs émotions. Dans l’expérimentation que nous avons conçue pour y arriver, nous avons travaillé sur différents types de réponses émotionnelles obtenues sur 26 sujets ayant visionné une vidéo « violente » (la vidéo « stop-djihad » diffusée par le gouvernement français suite aux attentats de 2015 – désormais notée vidéo V) et une vidéo « neutre » (sur la nouvelle région Languedoc Roussillon midi Pyrénées – désormais notée N). Le protocole multimodal suivi pour récupérer nos données a été réalisé en milieu hospitalier3. Nous avons recueilli plusieurs entretiens individuels semi-directifs portant sur le ressenti émotionnel avant et après la vision des différentes vidéos, ainsi que de nombreuses données neurovégétatives. Cette recherche soutenue par la mission à l’interdisciplinarité du CNRS entre novembre 2016 et décembre 2017 visait plus particulièrement la compréhension et la perception de la violence verbale chez des sujets sains (Fracchiolla et al., 2013). L’expérimentation ainsi menée nous permet à la fois de mettre en évidence certains des éléments marqueurs d’extériorisation émotionnelle verbale et de comparer les types de réponses aux vidéos V et N. La présente publication porte exclusivement sur la dimension verbale de l’extériorisation des émotions, une fois le corpus des entretiens menés avec nos sujets retranscrit et étudié à l’aide du logiciel Lexico3. Notre approche sera ici plus 3 Dans le service de et en collaboration avec la Professeure Isabelle Arnulf, Neurologue, directrice de l'unité des pathologies du sommeil de l'hôpital de la PitiéSalpêtrière, professeure de neurologie à l’Université Pierre et Marie Curie (UPMC), laboratoire : ICM UMR 7225. JADT’ 18 279 spécifiquement de nous demander si les mots que nous utilisons pour nous exprimer sont en adéquation avec ce que nous pensons et surtout avec les émotions ressenties. Notre corpus est ainsi constitué de 26 entretiens répartis en deux groupes comme suit : le Groupe 1 a vu les vidéos dans l’ordre : 1/ Vidéo N – 2/ Vidéo V. Le Groupe 2 : a vu les vidéos dans l’ordre inverse 1/ Vidéo V – 2/ Vidéo N4. 2. Manifestations d’un discours « émotionné » 2.1. Analyse des PCLP La répartition du corpus selon la partition « vidéo » avec l’outil PCLC (Principales caractéristiques lexicométriques du corpus), montre les spécificités de cette première partition par vidéo et par groupe. Les interventions des enquêtrices n’y sont pas inclues. Tableau 1 : Principales caractéristiques de la partition « vidéo » Partie V1 N1 V1 N2 V1 Neutre V2 Dj1 V2 Dj2 V2 Djihad Groupe 1 V1 Dj1 V1 Dj2 V1 Djihad V2 N1 V2 N2 V2 Neutre Groupe 2 Occurrences 8295 33359 41654 7872 40191 48063 89717 12794 35405 48199 5790 36002 41792 89991 Formes 1227 2926 4153 1224 3325 4549 8702 1677 2966 4643 961 3013 3974 8617 Hapax 689 1538 2227 685 1679 2364 4591 906 1492 2398 517 1561 2078 4476 Fréquence Max 300 1049 1349 260 1225 1485 2834 368 1096 1464 168 1205 1373 2837 Forme de de de de de de de Et Je Je La Je Je Je Pour le groupe 1 (N en 1 et V en 2) la forme la plus fréquente est « de » alors que pour le groupe 2, c’est « je ». Les caractéristiques sont à peu près équivalentes quelle que soit la vidéo projetée en 1. Quelle que soit la vidéo projetée, et quel que soit l’ordre, pour les deux groupes on remarque que la première exposition à la vidéo provoque moins de réactions (paroles= nombre de formes) que la seconde, ce qui est a priori dû au fait que les entretiens 2 (soir) et 3 (lendemain matin) contiennent un entretien de L’un des principaux critères de recherche était de voir si les émotions étaient plus ou moins mieux intégrées à 11h d’intervalle de jour ou de nuit. Tous les sujets ont donc vu les 2 vidéos deux fois, à 11h d’intervalle entre chaque projection. 13 sujets dans l’ordre vidéo V matin et soir et N soir et matin, 13 sujets au contraire dans l’ordre vidéo N matin et soir et V soir et matin. 4 280 JADT’ 18 mémoire de la vidéo, avant la seconde projection sont plus longs. Cependant, quel que soit l’ordre de passage, l’ensemble des sujets, tout groupes confondus, parlent plus (environ 7000 occurrences de plus), à propos de la vidéo V (stop djihad), qu’à propos de la N. Une tendance se dessine ainsi selon laquelle la confrontation à la violence provoquerait une prise de parole en « je » et un besoin de parler plus important. 2.2. Analyse du lexique « émotionné » Reconnues comme des « moments » spécifiques instantanés, les émotions sont définies comme « une réaction physique et/ou psychologique due à une situation. », dont l’effet peut parfois se prolonger plus ou moins dans le temps en fonction de leur intensité (Coletta & Tcherkassof, 2016; voir aussi Bourbon, 2009 ; Feldman et al,. 2016 ou Fiehler, 2002). Pour étudier le lexique des émotions, nous avons regroupé sous formes de listes des mots identifiés dans le corpus et en fonction des concordances comme se rapportant à l’expression de 4 des 6 émotions de base selon Ekman (1972) à savoir : la joie, la colère, la tristesse et la peur (ici nommée inquiétude). Ce choix de 4 émotions et du terme « inquiétude » au lieu de « peur » a été fait en adéquation avec les tests BMIS (échelles d’auto-évaluation de l’état émotionnel par les sujets) demandés aux volontaires avant et après chaque projection de vidéo. Les termes du lexique « émotionné » sont rassemblés cidessous par « groupes de formes ». Ainsi par exemple agréable+ contient agréable(s)(ment) : Bonheur/ Joie : Adoucit ; agréable+ ; allégresse ; ambiance+ ; amusé+ ; apaisant+ ; bon+ ; calme+ ; content+ : désir+ ; emballer+ ; émerveillé ; émouvoir+ ; excitant+ ; fière ; gai+ ; heureux+ ; jaloux* ; joie,+ ; marrant+ ; paisible ; ravi ; serein+ ; surpris+ Colère : aberrant+ ; agacée+ ; agressé+ ; blasé+ ; chiffonne ; choc/choquer+ ; colère ; énerver+ ; fâcher ; frappant+ ; furieux ; haine ; hard ; heurté+ ; horreur+ ; horripile+ ; hostile+ ; irriter+ ; révolter+ ; saoulé Inquiétude/ Peur :agitation+ ; angoissant+ ; anxiété+ ; apeuré+ ; crainte ; effraiement*, effrayant+ ; flippant+ ; gêne+ ; incompréhensible+ ; nerveux+ ; perdre+ ; peur+ ; stressant+ ; terreur Tristesse : affecter+ ; affreux+ ; attristé+ ; bouleversé+ ; déception/déçu+ ; dégoût+ ; déprimant+ ; dérange+ ;désolant+ ; impuissance ; malheureusement, malheureux ; mélancolique; navrée ; peine+ ; triste+ Nous avons ici fusionné les émotions positives et neutres dans un même groupe, ce qui explique que sous « joie » soient listés les termes « apaisante, calme, serein » qui ne signifient pas éprouver de la joie, mais dont l’axiologie est évaluée comme positive car exprimant une certaine neutralité émotionnelle (Kerbrat-Orrechioni, 1980). De même, le terme « jaloux » dans la colonne « joie » prête à interrogation : la jalousie est normalement associée JADT’ 18 281 à l’expression d’un désir négatif, de l’ordre de l’inquiétude et de la colère ; mais elle traduit ici du désir, comme le montre le contexte : «… ça faisait, ça faisait très envie et ça rendait un peu jaloux». Ici, « jaloux », comme « envie », exprime un désir positif, qui va dans le sens d’un bien-être, contrairement à son axiologie sémantique intrinsèque. De même, le terme « chiffonne » (préoccuper, contrarier) est également une émotion négative qui devrait trouver sa place plutôt dans la colonne de l’inquiétude. Mais en contexte, il correspond ici à de la colère (« énerve » serait ici un synonyme) : « … ça me, ça me chiffonne un peu de voir ce genre de, de, de vidéo à chaque fois ». Enfin, le néologisme « éffraiement* », substantif masculin construit sur le verbe effrayer, est ici associé à la peur, nous permettant de le classer dans la colonne inquiétude : « un petit peu de peur et, et d’effraiement5 ». D’une manière générale, pour une étude fine, tous les termes ici listés nécessiterait une analyse développée, en contexte ; ce qui est l’objet d’une autre publication. 3. Evaluation des émotions en contexte L'analyse en concordance du lexique émotionné relevé ci-dessus révèle des éléments significatifs avec le tri « avant », synthétisés dans le tableau cidessous. Ces résultats ont été doublés par des graphiques de ventilation : Tableau 2 : synthèse des locutions adverbiales ou adverbes accompagnant les expressions des émotions Joie un (petit) peu un (peu) plus (encore/beaucoup) plus aussi assez plutôt moins pas très pas très vraiment autant surtout 10 8 20 0 5 8 7 8 12 13 0 0 0 Colère 37 0 27 2 9 8 5 0 0 0 3 0 0 Inquiétude 37 4 8 2 2 1 0 0 0 1 4 3 0 Tristesse 36 0 9 0 0 2 0 0 7 0 0 0 4 On peut ici interroger à un niveau plus large le principe même de la création néologique en rapport avec le contexte de l’émotion, qui peut se traduire au niveau de la production verbale comme au niveau du corps, par différentes perturbations (bégaiement, intonation, respiration changée, ne plus trouver ses mots…) (voir Plantin, 2016) ; perturbations dont la création de néologismes serait l’une des manifestations sur le plan lexical. 5 282 JADT’ 18 Figure 1 : Histogramme représentant les locutions adverbiales présentes à proximité des expressions d’émotion (fréquences relatives) Le contexte interactionnel de l’étude où l’on demande aux interviewés d’évaluer les émotions ressenties, génère comme on le voit des réponses presque systématiquement accompagnées d’adverbes ou locutions adverbiales exprimant une intensité positive, équivalente, ou négative. De manière significative, on relève ensuite une accentuation de l’intensité positive lorsqu’il s’agit d’exprimer la joie (« encore/beaucoup/plus » 20 fois, « très » 13 fois) alors que « un (petit) peu » est hyper présent pour atténuer significativement les émotions négatives ressenties (colère, inquiétude, tristesse). La seconde projection graphique permet de voir que, lorsque la joie est exprimée, elle l’est de manière plus diverse, comparativement aux émotions négatives. Ces résultats indiquent que pour le corpus étudié, qui s’intéresse à la réception d’un discours violent, l’expression de l’intensité correspond à celle d’une atténuation. On peut voir par exemple que l’inquiétude et la tristesse sont les émotions qui attirent le plus la locution d’intensité « un peu » qui tend à restreindre l’intensité de l’émotion perçue par le locuteur (Coupin, 1995). Il est possible également que cela soit dû au fait que ce sont des émotions plus diffuses et plus difficiles à caractériser de manière tranchée que la joie et la colère, que l'on identifie assez facilement lorsqu’on les ressent. Cela est confirmé par le fait que les émotions positives sont accompagnées de locutions adverbiales marquant une forte intensité (encore/beaucoup ; plus et très) : les locuteur.trice.s expriment leur joie avec certitude et n’ont pas peur de la dire. De manière significative, c'est également le cas pour l'expression de la colère, qui semble être l'émotion la plus caractérisée adverbialement, à la fois par des éléments atténuateurs et par des éléments intensificateurs («un (petit) peu» 37 occ. et « encore/beaucoup/plus » 27 occ.), ce que l’on peut interpréter comme l’expression du fait que les volontaires ne sont pas particulièrement JADT’ 18 283 heureux.ses de se trouver exposé.e.s deux fois à la vidéo V et le manifestent de cette manière. Le contexte apparaît ici fondamental : la colère est liée d’une manière ou d’une autre ici à une forme d’impuissance face à la fois aux attentats terroristes, aux images montrées qui sont en lien plus ou moins direct selon les sujets, avec les attentats et l’état d’urgence et avec la situation des civils syriens. Figure 2 : Graphiques de ventilation par partition : V en N Les graphiques de ventilation par partition vidéo V et N montrent les émotions exprimées par les volontaires selon les vidéos visualisées. Les émotions négatives (colère, inquiétude, tristesse) sont élevées en V ; à l’inverse la joie est assez élevée en N. On remarque une variation des émotions entre le premier et le second visionnage des vidéos : en effet, la verbalisation des émotions négatives tend à baisser lors du second visionnage (V1 à V2) alors que les émotions positives augmentent de V1 à V2. Le même phénomène s’observe à l’inverse :les émotions positives baissent de N1 à N2, et les négatives augmentent de N1 à N2, ce que montre le tableau cidessous : Tableau 3: tableau récapitulatif des graphiques de partition v1 et v2 Groupe 1 Joie Colère Inquiétude Tristesse V1=N V2=DJ 159 153 145 84 154 215 202 134 Groupe 2 V1 – V2 5 62 57 50 V1=DJ V2=N 245 167 100 124 259 105 43 74 V1 – V2 14 62 57 50 284 JADT’ 18 Conclusion Les réactions des sujets montrent de manière attendue, que la vidéo V génère des émotions négatives et N, des émotions positives. En revanche, l'intensité des émotions exprimées tend à être influencée par l'ordre dans lequel sont vues les vidéos :dans le groupe 1, 1’expression de la joie est exprimée 159 fois ; elle est exprimée 259 fois en N dans le groupe 2. Lorsque les volontaires voient d'abord la vidéo V, il semble que leurs réactions émotionnelles tendent statistiquement à l'inverse de ce à quoi elles tendent dans l'ordre contraire : ainsi l’expression verbale d’une émotion de bonheur tend à être supérieure lorsqu'ils voient la vidéo N après la V, et l'expression de la colère, l’inquiétude et la tristesse sont nettement inférieures. L’étude du lexique émotionné tend à montrer que les sujets ressentent plus de bien être lorsqu'ils voient la vidéo N après la V, comme un soulagement, un apaisement qui arrive après une scène violente. Lorsque la vidéo N est vue en premier, néanmoins, un certain facteur de stress émotionnel demeure, dû probablement au fait que les sujets découvrent l'expérimentation et ne savent pas ce qu'ils vont voir. References Bourbon B., (2009). L’expression des émotions & des tendances dans le langage, University of Michigan Library. Colletta J.-M. et Tcherkassof A. (2003). Les émotions. Cognition, langage et développement. (P. Mardaga, Éd.) Belgique:Mardaga. Coupin C. (1995). La quantification de faible degré : le couple peu/un peu et la classe des petits opérateurs, thèse de doctorat, dir. Oswald Ducrot, EHESS. Feldman B. L., Lewis M., Haviland-J. et Jeanette M. (2016). Handbook of Emotions, Fourth Edition, The Guildford Press. Fiehler R. (2002). « How to Do Emotions withWords : Emotionality in Conversations », in Fussel, Susan (ed.) The Verbal Communication of Emotions, London, Lawrence Erlbaum,pp.87-107. Fracchiolla B., Moïse C., Romain C. et Auger N. (2013). Violences verbales Analyses, enjeux et perspectives. Rennes: Presses Universitaires de Rennes. Kerbrat-Orecchioni C. (1980) L’énonciation. La subjectivité dans le langage, Paris, A. Colin. Perrin L. (2016). « La subjectivité de l’esprit dans le langage », in Rabatel A. et al. (éds) Sciences du langage et neurosciences (Acte du colloque de l’ASL 2015), Lambert-Lucas, 189-209. Plantin Ch. (2011). Les bonnes raisons des émotions. Principes et méthode pour l’étude du discours émotionné. Berne, Peter Lang. JADT’ 18 285 Improving Collection Process for Social Media Intelligence: A Case Study Luisa Franchina1, Francesca Greco2, Andrea Lucariello3, Angelo Socal4, Laura Teodonno5 1 AIIC (Associazione Italiana esperti in Infrastrutture Critiche) President – blustarcacina@gmail.com 2Sapienza University of Rome – francesca.greco@uniroma1.it 3Hermes Bay Srl – a.lucariello@hermesbay.com 4Hermes Bay Srl – a.socal@hermesbay.com 5Hermes Bay Srl – l.teodonno@hermesbay.com Abstract Social Media Intelligence (SOCMINT) is a specific section of Open Source Intelligence. Open Source Intelligence (OSINT) consists in the collection and analysis of information that is gathered from public, or open sources. Social Media Intelligence allows to collect data gathering from Social Media web sites (such as Facebook, Twitter, YouTube etc…). Both OSINT and SOCMINT are based on the Intelligence Cycle. This Paper aims to illustrate advantages gained by applying text mining to collection phase of the intelligence cycle, in order to perform threat analysis. The first step for detecting information related to a specific target is to define a consistent set of keywords. Web sources are various and characterized by different writing styles. Repeating this process manually for each source could be very inefficient and time consuming. Text mining specific software have been used in order to automatize the process and to reach more reliable results. A partially automatized procedure has been developed in order to gather information on specific topic using the Social Media Twitter. The procedure consists in searching manually a set of few keywords to be used for a specific threat analysis. Then TwitteR of R Statistics was used to gather tweets that were collected in a corpus and processed with T-Lab software in order to identify a new list of keywords according to their occurrence and association. Finally, an analysis of advantages and drawbacks of the developed method. Abstract La Social Media Intelligence (SOCMINT) è una sezione specifica di Open Source Intelligence. L’Open Source Intelligence (OSINT) consiste nella raccolta e analisi di informazioni da fonti pubbliche o aperte. La Social Media Intelligence consente di raccogliere dati da siti Web di social media (come Facebook, Twitter, YouTube ecc.). Sia l’OSINT che la SOCMINT sono basate 286 JADT’ 18 sul ciclo di Intelligence. Il presente documento intende illustrare i vantaggi ottenuti applicando tecniche di text mining alla fase di raccolta del ciclo di intelligence, al fine di eseguire analisi delle minacce. Il primo passo per individuare le informazioni relative ad un obiettivo specifico è definire un insieme coerente di parole chiave. Le fonti Web sono varie e caratterizzate da diversi stili di scrittura. La ripetizione manuale di questo processo per ciascuna fonte potrebbe essere molto inefficiente e dispendiosa in termini di tempo. Sono stati utilizzati software specifici di text mining per automatizzare il processo e ottenere risultati più affidabili. È stata sviluppata una procedura parzialmente automatizzata al fine di raccogliere informazioni su argomenti specifici utilizzando il Social Media Twitter. La procedura consiste nella ricerca manuale di un gruppo di poche parole chiave da utilizzare per un'analisi specifica delle minacce. Quindi il pacchetto TwitteR di R Statistics è stato utilizzato per raccogliere i tweet che sono stati raccolti in un corpus ed elaborati con il software T-Lab al fine di identificare un nuovo elenco di parole chiave in base al loro verificarsi e associazione. Infine viene fornita un'analisi dei vantaggi e degli svantaggi della procedura sviluppata. Keywords: Social Media Intelligence, Twitter, text mining, data collection 1. Introduction “Open Source Intelligence [OSINT] is the discipline that pertains to intelligence produced from publicly available information that is collected, exploited, and disseminated in a timely manner to an appropriate audience for the purpose of addressing a specific intelligence requirement” (Headquarters Department of the Army, 2010, p. 11-1). OSINT is mainly used in the framework of national security, by law enforcement to conduct investigations, and in business field to gather important information. Social Media Intelligence (SOCMINT) is a specific section of OSINT which focuses on Social Media. In recent years, with the spread of Internet, and the high amount of readily accessible data, which give a picture of the actual state of things, the importance of OSINT and SOCMINT has grown, becoming a key enabler of decision and policy making. To bring the best out of such flow of data, the intelligence process must take place as a systematic approach structured around clear steps: planning and direction; collection; processing; analysis and production; dissemination. These stages, each of which is vital, create the Intelligence Cycle (CIA - Central Intelligence Agency, 2013). In order to automatically collect data from both the web and the Social Media, OSINT dashboards are being developed (Brignoli et Franchina, 2017). JADT’ 18 287 This paper describes the contribution provided by automated support tools in the collection phase of the Intelligence Cycle from a Social Media (Twitter) on the phenomenon of interest. To capture the real essence of text available and turn data publicly collected into valuable and reliable knowledge, text mining techniques were implemented. To this aim, text mining plays a relevant role as it enables the detection of meaningful patterns to explore knowledge from textual data. As stated by Feldman and Sanger: “Text mining can be broadly defined as a knowledge-intensive process in which a user interacts with a document collection over time by using a suite of analysis tools. In a manner analogous to data mining, text mining seeks to extract useful information from data sources through the identification and exploration of interesting patterns” (Feldman et Sanger, 2007, p. 1). 2. The use of Twitter Twitter is a common Social Media, a microblog mainly for real time information and communication. With Social Media becoming the main tool for informational exchange, in October 2017, Twitter reached about 330 million users (Statista, 2018). Twitter’s specific characteristics makes such a social particularly suitable for SOCMINT purposes. Contents can be accessed by anyone, with no need to create an account. Its users interact with short messages called “tweet”, whose length is limited to 280 characters and can be embedded, replied to, liked and unliked. Tweet quick nature, which can then be easily compared to SMS (Short Messaging Service) messaging, fosters the use of acronyms and slang, providing a real-time feel as they bring the first reaction to an event. Phrasing can be simple in structure or imply a large amount of hapax. With Twitter becoming one of the most important web application, it provides a big amount of data and therefore it constitutes a vital source for Social Media Intelligence. Thanks to its characteristics (potential reach, oneon-one conversation, promotional impact), Tweeter gained importance over years in different social fields, from policy, to media communication and terrorism. As a result, it is commonly considered a valuable source to monitor social phenomena and their changing pattern. 3. Case Study This paragraph illustrates how text mining tools can be integrated into the SOCMINT data collection phase. The aim of the procedure is to select a suitable and limited list of keywords allowing for an effective and efficient information retrieval in order to support the analyst work. In this case study the analyst was interested in collecting tweets on the criminal and antagonist threat macro thematic that is related to many specific 288 JADT’ 18 topics as, for example, critical infrastructures or telecommunications. The collection process has to identify a list of keyword able to collect the messages concerning, for example, "the criminal and antagonist threat in relation to critical infrastructures". The process can be illustrated by a cycle of four different steps: selection of keywords related with the specific tropic performed by the analyst; tweets collection; text mining; and verification and list of keywords definition (figure 1). Figure 1: illustration of automatic process for Twitter’s data collection four steps cycle 3.2. Keywords selection The first step is performed by the analyst and consists in defining a suitable list of words which could be used in order to collect tweets related to a specific thematic, which in our example could be Critical Infrastructures. To each X topic there is a set of keywords defining it (X1, X2, … Xn), e.g., railway, station, airport. The same topic is made by all possible sets, given by the formula: 3.1. Tweets collection Once the keywords are selected, the second step consists collect data from Twitter repository, e.g. using the twitteR package of R statistics (Gentry, 2016), in order to identify the keywords allowing for the collection of a certain amount of tweets, that in our example was more than one hundred in a day. That is, a word could perfectly represent the topic but could be rarely used in the messages, resulting in a collection of a small sample of tweets. The aim of this step is to find these words that allows for an effective data collection (n ≥ 100), eliminating those words that are rarely used in the JADT’ 18 289 messages (n < 100). That makes information retrieval more effective as the number of keywords that can be used is limited. 3.3. Text Mining After the keywords’ data collection efficacy was checked, a ten day messages collection was performed including the retweets (49,3%), which is the data retrieval maximum limit of the Twitter repository. The large size corpus (token = 284.253) of 19.491 tweets was cleaned and pre-processed by the software T-Lab (Lancia, 2017) in order to build a vocabulary (type = 19.765; hapax = 8.947) and a list of content words (nouns, verbs, adverbs, adjectives) (table 1). Then the list of content words was checked in order to identify the new keywords and to implement the list. Table 1: List of the first 20 lemmas of the list Word stazione n 6066 Word elettrico n Word n Word 2226 treno 1198 via n 825 Word ferrovia n 659 aeroporto 4734 nuovo 1581 regione 1025 Milano 731 repubblica 632 impianti 3605 rifiuti 1536 Zingaretti 1022 autorizzare 720 giorni 627 Roma 3337 comune 1317 aiutare Italia 679 centrale 605 896 In order to perform a content analysis, keywords were selected. In particular, we used lemmas as keywords filtering out the lemmas below ten occurrences. Then, on the tweets per keywords matrix, we performed a cluster analysis with a bisecting k-means algorithm (Savaresi et Boley, 2004) limited to twenty partitions, excluding all the tweets that did not have at least two keywords co-occurrence. The eta squared value was used to evaluate and choose the optimal solution. The results of the cluster analysis show that the keywords selection criteria allow the classification of 98.53% of the tweets. The eta squared value was calculated on partitions from 3 to 19, and it shows that the optimal solution is 13 clusters (η2 = 0,19) (figure 2). Then, the analyst controlled for the lexical profile of each cluster in order to detect the words useful to focus data collection by means of the Boolean operators. This procedure allows for the identification of a short list of most used words (about 20) with regard to both the macro thematic and the related topic. The list of keyword was then further reduced and it was reached a set off five meaningful words for each intersection of the macro thematic with a specific topic. Such a reduction stems from the fact that the use of a bigger amount of words led to an exponential increase of false - positive production rate. 290 JADT’ 18 Figure 2: Eta squared difference per partition As abovementioned, though such a work methodology effectively enables to extract more often used words, with regard to Twitter it is still necessary to test keywords to delete “noise” they produce, which however will not be eliminated entirely. In other words, this methodology affects keywords’ amount on the basis of redundancies used by users. However, keywords’ quality should be tested in Twitter search engine in order to reach a level of acceptance which includes both false and positive negative. Such words made up the vocabulary to be used to identify intersection between the macro thematic and a specific topic, i.e in the first case “criminal and antagonist's threat with regard to critical infrastructure”, in the second case “criminal and antagonist’s threat with regard to telecommunication” etc. Between words identified there is an OR relationship. Example: terrorism OR attack OR attack at station OR airport OR railway. Intersection between cluster “criminal and antagonist’s threat” and “critical infrastructure is synthetized by the following formula: Where A is the cluster “criminal and antagonist’s threat”, B is “critical infrastructure” and C is the intersection, which is “criminal and antagonist’s threat with regard to “critical infrastructures”. The following image shows an example. JADT’ 18 291 Figure 3: an example of a possible set of words defining the intersection of the cluster “criminal and antagonist’s threat”, with the topic “critical infrastructure” 3.4. Verification test Finally, the list of keywords was tested on the Open Source Intelligence dashboard. Collected Tweets were analyzed in order to identify the level of its reliability to monitor the desired phenomena. 4. Conclusion The developed process reflects the reliability of text mining software in supporting information gathering process for Social Media Intelligence purposes. The vocabulary identified for four different clusters, each of one covering a specific topic, is being tested at this very moment on an advanced dashboard in order to evaluate reliability. However, the role of the analyst is still fundamental. The relationship between OSINT dashboard and analysts must be complementary: dashboard plays a key role in gathering a big amount of tweet, but it is still necessary the analyst support in choosing the suitable keywords to be upload in the database, in order to render information collection more effective. Indeed, OSINT dashboard can’t understand Twitter users’ use of metaphors and similarities: keywords choice must be made in accordance with monitoring targets. It should be recalled that Italian language is really complex and it might occur that users’ language don’t refer to chosen target. Let’s see a practical example: some keywords which usually refer to criminal threats (bomba - bomb or furto theft) can be used in Italian language also to refer to synthetic concepts with regard to football or business offers (“bomba” might be used to mean a goal scored through a powerful strike; “furto” might be used to mean that a particular business offer is uneconomical). Another very important issue, which can’t be solved without analysts, regard ironic tweets: dashboard 292 JADT’ 18 collects all information uploaded into database but it can’t subdivide tweets into ironic and non-ironic by means of interpretation. To conclude, as dashboards don’t understand textual meaning of words, analysts are required to support dashboards’ capabilities, being the only ones to interpret the specific meaning of words. References Brignoli M. A., and Franchina L. (2017). Progetto di Piattaforma di Intelligence con strumenti OSINT e tecnologie Open Source. Proceedings of the First Italian Conference on Cybersecurity (ITASEC17), Venice, Italy, pp. 232-241. CIA, Central Intelligence Agency (2013). Kids' Zone. CIA, https://www.cia.gov/kids-page/6-12th-grade/who-we-are-what-wedo/the-intelligence-cycle.html Feldman R. and Sanger J. (2006), The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press. Gentry J. (2016). R Based Twitter Client. R package version 1.1.9. Headquarters Department of the Army (2010). FM 2-0 Intelligence: Field Manual. USA Army, https://fas.org/irp/doddir/army/atp2-22-9.pdf Lancia F. (2017). User’s Manual: Tools for text analysis. T-Lab version Plus 2017. Savaresi S.M. and Boley D.L. (2004). A comparative analysis on the bisecting K-means and the PDDP clustering algorithms. Intelligent Data Analysis, 8(4): 345-362. Statista (2018). Twitter: number of monthly active users 2010-2017. Statista, https://www.statista.com/statistics/282087/number-of-monthly-activetwitter-users/ JADT’ 18 293 The impact of language homophily and similarity of social position on employees’ digital communication Andrea Fronzetti Colladon, Johanne Saint-Charles, Pierre Mongeau 1. Introduction Knowledge creation and organizational communication are fundamental assets to obtain strategic competitive advantage (Tucker, Meyer, & Westerman, 1996) and in modern organization most of these happen through digital communication. We know that the way employees use digital communication can predict their engagement level (Gloor, Fronzetti Colladon, Giacomelli, Saran, & Grippa, 2017) as well as future business performance (Fronzetti Colladon & Scettri, 2017). Hence there is a need to better understand what is affecting employees’ participation in internal communication in order to foster the efficacy of internal communication and to deliver effective messages and campaigns in the most strategic way. Based on the idea of homophily, this paper examines if employees’ participation in their organization intranet is linked with their similarity in discourse and in network positions. Communication, digital or not, encompasses both the language people are using to communicate and the interactions and relationships they have (Tietze, Cohen, & Musson, 2003; White, 2011). In the last two decades scholars have explored how people’s discourse1 and relationships are intertwined notably through the lenses of social network analysis. Among others, those studies have shown that social relationships or interactions between people are linked to the similarity of the words and expressions they use (Basov & Brennecke, 2018; Nerghes, Lee, Groenewegen, & Hellsten, 2015; Roth & Cointet, 2010; Saint-Charles & Mongeau, 2018). Also, Gloor and colleagues have proposed a framework to study online social dynamics in which language plays an important role, especially with regards to the dimensions of sentiment, emotionality and complexity (Gloor et al., 2017). Such results align with the notion of homophily that corresponds to the tendency to relate to others on the basis of similarities (Lazarsfeld & Merton, 1954). A tendency now acknowledged as an important factors for the constitution of social networks (Mcpherson, Smith-Lovin, & Cook, 2001). It is assumed that this similarity leads to the development of relationships since similarity is linked to attraction towards the other (Montoya & Horton, 2013). 1 Discourse is define here as “a general term that applies to either written or spoken language that is used for some communicative purpose” (Ellis, 1999, p. 81). 294 JADT’ 18 Considering digital communication Brown, Broderick, & Lee (2007) and Yuan & Gay (2006) showed that ties strength and computer-mediated interaction increases with homophily. Most of the studies have explored similarities with regards to sociodemographic variables but several authors have expanded this to a wide range of variables including attitudes, psychological traits, values, etc. as latent homophily factors (Lawrence & Shah, 2007; Shalizi & Thomas, 2011). Hence, given that interaction in digital communication happens through written text, we assume that discourse similarity of employees’ messages is a key homophilic determinant for employees’ interactions in the network of internal digital communication. Similarity can also be observed with regard to network position. Indeed, occupying an equivalent position in a network was shown to lead to similar outcomes (attitudes, points of view, roles, etc.) (Borgatti & Foster, 2003; Burt, 1987). In the study of large on-line networks, actors’ similarity in centrality has proven useful for identifying role-similarity of actors in the network (Roy, Schmid, & Tredan, 2014). According to Gloor et al. (2017), it is also important to investigate the dynamic evolution of social positions. Rotating leaders, for example, proved to play a very important role in online communities, supporting their growth and participation (Antonacci, Fronzetti Colladon, Stefanini, & Gloor, 2017). In sum, the “homophily phenomenon” has been largely demonstrated through the study of various types of similarities. This paper seeks to explore this phenomenon in the context of the use of internal digital communication system in an organization and we propose to use discourse and network position similarity measures to this avail, our overall hypothesis being that the two are correlated and that they are correlated with interactions. 2. Research Design and Methodology We analyzed the digital communications of about 1,600 employees working for a large multinational company, mainly operating in Italy. This company has a largely popular intranet social network, structured as an online forum, where only employees can interact, exchanging opinions and ideas through the sharing of news and comments. We could extract and analyze more than 23,000 posts (news and comments), written in Italian over a period of one and a half year. Users were mostly males (68%) and a small part of them also played the role of content managers (7%). The first step in our analysis was to build the social network which represents the forum interactions. This network is made of N nodes, one for each forum user, and M edges. In general, there is an edge between two nodes if the corresponding employees had at least one interaction – for example, they exchanged knowledge or opinion through subsequent JADT’ 18 295 comments, or one answered a question of the other. We then proceeded to calculate the similarity measures for both discourse and network position. Based on what was presented above, we looked at five aspects of discourse similarity: words use, sentiment, emotionality, complexity and length. Additionally, we studied employees’ connectivity and interactivity, as suggested by Gloor and colleagues (2017). We further explored employees’ use of language by looking at the sentiment, emotionality, complexity and length of their forum posts. Length is simply calculated as the average number of characters used in forum posts by an employee – after having removed stop-words and punctuation, via a script written using the Python programming language and the package NLTK (Perkins, 2014). Sentiment expresses the positivity or negativity of forum posts and is calculated thanks to the machine learning algorithm included in the social network and semantic analysis software Condor (Gloor, 2017). Sentiment varies between 0 and 1, where 0 represents a totally negative post and 1 a totally positive one. Emotionality expresses the variation from neutral sentiment and is computed by Condor using the formula presented by Brönnimann (2014). Posts that convey less neutral expressions, either positive or negative, are considered more emotional. Lastly, complexity represents the deviation from common language and is calculated as the probability of each word of a dictionary to appear in the forum posts (Brönnimann, 2014); when rare terms appear in forum posts more often, complexity is higher. Even this last measure was obtained from Condor. Concerning the study of employees’ positions in the social structure, we referred to network centrality measures (Freeman, 1979). To measure centrality, we used the two well-known metrics of degree and betweenness centrality. Degree centrality measures the number of direct links of a node, i.e. the number of people an employee interacted with, in the online forum. Betweenness centrality, on the other hand, takes into account the indirect links of a node and counts how many times a social actor lies inbetween the paths that interconnect his/her peers. Betweenness centrality is calculated by considering the shortest network paths that interconnect every possible pair of nodes and counting how many times these paths include a specific employee (i.e. the node for which the betweenness centrality is calculated). Employees’ interactivity was operationalized by calculating rotating leadership. This variable counts the oscillations in betweenness centrality of a social actor, i.e. the number of times betweenness centrality changed reaching local maxima or minima. If an employee maintains a static position, his/her rotating leadership is zero. On the other hand, we have rotating leaders when people oscillate more between central and peripheral positions, activating or taking the lead of some conversations and then leaving space to other people in the network. As control variables, we could 296 JADT’ 18 access to employees’ gender and forum role (content manager or not). Even if gender homophily is not always supported by social networks studies, it is very often used as a control variable, as it has been shown that gender can influence online social communication and behavior (Thelwall, 2008, 2009). Similarly, we control for content manager role, as we expect different behaviors when employees have the assignment of informally moderating the intranet social network. All the variables presented above were first calculated at the node level and subsequently transformed into similarity matrices. Like a network adjacency matrix, a similarity matrix is made of N row and columns, where each row and column represents a specific employee. For categorical attributes (gender and being a content manager or not) we have a value of 1 in a cell of the matrix if the two corresponding employees share the same attribute (for example they are both females), and 0 otherwise. For continuous variables, we populated the matrices with the absolute value of the differences in individual actor scores. 3. Results In general, we notice a prevalence of male employees, even if more forum content managers are females (most of them working in the internal communication department, which is mostly populated by females). Being a content manager is also associated with more central and dynamic network positions: content managers have on average higher scores of degree and betweenness centrality and they rotate more. To put it in other words, they have interactions with more people, often act as brokers of information and in general do not keep a static dominant position after having fostered a conversation. As described in the previous section, we measured similarity with respect to several characteristics of employees: their gender, content manager role, use of language, centrality and interactivity. Text similarity shows the strongest association with digital communication (ρ = 0.48). Employees who more frequently use the same vocabulary communicate more between themselves. Apart from gender and sentiment, homophily effects seem to be significant for all the other variables included in our study. Employees that are more similar with respect to their use of language, degree of interactivity and network position tend to interact more between themselves. As per agreed privacy arrangements, we are prohibited from revealing the company name or other details that could help in its identification. It might be useful to replicate our research to see if our findings are confirmed in different business contexts. Future studies could include more control variables, particularly those which are supposed to produce homophily effects – such as employees’ age (Kossinets & Watts, 2009). Having more JADT’ 18 297 accurate timestamps could also help in the assessment of average response time, to see if more reactive users tend to cluster. As our was mainly an association study, we advocate further research to carry out a longitudinal analysis which could tell us which actor similarity effects can be considered as significant antecedents of digital communication. Our findings have practical implications both for company managers and administrators of online communities. For example, if a company wants to attract the attention of employees on a strategic topic, in the light of our results, it appears vital to choose a language close to that of the target people. Employees’ participation in conversations can be fostered by online messages aligned with the general use of language and by choosing social ambassadors who have network positions similar to the target. References Antonacci, G., Fronzetti Colladon, A., Stefanini, A., & Gloor, P. A. (2017). It is Rotating Leaders Who Build the Swarm: Social Network Determinants of Growth for Healthcare Virtual Communities of Practice. Journal of Knowledge Management, 21(5), 1218–1239. https://doi.org/10.1108/JKM11-2016-0504 Basov, N., & Brennecke, J. (2018). Duality beyond Dyads: Multiplex patterning of social ties and cultural meanings. Research in the Sociology of Organizations, in press. Borgatti, S. P., & Foster, P. C. (2003). The network paradigm in organizational research: A review and typology. Journal of Management. https://doi.org/10.1016/S0149-2063(03)00087-4 Brönnimann, L. (2014). Analyse der Verbreitung von Innovationen in sozialen Netzwerken. University of Applied Sciences Northwestern Switzerland. Retrieved from http://www.twitterpolitiker.ch/documents/Master_ Thesis_Lucas_Broennimann.pdf Brown, J., Broderick, A. J., & Lee, N. (2007). Word of mouth communication within online communities: Conceptualizing the online social network. Journal of Interactive Marketing, 21(3), 2–20. https://doi.org/10.1002/dir.20082 Burt, R. S. (1987). Social Contagion and Innovation: Cohesion versus Structural Equivalence. American Journal of Sociology, 92(6), 1287–1335. https://doi.org/10.1086/228667 Ellis, D. G. (1999). From Language To Communication. New York, NY: Routledge. Freeman, L. C. (1979). Centrality in social networks conceptual clarification. Social Networks, 1, 215–239. Fronzetti Colladon, A., & Scettri, G. (2017). Look Inside. Predicting Stock 298 JADT’ 18 Prices by Analysing an Enterprise Intranet Social Network and Using Word Co-Occurrence Networks. International Journal of Entrepreneurship and Small Business, in press. https://doi.org/10.1504/IJESB.2019.10007839 Gloor, P. A. (2017). Sociometrics and Human Relationships: Analyzing Social Networks to Manage Brands, Predict Trends, and Improve Organizational Performance. London, UK: Emerald Publishing Limited. Gloor, P. A., Fronzetti Colladon, A., Giacomelli, G., Saran, T., & Grippa, F. (2017). The Impact of Virtual Mirroring on Customer Satisfaction. Journal of Business Research, 75, 67–76. https://doi.org/10.1016/j.jbusres.2017.02.010 Huang, A. (2008). Similarity measures for text document clustering. In Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008) (pp. 49–56). Christchurch, New Zealand. Jivani, A. G. (2011). A Comparative Study of Stemming Algorithms. International Journal of Computer Technology and Applications, 2(6), 1930– 1938. https://doi.org/10.1.1.642.7100 Kossinets, G., & Watts, D. J. J. (2009). Origins of Homophily in an Evolving Social Network. American Journal of Sociology, 115(2), 405–450. https://doi.org/10.1086/599247 Krackhardt, D. (1988). Predicting with networks: Nonparametric multiple regression analysis of dyadic data. Social Networks, 10(4), 359–381. Lawrence, T. B., & Shah, N. (2007). Homophily: Meaning and Measures. In Paper presented at the International Network for Social Network Analysis (INSNA). Corfu, Greece. Lazarsfeld, P. F., & Merton, R. K. (1954). Friendship as a Social Process: A Substantive and Methodological analysis. Freedom and Control in Modern Society, 18, 18–66. https://doi.org/10.1111/j.1467-8705.2012.02056_3.x Mcpherson, M., Smith-Lovin, L., & Cook, J. M. (2001). Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27(1), 415– 444. https://doi.org/10.1146/annurev.soc.27.1.415 Montoya, R. M., & Horton, R. S. (2013). A meta-analytic investigation of the processes underlying the similarity-attraction effect. Journal of Social and Personal Relationships, 30(1), 64–94. https://doi.org/10.1177/0265407512452989 Nerghes, A., Lee, J.-S., Groenewegen, P., & Hellsten, I. (2015). Mapping discursive dynamics of the financial crisis: a structural perspective of concept roles in semantic networks. Computational Social Networks, 2(16), 1–29. https://doi.org/10.1186/s40649-015-0021-8 Perkins, J. (2014). Python 3 Text Processing With NLTK 3 Cookbook. Python 3 Text Processing With NLTK 3 Cookbook. Birmingham, UK: Packt Publishing. JADT’ 18 299 Roth, C., & Cointet, J. P. (2010). Social and semantic coevolution in knowledge networks. Social Networks, 32(1), 16–29. https://doi.org/10.1016/j.socnet.2009.04.005 Roy, M., Schmid, S., & Tredan, G. (2014). Modeling and measuring graph similarity: The case for centrality distance. In Proceedings of the 10th ACM international workshop on Foundations of mobile computing, FOMC 2014 (pp. 47–52). New York, NY: ACM. https://doi.org/10.1145/2634274.2634277 Saint-Charles, J., & Mongeau, P. (2018). Social influence and discourse similarity networks in workgroups. Social Networks, 52, 228–237. https://doi.org/10.1016/j.socnet.2017.09.001 Shalizi, C. R., & Thomas, A. C. (2011). Homophily and contagion are generically confounded in observational social network studies. Sociological Methods and Research, 40(2), 211–239. https://doi.org/10.1177/0049124111404820 Tata, S., & Patel, J. M. (2007). Estimating the selectivity of tf-idf based cosine similarity predicates. ACM SIGMOD Record, 36(2), 7–12. https://doi.org/10.1145/1328854.1328855 Thelwall, M. (2008). Social networks, gender, and friending: An analysis of mySpace member profiles. Journal of the American Society for Information Science and Technology, 59(8), 1321–1330. https://doi.org/10.1002/asi.20835 Thelwall, M. (2009). Homophily in MySpace. Journal of the American Society for Information Science and Technology, 60(2), 219–231. https://doi.org/10.1002/asi.20978 Tietze, S., Cohen, L., & Musson, G. (2003). Understanding organizations through language. Understanding Organizations Through Language. https://doi.org/10.4135/9781446219997 Tucker, M. L., Meyer, G. D., & Westerman, J. W. (1996). Organizational communication: Development of internal strategic competitive advantage. Journal of Business Communication, 33(1), 51–69. https://doi.org/10.1177/002194369603300106 White, H. C. (2011). Identité et contrôle. Une théorie de l’émergence des formations sociales. Paris: Éditions de l’École des hautes études en sciences sociales. Yuan, Y. C., & Gay, G. (2006). Homophily of network ties and bonding and bridging social capital in computer-mediated distributed teams. Journal of Computer-Mediated Communication, 11(4), 1062–1084. https://doi.org/10.1111/j.1083-6101.2006.00308.x 300 JADT’ 18 Looking Through the Lens of Social Sciences: The European Union in the EU-Funded Research Projects Reporting Matteo Gerli University for Foreigners of Perugia – matteogerli81@gmail.com Abstract In the last decades, European integration and scientific production have come to be deeply intertwined as a result of the Europeanization of many research activities. On one side, European institutions promote the realization of research projects aiming at developing a type of knowledge “close” to the end users’ interests; on the other side, the resulting knowledge contributes to conditioning the practices that take place in the European and national institutions, according to a circular process that brings the innovations to feed back into the system that expresses them. The purpose of this paper is to explore this relationship by examining two peculiar scientific products realized by researchers operating within the broad domain of the Socio-economic Sciences and Humanities (SSH), as a part of the research projects financed by the Seventh Framework Programme (2007-2013) of the European Union: final reports and policy briefs. In other words, it aims to analyse all reports as a whole using some automatic text analysis tools, while incorporating some supplementary variables which help to define the broader context of scientific production. Keywords: European Union, International Research Projects, Socio-economic Sciences and Humanities, Textual Data Exploration, Quantitative Discourse Analysis, IRaMuTeQ. 1. Introduction The European Research Policy plays a strategic role for thousand of researchers and research institutions which operate within the EU borders. Thanks to the concomitant decrease in national public funds for scientific activities (see for instance, Vincent-Lacrin, 2006; 2009), the European research agenda has dramatically increased its appeal among scholars and consequently its ability to have an impact on the directions and processes of scientific knowledge production. Indeed, starting from the 90s, the European Commission has equipped itself with new means to combine and manage, on the basis of medium to long-term planning cycles, the whole set of scientific and technological initiatives financed by the European budget: the framework JADT’ 18 301 programme (Ippolito, 1989; Ruberti and André, 1995; Guzzetti, 1995; Menéndez and Borras, 2000; Borras, 2000; Banchoff, 2002; Cerroni and Giuffredi, 2015). In short, the underlying logic is that of the programmatic intersection between research activities and other European policies, so that the promotion of scientific excellence complements the need to foster the creation of cross-border and interdisciplinary collaborations intended for producing a type of knowledge “close” to the end users’ interests. As it was observed in previous studies (Adler-Nissen and Kropp, 2015), European integration and scientific production have come to be deeply intertwined: on one side, the progress of integration process influenced (and still influences) research activities through the promotion of particular forms of knowledge and research questions (as far as we are concerned, mainly through the realization of cross-national and cross-disciplinary research projects); on the other side, the resulting knowledge contributes to conditioning the practices that take place in the European and national institutions, according to a circular process that brings the innovations to feed back into the system that expresses them. Social Sciences and Humanities, which are less directly involved in the production of knowledge with a clear practical usability, are by no means unconcerned with this kind of phenomenon. At this regard, the Journal of European Integration has recently published a special issue on the relationship between social sciences and European integration, hosting some important articles that have highlighted the existence of several “crossroads” between the European Union and the scientific community’s “itineraries”1: Rosamond (2015), for instance, observed how certain theories on the political and economic integration (in particular that of the Hungarian Béla Balassa, from the economics side, and the neofunctionalism, from the political science side) had been informing the “strategic narrative” adopted by the European Commission during the 60s and 70s to legitimize its newly-formed institutional role and its economic policy position, according to a quite peculiar two-ways traffic of influences process, being the economic integration theorized while it was happening; Deem (2015) pointed out the existence of a relationship between the birth of a new field on higher education studies, the simultaneous evolution of national university systems and the launch of the so-called Bologna process at European level; Vauchez analysed, through a sociogenetic approach, the historical process through which the acquis communautaire «has been formulated, stretched, criticized, revised and finally naturalized as the most rigorous and objective measure of Europe against other possible methods» (2015: 196) thanks to the work of those who have been defined 1 Journal of European Integration, 37 (2015). 302 JADT’ 18 “methodological entrepreneurs”, that is European officials who have politically invested and succeeded in establishing Europe’s cognitive and technical equipment. Looking beyond such individual cases, what is really relevant to our purpose is the underlying idea about the possibility of studying science production from a sociological point of view, basically by rejecting what was traditionally regarded as an internal/external division (Adler-Nissen and Kropp, 2015: 161-163), and thus admitting that even scientific and academic concepts can be formulated in conjunction with political-economic ambitions and practical problems (see Bohme et al., 1983; Funtowicz and Ravez 1993; Slaughter and Leslie 1997; Gibbons et al., 1994; Ziman, 2000; Albert and Mcguire, 2014), such as those above mentioned. This does not mean that science is equal to politics or economics (Breslau, 1998); what it does mean is that, in order to understand science production, one needs to recognize that “non-academic” resources (such as, for instance, financial or material resources, ideas and beliefs, symbolic resources, political or normative resources, people, etc.) may overstep scientific boundaries and be used for the production of new knowledge. Bourdieu (1975, 1984, 1990, 1992, 1994, 1995, 2001) described this phenomenon through the concept of “fields interrelations”. In few words, the social word is composed of multiple semiautonomous fields, basically microcosms characterized by different stakes, rules of the game and particular resources which one needs to possess to get access to the game itself and its specific advantages. He conceptualized these sphere as partially independent, by which he means that, even though each field develops its own institutions, hierarchies, problems, tacit or explicit rules, they necessarily interact and affect each other. This is particularly true for cultural fields (art, cinema, religion, science, journalism, etc.), since they are structurally dependent and subordinated to political and economic fields. Going straight to the point, this is to say that, if one is dealing with a sociological analysis of a cultural product (e.g. a text), thus one neither can just consider its formal characteristics, nor be limited to its context of production. Instead, one should use a “relational approach”, taking into account both the internal features of the product and its external determinants. In engaging with this broad issue, this paper will try to further contribute to the understanding of the topic by examining two peculiar scientific products realized by researchers operating within the broad domain of the Socioeconomic Sciences and Humanities (SSH), as a part of the research projects financed by the Seventh Framework Programme (2007-2013) of the European Union: final reports and policy briefs. By using some automatic text analysis tools, it will thus statistically explore the contents of such documents not per se, but in connection with some variables, which help to define the broader JADT’ 18 303 context of production. In its exploratory character, this study does not have strong hypothesis to be tested. Nevertheless, following Bourdieu’s approach, it aims to give an original perspective through which observing the relationship between the field of social sciences and the public policy field of the European Union (Gerli, 2017). 2. The corpus and methodology Unlike the studies discussed earlier, which are mainly based on microsociological observation, our investigation covers a macro-sociological analysis of a quit large corpus made of 46.513 graphic forms, equal to 3.025.960 occurrences. It is an ad-hoc constructed corpus: it contains 360 texts, of which 205 belonging to final reports and 155 to policy briefs, which were collected from the digital database CORDIS2, the main institutional source of information related to the research projects financed by the European Union. The choice to focus on these documents is not accidental, but depends on their strict relevance to our research objectives. In fact, both include a summary of the project results and conclusions, with a description of their potential socio-economic impact (EC 2010), even though policy brief is strictly designed for policy makers (both European and national ones), while final report is addressed to a wider audience, which may include (at least potentially) lay people as well. In this perspective, they represent an effective “shortcut” through which empirically observe the way in which the research groups awarded a grant “actualized” the inputs they received from the Commission. This is, to resume the previous discussion, to analyse how European institutions and social scientists contribute together to the definition and resolution of some EU-related issues. With regard to the methodology, both simple and multivariate analyses were performed with the IRaMuTeQ software (Lebart et al., 1998; Bolasco, 2013). In particular, the lexicographical analysis was used for a first exploration of the corpus, that is to identify and format texts units, turn texts into text segments (TS) and classify words by their frequency. The multivariate analysis, instead, was performed to detect the associations between textual data and the following supplementary variables related to what in the 7FP was defined as macro-activity (MA) and financing scheme (FS)3. Going into more details, the 7FP included eight macro-activities: Growth, employment and competitiveness in a knowledge society (MA1); Combining economic, social and environmental goals in Europe: towards sustainable development (MA2); Major http://cordis.europa.eu/projects/home_it.html. For more details: Decision No 1982/2006/EC of the European Parliament and of the Council of 18 December 2006. 2 3 304 JADT’ 18 trends in society and their implications (MA3); Europe in the world (MA4); The citizen in the European Union (MA5); Socio-economic and scientific indicators (MA6); Foresight studies (MA7); Strategic activities (MA8). As for the financing schemes, the 7FP included five main different types, which differed from each other by the research team size and the type of purposes to be achieved (the first three mainly focused on the development of new knowledge, while the last two were mainly thought for the coordination and support of research activities and policies): Small or medium-scale focused project (FS1); Small or medium-scale focused research project aimed at international cooperation (FS2); Large-scale integrating project (FS3); Coordination action (FS4); Support action (FS5). Additionally, we also took into account the starting year of the project and the geographic area in which the coordinating institution was located. As a whole, our sample (of non-probabilistic type) involves 223 research projects out of 251 realized in 2007-2013 (equal to 88.8%) and broadly covers all macro-activities and financing schemes above mentioned. In Tab. 1, a description of the corpus and its main subsets is provided. Tab. 1: Description of the corpus Type Final report Policy Brief Corpus Number of texts 205 155 360 Graphic forms 42.047 19.795 46.513 Occurrences 2.441.168 584.792 3.025.960 3. The main findings At first glance, the most frequent “full” words used in the SSH research reports do not provide particularly relevant insights. The first ten (social, policy, research, European, project, EU, countries, public, national, Europe) concerns the “general context of meaning” where discourses on Europe and related issues took shape. Ten words that, without having a clear disciplinary connotation, define some “semantic coordinates” common to all research projects carried out. Interesting enough, it is the wide use of the words country/es (freq.=10.531) and national (freq.=5.527) which, compared with the words European (freq.=9.190), EU (freq.= 8.563) and Europe (freq.=5.408), prove the great importance of the “national” level of analysis, mainly in a comparative way. Scrolling down the list, we can also recognise some typical words of the socio-economic lexicon (economic, market, growth, employment, financial), the socio-political lexicon (people, education, State, young, groups, cultural, society, governance), and the methodological one, namely related to the operative context of the research activities (date, case, results, impact, analysis, study). Yet these are terms that, at this early stage of the analysis, do not provide any clear “message”. JADT’ 18 305 At a closer look, however, we can identify some specific words which are, in a broad sense, linked to the political macro-orientations defined by the Lisbon Strategy (European Council 2000), demonstrating the “osmosis” existing between European institutions and social sciences. Here some examples: innovation (freq.=5.793), cornerstone of industrial competitiveness and economic growth (EC 2003, 2006); development (freq.=5.176), to be understood, among the various meaning, mainly as sustainable development (EC 2005, 2009); education (freq.= 3.490) and knowledge (freq.=3.221), which, together with the already mentioned “innovation”, represent the “three sides” of the so-called “knowledge triangle”, from the European Commission’s perspective, the ground for a greater economic and social dynamism. For the aim of this study, what is of particular interest is also the geographical scope of the research activities. Indeed, the most frequent toponyms refer to EU based countries. Among these, the five main sponsors and recipients of the framework programs (Germany, UK, France, Italy and Spain) are placed at the top of the ranking. As for the extra-European countries, several of them are placed in Asia (e.g. China, Japan, India, Vietnam and Thailand), North Africa (Morocco, Tunisia, Egypt and Libya) and South America (Brazil, Argentina, Colombia, Peru and Chile). This is indicative of a globalization process, which is affecting both European institutions and researchers by expanding their interests (“political”, with regard to the first ones, and “scientific”, for the second ones) beyond the European borders. What matters is that they are moving together insofar we can suppose the existence of a clear synergy between the emergence of a new multipolar area of political, commercial and cultural influence, in which the European Union is now required to act, and the production of knowledge on topics with a potential “global” added value. 3.1 The main semantic groups and their connections with the “context” To go deeper in the analysis, and to explore the relationship between the selected texts and some variables related to their context of production, we performed a Descending Hierarchical Analysis (DHA). Indeed, this method allowed us, first, to identify clusters with similar vocabulary within text segments and, then, to visualize them in conjunction with the supplementary variables (Camargo and Justo 2013; Curbelo, 2017). In Fig. 1, the output of the DHA is summarised. 306 JADT’ 18 Fig. 1: Dendrogram of top-down hierarchical classification (Reinert method) of the corpus As it can be easily seen in Fig. 1, the DHA algorithm allowed the identification of five clusters, each with its own specific semantic content. Following Reinert (1987), they can be interpreted as “lexical words”, namely specific semantic structures which, in our case, refer to different and even competing scientific representations of the European Union and related issues. The second cluster has the greater representation (26,8% of the SSH discourses) and identifies a semantic sphere characterized by a language mainly oriented towards political and social issues. Indeed, the most central word in this cluster is political, followed by cultural, identity, citizenship, border, conflict, citizen, State and so on. Immigration (migrant) and related issues appear to be particularly relevant as well. The fifth cluster (24,1%) delineates a quite peculiar semantic sphere based on a set of words (such as project, conference, research, university, workshop, dissemination, website, etc.) strictly linked with the management and realization of European research projects and, more in general, with scientific research and related activities. The first cluster, third in terms of representativeness (19%), refers to the relationship between economic development and environmental protection, being the most central word innovation, followed by development, economic, sustainable, environmental, change, rural and so on. This interpretation seems to be supported by the presence of several words that refer to the need for a change with respect to a situation that is perceived as not desirable (change, JADT’ 18 307 impact, strategy, challenge, need, solution, improve, step, etc.). The third cluster (16,2%), instead, covers a semantic area mainly related to the economy and the market. It is a language that involves two main branches, the one of the real economy (income, price, household, wage, firm, energy, poverty, etc.), and the one of the finance (financial, bank, risk, monetary, credit), but above all it is characterized by the large presence of technical terms and acronyms (gdp, estimate, asset, inflation, emu, Eurozone, insurance, macroeconomic, etc.). Finally, the fourth linguistic cluster (13,9%) includes words essentially associated to the relationship between education, training and employment, as shown by the presence of terms such as young, person, child, school, education, aspiration, background, vocational and compulsory. It is a cluster that differs from the others due to the greater concreteness of the language, as proved by the recurring use of words referring to “concrete” social actors (child, parent, student, teacher, mother, friend, volunteer, etc.). Fig. 2, resulting from a Lexical Correspondences Analysis (LCA), shows the relationship between clusters (left side) and between clusters and the supplementary variables (right side). The main aim here was to verify whether or not SSH discourse exhibits clear evidence of “adaptability” with regard to the macro-activities and the financing schemes, as defined by the European Commission. Fig. 2: Association between clusters and supplementary variables The first two factors summarize together 67,5% of the total inertia: the first one (39,97%) marks a clear opposition between cluster 5 (positive half-plane) and the other four clusters (negative half-plane); the second factor (27,47%), instead, highlights a significant opposition between clusters 1 and 3 (positive half-plane) and clusters 2 and 4 (negative half-plane). As a whole, we can distinguish three different (partially autonomous) semantic contexts, arising from the association between the “cultural” and “socio-political” discourses 308 JADT’ 18 (third quarter), the “economic” discourse and that on “innovation” and “sustainable development” (forth quarter), and finally the discourse on “research activities” (in-between the first and the second quarters). As far as the relationship between discourses (clusters) and supplementary variables, Figg. 3 and 4 show the most significant categories (those with a larger chi-square and a lower p-value), referring to the “macro-activity” and “financing scheme” variables. As shown in the first figure, MA1 and MA2 categories are only significant in the definition of clusters 1 (innovation) and 3 (economics); MA5 is the most relevant for cluster 2 (politics); similarly, MA3 category is the only significant for cluster 4 (culture); and finally, MA4 and MA8 categories predominate on cluster 5 (research activities). In short, these results strongly support the thesis of adaptability, insofar the different scientific representations of the European Union emerged from the analysis resulted strongly associated with the macro-activities defined by the European Commission. Cluster 1 2 3 4 5 Category Chi2 % p-value MA2 1226.7 25,7 <0.0001 MA7 762.9 36.5 <0.0001 MA5 5220.0 54.8 <0.0001 MA1 1282.4 28.9 <0.0001 MA2 1414.2 27.0 <0.0001 MA3 5238.5 33.0 <0.0001 MA4 839.9 33.6 <0.0001 MA8 534.9 43.7 <0.0001 Fig. 3: Chi2 significance of variable “macro-activity” by cluster On the other hand, the role of the “financing scheme” variable resulted much less significant in discriminating the five clusters, except for categories FS4 and FS5, which are the most significant for cluster 5, and category FS1, which instead clearly prevail on cluster 4. Nothing relevant emerged in relation to the variables “geographic area” and “starting year”. Cluster 1 Category Chi2 % p-value FS2 186.3 25,7 <0.0001 FS3 145.1 24.7 <0.0001 2 FS1 487.6 29.0 <0.0001 3 FS1 286.5 17.6 <0.0001 4 FS1 1245.0 16.7 <0.0001 FS4 2195.0 51.5 <0.0001 FS5 1583.2 58.5 <0.0001 5 Fig. 4 JADT’ 18 309 4. Conclusions The findings presented herein indicate a close relationship between the programmatic framework, defined by the Commission, and the contents of the final reports and policy briefs, supporting the thesis of a co-construction of the European integration (Adler-Nissen, Kropp 2015). The scientific discourse has come to be structured around few semantic macro-aggregates arisen from DHA, which in turn resulted associated with the variables performed in LCA. Furthermore, the SSH linguistic space shows a clear cleavage between the economic discourse and the cultural discourse, which points out the existence of a lack of interaction between these two spheres. From a more “general” point of view, all this means that, in connecting the social sciences field with the policy field, the European research projects produced a scientific discourse that, on the whole, is structurally homologous with the “space of possibilities” inherent to the 7PQ. References Adler-Nissen R., Kropp K. (2015). A Sociology of Knowledge Approach to European Integration: Four Analytical Principles. Journal of European Integration, 37(2): 155-173. Albert M., Mcguire W. L. (2014). Understanding Changes in Academic Knowledge Production in a Neoliberal Era. Political Power and Social Theory, 27: 33-57. Banchoff T. (2002). The Politics of the European Research Area. ACES Working Paper 3, Paul H. Nitze School for Advanced International Studies. Böheme G., Van den Daele W., Hohlfeld R., Krohn W., Shafër W. (1983). Finalization in Science. The Social Orientation of Scientific Progress. Dordrecht: Riedel. Bolasco S. (2013). L’analisi automatica dei testi. Fare ricerca con il text mining. Roma: Carocci. Borras S. (2000). Science, Technology and Innovation in European Politics. Research Paper n. 5, Roskilde University. Bourdieu P. (1975). The Specificity of Scientific Field and the Social Condition of the Progress of Reason. Social Sciences Informations, 6: 19-47. Bourdieu P. (1984). Homo academicus, trad. it. (2013) Homo academicus. Bari: Edizioni Dedalo. Bourdieu P. (1992). Les règles de l’art, trad. it. (2013) Le regole dell’arte. Milano: Il Saggiatore. Bourdieu P. (1994). Raisons pratiques. Sur la théorie de l’action, trad. it. (2009) Ragioni pratiche. Bologna: Il Mulino. Bourdieu P. (1995). Champ politique, champ des sciences sociales, champ 310 JADT’ 18 journalistique, trad. it. (2010) Campo politico, campo delle scienze sociali, campo giornalistico. In Cerulo M. (a cura di). Sul concetto di campo in sociologia. Roma: Armando. Bourdieu P. (2001). Science de la science et réflexivité, trad. it. (2003) Il mestiere di scienziato. Milano: Mondolibri. Breslau D. (1998). In Search of the Unequivocal: The Political Economy of Measurement in U.S. Labor Market Policy. London: Praeger. Camargo B. V., Justo A. M. (2013). R Interface for Multidimentional Analysis of Texts and Questionnaires, IraMuTeQ tutorial, available on: http://www.iramuteq.org. Cerroni A., Giuffredi R. (2015). L’orizzonte di Horizon 2020: il futuro europeo nelle politiche della ricerca. Futuri, 6: 29-39. Curbelo A. A. (2017). Analysing the (Ab)use of Language in Politics: the Case of Donald Trump. Working Paper n. 2. University of Bristol: SPAIS. Deem R. (2015). What is the Nature of the Relationship between Changes in European Higher Education and Social Science Research on Higher Education and (Why) Does It Matter?. Journal of European Integration. 37(2): 263-279. European Commission (2010). Communicating research for evidence-based policymaking. Bruxelles: Directorate-General for Research. European Commission (2003). Politica dell’innovazione: aggiornare l’approccio dell’Unione Europea nel contesto della Strategia di Lisbona. COM(2003) 112 definitivo, 11.03.2003. European Commission (2005). Comunicazione della Commissione al Consiglio e al Parlamento europeo sul riesame della strategia per lo sviluppo sostenibile. Una piattaforma d’azione. COM(2005) 658 definitivo, 13.12.2005. European Commission (2006). Mettere in pratica la conoscenza: un’ampia strategia per l’innovazione per l’UE. COM(2006) 502 definitivo, 10.05.2006. European Commission (2009). Integrare lo sviluppo sostenibile nelle politiche dell’UE: riesame 2009 della strategia dell’Unione Europea per lo sviluppo sostenibile. COM(2009) 400 definitivo, 24.07.2009. Funtowicz S., Ravez J. (1993). Science for the Post-Normal Age. Future, 25: 735-755. Gerli M. (2017). Il campo sociale dei progetti di ricerca europei. Il caso delle SSH. Studi Culturali, 1: 127-150. Gibbons M., Limoges C., Nowotny H., Schwartzman S., Scott P. e Trow M. (1994). The New Production of Knowledge. London: Sage. Guzzetti L. (1995). A Brief History of European Union Research Policy. Luxembourg: Publications Office of the European Communities. Ippolito F. (1989). Un progetto incompiuto. La ricerca comune europea: 1958-88. Bari: Edizioni Dedalo. JADT’ 18 311 Lebart L., Salem A., Berry L. (1998). Exploring Textual Data. New York: Kluwer Academic. Menéndez L. S., Borrás S. (2000). Explainig Changes and Continuity in EU Technology Policy: The Politics of Ideas. In Dresner S. e Gilbert N. (eds), Changing European Research System. Aldershot: Ashgate. Reinert M. (1987). Classification descendante hiérarchique et analyse lexicale par contexte: application au corpus des poésies d’Arthur Rimbaud. Bulletin de Méthodologie Sociologique, 13: 53-90. Rosamond B. (2015). Performing Theory/Theorizing Performance in Emergent Supranational Governance: The Live Knowledge Archive of European Integration and the Early European Commission. Journal of European Integration, 37(2): 175-191. Ruberti A., André G. (1995). Uno spazio europeo della scienza. Riflessioni sulla politica europea della ricerca. Firenze: Giunti. Slaughter S., Leslie L.L. (1997). Academic Capitalism: Politics, Policies and the Entrepreneurial University. Baltimore: The John Hopkins University Press. Vauchez A. (2015). Methodological Europeanism at the Cradle: Eur-lex, the Acquis and the Making of Europe’s Cognitive Equipement. Journal of European Integration, 37(2): 193-210. Vincent-Lacrin S. (2006). What is Changing in Academic Research? Trends and Futures Scenarios. European Journal of Education, 41(2): 169-202. Vincent-Lacrin S. (2009). Finance and Provision in Higher Education: A Shift from Public to Private?. Higher Education to 2030 (vol. 2), Centre for Education Research and Innovation: OECD. Ziman J. (2000). Real Science: What It Is, and What It Means. Cambridge-New York: Cambridge University Press. 312 JADT’ 18 Spécialisation générique et discursive d’une unité lexical L’exemple de joggeuse dans la presse quotidienne régionale Lucie Gianola1, Mathieu Valette2 Université de Cergy-Pontoise – lucie.gianola@u-cergy.fr 2Institut National des Langues et Civilisations Orientales– mvalette@inalco.fr 1 Abstract In this paper, we study the distribution of lexical items designating outdoor sport practitioners (joggeur/joggeuse, randonneur/randonneuse, unneur/runneuse, promeneur/promeneuse), in order to identify links between gender, semantic themes and genre across press discourse in French. The corpus is sampled from newspaper articles from regional newspapers. In press discourse, we observe a convergence between gender and genre through the actualized semantic classes. Résumé Nous étudions dans cet article la distribution d’unités lexicales désignant les pratiquant·e·s de sport de plein air (joggeur/joggeuse, randonneur/randonneuse, runneur/runneuse, promeneur/promeneuse) afin d’identifier les corrélations entre genres sexuels, thèmes sémantiques et genres textuels dans le discours journalistique en français. Le corpus est constitué à partir d’un échantillonnage d’articles de la presse quotidienne régionale. Il apparaît que dans le discours journalistique, on observe une convergence entre genres sexuels et genres textuels par le biais des classes sémantiques instanciées. Keywords: Press discourse, textometrics, semantic class, genre, gender 1. Introduction Nous proposons une étude de lexicologie textuelle sur la distribution d’unités lexicales choisies dans un corpus de textes de presse. L’étude n’a pas été réalisée dans une perspective corpus-driven, comme c’est souvent le cas en textométrie, mais avec une approche corpus-based (Biber, 2009) où les observables ont été prédéfinis. Notre objectif est en effet de nous focaliser sur les désignations des pratiquant·e·s de sport de plein air suivant une opposition en genres sexuels : joggeur vs joggeuse, randonneur vs randonneuse, runneur vs runneuse, promeneur vs promeneuse. Il s’agit d’identifier les corrélations entre genres sexuels, isotopies et genres textuels dans le discours journalistique de la presse quotidienne régionale française. JADT’ 18 313 2. Problématique 2.1. Sommation des isotopies de genres et de discours en signifiés La lexicologie textuelle consiste en l’analyse du lexique à partir des conditions textuelles de sa production. Elle repose sur l’hypothèse selon laquelle les unités lexicales subissent un ensemble de contraintes intertextuelles et infratextuelles de la même nature que les formes sémantiques diffuses et non lexicalisées et qui en conditionnent les régimes de production et d’interprétation. Dans de précédents travaux, ont été proposées les conditions théoriques d’une analyse textuelle du lexique, principalement focalisées sur l’étude de la néologie sémantique – ou néosémie (Rastier et Valette, 2009) et des formes sémantiques diffuses en voie de lexicalisation synthétique ou protosémie (Valette, 2010ab). Il s’agit ici d’étudier l’utilisation systématique d’une unité lexicale donnée dans un genre textuel précis et l’incidence de cette utilisation sur son sémantisme. En effet, tout mot placé dans un texte en reçoit des déterminations sémantiques, qui sont susceptibles de modifier son signifié (afférence de sèmes). Posant l’hypothèse selon laquelle le signifié est une forme sémantique lexicalisée (Valette 2010b), on considérera que les sèmes des isotopies du texte peuvent se propager au signifié d’une unité lexicale par le processus de sommation décrit par (Rastier, 2006). L’observation a pu être faite concernant les isotopies de domaine (redomanialisation d’une unité lexicale dans le cas de la néosémie par exemple) mais les isotopies génériques (relatives au genre textuel) ou discursives (relatives au discours) peuvent-elles transformer le signifié d’un mot de la même façon que les isotopies domaniales ? C’est à cette question que nous allons tâcher de répondre ici. 2.2. Présentation du corpus Le corpus est donc constitué suivant deux axes, lexical et discursif : nous avons utilisé 8 formes considérées comme des mots-clés pour collecter des textes exclusivement issus du discours journalistique et, plus précisément, de la presse quotidienne régionale, sans considération de genre textuel. Le corpus a été collecté de manière semi-automatique à l’aide d’un script d’aspiration de pages web puis nettoyé et dédoublonné manuellement, afin d’écarter des articles constitués de reprises de dépêches AFP qui se retrouvent d’un titre à un autre. Le script, basé sur la commande Linux cURL, est alimenté pas une liste d'URL collectées sur les sites des titres de presse à l’aide de requêtes effectuées sur le moteur de recherche Google (site:nomdusite forme, modulée par un inhibiteur -blade dans le cas de « runner » afin d'écarter les articles à propos du film Blade Runner). Entre 100 et 130 URL ont été collectées pour chaque forme. La phase de nettoyage a permis de supprimer les en-têtes, sommaires, liens annexes, légendes 314 JADT’ 18 d’images, etc., pour ne conserver que le titre et le corps de l’article. Le corpus est organisé en huit sous-corpus correspondant aux 8 formes étudiées : Joggeur, Joggeuse, Promeneur, Promeneuse, Randonneur, Randonneuse, Runner, Runneuse, dont les statistiques sont présentées dans le tableau suivant. Table 1 : Analyse factorielle des correspondances sur les parties du discours Sous-corpus Nombre de mots Joggeur 40 671 Joggeuse 48 285 Randonneur 35 162 Randonneuse 31 931 Promeneur 44 497 Promeneuse 31 009 Runner 22 212 Runneuse 31 367 Total 285 134 Les articles sont issus principalement de titres de la presse quotidienne régionale comme Nice Matin, Ouest-France, L’Est Républicain, La Dépêche du Midi, La Montagne, Corse-Matin, La Provence. La collecte n’a pas été orientée sur une rubrique en particulier mais sur l’ensemble des titres, et nous n’avons pas défini de limite temporelle. 3. Analyses1 3.1. Observations générales Une analyse factorielle préliminaire (figure 1) portant sur les seules parties du discours montre une opposition marquée sur l’axe 1 entre les sous-corpus Runner et Runneuse et les autres sous-corpus. Cet écart s’explique par les genres textuels des sous-corpus considérés. En effet, comme l’ont montré les travaux pionniers de (Biber, 1988) et, à leur suite, ceux de (Malrieu et Rastier, 2001), les variables locales que constituent les parties du discours sont des marqueurs de genre particulièrement stables. Ici, il apparaît que Runner et Runneuse relèvent du genre du compte rendu d’événements sportifs tandis que les 6 autres sous-corpus sont composés en grande majorité de faits divers. Autrement dit, la plupart des unités lexicales choisies pour nos requêtes, qui correspondent à des pratiques sportives de plein air, 1 Le corpus a été analysé au moyen du logiciel de textométrie TXM (http://textometrie.ens-lyon.fr/) (Heiden et al. 2010). JADT’ 18 315 n’appartiendraient pas – ou alors à la marge – au vocabulaire des genres sportifs du discours journalistique. L’analyse factorielle des correspondances sur les formes, dont la fréquence est au moins égale à 10 occurrences, offre à voir une distribution très différente. Runner et Runneuse sont toujours très proches mais il en est désormais de même de Randonneur et Randonneuse (désormais Randonneur·se) (figure 2). Les sous-corpus Joggeur, Promeneur et Promeneuse se situent à la croisée des axes et seront étudiés individuellement, mais Joggeuse se singularise. 3.2. Analyses des classes sémantiques constituantes L’analyse des spécificités (formes) des regroupements ainsi constitués nous indique les contextes d’instanciation des différentes formes. Le regroupement a priori très homogène Randonneur·se offre à voir un vocabulaire associé aux accidents de montagne. Le corpus est structuré en 3 classes sémantiques principales, - celle des accidents : « chute », « mortelle », « mètre », « avalanche », « fracture », « cheville », « hôpital », « blessée », « trauma », « glisser » etc. - celle des disparitions : « disparu », « alerte », « retrouvé », « emporté », « inquiet », etc. - celle des secours : « PGHM » (pour Peloton de gendarmerie de haute montagne), « hélicoptère », « Dragon » (un modèle d’hélicoptère) « évacué·e », « pompiers », « CRS », « secouriste », « secteur », « équipe », « sauveteur », « secourir », etc. Le sous-corpus Promeneur et le sous-corpus Promeneuse relatent essentiellement 3 types d’événements : - la promenade : « sentier », « phare », « littoral », « patrimoine », « chemin », etc. - les accidents : accident de chasse essentiellement : « chasseurs », « chasse », etc. - les découvertes : « macabres », « corps », « cadavre », « tronc », « jambe », « squelette », « ossement », « obus », « pépite », etc. Le sous-corpus Joggeur ne comporte quant à lui qu’une classe sémantique principale, celle des accidents n’incluant pas de tiers humain : « arrêt, malaise, crise cardiaque », « algues vertes », attaques d’animaux (« rapace », « aigle », « buse »), sulfure d’hydrogène, H2S, intoxication, toxique, gaz. Il est à noter que cette classe ne s’actualise pas dans le sous-corpus Joggeuse. 316 JADT’ 18 Les deux sous-corpus restants, le regroupement très homogène Runner et Runneuse (désormais Runneur·se) et Joggeuse méritent toute notre attention. D’un point vue ontologique, le jogging comme le running sont des formes similaires de course à pied relevant du domaine du sport. Mais leur usage dans le discours journalistique diffère très sensiblement. Dans le regroupement Runneur·se, qui comporte, comme nous l’avons vu, essentiellement des articles relatant des événements sportifs, le vocabulaire est structuré autour des classes sémantiques suivantes : - définitoire : hyperonyme « sport », synonyme « coureur », etc. Ainsi, le sous-corpus Runneur·se est le seul dont le sens correspond à la signification. - classe de la compétition : « course » « marathon », « semimarathon », « trail », «triathlon », « championnat », « inscription », « départ », « épreuve », « km », « victoire », « podium », « médaille », « sponsors », etc. - classe des blessures : « blessure », « foulure », « ampoule », « contracture », etc. Il comporte également deux classes sémantiques liées aux techniques associées à la pratique : - classe des équipements : « équipement », « baskets », « chaussures », « brassière », « connectés », « GPS » ou « montre GPS », etc. - classe des entrainements : « entrainement », « préparation », « fractionné », « cardio », « conseils », « performances », « yoga » (comme activité complémentaire destinée à éviter les blessures), etc. Il est à noter que le sous-corpus Runneuse se singularise par la mention d’événements sportifs caritatifs liés à la lutte contre le cancer du sein : « octobre rose », « prévention ». A l’inverse, la joggeuse dans le sous-corpus éponyme n’est nullement une sportive, mais sa caractérisation textuelle est remarquablement précise : elle est une femme agressée pendant son jogging et les classes sémantiques actualisées dans ce sous-corps relèvent du crime, du droit et de l’enquête judiciaire : - classe des agressions : « meurtre », « tentative », « agressée », « agression sexuelle », « viol », « enlèvement », « tuée », - classe des agresseurs : « homme », « suspect », « meurtrier », « présumé », « portrait-robot », « violeur », « exhibitionniste » - classe des procédures judiciaires : « enquêteurs », « avocats », JADT’ 18 317 « cour », « procureur », « réquisition », « réclusion », « prison », « accusé », « interpellé », « agresseur », « condamné », « procédure », « instruction », « ADN », etc. 3.3. Synthèse A l’issue de cette analyse, on choisit de se concentrer sur la définition en miroir de la joggeuse et de la runneuse, laissant de côté les autres unités lexicales détaillées ci-dessus. Les isotopies génériques et discursives qui constituent la trame sémantique des articles dans lesquels occurrent ces deux formes donnent lieu à la construction de deux signifiés antagonistes, par sommation : La joggeuse apparaît : 1. /isolée/ (elle court seule), 2. /vulnérable/ (elle est sans défense face à un agresseur) et, quoi qu’il arrive, puisque le genre du fait divers l’exige, 3. /victime/ (elle est agressée, violée, tuée). A l’inverse la runneuse est : 1. /entourée/ (elle court dans le cadre d’événement sportifs collectifs), 2. /sécurisée/ (par la technologie, notamment les montres GPS qui permettent de gérer l’effort et d’optimiser ses performances, par l’entraînement suivi. Les blessures subies apparaissent par ailleurs bénignes par rapport aux risques encourus par la joggeuse), 3. /compétitrice/ (elle participe à des compétitions). 4. Conclusion Dans cet article, nous avons tenté de montrer comment les fonds sémantiques issus des genres et des discours pouvaient modifier, par sommation, les signifiés des unités lexicales qui sont utilisées. Pour deux unités lexicales partageant a priori un référent identique, celui d'une femme pratiquant la course à pied, l'actualisation en corpus journalistique fait émerger des contenus sémantiques très différents. Il ne s’agit pas de considérer que les joggeuses sont nécessairement des femmes en danger mais la régularité avec laquelle le mot joggeuse est actualisé dans la presse comme une /victime/, /vulnérable/ et /isolée/ pourrait avoir, à terme, une incidence sur la perception d’une pratique dont la réalité médiatique est exclusivement macabre. En d’autres termes, dans le discours de presse, pour les femmes, le jogging est une pratique dangereuse, la joggeuse une victime d'agression, alors que la runneuse une sportive impliquée dans des événements sociaux et le running une pratique sûre et valorisante. 318 JADT’ 18 Références Biber, D. (1988). Variation across Speech and Writing . Cambridge, Cambridge University Press. Biber, D. (2009). Corpus-Based and Corpus-driven Analyses of Language Variation and Use. In B. Heine and H. Narrog (editors) The Oxford Handbook of Linguistic Analysis, 159–191. Oxford. Heiden S., Magué J.-P., et Pincemin B. (2010). TXM : Une plateforme logicielle open-source pour la textométrie – conception et développement, S. Bolasco. editors., Journées internationales d’Analyses statistiques des Données Textuelles, vol(2), 1021-1032. Malrieu, D. et Rastier, F. (2001). Genres et variations morphosyntaxiques, In Traitements automatiques du langage, 42, 2, 547-577. Rastier, F. (2006). Passages. In Corpus, 6, 125-152. Rastier, F., Valette, M. (2009). De la polysémie à la néosémie. In Le français moderne, vol. (77), 97-116. Valette, M. (2010a). Propositions pour une lexicologie textuelle. In Zeitschrift für Französische Sprache und Literatur, vol. (37): 171-188. Valette, M. (2010b). Méthodes pour la veille lexicale, In L. Messaoudi, et al. editors Sur les dictionnaires, Publication du laboratoire Langage et société, Université Ibn Tofail, Kénitra: 251-272. JADT’ 18 319 The Transparency Engine – A Better Way to Deal with Fake News Peter A. Gloor1, Joao Marcos de Oliveira2, Detlef Schoder3 1 MIT Center for Collective Intelligence, Cambridge MA – pgloor@mit.edu 2Galaxyadvisors, Aarau Switzerland – jmarcos@galaxyadvisors.com 3University of Cologne, Germany – schoder@wim.uni-koeln.de Abstract We introduce the “Transparency Engine”, a social network search engine to separate fact from fiction by exposing (1) the hidden “influencers” and (2) their “tribes”. Our goals are to quantify the influence and relevancy of persons, concepts, or companies on institutions, issues or industries by tracking the dynamics and changes in the observed environment. In particular we visualize the networks of influence for a given social or economical ecosystem, thus providing a tool to both the scientific and general public (including journalists, or anyone interested to check news) to track the diffusion of new ideas, both good and bad. In particular, the Transparency Engine exposes the hidden influencers behind fake news. We propose a unique solution, which combines three subsystems we have been developing over the last five years: (I) Powergraph, (II) Tribefinder, and (III) Swarmpulse, The powergraph displays the degree and power of the spreader’s position by re-constructing her/his (social) network via Web sites and social position in the Twitter-universe. The tribefinder exposes the tribal echo chambers on Twitter nurturing fake news items through social media mining, thus allowing the news consumer to develop an informed opinion for identifying the motivation of the spreaders of fake news. This is done through mining Twitter word usage of tribe members with neural networks using tensorflow. The swarmpulse system finds the most relevant fake and non-fake news on Wikipedia and Twitter by combining their emergent patterns. Keywords: Fake News, Transparency Engine, News, Truth, Belief System, Machine Learning, Big Data 1. Introduction According to independent investigations, Russian misinformation and fake news by Western conspiracy theorists on social media may have contributed 320 JADT’ 18 to the outcome of the Brexit vote1 and the election of Donald Trump2. Misinforming news has become a significant threat to societal discourse and opinion formation. Mechanisms to deal with this type of fake news by making them transparent are urgently needed. The goal of this project is to understand the concept of “fake news” in the context of forming collective awareness through social media. The concept of truth is dependent on a personal belief system. On the other hand, conspiracy theories and satire is nothing new, and people who WANT to believe these have always embraced them. Categorizing news as “Fake news” happens when they are against one's innermost and most passionate beliefs. The more somebody is embedded into a predefined belief system, the more likely they are to believe fake news. For instance, people who use Facebook as their major news source, are more likely to believe fake news (Silverman & Singer-Vine, 2016). What mental processes are happening when we embrace fake news? When embedded in a particular belief system, individuals recognize fake news immediately when they read them, because they do not want to believe them, similarly they also immediately categorize news as true news when they read them, because they perfectly fit into their belief system. For instance, Trump followers label mainstream news as “fake news”, while mainstream news labels news from Trump followers as “fake news”. 2. Related Work There are many approaches to creating more transparency in societal discourses. In fact, this may be seen as the core task of quality journalism. Most if not all of these approaches, however, are not well supported by IT tools, do not scale well, and many do not reveal the applied algorithms. Fact checking Websites such as Wikitribune, Snopes.com, PolitFact, and FactCheck.org, and corporate/proprietary initiatives like Facebooks’s fake news detection tools mostly rely on human volunteers and/or paid staff to do fact checking, which has major disadvantages: - human bias: fact checkers might have a “leftist” or “right-wing” bias - non-scalable: the human pool of fact checkers is by definition restricted - deferred access: the machine can check any news item immediately, 24/7, and it does not take the expensive detective work of the human fact checker - non-replicable: as the fact checking is done by different users, the reader will not be able to understand why a certain fact has been categorized in a particular way 1 Londongrad - Russian Twitter trolls meddled in the Brexit vote. Did they swing it?. Economist, Nov. 23rd 2017 2 https://en.wikipedia.org/wiki/Russian_interference_in_the_2016_United_States _elections JADT’ 18 321 Among the automated approaches, Kloutscore (www.klout.com) gives a metric for the social media influence of a person. However the kloutscore has to be requested manually by a user who wants a kloutscore, so it is heavily skewed towards self-promoters. Another solution for finding the social media profiles of users is to leverage the Google Knowledge Graph (https://en.wikipedia.org/wiki/Knowledge_Graph), which has been employed in theoretical work by Ciampaglia et al. (2015) for fact checking by measuring the shortest path distance between related concept nodes. Another approach consists of using machine learning to identify fake news, for instance it has been shown by Ott et al. (2011) that machine learning based on word usage beats humans by wide margins to identify fake reviews in tripadvisor by computing feature vectors from the text of the reviews. More generally, (Youyou et al. 2015) have shown that to identify (tribal) attributes of people, having the computer look at their Facebook likes through machine learning will be more reliable than human judgment. A similar research question is addressed when identifying Twitter bots based on their networking pattern and word usage. For instance, Botcheck (botcheck.me) and Botometer (https://botometer.iuni.iu.edu/#!/) (Varol et al. 2017) check the likelihood of any Twitter id to be a bot, based on number of followers and friends, tweeting dynamics, and content of tweets. 3. Motivation – How Influencers Spread Fake News Today’s online social media consumers are exposed to a cacophony of fact and fiction as never before. “It is true, I read it on the Internet” is unfortunately a prominent way for information to spread. For example, immediately after the 2016 US Presidential elections, in early November 2016, Hillary Clinton was accused of running a pedophile ring out of a pizza restaurant in Washington. Called “pizzagate”, this news item became a favorite call to arms among right-wing extremists and Donald Trump supporters, leading one incensed fanatic to drive a few hundred miles from Salisbury, North Carolina to Washington DC, and firing his automatic gun into the pizza restaurant. The origin of this fake news story has been well documented, starting from a white supremacist Twitter account, then picked up by the conspiracy News Web site of Sean Adl-Tabatabai, where it fell on the willing ears of the American right. Just like Google has revolutionized the way we access information, our proposed Transparency Engine intends to change the way how we look at such information, by exposing the hidden influencers like “Sean Adl-Tabatabai” who inject new information into the public discourse. 322 JADT’ 18 3.1 The concept of tribes and how they perceive information Besides knowing the sources of rumors, it is essential to also know the (political) orientation of these influencers. Quantum physics suggests that there are many different universes, with our current world being embedded into just one out of infinitely many other universes. Looking at radically different interpretations of the same news item, it seems we are indeed living in different quantum universes. These different universes can be grouped into “tribes” (Sloterdijk 2011). Each of these tribes has its own reality, defining fact or fiction for the members of the tribe. Previous research (De Oliveira et al. 2017) has exemplified this idea. What is fact for one tribe is fiction for another tribe. It all depends on the tribe, and what the members of the tribe WANT to believe. Examples are the denial of human-influenced global warming, the explanation of evolution through “intelligent design”, or the causal relationship between vaccination and autism where some tribes perceives related issues as “fact” and “truth” whereas other tribes perceive the objectively same issues as “fiction”, “lie” or “fake news”, thus creating an “alternate reality”. In contrast to the power of states and corporations, the growing power and dynamics of networks is mostly invisible. Unlike hierarchical structures, the central influencers in networks are hard to identify by the “naked eye”. What matters to spread any news – fact or fake – is the influence of the spreader. The main way to quantify the influence of the spreader is her/his position in a given network and with it the power to “multiply” the word to larger audiences. More specifically, the degree and power of the spreaders’ position can be measured by re-constructing their (social) network via their Web sites and their social position for example in the Twitter-universe (and other social networking platforms) thus measuring the influence of Web sites and the influence of Twitter (accounts) on a specific topic. Figure 1 Twitter retweet network “pizzagate” (left), and Twitter influence network (right) JADT’ 18 323 Pizzagate only spread because a moderately influential spreader, Sean AdlTabatabai, discovered the original tweet and posted it on his conspiracy News Web site. Figure 1 illustrates how social media analysis can increase trust and transparency by visualizing the echo chambers of fake news about pizzagate using our social media analysis system Condor (Gloor 2017). The picture at left shows the Twitter network about pizzagate, each node is a person tweeting, a link between two people means either that one person is retweeting a tweet sent by the other person, or is mentioning the other person in a tweet. There is a large cluster in the center of the network, made up of believers in the fake news. They are reinforcing each other, and increasing the traffic in their echo chamber. The few supporters of Hillary Clinton, trying to debunk the fake news, are pushed aside; their tweets are ignored by the large echo chamber of conspiracy theory believers. The people in the periphery (the “asteroid belt”) are tweeting into the void, as their tweets are ignored by friends and foes alike. Using an influencer algorithm (Gloor 2017) shows that the discourse about pizzagate on Twitter is dominated by Trump followers (the picture at right above). Our algorithm makes somebody an influencer, if the words she or he is using, are picked up by others and spread quickly through the network. As the picture at right in figure 1 shows, there is just one voice of reason left, while the proponents of pizzagate reinforce each other much more, with a cluster of influential spreaders of wild ideas in the center, and other conspiratorialists in the periphery of the cluster, being retweeted by hundreds of likeminded others (shown as “parachutes” in the graph). 4. Our Solution – Transparency Engine We introduce the “Transparency Engine”, a social network search engine to separate fact from fiction by exposing the hidden influencers and their “tribes” behind fake news. Just like Google has revolutionized the way we access information, Transparency Engine changes the way we look at such information, by exposing the hidden influencers. Our goals are fourfold: (1) Quantify the influence and relevancy of persons, concepts, companies on institutions, issues or industries. (2) Qualify the dynamics and changes in the observed environment. (3) Visualize the networks of influence for a given social or economical ecosystem. (4) Provide a tool to track the diffusion of new ideas, both good and bad. 4.1. Powergraph Our solution combines three subsystems we have been developing over the last five years (Fuehres et al. 2012, de Oliveira et al. 2016, de Oliveira et al 2017): Power graph, tribe finder, and swarmpulse. Power graph measures the 324 JADT’ 18 importance of “notable” people as defined by Wikipedia through calculating the number of other Wikipedia people pages than can be reached within two degrees of separation from a particular people page on Wikipedia. This is a proxy for social capital, as it basically measures the influence of the people a person is connected to. The system also identifies those people with Twitter accounts by matching them with sources of information like Wikidata and Google knowledge graph. Figure 2. Sample Powergraph for “global warming” Figure 2 illustrates our prototype version of the Powergraph, showing the social network of the most influential people about “global warming”, based on their Wikipedia and Twitter presence. We find, not surprisingly, that Donald Trump and the former US presidents are most influential. We measure the importance of people through calculating the number of other Wikipedia people pages and Twitter friendship networks than can be reached within two degrees of separation from a particular people page. This is a proxy for social capital, as it basically measures the influence of the people a person is connected to (Fuehres et al. 2012). 4.2 Tribefinder The second component of our system, tribefinder (de Oliveira et al. 2017), identifies the tribal affiliations of the opinion leaders about any news item. To assign a tribe to an influencer, our system analyzes their word usage, using deep learning. An integral component of the tribefinder system is “TribeCreator", this subsystem automatically helps the user to find people that belong to a newly defined tribe by looking at profile self-descriptions, JADT’ 18 325 the content of tweets, and at followers, and Twitter friends. For example, if users wants to create a tribe for Treehuggers (people who like nature), they can search for people with profile descriptions that match the idea of this tribe: “nature lover”, “I love nature”, “nature”, etc., for people who follow pages about nature, or tweet about nature. In the second step we calculate the vocabulary that these influentials are using in their tweets. This vocabulary is then used to match the vocabulary against the vocabulary of any Twitter user, calculating their tribal affiliates. Knowing the tribal affiliations of the thoughtleaders for a news item allows readers to correctly position the news item, deciding for themselves if they want to trust the news coming from a particular influencer. 4.3 Swarmpulse The third component of our system is Swarmpulse (de Oliveira et al 2016). Swarmpulse finds the most recently edited Wikipedia pages and uses Twitter to see which people are talking about those subjects. This system helps users to serendipitously spot most recent news items they were not aware of, and then check their influencer network on the power graph and calculate their tribal affiliations with tribefinder. 5. Conclusion The best approach for fact-checking is a critical, well-informed mind. Our world needs more powerful ways and tools to support the critical mind. Transparency is a key enabler for this. The Transparency Engine thus provides the foundation for informing the critical mind: The global Powergraph will display the power network of the one million globally most influential people on Wikipedia people pages and the most popular Twitter users. It will allow all other Twitter users to position themselves within the context of the Powergraph. The Tribefinder will show the “truth of tribes” by creating tribes through their use of language on social media and assigning each influencer to one or more tribes and showing the tribal affiliations in the Powergraph. Swarmpulse will build an index of most recent significant news by combining new edits on Wikipedia with the most popular tweets from influential twitterers and show the actors involved through Powergraph. The landscape of transparency generating approaches calls for a scientific, open approach such as the Transparency Engine proposes. Our aim is to substantially contribute to popularizing and democratizing fact checking for the whole world. Everyone should be enabled to do this easily and simply by themselves! 326 JADT’ 18 References Ciampaglia, G. L., Shiralkar, P., Rocha, L. M., Bollen, J., Menczer, F., & Flammini, A. (2015). Computational fact checking from knowledge networks. PloS one, 10(6), e0128193. de Oliveira, J. Gloor, P. (2016) The Citizen IS the Journalist - Automatically Extracting News from the Swarm. Rome, Italy June 9-11, 2016, Designing Networks for Innovation and Improvisation: Proceedings of the 6th International COINs Conference (Springer Proceedings in Complexity) de Oliveira, J. Gloor, P. (2017) GalaxyScope – Finding the "Truth of Tribes" on Social Media. Detroit September 11-14, 2017. Proceedings of the 7th International COINs Conference (Springer Proceedings in Complexity) Fuehres, H. Gloor, P. Henninger, M. Kleeb, R. Nemoto, K. (2012) Galaxysearch: Discovering the Knowledge of Many by Using Wikipedia as a Meta-Search Index. Proceedings Collective Intelligence 2012, April 1820, Cambridge, MA Gloor, P. (2017) Sociometrics and Human Relationships: Analyzing Social Networks to Manage Brands, Predict Trends, and Improve Organizational Performance , Emerald Publishing, London 2017 Ott, M., Choi, Y., Cardie, C., & Hancock, J. T. (2011, June). Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 309-319). Silverman, C. Singer-Vine, J. (2016) "Most Americans who see fake news believe it, new survey says." BuzzFeed News; https://www.buzzfeed.com/craigsilverman/fake-news-survey Sloterdijk, P. (2011). Bubbles: microspherology. MIT Press Varol, O., Ferrara, E., Davis, C. A., Menczer, F., & Flammini, A. (2017). Online human-bot interactions: Detection, estimation, and characterization. arXiv preprint arXiv:1703.03107. Youyou, W. Kosinski, M. Stillwell, D. (2015) Computer-based personality judgments are more accurate than those made by humans. Proceedings of the National Academy of Sciences (PNAS) JADT’ 18 327 Brexit and Twitter: The voice of people Francesca Greco, Leonardo Alaimo, Livia Celardo Sapienza University of Rome – francesca.greco@uniroma1.it; leonardo.alaimo@uniroma1.it; livia.celardo@uniroma1.it Abstract 1 There is an increase in Euroscepticism among EU citizens nowadays, as shown by the development of the ultra-nationalist parties among the European states. Regarding the European Union membership, public opinion is divided in two. British referendum in 2016, where citizens chose to “exit” shaking the public opinion, and the following general election in June 2017, where the British Europeanist parties won the election according to the 1975 British referendum where 72% of citizens chose to “Remain”, are clear examples of this fracture. There are still few studies concerning the investigation of Brexit discourses within the social media and most of them focus on the 2016 British referendum. Due to that, this exploratory research aims to identify how Brexit and the EU are nowadays discussed on Twitter, through a text mining approach. We collected all the tweets containing the terms “Brexit” and “EU”, for a period of 10 days. Data collection has been performed with TwitteR package, resulting in a large corpus to which we applied multivariate techniques in order to identify the contents and the sentiments behind the shared comments. Abstract 2 Negli ultimi anni c'è stato un aumento dell'euroscetticismo tra i cittadini dell'UE, come testimoniato dallo sviluppo di partiti ultra nazionalisti in diversi stati europei. Sul tema "Europa", l'opinione pubblica è divisa fra europeisti e euroscettici. Un chiaro esempio di questa divisione è dato dalle recenti vicende britanniche: infatti, nel referendum del 2016 i cittadini britannici hanno scelto di "uscire" dall’UE scuotendo l'opinione pubblica, mentre le successive elezioni politiche di giugno 2017 hanno visto l'affermazione dei principali partiti filo-europeisti. Vi sono ancora pochi studi in letteratura che indagano come nei social media venga affrontato il tema della Brexit in relazione all’UE, dato che la maggior parte di essi si focalizza su cause e potenziali effetti del voto di giugno 2016. In tal senso, questa ricerca esplorativa ha lo scopo di identificare in che modo Brexit e l'Unione Europea vengano discusse su Twitter in questo momento storico attraverso l’analisi automatica del testo. A questo scopo sono stati raccolti tutti i messaggi contenenti i termini "Brexit" e "EU" per 10 giorni attraverso 328 JADT’ 18 l'utilizzo del pacchetto TwitteR, ottenendo un corpus di grandi dimensioni a cui sono state applicate delle tecniche multivariate, al fine di individuare i contenuti e i sentimenti relativi al tema in esame. Keywords: Brexit, Twitter, Emotional text mining. 1. Introduction There is a growing increase in Euroscepticism among EU citizens nowadays, as shown by the development of the ultra-nationalist parties among the European states. Regarding to European Union membership, public opinion is divided between Eurosceptics and pro-Europeans, as shown by the 2016 British referendum ("Brexit"), where 52% of citizens chose to “Leave”. For further evidence of this division, the following general election of June 2017 saw the affirmation of the main Europeanist parties (especially the Labour Party) and the results led to a hung Parliament. Brexit has shaken the European public opinion as it revealed the relevance of the anti-Europeanist trend. During the 60th Anniversary of the Treaties of Rome in 2017, millions of citizens expressed their support to the EU participating to Europeanist demonstrations in many European cities. One useful starting point for explaining the results of Brexit is to focus on the electoral issue: the relationship between the UK and Europe. This has always been a central and rather controversial issue in the British public debate. The media, public opinion and the political class have always been deeply critical and sceptical about the European integration. This position influences citizens' attitudes towards the Union, which is not only considered distant and inadequate to resolve everyday issues (immigration, unemployment, and so on), but it is often perceived as their major cause, by limiting the political and economic power of United Kingdom. The electoral outcome created disbelief all over the world. Britain is the home of the term Euroscepticism (Spiering 2004, p.127). But, while it is clear that a large proportion of UK residents are sceptical about Europe, it is not clear enough that this position coincides with the wish to leave the EU. However, Euroscepticism should not be confused with this wish. Szczerbiak and Taggart (2008) have distinguished two different types of Euroscepticism: the Hard Euroscepticism that is a principled opposition to the EU and European integration and Soft Euroscepticism that concerns on one (or a number) of policy areas lead to the expression of qualified opposition to the EU. Although there are several studies exploring British Euroscepticism, only few of them investigate the Brexit discourses within the social media. Due to that, we decided to perform a quantitative study, where the online discourses regarding Brexit and EU are analysed using two different approaches, JADT’ 18 329 Content Analysis and Emotional Text Mining. The aim is to explore not only the contents but also the sentiments shared by users on Twitter. For this paper, we used one of the most important and known blog tools, Twitter. It is an online platform for sharing real-time, character limited communication with people partaking of similar interests that, in 2017, counted over than 300 million users and an average of about 500 million of tweets sent per day. 2. Data collection and analysis In order to explore the sentiments and the contents on Brexit and EU in twitter communications during ten days, we scraped all the messengers in English language produced from September 22nd to October 2nd, 2017, containing together the words Brexit and EU. The data extraction was carried out with the TwitteR package of R Statistics (Gentry, 2016). We started collecting 221,069 messengers, including 83% of retweets, from which two samples of tweets were extracted. The first we used for the sentiment analysis is composed of 99,812 messengers, where the retweets were limited to the threshold of 31, resulting in a large corpus of 1,601,985 of tokens; the second one we used for content analysis, where we excluded all the retweets, resulted in a large corpus of 37,318 tweets and 618,255 tokens. In order to check whether it was possible to statistically process data, two lexical indicators were calculated: the type-token ratio and the hapax percentage (TTRcorpus 1 = 0.02; Hapaxcorpus 1 = 39.8%; TTRcorpus 2 = 0.04; Hapaxcorpus 2 = 52.31%). According to the large size of the corpus, both lexical indicators highlighted its richness and indicated the possibility to proceed with the analysis. 2.1. Emotional text mining We know that people sentiments depend not only on their rational thinking but also, and sometimes most of all, on the emotional and social way of functioning of people’s mind. If the conscious process set the manifest content of the narration, that is what is narrated, the unconscious process can be inferred through how it is narrated, that is, the words chosen to narrate and their association within the text. According to this, it is possible to detect the associative links between the words to infer the symbolic matrix determining the coexistence of these terms in the text (Greco, 2016). To this aim we perform a multivariate analysis based on a bisecting k-means algorithm to classify the text (Savaresi et Boley, 2004), and a correspondence analysis to detect the latent dimensions setting the cluster per keywords matrix (Lebart et Salem, 1994) by means of T-Lab software. The interpretation of the cluster analysis results allows to identify the elements characterizing the emotional representation of Brexit, while the results of correspondence 330 JADT’ 18 analysis reflect its emotional symbolization. By the clusters interpretation, we classify the emotional representations in positive, neutral and negative sentiments, determining the percentage of messages for each sentiment modality. To this aim, first corpus was cleaned and pre-processed with the software T-Lab (T-Lab Plus version, 2017) and keywords selected. In particular, we used lemmas as keywords instead of types, filtering out the lemma Brexit and EU and those of the low rank of frequency (Greco, 2016). Then, on the tweets per keywords matrix, we performed a cluster analysis with a bisecting k-means algorithm limited to twenty partitions, excluding all the tweets that do not have at least two keywords co-occurrence. The percentage of explained variance (η) was used to evaluate and choose the optimal partition. To finalize the analysis, a correspondence analysis on the keywords per clusters matrix was made in order to explore the relationship between clusters and to identify the emotional categories setting Brexit representations. 2.2. Content analysis Content analysis is a technique used to investigate the content of a text; in text mining, many methods exist to analyse it automatically. One of these is Text Clustering, where the corpus is splits in different subgroups based on words/documents similarities (Iezzi, 2012). In this paper, a text co-clustering approach (Celardo et al., 2016) is used. The objective is to simultaneously classify rows and columns, in order to identify groups of texts characterized by specific contents. To do that, data were pre-processed with Iramuteq software lemmatizing the texts, removing stop words and terms with frequency lower than 10. The weighted term-document matrix was then coclustered through the double k-means algorithm (Vichi, 2001); the number of clusters for both rows and columns was fixed using the Calinski-Harabasz index. 3. Emotional text mining main results and discussion The results of the cluster analysis for ETM show that the 655 keywords selected allow the classification of 88,6% of the tweets. The percentage of explained variance was calculated on partitions from 3 to 19, and it shows that the optimal solution is six clusters (η= 0.057). The correspondence analysis detected six latent dimensions. In table 1, we can appreciate the emotional map of Brexit and the EU emerging from the English tweets. It shows how the clusters are placed in the factorial space produced by five factors. The first factor represents the political and economic domain where Brexit seems to have its main impact; the second factor reproduces the possible solutions of Brexit: a separation or a new agreement; the third factor JADT’ 18 331 represents the national or European level of reaction to Brexit; the fourth factor is the blame, distinguishing the blame of politicians from the one of the willingness to be independent; and the fifth factor is the political leadership, differing old and new policies. Table 1  Correspondence analysis results (the percentage of explained inertia is reported between brackets beside each factor). Factor 1 (27.5%) NP NP PP try Macron war pro support chance Brussel deliver Europe an good Florenc e Delay remaine r concern zero divorce better off save laureate debate union finger proposa l fight leaving progres s negotiat or pay brexiteer miracle s help market allow single finish chief event Merkel row 0.070.02 ac economi st 6.49-4.40 ac NP 4.721.50 ac PP Factor 3 (19.8%) negotiatio bill n Briton Barnier future PP Factor 2 (24.3%) 0.350.12 ac 1.5-1.61 ac Factor 4 (15.6%) NP blame march withdraw stay al blast speech states 0.350.05 ac PP Factor 5 (12.9%) NP referendum leader Johnson remai n Verhofstadt walk PP people Tory hard independen urge voter t conservativ destroy May T. party e anti migrant hope happe n Blair vow Cataloni call a reverse adopt die time 0.550.29 ac 5.220.94 ac 3.651.24 ac 10.281.49 ac NP =negative pole; PP = positive pole; ac = absolute contribution (10-3) The six clusters are of different sizes and reflect the representations of Brexit (table 2), that correspond to three different sentiments: positive, negative for domestic reasons, and negative for foreign ones (table 1). The first cluster represents the choice to leave EU as a good option, underlining the need to proceed; the second cluster focuses on the EU political reaction fixing divorce conditions, perceiving EU political representatives as unfavourable and therefore threatening; the third cluster represents Britons’ hope to improve their economic condition leaving EU as naive; the fourth cluster represents the old British political leadership as incompetent, being unable to protect and adequately inform Britons in order to support them in remaining in the EU; the fifth cluster reflects the negotiation of the divorce conditions, perceiving the negotiation as unfair and the costs of leaving EU as a punishment; and the sixth cluster represents Brexit as a Britons informed choice, highlighting that its consequences belong to the policy domain who should respect the citizens’ choice. 332 JADT’ 18 Table 2  Clusters (the percentage of context units classified in the cluster is reported between brackets). Cluster 1 (10.0% CU) Cluster 2 (14.9% CU) Cluster 3 (20.9% CU) Cluster 4 (13.4% CU) Cluster 5 (19.2% CU) Cluster 6 (21.7% CU) Good Choice EU Reaction Uncertain Future British Leadership Divorce Conditions Informed Choice people bill referendum leaving Tory Barnier Corbyn Briton hard brussel Johnson Theresa May market chance voter progress think urge warn zero party divorce independent call single better off happen negotiator Boris walk business Nobel Florence pay Verhofstadt UKIP minister economist stay chief Florence government Europe laureate Catalonia demand destroy hope move tell believe national try look Merkel rating Spain Davis policy mean miracle law Rees Mogg offer issue leader Macron negotiation remain European time good From 1611 to 620 CU From 2004 to 951 From 1844 to 668 CU From 2506 to 461 CU From 2705 to 843 CU From 2098 to 512 CU CU = context units classified in the cluster. By the clusters interpretation, we detected six different representations of Brexit that correspond to three different sentiments (table 1). We have considered as positive (21,7%) the representation of Brexit as a Good Choice or an Informed Choice, and negatives all the other representations (78,3%). Among the negative clusters, we distinguished negativity according to the origin of the problem: Uncertain Future and British Leadership are negative for domestic reasons (34,2%), that is, the lack of UK political leadership’s competences; and EU Reaction and Divorce Condition are negatives due to foreign factors (34,1%) as the EU after Brexit seems to be perceived as vindictive and, therefore, threatening. 4. Content analysis main results and discussion The pre-processing phase, implemented on the second corpus, allowed us to identify a set of 1.957 keywords, representing the 97% of the tweets; so, on the term-document matrix of dimension (1.957 × 36.383) we calculated the Calinski-Harabasz Index in order to define the number of clusters for rows and columns. After calculating the index values for partitions from 2 to 10 for each dimension, the Calinski-Harabasz Index suggested to classify the words in three groups and the tweets in five groups. In table 3, the centroids of the clusters are exposed. JADT’ 18 333 Table 3  Centroids matrix (Terms × Documents). Cluster 1 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 (55%) (20%) (12%) (11%) (2%) 0,005 0,003 0,004 0,000 0,000 Cluster 2 0,002 0,063 0,003 0,149 0,012 Cluster 3 -0,002 0,000 0,090 -0,003 0,309 Table 4  Words groups (first 10 words listed below by frequency of occurrence). Cluster 1 Negotiation stay Cluster 2 Economic Transformation leave Cluster 3 British Identity home Junker move sound ambassador transition cake cry late plan track deal datum surge trade live peer retain finish shape post Id turmoil Macron idea survive urge national As shown in the table 3, the algorithm has identified five blocks of specificities; in fact; the first cluster of words is connected to the first group of tweets; the second is specific of the second and the fourth cluster of tweets and the third is related to the third and the fifth group of tweets. In table 4, the groups of words are presented. The first group of words is related to the need of defining new rules and settlements within the negotiation and it represents more than half of the tweets; it has no strong specificities related to the texts, but in comparison to all the documents clusters, it seems to be more connected to those words. On the other hand, for the other two groups of words, there are more effective specificities; the second cluster of words is about the definition of new economic agreements, and it is connected to the 31% of the tweets, while the third one, related to the requirement in specifying a new identity after Brexit, is representative of the 14% of the corpus documents. 5. Conclusions The results of the two analyses showed a strong relationship between the terms “Brexit” and “EU”, not only in terms of sentiment, but also in terms of 334 JADT’ 18 contents. According to the literature, the sentiment analysis revealed the presence of both positive and negative opinions in respect to the exit of United Kingdom from the EU. On the other hand, starting from the analysis of the contents we found that the Twitter communications on Brexit focuses primarily on the concept of negotiation. The remaining part of the messages take into account both the Brexit economic features and the need of the national identity redefinition. To conclude, the results of the two analyses revealed that Brexit is a theme with a strong emotional charge, mostly negative. British people seem to focus their attention basically toward three issues: the new asset, the economic consequences, and the national identity. These subjects are treated positively and negatively from the users, probably because of the lack of cohesion within the country. References Celardo L., Iezzi D.F. and Vichi M. (2016). Multi-mode partitioning for text clustering to reduce dimensionality and noises. In Proceedings of the 13th International Conference on Statistical Analysis of Textual Data. Gentry J. (2016). R Based Twitter Client. R package version 1.1.9. Greco F. (2016). Integrare la disabilità. Una metodologia interdisciplinare per leggere il cambiamento culturale. Franco Angeli. Hobolt S. (2016). The Brexit vote: a divided nation, a divided continent. Journal of European public policy, 23(9): 1259–1277. Iezzi D. F. (2012). Centrality measures for text clustering. Communications in Statistics-Theory and Methods, 41(16-17), 3179-3197. Lebart L. and Salem A. (1994). Statistique Textuelle. Dunod Savaresi S. M. and Boley D. L. (2004). A comparative analysis on the bisecting K-means and the PDDP clustering algorithms. Intelligent Data Analysis, 8(4): 345-362. Spiering M. (2004). British Euroscepticism. In Harmsen R. and Spiering M., editors, Euroscepticism: Party Politics, National Identity and European Integration. Editions Rodopi B.V. Szczerbiak A. and Taggart P. (2008). Opposing Europe? The Comparative Party Politics of Euroscepticism. Volume 1: Case Studies and Country Surveys. Oxford University Press. Vichi M. (2001). Double k-means clustering for simultaneous classification of objects and variables. Advances in classification and data analysis, 43-52. JADT’ 18 335 A text mining on clinical transcripts of good and poor outcome psychotherapies Francesca Greco1, Giulio de Felice2, Omar Gelo3 Sapienza University of Rome & Prisma S.r.l. – francesca.greco@uniroma1.it 2 Sapienza University of Rome & NCU University – giulio.defelice@uniroma1.it 3 University of Salento & Sigmund Freud University – omar.gelo@unisalento.it 1 Abstract The text mining of clinical transcripts is broadly used in psychotherapy research, but is limited to top-down approaches, with a-priori vocabularies that code the transcripts according to a theoretical predetermined framework. Nevertheless, the semantic level that a word or clinical intervention can assume depends on the relational field in which the discourse is produced. Thus, bottom-up approaches seem to be particularly meaningful in addressing such a relevant issue. With the aim of investigating possible similarities and differences between good outcome and poor outcome psychotherapies, we applied a multivariate analysis on the transcripts of eight single cases of brief experiential psychotherapy (four good outcome vs four poor outcome cases), in order to identify the general core themes, and their difference according to therapy outcome. The results showed a significant difference in the number of context units classified in two of the six core themes (clusters) between good and poor outcome cases (χ2, df=5, p<0,01). These findings show how the bottom-up technique of text analysis on clinical transcripts turned out to be an enlightening tool to let their latent dimensions emerge, setting the clinical process and outcome, and therefore, providing a very useful tool for clinical purposes. Abstract L’analisi delle trascrizioni cliniche è stata ampiamente utilizzata nella ricerca in psicoterapia, sebbene prevalentemente si basi sull’utilizzo di un dizionario che consente la codifica del testo in funzione di criteri predeterminati. Tuttavia, la polisemia che una parola, o un intervento clinico,” può assumere dipende dal campo relazionale in cui il discorso è prodotto. Pertanto, gli approcci bottom-up sembrano essere particolarmente utili nell'affrontare tale questione. Allo scopo di indagare gli elementi caratterizzati le trascrizioni cliniche con esito positivo e negativo, è stata effettuata un’analisi multivariata di un corpus composto da otto trascrizioni di psicoterapia breve (quattro con esito positivo e quattro con esito negativo) al fine di identificare i temi centrali generali e la distribuzione delle unità di contesto nei diversi temi in 336 JADT’ 18 funzione dell’esito della terapia. I risultati hanno evidenziato una differenza significativa tra i casi con esito positivo e quelli con esito sfavorevole (χ2, df = 5, p <0,01), mettendo in evidenza come l'analisi automatica del testo delle trascrizioni dei colloqui clinici possa essere uno strumento utile a far emergere le dimensioni latenti organizzatrici del processo e del risultato, configurandosi così come un utile strumento ai fini clinici. Keywords: Emotional Text Mining, clinical transcripts, psychotherapy outcome. 1. Introduction The text mining of clinical transcripts is very broadly used in psychotherapy research, but is limited to top-down approaches where a-priori vocabularies code them according to a theoretical predetermined framework. Nevertheless, the semantic level that a word, or clinical intervention, can assume, depends on the relational field in which the discourse is produced. Thus, bottom-up approaches seem to be particularly meaningful in addressing such relevant issue. Psychotherapy can be considered a dynamic communicative exchange between the client and the therapist (e.g., Gelo et Salvatore, 2016). Within such an exchange, the content (i.e., the semantic) of what is said plays a primary role. Thus, the textual analysis of therapy transcripts may represent a very useful tool for psychotherapy process researchers as well as for clinicians (Gelo et al., 2013; Salvatore et al. 2017). In the field of psychotherapy research, some methods of text mining have been developed and applied, such as the Therapeutic Cycle Model (Mergenthaler, 2008) and Referential Activity (Bucci et al., 1992). Following a top-down approach, these methods use predefined content categories to semantically classify units of text. Each of these categories corresponds to a thematic dictionary containing all the words indicative of the content represented by that category. Even though these top-down methods of text mining allow for a reliable and valid investigation of the therapeutic process, they present a major limitation, disregarding the contextual nature of the linguistic meaning (Carli et al., 2004; Salvatore et al., 2012). In fact, the meaning of a word is polysemic and depends on the way it combines with other words in the communicative interaction, i.e., it depends on its association with other words. Grounded on these considerations, there has recently been a development of text mining approaches which, by means of their bottom-up logic, allow for a context-sensitive textual analysis (e.g., Salvatore et al., 2012; 2017; Cordella et al., 2014; Greco, 2016). The aim of this study is to investigate possible similarities and differences between good outcome and poor outcome psychotherapy cases by applying the Emotional Text Mining (Cordella et al., 2014; Greco, 2016). Our assumption is that it is possible to JADT’ 18 337 detect the associative links between the words in order to infer the symbolic matrix determining the coexistence of the terms in the text. To this aim, we perform a multivariate analysis based on a bisecting k-means algorithm (Savaresi et Boley, 2004) to classify the text, and a correspondence analysis (Lebart et Salem, 1994) to detect the latent dimensions setting the cluster per keywords matrix. The interpretation of the cluster analysis allows for the identification of the elements characterizing the core themes of the treatment, while the results of the correspondence analysis reflect the emotional symbolisation characterising the therapeutic exchange. The advantage of such an approach is to interpret the factorial space according to word polarization, thus identifying the emotional categories that generate the core themes, and to facilitate the interpretation of clusters, exploring their relationship within the symbolic space (Greco et al., 2017). 2. Data collection and analysis 2.1. Data collection The sample of the present study was drawn from the York Depression Study I, a randomized clinical trial to assess the efficacy of brief experiential therapy for depression (Greenberg et Watson, 1998; Watson et al., 1998).1 From the original sample, we initially selected the six best outcome cases and the six cases worst outcome cases based on the Reliable Change Index of the Beck Depression Inventory (BDI; Beck et al., 1988). We then excluded four cases due to missing session transcripts. Our final sample was thus comprised of a total of eight cases, with four good outcomes and four poor outcomes. The treatment length was between 15 and 20 sessions (M = 17.62; SD = 1.38), for a total of 141 sessions. Patients (one man and seven women; M=37.1 years old) met the criteria for major depressive disorder assessed by means of the Structured Clinical Interview for DSM-III-R (SCID; Spitzer et al., 1989). Therapists (seven women and one man; M= 5.5 years of therapeutic experience) had six months of training in experiential psychotherapy (Greenberg et al., 1993). The transcripts were collected in a large size corpus of 1090234 tokens. In order to check whether it was possible to statistically process data, two lexical indicators were calculated: the type-token ratio and the percentage of hapax (TTR = 0.01; hapax = 35.3%). They highlighted the richness of the corpus indicating the possibility of proceeding with the analysis. 1 We are grateful to Dr. Les Greenberg for having us provided with files of the transcripts for these cases. 338 JADT’ 18 2.2. Data analysis First, data were cleaned and pre-processed with the software T-Lab and keywords selected. In particular, we used lemmas as keywords instead of type. We selected all the lemmas in the medium rank of frequency (upper frequency threshold = 933), and those of the low rank of frequency until the threshold of 17 occurrences, that is, equal to the average number of sessions made on average by the patients (Greco, 2016). Then, in order to identify the core themes common to all the psychotherapies, we performed a cluster analysis on the keywords per context units (CU) matrix, by means of a bisecting k-means algorithm (Savaresi et Boley, 2004), limited to ten partitions, excluding all the CU that did not have at least two keywords cooccurrences. The eta squared value was used to evaluate and choose the optimal solution. To finalize the text mining, we performed a correspondence analysis on the keywords per clusters matrix (Lebart et Salem, 1994) in order to explore the relationship between clusters, and to identify the emotional categories setting the psychotherapeutic process. The interpretation of the factorial space was performed according to the procedure proposed by Cordella and colleagues (2014) in which each keyword is considered only in the factor with the greatest absolute value. To finalise the analysis, we performed a chi squared test on the contingency table cluster per therapy outcome, calculating the standard residual in order to identify the differences between good outcome and poor outcome clinical transcripts in terms of core themes. 3. Main results and discussion The results of the cluster analysis show that the 1351 keywords selected allow for the classification of 56.6% of context units. The high proportion of unclassified context units is due to the transcripts richness in terms of paraverbal interactions (i.e. mhm, yeah, etc). The eta squared value was calculated on partitions from 3 to 9, and it showed six clusters as the optimal solution (η2 = 0.034). In table 1, we can appreciate the emotional map emerging from the clinical transcripts representing the clusters location in the factorial space produced by the interpretation of the five factors. The first factor reflects patient positioning, which can be passive or active; the second factor refers to the relationship that could be familiar or unfamiliar, i.e., a person facing something new and unpredictable; the third factor represents the communication content that can be emotional or concrete; the fourth factor reflects the outcome of the therapeutic work, that is, the patient’s empowerment or making sense of the patient’s experiences; and the fifth factor distinguishes the issues within the daily ones, concerning everyday life, JADT’ 18 339 from the relational ones, concerning the loved ones.2 Table 1  Factorial space representation (the percentage of explained inertia is reported between brackets under each factor). Cluster 1 Label (CU%) Family Structure (11.6%) Transformative Process (12.1%) Concrete thinking (16.1%) Therapeutic Relationship (22.4%) Relational Issues (14.6%) Feelings (23.1%) 2 3 4 5 6 Factor 1 (26.7%) Positioning Passive 0.20 Active -0.46 Passive 0.84 Active -0.25 0.04 0.06 Factor 2 (25.8%) Relationship Familiar -0.56 Unfamiliar 0.29 Unfamiliar 0.34 Familiar -0.18 Familiar -0.14 Unfamiliar 0.58 Factor 3 (21.5%) Content Emotional -0.16 0.06 Concrete 0.42 Concrete 0.41 Emotional -0.47 Emotional -0.43 Factor 4 (14.5%) Outcome -0.01 To empower -0.35 To empower -0.19 To understand 0.28 To empower -0.18 To understand 0.49 Factor 5 (11.5%) Issues Daily -0.32 Daily -0.16 0.05 Relational 0.16 Relational 0.45 Daily -0.14 CU = context units classified in the cluster. Table 2  Psychotherapy core themes. Cluster 1 Family Structure Cluster 2 Transformative Process Cluster 3 Concrete Thinking Cluster 4 Therapeutic Relationship Cluster 5 Relational Issues Cluster 6 Feelings keyword CU keyword CU keyword CU keyword CU keyword CU keyword home 525 start 507 hear 455 week 699 mother 399 understand 416 kid 371 able to 504 money 326 sense 675 life 335 hurt 300 house 290 change 438 dollar 267 day 438 problem 333 important 298 father 241 different 396 accept 205 bad 432 hard 292 person 231 husband 213 situation 288 pay 196 angry 381 care 268 hard 213 child 205 point 237 listen 175 call 253 deal 252 support 185 parent 194 go on 216 believe 135 night 189 family 237 inside 170 stay 190 mind 213 matter 130 morning 169 relationship 233 strong 168 live 179 trying 183 sell 126 set 162 Father 153 195 pain CU = context units classified in the cluster. The six clusters are of different sizes (table 1) and reflect the core themes of the brief psychotherapy (table 2). The first cluster describes the family structure with its role and places; the second cluster reflects the transformative 2 In the negative pole of the fifth factor (Daily Issues) we find the following words: house, stay, TV, rule, street, teacher, move out, neighbour, pounds, and in the positive pole we find words as mother, life, problem, sister, relationship. CU 340 JADT’ 18 process characterising a psychotherapy; the third cluster highlights the concrete thinking process, a way to think that could be defined as concrete thinking, which is often rational and frequently concerning economic issues; the fourth cluster represents the therapeutic relationship that is made of concrete limits, and the process of making sense of personal experiences; the fifth cluster reflects the relational issues of the patient’s private life; and the sixth cluster refers to the process of detecting, recognizing, and understanding feelings, characterizing internal emotional experiences. There is a significant difference in the number of content units classified in each cluster among the good and poor outcome therapies (χ2, df = 5, p < 0.01). In particular, the differences lay on the relevance of two of the six core themes: the concrete thinking and the feelings. While the good outcome brief psychotherapies are characterized by a high number of context units classified in the cluster feelings (SE = 6.8) and a low number of context units classified in the cluster concrete thinking (SE = -5.8); the poor outcomes psychotherapies are characterized by a high number of context units classified in the cluster concrete thinking (SE = 6.8) and a low number of context units classified in the cluster feelings (SE = -7.0). Namely, it would seem that patients tend to dwell upon their emotional experiences in the good outcome psychotherapy, while they tend to dwell upon facts in the poor outcome psychotherapy, probably not connecting them to their emotional experiences. Given that we classified the interactions among the patients and the therapists in this analysis, the therapy outcome could derive both from the patient’s ability in dealing with feelings or the therapist’s ability to support the patient in doing so. The above-mentioned differences between good and poor outcome cases are coherent with findings obtained on the same sample by means of a principal component analysis made on the transcripts coded according to three dictionaries: abstract language, emotional positive language, and emotional negative language (de Felice et al., 2018). In this study, differences in the correlation matrices between good outcome and poor outcome cases were evident. The most obvious one concerned the dynamic in which the patient made use of abstract/concrete language, interpreted very positively in poor outcome cases and very negatively in good outcome cases. In the latter, it was probably and correctly considered as a patient’s defense mechanism to address. This was confirmed by the use of positive and negative emotional language, inversely proportional to abstraction, only in poor outcome cases. 4. Conclusion Talking about concrete events without any sort of emotional involvement in the clinical literature is a defence mechanism that goes under the name of JADT’ 18 341 rationalisation, and it represents a way to protect the mind from painful feelings using an abstract, intellectual and often concrete attitude in dealing with them. While the good outcome psychotherapeutic relationships seem to be capable of addressing the emotional content laying under the surface of the psychotherapeutic field (i.e. use of the therapist’s negative emotional language), the poor outcome dynamics seem to be completely wrapped up in a process of avoiding it. Both the PCA (de Felice et al 2018) and text analysis on clinical transcripts confirmed the difficulty in poor outcome psychotherapies to work on the patient’s emotional aspects. This bottom-up technique of text analysis on clinical transcripts turned out to be an enlightening tool to let their latent dimensions emerge, arranging the clinical process and outcome, therefore, providing a very useful tool for clinical purposes. References Beck A.T., Steer R.A. and Garbin M. G. (1988). Psychometric properties of the Beck Depression Inventory: Twenty-five years of evaluation. Clinical Psychology Review, 8: 77 100. Bucci W., Kabasakalian-McKay R. and RA Research Group (1992). Scoring referential activity. Ulm, Germany: Ulmer Textbank. Carli R., Dolcetti F. and Dolcetti (2004). L’Analisi Emozionale del Testo (AET): un caso di verifica nella formazione professionale. In Purnelle G., Fairon C. and Dister A., editors, Actes JADT 2004: 7es Journées internationales d’Analyse statistique des Données Textuelles, pp. 250-261. Cordella B., Greco F. and Raso A. (2014). Lavorare con Corpus di Piccole Dimensioni in Psicologia Clinica: Una Proposta per la Preparazione e l’Analisi dei Dati. In Nee E., Daube M., Valette M. and Fleury S., editors, Actes JADT 2014 (12es Journées internationales d’Analyse Statistque des Données Textuelles, Paris, France, Juin 3-6, 2014), pp. 173-184. de Felice G., Orsucci F., Mergenthaler E., Gelo O., Paoloni G., Scozzari A., Serafini G., Andreassi S., Vegni N. and Giuliani A. (2018). What differentiates good and poor outcome psychotherapies? A statistical mechanics approach to psychotherapy research. Nonlinear Dynamics, Psychology and Life Sciences. Submitted. Gelo O.C.G. and Salvatore S. (2016). A dynamic systems approach to psychotherapy: A meta-theoretical framework for explaining psychotherapy change processes. Journal of Counseling Psychology, 63(4): 379-395. Gelo O.C.G., Salcuni S. and Colli A. (2013). Text analysis within quantitative and qualitative psychotherapy process research: introduction to special issue. Res. Psychother. Psychopathol. Process Outcome 15: 45–53. 342 JADT’ 18 Greco F. (2016). Integrare la disabilità. Una metodologia interdisciplinare per leggere il cambiamento culturale. Franco Angeli. Greco F., Maschietti D. and Polli A. (2017). Emotional text mining of social networks: The French pre-electoral sentiment on migration. Rivista Italiana di Economia Demografia e Statistica, 71(2): 125:36. Greenberg L., Rice L. and Elliott R. (1993). Facilitating emotional change. The moment by moment process. Guilford Press. Greenberg LS, Watson JC (1998). Experiential therapy of depression: differential effects of client-centered relationship conditions and process experiential interventions. Psychotherapy-Research 8: 210–224. Lebart L. and Salem A. (1994). Statistique Textuelle. Dunod Mergenthaler E. (2008). Resonating minds: A school-independent theoretical conception and its empirical application to psychotherapeutic processes. Psychotherapy Research, 18(2): 109-126. Salvatore S., Gelo O., Gennaro A., Metrangolo R., Terrone G., Pace V., Venuleo C., Venezia A. and Ciavolino E. (2017). An automated method of content analysis for psychotherapy research: A further validation. Psychotherapy Research, 27(1): 38-50. Salvatore S., Gennaro A., Auletta A.F., Tonti M. and Nitti M. (2012). Automated method of content analysis: A device for psychotherapy process research. Psychotherapy Research, 22(3): 256-273. Savaresi S.M. and Boley D.L. (2004). A comparative analysis on the bisecting K-means and the PDDP clustering algorithms. Intelligent Data Analysis, 8(4): 345-362. Spitzer R., Williams J., Gibbons M. and Firs M. (1989). Structured Clinical Interview for DSM-III-R. American Psychiatric Association Watson J.C., Greenberg L. S. and Lietaer G. (1998). The experiential paradigm unfolding: Relationship & experiencing in therapy. In Greenberg L.S., Watson J.C. and Lietaer G., editors, Handbook of experiential psychotherapy, Guilford Press. JADT’ 18 343 DOMINIO: A Modular and Scalable Tool for the Open Source Intelligence Francesca Greco1, Dario Maschietti2, Alessandro Polli3 1 La Sapienza University of Rome, Prisma S.r.l. – francesca.greco@uniroma1.it 2 Prisma S.r.l – d.maschietti@prismaprogetti.it 3 La Sapienza University of Roma – alessandro.polli@uniroma1.it Abstract Prisma has developed an innovative technology for the Open Source Intelligence (OSINT) which aims to provide a solution for those processes of knowledge management, which require the intervention of a human operator, unaided by information technology (IT) support, in one or more stages of the procedure. Such intervention involves a considerable waste of time and resources that could be reduced through the use of an IT tool, partially or totally automating entire stages of the procedure. DOMINIO is a platform that implements tools for automatic online information aggregation, its analysis, the possible alignment with traditional databases and the representation through infographic and georeferencing tools, in order to generate a report. This paper describes the platform architecture, the main algorithms used in the analysis stage of the contents and possible directions of development. Abstract Prisma ha sviluppato una tecnologia innovativa finalizzata all’Open Source Intelligence (OSINT) che intende fornire risposta alle necessità di knowledge management, che richiedono l’intervento di un operatore umano, non assistito da supporti di information technology (IT), in una o più fasi della procedura. Tale intervento comporta un notevole dispendio di tempo e risorse che potrebbe essere ridotto attraverso l’utilizzo di uno strumento di IT, automatizzando parzialmente o totalmente intere fasi della procedura. DOMINIO è una piattaforma che implementa strumenti per l’aggregazione automatica di informazioni on line, la loro analisi, l’eventuale allineamento con banche dati di tipo tradizionale, la rappresentazione attraverso tool di infografica e georeferenziazione, allo scopo di generare una reportistica. Il presente contributo descrive l’architettura della piattaforma, i principali algoritmi adottati nella fase di analisi dei contenuti e le possibili direzioni di sviluppo. Keywords: knowledge management, Open Source Intelligence tool, Information Technology, 344 JADT’ 18 1. Introduction There is a close link between data management and knowledge on the one hand, and knowledge and innovation on the other. The growing mass of unstructured information from disparate channels (search engines, RSS feeds, social networks) and traditional databases entails the need to drastically simplify the preparation, analysis and reporting stages required to structure the information. In fact, only a structured information translates into knowledge. Knowledge, in turn, is a major driver of innovation and, properly managed, it translates into a competitive advantage. The idea at the basis of the tool OSINT (Open Source Intelligence) stems from the needs expressed by analysts – mainly involved in the field of sentiment analysis and opinion mining industry. However, this idea is enough comprehensive to encompass all those activities of knowledge management, similar to the former, which require intervention by a human operator, unaided by IT support (Information Technology), in one or more stages of the procedure, the intervention of which involves a great deal of time and resources. Although in high-end solutions machine learning systems are starting to spread, the available technology is still characterized by significant limitations, especially in the presence of unstructured information. In particular, with regard to supervised machine learning systems, intervention is required by an operator in the initial stages of the procedure and, in general, with reference to any automated system applied to the analysis of a text, it is still impossible to identify complex cognitive functions (for example, irony). Of course, these problems are immanent in many fields of OSINT, and they also affect the stage of reporting, which requires a direct involvement of the analyst, unaided by IT. So, the availability of an IT tool that minimizes human operator intervention − partially or totally automates entire stages of the procedure − would result in substantial advantages, like time savings, increased productivity and the resulting increased efficiency in the allocation of human and financial resources. Prisma has developed an innovative technology of OSINT, which aims to fix the problems briefly described above. The platform implements tools for automatic aggregation of the online information, their analysis, the alignment with traditional databases, the representation through infographic and georeferencing tools, aimed to automate also the phase of elaboration of the final report. This paper will describe the architecture of the platform, the main analysis modules and the possible directions of development. JADT’ 18 345 2. Platform Architecture DOMINIO is an OSINT (Open Source Intelligence) platform that automatically aggregates information from online and traditional databases, analyses it and generates reports on a user-defined subject. The platform collects information by querying several channels: search engines (Google, Yahoo, Bing), social networks (Facebook, Twitter, Google+), RSS feeds, blogs (Blogger, Wordpress, Tumblr), traditional databases. The goal of DOMINIO is to build a structured set of contents, as broad as possible, and to carry out a wide range of qualitative and quantitative analysis. DOMINIO stores these contents within a non-relational database (DB) (MongoDB, 2018; Morphia, 2018), classifying the various documents by channel of origin (Twitter, Facebook, RSS, etc.) to ensure the homogeneity of the collections. Among the options, the DOMINIO user can make queries on-demand or in a continuous mode. The on-demand option carries out an asynchronous search, while the continuous mode option enables to aggregate periodically data and to track a subject over an extended time span. The DOMINIO’s architecture allows the user to switch from one mode to another; the availability of two searching modes allows overcoming the trade-off between accuracy of analysis and speed of processing. With regard to one or more subjects selected by the operator, DOMINIO performs synchronous or asynchronous research on a set of Internet channels, such as search engines (Google, Yahoo, Bing), social networks (Facebook, Twitter, Google+), RSS feeds, blogs (Blogger, Wordpress, Tumblr). The user can also extend the search to the Deep Web, through specific search engines, such as Torch or Grams. Moreover, to meet specific information needs, DOMINION can match these search results with the information achievable from the traditional databases to support many types of analysis (brand reputation, country risk assessment, opinion polls, cyber security, etc.), considerably increasing the operability and flexibility of the tool. Among the traditional databases already available, DOMINIO includes:  IHS Jane's (2018), which provides updates on military and political situation, terrorist acts, civil wars, transportation system, for most of the countries in the world;  Bureau Van Dijk (2018), which collects firms data on ratings, shareholdings, equity investments and M&A;  MIG (a geographic information database drawn up by one of the authors). In addition, for specific information purposes, DOMINIO is open for interfacing with Enterprise Resource Planning databases (like SAP, Oracle, etc.) through market tools (Business Object, Quick View). 346 JADT’ 18 The search results are recalled by the analyst, who operates from a CMS (Content Management System) application to manage the structured set of content and conduct a wide range of qualitative and quantitative analyses (from simple summary statistics to sophisticated multivariate analyses and text and opinion mining techniques). The statistical methods implemented on DOMINIO are chosen by the Prisma research team according to a set of criteria that privileges the suitability of one algorithm to automate entire stages of the procedure, in accordance with the original design idea. Moreover, the modular architecture of DOMINIO, described briefly below, allows a quick integration of the latest analysis tools and innovative methodologies produced in the academic field. Once the stage of content analysis is completed, the CMS application generates a micro-site containing the results (geo-referenced maps, summary statistics, multivariate analysis results, textual and semantic analysis of sentiment analysis, etc.). After selecting a graphic layout for the final report, the analyst has only to write notes and final remarks. The possibility of including features generating automatic and/or autocompletion comments, customizable by the user, is also being studied. Once the last stage is completed, the report is ready for online publication or traditional diffusion in pdf format, or linked to external services. From an architectural point of view, DOMINIO is designed following the most modern criteria of modular software design, with the parallel development of the platform’s modules. In short, in order to ensure a greater fault tolerance and high safety standards, the system is divided into three independent logical units (cfr. Figure 1): • DOMINIO Engine Unit (MEU), which implements the features of 1) scraping information from the sources mentioned above (web, social networks, RSS feeds, traditional databases); 2) storage of results on MEDB database; 3) qualitative and quantitative analysis; • DOMINIO RESTurl Unit (MRESTU), which receives requests from the MCMS unit, verifies the consistency and forwards the request to the unit ME. Upon receiving the response, it implements the request by adding additional fields (username, token, etc.) and returns them to the MCMS client. The MRESTU unit contains the database (MRESTDB) for user profiling; • DOMINIO Content Management System Unit (MCMSU), which manages the stage concerning the reporting and archiving of reports according to pre-logical criteria (organization by topic, chronologically, for templates, etc.). JADT’ 18 347 Figure 1 - DOMINIO General Overview 3. Main analysis modules 3.1. Country Threat Assessment The Country Threat Assessment module supports the Company Intelligence and Security analyst in the country's risk assessment process. Through a responsive type interface, it aggregates information from major global industry databases (eg, IHS Jane's) giving an assessment of external and internal risk and that due to political and socio-economic factors and potential outbreaks or revolutionary movements for 192 different countries. Country Threat Assessment is integrated with intelligence information updated weekly on each country. Through an automatic report, data is aggregated into a single file by optimizing timing of risk assessment and providing a solid foundation for any further detailed analysis. DOMINIO offers the possibility of making a full or partial information download, and the generation of an automatic report, thus optimizing any drafting processes. 3.2. Due Diligence The Due Diligence module supports the Economic Intelligence analyst in the process of business valuation in relation to suppliers, partners and customers. Among the sectors analysed in the module are included 348 JADT’ 18 assessments of profitability and financial performance as well as creditworthiness. Through a simple and intuitive interface, the module aggregates information from leading industry databases and returns an economic, financial and credit risk profile on hundreds of millions of businesses around the world. The Due Diligence Module also allows an assessment of individuals, through the analysis of individuals exposed politically, returning an automatic report that integrates the main aspects of each business and its economic risk analysis. 3.3. Open Source Intelligence On completion of the aggregation of large amounts of data from major social networks (Facebook, Twitter, Youtube) and the main Italian newspapers based on predetermined keywords analyst, a statistical representation of the main trending topic is returned and an output of structured data for subsequent multivariate analysis is generated. Furthermore, the module allows the geo-referencing of content, highlighting even at geographic levels useful signs for the analyst. As for each of DOMINIO’s modules, it is possible to generate automatic reporting. 3.4. Geographic Information Module This is a module that analyses the information inferable from a dataset of basic statistical information and related indicators, with reference to a multitude of subjects, 9 of which are in a current stage of development. The basic statistical information, refers to the division of the Italian territory into provinces, covering a time period between 1995 and the latest available year, which for some subject areas is ongoing or, more frequently, the previous year to the current one. The dataset will be supportive to a wide range of applications - from forecasting and scenario analysis, counterfactual analysis to spatial analysis. 3.5. Text Mining Module On completion of the automatic analysis of textual data using statistical methods (Lebart et Salem, 1994; Feldman et Sanger, 2006; Bolasco, 2013), in order to extract structured information, the main statistical methods of analysis of textual data implemented in DOMINIO are: factor analysis (correspondence analysis, multiple correspondence analysis); cluster analysis (k-mean, bisetting k-mean, fuzzy clustering, etc.); network analysis; Markov analysis; pattern recognition. For example, during the French presidential campaign of 2017 we analysed the sentiment about migration, that was one of the most debated theme. We performed an Emotional Text Mining (Greco et al., 2017) in order to explore JADT’ 18 349 the emotional content of the Twitter messages concerning migration written in French in the last two weeks before the first round of the presidential election in 2017. The aim was to analyse the opinions, feelings and shared comments, classifying the contents and the sentiments. We retrieved the messanges from the Twitter repository collecting a sample of over une hundred thousand tweets The large size corpus of 2.154.194 tokens (TTR = 0,01; Hapax percentage = 40,4) underwent a multivariate analysis based on a bisecting k-means algorithm (Savaresi et Boley, 2004) to classify the text, and a correspondence analysis (Lebart et Salem, 1994) to detect the latent dimensions setting the cluster per keywords matrix. The advantage connected with this approach is to interpret the factorial space according to words polarization, thus identifying the emotional categories that generate migration representations, and to facilitate the interpretation of clusters, exploring their relationship within the symbolic space (Greco, 2016). The results interpretation allowed for the detection of seven representations of migrants that corresponded to three different sentiments: positive (42%), negative for the community (45%), and negative for migrants (13%). We considered as negative the representation of migrants as squatters, invaders, terrorists, trafficking slaves and migration victims, and positive the sport heroes and the EU solidarity target. Among the negative clusters, we distinguished negativity according to the direction of the action: squatters, terrorists and invaders are negative for the community and trafficking slaves and migration victims are negatives for migrants themselves (see Greco et al., 2017). Moreover, It was possible to highlight the connection between the real life events and the tweets production. While the terrorist attack three days before the first round of voting in the centre of Paris had slightly modified the production of messages, the candidates’ interviews had a higher impact. This suggests that the medialization was more important than the terrorist attack in the production of messages (see Greco et al., 2017). 4. Conclusion The innovative aspect that characterizes DOMINIO is the ability to aggregate data of different types and from different channels of information, automatically, simply and transparently. Moreover, its structure allows for the integration of the latest analytical tools and innovative methodologies produced in academia. By means of an automated reporting system, the analyst is supported in the assessment of risk and the collection of information in the geopolitical and economic field and from open sources. The set of modules allows the analyst to generate knowledge from an evergrowing amount of data by optimizing the processes of assessment and risk reduction. 350 JADT’ 18 References Bolasco S. (2013). L’analisi automatica dei testi: Fare ricerca con il text mining. Carocci. Bureau von Dijk (2018). A Moody’s Analytics Company. Bureau von Dijk, https://www.bvdinfo.com/it-it/home Feldman R. and Sanger J. (2006). The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press. Greco F. (2016). Integrare la disabilità. Una metodologia interdisciplinare per leggere il cambiamento culturale. Franco Angeli. Greco F., Maschietti D. and Polli A. (2017). Emotional text mining of social networks: The French pre-electoral sentiment on migration. RIEDS, 71(2): 125:36. IHS Jane’s (2018). Jane’s Information Group. IHS Jane’s, http://www.janes.com Lebart L. and Salem A. (1994). Statistique Textuelle. Dunod MongoDB (2018). MongoDB for GIANT ideas. MongoDB, https://www.mongodb.com Morphia (2018). The Java Object Document Mapper for MongoDB. MongoDB, https://mongodb.github.io/morphia/ Savaresi S.M. and Boley D.L. (2004). A comparative analysis on the bisecting K-means and the PDDP clustering algorithms. Intelligent Data Analysis, 8(4): 345-362. JADT’ 18 351 Is training worth the trouble? A PoS tagging experiment with Dutch clinical records Leonie Grön, Ann Bertels, Kris Heylen KU Leuven – leonie.gron@kuleuven.be; ann.bertels@kuleuven.be; kris.heylen@kuleuven.be Abstract Part-of-speech (PoS) tagging is a core task of Natural Language Processing (NLP), which crucially influences the output of advanced applications. For the tagging of specialized language, such as that used in Electronic Health Records (EHRs), the domain adaptation of taggers is generally considered necessary, since the linguistic properties of such sublanguages may differ considerably from those of general language. Previous research suggests, though, that the net benefit of domain adaptation varies across languages. Therefore, in this paper, we present a case study to evaluate the effect of training with in-domain data on the tagging of Dutch EHRs. Keywords: Electronic Health Records; Part-of-Speech tagging; medical sublanguage; Dutch 1. Background EHRs are valuable resources for data-driven knowledge-making. To unlock the relevant information from free text, domain-specific NLP systems are required. Such systems must deal with a text genre that can be characterized by a high density of specialized terms, including non-canonical variants, and non-standard syntactic constructions. These properties affect all further steps in a processing pipeline, starting from core tasks such as PoS tagging. Since PoS values are important features for further processing, the output of many systems, such as tools for term extraction and term-to-concept mapping (e.g. Doing-Harris et al., 2015; Scheurwegs et al., 2017), crucially depends on the accuracy of the PoS tags assigned in the first place. Processing suites such as cTAKES (e.g. Savova et al., 2010), which have been developed specifically for the medical domain, are known to boost tagging performance. As most tools are only available for English, though, systems dealing with other languages, such as Dutch, are required to start the domain adaptation from scratch. Typically, this process involves the re-training of an existing tool on handcoded data, which is time- and labor-intensive. Besides, evidence from German challenges the wide-held belief that domain training is a prerequisite to achieve good tagging performance (Wermter et Hahn, 2004). 352 JADT’ 18 Given these considerations, we conduct a pilot study to investigate the potential benefit of domain adaptation for the PoS tagging of Dutch EHRs. Firstly, we assess the impact of training with a hand-coded clinical dataset on the accuracy of an off-the-shelf tagger. Secondly, we evaluate how the difference in accuracy affects the output of a term extraction method based on PoS patterns. 2. Related Work For the PoS tagging of clinical writing, the main challenges reside in the particular linguistic properties of the genre, both at the lexical and the syntactic level: On the one hand, EHRs contain a high proportion of specialized terminology and idiosyncracies, including misspellings and noncanonical abbreviations; a tagger developed for general language will thus encounter a high number of out-of-vocabulary words (Knoll et al., 2016). To complicate this matter, the PoS distributions in clinical corpora differ from those found in general language, which may be detrimental to the statistical classification of unknown or ambiguous tokens (Pakhomov et al., 2006). On the other hand, EHRs are typically composed in a telegraphic style, which can be characterized by the omission of functional syntactic elements; the lack of linguistically informative context may prevent the accurate prediction of PoS transitions within n-grams (Coden et al., 2005). At the same time, the average sentence length in EHRs is relatively short; the high number of inter-sentential transitions may pose additional pitfalls for an out-of-domain tagger (Pakhomov et al., 2006). Most previous research thus agrees that the use of offthe-shelf taggers on clinical writing is highly prone to errors, which are likely to be propagated through the different levels of an application (Ferraro et al., 2013).Therefore, many state-of-the-art systems use an annotated set of EHRs for training. The creation of training materials comes at a cost, though, and entails a range of methodological challenges in itself, such as the creation of suitable guidelines and tagsets (Albright et al., 2013). To circumvent these issues, alternative ways of domain adaptation have been explored, including the integration of a domain-specific vocabulary, and the exploitation of morphological features to classify unknown words (Knoll et al., 2016). However, other languages than English may present a different case: In an early study, Wermter & Hahn (2005) come to the conclusion that in German, taggers trained on newswire perform very well on EHRs. This surprising finding can be partly attributed to the rich inflectional system of the language, which lends itself to the prediction of PoS categories. On the other hand, the low complexity of the medical sublanguage may be a factor: In their study, the general training data subsumed all PoS transitions found in the clinical test data, so that the tagger was sufficiently equipped to handle the latter. JADT’ 18 353 3. Methods 3.1. Corpus and manual tagging Our study is based on the analysis of a mixed sample of EHRs, containing a total of 375 documents. As detailed in Table 1, the subsets of this sample differ with regard to their medical subdomain, institutional origin and document structure: The EN and RD sets cover only one medical specialty, whereas the DL, SP and GP sets are less homogeneous; the DL, EN and RD sets were composed at a single institution, while the documents in the GP and SP sets are drawn from a multi-source database, Integrated Primary Care Information (ICPI), which contains EHRs from medical practices all across the Netherlands. Finally, the EHRs in four subsets (DL, GP, RD, SP) had been split into shorter fragments to comply with privacy standards; therefore, these documents are much shorter than those in the EN set, which count 204.2 tokens on average. All EHRs are tokenized with the NLTK tokenizer1 and manually labelled by the authors, using the Universal Tagset (Petrov et al., 2012). Finally, for each subset, the EHRs are split into a training and test set, containing 67% vs. 33% of the files respectively. Table 1: Overview of the subsets of our file sample. The first three columns specify the name of the subset, the document types, the origin and the number of institutions involved in their creation. The remaining columns give the number of documents, the absolute length in tokens, and the average document length in tokens. Subset Document types Origin Nr. of sources Nr. of documents Subset length DL Clinical discharge letters EHRs from endocrinology EHRs from general practitioners EHRs from radiology EMC Rotterda m UZ Leuven IPCI (Vlug et al., 1999) EMC Rotterda m IPCI (Vlug et al., 1999) One 88 3597 Average documen t length 40.88 One 80 16337 204.2 Multipl e 60 1431 23.85 One 60 1441 24.02 Multipl e 87 4784 54.99 Σ 375 27590 73.57 EN GP RD SP Specialist letters from various fields (e.g. cardiology) 1 http://www.nltk.org/_modules/nltk/tokenize.html 354 JADT’ 18 3.2. Evaluation 3.2.1. Effect of domain training on tagging performance Firstly, we assess the impact of using in-domain data for training on tagging accuracy. For evaluation, we use the state-of-the-art Perceptron Tagger.2 This tagger uses context tokens as well as suffix features for classification. As Knoll et al. (2016) show, this configuration outperforms a primarily sequential tagger, as used by Wermter & Hahn (2005), on clinical data. The pre-compiled model for Dutch is trained on the Alpino Treebank (van Noord, 2006). In addition, we build a domain-specific model based on the manually labelled training set. Then, we feed both models into the tagger to classify the test set. To measure the accuracy of each model, we calculate the precision, i.e. the proportion of tags that match those in the manually labelled gold standard.3 To compare the effect across the different subsets, we calculate the gain in precision achieved with the domain model relative to the precision achieved with the Alpino baseline. 3.2.2. Effect of tagging performance on term recognition and extraction Secondly, we quantify the effect of tagging performance on pattern-based term recognition. For the identification of term candidates, we use a set of PoS sequences that are characteristic for termhood in the domain. Similar to Scheurwegs et al. (2017), we focus on complex nominals, i.e. nouns surrounded by one or more modifiers; Table 2 provides some examples of such patterns. Table 2: Examples of PoS patterns used for term retrieval. The left column lists the target tag sequence, the middle and right column provide Dutch examples and English translations of term candidates. PoS pattern adjective noun noun adposition noun noun noun Dutch example ‘diabetische retinopathie’ ‘syndroom van Apert’ ‘zwelling enkel’ English translation diabetic retinopathy syndrom of Apert swelling ankle Using a sliding-window approach, we iterate through the three tagged versions of the test set, i.e. the manually tagged gold standard, the version http://www.nltk.org/_modules/nltk/tag/perceptron.html The Alpino model uses a more fine-grained tagset than the Universal Tagset used for the manual tagging. To enable the comparison across models, the redundant labels from Alpino are mapped to the respective categories of the Universal Tagset (e.g. adj , comparative → ADJ ). 2 3 JADT’ 18 355 tagged with the Alpino model and the version tagged with the domain model. We identify all PoS sequences that match the pre-specified patterns, and extract the respective tokens for manual validation. For each version, we calculate the precision as the proportion of true positives, i.e. domain-specific phrases, relative to the total list of matches.4 To assess the individual effect size, we also calculate the relative gain in precision for each subset. 3.3. Results 3.3.1. Effect on tagging performance For PoS tagging, training on domain data has a sizeable effect on precision: The domain model reaches 85.8% accuracy on the test set of held-out EHRs, compared to 66.9% with the Alpino baseline. Regardless of the model, the best results are achieved for DL, followed by RD and EN; for SP and GP, precision stays at the lowest levels. To evaluate the improvement across the different subsets, we compare the increase in precision relative to the value achieved with the baseline. The comparison of these values reveals considerable differences of the individual effect sizes: In SP, the training effect is most striking, followed by GP and EN; in RD and DL, the improvement is less evident. 3.3.2. Effect on term recognition and extraction The increase in accuracy has a strong effect on the term retrieval task: When using the tags assigned by the Alpino model, only 3.42 of the retrieved candidates are correct; with the domain model, precision jumps to 9.3%. Again, the results vary substantially across the different datasets: Overall, the best results are obtained for EN, followed by RD and DL. In SP and GP, precision remains at the lowest levels. Judging from the relative gain in precision, though, we find the strongest increase in GP, followed by DL. In RD, EN and SP, we only find weaker effects. Table 3 provides the full results for both tasks. For error analysis, we label all false positives with the nature of misclassification, whereby we distinguish between three types of errors: Firstly, errors based on erroneous PoS tags (e.g. ‘merkt hypoglycemie’ notices hypoglycemia, whereby the verb is tagged as an adjective); secondly, segmentation errors, whereby one token is associated with an unrelated one (e.g. ‘oedeem Lipitor’ edema Lipitor, whereby two unrelated nouns are To qualify as domain-specific, a phrase must contain at least one noun that has a concept entry in the clinical terminology SNOMED-CT (International Release July 2017; http://browser.ihtsdotools.org/). For instance, ‘echografie rechterschouder’ echography right shoulder, which refers to a clinical procedure, would count as a true positive; the general expression ‘pak koekjes’ bag of biscuits would not. 4 356 JADT’ 18 mistaken for a compound); thirdly, term candidates that match a target PoS pattern, but are not domain-specific (e.g. ‘kleine boterhammen’ small sandwiches). Then, we calculate the proportion of error types among the false positives provided by both models. With the Alpino model, the vast majority of errors (74.4%) is based on false PoS tags. About 18.2% of the proposed term candidates are out-of-domain, while only a small portion (7.3%) of errors is caused by mistakes in segmentation. Conversely, with the domain model, most false positives (49.7%) are out-of-domain terms; errors in tagging and segmentation account for 30.1% and 20.2% respectively. Table 3 : Precision of PoS tagging and term extraction across subsets. The first column specifies the subset. The second and third column provide the percentage of correct tags assigned by the domain model and the Alpino model respectively; the fourth column contains the relative increase in precision. The remaining three columns provide the corresponding values for the extraction task. Term extraction PoS tagging % % Prec Prec % domain subset % Prec domain model % Prec Alpino % increase model Alpino increase DL 89.62 EN GP 76.61 16.99 7.33 2.64 177.87 86.82 67.5 28.62 21.48 8.04 167.1 79.81 61.76 29.23 3.28 0.84 291.31 RD 88.98 74.1 20.08 8.89 3.31 168.52 SP 83.68 54.5 53.53 5.52 2.26 144.09 Σ 85.78 66.9 29.69 9.3 3.42 189.78 4. Discussion Overall, the positive effect of domain adaptation is evident: Using clinical data for training improved the accuracy of PoS assignments and, as a consequence, the output of the term extraction method. Based on our results, we do not see a clear relation between the amount of training data and the global level of precision: For PoS tagging, DL and RD, which are among the smaller subsets, score highest; on the other hand, for the term extraction task, EN, which is the largest subset, produces the best results by far. This indicates that the benefit of training hinges on linguistic and semantic qualities, rather than the mere quantity of the data. In particular, tagging performance correlates with the homogeneity and wellformedness of the data. The homogeneity depends, on the one hand, on the medical field: A dataset such as RD, which is confined to one clinical JADT’ 18 357 specialty, only makes reference to a fairly limited number of medical concepts; by contrast, a more heterogeneous set, such as SP, covers a wider range. Besides, the number of institutions involved in data creation plays a role: In an EHR sample provided by a single hospital, such as EN, it is likely that preferred terms and phrases are perpetuated throughout the dataset. By contrast, in a set drawn from a multi-source database, such as GP, the potential for variation is higher. Both these factors affect the overall size of the vocabulary, which, in turn, determines the complexity of the tagging task. The well-formedness, on the other hand, depends mainly on the EHR type. The GP set, for instance, contains mostly notes intended for internal documentation; these notes are written in an informal style, whereby function words and suffixes may be left out or truncated. As these features usually serve as predictors for PoS classification, their omission may cause a drop in tagging performance. While the global level of precision is thus lowest in conceptually and lexically EHR samples, such as GP and SP, the relative benefit of domain adaptation is the greatest here. 5. Conclusion We conclude that the training with in-domain data benefits the output of PoS taggers for clinical Dutch. Especially if the file sample covers different subdomains, or if the language used deviates strongly from the standard, the potential gain in performance is great. At the same time, considerable training efforts are required to achieve only marginal improvements. Depending on the scope of the project and the composition of the sample, it may thus be preferable to implement a cheaper alternative, for instance by integrating a domain dictionary into the tagger. Acknowledgements This work was supported by Internal Funds KU Leuven. References Albright D., Lanfranchi A., Fredriksen A., Styler W.F., Warner C., Hwang J.D., Choi J.D. et al. (2013). Towards Comprehensive Syntactic and Semantic Annotations of the Clinical Narrative. J Am Med Inform Assoc vol. 20: 922–30. Coden A.R., Pakhomov S.V., Ando R.K., Duffy P.H. and Chute C.G. (2005). Domain-Specific Language Models and Lexicons for Tagging. J Biomed Inform vol. 38: 422–30. Doing-Harris K., Livnat Y. and Meystre S. (2015). Automated Concept and Relationship Extraction for the Semi-Automated Ontology Management (SEAM) System. J Biomed Semantics vol. 6 (15): 1–15. 358 JADT’ 18 Fan J.-W., Prasad R., Yabut R.M., Loomis R.M., Zisook D.S., Mattison J.E. and Huang Y. (2011). Part-of-Speech Tagging for Clinical Text: Wall or Bridge between Institutions? In AMIA Annu Symp Proc, pp. 382–91. Ferraro J.P., Daumé H.I., DuVall S.L., Chapman W.W., Harkema H. and Haug P.J. (2013). Improving Performance of Natural Language Processing Part-of-Speech Tagging on Clinical Narratives through Domain Adaptation. J Am Med Inform Assoc vol. 20: 931–39. Knoll B.C., Melton G.B., Liu H., Xu H. and Pakhomov S.V.S. (2016). Using Synthetic Clinical Data to Train an HMM-Based POS Tagger. In 2016 IEEE-EMBS (International Conference on Biomedical and Health Informatics), pp. 252–55. van Noord, G. (2006). At Last Parsing Is Now Operational. In Proceedings of TALN 2006, pp.20–42. Pakhomov S.V., Coden A. and Chute C.G. (2006). Developing a Corpus of Clinical Notes Manually Annotated for Part-of-Speech. Int J Med Inform vol. 75: 418–29. Petrov S., Das D. and McDonald, R. (2012). A Universal Part-of-Speech Tagset. In Piperidis N.C., Choukri K., Declerck T., Doğan M.U., Maegaard B., Mariani J., Moreno A., Odijk J., and Piperidis S. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), pp. 2089–96. Savova G.K., Masanz J.J., Ogren P.V., Zheng J., Sohn S., Kipper-Schuler K.C. and Chute C.G. (2010). Mayo Clinical Text Analysis and Knowledge Extraction System (cTAKES): Architecture, Component Evaluation and Applications. J Am Med Inform JADT’ 18 359 Les outils de la statistique textuelle pour analyser les corpus de données d’enquêtes de la statistique publique France Guérin-Pace, Elodie Baril Institut national d’études démographiques Abstract For more than 20 years, textual statistic methods have been allowing us to explore and analyze data from official statistics survey and the different corpora it contains: answers to an open question, associated words, significant life events. Based on three corpora of data: Population-Lived Spaces-Environments survey (Ined, 1992), EuroBroadMap survey on representations of Europe in the world (2009), and more recently the Information and Daily Life survey on adult reading skills (INSEE, 2011), we have demonstrated the diverse use cases of these methods and the richness that helps identify the corpus content in relation to the individual characteristics of respondents as well as to the survey questions. In recent years, we have mobilized these methods to post-codify the events collected in the IVQ survey. Today we will present to you the results of this work: the benefits and limitations of textual statistic method. Résumé Réponses à une question ouverte, mots associés, évènements marquants de la biographie, constituent autant de corpus issus de données d’enquêtes de la statistique publique que nous avons explorés et analysés avec les méthodes de la statistique textuelle, depuis plus de 20 ans. A partir de trois corpus de données : enquête Populations-Espaces de vie-Environnements (Ined, 1992), enquête EuroBroadMap sur les représentations de l’Europe dans le monde (2009), et plus récemment l’enquête Information et Vie quotidienne sur les compétences en lecture des adultes (Insee, 2011), nous montrons la diversité d’applications de ces méthodes, leur richesse pour cerner le contenu des corpus en lien avec les caractéristiques individuelles des répondants mais aussi d’autres questions d’enquête. Plus récemment nous avons mobilisés ces méthodes pour post-codifier les évènements recueillis dans l’enquête IVQ. Nous présenterons les apports et les limites de cette démarche. Keywords: textual statistics, open-ended questions, associated words corpus, post-coding. 360 JADT’ 18 1. Des corpus de nature variée Introduire un questionnement ouvert dans une enquête en population générale est toujours un défi pour les concepteurs même si les méthodes de la statistique textuelle ont prouvé depuis longtemps leur intérêt et leur efficacité pour leur traitement. Cerner les contours et l’acception d’un mot valise était l’objectif de l’introduction de la question ouverte «Si je vous dis environnement, qu'est-ce que cela évoque pour vous? » dans l’enquête « Populations-Espace de vie-Environnements » réalisée en 1992 (INED) auprès d'un échantillon de 6 000 personnes, représentatif de la population française. Un des objectifs consistait à examiner quelles représentations les populations construisent sur la notion même d'environnement. Une technique un peu différente de recueil est celle adoptée, par exemple, dans l’enquête EuroBroadMap conduite en 2009 dans 18 pays. Enquêter près de 10 000 étudiants à travers le monde sur leurs représentations de l’Europe est l’un des objectifs de ce projet européen. Une pièce centrale de ce dispositif est de recueillir les mots associés à l’Europe par les étudiants1 après leur avoir demandé de délimiter, selon leur perception, ses contours sur une carte du monde. A la différence du corpus précédent, les mots ne sont pas proposés sous forme de liste et ce sont les représentations spontanées qui sont recueillies. Cette technique des mots associés a pour intérêt de contraindre davantage le format des réponses et d’obtenir un corpus plus homogène. Une des principales difficultés de ce corpus est celle de la langue de recueil des mots associés. Pour résoudre en partie ce problème, nous avons choisi de traduire les réponses en anglais pour chacun des pays au moment de la saisie, selon des consignes précises2. Une autre forme de matériau qualitatif intéressant à recueillir dans les enquêtes concerne les événements de vie. Pour les démographes, le recueil d’éléments des parcours individuels possède une dimension explicative très pertinente, qu’ils s’agissent de points d’inflexion, de ruptures au sein des parcours biographiques ou d’éléments ponctuels sans conséquence à long terme (Laborde et al., 2007). C’est ce que nous avons mis en place dans l’enquête Information et Vie quotidienne (Guérin-Pace, 2009). Les évènements marquants peuvent être recueillis de manière ouverte ou fermée. L’intérêt de les recueillir, sous forme de question fermée, est de pouvoir La question posée était « Quels sont les mots que vous associez le plus à l’ « Europe » ? Choisissez 5 mots au maximum. » 2 Pour des raisons de coût et de délai, l’instruction donnée aux partenaires était de traduire, eux-mêmes, en anglais les mots associés au moment de la saisie des questionnaires. Les premiers traitements textuels ont permis de repérer des incohérences et nécessité un retour vers les questionnaires dans leur langue d’origine. 1 JADT’ 18 361 effectuer des comparaisons systématiques dans la mesure où tous les enquêtés répondent à une même question. Nous avons introduit dans l’enquête sous forme de question fermée les évènements les plus fréquemment cités (divorce ou séparation des parents, décès d’un proche, problème de santé, etc.). Les événements recueillis de manière « fermée » ne permettent pas d’aborder tous les thèmes notamment ceux portant sur des sujets sensibles (cas de violence par exemple). Le recueil sous forme d’énumération devient en effet vite intrusif, parfois déplacé, si les personnes ne sont pas concernées. Par ailleurs, par cette démarche, on fait l’hypothèse de la nature a priori traumatisante d’un événement sans savoir si Ego l’a vécu comme tel durant son enfance (Laborde et al., 2007). Nous avons ainsi fait le choix de compléter ce questionnement par la question ouverte suivante « Avez-vous connu un autre événement marquant durant votre enfance ? Si oui, lequel ? ». Près d’un quart des répondants déclarent un « autre événement marquant » de leur enfance en réponse à cette question. Parmi eux, un sur deux évoque un décès, un sur dix un événement lié à un problème de santé, et dans les mêmes proportions une situation de violence vécue durant l’enfance (Baril, Guérin-Pace, 2016). Tableau 1 : Description des corpus analysés Enquêtes PopulationsEspaces de VieEnvironnement (1992) EuroBroadMap (2009) Information et Vie Quotidienne (2011) Corpus Nombre de réponses Nombre d’occurrences Nombre de mots distincts Environnement 4596 28716 2130 9343 40800 5111 3167 15993 2161 Mots associés à l’Europe Evènements marquants de l’enfance 2. Une étape sous-estimée : lecture des mots du corpus et les statistiques lexicales Une première étape essentielle d’analyse est la lecture du lexique des mots les plus fréquents associé à un corpus d’enquête. Ce lexique donne à lui seul un aperçu de la tonalité du vocabulaire (positive ou négative) et des registres abordés. Par exemple, dans le corpus de mots associés à l’Europe, le premier mot à connotation péjorative n'apparait qu'en 26ème position (colonialism). La lecture des événements les plus fréquents indique quant à elle le caractère individuel ou collectif, le plus souvent historique, des événements perçus. Pour les enquêtes internationales ou à passage répété, le recours aux 362 JADT’ 18 statistiques lexicales permet de comparer la richesse du vocabulaire de manière pertinente. Ainsi, dans le corpus « Europe », la comparaison des proportions de mots distincts (Figure 1) apporte des informations intéressantes. Il apparaît ainsi que les étudiants interrogés dans des pays les plus éloignés de l’Union européenne (Cameroun, Chine, Russie, Brésil, Inde) ont une vision plus consensuelle ou partagée de l’Europe que ceux des pays qui en sont membres, ou à la marge. Figure 1 : Diversité des mots associés à l’Europe selon les pays d’enquête Source : Enquête EuroBroadMap (2009) 3. Faire émerger le contenu d’une question ouverte à partir du TLE Une autre application des méthodes d’analyse textuelle à un corpus de réponses à une question ouverte consiste à extraire les mondes lexicaux selon la méthodologie Alceste. Une CDH effectuée sur le tableau croisant les réponses à la question ouverte avec le lexique associé au mot « environnement » met en évidence deux approches fondamentalement différentes de la notion d’environnement (Figure 2). La première aborde l’environnement selon une approche cognitive concernant un espace physique et social (qualité de vie, univers local, etc.), tandis que la deuxième approche est plus symbolique ou imaginaire (iconographie de la nature, sensation de bien-être.). JADT’ 18 363 Figure 2 : Les mondes lexicaux du corpus « environnement » (Alceste) In Guérin-Pace F., 1997 4. Croiser les réponses spontanées avec un questionnement fermé Les limites d’interprétation d’une question ouverte résident dans l’impossibilité d’interpréter ce qui n’a pas été évoqué par les répondants. Compléter ce dispositif par un questionnement fermé permet d’y remédier. Nous avons ainsi, à la suite de la question ouverte, introduit deux questions fermées qui proposaient une liste de mots et d’adjectifs pouvant être associés ou non, par le répondant, au mot « environnement »3. L’observation conjointe 3 Les questions étaient libellées de la manière suivante : « Voici une liste de noms (adjectifs). Lesquels vous semblent liés à la notion d’environnement ? (Pour chacun, précisez oui ou non). 364 JADT’ 18 des réponses à ces deux modes de questionnement par une ACM sur le TLA permet d’enrichir l’analyse du contenu « spontané » au regard des représentations fermées. On observe ainsi (Figure 3) que l’opposition entre un environnement fait de « relations » et un environnement fait de « nature » (axe horizontal) s’accompagne, par exemple, du choix ou du refus de mots et d’adjectifs qui décrivent les nuisances urbaines. Sur l’axe vertical, à l’opposition entre un environnement conçu comme une proximité immédiate et un environnement basé sur les relations entre « l’homme et son milieu » correspond un vocabulaire associé qui renforce cette perception. Proche de la première perception, on relève les mots « maison-oui », « amical-oui », « sécurité-oui » et « planète-Non ». Figure 3 : Proximité entre formes du corpus « environnement » et associations proposées Guérin-Pace F., Garnier B, 1995 Lecture : à proximité des mots « santé » ou « liberté » cités en réponse à la question ouverte, on relève les réponses « non » à l’association du mot environnement aux mots « ville » ou « violence ». 5. Post-coder les événements marquants de l’enfance par la statistique textuelle Une autre application plus récente de ces méthodes pour post-codifier des réponses à une question ouverte peut sembler contradictoire avec l’esprit même de la statistique textuelle. Il s’agit plus précisément de post-codifier les évènements recueillis dans l’enquête Information et Vie quotidienne (IVQ). Pour cela, nous avons effectué une classification (CDH) sur le tableau lexical JADT’ 18 365 entier croisant les réponses à la question « Avez-vous vécu d’autres événements marquants ? » avec le lexique du corpus. On retient une partition en cinq classes au sein de laquelle on observe une première dichotomie entre des événements de nature collective (guerre d’Algérie, Mai 1968, etc.) et un ensemble de classes qui évoquent des événements de nature individuelle : décès, maladie, accident et violence (Figure 4). Nous avons ajouté à ces cinq classes deux classes supplémentaires : une classe intitulée « Refus » regroupant toutes les réponses qui marquent une volonté de l’enquêté de ne pas détailler l’événement marquant à l’enquêteur (tout en ayant donné une réponse affirmative à la question « Avez-vous connu un autre événement marquant ? ») ; une classe « Autre » au sein de laquelle nous avons regroupé les réponses non classées4. Nous avons ensuite cherché à affiner cette typologie en précisant les acteurs éventuels impliqués dans les événements. Par exemple, au sein de la classe « Maladie » (classe 2), nous avons filtré au moyen d’un vocabulaire familial (père, mère, frère, sœur, tante, ami, etc.) et constitué 4 sous-modalités distinctes selon les personnes concernées. Figure 4 : Typologie des événements marquants de l’enfance Source : Enquête IVQ, Iramuteq (classification Méthode Reinert) Nous avons procédé de la même manière pour la classe « violence » en distinguant cette fois les personnes concernées par l’événement et son auteur éventuel. Nous obtenons finalement une typologie construite sur les questionnements ouverts et fermés, composée de 43 items (Baril, GuérinPace, 2016), qui pourrait être réutilisée pour d’autres enquêtes nationales. En conclusion, ces différentes applications sur des corpus variés d’enquêtes 4 Près de 90 % des 3167 réponses à cette question sont classées. 366 JADT’ 18 de la statistique publique permettent de mettre en évidence la diversité des apports des méthodes de la statistique textuelle. Aujourd’hui, de plus en plus d’enquêtes nationales abordent des thématiques sensibles (violences, précarité, illettrisme, etc.). Le recours à un questionnement ouvert s’avère ainsi indispensable en permettant au chercheur d’objectiver sa démarche. Les méthodes de la statistique textuelle se révèlent incontournables dans cette perspective. Références Baril E., Guérin-Pace F. (2016). Compétences à l’écrit des adultes et événements marquants de l’enfance : le traitement de l’enquête Information et vie quotidienne à l’aide des méthodes de la statistique textuelle, Economie et statistique, n°490, pp. 17-36. Guérin-Pace F. (2009). Illettrismes et parcours individuels, Economie et statistique, n°424-425. Brennetot A., Emsellem K, Guérin-Pace F., Garnier B. (2013). Dire l’Europe à travers le monde. Les mots des étudiants à travers l’enquête EuroBraodMap, Cybergeo : European Journal of Geography. Guérin-Pace F., Collomb P. (1998). Les contours du mot environnement : Enseignements de la statistique textuelle, L’Espace Géographique, n°1, pp. 41-52. Guérin-Pace F. (1997). La statistique textuelle : un outil exploratoire en sciences sociales, Population, n°4, pp. 865-888. Laborde, C., Lelièvre, E., Vivier, G. (2007). Trajectoires et événements marquants, comment dire sa vie : Une analyse des faits et des perceptions biographiques. Population, vol. 62,(3), pp. 567-585. ssoc vol. 17: 507–13. Scheurwegs E., Luyckx K., Luyten L., Goethals B. and Daelemans W. (2017). Assigning Clinical Codes with Data-Driven Concept Representation on Dutch Clinical Free Text. J Biomed Inform vol. 69: 118–27. Vlug A. E., van der Lei J., Mosseveld B.M., van Wijk M.A., van der Linden P.D., Sturkenboom M.C., and van Bemmel J.H. (1999). Postmarketing Surveillance Based on Electronic Patient Records: The IPCI Project. Methods Inf Med 38 (4/5): 339–44. Wermter J. and Hahn U. (2004). Really, Is Medical Sublanguage That Different? Experimental Counter-Evidence from Tagging Medical and Newspaper Corpora. In Fieschi M., Coiera E. and Li Y.-C.L. Proc. of the 11th World Congress on Medical Informatics (MEDINFO 2004), pp. 560–64. JADT’ 18 367 Annotation-based Digital Text Corpora Analysis within the TXM Platform Serge Heiden Université de Lyon, ENS de Lyon, IHRIM – UMR5317, CNRS – slh@ens-lyon.fr Abstract This paper presents new developments in the TXM textual corpora analysis platform (http://textometrie.org) towards direct text annotation functionalities. Some annotations are related to a web based external historic ontology called SyMoGIH and others to co-reference information between words or to word properties like part of speech or lemma. The paper discusses the methodological stakes of unifying in a single framework the production and the analysis those annotations with the traditional ones already available in TXM corresponding to the XML markup of the text sources and to the linguistic annotations automatically added to texts by NLP tools. Keywords: textometry, TXM, digital text representation, XML, TEI, annotation, ontology, co-reference, part of speech, digital hermeneutic circle. 1. Introduction TXM (Heiden, 2010) is a software platform offering textual corpora analysis tools. It is delivered as a standard desktop application for Windows, Mac and Linux and as a web portal server application (http://textometrie.org). Its analysis tools combine qualitative types of tools like word lists, concordancing or text edition navigation (close reading) with synthetic quantitative types of tools like factorial analysis, clustering, keywords or statistical co-occurrence analysis (distant reading). To be able to work on texts, the platform imports first the corpus sources to build a rich internal representation of texts through the following general workflow: a) first the “base text” of each text is established: this operation implements “digital philology” principles and consists of decoding information in the various formats of the source documents5 to 5 TXM can analyze three main types of corpora : corpora of written texts, possibly including paginated editions including images of facsimiles ; record transcriptions corpora, possibly time synchronized with the audio or video source ; 368 JADT’ 18 decide primarily where are the text limits, internal structures boundaries and words and punctuations of the text. Its result is represented in a pivot XML format especially designed for TXM called “XML-TEI TXM” and extending the standard encoding recommendations of the Text Encoding Initiative consortium (TEI Consortium, 2017) ; b) then, natural language processing (NLP) tools are optionally applied to the base text to automatically add linguistic information like sentence boundaries, grammatical category (pos = part of speech) and lemma of words by eg TreeTagger (Schmid, 1994), etc. As NLP tools generally don’t take XML format as input, the pivot representation is first converted to raw text for NLP processing and results are added back into the XML-TEI TXM representation ; c) finally a specialized representation of texts is built into TXM for efficient execution of its tools (by indexing for search engines and text edition rendering). From the point of view of TXM, NLP tools results in b) are seen as automatic annotations added to the initial XML-TEI TXM representation of texts built in a), and the XML tags of the initial XML-TEI TXM representation in a) can be seen as manual annotations added to the base text (or raw text), typically philologically edited with the help of specialized XML editors (like Oxygen XML Editor6) outside of TXM when the source is in XML format, or as automatic annotations added by TXM when converting from some other format into XML-TEI TXM. All TXM tools apply indiscriminately to all types of annotation regardless of their origin (automatic or manual). Thus, TXM implements a traditional workflow combining a “text source encoding and annotation” step to an “application of analysis tools to annotated texts” step. The text analysis tools use text annotations (for example word pos) to offer their services and produce their results (for example the concordance of all infinitive verbs). The workflow is unidirectional and the whole of it must be passed through again completely if any annotation needs to be corrected. To add or correct annotations, the user has to edit the sources or the annotations outside of TXM. For example word properties can be exported from the XML-TEI TXM representation, edited in a spreadsheet and inserted back into the texts before re-import7. and parallel multilingual corpora aligned at the level of a textual structure such as the sentence or the paragraph. 6 https://www.oxygenxml.com 7 see for example this tutorial based on TXM macros: https://groupes.renater.fr/wiki/txm-users/public/tutoriel_correction_mots. JADT’ 18 369 This paper introduces new services developed in TXM to annotate directly texts from within the results view of specific tools for a better integration of philological and analytic work. 2. Annotation services in TXM The new annotation services concern both adding and correcting information and all the annotations edited are meant for further exploitation by usual TXM tools. 2.1. SyMoGIH annotation by concordance The first new service, developed in partnership with the LARHRA research laboratory in history8, is based on the annotation of concordance pivots: any sequence of words composing the pivots can be annotated with any semantic category9 coming from the SyMoGIH10 historical ontology framework (Beretta, 2015). In this architecture, the SyMoGIH web platform hosts the ontology of historic facts and knowledge, and concordances provide the user interface to link identifiers of those data to text spans for further analysis. As an illustration, see figure 1 the annotation of the “Faculté de droit d’Aix” entity (of id CoAc13562) in unverified OCRed texts of the “Bulletin administratif de l'Instruction publique" corpus11. TXM internal management of those annotations is equivalent to a re-import of the current pivot representation of the annotated texts. After re-import (after saving annotations) the new annotations are available for all TXM tools to work on like any original “annotation” of the texts (internal structures and their properties, word properties, etc.). 2.2. URS annotation in text edition The second new service is based on manual annotation of word sequences inside text editions with elements of a Unit-Relation-Schema (URS) annotation model. URS type annotations are designed to encode discourse entities like co-reference chains in texts (Schnedecker, Glikman, & Landragin, 2017). In a URS model, Units or entities have any number of properties and can be linked together by the two other annotation types: Relations, having any number of properties (1-to-1 relation type), and Schemas, having any http://larhra.ish-lyon.cnrs.fr pivots can also optionally be annotated with simple keywords or with keyvalue pairs, managed by TXM in a local repository. 10 http://symogih.org/?lang=en 11 see the Bibliothèque historique de l'éducation (BHE) project: http://www.persee.fr/collection/bhe 8 9 370 JADT’ 18 number of properties (1-to-n relation type). Any types and properties of units, schemas, and relationships are definable in the annotation model before and during annotation. The types and properties are chosen by the user, they are not limited to co-reference chains. Figure 1: TXM screenshot of a Concordance of a “Faculté de droit d’Aix” word sequence pattern to annotate (top) and of browsing SyMoGIH semantic categories to use for the annotation (bottom). The original URS model has been designed and developed in the Glozz (Widlöcher & Mathet, 2009) and Analec (Landragin, Poibeau, & Victorri, 2012) software. It is being integrated into TXM through the text edition reading tool for a project funded by the French National Research Agency (ANR) called DEMOCRAT12. As an illustration, see figure 2 the annotation of the “ses loix” word sequence 12 http://www.agence-nationale-recherche.fr/en/anr-fundedproject/?tx_lwmsuivibilan_pi2%5BCODE%5D=ANR-15-CE38-0008 JADT’ 18 371 with a unit of type MENTION, of “GN.POS” grammatical category and “les lois de la divinité” referent, in the first chapter of the 1755 edition of De l'esprit des lois by Montesquieu. TXM internal management of those annotations can be represented as new XML-TEI stand-off annotations anchored to the word elements of the XML-TEI TXM representation of texts (Grobol, Landragin, & Heiden, 2017). Figure 2: TXM screenshot of the edition of the first page of De l'esprit des lois with units of type MENTION highlighted in yellow and the selected unit in bold (top) and the current values of the properties of the selected unit (bottom). 2.3. Word properties annotation by concordance The third service will be based on the annotation of concordance pivot words: a word present in the pivots of a concordance will be able to be annotated with properties. The primary goal of that service is to annotate and correct grammatical properties and lemma of word elements of the XML-TEI TXM representation of texts. This development is done for a project cofunded by the ANR and Deutsche Forschungsgemeinschaft (DFG) called PaLaFra13 . 2.4. Editing XML sources Finally we are developing the possibility to directly edit the XML sources 13 http://www.agence-nationale-recherche.fr/en/anr-fundedproject/?tx_lwmsuivibilan_pi2%5BCODE%5D=ANR-14-FRAL-0006 372 JADT’ 18 from within TXM through an internal XML editor. This editor will eventually be accessed through TXM tools as a “back to source” operation similar to the current “back to text” operation (for example from a concordance line to a text edition page). 3. Discussion By using a common XML-TEI pivot representation for internal management of corpora for all the annotation services, TXM unifies transcription and annotation activities in a single framework. In this framework, annotations represent manual (user), semi-automatic (machine+user) or automatic (machine) interpretation results used further for analysis and interpretation work. The reflexive nature of the resulting text analysis workflow is schematized in figure 3. Texts are first digitized by OCR, transcribed or converted from digital formats. They are then philologically corrected and established through XML-TEI manual encoding. Then automatically processed by NLP tools while being imported into TXM to produce the TXM internal corpus model. Corpus analysis is then assisted by TXM tools applied to the corpus model. The pivot representation that gathers all annotations produced by annotation tools is figured as the node labeled « Pivot rep. » and the interpretation workflow itself is figured as a digital hermeneutic circle. Figure 3: Digital hermeneutic circle integration into TXM. JADT’ 18 373 Legend: - red box = automatic annotation activity - black box = tool - blue box = manual annotation activity - green box = TXM corpus data model - purple disk = data representation - black arrow = activity - green arrow = annotation equivalence 4. Conclusion All the new annotation services integrated into TXM are building a comprehensive annotation-based digital text corpora analysis platform. From an epistemological point of view, the integration of different annotation models and tools into the platform should help its users to better define what comes from the source corpus they analyze and what comes from their own or from others interpretation work. This work was funded by the ANR and the DFG under grant numbers ANR15-CE38-0008 (DEMOCRAT project) and ANR-14-FRAL-0006 (PaLaFra project). References Beretta, F. (2015). Publishing and sharing historical data on the semantic web : the SyMoGIH project – symogih.org. Presented at the Workshop: Semantic Web Applications in the Humanities. Retrieved from https://halshs.archives-ouvertes.fr/halshs-01136533 Grobol, L., Landragin, F., & Heiden, S. (2017). Interoperable annotation of (co)references in the Democrat project. Presented at the Thirteenth Joint ISO-ACL Workshop on Interoperable Semantic Annotation. Retrieved from https://hal.archives-ouvertes.fr/hal-01583527/document Heiden, S. (2010). The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme. In K. I. Ryo Otoguro (Ed.), 24th Pacific Asia Conference on Language, Information and Computation (pp. 389–398). Institute for Digital Enhancement of Cognitive Development, Waseda University. Retrieved from http://halshs.archivesouvertes.fr/halshs-00549764/en/ Landragin, F., Poibeau, T., & Victorri, B. (2012). ANALEC: a New Tool for the Dynamic Annotation of Textual Data (pp. 357–362). Presented at the International Conference on Language Resources and Evaluation (LREC 2012). Retrieved from https://halshs.archives-ouvertes.fr/halshs00698971/document Schmid, H. (1994). Probabilistic Part-Of-Speech Tagging Using Decision 374 JADT’ 18 Trees. In Proceedings of the International Conference on New Methods in Language Processing (Vol. 12). Schnedecker, C., Glikman, J., & Landragin, F. (2017). Les chaînes de référence : annotation, application et questions théoriques. Langue française, (195), 5–16. https://doi.org/10.3917/lf.195.0005 TEI Consortium. (2017). TEI P5: Guidelines for Electronic Text Encoding and Interchange. TEI Consortium. Retrieved from http://www.teic.org/Guidelines/P5 Widlöcher, A., & Mathet, Y. (2009). La plate-forme Glozz: environnement d’annotation et d’exploration de corpus. In Actes de la 16e Conférence Traitement Automatique des Langues Naturelles (TALN’09), session posters (p. 10). Senlis, France, France. Retrieved from https://hal.archivesouvertes.fr/hal-01011969 JADT’ 18 375 Quantifying Translation : an analysis of the conditional perfect in English-French comparableparallel corpus Daniel Henkel Université Paris 8 Vincennes St-Denis – dhenkel@univ-paris8.fr Abstract The frequency of the conditional perfect in English and French was observed in an 8-million-word corpus consisting of four 2-million-word comparable and parallel subcorpora, tagged by POS and lemma, and analyzed using regular expressions Intra-linguistically the Wilcoxon-Mann-Whitney test was used to compare authors and translators. Frequencies in source and target texts were evaluated using Spearman's correlation test to identify interlinguistic influences. Overall, the past conditional in English was found to have a stronger influence in the translation process. Résumé La fréquence du conditionnel parfait en anglais et en français a été observée dans un corpus de 8 millions de mots comprenant quatre sous-corpus comparables et parallèles de 2 millions de mots chacun, étiquetés par catégorie grammaticale et par lemme, et analysés par expressions rationnelles (regex). Le test de Wilcoxon-Mann-Whitney a servi pour comparer les auteurs et traducteurs, tandis que la corrélation entre textes-sources et -cibles a été évaluée au moyen du coefficient de corrélation de Spearman. Globalement, l'influence du conditionnel parfait en anglais sur le processus traductionnel paraît plus sensible. Keywords: corpus, translation, regular expressions, statistical analysis, Wilcoxon-Mann-Whitney, Spearman, conditional perfect 1. Introduction Since Corpus-based Translation Studies (CBTS) first began to gain momentum around the turn of the 21st century, differences have consistently been shown between corpora of translated English, French and other languages in comparison with untranslated reference corpora in the same languages. The hybrid nature of translated texts is now thus widely 376 JADT’ 18 acknowledged as an established fact among specialists1 in the field so much so that any further proof might seem superfluous. These studies have focused on phenomena such as the use of 'that' to introduce subordinate clauses (Olohan & Baker, 2000), contractions (Olohan, 2003), manner-ofmotion verbs (Cappelle, 2012), existential predications (Loock & Cappelle, 2013) most often in terms of their overall frequency2. Such comparisons have provided valuable insights about the languages involved and the translation process. Little consideration has been given so far, however, to the fact that each language-system consists of many individual styles or idiolects which gravitate around a common center, but individually exhibit widely differing characteristics. In other words, while the variation from one author or translator to another is inherent in the very nature of corpus linguistics, this dimension remains absent from the equation in many, if not most, corpusbased translation analyses. 2. Methods Two important terminological distinctions must be made at the outset. The first is between ex nihilo, a.k.a. 'original', English (En0) and French (Fr0), i.e. discourse in each language produced independently of any known prior influence, as opposed to English-translated-from-French (EtrF) and Frenchtranslated-from-English (FtrE), which will be used to refer to translations into each language, based on a pre-existing work in the other language, and therefore potentially subject to inter-linguisitic influences. The second distinction is between two sorts of bilingual corpora, 'comparable' and 'parallel'. In keeping with the clarification offered by McEnery & Xiao (2007), the term 'comparable corpus' will hereafter refer to a bilingual corpus consisting of two subcorpora of ex nihilo English and French texts, which are therefore not translations of one another, but which share a certain number of common characteristics, whereas the term 'parallel corpus' will designate a Albeit with some divergence of opinion as to whether such differences are best interpreted as evidence of source-language interference or as consequences of the translation process regardless of the source-language, i.e. characteristics inherent in the 'third code' or 'translationese' (cf. Koppel & Ordan, 2011). 2 Olohan (2002) apparently subscribes to Stubbs' (2001) view that “corpus linguistics […] investigates relations between frequency and typicality, and instance and norm. It aims at a theory of the typical,” (while nonetheless encouraging investigation of individual translators' styles in her conclusion), and the predominance of this approach is confirmed again over a decade later by Loock (2013) who observes that “many studies within the CBTS framework still solely rely on overall quantitative analyses to establish differences between original and translated languages.” 1 JADT’ 18 377 corpus made up of one sub-corpus of ex nihilo works in a source-language and another sub-corpus consisting of the translations of those same works into the target-language. The corpora used in this study were compiled from public domain works available in electronic format (.epub, .mobi, .html or .txt), the translations of which were also available in electronic format via publicly available sources (primarily Project Gutenberg). Common criteria3 based on size and date were then used to select 20 works by 20 different authors in En0 and the same number in Fr0, so as to obtain, first of all, two reference sub-corpora comparable in terms of date, size, discourse type and diversity: Table 1 Summary of characteristics for comparable En0 and Fr0 subcorpora. Subcorpus 1 En0 (n=20) Subcorpus 2 Fr0 (n=20) Wordcounts4 Max. 199,976 (Collins, The Moonstone) Min. 59,771 (Mansfield, The Garden-party) Median 99,558 (Wells, The War in the Air) Total 2,114,517 Max. 192,521 (Zola, Les trois villes Paris) Min. 62,539 (Rolland, Les précurseurs) Median 90,873 (Leroux, La chambre jaune) Total 2,083,787 Dates Max. 1928 (Woolf, Orlando) Min. 1868 (Collins, The Moonstone) Median 1901 (Kipling, Kim) Max. 1921 (Leblanc, Les dents du tigre) Min. 1866 (Gaboriau, L'affaire Lerouge) Median 1901 (Bazin, Les Oberlé) The translations of these works were then compiled into two sub-corpora of EtrF and FtrE, so as to produce an 8m-word 'super-corpus' consisting of four 2m-word sub-corpora, designed to be both comparable and parallel and thereby provide a basis for three types of comparisons: – between En0 and Fr0, in order to establish benchmark data for each language, – between EtrF and En0, so as to ascertain whether the linguistic indicator Whenever several works by the same author were available, preference was given either to the most recent or the one with the highest word-count. In general date was given precedence over size, except in cases where a major difference in wordcount was found between works published within a relatively close interval. 4 Word-counts were estimated using the text editor Geany, after replacing punctuation with whitespaces, given that punctuation has been found to artificially inflate word-counts in French as compared to English. 3 378 JADT’ 18 under investigation, i.e. the conditional perfect, has a similar distribution in EtrF compared to En0, and likewise for FtrE in comparison with Fr0, – between source- and target-texts, to determine whether correlations exist between the parallel subcorpora (i.e. EtrF~Fr0 and FtrE~En0) which could be taken as evidence of interlinguistic interference. All of the texts were cleaned of metatext, tagged for POS and Lemma in TreeTagger, and interrogated in TextSTAT using the following regular expressions to target the conditional perfect. English (all verbs): d) (((w|c|sh)ould)|('d)|(might)|(ought))(e?st)?/\S+( \S+/RB[RS]?/\S+)*( to/\S+)?( ((ha|')ve|of)/\S+)( \S+/RB[RS]?/\S+)* \S+/V[BHV][ND]/ French (verbs taking AVOIR as an auxiliary, verbs taking ÊTRE, reflexive constructions): e) \S+/VER:cond/avoir( \S+/ADV/\S+)* \S+/VER:pper f) \S+/VER:cond/être( \S+/ADV/\S+)* \S+/VER:pper/(r[eé])?(aller|(ad|de|inter|par|pro|sur)?venir|rester|deme urer|(ap|dis)?paraître|naître|mourir|décéder|arriver|partir|tomber|mo nter|descendre|passer|rentrer|retourner|sortir) g) ((je/\S+( \S+/ADV/\S+)* m[e']/\S+)|(tu/\S+( \S+/ADV/\S+)* t[e']/\S+)|(nous/\S+( \S+/ADV/\S+)* nous/\S+)|(vous/\S+( \S+/ADV/\S+)* vous/\S+)|(s[e']/\S+))( en|y/\S+)* \S+/VER:cond/être( \S+/ADV/\S+)* \S+/VER:pper/ The results obtained from these queries were converted into frequencies per 1000 words (freq./1k) for each author or translator and analyzed using the Wilcoxon-Mann-Whitney and Spearman tests as described in the following section. 3. Results and analysis The data collected from each of the subcorpora are presented in the following tables and summarized in Fig. 1. Table 2a Conditional perfect frequencies in En0 Cond.Pf. (n=) Words (n=) Freq./1k Cond.Pf. (n=) Words (n=) Freq./1k Buchan 139 102022 1.36 Lewis 58 83799 0.69 Burnett 78 84093 0.93 London 57 100816 0.57 Collins 326 199976 1.63 Mansfield 67 59771 1.12 ConanDoyle 108 105040 1.03 Reid 200 94254 2.12 Cox 142 114352 1.24 Stevenson 81 70366 1.15 Eliot 319 164456 1.94 Stoker 127 161255 0.79 JADT’ 18 379 Hardy 254 153076 1.66 Wallace 135 101948 1.32 Hope 115 83189 1.38 Wells 54 99558 0.54 Joyce 26 69225 0.38 Wilde 76 79412 0.96 Kipling 109 107601 1.01 Woolf 76 80308 0.95 max: 2.12, min: 0.38, median: 1.03 Table 2b Conditional perfect frequencies in EtrF Cond.Pf. (n=) Words (n=) Freq./1k Cond.Pf. (n=) Words (n=) Freq./1k Tr.Barbusse 48 116179 0.41 Tr.Leroux 127 74920 1.7 Tr.Bazin 74 76312 0.97 Tr.Loti 15 65837 0.23 Tr.Benoît 41 64301 0.64 Tr.Massenet 42 57736 0.73 Tr.Flaubert 125 175678 0.71 Tr.Maupassant 45 76070 0.59 Tr.France 66 76830 0.86 Tr.Mirbeau 76 101959 0.75 Tr.Gaboriau 335 170870 1.96 Tr.Proust 408 198721 2.05 Tr.Gourmont 76 69399 1.1 Tr.Rolland 27 65872 0.41 Tr.Hugo 104 125428 0.83 Tr.Vanderem 80 95884 0.83 Tr.Huysmans 46 130181 0.35 Tr.Verne 89 63760 1.4 Tr.Leblanc 112 128493 0.87 Tr.Zola 179 205503 0.87 max: 2.05, min: 0.23, median: 0.83 Table 2c Conditional perfect frequencies in Fr0 Cond.Pf. (n=) Words (n=) Freq./1k Cond.Pf. (n=) Words (n=) Freq./1k Barbusse 47 114877 0.41 Leroux 78 90873 0.86 Bazin 41 78395 0.52 Loti 15 72386 0.21 Benoît 33 67915 0.49 Massenet 45 76711 0.59 Flaubert 108 149808 0.72 Maupassant 46 75598 0.61 France 20 71998 0.28 Mirbeau 59 117035 0.5 Gaboriau 53 120464 0.44 Proust 296 170105 1.74 Gourmont 60 73000 0.82 Rolland 11 62539 0.18 Hugo 18 118095 0.15 Vanderem 44 91476 0.48 Huysmans 22 132824 0.17 Verne 50 76890 0.65 Leblanc 47 130277 0.36 Zola 141 192521 0.73 max: 1.74, min: 0.15, median: 0.5 380 JADT’ 18 Table 2d Conditional perfect frequencies in FtrE Cond.Pf. (n=) Words (n=) Freq./1k Tr.Buchan 69 105082 0.66 Tr.Burnett 74 80743 Tr.Collins 138 Tr.ConanDoyle Cond.Pf. (n=) Words (n=) Freq./1k Tr.Lewis 80 96211 0.83 0.83 Tr.London 49 86378 0.57 198988 0.69 Tr.Mansfield 82 68674 1.19 119 117280 1.01 Tr.Reid 120 93025 1.29 Tr.Cox 194 130967 1.48 Tr.Stevenson 64 76757 0.83 Tr.Eliot 120 168125 0.71 Tr.Stoker 167 176623 0.95 Tr.Hardy 217 151435 1.43 Tr.Wallace 97 87316 1.11 Tr.Hope 99 82966 1.19 Tr.Wells 74 108529 0.68 Tr.Joyce 49 72739 0.67 Tr.Wilde 63 82430 0.76 Tr.Kipling 68 124885 0.54 Tr.Woolf 56 87475 0.64 max: 1.48, min: 0.54, median: 0.83 Fig. 1 Distributions of conditional perfect frequencies in En0, EtrF, FtrE and Fr0. As is readily apparent from Fig. 1, the conditional perfect is used more frequently in En0 than in Fr0, which, aside from one extreme outlier (Proust), is situated below the 1st quartile of En0. EtrF and FtrE (as usual) occupy an JADT’ 18 381 intermediate zone, with practically identical medians (0.83) which are both inferior to Q1 in En0 and superior to Q3 in Fr0. The most striking difference is between authors in Fr0 and translators, who use the conditional perfect almost twice as often in FtrE. As a result, the entire distribution in FtrE is superior to the median for Fr0, with 75% of FtrE (Q2-Q4) in the same range as the top quartile (Q4) of Fr0. Wilcoxon-Mann-Whitney confirms that a similar disparity could hardly occur by chance (U=337, n1=n2=20, p=0.0002) and that it is therefore reasonable to infer that – notwithstanding the considerable amount of variation that can be observed from one author or translator to another – FtrE and Fr0 are clearly different with respect to their use of the conditional perfect. Between EtrF and En0, however, the difference is less obvious. Although the interquartile range for EtrF (0.63-1) is noticeably lower than in En0 (0.9-1.37), there is nonetheless a great deal of overlap between the two distributions, and Wilcoxon-Mann-Whitney (U=135, n1=n2=20, p=0.08) indicates that the risk of error is too great to say with confidence whether any substantial difference exists between EtrF and En0 in their use of the conditional perfect. To what extent such differences may be attributed to the influence of the analogous forms in the source-texts can be assessed statistically as illustrated in Fig. 2a and 2b: Fig. 2a Frequency of conditional perfect forms in FtrE vs. En0. (ρ=0.47, p=0.036) Fig. 2b Frequency of conditional perfect forms in EtrF vs. Fr0. (ρ=0.57, p=0.009) In both cases, Spearman's5 correlation test reveals a statistically significant (p<0.05) positive correlation (ρ=0.57 for EtrF/Fr0, ρ=0.47 for FtrE/En0) of moderate strength, which somewhat unexpectedly obtains a higher score for 5 Spearman's was preferred due to the presence of outliers. Pearson's R yields an almost identical result for FtrE/En0, and a somewhat stronger coefficient (r=0.67) for EtrF/Fr0, with similar p-values in both cases. 382 JADT’ 18 EtrF/Fr0. These correlations of similar strength suggest an intuitively plausible tendency to translate individual instances of the conditional perfect in one language by the analogous form in the other language in both directions and in roughly similar proportions (although this remains to be verified by manual examination of translation segments). Such a hypothesis would help to explain why the medians and interquartile ranges observed in EtrF and FtrE occupy a middle zone between En0 and Fr0, but it does little to account for the greater disparity between FtrE and Fr0 as opposed to EtrF and En0. Other contextual parameters may well be involved, or perhaps the higher frequency of the conditional past in En0 exerts a sort of subliminal effect on translators, who then use it more freely in FtrE with or without a syntactic counterpart in the corresponding En0 segment. 4. Conclusion These findings demonstrate how quantitative analysis of translated parallel corpora in comparison with untranslated comparable corpora, can be used both to identify disparities between target-texts and the target-language as represented in an ex nihilo corpus, and to assess the influence of the sourcetexts on the target-texts. Such relationships are often asymmetrical: in this case the correlation between the original French conditional perfect and the translations into EtrF is stronger, while the higher frequency of conditional perfect forms in English, though less strongly correlated on a text-to-text basis, nonetheless fosters a style of French-translated-from-English which is markedly different from ex nihilo French. While the exact mechanisms involved will require further investigation, the conditional perfect in English appears to exert a stronger influence in the translation process than the corresponding form in French. References Hu K. (2016). Introducing corpus-based translation studies. Springer. Koppel M and Ordan N. (2011). Translationese and Its Dialects Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pp. 1318–1326, June 19-24, 2011 Kruger A., Wallmach, K. and Munday J. (Eds.). (2011). Corpus-based translation studies: Research and applications. Bloomsbury Publishing. Loock R. (2013). Close encounters of the third code. In Lefer M.A. and Vogeleer S., eds, Interference and normalization in genre-controlled multilingual corpora, Belgian Journal of Linguistics 27: 61-86 Olohan M. (2002). Comparable corpora in translation research. In LREC Language Resources in Translation Work and Research Workshop Proceedings pp. 5-9. JADT’ 18 383 Zanettin F. (2013). Corpus methods for descriptive translation studies. Procedia-Social and Behavioral Sciences, 95, 20-32. Hüning Matthias. TextSTAT 2.9c © 2000/2014 Niederländische Philologie, Freie Universität Berlin, http://neon.niederlandistik.fuberlin.de/en/textstat/ R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria URL https://www.R-project.org/. Schmid H. TreeTagger, Universitaet Stuttgart, http://www.cis.unimuenchen.de/~schmid/tools/TreeTagger/ 384 JADT’ 18 Extraction of lexical repetitive expressions from complete works of William Shakespeare Daniel Devatman Hromada Univesität der Künste, Berlin, Germany – daniel at udk dash berlin dot de Abstract Rhetoric tradition has canonized dozens of repetition-involving figures of speech. Our article shows a way how hitherto ignored repetition-involving schemata can be identified by means of translation of so-called “entangled numbers” into backreferencing regular expressions. Each regex is subsequently exposed to all utterances in all works of William Shakespeare, allowing us to pinpoint 3367 instances of 172 distinct repetitive schemata. Keywords: rhetoric stylometry, figures of speech, repetition, chiasm, entangled numbers, regular expressions, William Shakespeare, non-zipfian distribution Résumé On montre, comment peut-on identifier les figures de styles jusqu'ici inconnues. Le but en question est atteint grâce au fait qu'on peut concevoir un certain groupe de figures de style tel un nombre ayant quelques propriétés particulières. Une fois découverte et énumérés, on peut transcrire ces nombres en expressions régulières qui peuvent ensuite être éxposé à un corpus textuel. Dans le cas de notre étude préliminaire, il s'agissait du corpus de William Shakespeare. Mots clés: stylométrie rhétorique, rfigures de style, répétition, chiasme, expressions régulières, répetition, William Shakespeare 1. Introduction Masterpieces of litterature and drama abound with repetitions. Rhetorics abounds with repetitions, succesful oratories abound with repetitions. Many a schema and a figure exists which exploits repetition : e.g. a polysyndeton and an anaphore, an anadiplose and an epistrophe, a symploche and an antanaclasis, paranomasis and an antimetabole. And alliterations and paregmenons, and polypoptons, epizeuxiae or even a good old psittacism ? Many are such schemata, many are such figures. Woe to the one who thinking he knows them all ! JADT’ 18 385 Our article presents a way of enumerating of many a new schemata involving one or more repetition of one or more lexical signifiants. The procedure starts with a theoretical insight, that at least certain subset of the set of all such schemata, is easily enumerable. This insight is subsequently transcribed into an algorithm enumerating natural numbers which satisfy following properties. These numbers once identified, they are to be translated into Perl Compatible Regular Expressions exploiting some back-references and negative lookaheads. 1.1. Computational rhetorics and its roots In literature studies it is fairly common to speak about so-called "rhyme schemes" like AAAA for monorhymes, ABAB for alternate rhyme, ABBA for enclosed rhymes etc. It is therefore barely surprising that analogic formalisms - that is, formalisms that involve alphabetic indices - have been adopted by scholars aiming to formalize a subgroup of rhetoric figures, known as the group of schemes. For example (Harris et DiMarco, 2009) use a following formalism: [W ]a::: [W ]b::: [W ]b::: [W ]a to denote the rhetoric figure known as antimetabole. Subsequent studies in automatized chiasm identification pursue a similiar route and often use formulae like ABXBA, ABCBA, ABCXCBA to denote schemata corresponding to utterances such as: "Drake love loons. Loons love Drake.", "All as one. One as all." (Hromada, 2011) or "In prehistoric times women resembled men, and men resembled women." (Dubremetz & Nivre, 2015) . Table 1: 14 lowest E-numbers, their corresponding alphabetic representations and some corresponding Shakespearean expressions . E-number Alphabetic Example expression 11 AA "we split we split " 1 111 AAA "we split we split we split " 1111 AAAA "justice justice justice justice " 1122 AABB "gross gross fat fat " 1212 ABAB "to prayers to prayers " 1 Note that sometimes one single word is attributed the role of a distinct « brick » , sometimes a concatenation of two or even more words assumes such a role. As will be indicated in sections two and three, this behaviour is not a bug, but an anticipated property of our method. 386 JADT’ 18 1221 ABBA "my hearts cheerly cheerly my hearts " 11111 AAAAA "so so so so so " 11122 AAABB "great great great pompey pompey " 11212 AABAB "come come buy come buy " 11221 AABBA "high day high day freedom freedom high day " 11222 AABBB "o night o night alack alack alack " 12112 ABAAB "too vain too too vain " 12121 ABABA "come hither come hither come " 12122 ABABB "come buy come buy buy " 1.2. Entangled numbers The set of entangled numbers (or E-numbers) is a subset of a set of natural numbers (i.e. integers). Entangled numbers are defined as “words of length n over an alphabet of size 9 that are in standard order and which have the property that every letter that appears in the word is repeated. “ (OEIS, 2016) Note that the term word, as used in the preceding, as well as in following citations, is used in mathematician's sense, meaning something as « sequence of symbols » : “A word is in "standard order" if it has the property that whenever a letter i appears, the letter i-1 has already appeared in the word. This implies that all words begin with the letter 1.” (Arndt et Sloane, 2016). Hence, numbers like 22 or 33 are not entangled numbers because they are not in “standard order” and numbers like “12” or “121 ”are not entangled because some (or all) of their digits are not repeated. Fourteen smallest (i.e. with the lowest numeric value) entangled numbers and their corresponding alphabetic transcriptions are enumerated in Table 1. Given that entangled numbers are natural numbers, they can be easily enumerated by an incremental algorithm starting at one and iterating towards infinity. Once enumerated (OEIS, 2016), we can bridge the realm of numbers with the realm of text and apply our method. 2. Method The core idea behind our method can be stated as follows: Any E-number can be "translated" into a backreference-endowed regular expression. Concretely speaking, every digit of an E- number can be interpreted as an element or a "brick". In this article, we work only with one type of bricks, those corresponding to sequences which are between two to twenty-three JADT’ 18 387 characters long 2. Such sequences can correspond to one or multiple lexical units. A first occurence of a novel brick can be represented as a PERLcompatible regular expression (Friedl, 2002 ; Aho, 2014): (.{2,23}) However, any subsequent repeated occurence of a digit in an E- number is interpreted not as an occurence of the new brick, but rather as a backreference to the brick which was already denoted by the same digit. The very first E- number 11 is therefore NOT to be translated into regex /(.{2,23}) (.{2,23})/. For this would imply existence of two distinct bricks. Rather, the Enumber 11 is to be translated into regex: (.{2,23}) \1 wherein the expression \1 denotes the backreference to the content matched by the regex-brick specified in first parentheses, i.e. brick no.1 . Hence, the E-number 111 can be easily translated into a regex /(.{2,23}) \1 \1/, 1111 into a regex /(.{2,23}) \1 \1 \1/ etc. What's more, when we combine the backreference with a negative lookahead operator – traditionally expressed by the formula (?!) - we can make sure that a so-called non-identity principle is also satisfied. That is : "Each distinct digit corresponds to distinct content" For example, by translating the E-number 121 into the regex (.{2,23}) (?!\1)(.{2,23}) \1 we can make sure that the content matched by the brick denoted by digit 2 shall be different from the content matched by the brick denoted by digit 1. Thus, a phrase "no no no" shall not be matched by such a regex while an expression "no yes no" shall. Going somewhat further, an E-number 12321 - which could be understood as an instance of chiasm or antimetabole ABXBA - is to be translated into regex (.{2,23}) (?!\1)(.{2,23}) (?!\1\2)(.{2,23}) \2 \1 whereby the disjunctive backreference contained in the negative lookahead 2 These are the only variable parameters of our method. 388 JADT’ 18 (?!\1\2) assures that the content matched brick no.3 - corresponing to filler X - shall be different from content matched by the brick representing digit 1 as well as the brick representing digit 2. 3. Corpus & Processing A digital, unicode-encoded version of Craig's edition of "Complete works of William Shakespeare" has been downloaded from a publicly available Internet source3 . This corpus contains 17 txt files stored in the sub-folder "comedies", 10 txt files stored in the sub-folder "tragedies" and 10 txt files stored in the sub-folder "historical". Texts were subsequently split into utterances by interpreting closing tags (e.g. , etc.) as utterance separator. Even more concretely, one can simply consider the slash symbol / to be the utterance separator. Only two further text-processing steps have been executed during the initialization phase of the experiment hereby presented. Primo, content of each utterance has been put into lowercase. Secundo, non-alphabetic symbols (e.g. dot, comma, exclamation mark etc.) have been replaced by blank spaces. We are aware that such replacement could potentially lead to certain amount of loss of prosody- or pathos- encoding information. However, we consider this step as legitimate because the objective of our experiment was to focus on repetition of lexical units4. Pre-processing code once executed, identification of expressions containing diverse types of lexical repetitions is as simple as matching each Shakespearean utterance with each regex. 4. Results All in all, 3667 instances of a repetitive expressions have been detected in Shakespeare's complete works. These were contained in 2295 distinct utterances and corresponded to 172 distinct schemata. Among these, 71 matched more than one instance: these schemata could thus potentially correspond to a certain cognitive pattern or a habitus in Shakespeare's mind. Table 2 contains summary information concerning 23 schemata matching at least five distinct utterances. 3 http://www.lexically.net/downloads/corpus_linguistics/ShakespearePlaysPlus.zip 4 Regexes matching repetitions of phonotactic clusters, syllables, or phrases, are also possible. We prefer, however, not to focus on this topic within the limited scope of this conference proposal. JADT’ 18 389 Table 2: Repetitive schemata matching at least 23 distinct utterances present in collected works of William Shakespeare. Instances 2332 525 170 100 48 35 32 32 30 23 E-number 11 1212 111 123123 12121 1221 12341234 1122 1111 121212 Example "bestir bestir " "to prayers to prayers " "ha ha ha " "cover thy head cover thy head " "come hither come hither come " "fond done done fond" "let him roar again let him roar again " "with her with her hook on hook on " "great great great great " "come on come on come on " Another phenomenon may be found noteworthy by a reader interested in purely quantitative aspects of our research. It concerns the relation between the length of the E-number (i.e. the amount of corresponding bricks) and the number of utterances matched by such numbers. In case of trivial repetitions, this relation seems to be plainly Zipfian. For example : Shakespeare's dramas seem contain 2332 duplications (e.g. E=11), 170 triplications (E=111), 30 tetraplications (E=1111), 8 pentaplications (E=11111) two hexaplications (E=111111), one heptaplication (E=1111111) and zero octaplications. Table 3: Comparison of frequencies of occurrence of schemata of certain length and amount Digits 2 3 4 5 6 7 8 9 Theoretical 1 1 4 11 41 162 715 3425 Matched 2332 170 622 91 211 56 86 67 It is worth mentioning, however, that generic relation between the length (in digits) of an and the amount of utterances which matches seems not to be Zipfian. As indicated by Table 3, an observed preference for repetitive expressions including two, four, six or eight bricks cannot be explained in terms of number-theoretical distribution of E-numbers themselves. For example, there exists eleven E-numbers with five digits and fourty-one Enumbers of length six. However, when exposed to Shakespeare corpus, regexes generated from six digits long seem to match 211 utterances while five brick long regexes match only ninety-one of them. Whether this 390 JADT’ 18 observed asymmetry is an artefact of our method or whether it is due to a sort of cognitive bias, a sort of preference for balanced repetitions within the Poet's mind poses us in front of an argument which we do not dare to tackle here. 4. Conclusion Insight that certain class of repetition-based schemata can be enumerated allows us to generate myriads hitherto unseen Perl Compatible Regular Expressions5 which involve back-references and negative lookaheads. In the end, such regexes have been exposed to corpus containing collected works of William Shakespeare. Matching all utterances with all regexes generated out of all 4360 E-numbers with less than 10 digits lasted 9555 seconds in case Shakespearean comedies, 6607 seconds in case of tragedies and 6900 seconds in case of historical dramata. All this on one single core of an 1.4 GHz CPU. This approach allowed us to pinpoint 36676 utterances matching at least one among 172 distinct repetitive schemata. 23 among these schemata matched at least 5 distinct utterances, 71 among them matched at least two utterances. This may potentially point to a sort of neurolinguistic habit residing in the opaque sphere between the syntactic and lexical layers. We believe that at least some among these «figures » could be of certain interest not only for scholars trying to understand inner intricacies of Shakespeare's genius, but also to address more generic topics in fields as distinct as digital humanities, computational rhetorics, discourse stylometry or even more general cognitive sciences. References Aho, A. V. (2014). Algorithms for finding patterns in strings. Algorithms and Complexity, 1:255. Arndt, J., Sloane, N. J. A. (2016). Counting words that are in "standard order". The on-line encyclopedia of integer sequences. https://oeis.org/A278984/a278984.txt. Dubremetz, M., Nivre, J. (2015). Rhetorical figure detection: the case of chiasmus. On Computational Linguistics for Literature, page 23. We remind the reader that PCREs are much more powerful than so-called regular grammars. For example, regular grammars are unable to backreference, while for PCREs, backreferencing is a completely legal act. 6 See https://refused.science/rhethorics/shakespeare-regex/matches.csv (Licenced under CC BY-NC-SA) for list of all matched utterances, including the information about the respective entangled numbers, theater pieces, genres (comedy / tragedy / drama) and the dramatis personae. 5 JADT’ 18 391 Friedl, J. E. F. (2002). Mastering regular expressions. O’Reilly Media, Inc. Harris, R., DiMarco Ch. (2009). Constructing a rhetorical figuration ontology. In Persuasive Technology and Digital Behaviour Intervention Symposium, pages 47–52. Citeseer. Hromada, D. D. (2011). Initial experiments with multilingual extraction of rhetoric figures by means of PERL-compatible regular expressions. In RANLP Student Research Workshop, pages 85–90. OEIS (2016). List of words of length n over an alphabet of size 9 that are in standard order and which have the property that every letter is repeated at least once. https://oeis.org/A273978 392 JADT’ 18 Spécificités des expressions spatiales et temporelles dans quatre sous-genres romanesques (policier, science-fiction, historique et littérature générale) Olivier Kraif, Julie Sorba Univ. Grenoble Alpes, LIDILEM olivier.kraif@univ-grenoble-alpes.fr; julie.sorba@univ-grenoble-alpes.fr Abstract In this paper, we aim to test if the classifications of the phraseological units based on recurring trees and ngram methods are functional in order to separate novel genres one from another. Our results confirm that these two methods are relevant for the expressions relative to space and time into our corpora. Résumé Notre objectif est de tester les classifications des phraséologismes, opérées par les méthodes des ALR et des SR, dans le but de distinguer des sousgenres romanesques les uns des autres. Dans nos corpus, nos résultats confirment la pertinence de ces classifications pour les deux champs de l’espace et du temps. Keywords: ngram, recurring trees, novel genres, phraseology 1. Introduction Notre étude, qui s’inscrit dans le cadre de l’analyse exploratoire des données textuelles, concerne des romans français contemporains rassemblés dans le cadre du projet ANR-DFG PhraseoRom. Ce corpus (plus de 110 millions de mots pour le français) est partitionné en plusieurs sous-corpus correspondant à différents sous-genres littéraires (policier, science-fiction, fantasy, roman historique, roman sentimental, littérature générale). Notre objectif est de caractériser ces genres et sous-genres textuels par les unités phraséologiques spécifiques qu’ils contiennent. À l’instar de Boyer, nous postulons que « chaque genre comprend un certain nombre de sous-ensembles, des séries fondées sur la réutilisation de composantes identiques » (1992, p.91). Dans la mesure où la phraséologie étendue s’intéresse à tout ce qui est « préfabriqué » dans les séquences lexicales, elle constitue donc un point d’entrée privilégié pour mettre en évidence ces « séries ». Pour cette étude, nous retenons spécifiquement 4 sous-genres : les romans de JADT’ 18 393 science-fiction (SF), les romans policiers (POL), les romans historiques (HIST) et les romans de littérature dite blanche ou générale (GEN). La fouille des textes utilise la technique de repérage des Arbres Lexicosyntaxiques Récurrents (ou ALR, Kraif & Diwersy, 2012 ; Kraif, 2016) dont la validité a déjà été montrée par le repérage d’unités phraséologiques spécifiques dans les textes scientifiques (Tutin & Kraif, 2016). Nous proposons en outre de comparer ici cette technique d’extraction avec celle des segments répétés (Salem, 1987), les ALR ayant montré une meilleure prise en compte de la variabilité syntaxique pour le repérage des routines, mais s’avérant parfois défaillants pour identifier des segments figés en surface, du fait du modèle dépendanciel employé. Dans des travaux antérieurs, nous avons montré comment les ALR permettaient de repérer des motifs récurrents construits autour d’expressions spécifiques fortement liées à la composante thématique des sous-genres en question : c’était le cas pour « scène de crime » dans POL (Kraif, Novakova & Sorba, 2016). Ici, nous nous concentrons sur des expressions moins directement liées aux univers de référence des sous-genres (le crime, l’amour, la science, etc.), afin de mettre en évidence des traits moins prévisibles. C’est pourquoi, nous avons choisi de sélectionner les séquences – bien souvent adverbiales – liées à l’expression du temps et de l’espace. Nous allons désormais présenter les résultats obtenus dans des travaux antérieurs (partie 2), puis décrire notre méthodologie expérimentale (partie 3). Enfin, nous exposerons et discuterons nos observations (partie 4) avant de proposer des conclusions et perspectives à notre étude (partie 5). 2. Travaux antérieurs Lefer, Bestgen & Grabar (2016) s’appuient sur une extraction de n-grammes de 2 à 4 mots pour caractériser 3 genres textuels : des débats parlementaires européens, des éditoriaux de presse et des articles scientifiques. Ces auteurs utilisent une méthode d’AFC pour identifier les expressions les plus typiques et en tirent des observations contrastives concernant l’expression de la certitude et de l’opinion. De notre côté, nous avons analysé des contrastes génériques sur un plan qualitatif, en identifiant des ALR dans des corpus de romans policiers et de science-fiction, en nous fondant sur des mesures de spécificité (Kraif, Novakova & Sorba, 2016). Nous avons également utilisé l’extraction des ALR pour classer automatiquement, dans une approche supervisée, des sous-corpus POL, SF et GEN (Chambre & Kraif, 2017). Ces travaux préliminaires ont montré que les ALR donnaient de meilleurs résultats que les autres catégories de traits (ponctuation, morphosyntaxe, lexique), et permettaient de classer correctement 98% des textes du corpus à partir d’une sélection de traits discriminants. La plupart de ces traits 394 JADT’ 18 appartenaient à des champs lexicaux précis, liés aux univers de référence propres à chaque sous-genre, comme ceux du ‘téléphone’ (le numéro de portable, passer un coup de fil, etc.) ou de la ‘voiture’ (à travers le pare-brise, démarrer en trombe, etc.) pour POL. De plus, des expressions temporelles (p.ex. pour POL à huit heures, vingt et une heure, au bout de X minutes) et des indications spatiales très variées (p.ex. pour SF par la voie, dans le territoire, dans la sphère, dans l’espace, la zone de) ont été mises en évidence. Nous proposons ici un prolongement de cette expérimentation, d’une part, en étudiant les expressions spatiales et temporelles, et d’autre part, en ajoutant le sous-genre des romans historiques (HIST), afin de déterminer si ces classes d’expression sont suffisantes pour différencier les quatre sousgenres (POL, SF, GEN, HIST). 3. Méthodologie Pour chaque sous-genre, notre corpus comporte un échantillon d’environ 8 millions de mots, correspondant à environ 70 œuvres d’une quarantaine d’auteurs (cf. Tableau 1). Ces œuvres sont toutes postérieures à 1950, et la majorité d’entre elles ont été publiées pour la première fois après 2000. La classification des œuvres en genre a été effectuée a priori selon des critères éditoriaux, en fonction des collections de publication. Auteurs Romans Taille POL 46 69 8 008 395 SF 36 75 8 001 582 HIST 38 70 8 015 933 46 69 8 008 395 GEN Tableau 1 : Constitution du corpus Figure 1 : ALR représentant l’expression en une fraction de seconde Pour identifier les expressions phraséologiques caractéristiques des différents sous-genres, nous utilisons deux méthodes de repérage : - la méthode des ALR : nos corpus étant analysés en dépendances avec XIP JADT’ 18 395 (Aït-Mokhtar et al., 2002), ces ALR sont des sous-arbres respectant des critères de fréquence (ici ≥ 10 occurrences), de dispersion (ici ≥ 10 auteurs différents, appartenant à au moins 3 sous-genres différents) et de taille (ici ≥ 3 nœuds et ≤ 8 nœuds). En outre, lors de la recherche de ces ALR, une mesure d’association est calculée afin de ne retenir que les nœuds significativement associés avec le reste de l’arbre. La figure 1 montre un exemple d’ALR correspondant à l’expression en une fraction de seconde. - la méthode des segments répétés (ou SR, Salem, 1987) : nous avons appliqué les mêmes critères de dispersion et de taille (≥ 3 et ≤ 8), afin de comparer les deux méthodes in fine. Les SR sont constitués de séquences de lemmes (obtenus avec XIP), et non de formes fléchies. Cette dernière méthode est plus simple à mettre en œuvre et nécessite peu de ressources linguistiques, bien qu’elle pose des problèmes d’explosion combinatoire (cf. partie 4). Dans un second temps, nous appliquons un filtrage par mots-clés afin de ne retenir que les séquences liées aux deux sous-domaines étudiés, à savoir l’expression du temps et de l’espace. Les mots-clés pour l’espace sont des noms de lieux, d’espaces naturels, de description géographique, des mesures de distance, des adverbes de lieu, sélectionnés après un premier sondage des ALR extraits : - Mots-clés ESPACE : cave, salon, hôpital, immeuble, bâtiment, camp, restaurant, village, route, rue, quai, chaussée, terrasse, ministère, parc, bureau, carlingue, maison, toit, chambre, hôtel, palais, rez-de-chaussée, entrée, pont, escalier, chemin, place, salle, jardin, seuil, cour, couloir, colline, sentier, sol, rive, rivage, plage, rivière, mont, montagne, mer, océan, lac, bois, forêt, espace, endroit, coin, pays, continent, frontière, direction, cap, sud, est, nord, ouest, confins, mètre, kilomètre, annéelumière, hectare, acre, loin, proche, près de, au bord de, orée, distance. Les mots-clés pour le temps désignent des moments de la journée et de l’année, des unités de mesure et des découpages conventionnels de période (noms, adverbes et locutions adverbiales) : - Mots-clés TEMPS : matin, soir, soirée, après-midi, nuit, jour, temps, fois, moment, instant, toujours, jamais, parfois, souvent, autrefois, jadis, tôt, tard, longtemps, brièvement, immédiatement, subitement, tout à coup, tout de suite, aujourd'hui, demain, hier, lendemain, maintenant, heure, minute, seconde, journée, semaine, mois, an, année, décennie, siècle, millénaire, printemps, été, automne, hiver. Ces listes ne prétendent pas être exhaustives et le filtrage opéré produit à la fois du silence et du bruit, du fait des ambiguïtés. Celles-ci demeurent toutefois marginales (d’après un sondage manuel, le bruit est inférieur à 10 %). Pour identifier les ensembles de traits pertinents du point de vue des sousgenres, nous injectons ces expressions (ALR ou SR) dans un système de classification automatique. De la sorte, nous visons un double objectif : d’une 396 JADT’ 18 part, vérifier que nos classes constituées a priori sont cohérentes et corrélées à des critères objectivables ; d’autre part, identifier ces critères sous la forme d’ensemble de traits discriminants pour la classification. 4. Résultats et discussion Dans une première étape, nous avons extrait les 6000 ALR les plus fréquents sur l’ensemble du corpus. En effectuant une classification sur ces traits, avec un modèle SVM optimisé par SMO (avec la plate-forme Weka, Eide et al. 2016), on obtient, dans une évaluation croisée à 10 plis, une précision de 74 % (123 sur 166), avec un Kappa de 0,65, ce qui correspond à un très bon accord avec la classification de référence. La matrice de confusion (cf. Tableau 2) montre que les deux genres les mieux classés sont SF (93,1 %) et POL (79,5 %). Le genre GEN obtient la précision la plus faible (64%) avec des confusions fréquentes avec POL et HIST ; HIST est de son côté fréquemment confondu avec GEN. L’examen des ALR les plus discriminants montre, comme on pouvait s’y attendre, la forte présence de certains thèmes dans POL, HIST et SF (la voiture, le crime, le téléphone pour POL ; la guerre, la religion pour HIST ; l’univers spatial et les artefacts technologiques pour SF) et l’absence de traits saillants dans GEN. 4.1 Sélection des traits TEMPS+ESPACE Lorsqu’on sélectionne les traits liés à l’expression du temps seul (environ un millier), on obtient une dégradation par rapport aux résultats précédents, avec une précision globale de 48,8 % et un Kappa de 0,31 signifiant un accord faible entre la classification a priori et la classification automatique. Les expressions spatiales, de leur côté (on en obtient 1560, mais nous avons retenu les 1000 plus fréquentes afin de disposer de résultats comparables), obtiennent des résultats un peu meilleurs, toutefois moins bons que les traits non filtrés : la précision est de 59,6 %, avec un Kappa de 0,46 correspondant à un accord modéré. Quand on sélectionne conjointement les ALR de TEMPS et ESPACE, on obtient une légère amélioration par rapport à la classification avec ESPACE seul : 61,4 % (102 instances bien classées sur 166), avec un Kappa assez bon de 0,48. La matrice de confusion (cf. tableau 2) montre que POL obtient la meilleure précision (69%) et GEN la moins bonne (55,9 %). Si on sélectionne les traits les plus discriminants (attributs SfcSubsetEval avec méthode BestFirst dans Weka), on obtient un ensemble de 54 attributs. On peut évaluer, de manière indicative, le pouvoir classificateur de ces attributs sur notre corpus en les réinjectant dans une classification par SMO : on obtient alors une précision globale très légèrement supérieure (62 %), mais il JADT’ 18 397 est intéressant de noter que les genres marqués POL, SF et HIST sont très bien classés sur la base de ces traits (précision de 85,7% pour HIST, 84 % pour SF, 75,7 % pour POL) avec une dégradation forte pour GEN (43,4%), comme le montre la matrice de confusion ci-dessous (tableau 2). Tableau 1 : Matrices de confusion pour les classifications avec (1) tous les traits, (2) les ALR filtrés (TEMPS+ESPACE) et (3) les ALR sélectionnés (1) Tous les traits (6000 ALR plus fréquents) (2) TEMPS+ESPACE (2571 traits filtrés) (3) TEMPS+ESPACE Sélection de 54 traits SF POL GEN HIST SF POL GEN HIST SF POL GEN HIST SF 27 2 2 5 18 5 6 7 21 2 13 0 POL 1 35 9 1 5 29 12 0 3 28 15 0 GEN 1 5 32 8 3 3 33 7 1 6 36 3 HIST 0 2 7 29 3 5 8 22 0 1 19 18 L’examen détaillé des 54 traits sélectionnés révèle plusieurs points saillants : - d’une manière générale, les ALR relatifs à l’espace sont très largement majoritaires avec 33/54 contre 17/54 pour le temps, après élimination du bruit (4/54). - si on considère les traits spécifiques à HIST, les expressions spatiales désignent surtout des lieux de pouvoir (la place forte, de son palais, salle du palais, salle du château, pénétrer dans la grande salle) et la mer (sur la mer, de la mer), tandis que les expressions temporelles font référence à une temporalité longue (au bout de quelques mois, règne de X années, avoir le temps) et à des datations absolues ou relative (du Ne siècle, venir le lendemain, à trois heures de l’après-midi). - pour POL, en revanche, les expressions temporelles indiquent des datations horaires (à 8 heures, 21 heures) et des durées courtes (une vingtaine de secondes). Les expressions spatiales, nombreuses, indiquent des pièces et des espaces intérieurs (de la salle de bain, vers la salle de bain, entrer dans le bureau, vers le bureau, dans le coin), des lieux urbains (aller à l’hôtel, passer à l’hôpital, à l’hôpital), et des localisations vagues (dans le coin au sens de « dans les parages »). - pour SF, les expressions temporelles sont plus nombreuses (7/18) que dans les autres sous-genres. Elles font référence à des durées extrêmes par leur longueur (milliers d’années, de mille ans) ou leur brièveté (une fraction de seconde, un centième de seconde). Pour l’espace, on trouve des expressions de distances chiffrées (dizaines de mètres, centaine de mètres, plusieurs centaines de mètres), des références attendues à l’espace intersidéral (dans l’espace, à travers l’espace, être dans l’espace, voyager dans l’espace, flotter dans l’espace), à l’espace- 398 JADT’ 18 temps et des expressions avec sol (sur le sol, sous-sol). - pour GEN : la seule expression spécifique apparaissant dans les traits sélectionnés est chemin de traverse. 4.2 Comparaison avec les segments répétés Nous n’avons pas réussi à extraire la totalité des SR de 3 à 8 mots pour l’ensemble du corpus, du fait des problèmes d’explosion combinatoire (environ 40 000 000 SR générés pour 100 textes du corpus). Nous avons donc retenu les SR contenant les mots-clés sélectionnés pour TEMPS et ESPACE, en conservant les 1000 SR les plus fréquents afin d’avoir des ensembles de traits comparables aux ALR filtrés. On obtient de meilleurs résultats que pour les ALR, avec une précision de 66,7 % pour ESPACE et 58,3 % pour TEMPS contre respectivement 59,6 % et 48,8 %. Pour TEMPS+ESPACE, on constate une certaine dégradation, avec une précision qui tombe à 64,1 %. À ce stade de nos observations, il nous est difficile d’interpréter ces résultats quantitatifs car la sélection du meilleur ensemble de traits pour ESPACE donne peu ou prou les mêmes expressions qu’avec les ALR : le chambre de, le cour de, à le cour, dans le espace, le salle de bain, de le espace, dans son bureau, de le immeuble, le maison et, à le hôtel de, centaine de mètre, sur le bureau, sur le place de, le palais de, dans le grand salle, de bureau de, de le salle de bain, sur son bureau, cour de France, en route pour, dans mon bureau, dans tout le direction, un dizaine de mètre, de son pays, à le rue, dans le sous-sol, quitter le salle, dans un restaurant, sur le rivage, mètre plus bas, vers le bureau, route vers le, dizaine de mètre de, un kilomètre de, à ministère de, dans le espace et, de un montagne, le espace et le. Les deux méthodes donnent donc des résultats convergents en termes qualitatifs en extrayant les mêmes expressions. Néanmoins, des investigations complémentaires seront nécessaires pour interpréter correctement le fait que les SR obtiennent de meilleurs résultats quantitatifs. 5. Conclusion et perspectives Cette étude confirme que les expressions phraséologiques constituent de bons descripteurs pour la classification en sous-genre (Chambre & Kraif, 2017). En effet, même si les résultats obtenus ici à partir du sous-ensemble constitué des expressions spatiales et temporelles sont sensiblement inférieurs à ceux obtenus à partir de traits plus directement liés aux univers de référence de chaque sous-genre (61.4 % /vs/ 98 %), ces expressions moins riches sur le plan informatif permettent cependant de classer les romans dans les sous-genres marqués POL, SF et HIST de manière satisfaisante. En revanche, pour la catégorie des romans généraux (GEN), elles ne sont pas discriminantes. Notre méthode permet aussi de dégager des spécificités JADT’ 18 399 génériques propres à ces deux champs ESPACE et TEMPS (lieux de pouvoir dans HIST /vs/ intérieur et lieux urbains dans POL ; durées et distances extrêmes dans SF). Enfin, à partir de cette sélection d’expressions spatiotemporelles, la méthode des segments répétés produit une classification en sous-genres plus précise que celle des ALR. Ce point, difficile à interpréter à partir de nos premières observations qualitatives, nécessite une étude plus approfondie. Ces résultats nous incitent à poursuivre l’exploration d’autres champs lexicaux en marge des univers de référence de chaque sous-genre, afin, d’une part, d’affiner notre méthodologie et, d’autre part, de cibler les éléments au cœur de la phraséologie. Références Aït-Mokhtar S., Chanod J.-P. and Roux C. (2002). Robustness beyond Shallowness: Incremental Deep Parsing. Natural Language Engineering, 8:121-144. Boyer A.-M. (1992). La paralittérature. Presses Universitaires de France. Chambre J. et Kraif O. (2017). Identification de traits spécifiques du roman policier et de science fiction. Communication présentée aux Journées Internationales de la Linguistique de Corpus - JLC2017, Grenoble, 05.07.2017. Eibe F., Hall M. A. and Witten I. H. (2016). The WEKA Workbench. Online Appendix for "Data Mining: Practical Machine Learning Tools and Techniques", Morgan Kaufmann, Fourth Edition. Kraif O., Novakova I. et Sorba J. (2016). Constructions lexico-syntaxiques spécifiques dans le roman policier et la science-fiction. Lidil, 53 : 143-159. Kraif O. et Diwersy S. (2012). Le Lexicoscope : un outil pour l'étude de profils combinatoires et l’extraction de constructions lexico-syntaxiques. Actes de la conférence TALN 2012, pp. 399-406. Lefer M.-A., Bestgen Y. et Grabar N. (2016). Vers une analyse des différences interlinguistiques entre les genres textuels : étude de cas basée sur les ngrammes et l’analyse factorielle des correspondances. Actes de la conférence conjointe JEP-TALN-RECITAL 2016, pp. 555-563. Tutin A. et Kraif O. (2016). Routines sémantico-rhétoriques dans l’écrit scientifique de sciences humaines : l’apport des arbres lexico-syntaxiques récurrents. Lidil, 53 : 119-141. Salem A. (1987). Pratique des segments répétés. Essai de statistique textuelle. Klincksieck. 400 JADT’ 18 Les phrases de Marcel Proust Cyril Labbé1, Dominique Labbé2 1 Univ. Grenoble Alpes, CNRS, Grenoble INP*, LIG, F-38000 Grenoble France (cyril.labbe@imag.fr) 2 Univ. Grenoble Alpes, PACTE (dominique.labbe@umrpacte.fr) Abstract Analysis of sentence lengths in Marcel Proust’s A la recherche du temps perdu. Counting standards and the various available measures are presented. For most of his reading time, the reader of this novel is confronted with very long and syntactically-complex sentences. A comparison with other writers shows that these sentences are atypical but not unique and that some of their characteristics can be observed in a number of other works, some of which are cited in the Recherche du temps perdu. Résumé Analyse des longueurs de phrases dans A la recherche du temps perdu de Marcel Proust. Présentation des normes de dépouillement et des différentes mesures possibles. Durant la majorité de sa lecture, le lecteur se trouve confronté à des phrases très longues et syntaxiquement complexes. Une comparaison avec un large panel d’écrivains montre qu’il s’agit d’un phénomène exceptionnel mais pas unique et que certaines caractéristiques se retrouvent dans quelques œuvres dont certaines sont citées dans la Recherche du temps perdu. Keywords: lexicometry - stylometry - sentence length – French literature Proust 1. Introduction Les phrases de Marcel Proust (1871-1922) sont-elles exceptionnelles ? La question a été surtout traitée sous l’angle qualitatif (notamment Curtius 1970). Il existe quelques estimations quantitatives (Bureau 1976, Brunet 1981, Milly 1986), avec des résultats divergents pour des raisons qui seront explicitées au début de cette communication. Mais surtout, nous présentons une comparaison statistique avec d’autres écrivains qui permettra de juger de l’exceptionnalité de la phrase proustienne. L’analyse des phrases soulève plusieurs des problèmes auxquels est confrontée la lexicométrie (statistique appliquée au langage). En premier lieu, ici, il y a le choix de l’édition de référence. En effet, pour la Recherche du temps JADT’ 18 401 perdu, ce choix existe et introduit une légère incertitude concernant la ponctuation de l’oeuvre (discussion dans Ferré 1957 et Serça 2010), spécialement pour les trois derniers volumes. Nous nous sommes tenus au principe général selon lequel fait foi l’ultime version révisée par l’auteur ou, à défaut, la plus proche de sa mort. Il s’agit ici de l’édition originale chez Gallimard (annexe 1). De plus, cette édition originale s’impose puisqu’elle est dans le domaine public et peut être communiquée librement aux chercheurs soucieux de reproduire nos résultats et d’aller plus loin dans cette analyse. 2. Le mot et la phrase Le mot est défini comme l’occurrence d’un vocable, c’est-à-dire une entrée dans le lexique de la langue française selon la norme présentée par Muller 1963. Cette norme est fondée notamment sur la nomenclature de Hatzfeld et al. 1898. Son implémentation est décrite dans Labbé 1990. Par exemple, "aujourd’hui", "parce que" ou "Saint-Loup" sont des mots uniques et non deux "formes graphiques". Il y a 1 449 "parce que" dans la Recherche, soit plus d’un mot pour mille ; et 787 fois "Saint-Loup" (l’un des principaux personnages du roman). A l’inverse, les formes graphiques "le", "la", "les" ont deux entrées (pronom ou article) ; "du" ou "des" sont la contraction de deux entrées du lexique - préposition "de" et article "le". En fonction de la norme retenue (vocable ou formes graphiques), le nombre de mots dans un texte peut varier de près 10%. Selon cette "norme Muller", la Recherche compte 1 327 859 mots (N dans la suite) et 21 836 vocables différents. Quant à la phrase, il y a un accord général pour la définir comme l’empan de texte dont le premier mot comporte une majuscule initiale et qui se trouve compris entre deux ponctuations majeures. Les ponctuations majeures sont le point, les points d’interrogation et d’exclamation, les points de suspension. Cependant, aucun de ces 4 signes typographiques ne marque automatiquement une fin de phrase : - le point dans « M. Verdurin » ne termine pas une phrase même s’il est suivi d’un mot à majuscule initiale. Il y a dans la Recherche 3 152 « monsieur » écrits "M.". C’est le deuxième substantif le plus fréquent dans la Recherche (juste derrière "Mme"), soit 2,4 pour mille mots. Ce point "non-terminal" se retrouve dans les initiales que Proust utilise pour "anonymiser" certains noms (Mme X.) ou derrière des abréviations (etc.). - dans la Recherche, plus de trois points d’interrogation sur 10 sont internes à la phrase (721). - il y a 1 201 points d’exclamation internes à la phrase et 190 points de suspension également dans cette situation. Proust a plusieurs fois déclaré son hostilité envers ces derniers mais il les utilise parfois. Par exemple : « La duchesse émit très fort, mais sans articuler : « C’est l’... i Eon l... b... frère à 402 JADT’ 18 Robert. » (la Prisonnière). Cette rapide discussion permet de comprendre la solution adoptée : un automate détermine les fins de phrase et, en cas de doute, l’opérateur choisit : fin de phrase ou ponctuation interne ? A condition que l’opérateur suive toujours la même norme, le dépouillement est fait sans erreur et, surtout, les résultats obtenus sur un auteur sont comparables à ceux de tous les autres. Ce recensement établit le nombre de phrases de la Recherche (voir tableau en annexe). P = 37 336 phrases. Comment caractériser ces phrases en fonction de leurs longueurs ? 3. Les indices statistiques usuels. Les P phrases sont rangées par longueur croissante, dans des classes d’intervalles égaux (ici 1 mots). Par exemple, la première classe (1 mot, généralement une exclamation) contient 124 phrases, soit 0,37% du total. L’effectif de chaque classe est ainsi recensé et son poids relatif est calculé. Ce recensement fournit les informations suivantes : - Etendue de la distribution : 1 à 931 mots. La plus longue phrase est celle sur les homosexuels au début de Sodome et Gomorrhe. Les phrases de la Recherche ne sont pas réparties uniformément sur cet intervalle. La seconde plus longue – celle sur les chambres au début de Combray – compte 542 mots ; la troisième (le salon des Verdurins dans la Prisonnière) : 430 ; la quatrième (l’église de Combray) : 399. Ensuite, il n’y a plus de "trou" important dans l’étalement des longueurs. - Le mode est la classe la plus peuplée, ou longueur de phrase que le lecteur a le plus de chance de rencontrer : 11 mots. Il y a donc, dans la Recherche, une prédominance des phrases courtes et syntaxiquement simples. Il en est ainsi dans la plupart des textes en français. - La médiane est la valeur de la variable pour l’individu du milieu ou individu "médian". Dans les P phrases rangées par longueurs, l’individu médian est celui qui occupe la place (P+1)/2. Lorsque l’effectif total de la population (P) est pair, la médiane est la moyenne des valeurs de la variable pour les 2 individus situés de part et d’autre. Dans un texte étendu comme la Recherche, la médiane se trouve dans une classe dont l’effectif est assez élevé. Dans ce cas, la valeur est interpolée en divisant l’intervalle de la classe où se situe l’individu médian par l’effectif de cette classe. Dans la Recherche, ce calcul aboutit à une médiane de 26,28 mots. Etant donné que la variable "longueur de phrase" ne prend que des valeurs entières, les décimales indiquent le sens de l’arrondi et la position de la borne. La longueur médiane des phrases de la Recherche est donc de 26 mots. Ou encore la moitié des phrases ont une longueur inférieure ou égale à 26 mots et l’autre moitié une longueur supérieure à 26. JADT’ 18 403 - La moyenne (N/P) : 35,57 mots. A cet indice est associée une déviation "standard" des valeurs de la variable autour de la moyenne (écart-type) : racine carrée de la variance (moyenne des carrés des écarts de chaque valeur de la variable à la moyenne arithmétique). L’écart type de la longueur des phrases de la Recherche est de 31,42 mots. La dispersion des valeurs autour de la moyenne mesurée par le coefficient de variation relative : rapport de l’écart-type à la moyenne arithmétique (ici 89%). Etant donné l’effectif considéré (37 336 phrases), si les valeurs de la variable "longueur de phrase" étaient distribuées normalement autour de la moyenne (cas d’une population homogène), ce coefficient serait d’environ 4%. Autrement dit, les observations sont extrêmement dispersées. Dans ce cas, la moyenne n’est pas représentative de la série et, en particulier, il n’est pas possible de considérer que cette moyenne se situe à peu près "au milieu" de la population. Dès que la dispersion relative approche les 50% de la moyenne, celle-ci est située dans la partie basse de l’étendue de la distribution qui est fortement asymétrique. Le profil de la distribution des longueurs de phrases dans la Recherche est donné par la figure 1 dans laquelle l’effectif relatif de chaque classe est représenté par la hauteur du bâton correspondant (histogramme). Figure 1. Histogramme de la distribution des longueurs phrases D’une part, le graphique s’interrompt à la classe 200+ mots et le bâton pour cette classe – à l’extrême-droite du graphique - correspond aux 96 phrases longues de 200 mots et plus (0,3% du total des phrases mais 2,1% de la surface du texte). Le graphique complet est encore plus étalé sur la droite, la grande masse des phrases apparaissant serrées sur la gauche… D’autre part, le bâton le plus haut correspond au mode principal (11 mots) mais l’on observe de nombreux modes secondaires (17, 20, 24, etc.) : plusieurs 404 JADT’ 18 populations sont donc mélangées. La plupart des phénomènes sociaux présentent des caractéristiques semblables et, en premier lieu, la distribution des revenus ou des patrimoines. Dans de pareils cas, l’analyse ne se contente pas des valeurs centrales. Elle se centre sur la distribution du caractère étudié (ici la surface du texte) au sein de la population (ici les phrases). 4. L’inégal partage de la surface du texte entre les phrases Ce renversement de perspective présente un avantage : la surface de texte correspond grosso-modo à la durée de la lecture. Deux méthodes sont possibles pour l’évaluer. 4.1 Quantile et médiale Les phrases étant classées par longueurs croissantes, la surface du texte qu’elles couvrent est découpée en masses égales (tableau 1). Tableau 1. Partage de la surface du texte en fonction de la longueur des phrases Surface divisée en quantiles Premier décile Deuxième décile Premier quartile Troisième décile Quatrième décile Deuxième quartile (médiale) Sixième décile Septième décile Dernier quartile Huitième décile Neuvième décile Longueur (mots) 18.58 26.70 29.53 33.30 41.35 49.93 60.20 72.93 81.13 90.57 121.00 % des phrases (cumulé) 33,8 49,6 54,5 60,6 70,1 77,5 84,6 89,7 92,3 94,2 97,8 Dans ce tableau, le premier décile est la borne supérieure de l'intervalle comprenant les phrases les plus courtes couvrant en tout 10% de la surface du texte et la borne inférieure du 2e décile. Il indique que les phrases de longueurs inférieures ou égales à 18 mots couvrent 10% du texte et représentent plus du tiers du total des phrases (33,8%). Le lecteur n’y passe au mieux qu’un dixième du temps de la lecture. Or c’est au-dessus de cette longueur que l’on commence à rencontrer des phrases syntaxiquement complexes. Autrement dit, au mieux, le lecteur de la Recherche se trouve face à des phrases simples pendant un dixième de sa lecture (ou il est face à des phrases plus ou moins complexes pendant les neuf dixièmes !) A l’opposé, 2,2% des phrases (700) comptent plus de 121 mots (9e décile). Elles couvrent également 10% du texte, c’est-à-dire la même surface que le tiers évoqué ci-dessus. Cela signifie que le lecteur de la Recherche passe (au moins) autant de temps à lire des phrases très longues – dont la construction est nécessairement complexe -, qu’il n’en consacre à la masse des phrases les JADT’ 18 405 plus brèves et structurellement simples. Dans cette perspective, la valeur centrale la plus caractéristique est la longueur de la phrase qu’il faut atteindre pour avoir lu la moitié du texte. Pour éviter les confusions, cette seconde médiane est appelée médiale (Ml). Elle correspond à la borne haute du cinquième décile (ou du deuxième quartile). Dans la Recherche, elle est égale à 49,93 mots, soit 50 mots. Le tableau indique que 77,5% des phrases (près de 8 sur 10) sont inférieures à cette médiale. Autrement dit, le lecteur de la Recherche passe au moins la moitié de son temps confronté à des phrases de 50 mots et plus, ce dont la plupart d’entre eux n’ont guère l’habitude. Malgré le talent de l’écrivain, c’est évidemment cela que les lecteurs retiennent. 4.2 Mesure de l’inégalité Deuxième méthode, un indice unique mesure l’inégale répartition de la surface du texte entre les phrases (en fonction de leurs longueurs). Deux calculs sont proposés : - le rapport entre la médiane (26,28) et la médiale (49,93) soit 0,90. Autrement dit la médiale est de 90% supérieure à la médiane (pour des comparaisons avec d’autres écrivains, voir l’annexe 2). Cet écart considérable suffit à attester la prédominance des phrases longues dans la Recherche. - le second calcul est utilisé en science économique pour étudier la distribution des revenus ou des patrimoines. Il s’agit de l’indice de Gini qui mesure l’écart entre la situation réelle et celle qui serait observée en cas d’égale répartition du caractère (ici la surface du texte) entre les individus (les phrases) composant le livre. En cas d’équirépartition, toutes les phrases de la Recherche auraient la longueur moyenne (≈ 36 mots). Pour chaque centile, on calcule la proportion de la surface de texte couverte et l’écart par rapport à ce que serait cette surface dans l’hypothèse d’équirépartition. L’indice de Gini est la somme de ces écarts. Ici, il est égal à 55,4%. Autrement dit, dans la Recherche, les longueurs de phrases s’écartent de plus de 55% de ce qui serait constaté dans une population homogène. Le "diagramme de Gini" permet de visualiser cette situation. Les phrases étant rangées par longueurs croissantes, on compte le nombre qu’il faut lire pour atteindre 1% de la surface (premier centile), puis 2%, etc. jusqu’à 100%. Les valeurs observées pour chaque centile sont reportées sur la figure 2 où la diagonale représente l’hypothèse d’équirépartition. L’indice de Gini est la surface comprise entre la diagonale et la courbe. Deux auteurs contemporains, et importants pour M. Proust, sont ajoutés sur le diagramme afin d’en illustrer les propriétés. 406 JADT’ 18 Figure 2 Diagramme de concentration (Gini) de la surface de la Recherche sur les phrases longues, comparée à celle de J. Barbey d’Aurevilly et de A. France. Ce diagramme permet de comprendre pourquoi la médiane ou la moyenne rendent mal compte des distributions fortement asymétriques comme les longueurs de phrase. Par exemple, les deux tiers des phrases ont des longueurs inférieures à la moyenne et pourtant ces phrases ne couvrent qu’à peine plus d’un tiers du texte (34,5%). La figure 2 montre également que, si les phrases de la Recherche sont singulières par rapport à certains écrivains du XIXe - à commencer par A. France qui aurait fourni le modèle de Bergotte (Levaillant 1952) –, elles semblent très proches de quelques livres comme Une vieille maîtresse (1851) de Barbey d’Aurévilly, écrivain que Proust cite à plusieurs reprises (Rogers 2000). C’est la dernière question abordée dans cette communication. 5. Singularité de Proust ? Pour juger de cette singularité : à qui le comparer ? Et comment décider si les écarts constatés sont statistiquement significatifs ? Premièrement, il faut comparer Proust à lui-même. Un de ses ouvrages se trouve dans le domaine public : Les Plaisirs et les jours (1896) dont les valeurs centrales sont indiquées en première ligne dans le tableau 2. JADT’ 18 407 Tableau 2. Caractéristiques des phrases des Plaisirs et les jours comparés à la Recherche Plaisirs et jours Recherche Etendue 1-250 Mode 7 Médiane 21,30 Moyenne 27,87 Médiale 37,16 Me/Ml 0,754 Gini 0,542 1-931 11 26,28 35,57 49,93 0,900 0,554 Toutes ces valeurs sont significativement inférieures à celles observées dans la Recherche. Cependant, l’indice de Gini indique que le jeune Proust avait déjà tendance à concentrer une proportion importante du texte dans les phrases longues. Deuxièmement, il faut comparer Proust aux auteurs qu’il cite explicitement ou par allusion, non seulement dans la Recherche (Nathan 1968) mais aussi dans ses autres œuvres et dans sa correspondance (Chantal 1967). Dans la Recherche, Racine et Mme de Sévigné sont les plus cités, puis en seconde position : Balzac et Saint-Simon ; en troisième : Chateaubriand, Hugo, Molière, Musset, Sand et Vigny. La singularité des phrases théâtrales (Labbé & Labbé 2010) ne permet pas de comparer la Recherche (qui est un roman) avec les pièces produites par Molière, Hugo, Musset, Racine ou Vigny. Enfin, il faut le comparer aux autres romanciers contemporains : ont été ajoutés les principaux écrivains du XIXe et du début du XXe - comme Bourget, Giraudoux, Flaubert, Maupassant, Zola – et quelques auteurs moins connus mais singulièrement proches de Proust. L’annexe 2 présente un échantillon des résultats. Chaque écrivain est singulier et parfois les indices peuvent varier selon ses oeuvres. La Recherche se situe dans la partie haute pour tous les indices et notamment pour la propension à concentrer une proportion importante du texte dans les phrases les plus longues (Gini). Cependant, on observe des caractéristiques supérieures à celle de Proust dans quelques œuvres - Huysmans (A rebours), les frères Goncourt (Mme Gervaisais) - ou proches dans Barbey d’Aurevilly, mais aussi dans les Lettres de Mme de Sévigné ou les Mémoires de SaintSimon. 6. Conclusions Lorsque, dans une population – ici les phrases d’un texte -, un caractère (la surface de ce texte) est très inégalement réparti, la moyenne et la dispersion standard sont de peu d’utilité. L’indice statistique le plus éclairant est la seconde médiane ou médiale. Pour mesurer le degré de dispersion de la série autour de cette valeur centrale, de nombreux indices sont concevables, notamment les rapports entre quantiles extrêmes. Cependant, le rapport entre médiane et médiale, ou l’indice de Gini paraissent les plus aptes à donner une indication de la concentration du caractère sur une proportion 408 JADT’ 18 plus ou moins restreinte de la population totale. Ces indices montrent que, durant la majorité du temps, le lecteur de la Recherche se trouve confronté à des phrases très longues (50 mots et plus) et syntaxiquement complexes. Ils confirment que M. Proust a une propension à concentrer une proportion importante du récit dans les phrases les plus longues. Ces conclusions ont été acquises grâce à un dépouillement rigoureux, à des indices statistiques adaptés et à une vaste base de textes traités selon les mêmes procédures. A ce prix, la statistique lexicale peut être une auxiliaire utile de l’analyse littéraire. Enfin, dans une œuvre littéraire, il n’existe pas un type de phrase unique mais plusieurs qui ont chacun leurs particularités lexicales et stylistiques (Monière et al. 2008 ; Labbé & Labbé 2010). Une prochaine publication présentera ces types de phrases avec leurs singularités lexicales, stylistiques et thématiques. Elle répondra aussi à une question pendante : comment déterminer que les écarts entre œuvres et auteurs sont ou non significatifs ? References Brunet E. (1981). La phrase de Proust. Longueur et rythme. Travaux du cercle linguistique de Nice, p. 97-117. Bureau C. (1976). Marcel Proust ou le temps retrouvé par la phrase. Linguistique fonctionnelle et stylistique objective. Paris : PUF, p. 178-231. Curtius E.-R. (1971). Etude de lilas. Le rythme des phrases. In Tadié J.-Y. (dir.). Lectures de Proust. Paris : A. Colin. Milly J. (1975). La phrase de Proust. Des phrases de Bergotte aux phrases de Vinteuil. Paris : Larousse. Ferré A. (1957). La ponctuation de M. Proust. Bulletin de la Société des Amis de Marcel Proust, 7, p 171-192. Hatzfeld A., Darmeisteter A., Thomas A. (1898). Dictionnaire général de la langue française du commencement du XVIIe siècle jusqu'à nos jours. Paris : Delagrave. Labbé C., Labbé D. (2010). Ce que disent leurs phrases. In Bolasco S., Chiari I., Giuliano L. (Eds). Proceedings of 10th International Conference Statistical Analysis of Textual Data. Rome : Edizioni Universitarie di Lettere Economia Diritto. Vol 1, p. 297-307. Labbé D. (1990). Normes de saisie et de dépouillement des textes politiques. Grenoble : Cahiers du CERAT. Levaillant J. (1952). Note sur le personnage de Bergotte. Revue des sciences humaines. Janvier-Mars 1952, p 33-48. Milly J. (1986). La longueur des phrases dans "Combray". Paris-Genève : Champion-Slatkine. JADT’ 18 409 Monière D., Labbé C. & Labbé D. (2008). Les styles discursifs des premiers ministres québécois de Jean Lesage à Jean Charest. Canadian Journal of Political Science / Revue canadienne de science politique. 41:1, p. 43-69. Muller C. (1963). Le mot, unité de texte et unité de lexique en statistique lexicologique. Langue française et linguistique quantitative. Genève-Paris: Slatkine-Champion, 1979, p 125-143. Nathan J. (1969). Citations, références et allusions de Marcel Proust dans A la recherche du temps perdu. Paris : Nizet (Première édition : 1953). Rogers B. (2000). Proust et Barbey d’Aurevilly. Le dessous des cartes. Paris : Champion. Serça I. (2010). Les coutures apparentes de la Recherche. Proust et la ponctuation. Paris : Champion. Annexe 1 Corpus A la Recherche du temps perdu (Marcel Proust. Paris Gallimard 1919-1927) Livre Longueur Vocabulaire Combray 79 906 6 502 1 727 Un amour de Swann 84 142 5 859 2 226 Noms de pays : le nom 19 434 2 823 374 Du côté de chez Swann (1919) 183 482 9 347 4 327 Autour de Mme Swann 91 451 6 532 2 511 Noms de pays : le pays 134 192 8 283 3 334 225 643 10 396 5 845 Le côté de Guermantes 1 75 494 6 281 1 903 Le côté de Guermantes 2, chapitre 1 84 354 6 368 2 781 A l'ombre des jeunes filles en fleur (1919) Le côté de Guermantes 2, chapitre 2 N phrases 89 727 6 707 2 700 249 575 6 707 7 384 Sodome et Gomorrhe 13 512 2 476 271 Sodome et Gomorrhe 2, chapitre 1 30 699 3 779 2 082 Sodome et Gomorrhe 2, chapitre 2 117 774 7 822 3 056 Sodome et Gomorrhe 2, chapitre 3 57 603 5 311 1 811 Sodome et Gomorrhe 2, chapitre 4 8 137 1 373 250 227 725 10 972 7 470 Le côté de Guermantes (1920-21) Sodome et Gomorrhe (1921-22) La prisonnière (1923) 173 409 9 062 5 124 La fugitive (1925) 115 866 6 456 3 255 Le temps retrouvé (1927) 152 159 8 708 3 931 Dernier volume (posthume) 441 434 13 518 12 310 1 327 859 21 837 37 336 Total général (A la recherche du temps perdu) 410 JADT’ 18 Annexe 2 Longueur des phrases chez quelques écrivains antérieurs ou contemporains de Proust Recherche Balzac Barbey (Chevalier) Barrès d’A. Etendue Mode Médiane Moyenne Médiale Me/Ml Gini 931 11 26,28 35,57 49,93 0,900 0,554 391 10 17,27 21,88 29,00 0,680 0,511 192 7 21,92 29,4 43,00 0,964 0,557 0,497 195 8 17,86 21,94 28,59 0,601 Bourget 201 7 16,62 21,34 29,58 0,780 0,539 Chateaubriand (Mémoires) Daudet 195 22 24,46 28,5 34,28 0,401 0,437 203 5 13,14 17,84 25,26 0,923 0,549 Dumas 243 7 14,90 20,28 29,00 0,947 0,567 Flaubert 231 7 13,75 18,37 25,24 0,837 0,528 France 394 8 15,79 19,98 26,06 0,651 0,504 Gautier* 282 18 27,11 33,07 41,90 0,546 0,493 Giraudoux* 466 4 18,60 25,77 37,76 1,031 0,580 Goncourt (Gervaisais) Goncourt (Journal) Hugo* 670 8 24,17 34,05 51,47 1,130 0,597 373 3 19,80 25,37 37,62 0,900 0,580 828 6 11,39 16,89 23,68 1,079 0,561 Huysmans (A rebours) Maupassant* 254 28 44,24 51,49 65,82 0,488 0,557 168 6 14,44 18,98 26,39 0,828 0,542 Musset* 197 16 19,56 23,82 29,57 0,512 0,485 Nerval* 136 12 19,93 24,21 31,27 0,569 0,499 Saint-Simon 361 18 27,89 34,15 44,14 0,523 0,506 Sand (Champi) 117 21 22,11 26,19 32,56 0,473 0,477 Sévigné (Lettres) 307 11 25,72 31,99 40,96 0,593 0,490 Stendhal 235 18 20,18 23,92 29,79 0,477 0,463 Vigny* 315 17 20,82 27,47 37,41 0,797 0,538 Zola 153 8 15,80 19,91 25,66 0,624 0,491 * Uniquement les romans JADT’ 18 411 Verso un dizionario corpus-based del lessico dei beni culturali: procedure di estrazione del lemmario Ludovica Lanini1, María Carlota Nicolás Martínez 2 1 Università degli Studi di Roma La Sapienza– ludovica.lanini@uniroma1.it 2 Università degli Studi di Firenze – cnicolas@unifi.it Abstract The vocabulary of Italian cultural heritage has become a crucial object of interest for different categories of users from a number of countries. However, there are no satisfactory multilingual lexical resources available. The present work moves in that direction. The aim of the paper is twofold: on the one hand, it describes the LBC database, a resource for developing a multilingual electronic dictionary of cultural heritage terms, made up of comparable corpora from nine languages; on the other hand, a corpus-based method for building a comprehensive headword list is proposed. Keywords: electronic lexicography, multilingual lexical resources, corpus linguistics 1. Introduzione Di fronte a un interesse crescente, a livello internazionale, per il lessico italiano dei beni culturali, emerge oggi l’esigenza, da parte di diverse categorie di utenti, di risorse elettroniche multilingui relative al patrimonio culturale; nonostante ciò, allo stato attuale, non sono disponibili strumenti multilingui adeguati. Il progetto LBC (Lessico dei Beni Culturali) si propone di affrontare il problema, sviluppando una banca dati testuale comprendente corpora specialistici e comparabili per nove lingue (cinese, francese, inglese, italiano, portoghese, russo, spagnolo, tedesco, turco). Fine ultimo è la creazione di un dizionario multilingue del lessico dei beni culturali a base testuale, che abbia come principali utenti studiosi del settore, ma anche traduttori e operatori turistici. L’approccio corpus-based viene applicato sin dal processo di definizione del lemmario, focus specifico del contributo. 2. La Banca dati LBC La Bd-LBC (Banca dati LBC) è un database testuale multilingue progettato per essere rappresentativo del lessico dei beni culturali: per il suo disegno si è considerato l’italiano quale punto di partenza, ma si è pensato anche al valore aggiunto derivante dalla possibilità di stabilire relazioni tra le diverse lingue. L’italiano viene scelto come punto di riferimento in virtù della sua 412 JADT’ 18 centralità nello sviluppo storico del lessico dei beni culturali; molti testi non italiani relativi a tale dominio hanno inoltre lo sguardo rivolto proprio verso le tecniche e i monumenti realizzati in Italia. La prima fase di lavoro, dedicata alla raccolta dei materiali, è partita dunque dai testi italiani che sono alla base della storia dell’arte e dalle relative traduzioni, ma anche da opere in altre lingue, applicando una metodologia di studio che facesse leva sulle potenziali sinergie plurilingui. Per dare fondamento alla struttura del corpus (Cresti et Panunzi 2013:57), la rappresentatività della risorsa è stata definita fin dall’inizio attraverso dei criteri di campionamento dei testi (Billero et Nicolás 2017: 208): «la rilevanza storico-culturale dell’opera dell’ambito specifico di studio (ad es. testi di Vitruvio o Leonardo); la diffusione internazionale di un’opera relazionata con l’ambito di studio (es. libri di Vasari); il prestigio dato a livello internazionale al patrimonio italiano da parte di un’opera (es. testi di Stendhal o Ruskin); la specificità dell’argomento in rapporto alla storia dell’arte italiana ed in particolare della Toscana (es. Burckhardt) ». Si è in questo modo delimitato un nucleo di testi di base condivisi tra lingue, tale da rendere il corpus parzialmente parallelo, cui si sono aggiunti via via testi peculiari per ogni lingua. La progettazione del database ha previsto inoltre una macrostruttura omogenea per i diversi corpora, che condividono i metadati associati a ogni testo, a partire dai quali viene generato automaticamente un nome di file univoco. Per quanto riguarda la microstruttura, la regola fondamentale è stata quella di rispettare il testo originale, mantenendo eventuali note, divisione in capitoli e tratti ortografici arcaici. Seguendo tali regole strutturali, ogni squadra di lavoro, specificamente rivolta a una delle lingue, ha avviato lo sviluppo dei singoli corpora (Corpus LBC-francese, Corpus LBC-inglese, etc.), sottoposti a un’operazione di validazione della digitalizzazione da parte di professori e studenti competenti nelle diverse lingue. La banca dati, così disegnata, presenta un’omogeneità in grado di favorire il lavoro lessicografico: la forte coesione strutturale tra corpora permette infatti di operare davvero in parallelo. Tra gli obiettivi del progetto vi è anche quello di implementare strumenti informatici di gestione e interrogazione dei corpora, che consentano ai membri del gruppo di effettuare ricerche ed estrarre dati sull’uso lessicale, fondamentali per lo svolgimento del lavoro lessicografico. Si è dunque realizzato un software online, per ora accessibile ai soli membri dell’unità di ricerca, ma in prospettiva disponibile anche per gli utenti, che consenta la consultazione dei corpora, sia in chiave monolingue che multilingue. Nella ricerca di soluzioni per l’implementazione di un’installazione del corpus su apposito server Internet, si è optato per l’ultima release di NoSketchEngine, versione open source di Sketch Engine. JADT’ 18 413 3. Il dizionario LBC: processo di definizione del lemmario La banca dati, così elaborata, si pone quale risorsa di base per lo sviluppo di un dizionario elettronico multilingue del lessico dei beni culturali, che possa risultare strumento utile soprattutto in ambito traduttivo e turistico. In vista della particolare utenza e applicazione, l’intento è quello di fornire una risorsa lessicografica che presenti le seguenti caratteristiche: - trattamento dei lemmi più “problematici” del dominio, con inclusione a lemma di nomi propri ed espressioni multiparola, categorie lessicali generalmente assenti dalle risorse, tuttavia di particolare rilevanza in virtù delle difficoltà traduttive e del forte carico culturale; - attenzione per l’aspetto più prettamente pratico e referenziale del lessico della cultura, con apertura a quelle voci di arti e mestieri tradizionalmente trascurate dalla lessicografia italiana, nonché interesse rivolto alle persone, alle opere e ai luoghi fisici della storia culturale, più che al carattere teorico e mentale (Harris, 2003) ed estetico generale (De Mauro, 1971) che ha a lungo connotato il lessico artistico, in particolare quello della critica d’arte; - inclusione non solo di nomi, ma anche di verbi, di norma esclusi dalle risorse terminologiche, qui ritenuti di interesse per rendere conto di tecniche e pratiche; - impianto corpus-based, non solo per la selezione, descrizione e traduzione dei lemmi, con individuazione degli equivalenti a partire dall’analisi di concordanze bilingui, ma anche per l’offerta all’utente, entro la scheda lessicografica, di esempi e citazioni testuali reali. L’approccio corpus-based viene adottato sin dal processo di definizione del lemmario, sviluppato a partire dal corpus LBC-italiano. Il metodo proposto prevede la combinazione di tre ordini eterogenei di dati: dato lessicografico; dato testuale quantitativo; dato testuale qualitativo. Il dato di origine lessicografica, assunto sullo sfondo a frame di riferimento, viene dunque incrociato con il dato testuale, tanto di livello quantitativo keyword e liste di frequenza- quanto di livello qualitativo -prodotto di ricerche mirate su corpus e di osservazione dei contesti. Per quanto riguarda le risorse adottate, la fonte lessicografica scelta è il Grande Dizionario Italiano dell’Uso (De Mauro, 2007), la più estesa risorsa lessicografica esistente per la lingua italiana, mentre alla banca dati LBC viene affiancato, quale corpus generale di riferimento, il corpus Paisà (www.corpusitaliano.it), costruito nel 2010 tramite web-crawling e raccolta mirata di documenti da specifici siti web, per un totale di 250 milioni di token, inteso come rappresentativo della lingua e cultura comune contemporanea (Lyding et al., 2014). Indirettamente, viene assunto come corpus di riferimento anche itTenten16, il corpus per la lingua italiana implementato in Sketch Engine, interamente raccolto tramite web-crawling nel 2016 414 JADT’ 18 (5.864.495.700 token). Riguardo agli strumenti impiegati, l’adozione di un software di corpus management e query all’avanguardia come Sketch Engine (www.sketchengine.co.uk) risulta infatti cruciale per il processo di lavoro, descritto di seguito nel dettaglio. 3.1 Fasi di lavoro La prima operazione è consistita nell’estrazione dal corpus LBC di una lista di parole chiave (2000), applicando la funzione keywords di Sketch Engine: le keyword vengono ordinate in base al keyness score, dato dal rapporto tra la frequenza normalizzata della parola nel focus corpus (LBC) e la sua frequenza normalizzata in un corpus generale (itTenten16), previa applicazione di una costante, denominata simple math parameter1 (Kilgariff et al., 2014). Alla lista delle keyword è stata affiancata la lista di matrice lessicografica, estratta dal Gradit selezionando l’insieme dei lemmi etichettati con marca [TS] (tecnico-specifico) per arte, pittura, scultura e architettura, per un totale di 2515 lemmi, di cui molti (370) multiparola. In maniera inattesa, dal confronto tra le due liste emergono solo 24 coincidenze. Risultando poco pulita, la lista delle keyword è stata sottoposta a uno spoglio manuale, che ha ridotto i 2000 lemmi candidati a 219, primo vero lemmario di base (comprendente nomi propri come Mantegna, arcaismi come fregiatura, tecnicismi come nicchia). Si è proceduto a questo punto a una serie di confronti, a partire dalla lista di frequenza lemmatizzata del corpus LBC, come sintetizzato in Tabella 1. L’incrocio con la lista del Gradit ha restituito 272 lemmi comuni, di cui 235 sono stati accolti previo controllo. Il lavoro di confronto con il corpus generale Paisà ha seguito invece due linee di sviluppo: lo studio dei lemmi caratterizzati da più alta differenza di frequenza relativa con peso maggiore in LBC (i primi 600), da cui sono emersi 77 lemmi di interesse (figura, Firenze, Raffaello) e lo spoglio dei lemmi presenti in LBC ma non in Paisà, che ha permesso di individuarne 62 (tecnicismi come scalea e imbasamento, numerosi arcaismi e varianti arcaiche come scarpellino, Florenzia, Buonarruoto). L’insieme delle voci della lista Gradit assenti in LBC (ben 2243) è stato inoltre sottoposto a un esame puntuale, che ha portato ad aggiungere al lemmario 1629 lemmi2. Il corpus LBC è in effetti in fase di sviluppo, per cui molte aree A seconda dei bisogni dell’utente e della natura dei corpora, la costante può essere modificata per restituire una lista con candidati a frequenza maggiore o minore, con 100 come valore consigliato per ottenere parole del vocabolario core e rumore minimo, qui applicato. 2 Non si sono accolti: lemmi astratti, propri della critica d’arte (asemanticità); lemmi riferiti a movimenti e tendenze generali (astrattismo); aggettivi o avverbi. Si 1 JADT’ 18 415 di interesse (per esempio il dominio dell’arte contemporanea) non risultano ancora adeguatamente rappresentate: la lista del Gradit può offrire in questa direzione materiali utili, in attesa dell’ampliamento del corpus. Dalla convergenza dei lemmi accolti è stato così possibile arrivare alla definizione di un primo lemmario, per un totale di 2147 lemmi. Tabella 1 Risorse Lista LBC Lista Gradit Lista LBC Lista Paisà Lemmi Lemmi di interesse Lemmi estratti Lemmi accolti Lemmi comuni 272 235 600 77 1139 62 2000 219 2243 1629 TOT. 2222 (-75 lemmi ripetuti) = 2147 8388 2515 8388 1032178 Lemmi con differenza di frequenza relativa significativa Lemmi presenti in LBC assenti in Paisà Lista keywords 0) (Lee and Seung, 1999; Berry et al, 2007; after Paatero &Tapper, 1994. See also Gaujoux, 2010). In the topic modeling context, the main output of NMF is a set of topics characterized by list of words (software ‘scikit-learn’ [Python] by Grisel O., Buitinck L., Yau C.K; In: Pedregosa et al., 2011). LDA (Latent Dirichlet Allocation) (Blei et al., 2003; Griffiths et al., 2007) is a generative statistical model (involving unobserved topics, words, and document) devised to uncover the underlying semantic structure of a 440 JADT’ 18 collection of texts (documents, supposed to be a mixture of a small number of topics). The method is based on a hierarchical Bayesian analysis of the texts. (package R: ‘topicmodels’, and software ‘scikit-learn’ [Python]). At this stage, we have limited our investigation to six techniques out of a great number of approaches likely to identify topics. Among these approaches let us mention the direct use of CA without fragmentation of the texts, the techniques of clustering (used in FCA and LOA) which contain many more methods and variants, the already mentioned Alceste methodology (Reinert, 1986). The present piece of research evidently needs to be extended. In fact, each method involves also a series of parameters (threshold of frequency for the words; preprocessing options such as lemmatization/stop words; size of fragments or context units, number of iterations). The following experiment limited to six methods will be tersely summarized. A thorough investigation would need many more pages. 4. Excerpts from the list of 49 topics (limited to two topics per method) The number of topics detected by each of the six selected methods varies between six and ten. Only two topics are printed below for each method. 4.1 Rotated Factor Analysis (Rotation Oblimin). (2 topics out of 6) RFA1 eyes see bright lies best form say days RFA2 beauty false old face black now truth seem 4.2 FCA (Fragmented Correspondence Analysis) (2 topics out of 7) FCA1 beauty truth muse age youth praise old eyes glass long seen lies false time days FCA2 night day bright see look sight 4.3 Logarithmic Analysis (Spectral mapping) (2 topics out of 8) LOA1 summer away youth sweet state hand seen age rich beauty time hold nature death LOA2 pen decay men live earth verse muse once life hours make give gentle death 4.4 Latent Semantic Analysis (2 topics out of 8) LSA1 time heart beauty more one eyes eye now myself art still sweet world LSA2 end grace leave words lie spirit change shame self could ever decay write 4.5 NMF topics (2 topics out of 10) NMF0: love true new hate sweet dear say prove lest things best like ill let know fair soul NMF1: beauty fair praise art eyes old days truth sweet false summer nature brow black live 4.6 Latent Dirichlet Allocation LDA (2 topics out of 10) LDA0 summer worse praise nature making time like increase flower let copy JADT’ 18 441 rich year die LDA1 sing sweets summer hear love music eyes bear single confounds prove shade eternal. 5. A synthesis of produced topics How to compare the complete lists of topics, since neither the order of topics, nor the order of words within a topic are meaningful? We deal here with real ‘bags of words’ exemplified by the excerpts of lines in section 4. We will add the eight a priori themes defined in table 1. Each a priori theme corresponds to a subset of sonnets. That subset will be described by its characteristics words. We can then perform a clustering of these 57 topics/themes (49 + 8). The technique of additive trees (Sattath and Tversky, 1977; Huson and Bryant, 2006) seems to be the most powerful tool for synthesizing in compact form these 57 topics/themes (figure 2). Let us recall one important property of additive trees: the real distance between two points can be read directly on the tree as the shortest path between the two points. Ideally, we expect to find a tree with as many branches as there are real topics in the corpus, each branch of the additive tree being characterized by seven labels: six labels corresponding to the six methods briefly described above, plus one label corresponding to one a priori theme. Such situation occurs when each method has uncovered the same real topics. The observed configuration is not that good, but we can distinguish between six and nine main branches, which is probably the order of magnitude of the number of different topics. We note also that several different methods often participate in the same branch, which suggest that that branch correspond to a real topic discovered by almost all the six methods. Let us mention that a similar additive tree performed on the 49 topics (not involving the eight a priori themes) produces approximately the same branches. Thus, the eight a priori themes can be considered here as illustrative elements, serving only as potential identifiers of the branches. It is remarkable that the eight a priori themes (boxed labels) are well distributed over the whole of Figure 2. If we except the branch of the tree located in the upper right part of the display, on the right of the label “Young man”, all the main branches have as a counterpart one of the a priori themes. As an example of interpretation of figure 2, the branch in the lower center part of figure 2: [NMF7, LOA4, RFA3, LDA7, LSA5] is clearly closely linked to the a priori topic named Rivalry (see section 2.2) (concurrence of five methods out of six). Most of the branches of the additive tree could be interpreted likewise. The upper right branch identified by none of the a priori themes may represent an unforeseen topic. More research and an expertise in Elizabethan poetry are required to confirm that we are dealing here with an undetected new theme. To conclude, we can only observe that each of the 442 JADT’ 18 involved method, be it ancient or modern, may contribute to detect topics… and that exploratory tools are essential to visualize the complexity of the process and assess the obtained results. Figure 2. Additive Tree describing the links between the 49 topics provided by the 6 selected methods and the 8 a priori themes. The identifiers are those of section 4 for the 6 selected methods. The 3 first letters indicate the method, followed by the index of the produced topic. The distance between two topics is the chi-square distance between their lexical profiles. Threshold of frequencies for words: 2. The boxed identifiers of the a priori themes are those (possibly shortened) of table 1. References Alden, R. M. (1913). Sonnets and a Lover's Complaint. New York: Macmillan. Berry M.W., Browne M., Langville Amy N., Pauca V.P., and Plemmons R.J. (2007). "Algorithms and applications for approximate nonnegative matrix factorization". In: Computational Statistics & Data Analysis 52.1: 155-173. Blei, D., Ng, A., and Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3: 993—1022. JADT’ 18 443 Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K.and Harshman R. (1990). Indexing by latent semantic analysis, J. of the Amer. Soc. for Information Science, 41 (6): 391-407. Garnett J.-C. (1919). General ability, cleverness and purpose. British J. of Psych, 9, 345-366. Griffiths T.,L., Steyvers M., and Tenenbaum J.,B. (2007). Topics in Semantic Representation. Psychological Review, 114, 2, 211-244. Huson D. H., Bryant D. (2006) Application of Phylogenetic Networks in Evolutionary Studies. Molecular Biology and Evolution, 23 (2): 254 - 267. Software available from www.splitstree.org. Kazmierczak J.-B. (1985). Analyse logarithmique : deux exemples d'application. Revue de Statist. Appl., 33, (1), 13-24. Lee D.D. and Seung H. S. (1999). Learning the parts of objects by nonnegative matrix factorization. Nature, 401: 788-791. Lebart L. (2012). Articulation entre exploration et inférence. In : JADT_2012. Dister A., Longree D., Purnelle G., Editors. Presse Universitaire de Liège. Lewi P.J. (1976). Spectral mapping, a technique for classifying biological activity profiles of chemical compounds. Arzneim. Forsch. in: Drug Res. 26, 1295-1300. Paterson D. (2010). Reading Shakespeare Sonnets. Faber & Faber Ltd. London. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot M. and Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research , 12, 2825-2830. Reinert, M. (1986). Un logiciel d’analyse lexicale: [ALCESTE]. Cahiers de l’Analyse des Données, 4, 471–484. Sattath S. and Tversky A. (1977). Additive similarity trees. Psychometrika, vol. (42), 3: 319-345. Shakespeare, W. (1901). Poems and sonnets: Booklover's Edition. Ed. The University Society and Israel Gollancz. New York: University Society Press. Shakespeare Online. Dec. 2017. Spearman C. (1904). General intelligence, objectively determined and measured. Amer. Journal of Psychology, 15, 201-293. Gaujoux R.et al. (2010). A flexible R package for nonnegative matrix factorization. In: BMC Bioinformatics 11.1 (2010): 367. Thurstone L. L. (1947). Multiple Factor Analysis. The Univ. of Chicago Press, Chicago. 444 JADT’ 18 Analyse Diachronique de Corpus : le cas du poker Gaël Lejeune1, Lichao Zhu2 1 STIH, Sorbonne Université – gael.lejeune@sorbonne-universite.fr 2 LLSHS, Université Paris XIII – lichao.zhu@univ-paris13.fr Abstract In this paper we will investigate a diachronic corpus. We want to highlight how people’s mentalities evolve regarding the gambling especially the poker game and how the evolution is correlated with the way that the game is considered in press articles. We study plain or metaphorical meanings of the terms in question by using clustering and statistical methods in order to detect changes of meanings in a relatively large period of time. Résumé Dans cet article nous nous intéressons à l'étude diachronique de corpus de presse dans le but d'illustrer des évolutions dans la vision de la société sur les jeux d'argent et de hasard ainsi que sur les joueurs. Nous utilisons des méthodes de statistique textuelles et de clustering pour détecter les grandes tendances visibles sur noter échelle de temps en nous focalisant sur le poker . Nous montrons que si le regain de popularité du jeu de poker se traduit par un traitement médiatique plus important, les métaphores exploitant la notion de poker restent très fréquentes. Keywords: analyse diachronique, corpus, jeux d'argent et de hasard 1. Introduction L'analyse diachronique de corpus opère sur un champ assez large. Nous pouvons en juger par exemple en observant les nombreux travaux sur l'évolution des langues, travaux qui passionnent aussi bien la communauté scientifique (Dediu & de Boer 2016) que les médias si l'on se fie par exemple à l’intérêt renouvelé porté par ceux-ci sur l’évolution des dictionnaires. Dans le champ purement scientifique, les intérêts dans le domaine embrassent tous les niveaux de l'analyse linguistique même si la morphologie (Macaulay 2017) et le lexique (la néologie par exemple chez Gérard et al. 2014). La sémantique est un autre aspect des études diachroniques notamment pour étudier les représentations mentales des locuteurs (Hamilton et al. 2016). Le travail présenté ici s'intéresse à une autre catégorie de représentations mentales qui est l'image que certaines activités ludiques peuvent prendre au cours du temps. Nous nous intéressons ici à un jeu d'argent et de hasard qui JADT’ 18 445 a connu une sorte de nouvelle jeunesse ces dernières années : le jeu de poker. Dans ce travail, nous nous inspirons de l’analyse de l’usage du lexique dans (Hamilton et al. 2016), nous souhaitons examiner l’évolution de l’usage d’un mot, d’un terme particulier au cours du temps. Ce travail, même si notre ambition est moins large, peut se rattacher aux études sur la néologie sémantique (Sablayrolles 2002) ou néosémie (Rastier et Valette 2009). Pour illustrer l’intérêt que représente le poker en tant que phénomène de société, nous pouvons considérer le retentissement autour du Moneymaker Effect1 ou encore cette citation du journal Le Monde daté du 22 janvier 2007 qui illustre le changement d’image de ce jeu: « Considéré il y a encore peu de temps comme un jeu sulfureux se jouant dans les arrière-salles de bars louches ou dans des appartements huppés à l'abri des regards indiscrets, le poker fait une entrée en force à la télévision ». En particulier, dans sa variante à la mode Texas Hold'Em, le poker est redevenu un jeu dont on parle et dont on parle plutôt positivement. Notre objectif est d’une part de mesurer à quel point ce regain d’attention a pu se traduire par une amélioration de l’image du jeu de poker en général. D’autre part, il s’agit de voir dans quelle mesure les usages métaphoriques du terme poker, plutôt connotés “négativement” (poker menteur, coup de poker2…) ont pu évoluer conjointement à cette plus grande popularité du jeu lui même. Dans la section 2 nous présenterons le corpus que nous avons constitué pour cette étude. Puis, nous proposerons dans les deux sections suivantes une analyse statistiques des prédicats puis une analyse sous forme de clustering. Enfin, nous présenterons nos conclusions et perspectives. 2. Présentation de notre corpus d’étude De manière à pouvoir s’affranchir des variations de choix éditoriaux entre journaux, nous nous avons souhaité nous concentrer sur une seule publication. Nous avons choisi le Monde ce qui nous permettais d’exploiter des articles dont la publication s’étalent sur 30 ans : 1988-2017. Pour la partie 1988-2005 nous avons utilisé le corpus du monde distribué par ELRA3, nous avons restreint aux textes contenant le terme poker. Pour les années 2006 à 2017 nous avons extrait d’Europresse4 les articles qui comportait le terme poker. Dans les deux cas nous avons considéré toutes les variantes possibles dans la casse. Nous avons ainsi obtenu 3528 textes dont la répartition dans le Par exemple : http://www.slate.com/articles/news_and_politics/explainer/ 2011/06/the_moneymaker_effect.html 2 Dans le sport par exemple, on remarque des contextes de « tentative désespérée », « dernière chance » ... 3 http://catalog.elra.info/product_info.php?products_id=438&language=fr 4 http://www.europresse.com/fr/ 1 446 JADT’ 18 temps est présentée Figure 1. Nous pouvons observer que le nombre d’articles a connu une chute entre 2005 et 2006. Ceci semble être dû au fait que nous passions à ce moment précis d’une étude du corpus complet du monde tel qu’existant auprès d’ELRA à une étude fondée sur la base Europresse. De fait, sur nos critères de recherche, la base Europresse ne totalise que 47 articles pour 2003 (contre 129 dans le corpus ELRA), 62 pour 2004 (contre 117) et 67 articles pour 2005 (contre 117). Les contraintes respectives d‘utilisation de ces deux sources de données nous ont interdit de pouvoir disposer d’un corpus dont la constitution soit constante. Nous nous sommes efforcés de s’affranchir de ce biais en adaptant notre méthodologie (notamment le clustering). Figure 1 : Répartition du nombre d'articles par année Nous avons 4353 occurrences du terme recherché, leur répartition est instructive (Figure 2) : la très grande majorité des articles (2834/3528 soit 80,33%) ne comporte qu’une seule occurrence. Nous pensons que ceci est le reflet de deux tendances. D’une part le sujet de l’article est rarement le poker pour lui même, il est question d’un personnage qui par ailleurs joue au poker par exemple. D’autre part, cette rareté de la répétition révèle un usage massivement métaphorique, en effet comme l’a montré (Lejeune 2013) une métaphore perd de sa force en étant répétée. Si un terme est répété, il est très probable qu’il soit employé dans son sens plein. Si cette observation était faite sur des noms de maladies infectieuses, il nous semble que ceci est avant tout lié au genre de texte et que cela s’applique également ici. Si nous allons un peu plus loin, nous pouvons faire l’hypothèse que la métaphore peut être filée, mais qu’elle est rare dans les articles expositifs. D’autre part, dans le cas peu probable d’une métaphore filée, les conventions stylistiques impliquent de changer le terme employé, le journaliste utilisera plutôt des termes du même champ lexical. JADT’ 18 447 Figure 2 : Répartition des d'articles selon le nombre d’occurrences du terme « poker » La répartition des articles entre ceux qui comportent une et une seule occurrence et ceux qui en comportent plusieurs montre des variations importantes dans le temps (Figure 3). Si l’on observe des périodes de 5 ans, on peut se rendre compte que le nombre d’articles comprenant plusieurs occurrences de “poker” représente 15% des articles sélectionnés sur la période 1988-1992, se pourcentage descend à 10% jusqu’en en 2003 puis remonte progressivement pour finalement rester au-dessus de 20% à partir de 2004-2008 avec une pointe à 30% pour les périodes 2007-2011 à 2009-2013. Figure 3 : Répartition par année des articles selon le nombre d’occurrences 3. Prédicats et séquences figées Dans la théorie linguistique lexique-grammaire de M. Gross (1975) et de G. Gross (2012), les prédicats sont considérés comme les noyaux d’une phrase capables de disposer d’arguments, grâce à leurs propriétés transformationnelles et distributionnelles. Parmi les apports de cette théorie figurent le « schéma d’arguments » et les « prédicats appropriés ». Nous relevons dans notre corpus les contextes gauches et droits des séquences 448 JADT’ 18 figées « partie de poker » et « coup de poker » afin de distinguer leurs emplois métaphoriques et non métaphoriques. Ce travail est fait en étudiant le premier verbe précédant ou suivant l’expression (sans remonter au-delà d’une phrase). Nous montrons dans les tableaux 1 et 2 les 20 verbes les plus fréquents pour chaque contexte se trouvent le plus fréquemment dans ces contextes (20 dans les contextes gauches, 20 dans les contextes droits). Tableau 1 : Effectif des verbes dans le contexte gauche de “[partie|coup] de poker” être (76) jouer (62) faire (15) tenter (14) gagner (11) avoir (11) ressembler (10) prendre (9) tenir (8) lancer (8) perdre (7) voir (6) partir (6) engager (6) agir (5) réussir (4) livrer (4) remporter (3) organiser (3) mener(2) Tableau 2 : Effectif des verbes dans le contexte droit de “[partie|coup] de poker” être (98) avoir (75) jouer (16) pouvoir (13) devoir (8) gagner (7) engager (7) venir (6) livrer (6) faire (5) vouloir (4) voir (4) tenter (4) tenir (4) réussir (4) prendre (4) monter (4) bluffer (4) aller (4) retrouver (3) Hormis les verbes « être » et « avoir » qui sont susceptibles d’être des verbes auxiliaires ou semi-auxiliaires, pour les autres verbes on peut se trouver dans trois cas de figure : h) Verbe support i) Prédicat approprié : le sens littéral de l’expression peut être activé j) Prédicat non approprié : le sens métaphorique de l’expression est activé Le cas des verbes support n’est pas pertinent pour notre étude. Pour le second cas, nous observons que le verbe jouer, prédicat approprié pour les deux séquences décrites, est très souvent lié à un usage métaphorique. Dans le troisième cas, de loin le plus fréquent. Les verbes « tenter », « s’engager », « réussir », « mener », « lancer » voire « remporter » ne sont pas tout à fait congruents avec le sens premier de la séquence, c’est-à-dire qu’ils ne sont pas des prédicats appropriés au sens propre du jeu de poker. Des occurrences de ces verbes dans le corpus confirment cette intuition : Il leur fallait lancer la partie de poker que Bonn et Paris s'apprêtent à jouer sur le GATT (1993) les enjeux de la partie de poker qui s'engagera mercredi à la mi-journée lorsque l'ambassadeur[...] (2017) [ils] avaient pu croire un moment que leur coup de poker allait réussir. (1989) JADT’ 18 449 [Celui qui] est davantage connu pour ses coups de poker financiers continue à mener sa stratégie (2015) Elle venait de remporter la partie de poker menteur qui constitue l'essentiel des premiers hectomètres. (1995) 4. Étude des champs lexicaux par clustering Si les séquences « partie de poker » et « coup de poker » sont ambiguës dans le sens où elles figurent dans des champs lexicaux différents, on peut se demander ce qu’il en est des champs lexicaux du terme « poker » en général. Pour étudier cette question, nous avons réalisé un clustering de notre corpus. Nous avons utilisé l’implantation des k-moyennes (K-means) de la bibliothèque Python scikit-learn. Nous avons fixé le nombre de clusters K à 105 et le nombre maximal d’itérations à 400, la mesure des poids est le tf-idf. Nous avons extrait tous les n-grammes de mots avec n allant de 1 à 3 puis seulement nous avons utilisé une stop-list. De sorte que, par exemple, « de » n’était pas gardé en tant que tel mais que nous le retrouvions dans « coup de poker » ou « loi de Robien ». Nous avons tout d’abord travaillé sur le corpus lemmatisé puis nous avons observé que les résultats étaient semblables sans lemmatisation, nous avons donc supprimé ce pré-traitement. Nous allons maintenant décrire chaque cluster en donnant la proportion du corpus qu’il couvre ainsi que les 10 termes les plus significatifs. Cluster 0, « sport et poker 1 » : 3,1 % (club, football, équipe, Ligue, France, championnat, saison, joueurs, OM, Marseille) Ce cluster comporte deux volets : l’un sur les « coups de poker » dans les championnats de football et l’autre où il est question des championnats de poker eux mêmes. Cluster 1, « politique » : 18,79 % (ministre, président, politique, gouvernement, pays, ,État, premier ministre, premier, États, faire). Un cluster autour de l’action politique, notamment au niveau européen. Un exemple intéressant de métaphore (filée) ici : « M. Erdogan remet tout en jeu, comme un joueur de poker fait tapis » Cluster 2, « fourre-tout » : 38,01 % (être, bien, film, vie, entre, Jean, monde, France, temps, homme) Le seul de nos clusters qui n’ait pas d’unité ni de tendance thématique, ici les expressions contenant poker sont pour moitié métaphoriques. Cluster 3, « culture_1 » : 5,13 % (film, Booker Prize, roman, prix, livres, livre, littéraire, base, prix littéraire, attribué). Ce cluster rassemble les livres ayant trait au poker, les expressions liées sont prises dans leur sens littéral 5 9 et 12. Selon la méthode du coude (elbow method), la valeur optimale se situait entre 450 JADT’ 18 (l’expression « coup de poker » y est quasi absente). Cluster 4 « finance »: 4,2 % (Vivendi, marché, groupe, Bourse, marches, actionnaires, titres, taux, millions, fonds, terme, milliards, prix) Il se caractérise uniquement par des thématiques associées au domaine de finance et notamment aux coups de poker boursiers. Cluster 5 « sport et poker 2»: 5,04 % (Coupe, match, équipe, joueurs, France, club, football, finale, francs, PSG). Nous avons ici un cluster sur le sport où environ la moitié des articles concernent toutefois le poker lui même. Cluster 6 « industrie du poker »: 12,96 % (jeux, paris, ligne, marché, milliards, euros, millions, Internet, dollars, Bourse) Ici nous avons tout ce qui est lié à l’industrie du poker et notamment à l’essor des jeux d’argent sur Internet (dont le poker a été un fer de lance). Cluster 7 « sport »: 3,26 % (Tour, numéros, France, coureur, étape, peloton, course, équipe, Tour de France, maillot) Nous avons ici des usages, massivement métaphoriques, dans le domaine du sport (principalement le cyclisme). Un exemple avec le terme spécialisé flop : « [P.A.Bosse] avait trouvé cette image [...] : Si on compare le 1500m au poker, il a un flop d'avance. » Cluster 8 « culture_2 » : 7,14 % (blues, musique, CD, rock, John Lee Hooker, jazz, album, guitare, musiciens, scène) Un usage métaphorique dans le domaine de la musique avec des expressions telles que « poker face », « poker perdant »... Cluster 9 « culture_3 » : 2,38 % (Dracula, Bram Stoker, vampire, roman, film, fantastique, Christie, Coppola, comte, Frankenstein) Le cluster 3 était centré sur le domaine littéraire, ici il est question de cinéma et particulièrement des personnalités liées au poker. L’usage y est surtout littéral. Pour ce qui est de la répartition temporelle, il est très intéressant de noter que le cluster 6 (l’industrie du poker) devient le second plus important derrière le cluster 2 ( à partir de 2005 (popularisation des jeux d’argent sur Internet) et plus encore à partir de 2010 (légalisation des paris en ligne). Le cluster 0 (sport et poker) devient plus important à partir de 2004 d’autant qu’en son sein la thématique poker y est alors largement majoritaire. 5. Conclusion Nous avons proposé dans cet article une étude diachronique d’articles de presse contenant le mot « poker ». Notre hypothèse initiale était que ce terme était souvent employé dans des expressions métaphoriques et que le regain de popularité de ce jeu depuis quelques années avait du amener une plus grande proportion d’usage littéral. Nous avons observé que dans plus de 80 % des cas, le terme poker n’apparaissait qu’une fois dans les textes. Nous JADT’ 18 451 avons montré que ceci était dû à un usage principalement métaphorique, on ne répète pas une métaphore, mais aussi au fait que le poker est rarement le sujet central de l’article. Cette tendance change quelque peu à partir de 2005, le poker devenant lié à des championnats et des retransmissions télévisuelles plutôt qu’à des tripots et des casinos. Enfin, nous avons montré que les usages métaphoriques relevaient très majoritairement de 3 domaines : la finance, la politique et le sport. References Dediu D. and de Boer B. (2016)., Language evolution needs its own journal , Journal of Language Evolution, Volume 1, Issue 1, 1 January 2016, Pages 1–6 Gérard C., Falk I., and Bernhard D. (2014). Traitement automatisé de la néologie : pourquoi et comment intégrer l’analyse thématique ? Actes du 4e Congrès mondial de linguistique française (CMLF 2014), Berlin, Pages 2627-2646 Gross, M. (1975). Méthodes en syntaxe: régime des constructions complétives, Hermann. Gross, G. (2012). Manuel d'analyse linguistique: Approche sémanticosyntaxique du lexique, Presses Universitaires du Septentrion. Hamilton W.L., Leskovec J., and Jurafsky D. (2016). Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. In Proc. Of the Association for Computational Linguistics Conference (ACL) 2016 Lejeune G. (2013) Veille épidémiologique multilingue : une approche parcimonieuse au grain caractère fondée sur le genre textuel, Thèse de doctorat en Informatique de l'Université de Caen Macaulay, M. & Salmons. (2017). Synchrony and diachrony in Menominee derivational morphology, J. Morphology 27: 179 Rastier, F., Valette, M. (2009) « De la polysémie à la néosémie », Le français moderne, S. Mejri, éd., La problématique du mot, 77, 97-116. Sablayrolles, F. (2002). « Fondements théoriques des difficultés pratiques du traitement des néologismes », Revue française de linguistique appliquée, VII-1, pp. 97-111. 452 JADT’ 18 Approche textométrique des variations du sens Julien Longhi1, André Salem2 Université de Cergy-Pontoise, France – julien.longhi@u-cergy.fr Université de la Sorbonne nouvelle, France – salem@msh-paris.fr 1 2 Abstract The use of textometric methods relies on the hypotheses, firtly, that stable units exist (forms, lemmas or their graphical approximations) and, secondly, that occurrences of these forms can be retrieved from different parts of a corpus. Once automatic counting performed, more sophisticated textometric methods can be employed to focus on textual variations (repeated segments, collocations, etc.) that occur around the same unit but in different contexts found within the corpus. This approach leads to the identification of semantic variations with relation to the context of each occurrence as highlighted through automatic segmentation. We will illustrate this by using examples of repeated segments within the corpus that contain the N-gram /enemy / taken from a widely-studied chronological text series. Résumé Pour pouvoir mettre en œuvre les méthodes de la textométrie, il est indispensable de postuler, dans un premier temps, l'existence d'unités stables (formes, lemmes ou leurs approximations graphiques), dont on recensera ensuite les occurrences dans les différentes parties du corpus étudié. Une fois les dépouillements automatiques réalisés, il est cependant possible d'utiliser des méthodes textométriques plus élaborées pour accéder aux variations textuelles (segments, répétés, cooccurrences, etc.) qui peuvent se réaliser autour d'une même forme dans chacun des contextes particuliers du corpus. Cette démarche permet d'accéder au repérage de variations sémantiques qui se rapportent à chacune des occurrences des formes produites par la segmentation automatique. Nous illustrons notre démarche à l'aide d'exemples prélevés dans les parties d'une série textuelle chronologique largement étudiée, des segments répétés du corpus qui contiennent le Ngram /ennemi/. Keywords:.unité textométrique, sémantique, variation du sens 1. Introduction Notre étude s’inscrit dans une perspective de prise en compte des dynamiques du sens à l’œuvre dans les discours, qui tiendrait compte de la JADT’ 18 453 variation, de l’hétérogénéité, ou encore de l’articulation entre topologie textuelle et discursive, sens et profilage. Le sens se construit dans différents champs où il est susceptible de paraître, et s’analyse « par le contexte, sous forme d’indices de position liés aux modalités de sa mise en place dans le champ » (Cadiot et Visetti, 2011), la caractérisation sémantique se faisant alors sur la base de la composition et décomposition des profils disponibles. L'automatisation du dépouillement de vastes corpus de textes, à des fins textométriques, nécessite au contraire que le repérage des unités de décompte puisse être confié à des machines. Pour pouvoir mettre en œuvre les méthodes de la textométrie, il est indispensable de postuler, dans un premier temps, l'existence d'unités stables (lexèmes, lemmes ou leurs approximations graphiques), dont on recensera ensuite les occurrences dans différentes parties du texte. Cette manière de faire permet d'étudier la répartition de chacune des unités dans un corpus ou encore de rapprocher les différents contextes qui contiennent chaque unité textométrique. Ces simplifications, incontournables dans le premier temps de l'analyse, nous éloignent de l'étude du sens de chacune des occurrences que l'on peut élaborer dans chaque contexte particulier. Cependant, une fois les premiers dépouillements automatiques réalisés, il est possible d'utiliser des méthodes textométriques plus élaborées pour accéder aux variations textuelles qui peuvent se réaliser autour d'une même forme dans le corpus (segments répétés, cooccurrences, etc.). C’est ce croisement de perspectives et ce va-etvient entre approche empirique et théorisation sémantique, que nous souhaitons mettre à l’épreuve dans la présente étude. 2. Application au corpus Duchesne Pour illustrer notre démarche, nous appliquons ces méthodes à l'étude de la ventilation, dans les différentes parties d'une série textuelle chronologique largement étudiée, des segments répétés du corpus qui contiennent le Ngram /ennemi/. 2.1. Rappels sur l'analyse de la série chronologique Duchesne La série chronologique Père Duchesne a déjà fait l'objet de nombreuses analyses textométriques1. Nous avons montré, en particulier, que les Le corpus Père Duchesne est constitué par la réunion d'un ensemble de livraisons du journal Le Père Duchesne de Jacques-René Hébert, parues entre 1793 et 1794. Pour une description plus avancée de ce corpus, on consultera, par exemple (Salem, 1988). Les analyses dont nous rendons compte ci-dessous, ont été effectuées à l'aide du logiciel Lexico5. Cedric Lamalle, William Martinez, Serge Fleury ont largement 1 454 JADT’ 18 typologies réalisées à partir d'une partition de ce corpus en huit périodes, correspondant chacune à un mois de parution, mettaient en évidence un renouvellement lexical fortement lié à l'évolution dans le temps. On peut vérifier, sur la figure 1, que les parties correspondant aux périodes successives de parution sont proches sur les facteurs issus de l'analyse du tableau (8 parties x 1420 formes dont la fréquence dépasse dix occurrences)2. La méthode des segments répétés permet de repérer toutes les occurrences de suite de formes graphiques qui apparaissent plusieurs fois dans un corpus de textes (Lafon et Salem, 1983 ; Salem, 1986). Pour la présente étude, nous avons constitué un ensemble d'unités textuelles qui contient outre les formes graphiques ennemi et ennemis, tous les segments répétés qui contiennent l'une ou l'autre de ces formes. On a projeté sur la figure 1, en qualité d'éléments supplémentaires, cet ensemble de segments. La position sur ce graphique des différents segments montre que ces unités ne sont pas employées de manière uniforme tout au long des périodes. Figure 1 : Duchesne. Les segments contenant la séquence ennemi sur le plan des deux premiers facteurs issus de l'analyse de tableau 8 parties x 1420 formes (F>=10) Guide de lecture pour la figure 1 : La figure fournit la représentation des huit parties du corpus Duchesne, sur les deux premiers axes issus d'une Analyse contribué au développement des fonctionnalités de ce logiciel. Les auteurs tiennent à les en remercier. 2 Ce phénomène connu sous le nom d'effet Guttman, a été largement décrit par Guttman (1941, 1946, 1950), Benzecri (1973) et Van Rijckevorsel (1987). JADT’ 18 455 des correspondances, réalisée sur l'ensemble des formes dont la fréquence dépasse 10 occurrences. Les segments répétés du corpus contenant la séquence de caractères /ennemi / ont été projetés sur ce même plan, en tant qu'éléments supplémentaires. La figure a été allégée des segments redondants (ex : segments contenus dans des segments plus longs). Certains des éléments superposés par l'analyse ont été très légèrement déplacés à fin de rendre la figure plus lisible. Ainsi par exemple, le segment plus cruels ennemis trouve toutes ses occurrences au début du corpus alors que celles du segment ennemis de la liberté sont plutôt concentrées vers la fin. L'analyse des projections des différents segments qui contiennent le n-gram /ennemi/ va nous permettre de dégager des contextes dont la distribution diffère fortement entre le début et la fin de la période temporelle couverte par le corpus. 2.2. L'évolution du contexte de la forme ennemi(s) On peut estimer que le contenu sémantique de la forme ennemi(s) conserve une valeur relativement stable tout au long des périodes couvertes par le corpus que nous étudions. Le chercheur confronté à l'analyse de ces textes retrouvera sans peine, lors de l'examen de chacune des occurrences du terme, les principaux traits sémantiques décrits dans un dictionnaire de langue à propos de ce lexème (opposé, hostile, etc.). Cependant, l'analyse de ces mêmes contextes montre qu'il en va tout autrement pour ce qui concerne les référents auxquels la forme renvoie, dans chaque période particulière. Aux plus cruels ennemis, plus mortels ennemis, ennemis du dehors (les puissances étrangères, les expatriés), des périodes du début, succèdent bientôt les ennemis du dedans et du dehors, expressions qui peuvent s'analyser comme une dénonciation du fait que les ennemis du dehors ne constituent pas le seul danger et qui opère donc une modification manifeste du référent de départ. Par la suite la mention des ennemis de l'intérieur complètera la notion d'ennemis du dedans. Il faut noter que les ennemis de l'intérieur sont de plus en plus souvent précédés de l'article défini les qui les désigne comme une réalité dont l'existence est présupposée (elle n’est plus à démontrer). Progressivement, nos ennemis, deviennent vos ennemis, puis les ennemis. Dans la dernière période les ennemis, désormais désignés, de manière préférentielle, au pluriel, ne sont plus qualifiés par leur localisation ou par leur rapport aux destinataires du message (nos/vos ennemis) mais par des valeurs supposée communes auxquelles ils sont censés s'opposer : ennemis du peuple, ennemis de la république, ennemis de la révolution, ennemis de la liberté, ennemis de l'égalité. 456 JADT’ 18 3. La sémantique de ennemi(s) Les variations constatées montrent que la forme ennemi(s) prend différents sens selon les contextes dans lesquels elle s'inscrit, en ce qu’ils sont associés à des référents distincts. Plutôt que de représenter le sens comme la somme des cooccurrences constatées, nous souhaitons analyser ces valeurs comme un sous-ensemble prélevé sur un ensemble de valeurs acquises. Les espaces sémantiques déterminés et caractérisés par l’analyse statistique jouent un rôle fondamental qui, au-delà des synonymies, ou des polysémies, se renouvellent « en étant confronté aux textes – ce qui impliquerait de prêter attention à d’autres corrélations » (Visetti 2004 : 11). La description sémantique que nous proposons s’inscrit dans le champ de la sémantique lexicale3, du côté des approches qui envisagent la construction des référents comme extrinsèque. Cependant, alors que ces approches mobilisent en général des analyses phrastiques, et travaillent sur des exemples forgés, nous introduisons une perspective statistique qui précède la représentation du sens. La description de l’objet ennemi(s) n’est pas séparée des rapports que l’on entretient avec lui, et sa description suppose une prise en compte différenciée de ses propriétés extrinsèques (relatives à ces rapports), et de ses propriétés intrinsèques (supposées stables et indépendantes). Figure 2 : Niveaux et unités d'analyse 3 Cadiot et Némo (1997 : 127-128) JADT’ 18 457 L’intérêt de cette démonstration textométrique est pour nous de fournir des résultats concrets et matériels pour l’analyse des sens d’une unité lexicale. Ceci a plusieurs conséquences pour la mise en œuvre d’une sémantique soucieuse de l'exploitation des constats empiriques : 1) la représentation des variations du sens en contexte nous a permis d’identifier la manière donc les propriétés sont introduites et attribuées dans le corpus. Le référent change au fil du temps, puisque les ennemis, initialement définis comme du dehors, et introduits par nos, deviennent vos ennemis, et se présentent finalement sous la forme ennemi(s) de + N. Le besoin d’être déterminé par un complément du nom, ou son équivalent, qui indique avec quoi le terme « relatif » se trouve mis en relation », cette complémentation explicitant « ainsi la référence identitaire » (Steuckardt, 2008). 2) L’évolution dans le corpus au fil du temps permet de rendre compte de la dynamique sémantique à l’œuvre, laquelle rend compte diachroniquement des évolutions de sens. La textométrie permet ainsi de saisir les processus, et donc de donner du sens à la dimension potentiellement « hétéroclite » des propriétés des référents. Ainsi, au plan linguistique, le passage du référent 1 ou référent 2 se fait par l’intermédiaire d’une transformation des propriétés de ennemi(s) : défini de manière situationnelle (du dehors) et relative (nos, nos plus cruels), il acquiert des propriétés plus polémiques (vos, du dedans et du dehors), pour s’intégrer ensuite dans un processus discursif qui construit le référent (ennemi de + N : ennemi de la liberté ; ennemi du peuple), par l’introduction de termes à fort charge axiologique. Le référent introduit alors un point de vue, qui n’est pas strictement géographique ou institutionnel, mais aussi politique et idéologique. L'approche statistique dévoile, en outre, que c’est le pluriel qui est prioritairement mobilisé. 3. Conclusion De manière désormais classique, les méthodes de la textométrie permettent de mettre en évidence les variations du vocabulaire qui surviennent au cours des périodes successives d'une même série textuelle chronologique. Dans la présente étude, nous avons appliqué les méthodes d'analyse statistique multidimensionnelle (AFC) à l'étude d'un ensemble particulier, celui des segments répétés réunis sur la base du fait qu'ils contenaient tous une même unité graphique (en l'occurrence, le n-gram /ennemi/). La confrontation des segments ainsi sélectionnés nous permet d'observer des variations autour des formes graphiques ennemi et ennemis. L'analyse de ces 458 JADT’ 18 variations dans le temps nous conduit à distinguer des référents qui varient en fonction des périodes réunies dans le corpus. Au-delà des séries textuelles chronologiques, la méthode que nous avons présentée est susceptible de recevoir des applications dans l'étude de nombreux types de corpus. L'extraction semi-automatique des unités dont les contextes varient fortement en fonction des parties d'un corpus textuelle peut également être envisagée. References Benzécri J-P. and coll. (1981). Pratique de l'analyse des données, Linguistique et lexicologie. Dunod. Cadiot P. and Nemo F. (1997). Propriétés extrinsèques en sémantique lexicale. Journal of French Language Studies, 7(2): 127-146. Cadiot P. and Visetti Y.-M. (2001). Pour une théorie des formes sémantiques.PUF. Guttman L. (1941). The quantification of a class of attributes: a theory and method of a scale construction. In P. Horst, The prediction of personal adjustment, SSCR New York. Lafon P. and Salem A. (1983). L’Inventaire des segments répétés d'un texte. Mots. Les langages du politique, 6 : 161-177. Lamalle C, Martinez W, Fleury S, and Salem A. (2002). Les dix premiers pas avec Lexico3. Outils lexicométriques. http://www.cavi.univparis3.fr/Ilpga/ilpga/tal/lexicoWWW Lebart L. and Salem A. (1994). Statistique textuelle. Dunod. Longhi J. (2008). Objets discursifs et doxa. Essai de sémantique discursive. L’Harmattan, coll. « Sémantiques ». Rastier F. (2011). La mesure et le grain. Sémantique de corpus. Honoré Champion, coll. « Lettres numériques ». Salem A. (1987). Pratique des segments répétés. Klincksieck. Salem A. (1988). Approches du temps lexical. Mots. Les langages du politique, 17 : 105-143. Steuckardt A. (2008). Les ennemis selon L’Ami du peuple, ou la catégorisation identitaire par contraste. Mots. Les langages du politique [En ligne], 69 | 2002. http://journals.openedition.org/mots/10023 Van Rijckevorsel J. (1987). The application of fuzzy coding and horseshoes in multiple correspondances analysis. DSWO Press. Visetti Y.-M. (2004). Le Continu en sémantique : une question de formes. Texto ! juin 2004. http://www.revuetexto.net/Inedits/Visetti/Visetti_Continu.html JADT’ 18 459 ADT et deep learning, regards croisés. Phrases-clefs, motifs et nouveaux observables Laurent Vanni1, Damon Mayaffre, Dominique Longree2 1 UMR 7320 : Bases, Corpus, Langage - prenom.nom@unice.fr 2L.A.S.L.A. - prenom.nom@uliege.be Abstract 1 This contribution confronts ADT and Machine learning. The extraction of statistical key-passages is undertaken following several calculations implemented using the Hyperbase software. An evaluation of these calculations according to the filters applied (taking into account only positive specificities, only substantives, etc.) is given. The extraction of key passages obtained by deep learning - passages that have the best recognition rate at the time of a prediction - is then proposed. The hypothesis is that deep learning is of course sensitive to the linguistic units on which the computation of the key statistical sentences are based, but also sensitive to phenomena other than frequency and other complex linguistic observables that the ADT has more difficulty taking into account as would be the case with underlying patterns (Mellet et Longrée, 2009). If this hypothesis is confirmed, it would on the one hand permit better understanding of the black box of deep learning algorithms and on the other hand to offer the ADT community a new point of view. Abstract 2 Cette contribution confronte ADT et Deep learning. L’extraction de passagesclefs statistiques est d’abord proposée selon plusieurs calculs implémentés dans le logiciel Hyperbase. Une évaluation de ces calculs en fonction des filtres appliqués (prise en compte des spécificités positives seulement, prise en compte de substantifs seulement, etc) est donnée. L’extraction de passages-clefs obtenus par deep learning - c’est-à-dire des passages qui ont le meilleur taux de reconnaissance au moment d’une prédiction - est ensuite proposée. L’hypothèse est que le deep learning est bien sûr sensible aux unités linguistes sur lesquelles le calcul des phrases-clefs statistiques se fondent, mais sensible également à d’autres phénomènes que fréquentiels et d’autres observables linguistiques complexes que l’ADT a plus de mal à prendre en compte - comme le seraient des motifs sous-jacents (Mellet et Longrée, 2009). Si cette hypothèse se confirmait, elle permettrait d’une part de mieux appréhender la boîte noire des algorithmes de deep learning et d’autre part d’offrir à la communauté ADT de nouveaux points de vue. 460 JADT’ 18 Keywords: ADT, deep learning, phrase-clef, motif, spécificités, nouveaux observables 1. Introduction Pour des raisons techniques avant tout, l’ADT s’est constituée à partir des années 1960 autour du token, c’est-à-dire du mot graphico-informatique. Depuis lors, la discipline n’a cessé de varier et d’élargir ses observables, convaincue que le token seul rendait difficilement compte du texte dans sa complexité linguistique. Ainsi la tokenisation en particules graphiques élémentaires reste l’acte informatique premier des traitements textométriques, et le calcul des spécificités lexicales reste l’entrée statistique privilégiée de nos parcours interprétatifs. Cependant, la recherche d’unités phraséologiques élargies et complexes, caractérisantes et structurantes des textes, est devenue le programme d’une discipline désormais adulte. Historiquement, dès 1987, le calcul des segments répétés (Salem, 1987) ou les ngrams a représenté une avancée puisque les segments significatifs du texte, de taille indéterminée, étaient automatiquement repérés ; et aujourd’hui la détection automatique, non supervisée, de motifs (Mellet et Longrée, 2009; Quiniou et al., 2012; Mellet et Longrée, 2012; Longrée et Mellet, 2013) - objets linguistiques complexes à empans variables et discontinus - apparait un enjeu décisif. C’est dans cette perspective que cette contribution travaille et met à l’épreuve l’idée de passages-clefs du texte, tels qu’ils sont implémentés dans les deux versions d’Hyperbase (locale développée par Etienne Brunet et web développée par Laurent Vanni) que l’UMR Bases, Corpus, Langage produit en collaboration avec le LASLA. La démonstration se fait en deux temps. D’abord, nous proposons une extraction statistique de \textit{passages-clefs}, avec évaluation de leur pertinence interprétative sur un corpus français et un corpus latin. Ensuite une confrontation méthodologique avec le deep learning est mise en œuvre puisque le traitement deep learning attribue, après apprentissage, les passages de texte à leur auteur avec un taux de réussite éprouvé : par déconvolution nous repérons alors au sein de ces passages les zones d’activation, en soupçonnant qu’il s’agit, d’un point de vue linguistique, de motifs remarquables. 2. Les passages-clefs en ADT 2.1. Terminologie Si nous préférons le terme de passage-clef à celui de phrase-clef c’est que les traitements ici présentés n’ont pas de modèle syntaxique, et que la ponctuation forte qui délimite habituellement la phrase est un jalon utile mais non-nécessaire à nos traitements. La notion de passage a été fortement JADT’ 18 461 théorisée par (Rastier, 2007) dans un article éponyme et désigne une « grandeur » du texte dont la valeur textuelle c’est-à-dire interprétative est patente. Un passage est donc un morceau de texte jugé suffisamment parlant, notamment par sa taille qui gagne à dépasser le mot, le segment voire la phrase, pour prétendre rendre compte d’un texte. Le passage-clef, quant à lui, s’appuie sur la définition rastirienne mais est une unité de surcroit textométrique ; c’est-à-dire une unité dont la pertinence est calculable et l’extraction automatique. 2.2. Implémentations Les logociels ADT comme Hyperbase, Dtm-Vic, Iramuteq implémentent des calculs et l’extraction de passages-clefs. Dans tous les cas, les calculs proposés reposent sur l’examen des mots spécifiques (Lafon, 1984) : grosso modo, plus un passage concentre de spécificités, plus ce passage est jugé remarquable. Nous présentons ici deux types d’approche sur des passages arbitrairement constitués de 50 mots : un calcul naïf et sans filtre dans lequel tous les mots du passage sont considérés et un calcul filtré par nos connaissances linguistiques (sélection a priori des mots à considérer). Une évaluation de ces deux types d’approche est ensuite donnée. 2.3. Calcul sans filtre Dans le cadre des études contrastives habituelles en ADT, l’indice de spécificité de chaque mot (Lafon, 1984) est sommé, qu’il soit positif ou négatif en postulant que si les mots positifs (les mots sur-utilisés par un auteur par exemple) doivent promouvoir le passage, il est légitime que les mots négatifs (les mots sous-utilisés par un auteur) doivent l’handicaper. Chaque passage du corpus se trouve ainsi doté d’un super-indice de spécificité et Hyperbase fait remonter en bon ordre les passages les plus caractéristiques des textes comparés. Ainsi pour le français, sur le corpus de la présidentielle française 2017, le passage-clef le plus fortement d’E. Macron (versus les autres candidats) est le suivant : [...] nous croyons dans l'innovation, dans la transformation écologique et environnementale, parce que nous voulons réconcilier cette perspective et l'ambition de nos agriculteurs, parce que nous croyons dans la transformation digitale, parce que nous sommes pour une société de l'innovation, parce que nous voulons […] Quoique naïf, le calcul apparait performant puisque l’interprétabilité sociolinguistique de ce passage est évidente : de fait Macron s’est fait élire sur un discours dynamique (voulons , innovation (deux fois), transformation (deux fois), digitale) et un discours rassembleur susceptible de transcender le clivage gauche/droite (nous (5 fois), réconcilier). 462 JADT’ 18 2.4. Calcul filtré Par connaissances linguistiques et statistiques, le calcul peut être raffiné. Par exemple, seules les spécificités positives – et parmi elles, les spécificités les plus fortes – peuvent être considérées au motif qu’un objet s’identifie mieux par ses qualités que par ses défauts. Ensuite, les mots outils (conjonctions, déterminants) peuvent être écartés : ils présentent le double inconvénient d’avoir de très hautes fréquences (potentiellement déterminante pour le calcul des spécificités) et d’être peu parlants d’un point de vue sémanticothématique. Et encore, la catégorie grammaticale peut être choisie : par exemple seuls les noms propres et communs, parfois plus chargés de sens, sont pris en compte. Ainsi pour le latin un passage-clef de Jules César, contrasté à de nombreux auteurs contenus dans la base du LASLA, est le suivant : [...] partes Galliae uenire audere quas Caesar possideret neque exercitum sine magno commeatu atque molimento in unum locum contrahereposse sibi autem mirum uideri quid in sua Gallia quam bello uicisset aut Caesari aut omnino populo Romano negotii esset his responsis ad Caesarem relatis iterum ad eum Caesar […] De fait, ce passage de la Guerre des Gaules peut être effectivement considéré comme très représentatif de l’œuvre de César. On relève des noms propres connus (Galliae, Caesar, Gallia) ou des noms communs correspondant à la réalité militaire du moment (bello, commeatu). Toutefois la méthode ne permet pas de repérer des structures caractéristiques de la langue et du style de César, comme par exemple une proposition participiale marquant la transition entre épisodes dans une négociation : His responsis ad Caesarem relatis, « Ces réponses ayant été rapportées à César ». 2.5. Evaluation Calcul naïf ou calcul élaboré : nous récapitulons quelques performances. Dans un corpus contrastif, nous calculons le score de super-spécificité de chaque passage en fonction des différents auteurs comparés (Tableau 1). Par exemple pour le français, sans aucun filtre 58% des passages du corpus de la présidentielle sont attribués justement à leur auteur ; et en ne considérant que les spécificités positives, le score descend à 52%. A l’opposé, en imposant le double filtre de la catégorique grammaticale (seulement les substantifs) et de l’indice de spécificité (seulement les spécificités positives) nous élevons le taux de bonne attribution à 89% pour le français et 82% pour le corpus latin du LASLA. JADT’ 18 463 Tableau 1. Taux d’attribution ADT et taux de prédiction deep learning 3. Deep learning : à la recherche de nouveaux marqueurs linguistiques 3.1. Convolution et déconvolution, les principes Le découpage du texte en segments de taille fixe est une méthode qui peut aussi être utilisée pour entraîner un réseau de neurones. Chaque segment devient alors une image d'un texte que le réseau va utiliser pour apprendre (Ducoffe et al., 2016) et faire ensuite de la prédiction. Sur nos deux corpus de référence (français et latin), les taux de précision convergent rapidement et atteignent le même niveau que ceux obtenus avec l'ADT (Figure 1). Si nous connaissons les paramètres à faire varier pour optimiser la détection des passages-clefs ADT, ceux issus du deep learning sont complètement non supervisés et découverts automatiquement par le réseau. L'idée des réseaux à convolution est de proposer un modèle capable de faire automatiquement une abstraction performante des données.1 La convolution utilise pour cela un mécanisme de filtres qui va lire le texte avec une fenêtre coulissante pour extraire à chaque fois une partie de la matière linguistique présente dans la fenêtre (Figure 2). Avec des centaines de filtres de tailles différentes, le texte est lu en utilisant tous les empans linguistiques possibles et le mécanisme de back-propagation2 finit par accorder un certain poids à certains éléments du texte qui le pousse à prendre la bonne décision. Le deep learning est souvent considéré comme une boîte noire faute de pouvoir mettre en évidence précisément ces éléments. Nous avons donc ici concentré nos efforts sur la déconvolution. Ce mécanisme utilisé notamment en analyse d'images permet de démêler le réseau et de lui redonner une forme interprétable par l'humain. Notre modèle est composé d'une couche de pré-apprentissage (Mikolov et al., 2013) pour la représentation des mots en vecteurs, d'une couche de convolution (Kim, 2014), un maxpooling pour compresser l'information et enfin un réseau classique de perceptron à une couche cachée pour la classification (Figure 2). La déconvoltution est en fait une simple copie partielle de ce réseau (jusqu'à la convolution) à laquelle on ajoute à la fin une transposée de la convolution. On copie bien sûr le poids de chaque neurone 1 L'abstraction des données peut être considérée comme les saillances lexicales d'un texte qui lui donnent une identité propre 2 \Correction de l'erreur à chaque phase d'apprentissage. 464 JADT’ 18 après l’entraînement dans cette copie de réseau et on obtient un nouveau réseau dont la couche de sortie correspond au résultat de chaque filtre de la convolution. Une simple somme de ces filtres pour chaque mot nous donne un indice d'activation du mot dans son contexte. Au final nous observons ici des zones de texte s'activer plus ou moins suivant l'importance que leurs a accordée le réseau. Figure 2. Convolution et déconvolution d’un passage du discours d’E. Macron 3.1. Résultats et perspectives A la lecture des résultats, nous voyons que le modèle identifie, sans surprise, des mots que le traitement statistique avait calculés comme spécifiques. Mais pas seulement. Certaines zones éclairées par le réseau semblent relever d’une nouvelle forme de lecture du texte. Nous pouvons illustrer ce constat avec un extrait des vœux d’E. Macron le 31 décembre 2017: [...] une transformation en profondeur de notre pays advienne à l'école pour nos enfants , au travail pour l' ensemble de nos concitoyens , pour le climat , pour le quotidien de chacune et chacun d' entre vous . Ces transformations profondes ont commencé et se poursuivront avec la [...] Dans ce passage, les mots transformation et notre, fortement spécifique de Macron, sont activés : ici il n’y a pas de plus-value heuristique par rapport à l’ADT. De même, le segment répété chacune et chacun, très spécifique, est repéré par le réseau. Mais il y a aussi les mots pays et advienne qui ne sont pas statistiquement spécifique de Macron et qui ont pourtant fortement contribué à la reconnaissance du passage. Si l’on regarde maintenant les activations autour de ces mots ciblés, on voit que c’est une expression formée de plusieurs mots, pas forcément contigus, qui est repérée par le réseau. Il semble donc que le deep learning ait identifié des structures phraséologiques ou motifs linguistiques sensibles aux occurrences et à leur organisation syntagmatique. Plus loin, la visualisation du passage dans son ensemble met au jour une topologie textuelle ou un rythme auxquels le deep a été sensible (Figure 3). JADT’ 18 465 Figure 3. Déconvolution : observation de la topologie d’un passage 3. Conclusion L’ADT et le deep learning ne sont peut-être pas des continents étrangers l’un à l’autre (Lebart, 1997). Cette contribution en croisant approche statistique et réseau de neurones nous a permis d’identifier des passage-clefs et peut-être des motifs susceptibles de nourrir nos traitements textuels. Si les observables qui ont présidé à la détection de passages-clefs par l’ADT (les spécificités lexicales) sont connus et éprouvés, les zones d’activation du deep learning semblent relever de nouveaux observables linguistiques. Rappelons que la matière linguistique et la topologie des passages ne sauraient renvoyer au hasard : les zones d’activations permettent d’obtenir des taux de reconnaissance de plus de 90% sur le discours politique français et de 85% sur le corpus du LASLA ; soit des taux équivalents ou supérieurs aux taux obtenus par le calcul statistique des passages-clefs. Reste désormais à améliorer le modèle et à en comprendre tous les aboutissants mathématiques comme linguistiques. La première amélioration que l’on se propose désormais d’implémenter est l’injection d’informations morphosyntaxiques dans le réseau afin de mettre à l’épreuve des motifs linguistiques toujours plus complexes. References Ducoffe, M., Precioso, F., Arthur, A., Mayaffre, D., Lavigne, F., et Vanni, L. (2016). Machine learning under the light of phraseology expertise : use case of presidential speeches, de Gaulle - Hollande (1958-2016). Actes de JADT 2016, pages 155–168. Kim, Y. (2014). Convolutional neural networks for sentence classification. EMNLP, pages 1746–1751. Lafon, P. (1984). Dépouillements et statistiques en lexicométrie. Genève-Paris, Slatkine-Champion. Lebart, L. (1997). Réseaux de neurones et analyse des correspondances. Modulad, (INRIA Paris), 18, pages 21–37. 466 JADT’ 18 Longrée, D. et Mellet, S. (2013). Le motif : une unité phraséologique englobante ? Etendre le champ de la phrase ́ologie de la langue au discours. Langages 189, pages 65–79. Mellet, S. et Longre ́e, D. (2009). Syntactical motifs and textual structures. Belgian Journal of Linguistics 23, pages 161–173. Mellet, S. et Longrée, D. (2012). Légitimité d’une unité textométrique : le motif. Actes de JADT 2012, pages 715–728. Mikolov, T., Chen, K., Corrado, G., et Dean, J. (2013). Efficient estimation of word representations in vector space. ArXiv : 1301.3781. Quiniou, S., Cellier, P., Charnois, T., et Legallois, D. (2012). Fouille de données pour la stylistique : cas des motifs séquentiels émergents. Actes de JADT 2012. Rastier, F. (2007). Passages. Corpus 6, pages 25–54. Salem, A. (1987). Pratique des segments répétés. essai de statistique textuelle. Paris : Klincksieck. JADT’ 18 467 Déconstruction et reconstruction de corpus... À la recherche de la pertinence et du contexte Lucie Loubère Lerass Université de Toulouse – lucie.loubere@iut-tlse3.fr Abstract Faced with corpora of large sets of texts, we propose a method of selection, based on the identification of segments of texts relevant to a topic by successive classification, then recomposition of the corpus with all the texts having at least one relevant segment . This approach makes it possible to preserve the contextualizations and narrative discourses surrounding a theme while excluding off-topic texts. Résumé Face aux corpus constitués de grands ensembles de textes, nous proposons une méthode de sélection, basée sur l’identification de segments de textes pertinents à une thématique par classification successive, puis recomposition du corpus avec l’intégralité des textes ayant au moins un segment pertinent. Cette démarche permet ainsi de conserver les contextualisations et discours narratifs entourant une thématique tout en excluant les textes hors-sujet. Keywords: Big corpus, Reinert classification, Iramuteq 1. Introduction La multiplication d’outils d’extraction de contenus numériques ou l’abonnement des universités aux bases de données de presse, sont autant de raisons favorisant la création de corpus de grande taille. À ces facilités grandissantes s’opposent de nouvelles difficultés. L’hétérogénéité des contenus mis à disposition par une communauté, les algorithmes de recherche de bases de données, ou simplement les limites d’ambiguïté de requêtes génèrent de nombreux bruits à nos corpus. Nous proposerons ici une méthode s’appuyant sur une identification de contenu par classification successive (Ratinaud et Marchand, 2015), puis une régénération du corpus par concaténation de l’intégralité des articles contenant au moins un segment de texte (ST) dan le matériel identifié comme pertinent. 2. Problématique La sélection de corpus par classifications successives, en utilisant comme 468 JADT’ 18 unité le segment de texte, permet d’obtenir un sous corpus pertinent avec une thématique (Loubère, 2014; Ratinaud et Marchand, 2015). Cependant, lorsque le corpus de départ est constitué de textes au contenu narratif structuré et délimité (article de presse, blog, argumentaires dans une concertation…) ce processus peut supprimer les éléments périphériques au thème étudié. Ces contenus restent portant pertinents pour la compréhension de l’objet d’étude, mais peuvent être classés avec le bruit des textes hors sujet dès les premières étapes de sélection. L’objectif de cettte méthode est donc d’exclure le bruit de textes hors-sujets tout en conservant le contexte d’évocation de la thématique principale. 3. Méthodologie Le processus proposé ici se décompose en trois étapes : k) Numérotation des textes par un identifiant en méthadonnée l) Extractions des segments de textes propres à notre thématique par classifications successives. Cette étape repose sur la classification hiérarchique descendante (CHD) de type Reinert (Reinert, 1983) proposée par le logiciel Iramuteq (Ratinaud, 2009). En permettant de faire émerger les mondes lexicaux, ce traitement nous permet de sélectionner les segments concernant notre thématique, puis de les re-soumettre à une CHD afin de préciser le corpus. Cette étape est reconduite jusqu’à avoir une classification dont toutes les classes concernent la thématique étudiée. m) Re-composition du corpus par concaténation des articles apparaissant au moins une fois dans l’extraction finale de l’étape 2 4. Exemple empirique Dans les parties qui suivront, nous présenterons une mise en application de cette méthode sur un corpus utilisé lors de notre thèse (Loubère, 2018). Il est constitué d’une extraction d’article de presses quotidiennes nationales (libération, l’humanité, le monde, la croix, le figaro) portant sur la thématique du numérique éducatif du 01/01/2000 au 31/12/2014. Afin de couvrir le plus d’informations possible la requête exécutée sur la base de donnée d’Europresse retournait tous les articles contenant au moins un terme éducatif dans la liste : collège, lycée, école, éducation et au moins un terme numérique dans la liste : numérique, informatique, multimédia, TICE. 4.1. Les classifications successives Cette extraction retourna 18 804 articles, auxquels nous avons retiré 875 doublons. LE corpus exploité ici est donc constitué de 17 929 articles représentant 450 815 segments de textes, sur lesquels nous avons apposé en JADT’ 18 469 méthadonnée le numéro de l’article source. Nous allons présenter ici les classifications successives Nous avons effectué une CHD de 20 classes en phase 1 et un minimum de 1000 ST par classe, nous obtenons 16 classes représentant 99,72 % du corpus. Le résultat obtenu est présenté sur le dendrogramme en illustration 1 Illustration 1 : dendrogramme de la première CHD Ce premier découpage montre une séparation en 3 blocs. Le premier est composé des personnalités publiques, le second est composé par des thématiques extérieures à notre sujet. En effet, de nombreux articles contiennent les termes de notre requête sans être pour autant dans le domaine éducatif (ou numérique). Ainsi, les classes 9 et 8 regroupent les actualités ou dossiers portant dans le domaine de la culture. Nous citerons comme exemple non exhaustif d’article de ce domaine un article du journal Le monde commentant les sorties cinématographiques dans lequel nous relèverons « les enfants privés d’école jouant dans les rues », et pour un autre film « les décors numériques ». Nous retrouvons sur le même principe les classes 6, 5 et 13 traitant des conflits armés détruisant les lycées et relatant une infériorité numérique.Enfin, le troisième bloc présente une classe centrée sur le numérique (classe 12), deux classes centrées sur l’éducatif (11 et 10) et deux classes sur l’aspect législatif et économique (classes 1 et 2). Afin de pouvoir affiner ces thématiques et les possibles interactions, nous avons choisi de conserver le bloc entier, soit les segments composant les classes 1, 2, 10, 11, 12 et 14. L’export précédent nous a permis d’obtenir 194 966 segments de texte sur lesquels nous avons effectué une deuxième CHD de 15 classes en phase 1 et seuil minimal de 100 ST. Nous obtenons 14 classes portant sur 99,97 % des segments. Le résultat est présenté en illustration 2. Ce deuxième découpage reprend une structure en trois groupes. Ici, nous relevons le contexte économique du marché du numérique (classe 14, 5 et 6). 470 JADT’ 18 Illustration 2 : dendrogramme de la deuxième CHD Le second bloc (classe 4, 3, 7, 8, 10) est constitué des différents discours témoins de la numérisation de la société. Le troisième groupe séparé du reste du corpus par le premier facteur est centré sur-le-champ éducatif. Les trois premières classes à se détacher partagent un discours sur l’après-formation et le recrutement (classes 9, 2 et 1). La classe 11 constituant 10,3 % du corpus est entrée sur l’éducation primaire et secondaire, alors que la classe 12 porte sur l’enseignement supérieur et la recherche. Notre étude portant sur le système scolaire secondaire, nous ne conserverons que la classe 11 pour l’étape suivante. L’export de cette dernière constitue un corpus de 20 167 segments de texte sur lesquels nous avons effectué une CHD de 15 classes en phase 1 et un minimum de 100 ST par classe. Nous obtenons 8 classes rapportant 99,22 % des segments.. Ce dendrogramme, structuré en deux blocs, nous montre une séparation entre un discours centré sur l’aspect structurel de l’éducation (classes 8, 6, 4, 3) et celui traitant de l’enseignement (classes 2, 1, 5, 7). Illustration 3: dendrogramme de la troisième CHD JADT’ 18 471 Dans la partie structurale nous retrouverons les segments de texte traitant des réformes sous un angle gouvernemental (classe 8), suivie de tout le discours se regroupant des aspects temporels, comme le temps de travail mais également les rythmes scolaires (classe 6). La classe 3 constitue un discours sociologique sur l’éducation, nous y retrouvons de nombreuses statistiques étudiant les répartitions sociales dans les différents cursus. Enfin, la classe 4 traite des établissements scolaires dans leurs diversités. Les autres classes portent toutes sur le domaine pédagogique : la classe 7 concerne les contenus d’enseignement. La classe 5 traite de la mise en place d’outil numérique parascolaire (jeux éducatifs, fiche de révision) alors que la classe 2 est centrée sur la mise en place de formations à distance. Enfin, la classe 1 est le discours portant sur le numérique dans l’éducation, les mots clés employés dans notre requête y sont tous surreprésentés. Nous ne conserverons dons que les segments composant cette classe. L’extraction de cette dernière classe nous permet d’obtenir 2072 segments sur lesquels nous avons effectué une CHD de 20 classes en phase 1 avec un seuil de 100 ST par classe. Cette classification nous a montré une réelle stabilité de la thématique. En effet, les 8 classes exposée portent chacune sur un aspect du numérique éducatif. Illustration 4 : dendrogramme de la quatrième CHD 4.2. Classification du corpus recomposé Le corpus recomposé des 2902 articles contenant au moins un segment de texte dans la classe 1 de la troisième CHD est constitué de 72460 segments. Une CHD de 20 classes en phase 1 et un minimum de 800 ST par classe nous donne le dendrograme suivant : 472 JADT’ 18 Illustration 5 : dendrogramme de la CHD sur le corpus recomposé Nous y retrouvons donc au-delà de discours sur l’utilisation du numérique dans les établissements, un discours sur l’économie reflétant le marché du numérique éducatif et les frais engendrés par les dotations des établissements. Un discours à la frontière de la culture et de l’éducation, avec les formations de ces domaines empreinte de numérique. Mais également un discours sur l’actualité géopolitique mondiale contextualisant des initiatives où le numérique apporte des solutions éducatives lors de ségrégation ethniques, ou éloignements géographiques. Tous ces mondes lexicaux constituent des éléments du discours social sur notre sujet, qu’une étude réduite aux segments ciblés lors des CHD successives ne permettrait pas d’explorer. 5. Conclusion Le principe des CHD successives, s’il nous permet d’accéder finement aux segments contenant le discours sur le numérique éducatif, nous éloigne d’une compréhension globale du sujet. En effet, interroger les bases de données de presse sur une longue période et une sélection de presse généraliste apporte une quantité importante de documents hors contexte. Ces données portent des éléments contextuels communs avec les articles traitant de notre sujet (personnalités politiques, discours économique…), la proximité lexicale des segments de ces champs structure les classes de discours communes aux articles portant sur notre sujet ou non. Cette hétérogénéité associée à l’insécurité d’un grand ensemble (Geffroy et Lafon, 1982) nous empêchant une connaissance du corpus antérieure à l’analyse lexicométrique conduit « à tracer un peu trop vite une autoroute » (Geffroy et Lafon, 1982, p. 140) jusqu’à notre classe 1 finale. Ce phénomène questionne la constitution d’un corpus sur une dimension architextuelle, alors même que l’outil de classification utilisé ici joue sur un niveau intertextuel et cotextuel (Rastier, 2015), rapprochant des passages de textes en fonction de leur structure lexicale. La présence de textes aux sujets hétéroclites fait ressortir de façon JADT’ 18 473 précoce des thématiques indépendamment de leur hypothétique poids dans le corpus qu’aurait constitué une sélection de textes centrés sur notre sujet. Ainsi, les segments traitant de sujets de politique générale ou exposant le contexte social d’un pays dans les articles traitant du numérique éducatif sont classés avec ceux des articles hors sujets. Cette difficulté éloigne le chercheur de la compréhension d’un discours. La démarche que nous venons de présenter nous permet de se rapprocher d’un positionnement de textomètre (Pincemin, 2012), sélectionnant les segments pertinent par une démarche inductive, mais en conservant l’unité sématique du texte dans la construction du corpus final. Bibliography Geffroy, A., & Lafon, P. (1982). L’insécurité dans les grands ensembles. Aperçu critique sur le vocabulaire français de 1789 à nos jours d’Etienne Brunet. Mots, 5(1), 129-141. Loubère, L. (2014). Le traitement des TICE dans les discours politiques et dans la presse. In Présenté à 12èmes Journées internationales d’Analyse statistique des Données Textuelles. Pincemin, B. (2012). Sémantique interprétative et textométrie. Texto! Textes et Cultures, 17(3), 1-21. Rastier, F. (2015). Arts et sciences du texte. Paris: Presses universitaires de France. Ratinaud, P. (2009). IRAMUTEQ : Interface de R pour les Analyses Multidimensionnelles de TExtes et de Questionnaires. Consulté à l’adresse http://www.iramuteq.org Ratinaud, P., & Marchand, P. (2015). Des mondes lexicaux aux représentations sociales. Une première approche des thématiques dans les débats à l’Assemblée nationale (1998-2014). Mots. Les langages du politique, (2), 57-77. Reinert, M. (1983). Une méthode de classification descendante hiérarchique: application à l’analyse lexicale par contexte. Les cahiers de l’analyse des données, 8(2), 187-198. 474 JADT’ 18 L’apport du corpus-maquette à la mise en évidence des niveaux descriptifs de la chronologie du sens. Essai sur une Série Textuelle Chronologique du Monde diplomatique (1990-2008). Heba Metwally Université d’Alexandrie, Égypte – heba.metwally77@gmail.com Abstract Chronological corpora and particularly time series (Lebart et Salem 1994) organize the textual data in corpora according to their natural sequence in time. Today, scholars are interfacing increasingly with chronological corpora following the democratization of access to big data. The lexicometry develops into stylometry, textometry and logometry. And statistical data analysis integrates the observation of co-occurrential systems and lexical networks in their complexity. This improves the analysis of semantic contents according to their localisation in the semantic strata. This contribution aims to enhance the description of the chronology of meaning. The study is based on a corpus of more than 5000 articles (ca 11 millions of tokens) published in the Monde diplomatique between January 1990 and December 2008. To analyze big chronological corpora we propose a scale model of the chronological corpus by compressing the initial corpus to its most frequents nouns. The compression procedure is duplicated in the four sub-corpuses of relevant semantic stability. We obtain two descriptive levels of chronology: the synthetic level of dominant contents and the analytical level of the four chronological phases of meaning. The two levels are intended to respond to different investigations on time and meaning. Working on sets of scale models that are either connected horizontally (chronological sequence) or vertically (the synthetic perspective clarified by an analytic perspective) enlarges our field of observation and deepens our understanding of chronological data in particular and the unfolding of text in general. Keywords: chronological corpus – logometry – logogenesis – clustering – method Reinert – corpus semantics – media analysis JADT’ 18 475 Résumé Les corpus chronologiques et a fortiori les Séries Textuelles Chronologiques (Lebart et Salem, 1994) organisent les données textuelles dans le corpus selon leur enchaînement naturel dans le temps. La banalisation des corpus textuels et l’accès facilité et accéléré au big data multiplient les corpus chronologiques, puisque finalement toute production textuelle s’étale dans le temps. La lexicométrie – au sens classique – doublée de la stylométrie, de la textométrie voire de la logométrie, et la statistique occurrentielle enrichie par un outillage cooccurrentiel (Viprey, 1997), (Mayaffre, 2014), la voie est ouverte aujourd’hui à une observation améliorée des contenus sémantiques qui gagnent en visibilité grâce aux tentatives parfois incontrôlées de leur objectivation. Cette contribution a pour objectif de contribuer à la description de la chronologie des contenus sémantiques. On s’appuie sur un corpus d’articles du MD (1990-2008). On compte plus de 5000 articles et plus de 11 millions d’occurrences. On propose pour cela le recours à un corpus-maquette, une compression du corpus chronologique intégral à partir des noms les plus fréquents. Cette démarche de compression est reproductible dans les souscorpus des périodes de stabilité sémantique. On obtient deux niveaux descriptifs de la chronologie, à savoir le niveau global, synthétique des contenus dominants et le niveau subordonné, analytique des sens particuliers des phases transitoires du discours. Les deux niveaux infèrent sur un questionnement différent sur le temps en multipliant les pistes d’interrogation et en articulant le niveau synthétique et son niveau analytique. Mots-clés: corpus chronologique – logométrie – logogénétique – classification – méthode Reinert – sémantique de corpus – Analyse de discours médiatique 1. Introduction Dans la tradition lexicométrique, les STC (Séries Textuelles Chronologiques) problématisent les investigations sur le temps1. Ce type de corpus est né, dans les études à caractère historique, du questionnement sur le changement dans le discours au fil du temps. Et les travaux d’André 1 « Nous appelons séries textuelles chronologiques ces corpus homogènes constitués par des textes produits dans des situations d'énonciation similaires, si possible par un même locuteur, individuel ou collectif, et présentant des caractéristiques lexicométriques comparables. » (Lebart et Salem, 1994 : 217) 476 JADT’ 18 Salem2 témoignent de l’intérêt porté à la description des corpus textuels chronologiques. Pour ce faire, André Salem généralise les STC, décrit la particularité des sorties machines des analyses statistiques qu’elles produisent (AFC ; calcul de spécificités), introduit la notion de «temps lexical », et conçoit une gamme de calculs visant, dans un premier temps, la « mise en évidence et la mesure du stock lexical au cours du temps » (Salem, 1988 : 118) et, dans un second temps, la caractérisation des périodes dans une STC. Plus généralement, la particularité des STC est de concilier la linéarité du texte, du temps et la sérialité du corpus. Si tous les corpus sont partitionnés en séries pour permettre la comparaison, ces séries ont l’avantage de conserver l’ordre naturel des textes qui s’échelonnent – sans conflit – dans le corpus et dans le temps. Aujourd’hui, le champ des observables est constamment élargi grâce à l’évolution des outils informatiques et au progrès de la tokenisation pour embrasser progressivement des niveaux descriptifs textuels que le chercheur filtre ou articule à sa guise. La lexicométrie est enrichie et mise à jour par la textométrie et la logométrie dont le projet est de dépasser la lexie vers les textes, le discours et le sens. Le sens est objectivable grâce à la formalisation de la cooccurrence, et à son baptême comme unité minimale de contextualisation, i.e de sens (Mayaffre, 2008). Dès lors, la statistique occurrentielle se double de la statistique cooccurrentielle. La cooccurrence devient unité de décompte généralisée à laquelle s’applique les calculs statistiques traditionnels (Brunet, 2012). Des applications d’ADT de tradition benzécriste se développent pour appréhender les réseaux lexicaux dans leur complexité. La cooccurrence généralisée (Viprey, 1997, 2005, 2006) se donne une visée exploratoire et la méthode Alceste (Reinert, 1983, 1993) procède à la démarche classificatoire des réseaux lexicaux structurants des textes. C’est dans ce cadre des progrès de la méthodologie et de la technologie qu’une sémantique de corpus (Rastier, 2011) est envisageable. Ce champ d’investigation intéresse naturellement les études chronologiques qui peuvent désormais observer le mouvement des contenus sémantiques dans le temps pour comprendre l’impact du temps dans la thématisation d’une Série Textuelle Chronologique3. Pour l’objectivation des fonds Cf. (Salem, 1988, 1991, 1993, 1994) Ce point précisément constitue la problématique de notre thèse de doctorat intitulée « Les thèmes et le temps dans Le Monde diplomatique (1990-2008) », soutenue le 11 décembre 2017 à l’Université Côté d’Azur (UCA) à Nice. 2 3 JADT’ 18 477 sémantiques4 du discours, on sollicite la méthode Alceste implémentée dans le logiciel libre Iramuteq (Ratinaud et Marchand, 2012) qui s’articule à Hyperbase. Pour une visualisation améliorée des topics du discours, on propose de recourir à une maquette du corpus et de ses sous-corpus. Au sens propre, la maquette est une représentation en trois dimensions, à échelle réduite qui reste fidèle dans ses proportions. Ici, dans le cas des corpus textuels, la maquette est une compression du corpus intégral qui se réduit à ses noms les plus fréquents. A partir d’une STC du Monde diplomatique (19902008), cette contribution se donne deux objectifs. Dans un premier temps, elle vise à mettre en exergue les deux niveaux descriptifs complémentaires de la chronologie du sens : chronologie des contenus dominants (3.) et la logogénétique (4.) tout en relevant l’intérêt de étude conjointe de ces deux niveaux. Dans un second temps, il s’agit également de mettre à l’épreuve notre proposition de la maquette. On recherche une visualisation améliorée des contenus sémantiques structurants grâce au recours à une maquette, reproduction grossière et fidèle des textes dont l’usage spécifique sera illustré dans les lignes suivantes. 2. Du corpus intégral à la maquette du sens et du temps Le choix du Monde diplomatique pour l’étude de l’évolution du sens s’appuie sur la richesse et la stabilité de son contenu. La période couverte par cette étude marque un moment historique important, à savoir le monde après la chute du Mur de Berlin. En plus, cette période se caractérise par une continuité éditoriale5. Bref, nous avons affaire à un discours stable, sans complexe qui à l’examen multidimensionnel épouse un schéma évolutif classique sans ruptures6. On estime que la stabilité du discours est un facteur indispensable à l’étude de l’évolution, celle-ci reposant principalement sur la continuité. La finalité de ce travail, à savoir l’étude de la chronologie du sens d’un gros corpus textuel, préside à la conception de la maquette. La taille du corpus Les fonds sémantiques sont les isotopies ou les macrostructures sémantiques sur lesquelles se détachent les formes sémantiques que sont les thèmes. Cf. (Rastier, 2011 : 24) 5 Il s’agit du mandat d’Ignacio Ramonet qui est directeur de la publication de janvier 1990 à mars 2008. 6 Par examen multidimensionnel on entend l’AFC de la distance entre les textes qui dans le cas des données sérielles reproduit une forme parabolique baptisée parabole Guttman et qui est symbolique du mouvement linéaire des données ordonnées dans le temps. Cf. (Salem, 1991). 4 478 JADT’ 18 intégral excédant 11 millions d’occurrences (voir ci-dessous Tableau 1) pose immédiatement le problème de son interprétation comme il nous confronte à la difficulté de l’appréhension des fonds sémantiques structurants du corpus. En ADT, les chercheurs procèdent assez souvent pour des raisons pratiques à des sélections au sein de la population statistique étudiée. A notre tour, on propose un mode de réduction qui se fonde sur la finalité herméneutique et perpétue la pratique d’une sémantique interne. On pose ici – sans généraliser – que le discours médiatique par sa vocation informative et sa référence au monde structure son contenu d’une manière privilégiée autour des noms. La classe nominale (noms communs et noms propres) est la classe grammaticale la plus importante dans le corpus ; elle couvre 28,9% de la surface du corpus. Elle connaît également une stabilité distributionnelle au fil de la STC. L’importance numérique absolue et la distribution équilibrée attestent le critère de la représentativité statistique7. Aussi une comparaison avec d’autres corpus8 entre les listes des lemmes les plus fréquents triés par catégorie grammaticale confirme le pouvoir caractérisant de la classe nominale en général et des noms propres en particulier. On s’appuie donc sur la classe nominale et l’argument fréquentiel pour réduire le corpus intégral à ses 380 noms les plus fréquents. La démarche laisse intacts les partitions du corpus et l’enchaînement des textes pour respecter la structure séquentielle des textes et la conception chronologique du corpus. L’une et l’autre garantissent au corpus textuel son authenticité ; seul leur maintien autorise l’examen de l’hypothèse de travail présidant à la conception du corpus textuel. Pour expliquer un peu ce travail philologique simple dans son principe, la démarche consiste à mettre un cache sur tout le texte à l’exclusion des 380 noms les plus fréquents. Cette procédure est à reprendre dans les sous-corpus de stabilité sémantique. Celle-ci se laisse mesurer d’une manière endogène à l’aide du calcul de la distance entre les textes à partir de la forme minimale de signification thématique, la cooccurrence. La distance intertextuelle calculée sur les cooccurrences au sein des noms de la maquette donne à voir quatre périodes qui fondent les quatre sous-corpus, ceux-ci réduits à leur tour à des maquettes. Cette périodisation endogène fonde le Dans notre travail doctoral (Metwally, 2017), nous avons étudié les contenus des classes de fréquences du corpus intégral pour une compréhension de la hiérarchie numérique du lexique. Aussi avons-nous analysé la structure grammaticale des données et leur distribution dans la STC. 8 (Labbé et Monière, 2003); (Mayaffre, 2004). 7 JADT’ 18 479 temps sémantique9 selon lequel on remodèle le corpus intégral et sa maquette. Le tableau 1 (ci-dessous) synthétise la structure lexicale du corpus, des souscorpus et de leurs maquettes. Celles-ci couvrent chacune approximativement 9,8% de la surface leurs corpus originaux respectifs. Cette stabilité de représentativité numérique autorise la comparaison entre les données. Tableau 1: Tableau synthétique de la structure lexicale du corpus, des sous-corpus et de leurs maquettes corpus et souscorpus taille (N=occurrences) vocabulaire (V=mots) maquette et sousmaquettes (V=noms) maquette et sousmaquettes (taille) 1990-1993 1994-1997 1998-2001 2002-2008 1990-2008 2697013 2402434 2552998 3765908 11418356 67989 67571 70954 86032 140690 307 282 290 375 380 266439 218643 229119 382298 1115311 On obtient donc finalement un dispositif complexe à deux niveaux : le niveau global des contenus sémantiques de l’ensemble de l’empan chronologique étudié dont on peut étudier la dynamique (3.); et le niveau analytique, d’ordonnancement chronologique, des phases sémantiques stables et qui permet et l’observation du mouvement des contenus sémantiques et la confrontation avec le niveau global synthétique (4.). L’étude des fonds sémantiques est concevable en mobilisant la statistique cooccurrentielle qui met en évidence les structures sémantiques pertinentes. A l’issue de la CHD appliqué à la maquette et ses sous-maquettes, sont observables les mondes lexicaux stabilisés (Reinert, 1993, 2008) du sens global et de ses phases transitoires (voir les dendrogrammes Fig. 1, 3, 4). 3. La dynamique des contenus dominants La démarche habituelle dans les études chronologiques repose d’abord sur une étude statique première du sens global pour procéder ensuite à une vue dynamisée. Les vues statiques relèvent d’un artifice méthodologique provisoire destiné à mettre en évidence les contenus sémantiques stabilisés 9 On s’est permis de parler de temps sémantique à la suite du temps lexical d’André Salem (1988). Le temps sémantique est le rythme selon lequel s’organisent dans le temps les contenus sémantiques et que mesure ici la distance intertextuelle calculée sur la cooccurrence. 480 JADT’ 18 au bout d’un mouvement dynamique. La saisie du sens global répond au questionnement sur les contenus dominants, consensuels d’une période à l’autre, qui survivent au cours de 19 ans de production d’articles. Pour l’analyse de la structure sémantique de la maquette, on donne à Iramuteq la maquette globale, où les 380 noms les plus fréquents s’organisent sur l’axe syntagmatique selon l’ordre de leur apparition, et dont les partitions assurent au corpus une structure chronologique adaptée au temps sémantique du corpus. Une fois Iramuteq mobilisé, il se met à découper le texte en segments de textes paramétrables. Le choix de l’étendue des segments de textes (ST) est capital, car ce sont les ST qui constituent les énoncés analysés et classés par la méthode Alceste. Pour ces unités de contexte on a estimé la succession de 10 noms dans le corpus-maquette comme l’équivalent dans le corpus intégral de la fenêtre contextuelle de 33 mots10. On vise par là un espace intermédiaire entre la phrase et le paragraphe. Une fois Alceste activé, il procède à une CHD qui croise les ST et les noms pour effectuer un classement partant du caractère lexical prédominant des ST. Figure 1 : Les mondes lexicaux de la maquette (1990-2008)11 On impose à l’algorithme un paramétrage exigeant qui nous garantit une grille de lecture assez riche. Avec 15 classes demandées à l’issue de la phase Cette estimation repose sur le pourcentage de la classe nominale dans l’ensemble du corpus (28,9%). Voir (Metwally, 2017). 11 Dans ces listes, on peut repérer quelques verbes (partir, produire, revenir, sentir, passer). Il s’agit d’une erreur due à une lemmatisation effectuée par Iramuteq malgré les tentatives de dissuasion. Il s’agit plutôt de substantifs (parti, produit, revenu, sens, passé). 10 JADT’ 18 481 1, 8 se trouvent stabilisées (Figure 1). Les sorties machines de la CHD sont multiples. La représentation en dendrogramme correspond au classement stricto sensu ; et elle est enrichie d’informations supplémentaires qui mettent en valeur la CHD. On commence par l’identification rapide de la structure sémantique du discours et de la hiérarchie de l’information. Le dendrogramme, par sa logique binaire de représentation, oppose les contenus économiques, les plus importants avec 41,5% des ST classés, aux contenus non-économiques. Ceux-ci distinguent les thématiques politiques (35,2% des ST classés) et les thèmes de l’Homme (23,3% des ST classés), thématiques socio-culturelles qui traitent de sujets historiques et culturels et de questions sociétales. Suivant la logique hiérarchique descendante de la classification, des classes spécialisées se stabilisent pour mieux caractériser les trois domaines sémantiques identifiés. Au sein des classes économiques se spécialise une classe socio-économique dédiée aux questions de l’emploi et du travail (classe 8 ; « emploi », « travail », « chômage », « salaire », « syndicat ») ; celle-ci se distingue des deux classes de la macro-économie qui traitent de l’économie domestique (classe 2), de la machine économique des pays (« développement », « industrie », « concurrence », « secteur »), et l’économie mondiale (classe 7) qui couvre les questions des finances et de la performance économique des pays sur le marché mondial (« dollar », « banque », « dette », « prix », « croissance »). Attachés à la même branche des thèmes politiques, les mondes lexicaux de l’Homme connaissent une variation qui différencie les questions philosophiques et/ou idéologiques sur l’histoire et la culture (classe 1 ; « histoire », « siècle », « monde », « culture », « sens », « conscience », « passé ») du quotidien des êtres humains dans ce monde (classe 6 ; « femme », « enfant », « victime », « quartier », « violence », « police », « vie », « école »). Si l’analyse du sens passe nécessairement par la suspension provisoire de la structure sérielle du corpus, l’interrogation des partitions de la maquette sur leur part aux classes lexicales restitue la temporalité définitoire du corpus. Une projection des classes dans les périodes de stabilité sémantique met en évidence la dynamique des classes, la thématisation de chaque période pour permettre finalement d’inférer sur l’évolution du sens. Les classes lexicales poursuivent différentes tendances au cours du temps. Les thèmes du pouvoir (classes 4 et 5) est un axe informatif important qui ne subit guère de variations quantitatives. La classe des politiques internationales (classe 3) connait un pic positif exceptionnel dans la dernière période. 482 JADT’ 18 Figure 2 : Périodes et classes de la maquette (écarts en Chi2) Ce sont les contenus économiques et socio-historiques qui sont traversés par deux logiques évolutives opposées. L’ordonnancement des bâtons positifs met en relief les pics positifs importants et exclusifs de deux classes économiques dans les deux premières périodes. Cette importance s’évanouit progressivement. Dans la dernière période les déficits les plus importants sont ceux des classes économiques. Face à la régression des contenus économiques, la progression est réservée aux contenus socio-historiques (classes 1 et 6). Il s’ensuit une couleur thématique changeante d’une période à l’autre. Les contenus économiques qui marquent les 19 ans qui ont suivi la chute du Mur de Berlin proviennent majoritairement des deux premières périodes, tandis que les deux périodes suivantes connaissent des centres d’intérêt socio-historiques qui se mêlent dans la troisième période à des thèmes économiques et dans la dernière période aux événements globaux de politiques internationales. A l’œil nu, l’histogramme de la dynamique du sens global se laisse diviser en deux moments évolutifs distincts et asymétriques. Sur le plan quantitatif, le sur-emploi de la première moitié de la série n’est jamais égalé par un sur-emploi pareil dans la deuxième moitié. Sur le plan qualitatif, les contenus majoritaires de la première partie sont des contenus techniques et relèvent de l’axe informatif le plus important, un axe technique qui relève des visions macro. Par contre, les contenus dominants de la deuxième moitié de la série sont plus variés et traduisent un intérêt croissant aux sujets philosophiques et humanistes. Un mouvement général semble déplacer le focus de l’ordre mondial vers les hommes et le sens de leur vie dans le monde. JADT’ 18 483 La description de la chronologie du sens touche à ses limites. Car les contenus dominants qu’on observe ici sont précisément les contenus consensuels, ceux qui trouvent toujours leur expression d’une période à l’autre selon un dosage qui leur garantit finalement la supériorité quantitative. Le mouvement dynamique de ces contenus revient donc à une interrogation sur leurs périodes spécifiques. Ceci dit, on pose que la dynamique des contenus dominants repose nécessairement sur les sens particuliers de ces périodes. L’étude du niveau subordonné de la génétique du discours (tout de suite ci-dessous) est certes instructive pour une analyse plus détaillée de la spécificité sémantique de chaque période. L’étude de la formation du sens nous renseigne également sur le rapport entre le sens particulier, temporaire et le sens général, dominant. Elle est indispensable pour compléter et éclairer nos observations sur l’évolution. 4. La logogénétique ou la génétique du discours Le mot logogénétique reprend le mot anglais logogenesis dont Halliday (1994) explicite la signification et l’intérêt en termes suivants : “It is helpful to have a term for this general phenomenon – i.e. the creation of meaning in the course of the unfolding of text. We shall call it logogenesis, with ‘logos’ in its original sense of ‘discourse’ (see Halliday & Matthiessen, 1999: 18; Matthiessen, 2002b). Since logogenesis is the creation of meaning in the course of the unfolding of a text, it is concerned with patterns that appear gradually in the course of this unfolding; and the gradual appearance of patterns is, of course, not limited to single texts but is rather a property of texts in general instantiating the system of language.” (Halliday, 1994 : 601) La logogénétique ou la génétique du discours permet de renouer avec les modèles linguistiques qui traversent le texte et contribuent à sa formation. Concrètement ici, on voit dans l’observation et la confrontation ordonnée dans le temps des CHD des quatre sous-maquettes un grand intérêt pour rétablir les modèles sémantiques propres des périodes de stabilité sémantique et qui fondent le mouvement général du sens et sa stabilisation au niveau global au cours du temps. On reprend les mêmes paramètres utilisés pour la CHD de la maquette globale dans les quatre sous-maquettes pour obtenir les dendrogrammes ci-dessous (Fig. 3, 4). Un examen attentif de la structure interne des sous-maquettes du sens est susceptible d’offrir des grilles de lectures analytiques des contenus dominants, de leur dynamique et de leur formation. On ne saura pas épuiser la valeur heuristique de ces 484 JADT’ 18 dendrogrammes. Et on se contente de souligner l’apport principal de cette démarche à la description du sens sans prétendre effectuer une analyse fouillée du sens. Celle-ci devrait reposer sur une étude systématique des réseaux lexicaux ce qui dépasse l’objectif de cette contribution. Figure 3 : Les mondes lexicaux des deux premières périodes JADT’ 18 485 Figure 4 : Les mondes lexicaux des deux dernières périodes La première remarque à souligner est la permanence des fondamentaux du discours et le nombre fixe de mondes lexicaux qui se stabilisent d’une période à l’autre. Cette stabilité de la structure sémantique ratifie la pertinence de l’étude de l’évolution. Celle-ci s’effectue nécessairement au sein d’un environnement stable. Observons l’évolution de la hiérarchie de 486 JADT’ 18 l’information d’une période à l’autre. Le graphique ci-dessous (Figure 5) rend compte de l’importance de chaque domaine sémantique au sein des ST classés. La comparaison est instructive d’une période à l’autre, et entre le niveau des sous-maquettes et le niveau supérieur de la maquette globale. Figure 5 : L’évolution de l’importance des fondamentaux du discours au cours du temps (en pourcentages) Quelle que soit la période, les contenus politiques restent les plus dominants. A l’examen de la répartition interne des classes politiques on note l’importance des classes de politiques internationales qui sont constamment au nombre de deux (Fig. 3, 4) par opposition au niveau global qui ne connaît qu’une seule classe (Fig. 1, classe 3). C’est l’ampleur des classes de politiques internationales dans les sous-maquettes qui fait la supériorité des thématiques politiques. Et pourtant, ce n’est pas le cas au niveau global. Ceci est dû principalement à la nature conjoncturelle des événements internationaux : les guerres américaines de la première et dernière période, les questions sécuritaires d’actualité en Europe après la chute du mur de Berlin, la guerre de Kosovo dans la troisième période, le conflit israélo-palestinien avec ses variantes et ses flux et reflux au cours du temps (voir le contenu des classes lexicales, Fig. 3, 4). Tant d’événements spécifiques de certaines périodes et qui ne parviennent pas tous à se stabiliser au niveau global pour caractériser les 19 ans. D’où la prédominance des contenus politiques dans les sousmaquettes et leur recul au niveau global. Par contre, les contenus économiques connaissent une tendance inverse. Au niveau global, ils occupent le sommet de la pyramide hiérarchique avec trois JADT’ 18 487 classes. Au niveau subordonné des sous-maquettes, ils viennent en deuxième rang pour passer dans la dernière période au troisième rang. Le nombre de leurs classes fluctue entre trois et un. Ce qui est curieux est que la variété maximale du nombre des classes économiques finit par se stabiliser au niveau global. À la différence des thématiques de politiques internationales, les thématiques économiques connaissent des prolongements plus pérennes. Il suffit d’observer les dendrogrammes des sous-maquettes pour localiser dans le temps les sources des trois classes économiques de la maquette globale. Comme le montre bien l’évolution de la hiérarchie de l’information (Fig. 5), les thèmes socio-historiques continuent à s’amplifier pour dépasser les thématiques économiques dans la dernière période. Ce constat est bien compatible avec la dynamique du sens global (Fig. 2) où on a observé les déficits record des thèmes économiques et le sur-emploi significatif des classes socio-historiques. Notons également que ces dernières croissent quantitativement et qualitativement. C’est exclusivement dans la dernière période qu’on a affaire à deux classes socio-historiques. Dans cette dernière période la classe 6 caractérisée par « enfant » et « femme » ressemble à la classe 6 de la maquette globale (Fig. 1), tandis que la classe voisine (classe 2) lexicalisée par « science », « recherche », « individu », « pratique » n’a pas d’équivalent lexical au niveau global. Il s’agit de contenus émergents qui ne trouvent pas de précédents dans la STC. Le vocabulaire de la classe 2 se situe à mi-chemin entre le sociétal et le social. Le ST le plus caractéristique de la classe nous éclaire sur sa particularité rhétorique. A l’occasion du Sommet G8 2007 dont le thème est ‘croissance et responsabilité’, le MD lance un tract appelant à une révolution culturelle généralisée. On élargit la fenêtre de l’observation au-delà des limites du ST12 pour améliorer l’identification du contenu sémantique:13 « A quand, là encore, la lancée d’initiatives mondiales de la part de quelques pays courageux – on attend la France – pour prendre à contrepied la vieille tentation d’inféoder la recherche aux désignations Tandis que le ST se limite à la succession de 10 noms parmi les 380 noms les plus fréquents du corpus, la lecture ne s’arrête pas aux frontières des ST mais elle en part. Selon Rastier (2007), le passage - îlot de pertinence – « n’a pas de bornes fixes et son empan dépend évidemment du point de vue qui a déterminé sa sélection » (p. 31). Notre paramétrage cible le paragraphe, i.e la période, qui relève du niveau mésotextuel, lieu de l’observation et de l’objectivation des thèmes. Et la lecture poursuit sur l’axe syntagmatique le développement d’un thème d’un ST à l’autre. 13 Sont mis en rouge uniquement les noms spécifiques de la classe 2. 12 488 JADT’ 18 d’objectifs par quelques manipulateurs, et pour lancer les chercheurs, au contraire, à l’assaut des nouvelles questions vitales : telles, en sciences humaines, les formes de légitimité anthropologique, politique et démocratique qui conviendraient à une société-monde en formation ; telle, en sciences technologiques, la rupture nécessaire avec les grands systèmes énergivores, laquelle permettrait demain aux sociétés – locales, urbaines, régionales – d’assurer leur autonomie alimentaire et énergétique sans se désengager de la conversation mondiale autorisée par la circulation instantanée des données ? Bref, le pire des réflexes de solidarité défensive ne parvient plus à occulter les questions désormais immédiatement planétaires : celle qu’on ne tergiversera plus à nommer simplement la nature, ce support de la vie terrestre devenu poste de résistance principal pour le mirage de la valeur argent ; celle de la culture, aussi bien identitaire et artistique que scientifique, et qui constitue – au moins à l’égal de la production matérielle désormais technologisée – un vaste univers d’activités essentielles, dont la logique ouverte ne peut être inféodée au rendement de type industriel ou financier sans péril pour l’humanité civilisée, et pour sa pluralité démocratique ; et enfin la question cruciale des sociétés plus autonomes par rapport au tourbillon techno-chrématistique, et qui seront dans l’avenir autant de sources d’emplois plus stables, d’activités moins gaspilleuses d’énergie et moins polluantes, et aussi de conversations politique plus proches des citoyens. » (Août 2007) Le ST le plus spécifique fait partie d’un passage qui fait appel à une révolution culturelle généralisée. Celle-ci se charge de poser les questions sociétales et civilisationnelles les plus urgentes et de promouvoir les alternatives-solutions. La révolution est celle de la culture scientifique. Est urgente une refonte de la pensée dominante et unique dans tous les domaines. Tout est à réinventer : des théories de référence pour une sociétémonde autre que la mondialisation, des théories économiques au service des sociétés et des hommes, d’autres technologies bioéthiques qui respectent la nature, ceci pour rester fidèle à la culture démocratique. Ce passage donne une idée sur la couleur sémantique de cette classe exclusive de la dernière période et qui échappe au sens global. D’une manière générale, les contenus socio-historiques connaissent un tournant qualitatif au cours du temps. Sur les dendrogrammes (Fig. 3, 4) on identifie leur emplacement libre entre les thèmes politiques et les thèmes économiques d’une période à l’autre. Dans les deux premières périodes, les questionnements sur l’histoire et la condition JADT’ 18 489 de l’Homme sont mobilisés par la situation politique, tandis que les contenus économiques régressifs des deux dernières périodes attirent les thèmes sociohistoriques. 5. Conclusion Rapporter la structure sémantique des sous-maquettes à la dynamique des contenus dominants nous éclaire sur la formation du sens global et sur sa logique. Autrement dit, la dynamisation du sens global par la projection des classes lexicales sur la chronologie constitue un niveau intermédiaire entre le niveau des sous-maquettes, celui des phases sémantiques stables et de leurs sens particuliers d’un côté et le niveau synthétique du sens qui finalement se stabilise au niveau global après l’accumulation des sens particuliers. Ce qu’on voulait illustrer ici c’est ponctuellement l’intérêt du recours à une maquette, réduction raisonnée du corpus à ses noms les plus fréquents, modèle à échelle réduite repris dans les sous-corpus de stabilité sémantique. Cet usage couplé à une statistique cooccurrentielle ciblant les réseaux lexicaux structurants permet un accès rapide aux fonds sémantiques, condition première pour pratiquer une sémantique de corpus. La maquette balise une sémantique de corpus qui va du global au local (Rastier 2001). Plus concrètement, si la cooccurrence est l’interprétant minimal saisi au sein du passage (Rastier 2007), on lui a assigné la mission de mesurer le temps sémantique pour déterminer les phases de stabilité sémantique où l’on peut observer les mondes lexicaux stabilisés (Reinert 1993, 2008). Ceux-ci sont les interprétants maximaux objectivables au niveau de la maquette et des sousmaquettes. La maquette telle qu’on la conçoit ne renvoie pas à un modèle généralisable mais à un usage généralisable. Un usage qui pour chaque corpus contribue à la reconstitution de son modèle sémantique quelle que soit sa spécificité et à réaliser la vocation de sa conception. Ici, dans le cas des corpus chronologiques, la maquette réconcilie l’étude du sens et l’étude du temps. Tandis que la première passe par délinéarisation du texte et la capture de la structure non-séquentielle du texte, la seconde poursuit l’organisation séquentielle des textes. La maquette en tant que dispositif destiné à un usage prédéfini intègre l’étude du non-séquentiel dans le séquentiel et efface le faux contraste entre eux. Références Brunet E. (2008). Les séquences (suite). JADT 2008. Brunet E. (2012). Nouveau traitement des cooccurrences dans Hyperbase. Corpus (11). 490 JADT’ 18 Halliday M. A. (1994). Introduction to Functional Grammar. London : Edward Arnold. Lebart L. et Salem A. (1994). Statistique textuelle. Paris : Dunod. Mayaffre D. (2008a). Quand ‘travail’, ‘famille’, ‘patrie’ co-occurrent dans le discours de Nicolas Sarkozy. Etude de cas et réflexion théorique sur la cooccurrence. JADT 2008. Mayaffre D. (2008b). De l’occurrence à l’isotopie. Les co-occurrences en lexicométrie. Sémantique & synatxe (9). Mayaffre D. (2014). Plaidoyer en faveur de l’Analyse des Données co(n)textuelles. Parcours coocurrentiels dans le discours présidentiel français (1958-2014). JADT 2014. Metwally H. (2017), Les thèmes et le temps dans Le Monde diplomatique (19902008). Thèse de doctorat, Université Côté d’Azur. Rastier F. (2001). Arts et sciences du texte. PUF. Rastier F. (2007). Passages. Corpus (6), pp. 25-54. Rastier F. (2011). La mesure et le grain. Sémantique de corpus. Paris : Champion. Ratinaud P. et Marchand P. (2012). Application de la méthode ALCESTE aux « gros » corpus et stabilité des « mondes lexicaux » : analyse du « CableGate » avec IRAMUTEQ. JADT 2012. Reinert M. (1983). Une méthode de classification descendante hiérarchique : application à l’analyse lexicale par contexte. Les cahiers de l’analyse des données. 8(2), pp. 187-198. Reinert M. (1993). Les « mondes lexicaux » et leur « logique » à travers l’analyse statistique d’un corpus de récits de cauchemars. Langage et société (66), pp. 5-39. Salem A. (1988). Approches du temps lexical. Statistique textuelle et séries chronologiques. Mots (17). pp. 105-143. Salem A. (1991). Les séries textuelles chronologiques. Histoire & Mesure, VI (1/2). pp. 149-175. Salem A. (1993). De travailleurs à salariés. Repères pour une évolution du vocabulaire syndical (1970-1993). Mots(63). pp. 74-83. Salem A. (1994). La lexicométrie chronologique. Dans Actes du colloque de lexicologie politique ‘Langages de la Révolution’. Paris : Klincksieck. Viprey J.-M. (2005). Corpus et sémantique discursive : éléments de méthode pour la lecture des corpus. Dans A. Condamines, Sémantique et corpus. Paris : Lavoisier. Viprey J.-M. (2006). Structure non-séquentielle des textes. Langages (183). JADT’ 18 491 Séries textuelles homogènes Jun Miao 1, André Salem 2 Université Lumière de Lyon 2, France – miaojun@miaojun.net 2 Université de la Sorbonne nouvelle - Paris 3, France – salem@msh-paris.fr 1 Abstract Textometric methods, widely used for the study of large corpora, are applied here to a set of small texts, which, however, present homogeneous characteristics. Our study focuses on a chronological textual series consisting of reports of successive congresses of the CCP (Chinese Communist Party) during the period 1982-2017. The textometrical methods are firstly used to highlight the changes occurred during the 2017 congress. Secondly, we apply these same methods to the subcorpora consisting of a collection of fragments, automatically extracted from each congress and related to the same topic. This subcorpora thereby constituted make it possible to observe, with greater efficiency, the contextual variations that occur over time around the same type. The method can be extended to any corpora consisting of fragment systems that present a certain level of homogeneity among them. Keywords: Textual series, Chinese political speeches, homogeneous subcorpora Résumé Nous appliquons ici des méthodes textométriques, largement utilisées pour l'étude de vastes corpus, à des ensembles de textes dont la taille est réduite mais qui présentent de fortes caractéristiques d'homogénéité. Notre étude porte sur une série textuelle chronologique constituée par les rapports successifs des congrès du PCC (Parti Communiste Chinois) durant les années 1982-2017. Les méthodes de la veille textuelle textométrique sont d'abord mises en œuvre pour mettre en évidence les changements survenus lors du congrès de 2017. Dans un deuxième temps, nous appliquons ces mêmes méthodes à des souscorpus, constitués par la réunion de fragments extraits de chacun des congrès et relatifs à un même thème. Les sous-corpus ainsi constitués permettent d'observer avec une efficacité accrue des variations contextuelles qui surviennent au fil du temps autour d'une même forme-pôle. La méthode peut être appliquée à tout corpus constitué de systèmes de fragments présentant une certaine homogénéité entre eux. 492 JADT’ 18 Mots-clés: Séries homogène. textuelles, discours politique chinois, sous-corpus 1. Introduction1 Le développement des capacités textométriques permet désormais d'explorer avec profit des ensembles de textes extrêmement vastes et souvent variés. Nous avons, cependant, insisté, avec d'autres, sur l'intérêt qu'il y a à appliquer ces mêmes méthodes à des corpus constitués par la réunion de productions textuelles présentant de fortes caractéristiques d'homogénéité et forcément plus réduites de ce fait (Salem 1991). Au delà des séries chronologiques, auxquelles nous empruntons nos exemples, la démarche que nous présentons peut être appliquée à différents types de corpus. Depuis quelques décennies, le Congrès national du Parti communiste chinois (PCC) a lieu une fois tous les cinq ans. Il constitue la plus haute instance de ce Parti, dans laquelle sont annoncées les décisions importantes2. Dans la dernière décennie, les commentaires et les analyses quantitatives, portant sur les textes de congrès du PCC, plus ou moins appuyés sur des méthodes d'analyse statistiques, se sont multipliés dans la presse et sur différents sites de l'Internet. Le corpus que nous étudions est constitué d'un ensemble des textes produits lors des congrès du PCC, entre 1982 et 2017. Pour des raisons que nous analysons, les textes produits durant cette dernière période présentent une grande homogénéité, tant du point de vue de leur taille que de celui des thèmes qu'ils abordent et du style qu'ils emploient. Nous commençons par étudier de manière classique la série chronologique PCC1982-2017 divisée en congrès afin de mettre en évidence des variations dans l'emploi du vocabulaire. Nous proposerons ensuite une méthode qui permet, selon nous, d'étudier au plus près les variations du contexte immédiat d'un terme donné. 2. Analyse chronologique de la série PCC1982-2017 Le corpus ainsi constitué compte au total 115 1338 occurrences pour 7365 Les analyses dont nous rendons compte ci-dessous, ont été effectuées à l'aide du logiciel Lexico5. Cedric Lamalle, William Martinez, Serge Fleury ont largement contribué au développement des fonctionnalités de ce logiciel. Les auteurs tiennent à les en remercier. 2 L’article de Salem et Wu (2008) constitue une étude chronologique portant sur l'intégralité des congrès du PCC survenus depuis sa fondation 1921 jusqu'à l'année 2012. Au-delà des évolutions chronologiques qu'elle avait permis de mettre à jour, cette étude montre le caractère hétérogène de la forme congrès considérée sur une échelle aussi large. 1 JADT’ 18 493 formes différentes3. La division en congrès amène une partition du corpus en huit parties. Les longueurs des parties, pour chaque congrès, s’échelonnent entre 2 400 et 2 900 occurrences. La forme de fréquence maximale est toujours la forme的 (de, DE1), dont on peut vérifier la forte diminution au fil des congrès4. 2.1 Le congrès 2017 Lorsque survient un nouveau congrès qui complète une série chronologique pré-existante, la méthode des spécificités permet de répondre à la question : Quelles sont les principales évolutions lexicales survenues lors du dernier congrès de la série ? C'est une opération de veille lexicale. Le calcul des spécificités appliqué au congrès de 2017 signale des spécificités positives, dont le contenu revêt un caractère nettement lexical : 时代 (shídài, ère, S +24), 治理 (zhìlǐ, gérer, S +21), 生态 (shēngtài, écologie, S +15), 梦 (mèng, rêve, S +14)5. A l'inverse, les formes de spécificités négatives, pour cette même période, sont plutôt des formes grammaticales, telles que 的 (de, DE1, S -38) , 这 (zhe, ce, S 22), 地 (de, DE2, S -14). Le même calcul appliqué aux segments répétés du corpus permet de préciser les modifications survenues lors ce même congrès. La mise en vedette du terme 新 时代 (xīn shídài, nouvelle ère), employé 36 fois lors du congrès de 2017, a été largement commentée par les analystes qui se sont penchés sur ce texte6. Le recensement systématique des segments fortement spécifiques pour cette même période permet de mettre en évidence des séquences répétées dont certaines ont pu échapper aux commentateurs et qui constituent également des néologismes par rapport aux congrès précédents : 新 时代 La séquence textuelle continue des textes chinois, composés de caractères juxtaposés (scriptio continua, dans laquelle les mots ne sont pas séparés par des espaces), a été soumise à un segmenteur automatique NLPIR (Zhang, 2016), très largement utilisé dans le monde sinophone, afin d'être segmentée en mots graphiques. 4 Nous expliquons dans une étude parallèle comment cette diminution progressive peut être mise en rapport avec l'évolution du style d’écriture. 5 Dans nos exemples, la forme native chinoise est suivie de sa transcription en pinyin, puis d'un équivalent français (lequel ne peut prétendre au statut de traduction satisfaisante pour chacune des occurrences du terme). Un coefficient de spécificité, positive ou négative, de forme S +/- xx indique enfin le degré de spécificité de la forme dans la partie du texte considérée. 6 De nombreux articles publiés à cette occasion ont explicitement mentionné la 3 fréquence (36 occurrences) de la formule新 时代 (xīn shídài, nouvelle ère) ex : Vandnepitte (2017). D'autres sites ont proposé aux internautes de classer congrès par fréquence d'apparition de plusieurs termes répétés dans chaque congrès (Qian, 2017). 494 JADT’ 18 中国 特色 社会主义, (le socialisme à la chinoise dans la nouvelle ère, 13 occ., S +12) , 治理 体系. (le système de gouvernance, 13 occ., S +12). Plus remarquable à nos yeux, certaines expressions extrêmement courantes dans les périodes précédentes ont complètement disparu du texte du dernier congrès. Tel est le cas, par exemple pour des segments comme : 有 中国 特色, (posséder des caractéristiques chinoises, 0 occ., S -7), 有 中国 特色 社会主义 (avoir un socialisme à la chinoise, 0 occ., S 5). L'analyse des spécificités permet également de localiser des parties du texte dans lesquelles le renouvellement lexical se révèle particulièrement important. Sur la figure 1, une carte des sections a été établie pour chacun des congrès, divisé en chapitres. Les sections apparaissent d'autant plus sombres qu'elles renferment de nombreuses occurrences de termes spécifiques pour le dernier congrès. La représentation permet de vérifier que le renouvellement ne se fait pas de manière uniforme, dans le dernier congrès. Une partie du vocabulaire spécifique du congrès de 2017, était déjà largement présente dans les deux congrès précédents. La carte permet en outre de localiser précisément les chapitres du dernier congrès qui font le plus fortement l'objet d'un renouvellement lexical. La figure 2 ci-dessous permet d'apprécier l'évolution du vocabulaire survenue dans la dernière période en combinant une représentation factorielle sur l'ensemble des congrès et les spécificités calculées pour le dernier congrès. Une analyse réalisée sur les huit congrès met en évidence la progressivité des changements lexicaux. On a projeté en qualité d'éléments supplémentaires les formes spécifiques positives de la dernière partie. Ce type de représentation peut être articulé avec les cartes de section, présentées cidessus pour illustrer les changements lexicaux. 3. Utiliser la structure des documents Dans chacun des textes de l'édition originale des congrès, des repères éditoriaux (intertitres, numérotation de sous-parties, etc.) permettent d'effectuer un découpage en unités plus petites que nous appellerons chapitres. Chaque chapitre correspond à l'évocation d'un thème particulier (développement économique, perspectives internationales, état des forces armées, etc.). Lors de chacun des congrès, ces thèmes sont abordés tour à tour, souvent dans un ordre similaire qui peut conduire à proposer une description globale de l'ordonnancement de ces textes de congrès. JADT’ 18 495 Guide de lecture: A gauche, on trouve une carte des sections réalisée à partir d'un découpage en chapitres. Chaque ligne regroupe les chapitres relatifs à un même congrès. Les carrés les plus foncés correspondent aux chapitres les plus chargés en formes spécifiques dans le dernier congrès (S+>10). En bas : le texte du deuxième chapitre du dernier congrès, qui figure au dessous de la carte est signalé comme particulièrement chargé en formes spécifiques. $# 同志 们 : ¶ # 现在 , 我 代表 第十八 届 中央 委员会 向 大会 作 报告 . ¶ # 中国共产党 第十九 次 全国 代表大会 , 是 在 全面 建成 小康 社会 决胜 阶段 * 中国 特色 社会主义 进入 新 时代 的 关键 时期 召开 的 一 次 十分 重要 的 大会 . ¶ # 大会 的 主题 是 : 不 忘 初心 , 牢记 使命 , 高举 中国 特色 社会主义 伟大 旗帜 , 决胜 全面 建成 小康 社会 , 夺取 新 时代 中国 特色 社会主义 伟大 胜利 , 为 实现 中华民族 伟大 复兴 的 中国 梦 不懈 奋斗 /... /. ¶. Figure 1 : Repérage des portions caractéristiques pour le dernier congrès (2017) 496 JADT’ 18 Figure 2 : Spécificités positives du congrès 2017 mises en évidence dans l’AFC Guide de lecture: Sur la figure 2, les différents congrès s'échelonnent dans le temps selon une parabole. Cet échelonnement résulte d'un renouvellement important du vocabulaire au fil des congrès. Les formes les plus spécifiques pour le dernier congrès ont été projetées en qualité d'éléments supplémentaires. JADT’ 18 497 3.1 Analyse en chapitres Lorsqu'on soumet à des analyses typologiques, le même corpus divisé, cette fois, en chapitres, on constate que les chapitres correspondant aux mêmes thèmes, mais appartenant à différents congrès, ont une forte tendance à se regrouper, du fait qu'ils emploient des vocabulaires proches. La structure chronologique, mise en évidence par l'analyse en congrès s'efface, dans ce cas, devant une typologie d'ordre thématique. La figure 3 montre les résultats d'une Analyse factorielle des correspondances effectuée à partir du corpus PCC1982-2017 divisé cette fois en 89 chapitres. Sur cette figure, les identificateurs des chapitres sont constituées de deux parties. Le premier nombre indique le numéro du congrès dont le chapitre est extrait. Le second, l'ordre du chapitre à l'intérieur du congrès. Comme on peut le vérifier sur cette figure les chapitres correspondant à un même thème ont tendance à se regrouper fortement. 498 JADT’ 18 Figure 3 : Analyse factorielle des correspondances sur le corpus divisé en chapitres A titre d'exemple, nous avons agrandi les portions du graphique qui correspondent à deux groupes thématiques : a) le groupe un pays deux systèmes qui correspond à une orientation politique constante du PCC, réaffirmée à chaque congrès ; b) un groupe de chapitres correspondant à l'analyse des relations internationales, qui constitue également un moment incontournable pour chaque congrès, à partir du 14ème. 3.2 Le sous-corpus thématique « un pays deux systèmes » L'étape suivante consiste à réitérer ces mêmes analyses à partir de souscorpus réduits, rassemblant les seules chapitres relatifs à une même thématique. Les analyses textométriques effectuées sur ces sous-corpus homogènes débouchent sur des résultats particulièrement lisibles. Lors de l'analyse de ce type de corpus, la dimension chronologique revient au premier plan. Le sous-corpus qui rassemble les passages relatifs au thème un pays, deux systèmes ne compte que deux milles occurrences, sur l'ensemble des congrès. L'analyse des formes qui apparaissent spécifiquement dans les contextes de ce terme, montre cependant une nette évolution du contexte immédiat de ce terme. Le Congrès de 1987, présente la formule comme un principe à mettre en œuvre. Dans les congrès suivants, on voit apparaître les verbes maintenir et continuer (2002) puis mettre en œuvre sans faille (2007). En 2017, il s'agit d'appliquer intégralement et avec précision le principe un pays, deux systèmes. La figure 4 montre une projection des différents segments qui contiennent l'expression sur l'analyse réalisée à partir du sous-corpus7. 7 Le graphique a été légèrement modifié pour permettre une plus grande lisibilité. Les segments redondants ont été écartés; les points superposés ont été légèrement déplacés. JADT’ 18 499 Figure 4 : Variations lexicales autour de l'expression : un pays deux systèmes 4. Conclusion Nos expériences nous amènent à conclure que l'analyse textométrique opérée à partir de regroupements de fragments homogènes, prélevés autour d'un même thème durant les années couvertes par une série chronologique conduit à des résultats dont l'interprétation se révèle particulièrement aisée. La grande homogénéité lexicale des fragments rapprochés permet alors d'observer des variations très fines. Elle compense largement la taille réduite du corpus, peu favorable, a priori, dans le cas d'études textométriques. Au delà des applications aux seules séries textuelles chronologiques, la méthode pourra être utilisée pour toute sorte de corpus, dans une large variété de langues, à la condition qu'il soit possible de distinguer des sous ensembles thématiques homogènes Références Miao J. (2012). Approches textométriques de la notion de style du traducteur Analyses d'un corpus parallèle français-chinois : Jean-Christophe de Romain Rolland et ses trois traductions chinoises. Thèse doctorale dirigée sous la direction de M. André Salem, Paris 3. Qian G. (2017). 中共历届党代大会报告语象分析 (Analyses lexicales des rapports de tous les congrès du Parti communiste chinois).Lianhe Zaobao du 19 novembre 2017. Salem A. (1991). Les séries textuelles chronologiques. Histoire & Mesure Année, Vol. (6) : 149-175. 500 JADT’ 18 Salem A., Wu Li-Chi. (2008). Essai de textométrie politique chinoise. In André Salem et Serge Fleury, éditeurs, Lexicometrica – Explorations textométriques, Vol. (1). URL : http://lexicometrica.univparis3.fr/numspeciaux/special8.htm (consulté le 5 février 2017). Vandepitte M. (2017). Quatre choses à savoir sur la Chine – dans le cadre du XIXème congrès du Parti. Traduit par Anne Meert en français du néerlandais. Investig’Action du 15 novembre 2017. URL : goo.gl/8fgSkq (consulté le 25 novembre 2017). Logiciels utilisés : Zhang H.P. (2017). Segmenteur automatique chinois NLPIR. URL : http://www.nlpir.org/ Salem A. (2017). L’outil d’analyse textométrique Lexico 5. URL : http://www.lexi-co.com/index.html JADT’ 18 501 TaLTaC in ENEAGRID Infrastructure Silvio Migliori1, Andrea Quintiliani1, Daniela Alderuccio1, Fiorenzo Ambrosino1, Antonio Colavincenzo1, Marialuisa Mongelli1, Samuele Pierattini1, Giovanni Ponti1 Sergio Bolasco2, Francesco Baiocchi3, Giovanni De Gasperis4, 1 ENEA DTE-ICT – silvio.migliori@enea.it, 2 Sapienza Università di Roma, 3 Staff TaLTac - info@taltac.it, 4 Dip. DISIM Università dell‘Aquila Abstract The aim of this joint ENEA-TaLTaC project is to enable the TaLTaC User Community and the Digital Humanists to have remote access to the TaLTaC software through the ENEAGRID Infrastructure. ENEA's research activities on the integration of Language Technologies (Multilingual Text Mining Software and Lexical Resources) in the ENEA distributed digital infrastructure provide a "community Cloud" approach in a digital collaborative environment and on an integrated platform of tools and digital resources, for the sharing of knowledge and analysis of textual corpora in Economic and Social Sciences and e-Humanities. Access to the TaLTac software in Windows and Linux version will exploit the high computational capacity (800 Teraflops) of the e-infrastructure, to which users access as a single virtual supercomputer. Riassunto Obiettivo del progetto congiunto ENEA-TaLTaC è consentire alla comunità degli utenti TaLTaC e ai ricercatori nelle Digital Humanities l’accesso remoto al software TaLTaC attraverso l'infrastruttura digitale ENEAGRID. Le attività di ricerca dell'ENEA sull'integrazione delle tecnologie linguistiche (software di Text Mining per testi multilingue e risorse lessicali) in ENEAGRID forniscono un approccio "community Cloud" in un ambiente collaborativo digitale e su una piattaforma integrata di strumenti e risorse digitali, per la condivisione delle conoscenze e l'analisi di corpora testuali in Scienze Economiche e Sociali ed e-Humanities. L’accesso al software TaLTac in versione Windows e Linux sfrutterà l’elevata capacità computazionale (800 Teraflops) dell’infrastruttura di calcolo, a cui gli utenti accedono come ad un unico supercomputer virtuale. Keywords: Text Mining Software, Cloud Computing, Digital-Humanities, Socio-Economic Sciences, Big Data. 502 JADT’ 18 1. Introduction “TaLTaC in CLOUD” is a joint ENEA-TaLTaC project for the set-up of an ICT portal on the ENEA distributed e-Infrastructure1 (Ponti et al., 2014), hosting TaLTaC Software (Bolasco et al., 2016, 2017). Users will access TaLTaC software (Windows and Linux versions) in a remote and ubiquitous way, and the computational power (800 Teraflops) of ICT ENEA distributed resources, as a single supercomputer. The aim of this joint ENEA-TaLTaC project is to enable the TaLTaC User Community and Digital Humanists to have remote access to TaLTaC software through ENEAGRID Infrastructure, integrating ICT inside Digital Cultural Research. ENEAGRID offers a digital collaborative environment and an integrated platform of tools and resources assisting research collaborations, for sharing knowledge and digital resources and for storing textual data. In this virtual environment, TaLTaC software evolves from a stand-alone uniprocessor software toward a multiprocessor design, integrated in an ICT research einfrastructure. Furthermore, it evolves towards implementing ancient language lexical and semantic knowledge and e-resources, facing research needs and implementing solutions also for Digital Humanities communities. 2. TaLTaC Software The TaLTaC software package, conceived at the beginning of the 2000s, has been progressively developed to date in three major releases: T1 (2001), T2 (2005) and T3 (2016); it is widespread among the text analysis community in Italy and abroad with over 1000 licenses, including two hundred entities between university departments, research institutions and other organizations. The 2018 release of the software, T3, implemented the following priority objectives: i) the processing of big data (around of a billion words), achieving the independence from the dimensions of the text corpora, limited only by hardware resources; ii) the automatic extraction on multiple layers of results from text parsing (tokenization): layer zero (text in the original version), layer 1 (recognition of words with automatic corrections of the accents), layer 2 (pre-recognition of most common Named Entities), layer 3 (reconstruction of pre-defined multiwords); iii) computing speed, taking advantage of the power of the multi-core processing readily available on current computers The ENEAGRID infrastructure is based on several software components which interact with each other to offer an integrated distributed system. The ENEAGRID infrastructure allows access to all these resources as a single virtual system, with an integrated computational availability of about 16000 cores, provided by several multiplatform systems. 1 JADT’ 18 503 (personal or cloud). Table 1 shows the processing times of three parsing, up to layer 2, for larger corpora on PC (1-core and 8-cores) and on ENEAGRID. Preliminary results on ENEAGRID (1core-CRESCO) show that with increasing corpus size there is an even greater saving of time. TALTAC was installed in ENEAGRID infrastructure, but the computational capabilities of the HPC system are not yet exploited because the current version of the software does not support multi-core. Therefore, the present ENEAGRID capabilities allow only multi-users access and computation; future versions of the software will be tested for multi-core capabilities to exploit the real power of ENEA ICT High Performance Computing. Table 1. Preliminary results of processing times of three parsing on PC and on ENEAGRID. ENEAGRID 1 core (CRESCO) in minutes millions 74 0,41 3,4 1,1 0,33 3,5 284 1,55 13,0 3,8 0,29 13,2 tokens 1 "La Repubblica " (100 th Artic.) 2 "La Repubblica " (400 th Artic.) 3 Italian and French Press 4 Various Press Collection MAC i7 (7th generation) 8core 1 core 8 cores /1core in minutes in % size of file GB 535 2,89 37,4 8,8 0,24 41,3 1.138 6,18 88,2 14,0 0,16 54,7 For the characteristics of the technological architecture of the TaLTaC3 platform, see previous works (Bolasco et al. 2016, 2017), that can be summarized here as: a1) HTML 5 for the GUI and jQuery with its derived Javascript frameworks to encapsulate the GUI user interaction functions for the MAC and Cloud solution; a2) Windows native DotNET desktop application; b) JSON (JavaScript Object Notation): as an inter-module language standard, with a structured and agile format for data exchange in client/server applications; c) Python / PyPy: advanced script/compiled programming language, mostly used for textual data analysis and natural language processing at the CORE back end; d) No-SQL: high performance key/value data structure storage server Redis adopted for vocabularies/linguistic resources persistence; e) RESTful: interface standard for data exchange over the HTTP web protocol; f) MULTI-PROCESSING: exploiting in the best possible way multi-core hardware, distributing processing power among different CPU cores. The choice of the Python language allowed to develop a cross-platform computational core running on Windows, Linux, macOS. In particular, the overall system of software processes runs smoothly over a linux-based cloud computing facility, like the ENEAGRID. Furthermore, the Python code 504 JADT’ 18 compiled through the 64bit PyPy just-in-time-compiler allows very efficient macro operations over a large set of data, stored as hash dictionaries, so that the upper limits of performance and capacity is only due to the physical limit of the host machine, in terms of RAM and number of cores and OS kernel scheduler. In our test each node in the ENEAGRID infrastructure hosted a single Redis instance and a number of 24 logic cores, with 16GB of RAM. 3. ENEAGRID Infrastructure ENEA activities are supported by its ICT infrastructure, providing advanced services as High Performance Computing (HPC), Cloud and Big Data services, communication and collaboration tools. Advanced ICT services are based on ENEA research and development activities in the domains of HPC, of high performance networking and data management, including the integration of large experimental facilities, with a special attention to public services and industrial applications. As far as High Performance Computing is concerned, ENEA manages and develops ENEAGRID, a computing infrastructure distributed over 6 ENEA research centers for a total of about 16000 cores and a peak computing power of 800 Tflops. HPC clusters are mostly based on conventional Intel Xeon cpu with the addition of some accelerated systems as Intel Xeon/PHI and Nvidia GPU. Storage resources includes RAID systems for a total of 1.8 PB, in SAN/Switched and SRP/Infiniband configuration. Data are made available by distributed and high performances files systems (AFS and GPFS). ENEA Portici Center has become one of the most important italian HPC center in 2008 with the project CRESCO - Computational RESearch Center for COmplex Systems. CRESCO HPC clusters are used in many of the main ENEA research and developments activities, such as energy, atmosphere and sea modeling, bioinformatics, material science, critical infrastructures analysis, fission and fusion nuclear science and technology, complex systems simulation. CRESCO clusters have provided in 2015 and 2016 more than 40 million core hours each year to ENEA researchers and technologists and to their external partners (external users account for about 30% of the total machine time). CRESCO6, the new HPC cluster recently installed in Portici in the framework of the 2015 ENEA-CINECA agreement, provides a peak computing power of 700 Tflops and is based on the new 24 cores Intel SkyLake cpu. Its nodes will be connected by the new Intel OmniPath high performance network, providing a 100 Gbps bandwidth. ENEA ICT department provides also general purpose communication, elaboration and collaboration tools and services as Network management, EMail, Video Conferencing and Voip services, Cloud Computing and Storage. JADT’ 18 505 A friendly user access to scientific and technical applications (as Ansys, Comsol, Nastran, Fluent) is provided by dedicated web portals (Virtual laboratories) relying on optimized remote data access tools as NX technology. 4. TaLTaC in ENEAGRID Infrastructure 4.1 Software Installation and Access on ENEA e-Infrastructure The software TaLTaC is available on Windows and Linux through ENEAGRID via AFS in a geographically distributed file system, which allows remote access to each computing node of the HPC CRESCO systems and Cloud infrastructure from anywhere in the world. This provides three capabilities: i) data mining, sharing and storage; ii) ICT services necessary for the efficient use of HPC resources, collaborative work, visualization and data analysis; iii) the implementation of software and its settings for future data processing and analysis. Moreover, the availability of the software on the ENEA ICT infrastructure can benefit of the advantages of AFS such as scalability, redundance, backup and so on. Through the ACL rules it can be possible to manage the accessibility of the software to the community of users in compliance of the license policies that will be put in place. The following two options are provided for TaLTaC running: the first one is to use the applications installed in the windows system and the second one is to use FARO2 – Fast Access to Remote Objects (the general purpose interface for hardware and software capabilities by web access) to directly access the applications installed in the Linux environment and that refer to the data in AFS. 4.1.1. TaLTaC2 (Windows) on Remote Desktop Access The software TaLTaC2 is available on “Windows Server 2012 R2” by remote desktop access to a virtual machine that can be reached by the ThinLinc general-purpose and intuitive interface. All the users involved in the project activities can access the server but only the person in charge of developing and installing the application can obtain administrator privileges. For this reason, AFS authentication is always required. Every TaLTaC2 user with AFS credentials can access ENEAGRID to run the software and to manage data on AFS own areas via web and from any remote location. In the AFS environment, an assigned disk area with a large memory capacity is defined. This area is mainly used for storage and sharing of large amounts of data (less than 200 MB) (analysis, reports and documents) that come from running the software on a single processor, in serial mode, or for future parallel data mining applications. 506 JADT’ 18 4.1.2. TaLTaC3 (Linux) on CRESCO System On the CRESCO systems, that is accessible from ENEAGRID infrastructure, TaLTaC3 is available on CentOS Linux nodes and then it is possible to leverage the overall computing power dedicated to the activities of TaLTaC and Digital Humanists communities. Every user can start own work session allocating a node with a reserved Redis instance and as many computing core as needed. Performance improvements are obtainable through the parallelization so that a single user can use the full capacity of the assigned node, in terms of number of computing cores. The TaLTaC3 package is automatically started as the user login to the node by a shell script. The opensource Mozilla Firefox web browser makes the user interface in the current beta version. The access to the TaLTaC3 portal use the ThinLinc remote desktop visualization technology that allows an almost transparent remote session on the HPC system, including the graphical user interface, thanks to the built-in features such as load-balancing, accelerated graphics and platform-specific optimisations. Input and output data can be accessed through the ENEAGRID filesystems and therefore easily uploaded and downloaded. 4.2 Case Studies ENEA distributed infrastructure (and cloud services) enables the management of research process in Economic-Social Sciences and Digital Humanities, providing technology solutions and tools to academic departments and research institutes: building and analyzing collections to generate new intellectual products or cultural patterns, data or research processes, building teaching resources, enabling collaborative working and interdisciplinary knowledge transfer. 4.2.1. TaLTaC User Community The current (2018) community of TaLTaC over the years aggregated users from the computer laboratories of automatic analysis of texts and text mining, also carried out within the institutional courses of bachelor and magistral degrees, plus Ph.D. students from doctoral degree courses at the universities of Rome "La Sapienza" and "Tor Vergata", of Padua, Modena, Pisa, Naples and Calabria (a total estimate of over 1300 students over the last eight years); furthermore, there is another set of users that subscribed to specific tutorial courses dedicated to TaLTaC (more than 60 courses for a total number of 750 tutorial participants). A call about the opportunity of using "remotely" the software via the ENEA distributed computing facilities, received the manifestation of interest by 40 departments and other research institutes. JADT’ 18 507 4.2.2. Digital Humanities Community as TaLTaC user In collaboration with academic experts, ENEA focused on Digital Humanities projects in Text Mining & Analysis in Ancient Writings Systems of the Near East and used TaLTaC2 to perform quantitative linguistic analysis in cuneiform corpora (transliterated into latin alphabet) (Ponti et al., 2017). Cuneiform was used by a number of cultures in the ancient Near East to write 15 languages over 3,000 years. The cuneiform corpus was estimated to be larger than the corpus of Latin texts but only about 1/10 of the extant cuneiform texts have been read even once in modern times. This huge cuneiform corpus and the restricted number of experts lead to the use of Text Mining and Analysis, clustering algorithms, social network analysis in the TIGRIS Virtual Lab for Digital Assiriology2, a virtual research environment implemented in ENEA research e-infrastructure. In TIGRIS V-Lab researchers perform basic tasks to extract knowledge from cuneiform corpora. (i.e. dictionaries extraction with word list of toponyms, chrononyms, theonyms, personal names, grammatical and semantic tagging, concordances, corpora annotation, lexicon building, grammar writing, etc.). 5. Conclusions Researchers and their collaborators will use computational resources in ENEAGRID to perform their work regardless of the location of the specific machine or of the employed hardware/ software platform. ENEAGRID offers computation and storage resources and services in a ubiquitous and remote way. It integrates a cloud computing environment and exports: a) remote software (i.e. TaLTaC); b) Virtual Labs: thematic areas accessible via web, where researchers can find set of software (and documentation regarding specific research areas); c) remote storage facilities (with OpenAFS file system). In this virtual environment, TaLTaC software evolves from a uniprocessor software toward a multiprocessor design, integrated in an ICT research e-infrastructure. This project leads to the TaLTaC evolution from a stand-alone software (allowing Text Mining & Analysis to search for linguistic constructions in textual corpora, showing results in a table or concordance list) to a software “always and anywhere on”, that also can be accessed, providing an interface where you can visualize results, create interpretative models, collaborate with others, combine different textual representations and storing data, codeveloping research practices. Furthermore, this project reflects the shift 2 TIGRIS - Toward Integration of e-tools in GRId Infrastructure for e-aSsyriology http://www.afs.enea.it/project/tigris/indexOpen.php http://www.laboratorivirtuali.enea.it/it/prime-pagine/ctigris 508 JADT’ 18 from the individual-researcher-approach to a collaborative research community-approach, leading to a community-driven software design, tailor-made on specific research community needs and to Community Cloud Computing. This interdisciplinary knowledge transfer enables creating/activating new knowledge from big (cultural and socio-economic) data, both in modern and ancient languages. References Bolasco, S., Baiocchi, F., Canzonetti, A., De Gasperis, G. (2016). “TaLTaC3.0, un software multi-lessicale e uni-testuale ad architettura web”, in D. Mayaffre, C. Poudat, L. Vanni, V. Magri, P. Follette (eds.), Proceedings of JADT 2016, CNRS University Nice Sophia Antipolis, Volume I, pp. 225-235. Bolasco S., De Gasperis G. (2017). “TaLTaC 3.0 A Web Multilevel Platform for Textual Big Data in the Social Sciences” in C. Lauro, E. Amaturo, M.G. Grassia, B. Aragona, M. Marino. (eds.) Data Science and Social Research Epistemology, Methods, Technology and Applications (series: Studies in Classification, Data Analysis, and Knowledge Organization) Springer Publ., pp. 97-103. Ponti G., Palombi F., Abate D., Ambrosino F., Aprea G., Bastianelli T., Beone F., Bertini R., Bracco G., Caporicci M., Calosso B., Chinnici M., Colavincenzo A., Cucurullo A., Dangelo P., De Rosa M., De Michele P., Funel A., Furini G., Giammattei D., Giusepponi S., Guadagni R., Guarnieri G., Italiano A., Magagnino S., Mariano A., Mencuccini G., Mercuri C., Migliori S., Ornelli P., Pecoraro S., Perozziello A., Pierattini S., Podda S., Poggi F., Quintiliani A., Rocchi A., Sciò C., Simoni F., Vita A. (2014) “The Role of Medium Size Facilities in the HPC Ecosystem: The Case of the New CRESCO4 Cluster Integrated in the ENEAGRID Infrastructure”. In: Proceedings of the International Conference on High Performance Computing and Simulation, HPCS (2014), ISBN: 978-1-4799-5160-4. Ponti G., Alderuccio, D., Mencuccini, G., Rocchi, A., Migliori, S., Bracco, G., Negri Scafa, P. (2017) “Data Mining Tools and GRID Infrastructure for Text Analysis” in “Private and State in the Ancient Near East” Proceedings of the 58th Rencontre Assyriologique Internationale, Leiden 16-20 July 2012, edited by R. De Boer and J.G. Dercksen, Eisensbrauns Inc. - LCCN 2017032823 (print) | LCCN 2017034599 (ebook) | ISBN 9781575067858 (ePDF) | ISBN 9781575067841. ENEAGRID http://www.ict.enea.it/it/hpc Laboratori Virtuali http://www.ict.enea.it/it/laboratori-virtualixxx/virtual-labs TIGRIS Virtual Lab http://www.afs.enea.it/project/tigris/indexOpen.php TaLTaC: www.taltac.it JADT’ 18 509 The dimensions of Gender in the International Review of Sociology. A lexicometric approach to the analysis of the publications in the last twenty years Isabella Mingo, Mariella Nocenzi Sapienza University of Rome – isabella.mingo@uniroma1.it; mariella.nocenzi@uniroma1.it Abstract 1 (in English) The Social Sciences and, specifically, the sociological research has progressively assumed the gender factor as one of the strategic keys to understand contemporary phenomena. In fact, as a variable for sociostatistical analysis or as a characterizing trait of individual identity, it is a decisive factor in the interpretation of the deep social transformations and it inspires the self-reflection of the sociologists about the analytical tools of their discipline. The contribution proposes, through a lexicometric approach, an analysis of the articles published in the last two decades by the oldest Journal of Sociology, published by Routledge. The main aim is to highlight the different ways in which gender issues are declined in the international sociological researches presented in the repertoire of the International Review of Sociology and to outline, both on the lexical level and on the topic level, the changes occurred over time. Abstract 2 (in French, Italian or Spanish) Le scienze sociali e, nello specifico, la ricerca sociologica hanno progressivamente assunto il fattore del genere come una delle più strategiche chiavi di lettura dei fenomeni contemporanei. Si tratta, infatti, di un fattore che, quale variabile per l’analisi socio-statistica o come tratto caratterizzante dell’identità individuale, si rivela dirimente nell’interpretazione delle profonde trasformazioni sociali in atto e spunto per un’autoriflessione degli stessi sociologi sugli strumenti di analisi della loro disciplina. Il contributo propone, mediante un approccio lessico-metrico, un’analisi degli articoli pubblicati nelle ultime due decadi dalla più antica rivista di sociologia, edita da Routledge, con l’obiettivo di evidenziare i diversi modi con cui il concetto di genere viene declinato nelle ricerche sociologiche internazionali presentate nel repertorio dell’International Review of Sociology e di delineare, sia sul piano lessicale che su quello delle tematiche, i cambiamenti intervenuti nel corso del tempo. Keywords: Gender, International Review of Sociology, Lexicometric Analysis, Textual Analysis, Social Change, Sociological Analysis 510 JADT’ 18 1. Introduction and the hypothesis of the paper From 1955, when in a relevant paper the American scholar John Money (et al., 1955) coined the term of gender for the definition of “those things that a person says or does to disclose himself or herself as having the status of boy or man, girl or woman”, the social sciences have developed entire subfields and a wide range of topics to analyse it with a variety of research methods. Sociologists, in particular, had outlined specific theoretical approaches and had led many detailed studies to understand firstly what gender is and the difference with sex. They had shared that if the meaning of sex is the biological classification based on body parts, gender, on the other hand, is the social classification based on one’s identity, presentation of self, behavior, and interaction with others. Sociologists, hence, view gender as a learned behavior and a culturally produced identity, and, for these reasons, they define it as a “social” category. It has always been a very relevant category for the critical analysis of the social construction because one of the most important social structures is the status and one of the most strategic statuses is just gender. In the last decades, the sociological theories and researches based on gender are become more and more widespread, articulated, integrated with other subfields of sociology and of the other social sciences. One of the most representative indicator of this research development and specialization is not only the common recognition and, then, institution of the sociology of the gender as a subfield of the sociology, but the most frequent use of gender as reference concept for all the other sociological theoretical approaches to the analysis of the social system. The same sociology of gender has studied many topics, with multiple research methods, including identity, social interaction, power and oppression, and the interaction with race, class, culture, religion, and sexuality, among others. This paper aims to observe and, if possible, to interpret this progressive diffusion and specialization in the use of gender as a theoretical and research category through the publications of the International Review of Sociology, a sociological journal, edited by Routledge with a worldwide online and paper diffusion, during the last two decades. This journal, the oldest review in the field of sociology in Europe, founded by René Worms in 1893 in Paris, still maintains – as the “Aims and scope of the Review” state – «the traditional orientation of the journal as well as of the world’s first international academic organization of sociology which started as an association of contributors to International Review of Sociology: it assumes that sociology is not conceived apart from economics, history, demography, anthropology and social psychology. Rather, sociology is a science which aims to discover the links between the various areas of social activity and not just a set of empty formulas. Thus, International Review of Sociology provides a medium through JADT’ 18 511 which up-to-date results of interdisciplinary research can be spread across disciplines as well as across continents and cultures»1. The Authors proposes to highlight the different ways in which gender issues are declined in the international sociological researches, through an analysis of the articles published in the last two decades (1997-2017) in International Review of Sociology. We consider the last two decades of publication not only because of the best accessibility to the International Review of Sociology catalogue. For the sociology, indeed, the recent gender studies and researches have registered a deeper specialization in terms of connection with other disciplines, unusual application of the gender approach to some social phenomena, exploration of new research frontiers (multiple gender identities, gender sensitive data arrangement, the non-alignment statuses of sex and gender et similia). 2. Data and Methods The analysis of the International Review of Sociology papers was carried out mainly through a lexicometric approach, integrated with hermeneutic analysis useful both in the first and in the last phase of the study. The first phase has regarded the collection of the corpus, while the last one has concerned the interpretation of the results obtained from quantitative and automatic procedures. The lexicometric analyses, supported by the software IRaMuTeQ2, were carried out to extract the most relevant forms/lemma and to apply some exploratory techniques for identifying the main lexical-textual dimensions, the relationships between some keywords, the recurring topics, and possible differences over the time analysed. 2.1. The Corpus: Selection Criteria and Preliminary Analysis The texts analyzed in this study have been collected from the archive of the International Review of Sociology, considering the papers published from 1997 to 2017. In the first stage, they have been extracted all the papers which propose the term gender in title, abstract, body text and/or key words).They were 235, distributed over the past 20 years, as shown in Table 1. Then, they have been selected only those papers which present a relevant See at the International Review of Sociology web site, page “Aims and scope”, https://www.tandfonline.com/action/journalInformation?show=aimsScope&journalC ode=cirs20. 2 IRaMuTeQ is a open software, distributed under license GNU GPL, based on R statistical software and on Python language. It has now reached version 0.7 alpha 2 and it is still under development (Ratinaud, 2009). 1 512 JADT’ 18 reference to gender as theoretical or empirical category – and not only as a composing part of a title of some sources, a statistical variable, or synonym – in order to outline meaningful remarks for the aims of each article. This selection has been supported by a hermeneutic analysis, based on careful reading of the papers to evaluate the centrality of the gender issues in their hypotheses and theses, as in the implementation of the theoretical and/or empirical methodologies. They resulted 67, distributed over the past 20 years, as shown in Table 1. Table 1 - Extracted and Selected Papers 1997-1999 2000-2002 2003-2005 2006-2008 2009-2011 2012-2014 2015-2017 Total Extracted Papers (EP) Selected Papers (SP) SP/EP% 19 18 22 21 45 55 55 235 2 3 3 3 20 15 21 67 10,53 16,67 13,64 14,29 44,44 27,27 38,18 28,51 The incidence of the selected papers on the extracted ones (SP/EP%) highlights the increased relevance of the term gender over time: it is used more and more often as analytic category in sociological research, rather than as a synonym or to indicate only a demographic characteristic of individuals. The corpus, submitted to the subsequent analyzes, includes therefore 67 selected papers, and has the following lexicometric measurements: dimension N=495470, word types V=21680; Type/token ratio TTR= 4,38%; Hapax/V= 41,56%; Hapax/N=1,82%. These characteristics show that the corpus can be considered sufficiently large for a quantitative approach analysis (Bolasco, 1999, p.203). 2.2. Strategy of Analysis The analyzes on the corpus, carried out with IRaMuTeQ, will be the following: 1- Lexicon Analysis: exploration of the lexicon used in the corpus and identification of theme-words/lemma; 2- Analysis of the specific lexicon: individuation of specific words/lemma by time and by author/authors gender; 3- Correspondence Analysis: extraction of lexical dimensions starting from the Aggregated Lessical Table (ALT) Lemma/Texts (Lebart, Salem 1994), JADT’ 18 513 in which the texts were identified according to the different years of publication (Y = 1997 ..., 2017;) and the gender of the author/authors (G = 1-Female; 2-Male; 3-Male and Female) 4- Cluster Analysis: identification of main topics through descending hierarchical analysis (Reinart 1983) applied to the Binary Lexical Table (BLT), Text segments / Lemma. 5- Similarity Analysis: description of the clusters obtained in point 4), through graphic representation starting from the proximity matrix between forms or lemmas. References Bolasco S. (1999). Analisi multidimensionale dei dati. Metodi, strategie e criteri d'interpretazione, Roma, Carocci Lerbart L., Salem S. (1994). Statistique textuelle, Paris, Dunod. Money, John; Hampson, Joan G; Hampson, John (1955). “An Examination of Some Basic Sexual Concepts: The Evidence of Human Hermaphroditism”. Bull. Johns Hopkins Hosp. Johns Hopkins University. 97 (4), pp. 301–19. Ratinaud, P. (2009). IRAMUTEQ: Interface de R pour les Analyses Multidimensionnelles de Textes et de Questionnaires. http://www.iramuteq.org. Reinert, M. (1983). Une méthode de classification descendante hiérarchique : application à l’analyse lexicale par contexte. Les Cahiers de l’Analyse Des Données, 8, 187–198. 514 JADT’ 18 The Rhythm of Epic Verse in Portuguese From the 16th to the 21st Century Adiel Mittmann, Alckmar Luiz dos Santos Universidade Federal de Santa Catarina (Florianópolis, Brazil) adiel@mittmann.net.br, alckmar@gmail.com Abstract The verses of most epic poems in Portuguese have been written following the example of the Italian endecasillabo: a verse whose last stressed syllable is the tenth, which usually means, in both Italian and Portuguese, that most verses have a total of eleven syllables. In addition to the tenth, other syllables may be stressed within the verse as well, and the specific distributions of stressed and unstressed syllables make up different rhythmic patterns. In this article, we investigate how such patterns were used in six epic poems written in Portuguese, ranging from the 16th to the 21st century, for a total of 52,412 verses. In order to analyze such a large amount of verses, we used Aoidos, an automatic scansion tool for Portuguese. By using supervised and unsupervised machine learning, we show that, though the influence of earlier poets (especially Camões) is ever present, poets favor different rhythmic patterns, which can be regarded as their rhythmic signature. Keywords: Epic poetry, Portuguese, Scansion. Résumé Les vers de la plupart des épopées en portugais ont été écrits à l’instar de l’endecasillabo italien : un vers dont la dernière syllabe accentuée est la dixième, ce qui signifie généralement, en italien et en portugais, que la plupart des vers ont onze syllabes au total. En plus de la dixième, des autres syllabes peuvent aussi être accentuées dans ce vers, chaque combinaison de syllabes accentuées et non accentuées représentant un standard rythmique. Dans cet article, nous examinons comment ces standards ont été utilisés dans six épopées écrites en portugais, du XVIème au XXIème siècles, dans un total de 52.412 vers. Pour analyser une telle quantité de vers, nous avons employé Aoidos, un outil automatique de scansion pour le portugais. En utilisant des apprentissages supervisés et non-supervisés, nous concluons que, encore que l’influence de poètes précédents (surtout celle de Camões) se fasse toujours remarquer, chaque poète préfère de différents standards rythmiques, qui peuvent être considérés comme sa signature rythmique. Mots-clés: Epopée, Portugais, Scansion. JADT’ 18 515 1. Introduction Poets are frequently compared to one another, but over the centuries rarely have such comparisons been made objectively, especially with respect to verse structures. When critics state that a poet has followed the steps of another too closely and has therefore produced unoriginal and derivative work, they can seldom rely on objective facts. Works such as that of Chociay (1994), who manually analyzed and tabulated more than 1,500 verses, are not the rule, but the exception. It is indeed a tedious and tiresome task for any human to carry out; but looking at a great amount of text from afar and extracting relevant information from it constitutes a core element of distant reading (Moretti, 2013). Table 1: Poems included in the corpus. The code is derived from the poem’s title. Code L M C A B F Author Luís de Camões Francisco de Sá de Santa Rita Durão Fagundes Varela Carlos Alberto Nunes José Carlos de Souza Born in Portugal Portugal Brazil Brazil Brazil Brazil Poem Os Lusíadas Malaca Caramuru Anchieta Os Brasileidas Famagusta Year 1572 1634 1781 1875 1938 2016 Verses 8,816 10,656 6,672 8,484 8,504 9,280 52,412 In this article, we turn our attention to the verse most commonly used in epic poetry in Portuguese, the decassílabo, which was borrowed from Italian1. It is the verse used by Dante in his Divina Commedia and by Petrarch in his Canzoniere. Stressed syllables are distributed in the verse according to certain rules; in particular, the 10th syllable (which defines the length of the verse) must always be stressed. Other syllables may also be stressed, producing many possible rhythmic patterns—which are, both in Portuguese and Italian, required to have their 6th or, less commonly, their 4th syllable stressed (Versace, 2014). We identify such patterns by indicating the syllabic positions that are stressed within a given verse, so that a pattern like 3-6-10 means that the 3rd, 6th and 10th syllables are stressed. We are interested in tracking which rhythmic patterns poets have favored In both Italian and Portuguese, this kind of verse always has its 10th syllable stressed and typically has a total of eleven syllables, since most words in both languages have a stress on the penult. However, in Italian this verse is called endecasillabo because of the total number of syllables, whereas the Portuguese term decassílabo emphasizes the fact that the 10th is the last stressed syllable in the verse. 1 516 JADT’ 18 over the centuries and whether such patterns are characteristic to each poet. For this purpose, we have assembled a corpus consisting of six poems, whose publication dates range from the 16th to the 21st century, for a total of 52,412 verses (about 300,000 words). In order to analyze such an amount of verses, we have used our automatic scansion tool, Aoidos (Mittmann et al., 2016), which is capable of scanning thousands of verses in a few seconds and producing rhythmic information. The next section describes the corpus we used in our experiments; Section 3 reports the results obtained with our analyses; finally, Section 4 presents our conclusions and discusses future work. 2. Corpus The poems chosen to compose the corpus for this article are summarized in Table 1. We adopted two criteria in order to select these poems. Firstly, we searched for an important—and thus well known—or exemplary epic poem in each century, from the 16th up to the present. Secondly, we required trustful and reliable digital editions; in one case (17th century), we produced a digital edition especially for this article, since no suitable candidate was found. Camões’ poem Os Lusíadas is by far the most important epic poem ever written in Portuguese. Its influence can be felt, for instance, even in 20thcentury lyrical poets such as Jorge de Lima. Meneses’ Malaca Conquistada and Durão’s Caramuru follow very closely the Camonean model: they use identical rhyme schemes, they have a similar argument and they celebrate a protagonist in like manner. Nevertheless, we would like to investigate whether the two authors innovated with respect to rhythm, even though they kept the overall model of the Camonean epic. These three poems in our corpus were written by Portuguese citizens (Durão was born in colonial Brazil and died before the country’s independence), while the remaining three poems were written by Brazilian poets. 14610 610 1610 2- 1 Ce- 2 ssem 3 do 4 sá- 5 bio 6 Gre- 7 go e 8 do 9 Troi- 10 a- 11 no As na- ve- ga- ções gran- des que fi- ze- ram; Ca- le- se de A- le- xan- dro e de Tra- ja- no A fa- ma das vi- tó- rias que ti- ve- ram; JADT’ 18 610 24610 24610 136810 14610 517 Que eu can- to o pei- to i- lus- tre Lu- si- ta- no, A quem Nep- tu- no e Mar- te o- be- de- ce- ram: Ce- sse tu- do o que a Mu- sa an- ti- ga can- ta, Que ou- tro va- lor mai- s al- to se a- le- van- ta. Figure 1: Scansion produced by Aoidos. Fagundes Varela’s Anchieta, a romantic piece of the 19th century, would not be, at a first glance, an epic poem, since its subject is the telling of New Testament stories to Brazilian Indians by priest José de Anchieta. However, as historian Maria Aparecida Ribeiro and others remark, Anchieta is a kind of “religious epopee” (Ribeiro, 2003), which drives our attention to the Romantic effort to renew the ancient models inherited from Classical or Neoclassical literature (although it clearly returns to the Greek epic model, as it does not adopt regular sized stanzas). Despite some important differences in the narrative logic, the verses reproduce the most important invariants of the genre: the honoring of a protagonist (Anchieta) and the use of the decassílabo (blank ones, in this case). As for Carlos Alberto Nunes’ Os Brasileidas, this poem also presents some invariants that characterize the traditional epic poem: blank decassílabo verses; several cantos, beginning with the proposition; the intention of celebrating an individual hero, in this case Antônio Raposo Tavares, a 17th-century Brazilian trailblazer. In addition to the absence of rhymes, in order to emphasize the differences in relation to the Camonean epic style, there is no regular stanza division in each one of the nine cantos (ten, if we consider the epilogue), as in Anchieta, although they may vary significantly, from seven up to sixty five or more verses. Finally, regarding Famagusta, by José Carlos de Souza Teixeira, one quickly notices that it is a curious combination of traditional epic elements from different ages. In addition to the epic intention of celebrating an historical event and 518 JADT’ 18 some sort of heroic action, its formal elements are, to say the least, very heterogeneous. For instance, it takes the Camonean eight verse stanza but adopts a different rhyme scheme, resulting no more in the well-known ottava rima (ABABABCC), but in the medieval Sicilian stanza called strambotto romagnuolo (ABABCCDD), scarcely used in Brazilian literature2. 3. Analysis In order to analyze the corpus, we used Aoidos, an automatic scansion tool for Portuguese (Mittmann et al., 2016), much like Métromètre (Beaudouin and Yvon, 2004) and Anamètre (Delente and Renault, 2015) for French. Starting from the written word, Aoidos produces a phonetic transcription for each verse and then applies many rules (such as elision or syncope) to produce a series of alternative scansion. By examining the poem as a whole, the system then selects the most appropriate alternative and, by applying a set of heuristics, proposes a rhythmic pattern for each verse. The scansions generated by Aoidos have been manually verified to be correct in 99.0% of cases (Mittmann, 2016). Figure 1 shows the output produced by the system for the 3rd stanza of Camões’ Os Lusíadas. 2-4-8-10 1-3-6-8-10 1-4-6-8-10 4-6-8-10 1-4-8-10 4-8-10 1-6-10 1-6-8-10 7.7 7.1 7.6 9.4 9.6 10.5 4-6-10 2-6-8-10 15.2 12.2 11.2 7.3 7.7 11.7 1-4-6-10 2-4-6-10 9.0 9.6 7.1 11.3 13.2 16.2 1-3-6-10 2-6-10 10.3 11.9 10.3 14.4 14.0 13.8 2-4-6-8-10 3-6-10 L M C A B F 7.6 11.0 6.2 8.2 9.5 5.2 8.5 9.1 7.7 9.0 4.0 6.1 7.3 5.1 6.8 9.0 8.0 7.0 7.9 5.7 6.1 7.2 5.1 4.0 6.2 5.5 3.6 5.4 5.2 4.5 1.1 6.5 8.1 6.5 4.6 0.2 4.5 3.8 5.8 4.2 3.5 4.3 5.0 3.9 4.6 3.3 2.7 3.1 4.1 3.6 2.9 2.5 3.5 2.7 0.5 2.2 3.6 4.6 3.4 0.1 0.4 1.8 2.2 1.2 3.1 0.1 1.3 1.1 0.4 1.6 2.1 1.8 1.0 0.8 0.8 1.3 1.4 1.2 3-6-8-10 Poem Table 2: Rhythmic pattern usage (%) for each poem. A total of 42 different rhythmic patterns were found among all 6 poems. Table 2 shows how frequently patterns with an average usage of at least 1% were employed in each poem. In each row, the bold number indicates the pattern most favored by that row's poem. Although some patterns, such as The Brazilian-born baroque poet Manoel Botelho de Oliveira did use this stanza in some madrigals written in Spanish, such as this one: Si Cupido me inflama, / Si desdeñas mi empleo; / En amorosa llama, / En nieve desdeñosa el Etna veo, / Con amor, y tibieza / Tenemos su firmeza, / Y en disonancia breve / Suspiro fuego yo, tu brotas nieve. 2 JADT’ 18 519 Figure 2: Dendrogram built from all cantos of all poems. 3-6-8-10 and 1-3-6-10 remain more or less constant, many others display a wide range of relative usage: pattern 2-6-10 ranges from 7.1% to 16.2%, and pattern 1-4-8-10 from 0.1% to 3.1%. Whereas Camões (L) does seem to set the tone for the following poems, there are clear differences when one considers patterns such as 2-4-6-10 and 2-4-8-10. In fact, pairs such as Malaca (M) and Caramuru (C) or Anchieta (A) and Os Brasileidas are more similar between themselves than Camões’ Os Lusíadas (L) is to any other poem. By looking at numbers from one century to the next, twice a change of more than 5% can be seen: from Caramuru (C) to Anchieta (A) there was a decrease of 5.1% for the pattern 2-4-6-8-10, and from Os Lusíadas (L) to Malaca (M) the pattern 2-48-10 increased in usage by 5.4%. An interesting question arises at this point: do smaller parts of the poems reflect the overall distribution shown in Table 2? In other words, given a smaller part of a poem, could we tell from which work it was taken simply by looking at its rhythmic signature? To answer this question, we divided each poem into its cantos, for a total of 72 divisions, with an average of 727.9 verses per canto. We then extracted the usage frequency of the rhythmic patterns, thus producing a feature vector for each canto. By iteratively clustering such vectors, we obtained the dendrogram shown in Figure 2; complete linkage was used. Each canto in the figure is indicated by a letter (the poem code) and a number (the canto number within the poem). Cantos from the same poem are also displayed with the same color. The closer to the center that two branches link together, the more different the cantos they contain are. We can immediately see that, in general, cantos that belong to the same poem are located next to each other. All cantos of Camões’ Os 520 JADT’ 18 Lusíadas (L), in particular, are tightly grouped in their own branch. It is also interesting to note that, except for Famagusta (F), whenever a smaller group of cantos from the same poem were placed far from the larger group of cantos, there is a certain order: it was the first three cantos of Caramuru (C) were separated; the last four of Anchieta (A); and the first two of Os Brasileidas (B). Two cantos from Famagusta (F1 and F16) are only linked with other nodes at a great distance; this stems from the fact that these two cantos are the shortest ones in all of the corpus: the first canto has only 24 verses, the sixteenth 112. Such small amounts of verses produce poor feature vectors. In order to further investigate how well the cantos reflect the poems, we employed a nearest centroid classifier. In this case, each of the 72 feature vectors (the rhythmic signatures of the cantos) was labeled with the poem they belong to. We then used stratified k-fold cross validation, with k = 4 and 100 repetitions to assess the classifier’s performance. The mean precision obtained was 96.5%, mean recall 95.9% and mean F1 score 95.5%; the mean accuracy was 95.6%. This means that, given a sample of 54 cantos (because k = 4), the classifier guesses the right poem for the other 18 cantos in about 96% of the cases. 4. Conclusion The frequency with which poets employ certain patterns of stressed and unstressed syllables in their verses can be regarded as a rhythmic signature— at least in epic poems, the subject of this article. In this work, we have subjected 72 individual cantos to a hierarchical clustering technique (Figure 2), which shows that rhythmic patterns do reflect an author’s preferences (unconscious as they might be). Furthermore, a nearest centroid classifier obtained a mean accuracy of 95.6%, which is also evidence for the existence of a rhythmic signature. This kind of analysis is possible thanks to automatic scansion systems, such as Aoidos, which allow a large amount of verses (more than 50,000 in this case) to be scanned and analyzed. Although Camões, whose poem Os Lusíadas is the oldest in our corpus, has influenced newer generations of poets, this article shows that, at least rhythmically, each poet in our corpus took their own path. In fact, Camões’ verses are the ones most easily distinguished from the others (see Figure 2). Lesser-known poems, such as Malaca or Os Brasileidas, have not failed to produce rhythmic signatures that, in most cases, set them apart from other works. In addition to the rhythmic signature, we would like to investigate, in the future, additional features that could be extracted from verses and used in stylometric analyses. In particular, the decassílabo usually falls into one of two categories: either the 6th syllable has the dominant stress or—less commonly—the 4th; in the former case, the verse is heroic; in the latter, JADT’ 18 521 Sapphic. A verse whose rhythmic pattern includes the 6th syllable, but not the 4th, is heroic; but one that includes both the 6th and the 4th could be either heroic or Sapphic. It would be interesting to resolve this ambiguity and evaluate how well these categories characterize a poet’s style. Although this article has only considered epic poems, there is no reason to believe that rhythmic signatures are limited to this genre. In the future, we would like to explore how well the approach shown here fares when applied to other verses and other genres. Acknowledgments For the nearest centroid classifier we employed Scikit-learn (Pedregosa et al., 2011). For the dendrogram, we used Dendextend (Galili, 2015) and Circlize (Gu et al., 2014). References Beaudouin, Valérie and Yvon, François (2004). “Contribution de la métrique à la stylométrie”. 7èmes Journées internationales d’Analyse statistique des Données Textuelles. (2004), pp. 107–118. Chociay, Rogério (1994). A Identidade Formal do Decassílabo em “O Uraguai”. Revista de Letras 34, 229–243. Delente, Éliane and Renault, Richard (2015). Projet Anamètre : Le calcul du mètre des vers complexes. Langages 3.199, 125–148. Galili, Tal (2015). dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics 31 (22), 3718– 3720. Gu, Zuguang et al. (2014). circlize implements and enhances circular visualization in R. Bioinformatics 30 (19), 2811–2812. Mittmann, Adiel (2016). “Escansão Automático de Versos em Português”. PhD thesis. Universidade Federal de Santa Catarina. Mittmann, Adiel, Wangenheim, Aldo von, and Luiz dos Santos, Alckmar (2016). “Aoidos: A System for the Automatic Scansion of Poetry Written in Portuguese”. 17th International Conference on Intelligent Text Processing and Computational Linguistics. (2016). Moretti, Franco (2013). Distant reading. London: Verso. Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830. Ribeiro, Maria Aparecida (2003). Anchieta no Brasil: Que Memória? História Revista 8, 21–56. Versace, Stefano (2014). A Bracketed Grid account of the Italian endecasillabo meter. Lingua 143, 1–19. 522 JADT’ 18 Le vocabulaire des campagnes électorales Denis Monière1, Dominique Labbé2 Université de Montréal (denis.moniere@umontreal.ca) 2 PACTE CNRS - Université de Grenoble (dominique.labbe@umrpacte.fr) 1 Abstract After having done a first presidential term, V. Giscard d’Estaing, F. Mitterrand, J. Chirac and N. Sarkozy were candidates for a second term. In this study, their electoral speeches are compared with their presidential ones drawing attention to the specific nature of the vocabulary used. It would appear that this calculation is mainly biased by grammatical categories and word frequency. We present modifications of the classical formulae which make it possible to neutralize the influence of grammatical categories and, at least partially, that of word frequency. Electoral discourse privileges the verb over the name, as such speech is more personalized than governmental discourse, it focuses on the country and its inhabitants, the rest of the world being pushed into the background. Finally, in recent years, the polemical dimension is becoming predominant. Résumé Après un premier mandat présidentiel, V. Giscard d’Estaing, F. Mitterrand, J. Chirac et N. Sarkozy ont été candidats à un deuxième mandat. On compare leurs discours électoraux avec leurs discours présidentiels à l’aide des spécificités du vocabulaire. Il apparaît que ces spécificités dépendent surtout des catégories grammaticales et des effectifs des mots. On présente des modifications du calcul classique qui permettent de neutraliser l’influence des catégories grammaticales et, au moins partiellement, celle des fréquences. Le discours électoral privilégie le verbe au détriment du nom, il est plus personnalisé que le discours au pouvoir, il se centre sur le pays et ses habitants, le reste du monde passant au second plan. Enfin, ces dernières années, la dimension polémique devient prédominante. Keywords: lexicometry ; political discourse ; French presidential campaigns ; specific vocabulary ; spécificités du vocabulaire. 1. Introduction Le discours électoral diffère-t-il du discours de gouvernement et en quoi ? La réponse est difficile car il faut neutraliser l’effet des personnalités et des conjonctures pour isoler l’effet sur le discours des choix stratégiques du JADT’ 18 523 locuteur. L’idéal serait de pouvoir étudier les mêmes hommes à peu près simultanément dans les deux positions de gouvernant puis de candidat. Le corpus des discours des présidents français depuis 1958 remplit ces deux conditions (présentation du corpus dans Arnold et al 2016). En effet, pour 5 présidents (C. de Gaulle, V. Giscard d’Estaing, F. Mitterrand, J. Chirac et N. Sarkozy), ce corpus contient leurs interventions lorsqu’ils étaient présidents et leurs discours de campagne pour leur réélection. Certes, en 1965, de Gaulle n’a pratiquement pas fait campagne (Labbé 2005), mais ses successeurs ne l’ont pas imité en 1981, 1988, 2002 et 2012 (corpus en annexe). Pour comparer ces corpus, le calcul des "spécificités" semble l’outil le plus adapté (Lafon 1980 et 1984). Il rapporte le vocabulaire d’un sous-ensemble de textes (sous-corpus) à un corpus de référence. Mais il se heurte à une double difficulté : la spécificité éventuelle d’un vocable est liée à sa catégorie grammaticale et à sa fréquence d’emploi (Labbé, Labbé 1994 ; Monière et al. 2005), comme nous allons le vérifier d’abord avec le cas de Sarkozy en 2012 (Sur cette campagne : Labbé, Monière 2013). Dès lors, la mesure des spécificités doit neutraliser, autant que possible, ces deux inconvénients. 2. Les catégories grammaticales du discours électoral Le discours présidentiel de Sarkozy s’étend de son investiture (16 mai 2007) au 12 février 2012 (annonce de sa candidature). La campagne s’étend jusqu’au soir du second tour (6 mai 2012). Le corpus complet (P) compte 1074 interventions, soit au total 3 221 259 mots avec 21 602 vocables différents. A partir de sa déclaration de candidature, Sarkozy est intervenu 110 fois (souscorpus E), soit 369 808 mots et un vocabulaire de 8 511 vocables différents. Ces interventions sont d’abord marquées par un net changement de style (tableau1). Tableau 1. Densités des catégories grammaticales dans les interventions de Sarkozy lors de la campagne de 2012 comparées à ses interventions comme président 2007-2012 (en ‰) Catégories Verbes Futurs Conditionnels Présents Imparfaits Passés simple Participes passés Participes présents Infinitifs Noms propres Substantifs Adjectifs P-E (CorpusSous corpus) 159.2 7.0 3.2 82.9 6.4 0.6 20.8 2.1 36.3 27.9 178.4 54.0 E Sous corpus 169.4 7.2 2.8 89.3 6.4 0.3 23.8 2.1 37.6 23.0 176.0 46.6 (P-E)/P +6.4 +1.6 -11.2 +7.7 -0.2 -55.2 +14.6 +2.9 +3.6 -17.3 -1.3 -13.7 Indice + + + ≈ + ≈ + - 524 Adj. participe passé Pronoms Pronoms personnels Déterminants Articles Nombres Possessifs Démonstratifs Indéfinis Adverbes Prépositions Coordinations Subordination JADT’ 18 5.2 124.3 65.4 181.6 131.9 18.7 14.5 7.6 8.9 67.1 150.1 29.1 25.9 4.5 132.6 69.6 182.5 128.1 20.9 17.0 7.8 8.7 68.9 145.6 25.4 27.9 -13.1 +6.7 +6.5 +0.5 -2.9 +11.9 +17.3 +2.7 -2.4 +2.7 -3.0 -12.7 +8.0 + + + + + + + + Dans le discours présidentiel, on rencontre 159 verbes en moyenne pour 1 000 mots ; dans les discours électoraux, cette proportion passe à 169‰, soit une augmentation de +6,4%, ce qui est un écart significatif avec moins de une chance sur 10 000 de se tromper (signe + en dernière colonne). Les lignes suivantes donnent le détail des temps et des modes. Le recul le plus significatif concerne le conditionnel (le discours électoral ne doit pas connaître le doute). En revanche, le participe passé connait l’augmentation la plus forte (le président sortant peut difficilement éviter de défendre sa gestion). Les pronoms, les adverbes et les conjonctions de subordination évoluent dans le même sens que les verbes. Ils sont réunis dans le "groupe du verbe". A l’inverse, les substantifs, adjectifs, articles et prépositions suivent la tendance inverse : groupe du nom. Le tableau 2 donne les densités des deux groupes chez les 4 présidents. Tableau 2. Densités des groupes du verbe et du nom (en ‰) dans les discours électoraux (E) comparés aux discours présidentiels (P-E). Catégories Sarkozy (2007-2012) Groupe du verbe Groupe du nom Giscard d’Estaing (1974-1981) Groupe du verbe Groupe du nom Mitterrand (1981-1988) Groupe du verbe Groupe du nom Chirac (1995-2002) Groupe du verbe Groupe du nom P-E (CorpusSous corpus) E Sous corpus (P-E)/E Indice 376.6 621.1 398.9 599.2 +5.9 -3.5 + - 351.5 646.1 392.5 604.5 +11.7 -6.4- + - 386.4 611.0 427.1 569.8 +10.5 -6.7 + - 329.5 668.8 333.2 665.1 +1.1 -0,6 + - JADT’ 18 525 Chez tous les présidents en campagne, il se produit une augmentation du groupe du verbe et un recul de celui des noms. Statistiquement, ces mouvements sont significatifs (avec = 1%). L’écart le plus fort est observé chez Giscard d’Estaing puis chez Mitterrand. Cependant, Chirac tranche sur les autres avec une densité du verbe beaucoup plus faible et une campagne présidentielle presqu’aussi distanciée que ses interventions lors de son premier mandat, marqué par une cohabitation de 5 ans (1997-2002) avec un Premier ministre socialiste (Jospin). Dans son discours électoral, la densité des verbes augmente nettement (+3,6%) mais se trouve en partie compensée par un recul des pronoms, ce qui accentue le caractère dépersonnalisé des propos de Chirac à l’opposé des trois autres. En conséquence, pour les 4 présidents, les principaux verbes apparaissent en spécificités positives du discours électoral et il ne s’en trouve que quelquesuns en spécificités négatives. Il en est de même pour les pronoms et les adverbes. La situation inverse se constate pour les adjectifs, les substantifs, etc. Autrement dit, si un mot appartient à une catégorie sous-employée dans le sous-corpus (par rapport à sa densité d’utilisation dans le corpus entier), ce vocable a toute chance d’apparaître dans les spécificités négatives (et positives dans le cas inverse). Il est possible de neutraliser ce biais. 3. Neutralisation de la catégorie grammaticale Le calcul standard est le suivant. Soit : - le corpus de référence (P) long de Np mots ; - le sous-corpus E long Ne mots dont on recherche les spécificités par rapport àP; - un vocable i avec Fip occurrences dans P et Fie dans E. Si sa répartition est uniforme, ce vocable apparaîtra Eie(u) fois dans le sous-corpus E : E ie(u)  Fip * U avec U = Ne 369 808   0.113 N p 3 223 570 (1) La probabilité pour que le vocable i soit observé Fie fois dans E suit une loi hypergéométrique de paramètres Fip, Fie, Ne, Np : Fip   N p  Fip     Fie   N e  Fie   P( X  Fie ) = N p    N e  (2) L’indice de spécificité (S) est la somme des probabilités – calculées avec (2) – 526 JADT’ 18 de survenue des J valeurs entières de X variant de 0 à Fie {X=0 ; X= Fie} : j = Fie S = P( X  Fie ) =  P(X  j) (3) j= 0 Si au seuil , Fie excède Eie(u), le vocable est « spécifique plus » (S+) ; S- dans le cas contraire. Avec ce calcul, la plus grande partie des verbes usuels de Sarkozy apparaissent donc en S+ de sa campagne électorale et la majorité des substantifs en S-, parce que, dans ses discours électoraux, la première catégorie est privilégiée par rapport au discours de gouvernement où elle est moins utilisée (à l’inverse des substantifs). Pour corriger ce biais, le calcul prend en compte les catégories grammaticales (g). La modification est présentée dans : Monière, Labbé, Labbé 2005 ; Mayaffre 2006 et Monière, Labbé 2012. Soit : Nge et Ngp le nombre de mots appartenant à la catégorie grammaticale G respectivement dans le sous-corpus E et le corpus entier P. Les formules (1) et (2) deviennent : E ie(u)  Fip * U avec U = N ge N gp Fip   N gp  Fip     Fie   N ge  Fie  P( X  Fie ) =  N gp     N ge  (4) (5) Les formules (4) et (5) appliquées aux 4 corpus aboutissent à un équilibre relatif, au sein de chaque catégorie, entre les S+ et les S- (tableau 4). Ces formules neutralisent donc la liaison entre spécificités et densité des catégories grammaticales. Comme indiqué dans Monière & Labbé 2012, cette modification change drastiquement la liste des "mots spécifiques" mais elle laisse subsister la liaison entre spécificité et fréquence. 4. Questions de seuils Le calcul porte sur une minorité du vocabulaire et il est asymétrique. En effet, avec = 1% : - l’effectif minimal pour être S+ est de 5 occurrences ("seuil de spécificité positive"), toutes dans les discours électoraux (E) et à condition que Eie(u) < .5, ce qui signifie que Nge < 0.10Ngp. Par construction le calcul élimine donc tous les vocables d’effectifs inférieurs à 5. Dans le corpus Sarkozy, cela représente JADT’ 18 527 plus de la moitié du vocabulaire (54 % des vocables). Autrement dit, seulement 46% du vocabulaire peut être S+ ; - le "seuil de spécificité négative" correspond à la situation suivante : un vocable i absent de E (Fie = 0) alors qu’on en attend au moins 5 (Eie(u) ≥ 5). En pratique, cela signifie que son effectif dans P est égal ou supérieur à 5*1/U, soit ici 40. Autrement dit, pour le discours électoral de Sarkozy, 83% du vocabulaire de P ne peut apparaître en S-. Dès lors, les vocables dont les effectifs dans P sont compris entre 5 et 39 peuvent être S+ mais pas S- dans E. On s’attend donc à ce qu’il y ait plus de vocables S+ que S-. 5. Liaison entre spécificité et fréquence 9 876 vocables apparaissent 5 fois ou plus dans P. Si ce corpus était homogène (hypothèse nulle H0), une distribution normale des vocables laisserait attendre - avec = 1% - environ 100 vocables spécifiques. Le tableau 3 compare les résultats observés et attendus (avec H0). Tableau 3. Effectifs des vocables classés par catégories grammaticales et par spécificités Verbes Mots à majuscule Substantifs Adjectifs Pronoms Adverbes Déterminants Prépositions conjonc. Total & Effectifs (Fip ≥ 5) 1 540 1 501 4 175 2 065 52 411 72 60 H0 15 15 42 21 1 4 1 1 S+ 176 112 455 140 18 20 21 21 S143 142 468 115 13 57 12 9 Total S 319 254 923 255 31 77 33 30 9 876 100 963 959 1 922 Il y a donc vingt fois plus de vocables spécifiques que n’en laisse attendre H0 (répartition homogène des mots entre corpus et sous-corpus). A priori, cela signifie simplement que discours électoral et discours de gouvernement sont fortement contrastés. En fait, ce décalage provient essentiellement des vocables les plus fréquents (tableau 4 et Figure 1). Tableau 4. Proportion des vocables spécifiques de E dans l’ensemble du vocabulaire (P) classé par fréquence absolues. Classe de fréquence (P) 5-9 10-14 15-19 20-29 Vocables spécifiques de E dans la classe 64 68 55 89 Total vocables de P dans la classe 2 759 1 237 757 987 Proportion des vocables de P spécifiques de E 2,3 5,5 7,3 9,0 528 JADT’ 18 30-49 50-99 100-199 200-499 500+ Total 143 317 332 398 473 1 939 997 1 054 799 686 640 9 916 14,3 30,1 41,6 58,0 73,9 19,6 Figure 1. Liaison entre la spécificité et la fréquence Au-dessus du seuil de spécificité positive (ici 40), la proportion de vocables spécifiques est directement corrélée avec la fréquence : la courbe suit la diagonale du tableau et le coefficient de détermination de Y par X est égal à 0,997, ce qui indique une liaison rigide et linéaire. Il en est toujours ainsi : plus un vocable donné est fréquent dans un corpus, plus il a de chances d’être "spécifique" à l’une quelconque des parties de ce corpus. Cette dépendance peut être interprétée de deux manières. D’une part, l’essentiel des choix thématiques seraient véhiculés par les vocables les plus fréquents et la variation dans leurs fréquences d’emploi seraient la principale manifestation de ces choix. Cependant, dès que le corpus atteint une certaine longueur, l’observateur se trouve noyé dans des listes qui contiennent la plus grande part du vocabulaire usuel, ce qui en rend l’interprétation difficile. D’autre part et à l’inverse, on peut penser que le raisonnement probabiliste qui sous-tend ce calcul - doit être adapté à cette liaison manifeste entre spécificité et fréquence. 6. Neutralisation de la liaison entre fréquence et spécificité Les limites des classes de fréquence du tableau 5 et de la figure 1 ont été fixées selon une échelle proche d’une progression géométrique, ce qui assure JADT’ 18 529 aux classes des effectifs sinon égaux du moins suffisamment proches et importants. Ceci correspond à une particularité dite "loi de Zipf" - ou "ZipfMandelbrot" - selon laquelle le nombre d’occurrences d’un mot dans un texte est lié à son rang dans la distribution des fréquences (Zipf 1935 ; Mandelbrot 1957). Dès que le corpus atteint une longueur suffisante (au moins un demi-million de mots) et que le sous-corpus est égal à au moins d’un dixième du corpus, on peut découper le vocabulaire en quelques classes de fréquence. Pour un corpus de la dimension de celui de Sarkozy (et des trois autres présidents), trois classes suffisent : vocables "rares" (inférieurs à 100 occurrences) ; "fréquents" (de 100 à moins de 500) ; "très fréquents" (500 et plus). Dans ces trois classes, les vocables sont classés par catégorie grammaticale puis en fonction de leur indice de spécificité et, dans chacune des classes, seuls les plus caractéristiques sont retenus. Le tableau 5 donne les 5% les plus caractéristiques du discours électoral de Sarkozy comparé à son discours présidentiel, pour trois catégories grammaticales. Tableau 5. Spécificités les plus remarquables du discours électoral de Sarkozy par rapport à son discours présidentiel (par catégories grammaticales en trois classes de fréquence) <100 Vocables significativement sur-employés : Verbes : voler, cotiser, détester, casser, éduquer, suspendre, démolir Mots à majuscule Mélenchon, Le Pen, Substantifs honte, rassemblement, héritier, socialiste, colère, délit, amalgame 100 – 499 adresser, bénéficier, apprendre, souffrir, supprimer, régulariser François, Polynésie, Hollande, Schengen, TVA jeunesse, souffrance, gauche, destin, erreur, étranger, salaire, outremer Vocables significativement sous-employés : Verbes admirer, illustrer, progresser, témoigner, expérimenter, inaugurer, évoquer, marquer, associer Mots à majuscule Bush, Poutine, Roumanie, Quatar Russie, Inde, Iran, Barroso Substantifs refondation, coalition, scientifique, lycéen processus, visite, équipe, conférence, planète, gouvernance, alliance, 500+ dire, vouloir, parler, vivre, proposer, changer, respecter, défendre France, Français, Corse travail, entreprise, droit, république, vie, emploi, ami, enfant, territoire, peuple, être, devoir, savoir, comprendre, trouver, attendre, remercier, essayer Afrique, G20, Méditerranée, Merkel, Paris, Chine pays, monsieur, président, état, ministre, politique, gouvernement, question 530 JADT’ 18 Chez Sarkozy, le discours électoral est affaire de volonté, il se centre sur le pays, ses habitants mais aussi l’adversaire – la gauche, Hollande - dont il dénonce les amalgames et les erreurs. Les spécificités négatives indiquent que le discours électoral n’est pas affaire de devoir ou de connaissance ; il "oublie" le reste du monde et ses dirigeants, les institutions du pays comme le gouvernement et les ministres, etc. 7. Conclusions Lorsqu’un président entre en campagne, il doit descendre dans l’arène et adopter un discours de combat qui se caractérise avant tout par une augmentation de la densité des verbes, une forte personnalisation et un recul de la place accordée aux substantifs et aux adjectifs. Ces caractéristiques se retrouvent dans les discours électoraux des Premiers ministres canadiens (Monière, Labbé 2010). Cependant, en campagne ces derniers insistent sur le "nous" car, dans un système parlementaire, il s’agit de faire élire une majorité de députés, alors que les présidents français privilégient le "je"… Enfin, ces dernières années en Amérique du nord comme en France, la forte présence de la construction négative et la désignation des adversaires (noms propres) soulignent le caractère polémique du discours électoral. Le calcul des spécificités – tel qu’il est utilisé en analyse des données textuelles – enregistre la catégorie grammaticale du vocable analysé et sa fréquence d’emploi et non pas les choix thématiques du locuteur. La neutralisation de la catégorie grammaticale est aisée si les mots ont été étiquetés. En revanche, l’effet de la fréquence est susceptible de plusieurs interprétations. Toutefois, si l’on souhaite ne pas être enseveli sous les listes produites par le calcul classique, la solution réside dans le classement des vocables en classes de fréquence –selon une échelle géométrique - et, au sein de chacune de ces classes, dans la sélection des vocables les plus singuliers. A ce prix, les singularités d’un sous-corpus peuvent être identifiées sans avoir à effectuer des tris discutables dans des listes trop longues. References Arnold E., Labbé C. & Monière D. (2016). Parler pour gouverner : Trois études sur le discours présidentiel français. Grenoble : Laboratoire d'Informatique de Grenoble, 2016. Labbé C., Labbé D. (1994). Que mesure la spécificité du vocabulaire ? Grenoble : CERAT, décembre 1994. Reproduit dans Lexicometrica, 3, 2001. Labbé D., Monière D. (2010). Quelle est la spécificité des discours électoraux? Le cas de Stephen Harper. Canadian Journal of Political Science, 43:1, p. 69– 86. Labbé D., Monière D. (2013). La campagne présidentielle de 2012. Votez pour JADT’ 18 531 moi ! Paris : l’Harmattan. Lafon P. (1980). Sur la variabilité de la fréquence des formes dans un corpus. Mots, 1, p. 127-165. Lafon P. (1984). Dépouillements et statistiques en lexicométrie. Genève-Paris : Slatkine-Champion. Mandelbrot B. (1957). Étude de la loi d'Estoup et de Zipf Fréquences des mots dans le discours. Apostel L et al. Logique, langage et théorie de l'information. Paris, PUF, p. 22-53. Mayaffre D. (2006). Faut-il pondérer les spécificités lexicales par la composition grammaticale des textes ? Tests logométriques appliqués au discours présidentiel sous la Vème République. Condé C., Viprey J.-M. Actes des 8e Journées internationales d'Analyse des données textuelles. Besançon : Presses universitaires de Franche Comté, II, p. 677-685. Monière D., Labbé C., Labbé D. (2005). Les particularités d'un discours politique : les gouvernements minoritaires de Pierre Trudeau et de Paul Martin au Canada. Corpus, 4, p.79-104. Monière D., Labbé D. (2012). Le vocabulaire caractéristique du Premier ministre du Québec J. Charest comparé à ses prédécesseurs. Dister A. et al. (éds). Proceedings of the 11th International Conference on Textual Data Statistical Analysis. Liège : LASLA - SESLA, p.737-751. Zipf G. K. (1935). La psychobiologie du langage. Paris : CEPL, 1974. 532 JADT’ 18 Faire émerger les traces d’une pratique imitative dans la presse de tranchées à l’aide des outils textométriques Cyrielle Montrichard ELLIADD, UBFC – cyrielle.montrichard@edu.univ-fcomte.fr Abstract The main goal of this paper is to show how textometric tools can help to reveal the imitative usage of genres. During the Great War, soldiers must not criticize the hierarchy or the governement. Trench press is written by and for French soldiers in which we can find a great number of media and literary genres. Plus, we assume that writers use a number of discursive schemes to implicitly tell their point of view on the war, the governement and the « sacred union » discours which has become the mainstream speech in the public space in the early begining of the war. Therefore a corpus of this press seems to be the perfect place to search the notion of imitative usage of genres. To put into perspective the results given by the textometric tools we use a sample corpus from the national french press. Résumé L’objectif de cette contribution est d’interroger la pratique imitative des genres médiatiques et littéraires. Pour ce faire, nous mobilisons un corpus de presse de tranchées dans lequel se déploient de nombreux genres et sousgenres. Portant notre attention tout particulièrement sur les genres des dépêches et du roman-feuilleton nous montrons, en comparant ce corpus à un corpus échantillon de textes parus dans la presse quotidienne nationale en quoi la presse de tranchées copie les genres instaurés dans la presse civile. La seconde partie interroge le corpus au niveau syntagmatique pour tenter de faire émerger les registres ludiques et satiriques ayant court dans cette presse. Keywords : presse écrite, genre, pratique imitative, première guerre mondiale, presse de tranchées. 1. Introduction La presse de tranchées est un type de document né pendant la première guerre mondiale. Cette presse a la particularité d’être écrite par et pour les combattants (Audoin-Rouzeau, 1986). La censure ainsi que le discours JADT’ 18 533 doxique d’union sacrée tenant place dans l’espace publique durant la période du conflit ne permettent pas aux locuteurs d’exprimer ouvertement leur opinion (Forcade, 2016). L’objectif de cette communication est de montrer comment émergent les registres ludiques et satiriques dans la presse de tranchées à travers l’inscription de discours dans des genres faisant écho à la matrice générique médiatique et littéraire. Comment repérer à l’aide des outils textométriques les traces discursives d’une pratique imitative des genres médiatiques et littéraires dans la presse de tranchées ? Cette communication vise à interroger la « pratique imitative » c’est-à-dire les « différentes formes ou genres qui permettent à un auteur de produire un texte (T2) attribué, sérieusement ou non, et de manière plus ou moins explicite, au modèle dont il s’est inspiré (T1) » (Aron, 2013). Pour ce faire, nous avons réuni en corpus cinq titres de presse de tranchées au format XML-TEI pour plus de 500 000 occurrences permettant une analyse du discours outillée. À l’aide des outils textométriques et de la plateforme TXM (Heiden et al., 2010) nous proposons de montrer comment les textes s’inscrivent et reprennent les codes établis des genres médiatiques et littéraires. Ensuite, nous proposons des pistes d’analyse visant à faire émerger le registre ludique ou satirique usité par les rédacteurs pour détourner le genre. 2. Contexte de la recherche et présentation du corpus Notre étude propose d’investir la notion de pratique imitative. Cette dernière est proche de l’hypertextualité et de l’imitation (Genette, 1982) c’est-à-dire la reproduction d’un style, d’une manière. En analyse du discours, D. Maingueneau (1984) a investi la notion de pastiche, confirmant que celui-ci peut s’opérer sur un genre. Mais le pastiche pour G. Genette (1982) est associé principalement à une fonction ludique et dans le cadre de notre étude, la question entre registre satirique et registre ludique reste ouverte, c’est pourquoi nous nous cantonnerons donc à la notion de « pratique imitative ». Il n’existe, à notre connaissance, pas de travaux visant à interroger la pratique imitative en analyse du discours outillée. Xavier Garnerin (2009), pasticheur, tente de déterminer les méthodes des pasticheurs qui se situent selon lui « entre analyse et intuition » ce qui dénote toute la difficulté pour le chercheur à mettre au jour de façon systématique les liens unissant un texte T2 imitant un texte T1. Nous proposons de mettre à l’épreuve les outils textométriques pour tenter de percevoir la pratique imitative des genres. Notre corpus se compose de cinq titres de presse de tranchées parus entre 1915 et 1918. Nous avons mis en place des variables permettant d’investir les 534 JADT’ 18 genres et les sous-genres (Rastier et Malrieu, 2002). La variable genre scinde le corpus en deux parties : le genre littéraire (287184 occurrences, 747 articles) et le genre médiatique (216534 occurrences pour 1005 articles). Afin d’opérer une étude fine, nous avons aussi catégorisé les textes en sousgenre permettant ainsi de distinguer les romans-feuilletons, les nouvelles, les poèmes, etc., au sein du genre littéraire et les brèves, filets, dépêches, échos, faits divers, etc. dans le genre médiatique. L’espace de la contribution ne nous permet pas d’analyser chacun de ces sous-genres de façon particulière, c’est pourquoi nous concentrons notre étude sur un sous-genre littéraire, le roman-feuilleton et un sous-genre médiatique, la dépêche. Afin de mettre en perspective les résultats obtenus, nous avons constitué un corpus échantillon donnant à voir 38 dépêches parus entre 1915 et 1918 dans deux quotidiens nationaux (Le Petit Journal et Le Matin) et trois romans-feuilletons1. Ce corpus échantillon sera principalement mis à profit pour observer les constructions syntaxiques et la place des catégories morphosyntaxiques dans les deux sousgenres. Ainsi, la taille des effectifs n’est pas déterminante. 3. L’ancrage dans les moules discursifs médiatiques et littéraires Dans cette partie, nous montrons comment les textes reprennent les codes établis dans la presse et dans la littérature à travers l’étude des catégories morphosyntaxiques et du lexique. 3.1. Les catégories morphosyntaxiques Le graphique AFC ci-dessous donne à voir la distribution des catégories morphosyntaxiques (point-ligne en bleu) dans le sous-corpus du genre littéraire partitionné en sous-genres (point-colonne en rouge). On remarque, dans cette représentation graphique, que l’axe 1 contribue pour 60,63% à la structure du graphique. Cet axe semble structuré par le temps des verbes. En effet, à gauche du graphique on trouve les verbes au présent et au futur alors qu’à droite, on retrouve les temps du passé (passé simple, imparfait). On remarque que le roman-feuilleton se situe du côté des verbes au passé, respectant ainsi les caractéristiques du genre usant des temps du récit. De plus, si l’on regarde la distribution des verbes en pourcentage dans la presse de tranchées et la Presse Quotidienne Nationale (PQN), on repère la proximité dans les temps employés. 1 Entre deux âmes (1912) de Delly paru dans L’Echo de Paris, Le Château noir (1914) et Confitou (1916) de Gaston Leroux parus dans Le Matin. JADT’ 18 535 Figure 1. AFC des catégories morphosyntaxique du sous-corpus littéraire partitionné en sousgenre dans le corpus de presse de tranchées. Figure 2. Graphique représentant pour cent verbes les temps utilisés dans les romans feuilletons parus dans la presse de tranchées (à gauche) et ceux parus dans la PQN (à droite) Du côté du genre médiatique, le calcul des spécificités sur les catégories morphosyntaxiques indique que les dépêches dévoilent un score positif pour les noms communs (2) alors que les adverbes et les pronoms personnels sont en sous-emploi (respectivement des scores de -5,4 et -8,7). Ces résultats sont à mettre en lien direct avec les caractéristiques de la dépêche : [..] l’auteur de la dépêche se plie à un modèle de représentation qui doit faire l’économie des ressources stylistiques propres au littéraire : ni dialogue, ni focalisation interne, ni commentaire sur l’évènement rapporté. (Kalifa et al., 2011 : 738) On comprend ainsi le sous-emploi des adverbes et des pronoms personnels, souvent usités pour introduire un commentaire, alors que l’objectivation de l’information et l’effacement énonciatif préfèrent les catégories nominales 536 JADT’ 18 aux catégories verbales (Rabatel, 2004). D’ailleurs, on observe sur le graphique ci-dessous une proximité dans l’emploi des catégories morphosyntaxiques entre les dépêches de la presse de tranchées et celles de la PQN. Figure 3. Graphique qui montre la proportion des grandes catégories morphosyntaxiques utilisées dans les dépêches parues dans la presse de tranchées (en bas) et celles parues dans la PQN (en haut) L’observation de la ventilation des catégories morphosyntaxiques laisse entrevoir que presse civile et presse de tranchées usent des mêmes catégories morphosyntaxiques selon les genres. 3.2. Le lexique et les segments répétés Dans la presse du début XXème, la dépêche débute souvent par une ligne indiquant le lieu et le jour de l’évènement. Les dépêches de notre corpus de presse de tranchées suivent cette règle et reprennent cette mise en scène de l’information. On le voit à travers de nombreux noms de lieux en spécificité positive comme : « Londres » (4,9), « Paris » (4,2), « Berlin » (2,3), etc. Les dépêches de la PQN confirment cette tendance avec une moyenne de 4 noms de lieux par article. L’escamotage de l’auteur passe d’abord par la mise au point d’un système d’énonciation à double détente : soit la source de l’évènement est indiquée – renvoyant toujours à un point de vue neutre – soit l’évènement est rapporté directement, sans mention manifeste de la source. (Kalifa et al., 2011 : 738) Les combattants improvisés journalistes mentionnent souvent une source que l’on peut percevoir à travers le suremploi des formes graphiques « communiqué » (score de 16,5) ou « dépêche » (score de 2). De plus, lorsque l’on s’intéresse aux segments répétés, on remarque que 7 dépêches de l’Argonnaute débutent par « Communiqué officiel de l’intérieur téléphoné par […] ». Du côté de la presse civile on retrouve les formes « dépêche » et « annonce » justifiant respectivement de 9 et 6 occurrences ainsi qu’ « Havas » (17 occurrences). Pour le roman-feuilleton dans la presse de JADT’ 18 537 tranchées, on repère des termes indiquant là aussi le respect de la mise en scène du roman en « chapitre » (score de 49) et le format feuilleton avec les termes « suite » (score de 37,4) et « suivre » (22,4). 4. Repérer la pratique imitative À ce stade de notre étude, nous avons montré la proximité entre presse de tranchées et PQN mais ni l’étude des catégories grammaticales ni l’étude lexicale n’a permis de mettre au jour les registres ludiques et/ou satiriques signes d’une imitation et non d’une inscription dans le genre. Fort de ce constat, il apparaît nécessaire d’effectuer des recherches qui soient plus larges que celles du lemme mais plus précise que celles menées jusqu’alors sur les catégories morphosyntaxiques. Dès lors, une recherche au niveau syntagmatique semble s’imposer. 4.1. Constructions syntaxiques en suremploi pour les dépêches Nous avons effectué des recherches pour obtenir les constructions syntaxiques enchaînant deux catégories morphosyntaxiques sur l’ensemble du corpus partitionné en sous-genre. Les résultats des premiers syntagmes en spécificité positive confirment ce que nous avons déjà pu voir : la catégorie préposition suivie d’un nom propre présente un score de +10,3 et un retour au texte confirme qu’il s’agit de la présentation du lieu de l’évènement (« à Londres », « de Paris », etc.). Aussi, on trouve une construction syntaxique qui induit une construction passive (verbe au présent suivi d’un verbe au participe passé) indiquant encore l’effacement énonciatif (Rabatel, 2004). Dans la liste des spécificités positives nous trouvons la combinaison nom suivi d’adjectif (score de +2,3). La liste éditée donne à voir 74 syntagmes. Quatorze d’entre eux (soit 19%) ont attiré notre attention de part, soit l’invraisemblance du dire (« homme volant », « provision inépuisable »), soit parce que leur présence ne fait pas sens dans le genre dans lequel ils se déploient (« bicyclette usagée », « cellules nerveuses », « chauffage central », « crayon ennemi »). À noter le syntagme « agence Ivile » jouant de l’homonymie avec « agence civile ». Le retour au texte permet de mieux comprendre l’usage de ces syntagmes par les rédacteurs jouant souvent sur le double sens des mots. Plusieurs saucisses boches (de Francfort) ont été capturées à la devanture d’un charcutier par un audacieux homme volant. (Argonnaute, 15 mars 1916) Le syntagme « saucisses boches » peut renvoyer en 1916 à deux signifiés : le produit de charcuterie ou le projectile ennemi. C’est sur cette ambiguïté qu’est basée l’énoncé accentuée par la présence du nom « charcutier » et du 538 JADT’ 18 participe passé « capturées » qui indique chacun une possibilité d’interprétation différente. Enfin, l’« homme volant » peut être entendu comme un briguant ayant dérobé de la charcuterie où un homme ayant la capacité de voler dans les airs et ayant capturé les projectiles ennemis avant l’impact. Cet exemple dévoile comment les rédacteurs par un registre ludique créent de la connivence avec le lecteur qui partage les mêmes références. Un autre exemple permet d’introduire l’idée d’un registre satirique avec la critique du discours dominant dans l’espace publique. […]Paris, 31 avril […]Rue du Paon-Blanc (14h.) Paris gronde. Le régime a vécu. Vive la révolution ! Les bains de la Samaritaine sont en état de siège. Le syndicat de la Grande Presse n'autorise plus que la parution d'un bulletin relatant le Communiqué. La censure s'est tranchée la gorge avec ses ciseaux. L'héroïsme sacré fait battre les cœurs.[…] C'est l'union sacrée. Concierges, locataires et propriétaires s'embrassent aux portes des immeubles. (Rigolboche, 10 mai 1917) L’article remet ici en cause la censure, les festivités parisiennes et fait également écho aux désaccords entre les propriétaires et les locataires mobilisés remettant ainsi en cause le discours d’union sacrée tout en réinvestissant ses dires (Authier-Revuz, 1984). La recherche de syntagmes nous permet donc d’entrer dans le corpus au niveau du texte et de percevoir ce qui dans les articles semblent détourner le genre à des fins ludiques et satiriques. 4.2. Construction syntaxique en suremploi pour le roman-feuilleton Le roman-feuilleton tient une place importante dans la presse du XIXème siècle et du XXème (Kalifa et al, 2011). Le conflit ne modifie pas la place de cette fiction. La guerre pénètre très rapidement dans le « rez-de-chaussée », et le roman-feuilleton, sous la forme de récits patriotiques, se mue en instrument destiné à entretenir et intensifier la mobilisation de la population en faveur de l’effort de guerre. (Erbs, 2016 : 740) Voici ce qui est donné à lire aux combattants qui reçoivent et lisent la presse civile (Gilles, 2013). Nous avons, comme pour les dépêches tenter d’effectuer une recherche sur les syntagmes de deux occurrences à travers les spécificités selon les catégories grammaticales. Ces recherches n’ont pas été fructueuses pour le roman feuilleton. Nous avons donc étendu la recherche à trois JADT’ 18 539 occurrences. La construction syntagmatique « verbe au passé simple + déterminant + nom » avec un score de +52 a attiré notre attention. Sur les 130 syntagmes, 24 nous ont interpellés, soit 14% d’entre eux. D’abord, nous avons repéré des syntagmes qui semblent construits sur des expressions figées mais où l’un des termes a été modifié comme « fouilla l’horizon » ou « coupa la pipe ». Nous avons aussi repéré des syntagmes qui ne semblent pas faire sens comme « revêtit l’ampleur » ou « trancha les jours ». Alors une colère terrible parut animer l'Armada toute entière. Proue baissée, les navires foncèrent sur le pirate boche ... Cependant une première torpille alla frôler par bâbord le vaisseau amiral ; une deuxième, lancée trop haut, coupa la pipe du commandant qui flegmatiquement, sortit d'un étui une cigarette qu'il ajusta au tuyau mutilé de sa pipe. […] (« Krotufex », Rigolboche 10/12/1917) La torpille coupe littéralement la pipe du commandant alors qu’on aurait pu s’attendre à ce que ce dernier casse sa pipe dans un tel contexte. Cela renvoie au registre ludique avec le jeu sur l’expression figée mais certainement aussi au registre satirique offrant ici une critique des romans-feuilletons patriotiques décrivant des batailles sanglantes sans jamais que le héros ne succombe. En étudiant les mêmes syntagmes dans le sous-corpus romanfeuilleton dans la PQN, on repère la présence abondante de noms renvoyant à une partie du corps (« leva les yeux », « prit la main », « secoua la tête », « tendit la main ») : sur les dix premiers syntagmes six ont cette caractéristique. On observe également la présence du corps dans ces syntagmes dans la presse de tranchées mais ceux-ci semblent une fois encore surréalistes et usés à des fins ludiques, copiant le genre en le détournant : « cala les joues », « déchaussa son pied », « frotta la mandibule », « tomba le torse », etc. 5. Conclusion Notre contribution avait pour objectif d’investir la pratique imitative avec les outils textométriques sur un corpus singulier de presse de tranchées mis en perspective avec un corpus échantillon issu de la PQN. Nous avons pu montrer dans un premier temps comment les genres sont imités en reprenant les codes établis dans la presse civile. Pour faire émerger les traces d’une pratique imitative, il nous a semblé nécessaire d’interroger le corpus, à l’aide du logiciel textométrique TXM, au niveau syntagmatique. Cette recherche a, dans le cas de notre étude, permis de faire émerger les registres ludiques et satiriques ayant court dans la presse de tranchées. Cette presse est un lieu 540 JADT’ 18 énonciatif où l’implicite et la connivence tiennent une place importante au vue de la censure mais aussi des liens particuliers qui unissent lecteurs et rédacteurs. Il serait intéressant de voir si, en usant de la même méthodologie, sur des textes et des genres différents, des résultats similaires peuvent être observés. Références Aron, P. (2013). Le pastiche et la parodie, instruments de mesure des échanges littéraires internationaux. In Gauvin, L. dir., Littératures francophones : Parodies, pastiches, réécritures. ENS Éditions. Audoin-Rouzeau, S. (1986). 14-18, les combattants des tranchées : à travers leurs journaux. A. Colin. Authier-Revuz, J. (1984). Hétérogénéité(s) énonciatives. Langages, vol.(73) : 98-111. Erbs, D. (2016). Le roman-feuilleton français et le serial britannique pendant le premier conflit mondial, 1912-1920. (thèse de doctorat). Forcade, O. (2016). La censure en France pendant la Grande guerre. Fayard. Garnerin, X (2009). Le pastiche, entre intuition et analyse. Modèles linguistiques, vol.(60): 77-91. Genette, G. (1982). Palimpsestes. Seuil. Gilles, B. (2013). Lectures de poilus: livres et journaux dans les tranchées, 19141918. Ed. Autrement. Heiden, S., Magué, J-P. and Pincemin, B. (2010). TXM : Une plateforme logicielle open-source pour la textométrie – conception et développement. In Sergio B. et al. editors, Proc. of JADT 2010 (10th International Conference on the Statistical Analysis of Textual Data), pp. 1021-1032. Kalifa, D., Régnier, P., Thérenty, M.-E. et al. (2011). La civilisation du journal : histoire culturelle et littéraire de la presse française au XIXème siècle. Nouveau monde éditions. Maingueneau, D. (1984). Genèses du discours. Madraga. Malrieu, D & Rastier, F. (2002). Genres et variations morphosyntaxiques. Traitement automatique des langues vol.(42) : 548-577. Rabatel, A. (2004). Effacement énonciatif et effets argumentatifs indirects dans l’incipit du Mort qu’il faut de Semprun. Semen, vol.(17) : 111-148. JADT’ 18 541 Evolución diacrónica de la terminología y la fraseología jurídico-administrativa en los Estatutos de autonomía de Catalunya de 1932, 1979 y 2006 Albert Morales Moreno Università Ca’ Foscari Venezia / Université de Genève – albert.morales@unige.ch Abstract During the first half of 2017, research was carried out at the Institut de Lingüística Aplicada of the Universitat Pompeu Fabra thanks to a grant from the Generalitat de Catalunya’s Institut d’Estudis de l’Autogovern in order to study diachronically the Statutes of Autonomy of Catalonia (EAC acronym, in Spanish) approved in 1932, 1979 and 2006. As in other countries and traditions, the negotiation of such an important law is a challenge in the historical moment in which it occurs, both in legal and political terms (see Abelló (2007) for the EAC of 1932, Sobrequés (2010) for the 1979 EAC and Serrano (2008) for the 2006 EAC). We take lexicometrics as an analytical methodology and the communicative theory of terminology (Cabré, 1999) as the grounds for our research to study the use of legal and administrative terminology with respect to the assignment of competences from a diachronic approach. Specifically, we are interested in combining the study of repeated segments and the study of specificities to identify the terms, positions and key institutions of each EAC, as well as the use of some locutions between 1932 and 2006 in Catalan statutory discourse. Resumen Durante la primera mitad de 2017, se desarrolló una investigación en el Institut de Lingüística Aplicada de la Universitat Pompeu Fabra para el Institut d’Estudis de l’Autogovern de la Generalitat de Catalunya (EAC) para estudiar diacrónicamente los diferentes Estatutos de Autonomía de Cataluña (EAC), aprobados en 1932, 1979 y 2006. Al igual que en otros países y tradiciones, la negociación de los proyectos de regulación de esta escala es un reto en el momento histórico en que ocurre, tanto en términos legales y políticos (Abelló (2007) para el EAC de 1932, Sobrequés (2010) para la de 1979 y Serrano (2008) para el de 2006Partimos de la lexicometría como metodología analítica y de la teoría comunicativa de la terminología (Cabré, 1999) para estudiar el uso de la terminología jurídica y administrativa con respecto a la asignación de competencias y materiales a 542 JADT’ 18 partir de un enfoque diacrónico. En concreto, nos interesa combinar el estudio de segmentos repetidos con el estudio de especificidades para identificar los términos, cargos e instituciones clave de cada EAC, así como el uso de algunas locuciones entre 1932 y 2006 en el discurso estatutario catalán. Keywords: discourse analysis, legal discourse, Catalan statute of autonomy, repeated segments, terminology, diachronic analysis 1. Introducción El presente artículo presenta un estudio enmarcado dentro de un proyecto más amplio de análisis diacrónico de la redacción normativa en catalán. En dicha investigación, realizada gracias a una financiación posdoctoral del Institut d’Estudis de l’Autogovern de la Generalitat de Catalunya, se han estudiado los Estatutos de Autonomía de Catalunya (EAC) de 1932, 1979 y 2006 se han llevado a cabo estudios lexicológicos, estadísticos, terminológicos, traductológicos y pragmáticos de los distintos EAC. En esta en concreto, nos hemos centrado a estudiar, desde un punto de vista terminológico, los segmentos repetidos para evaluar si esta es una estrategia válida para identificar la evolución de la fraseología especializada relativa a un ámbito especializado como el Derecho a través del estudio de los segmentos repetidos específicos de cada EAC. Asimismo, nos proponemos comparar dichas unidades para ver cuál ha sido la evolución, desde un punto de vista diacrónico. Así pues, después de un exhaustivo estudio lexicométrico del corpus, hemos seleccionado unidades terminológicas especializadas (UTE) relativas al ámbito jurídico-administrativo que contribuyen a establecer las competencias de Catalunya en los diferentes EAC, con términos como competència/es, correspon o atribucioó/ons. Para dicho análisis, hemos partido de los índices estadísticos que ha arrojado la exploración lexicométrica desarrollada con Lexico3.6 y como marco teórico hemos empleado la Teoría Comunicativa de la Terminología (Cabré et al. 1999). 2. Los EAC de 1932, 1979 y 2006 En primer lugar, cabe definir el estatuto de autonomía como una unidad relativa al ámbito del derecho constitucional que se define como la “norma institucional básica de las comunidades autónomas” (Diccionario del español jurídico (DEJ), Real Academia Española). Numerosos juristas reconocen funcionalmente al EA de las comunidades como “equivalente a la constitución de un estado miembro de una federación, porque regula las instituciones autonómicas, establece las JADT’ 18 543 competencias que deben tener y no puede ser modificado por ninguna otra ley, ni autonómica ni estatal: sólo puede reformarse por el procedimiento que el mismo Estatuto prevé, característica propia de las constituciones y no de las leyes” (Albertí, et al. 2002:111). El Estatuto, pues, “tiene rango de ley orgánica estatal, forma parte del bloque de la constitucionalidad y está sometido a unos procedimientos agravados de aprobación y reforma, y sus previsiones disfrutan de unas garantías reforzadas que no proporciona la legislación ordinaria” (Pons y Pla 2007:187). En Cataluña, a principios del siglo XX, con la Mancomunitat, comienza la recuperación del autogobierno. En el marco de dicha institución, se redacta un primer proyecto de Estatuto de autonomía, aunque este no se llega a debatir, “porque el 27 de febrero de 1919 se suspendían las sesiones parlamentarias como consecuencia de la huelga de la Canadenca” (Fontana 2014:327). Debido al desarrollo histórico convulso de los años posteriores y de la dictadura de Miguel Primo de Rivera, los proyectos autonomistas se paralizan. Hay que esperar a 1931, la República, para que se redacte el primer EAC. Aquel texto se debate en las Cortes en mayo de 1932. Abelló afirma que aquel texto prevé “la inserción de Cataluña en una república federal” (2007:35) y lo define como “moderado” (2007:44). A pesar de los recortes que sufre, “se convirtió en una herramienta útil, que, con la reconquista de las instituciones catalanas de autogobierno, facultaría una legislación propia, a pesar de que esta fuera limitada” (Abelló 2007:187). La Generalitat de Catalunya asume las competencias durante poco tiempo, y el 6 de octubre de 1936 el EAC 1932 se suspende parcialmente; con la llegada de las tropas franquistas a Cataluña, Franco aprueba la ley de derogación del EAC el 5 de abril de 1938. Con la dictadura de Franco, el Estado se concibe desde una óptica recentralizadora y, como ya se ha señalado, se abole la autonomía de las comunidades. Hay que esperar hasta la muerte del dictador, el 20 de noviembre de 1975, para que, según Sobrequés (2010: 11), España y Cataluña iniciaran el proceso que tenía que cambiar su historia: la Transición. Durante esta, se sella el pacto constitucional de 1978 (la Constitución entra en vigor el 29 de diciembre de ese año) y se construyen los cimientos jurídicos del Estado autonómico con un ordenamiento que, a través de los estatutos de autonomía –al menos desde un enfoque teórico–, se da a los gobiernos autonómicos bastante autogobierno. El proyecto de redacción comienza el el 8 de septiembre de 1978 y su texto final se aprueba en referéndum el 25 de octubre de 1979. A principios del siglo XXI, sin embargo, un sector considerable del espectro social y político catalán percibe el EAC 1979 como un modelo sin recorrido (la conocida como doctrina Argullol, que supone releer de manera menos centralista la CE), pero rápidamente se comprueba “hay un número importante de competencias que, a pesar de ser incluidas en el Estatuto de 544 JADT’ 18 autonomía, no han sido objeto de desarrollo legislativo” (BOPC 2002:89). Por ese motivo, tras las elecciones autonómicas de 2003, la coalición tripartita integrada por PSC, ERC e ICV-EUiA inicia en 2004 la tramitación parlamentaria para la reforma estatutaria. Ello implica una primera negociación para que se aprobara en el Parlament de Catalunya el 30 de septiembre de 2005, y una segunda negociación para aprobarlo en las Cortes Generales (en esa segunda fase, tal y como se expone en Morales (2015), se producen los cambios más significativos). El texto final se aprueba en sede parlamentaria el 10 de mayo de 2006, día en el que el Pleno del Senado aprueba el nuevo estatuto con 128 votos a favor, 125 en contra y 6 abstenciones. El 31 de julio de 2006, Federico TrilloFigueroa y Martínez-Conde (junto con 98 diputados más del PP) presenta el 31 de julio de 2006 un recurso de inconstitucionalidad contra la mayoría de artículos del nuevo Estatuto (Bosch 2013: 44) porque, entre otras razones, “aplicaba el término nación en Cataluña, imponía el catalán, establecía una serie de derechos y deberes que restringían las libertades de los ciudadanos de Cataluña […] y cuestionaba la unidad de España” (Segura 2013: 217-218). El 28 de junio de 2010, el Tribunal Constitucional hace pública parte de la sentencia 31/2010 sobre la constitucionalidad del Estatuto, que declara inconstitucionales algunas de las partes del EAC 2006. Según numerosos politólogos e historiadores, esa fecha es clave para la historia política contemporánea porque “fue el día de la ruptura sentimental con España, el día en que [muchos catalanes] se convencieron de que Cataluña y los ciudadanos de Cataluña no tenían cabida en España” (Segura 2013: 32) y para muchos ciudadanos supuso el salto del autonomismo al independentismo, sin pasar por el nacionalismo (Segura 2013: 241). El corpus constituido es, pues, representativo para estudiar diacrónicamente la evolución del discurso estatutario en lengua catalana a través de los diferentes Estatutos aprobados a lo largo de la Historia. Para concluir, cabe añadir que según André Salem (1991:149) este corpus se considera una “serie textual cronológica”, puesto que son textos lingüística y pragmáticamente comparables de un arco temporal que permite extraer conclusiones sobre la evolución del discurso estatutario en lengua catalana de los últimos ochenta años. 3. Marco teórico y metodológico Desde la restauración de las instituciones de autogobierno, ha habido numerosas iniciativas, tanto públicas como privadas, de modernización del discurso normativo catalán. Cabe destacar el trabajo del Grupo de Estudios de Técnica Legislativa (GRETEL), de la Dirección General de Política Lingüística, del TERMCAT, de la Escuela de Administración Pública de JADT’ 18 545 Cataluña o del Parlament de Catalunya. El modelo que se sigue es el de Québec, adoptando –y adaptando– las directrices de Spar y Schwab Rédaction des lois: rendez-vous du droit et de la culture. Según Montolío, se aprovecha para renovar dicha tradición: Un caso especial lo constituyen las otras lenguas oficiales del Estado español (gallego, vasco y catalán). Para estas tres lenguas, la renovación del lenguaje jurídico ha venido impulsada por una motivación adicional: la voluntad de recrear una tradición jurídica truncada tras cuarenta años de prohibición. Entre ellas, cabe destacar la renovación del lenguaje jurídico catalán. (Montolío y Albertí 2012:99) Por ese motivo, los criterios y principios de la que parte la normalización del lenguaje jurídico catalán son el de economía, el de claridad y el de precisión en la expresión (DGPL 1999: 7). La falta de estudios lingüísticos exhaustivos de un componente esencial del discurso normativo catalán como es su Estatuto de autonomía, ha motivado este trabajo. Este trabajo nace de la necesidad de analizar combinando la estadística textual y el análisis del discurso, con una perspectiva diacrónica, los diferentes EAC que ha habido en vigor hasta la fecha, a partir de una disciplina consolidada: la Lingüística de Corpus. De acuerdo con la revisión presentada en Morales (2015:101-175), se han empleado dichas metodologías para estudiar textos similares. Para garantizar una selección de las unidades análisis objetiva, pertinente y representativa basada en criterios estadísticos, nuestro trabajo se desarrolla a partir de la lexicometría. Dicha escuela ha permitido caracterizar, entre otros, el vocabulario de personajes sociopolíticos, y de movimientos sociales e históricos. Dentro de la lexicometría, nuestra aproximación parte de una aproximación lexicométrica formalista, puesto que nuestra unidad básica de análisis es la forma. Posteriormente, hemos normalizado el texto (a partir de metodologías como las de Arnold (2008:110) y Menuet (2006:157)) para corregir las formas con errores (gramaticales o de escritura) y evitar que haya conteos duplicados debido a diferencias mínimas en la ortotipografía. Por último, hemos insertado en nuestro corpus las marcas estructurales requeridas por Lexico3.6 para identificar los diferentes EAC. De las múltiples funcionalidades que incluye el programa, han arrojado resultados especialmente interesantes el estudio de las concordancias, de los segmentos repetidos y de especificidades. Tras la primera exploración lexicométrica, hemos analizado algunos términos 546 JADT’ 18 clave identificados con el análisis de segmentos repetidos para ver si nos permite caracterizar la fraseología y terminologías propias del ámbito. 4. Análisis El corpus analizado presenta las principales características lexicométricas siguientes1: Identificador 01_1932 02_1979 03_2006 Documento EAC 1932 EAC 1979 EAC 2006 Ocurrencias 4.242 10.580 40.011 54.833 (7,7 %) (19,3 %) (73,0 %) (100 %) Formas 1.009 1.766 3.457 4.226 Hápax 606 935 1.546 1.804 Esta parte del análisis se centra en analizar los ya señalados segmentos repetidos (SR), es decir, las secuencias de formas repetidas con una frecuencia superior a 5. La exploración lexicométrica ha arrojado 2.398 segmentos repetidos, pero nos centraremos en algunos de los más significativos. Su distribución en relación con su longitud es: Longitud 2 Secuencias 1282 3 660 4 281 5 98 6 31 7 23 8 10 Ejemplos de Barcelona les llibertats la coordinació de la Constitució de seguretat pública en aquest Estatut les lleis de Catalunya a les Corts Generals el president o presidenta de conformitat amb les lleis els poders públics han de correspon a la generalitat la competència d ‘ acord amb allò que sens perjudici d ‘ allò que disposa el president o presidenta de la generalitat els poders públics han de vetllar per la impost sobre la renda de les persones físiques 1 Debido a las diferencias de tamaño obvias, aplicamos, gracias a la profesora Arjuna Tuzzi, técnicas de análisis estadístico que las tienen en cuenta a la hora de hacer los cálculos de representatividad y selección esperados, a partir de, entre otros Tuzzi (2003:128-129) o Van Gijsels, Speelman, y Geeraerts (2005:1). JADT’ 18 547 Longitud 9 Secuencias 7 10 11 4 11 Ejemplos en una votació final sobre el conjunt del text en el diari oficial de la generalitat de Catalunya correspon a la generalitat la competència exclusiva en matèria de de l ‘ apartat 1 de l ‘ article 149 de la carta dels drets i els deures dels ciutadans de Catalunya De las 20 más frecuentes, por ejemplo, solo cinco tenían interés para nuestro estudio lingüístico en tanto que unidades con semántica plena, como la Generalitat, de Catalunya o la competència. Además de aislar segmentos como de les quals (10), els altres (23), la resta (17), les quals (18) o en el termini (25) o la seva (57) –que podrían ser interesantes para investigaciones estilométricas o de atribución de autoría–, a continuación analizamos algunas de las unidades con una frecuencia superior. El sistema ha permitido identificar, por ejemplo, algunos sintagmas relativos a cargos e instituciones previstos estatutariamente, como les Corts (46) (y les Corts Generals (33)), Poder Judicial (46), la Comissió Mixta d’Afers Econòmics i Fiscals Estat-Generalitat (14), l’Agència Tributària de Catalunya (10), el Consell de Justícia de Catalunya (19), el Govern (50), el President (38), el President o Presidenta de la Generalitat (26), la Unió Europea (31) i el Parlament de Catalunya (24). Ha dado buenos resultados, pues, para identificar sintagmas relativos a unidades muy lexicalizadas como cargos o instituciones. Uno de los SR más frecuentes es correspon a la Generalitat. Dicho segmento presenta la distribución siguiente en el corpus: SR: correspon a la Generalitat FA FR (x10000) EAC 1932 1 4,4 EAC 1979 9 8,5 EAC 2006 144 36,0 Su uso es, como se constata, paradigmático del EAC 2006 (E+11) (presenta especificidad negativa en los EAC 1932 (E-05) y EAC 1979 (E-07)) y, tal y como se expone en Morales (2018, en prensa), el ámbito de la atribución competencial (de la que el segmento repetido es una de las expresiones lingüísticas más características, al menos en la redacción estatutaria contemporánea) es de las que más singularidades presenta en el EAC 2006 y que más cambios ha presentado en el corpus estudiado desde un punto de vista diacrónico. Otro de los SR más frecuentes (105 ocurrencias) es la Constitució, que se reparte de la manera siguiente: 548 SR: la Constitució FA FR (x10000) JADT’ 18 EAC 1932 17 40,1 EAC 1979 42 39,7 EAC 2006 46 11,5 En la mayoría de ocasiones se trata de contextos que hacen referencia a un artículo concreto de la CE 1978. Son fórmulas que sirven para restringir el alcance estatutario y establecer una remisión con la Carta Magna española. Es interesante señalar que el análisis de especificidades denota un uso específico positivo de dicho SR en los EAC 1932 (E+04) y EAC 1979 (E+07): JADT’ 18 549 Otras remisiones legislativas que hemos identificado gracias al estudio de los segmentos repetidos han sido aquest Estatut (96), l’article 149 (de la Constitució) (26) o el Títol V del mismo EAC (12). Al tratarse de un corpus legislativo, el análisis también ha permitido identificar como SR numerosas unidades pertenecientes al lenguaje jurídicoadministrativo que se rigen según el patrón determinante + sustantivo o sustantivo + adjetivo, como l’article, l’estatut, la legislació, una llei, llei orgànica, administracions públiques, l’administració, aquest article, comunitat autònoma, de catalunya, de seguretat, del règim jurídic, disposició addicional, domini públic, dret civil, el control, el foment, el règim, els àmbits, els articles, els deures, els mecanismes, els principis, els procediments, els processos, la llei, la llengua, la majoria, la normativa, la propietat, la salut, les activitats, les actuacions, les administracions, les administracions públiques, les comunitats, les empreses, les entitats, les iniciatives, les matèries, les normes, les organitzacions, les polítiques, les universitats, llei del parlament, polítiques públiques, règim jurídic, serveis públics, serveis socials, tributs estatals y una llei del parlament. El aspecto en el que el presente estudio ha proporcionado resultados más interesantes es, sin lugar a dudas, el relativo a las locuciones más empleadas en alguno de los EAC, y que en algunos casos se usan de manera especializada. Algunas de las unidades que hemos estudiado en profundidad han sido en matèria de/d’, si escau, d’acord amb, en tot cas, en els termes que o sens perjudici. El SR en tot cas presenta especificidad en el EAC 2006. Su uso es especifico positivo del EAC 2006 (E+05) y negativo de los EAC 1932 (E-04) y EAC 1979 (E-03). Sus 95 ocurrencias se distribuyen de la manera siguiente: SR: en tot cas FA FR (x10000) Esp EAC 1932 – – E-04 EAC 1979 10 9,5 E-03 EAC 2006 85 21,2 E+05 550 JADT’ 18 En la tesis (Morales 2015:398-400), se comprobó que esta es una cláusula bastante usada en el discurso estatutario catalán contemporáneo y describimos los usos de dicha cláusula. El libro de estilo del Parlament, una referencia básica para la redacción estatutaria contemporánea, la define así: en tot cas Locució adverbial, equivalent a en qualsevol cas, que es pot emprar amb valor concessiu o amb el sentit de ‘en tots els casos’. Quan té aquest sentit, per raons de claredat i precisió, és preferible substituirla per sempre o en tots els casos o, si escau, prescindir-ne. (SAL 2014:272) Otra cláusula identificada con el análisis de SR es en els termes, que se distribuye en el corpus de la manera siguiente: SR: en els termes FA FR (x10000) Esp EAC 1932 – – E-04 EAC 1979 12 11,3 – EAC 2006 63 15,7 E+03 El análisis de especificidades indica que su uso es característico positivo en el EAC 2006, mientras que en los otros no presenta especificidad (EAC 1979) o bien presenta especificidad negativa (E-04, en el EAC 1932). Al leer detalladamente las concordancias, se comprueba que aparece sobre todo en contextos como en els termes que disposin/determini/estableix o similares (en els termes establerts…): JADT’ 18 551 Cabe señalar que el EAC 1979 presenta más variedad en relación con el uso de esta cláusula (las 12 ocurrencias presentan 12 realizaciones diferentes), mientras que el EAC 2006 se constata menos variación; de los 63 contextos en los que aparece, las que acumulan más ocurrencias son: - en els termes que estableix/estableixen/estableixi/estableixin + [les lleis, la legislació…]: 41 - en els termes que determinin/determinen + [la llei orgànica, la legislació…]: 7 Se comprueba, pues, una fijación más alta. Habría que analizar corpus más grandes para verificar esta hipótesis, pero esta tendencia a tener un discurso estatutario más fijado en el EAC 2006 parece confirmarse. Hemos constatado, sin embargo, que en la mayoría de segmentos repetidos se observa un comportamiento lingüístico diferente entre los EAC 1932 y 1979 por un lado, y el EAC 2006 por el otro. Por lo tanto, estos resultados confirman la hipótesis planteada inicialmente y confirmada con el estudio de distancia intertextual llevado a cabo por la Dra. Arjuna Tuzzi (Università degli Studi di Padova). Otro de los segmentos identificados que equivale a una locución es sens perjudici, que presenta la distribución siguiente en el corpus: SR: sens perjudici FA FR (x10000) Esp EAC 1932 1 2.4 – Algunas de sus concordancias son: EAC 1979 28 26.5 E+09 EAC 2006 23 5.7 E-06 552 JADT’ 18 Ya hemos visto en el apartado dedicado al pronombre allò que, en algunos casos, este SR forma parte de la locución sens perjudici d’allò que. Carles Viver Pi-Sunyer afirma que (2007:37) el uso de dicha cláusula está relacionado con la técnica legislativa que se expone a continuación: L’Estatut d’Andalusia i les propostes de Canàries i de Castella la Manxa apliquen la mateixa tècnica que l’Estatut de Catalunya, malgrat que en alguns casos no totes les submatèries que en l’Estatut de Catalunya es consideren exclusives tenen la mateixa consideració en els altres tres. Per contra, els estatuts o projectes d’estatut de la Comunitat Valenciana, d’Aragó, de les Illes Balears i de Castella i Lleó no identifiquen submatèries exclusives dins d’àmbits materials en què l’Estat fins ara ha pogut dictar bases, però, en canvi, com hem vist, en d’altres casos declaren exclusives «sens perjudici» competències bàsiques estatals, àmbits en els quals es clar que l’Estat pot establir bases perquè així ho diu expressament la Constitució. (Viver Pi-Sunyer 2007:37) Aunque hemos constatado que aparece en el EAC 2006 en 23 ocasiones, la bibliografía indica que al redactar dicho Estatuto se produjo una innovación en la técnica legislativa relacionada con el uso de la cláusula en cuestión (sens perjudici), tal y como afirma Ernest Benach: Em sembla que [l’EAC 2006] és important, per «la seva nova tècnica legislativa d’assumpció de competències, que renuncia a la clàusula del “sens perjudici” i opta per la definició casuística i detallada, dins de cada àmbit competencial, de submatèries o perfils competencials». I hi afegeixo jo que a ningú no el podrà sorprendre que, després de vint-i-cinc anys de patir els perjudicis del «sens perjudici», els redactors de la Proposta del nou Estatut hagin optat per una tècnica legislativa moderna que precisa amb claredat l’abast de les competències de la Generalitat. (Benach 2006:20) JADT’ 18 553 Es un cambio, pues, que se comprueba que es fruto de la modernización del discurso legislativo en redacción estatutaria para obtener en el EAC 2006 un blindaje competencial más amplio del que se había conseguido con el EAC 1979. 5. Conclusiones El estudio presentado, como ya se ha señalado, se enmarca dentro de un proyecto de investigación postdoctoral más amplio realizado durante la primera mitad del año 2017 en el Institut Universitari de Lingüística Aplicada de la Universitat Pompeu Fabra gracias a la financiación del Institut d’Estudis de l’Autogovern de la Generalitat de Catalunya. En dicho estudio hemos llevado a cabo varios análisis lingüísticos (riqueza léxica, distancia intertextual, especificidades…) de un corpus de discurso jurídico en lengua catalana integrado por los Estatutos de autonomía de Catalunya aprobados en 1932, 1979 y 2006. Como ya se ha señalado, se han analizado los segmentos repetidos (SR) que genera el análisis lexicométrico de Lexico3.6. Puesto que los resultados que generaba eran 2.398 y muchas de las unidades no eran representativas para, desde el punto de vista del análisis del discurso, estudiar la evolución del discurso normativo, se ha optado por analizar cualitativamente algunos de los SR que presentan especificidad en alguno de los subcorpus. Además, el estudio ha permitido identificar las unidades léxicas y terminológicas más empleadas en la redacción estatutaria en catalán, así como las instituciones y cargos que se regulan en dicho EAC. Hemos identificado que, en el caso de Correspon a la Generalitat es un SR específico del EAC 2006 que se ha convertido, como ya se ha analizado en Morales (2018, en prensa) en una de las estructuras formulaicas más empleadas en la redacción de leyes en catalán. Asimismo, hemos identificado que, mientras en el EAC 2006 el sintagma la Constitución presenta especificidad negativa, en los otros dos EAC estudiados sí que se emplea por encima de las veces esperadas estadísticamente. Habrá que realizar investigaciones más amplias para entender dicha evolución en la redacción estatutaria en catalán. El ámbito en el que la presente investigación ha resultado útil ha sido en la identificación de locuciones, que en algunos casos se emplean como unidades de conocimiento especializado (UCE, en terminología de Cabré (1999)). Las más características, en positivo, del EAC 2006 son en tot cas y en els termes que, mientras que sens perjudici se tendía a utilizar más en la redacción del EAC 1979. En la bibliografía hemos identificado las motivaciones de dichos cambios. Así pues, este estudio ha permitido identificar, cruzando dos análisis 554 JADT’ 18 lexicométricos obtenidos con Lexico3.6 (el de segmentos repetidos y el de especificidades), algunas unidades lingüísticas (locuciones, términos y unidades poliléxicas del discurso estatutario y jurídico-administrativo, así como cargos e instituciones) que han presentado evolución en el discurso normativo catalán en el periodo 1932-2006. En futuras investigaciones, ampliaremos el estudio de este tipo de n-grams y ampliarlo a unidades fraseológicas y estructuras formulaicas, porque parece que podrían aportar resultados interesantes para describir el discurso estatutario catalán desde una aproximación cronológica. Bibliografía [BOE] Boletín Oficial del Estado. Constitución española. Madrid: Agencia Estatal Boletín Oficial del Estado, 1978. [BOPC] Butlletí Oficial del Parlament de Catalunya. "Moció 187/VI del Parlament de Catalunya, sobre l'exercici de l'autogovern." Butlletí Oficial del Parlament de Catalunya. 366. Barcelona: Parlament de Catalunya, 2002. 89. [DGPL] Direcció General de Política Lingüística. Criteris de traducció de textos normatius del castellà al català. Barcelona: Generalitat de Catalunya. Departament de Cultura, 1999. [SAL] Serveis d’Assessorament Lingüístic. Llibre d’estil de les lleis i altres textos del Parlament de Catalunya. Barcelona: Parlament de Catalunya, 2014. Abelló Güell, Teresa. El debat estatutari del 1932. Barcelona: Parlament de Catalunya, 2007. Albertí, Enoch, et al. Manual de dret públic de Catalunya. Barcelona: Generalitat de Catalunya. Institut d'Estudis Autonòmics, 2002. Arnold, Edward. "Le sens des mots chez Tony Blair (people et Europe)." JADT 2008: actes des 9es Journées internationales d’Analyse statistique des Données Textuelles, Lyon, 12-14 mars 2008: proceedings of 9th International Conference on Textual Data statistical Analysis, Lyon, March 12-14, 2008. Eds. Heiden, Serge, Bénédicte Pincemin and Liliane Vosghanian. Lió: Presses Universitaires de Lyon, 2008. 109-19. Benach, Ernest. L'Estatut: una aposta democràtica i moderna: Barcelona, 7 de novembre de 2005. Barcelona: Parlament de Catalunya, 2006. Bosch, Jaume. De l'Estatut a l'autodeterminació: esquerra nacional, crisi econòmica, independència i Països Catalans. Barcelona: Base, 2013. Cabré Castellví, M. Teresa. La terminología. Representación y comunicación. Elementos para una teoría de base comunicativa y otros artículos. Sèrie Monografies, 3. Barcelona: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra, 1999. Fontana, Josep. La formació d'una identitat. Una història de Catalunya. Vic: JADT’ 18 555 Eumo Editorial, 2014. Lamalle, Cédric, et al. Manuel d'utilisation. Lexico3 (Version 3.41 - Février 2003). París: SYLED–CLA2T. Université de la Sorbonne nouvelle–Paris 3, 2003. Menuet, Laëtitia. "Le discours sur l’espace judiciaire européen: analyse du discours et sémantique argumentative." Université de Nantes, 2006. Montolío, Estrella, and Enoch Albertí. Hacia la modernización del discurso jurídico: contribuciones a la I Jornada sobre la Modernización del Discurso Jurídico Español. Barcelona: Publicacions i Edicions de la Universitat de Barcelona, 2012. Morales Moreno, Albert. "Estudi lexicomètric del procés de redacció de l’Estatut d’Autonomia de Catalunya (2006)." Tesi doctoral no publicada. Universitat Pompeu Fabra, 2015. Pons, Eva, and Anna M. Pla. "La llengua en el procés de reforma de l'Estatut d'autonomia de Catalunya." Revista de Llengua i Dret.47 (2007): 183-226. Real Academia Española. Consejo General del Poder Judicial. "[DEJ] Diccionario del español jurídico." Madrid. Salem, André. "Les séries textuelles chronologiques (1)." Histoire et mesure.VI1/2 (1991): 149-75. Salem, André, M. Teresa Cabré, and Lydia Romeu. Vocabulari de la lexicometria: català, castellà, francès. Barcelona: Centre de Lexicometria, Divisió de Ciències Humanes i Socials, 1990. Segura, Antoni. Crònica del catalanisme: de l'autonomia a la independència. Barcelona: Angle Editorial, 2013. Sobrequés, Jaume. L'Estatut de la Transició: l'Estatut de Sau (1978-1979). Barcelona: Parlament de Catalunya, 2010. Tuzzi, Arjuna. L’analisi del contenuto. Introduzione ai metodi e alle tecniche di ricerca. Roma: Carocci, 2003. van Gijsel, Sofie, Dirk Speelman, and Dirk Geeraerts. "A Variationist, Corpus Linguistic Analysis of Lexical Richness." Proceedings of the Corpus Linguistics 2005 Conference, July 14-17, Birmingham, UK 1.1 (2005): 1-16. Viver Pi-Sunyer, Carles. "Les competències de la Generalitat a l'Estatut de 2006: objectius, tècniques emprades, criteris d'interpretació i comparació amb els altres estatuts reformats." La distribució de competències en el nou Estatut. Eds. Viver i Pi-Sunyer, Carles, et al. Barcelona: Institut d’Estudis Autonòmics, 2007. 13-52. 556 JADT’ 18 Comment penser la recherche d’un signe pour une plateforme multilingue et multimodale français écrit / langue des signes française ? Cédric Moreau Grhapes EA 7287 - INS HEA - UPL – cedric.moreau@inshea.fr Abstract 1 (in English) This article examines the access to the signs in French Sign Language (LSF) within a corpus taken from the collaborative platform Ocelles, from a multilingual French bijective/LSF perspective. There is currently no monolingual dictionary in SL, so deaf users must necessarily master the written language of the country to access SL contents. Most of the available tools are based on a hypothetical conceptual relationship of equivalence between the signs of SL and the words of the dominant vocal languages. This approach originates in works that ask deaf speakers to translate a lexema outside the context of the spoken language into the signed language. This corpus is subsequently used for an inventory of minimal pairs, in which configurations, locations and movements are widely represented. This approach is thus the anchorage point for a phonological hypothesis of SL in which the previous equivalence ‘sign – word’ is dominant and decisive in the conception of dictionaries. This study lies within a completely different paradigm, that of the semiotical model which stems from the description of a typology and the identification of the three main transfer structures (size and form, situational, and personal). According to Cuxac, the signer can thus ‘make visible’ the experience by relying on the maximal resemblance sequence of signs/experience, or use the lexical unit without resemblance with the referent. This model, which is also integrative, therefore takes into account the diachronic link existing within language under the influence of pressures between transfer structures and lexical units. The morphemic approach to the study of lexical units is in this case legitimate since their compositionality does not rely on strict phonology but, in the first place, on complex morphology. First of all, we shall present our paradigm and the origins of the Ocelles multilingual and multimodal platform (written, oral, and signed languages), out of which our French written/LSF corpus is built. We will then describe a process likely to enable users to search for an LSF signifier and to relate this result to that of the corresponding written French signifier. JADT’ 18 557 Abstract 2 (in French) Cet article interroge l’accès aux signes de la langue des signes française (LSF) d’un corpus dans une perspective multilingue bijective français / LSF à partir de la plateforme collaborative Ocelles. Actuellement il n’existe pas de dictionnaire monolingue en LS, les utilisateurs sourds doivent donc nécessairement maîtriser la langue écrite du pays pour accéder à un contenu en LS. La plupart des outils à disposition s’appuient sur une hypothétique relation d’équivalence conceptuelle entre les signes des LS et les mots des langues vocales dominantes. Cette démarche prend sa source dans des travaux qui interrogent les locuteurs sourds en leur demandant de traduire un lexème hors contexte de la langue vocale en langue signée. Ce corpus est ensuite utilisé dans l’élaboration d’un inventaire de paires minimales, dans lequel les configurations, leurs emplacements et leurs mouvements sont largement représentés. Cette approche est ainsi le point d’encrage d’une hypothèse phonologique des LS dans laquelle l’équivalence « signe – mot » précédente est dominante et déterminante dans l’élaboration de dictionnaires. Notre étude s’inscrit dans un tout autre paradigme, celui du modèle sémiologique qui prend ses origines dans la description d’une typologie et de la mise en évidence des trois structures de transfert principales (de taille et de forme, situationnel et personnel). Selon Cuxac, le signeur peut ainsi « donner à voir » l'expérience en s'appuyant sur la ressemblance maximale séquence de signes/expérience, ou utiliser l’unité lexicale sans ressemblance avec le référent. Ce modèle, également intégratif, prend donc en considération le lien diachronique qui existe au sein de la langue sous l’influence de pressions entre structures de transferts et unités lexicales. L’approche morphémique pour l’étude des unités lexicales est dans ce cas légitime, leur compositionnalité ne relevant pas d’une phonologie au sens strict mais bien, en premier lieu, d’une morphologie complexe. Nous exposerons tout d’abord notre paradigme et les origines de la plateforme multilingue et multimodale (langues écrites, orales et signées) Ocelles sur laquelle notre corpus français écrit / LSF se constitue. Nous décrirons ensuite un processus susceptible de permettre aux utilisateurs la recherche d’un signifiant en LSF et de lier ce résultat à celui du signifiant en français écrit correspondant. Keywords: Collaborative platform, Multilingualism, Multi-modality, French Sign Language, LSF, deaf, Signs research, Semiological model, Ocelles 1. Introduction Lorsqu’un locuteur de la langue des signes souhaite accéder à une ressource dans sa langue, notamment pour rechercher une définition dans un 558 JADT’ 18 dictionnaire de langue des signes (LS), il est confronté à deux obstacles. Le premier repose sur le fait que très peu d’outils présentés comme étant des dictionnaires numériques de langue des signes ne sont que des lexiques. Parmi 105 sites répertoriés sur le web, une majorité utilise le qualificatif « dictionnaire », or seulement 17 d’entre eux présentent des définitions écrites. Parmi ces 17, uniquement 7 donnent des définitions en LS. La quantité de dictionnaires en LS est donc extrêmement faible. De plus le nombre de définitions ne dépasse pas 5 000, nous sommes ainsi très éloignés des 135 000 proposées par le dictionnaire Larousse en ligne (Moreau, 2012). Le second obstacle porte sur la difficulté, pour l’utilisateur sourd d’accéder aux contenus mêmes d’un dictionnaire de ce type. En effet, nous avons constaté que dans la grande majorité des cas, les entrées proposées sont étroitement liées à la connaissance de la langue écrite du pays. Un prérequis nécessaire est donc la maîtrise de cette langue, ce qui constitue un obstacle majeur pour les personnes sourdes qui ont la LS pour langue première et la langue écrite, souvent mal maîtrisée, comme langue seconde. Parmi les 7 sites précédemment évoqués, seulement 2 offrent une entrée via les paramètres linguistiques de la LS (Moreau, 2012). Cette question prend également un écho particulier lorsque nous interrogeons le mode de transmission des LS. Il ne s’agit pas d’un mode de transmission héréditaire, puisqu’environ 95 % des sourds ont des parents entendants qui, pour la majorité, ne pratiquent pas la LS. L’apprentissage de la langue a donc lieu dans des contextes variés, à tout âge, souvent sans la référence fixe d’un adulte proche. Le pourcentage restant (environ 5 %) est donc constitué de sourds de parents sourds. Des parents qui, eux-mêmes pour la plupart, font partie de la catégorie précédente issus de familles entendantes. Seule 0,02 % de la population sourde signante est en effet composée d’une généalogie comptant trois générations successives de sourds signeurs. La norme d’apprentissage des LS ne peut donc pas être comparée à celles des entendants (Cuxac et Pizzuto, 2010). En outre, la langue des signes française (LSF), marquée par plus d’un siècle d’interdiction comme langue d’enseignement, n’est reconnue comme langue de la République que depuis 2005. C’est dans ce contexte qu’est né le projet collaboratif multilingue et multimodale Ocelles1, qui ambitionne de définir tous les concepts, dans tous les champs de la connaissance et dans toutes les langues (écrites, orales ou signées) (Moreau, 2017). 1 https://ocelles.inshea.fr Projet sous l’égide et avec l’aide de la Délégation générale à la langue française et aux langues de France (DGLFLF) et du ministère de l’Éducation nationale. JADT’ 18 559 2. Affrontement de deux paradigmes 2.1. Une hypothèse phonologique des LS Susan Goldin-Meadow a mis en évidence, à partir d’une étude basée sur la communication préscolaire, entre petits enfants sourds et leur entourage entendant, la création de gestes appelés « home signs » (Goldin-Meadow et Mylander, 1991) (Goldin-Meadow, 2003). Pour tenter de rentrer en communication avec leur entourage, ces enfants les réalisent dans l’univers perceptivo-pratique. Ces productions permettent de faire l’hypothèse de stabilisations conceptuelles pré linguistiques, à la différence des productions d’enfants entendants du même âge, pour lesquels le lien entre la langue et ces savoirs perceptivo-pratiques n’existe pas. Une fois scolarisé, ces enfants entrent ensuite en contact avec une langue des signes institutionnalisée. Selon Golwin-Meadow dans la mesure où les formes signifiantes des langues des signes institutionnalisées ont un statut phonologique, les composants des « home signs » de l’enfant perdraient alors leur statut de morphèmes pour devenir des équivalents de phonèmes. Cette hypothèse peut être envisagée comme point de départ à l’affrontement de deux paradigmes. L’iconicité est alors comparée à de la gestuelle co-verbale illustrative, reléguée au rang de pantomime en dehors de tout phénomène linguistique. C’est dans ce paradigme que s’inscrivent la plupart des « dictionnaires » de langues des signes actuellement. Leurs entrées sont majoritairement définies à partir d’une hypothétique équivalence conceptuelle entre les mots des langues vocales dominantes et celles des unités lexématiques (UL) des langues signées. (Fusellier-Souza, 2006). L’origine de cette méthodologie prend racine dans des travaux qui interrogent les locuteurs sourds en leur demandant de traduire un lexème hors contexte de la langue vocale en langue signée. Ce corpus est ensuite utilisé dans l’élaboration d’un inventaire de paires minimales, dans lequel les configurations (formes de la main), leurs emplacements et leurs mouvements sont largement représentés (Klima et Bellugi, 1979). 2.1. Une hypothèse morphémique des LS Notre travail s’inscrit dans un tout autre paradigme dans lequel la conséquence de la surdité n’est plus un simple effet de changement de canal. La possibilité de dire et de montrer étant le seul fait du canal visuo-gestuel a conféré aux langues des signes une architecture différente de celle des langues vocales. Selon Cuxac (Cuxac, 2000), deux stratégies discursives d'énonciations coexistent en LSF. Le signeur via le canal visuo-gestuel, choisit de dire sans montrer ou bien de dire en montrant. Il peut ainsi « donner à voir » l'expérience en s'appuyant sur la ressemblance maximale séquence de 560 JADT’ 18 signes/expérience, ou utiliser l’UL sans ressemblance avec le référent. Le modèle sémiologique (Cuxac et Pizzuto, 2010) prend ses origines dans la description d’une typologie et dans la mise en évidence des trois structures de transfert principales :  les volumes des entités (transferts de taille et de forme (TTF)),  les déplacements d’actants par rapport à des locatifs stables, à l’image d’un environnement en quatre dimensions (les trois de l’espace et le temps) recréé devant le locuteur (transferts situationnels (TS)),  l’entité souhaitée par le locuteur, qui devient alors cette entité (transferts personnels (TP)) (Cuxac, 2000; Sallandre, 2003). Des expériences imaginaires ou réelles sont ainsi anamorphosées par le locuteur. Le modèle sémiologique, prend donc en considération le lien diachronique qui existe au sein de la langue sous l’influence de pressions entre structures de transferts et UL. Lien qui se retrouve parfois dans l’étymologie de certaines des UL. L’approche morphémique pour l’étude des UL est dans ce cas légitime, leur compositionnalité ne relevant pas d’une phonologie au sens strict mais bien, en premier lieu, d’une morphologie complexe. Lors de la réalisation d’un signe (transfert ou UL), tout le corps du locuteur prend une valeur sémantique via une organisation des éléments morphémiques qui le composent (regard expression faciale, posture, orientation du visage, configuration, le mouvement, l'emplacement (Stokoe et al., 1965), l'orientation (Friedman, 1977; Liddell, 1980; Moody, 1980; Yau, 1992). 3. Éléments prégnants dans la recherche d’un signe pour une plateforme multilingue et multimodale français écrit / LSF 3.1. Contexte d’une recherche d’un signe dans un corpus bilingue langue écrite/LS Le projet collaboratif Ocelles permet de relier au fil des contributions, des définitions de concepts à plusieurs signifiants qu’ils soient sous formes textuelles, orales ou signées. Les entrées ne sont pas contraintes par la langue d’origine et l’architecture se déploie au fur à mesure des contributions des usagers. L’entrée textuelle peut donc prendre la forme, d’un mot ou d’une expression dans le cas où l’origine du dépôt provient d’une structure de transfert de la langue des signes. La réflexion actuelle porte donc sur le type d’indexation possible des signes indispensable au processus de recherche d’un signe dans le cadre d’un corpus bilingue langue écrite/LS. JADT’ 18 561 3.2. Automatisation de l’indexation L’indexation d’un signe se fait via l’entrée textuelle correspondante. Il n’existe pas aujourd’hui d’indexation automatique de corpus collaboratif dynamique de signes des LS qui pourrait servir de base pour un moteur de recherche d’une UL ou d’un transfert directement à partir des paramètres linguistiques des LS. La nature même du signal vidéo, très complexe à analyser ne permet pas l’indexation automatique. Outre les pertes d’informations tridimensionnelles liées aux projections de l’espace 3D à celui 2D de la vidéo, ce travail nécessiterait des outils fins d’analyses et de reconnaissances, des différents composants corporels, intervenant en parallèle, à des échelles spatiales et temporelles très différentes, mis au point pour des langues vocales, linéaires et mono source mais par pour les LS (Braffort et Dalle, 2012). 3.3. Situation actuelle et limite Aujourd’hui l’entrée à partir des paramètres linguistiques des signes des LS se fait majoritairement à partir de la configuration. Sur les 105 sites répertoriés qui proposent des signes en LS seuls 18 offrent une possibilité d’accéder directement à un signe à partir des paramètres linguistiques de la langue des signes, sans recours à une langue écrite. Sur ces 18, 17 proposent une entrée à partir de la configuration (le nombre de ces entrées manuelles varie d’ailleurs de 9 à 211 en fonction des sites), 6 proposent une entrée à partir du mouvement, 10 à partir de l’emplacement et 1 pour la symétrie, l’image labiale et la mimique faciale (Moreau, 2012). Cette indexation phonologique des LS, avec un tel écart dans le nombre envisageable de configurations de 9 à 211 par exemple, interroge la gestion de l’erreur potentielle du locuteur qui recherche un signe qu’il aurait perçu en discours (ce qui est le cas dans la majorité des cas, compte tenu du caractère oral des LS). En outre, sur un choix entre 211 configurations, le locuteur a une chance sur 211 de choisir la bonne ou 210 risques sur 211 de se tromper… 3.4. Description et critères de recherche L’indexation ne peut donc reposer uniquement sur une approche strictement phonologique et doit tenir compte de la gestion des erreurs possibles. Notre hypothèse repose sur une prégnance pour le locuteur de certaines unités linguistiques dans une approche morphémique mises en jeux lors de la formulation d’un signe (Moreau, 2012). Notre approche est fondée sur le principe d’une indexation collaborative qui permet de rendre compte des perceptions des locuteurs. Le principe est basé sur le processus suivant :  prise en compte du ou des type(s) de transfert(s) utilisé(s) 562 JADT’ 18 (TS / TP / TTF) dans la réalisation d’un signe à moins que l’unité lexématique puisse éventuellement trouver son origine dans l’un de ces transferts,  itération dans le choix d’images clés à partir desquelles repose une description des unités linguistiques prégnantes (Thom, 1988),  une description plus fine des unités retenues est ensuite proposée Si aujourd’hui les structures linguistiques ne peuvent être admises comme familières à l’ensemble des contributeurs, leur prise en compte ne peut être ignorée. Deux approches sont envisagées. Une première inhérente à l’objectif premier de la plateforme, repose sur la proposition d’une définition de ces concepts afin de familiariser progressivement les locuteurs à leurs usages. Une succession d’anamorphoses possibles de plus en plus précises est ensuite proposée. Cette approche est cohérente avec l’utilisation de n’importe quel outil pour lequel un minimum de prérequis sont nécessaires, à l’image de l’alphabet pour un dictionnaire. Une seconde approche repose sur la prise en compte de ces lacunes en inscrivant le processus dans un continuum, qui permet une possible contribution basée sur la sélection puis la description d’images représentatives du signe du point de vue de l’usager. C’est donc l’ensemble des descriptions macro-microscopiques, de chaque contributeur qui sert de base à la pondération des unités linguistiques prégnantes. Ces données seront ensuite réutilisées comme critère de recherche d’un signe. JADT’ 18 563 Conclusion ADT et visualisation, pour une nouvelle lecture des corpus Les débats de 2ème tour des Présidentielles (1974-2017) Jean Moscarola1, Boris Moscarola2 1 Université Savoie Mont Blanc, 2 Le Sphinx-Développement Abstract 1 The progress of textual data analysis leads from a statistical and lexical description of corpora to their semantic analysis. The software thus offers the qualitative researchers the opportunity to feed their interpretations on the basis of substitutes that summarize them or to code them automatically. Finally, data visualization offers the reader an experience of the corpus creating the conditions for a critical control. This approach is illustrated on the analysis of the 2nd round debate in the presidential election conducted with DataViv the new Sphinx module. Abstract 2 Les progrès de l’analyse de données textuelles conduisent d’une description statistique et lexicale des corpus à leur analyse sémantique. Les logiciels offrent ainsi au chercheur qualitatif la possibilité de nourrir leurs interprétations sur la base de substituts qui les résument ou de les coder automatiquement. Enfin la datavisualisation offre au lecteur une expérience du corpus créant les conditions d’un contrôle critique. Cette approche est illustrée sur l’analyse des débats de 2ème tour à l’élection présidentielle effectué avec DataViv le nouveau module de Sphinx. Keywords: Analyse de discours, statistique lexicale, analyse sémantique, data visulaisation, logiciel Sphinx 1. Introduction L’ADT, née d’une rencontre entre la recherche littéraire et la statistique, passe de l’étude de grandes œuvres à celle des médias de masse et de la communication politique. Avec le big data et le web sémantique elle s’enrichie des nouveaux outils de l’IA en abordant tous types de corpus. Dans les sciences humaines, l’analyse de contenu s’est développée à l’articulation de la recherche qualitative pure et des méthodes quantitatives mais sans rapport explicite avec l’ADT. Ce papier s’adresse aux chercheurs et chargés d’étude qualitative qui restent réticents à l’usage des outils de l’ADT. Il s’appuie sur l’étude du corpus des débats de 2ème tour à l’élection 564 JADT’ 18 présidentielle et utilise la nouvelle application Dataviv de Sphinx pour illustrer une nouvelle expérience de lecture. 2. Les méthodes et les techniques 1.1 Des humanités numériques à l’intelligence artificielle L’outil informatique a depuis longtemps été utilisé pour informatiser les grands corpus de la littérature (Frantext). C’est ainsi qu’apparaissent dans les années 60 les humanités numériques (Burdick) et l’utilisation de la statistique pour caractériser le style de grands auteurs ou leur attribuer des œuvres anonymes (Muller). Puis dans les années 70 des statisticiens fondent le courant français de l’analyse de données textuelle qui trouve un écho avec le structuralisme et l’analyse de discours (Beaudouin). Dans les années 60 aux Etats Unis une autre voie était ouverte avec la construction de thésaurus informatisés (Stone) utilisés pour coder le contenu des media de masse. Ces approches sont à l’origine des techniques que nous allons exposer. Elles sont enrichies dans les années 2000 par les progrès de l’ingénierie linguistique, et du traitement automatique des langues (Veronis). 2.1 Analyse de données textuelle L’examen statistique des textes a évolué du décompte des mots à l’étude de leurs associations. Dans la tradition des concordanciers, la voie est ouverte à la recherche des segments répétés (Lebart), émaillant les discours politiques (Marchand) ou publicitaires (Floch). L’informatique graphique, les cartes cognitives (Eden) et les nuages de mots donnent une représentation visuelle de ces concordances. L’influence des contextes et la recherche des spécificités lexicales complète des descriptions globales (Brunet, Lebart) Les méthodes d’analyses factorielles (Benzecri) font la synthèse entre la rigidité des segments répétés et le désordre des nuages de mot. En dégageant des d’affinités entre termes fréquemment associés, elles offrent une analyse structurale des textes popularisée par les cartes factorielles disposant les univers lexicaux révélateurs des thèmes du texte. A l’analyste d’en faire une lecture sémiotique. De manière duale à la mise en évidence des univers lexicaux, Reinert propose le regroupement des unités de signification (réponses, phrases ou séquence de mots…) pour créer une partition à partir de plusieurs analyses factorielles utilisées pour progressivement définir des classes homogènes. Cette méthode, mise en oeuvre avec le logiciel ALCESTE qui lui a donné son nom, a été reprise et enrichie par d’autres logiciels (IRAMUTEC, SPHINX). On retrouve des approches voisines chez les anglo-saxon. ‘L’analyse sémantique latente’ (Landauer) déplace l’attention de l’observation des cooccurrences vers la recherche de dimension latentes mesurées par les axes JADT’ 18 565 factoriels. La théorie du cadrage (Frame Analysis) formulée par Goffman interprète l’usage de certains mots clés et leurs relations comme des « conceptualisations diffuses » Ces cadres sont une manière d’interpréter les univers lexicaux. 2.2 Linguistique A l’origine les logiciels ne repéraient que les formes graphiques (séquence de lettre ne comportant aucun séparateur) sans parvenir à différencier singulier et pluriel ou les différentes flexions d’un même verbe. La lemmatisation a représenté un grand progrès en remplaçant les différentes graphies d’un mot par son lemme : L’infinitif pour les verbes, le masculin singulier pour les noms et adjectifs. Puis l’analyse des propriétés morphosyntaxiques conduit à distinguer les ‘mots pleins’ selon leur statut grammatical. Les substantifs, donnent les objets des textes ou des discours, les adjectifs les appréciations et opinions, les verbes renvoient aux actions. La recherche des syntagmes permet d’identifier les expressions propres au domaine, formes les plus expressives des concordances (Mayaffre). 2.3 Sémantique La sémantique s’intéresse au sens en passant du niveau des signifiants à celui des signifiés. Malgré leur intérêt théorique, les travaux de linguistique générale n’ont pu déboucher sur les applications qui marquent, avec la linguistique de corpus, le véritable essor de l’analyse sémantique. L’idée est de modéliser les connaissances de domaines particuliers comme des signifiés définis par l’ensemble des signifiants qui s’y rattachent (Saussure). Dès les année 60, « General Inquirer» développe à Harward des ressources informatiques permettant de coder automatiquement le contenu des médias. Ces dictionnaires sont toujours accessibles. WordNet® grande base de données lexicales de l’anglais développée par l’université de Princeton généralise cette approche en améliorant l’efficacité des dictionnaires par l’usage de réseaux sémantiques. WordNet peut être considéré comme un thésaurus généralisé reflet des corpus sur lesquels il est construit. Ces idées sont reprises par les moteurs sémantiques. Dans les années 2000, l’ingénierie linguistique et le traitement automatique des langues Normier) dépasse l’approche purement lexicale en spécifiant les thésaurus (Da Silva), par des ontologies(Grubert) et réseaux sémantiques(Godard). Le thésaurus définit l’arborescence des catégories conceptuelles : les signifiés. Les ontologies sont constituées de la liste des mots qui documentent ces catégories : les signifiant. Les réseaux sémantiques 566 JADT’ 18 précisent l’affectation des termes aux catégories du thésaurus en fonctions des liens constatés à partir de corpus de référence : les référents. Avec l’essor des réseaux sociaux il devenait enfin primordial enfin d’appréhender la tonalité de messages susceptibles de faire ou défaire les réputations. Ainsi dans les années 2010 apparaissent des applications de traitement automatique des langues pour synthétiser les avis et les opinions du web. Elles ont acquis leur notoriété sous l’appellation de ‘sentiment analysis’ ou ‘d’opinion mining’ (Thelwall). Ces analyses complètent la reconnaissance des catégories du thésaurus en évaluant les textes selon leur orientation positive ou négative mesurée sur une échelle assimilable à une mesure de l’opinion. L’Analyse de Données textuelles a ainsi évolué d’une approche descriptive statistique et lexicale à une approche sémantique fondée sur une modélisation des connaissances. Rendue très accessible par les logiciels (Boughzala) , elle présente une ressource pour la recherche qualitative ce que nous allons illustrer sur un exemple de corpus politique. 3. Contributions de l’ADT à l’analyse de corpus. 3.1 l’exemple des débats de 2ème tour L’analyse des discours politiques est un classique de l’ADT (Marchand, Mayaffre). Leurs transcriptions analysées à différents niveaux, (les locuteurs, les tours de paroles ou les phrases) sont traitées comme des données pour révéler le style, les structures lexicales, les idées et les opinions qui les caractérisent. Le corpus des 7 débats de deuxième tour couvre de 1974 à 2017, 43 ans de vie politique. Il est analysé à l’adresse suivante https://www.sphinxonline.net/debats/1974-2017/analyse.htm, qui présente de manière détaillée ce dont nous donnons qu’un aperçu dans cet article. Notre but est d’illustrer les méthodes qui viennent d’être évoquées et de discuter leur pertinence pour la recherche qualitative. Le lecteur est invité à en faire lui-même l’expérience plus riche que l’aperçu qui suit : -Les propos des candidats sont précis : les articles définis sont présents dans 2 phrases sur 3. Les embrayeurs ‘je’ et ‘vous’ sont utilisés de manière plus fréquente que ‘nous’ -Les expressions « premier ministre », « assemblée nationale » « pouvoir d’achat », « général de gaulle » « milliard d’euro » dominent sur l’ensemble de la période. -La carte des univers lexicaux montre une opposition entre l’évocation de la vie politique d’une part et les termes de l’économie et de la société d’autre part. -Sur les 11 thèmes identifiés par la classification automatique, les thèmes ‘Gouvernement-Majorité’, ‘Pays, Français’, ‘Année Nucléaire’, ‘Entreprise JADT’ 18 567 Salarié’ arrivent en tête. -Les principaux concepts reconnus par le thésaurus de l’application utilisée1 sont « Vote » « Civilisation » « Emploi et salaire » « Politique fiscale » « Citoyenneté »… -La tonalité des propos est neutre pour la moitié des interventions, pour le reste les prises de position positives sont un peu plus fréquentes. La référence aux candidats et aux périodes complète la description globale. -A chacun son style : Jospin Royal et Mitterrand se distinguent par l’usage de ‘je’ ; Chirac par le ‘nous’ plus collectif et Marine Le Pen interpelle son débateur (vous) à moins qu’elle ne s’adresse à l’audience. Macron fait preuve de l’usage le mieux équilibré. -Les mots clés sur représentés dans chaque période marquent bien le changement de siècle : ‘politique’, ‘gouvernement’ ‘problème’ au XXème, ‘entreprise’ ‘emploi’ ‘européen’ au XXIème. -Les catégories thématiques de la classification lexicale sont associées à des groupes de candidats : Sarkozy, Royal et Hollande développent les thèmes ‘Entreprise, Salarié’, ‘Loi’, ‘Crise Priorité’ ‘Pouvoir Président’. Mitterrand et Giscard d’Estaing, ‘Socialiste Communiste’, ‘Gouvernement Majorité’, Macron et Le Pen ‘Chômage, Emploi’, ‘Français, Pays’ -Enfin les concepts de l’analyse sémantique distinguent nettement les périodes : ‘Vote’ ‘Civilisation’ ‘Degré de libéralisme’ au XXème, et ‘emploi’, citoyenneté’ ‘politique fiscale’ au XXIème 3.2 Contribution à l’analyse qualitative pure Ces résultats plus abondamment décrits dans l’application en ligne peuvent être utilisées dans l’esprit de la recherche qualitative pure dès lors qu’on les envisage dans une démarche descriptive et exploratoire dont la valeur réside que dans la capacité du chercheur à les lire et à les d’interpréter (Moscarola). Les mots clé, nuages, cartes, les classifications et les concepts proposés par les logiciels sont des substituts du corpus. Ils portent la trace des modèles mentaux (Johnson-Laird) et des représentations et l'influences sociales dont parlent la théorie des actes de langage (Austin) et la sociolinguistique. L’ADT permet d’en faire une sorte de radioscopie et de mieux les comprendre. Elle offre aussi la possibilité d’une lecture distanciée échappant au risque de récursivité (Dumez) ou donnant la possibilité de le contrôler. En effet les substituts lexicaux ou sémantiques sur lesquels le chercheur fonde ses interprétations peuvent être communiqués pour exposer la lecture qu’il en fait à la critique d’une discussion basée sur des éléments partagés. 1 Thésaurus Larousse (Péchon 1994) intégré à SphinxIQ2 568 JADT’ 18 3.2 Contribution à l’analyse de contenu L’ADT peut également être vue comme une modalité de l’analyse de contenu traditionnelle (Belerson, Bardin). Elle s’en distingue par l’automatisme d’une ‘lecture artificielle’ identifiant des catégories établies statistiquement par apprentissage ou à partir d’un thésaurus. On retrouve ainsi l’approche inductive conduisant à interpréter à postériori les structures révélées par les analyses factorielles ou à reconnaître dans le corpus les concepts du thésaurus. Chaque unité de signification peut ainsi être codée dans une variable ‘mesurant’ le sens et utilisable selon les procédures classiques de l’analyse quantitative. Dans notre exemple on peut ainsi chercher les éléments lexicaux ou sémantique explicatif ou discriminant les appartenances politique des candidats… 3.3 Retour au texte et ‘data visualisation’ Le recours à l’ADT lexicale ou sémantique comporte deux risques majeurs malgré son intérêts pratique et scientifique : le risque d’erreur systématique auquel expose la lecture par une machine et le risque de réduction abusive imposé par les choix du chercheur, qu’il s’agisse de sa problématique ou des résultats qu’il choisit de communiquer. Le premier risque peut être évité par le retour au texte et une lecture de vérification. C’est la seule manière pour le chercheur et son lecteur de contrôler le sens des éléments lexicaux ou la pertinence des concepts et évaluations identifiés par les moteurs sémantiques ? Cette possibilité apparaît avec les hypertextes. Elle est d’autant plus nécessaire, qu’avec l’aide des infographies (nuages de mots, cartes) les représentations deviennent de plus en plus parlantes. Les méthodes dites de navigation facilitent ce retour au texte et peuvent être enrichies par les entrées provenant des codifications lexicales et sémantiques ou par les éléments des représentations visuelles. La navigation lexicale généralisée dans l’esprit de la datavisualisation (Faulx Briole) donne ainsi au lecteur la possibilité d’accéder directement aux verbatims associés aux mots d’un nuage ou d’une carte, aux catégories d’une classification automatique ou aux concepts et appréciations d’une analyse sémantique. Par exemples à quel verbatim correspond l’usage des mots ‘gens’ ou ‘français’, sont-ils plutôt de gauche ou de droite, à quoi correspond le concept ‘citoyenneté’ et est-il daté par un époque ou spécifique à certains candidats ? Retour au texte, mais au contexte aussi. L’analyse des discours politique a été pionnière dans ce domaine. Le Monde publie le 15-03-2012 une infographie dynamique donnant accès aux discours JADT’ 18 569 de campagne des candidats (Véronis). L’observatoire du discours politique (Mayaffre) en est un autre exemple. Il permet à partir d’un nuage de mots synthétisant le contenu des discours, d’en détailler les significations par du verbatim et d’en spécifier l’usage selon les différents candidats. Avec ce type d’application le chercheur qualitatif peut compléter la communication de ses résultats et de ses interprétations en donnant accès au corpus par l’expérience d’une navigation interactive proposée au lecteur. Il peut ainsi vérifier les interprétations de l’auteur et les prolonger par ses propres conjectures. C’est ce que nous proposons à l’adresse : https://www.sphinxonline.net/debats/1974-2017/analyse.htm Y sont présentés les substituts et synthèses qui conduisent à conclure à une profonde transformation du débat politique amorcée au tournant du siècle. Ces tendances peuvent être expérimentées par le lecteur pour nourrir une discussion critique ou susciter de nouvelles explorations et conjecture. Le logiciel utilisé permet ainsi de produire des résultats et en même temps de donner la possibilité au lecteur de les discuter. C’est le propre de la démarche scientifique. Bibliographie BARDIN L. (1977) L’Analyse de Contenu PUF BEAUDOUIN V. (2016) Retour aux origines de la statistique textuelle : Benzécri et l’école française de l’analyse de données JADT 2016 BENZECRI JP. (1992) Correspondance Analysis Handbook Marcel Decker Inc. 1992 BERELSON, B (1952). Content Analysis in Communication Research. Glencoe: Free Press.. BOUGHZALA Y., HERVE H., MOSCAROLA J. (2014) Sphinx Quali : un nouvel outil d’analyses textuelles et sémantiques JADT Université de Paris BRUNET E. (2016) Apports des technologies modernes à l’histoire littéraire HAL BURDICK A., DRUCKER J & ali. (2012) Digital humanities MIT Press DA SILVA L. (2006) Thésaurus et systèmes de traitement automatique de la langue, Documentation et bibliothèque DUPUY, P.-O. & MARCHAND, P., (2016) Les débats de l’entre-deux-tours de l’élection présidentielle française (1974-2012) Mots. Les langages du politique, EDEN C. (1988). "Cognitive mapping". European Journal of Operational Research FAULX-BRIOLE A. (2017) Datavisualisation et tableaux de bord interactifs Solution Business FLOCH J.M.(1988), The contribution of structural semiotics to the design of a hypermarket, International Journal of Research in Marketing, 4, 3, Semiotics and Marketing 570 JADT’ 18 GOFFMANN E. Frame analysis: An essay on the organization of experience Harper and Row 1974 GRUBER T. (1992) Toward Principles for the Design of Ontologies Used for Knowledge Sharing. In: International Journal Human-Computer Studies JOHNSON-LAIRD, P N. (1983) Mental Models: Toward a Cognitive Science of Language, Inference and Consciousness. Harvard University Press LANDAUER, T. K., FOLTZ, P. W., & LAHAM, D. (1998) An introduction to latent semantic analysis. In Discourse processes, Routledge LEBART L SALEM A. (1988) Analyse de données textuelles DUNOD MARCHAND, P. (2016). Les représentations sociales dans le champ des médias. In G. Lo Monaco, S. MAYAFFRE D. (2005) Analyse du discours politique et Logométrie : point de vue pratique et théorique Langage et société N° 114 MAYAFFRE D. (2014) Plaidoyer en faveur de l’analyse de données c(n)textuelle. Parcours coocurrentiels dans le discours présidentiel français. Actes JADT Nice MOSCAROLA J. (2018) Faire parler les données. Editions EMS MULLER C. (1979,). Étude de statistique lexicale. Le vocabulaire du théâtre de Pierre Corneille, Paris, Slatkine NORMIER B. (2007). L’apport des technologies linguistiques au traitement et à la valorisation de l’information textuelle. ADBS. REINERT A., (1983), Une méthode de classification descendante hiérarchique : application à l’analyse lexicale par contexte” Les cahiers de l’analyse des données, Tome 8, N°2, pp. 187198. STONE D.C. DUNPHY, M.S. SMITH, . M. OGILVIE. (1966) The General Inquirer: a computer Approach to Content Analysis MIT Press THELWALL M. (2017) Sentiment Analysis for Smal and Big Data.SAGE VERONIS J. (2014) Le traitement automatique des corpus oraux. In Traitement automatique des langues. Hermes JADT’ 18 571 A conversation analysis of interactions in personal finance forums Maurizio Naldi University of Rome Tor Vergata– maurizio.naldi@uniroma2.it Abstract 1 Interactions on a personal finance forum are investigated as a conversation, with post submitters acting as speakers. The presence of dominant positions is analysed through concentration indices. Patterns in replies are analysed through the graph of replies and the distribution of reply times. Keywords: Personal finance; Conversation analysis; Concentration indices. 1. Introduction Decisions concerning personal finance are often taken by individuals not just on the basis of factual information (e.g., company’s official financial statements or information about past performance of funds), but also considering the opinions of other individuals. Nowadays personal finance forums on the Internet have often replaced friends and professionals in that role. In those forums the interaction occurs among people who typically do not know one another personally and know very few personal information (if any) about other participants. Anyway, they often create online communities that can bring value to all participants [1]. Examples of such forums are SavingAdvice (http://www.savingadvice.com/forums/) or Money Talk (http://www.money-talk.org/board.html). The actual influence of such forums on individuals’ decisions has been investigated in sev- eral papers, considering, e.g., how the level of activity on forums impacts on stock trading levels [2], how participation in such forums pushes towards a more risky-seeking behaviour [3], or introducing an agents-based model to determine how individual competences evolve due to the interaction [4]. It has been observed that such forums may be employed by more aggressive participants to manipulate more inexperienced ones [5], establishing a dominance over the forum. In addition to being undesirable for ethical reasons, such an influence is often contrary to the very same rules of the forum. Here we investigate the subject by adopting a different approach from the semantic anal- ysis of [5]. In particular, we investigate the presence of imbalances in the online discussion and the dynamics of the interaction between participants. The rationale is that partici- pants wishing to manipulate others would try to take control of the discussion by posting 572 JADT’ 18 more frequently and being more reactive. For that purpose we employ two datasets extracted from the two most popular personal finance threads on the SavingAdvice website. For the purpose of the analysis the thread is represented as the sequence of participants taking turns, with dates and times of each post attached. We conduct a conversation analysis, wishing to assess if: 1) there are any dominant par- ticipants (in particular the thread starter); 2) repetitive patterns appear such as sustained monologues or sparring matches between two participants; 3) replies occur on a short-time scale. The paper provides the following contributions:  through the use of concentration indices we find out that, though no dominance exist, the top 4 speakers submit over 60% of the posts (Section 3);  both recurring reply sequences and monologues appear (Section 4);  reply times can be modelled by a lognormal distribution, with 50% of the posts being submitted no longer than 14 or 23 minutes (for the two datasets respectively) after the last one (Section 4). 2. Datasets We consider the two most popular threads on the SavingAdvice website. The topics are the following, where we indicate an identifying short name between parentheses: 1. Should struggling families tithe? (Struggling) African-American Personal Finance Gurus (Guru) The main characteristics of those datasets are reported in Table 1. For each thread we identify the set of speakers S = {s1, s2, . . . , sn}, i.e., the individuals who submit posts. We identify also the set of posts P = {p1, p2, . . . , pm} and a function F : P → S, that assigns each post to its submitter. For each speaker we can therefore compute the number of posts submitted by him/her. If we use the indicator function 1(·), the number of posts submitted by the generic speaker si is (1) Table 1: Datasets Thread Creator No. of speakers Struggling Guru jpg7n16 25 james.hendrickson 18 No. of posts 155 104 JADT’ 18 573 3. Dominance in a thread In this section we wish to examine if some dominance emerges in a thread. We adopt concentration indices borrowed from the field of industrial economics.We analyse domi- nance by considering the frequency of posts: an individual (or a group of individuals) is dominant if it submits most of the posts. We first examine how posts are distributed by looking at the rank-size plot: after ranking speakers by the number of posts they submit, the frequency of posts is plotted vs the rank of the speaker. In Figure 1, we see that a linear relationship appears between log N(i) and the rank i, so that a power law N(i) = k/i (a.k.a. a generalized Zipf law) may be assumed to apply roughly, where k is a normalizing constant and α is the Zipf exponent (see, e.g., [6]), measuring the slope of the log-linear curve, hence the imbalances between the contributions of all the speakers. By performing a linear regression, we get a rough estimate of α, reported in Table 2. Table 2: Concentration measures Thread Struggling Guru Zipf exponent HHI CR4 0.2545 0.2501 0.1220 0.1396 61.94% 67.31% As more general indices to assess dominance position we borrow two from Industrial Economics: the Hirschman-Herfindahl Index (HHI) [7, 8, 9], and the CR4 [10, 11]. For a market where n companies operate, whose market shares are v1, v2, …, vn the HHI is (2) The HHI satisfies the inequality 1/n ≤ HHI ≤ 1, where the lowest value corresponds to the case of no concentration (perfect equidistribution of the market) and the highest value represents the case of monopoly. Therefore, the larger the HHI the larger the concentration. Instead, the CR4 measures the percentage of the whole market owned by the top four companies: similarly, the higher the CR4, the heavier the concentration. In our case, the fraction of posts submitted by a speaker can be considered as his/her market share, so that the HHI can be redefined as 574 JADT’ 18 (3) Instead, the CR4 is defined as (4) For our datasets we get the results reported in Table 2. According to the guidelines provided by the U.S. Department of Justice, the point of demarcation between unconcentrated and moderately concentrated markets is set as HHI = 0.15 [12]. Since the values in Table 2 are below that threshold, we cannot conclude that there is a significant concentration phenomenon. However, the CR4 index shows that the top 4 speakers submit more than 60% of all the posts. Delving deeper into the top 4, we also see the most frequent speaker typically contributes around 1/4 of the overall number of posts, which represents a major influence. In the Struggling dataset, the most frequent speaker is the thread originator itself (with 22.6% of posts), while that’s not true in the Guru dataset, where the the most frequent speaker contributes 26.9% of posts and the originator just 2.88%. Fig. 1: Rank-size plot 4. Replies After examining dominance, we turn to interactions. In this section we analyse the pattern of replies, looking for recurrences in the sequence of JADT’ 18 575 replies and examining the time elapsed before a post is replied to. We build a graph representing how speakers reply to each other. We consider each post as a reply to the previous one. We build the replies graph by setting a link from a node A to a node B if the speaker represented by node A has replied at least once in the thread to a post submitted by the speaker represented by node B. The resulting graphs are shown in Figure 2, which is ordered from the core to the periphery in order of decreasing degree of the nodes, laid out on concentric rings. Here the degree of a node represents the number of speaker to which it replies. In both cases an inner core of most connected nodes appear, which represent the speakers replying to most other speakers. Reply patterns emerge as bidirectional links (couples of speakers who reply to each other). Loops represent monologues instead, i.e., speakers submitting two or more posts in a row. Fig. 2: Replies graph Further, we are interested in how fast the interactions are between contributors to the thread. We define the reply time as the time elapsing between a post and the subsequent one. The main statistics of the reply time are reported in Table 3. In both dataset the mean reply time is around 1 hour, but 50% of the replies take place within either 14 minutes (Guru dataset) or 23 minutes (Struggling dataset), i.e., with a much smaller turnaround. There is therefore a significant skewness to the right. A more complete view of the variety of reply times is obtained if we model the probability density function. In Figure 3, we report the curves obtained through a Gaussian kernel estimator, an exponential model, and a lognormal model (whose parameters have been estimated by the method of moments). By applying the Anderson-Darling test, we find out that the exponential 576 JADT’ 18 hypothesis is rejected at the 5% significance level, while the lognormal one is not rejected, with a p-value as high as 0.72 for the Struggling dataset and 0.076 for the Guru dataset. Fig. 3: Reply time Table 3: Reply time statistics (in minutes) Thread Mean Median Standard deviation 95% percentile Struggling Guru 70.5 58.9 23 14 156.2 112.7 254.7 406.7 5. Conclusions We have analysed two major threads within a personal finance forum as a conversation between submitters acting as speakers, searching for dominance and interaction patterns. Though no significant concentration exists, the top four speakers submit over 60% of the posts. Patterns of interaction emerge as the presence of several couples of speakers who reply to each other, several monologues, and short reply times (with 50% being below 14 and 23 minutes for the two datasets, though a significant distribution tail is present). References [1] Arthur Armstrong and John Hagel. The real value of online communities. Knowledge and communities, 74(3):85–95, 2000. [2] Robert Tumarkin and Robert F Whitelaw. News or noise? internet postings and stock prices. Financial Analysts Journal, 57(3):41–51, 2001. [3] Rui Zhu, Utpal M Dholakia, Xinlei Chen, and Ren ́e Algesheimer. Does online community participation foster risky financial behavior? Journal of JADT’ 18 577 Marketing Research, 49(3):394–407, 2012. [4] Loretta Mastroeni, Pierluigi Vellucci, and Maurizio Naldi. Individual Competence Evolution under Equality Bias. In 2017 European Modelling Symposium (EMS), Nov 2017. [5] John Campbell and Dubravka Cecez-Kecmanovic. Communicative practices in an on- line financial forum during abnormal stock market behavior. Information & management, 48(1):37–52, 2011. [6] Maurizio Naldi and Claudia Salaris. Rank-size distribution of teletraffic and customers over a wide area network. Transactions on Emerging Telecommunications Technologies, 17(4):415–421, 2006. [7] Stephen A Rhoades. The Herfindahl-Hirschman Index. Fed. Res. Bull., 79:188, 1993. [8] Maurizio Naldi. Concentration indices and Zipf’s law. Economics Letters, 78(3):329–334, 2003. [9] Maurizio Naldi and Marta Flamini. Censoring and Distortion in the Hirschman–Herfindahl Index Computation. Economic Papers: A journal of applied economics and policy, 2017. [10] I Pavic, F Galetic, and Damir Piplica. Similarities and Differences between the CR and HHI as an Indicator of Market Concentration and Market Power. British Journal of Economics, Management and Trade, 13(1):1–8, 2016. [11] Maurizio Naldi and Marta Flamini. Correlation and concordance between the CR4 index and the Herfindahl-Hirschman index. SSRN Working paper series, 2014. [12] The U.S. Department of Justice and the Federal Trade Commission. Horizontal Merger Guidelines, 19 August 2010. 578 JADT’ 18 Analisi testuale, rumore semantico e peculiarità morfosintattiche: problemi e strategie di pretrattamento di corpora speciali. Stefano Nobile Sapienza Università di Roma – stefano.nobile@uniroma1.it Abstract 1 The proliferation of text analysis techniques has made possible the combined use of different software, directed each time to specific needs for analysis and research. However, the opportunities offered by the different software do not mitigate a fundamental problem, inherent in the characteristics of some peculiar corpora. Perfectly suited for analysis on texts written accurately and based on a supervised style, however these software can not reduce some issues. Among these, one of the most common concerns the morphosyntactic rules of the language with its semantic noise. Problems of "noise", such as that generated in spontaneous conversations, require many precautions for the preparation of the corpus. This situation is exaggerated with Twitter, whose ease of access and messaging download has produced analysis that is not always adequately supported from the theoretical point of view. Poems and songs present a similar problem. In these kinds of corpora the problem derives from the structure of this style of communication, which in using some rhetorical expedients accentuates the critical mass generated by some words. What strategies are possible to adequately prepare the corpora to be analysed in these two particular situations? The contribution proposes some strategies on how to operate in these particular conditions, highlighting the advantages on the empirical level but also the effects on the theoretical one. Abstract 2 La moltiplicazione delle tecniche di analisi testuale ha reso possibile l’uso combinato di software diversi, piegati di volta in volta a singole esigenze di analisi e ricerca. Tuttavia, l’ampiezza di opportunità offerte dai diversi software non attenua un problema di fondo, insito nelle caratteristiche stesse di alcuni corpora peculiari. Perfettamente adatti ad analisi su testi redatti accuratamente e improntati a uno stile sorvegliato, questi software non riescono tuttavia a togliere l’utente dall’impaccio nel quale può trovarsi in alcune circostanze. Tra queste, una delle più comuni riguarda le regole morfosintattiche della lingua di riferimento e quindi portatrice di quote elevate di rumore semantico. Problemi di “rumore”, come quello generato JADT’ 18 579 nelle conversazioni spontanee, richiedono al ricercatore una serie di accorgimenti per la preparazione del corpus che tengano conto della necessità di evitare di ottenere dati fortemente distorti. Questo discorso si esaspera con Twitter, la cui facilità d’accesso e download dei messaggi è da qualche tempo foriero di analisi non sempre adeguatamente sostenute dal punto di vista teorico. A questi casi si aggiunge quello di corpora altrettanto peculiari come quelli delle poesie e delle canzoni. In corpora di questo tipo il problema deriva dal costrutto stesso di questo genere comunicativo, che nel servirsi di alcuni espedienti retorici accentua la massa critica generata da alcune parole, andando così a incidere, tra l’altro, sul calcolo di alcuni parametri rilevanti e rendendo meno leggibili i risultati. Quali strategie sono dunque possibili al ricercatore per preparare adeguatamente i corpora da analizzare in queste due situazioni particolari? Il contributo che si intende presentare vuole avanzare alcune proposte su come operare in queste particolari condizioni, evidenziando i vantaggi sul piano empirico ma anche le ricadute su quello teorico soggiacente agli obiettivi stessi che analisi su corpora di questo genere possono porsi. Keywords: rumore semantico, poesia, canzone, retorica, pretrattamento del corpus, costruttivismo vs. realismo. 1. Rumore semantico e corpora testuali peculiari La moltiplicazione delle tecniche di analisi testuale ha reso possibile, ai ricercatori interessati a lavorare in questo ambito, l’uso – anche combinato – di diversi software, ciascuno con le proprie peculiarità in risposta alle differenti esigenze di analisi e ricerca. Tuttavia, l’ampiezza di opportunità offerte dai tanti software in commercio (T-Lab, Taltac, Spad-T, R, eccetera) non attenua un problema di fondo, insito nelle caratteristiche stesse di alcuni corpora peculiari: quello delle distorsioni imputabili al rumore semantico generato sia da elementi irrilevanti dal punto di vista contenutistico, sia da ridondanze che alterano i rapporti di forza tra parole. Perfettamente adatti ad analisi testuali su testi redatti accuratamente e improntati a uno stile sorvegliato come può essere quello delle testate giornalistiche o di materiali di tipo istituzionale, questi software non riescono tuttavia a togliere l’utente dall’impaccio nel quale può trovarsi in alcune circostanze che, più o meno in concomitanza con la diffusione dei social network, hanno cominciato ad essere egemoniche quanto a produzioni di testi sul web. Tra queste circostanze, una delle più comuni riguarda quella che si potrebbe definire oralità scritta, poco o per nulla accorta alle regole morfosintattiche della lingua di riferimento e quindi portatrice di quote elevate di rumore semantico, qui inteso come forma leggibile e trattabile di 580 JADT’ 18 testo. Problemi di “rumore” come quello generato nelle conversazioni spontanee, rinvenibili – nelle forme più disparate – in rete, richiedono al ricercatore una serie di accorgimenti per la preparazione del corpus che tengano conto della necessità di evitare di ottenere dati fortemente distorti. Vale a dire che le forme linguistiche contratte (cmq, nn, xké), gli elementi espressivi tesi a restituire i toni del parlato (belloooo, bravaaaa), i segni grafici del tutto peculiari (Ã, ò, ðŸ , ù, é, 🠳), le ridondanze, i retweet, il testo non in formato Ascii, gli hashtag, i collegamenti multimediali, il linguaggio di markup, sono addendi di una somma che dà come risultato una proliferazione di rumore semantico, ai cui effetti si aggiungono quelli derivanti dalle distorsioni imputabili agli indici prodotti (ricercatezza ed estensione lessicale) nonché alle misure del corpus (occorrenze, forme grafiche, hapax). Questo discorso si esaspera con Twitter, la cui facilità d’accesso e download dei messaggi è da qualche tempo foriero di analisi non sempre adeguatamente sostenute dal punto di vista teorico (Ebner, Altmann e Softic, 2011). Accade infatti sempre più spesso che «l’elevato grado di automatismo delle procedure e la forte tendenza alla modellizzazione statistica possono esporre l’analisi testuale a stili di ricerca segnati da un’ingenua rincorsa dell’oggettività tramite l’estremizzazione ossessiva del calcolo numerico applicato ai testi, con la conseguente grave perdita del ruolo del contesto» (Tipaldo, 2014: 191; corsivo aggiunto). La necessità di contrarre il testo in 120 caratteri (raddoppiati soltanto a partire dal novembre 2017, ma la sostanza non cambia) determina infatti negli utenti l’inclinazione a trovare soluzioni – a volte convenzionali, altre volte originali – per poter ridurre il testo entro i limiti prefissati, così come si faceva quando gli sms avevano set limitati di caratteri ed erano relativamente dispendiosi. Da qui, la produzione di una quantità considerevole di rumore semantico che rende difficilmente trattabili i dati testuali “naturali”. Ai casi appena passati in rassegna – oggi largamente diffusi – si aggiunge quello di corpora altrettanto peculiari, ma del tutto diversi, come quelli delle poesie e delle canzoni (Nobile, 2012). In testi di questa natura, il problema deriva dal costrutto stesso di questi generi della comunicazione. Essi, infatti, nel momento in cui si servono di alcuni espedienti retorici (l’anadiplosi, l’epanalessi, il poliptoto, l’anafora, l’epanadiplosi e altri ancora), accentuano la massa critica generata da alcune parole. Ciò finisce con l’incidere sul calcolo di alcuni parametri rilevanti (specificità tipiche ed esclusive, estensione lessicale, ricercatezza lessicale, rango delle singole parole, confronto con i lessici peculiari, eccetera), rendendo meno leggibili i risultati. Un caso assai frequente, qui portato al parossismo, è il seguente: nella canzone, alcune parole, per necessità squisitamente ritmiche oppure per enfatizzare l’effetto-tormentone, vengono ripetute ossessivamente. È quanto JADT’ 18 581 accade – per fare un solo esempio, dati i margini ridotti entro i quali deve rimanere questo contributo – con la canzone Pino (fratello di Paolo), nella quale la parola Pino compare addirittura 60 volte nel giro di pochi secondi, andando ineluttabilmente a gonfiare tutte le modalità delle variabili (artista, decennio di pubblicazione, macro e microgenere musicale, sesso) a cui questa singola canzone è collegata (Nobile, 2012). Per l’uso delle figure retoriche vale un discorso analogo. Tra le tante possiamo prendere l’anafora a titolo esemplificativo. L’anafora è una figura retorica che consiste nella ripetizione di una o più parole all’inizio di una frase o di un verso. Per quanto essa sia rintracciabile anche nella prosa, è nella poesia e nella canzone che essa ottimizza le proprie potenzialità espressive. Tra lo sterminato numero di esempi che potremmo scegliere, uno è quello di Vai in Africa, Celestino!, un brano che il cantautore Francesco De Gregori ha pubblicato nel 2005: pezzi di stella, pezzi di costellazione / pezzi d’amore eterno, pezzi di stagione / pezzi di ceramica, pezzi di vetro / pezzi di occhi che si guardano indietro / pezzi di carne, pezzi di carbone / pezzi di sorriso, pezzi di canzone / pezzi di parola, pezzi di parlamento / pezzi di pioggia, pezzi di fuoco spento. In questo caso, è la parola pezzi a comparire un considerevole numero di volte grazie, appunto, all’espediente retorico dell’anafora. Non diverso, ovviamente, è il caso della letteratura, per il quale – a titolo esemplificativo – possiamo scomodare il celeberrimo III canto (canto e canzone, appunto…) dell’Inferno dantesco: Per me si va ne la città dolente / per me si va ne l'etterno dolore / per me si va tra la perduta gente. La poesia e la canzone, dunque, possono presentare delle caratteristiche strutturali che vanno a incidere sul text mining operabile dai diversi software, nella misura in cui forniscono informazioni numeriche alterate. Quantunque la ridondanza di alcuni termini non implichi necessariamente lo stravolgimento dell’asse sintagmatico (Bolasco, 2005), ossia della possibilità di ricostruire il senso del testo in ragione di un criterio di adiacenza delle parole all’interno dei contesti elementari, essa può compromettere il senso espresso dai dati relativi alla frequenza delle parole piene, alle peculiarità (sia quelle endogene, esprimibili in termini di specificità, sia quelle esogene, traducibili in termini di linguaggio peculiare) e alla numerosità di forme grafiche. Quali strategie sono dunque possibili al ricercatore per preparare adeguatamente i corpora da analizzare in queste due situazioni particolari, ossia profluvio di segni grafici e parole ripetute? Certamente non è sufficiente ripulire ortograficamente il testo né espungere da esso tutti quei segni, come le emoticons o la sintassi comunicativa propria di Twitter, che vanno a interferire su molti parametri d’analisi. Né d’altronde si può “addomesticare” il corpus fino al punto da stravolgerne l’aspetto precipuo, ossia la spontaneità del simil parlato del primo caso e la struttura morfosintattica e retorica del secondo. 582 JADT’ 18 2. Strategie di pre-trattamento del corpus Le soluzioni ai tipi di problemi testé esposti variano a seconda della natura del problema, delle competenze informatiche dell’utente e della prospettiva analitica assunta dal ricercatore e dipenderanno dalla combinazione tra queste tre dimensioni. Vediamole. La pulizia dei caratteri di testi naturali dipende in larga misura dalle competenze informatiche dell’utente, al netto delle potenzialità dei software utilizzati. Ad oggi, un utente privo di abilità informatiche avanzate non è in grado di fare un lavoro di pulizia impeccabile su corpora testuali molto “sporchi” come sono quelli che provengono da Twitter. Se da un lato gli potrà essere d’aiuto una elevata quota di pazienza per utilizzare un correttore ortografico che ripulisca il testo dagli errori di battitura tipici di testi “naturali”, e quindi non supervisionati, dall’altro dovrà necessariamente scontrarsi con la ridda di caratteri speciali che sono stati richiamati in precedenza. Le soluzioni a disposizione sono tre: il livello base consiste nella sostituzione manuale e in blocco di tutti i segni grafici da correggere, facendo attenzione – nell’uso di un normale word processor – alle maiuscole e alle minuscole. Si tratta di un’operazione tanto più lunga e faticosa quanto più lungo, complesso e ricco di rimandi ipertestuali è il corpus da ripulire. In alcuni casi, esistono software come Taltac che possiedono al loro interno una funzione di rimozione di alcuni caratteri particolari. Una seconda soluzione è quella di programmare delle macro (o, alternativamente, di usare programmi esterni) che risolvano lo stesso tipo di problema. La soluzione è più efficace dal punto di vista del risultato finale, ma altrettanto impegnativa da quello delle competenze e del tempo richiesti. La terza soluzione è, sulla carta, quella in grado di ottimizzare meglio il rapporto costi/benefici. Si tratterebbe, in questo caso, di sfruttare le potenzialità di programmi di ricerca che si sono dati come obiettivo proprio quello della pulizia di testi originati nel web e utilizzati per analisi testuali. Vanno in questa direzione progetti come Readability o CleanEval (Baroni et al., 2008), che tuttavia presentano a loro volta due ordini di problemi: uno legato ai costi; l’altro alla effettiva possibilità d’accesso. Entrambi, peraltro, evidenziano problemi di flessibilità rispetto ai diversi formati di corpora da elaborare (Claridge, 2007; Petri e Tavosanis, 2009). La questione del trattamento di corpora che devono la loro peculiarità alla struttura soggiacente, pur non presentando problemi rilevanti di ordine informatico, è più complessa e implica scelte decisive da parte del ricercatore. Il ricercatore dovrà infatti operare delle scelte di carattere gnoseologico e teorico rispetto ai fini che si pone, ben sapendo che le decisioni che prenderà avranno inevitabili ricadute sul piano delle risultanze empiriche. In altri termini, il ricercatore che impatta con materiale testuale che non nasce in forma di JADT’ 18 583 prosa, ma di verso, si trova sostanzialmente a dover operare una scelta tra una rappresentazione fedele, “fotografica”, delle caratteristiche del corpus esaminato e quella che invece tiene conto delle ridondanze e di tutti quegli elementi che possono contribuire a gonfiare alcuni parametri del corpus, a partire dal conteggio di forme grafiche e a finire con gli hapax. Nel primo caso gli esiti dell’analisi subiranno l’impatto non solo di quegli elementi retorici e morfosintattici che possono caratterizzare la forma-canzone o la forma-poesia, ma soprattutto del ritornello. Accettare questa prospettiva significa assumere alcune sezioni di testo – nonché gli elementi di esso che contribuiscono a ispessire alcuni termini per via delle scelte operate sui versi dagli autori – come elementi che, proprio perché ripetuti, meritano di svettare in termini parametrici dall’analisi del corpus stesso. Possiamo dire che in un caso come questo i risultati siano ingannevoli? Dipende, appunto, dalla prospettiva che si intende assumere. Una rappresentazione iperrealistica ci porta a scegliere la prima formula, quella del massimo rigore filologico, dello zelo assoluto: a un certo ammontare di parole, seppur ripetute a iosa, deve corrispondere il reale valore di frequenza delle parole stesse, con tutto ciò che questo implica in termini di relazioni tra parole, di frequenze e di individuazione di topics all’interno del corpus. All’opposto, il ricercatore potrebbe avere delle ottime ragioni per propendere per una prospettiva costruttivista, in virtù della quale il dato viene forgiato in ragione non già della frequenza effettiva delle parole – con le ridondanze che alcuni corpora si portano dietro per le ragioni già esposte – bensì del testo spurgato dagli elementi ridondanti. Un esempio che dovrebbe rendere palmare le implicazioni e la differenza esistente tra le due opzioni può essere tratto da un recente lavoro sui testi della canzone italiana che costituisce un aggiornamento in una direzione più spintamente sociolinguistica di un mio lavoro precedente (Nobile, 2012). Dal corpus1 che raccoglie i testi degli artisti che sono riusciti a piazzare uno o più dischi nei primi sessanta posti delle classifiche di vendita tra gli anni ’60 del Novecento e il 2016 selezioniamo i due che hanno fatto registrare il maggior numero di ingressi2: Mina (170 canzoni) e Renato Zero (177). Da ciascuno dei due corpora andiamo a estrarre, previa lemmatizzazione e normalizzazione del testo, le parole piene. A questo punto possiamo assegnare il rango a ciascuna di esse in base al numero di occorrenze nella prima e nella seconda situazione: quella nella Il corpus è costituito dai testi di 5940 canzoni, che hanno sviluppato 1.321.994 occorrenze, 43.855 forme grafiche diverse, 22.160 parole piene e 1.905 hapax. 2 Per il criteri di campionamento, si veda Nobile, 2012: 51-53 o anche Nobile, L’italiano della canzone dagli anni sessanta a oggi. Una prospettiva sociolinguistica, in corso di pubblicazione. 1 584 JADT’ 18 quale il testo è riportato pedissequamente così come viene cantato (quindi con tutti gli elementi di ridondanza di cui si è parlato) e quella in cui esso è stato invece ripulito da questi elementi che determinano una consistente ripetizione, imputabile appunto alla struttura della canzone, di alcuni termini3. Il confronto tra i due ranghi, operato rispetto ai due diversi artisti, suggerisce l’uso del coefficiente di cograduazione di Spearman (ρ). I valori ricavati dai due confronti forniscono risultati di indubbio interesse: nel caso di Mina, il valore del ρ di Spearman è di 0,61; in quello di Renato Zero di 0,68. Questa informazione, da sola, ci fonisce un’indicazione su quanto la pulizia del testo e il rumore semantico generato dalle ridondanze possa produrre conseguenze più che tangibili nella strutturazione dei dati da elaborare: una parola che ha basso rango ha più probabilità di essere selezionata tra le parole chiave, di comparire come termine specifico di un certo sottoinsieme, di emergere come parola capace di differenziarsi in ragione del rango che essa occupa in dizionari di riferimento (De Mauro et al., 1993) e, quindi, di ergersi a indicatore della peculiarità linguistica di un determinato locutore o di una certa unità aggregata di analisi. Così, nel corpus di Mina la parola specchio, una volta sacrificati i ritornelli, arriva a uno scarto di rango di 165 posizioni e la parola rabbia perde 100 posizioni nei due diversi trattamenti del corpus. Analogamente, nel corpus di Renato Zero la parola identikit perde 226 posizioni a seconda che il corpus sia ripulito dalle ridondanze oppure no: essa si trova in una sola canzone (Io uguale io), ripetuta un’infinità di volte. Stesso discorso con la parola fame, che perde 183 posizioni: essa, pur essendo – al contrario di identikit – del tutto trasversale nel canzoniere del cantautore romano, ricorre un consistente numero di volte come tormentone della canzone C’è fame. 3. Conclusioni In queste pagine si è visto che alla facilità di accesso a una quantità ciclopica di materiale testuale rinvenibile sul web non corrisponde una altrettanto disinvolta possibilità di analisi dello stesso. Da una parte, infatti, questo materiale incorpora le caratteristiche tipiche del linguaggio cosiddetto naturale e, in quanto tale, va incontro non soltanto ai comuni problemi di machine learning e di text mining (i più comuni dei quali sono riscontrabili, per esempio, nei traduttori automatici o nei programmi di riconoscimento vocale), ma anche a quelli creati dal sovradosaggio di elementi sempre più 3 La pulizia del testo espunto dai versi duplicati è stata realizzata utilizzando una funzione del programma Excel (dati, rimuovi duplicati) tenendo fissi i riferimenti alle singole canzoni e ai diversi autori, in modo da evitare la rimozione di versi duplicati a prescindere dai due parametri di riferimento testé indicati. JADT’ 18 585 diffusi come emoticons, caratteri speciali, eccetera. A questi problemi se ne possono aggiungere altri, annoverabili nell’ambito della poesia e della canzone, che rendono necessaria una fase particolarmente accurata e meditata del pre-trattamento dei testi stessi, prima che questi vengano sottoposti ad analisi. Nell’articolo si è cercato di mostrare come le scelte di ordine gnoseologico compiute a monte dal ricercatore abbiano, nel caso delle forme linguistiche peculiari di cui si è parlato, ricadute rilevanti sulle stesse risultanze empiriche. In più, le operazioni di tipo lessicometrico su materiale testuale con forte rumore semantico rischiano, se non adeguatamente supportate da una pulizia – tutt’altro che agile – del corpus spesso, di produrre risultati in cui la quota di rumore semantico rischia di essere addirittura superiore a quella del testo vettore di effettivo significato (Nobile, 2016). Riferimenti bibliografici Baroni M., Chantree F., Kilgarriff A. and Sharoff S. (2008). Cleaneval: A competition for cleaning webpages. Proceedings of the 6th Conference on Language Resources and Evaluation (LREC) (pp. 638-643). Elda. Bolasco S. (2005). Statistica testuale e text mining: alcuni paradigmi applicativi. Quaderni di Statistica, 7, pp. 17-53. Chiari I. (2007). Introduzione alla linguistica computazionale. Laterza. Claridge C. (2007). Constructing a corpus from the web: message boards. In M. Hundt, N. Nesselhauf, and C. Biewer, Corpus Linguistics and the Web (pp. 87-108). Rodopi. De Mauro T., Mancini F., Vedovelli M. and Voghera M. (1993). Lessico di frequenza dell'italiano parlato. EtasLibri. Ebner M., Altmann T. and Softic S. (2011). @twitter analysis of #edmedia10 – is the #informationstream usable for the #mass. Form@re, 11 (74), pp. 3645. Lancia F. (2004). Strumenti per l’analisi dei testi. FrancoAngeli. Nobile S. (2012). Mezzo secolo di canzoni italiane. Una prospettiva sociologica (1960-2010). Roma: Carocci. Nobile S. (2016). Consenso e dissenso. Le reazioni degli elettori ai post dei candidati. In Morcellini M., Faggiano M.P. and Nobile S. (a cura di), Dinamica Capitale. Traiettorie di ricerca sulle amministrative 2016 (pp. 115138). Maggioli. Pandolfini V. (2017). Il sociologo e l'algoritmo. l'analisi dei dati testuali al tempo di Internet, FrancoAngeli. Petri S. and Tavosanis M. (2009). Building a Corpus of Italian Web Forums: Standard Encoding Issues and Linguistic Features. JLCL, 24 (1), 115-128. Tipaldo G. (2014). L'analisi del contenuto e i mass media. Il Mulino. 586 JADT’ 18 L’individu dans le(s) groupe(s) : focus group et partitionnement du corpus Daniel Pélissier Université Toulouse 1 Capitole - daniel2.pelissier@ut-capitole.fr Abstract Lexicometric analyzes of the focus groups depend in particular on the choice of partitioning of the corpus by researcher. After having proposed a typology of possible partitioning, we present the results of an experiment of one of these approaches on a corpus of ten focus groups. These analyzes highlight some contributions and limitations of lexicometry compared to conversational analysis. Résumé Les analyses lexicométriques des focus groups dépendent notamment des choix de partitionnement du corpus par le chercheur. Après avoir proposé une typologie des partitionnements possibles, nous présentons les résultats d’une expérimentation d’une de ces approches sur un corpus de dix focus groups. Ces analyses mettent en évidence certains apports et limites de la lexicométrie par rapport à l’analyse conversationnelle. Keywords: Focus groups, partitioning, individual, group. Mots clefs : Focus groups, partitionnement, individu, groupe. 1. Introduction La lexicométrie a étudié d’abord des discours écrits (articles de journaux, discours politiques, etc.) et des réponses à des questions ouvertes (Lebart et Salem, 1988) puis s’est intéressée aux conversations orales retranscrites (Rouré et Reinert, 1993; Bonneau et Dister, 2010). L’analyse de ces dernières est en effet plus délicate en raison de textes en général plus courts, de syntaxes particulières. Les focus groups appartiennent à cette famille de données en posant le problème particulier du nombre important de participants. Selon certains auteurs, ce type de données est difficile à analyser avec des logiciels de lexicométrie (Duchesne et Haegel, 2014). Pourtant, l’analyse lexicométrique a été utilisée dans plusieurs études (Guerrero et al., 2009; Grésillon et al., 2012; Hulin, 2013; Bengough et al., 2015; Brangier et al., 2015) et des articles méthodologiques ont analysé l’efficacité des traitements lexicométriques (Dransfield et al., 2004; Peyrat- JADT’ 18 587 Guillard et al., 2014). Ainsi, la possibilité de traiter les focus groups par la lexicométrie est établie. Cependant, les apports spécifiques d’une approche quantitative sont à préciser dans un domaine dominé par les approches qualitatives dont l’analyse conversationnelle. Par exemple, le lien entre focus groups et représentations sociales est mis en avant (Jovchelovitch, 2004) et la classification descendante hiérarchique (CDH) de Reinert (1983) forme des mondes lexicaux (Ratinaud et Marchand, 2015) dont la nature est proche des représentations sociales. Nous insisterons, dans cet article, sur la place de l’individu dans le(s) groupe(s), problématique que la lexicométrie permet d’approcher par un jeu de variables adapté. Mais cette analyse suppose de préparer le corpus avec des méthodes spécifiques. Nous présenterons ainsi une typologie des méthodes de préparation d’un corpus de focus groups en complétant les analyses de Peyrat-Guillard et al. (2014) et en mettant en exergue celles centrées sur l’individu. Puis, nous analyserons les résultats de l’expérimentation d’une de ces méthodes en montrant en quoi elle permet une compréhension des discours de l’individu dans le(s) groupe(s). 2. Typologie des partitionnements d’un corpus de focus groups Avant de commencer le traitement lexicométrique de focus groups, le corpus exige une préparation spécifique. En effet, certaines décisions de partitionnement détermineront notamment les méthodes lexicométriques employables et les analyses possibles. Les textes des modérateurs sont souvent supprimés du focus groups (Guerrero et al., 2009 ; Peyrat-Guillard et al., 2014) car ses interventions, dans le cadre d’un focus group servent à fluidifier les échanges sans les orienter. Cependant, il peut être conseillé de comparer les résultats avec ou sans les interventions du modérateur (Peyrat-Guillard et al., 2014). La deuxième question porte sur la partition du corpus issu du focus group. Plusieurs méthodes existent. Une première possibilité est d’analyser le focus group comme une entité sans prendre en compte les échanges entre les individus. Soit chaque focus group constitue un texte sans distinction d’individu (Dransfield et al., 2004) ; l’argument avancé par les utilisateurs de cette méthode est de faciliter les analyses statistiques mais cela n’est pas une évidence, le nombre de segments étant stable. Soit le focus group est partitionné en thèmes à partir d’une analyse de contenu (Bengough et al., 2015) ; cette approche permet de comparer par exemple les résultats d’une analyse thématique avec celle proposée au chercheur par la lexicométrie. La deuxième famille de partition est celle qui souhaite conserver les échanges du focus group. Soit la partition peut être centrée sur les individus, dite 588 JADT’ 18 decrowded (Peyrat-Guillard et al., 2014) ; les textes des interventions de chaque individu sont alors rassemblés (Guerrero et al., 2009). Soit chaque intervention est considérée comme un texte, approche dite crowded (PeyratGuillard et al., 2014). Chacune de ces méthodes a des avantages et des inconvénients. Nous ne pensons pas qu’une partition soit à privilégier mais que la décision dépend des analyses envisagées par le chercheur selon sa problématique. Dans cet article, nous nous centrerons sur la deuxième famille qui permet d’étudier l’individu dans le(s) groupe(s) et pas seulement les thèmes abordés. 3. Résultats de l’expérimentation du partitionnement par locuteur Nous avons pu expérimenter ces méthodes de partition d’un corpus de focus groups à partir d’une recherche que nous avons menée auprès de jeunes diplômés de l’enseignement supérieur (niveaux bac+3 et bac+5). Les discussions des focus groups concernaient la communication numérique de recrutement des banques et ces jeunes diplômés échangeaient sur les dispositifs utilisés par les entreprises pour recruter. Nous avons animé puis restranscrit10 focus groups de 6 à 7 personnes soit 67 locuteurs au total. 3.1. Préparation du corpus et partitionnement Une fois les textes préparés (anonymisation, intégration des noms propres (BNP, Facebook, etc.) au dictionnaire, adaptation du dictionnaire selon les spécificités du discours, etc.), nous avons décidé de supprimer les interventions du chercheur car elles restaient neutres par rapport aux discours des jeunes diplômés que nous souhaitions analyser. Nous avons alors créé une partition par tours de parole selon ce principe : (variables entre crochets) [Groupe1, Ingénieurs , NUM1, 18ans, masc]: il y a des choses marquantes, il y a un site web où on n'a pas beaucoup d'informations et un autre site où il y a beaucoup d'informations. [Groupe1, Ingénieurs , NUM2, 20ans, masc]: je suis d'accord avec toi. En effet, nous souhaitions repérer des discours individuels dans les focus groups et pouvoir associer des variables de profil à un locuteur. Les variables utilisées (tableau 1) ont été déterminées selon nos hypothèses de recherche et leur accessibilité puis ont été associées par un script automatique à chaque intervention de locuteur. Tableau 1. Variables du focus groups associées aux locuteurs. Num 1 Code variable num Valeur 1, 2, 3, etc. 2 formation 3IL : Source école Description Numéro de chaque intervenant Désignation du JADT’ 18 Num Code variable 3 groupe 4 5 sexe participation 6 initial 589 Valeur d’ingénieur LPB : licence professionnelle banque 1, 2, 3, etc. 10 groupes au total M, F TA, PA, A TA : très actif A : actif PA : pas actif STS, IUT Source Description groupe Numéro du groupe Statistiques SONAL selon le nombre d’interventions Données organisme de formation Indicateur quantitatif de la participation de chaque intervenant Formation initiale des intervenants Le corpus se présentait ainsi de cette façon pour être utilisé dans Iramuteq (Ratinaud, 2009) : **** *num_44 *formation_LPB *groupe_1 *sexe_M *participation_ A *initial_STS moi je veux bien commencer. Quand je suis allé sur le site de la SG, … Les caractéristiques du corpus obtenu et traité à l’aide du logiciel Iramuteq sont alors les suivantes : 1876 textes allant d’une seule forme (Oui par exemple) pour les plus courts à 126 formes ou 280 occurrences pour le plus long, 40404 occurrences et 2094 formes au total, 21,54 occurrences par texte en moyenne, les hapax représentent 41,26% des formes. Chaque texte correspond alors à une intervention d’un locuteur dans un focus group. 3.2. Choix méthodologiques Si la CDH de Reinert est la plus souvent citée dans la littérature (Duchesne et al., 2010 ; Gresillon et al., 2012; Hulin, 2013; Peyrat-Guillard et al., 2014; Brangier et al., 2015; Freitas et Luis, 2015, etc.) d’autres techniques sont impliquées comme l’analyse factorielle (Dransfield et al., 2004; Guerrero et al., 2009) ou plus rarement l’analyse de similitude (Bengough et al., 2015). Notre choix de la classification de Reinert est lié à nos hypothèses de recherche qui associent les discours de ces jeunes diplômés aux représentations sociales. Or, la CDH de Reinert (1983) favorise le repérage de représentations sociales (Ratinaud et Marchand, 2015). Nous avons effectué plusieurs CDH simples sur segments de texte en faisant varier le nombre de classes demandées, le nombre minimum de segments par classe. Nous avons choisi de retenir les formes dont la fréquence est supérieure à 3 (soit 687 formes dans ce cas) pour centrer le traitement sur les formes les plus présentes. Au terme de ces simulations, nous avons retenu une CDH qui présente 15 classes avec un taux de segments classés de 83,63%. 590 JADT’ 18 3.3. Exemple d’utilisation de variables, groupes et degré de participation Chaque intervention ayant été associée à des variables de contexte, la méthode choisie permet de vérifier le lien existant entre les groupes et chaque classe repérée. Ainsi, pour ce corpus de focus groups, la classe 1 (Chi²=20,82, recherche d’emploi) et la classe 12 (Chi²=16,76, articles de journaux) sont associées aux étudiants de 3IL. La classe 7 (Chi²=32,17, Dupuy) et la classe 13 (Chi²=11,44, avantages et valeurs) sont plutôt liées au groupe des licences banques (fig. 1). Figure 1. Chi² par classe pour la variable ‘formation’. De même, la variable sur la participation (tableau 1 et fig 2.) a permis d’associer certaines classes avec cette caractéristique. Les résultats de la CDH permettent ainsi de poser une hypothèse sur le degré de consensus entourant une représentation sociale. Figure 2. Association de la classe 8 avec la variable participation. En effet, la classe 8 sur la taille de l’organisation est associée aux locuteurs qui ont peu participé globalement (Variable PA (Peu Actif), Chi²=4,19 ; fig. 2) comme pour la classe 3 (mobilité). Les discussions sur la recherche d’emploi (classe 1), la banque Dupuy (classe 7) ou les classements des sites internet et témoignages sont dominées par les locuteurs les plus actifs (Variable TA (Très actif) : Chi²=5,69 pour la classe 1 et Variable A (Actif) : Chi²=7,51 pour la classe 7). Elles peuvent être perçues comme plus conflictuelles ou engagées. Les échanges sur la taille ont ainsi laissé plus de places aux locuteurs peu JADT’ 18 591 actifs avec des discussions plus consensuelles moins conflictuelles que pour des représentations moins stabilisées. Cette hypothèse renvoie alors à la structure possible de cette représentation sociale construite autour d’un noyau central stable qui exigerait des études complémentaires pour être confirmée. 3.4. Repérage de discours individuels par l’analyse factorielle de correspondance (AFC) Le partitionnement effectué permet aussi de repérer des individus dont les discours sont différents (fig. 3) grâce à une AFC réalisée à la suite d’une CDH de Reinert. Dans ce cas, deux individus se détachent principalement : 17 et 37. Le retour au texte permet de confirmer ce repérage. L’autre intérêt est aussi de souligner des regroupements d’individus différents de leur rattachement à un focus groups. L’AFC, en mettant en évidence des ensembles de locuteurs, propose une approche qui dépasse la frontière de chaque focus groups pour proposer une analyse de l’individu dans les groupes. Figure 3. AFC à partir de la CDH présentant les variables (F1/F2, 19,57 % de l’inertie). 4. Conclusion Les méthodes lexicométriques utilisées pour analyser des focus groups 592 JADT’ 18 dépendent notamment de la partition du corpus effectuée en amont. Dans notre recherche, l’association de variables à chaque intervention de locuteur a permis de repérer des sous-groupes d’individus à l’intérieur des focus groups, des discours d’individus isolés ou des sous-groupes associés à plusieurs focus groups qui n’apparaissaient pas de façon évidente pendant les échanges. Cette approche a cependant certaines limites. D’abord, la procédure automatisée d’association des variables utilisée dans cette expérimentation ne permet pas de repérer l’évolution des thèmes pendant la discussion, une variable repérant les tours de paroles aurait alors été nécessaire. Ensuite, le repérage des individus s’est fait sur une AFC qui explique une faible part de la variance (19,57 %) et les causes de la singularité des discours est ainsi difficile à associer à la CDH. Enfin, d’autres méthodes auraient pu être investies (analyse des antiprofils, spécificités, similitudes, etc.). Sans remplacer l’analyse conversationnelle qui apporte des nuances spécifiques, certaines méthodes lexicométriques peuvent ainsi permettre de comprendre le corpus différemment et compléter la compréhension de ce type de données riches et profondes en dépassant notamment la frontière de chaque focus groups et faciliter une approche transversale du sens. Remerciements : merci à Pascal Marchand, Pierre Ratinaud et Lucie Loubère pour leur initiation à la lexicométrie et à Iramuteq. References Bengough, T., Bovet E., Bécherraz C., Schlegel S., Burnand B., et Pidoux, V. (2015). Swiss family physicians’ perceptions and attitudes towards knowledge translation practices. BMC Family Practice, décembre: 1–12. Bonneau, J., and Dister, A. (2010). Logométrie et modélisation des interactions discursives, l’exemple des entretiens semi-directifs. Journées internationales d’Analyse statistique de Données Textuelles, pp. 253–264. Brangier, E., Barcenilla, J., Bornet, C., Roussel, B., Vivian, R., and Bost, A. (2015). Prospective ergonomics in the ideation of hydrogen energy usages. In Proceedings 19th Triennial Congress of the IEA. Melbourne, pp. 1–2. Dransfield, E., Morrot, G., Martin, J.-F., and Ngapo, T.-M. (2004). The application of a text clustering statistical analysis to aid the interpretation of focus group interviews. Food Quality and Preference, 15(4): 477–488. Duchesne, S., and Haegel, F. (2014). L’entretien collectif. Armand Colin. Paris. Duschesne, S., Florence Haegel, Elizabeth FRAZER, Virginie Van Ingelgom, and Guillaume Garcia, André-Paul Frognier. (2010). Europe between integration and globalisation social differences and national frames in the analysis of focus groups conducted in France, francophone Belgium and the United Kingdom. Politique Européenne, 30(1): 67–105. JADT’ 18 593 Freitas, E. A. M., and Luis, M. A. V. (2015). Perception of students about alcohol consumption and illicit drugs. Acta Paul Enferm., 28(5): 408–414. Gresillon, E., and Marianne Cohen, Julien Lefour, Lydie Goeldner et Laurent Simon. (2012). Les trames vertes et bleues habitantes : un cheminement entre pratiques et représentations. L’exemple de la ville de Paris (France). Développement Durable et Territoires, 3: 2-17. Guerrero, L., Guàrdia, M., and Xicola, J. (2009). Consumer-driven definition of traditional food products and innovation in traditional foods. A qualitative cross-cultural study. Appetite, 52(2): 345–354. Hulin, T. (2013). Enseigner l’activité « écriture collaborative ». Tic&société, 7(1): 89–116. Jovchelovitch, S. (2004). Contextualiser les focus groups : comprendre les groupes et les cultures dans la recherche sur les représentations. Bulletin de Psychologie, 57(3): 245–261. Lebart, L., and Salem, A. (1988). Analyse statistique des données textuelles. Dunod. Paris. Peyrat-Guillard, D., Lancelot Miltgen, C., et Welcomer, S. (2014). Analysing conversational data with computer-aided content analysis: The importance of data partitioning. Journées internationales d’Analyse statistique des Données Textuelles, pp. 519–530. Pélissier, D. (2016), Pourquoi et comment utiliser la lexicométrie pour l’analyse de focus groups ?, Présence numérique des organisations, 11/07/2016. Ratinaud, P. (2009). Iramuteq. Lerass. Ratinaud, P., and Marchand, P. (2015). Des mondes lexicaux aux représentations sociales. Une première approche des thématiques dans les débats à l’Assemblée nationale (1998-2014). Mots. Les Langages du Politique, 108(2): 57–77. Reinert, M. (1983). Une méthode de classification descendante hiérarchique : application à l’analyse lexicale par contexte. Les Cahiers de L’analyse Des Données, 8(2): 187–198. Rouré, H., and Reinert, M. (1993). Analyse d’un entretien à l’aide d’une méthode d’analyse lexicale. Journées internationales d’Analyse statistique de Données Textuelles. ENST, Paris, pp. 418-42 594 JADT’ 18 Using the First Axis of a Correspondence Analysis as an Analytical Tool. Application to Establish and Define an Orality Gradient for Genres of Medieval French Texts Bénédicte Pincemin1, Céline Guillot-Barbance2, Alexei Lavrentiev3 Univ. Lyon, CNRS, IHRIM UMR5317 - benedicte dot pincemin at ens-lyon dot fr; celine dot guillot at ens-lyon dot fr; alexei dot lavrentev at ens-lyon dot fr Abstract Our corpus of medieval French texts is divided into 59 discourse units (DUs) which cross text genres and spoken vs non spoken text chunks (as tagged with q and sp TEI tags). A correspondence analysis (CA) performed on selected POS tags indicates orality as the main dimension of variation across DUs. We then design several methodological paths to investigate this gradient as computed by the CA first axis. Bootstrap is used to check the stability of observations; gradient-ordered barplots provide both a synthetic and analytic view of the correlation of any variable with the gradient; a way is also found to characterize the gradient poles (here, more-oral or less-oral poles) not only with the POS used for the CA analysis, but also with words, in order to get a more precise and lexical description. This methodology could be transposed to other data with a potential gradient structure. Keywords: textometry, Old French, represented speech, spoken genres, methodology, correspondence analysis, 1D model, data visualization, XML TEI, TXM software, DtmVic software. 1. Linguistic issue and preparation of textual data We investigate spoken language features of Medieval French in a corpus composed of 137 texts (4 million tokens), taken from the Base de français médiéval1. The corpus is annotated with part-of-speech (POS) tags at the word level; speech quotation chunks and speech turns are marked up using TEI XML tags at an intermediate level between sentences and paragraphs; and every text can be situated in a 32-genre typology (Guillot et al., 2017). Our hypothesis is that the features of orality may be related to text chunks representing speech, and also to text genres, as for instance some text genres 1 Base de français médieval: http://bfm.ens-lyon.fr JADT’ 18 595 are intended for oral performance. In order to perform a textometric analysis (Lebart et al. 1998) on our XML-TEI annotated data, we use the TXM opensource corpus analysis platform (Heiden, 2010; Heiden et al., 2010)2. We divide our corpus into 59 discourse units (DUs) obtained by splitting every genre into parts which represent speech on the one hand, and the remaining parts on the other hand (some text genres have no spoken passages). Discourse unit labels, like q_rbrefLn for instance, combine four pieces of information: (i) the first letter is either q for quoted speech chunks, sp for speech turns, or z for remaining (non oral) chunks; (ii) then we have the short name of the text genre (here, rbref means “récit bref”, i. e. short narrative); (iii) the uppercase letter stands for the domain3; (iv) the last character indicates whether this DU is represented in our corpus by one (1), two (2) or more (n) texts. We linguistically represent our texts with the POS tags4 they use5. The reliability of POS tags was measured in a previous study (Guillot et al., 2015) for a subset of 7 texts in which tags had been manually checked. For the present analysis, we eliminate low-frequency POS tags (freq. < 1 500), which include many high error rate tags and do not carry much weight into the quantitative analysis. For the remaining high error rate tags (with more than 25% wrong assignments), we measure their influence on the correspondence analysis (CA) by checking their contribution to the first axis. Then we remove the proper nouns category (NOMpro) which shows both high error rate and high contribution to the first axis (14.66 %). A new correspondence analysis enables two additional improvements from a linguistic perspective. We remove compound determiners (DETcom, PRE.DETcom, like ledit) as they emerged at the end of the 13th century, so that they introduce a singular and substantial diachronic effect (high contributions on the first axis). Moreover, the second axis describes mainly the association between psalms (z_psautierRn) and possessive adjectives (ADJpos): this corresponds to very specific phrases with some distinctive nouns (la meie aneme, li miens Deus, la tue misericorde), and the adjective is equivalent to a possessive determiner in other contexts, so we merge the two categories (DETADJpos). We finally get a contingency table crossing 59 DUs with 33 POS tags to explore with a CA. Textometry Project and TXM software: http://textometrie.org There are 6 domains: literature (L), education (D for “didactique”), religion (R), history (H), law (J for “juridique”), practical acts (P). 4 We use the Cattex2009 tagset, designed for Old French: http://bfm.enslyon.fr/spip.php?article176. 5 We exclude punctuations, editorial markup and foreign words. CQL query: [fropos!="PON.*|ETR|OUT|RED"] 2 3 596 JADT’ 18 2. Linguistic and methodological results from correspondence analysis Our study reveals that the first axis can in fact be interpreted as an orality gradient. The factorial map (Fig. 1) shows z_ DUs on the left hand side of the first axis, opposed to q_ and sp_ DUs on the right hand side. Some genres intended for oral performance go to the right with speech chunks (especially plays –dramatiqueL, dramatiqueR), whereas genres related to written processing (especially practical acts (P): charters, etc.) go to the left with outof-speech chunks. As this opposition matches the first axis, orality appears as the first contrastive dimension for Old French (as regards POS frequencies), as it is in Biber’s experiences with English (Biber, 1988), with the same kind of linguistic features (Table 1). Then, as a second result, DUs can be sorted according to their degree of orality, from “less oral” to “more oral” (see Appendix6). Peculiar positions (for didactic dialogs or psalms for instance) can be explained by a formal use of language given by the rules of the genre. The linguistic analysis of the DU gradient is detailed in (Guillot-Barbance et al., 2017)7. Figure 1. CA map of the 59 DUs (TXM). 21 DUs with low representation quality (cosine squared to 1 × 2 plane < 0.3) and no significant contribution to this plane (ctrb1 < 2 % & ctrb2 < 2 %) have been filtered out (macro CAfilter.groovy), so that the figure is clearer. Appendix is available online as a related file of this paper in HAL archive: https://halshs.archives-ouvertes.fr/halshs-01759219 7 Improvements made to the statistical processing in 2018 (management of the second axis with ADJpos and DETpos merging, confidence ellipses) strengthen the linguistic interpretation published in 2017, no significant change is observed on gradient given by the first axis, according to the four zones defined by the analysis, except for a few points which are not related to this axis (low cosine squared). 6 JADT’ 18 597 Figure 2. CA map of the 17 DUs with the largest confidence ellipses (DtmVic). The two largest ones (q_proverbesD2, q_lapidaireD2) couldn’t be drawn; the following three largest ones (q_commentaireD1, q_dialogueD2, q_sermentJ1) show that these DU positions cannot be interpreted; then other smaller ellipses indicate that the 54 remaining DU positions on axes #1 and 2 are stable. Table 1. The eight POS with the highest contributions on the first axis, for both sides. “Less oral” pole “More oral” pole personal pronoun PROper preposition PRE general adverb ADVgen common noun NOMcom negative adverb + definite ADVneg PRE.DETdef preposition finite verb VERcjg determiner VERppe adverbial pronoun (en, y) PROadv past participle DETdef DETADJpos possessive determiner or definite determiner DETcar adjective CONsub cardinal determiner VERppa subordinating conjunction VERinf present participle CONcoo infinitive verb coordinating conjunction A bootstrap validation (Dupuis & Lebart, 2008, Lebart & Piron, 2016) is applied to evaluate the stability of DU positions on the first axis (Figure 2). Sizes of ellipses in the 1×2 map are correlated to sizes of DUs: the fewer the words there are in the DU, the less data the statistics process, and the greater is the confidence ellipse (Table 1). Only five DUs are ascribed a big ellipse which shows their uncertain position (Figure 2): all of them are DUs from about ten words to about a hundred words, which are DUs for very singular linguistic usages, and are neither representative nor relevant for this overall linguistic analysis. The orality gradient is then confirmed throughout a 598 JADT’ 18 statistic validation on our data. The 2D factorial map provides a synthetic and efficient visualization. The second axis display reveals that the “more oral” pole is more compact, more consistent, than the “less oral” pole, which is more heterogeneous (the cosine squared values corroborate this). But what we want to stress in this methodological paper, is that the main linguistic result is uniquely provided by the interpretation of the first axis. Benzécri has illustrated the same kind of approach by using a 1D CA to reveal the hierarchy of characters in Racine’s Phèdre (1981 : 68). This method emphasizes the analytic power of CA, which separates the data (by the mathematical means of Singular Value Decomposition) into “deep” components (factors), just as a prism breaks light up into its constituent spectral colors. Despite its main use as a 2D illustration of a corpus structure in the textual data analysis field, CA is much more than a suggestive visualization or a quick sketch. 3. Complementary tools to analyse 1D gradient in textual data We now test new means to gain insight into the causation of this gradient in our data. 3.1. Gradient-ordered barplot Figure 3. Gradient-ordered specificity barplot for Personal Pronoun, as example of a POS which is correlated to the first axis. For readability reasons, the height of specificity bars is limited to 20. The first method we propose is to visualize the evolution of POS frequencies according to the orality gradient using a specificity bar-plot chart where the DU order on the x-axis is given by the DU order on the first CA axis: this display visually reveals how much a POS is correlated with speech or non speech features, and details its affinity with each DU. For instance, personal pronouns are typical for the more-oral pole: this is displayed as a rising profile (Figure 3), and one can easily find out which DU have an outlying use of this POS. Whereas a POS like adjectives (Figure 4), which is not correlated to the orality gradient, gets a chart with no overall pattern. JADT’ 18 599 Figure 4. Gradient-ordered specificity barplot for adjectives, as example of a POS which is not correlated to the first axis. For readability reasons, the height of specificity bars is limited to 20. 3.2. Back-to-text close reading by getting representative words for each side of the first axis The second methodological innovation concerns obtaining lexical information about orality characteristics in our texts. We select two sets of DUs based on their cosine squared scores for the first CA axis in order to represent the more-oral (cos21 > 0.4) and less-oral (cos21 > 0.35) poles (Table 2). The cos2 thresholds are adjusted to get two balanced sets with enough different DUs to get an adequate representativeness. Then, a specificity computation, which statistically characterizes the distribution of words into these two sets, reveals lexical features for more oral and less oral poles, showing typical words as they can be read in texts. Light is thus shed on the quantitative result throughqualitative observations. Table 2. Representative DUs Less-oral pole z_journalJ2 z_plaidsP1 z_commentaireD1 z_diversP1 z_registreP2 z_lettreH1 z_dialogueD2 z_rvoyageL1 Table 3a. Adjectives typical for the less-oral subcorpus Table 3b. Adjectives typical for the more-oral subcorpus More-oral pole q_romanLn sp_dramatiqueR1 q_rbrefLn q_bestiaireD2 sp_dramatiqueLn q_lyriqueLn z_lyriqueLn q_chroniqueHn sp_lyriqueLn q_hagiographieRn q_romanDn q_mémoiresHn Our example sheds light on the uses of adjective: whereas adjectives are not related to the orality gradient as a category (Figure 4), they have strong associations at a lexical level (Table 3). Represented speech makes much use 600 JADT’ 18 of terms of address introducing speech turns (bel, douz – and their formal variants: biaus, biax, etc.), and evaluative adjectives (grant, mal, boen). For the less-oral pole, there are more POS tagging errors; adjectives are more diverse and often associated with a subset of DUs, for instance present, saint, maistre are typical of two texts. 4. Conclusion In this contribution, we have shown several ways to take into account the limits of real data, especially textual data: managing the POS tags reliability (§1), validation process to identify where data is lacking (§2), refining morphosyntatic based analysis with lexical information (§3). But our main objective is to establish a methodology in order to reveal and study any gradient-like deep structuration of data. A simple seriation (as illustrated in Dupuis & Lebart, 2008) could provide the same results for the first step, as it generates the same ordered view of the data. But CA gives much more information, qualifying the relation of each variable to the gradient with indicators like contributions and cosines squared. Interpretation can go further: CA coordinates are controlled with bootstrap and confidence ellipses, gradient-ordered barplot visualizations are efficient to analyse in detail the relationship of any individual variable to the overall gradient, and the gradient poles can be illustrated by words, which add a concrete and textual account for the deep structure. Thus, on our corpus of French medieval texts, we discover that orality is the main contrastive dimension and that it characterizes represented speech as well as text genres. The methodology could be applied to other data, and is already entirely implemented using tools freely available to the scientific community. This research has benefited from the PaLaFra ANR-DFG project (ANR-14-FRAL0006), for corpus extension and POS evaluation. We are also very grateful to Ludovic Lebart, for his inspiring comments on a preliminary presentation of this research, and for DtmVic software, which has evolved in order to take into account the quantitative particularities of our data. References Benzécri J.-P. et al. (1981). Pratique de l’Analyse des données, tome 3. Linguistique & lexicologie. Dunod, Bordas, Paris. Biber D. (1988). Variation across speech and writing. Cambridge University Press. Dupuis F., Lebart L. (2008). Visualisation, validation et sériation. Application à un corpus de textes médiévaux. In Heiden S. and Pincemin B., eds, Actes JADT 2008, Presses univ. de Lyon: 433-444. Guillot C., Heiden S., Lavrentiev A., Pincemin B. (2015). L’oral représenté JADT’ 18 601 dans un corpus de français médiéval (9e-15e) : approche contrastive et outillée de la variation diasystémique. In Kragh K. J. and Lindschouw J., eds, Les variations diasystémiques et leurs interdépendances dans les langues romanes -Actes du Colloque DIA II, Éd. de linguistique et de philologie, Strasbourg : 15-28. Guillot-Barbance C., Pincemin B., Lavrentiev A. (2017). Représentation de l’oral en français médiéval et genres textuels, Langages, 208: 53-68. Heiden S. (2010). The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme. In Otoguro R. et al., eds, PACLIC24, Waseda Univ., Sendai : 389-398. Heiden S., Magué J.-Ph., Pincemin B. (2010). TXM : Une plateforme logicielle open-source pour la textométrie – conception et développement. In Bolasco S. et al., eds, Statistical Analysis of Textual Data -Proceedings of JADT 2010, Edizioni Univ. di Lettere Economia Diritto, Rome : 1021-1031. Lebart L., Piron M. (2016). Pratique de l’Analyse de Données Numériques et Textuelles avec Dtm-Vic. L2C, http://www.dtmvic.com. Lebart L., Salem A., Berry L. (1998). Exploring Textual Data. Kluwer academic pub., Boston. 602 JADT’ 18 Explorer les désaccords dans les fils de discussion du Wikipédia francophone Céline Poudat Université Côte d’Azur, CNRS, BCL, France – poudat@unice.fr Abstract This article concentrates on the exploration of French Wikipedia talk pages, with a focus on conflicts. We developed a typology of speech acts expressing disagreement, including direct and explicit forms (je ne suis pas d’accord / je suis en désaccord) as well as indirect acts, which are besides the most widespread. Disagreement is indeed a negative reaction that may threaten the face of the addressee. For this reason, disagreements are rather expressed indirectly in order to protect faces in interaction. A subset of the Wikiconflits corpus (Poudat et al., 2016) was annotated according to the typology and we carried on a primary exploration of the data using statistical methods. Résumé Cette étude se concentre sur l’exploration de l'encyclopédie Wikipédia, l'un des plus gros succès du Web 2.0, et spécifiquement sur l’exploration de ses discussions éditoriales, avec un intérêt particulier pour les conflits. Nous nous intéressons aux actes de langage exprimant le désaccord, de son expression la plus directe et la plus explicite (je ne suis pas d’accord / je suis en désaccord) à ses formes les plus indirectes, et d’ailleurs les plus usuelles ; le désaccord est effectivement plutôt exprimé de manière indirecte pour préserver sa face et celle de l’autre. Nous présentons la typologie que nous avons développée et nous l’appliquons à un sous-ensemble du corpus Wikiconflits que nous avons développé (Poudat et al., 2016). Le corpus annoté est ensuite exploré avec les méthodes de l’ADT et nous restituons certaines de ses caractéristiques. Keywords: Wikipedia, CMC corpora, Conflicts, Disagreements, Pragmatics, Semantic Annotation, Text statistics 1. Introduction Cette étude se concentre sur l’exploration de l'un des plus gros succès du Web 2.0 : l’encyclopédie Wikipédia, qui rassemble des milliers de contributeurs à travers le monde, mais qui demeure paradoxalement peu observée par les études de linguistique, certainement du fait de la complexité JADT’ 18 603 de l’objet, qui multiplie les versions, les types de pages et les genres textuels. Nous nous intéressons spécifiquement aux fils des pages de discussion du Wikipédia francophone, avec un intérêt particulier pour les conflits. Plutôt abordés par les sciences sociales (cf. Kittur et Kraut, 2008, 2010; Auray et al., 2009, Sumi et al., 2011, Borra et al., 2014), les conflits dans Wikipédia ont été peu décrits d’un point de vue linguistique. Nous proposons de les décrire au moyen d’une annotation en actes de langage, en distinguant entre marqueurs du (dés)accord et marqueurs du conflit : si tout désaccord ne tourne pas au conflit, un conflit nait souvent d’un désaccord. Deux entreprises d’annotation des interactions conflictuelles de Wikipédia ont été menées ces dernières années (Bender et al., 2011, Fershke et al., 2012), mais elles ne portaient pas sur le français, et se positionnaient dans un cadre distinct. La présente communication se concentre spécifiquement sur l’exploration des marqueurs du désaccord dans Wikipédia, de son expression la plus directe et la plus explicite (je ne suis pas d’accord / je suis en désaccord) à ses formes les plus indirectes, et d’ailleurs les plus usuelles ; le désaccord est effectivement plutôt exprimé de manière indirecte pour préserver sa face et celle de l’autre. Après avoir présenté le corpus de travail (2.), nous décrirons la typologie exploratoire que nous avons développée et les marqueurs que nous avons annotés manuellement (3.). Nous présenterons enfin certaines des régularités observées (4.). 2. Wikiconflits : pages et fils conflictuels Le corpus de travail sur lequel se fonde notre étude comprend un sousensemble du corpus Wikiconflits (Poudat et al., 2016), à savoir l’ensemble des discussions autour de six articles ayant été identifiés par Wikipédia comme conflictuels : Igor et Grichka Bogdanoff, Chiropratique, Éolienne, Histoire de la logique, Psychanalyse et Quotient intellectuel. La conflictualité de chaque fil a été évaluée et annotée avec une variable à trois modalités : si les fils non conflictuels sont catégorisés C0, C1 signale la présence d’un désaccord et C2 la présence d’un conflit sur le fil. page Tableau 1 : Corpus de travail tokens messages Fils C0 Fils C1 Fils C2 Bogdanoff 73864 493 30 16 20 Chiropratique 29919 226 5 3 12 Éolienne 13454 152 2 7 0 Histoire de la logique 3358 46 4 2 0 Psychanalyse 102338 878 54 39 34 Quotient intellectuel 20059 170 10 20 12 604 JADT’ 18 Désaccords et conflits sont deux formes d’affrontement verbal, à cette différence que le désaccord est un acte réactif qui exprime une réaction négative relative à une assertion préalablement exprimée (KerbratOrecchioni, 2016) tandis que le conflit est un acte agressif, qui implique la présence d’au moins une séquence attaque-réplique caractérisée par l’usage de marqueurs de violence verbale et d’actes de langage agressifs pour la face de l’allocutaire (Poudat et Ho-Dac, 2018). Ces définitions doivent être précisées relativement au genre très particulier qu’incarne la discussion Wikipédia, qui a pour fonction majeure de permettre aux rédacteurs de l’article de se coordonner et de clarifier leurs éventuels différends. L’article encyclopédique est ainsi le premier terrain de coopération entre les contributeurs, la discussion faisant plutôt office de coulisses de la rédaction – beaucoup d’utilisateurs réguliers de Wikipédia méconnaissent d’ailleurs l’existence de ces discussions. En d’autres termes, l’article est le genre premier, la discussion faisant figure de genre lié ou non autonome. Les désaccords et les conflits que l’on y observe s’adossent ainsi sur l’article, ce qui nous a amenée par exemple à observer qu’un désaccord pouvait porter sur un passage de l’article, considéré dans ce cas comme une assertion contestable. De la même manière, un conflit peut prendre sa source au cours de la rédaction de l’article, via une suppression ou un retour en arrière litigieux, qui pourra donner lieu à l’écriture d’une réplique agressive sur la page de discussion. Notons que nous écartons de notre étude les conflits non verbaux et autres guerres d’édition, largement observés par les sciences sociales. Les fils catégorisés C1 portent la trace verbale d’un désaccord tandis que les fils étiquetés C2 contiennent au moins une attaque manifeste de la face de l’un des contributeurs du fil. Cette annotation ne va bien sûr pas de soi et nous a souvent demandé d’arbitrer entre le contenu du message et son positionnement dans le fil d’interaction. Un message peut ainsi exprimer un désaccord ou être agressif sans recevoir de réponse, tandis qu’un contributeur peut être en désaccord avec un point de vue existant qui n’est pas pour autant celui de l’un de ses co-énonciateurs. Nous n’avons retenu que les désaccords ou les attaques orientés vers le(s) co-énonciateur(s) / corédacteurs(s), en ce sens qu’un passage très agressif envers un tiers auteur ou article par exemple, ne sera pas été considéré comme conflictuel. JADT’ 18 605 3. Le désaccord comme acte de langage : types et marqueurs Nous nous sommes ensuite concentrée sur l’annotation manuelle des actes de langage exprimant le désaccord en développant une typologie adaptée aux caractéristiques du corpus de travail. Le désaccord étant un acte exprimant une réaction négative, il est potentiellement menaçant pour la face de l’allocutaire auquel il s’adresse. C’est pourquoi il est généralement exprimé de manière indirecte. Les chiffres sont éloquents dans notre corpus : 82% des actes exprimant le désaccord relevés sont indirects, tandis que près de la moitié des désaccords exprimés directement sont adoucis ou minimisés. Les deux grands types d’expression indirecte du désaccord les plus récurrents que nous avons observés consistent à (i) recourir à la concession pour mettre en scène un accord partiel et (ii) exprimer son désaccord en se posant explicitement comme source évaluative (personnellement, je ne pense pas que… ; j’avoue ne pas comprendre, etc.). Comme nous le signalons dans le tableau 2, nous avons choisi d’annoter les concessions accompagnées d’un accord explicite comme « Ok, mais des solutions existent (développement de pales furtives absorbant les ondes radars) » (discussion Éolienne), ce qui explique peut-être pourquoi au final nous n’en obtenons qu’un petit nombre (9 occ.). L’expression du désaccord indirect semble privilégier significativement les actes secondaires de l’incompréhension (48 occ.) et de l’expression d’une opinion (29 occ.). À titre de comparaison, nous avons systématiquement annoté les manifestations d’accord explicites rencontrées. Contrairement au désaccord, l’accord est dans notre culture un acte positif pour la face de l’allocutaire. Peu employé de manière indirecte, il est plutôt intensifié qu’atténué (je suis tout à fait d’accord). On relève 57 actes d’accord explicite dans le corpus ; à titre de comparaison, on rencontre trois fois plus de formes exprimant un désaccord, ce qui est probablement dû à la dimension conflictuelle du corpus. Il nous faut enfin souligner que plus des deux tiers des 270 fils de discussion considérés ne contenaient aucune des formes observées, ce qui n’est pas surprenant : un quart des fils ne contiennent qu’un seul message tandis que nous avons conservé les fils catégorisés harmonieux à titre de contraste. Attributs polarité Valeurs accord désaccord explicite type implicite Tableau 2 : Typologie du désaccord Exemples je suis d’accord Je suis contre l’avis de X Accord explicite : je suis d’accord, je suis pour X, favorable à X, tout à fait de votre avis, je suis de ton avis, OK pour X… Désaccord explicite : pas d’accord, en désaccord, je ne suis pas favorable, je suis contre, totalement contre Voir acte indirect. 606 JADT’ 18 oui / non atténuation indirect non concession Concéder avis émotion Se poser comme source évaluative doute Incompréhension assertion négative forte Atténuation d’un accord explicite : je suis assez d’accord Atténuation d’un désaccord explicite : Nous sommes en désaccord (mineur) sur un point (mineur) Seuls les actes d’accord explicite accompagnés d’une concession ont été retenus. D'accord pour refuser le paragraphe ajouté à partir d'arkiv ; en revanche la suppression de la participation d'AR à la mission ne me semblait pas déraisonnable (discussion Bogdanoff) « Personnellement, je pense que non », je ne crois pas, je ne pense pas… mots-clés : personnellement, pense, crois, trouve émotion (rare dans le corpus pour exprimer le désaccord) j'ai été personnellement choqué par les affirmations gratuites comme "de gauche/de droite" dès le début de l'article, que je pense tout à fait intempestives et parfaitement corrélées à la hauteur du QI du contributeur et aux théories raciales de Rushton, (discussion QI) Je doute de la pertinence de ce passage dans cet article. mots-clés : certain, sûr, doute Je ne vois pas bien quel rapport ta source a avec ce constat. (discussion Psychanalyse) Encore une fois, je ne comprends pas le problème. Ce n'est pas du tout une question de vocabulaire secondaire (discussion Bogdanoff) 4. Analyses Le corpus annoté a ensuite été soumis à différentes méthodes de l’analyse de données textuelles afin d’explorer ses caractéristiques et de mettre en évidence les relations entre les types de désaccord et la situation du fil, harmonieuse, dissonante ou conflictuelle. Comme le montre la Figure 1, les fils identifiés comme lieux d’un désaccord (C1) sont ceux qui contiennent le nombre le plus significatif de marqueurs d’accord et de désaccord. Au contraire, les fils identifiés comme conflictuels contiennent significativement moins de marques d’accord explicite et de marques de désaccord. Nous voilà donc rassurée par la cohérence de notre annotation. JADT’ 18 607 Figure 1 : Ventilation des types d’accord et de désaccord d’un type de fil à l’autre (données Hyperbase Web) Afin d’évaluer plus précisément la structure de l’ensemble des annotations apposées sur les textes, nous avons réalisé une Analyse en Composantes Principales (ACP) sur la table des décomptes d’annotations en prenant le fil de discussion comme unité textuelle. Nous avons dû procéder à certains ajustements, (i) en écartant les fils qui ne contenaient aucune annotation ; (ii) en isolant certaines variables trop marginales (i.e. 2 occ. de la valeur émotion) et (iii) en distinguant entre les observations restantes celles qui seront utilisées comme variables actives ou comme variables supplémentaires. Ainsi, les variables ayant le trait atténuation ont été intégrées à titre illustratif. Au total, l’ACP a été réalisé sur un ensemble de taille restreinte, à savoir 98 fils * 8 variables actives (et 13 variables supplémentaires). De manière intéressante, l’ACP met en évidence la présence d’un facteur taille, c’est-àdire que toutes les observations sont corrélées positivement entre elles et se regroupent donc du même côté du premier axe factoriel. Certains fils de discussion ont des valeurs fortes pour toutes les variables, tandis que d’autres ont des valeurs faibles pour toutes les variables. Si l’on s’intéresse aux facteurs 2 et 3 (Figure 2) sur lesquels on projette le degré de conflictualité et les pages du corpus à titre illustratif, on observe une opposition entre accord et désaccord, et dans une moindre mesure entre explicite et implicite sur le facteur 2. Accords et actes explicites seraient du côté de l’harmonie et du désaccord tandis que les désaccords en général et les désaccords indirects en particulier seraient plus caractéristiques du conflit. Cette dernière remarque, qui devra être éprouvée et confirmée sur des jeux de données plus importants, nous semble intéressante : est-ce que les marqueurs implicites du désaccord vont de pair avec les marqueurs du conflit ? Y a-t-il une corrélation négative entre expression explicite du 608 JADT’ 18 désaccord et attaques personnelles ? Figure 2 : Facteurs 2 et 3 de l’ACP – 98 fils * 8 variables actives – Dtm-vic 5. Conclusion et perspectives Nous avons ainsi proposé une première typologie des actes exprimant le désaccord en français ; cette typologie a été développée dans le cadre d’un projet plus général d’exploration des conflits dans Wikipédia. Une seconde typologie, centrée sur les marqueurs de violence verbale et supposément caractéristique du conflit, est en cours de développement et viendra faire système avec la typologie du désaccord pour mettre en évidence les caractéristiques des interactions conflictuelles dans Wikipédia et dans les CMC. En ce qui concerne l’annotation présentée, un guide est actuellement en cours de rédaction ; chaque marqueur sera validé et évalué au moyen d’un kappa de Cohen. La typologie est encore en cours d’amélioration ; ainsi une troisième forme d’expression indirecte du désaccord que nous avions observée consiste à le neutraliser en déplaçant le focus sur une proposition ou une suggestion, i.e. un acte de langage positif (ne vaudrait-il pas mieux… ? Il faudrait peut-être d’abord définir ce qu’on entend par..). Ce type de séquence, plus complexe à identifier car plus ambigu, est en cours d’intégration. Enfin, reste à mettre en œuvre des parcours interprétatifs adaptés pour explorer ce type de données annotées avec nos méthodes ADT ; c’est aussi l’une des pistes que nous poursuivons ces dernières années, dans nos travaux (Poudat et Landragin, 2017) et dans le cadre du consortium CORLI. JADT’ 18 609 Références Auray, N., Hurault-Plantet, M., Poudat, C., & Jacquemin, B. (2009). La négociation des points de vue : une cartographie sociale des conflits et des querelles dans le Wikipédia francophone. In Réseaux 2/2009, n° 154: 15-50. Bender E.M., Morgan J.T., Oxley M., Zachry M., Hutchinson B., Marin, A., Ostendorf, M. (2011). Annotating Social Acts: Authority Claims and Alignment Moves in Wikipedia Talk Pages. In Proceedings of the Workshop on Languages in Social Media (pp. 48–57). Stroudsburg, PA, USA: Association for Computational Linguistics. Borra E., Weltevrede E., Ciuccarelli P., Kaltenbrunner A., Laniado D., Magni G., Venturini T. (2014). Contropedia - the Analysis and Visualization of Controversies in Wikipedia Articles. In Proceedings of The International Symposium on Open Collaboration (pp. 34:1–34:1). New York, NY, USA. Ferschke O., Gurevych I., Chebotar Y. (2012). Behind the Article: Recognizing Dialog Acts in Wikipedia Talk Pages. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 777–786). Stroudsburg, PA, USA: Association for Computational Linguistics. Kerbrat-Orecchioni, C. (2016). Le désaccord, réaction « non préférée » ? Le cas des débats présidentiels. Cahiers de praxématique, (67). Poudat C. et Ho-Dac L.-M. (2018). Désaccords et conflits dans le Wikipédia francophone. In Travaux linguistiques du Cerlico, Presses Universitaires de Rennes (sous presse). Poudat C. et Landragin F. (2017). Explorer un corpus textuel. Méthodes – Pratiques – Outils. Collection Champs linguistiques, De Boeck, Louvain-laNeuve. Poudat C., Grabar N., Paloque-Berges C., Chanier T. et Kun J. (2017). Wikiconflits : un corpus de discussions éditoriales conflictuelles du Wikipédia francophone. In Wigham, C.R & Ledegen, G., Corpus de communication médiée par les réseaux : construction, structuration, analyse. Collection Humanités numériques. Paris : L’Harmattan, pp. 19-36. Sumi, R., Yasseri, T., Rung, A., Kornai, A., & Kertész, J. (2011). Edit wars in Wikipedia. In: Proceedings of the ACM WebSci'11, Koblenz, Germany. pp. 1–3. 610 JADT’ 18 Textometric Exploitation of Coreference-annotated Corpora with TXM: Methodological Choices and First Outcomes Matthieu Quignard1, Serge Heiden2, Frédéric Landragin3, Matthieu Decorde2 ICAR, CNRS, University of Lyon – matthieu.quignard@ens-lyon.fr IHRIM, ENS Lyon, CNRS, University of Lyon – {slh,matthieu.decorde}@ens-lyon.fr 3Lattice, CNRS, ENS Paris, University Sorbonne Nouvelle, PSL Research University, USPC – frederic.landragin@ens.fr 1 2 Abstract In this article we present a set of measures – some of which can lead to specific visualisations – with the objective to enrich the possibilities of exploration and exploitation of annotated data, and in particular coreference chains. We first present a specific use of the well-known concordancer, which is here adapted to present the elements of a coreference chain. We then present a histogram generator that allows for example to display the distribution of the various coreference chains of a text, given a value from the annotated properties. Finally, we present what we call progress diagrams, whose purpose is to display the progress of each chain throughout the text. We conclude on the interest of these (interactive) modes of visualization in order to make the annotation phase more controlled and more effective. Résumé Nous présentons dans cet article un ensemble de mesures – dont certaines peuvent amener à des visualisations spécifiques – dont l’objectif est d’enrichir les possibilités d’exploration et d’exploitation des données annotées, en particulier quand il s’agit de chaînes de coréférences. Nous présentons tout d’abord une utilisation adaptée de l’outil bien connu qu’est le concordancier, en n’affichant que les maillons d’une chaîne choisie. Puis nous montrons un générateur d’histogramme qui permet par exemple d’afficher la répartition des chaînes de coréférences d’un texte à partir d’une propriété annotée. Nous montrons enfin ce que nous appelons des diagrammes de progression, dont le but est d’afficher les avancées au fur et à mesure du texte des chaînes de coréférences qu’il contient. Nous concluons sur l’intérêt de ces modes (interactifs) de visualisation pour rendre la phase d’annotation plus maîtrisée et plus efficace. Keywords: coreference chain, corpus annotation, annotation tool, visualisation tool, exploration tool, statistical analysis of textual data. JADT’ 18 611 1. Introduction The manual annotation of a textual corpus with referring expressions (Charolles, 2002) and coreference chains (Schnedecker, 1997, Landragin & Schnedecker, 2014) requires adapted tools. A coreference chain can cover the whole text; it is therefore a linguistic object for which the existing means of visualization and exploration are few and often perfectible. The MMAX2 tool (Müller & Strube, 2006) allows for visualizing the links between referring expressions using arrows which link markables. The GLOZZ tool (Mathet & Wildlöcher, 2009) offers several means of visualization: with arrows like MMAX2, or with a specific marking in the margin or the middle of the text. The ANALEC tool (Landragin et al., 2012) and its specific extension for coreference chains (Landragin, 2016) proposes a graphic metaphor based on the succession of coloured dots. This allows the analyst to configure visual parameters, for instance the colour which can be linked to any of the annotated properties. This type of visualization makes it possible to see at a glance the structural differences between the different reference chains of a text. That must be useful to the analyst, in addition to manual explorations and finer linguistic analyses. 2. Linguistic objects and methodology In the continuity of previous works (Heiden, 2010; Landragin, 2016), we present here a set of measures – some of which can lead to specific visualisations – with the objective to enrich the possibilities of exploration and exploitation of annotated data. We focus in particular on annotations which concern discursive phenomena like coreference, i.e., annotations which are necessarily described within two levels: 1. markable, group of contiguous words to which is assigned some labels, using for instance a feature structure; 2. set of markables, or links between markables, as is it the case for any chain of annotations: anaphoric chains, textual organizers chains, textual structure elements chains, etc. A feature structure can also be assigned at level 2, i.e., to the set or to the links. 3. A concordancer adapted to annotations chains As a first visualization mode, we reuse the very classic concordancer to display the elements which constitute a coreference chain. The use of such a visualization tool, which is well established in the community of corpus exploration (Poudat & Landragin, 2017), seemed natural for visualizing chains of annotations. The last version of TXM (Heiden, 2010) thus includes a concordancer which makes it possible to display in a column all the elements (e.g. referring expressions) of a chain (e.g. coreference chain), with left and right contexts for each elements. Compared to MMAX2 (Müller & Strube, 612 JADT’ 18 2006) and GLOZZ (Mathet & Wildlöcher, 2009) visualisation choices, i.e. arrows linking marquables which are displayed directly on the text, this concordancer has the advantage of regrouping all the relevant information in a small graphic space. Fig 1: Concordancer with the elements of a coreference chain, dedicated to a character named “Caillette”. Fig. 1 shows the list of all referring expression to the character ‘Caillette’. Sorted in the textual order, the concordancer shows the alternation of the use of proper nouns, pronouns, possessives, etc. This concordancer may also be sorted along a given property of the marquable, e.g. its POS label. This representation may then be exploited to see whether the POS annotation is consistent or not. 4. Histograms for visualising distributions of annotations chains A second mode of visualization, also very traditional, is the histogram (bar plot). The user can select one or several properties – the determination of the referring expressions, for instance, or the type of referent – and launch calculations on their occurrences: cross-counts, correlation computation and so on. TXM now includes a histogram generator, which allows for example to display the distribution of coreference chains throughout the text, as well as the distribution of chains according to the number of referring expressions they include. These calculations and their associated visualizations provide TXM with integrated functionalities which required in other state-of-the-art tools the development of scripts, in order to export the relevant data and exploit them in an external tool like a spreadsheet. JADT’ 18 613 Figure 2 compares the distribution of grammatical categories of referring expressions in three texts. Although all texts are all encyclopedical ones, the Discourse from Bossuet shows a particular profile, with a high number of proper nouns (GN.NAM). Fig 2: Comparative barplots of grammatical categories usage by reference units in three texts: Bossuet, “Discours sur l’histoire universelle” (1681), Diderot, “Essais sur la peinture” (17591766), Montesquieu, “Esprit des lois” (1755). 5. Progression charts for annotations chains A third (new) mode of visualization consists to graphically show the progress of each chain throughout the text. The principle is simple, but the possibilities of exploration and exploitation of the generated graph are numerous. In a two-dimensional chart the abscissa of which represents the linearity of the text, chains are displayed point by point (cf. Fig. 3): each occurrence of a referring expression increases by one notch the ordinate of the corresponding point. The resulting broken lines are all ascending but can considerably vary in their areas of progression and flat areas. When they are visualized simultaneously, it is possible to detect the parts of 614 JADT’ 18 the text where several referents are competitors, or on the contrary those where several referents appear alternately. Zooming (in and out) as well as focussing features allows for visualizing the characteristics of each point, thus enriching the exploration possibilities of these progression chart and the underlying coreference chains. Fig 3: Progression graph of the main coreference chains at the beginning of “Essais sur la peinture” from Denis Diderot. The dots highlighted with symbols correspond to referring expressions with low accessibility. 6. Discussion The common points of these new visualization modes is not only to propose visual representations which are easy to understand (and possibly interactive, when it is possible to modify on the fly one of the properties), to allow the visualization of these representations directly in TXM, with no need to export annotated data and to use external tools, but also to facilitate the detection by the analyst of intruders, outliers and deviant examples. For instance potential annotation errors: it can be the case for a referring expression which has nothing to do in the currently visualised chain. It may be a peak or a suspect flat in one of the generated histograms. It may be a zone with a very high slope (or a very long flat) in a progression diagram. In all three cases, the analyst can directly access the suspicious annotation, in order to verify it and of course to modify it. The integration of the measurements and their visualizations in TXM allows this immediate return to the corpus annotation phase. This is particularly effective when the corpus is being annotated manually. JADT’ 18 615 7. Conclusion and future works One can say that it is by annotating that we can see the mistakes we make, but we still need appropriate tools to detect these errors. With the new possibilities of interaction that we propose here, we hope that we are taking a significant step in this direction. The first tests which we have carried out demonstrated the relevance of our approach. References Charolles M. (2002). La référence et les expressions référentielles en français. Ophrys, Paris, France. Heiden S. (2010). The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme. Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation, Nov. 2010. Sendai, Japan, Institute for Digital Enhancement of Cognitive Development, Waseda University, pp. 389-398, available at halshs.archives-ouvertes.fr/halshs-00549764. Landragin F. (2016). Conception d’un outil de visualisation et d’exploration de chaînes de coréférences. Statistical Analysis of Textual Data – Proceedings of 13th International Conference Journées d’Analyse statistique des Données Textuelles (JADT 2016), Nice, France, pp. 109-120. Landragin F., Poibeau T. and Victorri B. (2012). ANALEC: a New Tool for the Dynamic Annotation of Textual Data. Proceedings of LREC 2012, Istanbul, Turkey, pp. 357-362. Landragin F. and Schnedecker C., editors (2014). Les chaînes de référence. Volume 195 of the Langages journal, Armand Colin, Paris, France. Müller C. and Strube M. (2006). Multi-level annotation of linguistic data with MMAX2. In Braun S., Kohn K. and Mukherjee J., editors, Corpus Technology and Language Pedagogy: New Resources, New Tools, New Methods, Peter Lang, Frankfurt, Germany. Poudat, C. and Landragin, F. (2017). Explorer un corpus textuel : méthodes, pratiques, outils. Champs Linguistiques. De Boeck Supérieur : Louvain-laNeuve. Schnedecker C. (1997). Nom propre et chaîne de référence. Klincksieck, Paris, France. Widlöcher A. and Mathet Y. (2012). The Glozz platform: a corpus annotation and mining tool. In Concolato C. and Schmitz P, editors, Proceedings of the ACM Symposium on Document Engineering (DocEng’12), Paris, France, pp. 171-180. 616 JADT’ 18 Amélioration de la précision et de la vitesse de l’algorithme de classification de la méthode Reinert dans IRaMuTeQ Pierre Ratinaud LERASS, Université de Toulouse – ratinaud@univ-tlse2.fr Abstract This work presents a proposal to improve the accuracy and the speed of execution of the divisive hierarchical clustering (DHC) algorithm used by the Reinert method implemented in the IRaMuTeQ free software. The DHC of the Reinert method is a serie of bi-partitions on a presence / absence matrix that intersects text segments and words. In the original version of this algorithm, after each partition, the largest of the remaining classes is selected to be split. We propose to replace the selection mode of the classes to be partitioned by a criteria of homogeneity. The complete rewriting of this part of the IRaMuTeQ code has also been an opportunity to improve its speed by implementing part of the code in C ++ and paralleling the procedure. An experiment carried out on 6 corpora shows that the new algorithm based on these principles is indeed more precise and faster. Résumé Ce travail présente une proposition d’amélioration de la précision et de la vitesse d’exécution de l’algorithme de classification hiérarchique descendante (CHD) utilisé par la méthode Reinert implémentée dans le logiciel libre IRaMuTeQ. La CHD de la méthode Reinert est une série de bi-partitions de matrices de présence / absence qui croise des segments de texte et des formes. Dans la version originale de cet algorithme, après chaque partition, la plus grande des classes restantes est sélectionnée pour être à son tour coupée en deux. Nous proposons de remplacer le mode de sélection des classes à partitionner par un critère d’homogénéité. La ré-écriture complète de cette partie du code d’IRaMuTeQ a également été l’occasion d’une amélioration de sa célérité par l’implémentation d’une partie du code en C++ et la parallélisation de la procédure. Une expérimentation menée sur 6 corpus permet de constater que le nouvel algorithme reposant sur ces principes est effectivement plus précis et plus rapide. Keywords: méthode Reinert, classification hiérarchique descendante, IraMuTeQ, précision JADT’ 18 617 1. Introduction La méthode Reinert a pour objectif de faire émerger les différentes thématiques qui traversent un corpus textuel. Sa plus grande originalité est sûrement l’algorithme de classification hiérarchique descendante (CHD) proposé par Reinert (1983). Après avoir rappelé les différentes étapes de ce type d’analyse, nous proposerons une modification de cet algorithme de classification dans l’objectif d’améliorer la précision de l’ensemble de la procédure. Le changement proposé concerne le critère de sélection des sousmatrices après chacune des partitions. La description de cette nouvelle procédure est complétée par une expérimentation sur 6 corpus en français et en anglais permettant de comparer la nouvelle version de l’algorithme avec l’ancienne. Les résultats que nous présentons attestent effectivement d’une augmentation de la précision de l’algorithme, dont la ré-écriture à également permis une augmentation de la vitesse d’exécution. Avant d’entamer cette présentation, il nous semble toutefois nécessaire de rappeler que la CHD n’est pas la seule particularité de la méthode Reinert. 2. Des corpus aux matrices Une autre originalité de cette procédure est l’unité utilisée dans la classification. Dans la plupart des situations, la classification ne porte pas sur les textes dans leur ensemble, mais sur une granularité inférieure. Les unités classées sont des segments de texte. Dans le logiciel IRaMuTeQ (Ratinaud, 2014; Ratinaud & Marchand, 2012), la taille de ces segments est fixée par défaut à 40 occurrences et leur découpage tient compte de la ponctuation. La règle de découpage essaie donc de proposer des unités de taille homogène (autour de 40 occurrences) et de respecter le découpage « naturel » des textes marqué par la ponctuation. Une seconde originalité qu’il convient de préciser est la distinction opérée entre formes pleines et mots outils. Dans ces analyses, la plupart du temps, seules les formes pleines (verbes, adverbes, adjectifs et substantifs) sont considérées. Les corpus peuvent alors être représentés sous la forme de matrices qui croisent les segments de texte et les formes pleines. Les cellules de ces matrices marquent la présence ou l’absence des formes dans les segments en codant 1 la présence et 0 l’absence. Le tableau 1 présente une telle matrice pour un corpus composé de 10 segments de texte (notés i1 à i10) et de 9 formes (notées j1 à j9). 618 JADT’ 18 Tableau 1 : Exemple d’une matrice croisant des segments de texte (en ligne) et les formes (en colonne) J1 J2 J3 J4 J5 J6 J7 J8 J9 I1 1 1 1 1 0 0 0 0 0 I2 0 0 0 0 1 1 1 1 1 I3 0 0 1 0 1 0 1 0 0 I4 1 0 1 0 1 0 0 0 1 I5 0 0 1 0 1 0 1 0 0 I6 1 1 1 1 0 0 0 0 1 I7 I8 0 1 0 0 0 1 0 0 1 1 1 0 1 0 1 0 0 0 I9 0 0 1 0 1 0 1 0 1 I10 0 0 1 0 1 0 1 0 0 La matrice présentée dans le tableau 1 est un exemple très simplifié de ce qu’il se passe dans la réalité. Les matrices générées sur des corpus textuels sont beaucoup plus grandes et beaucoup plus « creuses » (la proportion de 1 est très faible dans la matrice). Nous noterons N le nombre total de 1 dans la matrice. L’objectif de la classification est de proposer une réorganisation de cette matrice en sous-groupes de segments qui maximisent les propriétés suivantes : n) Les segments regroupés doivent être homogènes entre eux : la méthode doit réunir les segments de texte qui se ressemblent, c’est-àdire les segments qui ont tendance à contenir les mêmes mots. o) Les ensembles doivent être hétérogènes entre eux : les groupes de segments constitués doivent être les plus différents possibles. L’illustration 1 propose un découpage de la matrice présentée dans le Tableau 1 en 4 classes qui respectent ces critères. I1 I6 J1 J2 J3 J4 J5 J6 J7 J8 J9 1 1 1 1 0 0 0 0 0 J1 J2 J3 J4 J5 J6 J7 J8 J9 I8 1 0 1 0 1 0 0 0 0 1 0 1 0 1 0 0 0 1 I4 1 1 1 1 0 0 0 0 1 Illustration 1 : Découpage de la matrice du Tableau 1 en 4 classes La « qualité » de cette solution peut être déterminée par le calcul du chi2/N du tableau réduit (Reinert, 1983). Dans cet exemple, la solution optimale serait obtenue en séparant les lignes i6, i4, i2 et i9 de leur classe d’appartenance pour les laisser former leur propre classe. La solution à 8 classes obtenue résumerait alors l’intégralité de l’information contenue dans la matrice du Tableau 1. JADT’ 18 619 Tableau 2 : Tableau réduit de la classification de l’illustration 1 J1 J2 J3 J4 J5 J6 J7 J8 J9 Σ [i1,i6] 2 2 2 2 0 0 0 0 1 Σ [i4,i8] 2 0 2 0 2 0 0 0 1 Σ [i9,i3,i5,i10] 0 0 4 0 4 0 4 0 1 Σ [i2,i7] 0 0 0 0 2 2 2 2 1 3. La CHD de la méthode Reinert Rappelons que la méthode permettant de construire automatiquement ces classes s’appuie sur une série de bi-partitions reposant chacune sur une analyse factorielle des correspondances (AFC). La première coupure est obtenue en cherchant le long du premier facteur de cette AFC les deux sousmatrices qui maximisent le chi2/N du tableau réduit. La partition produite est améliorée en inversant chacune des lignes du tableau d’une classe à l’autre et en recalculant le chi2/N du tableau réduit. Toutes les inversions qui augmentent la valeur du chi2/N sont conservées. Cette étape boucle jusqu’à ce que plus aucune inversion n’augmente cette valeur. Une dernière étape consiste à retirer les formes (les colonnes) statistiquement sous-représentées dans les matrices (sur la base d’un chi2). Cette procédure (bi-partition de la matrice, inversion des lignes, suppression des colonnes) constitue une des partitions de la CHD. La CHD dans son ensemble réalisera cette procédure autant de fois que nécessaire pour atteindre le nombre de classes terminales paramétré. Il faut n-1 partition(s) pour constituer n classe(s) terminale(s). Après chacune de ces partitions, dans sa formulation d’origine, l’algorithme sélectionne la plus grande des classes constituées (c’est-à-dire celle qui contient le plus de lignes) pour lui faire à son tour subir une partition. Le tableau 3 présente, de façon très caricaturale, une matrice pour laquelle cette démarche ne conduit pas à un résultat satisfaisant. Si nous soumettions cette matrice à la CHD précédemment décrite, la première partition conduirait à la création d’une classe constituée des lignes i1, i2 et i3 (notée [i1,i2,i3]) et d’une autre constituée des lignes i4 et i5 (notée [i4,i5]). La première de ces classes étant la plus grande, elle serait sélectionnée pour, à son tour, subir une partition. Or, il est évident ici qu’il n’y a plus aucune information à extraire de cette matrice, les lignes étant toutes identiques. Seule la séparation des lignes i4 et i5 est, dans cet exemple, susceptible d’augmenter la qualité du résultat. Pour cela, il aurait donc fallu sélectionner la classe restante la plus hétérogène ([i4,i5]) plutôt que de sélectionner la plus grande ([i1,i2,i3]). 620 JADT’ 18 Tableau 3 : Une matrice problématique J1 J2 J3 J 4 J5 i1 1 i2 1 1 0 0 0 1 0 0 0 i3 1 1 0 0 0 i4 0 0 1 1 0 i5 0 0 0 1 1 Il convient donc de percevoir que dans la version actuellement disponible de cette méthode, l’algorithme de classification fait l’hypothèse que la matrice la plus grande est également la plus hétérogène. Nous pensons que certains corpus ne respectent pas cette propriété et qu’il est tout à fait possible qu’à différents moments d’une classification, la plus grande des matrices restantes ne soit pas la plus hétérogène. 4. Une nouvelle solution pour l’enchaînement des partitions Il apparaît alors pertinent de pouvoir tester, après chacune des phases de partition, l’homogénéité des matrices restantes de façon à sélectionner la plus hétérogène. Comme le calcul de l’analyse factorielle des correspondances nécessaire à chaque partition permet de déterminer le chi2 de la matrice dans son ensemble, nous avons utilisé cette propriété pour revoir le déroulement de l’algorithme. Dans cette nouvelle version, après chaque partition, l’AFC et le chi2 des deux matrices générées sont calculés a priori. Pour chacune de ces matrices, nous déterminons un indice d’homogénéité qui tient compte du chi2 de la matrice, de sa taille et du nombre total de formes. Ce critère relève de la formule suivante : Il s’agit donc de multiplier le chi2 de la matrice par le ratio de 1 qu’elle contient. Cette méthode permet de ne plus supposer que la matrice la plus grande est la plus hétérogène mais de tester cette hétérogénéité. Elle a pour désavantage de nécessiter le calcul systématique de l’AFC sur pratiquement toutes les matrices produites. Sans autre modification, cette procédure serait beaucoup plus lente que la version précédente de l’algorithme. Dans l’objectif d’accélérer ces analyses, la ré-écriture théorique de l’algorithme s’est accompagnée d’une recherche de gain de performances qui a ici suivi deux directions : JADT’ 18 621  Les parties les plus gourmandes en calcul ont été ré-écrites en C++ par l’intermédiaire des packages Rccp (Eddelbuettel et al., 2017) et RcppEigen (Bates, Eddelbuettel, Francois, & Yixuan, 2017) de R. Les parties concernées sont la recherche de la partition qui maximise le Chi2/N après l’AFC et le reclassement des lignes.  Ces deux parties étant une suite de calculs de chi2 sur la base d’une seule matrice, il a été possible de les paralléliser pour profiter de la nature multi-coeur de la plupart des processeurs modernes. Les calculs sont donc potentiellement distribués aux différents cœurs/threads de la machine par l’intermédiaire des packages Parallel et DoParallel (Calaway, Microsoft Corporation, Weston, & Tenenbaum, 2017) de R. Ces changements ont en fait nécessité la réécriture complète de l’algorithme de la méthode Reinert dans IRaMuTeQ. 5. Expérimentation De façon à tester les bénéfices apportés par cette nouvelle procédure, en termes de précision et de rapidité, une expérimentation sur 6 corpus différents a été réalisée. Nous avons associé à des corpus de grandes tailles (les plus susceptibles de présenter des disproportions dans les thématiques qu’ils contiennent) un corpus de taille plus réduite. Les caractéristiques de ces corpus sont présentées dans le Tableau 4. Tableau 4 : description des corpus utilisés dans l’expérimentation Le corpus dataconf correspond à des titres et à des résumés de conférences du domaine de l’informatique, il est uniquement en anglais. 20Newsgroup1 est un corpus également en anglais qui réunit 20 listes de discussions sur des thématiques très diverses (Lang, 1995). lemondefr est un corpus d’articles du 1 http://qwone.com/~jason/20Newsgroups/ 622 JADT’ 18 site web du monde en ligne2, il est en français. Ssm, pour « same sex marriage », est un corpus d’articles de presse américaine et anglaise sur la thématique du mariage entre personnes de même sexe. Il a été constitué par Nathalie Paton. AN2011 correspond à l’année 2011 de la retranscription des débats à l’assemblée nationale française (Ratinaud & Marchand, 2015). Enfin, le corpus noté LRU regroupe 100 articles de la presse quotidienne française sur la thématique de la loi liberté et responsabilité des universités. L’expérimentation consiste donc à faire subir les deux versions de l’algorithme de classification aux matrices extraites de ces corpus et à comparer la qualité des résultats obtenus. Le nombre de classes terminales a été fixé à 100 pour les “gros” corpus et à 30 pour le “petit”. Dans un cas, l’algorithme utilisera le critère de taille pour sélectionner les matrices à partitionner et dans l’autre il utilisera le critère d’homogénéité. Les résultats se présentent sous la forme de graphiques qui montrent l’évolution de la quantité d’information extraite après chacune des partitions. La valeur renvoyée est celle du Chi2/N du tableau réduit des classes. Dans les graphiques de l’illustration 2, les courbes rouges représentent les valeurs obtenues avec l’ancienne version de l’algorithme (notée Reinert) et les courbes bleues les valeurs obtenues avec la nouvelle version (notée Reinert++). Une valeur supérieure correspond à une meilleure qualité de la partition. Le graphique en barres présente le pourcentage d’augmentation ou de diminution de la qualité de la partition du nouvel algorithme en prenant l’ancien comme référence. Les barres vertes signalent une augmentation de la qualité et les barres rouges une diminution. Pour la nouvelle version de l’algorithme, 6 cœurs ont été alloués à la procédure3. Ces résultats montrent assez clairement que la nouvelle version de l’algorithme augmente dans la majorité des cas la précision de la classification. Ils permettent également de percevoir que ce gain de qualité est lié à la distribution des thématiques dans les corpus. Tous les corpus ne profitent donc pas de cette évolution de la même façon. Il faut également noter que sur le corpus LRU, il n’y a pratiquement pas de différences entre les deux méthodes. La perte de précision de 1 à 3 % à différents moments de la classification sur ce corpus est tout à fait négligeable et doit être attribuée à des différences d’arrondis entre le code en R et le code en C++. À l’opposé, certains corpus, comme 20newsgroup, présentent des gains de précision qui peuvent atteindre 15 %. http://www.lemonde.fr Ces tests ont été réalisés sur un macbook pro 11,3 équipé d’un processeur intel i7-4960HQ 2 3 JADT’ 18 623 Illustration 2 : Comparaison des résultats entre l’ancienne version (Reinert) et la nouvelle version (Reinert++) de l’algorithme de classification L’illustration 3 montre que sur les corpus conséquents, le gain de performances introduit par le passage au C++ et à la parallélisation est compris entre un facteur 4 et un facteur 6. Autrement dit, ce nouvel algorithme est jusqu’à 6 fois plus rapide sur la machine sur laquelle ces calculs ont été réalisés. 624 JADT’ 18 12000 7,0 4,3 4,9 4,8 6,0 5,0 4,0 6000 3,0 4000 2,0 2000 1,0 Gain de performance Temps en seconde 10000 8000 6,0 5,6 Reinert Reinert++ Gain de performance 0,0 0 AN2011 dataconf 20newsgroup lemondefr ssm Illustration 3 : comparaison des temps d’analyse entre l’ancienne version (Reinert) et la nouvelle version (Reinert++) de l’algorithme 6. Conclusion Dans ce travail, nous proposons une nouvelle formalisation de la procédure de classification hiérarchique descendante de la méthode Reinert. Partant de l’hypothèse que dans certains corpus et à certains moments de ces classifications, la classe la plus hétérogène n’est pas forcément la plus grande, nous proposons de substituer le critère du choix de l’enchaînement des matrices d’un critère de taille à un critère d’homogénéité. Les résultats d’une expérimentation sur 6 corpus montrent que les corpus volumineux profitent effectivement de ce changement. Ces résultats sont aussi une invitation à continuer les investigations sur cette méthode. Cette procédure sera implémentée dans la prochaine version du logiciel IRaMuTeQ. L’utilisation du critère d’homogénéité sera optionnelle, de façon à permettre aux utilisateurs de revenir à l’ancienne version. Bibliographie Bates, D., Eddelbuettel, D., Francois, R., and Yixuan, Q. (2017). RcppEigen: « Rcpp » Integration for the « Eigen » Templated Linear Algebra Library (Version 0.3.3.3.1). Consulté à l’adresse https://cran.rproject.org/web/packages/RcppEigen/index.html Calaway, R., Microsoft Corporation, Weston, S., and Tenenbaum, D. (2017). doParallel: Foreach Parallel Adaptor for the « parallel » Package (Version 1.0.11). Consulté à l’adresse https://cran.rproject.org/web/packages/doParallel/index.html Eddelbuettel, D., Francois, R., Allaire, J. J., Ushey, K., Kou, Q., Russell, N., … Chambers, J. (2017). Rcpp: Seamless R and C++ Integration (Version 0.12.14). Consulté à l’adresse https://cran.r- JADT’ 18 625 project.org/web/packages/Rcpp/index.html Lang, K. (1995). Newsweeder: Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning (p. 331-339). Ratinaud, P. (2014). IRaMuTeQ : Interface de R pour les Analyses Multidimensionnelles de Textes et de Questionnaires (Version 0.7 alpha 2) [Windows, GNU/Linux, Mac OS X]. Consulté à l’adresse http://www.iramuteq.org Ratinaud, P., and Marchand, P. (2012). Application de la méthode ALCESTE à de « gros » corpus et stabilité des « mondes lexicaux » : analyse du « CableGate » avec IRaMuTeQ. In Actes des 11eme Journées internationales d’Analyse statistique des Données Textuelles (JADT 2012) (p. 835-844). Liège, Belgique. Consulté à l’adresse http://lexicometrica.univparis3.fr/jadt/jadt2012/Communications/Ratinaud,%20Pierre%20et%20al .%20-%20Application%20de%20la%20methode%20Alceste.pdf Ratinaud, P., and Marchand, P. (2015). Des mondes lexicaux aux représentations sociales. Une première approche des thématiques dans les débats à l’Assemblée nationale (1998-2014). Mots. Les langages du politique, 2015(108), 57-77. Reinert, M. (1983). Une méthode de classification descendante hiérarchique : application à l’analyse lexicale par contexte. Les cahiers de l’analyse des données, VIII(2), 187-198. Reinert, M. (1990). ALCESTE : Une méthodologie d’analyse des données textuelles et une application : Aurélia de Gérard de Nerval. Bulletin de méthodologie sociologique, (26), 24-54. 626 JADT’ 18 Il parametro della frequenza tra paradossi e antinomie: il caso dell’italiano scolastico Luisa Revelli Università della Valle d’Aosta– l.revelli@univda.it Abstract Emblem of a formal register, the linguistic variety proposed as a model in the Italian school system ever since National Unity is characterized by a lasting artificiality and a strong unwillingness to innovate, even within a frame of progressive slow changes along its historical development. That's why lexical frequencies recorded for “Scholastic Italian” can appear as inherently inconsistent, contrasting with basic vocabulary, even contradictory compared with other apparently similar Italian varieties. Consequently, to study their configuration it's necessary to adopt analysis models capable to interpret quantitative data (volume figures) in the light of the complexity of paradigmatic relations between concurring solutions and of the composite connections between number and type of meanings exhibited in current use. By taking in consideration as a case study Scholastic Italian used by teachers during the first 150 years of the national school system, and starting from the data collected by the diachronic corpus of CoDiSV, the contribution aims at verifying opportunities and criticalities of lexicometric analysis applied to such a linguistic variety, that is addressed to an unsophisticated audience, yet characterized by a specialized point of view; of high aspirations, but influenced by educational needs; constantly evolving and yet always recalcitrant to the solicitations of the contemporary language. Riassunto Emblema di un canone ‘antiparlato’, la varietà linguistica proposta a modello nella scuola italiana a partire dall’Unità nazionale, pur presentando in diacronia evidenti tratti evolutivi, si caratterizza per una duratura tendenza all’artificiosità e per una marcata refrattarietà all’innovazione. Le frequenze lessicali documentate nell’italiano scolastico possono, per queste ragioni, risultare discordanti in rapporto a quelle del vocabolario di base, presentarsi come intrinsecamente poco coerenti, contraddittorie rispetto alle evidenze rintracciabili in varietà d’italiano apparentemente affini: lo studio delle loro configurazioni richiede, pertanto, modelli di analisi capaci di interpretare i dati quantitativi alla luce della complessità delle relazioni paradigmatiche tra le potenziali soluzioni concorrenti nonché dei compositi rapporti tra numero JADT’ 18 627 e tipologia delle accezioni testimoniate nei concreti impieghi contestuali. Assumendo l’italiano scolastico proposto dagli insegnanti nei primi centocinquant’anni di scuola nazionale a caso di studio, a partire dai dati ricavati dal corpus diacronico del CoDiSV, il contributo si prefigge allora di verificare opportunità e criticità poste dall’applicazione di parametri lessicometrici a una varietà linguistica al contempo rivolta a un pubblico ingenuo e connotata in prospettiva specialistica, di aspirazione elevata ma condizionata da esigenze didascaliche, in costante evoluzione e ciò malgrado costantemente recalcitrante rispetto alle sollecitazioni della lingua viva e coeva. Parole-chiave: italiano vocabolario di base. scolastico; frequenza lessicale; lessicometria; 1. Introduzione Al contempo rivolto a un pubblico ingenuo e connotato in prospettiva specialistica, di aspirazione elevata ma condizionato da esigenze didascaliche, in costante evoluzione e ciò malgrado costantemente recalcitrante rispetto alle sollecitazioni della lingua viva e coeva, l’italiano scolastico (d’ora in poi IS) proposto dagli insegnanti nei primi centocinquant’anni di scuola nazionale sembra costituire un buon banco di prova per far emergere le zone di criticità derivanti dall’applicazione di parametri lessicometrici a varietà linguistiche poligenetiche e costituzionalmente disomogenee1. Nell’IS, in effetti, un ideale di ricchezza espressiva perseguito attraverso una marcata ostilità nei confronti di ogni forma di ridondanza, ripetizione o generalità delle espressioni spinge verso un’ostentata e ricercata variatio, ma la contemporanea esigenza di alfabetizzare i giovani allievi orientandoli a privilegiare specifici membri di serie sinonimiche ritenuti maggiormente corretti, appropriati o esornativi tende, di fatto e in opposta direzione, a ridurre la gamma delle possibilità espressive disponibili. La necessità di veicolare attraverso la lingua i saperi disciplinari rende, d’altra parte, necessario l’uso di metalinguaggi, tecnicismi e accezioni semantiche che sembrano destabilizzare ulteriormente il serbatoio lessicale di riferimento dell’IS allontanandolo significativamente dal vocabolario di base della lingua italiana. In che misura e in che termini questo avvenga realmente è quanto ci si propone di verificare qui di seguito, integrando i dati lessicometrici e quantitativi disponibili con alcune 1 Per un inquadramento delle caratteristiche, stabili ed evolutive, dell’IS si rimanda a De Blasi 1993, Cortelazzo 1995, Benedetti G. e Serianni L. (2009), Revelli 2013. 628 JADT’ 18 riflessioni di natura qualitativa. Relativamente all’IS, la base lessicale presa a riferimento è costituita da un lessico di frequenza elaborato da chi scrive (Revelli 2013) a partire da un corpus iniziale di 830 quaderni di scuola elementare redatti in area valdostana nel periodo compreso tra la fine del XIX e i primi anni del XXI secolo. I 2.022 termini che compongono il vocabolario di base sono stati individuati dopo che una selezione bilanciata dei documenti, ripartiti in subcorpora cronologici ventennali, è stata sottoposta a trattamento computazionale con lo scopo di identificare la dimensione della variazione diacronica nei canoni linguistici proposti a modello da parte degli insegnanti2. A fianco delle concordanze, è stato così ricavato in prima battuta un vocabolario composto da 152.151 occorrenze (tokens), ricondotte a 18.898 forme (types) e 11.751 lemmi3. Un’ulteriore selezione ha poi dato luogo all’identificazione dei 2.022 sostantivi, aggettivi e verbi considerati pancronici perché stabilmente assestati nel vocabolario di base dell’italiano scolastico (d’ora in poi VoBIS), in quanto testimoniati con più di cinque occorrenze in almeno quattro dei sei repertori cronologici o in tre non consecutivi. Il termine di paragone è costituito dall’edizione 2016 del Nuovo Vocabolario di base della lingua italiana (d’ora in poi NVdB) di Isabella Chiari e Tullio de Mauro4, che ripartisce le circa 7.000 parole statisticamente più frequenti e accessibili ai parlanti italiani del XXI secolo nei tre serbatoi del lessico fondamentale (FO, circa 2000 parole ad altissima frequenza usate nell’86% dei discorsi e dei testi), del lessico ad alto uso (AU, circa 3.000 parole di uso frequente che coprono il 6% delle occorrenze) e del lessico di alta disponibilità (AD, circa 2000 parole “usate solo in alcuni contesti ma comprensibili da tutti i parlanti e percepite come aventi una disponibilità pari o perfino superiore alle parole di maggior uso”). La scelta di fare riferimento a tale base, che comprende al proprio interno anche le frequenze relative alle varietà parlate e si colloca temporalmente in un periodo successivo a quello considerato per il lessico scolastico, risponde all’esigenza di verificare se e in che misura il modello scritto offerto da quest’ultimo possa aver inciso sulla configurazione dei successivi usi concreti. Le tipologie testuali prese in considerazione sono costituite dalle consegne degli esercizi, dai titoli dei componimenti, da dettati, interventi correttivi, valutazioni e giudizi documentati nei quaderni degli alunni. 3 Il vocabolario e le concordanze del corpus sono stati ricavati, previa annotazione e lemmatizzazione, tramite il software T-LAB, ideato e distribuito da Franco Lancia. Per un approfondimento a proposito dei principi adottati e la metodologia seguita si rimanda a Revelli 2013. 4 https://www.internazionale.it/opinione/tullio-de-mauro/2016/12/23/il-nuovovocabolario-di-base-della-lingua-italiana. 2 JADT’ 18 629 2. Vocabolari di base a confronto: le frequenze nel NVdB e nel VoBIS La comparazione del serbatoio lessicale dei due repertori presi a confronto consente di compiere, in prima battuta, alcune osservazioni generali: dei 2022 lemmi del VoBIS, 1784 trovano riscontro nel NVdB, spartendosi per il 53% nel serbatoio del lessico FO, per il 26% in quello di AU e per il 9% in quello di AD. Senza entrare qui nel merito delle convergenze che accomunano i due vocabolari, sembra comunque opportuno segnalare che dietro molti esempi di apparente coincidenza delle distribuzioni di frequenza si celano in realtà difformità significative, prevalentemente indotte dalla tendenza dell’IS al restringimento o in alcuni casi anche alla rideterminazione semantica: fra le molte parole che assumono specifici sensi scolastici (ad es. diario, interrogazione, nota, pensierino, voto), alcune perdono del tutto l’ancoramento ai significati di cui sono dotati nella lingua comune, come accaduto a tema, passato a identificare non più un soggetto o argomento da trattare, ma invece il prodotto di una specifica tipologia testuale. Per ciò che concerne le 238 parole assenti nel NVdB (12%), esse possono essere raggruppate in categorie utili a mettere fuoco diverse criticità relative all’applicazione del parametro della frequenza comparativamente applicato. Un primo, corposo gruppo che risulta esclusivo dell’IS è costituito da logonimi caratteristici della nomenclatura metalinguistica dell’apparato scolastico, del tipo alfabetico, apostrofo, coniugazione, preposizione, ecc. Osserviamo che, malgrado il loro potenziale polisemico, molti di questi – come coniugare, derivato, imperfetto, possessivo, primitivo - raggiungono nell’ambito dell’IS frequenze molto elevate nel loro esclusivo ruolo di etichette destinate alla riflessione metalinguistica5: la rappresentatività quantitativa non implica quindi un contatto degli allievi con le diverse accezioni di cui quegli stessi termini possono essere portatori, ma corrisponde invece a un’insistita specializzazione motivata da esigenze didascaliche. Un secondo gruppo è costituito da termini tipici dei contesti d’insegnamento della letto-scrittura: si tratta principalmente di sostantivi che fanno riferimento a referenti concreti ma di scarsa prominenza nella quotidianità, la cui forma scritta guida e richiede la conoscenza di convenzioni controintuitive eppure fondamentali per la corretta codifica e decodifica ortografica. Citiamo a titolo di esempio parole come acquaio, acquavite e acqueo, evidentemente introdotte non per stringente necessità tematica quanto invece con scopo di consolidamento delle corrette rappresentazioni grafematiche. A scopi didattici legati agli insegnamenti disciplinari o più genericamente a 5 Ad esempio, dimostrativo - sempre preceduto da aggettivo o pronome - non entra mai in combinazione con atto, gesto, ecc. 630 JADT’ 18 scelte tematiche caratteristiche del contesto educativo sono da imputare le alte frequenze di diversi termini relativi all’ambito storico-geografico (legione, vetta), di voci descrittive dell’universo naturale (arto, astro) e della vita rurale (semina, vendemmia); di serie di verbi (castigare, disobbedire) di aggettivi (diligente, ordinato) e di sostantivi astratti (umiltà, penitenza) appartenenti al formulario tipico dell’educazione civica o morale e a quello della valutazione scolastica. A differenza del NVdB, per la sua impostazione diacronica il lemmario del VoBIS trova, d’altra parte, rappresentati numerosi arcaismi: si tratta in alcuni casi di varianti formali oggi dismesse (ad es. annunziare per annunciare,) o dispreferite (ubbidire per obbedire); di termini relativi a referenti che i cambiamenti sociali dell’ultimo cinquantennio hanno reso superflui o anacronistici (manto, ricamatrice); di membri di coppie o serie sinonimiche superati o formali, che soltanto in ambito scolastico sono o sono stati più a lungo privilegiati rispetto a concorrenti avvertiti dai parlanti come più attuali (persuadere per convincere)6. Proseguendo con le mancate corrispondenze nei due repertori, se l’assenza nel NVdB di voci scolastiche un po’ leziose come diletto, garbato, vezzo e soave risulta scontata, stupisce invece la mancata inclusione di termini che appaiono stabili nel tempo e di diffusione panitaliana: è il caso di zoonimi come bue, elefante, formica; di nomi di frutti usualmente presenti sulle tavole degli italiani come fragola, noce e uva; di nomi concreti d’uso comune come carezza, martello, ombrello. La mancanza di riscontri nel NVdB per termini di questo tipo può essere solo in parte interpretato in una dimensione propriamente sociolinguistica: pur essendo vero che - dato il pubblico cui si orienta - l’IS fa più frequente riferimento a temi e referenti della cultura materiale ed esperienziale di quanto non accada nelle varietà linguistiche rivolte a e prodotte da parlanti adulti, è altrettanto vero che in linea teorica tutti i vocaboli, a maggior ragione se accolti e veicolati dalla scuola, dovrebbero rientrare in quel patrimonio di «parole che può accaderci di non dire né tanto meno di scrivere mai o quasi mai, ma legate a oggetti, fatti, esperienze ben noti a tutte le persone adulte nella vita quotidiana» (De Mauro 1980: 148). Ci aspetteremmo quindi di trovare riscontri almeno all’interno di quel serbatoio di parole di AD di cui tuttavia lo stesso De Mauro ha in più occasioni dichiarato la natura sfuggente, non statistica ma Ad es. bambagia, cagionare, figliolo, focolare, garzone, uscio. Proprio nell’ambito di quest’ultima categoria il serbatoio dell’IS si differenzia d’altra parte in modo evidente da quello del lessico corrente, privilegiando sistematicamente soluzioni assenti nel NVdB, a scapito di quelle invece lì documentate e in molti casi dotate di marca d’uso FO (ad es. appetito per fame, ardere per bruciare, sciupare per rovinare, ecc.). 6 JADT’ 18 631 congetturale7. E, in effetti, probabilmente neppure le analisi quantitative più imponenti e minuziose possono aspirare ad azzerare inevitabili fattori di imprevedibilità e accidentalità della frequenza. Nel caso qui preso a campione, che relativamente all’IS non dispone di un corpus di partenza di dimensioni del tutto soddisfacenti, lacune relative a termini rispetto ai quali ci si aspetterebbe di avere riscontri si verificano anche capovolgendo la prospettiva e quindi partendo dal lemmario del NVdB: pure ampliando l’orizzonte all’intero vocabolario del corpus, a risultare mancanti non sono soltanto termini marcati come AD, ma anche parole fondamentali che sono, sì, probabilmente note ai bambini, ma non compaiono nel campione preso in esame per ragioni meramente accidentali.8 Certamente motivate e intenzionali sono invece specifiche tipologie di omissioni facilmente identificabili come specifiche dell’IS: si tratta di neologismi e prestiti di lusso, che i modelli dei maestri – forse in alcuni casi anche per ragioni ortografiche tendono a respingere quand’anche ormai stabilmente acclimatati nell’italiano standard (jeans, quiz, smog); di termini riferiti a concetti ritenuti sconvenienti per un pubblico acerbo (aborto, droga, sesso); di voci gergali, espressioni volgari, insulti, improperi (coglione, culo, ruttare); di appellativi discriminatori (ebreo, nano, negro) ma anche di parole prudenzialmente evitate perché avvertite come potenzialmente faziose, propagandistiche o almeno ideologicamente e politicamente orientate: su quest’ultimo aspetto, che incarna l’intimità dei rapporti tra lessico, scuola, clima sociale e temperie culturale non è tuttavia possibile compiere generalizzazioni, perché gli indizi relativi alle diverse caratterizzazioni assunte dal fenomeno nel corso dei tempi, anche molto recenti, richiedono di essere intercettati sulle frequenze basse o inesistenti, piuttosto che su quelle elevate del lessico di base. 3. Conclusioni e prospettive Come ci si è proposti di evidenziare, l’esame quali-quantitativo dell’IS conferma che, pur presentando in diacronia tratti di ammodernamento, il modello linguistico proposto dagli insegnanti risulta caratterizzato dallo Nella Prefazione al NVdB è specificato che le parole di AD “sono state ricavate partendo dalla lista di 2.300 parole di alta disponibilità del vecchio VdB e sottoponendola a gruppi di studenti e studentesse universitari per eliminare le parole non più avvertite come di maggior uso e per accogliere invece nuove parole avvertite come di alta disponibilità”. 8 Esemplificativo dei margini di casualità può essere il caso degli etnici, che mancano in alcuni casi al CoDiSV (ad es. cinese, iugoslavo) che pure ne documenta moltissimi altri almeno apparentemente di analoga diffusione (ad es. giapponese, inglese). 7 632 JADT’ 18 stabile impiego di termini estranei al vocabolario di base e dal parallelo evitamento di termini correnti, ritenuti inadeguati o sconvenienti o più semplicemente logorati da un uso reputato eccessivo. Lo studio dei dati consente, poi, di rilevare un’abbondante presenza di logonimi ed etichette tipici o esclusivi del metalinguaggio didattico e grammaticale, l’uso di hapax spesso confinati nell’ambito di occasionali specifiche tipologie esercitative ma per il loro ruolo strategico didatticamente irrinunciabili nonché per il ricorso a un formulario al cui interno termini correnti assumono tramite fenomeni di rideterminazione semantica accezioni differenti da quelle consuete, specializzandosi in relazione a compiti e routines comunicative tipici del contesto educativo. Le frequenze lessicali documentate nelle varietà dell’IS si presentano in parte, per queste ragioni, come intrinsecamente poco coerenti, discordanti in rapporto a quelle del vocabolario di base, contraddittorie rispetto alle evidenze rintracciabili in varietà d’italiano apparentemente affini: lo studio delle loro configurazioni richiede, pertanto, modelli di analisi capaci di interpretare i dati quantitativi alla luce della complessità delle relazioni paradigmatiche tra le potenziali soluzioni concorrenti nonché dei compositi rapporti tra numero e tipologia delle accezioni testimoniate nei concreti impieghi contestuali. In questa direzione, in parte già esplorata in particolare negli studi di taglio psicolinguistico e glottodidattico dedicati ai processi della comprensione e alla leggibilità dei testi, sembra che un raffronto comparativo tra il lessico dell’IS e quello del VdB condotto in modo sistematico su corpora cronologicamente armonizzati possa fornire ulteriori linee di ricerca in almeno due specifici ambiti d’indagine. Un primo, di prospettiva più propriamente acquisizionale, andrebbe finalizzato a verificare gli effettivi esiti della protratta esposizione in età scolare alla percentuale di parole dell’IS che risulta estranea al vocabolario di base: in questa direzione, tenuto conto della natura incrementale e adattiva degli apprendimenti lessicali ma anche dell’effetto di evanescenza che la mancata pratica può esercitare sulle competenze possedute, si potrebbe tentare di rispondere a domande del tipo: quanto incide effettivamente l’insistenza con cui un termine è presente nell’input offerto nell’ambito dell’IS sul suo effettivo impiego nei domini da questo distinti e successivamente sperimentati? In che misura la concettualizzazione relativa una determinata accezione di un termine veicolata dall’insegnamento può condizionare (positivamente o negativamente) la successiva acquisizione di significati ulteriori e diversi per quello stesso termine? In che termini le soluzioni preferenziali e le scelte paradigmatiche proposte dall’IS risultano vincenti, almeno a livello di competenza ricettiva, nella concorrenza con le analisi statistiche che i parlanti sperimentano su altre varietà e in contesti potenzialmente più pregnanti? E in questo senso, quanto può essere JADT’ 18 633 percepito come autorevole, significativo, dotato di rilevanza comunicativa il modello lessicale scolastico in un Paese in cui l’italiano è diventato lingua materna per la gran parte dei cittadini e la concorrenza di input – non soltanto lessicale - proveniente da fonti alternative alla scuola appare quantitativamente strabordante? Un secondo ambito d’indagine, al precedente correlato ma di prospettiva principalmente lessicografica, potrebbe invece essere indirizzato ad esplorare l’ipotesi che una parte del vocabolario scolastico di base possa essere considerata denominatore comune delle competenze lessicali possedute dai parlanti adulti alfabetizzati, e venire impiegata soprattutto come punto di riferimento per la definizione del vocabolario di alta disponibilità. In questo senso, le oggettive difficoltà di identificazione di quelle “parole che riteniamo più comuni, psicologicamente e culturalmente, ma che poi hanno in realtà una frequenza minima, vicina a zero, soprattutto nell’uso scritto” (De Mauro 2004: 142) potrebbero essere in parte superate facendo riferimento a quella porzione di bagaglio lessicale condiviso e acquisito, se non attraverso altri canali, per il tramite dell’IS: seppure statisticamente poco rilevanti nelle produzioni adulte, i termini a chiunque familiari perché proposti con frequenze elevate e funzioni significative nell’italiano per i bambini – ad esempio i termini tipicamente indicati sugli alfabetieri (oca), usualmente utilizzati per l’insegnamento delle particolarità ortografiche (camoscio), presenti nelle denominazioni più diffuse di giochi e tipologie esercitative (cruciverba), in fiabe e racconti (carrozza), corrispondenti a discipline (geografia) o routines scolastiche (giustificazione) – potrebbero probabilmente superare qualunque prova di elicitazione sui parlanti e quindi, seppur difficilmente rintracciabili nel lessico adulto, essere selezionate per entrare nel vocabolario di base con attribuzione della marca AD. Anche in questo caso, certamente, per evitare insidie e ambiguità semantiche andrebbero individuati dispositivi utili ad accertare la fenomenologia delle accezioni effettivamente attive nonché a verificare e interpretare criticamente le relazioni intercorrenti tra la frequenza dell’input lessicale (e semantico) in ingresso e la frequenza dell’output lessicale (e semantico) fattuale ma anche potenziale, in un modello descrittivo che – nel contemplare un’interazione dialettica, dinamica e comparativa tra le dimensioni della ricettività, produttività e disponibilità e attribuendo i giusti pesi a quella delicata e complessa combinazione di quantità e qualità che De Mauro (1994: 97) felicemente ebbe modo di battezzare binomio indispensabile – consenta di distinguere gli autentici dai solo apparenti paradossi della frequenza. 634 JADT’ 18 Riferimenti bibliografici Benedetti G. e Serianni L. (2009). Scritti sui banchi. L'italiano a scuola fra alunni e insegnanti. Roma, Carocci. Chiari I. e De Mauro T. (2012). The new basic vocabulary of Italian: problems and methods. Rivista di statistica applicata / Italian Journal of Applied Statistics, vol. 22 (1): 21-35. Cortelazzo M. (1995). Un'ipotesi per la storia dell'italiano scolastico. In Antonelli, Q. & Becchi E. curatori, Scritture bambine, Roma-Bari, Laterza: 237-252. De Blasi N. (1993). L’italiano nella scuola. In Serianni, L. e Trifone, P. curatori, Storia della lingua italiana, vol. I “I luoghi della codificazione”. Torino, Einaudi: 383–423. De Mauro T. (1980). Guida all'uso delle parole. Editori Riuniti, Roma 1980. De Mauro T. (2004). La cultura degli italiani. A cura di Francesco Erbani. Roma-Bari, Laterza. De Mauro T. (2005). La fabbrica delle parole. Torino, Utet Libreria. Revelli L. (2013). Diacronia dell’italiano scolastico. Roma, Aracne. JADT’ 18 635 How Twitter emotional sentiments mirror on the Bitcoin transaction network Piergiorgio Ricci Tor Vergata University – piergiorgio.ricci@gmail.com Abstract Bitcoin represents the first and most popular decentralized cryptocurrency. It was launched in 2008 by Satoshi Nakamoto, the name used by the unknown person or people who designed Bitcoin system and created its original reference implementation. It is based on Blockchain technology that is considered one of the most promising technologies for the future. It is more than an instrument of finance and will likely disrupt many industries from banking to governance in the next years. This research explores a geolocalized subset of Bitcoin blockchain and compares it with Twitter communication related to the topic in order to discover what people living in different geographical areas think about Bitcoin cryptocurrency and to assess potential relationship between characteristics of language adopted by Twitter users in posts containing the key word Bitcoin and the structure of geolocalized blockchain. It also answers a variety of interesting questions about the national use of Bitcoin. Keywords: Bitcoin, Blockchain, Cryptocurrency, Social Network Analysis, Semantic Analysis. 1. Introduction Bitcoin cryptocurrency is based on blockchain technology that consists in an open and distributed ledger where all transactions occuring in the system are recorded in a verifiable and permanent way. (Narayanan A., Bonneau J., Felten E., Miller A., and Goldfeder S., 2016) They are organized in blocks which are generated periodically and linked by using cryptography techniques (SHA256)( Drainville D., 2012). Each of them needs to be validated by a peer to peer network respecting a specific protocol for validating new blocks. Once stored, data can not be tampered without tampering all subsequent blocks, activity that requires collusion of the network majority. (Nakamoto, 2008) This approach complies with consensus theory, a social theory which holds that social changes and innovation can be reached without conflicts and the social system is fair. In fact, Bitcoin's protocol relies on a strong social consensus among all partecipants of the 636 JADT’ 18 system that represent a node of the network and run a software with the aim to improve enforcement of rules they agree on. Bitcoin network is decentralized and it does not require trusting in a third party, such as a bank or a government institution. For sure, it represents a new concept of money (Evans, 2014) and the main purpose of this work is to find out what people living in different geographical areas think about Bitcoin cryptocurrency and to assess potential relationship between characteristics of language used on Twitter posts related to the topic and the structure of geolocalized Bitcoin Blockchain. Research has been conducted to analyze correlations and causalities between social network metrics performed on the geolocalized Bitcoin Transaction Network and Bitcoin emotional signals interecepted by analyzing Twitter users posts grouped by country. In particular, it has been considered important to discover wheter there is a specific kind of communication adopted by Twitter users belonging to a specific country that holds certain transaction network centrality measures. In other words, the core question to be answered has regarded the analysis on existence of correlation between the centrality in the Bitcoin transactions network of a country and characteristics of language used on Twitter Bitcoin posts by their citizens. To achieve this purpose, two datasets reperesenting Bitcoin transactions and Twitter communication related to Bitcoin, have been collected and classified on the basis of geography. Prior research has been focused on economic aspects (Ron D. and Shamir A., 2012) and structural proprieties of Bitcoin transaction network (Lischke et Fabian, 2016) (Fleder M., Kester M. and Pillai S. 2015), but it has rarely considered the existing relationship between transactions and social media communication. This study also answers a variety of interesting questions about the national use of Bitcoin and how Twitter users perceive it through communication signals posted on Twitter microblogging platform. One of the most widely accepted use cases for Bitcoin has to do with payments for digital content (Grinberg R., 2012) and, at present, Bitcoin system is used only by early adopters and innovators among population. 2. Data set 2.1 Bitcoin dataset In order to analyze and compare the network of Bitcoin transactions and the relative user sentiment on Twitter, two differents dataset have been built by using a serious of Application Program Interface (API) available on the web. The first dataset to be extracted has been the Bitcoin transaction network that is publicly available from many free web services (such as Blockchain.info) or by using a Bitcoin client that requires and stores the whole transaction history, known as blockchain (Moser M., 2013). In order to reduce and JADT’ 18 637 manage its complexity, a subset of blockchain, composed by more than 2 million transactions from July 2013 to July 2017 has been collected. They have been imported through the Blockchain Data API service that allows Bitcoin block and transaction payments data query functionality, providing requests for data regarding single block, single transaction and block heights. Fig. 1 - Example of transaction with multiple inputs and outputs (www.blockchain.info) Fig.2 - Word Cloud related to USA Twitter dataset These transactions have been geolocalized by using IPInfo.it web service and stored in a NoSQL database. Geolocalization activity has regarded the discovery of the countries involved in each transaction and it has been carried out by scraping transactions IP addresses (Kaminsky, 2011). Each transaction block contains a set of transactions and is characterized by following attributes: flow identifier, hash transaction, timestamp, origin country, destination country, sender, recevier and total amount (Ober M., Katzenbeisser S. et Hamacher K. 2013). Since each transaction can allow multiple input and output addresses (Reid F. et Harrigan, 2012), they have been decomposed in transaction flows. In order to attach geographical informations to each transaction, the service provides by ipinfo.io website has been used. It offers a web interface where is possible to retrive the origin country of an IP Address provided as input. 638 JADT’ 18 2.2 Twitter dataset A set of tweets from 10 different countries containing the word "Bitcoin" have been collected in order to be analyzed. Sentiment analysis have been conducted using the Software Condor (MIT Center for Collective Intelligence) that automatically recognizes sentiment in English, Spanish, German, French, Italian and Portuguese and allows tweets fetching restricted to a given geolocation. It also allows to calculate sentiment of posts by using semantic analysis techinques. This dataset is partially misaligned with the first one for technical reasons. 3. Research methodology Research has been conducted combining social network analysis (SNA) and semantic analysis methodology with a particular focus on the relationship among main indicators related to these two fields calculated on the dataset. 3.1. Social Network Analysis Using a Social Network Analysis approach, several strategies are possible to examine the structure of the Bitcoin transaction network. In order to counduct the analysis some of the most common measures of centrality have been identified. Most of them have been proposed by Freeman (1979) and also analyzed in other Social Network Analysis articles (Batagelj, 2011). In the following subsections they are briefly described. 3.1.1 Degree centrality This measure is based on the degree that indicates the number of nodes attached directly to a specific node for which it is computed. In the case of directed networks, two different measures of degree centrality can be calculated, defined as indegree and outdegree. The first one is given by the number of ties directed to the node, while outdegree is the number of ties that the node directs to others. In such cases, the degree is the sum of indegree and outdegree. The (weighted) all-degree for the generic node a directed graph is represented by the following equation: in = counts the number of incoming ties and represent the number where of outgoing ties. A node with an high degree centrality is central in the network structure and tend to influence the others. JADT’ 18 639 3.1.2 Closeness Centrality Closeness centrality indicates the inverse of the distance of a node from all the others in the graph. It is based on the shortest paths that between each couple of nodes in the network. Closeness centrality of node with N nodes, is defined as following: , in a graph = is where, linking and the number of edges in the shorterst path . Closeness centrality is normalized as shown below: = (N - 1) This measure can be considered as a proxy of the speed by which a social actor can reach the others. 3.1.3 Betweenness Centrality This variable considers the shortest paths that connect every other couple of nodes and is higher when a node is more frequently in this subset. For a network with N nodes, the betweenness centrality for node: = where, is the number of the shortest paths linking two nodes in the network and is the number of shortest path linking two nodes that go through the node . Social network indicators described above can be used to analyze the structure and the dynamics of the Geographical Bitcoin Network. In particular, once collected the target set of transactions and enriched them with geographical informations, two directed graphs has been modeled. In the first one, identified as Generic Network, each node represents a Bitcoin address owned by a user belonging to a specific country, while each link indicates a transaction of a certain amount (weight of link) occuring between two different addresses, while in the second one, defined as Geographical Network, each node symbolize a country and links act for transactions that can involve single or different countries. All the network metrics used in this study will be explained in the next chapter. They have been performed on the Geographical Network, obtained by merging General Network transactions on geographical basis. 640 JADT’ 18 3.2 Semantic Analysis Semantic analysis of textual data allows to turn text into data for analysis. This is possible applying natural language processing techinques and analytical methods. (Hu X., Tang L., Tang J. and Liu H., 2013). In the following subsections a set of communication indicators will be briefly described. 3.2.1 Sentiment This indicator describes wheter messages are positive or not. Its value is between 0 in the case of very negative messages and 1 viceversa. It is computed as the average score for the whole text in a message. 3.2.2 Emotionality This variable expresses the degree of emotion of an individual text fragment and it is involved in sentiment elaboration. 3.2.3 Complexity It measures the rarity of a word, or the likelihood that a single word will occur in a text. It is higher when a text contains many rare words. 4. Results The aim of this study has been to find out whether characteristics of Twitter communication related to Bitcoin reflects the GeoBlockchain network structure. Analysis has been conducted combining most important social network centrality metrics, such as Degree Centrality, Closeness Centrality and Betweenness Centrality with some other language indicators measuring the characteristics of the textual data used in Twitter communication. On the one hand, centrality metrics measures the importance, influence or power of a node in the network and are widely applied in social network analysis, on the other, language indicators allow to identify whether communication referred to Bitcoin is positive or not, its emotionality and the complexity of word usage. During analysis, country rankings for each social network indicator has been calculated in order to be correlated with Twitter Sentiment, Complexity and Emotionality national rankings performed on Tweets containing the key word "Bitcoin". Spearman's correlation, computed considering a set of 10 different countries with a high number of transactions and tweets, shows a significative correlation between centrality measures computated in the Geographical blockchain netowork and language on microblogging platform Twitter. In particular, communication of people belonging to most central countries in the Bitcoin network, e.g. Germany and USA, is more complex and less emotional than the one of peripheral country nodes. This is probably due to a more depth knowledge of Bitcoin phenomena in the most innovative countries as shown by their Word clouds. In fact, they tweet more and with a quite technical language (e.g. they speak JADT’ 18 641 about technical aspects such as fork of blockchain), while the others one, for example Spain, appear frightened of the new cryptocurrency's diffusion. Emotionality Degree Centrality 1,000 -,638* Correlation Coefficient Emotionality Sig. (2-tailed) N Spearman's Rho Correlation Coefficient Degree Centrality Sig. (2-tailed) N . ,047 10 10 -,638* 1,000 ,047 . 10 10 Complexity Degree Centrality 1,000 -,693* . ,026 *. Correlation is significant at the 0.05 level (2-tailed). Correlation Coefficient Complexity Spearman's Rho Sig. (2-tailed) N Degree Centrality Correlation Coefficient Sig. (2-tailed) N 10 10 -,693* 1,000 ,026 . 10 10 *. Correlation is significant at the 0.05 level (2-tailed). Fig.3 - Spearman's correlations calculated on national rankings of Complexity - Degree Centrality and Emotionality - Degree Centrality 5. Conclusion and future works The analysis highlights the Bitcoin transactions geographical distribution and shows national differences in its adoption, revealing the major businesses and markets. In particular, the most central countries in Bitcoin transaction network are characterized by a positive and quite complex language, while peripheral countries use a more emotional language and the sentiment of their people about it is fairly variable. This result leads to the interpretation that Twitter emotional sentiments mirror the Bitcoin transaction network and this could be seen as an interesting signal for investors and entrepreneurs interested in the development of new payment systems based on Bitcoin technology and in the choice of the start up country. Main findings of the study could be applied to crypto-payments national regulation as well as to the economic and financial impact assessment of cryptocurrencies and future 642 JADT’ 18 works include investigation on the principle barriers to mass adoption of Bitcoin cryptocurrency. References De Nooy W.,Mrvar A. and Batagelj V. (2011). Exploratory social network analysis with pajek (2nd Ed.). Cambridge University Press. Freeman L.C. (1979). Centrality in social networks conceptual clarification. Social Networks, 1 , 215–239. Lischke M. and Fabian B. (2016). Analyzing the Bitcoin Network: The First Four Years. MDPI AG. Nakamoto S. (2008). Bitcoin: A Peer-to-Peer Electronic Cash System. Reid F. and Harrigan M. (2012). An Analysis of Anonymity in the Bitcoin System. Springer. Ober M., Katzenbeisser S. and Hamacher, K. (2013) Structure and Anonymity of the Bitcoin Transaction Graph. Future Internet. MDPI. Kaminsky D. (2011). Black Ops of TCP/IP. Black Hat & Chaos Communication Camp Drainville D. (2012). An Analysis of the Bitcoin Electronic Cash System. University of Waterloo Ron D. and Shamir A. (2012). Quantitative Analysis of the Full Bitcoin Transaction Graph. IACR Cryptology ePrint Archive Fleder M., Kester M. and Pillai S. (2015) Bitcoin Transaction Graph Analysis Moser M. (2013) Anonymity of Bitcoin Transactions. Munster Bitcoin Conference Grinberg R. (2012). Bitcoin: An Innovative Alternative Digital Currency. Hastings Sci. & Tech Hu X., Tang L., Tang J. and Liu H. (2013). Exploiting social relations for sentiment analysis in microblogging. In Proceedings of the sixth ACM international conference on Web search and data mining. ACM. Narayanan A., Bonneau J., Felten E., Miller A., and Goldfeder S. (2016). Bitcoin and Cryptocurrency Technologies: A Comprehensive Introduction. Princeton University Press Evans D. (2014) Economic Aspects of Bitcoin and Other Decentralized PublicLedger Currency Platforms. University of Chicago Coase-Sandor Institute for Law & Economics Research Paper No. 685 JADT’ 18 643 Analyse de contenu versus méthode Reinert : l’analyse comparée d’un corpus bilingue de discours acadiens et loyalistes du N.-B., Canada Chantal Richard1, Sylvia Kasparian2 Université du Nouveau-Brunswick, Canada – chantal.richard@unb.ca 2Université de Moncton, Nouveau-Brunswick, Canada – sylvia.kasparian@umoncton.ca 1 Abstract In this paper we compare two methods of thematic analysis by applying them to the same corpus. Specifically, we will compare the results of the classification of units of contexts using the Reinert method in IRAMUTEQ, with a content analysis (manually coded themes) analyzed using SPHINX in 2012. The bilingual corpus consists of two sub-corpora: speeches at the Conventions nationales acadiennes (in French) and centennial commemorative speeches by Loyalists (in English). Our goal is to determine whether the Reinert method of distribution by class confirms, contradicts, or enhances a traditional content or thematic analysis. Résumé Cet article compare deux méthodes d’analyse thématique de données textuelles appliquées à un corpus bilingue. Notamment, nous comparons la répartition par classes selon la méthode Reinert, intégrée dans IRAMUTEQ, avec les résultats d’une analyse de contenu (codification manuelle des thèmes) analysés par SPHINX en 2012. Le corpus est constitué de discours acadiens (en français) et de discours loyalistes (en anglais). Cette étude permet de voir dans quelle mesure la méthode Reinert confirme, contredit, ou bonifie l’analyse de contenu traditionnelle pour étudier les mondes lexicaux ou univers de discours de ces deux sous-corpus. Mots-clés : analyse de contenu, IRAMUTEQ, méthode Reinert, classification hiérarchique descendante. 1. Introduction Aux JADT 2012, nous avions présenté une analyse de contenu des thèmes principaux d’un corpus bilingue tiré de la base de données Vocabulaires identitaires. Cette base regroupe des discours en français et en anglais qui traitent de l’identité collective de deux peuples diasporiques au NouveauBrunswick, Canada : les Acadiens et les Loyalistes. Depuis 2012, la base de 644 JADT’ 18 données est passée de 74 à 1525 textes. S’imposait alors une démarche plus efficace – pour cela nous avons choisi la méthode Reinert de classification hiérarchique descendante. Avant d’entamer l’analyse du corpus plus large nous avons voulu comparer la méthode Reinert aux résultats de l’analyse de contenu de 2012 en l’appliquant au corpus original de 74 textes. Cet article permet de voir dans quelle mesure la méthode Reinert bonifie l’analyse de contenu traditionnelle pour étudier les mondes lexicaux de ce corpus. 2. Analyse de contenu et méthode Reinert Avant de procéder à l’analyse, nous définirons brièvement les deux types d’analyse tout en expliquant notre démarche méthodologique. 2.1 Analyse de contenu Nous entendons par analyse de contenu une « méthode de classification ou de codification dans diverses catégories des éléments du document analysé pour en faire ressortir les différentes caractéristiques en vue d’en mieux comprendre le sens exact et précis » (L’Écuyer 50). En d’autres mots, une lecture exhaustive du corpus permet de choisir des unités de classification, de générer une catégorisation sous forme de tableaux à être traités statistiquement, et l’interprétation des résultats de l’analyse statistique permet une description des thèmes relevés. C’est la méthodologie utilisée dans notre première étude du corpus à l’aide des logiciels SPHINX et HYPERBASE afin d’extraire les mots-clés des sous-corpus. Ci-dessous (Tableaux 1 et 2) se trouvent les thèmes et quelques mots-clés qui les constituent. 2.2 Méthode Reinert La méthode Reinert de la classification hiérarchique descendante a été adaptée pour le logiciel IRAMUTEQ et appliquée à notre corpus selon les modalités décrites par Ratinaud et Marchand (2012). Cette méthode consiste à identifier les unités de contexte élémentaires selon l’organisation interne du texte qui a été lemmatisé pour ensuite être réparti par classes en procédant par bipartitions successives. Comme pour l’analyse de contenu, nous avons analysé séparément les sous-corpus par langue. Les classifications obtenues ainsi ont été contrastées avec les premiers résultats obtenus à l’aide de l’analyse de contenu. 3. Corpus Les 34 discours des conventions nationales acadiennes, prononcés de 1881 à 1890, constituent le corpus acadien de langue française, qui compte 56 368 mots. À cette époque, les Acadiens procédaient à une réorganisation sociale JADT’ 18 645 par le choix de symboles nationaux. Les Loyalistes du Nouveau-Brunswick, pour leur part, sont un groupe d’Américains royalistes ayant fui le pays suite à l’Indépendance pour s’établir au Nouveau-Brunswick où ils fêtent leur centenaire en 1883. Les 40 discours du centenaire des Loyalistes, publiés entre 1882 et 1887, forment le corpus de langue anglaise qui compte 69 610 mots. 4. Analyse L’analyse contrastive des résultats obtenus par ces deux méthodes d’analyse sont présentés par sous-corpus en affichant en premier le tableau thématique accompagné de quelques mots-clefs générés par l’analyse de contenu, suivi du dendrogramme produit par IRAMUTEQ. 4.1 Corpus des Conventions nationales acadiennes (français) Tableau 1 : Thèmes et mots-clés extraits du sous-corpus acadien par l’analyse de contenu Événement Progrès et rassembleur avenir (symboles) fête convention drapeau adopter distinct monument assemblée tricolore légitime étoile… avancement intérêts droits développement sauvegarde surmonter triomphant amélioration combattre… Références au passé Relations (inter)nationales colonie histoire perdu ancêtres origine persécutés misère pères mort larmes souvenir infortune ruine… compatriotes anglais union sympathie ennemi confédération américains fusion puissance Louisiane préjugés… Caractéristiques associées au peuple grand bonheur malheur honneur noble, digne devoir, petit courage difficultés persévérance faible. pauvre humble… Race, ethnie et culture peuple nation race patriotisme sang Acadie patrie âmes usages traits… Religion saint religieuses frères foi patron Dieu Marie Église Assomption chrétien… La répartition par classes selon la méthode Reinert effectuée par IRAMUTEQ sépare en premier la classe 6 des autres classes. Cette classe est représentée par un lexique autour du choix d’une fête nationale acadienne, premier objectif de ce grand rassemblement patriotique. Une deuxième partition se fait entre les classes 3 et 4 et les classes 2, 1 et 5. La classe 4 est caractérisée par un lexique de valeurs associées à la religion alors que la classe 3 illustre des valeurs associées à un style de vie traditionnel attaché au passé. Le lien entre les deux est révélateur du fait que pour les Acadiens de l’époque, le style de vie traditionnel est fortement lié à la religion catholique. Si les classes 3 et 4 se réfèrent au passé, les classes 2, 1 et 5 suggèrent plutôt un regard tourné vers l’avenir, notamment dans les domaines des progrès matériel et intellectuel 646 JADT’ 18 (classe 2), de la presse francophone (1) et de l’éducation (5). Figure 1 : Dendrogramme CHD1 – phylogram produit par IRAMUTEQ : classification hiérarchique descendante par la méthode Reinert pour le corpus acadien Quant à la comparaison aux thèmes relevés par l’analyse de contenu traditionnelle (Tableau 1), certains rapprochements sont possibles. La classe 6 partage une quantité importante de formes avec le thème « événement rassembleur » dans l’analyse de contenu, notamment les mots-clés communs aux deux méthodologies : fête, adopter, drapeau, tricolore et distinct. Il est également possible de rapprocher les classes 3 et 4 des thèmes « Religion » et « Références au passé » du Tableau 1. Ces deux classes contiennent quelques mots présents sous le thème « Caractéristiques associées au peuple » du Tableau 1. Ces classes (2, 1 et 5) partagent une certaine partie de leur lexique avec le thème « Progrès et avenir » du Tableau 1. Quel est l’apport de la méthode Reinert à notre analyse? Dans ce cas, il est pertinent de s’interroger sur ce qu’elle ne relève pas. Notamment, les catégories de l’analyse de contenu « Relations nationales et internationales » et « Race, ethnie et culture » (bien que certaines formes telles que « sang » et « Acadie » se retrouvent dans les classes 3 et 4). Ces deux thèmes se rapprochent le plus des axes d’intérêt des chercheurs, ce qui suggère une interférence humaine probable. De plus, l’ordre des partitions proposé par IRAMUTEQ, qui sépare la classe 6 et répartit les 5 autres classes entre le JADT’ 18 647 passé et l’avenir, est très révélateur d’un discours paradoxal juxtaposant le progrès social à la préservation d’une identité ancrée dans le passé, ce qui n’était pas ressorti lors de l’analyse de contenu traditionnelle par thèmes. 4.2 Corpus des commémorations centenaires des Loyalistes du N.-B. Tableau 2 : Thèmes et mots-clés extraits du sous-corpus loyaliste par l’analyse de contenu (HYPERBASE et SPHINX) Événement Progrès et rassembleur avenir (commémoration) Références Relations Caractéristiques Race, au passé nationales et associées au ethnie internationales peuple et culture anniversary commemorate War, 1783 forefathers memorial Parrtown Victoria 1883, 18th Institute Regiment… abandoned bitterness choice confiscated defence hardship heroes, duty Israelites rugged struggle… advancement building cities commerce development establishment factories harbour hotels industrial… alliance annexation commonwealth constitution Independence monarchy government King Mother protection… active brave brotherhood conservative determination intelligent deserving strength… civil civilized humanity race superior anglosaxon yanks elevate blood… Religion God bibles bless Christian churches devotion Faith morality temperance … Sept classes sont proposées dans le dendrogramme produit par IRAMUTEQ pour le corpus loyaliste en anglais. Une première répartition sépare les classes 3 et 2 de toutes les autres classes. La classe 3 est composée de références militaires à des personnages, des lieux et des dates, et la classe 2 rassemble un lexique désignant des structures associatives responsables de préserver la mémoire. Les deux classes sont caractérisées par un grand nombre de noms propres. La classe 7 se distingue ensuite par ses termes juridiques rattachés à l’empire britannique et ses colonies. Pour sa part, la classe 6 est constituée d’un lexique autour des ressources naturelles et du progrès matériel ou commercial, ce qui suggère une vision de domination de la nature par l’être humain. La classe 1 traite des valeurs morales et religieuses prisées par les Loyalistes. Finalement, les classes 4 et 5 sont très proches, et désignent respectivement les circonstances du départ des Loyalistes des États-Unis par loyauté à la couronne britannique, et la célébration de leur succès en tant que fondateurs d’une nouvelle province (le Nouveau-Brunswick) cent ans plus tard. 648 JADT’ 18 Figure 2 : Dendrogramme CHD1 – phylogram produit par IRAMUTEQ : classification hiérarchique descendante par la méthode Reinert pour le corpus loyaliste Les classes ainsi obtenues peuvent être comparées aux thèmes du Tableau 2. Par exemple, la classe 1 (valeurs morales et religieuses) partage son lexique avec les thèmes « Religion » et « Caractéristiques associées au peuple ». La classe 4 (circonstances du départ) est très semblable au thème « Références au passé » et la classe 5 (célébration du succès) pourrait également être mise en parallèle avec « Événement rassembleur : commémoration » ainsi que le thème « Race, ethnie et culture », extrait par l’analyse de contenu. Les classes 2 (structures associatives) et 3 (références militaires) peuvent être rapprochées du thème désigné dans le Tableau 2 sous « Événement rassembleur : commémoration ». La classe 7 (empire britannique et ses colonies) se rapproche du thème « Relations nationales et internationales » sans toutefois être identique, et la classe 6 (ressources naturelles et progrès) ressemble au thème « Progrès et avenir », mais avec certaines distinctions, notamment, l’inclusion des mots se référant à la nature dans le thème du progrès matériel. L’originalité de la répartition par classes par IRAMUTEQ se trouve en partie dans la juxtaposition du passé et du présent dans les classes 3 (références militaires du passé) et 2 (associations pour la préservation de la mémoire par des activités commémoratives), ainsi que les classes 5 (célébration du succès) et 4 (circonstances du départ) qui en sont, en quelque sorte, l’écho. De plus, les catégories établies dans l’analyse de contenu se sont avérées incomplètes, et le lexique est réorganisé par la classification hiérarchique descendante. Selon les répartitions de la méthode Reinert, les JADT’ 18 649 termes juridiques (parliament, act, law, etc.) se retrouvent avec les termes se référent à la couronne britannique et ses colonies alors qu’ils n’avaient pas été relevés dans notre étude de 2012. De même, les mots désignant le monde naturel (forest, ocean, tree, etc.) côtoient le lexique du progrès matériel et commercial dans le dendrogramme, ce qui n’était pas intuitif à la lecture humaine, mais fort révélateur. C’est précisément dans ces apparentes contradictions qu’apparaissent les interprétations les plus nuancées, et donc les plus judicieuses d’un corpus textuel. 5. Conclusion Outre le fait de pouvoir traiter de corpus plus volumineux dans plusieurs langues, quels sont donc les avantages de l’application de la méthode Reinert à notre corpus bilingue? En somme, la répartition par classes nous a amené à réviser et nuancer les résultats de l’analyse de contenu originale. Si les partitions ressemblent parfois aux thèmes relevés en 2012, la méthode Reinert a l’avantage de dévoiler les liens entre les classes par ses partitions graduelles sans égard à la langue, ce qui nous a permis d’observer une répartition temporelle passé/avenir dans le sous-corpus acadien et passé/présent dans le sous-corpus loyaliste. De plus, les unités de contexte ne reposent pas sur des préconçus ou des dictionnaires internes, mais sur une répartition des mondes lexicaux qui respecte l’organisation interne des corpus, ce qui a donné une réorganisation du lexique et l’inclusion de mots qui ne figuraient pas dans l’analyse originale. C’est justement l’inclusion de ce lexique apparemment paradoxal qui mène à une analyse plus objective et plus fine. Par exemple, le côtoiement de la nature et du progrès matériel dans les discours loyalistes suggère une vision de la domination de la nature par l’être humain et les discours acadiens visent un progrès social, économique et commercial tout en souhaitant préserver une identité ancrée dans le passé. Ainsi, nos observations sur les discours patriotiques des Loyalistes et des Acadiens à la fin du 19e siècle se trouvent considérablement enrichies par la méthode Reinert telle qu’intégrée dans le logiciel IRAMUTEQ. Note : Cet article a bénéficié d’une subvention Savoir du Conseil de recherches en sciences humaines du Canada. Nous remercions aussi Marc-André Bouchard pour son aide technique. Bibliographie Baulac Y. et Moscarola J. SPHINX Solutions d’enquêtes et d’analyses de données. www.lesphinx-developpement.fr. Brunet É. HYPERBASE Laboratoire UMR 6039 Bases Corpus Langage, Université de NICE-Sophia Antipolis. 650 JADT’ 18 http://ancilla.unice.fr/~brunet/pub/logiciels.html. L'Écuyer R. (1987). L'analyse de contenu : notion et étapes. In Deslauriers, J.P., editor. Les méthodes de la recherche qualitative. Presses de l'Université du Québec, pp. 49-64. Ratinaud P. et Marchand P. (2012) Application de la méthode ALCESTE aux « gros » corpus et stabilité des « mondes lexicaux » : analyse du « CableGate » avec IRAMUTEQ. Dister A., Longrée D., Purnelle G., editors, Actes/Proceedings of JADT 2012. (11e journées internationales d’Analyse statistique de Données Textuelles, pp. 845-857. Ratinaud P. (2009). IRAMUTEQ: Interface de R pour les Analyses Multidimensionnelles de Textes et de Questionnaires. http://www.iramuteq.org. Richard C. et Kasparian S. (2012). Vocabulaire de l’identité nationaliste : analyse lexicale et morphosyntaxique des discours acadiens et loyalistes entre 1881 et 1890 au N.-B., Canada. Dister A., Longrée D., Purnelle G. editors, Actes/Proceedings of JADT 2012. (11e journées internationales d’Analyse statistique de Données Textuelles), pp. 845-857. Richard C., Bourque D., Brown A., Conrad M., Davies G., Francis C., Huskins B., Kasparian S., Marquis G., Mullally, S. Base de données : Vocabulaires identitaire/Vocabularies of Identity. https://voi.lib.unb.ca JADT’ 18 651 Bridge over the ocean: Histories of social psychology in Europe and North America. An analysis of chronological corpora1 Valentina Rizzoli, Arjuna Tuzzi University of Padova – valentina.rizzoli@phd.unipd.it; arjuna.tuzzi@unipd.it Abstract Since the European Association of Social Psychology (EASP - initially called European Association of Experimental Social Psychology) has been established in 1966, what was then considered “European” social psychology has been working to affirm its own identity by presenting a distinctive brand to the rest of the world in general and to North America in particular. This study aims to compare European and U.S. social psychology through the analysis of the papers published by two of the main journals in their field: The Journal of Personality and Social Psychology and the European Journal of Social Psychology. All the abstracts (from the first publication to the last one in 2016) of the two journals papers have been collected. By means of a (lexical) correspondence analysis (SPAD software), the existence of a latent temporal pattern in keywords’ occurrences was explored. Furthermore, in order to detect, retrieve and compare the main topics the journals dealt with over time, an analysis implemented by means of Reinert’s method was conducted (IRaMuTeQ and R software). Results show that even if there are some typical features that distinguish the “European” from the “American” social psychology some publication trends seem to converge. Results will be discussed also reflecting on the contribution of these methods in studying the history/ies of a discipline. Keywords: diachronic corpora, chronological textual data, text clustering, correspondence analysis, Reinert’s method, history of social psychology 1. Introduction It is widely spread that what is called “the modern social psychology” came from Europe with the migration of scholars during the second world war, 1This study is a new development of a an interdisciplinary research project funded by the University of Padova, fund CPDA145940 (2014) “Tracing the History of Words. A Portrait of a Discipline Through Analyses of Keyword Counts in Large Corpora of Scientific Literature" (P.I. Arjuna Tuzzi). 652 JADT’ 18 and started to develop mainly in the United States. Moscovici and Markova (2006) referred to an American indigenous tradition that compete with a newer Euro-American tradition, not intending to argue that there was a socio-psychological tradition born in Europe and brought to America; but a genuinely American tradition that began with the work of the immigrant Lewin and his new students. While there was a prosperous development of social psychology in U.S., in Europe there were scholars working on social psychology, but there was no European school (Moscovici, 1999). The establishment of the European Association of (Experimental) Social Psychology (EASP - initially EAESP) in 1966 has been fundamental in the development of a “European” social psychology. EASP represented a distinctive brand of the discipline to the rest of the world in general and to North America in particular, by providing a voice for a more “social” social psychology (http://www.easp.eu/about/?). To consider an "American" and a "European" social psychology as two completely separated and counterpoised entities would be wrong since there was a clear influence between them. Moreover, the first EASP meeting, which fostered the birth of EAESP, was an initiative of U.S. scholars (cf. Moscovici and Markova, 2006). By saying “American” social psychology we usually refer to the indigenous U.S. tradition explicated by Floyd Allport’s work in 1924, which considers social psychology as part of general psychology and keeps more attention on the “individual”. "European" social psychology usually refers to the EuroAmerican tradition, promoted by the EASP, that regards social psychology as strictly connected to close disciplines such as sociology and anthropology and accords a greater role to social and cultural aspects (http://www.easp.eu/about/?). This contribute consists in an empirical analysis that moves from the study of scientific production. Over time, scientific journals shape the history of a discipline as they include objects, fields of application and methods that contribute to delineate the trajectory of a discipline. Thus, an in-depth understanding of the past and the temporal evolution of a discipline can be achieved by analysing the scientific debate inside relevant scientific journals (Trevisani and Tuzzi, 2015; 2018). We have taken into account the European Journal of Social Psychology (EJSP) and the Journal of Personality and Social Psychology (JPSP). The former is an official publication of the EASP and worldwide represents the association's voice. The JPSP belongs to the American Psychological Association, that represents the most widespread community of psychologists in the United States, and not only: It is an important scientific reference that provides guidelines also in Europe. In terms of visibility and prestige, the JPSP is considered one of the most relevant journals of the field. The main aim is to observe and compare the trajectory of the two Journal publications and to reflect about JADT’ 18 653 what contribution these methods can provide for the study of the history of a discipline. We particularly intend: 1) to portray the temporal pattern of the main concepts debated in the past and covered today by EJSP and JPSP; 2) to detect, retrieve and compare the main topics these journals dealt with over time. 2. Methods All the available abstracts of the two journals have been included in two corpora and collected from different acknowledged sources compared with the website of the journals. As regards EJSP, a total of 2,559 items was collected, for a period of 46 years, from the very first in 1971, Volume No. 1, Issue No. 1 to the latest of 2016, No. 46, Issue No. 7. Regarding JPSP, an amount of 9,568 item was downloaded, for a period of 52 years, from 1965, Volume No. 1, Issue No. 1 to 2016, No. 111, Issue No. 6. Items without any abstract have been deleted (e.g., editorials, master heads, errata, acknowledgements). The EJSP corpus is composed of 2,195 abstracts, while the JPSP one of 9,536 abstracts. To improve the homogeneity of the corpora we decided to privilege the British spelling (e.g., we replaced analyzed with analysed) in EJSP and those American in JPSP. Our corpora have been normalised only replacing uppercase with lowercase letters. The lexicometric measures showed that there is a good redundancy, that is fundamental to work with frequencies (Lebart, Salem, & Berry, 1998; Tuzzi, 2003; Bolasco, 2013). Multi-words (MW) with frequencies ≥ 5 for the EJSP corpus and ≥ 10 for the JPSP one (it is consistently larger than the former) have been recognised, selected and considered as textual units. We resort to a procedure for automatic information retrieval that permits to recognise repeated informative sequences, e.g., an adjective followed by a noun as in “social psychology”, that produce a MW (Pavone, 2010). Two encyclopaedias of social psychology (Manstead et al., 1995; Baumeister & Vohs, 2007) and index of keywords available in the downloading process provided further MWs. In order to depict the structure of the association between years and words and to establish the existence of a chronological dimension, a (lexical) correspondence analysis (CA) has been conducted on two matrices: 5,784 words over 46 years (rows per columns) for EJSP corpus and 8,349 x 52 for JPSP. To detect a set of relevant topics included in the journals and observe their temporal development, an analysis implemented by means of Reinert's method (1986) has been conducted. Topics can be defined as “lexical worlds” (Reinert, 1993), that are groups of words referring to a class of meaning. The result, performed with a hierarchical descending classification, is a dendrogram that groups units into classes that mirror a similar lexical 654 JADT’ 18 context. Textual data were processed with the Taltac2 dedicated software and statistical analyses were conducted with SPAD, Iramuteq and R software packages. 3. Results By means of CA we can observe the existence of clear-cut temporal dimension in both the corpora (Figure 1). The keywords which mainly contributed to the factorial solution show which concepts typifies each timespan. Figure 1 - First factorial plan of Correspondence Analysis of EJSP (left side) and JPSP (right side). Projection of years In the EJSP (Figure 1, left side) the first period (1971-1990) is strongly characterised by words that refer to the experimental design. This is the period mainly concerned with the study of aggression, risk taking, dissonance, and attribution theory. The keywords of the subsequent period (nineties) seem to be related to social change, which is characterised by the study of social influence, categorization, and words referring to Moscovici and Tajfel's theories (that marked the European production: social representations, minority influence and minimal group paradigm). In the following years (2000s) we can observe that the attention has turned on the self, ingroup/outgroup relations and the social cognition with the study of stereotypes, emotions, motivation, agency/communion, and so on. In recent years (2011-2016) mainly social issues (e.g., gender, migration, environment, religion) and everyday life concerns are highlighted. As regards the JPSP (Figure 1, right side), in the first decade considered (1965-1976) the main contribution is given by words as reinforcement, verbal reinforcement, conditioning, and so on, that together refer to behaviourism. At the same time, we can observe the occurrence of words pertain to game’s theories, conflict/cooperation as well as aggression and dissonance theory. JADT’ 18 655 Also physiological measurements (e.g. heart rate) and experiments (experimental) are visible. The second period includes the last Seventies until the last Eighties. Its distinctive words are masculinity/femininity, and other terms that remind to motivational theories. Moreover, the presence of words related to personality is evident and becomes stronger in the following period, that includes the Nineties. In this period mood, personality, individual differences, memory and the self represent the main contribution. At the same time also issues about gender and women are noteworthy. The last period starts from the 2000s and shows many references to explicit/implicit, and intimate relationships. Moreover, further specific words about positive psychology (life satisfaction, goal pursuit, and so on) and culture (cultural, culture) are relevant. The analysis conducted by means of Reinert’s method enlightens the presence of nine different lexical worlds (79.64% of the abstracts have been classified) in EJSP (Figure 2). Figure 2 - EJSP classes and their distributions over years – Unsupervised clustering method Following the classes order from the bottom to the top of Figure 2, a brief outline of their contents is provided below. Class 1 (red) concerns attribution and methodological issues (e.g., method, statistical, model). Class 9 (fuchsia) contains words related to impression formation, categorisation and stereotype. Both these classes show decreasing trends without disappearing. Class 6 (light blue) includes mainly words related to gender studies and implicit measures (e.g., prime, IAT). Class 5 (water blue) concerns moods and regulatory focus theory. These two classes show increasing trends. Class 8 (purple) concerns studies on aggression (in which mainly male/female as subjects involved in an experiment were compared). This class was initially hegemonic in the field and then disappeared along time. Class 7 (blue) includes game theories and studies on cooperation competition and shows a decreasing trend. Class 2 (orange) concerns politics and culture (mainly cross cultural studies) and it is an ever-present topic, as well as Class 4 (green) that concerns the social identity theory and ingroup/outgroup dynamics. Class 3, that concerns the applications of that theory (e.g., migration), shows a clear 656 JADT’ 18 increasing trend. As regards JPSP, the analysis shows the presence of eleven clusters (76,08% of the abstracts have been classified - Figure 3). Following the order of the classes from the bottom to the top of Figure 3: Class 7 (light blue) concerns consensus formations and attribution, and seems to be an ever-present topic. Class 6 (water blue) contains processes regarding memory, stereotypes and categorisation and it is particularly recurrent in the nineties and 2000s. Class 3 (grey) contains studies on self, emotion and motivation and shows a clear increasing trend, becoming one of the most relevant topics nowadays. Classes 11 (fuchsia), 10 (lilac), and 1 (red) concern, respectively, studies on aggression and physical measurements, on dissonance and opinion changes, and male and female involved in experimental studies. They were predominant in the first years considered and then disappeared. Class 9 (purple) concerns culture (mainly comparing west and east ones) and politics. It shows an increasing trend although it is not among main topics nowadays. Class 2 (orange) includes words regarding the measurements and their validity (e.g., scale, reliability, test retest) and shows a stable trend. Class 8 (blue) contains words relate to interpersonally differences (based on gender or studied with twin studies). It seems to remain constant even if with a slight decreasing trend. Class 5 (water green) is represented by words concerning health (mental and physical) and how to cope with related problems. Class 4 (green) concerns romantic and couple relationships. Both those classes show increasing trends. Figure 3 - JPSP classes and their distributions over years – Unsupervised clustering method 4. Discussion and conclusions The aim of the present study is to compare American and European social psychology offering food for thought on the contribution of the methods used in studying the histories of a discipline. Thanks to these preliminary results we succeeded in highlighting the history of a discipline from the particular point of view of its effective scientific production. In the first years considered, some similarities among the contents tackled in the two journals can be noticed (e.g., dissonance theory and aggression). The main differentiation that emerged concerns the stronger attention on JADT’ 18 657 individual and personality in JPSP, on the one hand, and the different impact of Tajfel and Moscovici's contributions on the psychology of groups and Moscovici's works on social representations, on the other. This emerged as particularly evident in ‘80s and ‘90s. The predominant approach of social cognition seems to be a common feature, as well as methods and research design that mainly refer to the experimental method and topics concerning cross cultural studies and politics. As regards the topics identified, some common trajectories of publication were enlightened. They are, for example, Class 11 in EJSP and 8 in JPSP, concerning studies on aggression that were predominant in the first decades and later decline. Class 1 in EJSP and 7 in JPSP, as regards, studies on attribution. Also, class 2 in EJSP and 9 in JPSP, that are related to culture and politics. Similar contents but different trajectories are shown by Class 9 in EJSP and 6 in JPSP. The main difference between the journals is observed in JPSP Classes concerning personality, health, cope, and romantic and couple relationships (8, 5, 4), and EJSP Classes concerning ingroup/outgroup processes, and intergroup contact and applied concerns (4, 3). It is worth mentioning the core of the difference between American and European social psychology: the attention on the individual in the American and on the social in the European one. That difference finds its way as a greater attention on social issues in EJSP and individual related studies (e.g. interpersonal relations, personality) in JPSP. Two histories of publications in social psychology have been traced, one North American and the other European. Their typical differentiation is historically well known in the community, but the empirical works that contributed to that debate are less. This is an example of the contribution that quantitative analysis of textual data can provide to the study of the history of a discipline, also known as digital history. References Allport, F. (1924). Social Psychology. Boston, MA: Houghton Mifflin. Baumeister, R. F., & Vohs, K. D. (2007). Encyclopedia of social psychology. Thousand Oaks, CA: Sage. Lebart, L., Salem, A., & Berry, L. (1998). Exploring textual data. Netherlands: Springer. doi:10.1007/978-94-017-1525-6 Manstead, A. S., Hewstone, M. E., Fiske, S. T., Hogg, M. A., Reis, H. T., & Semin, G. R. (1995). The Blackwell Encyclopedia of Social Psychology. Blackwell Reference/Blackwell Publishers. Moscovici, S. (1999). Ringraziamento. In Laurea Honoris Causa in Psicologia a Serge Moscovici. Università degli studi di Roma “La Sapienza”: Centro Stampa d’Ateneo. 658 JADT’ 18 Moscovici, S., & Markova, I. (2006). The making of modern social psychology. Cambridge: Polity. Pavone, P. (2010). Sintagmazione del testo: una scelta per disambiguare la terminologia e ridurre le variabili di un’analisi del contenuto di un corpus. In S. Bolasco, I. Chiari, & L. Giuliano (Eds.) Statistical Analysis of Textual Data: Proceedings of 10th International Conference Journées d’Analyse statistique des Données Textuelles, 9-11 June 2010, Sapienza University of Rome, pp. 131-140. LED. Ratinaud, P. (2014). Visualisation chronologique des analyses ALCESTE: application à Twitter avec l’exemple du hashtag# mariagepourtous. Actes des 12es Journées internationales d’Analyse statistique des Données Textuelles. Paris Sorbonne Nouvelle–Inalco. Reinert, M. (1983). Une méthode de classification descendante hiérarchique: application à l’analyse lexicale par contexte. Les cahiers de l’analyse des données, 8(2), 187-198. Reinert, M. (1993). Les «mondes lexicaux» et leur «logique» àtravers l’analyse statistique d’un corpus de récits de cauchemars. Langage & Société, 66, 5– 39. Trevisani, M., & Tuzzi, A. (2015). A portrait of JASA: The history of Statistics through analysis of keyword counts in an early scientific journal. Quality and Quantity, 49, 1287-1304. Trevisani, M., & Tuzzi, A. (2018). Learning the evolution of disciplines from scientific literature. A functional clustering approach to normalized keyword count trajectories. Knowledge-Based System, 146, 29-141 JADT’ 18 659 Les « itemsets fréquents » comme descripteurs de documents textuels Louis Rompré1, Ismaïl Biskri2 1 Université du Québec à Trois-Rivières – rompre.louis@courrier.uqam.ca 2 Université du Québec à Trois-Rivières – ismail.biskri@uqtr.ca Abstract Automated classification is one of the preferred approaches applied to the problem of organizing information. The classification process is based on identification and evaluation of descriptors which characterize the information. It’s usually necessary to discover them following a raw data analysis. Generally, words are considered during this analysis. In this paper, we propose to use frequent itemsets as descriptors. We present how they can be identified and used to define a level of similarity between several texts. The experiments conducted demonstrate the potential of the proposed approach for defining similarity between texts and linking news broadcast on the web. Résumé La classification automatisée est une des principales approches appliquées au problème d’organisation de l’information. Le processus de classification repose sur l’identification et l’évaluation de descripteurs qui caractérisent l’information. Il est souvent nécessaire de déduire ces descripteurs à partir d’une analyse des données brutes. Généralement, les mots sont considérés pour mener cette analyse. Dans cet article, nous proposons d’utiliser des itemsets fréquents comme descripteurs. Les expérimentations effectuées démontrent le potentiel de cette approche pour établir un degré de similarité entre différents textes et lier des nouvelles diffusées sur le web. Keywords: Classification, Itemset fréquent, Descripteur, Document, Texte. 1. Introduction La digitalisation des documents a facilité la diffusion de l’information. Dès qu’un événement se produit de multiples articles sont rédigés et diffusés sur les différentes plateformes numériques. Plusieurs documents textuels diffusés sur le web sont composés uniquement de quelques centaines de mots. C’est en consultant différents documents, qu’une description riche peut être obtenue. Différents documents peuvent aborder un même sujet et chacun de ces documents est susceptible de contenir de l’information 660 JADT’ 18 complémentaire. Toutefois, la quantité de données disponibles et leur manque de structure limitent notre capacité à capturer ces informations d’où la nécessité d’avoir recours à des outils facilitant l’accès à l’information. La classification automatique est l’une des stratégies appliquées au problème d’organisation de l’information. Un processus classificatoire appliqué à des documents textuels, qu’il soit automatisé ou non, organise les documents de sorte que ceux qui partagent des similarités soient regroupés. L’organisation qui en découle peut être utilisée pour orienter, par exemple, la recherche d’information, l’extraction de connaissances, l’aide au résumé, etc. Plusieurs classifieurs automatiques ont fait l’objet de publications. Comparer ces classifieurs pour déterminer leur performance est une tâche complexe et, surtout, subjective. Un classifieur peut performer avec un ensemble particulier de documents et engendrer des classes bruitées avec un autre ensemble. La pertinence d’une classification est jugée en fonction de l’homogénéité des classes qui en résultent. Ce critère est toutefois relatif. L’examen d’une classe par un intervenant est accompli à partir de ses objectifs de recherche et de ses connaissances du domaine abordé. La qualité recherchée pour un système de classification automatisée est d’être capable de cibler les informations pertinentes à l’intérieur des documents visés et de déterminer comment ces informations peuvent être utilisées pour établir un niveau de similarité entre ces documents. La classification numérique repose sur l’identification et l’évaluation de descripteurs qui permettent de différencier une classe d’une autre. Le choix d’un descripteur aux dépens d’un autre revient à prendre position sur la nature des résultats générés. Il influence le comportement du classifieur car la présence ou l’absence d’un descripteur est un indice permettant de cibler la classe à laquelle appartient un document. Pour la classification textuelle, le mot est souvent utilisé comme descripteur discriminant (McCallum et Nigam, 1998). Lorsque plusieurs mots apparaissent à des fréquences comparables dans deux documents alors ces documents sont considérés comme étant similaires. Toutefois, il est courant que des documents partagent un nombre important de mots et ce même si ces documents traitent de sujets différents. La présence seule de ces mots est donc peu porteuse d’information et son utilité pour établir le niveau de similarité entre des documents est limitée. Néanmoins, les relations qu’entretiennent ces mots avec d’autres peuvent mettre en lumière des particularités propres à certains documents. Il est possible d’utiliser ces relations pour établir le niveau de similarité entre documents. 2. Les règles d’associations Le développement récent des règles d’association découle des travaux d’Agrawal sur l’extraction de connaissances à partir de données JADT’ 18 661 transactionnelles (Agrawal et al., 1993). Agrawal proposait de dégager des relations entre des items qui cooccurrent dans des transactions commerciales. Par exemple, les clients qui achètent les items x et y achètent également l’item z. Depuis, l’approche a été transposée à d’autres domaines, les règles d’association pouvant être appliquées à divers domaines dans la mesure où le concept de transaction peut y être défini. Soit T un ensemble de transactions tel que , les sont appelés des items. Un éléments qui composent les transactions item est une donnée dont la nature dépend du domaine abordé. Par exemple, les items peuvent correspondre à des descripteurs extraits d’une musique (Rompré et al., 2017), à des descripteurs extraits d’une image (Alghamdi et al., 2014) ou simplement à des mots extraits d’un texte (Zaïane et Antoine, 2002). Ainsi, une transaction peut être définie simplement comme un sousensemble de descripteurs. Soit un ensemble de d items distincts, chaque sous- ensemble qu’il est possible de générer à partir des items est appelé un itemset. Pour un ensemble I de taille d, le nombre d’itemsets possibles est (Tan et al., 2002). Le nombre d’itemsets potentiels est exponentiel, en fonction de la taille de I. L’objectif à atteindre lors du processus d’extraction des règles d’association étant de découvrir des relations cachées, il n’y a pas d’indice permettant de cibler les items à considérer. Ainsi, l’espace de recherche équivaut à l’ensemble des itemsets possibles. Même s’il est théoriquement possible de créer itemsets à partir d’un ensemble de taille d, en pratique plusieurs combinaisons apparaissent peu ou tout simplement pas dans les transactions. Ces combinaisons peuvent, donc, être ignorées. Le support est une mesure qui permet de cibler les itemsets à ignorer. Le support d’un itemset X représente le pourcentage des transactions de Il est noté qui contiennent X. , et donné par l’équation 3.1 où n équivaut au nombre total de transactions contenues dans T et au support brut. Le support brut d’un itemset représente le nombre de transactions de donné par l’équation 3.2. qui contiennent . Il est (3.1) 662 JADT’ 18 (3.2) Un itemset est considéré fréquent lorsque son support est supérieur ou égal à un seuil prédéterminé. Soit X et Y deux itemsets fréquents tel que , une règle d’association notée traduit une relation de cooccurrence entre ces itemsets. Par convention, le premier terme est appelé l’antécédent tandis que le second est appelé le conséquent. Une règle d’association est jugée de qualité selon une mesure préalablement fixé. Ainsi, une règle d’association et un seuil est jugée de qualité . La quantité de règles générées, leur pertinence de si même que leur utilité dépendent fortement des mesures et des seuils minimaux fixés. L’évaluation des mesures d’intérêt des règles d’association a fait l’objet de plusieurs travaux (Le Bras et al., 2010 ; Geng et Hamilton, 2006; Tan et al., 2002). Même s’il existe plusieurs variantes, l’extraction des règles d’association est généralement effectuée à l’aide de l’algorithme Appriori (Agrawal et Srikant, 1994) ou FP-Growth (Han et al., 2000). D’autres algorithmes sont présentés dans (Fournier-Viger et al., 2017). Les deux principales difficultés liées à l’extraction des règles d’association sont la gestion de la mémoire et l’effort computationnel nécessaire à la recherche des itemsets fréquents. Contrôler le nombre d’items à considérer demeure le meilleur moyen de traiter ces difficultés. Depuis deux décennies plusieurs travaux portent sur l’application des règles d’association à des fins de classification (Liu et al., 1998; Zaïane et Antoine, 2002 ; Bahri et Lallich, 2010). Les différents classifieurs qui découlent de ces travaux produisent des résultats qui sont en mesure de rivaliser avec ceux obtenus à l’aide d’autre approches comme les arbres de décision (Mittal et al., 2017). Le principal avantage des classifieurs à base de règles d’association est que les connaissances qu’ils exploitent pour guider le processus classificatoire peuvent être facilement interprétées. Ainsi, un classifieur qui exploite des règles d’association peut être utilisé pour identifier les descripteurs pertinents. Les différentes approches proposées dans la littérature impliquent généralement des règles de la forme où correspond à un ensemble de descripteurs et à une classe de similarité. Les documents sont considérés comme étant les transactions tandis que les descripteurs (mots clés, fréquence d’apparition des mots, etc.) et les classes sont considérés comme étant les JADT’ 18 items. Soit un ensemble de descripteurs 663 , et un représentant différentes ensemble d’étiquettes classes, alors un ensemble de documents peut être représenté de la manière suivante : Cette forme de représentation implique que les classes de similarité auxquelles appartiennent les documents soient préalablement connues. Un ensemble d’apprentissage est constitué et utilisé pour entraîner le classifieur. Les règles d’association dégagées lors de la phase d’entraînement sont utilisées pour prédire la classe de nouveaux documents. Ce processus demande généralement un effort considérable et les résultats générés dépendent de l’ensemble utilisé pour entraîner le classifieur. 3. Méthodologie À l’instar des classifieurs à base de règles d’association, notre approche exploite des itemsets fréquents pour décrire les documents. Toutefois, elle ne nécessite pas de phase d’entraînement. Des itemsets fréquents sont extraits de chacun des documents et comparés. Le degré de similarité entre deux documents est fonction du nombre d’itemsets fréquents qu’ils partagent. L’hypothèse derrière cette approche est que lorsque des mots co-occurrent fréquemment au sein des phrases qui composent un texte, alors ces mots sont représentatifs de ce texte. Ainsi, en considérant quelques itemsets fréquents, il est possible de dégager les thèmes spécifiques traités dans les documents. L’approche proposée comporte 4 étapes. La première étape consiste à segmenter les documents afin de les préparer à l’extraction des itemsets fréquents. Les documents sont traités comme des ensembles de transactions où les phrases constituent les transactions et les mots les items. Le nombre de mots différents susceptibles d’apparaître dans un ensemble de documents textuels est théoriquement de l’ordre de la taille du vocabulaire de la langue d’écriture de ces documents. Le nombre de mots qui composent le français est estimé par l’Office Québécois de la Langue Française à plus de 500 000. Considérant qu’à partir de 500 000 mots il est possible de générer itemsets, il est nécessaire d’imposer certaines conditions aux textes en entrée afin de contrôler le nombre de mots. La 664 JADT’ 18 diversité d’un lexique augmentant avec la taille d’un texte, nous devons limiter les textes en entrée à quelques milliers de mots. La deuxième étape consacre la réduction du nombre d’items et donc de l’espace de recherche lors de l’extraction des itemsets fréquents. Certains mots jugés peu porteur d’information sont supprimés des transactions. Une liste de 502 mots vides est utilisée. Les chiffres et les caractères de ponctuation sont également supprimés. La troisième étape vise à extraire les itemsets fréquents. Cette étape est réalisée à l’aide de l’algorithme Apriori. Un effort est porté afin de dégager un nombre restreint d’itemsets fréquents. La recherche des itemsets fréquents est effectuée de manière itérative. Lors de la première itération, le support minimum est fixé à une valeur élevée. Lorsque le nombre d’itemsets fréquents extraits est inférieur à 10 alors le support minimum est diminué de 0.1. Le processus cesse lorsque le nombre d’itemsets obtenus est supérieur à 10 ou que le support minimum est inférieur à 0.1. La dernière étape établit le degré de similarité entre les documents. Les itemsets fréquents utilisés pour décrire les documents sont comparés. Plus le nombre d’itemsets partagés par deux documents est grand, plus ces documents sont jugés comme étant similaire. 4. Expérimentation et discussion Afin d’évaluer l’approche proposée, plusieurs expérimentations ont été effectuées avec une application que nous avons développée en Python. Un corpus formé d’une centaine d’articles tirés de l’actualité et diffusés sur le web a été utilisé. Ce corpus se distingue par le fait qu’il présente les mêmes nouvelles sous l’angle de différentes agences de presse. Il regroupe des articles diffusés sur le web provenant de 6 sources différentes et contenant entre 500 et 1500 mots. Ces articles sont parfaitement adaptés aux conditions de l’approche proposée. Lors de nos expérimentations, nous avons mesuré le pouvoir discriminant des itemsets fréquents. Nous avons effectué une comparaison entre les classifications produites lorsque les descripteurs sont les itemsets fréquents et les classifications produites lorsque les mots sont les descripteurs. La nature des résultats obtenus suggère que les itemsets fréquents peuvent servir à raffiner la description d’une classe. À titre d’exemple, le mot {avions} est utilisé pour décrire 15% des articles du corpus. Même si ces articles sont associés à l’aviation, ils traitent néanmoins de 4 sujets différents. Nos expérimentations démontrent que l’utilisation des itemsets fréquents comme descripteurs peut servir à décrire plus précisément le contenu de ces articles. Les figures 1 et 2 illustrent respectivement la précision obtenue en considérant des itemsets fréquents et celle obtenue en considérant JADT’ 18 665 uniquement des mots. Il est à noter que lorsque seuls les mots sont considérés, les classes de similarité générées sont moins homogènes. En effet, des articles qui traitent de sujets autres que l’aviation y sont inclus. Figure 1 : Précision avec les itemsets fréquents Figure 2 : Précision avec les mots La figure 3 illustre la matrice de similarité produite pour des articles traitant de la crise nord-coréenne. La première colonne contient l’identifiant de l’article, la seconde indique le sujet abordé tandis que les colonnes suivantes donnent le nombre d’itemsets fréquents partagés par les articles. La diagonale équivaut aux nombres d’itemsets fréquents extraits pour un article. La figure 2 est représentative des résultats observés. Moins de 10 itemsets fréquents ont été extraits pour la moitié de ces articles. Néanmoins, ils ont tous été associés à la même classe. Figure 3 : Matrice de similarité des documents traitant de la crise Nord-coréenne. Malgré le fait qu’ils traitent du même sujet, certains articles partagent peu d’itemsets fréquents avec les autres articles qui forment la classe. Ceci s’explique par le lexique employé. Il est possible que les performances puissent être améliorées en ajoutant une étape de lemmatisation. Toutefois, certaines relations demeurent difficiles à établir automatiquement. Par exemple, le document 45 contient les itemsets {nucléaire, pyongyang} et {nucléaire, washington} tandis que le document 46 contient les itemsets {nucléaire, corée} et {nucléaire, américaine}. Les résultats présentés constituent uniquement un échantillon des connaissances extraites à l’aide de 666 JADT’ 18 l’approche proposée. En plus d’être faciles à interpréter, les itemsets fréquents permettent de décrire plus précisément le contenu des documents que les mots seuls. 5. Conclusion Nous avons proposé une approche non supervisée pour établir des relations entre des documents textuels. L’approche proposée repose sur l’utilisation d’itemsets fréquents. Ces descripteurs expriment la cooccurrence de mots au sein des phrases qui composent un texte. Les itemsets fréquents ont tendance à être plus discriminant que les mots seuls. Par conséquent, ils peuvent aider à rehausser la description d’une classe. L’un des avantages de la méthode proposée est que les résultats produits sont faciles à interpréter. Les expérimentations effectuées suggèrent que les itemsets fréquents, tels que définis, sont suffisamment informatifs pour servir à établir des liens cohérents entre des documents. Plusieurs débouchés sont envisageables. Entre autres, l’approche proposée pourrait servir comme prétraitement à la navigation entre différents documents, à l’annotation, au filtrage de l’information, etc. Références Agrawal, R., Imielinski T., et Swami, A. (1993). Minning association rules between sets of items in large databases, In Proc. of the SIGMOD Conference on Management of Data, pp 207-216. Agrawal, R., Srikant, R. (1994). Fast Algorithms for Mining Association Rules, In Proc. of the 20th International Conference on Very Large Database, pp. 487-499 Alghamdi, R. A., Taileb, M., et Ameen, M. (2014). A new multimodal fusion method based on association rules mining for image retrieval. In Mediterranean Electrotechnical Conference (MELECON), 2014 17th IEEE (pp. 493-499). IEEE. Bahri, E., et Lallich, S. (2010). Proposition d'une méthode de classification associative adaptative. 10eme journées Francophones d'Extraction et Gestion des Connaissances, EGC 2010, pp. 501-512. Fournier-Viger, P., Lin, J. C. W, Vo, B., Chi, T. T., Zhang, J. et Le, H. B. (2017). A survey of itemset mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. Geng, L., et Hamilton, H. J. (2006). Interestingness measures for data mining: A survey. ACM Computing Surveys (CSUR), vol. 38, no 3, p. 9. Han, J., Pei, J., et Yin, Y. (2000). Mining frequent patterns without candidate generation. In ACM sigmod record (Vol. 29, No. 2, pp. 1-12). ACM. Le Bras, Y., Meyer, P., Lenca, P., et Lallich, S. (2010). Mesure de la robustesse de règles d’association. QDC 2010. JADT’ 18 667 Liu, B., W. Hsu, et Y. Ma (1998). Integrating classification and association rule mining. In Knowledge Discovery and Data Mining, pp. 80–86. McCallum, A., et Nigam, K. (1998). A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization (Vol. 752, pp. 41-48). Mittal, K., Aggarwal, G., et Mahajan, P. (2017). A comparative study of association rule mining techniques and predictive mining approaches for association classification. International Journal of Advanced Research in Computer Science, 8(9). Rompré, L, Biskri, I et Meunier, J-G (2017). Using Association Rules Mining for Retrieving Genre-Specific Music Files, In Proc. of FLAIRS 2017, pp. 706-711. Tan, P. N., Kumar, V., et Srivastava, J. (2002). Selecting the right interestingness measure for association patterns. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 32-41). ACM. Zaïane, O. R., et Antonie, M. L. (2002). Classifying text documents by associating terms with text categories. In Australian computer Science communications (Vol. 24, No. 2, pp. 215-222). 668 JADT’ 18 Discursive Functions of French Epistemic Adverbs: What can Correspondence Analysis tell us about Genre and Diachronic Variation? Corinne Rossari, Ljiljana Dolamic, Annalena Hütsch, Claudia Ricci, Dennis Wandel University of Neuchâtel – corinne.rossari@unine.ch Abstract Our aim is to describe discursive functions of a set of French epistemic adverbs by establishing their combinatory profiles on the basis of their cooccurrence with different connectors. We then compare these profiles using correspondence analysis in order to find evidence of genre and diachronic variation. The use of these adverbs is explored in contexts of informative discourse within two distinctly different genres – contemporary written press and encyclopedic discourse – as well as within two diachronic spans. Keywords: epistemic adverbs, connectors, co-occurrences, correspondence analysis, genre variation, diachronic variation 1. Introduction Our aim is to analyze the genre and diachronic variation of discursive functions of French epistemic adverbs (E-ADV). By discursive function we mean the rhetorical aim of the utterance in which the adverb occurs: counterargument, argument, or conclusion (cf. Roulet et al., 1991). Our paradigm of E-ADVs consists of the following items: certainement, certes, peut-être, probablement, sans doute and sûrement1. The functions of these adverbs are explored in contexts of informative discourse within two distinctly different genres: contemporary written press and encyclopedic discourse. The former is represented by three daily newspapers: Le Monde (2008, 20 410 766 tokens), Le Figaro (2008, 10 795 373 tokens) and Sud-Ouest (2002, 29 763 988 tokens). In the latter, we consider two diachronic spans: the 18th century, represented by Diderot & d’Alembert’s Encyclopédie (DDA, 29 940 181 tokens) and the 21st century, represented by the 2005 edition of Encyclopédie Universalis (UNI, 49 859 864 tokens) and by a random sample of the 2015 version of Wikipédia 1 Selection based on Roulet’s (1979) paradigm of epistemic assertive adverbs. JADT’ 18 669 (WIKI, 50 396 345 tokens).2 We first proceed to an analysis based on the combinatory profile of each EADV (section 2) in our corpus of contemporary written press, and then, after having pinpointed what such an analysis can and cannot show, we use a more holistic approach based on correspondence analysis (section 3). 2. Analysis of Combinatory Profiles In order to identify the discursive functions of the E-ADVs considered here, we searched connectors (C) specifically co-occurring with each of these EADVs within a 20-token span. We have chosen a 20-token span rather than a sentence span, because a connector’s combinatory profile can go beyond the sentence boundaries. We define connectors as linguistic forms linking segments of discourse. Such a functional category is not part of the tagset of the platform we used. We therefore made our query by searching for three different categories: adverbs, subordinating conjunctions and coordinating conjunctions. We then manually filtered the resulting forms by keeping those which proved to function as a connector. For all our sub-corpora, each of these adverbs is thus specifically assigned a series of connectors within constructions of the type “E-ADV…C1/C2/Cn” and “C1/C2/Cn…E-ADV”, which represent their discursive combinatory profile3. We call each sequence within a combinatory profile a discourse movement as we consider it to have specific, rhetorically motivated discursive aims. These aims (mentioned in section 1) are signaled by the connectors cooccurring specifically with an E-ADV: néanmoins and mais signal that the utterance preceding them is a counter-argument to the utterance they introduce; donc and finalement signal that the utterance they introduce is a conclusion; car and parce que signal that the utterance they introduce is an argument in favor of the utterance preceding them. The tables below show the discursive combinatory profiles in three subcorpora of contemporary press (Le Monde 2008 ; Le Figaro 2008 ; Sud-Ouest 2002). The significance of each co-occurrence of a connector with an E-ADV is calculated using log-likelihood (LL).4 All the corpora used were supplied by the platform BTLC (Base Textuelle Lexicostatistique de Cologne), conceived by Sascha Diwersy (Diwersy, 2014), and were constituted within the French-German projects Presto (http://presto.ens-lyon.fr) and Emolex (http://emolex.u-grenoble3.fr). 3 We adapt the term combinatory profile used by Blumenthal et al. (2005) and Blumenthal (2008; 2012). 4 Although LL can be directly calculated on the BTLC platform, we used the platform to extract the corresponding frequencies, and calculated the LL by using R. 2 670 JADT’ 18 Tables 1-3. Log-likelihood scores (threshold: 10.83; all scores equal or above are marked in bold). Le Monde (2008) car (6 706) donc (8 276) finalement (1 559) mais (51 544) néanmoins (968) parce que (2 514) certainement (385) L R 0.08 0.08 (3) (3) 2.09 2.09 (6) (6) -1.18 -0.01 (0) (1) certes (1943) L -0.27 (11) -2.47 (10) -1.77 (1) 29.83 (47) 27.94 (46) 3248 (979) -0.73 (0) 0.88 (2) -0.73 (0) 2.81 (3) 5.84 (6) 1.78 (8) R 0 (13) 2.9 (23) 1.14 (5) 22.55 (52) 14.22 (9) -4.47 (1) Le Figaro (2008) car (3 922) certainement (268) L R -0.57 (1) -0.57 (1) 3.89 (14) donc (4 763) finalement (1 150) -1.02 (1) -1.02 (1) -0.03 (9) -1.14 (0) -1.14 (0) -0.04 (2) 36.23 (41) 14.25 (30) 1757.55 (545) 1.07 (1) -0.58 (0) 10.02 (6) 0.10 (1) -1.43 (0) -1.65 (1) mais (28 552) néanmoins (580) parce que (1 435) certes (1084) L peut-être (2900) L R 49.30 -0.37 (57) (12) -1.45 24.09 (18) (51) 3.60 15.43 (9) (15) probablement (723) L R 18.99 0.92 (17) (7) -0.68 1.43 (4) (9) -0.01 0.58 (1) (2) sans doute (2482) L R 16.09 0.02 (35) (17) 0.03 4.18 (21) (30) 0.01 6.96 (4) (10) sûrement (307) L R 1.51 -0.64 (4) (1) 0.10 0.10 (3) (3) -0.94 6.08 (0) (3) 371.45 (423) 107.28 (284) 10.30 (57) 10.30 (57) 205.88 (310) 32.85 (193) 30.90 (41) 65.68 (55) 0.02 (3) 62.15 (37) -0.23 (2) 9.75 (17) 0.12 (1) 2.03 (4) -1.38 (0) -3.58 (0) -0.06 (2) 86.58 (41) -0.06 (2) 0.12 (7) -0.58 (0) 3.78 (3) -0.58 (0) 0.07 (1) peut-être (1851) L R probablement (441) L R sans doute (1240) L R sûrement (211) L R 28.08 (37) -0.48 (11) 14.27 (12) 1.95 (6) 8.45 (19) 4.43 (16) 7.53 (6) -3.08 (0) -20.39 (2) 3.16 (24) -0.22 (3) 6.74 (10) -0.37 (9) 0.37 (13) -3.74 (0) -0.48 (1) -3.15 (1) 6.52 (10) 0 (1) 0 (1) 1.67 (5) -1.34 (1) -0.90 (0) -0.90 (0) 245.20 (281) 93.30 (204) 0.56 (27) 2.38 (31) 86.88 (151) 34.59 (117) 17.23 (27) 87.11 (52) 0.49 (2) 0.44 (3) -3.84 (0) -0.95 (0) -2.67 (09 -2.67 (0) -0.45 (0) -0.45 (0) 0.30 (2) 31.90 (22) -2.25 (2) 6.88 (5) 4.79 (8) 0.14 (4) 0.28 (1) 2.22 (2) R 2.35 (4) 1.81 (14) 0.95 (1) 1.39 (49) 0.95 (0) 0.03 (1) JADT’ 18 Ouest Sud (2002) car (12 434) donc (19 185) finalement (2 858) mais (77 108) néanmoins (1 698) parce que (5 981) 671 certainement (1277) L R certes (2795) L R peut-être (4950) L R probablement (812) L R 1.34 1.35 (10) (4) 10.78 (23) 6.53 (20) 2.89 (32) 5.92 (36) 44.71 (91) -0.29 (38) -3.00 (10) 5.72 (27) -6.45 (22) 0.10 (38) -1.98 (53) 1.55 (74) -1.32 (7) -0.09 (2) 0.11 (3) 3.19 (10) 0.07 (6) 7.35 (19) 3.68 (16) -0.23 (1) 28.87 (113) 64.77 (139) 6962.42 (1778) -5.72 (118) 520.54 (682) 211.15 (513) 10.35 (64) -0.16 (1) -0.16 (1) 9.25 (10) -0.01 (3) 5.39 (12) -0.54 (4) 0.93 (2) -4.77 (11) 13.45 (6) 12.51 (15) 8.48 (13) 814.50 (135) 104.89 (33) 16.58 (13) 9.77 (22) 0.23 (1) 9.48 (63) 1.85 (0) 0.57 (2) sans doute (3930) L R sûrement (684) L R 45.13 (78) -0.26 (30) 6.87 (13) -1.61 (42) 9.50 (74) 2.23 (12) 0.72 (10) 209.38 (434) -0.09 (5) 123.59 (376) 0.41 (7) 0.08 (1) 7.09 (52) 162.80 (130) 6.72 (11) 1.20 (7) 0.06 (1) 0.06 (1) 233.04 (108) 0.00 (16) 1.04 (8) -0.48 (9) The data lead to the following observations: (i) Although the E-ADVs belong to the same semantic class, each has its own specific combinatory profile. (ii) Certain E-ADVs share comparable combinatory profiles: sans doute and peutêtre share an almost identical set of specific connectors; more frequently, several E-ADVs essentially only share one or more specific connectors (for instance the connector mais for certainement, sûrement, peut-être and sans doute). (iii) Certain E-ADVs stand out for their unique combinatory features: certes is almost exclusively associated with mais, but only with mais_R, and with a notably higher log-likelihood score than the other E-ADVs. Probablement is also associated with only a few connectors, but with a low log-likelihood score, close to the threshold of 10.83. (iv) There is homogeneity in the significant association for each E-ADV in the three sub-corpora of contemporary press. However, preceding studies – Rossari et al. (2016) and Rossari & Salsmann (2017) – show that the E-ADVs’ combinatory profile varies depending on different genres and diachronic periods: contrary to what is observed for the press genre, in DDA and UNI the association peutêtre…mais is less significant than the association mais…peut-être. For instance, in DDA, no significant association certes…mais is observed, while the association sans doute…mais in the same corpus proves to be highly significant. The analysis of combinatory profiles (based on the significance measure log-likelihood; cf. Blumenthal et al., 2005) allows for one-to-one comparison of the different sequences of the type E-ADV...C and C...E-ADV. Thus, the associations of each E-ADV with each connector can easily be compared across corpora representing different newspapers, but also across different genres and diachronic periods. It is also possible to compare the 0.15 (10) 0.31 (2) 672 JADT’ 18 associations of different E-ADVs with one or a few connectors. However, this method has certain insufficiencies when it comes to simultaneously comparing all of these variables in a holistic view. This type of analysis of combinatory profiles never takes into account all variables at the same time (e.g. frequencies, log-likelihood scores, paradigm of E-ADVs, paradigm of connectors). Moreover, using a threshold (in our case 10.83) in order to decide whether an association is significant is useful for traditional collocation analysis. But our goal is to also represent the use of each E-ADV in its typical discourse movements in contrast to its non-typical discourse movements. It thus seems counterproductive that all sequences (EADV...C/C...E-ADV) which are not statistically significant for certain E-ADVs as such are not taken into account when establishing their combinatory profiles, since these nonsignificant cases play an important role in characterizing the overall use of the E-ADVs and connectors. In order to allow for a holistic approach, we propose to use correspondence analysis (CA) (Greenacre, 2017). 3. Correspondence Analysis (CA) The correspondence analysis presented in this section was performed using the R software and the package “CA” (Nenadić & Greenacre, 2007). (1) In DDA, representing the 18th century, certes has a use which stands out. Certes left and right of mais differ clearly from all other E-ADVs as to their associations with the connectors. Certes is not typically used with any other connector analyzed and, most importantly, its association is not stronger with mais on its right than it is with mais on its left. Conversely, in all other five sub-corpora (encyclopedic and press corpora), which represent the 21st century, there is an important difference between the use of certes right and left of mais: while certes_L is strongly linked to mais, certes_R is not. (2) In all six sub-corpora, mais appears to be opposed to all other connectors when it comes to its associations with E-ADVs. Its central position appears to be linked to its high frequency, indicating its high contribution to the horizontal axis, this being confirmed by the analysis of the correspondence analysis indicators. (3) An association between sans doute_L and parce que can be observed in DDA and WIKI, whereas in UNI, the adverb and the connector appear to be in the opposite relation. This behavior indicates variation has to be expected even within the encyclopedic sub-corpus, based on at least two parameters: on the one hand, the diachronic parameter is involved in some discursive uses of E-ADVs, like certes_L and certes_R showing no difference as to their association with mais in DDA, consistently with its different meaning at that time, whereas only certes_L is associated with mais in all other sub-corpora; on the other hand, some convergence between DDA and JADT’ 18 673 WIKI could be interpreted as showing similarities in writing style. (4) The results of the correspondence analysis show that in all sub-corpora of one particular genre, in most cases, the same E-ADVs are strongly associated with the same connector or group of connectors (donc and finalement ; car and parce que ; mais); this phenomenon is particularly pronounced in the subcorpora representing written press. The connector mais differs the most from the other connectors in what concerns the strength of its associations. Although mais is associated with most E-ADVs, its association appears to be strong with only a few of them in all sub-corpora (certes_L being the only constant), while most other connectors have a higher number of strong associations. This indicates that certain discourse movements (such as EADV...car / parce que) seem to be rather regular, whereas certes...mais proves to be a special association, although only in the 21st century corpora. (5) The behavior of néanmoins in the Figaro 2008 corpus should be interpreted with caution since the two axes describe only 10% of its variation. 4. Perspectives Our first attempt to use correspondence analysis to study different discursive movements has provided promising results regarding the genre and diachronic variation of discursive functions of French epistemic adverbs in these cases. We intend to further extend our analysis in three directions: First, we would like to enlarge our corpora to see if this allows to extend the paradigm of connectors, so as to give a better overview of the different discursive movements that exist and to better represent the different discursive functions of the E-ADVs that we have found. It would be especially interesting to cover different diachronic spans of press, allowing for a study of possible changes within this specific genre. Likewise, other text types may be considered in order to better represent possible variation between genres. Second, through the comparative analysis of the discursive combinatory profiles of each E-ADV, we aim to identify regularities concerning the rhetorical purpose of the sequence in which the E-ADV typically occurs by understanding its motivation. For instance, beyond the difference between a counter-argument, an argument, and a conclusion, there is a more fundamental difference between a discourse movement used with the rhetorical aim (i) to present a content as being in the discursive background (when the E-ADV is followed by mais), (ii) to introduce a content which the speaker considers to be most relevant (when the E-ADV is introduced by mais or donc), and (iii) to add evidence to a relevant content (when the E-ADV follows car or parce que). Third, in order to confirm the reliability and precision of the positions on the correspondence analysis planes, our intention is to apply bootstrap validation (Lebart, 2010). 674 JADT’ 18 Figures 1-6. Correspondence analysis scatter plots for the six corpora. JADT’ 18 675 References Blumenthal P. (2008). Combinatoire des prépositions : approche quantitative. Langue française, 157: 37-51. Blumenthal P. (2012). Particularités combinatoires du français en Afrique : essai méthodologique. Le français en Afrique, 27: 55-74. Blumenthal P., Diwersy S. and Mielebacher, J. (2005). Kombinatorische Wortprofile und Profilkontraste. Berechnungsverfahren und Anwendungen. Zeitschrift für romanische Philologie, 121: 49-83. Diwersy S. (2014). Corpus diachronique de la presse française : base textuelle créée dans le cadre du projet ANR-DFG PRESTO. Institut des Langues Romanes, Université de Cologne. Greenacre M. J. (2017). Correspondence analysis in practice. 3rd ed. Boca Raton: Chapman. Lebart L. (2010). Validation techniques for textual data analysis. Statistica Applicata - Italian Journal of Applied Statistics, 22(1): 37-51. Nenadić O. and Greenacre M. J. (2007). Correspondence Analysis in R, with two- and three-dimensional graphics: The ca package. Journal of Statistical Software, 20(3): 1-13. Rossari C., Hütsch A., Ricci C., Salsmann M. and Wandel, D. (2016). Le pouvoir attracteur de mais sur le paradigme des adverbes épistémiques : du quantitatif au qualitatif. In Mayaffre D. et al. (eds), Proceedings of 13th International Conference on Statistical Analysis of Textual Data, II: 819-823. Rossari C. and Salsmann M. (2017). Étude quantitative des propriétés dialogiques des adverbes épistémiques. Actes des 9èmes Journées Internationales de la Linguistique de corpus: 87-93. Roulet E. (1979). Des modalités implicites intégrées en français contemporain. Cahiers Ferdinand de Saussure, 33: 41-76. Roulet E., Auchlin A., Moeschler J., Schelling M. and Rubattel C. (1991). L'articulation du discours en français contemporain. 3rd ed. Bern: Lang. 676 JADT’ 18 Misleading information in online propaganda networks Vanessa Russo1, Mara Maretti2, Lara Fontanella3, Alice Tontodimamma4 D’Annunzio University of Chieti-Pescara – russov1983@gmail.com D’Annunzio University of Chieti-Pescara – mara.maretti@unich.it 3D’Annunzio University of Chieti-Pescara – lara.fontanella@unich.it 4D’Annunzio University of Chieti-Pescara – alicetontodimamma@gmail.com 1 2 Abstract 1 Nowadays, the spreading of inaccurate, false or misleading information over the digital space is amplified by the increasing use of social networks and social media. In different cases, misleading information can be linked to a propaganda activity aimed at supporting offline organizations. In fact, in such cases, online pages, conveying unintentionally (misinformation) or intentionally (disinformation) inaccurate information, are embedded into a network system composed by political and ideological advertise. In this paper, we discuss the different structures of online networks linked to some official pages of different political parties. The analyzed networks were identified through Social Network Analysis. Abstract 2 La diffusione di informazioni inesatte, false o fuorvianti nello spazio digitale è amplificata dal crescente uso di social network e social media. In diversi casi, tali informazioni approssimative e/o fuorvianti possono essere collegate ad un'attività di propaganda volta a supportare organizzazioni offline. Infatti, in questi casi, le pagine online, che trasmettono informazioni non intenzionalmente (misinformation) o intenzionalmente (disinformation) errate, sono incorporate in un sistema di rete composto da pubblicità politica e ideologica. In questo articolo, discutiamo le diverse strutture delle reti online. Le reti analizzate sono state identificate attraverso la Social Network Analysis. Keywords: misinformation, disinformation, propaganda activity, Social Network Analysis 1. Background: misinformation and disinformation online The development of the digital space relates to a new form of web-mediated communication, which can be defined according to the following main features. Web-communication can be thought of as a participative act and is JADT’ 18 677 not part of a broadcast system (McLuhan, 1962) but is a networkcast system. In fact, a web content generates connections, denoted as “Affinity networks” (Rainie and Wellman, 2012; Castells, 2000), based on the sharing of a given content. In this network system, Web-communication yields temporary consensus areas based on alliances between users with respect to the shared contents. Moreover, Web-communication favors a mobilization of skills that generates new paths of social action and collective projects (Levy, 2002). In the digital space, content validity relies on activism and interest of digital users and every opinion “has citizenship rights” (Quattrociocchi and Vicini, 2016; Mocanu, 2015). In this framework, misinformation and disinformation processes share the previous characteristics. Furthermore, the accidental or deliberate propagation of false information is strictly linked to a “loss of disintermediation” (Jenkins, 2006). According to this theory, one of the most important effects of webmediated communication is the loss of traceability of official information sources. In fact, phenomena like Wikipedia, Social Media sites or Blog news produce the culture of unofficial knowledge, creating a virtuous circle of free sources, on the one hand, and a vicious circle of misleading information, on the other hand. Disinformation and misinformation processes can be both related to Fake news and Hate Speeches. “Fake news” or “Junk news” refers to web sources completely invented or simply distorted. In fact, in the digital space, anyone gain access at different information sources and can, also, create information content with low costs and high distribution potential. Furthermore, the fake new propagation process can develop into a viral system, dominated by the high sharing power of different recurring themes. Usually, Hate Speech phenomenon is linked to sharing and commenting fake news. Web 3.0 era is permeated by hatred, mainly directed to immigrants, political parties and homosexual people. Although hater activity concerns specific themes, it becomes a fundamental part in redefining the digital public sphere (Lévy, 2002). 2. Research Design and Methodology The disinformation and misinformation online phenomena have become a propaganda activity to support offline organizations. In fact, in many cases online fake news and hate speeches are contained within a network system consisting of political and ideological advertising. In particular, this tendency gained attention during Trump’s election campaign (Ott, 2017). The Computational Propaganda Research Project, promoted by Oxford University, aims at investigating «how tools like social media bots are used to manipulate public opinion by amplifying or repressing political content, disinformation, hate speech, and junk news». Woolley and Howard (2017), 678 JADT’ 18 mapping the computational propaganda in different countries, analyzed tens of millions of posts on seven different social media platforms, referring to elections, political crises and national security incidents. Each case study takes into account qualitative, quantitative and computational evidences collected between 2015 and 2017. In this framework, following a computational approach (Lazer et al., 2009), our research aims at identifying and comparing propaganda policy networks. For this purpose, we investigated the networks in which different political Facebook Like pages are embedded. More specifically, we selected the following Facebook Like pages related to political institutional information: “Ricostruiamo il centro destra” (Centre-Right wing), “Di Battista Alessandro” (Five Star Movement) e “Partito Democratico” (Centre-Left wing). Exploiting Social Network Analysis and focusing the attention on each of the chosen pages, we detected the online networks. The analyzed adjacency matrices were built considering as link the “likes”. The analysis was implemented using the free and open source NodeXL extension of the Microsoft Excel spreadsheet (Hansen et al., 2011). For each network, we present the centrality measures, which describe how a particular vertex can be said to be in the “middle” of the network. In particular, betweenness centrality measures how often a given vertex lies on the shortest path between two other vertices. Vertices with high betweenness may have considerable influence within a network by virtue of their control over information passing between others. As pointed out by Hansen et al. (2011), these measures can be thought of as a kind of “bridge” score, a measure of how much removing a node would disrupt the connections between other vertices in the network. Closeness centrality captures the average distance between a vertices and every other vertex in the network. In NodeXL the inverse of the average distance is implemented so that higher closeness values indicate more central vertices. The Eigenvector Centrality network metric takes into consideration not only how many connections a vertex has (i.e., its degree), but also the degree of the vertices that it is connected to. A node with few connections could have a very high eigenvector centrality if those few connections were themselves very well connected. These centrality measures allowed to identify the most relevant nodes of each network. The identified Facebook Like Pages were classified in “official pages” and “junk pages” according to their contents. Junk information is strictly linked to the so-called post-truth politics, meaning a political culture in which truth is no longer significant or relevant and «objective facts are less influential in shaping public opinion than appeals to emotion and personal belief» (Oxford Dictionaries, 2016). In this context, the term junk information refers to fake news, conspiracy theories, hate speeches, misinformation and deliberately misleading disinformation. Accordingly, JADT’ 18 679 Facebook Like pages containing posts, comments or images conveying this kind of information were classified as “junk pages”. It is worth noticing how in the identified networks we did not retrieve hybrid forms, that is pages composed of both official and junk contents. 3. Preliminary results The network built by considering the Facebook Like page “Ricostruiamo il centro destra” is depicted in Figure 1. This social media network, linked to a Centre-Right political view, is composed by 159 nodes, comprising both institutional and junk pages (e.g. “unitaliasenzacomunisti”, “SapereEundovere”). Centrality values, provided in Table 1 for the six pages with higher levels of betweenneess centrality, highlight a connection between junk and institutional nodes; furthermore, the influence of junk pages in the network is very outstanding. Figure 1. NodeXL social media network diagram of relationships derived from the Facebook Like page “Ricostruiamo il centro destra”. Table 1: Social media network of relationships derived from the Facebook Like page “Ricostruiamo il centro destra”: centrality measures for the vertex pages with higher levels of betweenness Vertex ricostruiamocentrodestra unitaliasenzacomunisti SapereEundovere radionewsinformazionelibera italianinonsonorazzistisonostanchi diquestainvasione Betweenness Centrality 22644.000 10986.000 10044.000 1087.000 Closeness Centrality 0.004 0.003 0.003 0.002 Eigenvector Centrality 0.009 0.009 0.000 0.000 777.000 0.002 0.000 A similar situation was detected for the Five Star Movement. This network, represented in Figure 2, is composed by 664 nodes comprising again both institutional and junk pages. In this case, the junk pages are specifically of the Five Star Movement and institutional pages are personal pages of political candidates. The Five Star Movement network shows three big cluster in which the central node (WIlM5s) is a junk page. 680 JADT’ 18 Figure 2. NodeXL social media network diagram of relationships derived from the Facebook Like page “Di Battista Alessandro”. Table 2: Social media network of relationships derived from the Facebook Like page “Di Battista Alessandro”: centrality measures for the vertex pages with higher levels of betweenness. Vertex Betweenness Centrality Closeness Centrality EigenVector Centrality MassimoEnricoBaroni 281353.000 0.001 0.032 WIlM5s 172430.333 0.001 0.024 sorial.giorgio 143457.000 0.001 0.013 dibattista.alessandro 3405.667 0.001 0.006 pierrecantagallo89 1324.000 0.001 0.001 perchevotarem5s 702.000 0.001 0.003 The social media network of relationships derived from the Facebook Like page “Partito Democratico” does not show the features found out for the previous networks. In fact, the network related to the Centre-Left political party is composed by only institutional propaganda pages. Figure 3. Centrality measures for the social media network of relationships derived from the Facebook Like page “Partito Democratico”. 4. Community clusters The mapping process of propaganda pages resulted into different structures of network. For the classification of these structures, we make use of the model elaborated by Smith et al. (2014) in order to define a taxonomy of social networks derived from conversations within Twitter. The authors defined six types of Networks: polarized crowds, tight crowds, community cluster, brand cluster, broadcast network and support network (see Figure 5). JADT’ 18 681 Table 3: Social media network of relationships derived from the Facebook Like page “Partito Democratico”: centrality measures for the vertex pages with a higher level of betweenness Vertex Betweenness Centrality Closeness Centrality EigenVector Centrality partitodemocratico.it 46486.100 0.002 0.024 enricoletta.it 28853.657 0.002 0.047 scalfarotto 24167.162 0.002 0.038 giannipittella 23136.533 0.001 0.018 giovanidem 19798.000 0.001 0.011 palazzochigi.it 12633.519 0.001 0.009 Figure 5: Diagrams of the differences in the six types of social media networks (Smith et al 2014). In this framework, we can recognize how the Centre-Right wing social media network shows a conformation similar to a mixture of Polarized Crowd and Support Network. On the one hand, the Polarized Crowd model is characterized by two groups, polarized on specific opinions and sharing few connections. On the other hand, the Support Network model consists of a central node that sends information to the peripheral nodes. The Five Star Movement social network adheres more closely to Tight Crowd and Support network structures. The Tight Crowds is composed by highly connected nodes and specific shared themes. Finally, the Democratic Party network reflects the structures of a Community Cluster, which is organized in many cliques that share specific topics of conversation. 682 JADT’ 18 4. Conclusions and future works In this preliminary phase of our research, we considered the network structures related to the online propaganda linked to different political areas. Our analysis allowed to highlight the differences in the networks and to cast the reconstructed networks into the taxonomy proposed by Smith et al. (2014). In addition, in two out of the three analyzed social networks we found out the presence of junk pages contributing to the disinformation and misinformation processes by spreading out fake news and indulging in hate speeches. The cluster structures of those two networks, leading to closed circle of highly polarized information, facilitates the diffusion process of misleading information. Based on these preliminary results, future works will focus on the textual analysis of posts and comments shared on the retrieved junk pages, in order to identify the main discussed topics. To this end, Text mining and machine learning techniques will be exploited. References Castells M. (2000). The Rise of the Network Society, Blackwell Publishers Oxford Hansen D. L., Schneiderman B., Smith M. A. (2011). Analyzing social media networks with NodeXL: insights from a connected world, Morgan Kaufmann Jenkins H., (2006). Fans, Bloggers and Gamers: Exploring partecipatory culture, New York University Press. Lazer D., Pentland A., Adamic L., Aral S., Barabási A.L., Brewer D., Christakis N., Contractor N., Fowler J., Gutmann M., Jebara T., King G., Macy M., Roy D., Van Alstyne M., (2009). Life in the network: the coming of computational social science, Science 323(5915): 721–723 Lévy P. (2002). Cyberdémocratie. Essai de philosophie politique, Paris: O. Jacob McLuhan, M. (1962). The Gutenberg Galaxy: the making of typographic man, University of Toronto Press. Mocanu, D.; Rossi L., Zhang Q., Karsai M., Quattrociocchi W. (2015) Collective attention in the age of (mis)information. Computers In Human Behavior, 51, 1198-1204 Ott B. L. (2017). The age of Twitter: Donald J. Trump and the politics of debasement, Critical Studies in Media Communication, 34, (1): 59-68 Oxford Dictionaries (2016). Word of the Year 2016 Is..., https://en.oxforddictionaries.com/word-of-the-year/word-of-theyear-2016. Quattrociocchi W., Vicini A. (2016). Misinformation. Guida alla società dell’informazione e della credulità, Franco Angeli. Rainie L., Wellman B. (2012). Networked: The New Social Operating System, MIT Press. Smith M., Raine L., Shneiderman B., Himelboim I. (2014). Mapping Twitter Topic Network: From polarized Crowds to community Cluster, Pew JADT’ 18 683 Research Internet Project, February 20, http://www.pewinternet.org/2014/02/20/mapping-twitter-topic-networks-frompolarized-crowds-to-community-clusters/# Woolley S. C, Howard P. N. (2017). Computational Propaganda Worldwide: Executive Summar,. Working Paper 2017.11. Oxford, UK: Project on Computational Propaganda. comprop.oii.ox.ac.uk. 14 pp. 684 JADT’ 18 Topic modeling of Twitter conversations Eliana Sanandres1, Camilo Madariaga2, Raimundo Abello3 1 Universidad del Norte – esanandres@uninorte.edu.co 2Universidad del Norte – cmadaria@uninorte.edu.co 3Universidad del Norte – rabello@uninorte.edu.co Abstract Topic modeling provides a useful method of finding symbolic representations of ongoing social events. It has received special attention from social researchers, particularly among cultural sociologists, in the last decade (DiMaggio et al., 2013; Sanandres and Otalora, 2015). During this time, Twitter has acted as the most common platform for people to share narratives about social events (Himelboim et al., 2013). This study proposes LDA (Latent Dirichlet Allocation) based topic modeling of Twitter conversations to determine what topics are shared on Twitter in relation to social events. The dataset for this study was constructed from public messages posted on Twitter related to the financial crisis of the National University of Colombia. Over an eight-week period, we downloaded all tweets that included the hashtag #crisisUNAL (UNAL is the Spanish acronym of the university) using the Twitter API interface. We analyzed over 45,000 tweets published between 2011 and 2015 using the R package topicmodels to fit the LDA Model in five steps: first, we transformed the tweets into a corpus, which we exported into a document-term matrix; the terms were stemmed and the stop words, punctuation marks, numbers, and terms shorter than three letters were removed. Second, we used the mean term frequency-inverse document frequency (tf-idf) over documents containing this term to select the vocabulary. We only included terms with a tf-idf value of at least 0.1, which is a bit less than the median, to ensure that the most frequent terms were omitted. Third, we defined the number of topics k by estimating the log-likelihood of the model for each topic number starting with 1 though to 300 topics and selected k = 12 because it had the highest log-likelihood value (LL = -198000). Fourth, we run the LDA Model for k = 12 topics. Fifth, we labeled the k = 12 topics previously identified by choosing the top N terms ranked based on the probability of that topic. This article illustrates the strength of topic modeling for analyzing large text corpora and provides a way to study the narratives that people share on Twitter. Keywords: Topic modeling, LDA, Twitter. JADT’ 18 685 1. Introduction This article presents a way to analyze large amounts of textual data from Twitter conversations in an efficient and effective way. Specifically, we explain how to capture the narratives that people share on Twitter about social events, reduce their complexity, and provide plausible explanations. This is a research concern that has received special attention among social researchers (Kovanović et al., 2015; Yann et al., 2011; Newman and Block, 2006; Griffiths and Steyvers, 2004), particularly among cultural sociologists, who face the methodological challenge of working qualitatively with large amounts of data (Sanandres and Otalora, 2015; Eyerman et al., 2011; Alexander, 2004). In this paper we propose an LDA (Latent Dirichlet Allocation) based topic model to address this challenge. Topic modeling is a useful approach because the set of terms found within topics index discursive environments or frames that define patterns of association between a focal issue and other constructs (DiMaggio et al., 2013). These patterns of association are to be interpreted as symbolic representations of ongoing social events, which represent claims about the shape of social reality, its causes, and the responsibility for action such causes imply (Alexander, 2004). We applied an LDA-based model to Twitter conversations about the financial crisis of the National University of Colombia to examine how the debate over this crisis was framed on Twitter, from 2011 when it emerged, until 2015. We analyzed over 45,000 tweets and illustrated the strength of topic modeling for the analysis of large text corpora as a way to study narratives shared on Twitter. 2. Background: The financial crisis of the National University of Colombia Over the last decade, Colombian academics and representatives of the government have recognized that the limitations of their budgets are the major limitation in the response of public universities to the increasing demands of society. To face this problem, the government proposed to reform the entire system of higher education (Ministry of National Education, 2010). The intention was to find new sources of money for higher education, enable more people to attend college, encourage transparency and good governance in the education sector, and improve the quality of higher education. One of the most controversial proposed changes was the opening of the education sector to private investment by for-profit companies (El Espectador, 2011). This was immediately rejected by public universities, who claimed that the proposed reform would lead to a full-scale privatization of the system of higher education (Semana, 2011). At the public National University of Colombia, the largest higher education 686 JADT’ 18 institution in Colombia, some students and professors claimed that the reform offered no clear solution to the financial crisis of the university. They explained that the university had been using a funding model with its sources of support mixed between the state and external resources, claiming that since 2004 this model had borne dwindling state support and everincreasing costs to be covered by external resources. They showed that government transfers had decreased from 70% in 2004 to 64% in 2013, while the external resources produced from activities such as tuition fees, nonformal education courses, and academic extension services, among others, had increased from 30% to 36% in the same period (National University of Colombia, 2014). This statement reopened the debate on the financial crisis of the National University of Colombia and became a Twitter trending topic with the hashtag #CrisisUnal (UNAL is the Spanish acronym for the name of the university). 3. The financial crisis of the National University of Colombia on Twitter Here, we investigate how the financial crisis in the National University of Colombia was framed on Twitter. It may be asked why we should care about Twitter conversations on this topic? However, it should be considered that Twitter conversations can offer clues to what the university is thinking and doing about the crisis. A central advantage of using Twitter for analyses is that it covers topics in real time, producing a large amount of data that can be used to look at people’s perceptions and narratives of particular events. Twitter also provides a practical way to examine collective experience related to a topical event, to study behaviors and attitudes where social desirability bias may occur in official surveys, and to collect large amounts of data with a limited budget (Himelboim et al., 2013). Twitter conversations also illustrate the views of the reading public and show dominant viewpoints, which emerge quickly and are difficult to change (Xiong et Liu, 2014). We collected every tweet published between 2011 and 2015 that contained any reference to the financial crisis in the National University of Colombia with the hashtag #CrisisUNAL. We chose this period to track Twitter conversations around this topic, from the time it became a Twitter trend in 2011 through 2015 (the last year in which we collected data). Our collection formed a corpus of over 45,000 tweets. In the next section we describe how we used topic modeling. 4. Method Topic modeling is a machine-learning method used to discover hidden thematic structures in large collections of documents. In this work we used LDA, a widely used method in topic modeling (Jelodar et al., 2017; Fligstein JADT’ 18 687 et al., 2014), which assumes that there is a set of topics to be found in a collection of documents. The intuition behind LDA is that documents exhibit multiple topics. A topic is formally defined as a distribution of words over a fixed vocabulary (Blei, 2012). For LDA, topics must be specified before any data are generated. For each document in the collection, this method generates the words in a two-stage process. During the first stage, it randomly chooses a distribution over topics (step 1). In the second stage, for each word in the document, it randomly chooses a topic from the distribution over topics in step 1 (step 2a), and a word from the corresponding distribution over the vocabulary (step 2b). At the end, each document exhibits topics in different proportions (step 1) and each word in each document is drawn from one of the topics (step 2b), where the selected topic is chosen from the per-document distribution over topics (step 2a) (Blei, 2012). To run the LDA model, we followed five steps. First, we transformed the tweets into a corpus and exported this corpus to a document-term matrix; the terms were stemmed and the stop words, punctuation, numbers and terms shorter than three letters were removed. Second, we used the mean term frequency-inverse document frequency (tf-idf) to select the vocabulary. We only included terms with a tf-idf value of at least 0.1, which is a bit less than the median, to make sure that the most frequent terms were omitted. Third, we defined the number of topics k by estimating the log-likelihood of the model for each topic number, from 1 to 300 topics; we selected k = 12 as having the highest log-likelihood value (LL = -198000). Fourth, we run the LDA model for k = 12 topics. Fifth, we labeled the k = 12 topics previously identified by choosing the top N terms, ranked according to the probability of that topic. For this we used the R package topicmodels. 5. Results Table 1 displays the 12-topic solution and lists the 10 highest-ranking terms for each topic. We call attention to four sets of topics: six topics concerned with social protest (dark shading), three topics on educational reform (medium shading), two topics calling for investment (light shading), and one topic emphasizing the role of the National University of Colombia in the Colombian peace process (no shading). To more easily interpret the topics, after reviewing the list of terms we examined those tweets that exhibited each topic with the highest probability. 5.1 Protest topics Protest topics are the focus of the Twitter conversations on the financial crisis in the National University of Colombia. Topic 1 covers the protests of the education workers. The most highly ranked terms were sintraunal (the labor 688 JADT’ 18 union covering all workers at public universities), protest, strike, campus, riot, gas, blocked, and wall. The tweets in which this topic was strongly represented locate protests in national and international contexts with terms like nation and clacso (Latin American Council of Social Sciences), indicating that the protests were a matter of concern in Colombia and in Latin America. Topic 3 also refers to the protests of the education workers. Some of the top words are sintraunal, gases, wall, and block. This topic frequently exhibits tweets that show negative aspects of protests, such as confrontation, death, and bombs. Table 1: 12-topic solution Topic 1 sintraunal protest strike campus riot gas blocked wall nation clacso Topic 2 agricultural strike graffiti hate block bombs terrorists crash delinquents guevara Topic 3 sintraunal gases wall block undefined bombs hood criticism death confrontation Topic 4 agrarian protest movement mobilization participation people bombs poor assembly disturbance Topic 7 defend university improvement campus crisis infrastructure cement hospital architecture sociology Topic 8 no to the reform propose threat oblivion save closed blocked abnormality upedagogica uncertainty Topic 9 Stamp demand support public university strike resources deserve financial pride Topic 10 intimidation blocked abandoned public eviction strike che graffiti protest worker Topic 5 solidarity no to the reform justice march respect charge help block upedagogica studying Topic 11 peace process mobilization research studying participation talks intellectuals solidarity civil Topic 6 no to the reform universities listen sciences confrontation media classrooms abandoned mobilization block Topic 12 revolutionary victory popular campus strike eviction denounce deserve abandonment took Topics 2 and 4 refer to the agricultural sector protests. While Topic 4 is related to the mobilization of people to take part in these protests, Topic 2 emphasizes the participation of terrorists and delinquents in agricultural strikes. In this context, social protest is associated with the Argentine Marxist revolutionary Ernesto Che Guevara. Che is also mentioned in Topic 10, which deals with the protests of the working class and the intimidation of protesters. The most highly ranked terms in this topic are intimidation, blocked, abandoned, public, eviction, worker, strike, che, graffiti, and protest. Finally, Topic 12 covers the revolutionary cause of social protest and includes the words revolutionary, victory, popular, campus, and strike. JADT’ 18 689 5.2 Anti-reform topics Five topics deal with the reforms of higher education proposed by the government. According to the terms included in Topic 5, public universities reject this reform and called for justice and respect; terms in this topic include solidarity, no to the reform, justice, march, and respect; tweets representing this topic show strong solidarity among public universities, specially from the Universidad Pedagógica (upedagogica). Topic 8 is also related to the rejection of the planned educational reform to save public education; this includes terms like no to the reform, propose, threat, oblivion, and save; Universidad Pedagógica (upedagogica) is mentioned as well. In the same way, Topic 6 indicates that public universities reject the reform of higher education, mobilize to denounce the government’s abandonment, and demand to be listened to; some of the words in this topic are: no to the reform, universities, listen, sciences, confrontation, media, classrooms, abandoned, mobilization, and block. 5.3 Investment topics Topics 7 and 9 cover demands for investment to face the crisis. Topic 7 calls for infrastructure investment. Many tweets in which this topic is prominent focus on the infrastructure crisis of the campus buildings, in particular the sociology and architecture buildings and the university’s hospital. The top terms in this topic include defend, university, improvement, campus, crisis, infrastructure, cement, hospital, architecture, and sociology. Topic 9 plays a similar role in investment demands focusing on the pro-National University of Colombia stamp, created to acquire financial resources to improve the university facilities. Some tweets containing this topic highlight the role of the University as a national pride. The top ranked terms include stamp, demand, support, public, university, resources, financial, strike, deserve, and pride. 5.4 Peace topic Topic 12 represents the integration of the crisis in the National University of Colombia into a broader frame of national concern associated with the Colombian peace process. The top-ranked terms are peace, process, mobilization, research, studying, participation, talks, intellectuals, solidarity, and civil. Tweets in which this topic was strongly represented are related to the role of the university as facilitator in peace talks among the government, rebel groups involved in the Colombia’s internal armed conflict (which began in the mid-1960s and is currently in negotiation, in a process known as the Colombian peace process), intellectuals, and representatives of civil society. 690 JADT’ 18 6. Conclusions Producing an interpretable way to study Twitter conversations efficiently and effectively is only the beginning. The solution of this issue presents meaningful categories to address the analytic question that motivated the study: how was the financial crisis in the National University of Colombia framed on Twitter? The 12-topic solution showed that it was framed through four categories: protest, anti-reform, investment, and peace. Each topic constitutes a frame, in that it includes terms calling attention to particular ways in which the crisis under study may arouse controversy: protest frames emphasize public displays, demonstrations and the civil disobedience of the working class; anti-reform frames refer to the rejection of the reform of higher education by public universities; investment frames focus on investment demands to face the crisis; and the peace frame draws attention to the role the National University of Colombia played in acting as a facilitator in the Colombian peace process. Each of these frames represents a discursive environment for the financial crisis, which broadcasts not just the structural characteristics of the crisis (investment demands and education reform), but also symbolic representations of ongoing social events (workers protests and peace process), which can be seen as claims about ongoing social processes and demands of reparation. These results provide substantive insight into Twitter conversations about the financial crisis in the National University of Colombia. Using LDA to discover topics allowed us to locate two narratives: one focused on the structural characteristics of the crisis and the other concerned with symbolic representations of ongoing social events surrounding that crisis. For cultural sociologists, this is only the beginning of the analysis. A topic model allows a starting point to be found, which in this case is the structure of Twitter data. Used properly, with appropriate validation, topic models are valuable complements to other interpretive approaches, offering new ways to extract topics and make sense of online data. References Alexander, J. (2004). Toward a theory of cultural trauma. In Alexander, J., Eyerman, R., Giesen, B., Smelser, N. and Sztompka, P. Cultural trauma and collective identity. Univ of California Press. Blei, D. (2012). Probabilistic topic models. Communications of the ACM, 55(4): 77–84. DiMaggio, P., Nag, M., and Blei, D. (2013). Exploiting affinities between topic modeling and the sociological perspective on culture: Application to newspaper coverage of US government arts funding. Poetics, 41(6): 570– 606. JADT’ 18 691 El Espectador (2011). Universidades con ánimo de lucro, apuesta del gobierno. March 10. Eyerman, R., Alexander, J. C., and Breese, E. B. (2011). Narrating trauma: on the impact of collective suffering. Routledge. Fligstein, N., Brundage, J. S., and Schultz, M. (2014). Why the Federal Reserve failed to see the financial crisis of 2008: The role of “Macroeconomics” as a sense making and cultural frame. IRLE Working Paper No. 111–14. Griffiths, T., and Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, pp. 5228–5235. Himelboim, I., McCreery, S., and Smith, M. (2013). Birds of a feather tweet together: Integrating network and content analyses to examine crossideology exposure on Twitter. Journal of Computer-Mediated Communication, 18(2): 154–174. Jelodar, H., Wang, Y., Yuan, C., and Feng, X. (2017). Latent Dirichlet allocation (LDA) and Topic modeling: Models, applications, a survey. arXiv preprint arXiv:1711.04305. Kovanović, V., Joksimović, S., Gašević, D., Siemens, G., and Hatala, M. (2015). What public media reveals about MOOCs: A systematic analysis of news reports. British Journal of Educational Technology, 46(3): 510–527. Ministry of National Education (2010). Proposal for the education reform in Colombia. April 12. National University of Colombia (2014). Estadísticas e indicadores de la Universidad Nacional de Colombia. 19. ISSN 2357-5646. Newman, D., and Block, S. (2006). Probabilistic topic decomposition of an eighteenth-century American newspaper. Journal of the Association for Information Science and Technology, 57(6): 753–767. Sanandres, E., and Otálora, J. (2015). Application of topic modeling for Trauma Studies: The case of Chevron in Ecuador. Investigación & Desarrollo, 23(2): 228–255. Semana (2011). Reforma a la Ley 30: por qué sí, por qué no. April 1. Yang, T., Torget, A., and Mihalcea, R. (2011). Topic modeling on historical newspapers. In K. Zervanou & P. Lendvai (Eds.), LaTeCH ’11 Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 96–104. 692 JADT’ 18 What volunteers do? A textual analysis of voluntary activities in the Italian context Francesco Santelli, Giancarlo Ragozini, Marco Musella University of Naples Federico II francescosantelli@unina.it marcomusella@unina.it Abstract The complex phenomena of volunteering was mainly analyzed in economic literature with respect to its “economic value added”, i.e the capability of this kind of activities to increase the level of productivity of some specific gods or services. In this paper, the point of view switches and voluntary organizations are analyzed as place of job market innovation, where new jobs arise and where people acquire new skills. Thus, volunteering can be thought as “social innovation” factor. In order to analyze the contents of voluntary works we use data coming from Istat survey “Multiscopo, Aspetti della vita quotidiana” (Multi-purposes survey, daily life aspects), for the year 2013. In our textual analysis, we use information included in the open answers given by people about the description of the tasks performed individually as volunteer. After stemming, lemmatization, and cleaning, data have been analyzed by means of Community Detection based on Semantic Network Analysis in order to discover patterns of jobs and through Correspondence Analysis on Generalized Aggregated Lexical Tables (CA-GALT) in order to discover profiles of volunteers. In particular, we look for differences given by gender, age, educational level, region of residence and type of voluntary association. Keywords: Text Mining, Volunteers, Lexical Correspondence Analysis, Semantic Network Analysis 1. Introduction Volunteer work differs from the traditional forms of work for several features. Nevertheless, most of the authors approaching the volunteering phenomenon are interested mainly in the economic value that this sector is able to add to the labour market (Ironmonger, 2000; Salamon et al., 2011) considering it like a special case of job in the economic theory framework. From this point of view, volunteering is assumed to be a peculiar sector of the production with a considerable number of divergent rules and dynamics compared to the standard work patterns, but still able to provide goods and services to the community like all the other sectors. It will lead, of course, to increase the overall economic value of the society. JADT’ 18 693 In this work, the focus will be instead from a different perspective: volunteering will be considered as a laboratory of social innovation embedded in the labour market. The main concept behind it is that volunteering is based on different guidelines and different principles (Zamagni, 2005); therefore, it could develop new professional profiles and modify pre-existent ones. Text Mining approach will be performed on openend questions given by volunteers, assuming that their self-concepts is a consistent proxy of volunteering world. The empirical statistical analysis will make use of two tools chosen for their capability to profile both groups of words and cluster of volunteers. The latter, in the Italian context, will be analyzed in parallel with the traditional categories applicable to the classic labor theory. It will be shown that most of the determinants of the segmentation of the professions (Colombo, 2003), such as gender, age or geographic area of origin, can be adopted as well in this framework. 2. Data and statistical approach Data are taken from the Istat Survey of 2013 “Multiscopo, Aspetti della vita quotidiana” (Multi-purposes survey, daily life aspects) (Istat, 2013). It is a large annual sample survey that covers the resident population in private households, by interviewing a sample of about 20000 households and about 50000 people with P.A.P.I. technique. The main dimensions questionnaires concern education, work, family and social life, spare time, political and social participation, health, life style and access to the services. From the whole sample, we selected about 5000 persons that declared to be involved in volunteering and that answered to open-end questions about their voluntary activities and if they carried out it within an organization or by themselves. The main core of the statistical text mining procedure will be focused on these brief descriptions of their own volunteering jobs. We analyzed the descriptions along with the socio-demographic variables available: gender, age, geographic macro-area and educational level. Given the definition of volunteering (Istat, 2013; Wilson, 2000), several descriptions were erased from the database as they do not belong to voluntary activities (e.g., people donating blood to AVIS organization, or people that provides help to family members). Therefore, after this preliminary procedure in order to delete inappropriate or missing answers, the valid number of volunteers are 4254 from the original 5000. Before going through the analysis, we perform a preliminary transformation of the original lexical data by removing punctuation and stop-words, and by stemming the words, i.e. deleting all the derivational and inflectional suffixes (Lovins, 1968; Willet, 2006). Therefore, all the words that evolved from the same root will be considered to be the same after the 694 JADT’ 18 stemming. For this task we use the Porter Stemming Algorithm using software R implemented in the package tm (Meyer et al., 2008). After the preliminary analysis, in order to discover groups of activities that can be described as jobs we apply a Semantic Network Analysis (van Atteveldt, 2008; Drieger, 2013), and in order to profile of voluntary jobs with respect to socio-demographic dimensions we use Correspondence Analysis on Generalized Aggregated Lexical Tables (CA-GALT) (Kostov et al., 2015). The former is an extension of Social Network Analysis that treats text as graph structure: each word is defined as a node, and the ties between words are undirected links weighted by the count of co-occurrences (how many times do these words appear together in the same answers). Groups of terms corresponding to semantic clusters can be found through community detection algorithms (Fortunato, 2010). We use the Fast Greedy method that is suited to deal with undirected and weighted edges (Clauset et al., 2004). On the other hand, the CA-GALT method allows us to jointly analyze in a multiple correspondence framework both the lexical table and sociodemographic profiles, combining the document-term matrix and the matrix containing the individual characteristics. 3. Main findings of the analysis After the preliminary transformations, the overall corpus shows a high degree of heterogeneity with 1649 different words, and a high level of sparsity, close to 100% due to the large number of documents and their shortness. The term frequency distribution has a median equal to 2, and a p0.75 percentile equal to 4. Given the sparsity, we focus the analysis on the most frequent words that profile and describe voluntary activities, taking into account only words that are above the p0.90 percentile (frequency equal to 11), and ending up in a vocabulary consisting of 175 words. The most used of them are organizz (to organize, or organization) that appears 296 times, assistent (assistent) with 225 occurrencies, attiv (activity) that occurs 215 times, then assoc (association), aiut (to help) and volontar (volunteer and derived words). Those terms can be considered pretty generic, and could be related to several aspects inside the volunteers’ community, without showing additional informative power to profile volunteers. They are followed by terms describing specific field of intervention: sport, fond (fund), event, bambin (child/children), anzian (senior/old). Further, some of them are expressing just one semantic meaning, and can be considered bi-grams (Collins, 1996): croce rossa (red cross), croce verde (green cross), croce bianca (white cross), protezione civile (civil protection/defense), vigili fuoco (firefighters), capo scout (scoutmaster). We merge them in the following. Applying the Semantic Network and the community detection algorithm to JADT’ 18 695 these data, we found 7 groups/communities. In Fig. 1 we plot the semantic network along with the communities, in which words are colored according to the community. It is possible to identify a set of “jobs” related to the typical charity organizations, mainly in a religious context: the care of old people and hospitalized people -ospedal, malat, assistenz, ascolt, accud, cur, sostegn- (orange), the education and animation od disadvantaged children, mainly in religious organizations -insegn, parrocc, scuol, orator, cateches, anim(purple), the food and cloth drive and its distribution to the poor -cibo, vestiar, caritas, raccolt, aliment, mens, pover- (green). Another large group is related to the executives and officers of organizations and to the cultural events organizers -organizz, event, cultural, membr, consigl, dirigent, reunion- (blue). Related to this large group we found the musicians (black) characterized by suon, band, musical. Finally, the last important area of the network is associated to the organized volunteers on the territory -vigilefuoc, protezionecivil, territor, croceross, soccors, ambul- (red). The coaches are mixed with this group -squadr, allen, calc, pallavol- (brown). All these activities are mainly done in nonreligious organizations and are not directly related to charity aims. Analyzing categories and lexical CA in (fig:2) is possible to profile individuals according to their demographic status. In this context is not performed a real clustering procedure, but as in classical Correspondence Analysis the two spaces, units and variables, are linked taking into account that words close to a specific categories are more likely to occur for people belonging to the given category. It is clear that there is a gender gap: men are related to sport activities, they play music in band, they are driver (mainly ambulance) and they are involved in administration tasks. Women are more involved in providing services to individuals (taking care of children and old people), also carrying out food and cloth drive for the poor. Geographic differences come up as well: volunteers from North-Est and North-Ovest describe their activities as manutenzion, dirigent, addett, consigl, showing a higher organization level. South and Islands are more related to a female style of volunteering, with a predisposition for religious organization and mainly aimed to assistance. Educational level and age have an impact: lowest level of education, crossed with age information, profile a group of old and less educated volunteers involved in religious volunteering. Highest educated people carry out mainly administrative tasks. The central group of age (35-64) shows, on the other hand, an average profile close to the origin of axis, as well as people from Center Italy. 696 JADT’ 18 Figure 1: Semantic network: different colors for different communities identified by FastGreedy algorithm. Size of words and width of the edges are proportional to the weights 4. Discussion and conclusion As introduced in the first section, the aim of this work is to present a general perspective about volunteering work in Italy under the assumption that is possible to study it in an analogue way in which labour market is studied in classic economic literature. Some authors already gave example how it follows also the rule of supply and demand under given condition (i.e. Wolff et. al, 1993) and also volunteering companies make use of marketing strategies similarly to business companies (Dolnicar et Randle, 2007). The two different statistical tools presented in previous section give to the empirical analysis different hints, and are somehow complementary. Communities in Semantic Network of (fig:1) are based on the connection level between words, without taking into account other previously known characteristics of individuals. Communities thus discovered are groups of words that define several activities and so clusters of jobs in some specific fields. In the second analysis, both spaces build in Ca-Galt, individuals and categories, stress out how segmentation is clearly present in volunteering as in labour market, and words used (and so activities done) change for gender, education, age and macro-area, in an equivalent way as for standard jobs. It JADT’ 18 697 gives so an overview about the relationships between words (as description of activities) and categories (socio-demographic variables). Summing up, both analysis highlight how volunteering is complex and heterogeneous; it shows that people involved are in some cases highly skilled, often using some of the competencies trained in their life. Generally, they are able to describe their activities in a thorough way, explaining openly the aim of their voluntary jobs. The Text Mining analysis presented in this work could lead to figure out some needs of the population that are not adequately satisfied, given the assumption that volunteers spend their time and use their skills to give something to individuals that strongly ask for demands, in a framework similar to supply and demand mechanism. Furthermore, to have a more exhaustive overview for future policies to undertake, next step could be likely to go on the other side; another survey should be done asking people why do they ask help to volunteers. It will lead to better understand the real needs of individuals that are not fully satisfied of what they get in terms of assistance, especially from official institutions welfare. 698 JADT’ 18 Figure 3: Ca-Galt for both terms (blue) and categories (red). Overlapping both factor maps is possible to profile cluster of individuals. References Amati, F., Musella, M. and Santoro, M. (2015). Per una teoria economica del volontariato. (Vol. 1). G. Giappichelli Editore, Torino Clauset, A., Newman, M. E., and Moore, C. (2004). Finding community structure in very large networks. Physical review E, 70(6), 066111. Collins, M. (1996). A new statistical parser based on bigram lexical dependencies. In Proceedings of the 34th annual meeting on Association for Computational Linguistics, 184-191, Association for Computational Linguistics Colombo, A. (2003). Razza, genere, classe. Le tre dimensioni del lavoro domestico in Italia. Polis, 17(2), 317--344, Dolnicar, S. and Randle, M. (2007). The international volunteering market: Market segments and competitive relations. International Journal of Nonprofit and Voluntary Sector Marketing, 12(4), 350-370. Drieger, P. (2013) Semantic network analysis as a method for visual text analytics, Procedia-social and behavioral sciences, 79, 4 – 17 Fortunato, S. (2010). Community detection in graphs. Physics reports, 486(35), 75-174. Indagine Istat Multiscopo sulle famiglie: aspetti della vita quotidiana, (2013), Retrieved from http://www.istat.it/it/archivio/91926 Ironmonger, D. (2000). Measuring volunteering in economic terms. Volunteers and Volunteering, The Federation Press, Sydney, 56--72 Kostov, B., Bécue Bertaut, M. and Husson, F. (2015). Correspondence analysis on generalised aggregated lexical tables (CA-GALT) in the FactoMineR package, R Journal, 7(1), 109 -- 117, JADT’ 18 699 Lovins, J. (1968). Development of a stemming algorithm. Mech. Translat. Comp. Linguistics, 11(1-2), 22--31, (1968) Meyer, D., Hornik, K., and Feinerer, I. (2008). Text mining infrastructure in R. Journal of statistical software, 25(5), 1-54. Salamon, L., Sokolowski and S., Haddock, M. (2011). Measuring the economic value of volunteer work globally: Concepts, estimates, and a roadmap to the future, Annals of Public and Cooperative Economics, 82(3), 217--252, (2011) van Atteveldt, W. (2008). Semantic network analysis: Techniques for extracting, representing, and querying media content, BookSurge Publishers, Charleston SC Willett, P. (2006). The Porter stemming algorithm: then and now. Program, Vol. 40 Issue: 3, 219--223, doi: https://doi.org/10.1108/00330330610681295 Wilson. J. (2000). Volunteering, Annual review of sociology, 26(1), 215—240 Wolff, N., Weisbrod, B. A., and Bird, E. J. (1993). The supply of volunteer labor: The case of hospitals. Nonprofit Management and Leadership, 4(1), 23-45. Zamagni, S. (2005). Gratuità e agire economico: il senso del volontariato. In Working Paper presented at Aiccon meeting, Bologna 700 JADT’ 18 A longitudinal textual analysis of abstract presented at Italian Association for Vocational guidance and Career Counseling’ Conferences from 2002 to 2017 S. Santilli1., S. Sbalchiero2, L. Nota3, S. Soresi4 2 1 University of Padova – sara.santilli@unipd.it University of Padova – stefano.sbalchiero@unipd.it 3 University of Padova – laura.nota@unipd.it 4 University of Padova – salvatore.soresi@unipd.it Abstract This new century is characterized by phenomena such as globalization, internationalization, and rapid technological advances, that influence people life and the ways in which they seek and do their jobs. Changing the shape of organizations changes the shape of careers. To better account for the complexities of work due to the least socio economic crisis, the Life Design paradigm, a new paradigm for career theory in the 21st century (Savickas et al., 2009) has been recently developed an it represent the third wave of career theory and practice. The first wave emerged as the psychology of occupations in the first half of the 20th century to match people to jobs. The second wave comprised the psychology of careers ascending at mid-20th century to manage worker and other life roles across the lifespan. The main aims of the present study was illustrate the changes in theory, technique e measure emerged in the Italian vocational guidance and career counseling psychology by the analysis of the abstract presented at Italian Association for Vocational guidance and Career Counseling’ Conferences. The corpus was composed of 1,250 abstracts that have been collected from 2002 to 2017. In order to compare and contrast the main semantic areas over time, a topic analysis by means of Reinert's method (1983) was conducted (IRaMuTeQ and R software) to detect the clusters of words that characterized the different orientations over time. The results show that career counseling theories and technique evolved during the time to better assist workers in adapting to fluid societies and flexible organization and to better help clients design their lives in 21st century. Keywords: longitudinal textual analysis, career counseling, vocational psychology 1. Introduction In Western countries the economic recession that characterized the years JADT’ 18 701 2008–2009 lead to a dramatic loss of jobs throughout the Union’s private sector. Furthermore fast moving global economy and phenomena such as globalization, internationalization, and rapid technological advances, influence people’s lives and the ways in which they seek and do their jobs. The world of work is in general much less clearly defined or predictable, and employees face greater challenges in coping with work transitions (Savickas et al., 2009). Therefore, life in a 21st-century requires new models and methods to deal with the new issues such as uncertainty, inequalities, poverty, immigration precariousness in the labor market, and with the worrying consequences also on individual and relational wellbeing. For these reasons existing traditional career guidance assumptions have been swept away, together with other certainties, by the sudden changes that have taken place in the world of work and in the economic field. To better account for the complexities of work, the Life Design paradigm, a new paradigm for career theory and intervention in the 21st century (Savickas et al., 2009) has been developed. The psychology of life design advances a contextualize epistemology emphasizing human diversity, uniqueness, and purposiveness in work and career to make a life of personal meaning and social consequence. Rather than matching self to occupation, it reflects a third wave of career theory and practice. The first wave emerged as the psychology of occupations in the first half of the 20th century to match people to jobs. The second wave comprised the psychology of careers ascending at mid-20th century to manage worker and other life roles across the lifespan. The third wave arose as the psychology of life design to make meaning through work and relationships. The main aims of the present study was illustrate the longitudinal changes that emerge in the Italian context regarding the models and the theoretical paradigms that drive vocational guidance and career counseling by the analysis of the abstract presented at the Italian Association for Vocational guidance and Career Counseling 'Conferences. Specifically, we analyzed differences between the abstract presented before the economic recession (from 2002 to 2008) and during/after the economic recession (form 2009 to 2017) in the topics related to research, theories, and practice. The corpus was composed of 1,250 abstracts that have been collected from 2002 to 2017. 2. Corpus and method All the abstracts have been collected by the Italian Association for Vocational guidance and Career Counseling - SIO. SIO represents at the national and international level a focal center in which the main scholars and practioners converge, gather, share and compare the theories and practices in terms of vocational guidance and career counseling. The Abstracts from the first 702 JADT’ 18 SIO’s Conference (2002) to the latest one (2017) were collected. No abstract were collected during the year 2003, 2007, 2016, and 2014, becouse SIO has not organize national conferences. The corpus is composed of 1,250 abstracts. The corpus was pre-processed by means of IRaMuTeQ and R software (Ratinaud 2009; Sbalchiero e Santilli, 2017). The corpus was normalized replacing uppercase with lowercase letters, and punctuation, numbers and stop words have been removed because are not significant to analyse the content of abstract. The pre-processing steps were useful to reduce the redundancy and to provide homogeneity among forms. The lexicometric measures (Tab.1) indicate that it is plausible to apply statistical analysis of textual data to the corpus (Lebart et al., 1998). The corpus is composed of 20,932 word-type and 462,034 word-tokens. Tab. 1. Lexicometric Characteristics of the corpus Number of texts (V) Word-type (N) Word-tokens (V1) Hapax (V/N)*100 = Type/Token Ratio (V1/V)*100 = Percentage of hapax 1250 20932 462034 8902 4,53 42,53 Using the Reinert method (Reinert, 1983), we extracted a series of ‘lexical worlds’. The texts was divided into elementary content units of similar length, then, the algorithm provides reports on ‘words x units’ matrix. The classification of units consents to identify and extract only parts of texts relating to the same topic, so for each cluster the list of the most significant words calculated using the chi square measurement, are identified (Reinert, 1993; Sbalchiero and Tuzzi 2016; Sbalchiero e Santili, 2017). 3. Results The analysis conducted by means of Reinert’s method detected five different lexical worlds, as the dendrogram shows (Fig. 1). The methods identify the lexical worlds quite well because 98,42% of the abstracts have been classified and the words in the same sematic area are semantically associated, i.e. they refer to the same issue. Specifically, the first class of the present corpus refers to career counselor’s professional knowledge, skills, resources and training. The second class refers to the principals variables and constructs related to vocational guidance and career counseling, such as self-efficacy, personality, coping, intelligence, emotions, satisfaction, optimism. The third class include the statistic measure and instruments used in vocational guidance to assess JADT’ 18 703 people career self and personality. The fourth class refers to context variables, to the supports and barriers for inclusion, rights of people with vulnerabilities (people with disabilities, psychological sidelines, etc.). The fifth class includes the guidance services, projects, career guidance activities that are provided by local centers (university, region, province). As already mentioned, differences between the abstract presented before the economic recession (pre-crisis: from 2002 to 2008) and during/after the economic recession (post crisis: form 2009 to 2017) were analysed. These two period in vocational guidance history are specific because the stable employment and secure organizations of the pre-crisis have in post crisis given way to a new social arrangement of flexible work and fluid organization, causing people tremendous distress, making difficult to comprehend career with theories that emphasize stability than mobility. Furthermore, it seemed interesting to analyze whether differences could be found in the theories and techniques presented in the abstract in pre e post crisis. To differentiate between papers presented pre- and post-crisis, a specific procedure was used based on the Chi2 association of semantic classes (Ratinaud, 2014) over the two period of time (Fig. 2). The classes related to the pre-crisis are three and five characterised by statistic measure and instruments used in vocational guidance to assess people and guidance services, projects, and career guidance activities. The post-crisis period is characterized by the class four, that refers to context variables, to the support and barriers for inclusion, rights of people with vulnerabilities. Fig. 1: Cluster Dendrogram and list of most relevant words for each lexical world (in descending order according to the Chi2 value of each class). 704 JADT’ 18 Fig. 2: Comparison among pre-crisis and post-crisis papers These results highlighted that the topics presented in the abstract related to pre-crisis are more oriented towards “people” focusing on the assessment and measure with a statistical background. In the post-crisis period, the attention of counsellors is more oriented toward the “environment” in which people live and the relation between people and their context, so the uniqueness and the vulnerability of people are considers in relation to social and work inclusion. Finally, in order to compare and contrast the main semantic areas over time, the classes were analysed using the Chi2 association of semantic classes and their distribution over years (Fig. 3). Fig. 3: Comparison among classes and their distributions over years JADT’ 18 705 In addition to the classes already analyzed in the pre and post crisis periods, the comparison among classes and their distributions over years, highlights also class 1 and class 2, which can be considered as evergreen in the vocational guidance and career counseling field because they are present throughout almost the entire period considered. The class 1 refers to career counselor’s professional knowledge, skills, and competences. The class 2 refers to variables and constructs related to vocational guidance and career counseling such self-efficacy, coping, life satisfaction, and positive attitudes. 4. Conclusions and discussion The aim of the present study were to highlight the changes in theory, technique e measure emerged in the Italian vocational guidance and career counseling psychology by the analysis of the abstract presented at Italian Association for Vocational guidance and Career Counseling’ Conferences. The results show five different lexical worlds classes, related to career counselor’s professional knowledge, variables and constructs of vocational guidance and career counseling, measure and instruments to assess people career self and personality, context variables to support inclusion of people with vulnerabilities, and career guidance services and center. Differences between the abstract presented before the economic recession (pre-crisis: from 2002 to 2008) and during/after the economic recession (post crisis: form 2009 to 2017) were also analysed. The results shows that career counseling theories and technique evolved during the time to better assist workers in adapting to fluid societies and flexible organization and to better help clients design their lives in 21st century. In fact, while in the abstracts relented to the pre-crisis period, emphasis is given to all those guidance activities that consider particularly important to allow the person to collect information about their characteristics and needs before advancing decisionmaking hypotheses (measure and instrument for the assessment), in the abstracts related to the post-crisis period attention is paid to the "contexts" where people live. Career guidance practices that are limited to the analysis of "attitudes" and "interests" are considered obsolete, while current policies, challenges, socio-economic conditions, the way in which vulnerability is conceptualised are inputs from the environment which act at various levels and on which scholars should pay attention (Shogren, Luckasson, & Schalock, 2014).The evolution of social sciences that revolve around orientation is undoubtedly a very complex phenomenon. Career scholars and practioners should support people's needs taking into account the organizational and environmental context in which they develop and take shape. Currently the career guidance theory and model are numerous and not always denominated and defined in the same way by the various authors 706 JADT’ 18 and scholars. For these reasons is important to analyze and understand the different model developed over the time in order to activate a continuous comparison in the field of career counselor’s competences that produces precise trajectory regard the constructs to develop in the people by program and activity provided by career services. In fact, noteworthy is the result that highlights how the classes that refer to vocational guidance and career counseling are presented throughout the entire period considered. Nevertheless, these are just some results and other analyzes will be useful for examining the peculiarities that these specific classes assume during the years considered, in order to identify the specific skills and constructs that characterized different historical periods. It could also be important to compare the results that emerged in the Italian context with those of other European and North American contexts, to generalize the results obtained. References Lebart, L., Salem, A., & Berry, L. (1998). Exploring textual data. Kluwer Academic Publishers: Dordrecht. Ratinaud, P. (2014). Visualisation chronologique des analyses ALCESTE: application à Twitter avec l’exemple du hashtag #mariagepourtous. Actes des 12es Journées internationales d’Analyse statistique des Données Textuelles. Paris Sorbonne Nouvelle–Inalco. Reinert, M. (1983). Une méthode de classification descendante hiérarchique: application à l’analyse lexicale par contexte. Les cahiers de l’analyse des données, 8(2), 187-198. Reinert, M. (1993). Les «mondes lexicaux» et leur «logique» a` travers l’analyses tatistique d’un corpus de re´cits de cauchemars. Langage & Société, 66, 5–39. Shogren, K. A., Luckasson, R. & Schalock, R. L. (2014). The fefinition of “context” and its application in the field of intellectual disability. Journal of Policy and Practice in Intellectual Disabilities, 11(2), 109-116. Savickas, M. L., Nota, L., Rossier, J., Dauwalder, J. P., Duarte, M. E., Guichard, J., ... & Van Vianen, A. E. (2009). Life designing: A paradigm for career construction in the 21st century. Journal of Vocational Behavior, 75, 239-250. Sbalchiero, S. & Santilli, S. Some introductory methodological notes. In L. Nota & S. Soresi (Eds.), For A manifesto in favor of Inclusion. Florence: Hogrefe Editore Sbalchiero, S., & Tuzzi, A. (2016). Scientists’ spirituality in scientists’ words. Assessing and enriching the results of a qualitative analysis of in-depth interviews by means of quantitative approaches. Quality & Quantity, 50(3), 1333-1348. JADT’ 18 707 A la poursuite d’Elena Ferrante Jacques Savoy Université de Neuchâtel (Suisse) – Jacques.Savoy@unine.ch Abstract The objective of an authorship attribution model is to determine, as accurately as possible, the true author of a document, literary excerpt, threatening email, legal testimony, etc. Recently a tetralogy called My Brilliant Friend has been published under the pen-name Elena Ferrante, first in Italian and then translated into several languages. Various names have been suggested as possible true author (e.g., Milone, Parrella, Prisco, etc.). Based on a corpus of 150 contemporary Italian novels written by 40 authors, two computer-based authorship attribution methods have been employed to answer the question “Who is the secret hand behind Elena Ferrante?” To achieve this objective, the nearest neighbor (k-NN) approach was applied on the 100 to 2,000 most frequent tokens using the Delta model. As a conclusion, we found that Domenico Starnone is the true author behind Elena Ferrante’s pseudonym. As a second approach and using the entire vocabulary, Labbé’s model confirms this finding. Résumé L’objectif d’un modèle d’attribution d’auteur consiste à identifier, de la manière la plus fiable possible, le véritable auteur d’un document, extrait d’une œuvre, d’un courriel menaçant ou d’un testament. Récemment, la tétralogie débutant avec L’amica geniale (Une Amie Prodigieuse) a été publié sous le nom de plume d’Elena Ferrante, d’abord en italien puis traduite dans plusieurs langues. Plusieurs noms ont été proposés comme le possible véritable écrivain (par exemple, Milone, Parrella, Prisco, etc.). En s’appuyant sur un corpus composé de 150 romans contemporains italiens écrit par 40 auteurs, deux méthodes d’attribution d’auteur ont été utilisés pour déterminer qui se cache derrière le pseudonyme Elena Ferrante. Dans ce but, la technique du plus proche voisin a été appliquée sur la base des 100 à 2 000 vocables les plus fréquents avec le modèle Delta. Comme conclusion, on aboutit au nom de Domenico Starnone comme la véritable identité de Elena Ferrante. Comme deuxième approche basée sur l’ensemble du vocabulaire, le modèle de Labbé confirme cette conclusion. Keywords : Authorship attribution, corpus linguistics. Mots-clés : Attribution d’auteur, linguistique de corpus. 708 JADT’ 18 1. Introduction Avec la parution de L’amica geniale (2011) débute une tétralogie sur la vie à Naples depuis les années 50. Cette série de romans rencontre un étonnant succès, en particulier aux États-Unis. Toutefois, l’auteur indiquée, Elena Ferrante, représente un pseudonyme dont la véritable identité n’a pas été révélée. Des érudits et journalistes ont proposé plusieurs noms en tenant compte de possibles similarités stylistiques ou en affirmant que l’auteur doit connaître le Naples d’après-guerre, voire être une femme (par exemple, Erri De Luca, Francesco Piccolo, Michele Prisco, Fabrizia Ramondino, …). Sur la base des royalties versés, le journaliste C. Gatti (Gatti, 2016) affirme que la plume de Ferrante est tenue par Anita Raja (femme de l’écrivain Domenico Starnone). Aucune étude scientifique approfondie n’a abordé cette question, mais une première ébauche indique que le véritable auteur serait Domenico Starnone (Tuzzi et al., 2018). L’identification du véritable auteur de ces romans nous rappelle les investigations sur les relations Gary-Ajar en France dans les années 1970. Dans le monde anglo-saxon, la parution de The Cuckoo’s Calling (2013) sous la signature de R. Galbraith correspond à une affaire similaire puisque le véritable auteur était J. K. Rowling (Juola, 2016). La découverte d’un poème inédit soulève également la question de son véritable auteur (Thisted & Efron, 1987), (Craig & Kinney, 2009). Pour lever le voile sur l’identité exacte de Ferrante, notre étude dispose d’un corpus de 150 romans italiens contemporains. De plus, on s’appuiera sur deux méthodes d’attribution d’auteur (Juola, 2006) reconnues et ayant fait l’objet de plusieurs études. En effet, afin d’admettre une preuve devant un tribunal celle-ci doit posséder plusieurs caractéristiques (Chaski, 2013) comme, par exemple, correspondant aux meilleures pratiques dans le domaine, avoir été testée et pouvant être vérifiée et répliquée. Enfin, nous faisons l’hypothèse que le véritable auteur derrière la signature Ferrante est bien l’un des 39 écrivains italiens présents dans notre corpus (attribution dans un ensemble fermé). 2. Travaux reliés Afin de déterminer l’identité d’un écrivain, trois paradigmes principaux ont été proposées (Juola, 2006), (Stamatatos, 2009). D’abord, on s’est appuyé sur des mesures stylométriques admises comme invariantes pour chaque auteur, à l’exemple de la longueur moyenne des phrases, la taille du vocabulaire par rapport à la taille du document (TTR) (Rexha et al., 2016). Face à des textes de tailles variables, ces mesures s’avèrent d’être instables (Baayen, 2008). Deuxièmement, les choix lexicaux permettent de différencier les auteurs, tant dans la sélection des mots que dans leur fréquence d’occurrences ; « Le style c’est l’homme » disait Buffon en 1753). Dans ce but, Mosteller & Wallace (1964) proposent de sélectionner semi-automatiquement les vocables les plus JADT’ 18 709 pertinents. Burrows (2002) choisit les mots les plus fréquents et, en particulier, les mots fonctionnels (déterminants, prépositions, conjonctions, pronoms et verbes auxiliaires). Ces derniers possèdent l’avantage d’être plus fortement reliés au style de l’auteur qu’à la sémantique. Cette liste comprendra entre 50 à 1 000 vocables les plus fréquents (Hoover, 2007), voire l’ensemble du vocabulaire (Labbé, 2007). D’autres auteurs proposent de définir a priori une telle liste (Hughes et al., 2012). Sur cette base, chaque texte est représenté par les fréquences relatives d’occurrence des vocables sélectionnés. Ensuite, une mesure de distance (ou de similarité) permet d’estimer la proximité de deux textes. L’attribution s’établit habituellement selon la règle du plus proche voisin. Troisièmement, en recourant à des modèles d’apprentissage automatique (Stamatatos, 2009) les attributs les plus pertinents (mots, bigrammes de mots ou de lettres, partie du discours, émoticons, etc.) peuvent être sélectionnés. Ensuite un classifieur est entraîné pour générer les profils des auteurs retenus (Naïve Bayes, régression logistique, SVM, apprentissage en profondeur (Kocher & Savoy, 2017), etc.). Enfin, le texte d’attribution douteuse est représenté et le nom du profil le plus similaire est retourné comme réponse. 3. Le corpus de romans italiens contemporains Grâce aux efforts de A. Tuzzi et M. Cortelazzo (Université de Padoue), le corpus PIC (Padova Italian Corpus) a été créé en 2017. Cette collection contient 150 romans italiens couvrant la période de 1987 à 2016. Comme l’indique le tableau 1, ce corpus contient des œuvres de 40 auteurs (dont Elena Ferrante avec sept textes). Lors de sa création, les auteurs originaires de Naples et de sa région ont été favorisés (10 noms indiqués en italique dans le tableau 1), de même que les femmes (12, pour 27 hommes). Ce corpus contient 9 609 234 formes, avec une moyenne de 64 061 mots par œuvre (un seul écrit comprend moins de 10 000 formes). La longueur moyenne des romans signés par Ferrante s’élève à 88 933 mots. Enfin, un contrôle éditorial a été appliqué afin d’éliminer les éléments non-textuels (titre courant, numérotation des pages, etc.) ainsi qu’une inspection de l’orthographe. Ce corpus renferme donc des écrits de la même époque et langue, du même genre littéraire et dont la qualité a été vérifiée. Le 7 septembre 2017, un workshop regroupant sept équipes de chercheurs s’est tenu à l’Université de Padoue durant lequel le nom de Domenico Starnone a été identifié unanimement comme l’auteur derrière les œuvres de Elena Ferrante. Pour atteindre cette conclusion, notre approche s’appuie sur les techniques suivantes. 710 JADT’ 18 Tableau 1 : Nom des écrivains inclus dans le corpus avec le nombre de romans Nom Affinati Ammaniti Bajani Balzano Baricco Benni Brizzi Carofiglio Covacich De Luca De Silva Faletti Ferrante Fois H/F H H H H H H H H H H H H ? H Nombre 2 4 3 2 4 3 3 9 2 4 5 5 7 3 Nom Giodano Lagiola Maraini Mazzantin Mazzucco Milone Montesano Morazzon Murgia Nesi Nori Parrella Piccolo Pincio H/F Nombre Nom H 3 Prisco H 3 Raimo F 5 Ramondino F 4 Rea F 5 Scarpa F 2 Sereni H 2 Starnone F 2 Tamaro F 5 Valerio H 3 Vasta H 3 Veronesi F 2 Vinci H 7 H 3 H/F H H F H H F H F F H H F Nombre 2 2 2 3 4 6 10 5 3 2 4 2 4. Identifier l’auteur derrière la signature Elena Ferrante Notre étude débute par l’application du modèle Delta (Burrows, 2002) dans lequel la sélection des attributs stylistiques correspond aux k vocables les plus fréquents. Toutefois, aucune limite précise pour le paramètre k n’est indiquée et des travaux précédents (Savoy, 2015) soulignent que des valeurs entre 200 et 500 tendent à apporter les meilleures performances. Cette limite fixée, la méthode Delta estime un Z score pour chaque vocable ti basé sur la fréquence relative (dénotée rtfij pour le terme ti et dans le document Dj) comme indiqué par l’équation 1 (avec meani indique la fréquence moyenne du vocable et si son écart-type). Z score(tij) = (rfrij – meani) / si Pour chaque auteur, on concatène tous ses écrits pour générer son profil Aj. Enfin, on calcule la distance entre la représentation du texte à attribuer (dénotée Q) et les profils des auteurs Aj (voir équation 2). Ensuite, les différents auteurs peuvent être triés avec la plus faible distance signalant l’auteur le plus probable. Le tableau 2 redonne les trois premiers auteurs avec des valeurs pour k = 200, 300 et 500. Dans la dernière colonne (Stopword), les vocables choisis correspondent uniquement aux mots fonctionnels de l’italien (k = 307). Le tableau 2 nous renseigne sur l’attribution du roman L’amica geniale (2011). En considérant les six autres ouvrages, le même nom apparaît au premier rang. De même, si le nombre de vocables s’élève à 50, 100, 150, 250, 400, 1 000, 1 500 ou 2 000, nous retrouverons toujours Starnone en première place et ceci pour toutes les œuvres de Ferrante. JADT’ 18 711 Une analyse plus fine des distances du tableau 2 indique que la différence (en pourcentage) entre les distances du premier et deuxième rang présente des valeurs nettement supérieures à celles entre le deuxième et troisième rang. Ainsi, si k = 200, la différence entre 0,524 et 0,686 s’élève à 30,9 % tandis que celle entre 0,686 et 0,700 n’est que de 2,0 %. Le premier nom proposé se détache clairement des autres. Dans une deuxième série d’expériences, nous avons regroupé tous les romans attribués à Elena Ferrante pour en former qu’un seul texte (ou profil). En variant le nombre de vocables de 50, 100, 150, 200, 250, 300, 400, 500, 1 000, 1 500 à 2 000, Starnone se retrouve toujours au premier rang des auteurs ayant la plus forte similarité avec le profil d’Elena Ferrante. Tableau 2 : Listes triées des auteurs les plus probables pour L’amica geniale (méthode Delta) k = 300 k = 500 Stopword k = 200 Rang Distance Auteur Distance Auteur Distance Auteur Distance Auteur 1 0,524 Starnone 0,515 Starnone 0,505 Starnone 0,421 Starnone 2 0,686 Veronesi 0,684 Brizzi 0,686 Veronesi 0,640 Milone 3 0,700 Balzano 0,719 Veronesi 0,710 Brizzi 0,660 Veronesi Comme second modèle d’attribution d’auteur, l’approche de Labbé (2007) suggère de recourir à l’ensemble du vocabulaire. Dans ce cas, la distance entre deux textes A et B (indiquée par D(A,B) dans l’équation 3) dépend des fréquences absolues des vocables dans les deux textes (dénotées par tfiA, respectivement tfiB, avec i = 1, 2, …, k). La variable nA (ou nB) signale la longueur de l’écrit A (en nombre de formes). Comme les deux textes ne possèdent pas des tailles identiques, les fréquences du plus long (B dans l’équation 3) seront multipliées par le rapport des tailles (voir partie droite de l’équation 3). Enfin, les valeurs D(A,B) seront comprises entre 0 (aucun mot en commun) et 1 (mêmes mots avec des effectifs identiques). avec En appliquant cette méthode, une distance est calculée entre chaque roman et la distance permet de trier les couples d’écrits, de la plus faible distance à la plus grande. Le corpus PIC génère (150 x 149) / 2 = 11 175 couples. Un extrait est repris dans le tableau 3. Dans ce tableau, la première place correspond aux deux œuvres les plus similaires, deux romans écrits par Ferrante dans notre cas, soit Storia di chi fugge e di chi resta (Id : 51, (2013)) et Storia della bambina perduta (Id : 52, (2014)). Les deux autres romans de la tétralogie suivent, du deuxième au quatrième 712 JADT’ 18 rang, soit avec Storia del nuovo cognome (Id : 50, 2012) et L’amica geniale (Id : 49, (2011)). En cinquième position, on rencontre deux écrits de Faletti, soit Niente di vero tranne gli occhi (Id : 42, 2004) et Io sono Dio (Id : 44, 2009), puis deux romans de Veronesi (Id : 145, Caos calmo (2009) et Id : 147, Terre rare (2014)). Avec des distances faibles, les appariements s’opèrent entre des œuvres rédigés par le même auteur et dans un intervalle de temps assez court. Tableau 3 : Liste triée des romans les plus similaires (méthode Labbé) Rang 1 2 3 4 5 6 … 43 … 63 Distance 0,140 0,148 0,155 0,157 0,165 0,166 … 0,228 … 0,241 Id. 51 50 49 50 42 145 … 47 … 108 Auteur 1 Ferrante Ferrante Ferrante Ferrante Faletti Veronesi … Ferrante … Raimo Id. 52 51 50 52 44 147 … 127 … 147 Auteur 2 Ferrante Ferrante Ferrante Ferrante Faletti Veronesi … Starnone … Veronesi Lorsque la distance augmente, la probabilité de rencontrer le même auteur pour les deux ouvrages reliés diminue. Le premier lien apparament incorrect se situe au 43e rang avec un écrit de Ferrante (Id : 47, I giorni dell abbandono (2002) apparié avec un de Starnone (Id : 127, Eccesso di zelo (1993)). Un appariement entre ses deux auteurs apparaît également au rang 44, 53, et 54, avant que l’on découvre un autre type d’erreur en position 63 reliant un roman rédigé par Raimo (Id : 108, Il peso della grazia (2012)) et un autre de Veronesi (Id : 147, Terre rare (2014)). Puis, on découvre à nouveau un appariement entre Ferrante et Starnone aux rangs 65, 69, 71, 72, 73, 74, soit un total de dix couples entre ces deux auteurs et seulement un seul avec des autres écrivains. Sachant que Ferrante correspond à un pseudonyme, la forte similarité de style avec celui Starnone fait de ce dernier un choix de premier ordre. 5. Analyse Les choix lexicaux ne sont pas le fruit du hasard et chaque auteur a ses préférences qui sont détectables par les mesures stylistiques. Le rapprochement entre Ferrante et Starnone s’explique également en analysant quelques exemples. Dans notre corpus, les sept romans de Ferrante correspondent à 6,5 % de la taille tandis que 6,4 % est constitué par les dix œuvres de Starnone. Si les fréquences d’occurrences de certains mots s’écartent de ces proportions et dans la même direction pour les deux auteurs, nous pouvons rapprocher leur style. JADT’ 18 713 Le nom padre apparaît 9 815 fois dans le corpus PIC. Dans les œuvres de Ferrante, on en dénombre 833 (8,5 % du total) et 1 170 chez Starnone (11,9 %). Ce mot est clairement employé plus fréquemment par ces “deux” auteurs. De manière similaire, le mot madre possède une fréquence de 8 246 dans le corpus pour 1 104 occurrences (13,4 %) sous la plume de Ferrante et 762 (9,2 %) avec Starnone. D’autres vocables fonctionnels possèdent des distributions similaires. Ainsi le mot persino (même) apparaît 1 351 fois dans la collection PIC et on en compte 266 (19,7 %) chez Ferrante et 205 (15,2 %) chez Starnone. On notera également que ce terme peut également s’écrire perfino (avec une fréquence d’occurrences de 20 avec Ferrante, 18 chez Starnone). Pour Ferrante et Starnone, on voit une préférence pour une forme, tandis que d’autres auteurs recourent uniquement à l’une des orthographes (Baricco : uniquement perfino, Tamaro : seulement persino). Enfin certains écrivains ignorent les deux mots (Covacich, Parrella) ou l’utilisent très rarement (De Luca ou Balzano). Comme exemples complémentaires, certains mots ne sont employés que par Ferrante et Starnone comme risatella (gloussement, 16 occurrences chez Ferrante, 4 avec Starnone) ou contraddittoriamente (contradictoirement, Ferrante : 6; Starnone : 9). Pour un écrivain italien, le lexique peut inclure des formes provenant du dialecte comme celui de Naples avec le terme strunz (stronzo en italien). Ce terme apparaît 85 dans le corpus, avec 63 occurrences dans les romans de Starnone et 18 chez Ferrante (et deux fois chez De Silva et Raimo). Certains n-grammes de mots s’avèrent plus fréquents chez Ferrante et Starnone comme no essere che (ne pas être ça) qui apparaît 23 fois (100 %) dans le corpus mais 6 (26,1 %) sous la plume de Ferrante et 7 (30,4 %) sous celle de Starnone. Ensemble ces deux auteurs apportent plus de 56 % des occurrences de cette séquence. 6. Conclusion Cette étude s’appuie sur deux méthodes d’attribution d’auteur reconnues d’une part, et, d’autre part sur un corpus de 150 romans contemporains rédigés par 40 auteurs. Comme attributs stylistiques, nous avons retenu les 100, 150, 200, 250, 300, 400, 500, 1 000, 1 500 et 2 000 mots les plus fréquents pour la méthode Delta (Burrows, 2002). Avec ces différentes valeurs, le premier nom retourné comme le probable auteur s’avère toujours Domenico Starnone et ceci pour les sept romans parus sous le nom Ferrante. En s’appuyant sur l’ensemble du vocabulaire et la méthode de Labbé (2007), la même conclusion est obtenue. En analysant quelques choix lexicaux, on découvre des relations étroites entre Starnone et Ferrante. Par exemple, le mot persino est sur-employé dans les romans des deux auteurs, et la second forme perfino n’apparaît que plus 714 JADT’ 18 rarement. Chez d’autres écrivains, on rencontre habituellement une préférence pour l’un des deux termes ou l’absence de leur usage. Enfin, suite à l’atelier qui s’est tenu à Padoue le 7 septembre 2017 aboutissant à désigner Domenico Starnone comme l’écrivain derrière la signature Ferrante, celui-ci a démenti en être le véritable auteur (Fontana, 2017). Remerciements Cette recherche a été possible grâce à A. Tuzzi et M. Cortelazzo qui nous ont transmis le corpus PIC. Références Baayen, H.R. (2008). Analysis Linguistic Data: A Practical Introduction to Statistics using R. Cambridge University Press, Cambridge. Burrows, J.F. (2002). Delta: A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17(3):267-287. Chaski, C. (2013). Best practices and admissibility of forensic author identification. Journal of Law and Policy, 21(2):333-376. Craig, H., & Kinney, A.F. (2009). Shakespeare, Computers, and the Mystery of Authorship. Cambridge University Press, Cambridge. Fontana, E. (2017). Lo scrittore Domenico Starnone: “Io non sono Elena Ferrante”. Il Giornale, 9 sept. Gatti, C. 2016. La véritable identité d’Elena Ferrante révélée. BublioObs, 2 octobre 2016. Hoover, D.L. (2007). Corpus stylistics, and the styles of Henry James. Style, 41(2):160-189. Hughes, J.M., Foti, N.J., Krakauer, D.C., & Rockmore, D.N. (2012). Quantitative patterns of stylistic influence in the evolution of literature. Proceedings of the PNAS, 109(20), pp. 7682-7686. Juola, P. (2006). Authorship attribution. Foundations and Trends in Information, 1(3):233-334. Juola P. (2016). The Rowling case: A proposed standard analytic protocol for authorship questions. Digital Scholarship in the Humanities, 30(1), i100-i113. Kocher, M., & Savoy, J. (2017). Distributed language representation for authorship attribution. Digital Scholarship in the Humanities, 2017, to appear. Labbé, D. (2007). Experiments on authorship attribution by intertextual distance in English. Journal of Quantitative Linguistics, 14(1):33-80. Mosteller, F., & Wallace, D.L. (1964). Applied Bayesian and Classical Inference: The Case of the Federalist Papers. Addison-Wesley, Reading. Rexha, A., Klampfl, S., Kröll, M., & Kern, R. (2016). Towards a more fine grained analysis of scientific authorship. Proceedings ECIR 2016, pp. 26–31. JADT’ 18 715 Savoy, J. (2015). Comparative evaluation of term selection functions for authorship attribution. Digital Scholarship in the Humanities, 30(2):246-261. Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60(3):433-214. Tuzzi, A., & Cortelazzo, M. (2018). What is Elena Ferrante? A Comparative Analysis of a Secretive Bestselling Italian Writer. Digital Scholarship in the Humanities, to appear. 716 JADT’ 18 Regroupement d’auteurs dans la littérature du XIXe siècle Jacques Savoy Université de Neuchâtel (Suisse) – Jacques.Savoy@unine.ch Abstract This paper presents the author clustering problem in which a set of n texts written by several distinct authors must be regrouped into k clusters, each of them corresponding to a single author. The proposed model can use different distance measures and feature sets (composed of the most frequent word types). The evaluation is based on a French corpus composed of 200 excerpts of novels written during the 19th century. By varying different parameter settings, the evaluation indicates a better performance achieved with words instead of n-grams of letters. The Cosine distance achieves lower performance levels compared to the Tanimoto (L1) or Matusita (L2) functions. The text size plays an important role in the effectiveness of the solution, showing that size of 10,000 tokens produces significantly better results than text size of 5,000 to 500 tokens. A more detailed analysis provides reasons explaining stylistic aspects of some authors. Résumé Cette communication présente le problème du regroupement d’auteurs dans lequel un ensemble de n textes écrits doit être regroupé dans k grappes distinctes, une pour chaque auteur. Le modèle proposé permet l’emploi de différentes mesures de distance et divers ensembles d’attributs (vocables les plus fréquents). L’évaluation s’appuie sur un corpus composé de 200 extraits de romans français du XIXe siècle. En variant différents paramètres, notre étude indique que les vocables s’avèrent meilleur que les n-grammes de lettres. La fonction cosinus génère un taux de réussite plus faible que le fonction Tanimoto (L1) ou Matusita (L2). La taille des textes joue un rôle important dans la qualité de réponse et une longueur de 10 000 mots permet une performance significativement supérieure à des valeurs variant de 5 000 à 500 mots. Une analyse apporte quelques explications sur le style de différents auteurs. Keywords : Automatic classification, unsupervised machine learning, authorship attribution. Mots-clés : Classification automatique, apprentissage non-supervisé, attribution d’auteur. JADT’ 18 717 1. Introduction Le problème d’attribution d’auteur (Juola, 2006) rencontre un intérêt grandissant avec la multiplication des canaux électroniques. La présence de messages anonymes ou pseudo-anonymes soulève de nombreux défis en criminalité (Olsson, 2008), (Chaski, 2013) à l’exemple des chats calomnieux ou des courriels menaçants. Pourtant des questions plus classiques méritent notre attention comme, par exemple, déterminer la véritable identité de la romancière Elena Ferrante (Gatti, 2016) ou sur les relations de Shakespeare et de ses co-auteurs (Michell, 1996), (Craig & Kinney, 2009). Dans ce cadre, notre communication présente les problèmes liés à la question du regroupement d’auteurs avec une application en littérature française du XIXe siècle. Ce problème se résume ainsi. Disposant d’un ensemble de n extraits de romans, on doit regrouper en k classes disjointes, chacune contenant tous les écrits du même auteur. Ce problème a été posé lors de la campagne d’évaluation CLEF-PAN 2016 et 2017 (Stamatatos et al., 2016) mais les collections tests n’ont pas été rendues publiques. Ce problème présente une difficulté majeure par l’absence de données d’entrainement. 2. Travaux reliés Afin d’identifier l'auteur d’un écrit, trois familles d’approches ont été proposées (Juola, 2006). En premier lieu, des mesures stylométriques supposées invariantes ont été évoquées comme la longueur moyenne des phrases, la taille du vocabulaire par rapport à la longueur du document (rapport TTR) (Rexha et al., 2016). Toutes ces mesures possèdent l’inconvénient d’être instables face à des textes de tailles différentes (Baayen, 2008). Une deuxième famille d’approches se fonde sur le vocabulaire. Mosteller & Wallace (1964) proposent de sélectionner de manière semiautomatique les vocables les plus pertinents. Burrows (2002) sélectionne les mots les plus fréquents et, en particulier, les mots fonctionnels (déterminants, prépositions, conjonctions, pronoms et verbes auxiliaires). Ces derniers possèdent l’avantage d’être plus fortement reliés au style de l’auteur qu’à la sémantique. Cette liste comprendra entre 50 à 1 000 vocables les plus fréquents (Hoover, 2007). D’autres auteurs proposent de définir a priori une telle liste (Hughes et al., 2012). Ainsi, chaque texte peut être représenté par les fréquences d’occurrence de ces vocables. Ensuite, une mesure de distance (ou de similarité) permet d’estimer la proximité de deux textes. L’attribution s’établit habituellement selon la règle du plus proche voisin. Troisièmement, des modèles d’apprentissage automatique (Stamatatos, 2009) permettent de sélectionner les attributs (mots, bigrammes de mots ou de lettres, POS, émoticons, etc.) possédant le meilleur pouvoir discriminant. Ensuite un classifier est entraîné sur un ensemble d’apprentissage (SVM, 718 JADT’ 18 régression logistique, etc.). Cependant, dans le cadre du regroupement d’auteurs, aucune donnée d’entraînement n’est disponible rendant caduc de telles approches. Dès lors, pour résoudre ce problème, des approches proposent de déterminer en premier lieu le nombre k d’auteurs sur l’ensemble n d’écrits (Stamatatos et al., 2016). Cette valeur fixée, on applique un algorithme de classification k-means afin d’identifier les différents groupes de textes. Par itération, le nombre k d’auteurs peut être affiné. Comme second paradigme, la distance entre chaque écrit est calculée, puis on applique un algorithme de classification hiérarchique (Lebart et al., 1998) pour former les grappes de documents. Dans cette étude, nous suivrons cette seconde stratégie de résolution, choix qui nous a permis d’obtenir le deuxième rang lors de la dernière campagne d’évaluation PAN-CLEF 2016. 3. Corpus de test et méthodologie d’évaluation L’évaluation empirique tient une place importante en attribution d’auteur. Comme les corpus des campagnes PAN-CLEF 2016 et 2017 n’ont pas été rendus publics, nos évaluations seront basées sur une collection extraite de la littérature française du XIXe siècle. Ce corpus nommé St-Jean1 contient 200 extraits de romans écrit par 30 auteurs (entre 1801 (Châteaubriant, Attala) et 1901 (Régnier, Les Rencontres de Monsieur de Bréot)). Ce nombre d’écrivains et de textes étant élevé, la tâche demeure ardue. Chaque auteur est représenté par au moins trois extraits (avec un maximum de treize pour Balzac) provenant d’un à six romans et aucun écrivain ne représente plus de 5 % du corpus. Chaque extrait contient en moyenne 10 073 formes (min : 10 026 ; max : 10 230 ; standard déviation : 25). Au total, ce corpus contient 2 014 641 formes pour 51 661 vocables extraits de 67 romans. Disposant de n textes, notre approche produira une liste ordonnée de liens entre textes avec une indication de la distance entre eux. Un exemple est présenté dans le tableau 1. Avec ce corpus, la solution se compose de 30 groupes requérant la présence de 670 liens intra-auteurs. Comme mesure d’évaluation, nous reprenons la précision moyenne (AP) (la moyenne des précisions obtenues pour chaque lien pertinent), mesure usitée lors des campagnes PAN-CLEF 2016 et 2017. Ainsi, une valeur unique de performance reflète la qualité de chaque modèle de classification. Comme seconde mesure, la valeur HP (haute précision) indique le nombre de liens correctement établis depuis le début jusqu’à la présence du premier lien erroné. Dans notre tableau 1, la valeur HP = 168 signalant que les 168 premiers liens sont justes. 1 Ce corpus a été créé par D. Labbé et est disponible (www.unine.ch/clc/home/corpus.html) soit sous la forme de textes, soit lemmatisé. Les encodages UTF-8 et Windows sont disponibles. JADT’ 18 719 Tableau 1 : Exemple d’un extrait d’une liste ordonnée selon la distance (Tanimoto) Rang Distance Texte 1 Texte 2 1 0,239 51 Flaubert 62 Flaubert 2 0,242 3 Flaubert 20 Flaubert 3 0,248 29 Sand 115 Sand 4 0,248 122 Staël 140 Staël 5 0,253 125 Fromentin 159 Fromentin 6 0,255 37 Flaubert 62 Flaubert 7 0,256 132 Régnier 162 Régnier ... … … … 169 0,324 42 Maupassant 51 Flaubert 4. Sélection des attributs et mesure de distance Afin de regrouper les documents selon leur auteur, nous devons les représenter en fonction de leur style et non en fonction des thèmes qu’ils abordent. Comme mentionné précédemment, plusieurs études ont démontré que les vocables les plus fréquents constituent des attributs pertinents pour détecter le style d’un auteur. Dans le cadre de l’attribution d’auteur, le thème pourrait perturber des affectations correctes lorsque, par exemple, deux auteurs abordent des sujets similaires. Pour cerner les aspects stylistiques, une étude récente a démontré que tenir compte des 200 à 300 mots les plus fréquents (Savoy, 2015) apporte de bonnes performances comparées à d’autres fonctions de sélection (rapport des cotes, gain d’information, chicarré, etc.). Sur la base du corpus St-Jean, les mots les plus fréquents de notre corpus sont : de (4,11 % des occurrences), et (2,44 %), la (2,36 %), le (1,94 %), et à (1,9 %). Comme alternative, plusieurs études proposent de recourir aux fréquences des lettres et des bigrammes de lettres et, plus généralement, des n-grammes afin de distinguer les différents styles (Kjell, 1994), (Juola, 2006). On remarquera toutefois que les composantes stylistiques et thématiques seront toutes les deux présentes dans la génération de tels n-grammes. Dans cette étude, la distinction entre majuscules et minuscules est ignorée et les signes de ponctuation sont éliminés. Par contre, on tiendra compte du fait qu’une lettre débute ou termine un mot. Le nombre maximal d’attributs s’élève à (27 x 27) + 27 = 756. Pour la langue française, on retrouve 594 (ou 78,6 %) combinaisons possibles dans notre corpus. Les lettres françaises les plus fréquentes sont : e (15.6 % des lettres), s (8,3 %), a (8,3 %), i (7,5 %), et t (7,2 %). En indiquant par _ l’espace, les bigrammes de lettres les plus usuels sont : e_ (5,1 % des bigrammes), s_ (3,5 %), t_ (2,7 %), _d (2,4 %), et _l (1,8 %). Dès que chaque document est représenté par m de mots (ou de n-grammes de lettres), on peut calculer sa distance avec les autres entités du corpus. Le choix de cette fonction de distance (ou de similarité) peut s’opérer selon des critères théoriques (par exemple, symétrie, inégalité triangulaire) ou 720 JADT’ 18 empiriques (efficacité). Basée sur le profilage d’auteur, une étude récente (Kocher & Savoy, 2017) indique qu’aucune mesure de distance s’avère toujours la meilleure. Par contre un groupe restreint permet d’obtenir de bonnes performances comme la distance de Manhattan ou de Tanimoto basée sur la norme L1, ou celle de Matusita (norme L2). Nous avons repris ces mesures en y ajoutant la distance du cosinus. Ces quatre mesures respectent la symétrique et respectent l’inégalité triangulaire (Kocher & Savoy, 2017). Dans la définition de ces mesures de distance, les lettres majuscules indiquent les vecteurs représentants les documents. Les minuscules (ai, bi) correspondent aux fréquences relatives des termes sélectionnés. 5. Évaluation Notre première évaluation concerne l’efficience des différentes mesures de distance ainsi que la performance du nombre de vocables les plus fréquents retenus comme attributs. Le tableau 2 indique les valeurs de précision moyenne (AP) et de haute précision (HP) en représentant les textes par les 100 à 1 000 vocables les plus fréquents, ou tout le vocabulaire. La dernière ligne et colonne nous renseigne sur la moyenne des APs. Tableau 2 : Précision moyenne (AP) et haute précision (HP) selon diverses mesures de distance avec des représentations construites entre 100 vocables et tout le vocabulaire Manhattan Tanimoto Matusita Cosinus Moyenne Attributs AP HP AP HP AP HP AP HP AP 100 0,674 185 0,695 192 0,655 181 0,626 152 0,663 200 0,692 186 0,708 193 0,687 222 0,628 145 0,679 190 196 244 0,629 148 300 0,705 0,720 0,727 0,695 0,750 500 0,720 186 0,735 189 212 0,627 149 0,708 0,730 0,743 0,709 1 000 183 186 0,745 204 0,617 142 Tout 0,713 166 0,672 168 0,568 135 0,599 142 0,691 Moyenne 0,706 183 0,712 187 0,689 200 0,621 146 0,681 JADT’ 18 721 Ces résultats indiquent que les différences de précision moyenne restent faibles entre les mesures de Manhattan, Tanimoto et Matusita. Toutes les trois s’avèrent supérieures au cosinus. En considérant la haute précision (HP), Matusita tend à apporter une meilleure efficacité. Reste à déterminer a priori cette valeur maximale, sans connaître les attributions correctes. Enfin, une représentation par 300 à 500 voire 1 000 vocables les plus fréquents fournit les meilleurs taux de succès. En remplaçant les vocables par des ngrammes de lettres (performances indiquées dans le tableau 3), les valeurs de performance s’avèrent inférieures aux vocables. La variation des taux de succès entre une combinaison uni- et bigrammes de lettres (deuxième ligne du tableau 3) ou des séquences plus longues s’avère peu élevée. Par contre les temps de traitement s’accroissent rapidement (8,2 minutes pour les uni- et bigrammes à plus de 4 heures pour les 5-grammes comparé à 3 minutes avec les 500 mots les plus fréquents). Enfin, la fonction cosinus retourne les performances les moins bonnes. Nos premières évaluations se fondaient sur l’ensemble du texte disponible, soit environ 10 000 mots. Si l’on réduit cette taille à 5 000 voire à 500, les taux de réussite obtenus sont indiqués dans le tableau 4. La première ligne est reprise du tableau 2 puis les tailles décroissent comme le signale la première colonne. La réduction moyenne des performances est reprise dans la dernière colonne. Ainsi, en réduisant les textes à 5 000 mots, la baisse moyenne s’élève à 25,8 %. Si l’on doit œuvrer avec des longueurs de 1 000 à 500 mots, les taux de réussite s’avèrent faibles générant une réduction de 80 à 90 %. Est-il vraiment raisonnable d’effectuer des attributions d’auteur avec de telles tailles ? Tableau 3 : AP et HP selon diverses mesures de distance avec des n-grammes de lettres Matusita Cosinus Moyenne Manhattan Tanimoto HP HP AP HP AP HP AP n-grams AP AP uni & bi 0,559 139 0,559 139 0,503 128 0,538 94 0,540 3-gram 0,527 108 0,527 108 0,471 130 0,476 108 0,500 112 0,532 4-gram 0,570 153 0,570 153 0,507 147 0,481 5-gram 0,587 177 0,587 177 0,541 181 0,543 73 0,565 0,588 200 0,588 200 0,557 188 0,415 6-gram 36 0,588 Moyenne 0,566 155 0,566 155 0,506 147 0,510 97 0,545 Tableau 4 : AP et HP selon diverses mesures de distance avec des textes de tailles différentes (représentation sur la base de 300 vocables) Matusita Cosinus Moyenne Différence Manhattan Tanimoto HP HP AP HP AP HP Taille AP AP 10 000 0,705 190 0,720 196 0,727 244 0,629 148 0,695 5 000 0,526 55 0,545 58 0,526 85 0,466 74 0,516 -25,8% 2 500 0,326 31 0,342 39 0,306 35 0,284 11 0,315 -54,8% 1 000 0,152 4 0,152 2 0,116 1 0,141 3 0,140 -79,8% 500 0,093 2 0,089 2 0,079 3 0,086 2 0,087 -87,5% 722 JADT’ 18 En analysant la liste triée obtenue avec la fonction Matusita et en représentant les textes par les 300 vocables les plus fréquents, les distances les plus faibles se retrouvent entre des extraits de la même œuvre. La distance la plus faible se trouve avec le roman Les Rencontres de Mr de Bréot (1901) de Régnier, puis on trouve Bouvard et Pécuchet (1881) de G. Flaubert, Delphine de Mme de Staël (1803), Mme Bovary (1857) de G. Flaubert et La Petite Fadette (1832) de G. Sand. Si l’on analyse les appariements les plus difficiles entre deux œuvres du même auteur, les romans Graziella (1852) et Geneviève (1863) de A. de Lamartine constitue le lien le plus distant. Ensuite, on rencontre La double Maîtresse (1900) de H. de Régnier, Aurélia (1855) et Les Illuminés (1852) de G. de Nerval et Le père Goriot (1833) et La Maison Nucingen (1838) de H. de Balzac. Ces auteurs peuvent adopter des styles assez dissemblables, rendant une attribution plus ardue. Parmi les œuvres dont le style est perçu comme proche par la machine mais qui sont écrites par deux auteurs distincts, on trouve en tête Bel-Ami (Maupassant, 1885) et Mme Bovary (Flaubert, 1857), puis Volupté (Sainte-Beuve, 1834) et Dominique (Fromentin, 1862), Notre Cœur (Maupassant, 1890) et Mme Bovary (Flaubert, 1857), et enfin L’Assommoir (Zola, 1879) et Mme Bovary (Flaubert, 1857). 6. Conclusion Parmi les fonctions de distance, notre étude indique que le cosinus n’apporte pas de bons résultats. Par contre, les différences de performance entre les fonctions Manhattan, Tanimoto ou Matusita demeurent faibles. Afin de cerner une partie importante du style des auteurs, le recours à une représentation sur la base de vocables s’avère plus efficiente que le recours aux n-grammes de lettres (pour n variant de 1 à 6). Représenter le style avec les 300 à 500 vocables les plus fréquents s’avère pertinent. Lorsque l’on compare la précision moyenne (AP) et la haute précision (HP), le choix des paramètres optimaux diffère quelque peu d’une mesure de performance à l’autre. Notons que l’AP ne punit pas sévèrement les erreurs d’affectation, erreurs qui entraînent immédiatement une baisse de la valeur HP. Enfin, la taille des textes joue un rôle essentiel dans une attribution d’auteur et des valeurs inférieures à 1 000 mots ne permettent que des affectations souvent douteuses. Parmi les auteurs retenus, le style du roman Mme Bovary se rapproche de celui de Maupassant (Bel-Ami) ou de Zola (L’Assommoir). Remerciements L’auteur remercie D. Labbé pour avoir mis à sa disposition le corpus St-Jean. JADT’ 18 723 Références Baayen, H.R. (2008). Analysis Linguistic Data: A Practical Introduction to Statistics using R. Cambridge University Press, Cambridge. Burrows, J.F. (2002). Delta: A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17(3):267-287. Chaski, C. (2013). Best practices and admissibility of forensic author identification. Journal of Law and Policy, 21(2):333-376. Craig, H., & Kinney, A.F. (2009). Shakespeare, Computers, and the Mystery of Authorship. Cambridge University Press, Cambridge. Gatti, C. 2016. La véritable identité d’Elena Ferrante révélée. BublioObs, 2 octobre 2016. Hoover, D.L. (2007). Corpus stylistics, and the styles of Henry James. Style, 41(2):160-189. Hughes, J.M., Foti, N.J., Krakauer, D.C., & Rockmore, D.N. (2012). Quantitative patterns of stylistic influence in the evolution of literature. Proceedings of the PNAS, 109(20), pp. 7682-7686. Juola, P. (2006). Authorship attribution. Foundations and Trends in Information, 1(3):233-334. Kjell, B. (1994). Authorship determination using letter pair frequency features with neural network classifier. Literary and Linguistics Computing, 9(2):119-124. Kocher, M., & Savoy, J. (2017). Distance measures in author profiling. Information Processing & Management, 53(5):1103-1119. Labbé, D. (2007). Experiments on authorship attribution by intertextual distance in English. Journal of Quantitative Linguistics, 14(1):33-80. Lebart, L., Salem, A. and Berry, L. (1998). Exploring Textual Data. Dordrecht, Kluwer. Michell, J. (1996). Who Wrote Shakespeare? Thames and Hudson: New York (NY). Mosteller, F., & Wallace, D.L. (1964). Applied Bayesian and Classical Inference: The Case of the Federalist Papers. Addison-Wesley, Reading. Muller, C. (1992). Principes et méthodes de statistique lexicale. Honoré Champion, Paris. Olsson, J. (2008). Forensic Linguistics. Continuum, London. Rexha, A., Klampfl, S., Kröll, M., & Kern, R. (2016). Towards a more fine grained analysis of scientific authorship. Proceedings ECIR 2016, pp. 26–31. Savoy, J. (2015). Comparative evaluation of term selection functions for authorship attribution. Digital Scholarship in the Humanities, 30(2):246-261. Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60(3):433214. Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., & Potthast, M. (2016). Clustering by authorship within and across documents. Working Papers, CLEF-2016. 724 JADT’ 18 What’s Old and New? Discovering Topics in the American Journal of Sociology1 Stefano Sbalchiero, Arjuna Tuzzi University of Padova – stefano.sbalchiero@unipd.it; arjuna.tuzzi@unipd.it Abstract Nowadays the field of text mining techniques seems to be very active in dealing with the increasing mass of available digital texts and several algorithms have been proposed to analyze and synthesize the vast amount of data that today represents a challenging source of information overload. Topic modeling is a collection of algorithms which are useful for discovering themes, i.e. topics, in unstructured text. The Latent Dirichlet Allocation (LDA) by Blei (et al., 2003) was one of the first topic modeling algorithms and since then the field seems to be active and many variants and other algorithms have been suggested. The present study considers a topic as an indicator of the relevance of a research area in a specific time-span and its temporal evolution pattern as a way to identify the paradigm changes in terms of theories, ideas, forgotten topics, evergreen subjects and new emerging research interests. The study aims to contribute to a substantive reflection in Sociology by exploring the temporal evolution of topics in the abstracts of articles published by the American Journal of Sociology in the last century (1921-2016). Within the classical LDA perspective, the study also focus on topics with a significant increasing or decreasing trend (Griffiths et Steyvers, 2004). The results show different shifts that involved relevant reflections on various issues, from the early debate on the “institutionalization” process of Sociology as a scientific discipline to recent developments of sociological topics that clearly indicate how sociologists have reacted to new social problem. Keywords: Chronological corpus, History of Sociology, Academic Journals, Text Mining, Latent Dirichelet Allocation 1 This study was supported by the University of Padova, fund CPDA145940 (2014) “Tracing the History of Words. A Portrait of a Discipline Through Analyses of Keyword Counts in Large Corpora of Scientific Literature" (P.I. Arjuna Tuzzi, University of Padova). JADT’ 18 725 1. Introduction: topic modeling As evidenced by the literature on topic modelling (Blei et al., 2003; Ponweiser, 2012; Grimmer et Stewart, 2013; Griffiths et Steyvers, 2004), text mining approaches can mitigate the problem of analysing huge collections of textual data when they increase in number and size and complicate all information processing. From a methodological point of view, since the topics emerge directly from data, text mining approaches can tone down some problems about the role of analysts in coding and interpreting the content hidden in corpora, e.g. research bias or errors that notoriously affect most approaches in comparative and quanti-qualitative researche (Strauss et Corbin, 1990; Corbetta, 2003). A popular approache to extract information by summarizing the main contents embedded in relevant collection of texts in digital form is known as topic modeling (Blei et Lafferty, 2009), which is essentially a collection of algorithms that are exploited to discover themes, i.e. topics, in unstructured and complex texts. The Latent Dirichlet Allocation (LDA) is one of the first topic modeling algorithms, namely a “generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words” (Blei et al., 2003, p. 996). LDA is a technique that facilitates the automatic discovery of themes in a collection of documents. Since a text document can deal with different topics and the words that occur in that document reflect a set of possible topics, in “statistical natural language processing, one common way of modeling the contributions of different topics to a document is to treat each topic as a probability distribution over words, viewing a document as a probabilistic mixture of these topics” (Griffiths et Steyvers, 2004, p. 5228). Actually we cannot directly observe topics but only documents and words, as topics are part of the latent and hidden text structure. The model infers the latent topic structure given by observed words and documents: this is the LDA's generative processes which recreate (generate) the documents of the corpus by assigning the probability of topics (the relevance) to documents and the probability of words to topics. The result is a probabilistic distribution of topics over documents that is characterized and described by a cluster of cooccurring words (Blei et al., 2003). This list o words enable the researcher to interpret the meaning of all the generated topics. For the purposes of the present study, the temporal variable is crucial to analyse the direction and evolution of topics, and particularly to the extent that they have a direct relationship with the most significant shifts in the development of Sociology as a discipline over time. For these reasons, we propose a LDA-based topic detection procedure as this “method discovers a set of topics expressed by documents, providing quantitative measures that can be used to identify the 726 JADT’ 18 content of those documents, track changes in content over time” (Griffiths et Steyvers, 2004, p. 5228). An additional estimation procedure exploits a metavariable (year) to explore the topics trends: LDA offers the opportunity to estimate the slope of a linear model that represents the distribution of topics by year. The model permits to identify “hot and cold topics” (Griffiths et Steyvers, 2004), i.e. topics with significant increasing (hot) and decreasing (cold) trends through time. 2. Corpus and data The American Journal of Sociology (AJS), established in 1895 as the first U.S. scholarly journal in its field, can be considered one of the world’s preeminent journals and a leading voice for research in social sciences. The journal fosters pathbreaking work from all areas of sociology, with an emphasis on theory building and innovative methods. AJS is a multi-disciplinary journal that strives to speak to a "general sociological reader" and is open to sociologically informed contributions from anthropologists, statisticians, economists, educators, historians, and political scientists. Manuscripts are subjected to a double-blind review process and published articles are considered representative of the best current theoretical and methodological debates. Our corpus includes all the abstracts of the papers published by AJS that have been retrieved from popular archives (Scopus and Web of science) and the journal webpages. We decided to work on the abstracts since they provide concise information about the main contents of all articles. With regard to selection criteria, they were based on the following consideration: when abstracts did not provide any information about the content or did not refer to relevant scientific contributes (e.g. editorials, master heads, errata, acknowledgements, rejoinders, notes, announcements, corrections, list of consultants, obituary, etc.) we decided to disregard them in further analyses. The corpus is composed of 3,992 abstracts, collected for a period of almost a century (mean: 41 per year), from the Volume No. 27, Issue No. 1 (1921) to the latest, No. 121, Issue No. 4 (2016). The collected texts had relevant contents for the purpose of the present analysis based on the following consideration and hypothesis: If we consider a topic as an indicator of the relevance of a research area in a specific time-span, then the temporal evolution pattern of subject matters can portray main paradigm changes in terms of theories, ideas, forgotten topics, evergreen subjects and new emerging research interests in Sociology. The corpus has been pre-processed by means of TaLTaC2 software package. After the tokenization (the identification of words given character sequences chopping it up into pieces), the corpus has been normalized replacing uppercase with lowercase letters. An automatic search procedure identified relevant multi-words (MWs), i.e. JADT’ 18 727 informative sequences of words (Pavone, 2010) repeated at least five times in the corpus (849 MWs in total). This procedure retrieved most interesting MWs in the abstract (e.g. united states, fr. 395; social structure, fr. 115; social science, fr. 101; labor market, fr. 89; social change, fr. 78) and contributed to increase the amount of information conveyed by sequences of words2. Then, the corpus has been processed by means of R software packages3: punctuation marks and numbers have been removed, as well as some grammatical words (articles, conjunction, prepositions, pronouns). The corpus is composed of 24,418 word-types and 512,410 word-tokens (tab. 1), and the measures show that there is a sufficient level of redundancy to proceed with statistical analyses of textual data (Lebart et al., 1998; Trevisani et Tuzzi, 2015; Bolasco, 2013). Table 1. Basic lexical measures of the corpus of AJS abstracts (V) WORD-TYPES (N) WORD-TOKENS (V/N)*100 = TYPE/TOKEN RATIO (V1/V)*100 = PERCENTAGE OF HAPAX 24,418 512,410 4.76 47.08 3. Topic detection As the LDA algorithm “fits” the terms in the document into a number of topics that must be specified apriori, this represents an important and sensitive decision that affects results and findings: few topics will produce broad subjects and mixed-up contents, while too many topics will produce minimal subjects and results too detailed to be readable and interpretable. To set the number of topics in a data driven manner we have the opportunity to calculate different metrics (Arun et al., 2010) and estimate the optimal number of topics (Griffiths et Steyvers, 2004) by means of the maximum loglikelihood of LDA for a number of topics ranging from 2 to 50 (Fig. 1). 2 If MWs did not appear at least 5 times in the corpus, that is about once every 20 years, it was not considered important; however, the MWs that appeared with a frequency equal to or greater than 10 are 417. 3 The analysis were implemented by R pakages: Tm, Lda, Topic model. 728 JADT’ 18 Fig. 1: Fitting the model: log-likelihood calculated for increasing number of topics The best number of topics is the one with the highest value of log-likelihood that is around 30 and can be established as the optimal number of topics. Figure 2 shows the general trend of all the 30 topics as depicted by the fitted model A clue of how these topics change over time is shown by 30 panels with a topic trend line each, that lists the number of topics with positive or negative trends. All of the topics are ordered by slope: decreasing topics appear in the first panels (top left), and increasing ones in the last panels (bottom right)., Since the main aim of this study is to detect the temporal evolution of old, new and emerging topics in Sociology, we can resort to a limited number of topics that show prototipycal temporal patterns(Ponweiser, 2012; Griffiths et Steyvers, 2004). Fig. 2: Temporal patterns of the 30 topics in Sociology sorted by slope of linear models JADT’ 18 729 Consistent with the idea that topics show different trends and embrace theoretical, conceptual, and methodological shifts, the analysis of timedependent phenomena identifies three specific temporal patterns of topics: topics whose trajectory has grown in time and it is increasing over time (28, 4, 2, 27, 15, 11); topics whose trajectory decreased (7, 3, 21, 9, 13, 18); and topics whose peak-like journey (meteor) was high only in a specific interval of time (14, 17, 28, 15) or shows more irregular temporal trajectories. 4. What’s old and new in Sociology? To focus on major increasing or decreasing topics from 1921 to 2016, we explored the contents of five coldest and hottest topics. Figure 3 provides the top term for these topics. The groups of coldest topics correspond on one hand to the methodological development of sociological perspectives, and on the other hand to some specific objects of research. These topics were very popular in about 20s and 50s. First of all, the debate on the “institutionalization” process of Sociology as a scientific discipline characterized the early debate (topic 7). The main need was to create a strong scientific and knowledge base from the development of ideas advanced by the "founding fathers", e.g. Durkheim. At the same time, the debate on the “measurement” of social phenomena arose. The issue of migration between cities and farms (topic 3) by economic and social groups gives the net law of rural-urban social selection. The emerging of a scientific social reflections about health and illness (topic 21) by using empirical data to evaluate how social life affects morbidity and mortality rate, and vice versa, increased in efforts for better educated public and to improve health legislation. The development of psychological sociology (topic 9) and the general progress of psychological interpretations of social processes and institutions have decreased over time; researches in this tradition have been criticized because they mainly exemplified the biological background of social interpretations, also supplied by the impulse from the Darwinian doctrine. Class culture, conflict and leisure (topic 13) were popular issues in the 30s and 50s: the industrialization had raised many questions, from the class conflict to the growth of leisure hours of after work hours, providing new insights for social thought. The group of hottest topics (Fig. 4) is related to articles that have a focus of interest in a wide range of empirical case studies that underline most significant changes that have occurred since the mid-1960s. 730 JADT’ 18 Fig. 3: Decreasing topics: the five coldest (significant neg., p level 0.005) Fig. 4: Increasing topic attention: the five hottest (significant pos., p level 0.005) In those years, gender revolution (topic 11), ethnic discrimination (topic 2), mobilization, power and élite (topic 15), protests and social movements (topic 27), and the “measurement” of social phenomena in a post-positivist fashion, especially until the 70s (topic 4), offered to sociologists the opportunity to deal with a social effervescence of a particular historical moment. These hot topics indicates the ‘birth’ and recent developments of some sociological topics that clearly indicate how Sociology (as a discipline) and sociologists have reacted to new social problems. In conclusion, through the topic detection analysis of the abstracts of articles, different shifts that involved reflections on various issues have been identified. During the twentieth century, Sociology expanded its scope and influence, and motivated much research studies as well as a diversification of the field. Other studies have offered a remarkable theoretical contribution to the historical ‘shape’ of Sociology as a discipline (Kalekin-Fishman et Denis, 2012), even in a critical perspective (Turner, 1998), either emphasizing the content of the various domains of sociology (Scott et Desfor Edles, 2011; Blau, 2004), or specifically within the intellectual ground of American Sociology since the mid-nineteenth century (Calhoun, 2007). Even if they show an JADT’ 18 731 interesting round of paradigmatic reflection in Sociology, there has been a lack of research studies on the history of Sociology through empirical data and evidence to fast-moving sociological topics over time. To the extent that the history of Sociology is a continuous approach to the Sociology of the present, a new way of reading the history of a discipline is rely on topic detection of articles published in mainstream journals which mirror the sociological scientific debate of a specific historical moment. We analysed these trends exploiting topics as emerged from a text corpus and highlighted two distinct directions of topics, characterized by different theoretical and methodological implications that coexist within the same period considered: the hot-increasing and cold-decreasing topics. Results show how Sociology has become one of the main social science to provide fresh thinking about a whole range of topics affecting the public sphere and, as a consequence, the discipline developed shifting priorities in universities and social research agenda towards specialization and fostered the birth of a wide range of subdisciplines over time. This is just the tip of the iceberg: further analyses will shed light on many more aspects that need a deeper reflection. References Arun R., Suresh V., Veni Madhavan C. E. and Narasimha Murthy M. N., (2010). On finding the natural number of topics with latent dirichlet allocation: Some observations. In Mohammed J. Zaki, Jeffrey Xu Yu, Balaraman Ravindran and Vikram Pudi (eds.), Advances in knowledge discovery and data mining, Springer Berlin Heidelberg, pp. 391-402. Blau J. R. (2004). The Blackwell Companion to Sociology, Malden, MA: Blackwell. Blei D. M., Ng A. and Jordan M. I., (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3: 993-1022. Blei D. M and Lafferty J.D., (2009). Topic Models. In A. Srivastava, M. Sahami (eds.), Text Mining: Classification, Clustering, and Applications. Chapman & Hall/CRC Press. Bolasco, S. (2013). L’analisi automatica dei testi. Fare ricerca con il text mining. Carocci, Rome. Calhoun G. (2007). Sociology in America: A History. Chicago: University of Chicago Press Corbetta P. (2003). Social Research: Theory, Methods and Techniques, SAGE Publications Ltd., London. Griffiths T. and Steyvers M., (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America (PNAS), 101(Supplement 1):5228-5235. Grimmer G. and Stewart B. M., (2013). Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts, in Political 732 JADT’ 18 Analysis, 21 (3): 267-297. Kalekin-Fishman D. and Denis A. (2012). The Shape of Sociology for the 21st Century: Tradition and Renewal, London, SAGE. Lebart, L., Salem, A. and Berry, L. (1998). Exploring textual data. Kluwer Academic Publishers: Dordrecht Pavone, P. (2010). Sintagmazione del testo: una scelta per disambiguare la terminologia e ridurre le variabili di un’analisi del contenuto di un corpus. In S. Bolasco, I. Chiari and L. Giuliano (Eds.) Statistical Analysis of Textual Data: Proceedings of 10th International Conference Journées d’Analyse statistique des Données Textuelles, 9-11 June 2010, Sapienza University of Rome, pp. 131-140. LED. Ponweiser M., (2012). Latent Dirichlet Allocation in R, Vienna University of Business and Economics. Scott A. and Desfor Edles A. (2011). Sociological Theory in the Contemporary Era: Text and Readings, Thousand Oaks, Pine Forge Press. Strauss, A.L. and Corbin, J. (1990). Basics for Qualitative Research: Grounded Theory Procedures and Techniques, Newbury Park, Sage. Trevisani, M. and Tuzzi, A. (2015). A portrait of JASA: the History of Statistics through analysis of keyword counts in an early scientific journal. Quality & Quantity, 49(3): 1287-1304. Trevisani, M. and Tuzzi, A. (in press). Learning the evolution of disciplines from scientific literature. A functional clustering approach to normalized keyword count trajectories. Knowledge-Based System. Turner S. (1998). Who’s Afraid of the History of Sociology? Swiss Journal of Sociology, 24: 3-10. JADT’ 18 733 Comparison of Neural Models for Gender Profiling Nils Schaetti, Jacques Savoy Université de Neuchâtel - Rue Emile-Argand 11 - CH2000 Neuchâtel - Switzerland Abstract This paper describes and evaluates two neural models for gender profiling on the PAN@CLEF 2017 tweet collection. The first model is a character-based Convolutional Neural Network (CNN) and the second an Echo State Network-based (ESN) recurrent neural network with various features. We applied these models to the gender profiling task of the PAN17 challenge and have demonstrated that it can be applied to gender profiling. As features, we propose using pre-trained word vectors, part-of-speech (POS) and function words (FW) for the ESN model, and character 2-grams matrix with punctuation marks, smilies, beginning and ending 2-grams for the deep learning model. We finally compared these strategies to a baseline and found that an ESN model based on Glove pre-trained word vectors achieves the highest success rate and outperforms the baseline and the character-based CNN model. Keywords: Author Profiling, Gender Profiling, Deep-Learning, Convolutional Neural Network, Reservoir Computing, Echo State Network, Natural Language Processing 1. Introduction At the age of big data, a large number of applications are based on an exponential amount of various data such as pictures, videos, articles, links and blogs shared directly from computers, websites, smartphones and sensors. Social networks and blogs are the new platforms of communication based on fast interactions, generating a large varieties of content with their own characteristics. These contents are difficult to compare with traditional texts, such as novels and articles. This issue raises new questions : Can we determine if the author of a textual content is a man or a woman ? Can we identify the author’s place of origin, his age group or his (or part of) psychological profile ? Answering these questions can help solve current issues of the social network era, such as fake news, plagiarism and identity theft. Author profiling is, therefore, a particular and pertinent subject interest. In addition, author profiling is central to applications involving marketing, security and forensics. For example, forensic linguistics and police 734 JADT’ 18 investigation forces would like to know specific defining characteristics, such as the gender, the age group and the socio-cultural background of an author of harassing messages. When we apply this to marketing, companies and resellers could make use of these profile characteristics while targeting their consumers’ preferences, based on the analysis of individual consumer social network posts and online product consulting. In order, to extract this information, the classic statistical methods are employed as they have proven to be effective for text classification. Deep learning has gained increasing popularity just over the last decade, becoming a "breakthrough" technology in image recognition and computer vision. Yet, it faces difficulties in natural language processing (NLP) tasks. But recurrent neural networks (RNN), as well as long short-term memory (LSTM) obtained better results in such tasks. In this view, we therefore decided to test such an approach on the gender profiling tasks with two neural models, one based on Convolutional Neural Networks (CNN) and 2grams of characters, and the other on the Reservoir Computing Paradigm. Finally, we compare them to a baseline composed of both a random and a naive Bayes classifier This paper is organized as follows. Section 2 introduces the data set used to train and test both of the models and the methodology used for evaluation. Section 3 describes and evaluates our deep-learning model. Section 4 introduces the proposed echo state network-based reservoir computing model. Section 6 compares the results with the baseline. In the last section, we draw conclusions on our findings and possible future improvements. 2. Methodology To compare our two models on the gender profiling task, we needed a common ground composed of the same dataset and evaluation measures. To create this common ground, the PAN CLEF evaluation campaign was launched [1] and allowed multiple research groups to propose and compare profiling algorithms with the same methodology. For the PAN CLEF 2017 evaluation campaign, four test collections of tweets were generated written in several languages including English. Based on these collections, the challenge was to classify Twitter profiles per language variety (e.g., UK vs. US English) and gender. We were then able to use this common ground for our two models and compare their capacities on the gender profiling task. The dataset was collected on Twitter and is composed of tweets from different authors with 100 per author. For each author, a label indicates the correct gender (male, female). The collection included 3,600 authors, residing in the United-States, Great Britain, Ireland, New Zealand, Australia and JADT’ 18 735 Canada, 600 per country, and 1,800 for each group, for a total of 360’000 tweets. The table velow resumes dataset properties. Authors Tweets Genders 3600 360k (male) 1800 ; (female) 1800 The overall performance of a model is based on the accuracy on the gender profiling task. The accuracy is the number of correctly classified author gender divided by the number of authors. Based on data depicted in the table above, a random baseline will produce an accuracy rate of 0.5 (or 50%). 3. Character N-grams Matrix-based Convolutional Neural Networks A Convolutional Neural Network (or CNN) is a variety of feed-forward artificial neural networks inspired by the visual cortex [2]. In our first model, we applied a CNN to a character bigram representation matrix for an author in a collection. The first shows the structure of the representation matrix. For each letter, one can find one row. In the first position, the relative frequency of this letter is provided. Then, from left to right, the matrix is composed of the relative frequencies of each character bigram (e.g., at row "t" and column "h", the relative frequency of the bigram "th" is given). The hird part is optional and composed of relative frequencies of ending character bigrams, and finally, the last part is the same optional matrix representing the starting character bigrams of each word. This matrix representing an author is the input for the CNN The first two layers are composed of 20 and 10 kernels respectively, with a size of 5 × 5. These layers are followed by a drop-out layer. The last two are linear layers based on ReLU. The outputs are finally obtained by a Softmax function and give the author’s predicted class. The predicted class is therefore the class with the highest corresponding output from this function. The training set is composed of 90% of the dataset and the remaining 10% is use to estimate the performance. This procedure is repeated 10 times with 736 JADT’ 18 non-overlapping test sets to obtain the 10-fold cross validation estimator. Matrix / Alphabet English + Punctuation + Punctuation & Smilies Bigrams 75.26% 76.16% 76.51% + starting bigrams 76.02%∗ 77.63%∗† 77.50%∗ + ending bigrams 75.94% 77.22%† 77.25% + starting & ending bigrams 76.12% 77.83%† 78.33%∗† 4. Echo State Network-based Reservoir Computing models 4.1. Echo State Networks An Echo State Network was introduced in [3] and corresponds to the first equation. In this model, the highly non-linear dimensional vector xt, denoting the activation vector at time t, is defined by xt+1 = (1 − a) * xt + a * f(Win * ut+1 + W * xt + W) where xt ∈ R Nx with Nx the number of neurons in the reservoir. The scalar a represents the leaky rate allowing to adapt the network’s dynamic to the the task to be learned. The input signal is represented by the vector ut with dimension Nu, multiplied by the weight matrix in W∈RNx×Nu. In addition, the matrix W∈RNx×Nx stores the internal weights. Finally, Wbiais is the bias, and usually the initial vector is fixed to x0 = 0, corresponding to a null state. The network’s output ŷ is defined by ŷt = g(W * xt) and the learning phase consists of finding the values of the matrix Wout∈RNy×Nx , e.g., by applying the ridge regression method. This matrix is defined by Wout = Y * XT(X * XT + λ * I)−1 Where Y∈RNy×T is the matrix containing each target output ŷ for t = 1, 2, . . ., T where T denotes the training size, and Ny the number of outputs (categories). Similarly, the matrix X∈RNx×T stores the reservoir states xt obtained during the training phase. Finally, the parameter λ is a regularization factor. 4.2. From texts to temporal signals In order to apply ESN for text classification, we must first transform input texts as a temporal signal. In this study, we have evaluated three signal converter methods. First, each word sequence in a text (e.g. "to the citizens JADT’ 18 737 of") can be viewed as a word vector (WV) (e.g., vec(to), vec(the), vec(citizens), vec(of), each vector extracted from word embeddings pre-trained with Glove), Part-Of-Speech (POS) vector (size : number of POS tags), and as a function word (FW) (size : number of FW). As output, the ESN generated the vector yt,g with g∈{male, female} denoting the probability that the tokens in the ESN’s memory at time t as been written by a man or a woman. We then end up with an output temporal signal of gender probabilities (over t = 1, 2, . . ., T), and the final predicted class of a document is the one with the highest average across time. 4.3. State-Gram In addition, the output layer can take account of more than one state to estimate the class probabilities. A state-gram value of 2 means that the training is performed, not only on a single xt , but on xt−1 ∪ xt . Such a model was effective for handwritten digits recognition [4]. 5. Results In the second table, one can see the results of the deep-learning CNN model with different vocabulary and starting and ending bigrams. The statistical tests indicate that the starting bigrams can significantly improve the performance with respect to the base model (first row). The combination of starting and ending bigrams (last row) shows a significant improvement only for the vocabulary composed of punctuation marks and smilies. The best result (78.33%) is achieved by a CNN model with punctuation and smilies, with starting and ending character bigrams. 738 JADT’ 18 The left plot in the second figure shows the three features (WV, POS, FW) with a leak rate between 0.01 and 1.0. Using the same three feature sets, the right-side plot indicates the accuracy rate obtained by the state-gram model with value between 1 and 5. With a solid line, the best leak-rate parameter value is used, and with the dotted curves, a leak-rate value of 1 was used. Overall, Figure 2 indicates that the pre-trained word vector (WV) is the best feature set with a maximum value of 80.81% with a leak rate of 0.01. As the best accuracy rates is obtained with a leak rate between 0.01 and 0.05 (left plot in Figure 2), we can conclude that the author profiling task has a very slow temporal dynamics. The right-side plot signals that no significant improvement is achieved by increasing the value of the stage-gram parameter for the best leak-rate parameter value. Moreover, a high value of Ns decreases the performance for POS feature. The performance slightly increases for a leak-rate parameter value of 1, but these results show that the leak-rate parameter is a better lever to increase the accuracy rates. The following table compares the accuracy rates that can be achieved by a random classifier, the naive Bayes model together with the CNN and ESN models (with Nx = 1,000). Classifier 10-CV success rate Random baseline 50.0 % Naive Bayes classifier baseline 75.5 % CNN 2-grams + starting-grams + ending-grams 78.3 % ESN on Glove with Nx = 1000 80.6 % 6. Conclusion This paper presents a comparison of two neural models composed of a character-based CNN model and an echo-state network (ESN) model with POS, function words (FW) or pre-trained word vectors (WV) as possible feature sets. Based on the CLEF-PAN 2017 dataset, the best CNN model achieves a success rate of 78.3% with a feature set composed of the vocabulary, the punctuation marks, and smilies. The best ESN model obtains a success rate of 80.6% with 1,000 neurons and a leak-rate of 0.01. Based on our experiment setting, this model achieves the best performance. In comparison, the naive Bayes classifier obtains a success rate of 75.5% and the average and best performance for the gender profiling task in PAN 2017 was respectively 75.88% and 82.5%. Our results indicate that the two models can significantly improve the accuracy rate on the gender profiling task. Moreover, they demonstrated that JADT’ 18 739 a simple model, thanks to its simple linear regression algorithm, such as the echo state network can achieve a higher success rate than a more complex model such as a character-based CNNs. This higher result level can be explain by the recurrent architecture of the ESN model, allowing it to take into account word order. In the future, we want to explore more features for the ESN and word vectors pre-trained for Twitter applications to achieve hopefully a better performance. We will also apply classical and deep ESN architectures to other natural language processing tasks such as authorship identification and author diarization. References Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Potthast, and Benno Stein. Overview of the 4th author profiling task at pan 2016 : cross-genre evaluations. Working Notes Papers of the CLEF, 2016. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11) :2278–2324, 1998. Herbert Jaeger. The “echo state” approach to analysing and training recurrent neural networks-with an erratum note. Bonn, Germany : German National Research Center for Information Technology GMD Technical Report, 148(34) :13, 2001. Nils Schaetti, Michel Salomon, and Raphaël Couturier. Echo state networksbased reservoir computing for mnist handwritten digits recognition. In Computational Science and Engineering (CSE) and IEEE Intl Conference on Embedded and Ubiquitous Computing (EUC), 2016 IEEE Intl Conference on, pages 484–491. IEEE, 2016. 740 JADT’ 18 Segments répétés appliqués à l'extraction de connaissances trilingues Lionel SHEN Université Sorbonne Nouvelle - Paris 3 – lionel.shen@sorbonne-nouvelle.fr Abstract In a context of globalized societies, multilingualism is becoming an economic and social phenomenon. Translation constitutes a crucial element for communication. A good translation guarantees the quality of the transmission of all information. However, face to the challenge of multilingual information monitoring, can we simply use translation? With the advent of the digital age and the integration of all new technologies, corporate governance is undergoing a complete metamorphosis. One of the priorities remains the efficient exploitation of accumulated big data. The objectives of this paper hope to highlight the specificity and efficiency of the Repeated Segments tool through information discovering of trilingual thematic corpora (French, English and Chinese). Résumé Dans un contexte de sociétés mondialisées, on peut parler de multilinguisme ou encore de plurilinguisme. Aujourd’hui, la frénésie autour du phénomène des mégadonnées et le multilinguisme sont en train de métamorphoser tous les services et les comportements de notre époque. La traduction devient alors un élément capital pour la communication entre les peuples. Une bonne traduction garantit la qualité de la transmission de toutes les informations. Cependant, devant la gageure que constitue le projet de réaliser une veille multilingue, peut-on utiliser simplement la traduction ? Cet article s’articule autour d’explorations de corpus thématiques trilingues appliquées à l'extraction de connaissances et tente de mettre en lumière la spécificité et l’efficacité des cooccurrences en trois langues, français, anglais et chinois. Keywords: segments répétés, textométrie, veille multilingue, multilinguisme, fouille d’informations, text-mining, cooccurrences, poly-cooccurrences 1. Introduction Le monde, qui utilise des centaines de langages depuis des millénaires, a formalisé les mots et les grammaires pour transcrire, enseigner et transmettre sur des supports, les savoirs, les faits et les pensées. Des hiéroglyphes aux JADT’ 18 741 idéogrammes, en passant par les alphabets, ces représentations diffusent ainsi l'image du monde à travers les époques, les évolutions, les moeurs et les courants de pensée. Cela représente aujourd’hui des centaines de milliards de mots dans des corpus différents, avec des occurrences variables. Il n'est pas possible à un être humain d'aborder par lui-même la masse des publications archivées ou en circulation. Seul l'usage de l'informatique peut, à présent, dans le cadre de la mondialisation, permettre un balayage massif des séquences des corpus nécessaire à l'étude des occurrences et des usages des mots, au moins dans les langues essentielles diffusant le savoir, l'information et la communication entre les humains. L'utilité de ces recherches est étendue, allant des besoins sociaux, humains, scientifiques aux guerres économiques, en passant par les médias et les enjeux stratégiques des politiques. C'est la capacité à détecter, enregistrer, analyser et comprendre dans les meilleurs délais, qui va permettre aux différentes forces de pouvoirs d'anticiper les décisions et d'agir efficacement. Cette force de veille, implantée de manière continue et basée sur des outils performants, élaborés et mis en œuvre par des chercheurs, des informaticiens, des stratèges, des économistes, sous l'autorité des décideurs... va donc construire les forces de demain, parfois à l'échelle de la planète. Dans un contexte de sociétés mondialisées, on peut parler de multilinguisme ou encore de plurilinguisme. Aujourd’hui, la frénésie autour du phénomène Big-data et le multilinguisme sont en train de métamorphoser tous les services et les comportements de notre époque. La traduction devient alors un élément capital pour la communication entre les peuples. Une bonne traduction garantit la qualité de la transmission de toutes les informations. Cependant, devant la gageure que constitue le projet de réaliser une veille multilingue, peut-on utiliser simplement la traduction ? Cet article s’articule autour d’explorations de corpus thématiques trilingues appliquées à l'extraction de connaissances et tente de mettre en lumière la spécificité et l’efficacité de l’outil Segments répétés en trois langues. 2. Corpus Pour constituer ce travail, deux types de corpus sont mobilisés : un corpus comparable (nommé ENRG) et un corpus parallèle (nommé CLRG), composés de données textuelles extraites des discours de presse, ainsi que ceux des ONG. La construction de ces deux corpus s’effectue autour de trois thèmes d’actualité ayant pour objet, l’environnement, l’énergie et le changement climatique. La construction de ces deux corpus s’opère à partir d’articles de journaux issus de nos trois sphères de communication, à savoir, le Monde pour la France (4 817 articles), le NYT pour les États-Unis (3 993 articles) et 1200 742 JADT’ 18 médias pour la Chine (14 509 articles) comme le présente les deux figures (figure 1 et figure 2) ci-dessous. Les données textuelles extraites du corpus comparable proviennent des discours de la presse, tandis que celles du corpus parallèle sont issues de ceux des ONG. Figure 1 : volumétrie du corpus comparable ENRG Figure 2 : volumétrie du corpus parallèle CLRG Quant à l’aspect temporel des données du corpus comparable, il diffère selon les sources et couvre des périodes plus ou moins étendues : de 1999 à 2012 pour le Monde, de 2005 à 2012 pour le NYT, de 2008 à 2013 pour les médias chinois. Concernant le corpus parallèle, les articles datent de 2006 à 2014. La figure 3, ci-dessous montre les différentes périodes couvertes par les médias retenus. JADT’ 18 743 Figure 3 : périodes couvertes par les corpus ENRG et CLRG Les dépouillements sont réalisés à l’aide des outils de la textométrie, notamment grâce aux analyses factorielles des correspondances (AFC), aux spécificités du modèle hypergéométrique, aux segments répétés, aux réseaux cooccurrentiels et poly-cooccurrentiels ou encore à la carte des sections. Les caractéristiques locales et globales, les convergences, les divergences et les particularités de ces différents corpus ont été mises successivement en évidence. Après avoir présenté rapidement les deux corpus utilisés, nous allons nous polariser sur l’outil Segments répétés appliqué d’abord au corpus parallèle puis ensuite au corpus comparable. Nous nous intéresserons, plus particulièrement dans cet article à la spécificité des segments répétés appliqués à l’extraction de connaissances multilingues. Comme le souligne André Salem, « L’outil prend toute sa valeur lorsque l’unité linguistique traitée n’est pas le mot, mais le segment répété (suite de mots d’une longueur 2, 3, 4, 5) » (Salem 1987). Nous rappelons que «Un segment répété est une suite de formes dont la fréquence est supérieure ou égale à 2 dans le corpus». Nous émettons l’hypothèse suivante : l’outil Segments répétés serait plus performant en chinois, qu’en anglais et qu’en français. Corpus parallèle : segments répétés anglais-chinois Nous examinons maintenant les segments les plus répétés obtenus à partir des deux volets (anglais-chinois) du corpus parallèle CLRG. Tableau 1 : segments les plus répétés du corpus parallèle CLRG Le tableau 1 ci-dessus illustre les 14 segments les plus répétés de CLRG. Nous constatons que la fréquence de segments répétés du volet anglais est 744 JADT’ 18 beaucoup plus élevée que celle du chinois. Par exemple, la fréquence du segment climate change est de 2 468 dans le volet anglais, tandis que dans le volet chinois, la fréquence est de 830. La signification des segments répétés du volet anglais relève peu d’informations intéressantes. Les mots-outils ou les mots syntaxiques sont les plus répétés, un seul thème relatif à notre recherche est présent, climate change. En revanche, les segments répétés en chinois nous révèlent les véritables thèmes de notre recherche, gaz à effet de serre, changement climatique, énergies renouvelables, nouveau/nouvelle. Nous pouvons dire que deux types de répétitions se manifestent : d’une part de mots grammaticaux pour l’anglais, et d’autre part, de mots de contenu pour le chinois. Rappelons que la forte répétition de mots grammaticaux est la cause du grand nombre d’occurrences en anglais. Plus l’emploi des mots grammaticaux est intensif, plus le nombre d’occurrences est important. Ce phénomène dissymétrique des segments répétés dans les deux volets est absolument normal, car la structure syntaxique des deux langues est complètement différente. Le fait d’avoir des traductions de l’un à l’autre ne prouve nullement l’emploi symétrique des segments qui se répètent de la même manière dans les deux langues. Cependant, un prétraitement de l’anglais pour éliminer les mots outils donnerait plus de sens à l’étude des segments répétés (Shen, 2016). Les remarques formulées par André Salem viennent étayer notre hypothèse, renforcée également par celles de Damon Mayaffre. « L'analyse des voisinages récurrents permet d'utiliser les segments répétés pour documenter les analyses statistiques faites à partir des formes simples. On trouvera enfin une analyse typologique effectuée à partir des segments répétés. » (Salem, 1986). «Moins encore que la fréquence d’un mot, la récurrence de segments ne peut être naïvement attribuée au hasard : soit elle pointe une contrainte syntaxique, soit elle indique une détermination ou option sémantique. Dit rapidement, le mot est une unité graphique, le plus souvent ambiguë, sans sens explicite, pas même doté de signification. Le segment, lui, devient une unité linguistique porteuse de sens» (Mayaffre, 2007). Ces résultats de l’étude bilingue (anglais-chinois) des segments répétés parallèles ainsi que leurs analyses montrent que, pour une même information énoncée et décrite en deux langues, la répétition événementielle et thématique est plus saillante en chinois en raison de la faible pratique des anaphores (Shen, 2016). De plus le contenu est plus diversifié, puisque nous retrouvons nos principaux thèmes de recherche. Nous abordons l’étude des segments répétés dans le corpus comparable ENRG, composé de trois sous-corpus : sous-corpus français ENRG-FR, souscorpus américain ENRG-US, sous-corpus chinois ENRG-CN. JADT’ 18 745 Corpus comparable : segments répétés trilingues (français, anglais/US, chinois) Tableau 2 : segments les plus répétés du corpus comparable ENRG Le tableau 2 ci-dessus présente les 16 segments les plus répétés d’ENRG. Comme pour le corpus parallèle, notre premier constat est une répétition thématique particulièrement saillante pour le sous-corpus chinois (ENRGCN). Par exemple, la fréquence du segment réduire les émissions est de 12 554, placé comme le segment le plus répété dans ENRG-CN, formes absentes dans le haut du tableau des deux autres sous-corpus. Cependant, ces formes existent, mais sont classées bien plus bas dans les résultats des segments répétés. Les autres sous-thèmes représentés par les séquences répétées comme faible teneur en carbone, énergie éolienne, photovoltaïque, etc., directement liés aux énergies et au changement climatique sont également mis en valeur dans le tableau 2. Pour les sous-corpus français et américain, seuls des mots grammaticaux ou mots-outils apparaissent dans les segments les plus répétés. Ce phénomène est dû essentiellement au mécanisme des anaphores ou au mécanisme déictique qui n'est pas le même en français et en anglais américain (Shen, 2016). Toutefois, nous remarquons qu’en chinois, ce sont des termes clés qui se répètent, tandis qu’en anglais et en français, il s’agit souvent d’entités nommées (noms propres, toponymes, etc.). 3. Conclusion Dans le processus d’extraction de connaissances trilingues, nous pouvons conclure que les segments répétés mettent en lumière très efficacement les caractéristiques les plus saillantes en chinois que dans les deux autres langues occidentales. Deux types de répétitions se manifestent : d’une part des mots grammaticaux pour le français et l’anglais, et d’autre part, des mots de contenu pour le chinois. De plus, nous soulignons que les cooccurrences ou poly-cooccurrences permettent également d’extraire des connaissances du corpus grâce à la 746 JADT’ 18 coprésence de formes éloignées. Selon Mayaffre, «L’étude des segments répétés offre une alternative à la lemmatisation. Elle permet de désambiguïser les termes de manière formelle et surtout de manière endogène, en corpus et non en référence (arbitraire) au dictionnaire ou à la langue» (Mayaffre, 2007). A juste titre, en raison de la forte présence des mots-outils, les cooccurrences ou poly-cooccurrences par rapport aux segments répétés permettent de récupérer les séquences répétées non contigües au travers des phrases ou des paragraphes. A partir des résultats des segments répétés des deux corpus, nous pouvons affirmer que l’outil Segments répétés présente l’avantage d’extraire rapidement des informations clés en chinois, alors qu’en français et en anglais, le mécanisme des cooccurrences et poly-cooccurrences met en valeur des informations non détectables par des moyens traditionnels (par exemple, les concordances). Aussi, l’outil Segments répétés constitue un atout fondamental pour la fouille d’informations multilingues. Bibliographie Bonnafous S. and Tournier M. (1995). Analyse de discours, lexicométrie, communication et politique. In : Langages, 29e année, n°117, Paris, Larousse, pp. 67-81. Habert B., Nazarenko, A., Salem A. (1997). Les linguistiques de corpus. Paris, Armand Colin/Masson, 254 p. Habert B. and Zweigenbaum P. (2002). Problèmes épistémologiques : Régler les règles. TAL. Paris, Association pour le traitement automatique des langues, vol. 43, no3, pp. 83-105. Lafon P. (1981). Analyse lexicométrique et recherche des cooccurrences. In: Mots, n°3, octobre 1981. Butor-Rousseau, Péguy, Presse du Zaïre, "la nouvelle droite", vocabulaires, communiste et socialiste, cooccurrences? pp. 95-148. Lafon P. and Salem A. (1983). L'inventaire des segments répétés d'un texte. In: Mots, n°6, mars 1983. L'oeuvre de Robert-Léon Wagner. Vocabulaire et idéologie. Analyses automatiques. pp. 161-177. Lamalle C. and Salem A. (2002). « Types généralisés et topographie textuelle dans l’analyse quantitative des corpus textuels » in A. Morin et P. Sébillot (éds), JADT 2002. Saint-Malo : IRISA-INRIA, vol. 1, 403-411. Lebart L. and Salem A. (1995). Statistique textuelle Longrée D., Luong X., Mellet S. (2004). « Temps verbaux, axe syntagmatique, topologie textuelle : analyses d’un corpus lemmatisé » in G. Purnelle, C. Fairon, A. Dister (éds), JADT04. Louvain : Presses universitaires de Louvain, vol. 2, 743-752. JADT’ 18 747 Longrée D., Luong X., Mellet S. (2006). « Distance intertextuelle et classement des textes d’après leur structure : méthodes de découpage et analyses arborées » in J.-M. Viprey (textes réunis par), JADT’ 06. Besançon : Presses universitaires de Franche-Comté, vol. 2, 643-654. Mayaffre D. (2004). Paroles de président. Jacques Chirac (1995-2003) et le discours présidentiel sous la Vème République. Paris : Champion.Mayaffre, Damon (2004). Paroles de président. Jacques Chirac (1995-2003) et le discours présidentiel sous la Vème République. Paris : Champion. Mayaffre D. (2007). L'analyse de données textuelles aujourd'hui : du corpus comme une urne au corpus comme un plan : Retour sur les travaux actuels de topographie/topologie textuelle. Lexicometrica, Andrée Salem, Serge Fleury, 2007, pp.1-12. Rastier F. (2001). Arts et sciences du texte. Paris : Puf. Salem A. (1986). Segments répétés et analyse statistique des données textuelles. In: Histoire & Mesure, 1986 volume 1 - n°2. Varia. pp. 5-28. Salem A. (1987). Pratique des segments répétés. Essai de statistique textuelle, 1987 Shen L. (2016). Méthodes de veille textométrique multilingue appliquées à des corpus de l’environnement et de l’énergie : « Restitution, prévision et anticipation d’événements par poly-résonances croisées », Thèse : Sciences du langage, Université Sorbonne Nouvelle – Paris 3, octobre 2016, 474 p. Viprey J. (2005-a). « Philologie numérique et herméneutique intégrative », in Adam J.-M. et Heidmann U. (éds.), Sciences du texte et analyse de discours. Genève : Slatkine, 51-68. Viprey J. (2005-b). « Corpus et sémantique discursive : éléments de méthode pour la lecture d corpus », in A. Condamines (dir.), Sémantique et corpus. Paris : Lavoisier, pp. 245-276. Viprey J. (2006). « Structure non-séquentielle des textes », Langages, 163, 7185. 748 JADT’ 18 Misurare, Monitorare e Governare le città con i Big Data Sandro Stancampiano Istat – stancamp@istat.it Abstract Several new data sources are investigated in the production process of official statistics. This paper describes the results of the analysis of online reviews about four points of interest in Rome, Italy. The reviews, collected from the web using web scraping and data wrangling techniques, was written by tourists and visitors during the 2017. The general aim of this research is to extract useful information to help civil servants and citizens in decision-making processes. Within the activities related to this study were automatically collected and stored in a Data Base 9227 documents (each document is a review) used to build the corpora. The paper intends to classify the reviews and qualify the sentiment of the texts using tools and techniques of text mining. Abstract Numerose nuove fonti di dati vengono analizzate nel processo di produzione delle statistiche ufficiali. Questo documento descrive i risultati dell'analisi delle recensioni online su quattro punti di interesse della città di Roma, in Italia. Le recensioni, raccolte con tecniche di web scraping e data wrangling, sono state scritte da turisti e visitatori nel corso del 2017. Lo scopo generale di questa ricerca è di estrarre informazioni a supporto dei processi decisionali sia dei dipendenti pubblici sia dei cittadini. Tra le attività correlate a questo studio sono stati raccolti e archiviati automaticamente in una base di dati 9227 commenti utilizzati per creare un corpora analizzato utilizzando strumenti e tecniche di text mining. Il documento intende classificare le recensioni e qualificare il sentimento dei testi. Keywords: big data, Internet as data source, text mining, cluster analysis, web scraping. 1. Introduzione Questo progetto si propone di indagare soluzioni relative all’uso dei Big Data per produrre statistiche ufficiali a supporto della pubblica amministrazione. L’Istat ha incluso questo tema, condiviso a livello europeo, nel Piano JADT’ 18 749 triennale della ricerca tematica e metodologica1. L’Istat sta considerando la possibilità di utilizzare i Big Data nel processo di produzione dei dati, in modo da attenuare il trade-off tra tempestività e accuratezza (Alleva, 2016). 2. Background della ricerca Questo lavoro si focalizza sul tema della gestione dei beni culturali, indagando mediante tecniche esplorative multivariate (Bolasco, 2014) fonti dati non convenzionali. Si vogliono mostrare le enormi potenzialità dei dati presenti sul web per produrre statistiche al fine di ottimizzare i processi decisionali. Il risultato della ricerca potrà essere di ausilio agli amministratori nella gestione dei servizi dedicati ai fruitori dei beni culturali presenti sul territorio. L’esperimento, che si concretizza in un progetto pilota replicabile ed estendibile su ampia scala, utilizza l’analisi testuale (text mining) per estrarre informazioni da dati scaricati dal web mediante tecniche di web scraping. Si vogliono scoprire regolarità nei testi esaminati utilizzando la cluster analysis (analisi dei gruppi). Questa tecnica, applicata attraverso il software IRaMuTeQ, consente di definire la distanza tra gli oggetti che si vogliono classificare (Ceron et al., 2013). 3. Obiettivo e ipotesi di ricerca Tra i molti siti web utilizzati dagli utenti per produrre contenuti, è stato scelto Tripadvisor. Gli utenti registrati utilizzano il sito per scrivere le loro recensioni sui luoghi in cui si sono recati condividendo le loro esperienze (Iezzi e Mastrangelo, 2012). Sono state scelte quattro tra le più celebri attrazioni della città di Roma frequentate quotidianamente da numerosi turisti (Colosseo, Pantheon, Fontana di Trevi e Piazza Navona). Il Colosseo con oltre sei milioni di visitatori ha determinato, anche per il 2016, l'incremento degli incassi garantiti dai musei italiani2 e la supremazia della regione Lazio in questa graduatoria. Molti visitatori lasciano valutazioni relative ai luoghi aggiungendo considerazioni sullo stato di conservazione dei beni, sui servizi e i disservizi che hanno notato. Si ritiene che analizzando questi commenti, sia possibile dedurre preziose informazioni. L’analisi ha permesso di ottenere una classificazione gerarchica delle recensioni basata sui termini caratterizzati da un utilizzo superiore alla media con riferimento alla variabile monumento. https://www.istat.it/it/files/2011/07/Piano-strategico-2017-2019.pdf (pp.27-28) http://www.beniculturali.it/mibac/export/MiBAC/sitoMiBAC/Contenuti/Mibac Unif/Comunicati/visualizza_asset.html_892096923.html 1 2 750 JADT’ 18 4. Corpus e metodo I commenti sono stati raccolti in una base dati mediante l’applicativo Diogene3: progettato con il paradigma OOA/D e realizzato con metodologia agile (Larman, 2005). Utilizzando lo stesso software è stato creato il corpus delle recensioni.Le 9227 recensioni raccolte, pubblicate dal 1 gennaio al 31 dicembre 2017, sono così suddivise: Colosseo 3483 (37.8%), Piazza Navona 1020 (11%), Fontana di Trevi 2829 (30.6%) e Pantheon 1895 (20.5%). Si è proceduto in prima istanza con l’analisi lessicale ricavando informazioni utili alla successiva analisi testuale volta a localizzare unità di testo di rilevo per gli obiettivi del presente studio (Bolasco, 2013). L’analisi ha permesso di individuare gruppi di parole omogenei al loro interno ed eterogenei tra loro riguardo ai “concetti” espressi nelle recensioni. Il corpus analizzato è composto da 9227 testi, 1788819 occorrenze, 11891 forme, 366 hapax di cui il 3.08% relativi alle forme e lo 0.02% relativi alle occorrenze e media 193.87. La ricchezza lessicale del corpus è molto bassa4 (V/N*100 = 0.66%), difatti a fronte di un testo ampio si riscontra un vocabolario ridotto. Osservando le 30 forme attive con la frequenza assoluta maggiore, notiamo come il linguaggio utilizzato privilegi i sostantivi e gli aggettivi rispetto ai verbi. Gli aggettivi esprimono positività (bello, bellissima, grande) e i sostantivi sono legati alla fruizione dei beni oggetto di studio (monumento, piazza, visita, luogo, consiglio, interno) così come i verbi (visitare, fare, vedere, dire, entrare, trovare). 5. Gli scriventi e le recensioni I dati relativi ai giorni della settimana in cui è stata scritta la recensione, evidenziano la tendenza degli utenti a mettere nero su bianco i dettagli delle loro esperienze nei giorni centrali della settimana, con una predilezione per i mercoledì (vedi Figura 5.1). Le persone durante i fine settimana si dedicano alle visite dei beni culturali e preferiscono descrivere quanto visto e vissuto martedì, mercoledì e giovedì. Nel periodo oggetto di studio le recensioni relative alle quattro piazze sono state in media 741 al mese con un minimo di 572 a giugno e un massimo di 1129 a gennaio. Dalla Figura 5.2 risulta che i primi mesi dell’anno, da gennaio ad aprile, sono quelli in cui si concentra il maggior numero di recensioni (oltre il 42% del totale). 3 Diogene è un software sviluppato in java per effettuare processi di data wrangling. 4 Il calcolo è stato effettuato applicando la formula RL=V/N dove V = ampiezza del vocabolario e N = numero totale di parole nel testo. JADT’ 18 751 Figura 5.1: Numero di recensioni per giorno della settimana (gennaio – dicembre 2017) Figura 5.2: Numero di recensioni per mese (gennaio – dicembre 2017) 6. Cluster Analysis La cluster analysis ci consente di raggruppare le unità statistiche massimizzando coesione e omogeneità delle parole incluse in ciascun gruppo e minimizzando al tempo stesso il legame logico tra quelle assegnate a gruppi/classi differenti. Figura 6.1: Dendrogramma delle classi secondo similarità 752 JADT’ 18 Il dendrogramma (Figura 6.1) mostra la divisione del corpus in 4 classi. Le parole contenute in ciascuna classe permettono di individuare le tipologie di argomenti trattati nel corpus, applicando la metodologia Alceste proposta da Max Reinert e implementata nel software IRaMuTeQ5. In Figura 6.2 osserviamo le parole appartenenti ai quattro gruppi e come si dispongono sul piano fattoriale. Questa visualizzazione chiarisce meglio il significato delle classi individuate. Il gruppo di parole in rosso (65.4%), che si concentrano intorno all’origine, è composto dai termini più utilizzati: trasversali a tutto il corpus e di conseguenza a tutti e quattro i beni esaminati. Si tratta di parole tema come roma, simbolo, monumento, città, storia, dei verbi visitare, vedere, tornare, dire e di sostantivi e aggettivi come bello, emozione, luce, bellezza che esprimono positività e azioni legate alla visita. La classe 2, in verde (10.9%), rappresenta i commenti pubblicati da persone che sono attente a quello che accade nei luoghi e considerano prioritaria la sicurezza, la legalità e la qualità dei servizi che trovano. Si distinguono parole come venditore, abusivo, presenza, peccato, fastidioso, ordine, municipale, polizia, fischietto. Ci sono inoltre parecchi riferimenti alle attività commerciali (bar, bancarella, locale, ristorante, gelateria, trattoria) con particolare riguardo a cosa si può mangiare (aperitivo, pizza, granita, gelato, vino) e alle modalità di fruizione (tavolino, tavolo, panchina). Questo gruppo di parole evidenzia considerazioni che non sono strettamente correlate alla visita culturale ma piuttosto a tutto quello che ruota intorno a una escursione turistica. La classe 3, in celeste (12.7%), rappresenta tematiche connesse ad aspetti economici e pratici che in alcuni casi possono causare disagio durante la visita. Emergono parole come acquistare, prenotare, saltare, fila, coda, interminabile, biglietto, pagare, guida, audioguida, gratis, costo, euro, ticket. Gli argomenti sottesi sono relativi al costo del biglietto, all'attesa per l’ingresso e alla modalità della visita con connotazione sia positiva sia negativa a seconda della situazione particolare descritta dall’utente. La classe 4, in viola (11%), rappresenta coloro che descrivono e raccontano l’esperienza dal punto di vista culturale citando eventi, luoghi e personaggi storici. Le parole più utilizzate sono tomba, re, raffaello, sanzio, chiesa, colonna, fiume, barocco, agone, agnese, borromini, savoia, papa, pagano, cristiano. Si tratta di riferimenti a luoghi di culto e opere (Sant’Agnese in Agone, la fontana dei Quattro Fiumi, le tombe dei re custodite nel Pantheon, ecc.), agli artisti che le 5 IRaMuTeQ è un software realizzato per effettuare analisi multidimensionali di testi che fornisce una interfaccia grafica a R, altro software di elaborazione dati particolarmente efficiente per l’analisi di grandi dataset. JADT’ 18 753 Figura 6.2: La disposizione delle parole sul piano fattoriale hanno realizzate (Raffaello Sanzio e Borromini su tutti), alla storia e al contesto sociale e culturale di pertinenza dei siti visitati. La disposizione dei termini sul piano fattoriale, a prescindere dai gruppi, evidenzia il continuum della visita, che inizia con la prenotazione, la biglietteria e il successivo acquisto seguito dalla fila per entrare e dalla constatazione della bellezza del monumento per poi visitare e immergersi negli aspetti artistici e nella storia del luogo in cui ci si trova. 7. Conclusioni e sviluppi futuri Le tematiche palesate sono di sicuro interesse per gli amministratori pubblici, che possono ascoltare direttamente dalla voce dei cittadini quali sono i principali problemi dal punto di vista degli utenti. Sulla base di questo genere di analisi il decisore può valutare se e come intervenire per migliorare 754 JADT’ 18 la gestione dei luoghi e dei beni culturali. Il flusso informativo parte dal cittadino che alla fine del processo può ottenere dei benefici tangibili grazie ai dati che lui stesso ha immesso in rete. Il processo descritto in questo lavoro mostra un uso classico di Big Data: dati prodotti con una finalità specifica vengono utilizzati successivamente per raggiungere altri obiettivi apportando un innegabile valore aggiunto (Rudder, 2015). Le tecniche di text mining applicate hanno permesso di valorizzare informazioni che altrimenti sarebbero rimaste inutilizzate. Ulteriori e più approfondite analisi potranno essere condotte con la stessa metodologia e i medesimi software adoperati in questo lavoro. Si potrà continuare il monitoraggio, incrementando il corpus per condurre un’analisi longitudinale su questi stessi monumenti o studiare altre città e altri beni culturali al fine di migliorare le politiche di gestione e ottimizzare i processi decisionali. References Alleva G. (2016). Più forza ai dati: un valore per il Paese. Relazione di apertura della 12° conferenza nazionale di statistica. Bolasco S. (2014). Analisi Multidimensionale dei dati. Metodi, strategie e criteri d’interpretazione. Carocci editore. Bolasco S. (2013). L’analisi automatica dei testi. Fare ricerca con il text mining. Carocci editore. Ceron A., Curini L., Iacus S. M. (2014). Social Media e Sentiment Analysis. L’evoluzione dei fenomeni sociali attraverso la Rete. Springer Italia. Iezzi Domenica F., and Mastrangelo M. (2012). Il passaparola digitale nei forum di viaggio: mappe esplorative per l’analisi dei contenuti. Rivista Italiana di Economia, Demografia e Statistica, 66 (3-4), pp. 143-150. Larman C. (2005). Applicare UML e i Pattern. Analisi e progettazione orientata agli oggetti. Luca Cabibbo (a cura di), Pearson Education Italia. Rudder C. (2015). Dataclisma. Chi siamo quando pensiamo che nessuno ci stia guardando. Mondadori. JADT’ 18 755 Exploration textométrique d’un corpus de motifs juridiques dans le droit international des transports Fadila Taleb1, Maryvonne Holzem2 Université Rouen Normandie – fadila.taleb@etu.univ-rouen.fr 2Université Rouen Normandie– maryvonne.holzem@univ-rouen.fr 1 Abstract Within the framework of a research whose objective consists of responding to a need formulated by the IDIT, which helps to interpret the jurisprudential texts contained in its database, we are looking to highlight the interpretive paths considered as modal scenarios. We propose here a preliminary textometric analysis in order to define the linguistic profile of the corpus and to detect certain repeated segments that may represent a relevant constraint to complete and enrich the interpretive paths identified in the case law. Résumé Dans le cadre d’une recherche dont l’objectif consiste à répondre à un besoin formulé par l’IDIT1, celui d’aider à l’interprétation des textes jurisprudentiels contenus dans sa base de données, nous cherchons à mettre au jour des parcours interprétatifs envisagés comme des scénarios modaux. Nous proposons ici une analyse textométrique préalable afin de cerner le profil linguistique du corpus et de détecter certains segments répétés pouvant représenter une contrainte pertinente pour compléter et enrichir les parcours interprétatifs identifiés dans les textes jurisprudentiels. Keywords: textométrie, parcours interprétatif, scénario modal, segments répétés, motifs juridiques, droit des transports. 1. Introduction 1.1. Contexte Dans le cadre d’un projet pluridisciplinaire « PlaIR »2, des chercheurs informaticiens, linguistes, juristes posent la question de l’aide à l’interprétation3 du fond jurisprudentiel de la base de données de l’IDIT. Du point de vue linguistique, notre tâche préalable à une implémentation consiste en l’étude de décisions de justice dans le but de comprendre leur Institut du Droit International des Transports. Plateforme d’Indexation Régionale 3 Notre objectif est celui d’une aide instrumentée centrée sur l’agir de l’utilisateur cf. travaux du groupe ʋ (Holzem et Labiche, 2017). 1 2 756 JADT’ 18 structure, le mécanisme argumentatif mis en œuvre et les mouvements de transformations textuelles susceptibles de déclencher des parcours interprétatifs pouvant aider à la lecture de ces décisions. Notre recherche s’écarte des modèles prédictifs, justice prédictive ou legaltech, qui, sous l’influence des big data et du Machine Learning, produisent des résultats de contentieux sur des bases algorithmiques. De ce point de vue, nous partageons les craintes de bon nombre de juristes de voir ces legaltech « devenir eux mêmes une nouvelle forme de justice » (Garapon, 2017). Il s’agit d’une pratique textuelle (et intertextuelle) comprise comme régime de transformation et d’interprétation. Dans cette perspective, notre recherche se place donc du côté de la jurilinguistique et son objectif est d’essayer de comprendre dans une approche linguistique et à travers l’étude du matériel textuel les décisions de justice et les stratégies argumentatives mises en œuvre pour ainsi aider à leur interprétation. 1.2. Questionnement et hypothèse Pour aider à l’interprétation nous cherchons à cerner les stratégies argumentatives mises en œuvre par le juge, notamment dans sa manière d’intégrer et de prendre en charge les discours des autres (celui des parties du procès, celui des experts, celui du législateur etc.). Notre hypothèse est fondée sur des recherches antérieures (Holzem 2014 et Taleb 20144) qui ont montré l’intérêt de la prise en compte des modalités linguistiques suivant le modèle développé dans (Gosselin 2010) pour la constitution d’un parcours interprétatif (Rastier 2001) envisagé ici comme scénario modal susceptible d’aider à l’interprétation. Mais avant de procéder à une telle analyse textuelle menée directement sur des textes pleins, nous avons eu besoin de cerner dans sa globalité et ses spécificités le profil linguistique de notre corpus d’étude. Pour cela, nous avons eu recours à une analyse textométrique approfondie, menée avec le logiciel TXM. Au fil de nos investigations textométriques, nous nous sommes rendu compte de l’importance de certaines fonctions offertes par ces outils pour la détection, par exemple, de segments répétés5, qui peuvent représenter une contrainte pertinente pour compléter les parcours interprétatifs identifiés grâce à l’étude des modalités. L’objectif de cet article est de présenter, dans ses grandes lignes, en raison de la place, l’analyse textométrique menée sur notre corpus. 4 Un mémoire de master 2 recherche en science du langage soutenu en juin 2014 : « Étude du scénario modal et du syllogisme juridique pour la compréhension du processus de production du texte. Cas des textes du droit. » 5 Suite de formes graphiques identiques attestées plusieurs fois dans le texte. JADT’ 18 2. 757 Corpus et méthodologie 2.1. Description globale Nous avons, à la suite de Rastier (2011), retenu le critère du genre comme critère définitoire du corpus de référence. Il regroupe des textes (décisions de justice) relevant du discours judiciaire6 et appartenant au genre jurisprudentiel7. En reprenant la typologie du corpus proposée par B. Pincemin (1999) et reprise par (Rastier 2011), nous avons distingué quatre niveaux de corpus : (i) un corpus existant/latent (archives pour Rastier) qui correspond dans notre recherche à la base de données de l’IDIT ; (ii) un corpus de référence qui renvoie à l’ensemble des documents numérisés dans le fond jurisprudentiel de l’IDIT ; (iii) un corpus d’étude qui contient un nombre délimité de ces décisions sélectionnées pour les besoins de notre recherche et enfin (iv) un corpus distingué (corpus d’élection ou sous-corpus pour Rastier) correspondant à des passages précis des textes étudiés nommés « les motifs ». Ces derniers constituent le cœur du jugement, le juge exposant « (…) les raisons de faits et de droit qui justifient la décision (…).» (Cohen et Pasquino, 2013). Notre intérêt pour cette zone textuelle est doublement motivé. Premièrement notre objectif consiste à repérer les moments clés de transformations du jugement pour cerner les stratégies argumentatives mises en œuvre et partant aider à leur interprétation. Deuxièmement la motivation est une composante commune8 à toutes les décisions de toutes les juridictions. Elle doit faire face à une double exigence : logique et persuasive. L’une est due à la forme syllogistique du raisonnement juridique imposée et l’autre à la nécessité de persuader l’auditoire de la décision9 de sorte à éviter les recours et faire accepter la solution juridique apportée comme étant la seule possible. Il renvoie aux discours produits par (ou au sein) des juridictions. Il est à distinguer du discours juridique qui désigne, entre autre, les domaines du droit ou ses sources (lois, réglementation etc.). L’un concerne la création du droit, l’autre rend compte de son aspect applicatif. 7Le terme de jurisprudence renvoie ici à l’ « ensemble des décisions rendues par les tribunaux d’un pays, pendant une certaine période dans une certaine manière. » (Dictionnaire du vocabulaire juridique 2017,éd. LexisNexis) P.322) 8 Ce qui n’est pas le cas pour les autres composantes. Ainsi, l’exposé du litige ne figure pas dans les arrêts de la cour de cassation, car celle-ci étant une juridiction d’ordre suprême, son rôle est de veiller à la bonne application des normes juridiques, elle considère l’appréciation des faits par les juges de fond comme étant souveraine. 9 Composée certes des parties du litige directement concernées par la décision, mais aussi les autres juges des autres juridictions et un public encore plus large, le destinataire universel. 6 758 JADT’ 18 2.2. Caractéristiques quantitatives Le volume textuel du corpus d’étude est de 878848 occurrences dont 22456 formes. Le sous-corpus des motifs représente à lui seul près de la moitié des occurrences du corpus d’étude. Il contient 393092 occurrences pour 14742 formes. La dysmétrie de la distribution des formes dans les différentes zones délimitées montre l’importance et le rôle des motifs dans les décisions de justice, ils sont leur raison d’être, et tout juge est dans l’obligation de motiver son jugement. 2.3. Encodage et prétraitement Notre corpus présente l’avantage d’être accessible en ligne. Cependant, l’ensemble des textes au format PDF n’est pas homogène : certains documents proviennent d’un format image (non océrisé10). Le format PDF n’étant pas pris en charge par TXM, nous avons tout d’abord procédé à une conversion (avec la technique d’océrisation pour les fichiers annotés et numérisés) au format TXT, puis dans un second temps à un codage XML en s’inspirant des recommandations de la TEI11 pour l’encodage des données textuelles. Ce dernier nous permet une navigation plus fine dans le corpus grâce à des métadonnées péritextuelles, comme celles relatives au type de la juridiction : tribunal de commerce (TC), cour d’appel (CA), cour de cassation (CC), à la date et au lieu, et des métadonnées intratextuelles, telles que celles relatives à des parties spécifiques dans les textes. Nous avons relevé quatre parties principales : faits, moyens, motifs, conclusions. Les motifs et les conclusions sont présents dans toutes décisions étudiées. Les faits sont absents des CC, et les moyens ne sont pas toujours indiqués comme tels dans les arrêts CA, ils sont souvent rappelés dans la zone des faits sous forme de discours indirect. La figure suivante représente les différentes phases de préparation du corpus avant son traitement textométrique : Figure 1 Les étapes de préparation du corpus 10 OCR (Optical Character Recognition) Reconnaissance Optique de Caractères, étape nécessaire pour déchiffrer les formes et les traduire ici en lettres. 11 Text Encoding Initiative : recommandations standard pour l’encodage des documents numériques. JADT’ 18 759 Pour le passage du format TXT au format XML-TEI nous avons créé les balises spécifiques au genre du corpus étudié : , , etc. Nous avons eu recours à un encodage semiautomatique au moyen d’un tagger conçu spécialement pour notre étude par Eric Trupin, MCF en informatique au laboratoire LITIS12. Cette étape indispensable de préparation du corpus pour le traitement textométrique a été à la fois chronophage et délicate : traitement des annotations manuscrites et nettoyage de documents plus anciens. 3. Exploration textométrique du corpus distingué : la zone des motifs 3.1. Etude occurrentielle : les spécificités lexicales Une première étude contrastive au moyen d’un traitement textométrique phare, le calcul de spécificités13, permet d’avoir une vue globale sur les caractéristiques lexicales du corpus distingué « les motifs ». Le tableau cidessus dresse la liste des 20 premières formes les plus spécifiques à cette zone. Il est trié par ordre décroissant sur l’indice de spécificité de celle-ci : Figure 2 : spécificité lexicales de la zone des motifs Nous portons ici attention à un usage excessif d’occurrences caractéristiques du discours judiciaire et constitutif de la zone des motifs : Attendu, que, Considérant14, attendu, de même pour les connecteurs : Mais et donc. Laboratoire d'Informatique, du Traitement de l'Information et des Systèmes, Université Rouen Normandie 13 Le calcul de spécificités implémenté dans TXM repose sur la loi hypergéométrique développée par Lafon (1984). Le seuil de pertinence d’une distribution est fixé à 2 : +2 l’indice de spécificité est positivement significatif, -2 il est négativement significatif. L’indice se situant entre les deux est banal. 14 Dans notre corpus la forme Considérant n’apparaît que dans les CA. Son absence dans les CC serait donc significative. 12 760 JADT’ 18 L’ensemble de ces marqueurs jouent un rôle spécifique ici, celui de ponctuer l’argumentation du juge en assurant sa progression syllogistique. L’usage excessif du futur, représenté avec les verbes être (sera : 22,9) et condamner (condamnera : 14,6) n’est pas surprenant, car avant de prononcer le verdict final dans un acte exclusivement directif (énoncé réservé à la zone des dispositifs), les juges avancent au préalable dans la zone des motifs les résultats (comme le montre d’ailleurs le suremploi du verbe résulte (19,6)) de leurs argumentations : « Le jugement entrepris sera confirmé en ses autres dispositions qui ne sont pas critiquées».15; « Le tribunal condamnera Monsieur le capitaine du […] ;».16 L’emploi significatif d’autres mots, comme équité, marchandises, inéquitable renvoie à la thématique des textes étudiés : le droit des transports. L’emploi significatif des adverbes de négation : ne (+50,3), pas (+38,5) révèle une caractéristique particulière de l’argumentation juridique car, fidèle au principe spinoziste Determinatio negatio es, la négation manifeste une valeur réplicative et résultative (i.e. portée référentielle en réponse à ce qui a été énoncé précédemment et qui n'a plus lieu d'être) préparatoire à la transformation juridique de l’énoncé. 3.2. Etude contextuelle Au-delà des investigations menées sur des unités lexicales minimales, les outils que propose la communauté ADT problématisent la notion de contexte selon des paliers différents pour privilégier un retour au texte. Nous allons ici donner l’exemple de la contextualisation des « attendu » dans la zone des motifs dont le suremploi a été relevé dans le tableau ci-dessus. Suite à une étude cooccurrentielle autour du mot-pôle « attendu », nous avons repéré une très forte attractivité avec le connecteur « Mais » (l’indice de spécificité17= +95). CA Rouen, 03/10/2013 TC de Rouen, 15/12/2003 17 Le calcul des cooccurrences qui repère les affinités et répulsions lexicales selon un indicateur de probabilité de rencontre repose sur le même modèle que celui du calcul des spécificités (Lafon, 1984). 15 16 JADT’ 18 761 Figure 3: Concordancier "Mais attendu" dan la zone des motifs Nous avons remarqué une systématicité dans l’usage des « Mais attendu » qui vient clore un enchaînement de propositions subordonnées introduites par des « Attendu que », repris parfois par la conjonction que. L’étude approfondie des contextes de ce « Mais attendu » révèle une incidence particulière de celuici sur ses contextes droits : « Attendu que les marchandises ont été totalement perdues du fait de leur décongélation. Attendu que la première évaluation des marchandises a été établie à 18. 498, 85 € départ usine, Mais attendu qu'en application de la loi française du 18 juin 1966, le montant de la marchandise s'évalue en valeur CIF (coût + assurance + fret). Attendu qu'en l'espèce la valeur CIF des marchandises se monte à 21. 163, 96 €, c'est bien ce montant que le tribunal retiendra en préjudice principal. ». Dans l’extrait ci-dessus Mais attendu introduit non seulement un mécanisme de renforcement argumentatif18, mais il joue également le rôle de déclencheur de transformation modale entre deux modalités de type axiologiques19. Dans l’exemple cité ici, le Mais attendu accompagné d’une référence juridique « application de la loi française […] assure cette transformation entre une norme liée au domaine du transport (marchandises totalement perdues du fait de leur décongélation : modalité axiologique négative) et les modes d’édiction d’une norme juridique cette fois. La marchandise dépréciée se trouve alors revalorisée (axiologique positif du point de vue juridique) par le changement des co-occurrents à droite (valeur CIF (coût + assurance + fret)). 18 Voire les travaux pionniers de A. Ducrot (1984) sur les valeurs argumentatives de Mais. 19 Les modalités axiologiques sont propres aux jugements de valeur de nature morale, idéologique et/ou légale. (Gosselin, 2010). 762 JADT’ 18 4. Conclusion À travers cette contribution nous avons voulu montrer l’intérêt que représente une étude textométrique pour l’appréhension de son corpus d’étude. Si notre objectif principal, celui de mettre au jour des parcours interprétatifs nommés scénarios modaux (Taleb 2015), est difficilement envisageable en se limitant à une stricte étude textométrique (car elle repose sur l’étude modale propre à chaque texte). L’approche textométrique s’est avérée néanmoins pertinente pour décrire et cerner le profil linguistique du corpus. Son principe différentiel essentiel du point de vue sémantique, nous a incitées à adopter cette démarche d’analyse contrastive indispensable. L’analyse contextuelle à plusieurs paliers nous a permis le repérage de constructions lexicales répétitives, comme l’exemple des « Mais attendu » exposé ici, qui se révèlent être des moments clés du jugement et donc parcours interprétatifs corrélatifs à une transformation modale. Références Cohen M. et Pasquino P. (2013). La motivation des décisions de justice, entre épistémologie sociale et théorie du droit. Le cas des Cours souveraines et des cours constitutionnelles. CNRS, New York University, University of Connecticut. Ducrot A. (1982). Le dire et le dit. Les Éditions de minuit, Paris. Garapon A. (2017). Les enjeux de la justice prédictive. La semaine juridique LexisNexis, N°12: 47-52. Gosselin L. (2010). Les modalités en français. La validation des représentations. Amsterdam-New-York : Rodopi B.V. Holzem M. (2014). Le Parcours interprétatif sous l’angle d’une transformation d’états modaux, dans Numes Correia C. et Coutinho M. A. (eds), Estudos Linguisticos : Linguistic studies , n° 10, p. 283-295. Holzem M. Labiche J (2017) Dessillement numérique : énaction, interprétation, connaissances. Bruxelles, Bern, Berlin : PIE Peter Lang. Lafon P. (1984). Dépouillements et Statistiques en Lexicométrie. SlatkineChampion. Pincemin B. (1999). Diffusion ciblée automatique d’informations : conception et mise en œuvre d’une linguistique textuelle pour la caractérisation des destinataires et des documents, Thése de Doctorat en Linguistique, Universit. Paris IV Sorbonne, chapitre VII. Rastier F. (2001). Art et science du texte. Puf. Rastier 2011 Rastier F. (2011). La mesure et le grain. Paris, Éditions Honoré Champion. Taleb F. (2015). Les modalités linguistiques pour aider à l’interprétation de textes juridiques. Actes Interface TAL IHM (ITI'2015), 22ème Congrès TALn, Caen. JADT’ 18 763 The Framing of the Migrant: Re-imagining a Fractured Methodology in the Context of the British Media. James M. Teasdale Sapienza University of Rome - teasdale.1650019@studenti.uniroma1.it Abstract 1 This study analyses the portrayal of migrants and migration in the British press over two periods, using frame analysis as a foundation methodology, while attempting to improve upon the methodology used in similar studies. The study holds the ‘frame’ to be the key organising feature in the portrayal of migrants and these frames can be located through a cluster analysis of textual data. The first aim of the work is to ascertain how far location and time affect the deployment of one frame or another, what these frames consist of and, therefore, provide a detailed analysis of how migration is portrayed in the British press: a focus sorely lacking in previous frame analysis studies to date. The study demonstrates that six frames can be identified over two periods; four being thematic and two being episodic. The ‘negative’ and ‘positive’ migrant frames were present in the first period, as the ‘local’ focus provided an ideal ground for the former’s deployment as the subject was located closer to home and was depicted as a threat. While the second period saw the dominance of the ‘positive’ migrant frame with the death of Alan Kurdi and the corresponding conceptual shift to the ‘global’ removing the subject from the immediate border and placing them in a wider context. This was coupled with the overlap of the domestic responsibility frame with the ‘positive’ migrant frame as the two became intimately linked in the second period, while the European responsibility frame also arose. This demonstrated that the hegemony of one frame can be challenged but only if the corresponding situation is ‘drastic’ enough to allow. Abstract 2 Questo studio analizza la raffigurazione dei migranti e della migrazione nella stampa britannica durante il corso di due periodi di tempo, utilizzando la teoria del frame analysis come metodologia di base e cercando di migliorare il procedimento di analisi utilizzato in studi analoghi. La ricerca pone il “frame” come principio organizzatore di base nella rappresentazione dei migranti. Questi frames possono essere rintracciati attraverso l'analisi clustering di dati testuali. Il primo scopo dello studio è quello di accertare 764 JADT’ 18 quanto posizione e tempistiche possano influenzare l’impiego di un frame rispetto ad un altro, in che cosa consistano questi frames e dunque fornire un’analisi dettagliata di come il processo migratorio venga descritto nella stampa britannica. Si tratta di un focus fortemente mancante negli studi basati sulla teoria del frame sino ad oggi. L’osservazione dimostra che, durante i sopra citati due periodi di tempo, sono sei i frame che possono essere identificati: si tratta di quattro di tipo tematico e due di tipo episodico. I frame “negativo” e “positivo” riguardo i migranti si possono rintracciare nel primo periodo, dal momento che il focus “locale” ha fornito un terreno ideale per l'impiego degli stessi. I soggetti erano infatti situati in prossimità del territorio ed erano dunque raffigurati come una minaccia. Al contrario, il secondo periodo di tempo vede il prevalere del frame “positivo” riguardo ai migranti, innescato dalla morte di Alan Kurdi e dal corrispondente slittamento concettuale che ha portato alla rimozione “globale” del soggetto dai confini immediatamente prossimi per ricollocarlo in un contesto più ampio. Questo si è appaiato al sovrapporsi del frame della responsabilità nazionale con il frame “positivo” riguardo ai migranti. Si può notare come i due frames siano diventati profondamente interconnessi durante il secondo periodo, proprio mentre si registrava l'insorgere del frame della responsabilità europea. Ciò dimostra come l'egemonia di un singolo frame possa essere sfidata, ma solo nel caso in cui la situazione corrispondente sia “drastica” al punto da permetterlo. Keywords: migration, frame analysis, cluster analysis, British media, text mining 1. Introduction 1.1 Frame analysis and the migration crisis Over the last two decades frame analysis has become an increasingly popular tool for analysing the portrayal of a subject in the media, due to its ability to demonstrate the latent and manifest meaning of the news and the recurring themes and elements that exist in common between individual texts (Zhongdang and Kosicki, 1993). According to Entman, ‘framing essentially involves selection and salience. To frame is to select some aspects of a perceived reality and make them more salient in a communicating text, in such a way as to promote a particular problem definition, causal interpretation, moral evaluation, and/or treatment recommendation for the item described.’ (Entman 1993). A reality is presented to the audience, a reality that can be considered a package of information of which the constituent parts together form the frame being deployed (Gamson et al. 1983). One frame is distinguishable from another precisely because this collective package is the sum of its parts. These parts are defined as framing JADT’ 18 765 devices and reasoning devices, which can be discovered alongside one another thereby indicating the presence of one frame or another. These framing devices can consist of metaphors, visual images, lexical choices, stereotypes, idioms etc. (Tankard et al. 2001) which in turn support reasoning devices within the same frame which define the problem, assign responsibility, pass judgement and present possible solutions (Entman 1993). As a relatively new approach, and apart from the shared inheritance from cognitive psychology (Bartlett 1932), anthropology (Bateson 1972) and the seminal work of Erving Goffman (Goffman 1974), frame analysis remains a fluid approach with a lack of empirical and methodological consistency across studies. Some authors have even contended if the school in of itself can even be considered a paradigm due to this diversity (D’angelo 2002:871; Entman 1993:51). This paper is not concerned with this contention, but does strive to arrive at a methodology which incorporates various elements of previous techniques in order to arrive at a complimentary approach which in turn minimises the criticism normally fired at more extreme approaches deployed in the past due to their perceived rigidity and shortcomings. To date very little frame analysis has been directed towards migration, especially in the British context. Despite the migration crisis showing no signs of abating, the response of Europe has generally been categorised by two approaches; (i) strengthening internal and external borders to restrict movement throughout Europe (ii) disrupting attempted crossings by means of the Mediterranean. Britain is particularly interesting within this context, not only as a state which has consistently tried to curb entry at an official level, but also because of the media’s and public’s keen obsession with migration which was ultimately exemplified in the Brexit referendum. The media can be considered as central to this response. Whether one considers it to be the embodiment of public opinion or of elite opinion, it is nonetheless an incarnation of a country’s position and can be seen as acting as an arbiter of said country’s opinion. The current migration crisis is as complex as it is pressing, and the ‘reality’ presented by the media should not be seen as natural, ready to be recorded and transmitted from one human being to another, but rather as something that is constructed and then transmitted according to constructivist theory (Goffman 1974). The media is therefore able to set the agenda and frame the debate on the migration crisis, in turn affecting the reality in the mind of the population and government. This paper has two aims in mind. The first is to develop a methodology which combines previous qualitative and quantitative approaches in order to improve validity and reliability while the second is to use said methodology to ascertain how migration is portrayed by the British media and how far this portrayal is affected by factors such as time and geographical focus. 766 JADT’ 18 2. Methodology The study’s methodology was constructed with historical criticisms directed at frame analysis in mind; either that the process is too qualitative and therefore lacks reliability, or that it is conducted too quantitively, and therefore lacks reliability. The first step was to collect the data, which was obtained manually from four daily British newspapers’ online archives (the Daily Express, the Guardian, the Telegraph and the Daily Mail), and included all newspaper articles which included ‘migration’, ‘migrant’, ‘refugee’ etc. in the title, or whose content largely dealt with such topics. The two periods of investigation are 28th to 31st July 2015 and 2nd to 6th September 2015, these dates were chosen in order to ascertain whether frames could be consistently identified across two periods, even in the short term, but also to investigate whether dominant frames can be challenged if events are deemed drastic enough (the tragic death of Alan Kurdi became the dominant news story in the second period, whereas the first was primarily concerned with the Calais crisis). In total 505 were gathered, 160 for the first period and 345 for the second. The quantitative aspect of the study consists of a computer assisted approach, by using cluster analysis to process the data and indicate the presence of ‘frames’. Because, as mentioned above, framing is considered to be the grouping and salience of certain elements to the neglect of others, one can consider the cluster generated by a computer to precisely be a direct indication of the presence of one frame or another, as words are the primary form framing elements assume. The software used was the R program in conjunction with the Iramuteq interface. The clustering method used is that of Reinert (Reinert 1983), whose conception of clusters as a ‘cognitiveperceptive framework’ lends itself perfectly to frame analysis, concerned as it is with discerning different representations of a perceived reality. The second, more qualitative step of the study, was to conduct a deep read of all the texts, where the researcher intuitively coded texts and created a frame matrix which allowed an awareness of the context of the text as well as those framing and reasoning devices which seemed re-occurring and therefore significant. Combined, this allowed the reliability of the initial cluster analysis generated by the computer to be complemented by the in depth familiarity of the researcher, which provided a validity to the interpretation of results. JADT’ 18 767 3. Results Figure 1. Cluster analysis for first period Figure 2. Cluster analysis for the second period The two cluster analyses seem to identify three distinct clusters, yet those identified in the second period varying dramatically in respect to the first. 768 JADT’ 18 The first period under investigation generated three clusters, which have been labled The Refugee Cluster (Red), The Migrant Cluster (Green) and the Calais Crisis Cluster (Blue). However, the second period produced three different clusters: Migration as a Domestic Issue Cluster (Red), Migration as a European Issue Cluster (Green) and the Migrant Crisis Cluster (Blue). At first glance these results seem to refute the basis of framing theory; that frames are not produced by the journalist, but are deployed from the cultural repertoire they cognitively hold in common with the rest of society (Goffman 1974). This is because, if framing theory is correct, then in the space of one month it would be impossible for frames to mutate completely, and one would expect the clusters identified in the first period to be identical to those found in the second. However, if one makes a distinction between issuespecific and generic frames and episodic and thematic frames (de Vreese 2005) the two cluster groups are far more similar than first meets the eye. For instance, the first period produced two frames which are predominantly concerned with the figure of the migrant and two differing portrayals of the migrant; the migrant as a helpless victim and the migrant as an opportunistic individual. These are both clusters which one can consider thematic frames as the clusters do not refer to one story but rather represent a thematic perspective. The third frame, however, can be categorised as being an issue specific frame, concerned as it is only with the Calais crisis, the ‘Jungle’ camp and the stories of migrants attempting to enter the channel tunnel. The second period, similarly, consists of two thematic frames (that which considers migration as an issue for the British government and that which considers it to belong to the realm of European governance) and one episodic frame (those stories relating specifically to the death of Alan Kurdi and those migrants attempting to move through Hungary and Austria in the early days of September 2015). If the two episodic frames are laid aside, one is left with four remaining; the ‘negative’ migrant frame, the ‘positive’ migrant frame, the domestic responsibility frame and the European responsibility frame. What is interesting to note in the second period, is that ‘positive’ migrant frame from the first period does not disappear, but overlaps with and bolsters/is bolstered by the the arising domestic responsibility frame. For example, many of the key terms of the ‘positive’ migrant frame (vulnerable, refugee, conflict, persecution, support, receive, community etc.) are emblematic of those found in the so-called domestic responsibility frame (vulnerable, refugee, sanctuary, hazardous, save, help etc.) This means that rather than ‘disappearing’, the frame which represents migrants as individuals in need has been combined with arising domestic responsibility frame. However, this does not account for the disappearance of the ‘negative’ migrant frame. The reason for this lack of presence, and likewise the merging JADT’ 18 769 of the ‘positive’ migrant frame and the domestic responsibility frame in the second period, is due to the shock events linked to the tragic death of Alan Kurdi on September 2nd 2015. The event seems to have made the deployment of the ‘negative’ migrant frame untenable in the second period, while at the same time the ‘positive’ migrant frame persists as the period proved more fertile for this perspective. This is one reason why the two frames overlapped in the second period; the outrage and shock at the death of a toddler ultimately led to the locating of the solution to the ‘positive’ migrant frame in the domestic responsibility frame. Interestingly, this overlap did not occur with the European responsibility frame, which may be due to British political actors (the majority of those interviewed across the articles) actively positioning themselves as ready to help migrants in order to show themselves in a positive light. Another interesting finding is how location affected or at least was linked to the change in hegemony between the ‘positive’ and ‘negative’ migrant frames. In the first period, the obsession with the Calais crisis (demonstrated by the presence of the corresponding episodic frame) seemingly provided conceptual ground in which the ‘negative’ migrant frame could flourish, whereas in the second period, dominated as it was by news of the death of Alan Kurdi (and the presence of a more international episodic frame) ensured the continued presence of the ‘positive’ migrant frame. One reason for this could be that as the migrant is located nearer to the British boarder, the ‘negative’ migrant frame (characterised by terms such as arrest, siege, repel, overwhelm) was more easily deployed due to the greater unease of foreign migrants entering the country, whereas when the focus was positioned more globally this unease was overcome by the moral shock of Alan Kurdi’s death, lessening the unease and therefore the appropriateness of the previous frame. Despite demonstrating some continuity of frames across the two periods, that geographical focus affects the deployment of one frame or another and that shock events can seemingly shift the frames in play to a great extent, the study is not without shortcomings. Firstly, the two time periods, and the limitation of four days to each, has greatly reduced the data available. This in turn makes it impossible to understand how far and how robust the identified frames are across an extended period of time and whether other frames come into play depending on the specific moment or the dominating news story. One solution could be to extend the time frame, but this might in turn lead to a drop in validity and insight due to the limitations of the researcher to deal with the data to the same extent as a computer. The second issue, as has already been mentioned, is determining precisely the characteristics of one frame in relation to another. One possible solution 770 JADT’ 18 would be to predetermine those terms which are identified as framing elements or reasoning devices as variables in the cluster analysis, which would in turn limit the identification of episodic frames in favour of thematic frames and over a longer period more clearly define the continuation, and the fluctuation in presence, of identified frames. The drawback of this, however, is that arguably the subjectivity of the researcher enters at too early a stage and harms the validity of the methodology. A third point is that, although the cluster analysis did capture many of the framing devices (as they are commonly exhibited as words), it was unable to capture all (for instance accompanying images) and was largely unable to identify the presence of reasoning devices (as the unit of analysis needs to be bigger than single word choice). References Bartlett, F. (1932). Remembering: A Study in Experimental and Social Psychology. Cambridge University Press. Bateson, G. (1972). Steps to an Ecology of Mind: Collected Essays in Anthropology, Psychiatry, Evolution, and Epistemology. University of Chicago press. D’Angelo, P. (2002). News Framing as a Multiparadigmatic Research Program: A Response to Entman. Journal of Communication, 52(4): 870888. Entman, R.M. (1993). Framing: Toward Clarification of a Fractured Paradigm. Journal of Communication, 43(4): 51-58. Gamson, William A. and Kathryn E. Lash. (1983). The Political Culture of Social Welfare Policy. In S.E. Spiro and E. Yuchtman-Yaar, Evaluating the Welfare State: Social and Political Perspectives. Academic Press. Goffman, E. (1974). Frame analysis: An essay on the organization of experience. Harper and Row. Reinert, M. (1983). Une méthode de classification descendante hiérarchique: application à l’analyse lexicale par contexte. Les cahiers de l’analyse des données, 8(2): 187-198. De Vreese, C.H. (2005). News Framing: Theory and Typology. Information Design Journal and Document Design, 13(1): 51-62. Zhongdang, P. and Kosicki G.M.. (1993). Framing Analysis: An approach to news discourse, Political Communication, 10(1): 55-75. Tankard, J.W. and Severin W.J. (2001). Communication Theories: Origins, Methods and Uses in the Mass Media, 5th Edition. Pearson. JADT’ 18 771 Results from two complementary textual analysis software (Iramuteq and Tropes) to analyze social representation of contaminated brownfields Marjorie Tendero1, Cécile Bazart2 1 University of Rouen – CREAM and Agrocampus Ouest - marjorie.tendero@agrocampusouest.fr 2University of Montpellier, Montpellier – CEE-M - cecile.bazart@umontpellier.fr Abstract The aim of this paper is to demonstrate the complementarity of two types of textual analysis software, Iramuteq and Tropes, to analyze a corpus of data extracted from an open-ended question from a national cross-sectional survey. Descendant hierarchical classification made with Iramuteq lead to more homogeneous and less groups of discourse than the references fields made with Tropes. References fields allow to reveal how the corpus’ thematic are articulated made with Iramuteq. Résumé Cette communication présente l’apport complémentaire de deux logiciels d’analyse de contenu, Iramuteq et Tropes, pour analyser les représentations sociales à partir de réponses données à une question ouverte dans un questionnaire d’enquête. Il montre que les classifications hiérarchiques descendantes opérées à l’aide du logiciel Iramuteq peuvent être approfondies de façon complémentaire à l’aide des classifications sémantiques par univers de références et l’outil scénario du logiciel Tropes. Les classes de discours sont moins nombreuses et plus homogènes que les univers de références mis en évidence par logiciel Tropes. Ces derniers montrent l’articulation des thématiques du corpus. Keywords: Brownfield; Classifications; Iramuteq; textual data analysis; Tropes. 1. Introduction L’analyse de contenu regroupe les techniques permettant une analyse systématique et objective des communications écrites et orales. Il s’agit d’une approche multidisciplinaire croisant des méthodes quantitatives et qualitatives, et dont les domaines d’application sont très nombreux : sciences de la communication, sociologie, psychologie, informatique, et économie par exemple. Ces techniques étudient la structure d’un texte, ou d’un discours, 772 JADT’ 18 ainsi que sa logique afin de mettre en évidence le contexte dans lequel il est produit, et sa signification réelle à partir de données objectives. Ces méthodes permettent de traiter les réponses à des questions ouvertes en soutenant l’interprétation du phénomène étudié sur des critères quantitatifs et objectifs (Garnier and Guérin-Pace 2010). Pour analyser les réponses données à des questions ouvertes, un des avantages de ces méthodes et d’éviter les biais liés à la codification thématique a posteriori. Toutefois, cette méthode fait l’objet de critiques. Ces dernières sont relatives aux étapes à mettre en place pour préparer le corpus, pour effectuer les analyses, et interpréter les résultats. Ainsi, lors de la phase de préparation du corpus, une lemmatisation peut être effectuée. Or, celle-ci regroupe parfois des formes dont l’emploi, dans un contexte donné, mène à des contresens (Lemaire 2008). C’est le cas lorsqu’une forme au pluriel est lemmatisée au singulier. De plus, les dictionnaires des expressions utilisés par les logiciels peuvent ne pas rendre compte des marqueurs de modalités comme la négation (Fallery and Rodhain 2007). Par ailleurs, des différences interprétées en termes d’analyse de contenu peuvent en réalité provenir de différences sociales dans la façon dont un individu s’exprime à l’oral ou à l’écrit. Les problèmes d’homonymies, de polysémies, de synonymies peuvent donc amener à construire des classes lexicales différentes alors qu’elles relèvent de modes d’expression hétérogènes sur la forme mais en réalité très similaire sur le fond ; ce qui est le cas des opinions exprimées par des périphrases, des paraphrases ou des ellipses. Une attention particulière doit donc être portée au traitement des ambiguïtés afin d’éviter toute erreur d’interprétation. Pour cette raison, il est intéressant de combiner deux approches complémentaires, et donc différents logiciels, d’analyse de contenu ; ce qui permet d’assurer la validité des résultats (Vander Putten and Nolen 2010; Lejeune 2017). C’est par exemple ce qui a été fait sur un corpus d’entretien pour comparer les logiciels Nvivo et Wordmapper (Peyrat-Guillard 2006). Dans cette communication nous soulignons l’apport complémentaire des logiciels Iramuteq et Tropes pour l’analyse des représentations sociales associées aux friches polluées à partir des réponses données à une question ouverte dans le cadre d’une enquête administrée au niveau national auprès de 803 individus résidant sur une commune impactée par ce type de foncier. Nous présentons dans la section qui suit la méthodologie adoptée, les données récoltées et les analyses effectuées. Dans une troisième section, nous présentons les résultats obtenus à l’aide du logiciel Iramuteq ; puis ceux obtenus à partir du logiciel Tropes dans une quatrième section. Nous discutons des apports complémentaires de ces deux logiciels pour l’étude des représentations sociales à partir de l’analyse des réponses données à une question ouverte dans une dernière section. JADT’ 18 773 2. Méthodologie Nous avons élaboré un questionnaire afin d’étudier la perception individuelle vis-à-vis du risque de pollution du sol, et les représentations, et perceptions relatives aux friches urbaines et à leur reconversion. Le questionnaire a été administré aux riverains résidant sur les communes impactées par une friche polluée1. Au total, 803 réponses complètes ont été collectées sur 503 communes impactées par la présence d'une friche polluée. Pour analyser les représentations sociales, associées aux friches polluées, nous avons utilisé la question ouverte suivante : « à quoi associez-vous l’expression de friches urbaines ? ». Nous avons procédé à une analyse de données textuelles car cette technique d’analyse des données se prête particulièrement bien à l’étude des représentations, individuelles ou sociales, en rendant compte de la dynamique représentationnelle et cognitive d’un phénomène (Abric 2003; Beaudouin and Lahlou 1993; Kalampalikis 2005; Negura 2006). Toutes les questions étaient obligatoires. Cependant, tous les participants n’ont pas réussi à y répondre : certaines réponses n’étaient qu’une suite de caractères permettant de passer à la question suivante. De plus, cette question ouverte se situait dans la seconde partie du questionnaire. Ce dernier était relativement long ; il en a résulté une perte d’attrition. Nous avons donc écarté ces réponses de notre analyse. Au total, 539 réponses ont pu être conservées ; soit 67,12 % des réponses collectées. Les données ont été formatées pour pouvoir être analysées à partir du logiciel IRaMuteQ (Interface de R pour les analyses multidimensionnelles de textes et de questionnaires) version 0.7 alpha 2 dans un premier temps. C’est un logiciel libre développé par Pierre Ratinaud au sein du LERASS (Laboratoire d’Études et de Recherche Appliquées en Sciences Sociales) distribué sous les termes de la licence GNU GPL (v2) (Baril and Garnier 2015; Ratinaud and Déjean 2009). Le tableau 1 ci-dessous montre un extrait des réponses analysées. Tableau 1 : Extrait du corpus analysé 0001 percept_eleve danger_oui confiance_non Abandonnée, sale, nuisible 0002 percept_eleve danger_oui affecte_non prevent_non exist_non gestfri_non sexe_h age_4059 reg_centre gestion_non intention_oui affecte_non prevent_oui gestion_non exist_oui gestfri_non intention_oui confiance_oui 1 Ces communes ont été identifiées à partir d’une extraction de la base de données BASOL sur les sites et sols pollués (ou potentiellement pollués) appelant une action des pouvoirs publics, à titre préventif ou curatif. 774 JADT’ 18 sexe_f age_1924 reg_als Zones non_habité 0003 percept_eleve affecte_non prevent_non gestion_non danger_non exist_non gestfri_non intention_non confiance_non sexe_f age_4059 reg_als Un jardin en ville, laissé à l'abandon. 0004 percept_moyen affecte_non prevent_non gestion_non danger_non exist_non gestfri_non intention_oui confiance_non sexe_f age_4059 reg_rha zone abandonnée, zone polluée ville Le corpus de texte analysé a les caractéristiques décrites dans le tableau cidessous. Tableau 2 : Statistiques descriptives associées au corpus analysé Nombre de réponses Nombre de mots (occurrences) Nombre moyen de mots utilisés Nombre de formes actives (total) Nombre de formes supplémentaires (total) Nombre d’hapax Nombre de formes Nombre de formes actives (différentes) Nombre de formes supplémentaires (différentes) Corpus « friche » 539 2 177 4,04 1 537 640 275 482 402 80 Nous comparons les analyses suivantes : statistiques descriptives et classification hiérarchique descendante effectuée à l’aide du logiciel Iramuteq et univers de références et scénario à l’aide du logiciel Tropes. Il s’agit d’un logiciel d’analyse sémantique de textes créé en 1994 par Pierre Molette et Agnès Landré à partir des travaux de Rodolphe Ghiglione sur l’analyse propositionnelle de discours (Molette, Landré, and Ghiglione 2013). 3. Résultats de l’analyse avec Iramuteq 3.1. Statistiques descriptives Le tableau ci-dessous décrit les termes les plus fréquemment employés par les individus (effectif ≥ 20) lorsqu’ils évoquent les friches polluées. Ces dernières sont des « terrains » (99 occurrences), des « zones » (36) laissées à « l’abandon » (106). Il s’agit de terrains sur lesquels étaient implantées d’anciennes « usines » (29) aujourd’hui « désaffectées » (17). JADT’ 18 775 Tableau 3 : Termes les plus fréquemment employés (statistiques descriptives à partir du logiciel Iramuteq) Formes actives Abandon Terrain Laisser Abandonner Ville Zone Terrain vague Effectif Type 106 99 63 49 46 36 34 Nom Nom Verbe Verbe Nom Nom Nom Forme active Usine Pollution Ancien Espace Bâtiment Sol Effectif Type 29 28 28 25 25 20 Nom Nom Adjectif Nom Nom Nom 3.2. Classification hiérarchique descendante 65.49 % des réponses données sont classifiées au sein de quatre catégories. Le tableau 4 ci-après indique la significativité des termes associés à chaque classe. La première classe regroupe les termes faisant référence aux anciennes activités industrielles. La deuxième classe renvoie aux problème de la gestion de déchets en milieu urbain en évoquant les « décharges », les « saletés », et la « pollution ». La troisième classe correspond aux termes caractérisant ce type d’espace. La quatrième classe, quant à elle, fait référence aux espaces de nature auxquels les friches correspondent, en particulier dans le cas de parcelles agricoles laissées en jachère. 4. Résultats complémentaires apportés par Tropes Nous avons formaté le corpus pour l’analyser avec le logiciel Tropes. L’analyse des univers de références nous permet de mettre en évidence les principaux thèmes utilisés dans le texte en regroupant les termes dans des classes d’équivalent sémantiques. Le tableau 4 ci-après présente les résultats obtenus par les univers de références à l’aide du logiciel Tropes. Les classifications sont données par ordre décroissant et indiquent le nombre de termes qui s’y rapportent. Ces classifications ne permettent pas toujours de couvrir l’ensemble des termes utilisés dans le corpus : seuls les substantifs les plus significatifs du texte y apparaissent. Il est toutefois possible de paramétrer ces classifications à partir du mode scénario du logiciel ; la figure 1 en montre un extrait. 5. Discussion et conclusion Le tableau 6 précise les avantages et contraintes respectifs liés à l’utilisation de ces deux logiciels pour analyser les représentations sociales des friches polluées. En particulier, la classification sémantique par univers de références 776 JADT’ 18 et l’outil scénario font apparaître des classes plus nombreuses et moins homogènes que dans le cas de la classification hiérarchique descendante effectuée sous Iramuteq. Tableau 4 Résultats de la classification hiérarchique descendante à partir du logiciel Iramuteq Classe 1 (39,7 %) Classe 2 (15 %) Anciennes activités industrielles Problèmes de gestion déchets en milieu urbain Forme active ² p Abandonner 58,95 Usine 42,73 < 0,0001 < 0,0001 Ancien Bâtiment Industriel Polluer Désaffecté Site 29,1 28,82 22,13 17,24 15,66 15,66 Immeuble 14,05 Forme active Pollution ² p 151,38 Sol 59,79 < 0,0001 < 0,0001 < 0,0001 Laisser < 0,0001 Friche < 0,0001 Milieu_urbain 0,00073 Sauvage < 0,0001 Ville < 0,0001 Repos < 0,0001 Désert < 0,0001 Saleté < 0,0001 Décharge < 0,0001 Terre 0,00017 Culture Zone 13,72 0,00021 Industrie 10,9 0,00096 Lieu 10,87 0,0023 Non_construit 9,29 0,00513 Endroit 7,83 0,00547 Vieux 7,72 0,00547 des Classe 3 (33,7 %) Zone abandonnée et inutilisée Forme active 32,66 17,13 17,13 11,41 Classe 4 (11,6 %) Espace jachère agricole en ² p Terrain 107,94 < 0,0001 Forme active Espace Abandon 84,27 < 0,0001 Nature ² p 114 .47 < 0,0001 < 0,0001 62. 29 82,57 < 0,0001 Vert 46. 45 16,1 < 0,0001 Libre 41. 31 12,58 10,6 0,00038 0,00113 Non_exp loité 38. 60 Champ 30. 79 Non_cultivé 8,25 7,85 7,85 4,06 0,00408 0,00507 Aller 5,95 Non_utilisé 2 ,97 0,01471 NS (0,08500) 0,00507 Non_ent retenu Rntreten ir Non_cul tivé 24. 89 5.8 1 2.7 1 < 0,0001 < 0,0001 < 0,0001 < 0,0001 < 0,0001 0,0159 6 NS (0,099 62) 0,04402 Tableau 5 : Principaux univers de références associés au corpus Univers de références 1 Référence Eff. Exemple de termes associés Ville 74 Ville, taudis, zone urbaine Lieu 59 Zone Habitat 55 Bâtiments, immeubles, logement, appartements Référence Ville Lieu Industrie Univers de références 2 Eff. Exemple de termes associés 73 Ville, taudis, milieu urbain, zone urbaine 59 Site, zone, lieu 50 Industrie, zone industrielle, usines JADT’ 18 777 Industrie 50 Immeuble 36 Bâtiments, immeuble 39 Zone industrielle, industrie, usine Pollution, dépotoir Écologie Pollution 33 33 22 22 20 Végétation, herbe, ronce Déchet, détritus Jachère, cultures Terre Déchet Agriculture Terre 22 21 20 Polluant, pollution, dépotoir Déchet, détritus Jachère, cultures Sols, terre Plantes Déchet Agriculture Terre Figure 1 : Extrait des scénarios sous Tropes (ordre croissant) Cet outil permet d’approfondir et de valider l’interprétation effectuée à partir de la classification hiérarchique descendante à l’aide du logiciel Iramuteq. Ces deux logiciels apparaissent donc comme complémentaires. Ces complémentarités restent toutefois à vérifier à l’aide d’autres type de corpus (entretiens par exemple). Enfin, pour étudier les représentations sociales de friches polluées auprès de populations impactées par ce type de site, il serait intéressant d’identifier le lexique émotionnel et affectif utilisée à l’aide d’EMOTAIX par exemple (Piolat and Bannour 2009). En effet, cela permettrait de mieux identifier la dimension affective dans les intentions comportementales à l’égard de ce type de site. 778 JADT’ 18 Tableau 6 : comparaison des fonctionnalités d'Iramuteq et de Tropes pour l'analyse des représentations sociales Logiciels Procédures Découpage du texte Style du texte Mise en scène Épisodes et rafales Classifications Scénario Statistiques descriptives Analyse de similitude Analyse de spécificité et analyse factorielle des correspondances Analyse prototypique Principaux atout pour l’étude des représentation sociales Principaux inconvénients pour l’étude des représentations sociales Iramuteq Tropes Segments de texte Propositions canoniques Classification hiérarchique descendante      Univers de références   Indirectement par mots avec des graphes en aire ou étoilé   Richesse des analyses et des résultats Formatage des corpus moins contraignant Formatage des corpus longs Lemmatisation et classification automatisées aboutissent à des résultats peu lisible References Abric, Jean-Claude. 2003. Méthodes D’étude Des Représentations Sociales. ERES. Baril, Élodie, and Bénédicte Garnier. 2015. ‘Utilisation d’un outil de statistiques textuelles : IRaMuteQ 0.7 alpha 2. Interface de R pour les analyses multidimensionnelles de textes et de questionnaires’. Institut National d’Études Démographique. Beaudouin, V, and S Lahlou. 1993. ‘L’analyse Lexicale : Outil D’exploration Des Représentations’. Cahier de Recherche C (48): 25–92. Fallery, Bernard, and Florence Rodhain. 2007. ‘Quatre approches pour l’analyse de données textuelles :lexicale, linguistique, cognitive, thématique’. In XVIème Conférence de l’Association Internationale de Management Stratégique. Montréal, Canada. Garnier, Bénédicte, and France Guérin-Pace. 2010. Appliquer les méthodes de la statistique textuelle. Les collections du CEPED (Centre Population et JADT’ 18 779 Développement). Paris: CEPED. Kalampalikis, Nikos. 2005. ‘L’apport de la méthode Alceste dans l’analyse des représentations sociales’. In Méthodes d’étude des représentations sociales, edited by Jean-Claude Abric, 147–63. Hors collection. ERES. Lejeune, Christophe. 2017. ‘Analyser Les Contenus, Les Discours, Ou Les Vécus ? À Chaque Méthode Ses Logiciels !’ In Les Méthodes Qualitatives En Psychologie et Sciences Humaines de La Santé, Dunod, 203–24. Psycho Sup. Lemaire, Benoît. 2008. ‘Limites de La Lemmatisation Pour L’extraction de Significations’. In 9ème Journées Internationales d’Analyse Statistique Des Données Textuelles, 725–32. Lyon, France. Molette, Pierre, Agnès Landré, and Rodolphe Ghiglione. 2013. Tropes. Version 8.4. Manuel de référence. http://tropes.fr/doc.htm. Negura, Lilian. 2006. ‘L’analyse de Contenu Dans L’étude Des Représentations Sociales’. SociologieS Théories et recherches (October). Peyrat-Guillard, Dominique. 2006. ‘Alceste et WordMapper : L’apport Complémentaire de Deux Logiciels Pour Analyser Un Même Corpus D’entretien’. In Journées d’Analyse Statistique Des Données Textuelles, 725– 36. Besançon, France. Piolat, Annie, and Rachid Bannour. 2009. ‘EMOTAIX : Un Scénario de Tropes Pour L’identification Automatisée Du Lexique Émotionnel et Affectif’. L’Année Psychologique 109 (04): 655. https://doi.org/10.4074/S00035033 09004047. Ratinaud, Pierre, and Sébastien Déjean. 2009. ‘IRaMuTeQ: Implémentation de La Méthode ALCESTE D’analyse de Texte Dans Un Logiciel Libre’. Modélisation Appliquée Aux Sciences Humaines et Sociales MASHS, 8–9. Vander Putten, Jim, and Amanda L Nolen. 2010. ‘Comparing Results from Constant Comparative and Computer Software Methods: A Reflection About Qualitative Data Analysis’. Journal of Ethnographic and Qualitative Research 5: 99–112. Remerciements Nous remercions Jean-Marc Rousselle pour avoir administré en ligne ce questionnaire sous Limesurvey. Cette enquête a bénéficié du soutien financier du SRUM 2015, de l’université de Montpellier, du CEE-M (LAMETA), de l’ADEME, de la Région Pays-de-la-Loire, et du CREAM (Université de Rouen). 780 JADT’ 18 Multilingual Sentiment Analysis Matteo Testi1, Andrea Mercuri1,2, Francesco Pugliese1,3 Deep Learning Italia – m.testi@deeplearningitalia.com 2Tozzi Institute – a.mercuri@deeplearningitalia.com 3Italian National Institute of Statistics – francesco.pugliese@istat.it 1 Abstract In recent years, Sentiment Analysis (SA) has attracted significant attention in different areas of Research and Business. This is because “sentiments” can influence opinions of product vendors, politicians and the public opinion. The sentiments of users are generally categorised into three classes: negative, positive or neutral. Lately, more and more Deep Learning (DL) models have been employed to SA thanks to their automatic high-dimensional feature extraction capability. However, DL supervised models are greedy of data and the shortage of sentiment’s data sets in specific languages (other than English) is a big issue. In order to address this multilingual issue of training sets we propose a very deep Recurrent Convolutional Neural Network model (RCNN) which achieves “state-of-art” accuracy in sentiment classification. Extracting keywords from the final max-pooling layer we are able to create a corpus of domain-specific keywords. By exploiting these “discriminative” extracted words we scrape a long sequence of sentences (in two different languages) in order to feed a Neural Machine Translation model. A sequence-to-sequence model with attention and beam-search has been implemented to translate one language sentences (i.e. English) into another language sentences (i.e. Italian). As example, we train our RCNN on an English twitter sentiment training-set and extract keywords to generate the machine translation model. During the test stage, we translate our test sentences (i.e tweets) into another language for which we have poor training set (i.e. Italian). Results highlight a significant accuracy gain of this technique with regard to a model exclusively trained on a poor training set expressed in a language different from English. Keywords: sentiment, analysis, multilingual, deep, learning, recurrent, convolutional, neural, machine, translation 1. Introduction In recent years, Sentiment Analysis (SA) has attracted significant attention in different areas of Research and Business. This is mainly due to the fact that “sentiments” (which are exhibited on the web by users) can affect opinions of product vendors, politicians and readers in general, namely the public JADT’ 18 781 opinion. According to one of the most accredited definitions: Sentiment Analysis is the field of study that analyses people’s opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organisations, individuals, issues, events, topics, and their attributes (Qurat Tul Ain et al, 2017; Liu, 2012). This user point of view may usually be expressed under the unstructured form of an opinion, review, news, disapproval, etc. The rising demand of SA comes from the need of summarising a general direction of user opinions from social media (Haenlein et Kaplan, 2010). In fact, the aggregate data from Sentiment Analysis can represent a valuable information in order to orient decisions in politics, digital marketing or finance. Therefore, SA arises as a multidisciplinary field joining computational linguistics, information retrieval, semantics, natural language processing and artificial intelligence in general (Aydogan et Akcayol, 2016). Ultimately, SA can be seen as the process of automatically categorise utterances into three different classes: negative, positive or neutral. Generally these sequences of text or sentences come from social networks, opinion web-sites, e-commerce feedbacks, etc. Twitter is one of the most useful microblogging platforms for Sentiment Analysis and Opinion Mining since it offers very good API to download tweets and it is very popular amongst different categories of people (Pak et Paroubek, 2010). Traditionally, SA is a text classification problem and relies on two kinds of approaches: a) “lexicon-based” which is usually applied to problems without a training set. This technique generally makes use of a fixed number of keywords to orient the classification process by means of decision trees such as k-Nearest Neighbours (k-NN) or Hidden Markov Model (HMM); b) “machine learning-based” where extracted features typically consist of Parts of Speech (POS) tags, n-grams, bi-grams, uni-grams and bagof-words. Classification can be performed by Naïve Bayes or Support Vector Machines (SVMs) (Singh et al., 2016). Traditional lexicon-based approaches are not effective anymore in combination with the modern textual Big Data corpuses, especially as far as sentiment concerns. On the other hand, Machine learning approach can be supervised and unsupervised (less common) and it is a methodology able of automation over enormous corpus of data, this is a critical requirement for a reliable Sentiment Analysis. Deep Learning is a branch of Machine Learning proposed by G.E. Hinton in 2006 and adopts Deep Neural Network for text classification (Hinton et Salakhutdinov, 2006). Deep Learning enhance traditional neural networks introducing more than thousands of neurons, millions of connections, new regularisation techniques (dropout, data augmentation, batch normalisation), new pre-processing (skip-gram, word embeddings, etc) and different new models both supervised and unsupervised: Convolutional Neural Networks (CNN) 782 JADT’ 18 (Krizhevsky et al., 2012), Deep Belief Networks (DBN) (Hinton et al., 2006). and many more. Lately, more and more Deep Learning (DL) models have been employed to SA thanks to their automatic high-dimensional feature extraction capability (Vateekul and Koomsubha, 2016). For instance, in Financial Sentiment Analysis (FinTech), Deep Learning has contributed to investigate how to harness different media and financial resources in order to improve the accuracy of stock price forecasting (Day et Lee, 2016). The experimental results show how news sentiment categorisation, by means of Deep Neural Networks, has different effects to investors and their investments. However, SA is a challenging field due to the lack of supervised data and to the nature inherently subjective of sentiments. In this work we tackle one of the biggest problems for modern machine learning-based Sentiment Analysis: the shortage of data sets in specific less common languages (Italian, German, etc.). In order to address the classification of sentiments we examined some of “state-of-art” text classifiers: many deep learning models have been employed in Sentiment Analysis previously, such as those invented by Stanford University: Recursive Neural Networks (RNNs) (Socher et al., 2011b) and Recursive Neural Tensor Networks (RNTNs) (Socher et al., 2013). Furthermore, Stanford released the Sentiment Treebank that is the first corpus with fully labeled parse trees to train RNTSs. RNTNs reach an accuracy ranging from 80% up to 85.4% on a Sentiment Treebank’s test set. Although Recursive Models are very efficient in terms of constructing sentences’ sentiment representations, their performance heavily depends on the performance of the textual tree construction. Constructing such a textual tree exhibits a time complexity of at least O(n^2), where n is the length of the text. For this reason, we decided to make use of a Recurrent Convolutional Neural Network model (RCNN) (Lai et al., 2015) achieving a rather competitive accuracy in sentiment classification with regard to Recursive Models. RCNNs exploit a recurrent structure to capture contextual information as much as possible when learning word representations, which may introduce considerably less noise compared to traditional windowbased neural networks. Moreover, the benefit of exhibiting a time complexity of O(n) is a big added-value of RCNNs. To provide the support to a Multilingual Sentiment Analysis, a Neural Machine Translation (NMT) model has been employed in order to translate one language sentence (i.e. English) into another language sentence (i.e. Italian). Basically, a NMT model is a Neural Network structured in an encoder-decoder pattern which turned out as a competitive alternative to the traditional Statistical Machine Translation (SMT). The encoder consists of two independent recurrent networks: “forward” which reads the the sentence in the natural order and “backward“ which reads the sentence in reverse order. Instead, the decoder is JADT’ 18 783 an RNN capable to compose the sentence to be translated. This sequence-tosequence model can be trained on a training set made of pairs of sentences: the first is expressed into the source language and the second into the target language (Cho et al., 2014). 2. Materials and Methods The novelty of our Recurrent Convolutional Neural Network, with respect to the original paper, is that we introduced two new recurrent models called Long Short Term Memories (LSTM) instead of simple RNNs. These two LSTM bi-directionally scan the text. The topology of the RCNN (see Fig. 1) is intentionally designed to capture the context of each word (see the original paper for further details). The RCNN has been trained on a corpus of 1.6 million tweets composed from various Semeval training-sets (Strapparava et Mihalcea, 2007) and divided into positives (800k) and negatives (800k). To input textual sequences into the neural network we insert a pre-trained embedding layer on top (Mikolov et al, 2013). The embedding layer, which has been pre-trained on an English Wikipedia Corpus, transforms indexed words into numerical vector. Embedding vectors are characterised by a semantical relationship amongst them according a chosen metrics, a cosine distance in this case. Size of embedding vecotrs is 300. Figure 1. The structure of the RCNN scanning the sentence “A sunset stroll along the South Bank affords an array of stunning vantage points” (Lai et al., 2015). During the training stage, the RCNN achieves 84% of accuracy on a validation set (selected at the 20% of the original dataset). On a test set of 380 tweets (provided by Semeval), the model returns around 82% of accuracy on positive tweets and 78% of accuracy on negatives, with an approximative 784 JADT’ 18 80% overall on a mixed tweets set. We followed recomended settings within the original paper for the hyper-parameters selection. Finally, we have modified the RCNN in order to extract the most significant keywords that are specific for the model to drive the sentiment classification. Basically, the third layer, that is the max-pooling layer, relies on an elementwise “max” function as follows : The most "discriminative” words for the sentiment classification are those most frequently selected in the max-pooling layer. Hence, we extracted the indices of words corresponding to the max values of activation identified within the third layer. During the training we determined 3.2 millions of keywords, namely 2 for each tweet, the most important and the second in order of signinifancy. Many of the resulting keywords come duplicated or altered for multiple reasons : they might belong to a common slang or undergo typing errors. Then, we removed doubles and we matched the rest with the embedding corpus cointaining 2.5 mln words of the English Language. This process turned out with 85,000 correct english keywords. By exploiting these keywords as seed, we scraped a long sequence of sentences in English from a website of Contextual Translations such as “Reverso Context” (context.reverso.net) and its Italian translation in many different form of expression. This stage led to a training set of 800,000 pairs of sentences English-Italian and a Validation set of 50,000 pairs. A multi-level sequence-to-sequence model with attention and beam-search has been implemented to be trained on the training set of pairs (see Fig. 2) (Bahdanau et al., 2014; Luong et Manning, 2016). Figure 2. Multiple levels encoder-decoder (Luong et Manning, 2016). JADT’ 18 785 “Attention-based” models enable the decoder to “focus” specifically on some words rather than others, selectively orienting towards a more efficient combination of words within the destination language sentences (Chorowski, et al., 2015). “Beam search” is a greedy algorithm maximising the probability of the ouput words (Britz et al., 2017). The NMT model was trained with an embedding matrix randomly initialised and trained within the same process. Embedding vectors size was 512. Both encoder and decoder are made of two LSTM cells with an hidden state size equal to 512. Training algorithm was the Stochastic Gradient Descent (SGD) with 32 sized batches; initial learning rate of 1 and a decay factor of 0.5 starting from the 5-th epoch, plus early stopping to reduce the overfitting. Beam search amplitude has been set to 5. In Fig.3 they are reported some resulting translations from Italian to English, on a test example. Figure 3. Some translations from Italian to English by means of the neural model trained by us. In the same time, we have trained the RCNN model on the most popular Italian Sentiment Polarity Training set of tweets called SentiPolc 2016 (Barbieri et al., 2016). which is made of 7,000 annotated tweets and 300 test tweets. In this case (Italian language) our model reaches 45% of validation set accuracy and 43% on test set. For the embedding layer we have adpoted a pre-trained language model on an Italian Wikipedia Embedding Corpus. 3. Results We have tested the English RCNN model on the same italian SENTIPOLC 2016 test-set translated into English by our neural machine translation model. Results highlight a boost of performance : 78% of accuracy on the test set versus the 43% of the Italian trained RCNN model proving our strategy of stacking NMT and RCNN models is successful. 4. Conclusion Despite of the imperfections of the Neural Machine Translation producing translations with some errors, the RCNN is tolerant to minimal errors and is 786 JADT’ 18 able to hold the accuracy to high levels on a test set. This is because RCNN was previously trained on a solid and huge English corpus of tweets. This entire process of keywords extraction, specifically to the task of sentiment classification from the training set, is a fully novel approach to tackle the problem of the lack of Sentiment training sets in other languages. Keywords allow generating a domain-specific training set for the Neural Machine Translation. Arguably, we believe this way of stacking NMT and RCNN lead to a cutting-edge Multilingual Sentiment Classifier that can benefit other fields of Text Classification in future. Future directions might be towards a closer integration of NMT and Text Classifier and a reduction of translation errors. References Qurat Tul Ain, Mubashir Ali, Amna Riaz, Amna Noureen, Muhammad Kamran, Babar Hayat and A. Rehman (2017). Sentiment Analysis Using Deep Learning Techniques: A Review. International Journal of Advanced Computer Science and Applications (ijacsa). Haenlein, M., and Kaplan, A. M. (2010). An empirical analysis of attitudinal and behavioral reactions toward the abandonment of unprofitable customer relationships. J. Relatsh. Mark. Aydogan, E. and Akcayol, M. A. (2016). A comprehensive survey for sentiment analysis tasks using machine learning techniques. Int. Symp. Innov. Liu, B. (2012). Sentiment analysis and opinion mining (synthesis lectures on human language technologies). Morgan & Claypool Publishers. Pak, A., and Paroubek, P. (2010, May). Twitter as a corpus for sentiment analysis and opinion mining. In LREc (Vol. 10, No. 2010). Singh, J., Singh, G., and Singh, R. (2016) A review of sentiment analysis techniques for opinionated web text. CSI Trans. ICT. Hinton, G. E., and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. science, 313(5786), 504-507. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105). Hinton, G. E., Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural computation. 18(7), 1527-1554. Vateekul, P., and Koomsubha, T. (2016, July). A study of sentiment analysis using deep learning techniques on Thai Twitter data. In Computer Science and Software Engineering (JCSSE), 2016 13th International Joint Conference on (pp. 1-6). IEEE. Day. M., and Lee C. (2016) Deep Learning for Financial Sentiment Analysis on Finance News Providers. no. 1, pp. 11271134. JADT’ 18 787 Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D. 2011b. Semi-supervised recursive autoencoders for predicting sentiment distributions. In EMNLP, 151–161. Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, 1631–1642. Lai, S., Xu, L., Liu, K., and Zhao, J. (2015). Recurrent Convolutional Neural Networks for Text Classification. In AAAI. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Strapparava, C., and Mihalcea, R. (2007, June). Semeval-2007 task 14: Affective text. In Proceedings of the 4th International Workshop on Semantic Evaluations (pp. 70-74). Association for Computational Linguistics. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Luong, M. T., and Manning, C. D. (2016). Achieving open vocabulary neural machine translation with hybrid word-character models. arXiv preprint arXiv:1604.00788. Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. (2015). Attention-based models for speech recognition. In Advances in Neural Information Processing Systems (pp. 577-585). Britz, D., Goldie, A., Luong, T., and Le, Q. (2017). Massive exploration of neural machine translation architectures. arXiv preprint arXiv:1703.03906. Barbieri, F., Basile, V., Croce, D., Nissim, M., Novielli, N., and Patti, V. (2016, December). Overview of the EVALITA 2016 SENTiment POLarity Classification Task. In Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016). 788 JADT’ 18 A linguistic analysis of the image of immigrants’ gender in Spanish newspapers Juan Martínez Torvisco Universidad de la Laguna – jtorvisc@ull.edu.es Abstract 1 (in English) The phenomenon of immigration has been studied from diverse perspectives is important to understand that immigration is a fact associated with times of crisis. The reason for the avalanche of immigrants to the Canary Islands (Spain) is because it is the gateway to Europe, and therefore, immigrants want to enter from this point. This research arises from the need to linguistically determine the treatment of the phenomenon of immigration in the Spanish press as a result of the arrival of thousands of foreign citizens to the coast of the Canary Islands in 2006 and in 2015. It attempts to analyse four Spanish newspapers using Iramuteq qualitative analysis software, two from the Canary Islands (El Día and Canarias 7) and two Spanish national newspapers (El País and ABC). Also, we wanted to know how it is the informative treatment of gender. Our hypothesis is that the word male (immigrant) appear more than woman and on the contrary woman (refugee) has a higher frequency than male. Results are presented on a dendogram figures. Abstract 2 (in Spanish) El fenómeno de la inmigración se ha estudiado desde diversas perspectivas, y es un hecho asociado a tiempos de crisis. El motivo de la avalancha de inmigrantes en las Islas Canarias (España) se debe a que es la puerta de entrada a Europa y, por lo tanto, los inmigrantes quieren entrar desde esta parte de Europa, buscando una major vida. Esta investigación surge de la necesidad de determinar lingüísticamente el tratamiento del fenómeno de la inmigración en la prensa española como resultado de la llegada de miles de ciudadanos extranjeros a la costa de las Islas Canarias en 2006 y 2015. Se analizan cuatro periódicos españoles utilizando el software Iramuteq de análisis cualitativo, dos de ámbito regional de Canarias (El Día y Canarias 7) y dos periódicos de ámbito nacional (El País y ABC). También queríamos saber cómo aparece el género en las noticias de estos diarios. Nuestra hipótesis es que los inmigrantes son mayoritariamente hombres por tanto debe aparece más que la mujer y al contrario, la palabra mujer (refugiada) tiene una frecuencia mayor que la del hombre. Los resultados se presentan JADT’ 18 789 dos figuras de dendograma con el Análisis Jerárquico Descendiente (DHC) y reflejan que la mujer aparece en 2015 pero no está presente en las noticias de los diarios en 2006 y a la inversa ocurre con el hombre.Keywords: a set of keywords describing the content of the paper. 1. Introduction The media have become a powerful tool to make visible conflicts, or show realities that sometimes remain hidden from the world. Such a fact seems unquestionable. One of the most-recent cases are the so-called “immigration crisis” or the “refugees’ crisis,” it began before the dates analyzed in the current research, however, achieve an uncertain projection until these citizens reached the coasts of Europe, in the case of the Canary Archipelago. The concept “immigrant” as Shier, Engstrom & Graham (2011) suggest that they define an “immigrant” is a person arriving (immigrating) who has come to live in a country from some other country with the purpose to settle there. The journalistic enterprises face the challenge of attracting new audiences, being aware of the transformation of the sector and the emergence of a new ecosystem. These companies require narrative treatments contrasting from those already known, since these information units synthesize the content and preponderance of the published news; these elements are deciding to capture the attention of the readers (Jarvis, 2014). Through the selection of the headlines, it is possible to highlight the role of new professionals in the newsrooms that are responsible for defining what kind of news be published. As Ramonet (1998) makes evident, the variety of sources guarantee objectivity. However, information is a social good that concerns and understand the whole society. This society must establish moral norms that govern the responsibility of the media (Fraerman, 1998). The phenomenon of immigration has been analyzed from diverse perspectives is important to understand that immigration is a fact associated with times of crisis. But the gender issues are not treated deeply. Thus, one important aim is to know whether journalists take account that fact. The Canary Islands (Spain) is a point of gateway to Europe and this is the reason for the avalanche of immigrants, males and females. The evidence suggests immigrant’s networks wanting to enter by this point to reach European land. Most migration researchers understand these networks as consisting of a set of “strong ties” based on kinship, friendship, or a shared community of origin that connects migrants and non-migrants (Massey et al. 1998). Migration network approach is that a multidirectional flow of information and resources forms the basis of every migratory process (Dekker & Engbersen, 2014). The migration phenomenon in Europe has had two phases of maximum 790 JADT’ 18 activity in the years 2006 and 2015 where, despite being displaced people from the place of origin to another destination, including a change of residence. In the first case, the citizens who enter Europe through the Canary Islands are the so-called undocumented immigrants. These people left their countries as a free choice and for a “personal interest,” in line with the definition of International Conference on Migration (IOM). In the second case, refugees have carried out the displacement (also present in 2006, but in a very small percentage) to save their lives or preserve their freedom, as United Nations High Commissioner for Refugees (UNHCR) states. The data analyzed in this paper focuses on international migration and the movement across national borders, consequently this work takes care of the time-span analysis that separates two massive arrivals and the evolution that originates in the field of communication in that period. The search terms “immigrant,” in 2006 and “refugee” in 2015 and also the words “man” and “woman” were used as keywords to search the headlines and full news of database and locate information about immigration, and refugees (MUGAK, 2016). The study analyses the year 2006 matching with 2015 and aims to probe the narrative production generated by two Spanish newspapers (ABC and El País) and two Spanish regional newspapers (Canarias 7 and El Día), in relation to the immigration phenomenon that took place in the Canary Islands in those years. 2. Method In the present study carried out in the years 2006 and 2015, statistical methods are mainly concerned with the non-linguistic information from a text; e.g. term frequencies, inverse frequency and the position of a keyword in a text. For data analysis, for the study we apply Iramuteq software (Interface de R pour les Analyses Multidimensionnelles de Textes et de Questionnaires; Ratinaud, 2009; Ratinaud & Marchand, 2012, 2015). In our study, for the data processing, apply the Descending Hierarchical Classification (DHC) by Reinert method (1983, 1986, 1990) defined by lexical classes, where each of them represents a subject matter, and they can be described according to the vocabulary that defines them. From the most frequent words given in the text segments, lexical analysis was performed. This analysis overcomes the dichotomy between quantitative and qualitative research, as it allows employing statistical calculations on qualitative data, the texts. The vocabulary related to “immigration, immigrant/s, refugee/s, man and woman etc.” are identified and quantified in the frequency and, in some cases in relation to its position within the text. JADT’ 18 791 3. Results Below, the author illustrate the data of the text corpus of the years 2006 and 2015 period of study. The corpus used in this analysis is ad-hoc constructed. It contains 4.703 newspaper headlines and news published throughout 2006 and 2015 in Spanish. We used four newspapers two of nationwide (El Pais and ABC) and two of regional scope (Canarias 7 and El DIA), of which 169 news corresponds to El Pais, 291 news for ABC, whereas Canarias 7 published 512. The information of three newspapers was obtained through MUGAK (Centre of Studies and Documentation on Immigration, Racism and Xenophobia, Basque Country, Spain, 2016) database, in case of the newspaper El DIA, 3.731 news; the information was taken directly from the newspaper database. Table 1 - Statistical data from the text corpus of study Corpus 2006 Corpus 2015 Subcorpus 2006 Subcorpus 2015 (text in web editions) Occurrences 426.135 30.531 147.468 6.148 Forms 11.993 4.792 9.747 1.487 Hapax 5.093 2.440 4.525 827 Texts 7 11 7 4 In addition, the characteristics of each text, the number of occurrences detected in the online version of the newspapers is broad and reflects 20% of the occurrences of the entire corpus, observed the lexicometry while the remaining 60% belongs to the activity developed in the profiles enabled in the social networks of each newspaper. It can be observed the following cloud of words by collecting in generic terms the forms that characterize the selected texts. As it can observe some of the words, with bulkier characters and therefore most relevant, are related to the area of study that concerns us: period 2006 the word immigrant is the most used in the newspapers analyzed, followed by Canarias, patera and cayuco (two types of small boats) as a form to arrive to the Canary Islands. However, in 2015 appears the term refugee (refugiado), immigrant (immigrant), welcome (bienvenida), government (gobierno), rescue (rescate) or the Canary Islands (Canarias). In addition, some forms of refugeing, offering, asking or rescue appear, as Crespo (2008) points out, a certain ideological position that undoubtedly helps to construct a certain image about the migratory phenomenon and its 792 JADT’ 18 consequences for the receiving countries. The graphs generated by the Iramuteq software of this corpus of text can be inferred that some specific forms give positive or negative value. Depending on the verbs used for this purpose and the profile of the migrant to which reference is made, in our case display the data of the two analyzed periods. These appear related to the terminology of the topic that occupies us and previously used in the construction of the press holders. 3.1. Data from Descending Hierarchical Classification Analysis 2006 Iramuteq 0.7 alpha 2 software (Ratinaud, 2014) provides multivariate analysis through DHC and calculates descriptive results of clusters according to its main vocabulary (Camargo & Justo, 2013). Likewise, its location in the dendrogram, the resulting forms’ clusters reflect the different work scenarios beside how some social realities cross: class 1 (social, immigrant aid), class 2 (immigrants and their local rescue), Class 3 (social and family), class 4 (institutional). As well, a concept that appears common to two conglomerates in “immigrant” and “immigration” as can be seen in the figure below. (Fig.1). The word “male” appears 184 times, X2 =521,9. Figure 1 - DHC Dendrogram 2006 JADT’ 18 793 3.2 Data from 2015 DHC The data shown in the graphs below (Fig. 2) of this text offer an estimated viewing on the figure of the “refugee” and the “immigrant” and their evolution in the context of the knowledge acquired by the media as the phenomenon is going forward. In such a way, we find two words, “refugee” and “immigrant”, that appear in the journalistic headlines. Figure 2 - DHC Dendrogram 2015 The result of the above dendrogram reflects the different work scenarios and how some social realities are mixed: class 4 (local), class 2 (institutional), class 3 (social) and class 1 (European). The word “woman” appears 20 times with X2=28,9. It is worth mentioning the founding of the term "to receive", an element that is similar to the rest of the verbs that accompany it in the constellation of words in which it is lodged (to propose, to find, to celebrate or to dispose among many others). However, it becomes more relevant due 794 JADT’ 18 to its preponderance and strategic situation in an environment in which it appears with vocabulary with which it keeps linguistic similarities. 4. Conclusion This object of study that evolves in parallel to the population movement, as well as certain informative personalization through the introduction of adjectives that indicate narrative subjectivity. Our findings suggest a vast of knowledge that covers countless issues related immigrants and refugees and woman and man. It can be said that the word “man” does not appear during the 2006 and it does “male”, however in 2015 appears “woman” instead “female and it does not “male” like in 2006. The mechanization of publishing systems marks a clear dividing line between some texts and others and the shortage of human and technical resources used for this activity, causes local media to be less interventionist in drafting their texts than national ones. Finally, it should be notice for the future researches the role of journalists and the usage they do of the gender topic as a way to know how the immigration phenomenon man/woman behaves. References Crespo, E (2008). El léxico de la inmigración: atenuación y ofensa verbal en la prensa alicantina. En M. Martínez (Ed.) Inmigración, discurso y medios de comunicación (pp.45-62). Alicante: Instituto Alicantino de Cultura Juan Gil Albert, Diputación Provincial de Alicante. Dekker, R & Engbersen, G. (2014). How social media transform migrant networks and facilitate migration. Global Networks 14, 4, 401–418. Jarvis, J. (2014). Geeks Bearing Gifts. CUNY Journalism Press, New York. Spanish El fin de los medios de comunicación de masas. ¿Cómo serán las noticias del futuro? Barcelona: Ediciones Gestión 2000. Massey, D. S., J. Arango, G. Hugo, A. Kouaouci, A. Pellegrino and J. E. Taylor (1998) Worlds in motion: understanding international migration at the end of the millennium, New York: Oxford University Press. Mugak (2016) Centre of Studies and Documentation on Immigration, Racism and Xenophobia, Basque Country, Spain. Available in www.mugak.eu Ramonet, I (2011). La tiranía de la comunicación. Madrid: Debate. Ratinaud, P. (2009). IRAMUTEQ: Interface de R pour les Analyses Multidimensionnelles de Textes et de Questionnaires [Computer software] Retrieved 5th march 2013 in http://www.iramuteq.org. Ratinaud, P. (2014). Visualisation chronologique des analyses ALCESTE : application à Twitter avec l’exemple du hashtag #mariagepourtous. In Actes des 12eme Journées internationales d’Analyse statistique des Données Textuelles. JADT 2014 (p. 553- 565). Paris, France. Disponible JADT’ 18 795 Ratinaud, P. & Marchand, P. (2012). Application de la méthode ALCESTE à de “gros” corpus et stabilité des “mondes lexicaux”: analyse du “CableGate” avec IraMuTeQ. Em: Actes des 11eme Journées Internationales d’Analyse statistique des Données Textuelles. JADT 2012. Liège. Ratinaud, P., & Marchand, P. (2015). Des mondes lexicaux aux représentations sociales. Une première approche des thématiques dans les débats à l’Assemblée nationale (1998-2014). Mots. Les langages du politique, 108, 57- 77 Reinert, M. (1983). Une méthode de classification descendante hiérarchique: application à l’analyse lexicale par contexte. Les cahiers de l’analyse des données, 8, 2, 187- 198. Reinert, M. (1986). Un logiciel d’analyse lexicale: ALCESTE. Les cashiers de l’Analyse des Données, 4, 471-484. Reinert, M. (1990). ALCESTE. Une méthologie d’analyse des données textuales et une application: Aurelia de G. de Neval. Bulletin de méthologie sociologique, 28, 24-54 Shier ML, Engstrom S & Graham JR (2011) International migration and social work: A review of the literature, Journal of Immigrant and Refugee Studies, 9, 1, pp. 38-56. http://dx.doi.org/10.1080/15562948.2011.547825. 796 JADT’ 18 Lo strano caso delle frequenze zero nei testi legislativi euroistituzionali Francesco Urzì combinazioni.lessicali@gmail.com Abstract In this paper we intend to verify the actual impact of the so-called universals of translation – i.e. those linguistic features which typically occur in translated rather than original texts - on the legislative texts produced by the European Union. To this aim, a number of text segments have been heuristically selected in order to ascertain if their statistical absence, or quasi-absence, from European legislation should be traced back to the effects of the abovementioned universals and to identify possible EU-internal factors that might explain such conspicuous statistical absences. Keywords: universals of translation. European Union, Eur-lex, euroitaliano, terminology. 1. Introduzione Negli ultimi tempi si sono moltiplicati gli studi su corpora comparabili volti a verificare l’effettiva incidenza dei cosiddetti universali della traduzione, ossia dei tratti linguistici comuni ai testi tradotti e non riconducibili a un’influenza sistemica della lingua sorgente (Baker 1993 e 1996 e Laviosa 2002). Per l’italiano disponiamo delle analisi di Garzone 2005 e di Ondelli-Viale 2010. Ondelli-Viale, che si avvalgono esclusivamente di un corpus di estrazione giornalistica, rilevano ad esempio la minore ricchezza lessicale e la frequenza lievemente maggiore del Vocabolario di base nelle traduzioni, per effetto dell’universale traduttivo della semplificazione. Meno numerosi sono gli studi sui tratti specifici dell’euroitaliano, ossia di quella varietà della nostra lingua rappresentata dall’italiano delle traduzioni dell’UE. In tale ambito Cortelazzo 2013 ha operato un confronto quantitativo di due corpora di una certa ampiezza costituiti rispettivamente da direttive europee e leggi italiane di recepimento, utilizzando tra l’altro misure lessicometriche (ad es. type/token ratio e hapax) e prendendo anche in considerazione i “segmenti ricorrenti” (che secondo l’autore confermano per il corpus UE scelte lessicali “leggermente più povere e omogenee di quelle nazionali”). Con il presente contributo ci proponiamo di stabilire sulla scorta di segmenti scelti euristicamente, casi eclatanti di frequenze zero o prossime allo zero sul JADT’ 18 797 dominio di secondo livello europa.eu, e più specificamente su Eur-lex, che ne costituisce un sottoinsieme. Lo scopo di tale esercizio è di verificare • se l’irrilevanza statistica di determinate lessie in questi corpora, praticamente costituiti solo da testi tradotti - ricordiamo la pluricitata affermazione di Umberto Eco secondo cui “la lingua dell’Europa è la traduzione” - non forniscano una prova incontrovertibile degli effetti degli universali traduttivi, in particolare quelli della semplificazione e della normalizzazione (o conservatorismo linguistico); • se non sia pure ravvisabile un processo di “autoinibizione” da parte dei traduttori UE all’utilizzo di tali lessie. Non opererebbero in altre parole solo le tendenze generali ascrivibili al processo traduttivo in sé (gli universali della traduzione appunto), ma anche e soprattutto la specifica cultura traduttiva euroistituzionale e lo specifico contesto tecnico-operativo che contraddistingue i servizi di traduzione delle Istituzioni europee. Essendo tale analisi di tipo eminentemente qualitativo, l’utilizzo di un corpus “rumoroso” come Google non inficia la rilevanza dei risultati quantitativi, che tendono unicamente a individuare solo grandi scarti di frequenza, per cui è vero in questo caso che “more data is better data”. 2. La cultura traduttiva delle Istituzioni europee 2.1 Confusione fra ‘termine’e ‘parola’ Un tratto soggiacente della cultura di categoria dei traduttori euroistituzionali è la non percezione della differenza teorica fondamentale fra ‘termine’ e ‘parola’. E’ diversa infatti nel termine e nella parola la natura del riferimento, “che nel termine è specializzata all’interno di una particolare disciplina, mentre nella parola è generale in una varietà di argomenti (Cfr. Scarpa 2008: 52, che cita Sager 1994: 43). Cabré (1999, 33-34), sulle orme di Wüster (1981), menziona due specificità della terminologia. La prima è che “words in dictionaries are described with respect to their use in context; they are considered as elements of discourse. For terminology, on the other hand, terms are of interest on their own account”; la seconda che “lexicology and terminology present their inventories of words or terms (…) in different ways because they start from different viewpoints: terminology starts with the concept and lexicology, with the word”. Cabré (ibidem, 36) nota inoltre che “whereas a terminological inventory usually contains only nouns, in a general language dictionary all grammatical categories are represented”. 798 JADT’ 18 2.2 Referenzialità intertestuale La natura “ciclica” degli atti legislativi dell’Unione - che molto spesso modificano e aggiornano testi legislativi precedenti – che fa sì che le soluzioni traduttive già consacrate dall’ufficialità finiscano per essere trasferite di peso sui nuovi atti, con un fenomeno che si potrebbe definire di common law linguistica, in cui il precedente esercita forza vincolante sul giudizio linguistico autonomo del traduttore. E' in questa fase che il traduttore UE spesso assegna status di ‘termini’ a sintagmi che pur non rispondendo teoricamente a tale definizione (v. 2.1) hanno comunque acquisito il crisma dell'ufficialità per essere stati "validati" in testi legislativi precedentemente pubblicati o anche solo verificati sul piano qualitativo e ritenuti idonei a a essere immessi nel successivo iter legislativo. E’ così che determinate soluzioni traduttive tendono a perpetuarsi all’interno delle “filiera testuale” della materia trattata. Al riguardo va citato anche l’effetto di condizionamento subito dai traduttori più giovani, i quali trovano arduo sostenere scelte linguistiche innovative in contrasto con la "tradizione" dei testi dell'acquis communautaire e, soprattutto, tendono a non discostarsi dall'approccio traduttivo dei colleghi più anziani. 3. Il contesto tecnico-operativo dei servizi di traduzione delle Istituzioni europee 3.1 House Rules I servizi di traduzione delle Istituzioni europee hanno a disposizione un “Manuale di convenzioni redazionali” (OPOCE 2011), nella cui pagina di benvenuto si legge che "la sua applicazione [del Manuale] è obbligatoria [grassetto originale] per chiunque intervenga nella preparazione di ogni documento (su carta o elettronico) nelle istituzioni, organi o servizi dell’Unione europea". Non viene fatta nel Manuale alcuna distinzione fra le varie tipologie di testi e le differenti funzioni comunicative che competono a ciascuna di esse. Inoltre molte regole di redazione sono presentate sotto forma di prescrizione assoluta Ad esempio, si prescrive "direttiva" (atto legislativo) con la minuscola (il che non sorprende visto il numero di volte in cui il termine viene utilizzato nei testi UE), nonostante la regola secondo cui (Lesina 2009) "nei casi in cui un nome generalmente usato in senso comune viene utilizzato in senso proprio, con un significato restrittivo o particolare (…) l'iniziale maiuscola può [corsivo mio] essere utile per ragioni di chiarezza, al fine di segnalare al lettore la particolare accezione del nome". Conoscendo la scarsa frequentazione degli italiani (anche di buona cultura) con la terminologia degli atti legislativi comunitari, sorprende che il Manuale di convenzioni redazionali prescriva che "direttiva", anche quando non seguita dagli estremi completi dell'atto legislativo (ad es. direttiva JADT’ 18 799 2049/39/CE), debba essere sempre scritta con la minuscola (dunque anche nei testi a carattere divulgativo destinati alle pagine web). 3.2 Effetto standardizzante delle tecnologie CAT e MT Attualmente i traduttori delle Istituzioni europee beneficiano di una memoria di traduzione comune a tutti i servizi denominata “Euramis” e che provvede alla pretraduzione dei testi sia quando la traduzione è curata dai servizi interni sia quando è esternalizzata ad agenzie di traduzione. Da qualche anno è entrata in servizio anche la traduzione automatica che, su richiesta del traduttore, integra l’output della traduzione assistita. Poiché ad alimentare la memoria Euramis sono esclusivamente segmenti di testo “validati” (ossia già sottoposti al processo interno di controllo di qualità e dunque ritenuti idonei al successivo dibattito politico o alla pubblicazione) i traduttori preferiscono non discostarsi da soluzioni ritenute “sicure” (e la cui adozione, va pure sottolineato, si traduce in un notevole risparmio di tempo). 4. Esempi paradigmatici di "grandi assenti" Ad esemplificazione di quanto sopra passiamo di seguito in rassegna una serie di sintagmi, che presentano casi clamorosi di frequenze zero o prossime allo zero. Nelle relative tabelle il numero di occorrenze preceduto da asterisco indica dei “falsi positivi”. L’asterisco fra parentesi segnala che sono dei falsi positivi almeno una parte delle occorrenze. Le forme prese in considerazione sono una forma aggettivale gerundiva (costruendi), alcuni sintagmi nominali con aggettivo relazionale (indagini poliziesche, attività manutentive, servizi consulenziali), un composto aggettivale determinativo formato da due aggettivi relazionali (politico-programmatico) e due costrutti, rispettivamente con fattorizzazione (dati quali- quantitativi) e zeugma preposizionale (valutare e tener conto [di]). Laddove utile sono state proposte, a titolo comparativo, le statistiche relative alla forme più in uso nel corpus legislativo europeo. 4.1 Gerundivo Token Costruendi Google 11.800 Europa.eu *2 Eur-lex *1 I due unici esempi di europa.eu - ‘i costruendi locali’ e ‘sepolcri esistenti e costruendi’, entrambi provenienti dalla banca elettronica TED1, sono riferiti ad aree territoriali italiane. In questo caso sembra aver operato il 1 TED - Tenders Electronic Daily, ossia il supplemento alla Gazzetta ufficiale dell'Unione europea dedicato agli appalti pubblici europei 800 JADT’ 18 conservatorismo linguistico, che ha indotto ad evitare una forma non registrata dai dizionari2 e probabilmente ritenuta dai traduttori troppo ardita. 4.2 Aggettivi relazionali semplici e composti Un analogo comportamento linguistico convenzionale e semplificatorio da parte dei traduttori si osserva nel caso degli aggettivi relazionali. Non tutti i suffissi che formano aggettivi relazionali sono infatti suffissi "dedicati", ossia deputati a codificare esclusivamente il rapporto di relazione; alcuni formano anche aggettivi qualificativi. Tale è ad esempio il suffisso -ivo3 come in attività produttive vs. prefisso produttivo. Spesso basta questa ambivalenza semantica a dissuadere il traduttore dall'utilizzare tali aggettivi in funzione relazionale e a indurlo a fargli preferire soluzioni alternative (ad es. con l'impiego della preposizione ‘di’ o con locuzioni preposizionali del tipo ‘relativo/riguardo a/in materia di’. Nel caso di ‘indagini poliziesche’, potrebbe forse aver agito anche il proposito di evitare una indesiderata connotazione. Token Indagini di polizia Indagini poliziesche Google 164.000 14.700 Europa.eu 793 (*)2 Eur-lex 85 0 Da notare che una delle 2 occorrenze di ‘indagini poliziesche’ in europa.eu è un comunicato stampa, dunque scritto con ogni probabilità da un giornalista e non da un traduttore. Token Attività manutenzione Attività manutentive di Google 1.230.000 Europa.eu 6.730 Eur-lex 354 89.400 (*)139 *1 Da osservare che l’unico risultato di Eur-lex per ‘attività manutentive’ lo si ritrova in un testo italiano, che riportiamo (grassetto mio) “Regolamento del sottosegretario di Stato per l'Edilizia abitativa, la Pianificazione territoriale e l'Ambiente recante definizione di nuove Tale forma non registrata ad esempio nel Sabatini Coletti 2008 che però riporta ‘istituendo’ e ‘costituendo’, mentre il Grande dizionario Garzanti riporta solo ‘costituendo’. 3 Suffisso usato prevalentemente per la formazione di aggettivi qualificativi (Wandruska 2004: 391) 2 JADT’ 18 801 prescrizioni relative alla prevenzione di perdite accidentali di fluidi frigorigeni nell'ambito dell'utilizzo di o dell'esecuzione di attività manutentive su impianti di refrigerazione e, in relazione alle stesse, recante modifica del regolamento prescrizioni impermeabilità impianti di refrigerazione 1997” Dei 139 risultati in europa.eu 114 provengono dalla banca TED e, come conferma un controllo a campione eseguito da chi scrive, si riferiscono ad avvisi di appalto riguardanti il territorio italiano. Token Servizi di consulenza Servizi consulenziali Google 6.870.000 96.600 Europa.eu 29.300 (*)21 Eur-lex 16 0 Anche in questo caso, dei 21 risultati di europa.eu 3 provengono da TED, altri (anche se non tutti) da regioni italiane. Per quanto riguarda gli aggettivi relazionali composti, del tipo: libero professionale (relativo alla libera professione) oppure marittimo-portuale (relativo ai porti marittimi), si è scelto come caso eclatante di assenza il composto ‘politico-programmatico’. L’assenza è tanto più significativa in quanto non mancano certo nell’Unione europea i documenti funzionalmente analoghi al Documento politico-programmatico italiano, ma è solo a quest’ultimo documento che fanno riferimento le pochissime occorrenze di questo termine riscontrate su europa.eu e Eur-lex. Ancor più che nel caso degli aggettivi relazionali semplici, l’assenza si spiega con il senso di incertezza semantica che le formazioni aggettivali costituite da due aggettivi relazionali possono ingenerare, visto che spesso la loro disambiguazione (stabilire cioè se si tratta di composto coordinativo o determinativo) può avvenire solo in relazione a un dato cotesto. Token Politico-programmatico Google 34.900 Europa.eu 8 Eur-lex *1 Delle 8 occorrenze di europa.eu, almeno 2 provengono da documenti redatti da curatori italiani. L’unica occorrenza in Eurlex (dove la versione inglese è policy and planning platform), fa pensare a un brano di testo originariamente redatto in italiano e a una lettura coordinativa, anziché determinativa, del composto in sede di traduzione. 802 JADT’ 18 4.3 Fattorizzazioni e costruzioni zeugmatiche Questi due costrutti, i cui meccanismi sono di difficile reperimento nelle grammatiche, sono ampiamente utilizzati nel linguaggio giuridico e amministrativo italiano per evidenti ragioni di economia linguistica. Si è scelta a tal fine la sequenza 'dati qualitativi e quantitativi', che è un’espressione che ricorre sovente in testi che riportano dati statistici e che viene pertanto utilizzata in una pluralità di settori. Per lo zeugma grammaticale si sono ricercate le occorrenze della sequenza ‘valutare e tener conto’4, che è risultata non ben accetta dai traduttori in quanto probabilmente troppo “audace”. Oltretutto costrutti di questo tipo vengono sovente attribuiti a un’influenza della lingua inglese5, motivo questo di ulteriori spinte puristiche da parte dei traduttori. Token Dati qualitativi e quantitativi Dati qualiquantitativi Google 23.100 Europa.eu 370 Eur-lex 1 10.400 *9 0 I 9 risultati europa.eu si riferiscono tutti a progetti italiano nati in ambito regionale Token Valutare e tener conto Google 1930 Europa.eu (*)5 Eur-lex 0 Dei 5 esempi in europa.eu 2 si devono all’eurodeputata Pasqualina Napolitano (doc. A6-0502/2008) mentre 3 sono di provenienza esterna all’UE. Come nel seguente esempio (grassetto mio): Art. 5. (Coordinamento per la sicurezza e salute ex decreto legislativo n. 81 del 4 2008) 1. Ai sensi dell’articolo 90, comma 1-bis, del decreto legislativo n. 81 del 2008, il Tecnico incaricato è obbligato a considerare, valutare e tener conto, al momento delle scelte tecniche per la fase progettuale oggetto dell'incarico, dei principi e delle misure generali di tutela di cui all’articolo 15 del citato decreto legislativo n. 81 del 2008. (http://bandieconcorsi.comune.trieste.it/contenuti/allegati/schema_contratto_incarico. pdf). 5 Fanfani 2010 JADT’ 18 803 Riferimenti bibliografici Baker M. (1993), “Corpus Linguistics and Translation Studies – Implications and Applications”, in: M. Baker/G. Francis/Tognini Bonelli (a cura di), Text and Technology: In Honour of John Sinclair, Amsterdam-Philadelphia: Benjamins, 233-250. Baker M. (1996), “Corpus-based Translation Studies: The challenges that Lie Ahead”, in: H. Somers (a cura di), Terminology, LSP and Translation: Studies in Language Engineering in Honour of Juan C. Sager, AmsterdamPhiladelphia: Benjamins, 175-186. Cabré, M. T. (1999), Terminology – Theory, methods and applications, Amsterdam-Philadelphia: John Benjamins. Cortelazzo M. A (2013), "Leggi italiane e direttive europee a confronto", in: Stefano Ondelli (a cura di), "Realizzazioni testuali ibride in contesto europeo. Lingue dell’UE e lingue nazionali a confronto", Trieste, EUT Edizioni Università di Trieste, 2013, pp. 57-66. Fanfani M. (2010) Anglicismi, in Simone R., Berruto G. D’Achille P. (a cura di) “Enciclopedia dell’italiano”. Istituto della Enciclopedia italiana, Roma Garzone G. (2005), “Osservazioni sull’assetto del testo italiano tradotto dall’inglese”, in: A. Cardinaletti/G. Garzone (a cura di), L’italiano delle traduzioni, Milano: Franco Angeli, 35-58. Grande Dizionario Garzanti di italiano (2017), De Agostini Scuola s.p.a. – Garzanti linguistica (versione elettronica) Laviosa S. (2002), Corpus-based Translation Studies. Theory, Findings, Applications, Amsterdam-New York: Rodopi. Laviosa S. (2002), Corpus-based Translation Studies. Theory, Findings, Applications, Amsterdam-New York: Rodopi. Lesina R. (2009), Il Nuovo Manuale di stile¸Bologna: Zanichelli Manuale interistituzionale di convenzioni redazionali, Ufficio delle pubblicazioni dell’Unione europea (OPOCE), 2011, ISBN 978-92-78-40704-9 Ondelli S. e Viale M. (2010), L’assetto dell’italiano delle traduzioni in un corpus giornalistico. Aspetti qualitativi e quantitativi. In Rivista internazionale di tecnica della traduzione, n.12/2010, pp. 1-62. ISSN 1722-5906. Sabatini F e Coletti V. (2008), Il Sabatini Coletti. Dizionario della lingua italiana, Milano, Rizzoli-Larousse. Sager J. (1994), Language Engineering and Translation Consequences of Automation, Amsterdam-Philadelphia: John Benjamins. Scarpa F. (2008), La traduzione specializzata, seconda edizione, Milano: Hoepli. Urzì F. (2016), “Il paradosso degli aggettivi di relazione composti derivati da sintagmi N+A. Una risorsa non utilizzata in traduzione”, in: R. Bombi/V. Orioles (a cura di), Lingue in contatto-Contact Linguistics, Roma: Bulzoni, 804 JADT’ 18 163-178. Wandruszka U. (2004), “Aggettivi di relazione”, In M.Grossmann/F. Rainer (a cura di), La formazione delle parole in italiano, Tübingen, Niemeyer, 382394. Wüster E. (1976), "La théorie générale de la terminologie - un domaine interdisciplinaire impliquant la linguistique, la logique, l'ontologie, l'informatique et les sciences des objets", in H. Dupuis (a cura di), Essai de définition de la terminologie. Actes du colloque international de terminologie (Québec, Manoir du lac Delage, 5-8 octobre 1975), Québec, Régie de la langue française, pp. 49-57. Wüster E. (1981), “L’étude scientifique générale de la terminologie, zone frontalière entre la linguistique, la logique, l’ontologie, l’informatique e les sciences des choses”, in Rondeau, Guy/Felber, Helmut (a cura di), Textes choisis de terminologie – I Fondements théorique de la terminologie, Québec, GIRSTERM, 55-114. JADT’ 18 805 Les traductions françaises de The Origin of Species : pistes lexicométriques Sylvie Vandaele Université de Montréal – sylvie.vandaele@umontreal.ca Abstract In order to develop a sound methodology that would guide the analysis of the translations of important writings, we used Hyperbase to perform a lexicometic analysis of specificities on two corpora based on the various English and translated editions of Charles Darwin’s The Origin of Species. We show that the translated corpus is characterized by a notable lexical dispersion. compared to the source corpus. By combining the use of Hyperbase with Logiterm. a text alignment software. we were able to target and analyse contexts of interest. This approach allows for the rapid identification of contexts that are significant both statistically and in terms of the analysis of the translation strategies themselves. Résumé Afin de mettre au point une méthode raisonnée d’analyse des traductions d’œuvres conséquentes, nous avons soumis les versions originales de The Origin of Species, de Charles Darwin ainsi que leurs traductions en français à une analyse lexicométrique des spécificités à l’aide du logiciel Hyperbase. Nous montrons que le corpus de traductions se caractérise par une dispersion lexicale notable, contrairement au corpus anglais source. Les spécificités ont permis, à l’aide du logiciel d’alignement bilingue Logiterm, de cibler l’analyse de contextes bilingues montrant les différences de choix de traduction, Cette approche permet de repérer rapidement des contextes significatifs tant sur le plan statistique que sur le plan de l’analyse des stratégies de traduction. Keywords: The Origin of Species; specificities; Hyperbase; Logiterm, retranslation; translation choices; 1, Introduction La retraduction, fréquente en littérature (voir Monti et Schnyder, 2011), est rare en science, The Origin of Species [désormais OS], l’œuvre célèbre de Charles Darwin, fait exception : six éditions de langue anglaise (de 1859 à 1872), six traductions en français dont deux modernes (voir Vandaele et 806 JADT’ 18 Gendron-Pontbriand [2014] pour les détails). Cependant, l’ampleur de l’œuvre rend l’analyse des traductions difficile. Nous proposons une méthode consistant à isoler les spécificités lexicales des originaux et des traductions, puis à repérer les contextes bilingues alignés correspondants, soumis ensuite à une analyse qualitative. Nous accédons ainsi rapidement aux éléments saillants de l’évolution de l’œuvre et de ses traductions. 2. Corpus et méthodologie Les deux corpus1 sont constitués par les chapitres intégraux des six éditions originales anglaises de l’OS (1859-1872) et les six traductions en français, à l’exclusion du paratexte et des notes de bas de page. Les césures en fin de ligne ont été éliminées, les numéros de page, placés entre deux phrases, les appels de notes, enlevés. Nous avons eu recours au logiciel Hyperbase v. 102 réalisé par Étienne Brunet (Brunet 2011). L’annotation syntaxique et la lemmatisation ont été réalisées au préalable avec Cordial v. 14 (Synapse) pour le français, et à la volée, pour l’anglais, avec la version de TreeTagger incluse dans Hyperbase. L’alignement des versions originales et traduites a été réalisé avec Logiterm v, 5.7.1. (Terminotix). 3. Les versions originales anglaises de l’OS Le corpus anglais compte un peu plus d’un million d’occurrences, Darwin a procédé à des ajouts, mais aussi à des retraits.3 La 6e édition (18724) est 28 % plus longue que la 1re (1859), soit 48 000 occurrences de plus. L’analyse de la richesse du vocabulaire montre la proximité lexicale des six éditions originales : on compte 8559 lemmes pour tout le corpus, 6082, pour la 1re édition et 7431, pour la 6e (tableau 1). Les lemmes communs forment la majorité du corpus : pour les textes 2 à 2, leur nombre varie de 5597 à 6600, tandis que le nombre des lemmes privatifs fluctue de 136 à 1795. L’examen des formes donne des résultats du même ordre. L’accroissement chronologique des lemmes montre un léger appauvrissement pour la 2e et la 3e édition, mais un enrichissement notable Les textes anglais viennent du site Darwin Online (John van Wyhe, dir. 2002. The Complete Work of Charles Darwin Online - http://darwin-online.org.uk/). Les textes français ont été obtenus par Gallica ou Google livres, ou ont été numérisés par nous. 2 Téléchargeable à . 3 Voir le variorum en ligne (van Wyhe, 2002-; < http://darwinonline.org.uk/Variorum/1859/1859-1-dns.html>). 4 Celle de 1876, dite 6b, est quasiment identique à celle de 1872. C’est l’édition de 1872 qui a été traduite par Edmond Barbier (1876), raison pour laquelle nous l’avons choisie sans notre analyse. 1 JADT’ 18 807 du vocabulaire dans la 6e édition (tableau 1), essentiellement redevable à un grand nombre d’hapax, souvent des noms d’espèces.5 Ce résultat reflète le fait que Darwin apporte de plus en plus de données à l’appui de sa théorie. Année de publication et édition 1859, 1re éd, 1860, 2e éd, 1861, 3e éd, 1866, 4e éd, 1869, 5e éd, 1872, 6e éd, Total Tableau 1 – Corpus des éditions originales de l’OS Richesse du vocabulaire Nombre Effectif des Code d’occurrences6 lemmes N (écarts réduits) OS01 170 634 6082 (2,67) OS02 171 665 6210 (4,21) OS03 181 974 6019 (0,34) OS04 200 608 6914 (9,59) OS05 199 963 7072 (11,67) OS06 218 870 7431 (14,06) 1 143 714 8559 Accroissement chronologique Écarts réduits (calculés sur les lemmes 4,5 -6,5 -4,9 1,8 0,3 16,5 L’analyse arborée (selon Luong, 1994; cité dans Brunet 2011) met en évidence la faible distance séparant les textes, ce qui est attendu (figure 1), mais permet de situer les différentes éditions entre elles : qu’il s’agisse des fréquences (1A) ou des présences (1B)7, on note une grande proximité entre les 1re et 2e éditions, ce qui est corroboré dans les préfaces. La 5e et la 6e sont proches, cette dernière se distinguant par les nombreux hapax. La 3e et la 4e sont intermédiaires. Nombre de lemmes privatifs passent sous la barre des 5 %, les spécificités sont peu nombreuses, ce qui est attendu, mais révélateur. Les spécificités positives ne repèrent aucun mot plein pour les quatre premières éditions, mais font apparaître le pronom I et le déterminant my. C’est à la 5e édition que l’on note l’apparition de deux spécificités de mots pleins statistiquement significatives : survival et fittest, avec un écart réduit de 4,6 et de 4, respectivement, pour les formes, ou survival (substantif, 4,6) et fit (adjectif, 4) pour les lemmes. Dans la 6e édition, apparaissent Mr (7,1), through (6,1) cambrian (5,8) orchids (4,3), developed (4,9) et development (4,2), lower (4,2), Le nombre d’hapax augmente considérablement dans la 6e édition : respectivement, 45, 40, 61, 133, 134, et 622 occurrences (lemmes) de la 1re à la 6e édition (écart réduit de 33,5 pour la 6e édition). 6 Les valeurs reportées dans les tableaux sont fournies par Hyperbase. Il y a de légères différences avec des valeurs publiées antérieurement, dues à la préparation des textes et aux logiciels utilisés pour le décompte. 7 Respectivement selon Labbé et Jaccard, cités dans Brunet 2011. 5 808 JADT’ 18 beneficial (4,1) et spontaneous (4,1). L’analyse des lemmes fait, en plus des précédents, remonter survival (substantif, 4,6), spine (substantif, 5,3), increased (adjectif, 4,2), movement (substantif, 4,1), fit (adjectif, 4,1), beneficial (adjectif, 4,1) et spontaneous (adjectif, 4,1). A B Figure 1 – Analyse arborée sur les lemmes : A – sur les fréquences; B – sur les présences Le regroupement des spécificités en catégories reflétant le contenu sémantique (établi à partir des contextes) est instructif : concepts théoriques (fittest, fit, survival, through [expression de la causation]), données et citations (cambrian, orchids, spine, Mr), vision dynamique du vivant de Darwin (develop, development, increased, movement, spontaneous), jugements de valeur (beneficial, lower [certaines occurrences]). Ainsi , les spécificités, même rares, se démarquent par leur saillance : elles captent l’introduction du fameux concept de Spencer (1864), survival of the fittest et permettent de présumer une affirmation de la pensée de Darwin – à savoir sa vision profondément dynamique de la nature. Enfin, les spécificités négatives signalent que les fréquences relatives du déterminant possessif my et du pronom I diminuent avec le temps, ce qui traduit l’ajout de passages non argumentatifs contenant des données, et ce qui corrobore l’augmentation des hapax, constitués par majoritairement par des noms d’espèces. 4. Analyse du corpus français Le corpus français comprend un peu plus de deux millions d’occurrences (tableau 2) : trois traductions d’époque (Clémence Royer [1862, 3e éd.], JeanJacques Moulinié [1873, 5e éd.], Edmond Barbier [1876, 6e éd.]); celle de Daniel Becquemont (2008), qui part de la traduction de Barbier et la modifie pour remonter à la 1re édition; deux modernes, par Augustin Berra (2009, 6e éd.) et Thierry Hoquet (2013, 1re éd.) (voir Vandaele et Gendron-Pontbriand [2014] pour les références bibliographiques). Les textes comptent de 181 785 à JADT’ 18 809 248 863 occurrences, soit un écart de 67 078 occurrences. Les différences de coefficients de foisonnement8 révèlent déjà que les traducteurs ont travaillé avec des stratégies de traduction distinctes. L’homogénéité lexicale diminue par rapport aux originaux. La contribution de chacun des textes à la richesse lexicale est beaucoup plus importante en français qu’en anglais : les lemmes partagés dans les textes pris deux à deux se situent entre 4498 (13Ho et 62Ro) et 5649 (73Mo et 76Ba) pour un total de 11712 lemmes (soit 3153 lemmes de plus que dans le corpus anglais). Chacun des textes français contribue pour un pourcentage moindre au vocabulaire commun (figure 2A). Les effectifs des lemmes privatifs sont plus importants (de 772 à 3000) et fluctuent d’un traducteur à l’autre (figure 2B). Sont mises en évidence les différences entre Becquemont (08Bq) et Hoquet (13Ho) pour la 1re édition, et entre Barbier (76Ba) et Berra (09Be) pour la 6e édition, mais aussi la proximité (attendue) entre Barbier et Becquemont. Tableau 2 – Traductions françaises de l’OS – * d’après la traduction de Barbier de la 6e édition, Année de publication 1862 1873 1876 2008 2009 Richesse du vocabulaire Effectif des lemmes N (écart réduit) 6357 (-6,7) 7036 (0,8) 6971 (-3,8) 08Bq 186 440 9% 6260 (-4,8) 09Be 248 863 14 % 7804 (5,0) 13Ho 181 785 1 277 582 7% 6579 (-0,2) 11 712 Traduit par Code Nombre d’occurrences 1861 (3e) 1869 (5e) 1872 (6ea) 1859 (1e)* C. Royer J.-J,.Moulinié E. Barbier 62Ro 73Mo 76Ba D. Becquemont A. Berra T. Hoquet 1876 (6eb) 1859 (1e) 2013 Total 207 633 211 691 241 170 Coefficient de foisonnement 14 % 6% 10 % Édition originale anglaise Les distances lexicales intertextuelles (figure 3) confirment la proximité de Becquemont et de Barbier, mais révèlent deux faits inattendus : 1) Royer (62Ro) se situe sur la même branche que Berra et Hoquet; 2) Moulinié (73Mo) se place entre Becquemont et Barbier lorsque l’on passe des fréquences aux présences. Le coefficient de foisonnement est l’accroissement du nombre d’occurrences observé lorsque l’on traduit de l’anglais au français. Il est généralement admis, en traduction dite « pragmatique » (par opposition à la traduction littéraire) que le taux de foisonnement se situe généralement entre 10 % et 15 %, une des causes étant que le français recourt à plus de mots grammaticaux que l’anglais. Une forte concision peut diminuer ce taux. 8 810 JADT’ 18 Figure 2 – A – Contributions respectives de chacun des textes aux parties communes des corpus anglais et français (lemmes)9 – B – Richesse lexicale (lemmes), Le pointillé indique le seuil de 5 %, Diverses hypothèses explicatives doivent être explorées, mais il n’est en tout cas plus permis de douter que les manières de traduire sont décisives au point de brouiller, sur le plan lexical, la chronologie des versions originales, et que cette approche permet de mettre ces particularités en évidence. A B Figure 3 – Analyse arborée (méthode Luong) sur les lemmes A – calculée sur les fréquences (Labbé); B - calculée sur les présences (Jacquard) Nous nous sommes ensuite concentrée sur les spécificités positives des lemmes des mots pleins et, parmi elles, avons sélectionné les unités dont la signification paraissait la plus caractéristique du propos central de l’OS : ainsi, sélection, préservation, pouvoir… ont été retenus, mais pas aujourd’hui, grandement, Le schéma a été obtenu à partir des effectifs des lemmes pour chacun des textes, ramenés en pourcentage du nombre total de lemmes par corpus (représentation « radar » fournie par Excel v.16). Les effectifs des lemmes des textes traduits ont été disposés en regard des textes anglais (ceux de OS1 et OS6 ont donc été dupliqués); de plus, la forme assymétrique du tracé pour le français rend compte de l’absence de traduction d’OS2 et d’OS4. À cause de ces particularités, l’aire délimitée par les traits n’est pas représentative des valeurs totales pour chacun des corpus, mais le schéma reste visuellement parlant. 9 JADT’ 18 811 inclure… Nous nous sommes ensuite concentrée sur les spécificités positives des lemmes des mots pleins et, parmi elles, avons sélectionné les unités dont la signification paraissait la plus caractéristique du propos central de l’OS : ainsi, sélection, préservation, pouvoir… ont été retenus, mais pas aujourd’hui, grandement, inclure… Figure 4 – Analyse factorielle de correspondances : sélection de lemmes parmi les spécificités La quarantaine de lemmes ainsi obtenus a permis de générer un graphe (figure 4) représentant le résultat d’une analyse de correspondances (menée selon le programme de Lebart, inclus dans Hyperbase, sur les données pondérées). Le graphe montre que les modernes (Berra, Hoquet) s’opposent aux anciens (Barbier, Moulinié) ou quasi-ancien (Becquemont), Royer se situant à part. La consultation des contextes ciblés par cette méthode dans les corpus alignés par Logiterm permet d’analyser qualitativement les choix de traduction. L’exemple le plus frappant est le choix de élection et de électif par Royer, qui s’oppose au choix de sélection par les autres traducteurs (tab. 3). 812 JADT’ 18 Tableau 3 – Traductions alignées d’une phrase commune à toutes les éditions anglaises (Introduction) Darwin and we shall then see how Natural Selection almost inevitably causes much Extinction of the less improved forms of life… 62Ro Nous verrons comment cette élection naturelle cause presque inévitablement de fréquentes extinctions d’espèces parmi les formes de vie moins parfaites… 73Mo Nous y verrons comment la sélection naturelle détermine presque inévitablement l'extinction des formes moins perfectionnées… 76Ba Nous verrons alors que la sélection naturelle cause, presque inévitablement, une extinction considérable des formes moins bien organisées… 08Bq Nous verrons alors que la sélection naturelle cause presque inévitablement une extinction considérable des formes moins bien organisées 09Be nous verrons alors de quelle façon la sélection naturelle cause presque inévitablement une forte extinction des formes de vie moins améliorées… 13Ho Et nous verrons comment la Sélection Naturelle cause presque inévitablement une grande Extinction des formes de vie moins améliorées… 5. Conclusion Le ciblage de contextes, repérés au moyen d’une analyse lexicométrique préalable, dans des corpus alignés conséquents est une stratégie de choix. Elle permet d’arriver assez vite à des observations statistiquement significatives et de pointer d’emblée sur des éléments majeurs sans hypothèse préalable. Comme le souligne Brunet (2002), l’intérêt de travailler sur des traductions est que certains paramètres sont fixés. L’inconvénient actuel de l’entreprise tient à la faible ergonomie du processus, c’est-à-dire aux nombres de clics liés au passage d’un logiciel à l’autre. Restent les nombreuses modifications sous le seuil de 5 %, qui peuvent recéler, malgré l’absence de signification statistique, des éléments cruciaux en matière de choix de traduction. D’autres stratégies de filtrage sont alors nécessaires pour leur étude. Remerciements Nous remercions vivement Étienne Brunet, Damon Mayaffre et Laurent Vanni pour leurs conseils sur l’utilisation d’Hyperbase. Il va de soi que les éventuelles erreurs sont nôtres. Merci aussi à Marie-Joëlle StratfordDesjardins, étudiante auxiliaire de recherche, pour son aide à la préparation du corpus. La présente recherche a bénéficié d’une subvention de recherche du Conseil de recherche en sciences humaines du Canada (2015-2018). JADT’ 18 813 Références Brunet É. (2002). Un texte sacré peut-il changer ? Variations sur l’Evangile. In Cook J., dir. Bible and Computer, Leiden / Boston : Brill, pp. 79-98. Brunet É. (2011). Hyperbase – Manuel de référence. Hyperbase pour Windows, version 8.0 et 9.0. Luong X. (1994). L’analyse arborée des données textuelles : mode d’emploi. Travaux du cercle linguistique de Nice, 16 : 27-42. Monti E. et Schnyder, P., dir. (2011). Autour de la retraduction : Perspectives littéraires européennes. Coll. Universités, Paris : Orizons, Spencer H. (1864). The Principles of biology. Vol. 1, New York: Appleton. Vandaele S. et Gendron-Pontbriand E.-M. (2014). Des « vilaines infidèles » aux grands classiques : traduction et retraduction de l’œuvre de Charles Darwin. In: Pinilla J. et Lépinette B., dir, Traducción y difusión de la ciencia y de la técnica en España en los siglos XVIII y XIX,Valence : Universitat de València, pp. 249-276. 814 JADT’ 18 Circuits courts en agriculture : utilisation de la textométrie dans le traitement d’une enquête sur 2 marchés Pierre Wavresky1, Matthieu Duboys de Labarre2, Jean-Loup Lecoeur3 2 1Umr Cesaer Inra-Agrosup Dijon – pierre.wavresky@inra.fr Umr Cesaer Inra-Agrosup Dijon – matthieu.duboys-de-labarre@inra.fr 3Umr Cesaer Inra-Agrosup Dijon – yajintei@hotmail.fr Abstract Semi-structured interviews about short food supply chains have been done with producers and consumers on two different markets. Our work gives an insight to the themes common to producers and consumers that are not attributable to the interviews guides. It also underlines the advantages of a textometric approach and the precautions necessary to interpret such a corpus. Résumé Des entretiens semi-directifs sur le thème des circuits courts alimentaires ont été menés sur deux marchés, auprès de producteurs et des consommateurs. Notre travail s'intéresse notamment aux thématiques communes aux producteurs et consommateurs et qui ne soient pas imputables aux grilles d’entretiens. Il souligne par ailleurs les apports d'une approche textométrique, ainsi que les précautions d'interprétation sur un tel corpus. Keywords: short food supply chain, semi-structured interviews, textometry 1. Introduction et méthodologie Les circuits courts alimentaires interviennent de plus en plus dans le débat social. Ils sont devenus l’emblème d’une opposition au « modèle conventionnel ». Ils s’inscrivent également dans des enjeux de politique publique (définition légale en 2009 avec le plan Barnier1), et scientifique. Ils comprennent des formes innovantes comme les AMAP, mais aussi des formes plus anciennes comme les marchés ou la vente à la ferme. La sociologie a abordé les circuits courts sous des angles variés : la consommation engagée (Dubuisson-Quellier, 2009), la sociologie de 1 Circuit de commercialisation comprenant au plus un intermédiaire entre le producteur et le consommateur. JADT’ 18 815 l’innovation (Chiffoleau et Prévost, 2012), d’autres ont approché la question en décalant le point de vue vers le développement local (Traversac, 2010) ou au travers de la notion de proximité (Mundler et Rouchier, 2016). Les travaux de sociologie insistent sur l’intérêt économique des circuits courts, mais aussi sur leur capacité à recréer du lien social (Prigent-Simonin et HéraultFournier, 2014). De nombreux dispositifs s’appuyant sur les circuits courts de commercialisation se caractérisent par un rapport direct entre consommateurs et producteurs. Ce lien a été l’objet de différentes analyses et interprétations dans la littérature. Il est perçu comme un déplacement de l’espace de référence des agriculteurs vers celui des consommateurs (Dufour et Lanciano, 2012). Il a aussi été analysé comme le lieu de rencontre autour d’attentes plurielles (Chiffoleau et Prévost, 2012). Plus généralement, il s’ancrerait dans des logiques communes de re-localisation des pratiques agricoles et alimentaires (Duboys de Labarre, 2005). C’est ce lien que nous allons analyser au travers d’un dispositif textométrique. Nous mettrons en lumière les intérêts et les éventuelles limites interprétatives liés au type de corpus (faible nombre d’entretiens semi-directifs). Cela nous éclairera également sur les thématiques abordées et leur spécificité. Dans le cadre du projet européen H2020 « Strength2food » 2 , pour la France, nous avons interrogé 23 personnes3 (12 vendeurs-producteurs et 11 consommateurs) sur deux marchés (en milieu rural et en milieu urbain) par entretien semidirectifs. Nos deux sous-populations relèvent d’initiatives différentes dans leur structuration et leur ancienneté4. Dans les deux cas, les parties-prenantes restent attachées à la consommation/production bio et sont assez engagés. Ce corpus n’est donc pas représentatif (ni des consommateurs ni des producteurs) et nous considérons ce travail comme exploratoire. Le corpus est analysé grâce au logiciel de textométrie Iramuteq5, les thèmes communs ou spécifiques des producteurs et consommateurs seront recherchés essentiellement par classification descendante hiérarchique (Reinert, 1983) et par analyse de spécificité. Parmi les variables caractérisant les textes, a été incluse une variable à 4 modalités : consommateur-rural, https://www.strength2food.eu/. Ce projet a été financé par le programme de recherche et d'innovation Horizon 2020 de l'Union européenne dans le cadre de la convention de subvention n° 678024 3 Ces entretiens, structurées autour de 6 thèmes, sont semi-directifs et visent à favoriser l’expression des acteurs. Ils sont retranscrits mot à mot et incluent des annotations de l’intervieweur. 4 Celle en milieu urbain est un marché de plein vent traditionnel, celle en milieu rural est un marché de producteurs innovant. 5 http://www.iramuteq.org/ (Pierre Ratinaud) 2 816 JADT’ 18 consommateur-urbain, producteur-rural, producteur-urbain6. Comme la longueur des interviews est très variable (de 102 à 560 segments de texte) et le nombre d’interviewés assez faible (23), les statistiques relatives à cette variable peuvent être essentiellement imputables à une interview, il est donc d’autant plus nécessaire de revenir à l’interview. De plus il peut arriver que le lien, en termes de Khi², entre une des quatre catégories (ou une interview) et une thématique (classe de la classification) soit faible. Or quelques segments de textes énoncés par cette catégorie sous-représentée sont parfois très liés à cette thématique, et dire que le lien est faible serait erroné. D’où l’analyse, aidée par une représentation graphique, des segments de textes les plus caractéristiques d’une classe, pour chaque catégorie étudiée. Deux annotations de l’intervieweur, caractérisant la parole de l’interviewé, ont été conservées au sein du corpus, et seront donc analysées comme les autres mots : « rire » (codé « _rire ») et « blanc », signifiant un délai avant la réponse ou en son sein (codé « _blanc). Le but étant de voir si des hésitations (« _blanc ») sont cooccurrentes d’autres lemmes. 2. Analyse statistique du corpus réponse Les 5 lemmes les plus courants sont : aller, voir, bio, gens, marché. Ce qui ressemble à un programme : aller au marché, donc favoriser un mode de circuit court, pour acheter ou vendre des produits bio et pour voir des gens, donc avec un aspect relationnel important. Il est probable que les lemmes bio, aller et marché soient liés au contexte d’enquête (nature des enquêtés pour bio et nature des dispositifs pour aller et marché). Enfin, le caractère assez homogène de l’importance quantitative de ces 5 lemmes peut être interprété comme le reflet d’un horizon commun partagé par nos informateurs et ce en dépit de de leur groupe d’appartenance (producteur ou consommateur) ou du dispositif étudié. 2.1. Classification descendante hiérarchique : 12 types de discours Une classification descendante hiérarchique7 (Reinert 1983) a permis de dégager 12 types de discours. Nous nous focaliserons sur 2 ensembles de classes8, selon qu'elles sont plutôt spécifiques ou peu spécifiques d'une catégorie (producteur ou consommateur). Producteur-urbain signifiant producteur vendant sur le marché de la ville moyenne, en opposition avec producteur-rural qui vend sur le marché du village. 7 5264 segments de texte sur les 6231, soit 84%, ont été retenus par la classification. 8 Nous écartons la classe 3 (12,5%) car elle est peu interprétable (lemmes polysémiques : chose, gens, monde...). 6 JADT’ 18 817 Graphique 1 : les 12 classes de discours Le premier ensemble regroupe les classes 1, 2, 6, 9 et 11 qui sont caractéristiques d’un sous-groupe. Les classes 1 et 11 concernent surtout les producteurs, par contre les classes 2, 6 et 9 émanent principalement de consommateurs. Dans la classe 1 (14.4%) il est question des aides, de projet, d’installation, de reprise (d’exploitation), d’investissement. Il y a des critiques sur la PAC (notamment sur le fait que ce soit compliqué), mais pas seulement : « Bah comme on a de la surface un peu ouais ça commence c’est super compliqué la PAC je sais pas si tu veux qu’on en parle _rire même nous on a du mal » (Lydie, productrice rurale). La classe 11 (11,7%) est orientée autour des produits laitiers (lait, chèvre, fromage, yaourt, vache, faisselle, litre, cabri…), avec un aspect monétaire (euro, prix). Dans la classe 6 (8.1%) c’est de nourriture dont il est question, notamment le fait de manger des fruits et légumes de saison (manger, tomate, fraise, saison, pas en hiver). C’est un discours de consommateurs, surtout urbains. Melissa et Jennifer parlent surtout des courses qu’elles font, où elles les font (sur le marché de la ville moyenne essentiellement, où elles ont été interrogées). Toutefois l’autre thème (manger des fruits de saison) est celui qui est le plus typique de cette classe. Dans la classe 9 (3.3%) il est question de ville (vivre en ville/à la campagne) et de distance, aussi bien en termes de proximité que de nombre d’intermédiaires (distance, kilomètre, circuit_court, intermédiaire). C’est plutôt une classe de consommateurs. Enfin dans la classe 2 (12.4%) les 4 premiers lemmes forment une phrase : acheter produit bio producteur. Revendeur et local sont présents aussi. Il est donc question du comportement d’achat, mais pas des produits qu’on achète, comme dans la classe 6, plutôt de certaines de leurs propriétés (bio) et de la qualité du vendeur (producteur). Les classes 1, 2 et 6 renvoient directement à des thèmes abordés dans les guides d’entretiens respectifs des groupes et la classe 11 à une catégorie de produit agricole 818 JADT’ 18 spécifique qui était surreprésentée dans l’échantillon des producteurs transformateurs (5 informateurs sur 12). Ces classes parlent des pratiques liées aux groupes (professionnelles, d’achat et de consommation alimentaire) et permettent de les caractériser. Nous noterons que les classes 1, 2 et 6 renvoient à la notion de maîtrise ou de contrôle. Pour la classe 1 parce que les aides PAC sont parfois perçues comme extérieures et complexes. Pour les classes 2 et 6 au contraire parce qu’elles traduisent l’idée que le consommateur maîtrise sa pratique (choix de se fournir directement auprès d’un producteur et en aliments bio, locaux et de saison). Le second ensemble regroupe les classes 4, 5, 7, 8, 10 et 12. Elles sont peu spécifiques d’une catégorie. La Classe 10 (7.3%) est celle du respect des animaux et plus généralement du respect du vivant. On peut remarquer que le lemme _rire y est particulièrement rare : dans cette classe, le respect des animaux est abordé comme une question sérieuse. « C’est un animal pour l’élevage donc je le mange s’il a été élevé dans le respect des lois de la nature et de l’univers s’il a été élevé d’une manière respectueuse par rapport à l’environnement » (Théophile, producteur urbain) [Les mots en gras sont spécifiques de la classe]. Il n’y a pas de différence marquée rural/urbain ou producteur/consommateur. Graphique 2 : Score des segments de texte (classe 10) Mais si on considère le nombre de segments de texte caractéristiques (graphique 2), on voit que Jacques n’en parle pas beaucoup mais il en a énoncé certains très caractéristiques. Autrement dit, il parle peu mais intensément du bien-être animal : « Et nous nos animaux on est en bio on fait attention au bien-être animal on fait le choix de garder tous les petits pour pas qu’ils partent dans des élevages industriels intensifs et la suite logique» (Jacques, producteur rural [score=925]9). La classe 7 (4,5%) renvoie à deux univers de sens différents autour du lemme vie : d’une part la notion de trajectoire de vie en relation avec la parentèle (famille, parent [d’origine agricole], grand_parent, enfant), et d’autre part à une forme de souci de soi (mode de vie sain, santé reliée à nourriture et alimentation). « En amont dans un mode de vie qui devrait te permettre d’avoir une vie plus 9 La somme des Khi² (mesurant le lien entre chaque lemme et la classe) donne le score du segment de texte. JADT’ 18 819 harmonieuse plus saine plus en meilleure santé physique psychique mentale sociale parce_que tu crées du lien aussi enfin y a une… ça va dans une même mouvance » (Claire, consommatrice rurale). La classe 8 (5.5%) concerne les céréales (farine, pain, gluten, variété, vieux, boulanger), notamment les vieilles variétés. La classe 5 (6.3%) est celle du doute (on se pose des questions, il y a des _blanc : ces 3 lemmes sont entre 8 et 9 fois plus nombreux qu’attendu). « Se poser des questions » et penser évoque aussi une prise de conscience de problèmes. Mais c’est également « poser des questions » aux vendeurs sur leur production. La classe 4 (5.2%) est celle des relations et de leur importance. « Eh ben les relations humaines on côtoie une diversité de population quoi des gens et en fait on se parle c’est agréable _rire » (Christine, consommatrice rurale). Enfin la classe 12 (8.8%) est celle du temps (temps passé [heure], horaire précis [h]). Les jours de la semaine sont cités, les moments de la journée aussi, avec matinée, nuit, café, boire… Les 2 individus les plus impliqués dans cette classe sont François et Thérèse (éleveurs urbains). Il n’y a pas de spécificité forte d’une des 4 catégories car s’il y a surreprésentation de certains producteurs dans cette classe, d’autres parlent très peu de cet aspect (David et Théophile). Or les deux producteurs qui sont principalement impliqués dans cette classe se sont installés dans un cadre familial (ils ont repris l’exploitation de leurs parents). Alors que ceux qui en parlent le moins sont des hors cadres familiaux. La littérature (Dufour et Lanciano, 2012) souligne que les contraintes temporelles sont plus importantes dans le cadre d’une production en circuits courts. Cette dernière serait vécue différemment en fonction de la trajectoire des agriculteurs (cadres ou hors cadres familiaux). Le caractère commun de ces classes nous permet de proposer quelques pistes de réflexions concernant les liens qui se nouent entre producteurs et consommateurs. La classe 5 (celle du doute) renvoie partiellement à une forme de réflexivité partagée par ces deux groupes. Le respect des animaux et de la nature (classe 10)10 et l’aspiration à un mode de vie, un souci de soi (classe 7) dessinent un lien entre préoccupations personnelles et engagements globaux (respect des animaux et cause environnementale) (Pleyers, 2011). Enfin, la classe 4 souligne l’horizon commun que constitue l’importance du lien social attaché aux circuits courts. 10 Cette classe commune émerge dans le discours alors qu’elle n’est pas un thème des deux guides d’entretiens. 820 JADT’ 18 2.2. Pronoms personnels et spécificités L’analyse des spécificités des 4 catégories d’interviewés, toutes classes confondues, a mis notamment en évidence un emploi très différencié des pronoms personnels. Les consommateurs ruraux citent souvent deux des producteurs par leur prénom. Le lemme discuter est également présent. Donc ils parlent de gens avec lesquels ils sont en lien fort. Les consommateurs urbains citent beaucoup je et j, ainsi que vous : « Oui et puis […] si vous voulez vos salades au bout de 3 ou 4 jours en grande_surface elles ont pas été vendues elles ont quand même pas la même tête que celles que j’achète qui ont été cueillies la veille hein » (Mélissa, consommatrice urbaine). Il est donc question de ce que l’interviewé fait (je, j) et de ce qu’il ne fait pas (vous). Donc de son comportement d’achat : ce qu’il achète, du lieu où il achète ou pas (marché, supermarché, …), de la façon dont c’est produit ou vendu (bio, label, équitable, local, transport). Il y a également le lemme rencontre : le lien est présent, mais de façon plus conceptuelle, moins proche que dans le groupe des consommateurs ruraux. Chez les producteurs ruraux les pronoms tu et nous sont très employés. Le nous peut renvoyer à un couple de producteurs (Georges et Gina) ou à une communauté à laquelle on appartient : (les producteurs diversifiés, les producteurs du marché du village rural) : « Nous ce qui fait la caractéristique du secteur c’est que c’est des exploitations qui sont tournées vers beaucoup d’espèces on n’a pas de spécialisation enfin pas de très très grosse spécialisation » (David, producteur rural). Il nous semble que cette spécificité dans l’utilisation des pronoms peut-être rattachée à la nature différente des dispositifs (et non à leur caractère rural ou urbain). Dans un cas, le marché de plein vent traditionnel, nous avons affaire à une structure de taille importante qui préexiste aux acteurs. S’il est bien un lieu de rencontre, il est plus fortement marqué par une dimension individuelle tant pour les producteurs que pour les consommateurs (d’où la présence du je). Dans l’autre, le petit marché de producteurs engagés, nous avons affaire à un projet de taille plus réduite construit par une partie des acteurs. Les relations interpersonnelles, l’identification à un ou des collectifs mais également la dimension participative y sont donc plus marquées. 3. Conclusion et perspectives De nombreux thèmes sont apparus fortement dans le discours des interviewés : l’importance des relations, l’importance d’acheter au producteur des produits bio, de manger des produits de saison, d’utiliser des variétés de blé ancienne, de respecter l’environnement et les animaux. D’autre part, l’emploi de pronoms personnels différents et l’usage ou non de prénoms, révèlent une proximité avec les producteurs locaux (discours des JADT’ 18 821 consommateurs ruraux), l’appartenance à un groupe (discours des producteurs ruraux), une norme dans le comportement d’achat (discours des consommateurs urbains). Il est important de ne pas tenir compte uniquement de la spécificité globale d’une catégorie (ou d’un interviewé) pour juger de sa plus ou moins grande implication dans une thématique (cas de Jacques). De ce fait, les thèmes révélés par la classification ne sont pas toujours très spécifiques d’une catégorie. Malgré un corpus restreint et spécifique, la textométrie permet de mettre au jour des éléments factuels identifiés dans la littérature et d’esquisser des liens analytiques avec des approches théoriques plus générales. Ces résultats nous amèneront à poursuivre ce travail, dans le cadre du projet Strenght2Food, en y intégrant une comparaison internationale (avec tout ou partie du corpus des 6 pays partenaires sur cette thématique). Références Chiffoleau Y., Prévost B. (2012). Les circuits courts, des innovations sociales pour une alimentation durable dans les territoires, Norois, 224. Duboys de Labarre M. (2005). Le mangeur contemporain, une sociologie de l’alimentation. Thèse de sociologie, soutenue à Bordeaux, 426p. Dubuisson-Quellier S. (2009). La consommation engagée. Paris, Presses de la Fondation nationale des sciences politiques (Contester). Dufour A., Lanciano E. (2012). Les circuits courts de commercialisation: un retour de l'acteur paysan ? Revue Française de Socio-Économie (n° 9), pp. 153-169. Mundler P., Rouchier J. (2016). Alimentation et proximités: Jeux d’acteurs et territoires. Educagri. Pleyers G. (dir.) (2011) La consommation critique, mouvements pour une alimentation responsable et solidaire. Desclée de Brouwer. Prigent-Simonin A-H., Hérault-Fournier C. (2014). Au plus près de l’assiette. Editions Quæ. Reinert M. (1983). Une méthode de classification descendante hiérarchique : application à l’analyse lexicale par contexte. Les cahiers de l’analyse des données, VIII(2) :187-198. Traversac J.B. (2010). Circuits courts : contribution au développement régional. Educagri. 822 JADT’ 18 On the phraseology of spoken French: initial salience, prominence and lexicogrammatical recurrence in a prosodic-syntactic treebank Rhapsodie Maria Zimina, Nicolas Ballier Université Paris Diderot mzimina@eila.univ-paris-diderot.fr; nicolas.ballier@univ-paris-diderot.fr Abstract This paper focuses on specific quantitative characteristics of spoken language phraseology in the Rhapsodie speech database (ANR Rhapsodie 07 Corp-03001). A recent study (Zimina & Ballier, 2017) has shown that prosodic segmentation into IPE: Intonational PEriods (segments of speech with distinctive pitch and rhythm contours) available within the Rhapsodie database offers new insights for the observation of the functions of formulaic expressions in speech. Recurrent lexicogrammatical patterns at the beginning of Intonational PEriods (IPE) are strongly related to spoken formulaic language. These variations of initial salience depend upon several factors (interactional needs, social context, genres, etc.). Further experiments have shown that initially salient patterns also have specific prosodic characteristics in terms of prominence (prosodic stress) across major speech genres of the Rhapsodie dataset (oratory, narrative, description, argumentation, procedural) and corresponding speaking tasks. These specific prosodic characteristics are likely to reflect communicative needs of speakers and listeners (interactions, uptakes, speaking turns, etc.). Keywords: phraseology, prosodic constituents, prominence, salience, textometrics 1. Introduction Our research examines the notions of phraseology and formulaic language in speech production on the basis of prosodic transcriptions indicating specific events in speech: boundary tones, pitch accents, disfluent segments, etc. (Yoo et Delais-Roussarie, 2009). We believe that such speech events coded in spoken corpora are relevant for identifying the prosodic characteristics of formulaic language. Corpus-based studies of phraseology often exploit recurrent patterns detected using repeated segments, co-occurrences and pattern-matching JADT’ 18 823 techniques to explore formulaic strings of written texts (Granger, 2005; Sitri et Tutin, 2016). This approach seems equally applicable to oral discourse. Following this approach, our initial objects of study are predictable and productive sequences of signs called lexicogrammatical patterns (lexical signs, grammatical constructions). Made of permanent ‘pivotal’ signs and a more productive ‘paradigm’, these patterns may be discontinuous and may or may not be syntactic constituents (Gledhill, 2011; Gledhill et al., 2017). For example: § et donc euh c'est pour ça qu'aujourd'hui je suis en italien en XXX … § c'est-à-dire § ouais § un mois c'est pour ça que ça s'appelle radio Timsit … § mais bien sûr donc 1 transition and 1->2 transition. Asides this, only two other RC combinations exist for any possible CP: 1->1 when both CP members belong to RC 1 and 2->2 when PX mod 3 == PX+1 mod 3 ==2. Note the symmetry: two RC transitions and two "non-transition" states. Such symmetry exists, ex vi termini, only in the realm of modulo 3. Focusing solely on this transition / non-transition properties of consecutive pairs occurent among first 30 million primes, one can observe:  there are 16687076 transitions  there are 13312923 non-transitions  longest uninterrupted sequence of consecutive transitions consists of 32 primes  longest uninterrupted sequence of consecutive non-transitions consists of 19 primes  etc. Those unafraid of induction could thus simply conjecture that given 16687076 / 13312923 =~ 1.253, it is approx. 25% more probable that Px+1 will belong to different modulo3RC than PX. In other terms: approximately 25% carbon dioxide less could be potentially emitted if machines aiming to discover new prime PX+1 would explore:  sequences (PX + 2 + 3*n) if it is known that PX mod 3 == 1  sequences (PX + 4 + 3*n) if it is known that PX mod 3 == 2 It is in this point that we disagree with the expression "no practical use" contained in the statement : "this ‘anti-sameness’ bias has no practical use or even any wider implication for number theory, as far as Soundararajan and Lemke Oliver know" recently published in Your journal [2]. With highest regards, Daniel D. Hromada Berlin [1] Robert J. Lemke Oliver, Kannan Soundararajan. Unexpected biases in the distribution of consecutive primes. 2016. http://arxiv.org/abs/1603.03720 [2] Evelyn Lamb. Peculiar pattern found in 'random' prime numbers. 2016. http://www.nature.com/news/peculiarpattern-found-in-random-prime-numbers-1.19550 [3] https://en.wikipedia.org/wiki/Monty_Hall_problem Reproducible Identification of Pragmatic Universalia in CHILDES Transcripts Daniel Devatman Hromada1,2,3 1 2 Université Paris Lumières - France Slovak University of Technology – Bratislava - Slovakia 3 Berlin University of the Arts – Berlin - Germany Abstract This article presents method and results of multiple analyses of the biggest publicly available corpus of language acquisition data : Child Language Data Exchange System. The methodological aim of this article is to present a means how science can be done in a highly positivist, empiric and reproducible manner consistent with the precepts of the “Open Science” movement. Thus, a handful of simple one-liners pipelining standard GNU tools like “grep”, and “uniq” is presented - which, when applied on myriads of transcripts contained in the corpus – can potentially pave a path towards identification of statistically significant phenomena. Relative frequencies of occurrence are analyzed along age and language axes in order to help to identify certain concrete, pragmatic universalia marking different stages of linguistic ontogeny in human children. One can thus observe significant culture-agnostic decrease of laughing in child-produced speech and child-directed indo-european “motherese” occurrent between 1st and 2nd year of age; maternal increase in production of pronoun denoting 2nd person singular “you”; increase of usage of 1st person singular “I” in utterances produced by children around 3rd years of age and marked decrease of the same which takes place around 6 years of age. Other significant correlations both intra-cultural between English mothers and children, as well as inter-cultural - are pointed down always accompanied with thorough descriptions methodology immediately reproducible on an average computer. 1. Introduction Reproducibility is one of the hallmark principles of occidental science. Being based upon the philosophy of ancient greeks who were fully aware that only the knowlede of that, which repeats itself in many instances, can lead to generic and transtemporal ἐπίσταμαι, the western scientific method necessarily considers reproducibility as its main condition sine qua non. In words of the foremost figure of modern epistemology, « non-reproducible single occurrences are of no significance to science » (Popper, 1992). Hence the primary, epistemological, objective of this article is to show how anyone willing to do so can perform reproducible analyses and experiments regarding the phenomena traditionally falling into the scope of corpus, computational and developmental linguistics. This objective is to be quite naturally attained if ever three precepts are stringently followed : • use publicly available data • analyse the data with simple, specific yet powerful tools which are well-known to widest possible public • faithfully protocol the exact procedure of usage of these tools In more concrete terms, we promote the idea that - in regards to analysis of statistical textual data - core GNU (Stallman, 1985) utils and commands as well as basic operators and core JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles 2 DANIEL DEVATMAN HROMADA functions of open source langages like PERL (Wall, 1990) or R (Team, 2013) indeed offer such « simple, specific yet powerful tools well-known to widest possible public ». When it comes to the precept « faithfully protocol the usage of these tools », it shall be implemented - in this article and potentially beyond – in a following manner : every simple transformation of data is to be completely and exhaustively described in a footnote which accompanies the description of the transformation. By « simple », we mean such a transformation which can be described as a simple standard UNIX shell 1 one-liner pipelining combining together core commands like « grep », « uniq » or « sort ». In case of more complex transformations, the complete source code of program is always to be furnished either in publications's appendix or at least as an URL reference. To assure the highest possible reproducibility of the experiment, the snippet should not call any modules and libraries external to language's core distribution (e.g. no CPAN resp. CRAN). The most important thing, however, is not to forget that the protocol is to be complete, exhaustive and unambigous. That is, .history of all steps is to be described in the form which is immediately executable on a standard GNU-positive machine. All means all : from the very fact of downloading2 the corpus from a publicly available source to the very act of plotting the legend on a figure which is then disseminated among scientific communities. Given that these precepts are followed and under the conditions that • the analysis is fully deterministic (i.e. does not involve any source of stochasticity) • the source corpus has not changed in the meanwhile it can be expected that the same analysis shall bring the same results no matter whether it is executed in other folder of the same computer (e.g. reproducibility across directories) ; executed on different computers (e.g. reproducibility across experimental apparatus) and|or executed by different experimentator (e.g. experimentator-independent reproducibility). 2. Corpus & Method Child Language Data Exchange System (CHILDES) undoubtably belongs among most fascinating language-related corpora. Established by (MacWhinney and Snow, 1985) more than 30-years ago and including transcripts dating back to 1960s, CHILDES does not cease to be the biggest public repository of child language acquisition and development data. Thus, asides huge volumes of audio and video recordings of verbal interactions with children, CHILDES also contains more than thirty thousand distinct transcripts. Transcript themselves are encoded in UTF-8 compliant plaintext .CHA files. These files follow a CHAT format specified in (MacWhinney, 2012). Every transcript contains a header describing specificities facts concerning the transcribed scenario – e.g. the age of a child, identities of participants (lines beginning with *CHI denote utterances produced by children; lines beginning with *MOT denote utterances produced by their mothers). Unfortunately, different linguists have followed the CHAT manual in a different manner. For example, some include the timestamp information into their corpus and some not. Some mark the repetition by special tokens like [x 2] (for duplication) or [x 3] (for triplication) and some 1 $ echo 'All footnote-descriptions of shell one-liners begin with the sign $ and all footnote-descriptions of R commands begin with sign >.' 2 It is highly recommended to use standard utilities like «wget » or «curl » for that purpose. JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles [REPRODUCIBLE IDENTIFICATION OF PRAGMATIC UNIVERSALIA IN CHILDES TRANSCRIPTS] 3 transcribe the utterance as such, without using such tokens. And yet another set of differences necessarily originates in transcriber's own perception and habits. For example: while the token “mama” is occurrent in 1405 child utterances contained in English sections of the corpus3, some other English transcribers (e.g. Haggerty or Suppes) apparently prefered to transcribe the mother-directed vocative as “mamma” - this occurs in 126 distinct utterances. Be it as it may, the CHILDES corpus is already so huge that one may except that a well constituted and unbiased quantitative analysis could potentially allow the discovery of phenomena robust to any surface perturbations (e.g. differences in habits and styles of different investigators etc.). In other terms, if every transcript is understood as a result of a distinct act of sampling, then it can be expected that the statistical aggregation of such a huge amount of distinct samples (> 30000 distinct transcripts) could let to situation where the noise cancels itself out and statistically significant phenomena emerge. And individual CHILDES transcripts are indeed distinct. Not only because dozens, if not hundreds researchers and investigators of at least three or four generations had already directly participated on constitution of the corpus. Not only because majority of transcripts were in one way or another related to a specific research project with a goal unrelated to goals of other projects. But also because investigators themselves, as well as the investigated subjects (e.g. children), often stem from huge variety of distinct cultural backgrounds. More concretely: 26 languages are included in the corpus, covering practically majority of main terran language strata (i.e. indo-european languages, asian languages, semitic, altaic and ugrofinic languages etc.). This allows for trans-cultural analysis and such shall indeed be all analysis presented in the section 4. 2.1 Metrics Results can be mutually compared and communicated only if they are expressed in common units. In case of all experiments presented in this article, the relative frequency - interpreted as the probability of occurrence - of pattern X is such a unit. This is equivalent to absolute frequency of occurrence of FX normalized by the total number of utterances, i.e. PX = FX / Nutterances Ideally, for every month mentioned in the CHILDES corpus should correspond one P X value. To understand our approach more clearly, imagine, for example, in case of hypothethic language whose speakers utter 100 utterances each month since their birth until their tenth birthday. If such speakers utter the token « dog » twenty times every month, than the value of all 120 (i.e. 10 years * 12 months) datapoints describing the time series for this particular token would be constantly equal to 100/20 = 20% = 0.2. It is principially due to such trivial nature of the calculus hereby presented that the core datamining procedures can be performed directly on the BASH command-line. 3.2 Preprocessing Four hundred and sixty-seven megabytes of data compressed in 983 zip files are obtained after the corpus has been downloaded from its original source4 or from a mirror site which 3 $ grep "mama" child/*Eng* |wc -l; grep "mamma" child/*Eng* |wc -l 4 $ wget -P CHILDES -e robots=off --no-parent --accept '.zip' -r http://childes.psy.cmu.edu/data/ JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles 4 DANIEL DEVATMAN HROMADA represents state of CHILDES as of February 6th 20165. After these files are recursively decompressed6, the CHILDES arborescent structure is flattened so that all .CHA files are contained within one sole directory7. A following one-liner subsequently “peeks into” each .CHA file, retrieves the information about the child's age from it and puts this information into files' name8. Utterances containing only xxx and www tokens – which, according to CHILDES manual denote “unintelligible words with an unclear phonetic shape” resp. “untranscribed material” are removed from all child and mother transcripts 9. Next step is executed only to speed-up following pattern extraction processes: child utterances are funnelled into simplified transcripts stored in “CHI” subdirectory and maternal utterances are funnelled into “MOT” subdirectory 10 . Translocutory information is thus lost but this is allowed for the purpose of this article in which we shall focus solely on relative frequencies of certain tokens and not on more complex discourse units. All this yields 5833656 lines (e.g. utterances) contained in 29180 non-empty simplified transcripts stored in “child” directory and 3798005 lines contained in 13590 non-empty simplified transcripts stored in the “mother” directory. Note that metadata like age (years and months), language group, language and CHILDES investigator's identity are stored directly in the simplified transcript's filename. Workbench common to all following analyses can be thus considered as ready. 3. Analyses 3.1. First Analysis – Laughing It has been recently indicated that English mothers interacting with children younger than 16 months tend to laugh significantly more often than mothers which interact with children between 16-31 months of age (p.222, Hromada, 2015). Our 1st analysis will use CHILDES to address this hypothesis from a trans-cultural perspective. It may be surprising to use a dataset, which is essentially a linguistic corpus for, a purpose of study of such a non-verbal means of communication as laughing definitely is. But the very CHAT manual (p.62, MacWhinney, 2012) explicitely specifies the &=laughs marker as a most common standardized spelling denoting a specific extralinguistic event. 5 $ wget -P CHILDES -e robots=off --no-parent --accept '.zip' -r WILL-BE-GIVEN-IN-CAMERA-READY-VERSION 6 $ find CHILDES/data -name "*.zip" | while read filename; do unzip -o -d "`dirname "$filename"`" "$filename"; done 7 $ mkdir CHILDES_flat; find CHILDES/data -type f |perl -n -e 'chomp; if (/\.cha/) {$f=$_; s/\//-/g; s/\.-data-//g; `cp $f ./CHILDES_flat/$_`;}'; cd CHILDES_flat; 8 $ mkdir aged; grep -P '\|\d;\d' *| grep Child | perl -n -e 'chomp; `cp $1 aged/$2-$3-$1` if /^(.*?):.*0?(\d+);0?(\d+)/;' ; rm *.cha 9 $ perl -ni -e 'print if $_!~/^\*(MOT|CHI):\t(xxx|www) ?\./' aged/* 10 $ mkdir CHI; cp aged/* CHI; sed -i '/\*CHI/! d' CHI/*; mkdir MOT; cp aged/* MOT; sed -i '/\*MOT/! d' MOT/*; JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles [REPRODUCIBLE IDENTIFICATION OF PRAGMATIC UNIVERSALIA IN CHILDES TRANSCRIPTS] 5 Unfortunately, within the totality of CHILDES corpus, the marker itself &=laughs is not the only standardized form denoting the phenomenon and some authors prefered to use markers like [=! laughing]. Hence, for a purpose of our 1st analysis, we have simply used the token laugh as the one whose frequencies of occurrence we have decided to measure. Three indo-european (english, french and farsi) and two non-indo-european languages (japanese and chinese) were chosen in order to address the developmental trajectory of laughing from a trans-cultural perspective. For each among these langages, a target investigator was identified as the one who most frequently used the marker laugh in his transcripts of motherese11. Corpus subsections « Farsi-Family », «French-MOR-York », « Japanese-MiiPro » and « Chinese-Beijing » were thus identified as such target subsections. All English-language transcripts (i.e. such files whose filename contains the token « Eng ») were also taken into account. The core of the procedure is as follows: total amount of utterances is obtained, for each month and each target subsection of the corpus, by a one-liner 12 which redirects its output into a file whose every row contains three space-separated columns: first column denotes the denotes the value of Nutterances and second and third column denote the year resp. month. The procedure is to be repeated ten times alltogether, five for each target corpus subsections multiplied by two possible locutor values of the locutor variable (MOT13 or CHI14). Follow ten executions of a command sequence which generate 10 files containing absolute frequencies of occurrence of the token laugh within five different corpus sections – and again for both MOT15 and CHI16 locutors - which are aggregated according to child's age in the moment when laughing was noted down by the CHILDES investigator. And that's it: all result-containing files can now serve furnish input datasets for the R code which produces a plot displayed on adjacent figure. 11 Probability that laughing accompanies or substitutes an utterance produced by, or directed to, a child of specific age. $ grep laugh MOT/*French* | grep -o -P '\-French\-.+\-' | sort | uniq -c ; grep laugh MOT/*Farsi* | grep -o -P '\-Farsi\-.+\-' | sort | uniq -c ; grep laugh MOT/*Japanese* | grep -o -P '\-Japanese\-.+\-' | sort | uniq -c ; grep laugh MOT/*Chinese* | grep -o -P '\-Chinese\-.+\-' | sort | uniq -c ; 12 $wc -l MOT/*Farsi-Family* |perl -e 'while (<>) { s/MOT\///; /(\d+) (\d+-\d+)-/; $h{$2}+=$1; } for (sort keys %h) {/(\d+)- (\d+)/; print "$h{$_} $1 $2\n";}' >exp1.MOT.Farsi-Family.N 13 $wc -l MOT/*Eng* |perl -e 'while (<>) { s/MOT\///; /(\d+) (\d+-\d+)-/; $h{$2}+=$1; } for (sort keys %h) {/(\d+)-(\d+)/; print "$h{$_} $1 $2\n";}' >exp1.MOT.Eng.N 14 $wc -l CHI/*Eng* |perl -e 'while (<>) { s/CHI\///; /(\d+) (\d+-\d+)-/; $h{$2}+=$1; } for (sort keys %h) {/(\d+)-(\d+)/; print "$h{$_} $1 $2\n";}' >exp1.CHI.Eng.N JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles 6 DANIEL DEVATMAN HROMADA Potentially the most salient phenomenon is a marked decrease in production of laughs which occur between birth and second year of age. This could be potentially explained in terms of gradual switch from non-linguistic means of communication towards more verbal interactions. However, in case of child-directed speech of Japanese motherese the relative frequency of laughing seems to increase during the same period and in case of chinese, the decline is much less marked than in case of indo-european langages. This may potentially suggest an intercultural difference – a hypothesis which is further corrobated by the fact that it is only in case of indo-european langages that the « dotted » lines cross with « solid » lines. Id est, little english-, french- and farsi- speaking children tend to laugh more often than their mothers but older children seem to laugh less frequently than their mothers. This quiproquo notwithstanding, relative frequencies of CHI time series significantly correlate with MOT time series in both English (Pearson's correlation coefficient 0.933, t = 7.36, df = 8, p-value = 7.886e-05 ) and in Farsi (corr. coef. 0.972, t = 5.9224, df = 2, p-value = 0.02735 ). In French correlation is quite close to significancy threshold (t = 4.1692, df = 2, pvalue = 0.053, cor. coef = 0.947) when data is aggregated in year-sized packages but is insignificant (t = -1.1598, df = 27, p-value = 0.2563 ) when time series are correlated with monthly granularity. No statistically significant correlation between child-produced and mother-produced laugh time-series has been observed in case of Japanese or Chinese. 3.2. Second Analysis – 2nd person singular It has also been indicated that English mothers interacting with their children tend to use the pronoun for 2nd person signular « you » much more frequently than is the case in standard linguistic communication (p.218, Hromada, 2015). Similiarly to our 1st analysis, our 2nd analysis uses CHILDES to address this hypothesis from a trans-cultural perspective. The procedure is thus very similar to the one already presented with one major difference : we do not focus on assessement of occurrences of one standard marker (e.g. « laugh ») which is present in different corpus sections ; but rather look for, in each specific subscorpus, for a specific Perl Compatible Regular Expression, a (PCRE 2p.sg ) which matches nominative forms of 2nd person singular in the langage of subcorpus under study. Following table lists 6 cases of such PCREs for matching 2p.sg. in 6 languages. Language PCRE2p.sg English French [ \t]you[' ] [\t ]t(u |oi |') Farsi Polish Chinese Estonian Hebrew [\t ]to [\t ]ty (你|ni3) [\t ]s(in)?a [\t ]ata? Usage of these regexes within one-liners using the case-insensitive « grep » allows us to obtain distributions of relative frequencies independently for MOT17 and CHI18 utterances. 15 $grep laugh MOT/*Eng* |perl -n -e '/MOT\/(\d+)-(\d+)/; print "$1 $2\n"' |uniq -c >exp1.MOT.Eng.F 16 $grep laugh CHI/*Eng* |perl -n -e '/CHI\/(\d+)-(\d+)/; print "$1 $2\n"' |uniq -c >exp1.CHI.Eng.F 17 $grep -i -P "[\t ]you[' ]" MOT/*Eng* |perl -n -e '/MOT\/(\d+)-(\d+)/; print "$1 $2\n"' |uniq -c >exp2.MOT.Eng.F 18 $grep -i -P "[\t ]you[' ]" CHI/*Eng* |perl -n -e '/CHI\/(\d+)-(\d+)/; print "$1 $2\n"' |uniq -c >exp2.CHI.Eng.F JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles [REPRODUCIBLE IDENTIFICATION OF PRAGMATIC UNIVERSALIA IN CHILDES TRANSCRIPTS] 7 Command sequence yielding distributions of Nutterances19 is practically the same as in first analysis (c.f. footnotes 13 & 14), the only difference being due to the fact that this time we do not focus on subcorpora which represent transcripts done by specific target investigators, but rather process much bigger datasets containing all transcripts representing the langage under study. FPCRE2p.sg and Nutterances distributions are subsequently processed by the R code which is, mutatis mutandi, identic to R code snippet used in analysis 1. This yields Figure 2. A phenomenon common to all languages under study can be observed practically immediately. That is, on all six solid MOT lines, one can observe, between first and fourth year of child's age, a marked increase in maternal usage of 2nd. person singular. Sometimes such an augmentation is less marked (as in french), sometimes it comes later (between 2nd and 3rd year of age in case of farsi and hebrew), but it always comes. And it always reaches all-time-heights before fifth year of age, after which the maternal usage of "you" tends to slowly converge back to its "normal" levels. Note also that in English motherese, « you » is used in approximately every fifth utterance. What is also striking in regards the English language - which is definitely the biggest CHILDES subcorpus - is quite significant correlation between time-serie representing the usage of 2p. sg. by mothers and time-serie representing the usage of 2p. sg. by children themselves (Pearson's cor. coeff. = 0.768, t = 3.393, df = 8, p-value = 0.009451; Kendall's τ = 0.6, T = 36, p-value = 0.0166720; Spearman's ϱ = 0.733, S = 44, p-value = 0.02117). 19 $wc -l CHI/*Farsi*|perl -e 'while (<>){s/CHI\///;/(\d+) (\d+-\d+)-/;$h{$2}+=$1;}for (sort keys %h){/(\d+)-(\d+)/;print "$h{$_} $1 $2\n";}' >exp2.CHI.Farsi.N 20 >cor.test(aggregated_mot_lang1[,6]/aggregated_mot_lang1[,3],aggregated_chi_lang1[,6]/aggregated_chi_lang1[,3],metho d="kendall") JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles 8 DANIEL DEVATMAN HROMADA 3.3. Third Analysis – 1st person singular Our 3nd analysis is identic to the second, the only thing which changes are the PCRE patterns which are this time supposed to match nominative forms of pronous denoting the 1st. person singular. Id est the ego, the self-reference, the "I". Following table lists 7 cases of such PCREs matching 1p. sg. in their respective CHILDES subcorpora. Language PCRE1p.sg English [ \t]I[' ] French Farsi [\t ](j(e |')|moi) [\t ]m[aæe]n Polish Chinese Estonian [\t ]ja (我|wo3) [\t ]m(in)?a Hebrew [\t ]ani Everything else - from extraction of absolute frequencies of forms matched by PCREs all the way to aggregating, normalizing and plotting - is, mutatis mutandi, identic to 2nd analysis. This leads to visualisation presented on the above figure. An interestant phenomenon can be noticed: while in early infancy, mothers of all language backgrounds use 1p.sg. much more frequently than children (probably because children are still in a pre-linguistic stage), the difference is being switfly and strongly counteracted. Hence, around three years of age, children of all21 cultures tend to produce 1p. sg. much more frequently than their mothers. But not only augmentation of use but also diminutions are of certain scientific interest. Hence, a steep decline in use of 1p.sg. can be observed between 6th and 7th year of age. That is, during the period when children and enter school and which markes the offset of that ontogenetic stage which (Piaget, 1951) labeled as "egocentric". Similiary to 2nd analysis, a significant correlation between time serie representing the production of "I" by english-speaking mothers and production of "I" by english-speaking children can be observed (Kendall's τ = 0.555, T = 35, p-value = 0.02861 ). 21 With exception of Polish language where we unfortunately lack motherese data from 3rd birthday onwards. JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles [REPRODUCIBLE IDENTIFICATION OF PRAGMATIC UNIVERSALIA IN CHILDES TRANSCRIPTS] 9 What's more, the plot indicates a path towards identification of statistically significant intercultural correlations. Thus, after filling the gap22 in the Chinese dataset related to the fact that CHILDES does not seem to contain transcripts of chinese 8-year olds, one shall observe a correlation23 between time-series of relative frequencies of 1p.sg produced by french and chinese children (Kendall's τ = 0.511, T = 29, p-value = 0.02474 ). Idem for english and french (Kendall's τ = 0.777, T = 32, p-value = 0.002425), for polish and hebrew (Pearson coef. = ; Kendall's τ = ; Spearman's ϱ = 0.786, S = 12, p-value = 0.04802) and if one stays faithful to canonic p<0.05 precept (Fisher, 1925) and opts for Spearman's rho or Pearson's coeff rather than for Kendall's tau, then, for example then also for french and polish (Pearson coef. = 0.837, t = 3.4219, df = 5, p-value = 0.0188 ; Kendall's τ = 0.619, T = 17, p-value = 0.06905 ; Spearman's ϱ = 0.785, S = 12, p-value = 0.04802 ) as well as for polish and hebrew (Pearson coef. = 0.759, t = 2.6117, df = 5, p-value = 0.04757; Kendall's τ = 0.619, T = 17, pvalue = 0.06905 ; Spearman's ϱ = 0.786, S = 12, p-value = 0.0480224) . 4. Discussion It is a common practice in contemporary Corpus Linguistics in general and in Natural Language Processing in particular, to focus fully on formal and theoretical properties of one's model or analysis. Thus, majority of publications in these domains limit themselves to dissemination of few core formulas behind the analysis which is presented + results which were obtained (F-scores etc.). In atmosphere where sharing the code with the community is more an exception than a rule, it is not surprising that majority of publications disregard the concrete aspects of implementation and execution of one's analysis as unworthy of interest. Such an attitude can be excusable when one attacks a highly specific engineering problem. But in regards to analyses aiming to attain the general knowledge - id est, when doing fundamental research or exploratory science – such an approach is to be discarded as inconsistent with the ideal of experimentator-independent reproducibility. In this article, we have explained how cost-efficient (i.e. as free as open source software), reproducible and transparent science can be performed at the very border of corpus and developmental psycholinguistics. More concretely, in footnotes of this article, we have presented less than two dozens one-liners which pipeline and combine PCREs (Wall, 1990; Hromada, 2011) with core GNU utilities like “grep”, “uniq”, "wc" and “sort”. Asides this, a snippet of few dozen lines of beginner-level non-optimized R code is hereby being published25 in order to furnish complete – i.e. from downloading the corpus from publicly available source all the way to final plots and correlation coefficients - description of three experiments hereby performed. 22 >aggregated_chi_lang4[9,]=(aggregated_chi_lang4[7,]+aggregated_chi_lang4[8,])/2 23 >cor.test(aggregated_chi_lang2[,6]/aggregated_chi_lang2[,3],aggregated_chi_lang4[,6]/aggregated_chi_lang4[,3],method="kendall") 24 25 >cor.test(aggregated_chi_lang6[,6]/aggregated_chi_lang6[,3],aggregated_chi_lang5[,6]/aggregated_chi_lang5[,3],method="spearman") http://wizzion.com/code/jadt2016/childes.R JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles 10 DANIEL DEVATMAN HROMADA Common to these three experiments was a preprocessing phase which purified and repartitioned hundreds of megabytes of data contained in CHILDES. Result of this phase were two directories, CHI which contains utterances produced by children and MOT which contains motherese utterances (cf. section 2.2). Principal motivation behind this repartitioning was a speed-up of any subsequent analysis. For example the 3rd analysis - when executed on one sole core of 3.2 Ghz PC with 8GB RAM PC and CHILDES data stored on a SSD disk (a fairly standard configuration) - didn't last more than 15 seconds. All the way from matching the first regular expression on the first line of first transcript to R's final plotting. Mentioning regular expressions, we consider it as important to reiterate that regexes, like those implemented in Perl or PCREs, seem to us to be much more than impressive yet weird character sequences that no neophyte can read. Unambigously denoting what they should denote - i.e. a specific set of character sequences, a specific pattern, schema and form PCREs are formalisms in their own right (Hromada, 2011). Idem for shell commands and PERL or R instructions - they also are unambigous formalisms and for purposes of NLP, they can turn out to be at least as worthy as other formalisms. Formalisms, tools and methodology being thus defined by a concrete example, a question can be posed: "What should be the name of a discipline which uses implemets such a method and uses such tools ?" And given that what was done used techniques common to textometry in order to address topics common to developmental psycholinguistics (Tomasello, 2009), an answer could potentially sound: "Textometric Psycholinguistics". It is only now - with toolbox specified and reproducible method and scope of interest of discipline properly delimited - that a discussion about culture-independent anthropological constants occurent in adult-child verbal and pre-verbal interactions - id est a discussion about "linguistic universalia" and their meaning, a discussion among savants can, hopefully, begin. References Fisher, R. A. (1925). Statistical methods for research workers. Genesis Publishing Pvt Ltd. MacWhinney, Brian & Snow, Catherine. (1985). The child language data exchange system. Journal of child language, 12(02), 271-295. MacWhinney, Brian. (2012). The CHILDES Project Tools for Analyzing Talk–Electronic Edition Part 1: The CHAT Transcription Format. Piaget, J. (1951). Principal factors determining intellectual evolution from childhood to adult life. Columbia University Press. Popper Karl. (1992). The Logic of Scientific Discovery. Routledge, London. Hromada, Daniel D. (2011) Initial Experiments with Multilingual Extraction of Rhetoric Figures by means of PERL-compatible Regular Expressions. RANLP Student Research Workshop, 85-90. Hromada Daniel D. (2015). Theoretical Foundation of Thesis "Evolutionary Models of Ontogeny of Linguistic Categories". in press. Stallman, Richard. (1985). The GNU manifesto. Team, R.Core. (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2013. Tomasello, M., & Tomasello, M. (2009). Constructing a language: A usage-based theory of language acquisition. Harvard University Press. Wall, Larry. (1990). PERL: Practical Extraction and Report Language. JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles Introduction Das Experiment To whom it may concern Fast and Frugal Detection of Chiastic Protofigures in English Subsection of CHILDES Corpus regex strikes back Daniel Devatman Hromada123 daniel@wizzion.com 1 Université Paris 8 / Lumières École Doctorale Cognition, Langage, Interaction Laboratoire Cognition Humaine et Artificielle 2 Slovak University of Technology Faculty of Electronic Engineering and Informatics Department of Robotics and Cybernetics 3 Universität der Künste Fakultät der Gestaltung, Berlin Introduction Das Experiment Table of Contents 1 Introduction Computational Psycholinguistics Computational Rhetorics Main idea 2 Das Experiment 3 To whom it may concern To whom it may concern Introduction Das Experiment To whom it may concern Computational (Developmental) Psycholinguistics C(D)P Is a cross-over between computational linguistic (and/or Natural Language Processing) and developmental psycholinguistics. Main objectives: 1 use computational methods (data-mining, information retrieval, NLP etc.) to gain novel insights about ontogeny of language competence in human children 2 develop computational models of language acquisition and embed them into language-interacting artificial agents In this talk we focus solely on the first objective. Introduction Das Experiment To whom it may concern CHILDES CHILDES corpus: a gem of gems Child Language Data Exchange System (MacWhinney&Snow, 1985) http://childes.psy.cmu.edu/data http://wizzion.com/CHILDES/ (mirror from 6th Feb 2016) 1 more than 50 years of tradition 2 more than 1.5 GigaBytes of mostly textual data contained in cca 30000 transcripts 3 at least 26 languages, dialects or language combinations 4 Creative Commons BY-NC-SA licence Introduction Das Experiment To whom it may concern CHAT format CHAT system provides a standardized format for producing computerized transcripts of face-to-face conversational interactions. (MacWhinney, 2016; http://childes.talkbank.org/manuals/chat.pdf). @Languages: eng @Participants: CHI Eve Target_Child , MOT Sue Mother , FAT David Father @ID: eng|Brown|CHI|1;6.|female|||Target_Child||| @ID: eng|Brown|MOT|||||Mother||| @ID: eng|Brown|COL|||||Investigator||| @Date: 29-OCT-1962 *MOT: one two three four . %mor: det:num|one det:num|two det:num|three det:num|four . %act: tests tape recorder *CHI: one two three . [+ IMIT] A non-negligeable advantage Majority of transcripts follow the principle: ONE LINE = ONE UTTERANCE. Introduction Das Experiment To whom it may concern Computational Rhetorics Computational (& Cognitive) Rhetorics Computational Rhetorics A discipline which has attained its maturity at Computational Rhetorics Workshop organized by Harris and Di Marco at University of Waterloo. Computational-Cognitive Rhetorics A disciplne using computers to better understand why rhetorics casts such a powerful curse on human minds. Computational-Developmental Rhetorics Using computers to elucidate the process of ontogeny of rhetoric competence in human children. ”Child’s spontaneous remark is more valuable than all questioning in the world.” (Jean Piaget) Introduction Das Experiment To whom it may concern Main idea Main concept(s) Scheme A scheme is a generic form which corresponds to one or more distinct constellations of observables. Regular expression A sequence of characters that defines a search pattern. Perl-Compatible Regular Expressions Concise and expressive regex standard. Much more powerful than regular grammars: it is possible to perform back-tracking! Backtracking Allows us to match that, which has already been matched: paves the way to detection of repetitions. Introduction Das Experiment To whom it may concern Main idea Main idea(s) Main idea Chiasms are repetition-based schemata A1 B1 C1 XC2 B2 A2 (or A1 B1 XB2 A2 ). Note that the presence of middle term (B) and separator term (X) can be considered as facultative. But in order to detect chiasm, the initial preceptor (A1 ) has to be strongly reminiscent (and ideally identic) to terminal successor (A2 ). Idem for relation between terminal preceptor (C2 ) and initial successor (C1 ). Introduction Das Experiment Table of Contents 1 Introduction 2 Das Experiment Method Results 3 To whom it may concern To whom it may concern Introduction Das Experiment To whom it may concern Method Regex implementing the main idea initial preceptor (\w{3,}) terminal successor (.{0,77}) (\w{3,}) terminal preceptor .{0 77} \3 \2 \1 initial successor Note that nodes of a chiasmatic structure form a double-closed graph. Introduction Das Experiment To whom it may concern Method Demo Run this shell command* : grep -irP ’^\*MOT:.*(\w{3,}) (.{0,77}) (?!\1)(\w{3,}).{0,77}\3 \2 \1’ *Eng* in the directory into which You downloaded and unpacked the CHILDES corpus. Note that the extractor can be parametrized with change of numeric values: e.g. changing (\w{3,}) to (\w{1,}) could potentially allow You to detect grapheme-level metatheses like ”asteriks with an asterisk”. * Regex sequence is hereby transfered to Public Domain under Creative Commons BY-NC-SA (Author Attribution, Non-Commercial, Share-Alike) licence. Introduction Das Experiment To whom it may concern Results You’ll see many playful ones... pear pear yummy yummy yummy yummy pear . my name is Joey Joey Joe Joe Joe Joe Joey . I think I can I think I can I think I can I think I can I think I can I think I can . tick tick tick tick tick tick tick tick tick tock tick tock tick tick tick tick . Earth , moon , Earth , moon , full moon , Earth moon . crash , boom , crash , boom , crash , boom crash ! Note: triplicated couple A1 B1 A2 B2 A3 B3 always contains an A1 B1 B2 A3 implicit antimetabole!!! Introduction Das Experiment To whom it may concern Results ...reversed coordinatives... and they splish and they splash and they splash and they splish . a dot and a dash and a dash and a dot . well Granddad and Grandma [//] Grandma and Granddad are coming today . it’s called lamb and vegetable [//] mediterranean vegetable and lamb risotto . Donald hopped and swam and swam and hopped until he was safe on dry ground . every day my cows Poppy (.) Annabel (.) Emily and Heather moo and mumble (.) mumble and moo . Chester and Wilson Wilson and Chester . Introduction Das Experiment To whom it may concern Results ...and more exhaustive reversed lists... Chester and Wilson and Lily Lily and Wilson and Chester . okay , square , square , rectangle , square , oval , two , one , one , two . blue , green , yellow , red , red , yellow , green , blue . one two three or three two one ? sure we went through Rhode island , Massachusetts , New Hampshire , Vermont , and then on the way back we did Vermont , New Hampshire , Massachusetts , Rhode island , right ? Introduction Das Experiment To whom it may concern Results ...and reversals of direction and position and time... you get one ticket that says York to Manchester and another ticket that says Manchester to York . he used to rush here and there and there and here and back again all the time and of course he was always in such a rush that he never ever finished anything properly . from here to there , from there to here from here to there funny things everywhere . let’s put mine on yours and put yours on mine . could put the box on the lid instead of the lid on the box . but I mean do you get your drink after you’ve had your biscuit or do you get your biscuit after you’ve had your drink . Introduction Das Experiment To whom it may concern Results ...and reversals of attributes... let’s put the blue one on the guy with the red underpants and the red one on the guy with the blue underpants . if it (h)as been a police car it becomes a racing car and if it (h)as been a racing car it becomes a police car . and when you’re talking about little crocodiles and big snakes (.) or little snakes and big crocodiles (.) they’re jelly sweets you’ve had in the past . oh [!] I got a yellow cup and a red plate and you got a red cup and a yellow [!] plate (.) . look , they’re very similar (.) look , this one is green with a little yellow , and this I yellow with a little green (.) interesting , huh ? you mean it looks nicer than it smells [//] smells nicer than it looks . Introduction Das Experiment To whom it may concern Results ...and reversals of case-like roles, of course... Nominative vs. Vocative Amanda that’s xxx xxx that’s Amanda . xxx this is Stephanie Stephanie this is xxx by the way . Nominative vs. Accusative froggie keep an eye on mummy or mummy keep an eye on froggie ? Floppy meet the screwdrivers screwdrivers meet the Floppy . Nominative vs. Dative do you give Daddy a big kiss or does Daddy give you a big kiss ? Introduction Das Experiment To whom it may concern Results ...as well as some more complex swaps? like Nominative vs. Genitive vs. Locative... I mean you go [//] girls go to boys parties and boys go to girls ...or proto-rhetoric questions... I think you’re stinky you are stinky are you stinky ? wouldjou [: would you] couldjou [: could you] wouldjou [: would you] with a goat ? ...and other pieces of maternal wisdom. I would not could not in a box I could not would not with a fox . we’re in house of bricks not the bricks of house . two for tea , and tea for two . I meant what I said and I said what I meant . Introduction Das Experiment Table of Contents 1 Introduction 2 Das Experiment 3 To whom it may concern Current state Future directions To whom it may concern Introduction Das Experiment To whom it may concern Current state Concerning the method a naive rhetoric-figure-tagger (nRFT) fast*, deterministic, transparent for inspection, partially parametrizable form-oriented: looks for identic sequences within the signifier (no semantics involved) generates false positives: manual check needed; can be useful for CHIASMFP corpus can speed-up the manual annotation (semi-supervised scenario) IMPORTANT: the schema can be used not only to detect, but also to GENERATE * and super-fast if You store Your Big Data on a RAMdisk or at least on a SSD disk cache Introduction Das Experiment To whom it may concern Current state Concerning the results English motherese utterances tend to abound with protochiastic structures many functions: playful reversal of repetition, reversal of spatial direction, reversal of list, lapsus lingui correction, positional swap, attribute swap, functional (case) swap ... all matched by a single one-liner ! what we are dealing here with is a whole ecosystem of diverse structures indicated prominence of the verb ”put” as a middle term consistent with theories of Piaget and Tomasello triplicated couple A1 B1 A2 B2 A3 B3 always contains an A1 B1 B2 A3 implicit antimetabole Introduction Das Experiment To whom it may concern Future directions Invitation to explore not only intralocutory (i.e. within 1 utterance) chiasms, but also translocutory ones (within multiple successive utterances) relations to variation sets and Winograd schemata multi-lingual analysis (are these beasts universal ?) ontogenetic relation to other figures like rhetoric question or even metaphore (METAPHOROS = ”carry over”) informational content of chiasms (known components + unknown order = maximal amount of new info ?) neurocognitive aspects of chiasm processing (focus upon the cyclical referential closure between initial and terminal token of the sequence) neurorhetoric hypothesis: look for a P600-like evoked potential following the exposure to chiasmus non-linguistic chiasmata (musical, visual, spatial, anatomical, social, moral, emotional, sexual, spiritual etc.) Introduction Das Experiment To whom it may concern Future directions Conclusion Starting discussion with conclusion often concludes the discussion... Ergo, no ultimate conclusion without juicy discussion. daniel@wizzion.com thanks Thee for Thy attention Reproducible Identification of Pragmatic Universalia in CHILDES Transcripts Daniel Devatman Hromada1,2,3 1 2 Université Paris Lumières - France Slovak University of Technology – Bratislava - Slovakia 3 Berlin University of the Arts – Berlin - Germany Abstract This article presents method and results of multiple analyses of the biggest publicly available corpus of language acquisition data : Child Language Data Exchange System. The methodological aim of this article is to present a means how science can be done in a highly positivist, empiric and reproducible manner consistent with the precepts of the “Open Science” movement. Thus, a handful of simple one-liners pipelining standard GNU tools like “grep”, and “uniq” is presented - which, when applied on myriads of transcripts contained in the corpus – can potentially pave a path towards identification of statistically significant phenomena. Relative frequencies of occurrence are analyzed along age and language axes in order to help to identify certain concrete, pragmatic universalia marking different stages of linguistic ontogeny in human children. One can thus observe significant culture-agnostic decrease of laughing in child-produced speech and child-directed indo-european “motherese” occurrent between 1st and 2nd year of age; maternal increase in production of pronoun denoting 2nd person singular “you”; increase of usage of 1st person singular “I” in utterances produced by children around 3rd years of age and marked decrease of the same which takes place around 6 years of age. Other significant correlations both intra-cultural between English mothers and children, as well as inter-cultural - are pointed down always accompanied with thorough descriptions methodology immediately reproducible on an average computer. 1. Introduction Reproducibility is one of the hallmark principles of occidental science. Being based upon the philosophy of ancient greeks who were fully aware that only the knowlede of that, which repeats itself in many instances, can lead to generic and transtemporal ἐπίσταμαι, the western scientific method necessarily considers reproducibility as its main condition sine qua non. In words of the foremost figure of modern epistemology, "non-reproducible single occurrences are of no significance to science" (Popper, 1992). Hence the primary, epistemological, objective of this article is to show how anyone willing to do so can perform reproducible analyses and experiments regarding the phenomena traditionally falling into the scope of corpus, computational and developmental linguistics. This objective is to be quite naturally attained if ever three precepts are stringently followed : • use publicly available data • analyse the data with simple, specific yet powerful tools which are well-known to widest possible public • faithfully protocol the exact procedure of usage of these tools In more concrete terms, we promote the idea that - in regards to analysis of statistical textual data - core GNU (Stallman, 1985) utils and commands as well as basic operators and core JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles 2 DANIEL DEVATMAN HROMADA functions of open source langages like PERL (Wall, 1990) or R (Team, 2013) indeed offer such "simple, specific yet powerful tools well-known to widest possible public". When it comes to the precept " faithfully protocol the usage of these tools ", it shall be implemented - in this article and potentially beyond – in a following manner : every simple transformation of data is to be completely and exhaustively described in a footnote which accompanies the description of the transformation. By " simple ", we mean such a transformation which can be described as a simple standard UNIX shell 1 one-liner pipelining combining together core commands like " grep ", " uniq " or " sort ". In case of more complex transformations, the complete source code of program is always to be furnished either in publications's appendix or at least as an URL reference. To assure the highest possible reproducibility of the experiment, the snippet should not call any modules and libraries external to language's core distribution (e.g. no CPAN resp. CRAN). The most important thing, however, is not to forget that the protocol is to be complete, exhaustive and unambigous. That is, .history of all steps is to be described in the form which is immediately executable on a standard GNU-positive machine. All means all : from the very fact of downloading2 the corpus from a publicly available source to the very act of plotting the legend on a figure which is then disseminated among scientific communities. Given that these precepts are followed and under the conditions that • the analysis is fully deterministic (i.e. does not involve any source of stochasticity) • the source corpus has not changed in the meanwhile it can be expected that the same analysis shall bring the same results no matter whether it is executed in other folder of the same computer (e.g. reproducibility across directories) ; executed on different computers (e.g. reproducibility across experimental apparatus) and|or executed by different experimentator (e.g. experimentator-independent reproducibility). 2. Corpus & Method Child Language Data Exchange System (CHILDES) undoubtably belongs among most fascinating language-related corpora. Established by (MacWhinney and Snow, 1985) more than 30-years ago and including transcripts dating back to 1960s, CHILDES does not cease to be the biggest public repository of child language acquisition and development data. Thus, asides huge volumes of audio and video recordings of verbal interactions with children, CHILDES also contains more than thirty thousand distinct transcripts. Transcript themselves are encoded in UTF-8 compliant plaintext .CHA files. These files follow a CHAT format specified in (MacWhinney, 2012). Every transcript contains a header describing specificities facts concerning the transcribed scenario – e.g. the age of a child, identities of participants (lines beginning with *CHI denote utterances produced by children; lines beginning with *MOT denote utterances produced by their mothers). Unfortunately, different linguists have followed the CHAT manual in a different manner. For example, some include the timestamp information into their corpus and some not. Some mark the repetition by special tokens like [x 2] (for duplication) or [x 3] (for triplication) and some $ echo 'All footnote-descriptions of shell one-liners begin with the sign $ and all footnote-descriptions of R commands begin with sign >.' 1 It is highly recommended to use standard utilities like "wget " or "curl " for that purpose. 2 JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles [REPRODUCIBLE IDENTIFICATION OF PRAGMATIC UNIVERSALIA IN CHILDES TRANSCRIPTS] 3 transcribe the utterance as such, without using such tokens. And yet another set of differences necessarily originates in transcriber's own perception and habits. For example: while the token “mama” is occurrent in 1405 child utterances contained in English sections of the corpus3, some other English transcribers (e.g. Haggerty or Suppes) apparently prefered to transcribe the mother-directed vocative as “mamma” - this occurs in 126 distinct utterances. Be it as it may, the CHILDES corpus is already so huge that one may except that a well constituted and unbiased quantitative analysis could potentially allow the discovery of phenomena robust to any surface perturbations (e.g. differences in habits and styles of different investigators etc.). In other terms, if every transcript is understood as a result of a distinct act of sampling, then it can be expected that the statistical aggregation of such a huge amount of distinct samples (> 30000 distinct transcripts) could let to situation where the noise cancels itself out and statistically significant phenomena emerge. And individual CHILDES transcripts are indeed distinct. Not only because dozens, if not hundreds researchers and investigators of at least three or four generations had already directly participated on constitution of the corpus. Not only because majority of transcripts were in one way or another related to a specific research project with a goal unrelated to goals of other projects. But also because investigators themselves, as well as the investigated subjects (e.g. children), often stem from huge variety of distinct cultural backgrounds. More concretely: 26 languages are included in the corpus, covering practically majority of main terran language strata (i.e. indo-european languages, asian languages, semitic, altaic and ugrofinic languages etc.). This allows for trans-cultural analysis and such shall indeed be all analysis presented in the section 4. 2.1 Metrics Results can be mutually compared and communicated only if they are expressed in common units. In case of all experiments presented in this article, the relative frequency - interpreted as the probability of occurrence - of pattern X is such a unit. This is equivalent to absolute frequency of occurrence of FX normalized by the total number of utterances, i.e. PX = FX / Nutterances Ideally, for every month mentioned in the CHILDES corpus should correspond one P X value. To understand our approach more clearly, imagine, for example, in case of hypothethic language whose speakers utter 100 utterances each month since their birth until their tenth birthday. If such speakers utter the token " dog " twenty times every month, than the value of all 120 (i.e. 10 years * 12 months) datapoints describing the time series for this particular token would be constantly equal to 100/20 = 20% = 0.2. It is principially due to such trivial nature of the calculus hereby presented that the core datamining procedures can be performed directly on the BASH command-line. 3.2 Preprocessing Four hundred and sixty-seven megabytes of data compressed in 983 zip files are obtained after the corpus has been downloaded from its original source4 or from a mirror site which 3 $ grep "mama" child/*Eng* |wc -l; grep "mamma" child/*Eng* |wc -l 4 $ wget -P CHILDES -e robots=off --no-parent --accept '.zip' -r http://childes.psy.cmu.edu/data/ JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles 4 DANIEL DEVATMAN HROMADA represents state of CHILDES as of February 6th 20165. After these files are recursively decompressed6, the CHILDES arborescent structure is flattened so that all .CHA files are contained within one sole directory7. A following one-liner subsequently “peeks into” each .CHA file, retrieves child's age from it and puts this information into files' name8. Utterances containing only xxx and www tokens – which, according to CHILDES manual denote “unintelligible words with an unclear phonetic shape” resp. “untranscribed material” are removed from all child and mother transcripts 9. Next step is executed only to speed-up following pattern extraction processes: child utterances are funnelled into simplified transcripts stored in “CHI” subdirectory and maternal utterances are funnelled into “MOT” subdirectory 10 . Translocutory information is thus lost but this is allowed for the purpose of this article in which we shall focus solely on relative frequencies of certain tokens and not on more complex discourse units. All this yields 5833656 lines (e.g. utterances) contained in 29180 non-empty simplified transcripts stored in “child” directory and 3798005 lines contained in 13590 non-empty simplified transcripts stored in the “mother” directory. Note that metadata like age (years and months), language group, language and CHILDES investigator's identity are stored directly in the simplified transcript's filename. Workbench common to all following analyses can be thus considered as ready. 3. Analyses 3.1. First Analysis – Laughing It has been recently indicated that English mothers interacting with children younger than 16 months tend to laugh significantly more often than mothers which interact with children between 16-31 months of age (p.222, Hromada, 2015). Our 1st analysis will use CHILDES to address this hypothesis from a trans-cultural perspective. It may be surprising to use a dataset, which is essentially a linguistic corpus for, a purpose of study of such a non-verbal means of communication as laughing definitely is. But the very CHAT manual (p.62, MacWhinney, 2012) explicitely specifies the &=laughs marker as a most common standardized spelling denoting a specific extralinguistic event. Unfortunately, within the totality of CHILDES corpus, the marker itself &=laughs is not the only standardized form denoting the phenomenon and some authors prefered to use markers 5 $ wget -P CHILDES -e robots=off --no-parent --accept '.zip' -r WILL-BE-GIVEN-IN-CAMERA-READY-VERSION 6 $ find CHILDES/data -name "*.zip" | while read filename; do unzip -o -d "`dirname "$filename"`" "$filename"; done 7 $ mkdir CHILDES_flat; find CHILDES/data -type f |perl -n -e 'chomp; if (/\.cha/) {$f=$_; s/\//-/g; s/\.-data-//g; `cp $f ./CHILDES_flat/$_`;}'; cd CHILDES_flat; 8 $ mkdir aged; grep -P '\|\d;\d' *| grep Child | perl -n -e 'chomp; `cp $1 aged/$2-$3-$1` if /^(.*?):.*0?(\d+);0?(\d+)/;' ; rm *.cha 9 $ perl -ni -e 'print if $_!~/^\*(MOT|CHI):\t(xxx|www) ?\./' aged/* 10 $ mkdir CHI; cp aged/* CHI; sed -i '/\*CHI/! d' CHI/*; mkdir MOT; cp aged/* MOT; sed -i '/\*MOT/! d' MOT/*; JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles [REPRODUCIBLE IDENTIFICATION OF PRAGMATIC UNIVERSALIA IN CHILDES TRANSCRIPTS] 5 like [=! laughing]. Hence, for a purpose of our 1st analysis, we have simply used the token laugh as the one whose frequencies of occurrence we have decided to measure. Three indo-european (english, french and farsi) and two non-indo-european languages (japanese and chinese) were chosen in order to address the developmental trajectory of laughing from a trans-cultural perspective. For each among these langages, a target investigator was identified as the one who most frequently used the marker laugh in his transcripts of motherese11. Corpus subsections " Farsi-Family ", "French-MOR-York ", " Japanese-MiiPro " and " Chinese-Beijing " were thus identified as such target subsections. All English-language transcripts (i.e. such files whose filename contains the token " Eng ") were also taken into account. The core of the procedure is as follows: total amount of utterances is obtained, for each month and each target subsection of the corpus, by a one-liner 12 which redirects its output into a file whose every row contains three space-separated columns: first column denotes the denotes the value of Nutterances and second and third column denote the year resp. month. The procedure is to be repeated ten times alltogether, five for each target corpus subsections multiplied by two possible locutor values of the locutor variable (MOT13 or CHI14). Follow ten executions of a command sequence which generate 10 files containing absolute frequencies of occurrence of the token laugh within five different corpus sections – and again for both MOT15 and CHI16 locutors - which are aggregated according to child's age in the moment when laughing was noted down by the CHILDES investigator. And that's it: all result-containing files can now serve furnish input datasets for the R code which produces a plot displayed on adjacent figure. 11 Probability that laughing accompanies or substitutes an utterance produced by, or directed to, a child of specific age. $ grep laugh MOT/*French* | grep -o -P '\-French\-.+\-' | sort | uniq -c ; grep laugh MOT/*Farsi* | grep -o -P '\-Farsi\-.+\-' | sort | uniq -c ; grep laugh MOT/*Japanese* | grep -o -P '\-Japanese\-.+\-' | sort | uniq -c ; grep laugh MOT/*Chinese* | grep -o -P '\-Chinese\-.+\-' | sort | uniq -c ; 12 $wc -l MOT/*Farsi-Family* |perl -e 'while (<>) { s/MOT\///; /(\d+) (\d+-\d+)-/; $h{$2}+=$1; } for (sort keys %h) {/(\d+)- (\d+)/; print "$h{$_} $1 $2\n";}' >exp1.MOT.Farsi-Family.N 13 $wc -l MOT/*Eng* |perl -e 'while (<>) { s/MOT\///; /(\d+) (\d+-\d+)-/; $h{$2}+=$1; } for (sort keys %h) {/(\d+)-(\d+)/; print "$h{$_} $1 $2\n";}' >exp1.MOT.Eng.N 14 $wc -l CHI/*Eng* |perl -e 'while (<>) { s/CHI\///; /(\d+) (\d+-\d+)-/; $h{$2}+=$1; } for (sort keys %h) {/(\d+)-(\d+)/; print "$h{$_} $1 $2\n";}' >exp1.CHI.Eng.N 15 $grep laugh MOT/*Eng* |perl -n -e '/MOT\/(\d+)-(\d+)/; print "$1 $2\n"' |uniq -c >exp1.MOT.Eng.F 16 $grep laugh CHI/*Eng* |perl -n -e '/CHI\/(\d+)-(\d+)/; print "$1 $2\n"' |uniq -c >exp1.CHI.Eng.F JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles 6 DANIEL DEVATMAN HROMADA Potentially the most salient phenomenon is a marked decrease in production of laughs which occur between birth and second year of age. This could be potentially explained in terms of gradual switch from non-linguistic means of communication towards more verbal interactions. However, in case of child-directed speech of Japanese motherese the relative frequency of laughing seems to increase during the same period and in case of chinese, the decline is much less marked than in case of indo-european langages. This may potentially suggest an intercultural difference – a hypothesis which is further corrobated by the fact that it is only in case of indo-european langages that the " dotted " lines cross with " solid " lines. Id est, little english-, french- and farsi- speaking children tend to laugh more often than their mothers but older children seem to laugh less frequently than their mothers. This quiproquo notwithstanding, relative frequencies of CHI time series significantly correlate with MOT time series in both English (Pearson's correlation coefficient 0.933, t = 7.36, df = 8, p-value = 7.886e-05 ) and in Farsi (corr. coef. 0.972, t = 5.9224, df = 2, p-value = 0.02735 ). In French correlation is quite close to significancy threshold (t = 4.1692, df = 2, pvalue = 0.053, cor. coef = 0.947) when data is aggregated in year-sized packages but is insignificant (t = -1.1598, df = 27, p-value = 0.2563 ) when time series are correlated with monthly granularity. No statistically significant correlation between child-produced and mother-produced laugh time-series has been observed in case of Japanese or Chinese. 3.2. Second Analysis – 2nd person singular It has also been indicated that English mothers interacting with their children tend to use the pronoun for 2nd person signular " you " much more frequently than is the case in standard linguistic communication (p.218, Hromada, 2015). Similiarly to our 1st analysis, our 2nd analysis uses CHILDES to address this hypothesis from a trans-cultural perspective. The procedure is thus very similar to the one already presented with one major difference : we do not focus on assessement of occurrences of one standard marker (e.g. " laugh ") which is present in different corpus sections ; but rather look for, in each specific subscorpus, for a specific Perl Compatible Regular Expression, a (PCRE 2p.sg ) which matches nominative forms of 2nd person singular in the langage of subcorpus under study. Following table lists 6 cases of such PCREs for matching 2p.sg. in 6 languages. English French Farsi PCRE2p.sg [ \t]you[' ] [\t ]t(u |oi |') [\t ]to Polish Chinese Estonian Hebrew [\t ]ty (你|ni3) [\t ]s(in)?a [\t ]ata? Usage of these regexes within one-liners using the case-insensitive " grep " allows us to obtain distributions of relative frequencies independently for MOT17 and CHI18 utterances. Command sequence yielding distributions of Nutterances19 is practically the same as in first analysis (c.f. footnotes 13 & 14), the only difference being due to the fact that this time we do not focus on subcorpora which represent transcripts done by specific target investigators, but 17 $grep -i -P "[\t ]you[' ]" MOT/*Eng* |perl -n -e '/MOT\/(\d+)-(\d+)/; print "$1 $2\n"' |uniq -c >exp2.MOT.Eng.F 18 $grep -i -P "[\t ]you[' ]" CHI/*Eng* |perl -n -e '/CHI\/(\d+)-(\d+)/; print "$1 $2\n"' |uniq -c >exp2.CHI.Eng.F JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles [REPRODUCIBLE IDENTIFICATION OF PRAGMATIC UNIVERSALIA IN CHILDES TRANSCRIPTS] 7 rather process much bigger datasets containing all transcripts representing the langage under study. FPCRE2p.sg and Nutterances distributions are subsequently processed by the R code which is, mutatis mutandi, identic to R code snippet used in analysis 1. This yields Figure 2. A phenomenon common to all languages under study can be observed practically immediately. That is, on all six solid MOT lines, one can observe, between first and fourth year of child's age, a marked increase in maternal usage of 2nd. person singular. Sometimes such an augmentation is less marked (as in french), sometimes it comes later (between 2nd and 3rd year of age in case of farsi and hebrew), but it always comes. And it always reaches all-time-heights before fifth year of age, after which the maternal usage of "you" tends to slowly converge back to its "normal" levels. Note also that in English motherese, " you " is used in approximately every fifth utterance. What is also striking in regards the English language - which is definitely the biggest CHILDES subcorpus - is quite significant correlation between time-serie representing the usage of 2p. sg. by mothers and time-serie representing the usage of 2p. sg. by children themselves (Pearson's cor. coeff. = 0.768, t = 3.393, df = 8, p-value = 0.009451; Kendall's τ = 0.6, T = 36, p-value = 0.0166720; Spearman's ϱ = 0.733, S = 44, p-value = 0.02117). 3.3. Third Analysis – 1st person singular Our 3nd analysis is identic to the second, the only thing which changes are the PCRE patterns which are this time supposed to match nominative forms of pronous denoting the 1st. person 19 $wc -l CHI/*Farsi*|perl -e 'while (<>){s/CHI\///;/(\d+) (\d+-\d+)-/;$h{$2}+=$1;}for (sort keys %h){/(\d+)-(\d+)/;print "$h{$_} $1 $2\n";}' >exp2.CHI.Farsi.N 20 >cor.test(aggregated_mot_lang1[,6]/aggregated_mot_lang1[,3],aggregated_chi_lang1[,6]/aggregated_chi_lang1[,3],metho d="kendall") JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles 8 DANIEL DEVATMAN HROMADA singular. Id est the ego, the self-reference, the "I". Following table lists 7 cases of such PCREs matching 1p. sg. in their respective CHILDES subcorpora. English PCRE1p.sg [ \t]I[' ] French Farsi Polish Chinese [\t ](j(e |')|moi) [\t ]m[aæe]n [\t ]ja (我|wo3) Estonian Hebrew [\t ]m(in)?a [\t ]ani Everything else - from extraction of absolute frequencies of forms matched by PCREs all the way to aggregating, normalizing and plotting - is, mutatis mutandi, identic to 2nd analysis. This leads to visualisation presented at the bottom of this page. An interestant phenomenon can be noticed: while in early infancy, mothers of all language backgrounds use 1p.sg. much more frequently than children (probably because children are still in a pre-linguistic stage), the difference is being switfly and strongly counteracted. Hence, around three years of age, children of all21 cultures tend to produce 1p. sg. much more frequently than their mothers. But not only augmentation of use but also diminutions are of certain scientific interest. Hence, a steep decline in use of 1p.sg. can be observed between 6th and 7th year of age. That is, during the period when children and enter school and which markes the offset of that ontogenetic stage which (Piaget, 1951) labeled as "egocentric". Similiary to 2nd analysis, a significant correlation between time serie representing the production of "I" by english-speaking mothers and production of "I" by english-speaking children can be observed (Kendall's τ = 0.555, T = 35, p-value = 0.02861 ). What's more, the plot indicates a path towards identification of statistically significant intercultural correlations. Thus, after filling the gap22 in the Chinese dataset related to the fact 21 With exception of Polish language where we unfortunately lack motherese data from 3rd birthday onwards. >aggregated_chi_lang4[9,]=(aggregated_chi_lang4[7,]+aggregated_chi_lang4[8,])/2 22 JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles [REPRODUCIBLE IDENTIFICATION OF PRAGMATIC UNIVERSALIA IN CHILDES TRANSCRIPTS] 9 that CHILDES does not seem to contain transcripts of chinese 8-year olds, one shall observe a correlation23 between time-series of relative frequencies of 1p.sg produced by french and chinese children (Kendall's τ = 0.511, T = 29, p-value = 0.02474 ). Idem for english and french (Kendall's τ = 0.777, T = 32, p-value = 0.002425), for polish and hebrew (Pearson coef. = ; Kendall's τ = ; Spearman's ϱ = 0.786, S = 12, p-value = 0.04802) and if one stays faithful to canonic p<0.05 precept (Fisher, 1925) and opts for Spearman's rho or Pearson's coeff rather than for Kendall's tau, then, for example then also for french and polish (Pearson coef. = 0.837, t = 3.4219, df = 5, p-value = 0.0188 ; Kendall's τ = 0.619, T = 17, p-value = 0.06905 ; Spearman's ϱ = 0.785, S = 12, p-value = 0.04802 ) as well as for polish and hebrew (Pearson coef. = 0.759, t = 2.6117, df = 5, p-value = 0.04757; Kendall's τ = 0.619, T = 17, pvalue = 0.06905 ; Spearman's ϱ = 0.786, S = 12, p-value = 0.0480224) . 4. Discussion It is a common practice in contemporary Corpus Linguistics in general and in Natural Language Processing in particular, to focus fully on formal and theoretical properties of one's model or analysis. Thus, majority of publications in these domains limit themselves to dissemination of few core formulas behind the analysis which is presented + results which were obtained (F-scores etc.). In atmosphere where sharing the code with the community is more an exception than a rule, it is not surprising that majority of publications disregard the concrete aspects of implementation and execution of one's analysis as unworthy of interest. Such an attitude can be excusable when one attacks a highly specific engineering problem. But in regards to analyses aiming to attain the general knowledge - id est, when doing fundamental research or exploratory science – such an approach is to be discarded as inconsistent with the ideal of experimentator-independent reproducibility. In this article, we have explained how cost-efficient (i.e. as free as open source software), reproducible and transparent science can be performed at the very border of corpus and developmental psycholinguistics. More concretely, in footnotes of this article, we have presented less than two dozens one-liners which pipeline and combine PCREs (Wall, 1990; Hromada, 2011) with core GNU utilities like “grep”, “uniq”, "wc" and “sort”. Asides this, a snippet of few dozen lines of beginner-level non-optimized R code is hereby being published25 in order to furnish complete – i.e. from downloading the corpus from publicly available source all the way to final plots and correlation coefficients - description of three experiments hereby performed. Common to these three experiments was a preprocessing phase which purified and repartitioned hundreds of megabytes of data contained in CHILDES. Result of this phase were two directories, CHI which contains utterances produced by children and MOT which contains motherese utterances (cf. section 2.2). Principal motivation behind this repartitioning 23 >cor.test(aggregated_chi_lang2[,6]/aggregated_chi_lang2[,3],aggregated_chi_lang4[,6]/aggregated_chi_lang4[,3],method="kendall") 24 25 >cor.test(aggregated_chi_lang6[,6]/aggregated_chi_lang6[,3],aggregated_chi_lang5[,6]/aggregated_chi_lang5[,3],method="spearman") http://wizzion.com/code/jadt2016/childes.R JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles 10 DANIEL DEVATMAN HROMADA was a speed-up of any subsequent analysis. For example the 3rd analysis - when executed on one sole core of 3.2 Ghz PC with 8GB RAM PC and CHILDES data stored on a SSD disk (a fairly standard configuration) - didn't last more than 15 seconds. All the way from matching the first regular expression on the first line of first transcript to R's final plotting. Mentioning regular expressions, we consider it as important to reiterate that regexes, like those implemented in Perl or PCREs, seem to us to be much more than impressive yet weird character sequences that no neophyte can read. Unambigously denoting what they should denote - i.e. a specific set of character sequences, a specific pattern, schema and form PCREs are formalisms in their own right (Hromada, 2011). Idem for shell commands and PERL or R instructions - they also are unambigous formalisms and for purposes of NLP, they can turn out to be at least as worthy as other formalisms. Formalisms, tools and methodology being thus defined by a concrete example, a question can be posed: "What should be the name of a discipline which uses implemets such a method and uses such tools ?" And given that what was done used techniques common to textometry in order to address topics common to developmental psycholinguistics (Tomasello, 2009), an answer could potentially sound: "Textometric Psycholinguistics". It is only now - with toolbox specified and reproducible method and scope of interest of discipline properly delimited - that a discussion about culture-independent anthropological constants occurent in adult-child verbal and pre-verbal interactions - id est a discussion about "linguistic universalia" and their meaning, a discussion among savants can, hopefully, begin. References Fisher, Ronald Aylmer. (1925). Statistical methods for research workers. Genesis Publishing Pvt Ltd. MacWhinney, Brian & Snow, Catherine. (1985). The child language data exchange system. Journal of child language, 12(02), 271-295. MacWhinney, Brian. (2012). The CHILDES Project Tools for Analyzing Talk–Electronic Edition Part 1: The CHAT Transcription Format. Piaget, Jean. (1951). Principal factors determining intellectual evolution from childhood to adult life. Columbia University Press. Popper, Karl. (1992). The Logic of Scientific Discovery. Routledge, London. Hromada, Daniel Devatman. (2011) Initial Experiments with Multilingual Extraction of Rhetoric Figures by means of PERL-compatible Regular Expressions. RANLP Student Research Workshop, 85-90. Hromada, Daniel Devatman. (2015). Conceptual Foundations: Intramental Evolution & Ontogeny of Toddlerese. In press. Stallman, Richard. (1985). The GNU manifesto. Team, R.Core. (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2013. Tomasello, Michael. (2009). Constructing a language: A usage-based theory of language acquisition. Harvard University Press. Wall, Larry. (1990). PERL: Practical Extraction and Report Language. JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles C A N E V O L U T I O N A R Y C O M P U TAT I O N H E L P U S T O CRIB THE VOYNICH MANUSCRIPT? by Daniel Devatman Hromada 0.1 abstract Voynich Manuscript is a corpus of unknown origin written down in unique graphemic system and potentially representing phonic values of unknown or potentially even extinct language. Departing from the postulate that the manuscript is not a hoax but rather encodes authentic contents, our article presents an evolutionary algorithm which aims to find the most optimal mapping between voynichian glyphs and candidate phonemic values. Core component of the decoding algorithm is a process of maximization of a fitness function which aims to find most optimal set of substitution rules allowing to transcribe the part of the manuscript which we call the Calendar - into lists of feminine names. This leads to sets of character subsitution rules which allow us to consistently transcribe dozens among three hundred calendar tokens into feminine names: a result far surpassing both "popular" as well as "state of the art" tentatives to crack the manuscript. What’s more, by using name lists stemming from different languages as potential cribs, our "adaptive" method can also be useful in identification of the language in which the manuscript is written. As far as we can currently tell, results of our experiments indicate that the Calendar part of the manuscript contains names from baltoslavic, balkanic or hebrew language strata. Two further indications are also given: primo, highest fitness values were obtained when the crib list contains names with specific infixes at token’s penultimate position as is the case, for example, for slavic feminine diminutives (i.e. names ending with -ka and not -a). In the most successful scenario, 240 characters contained in 35 distinct Voynichese tokens were successfully transcribed. Secundo, in case of crib stemming from Hebrew language, whole adaptation process converges to significantly better fitness values when transcribing voynichian tokens whose order of individual characters have been reversed, and when lists feminine and not masculine names are used as the crib. 1 0 0.2 introduction 0.2 introduction Voynich Manuscript (VM) undoubtably counts among the most famous unresolved enigmas of the medieval period. On approximately 240 vellum pages currently stored as manuscript (MS) 408 in Yale University’s Beinecke Rare Book and Manuscript Library, VM contains many images apparently related to botanics, astronomy (or astrology) and bathing. Written aside, above and below these images are bulks of sequences of glyphs. All this is certain. Also certain seems to be the fact that in 1912, VM was re-discovered by a polish book-dealer Wilfrid Voynich in a large palace near Rome called Villa Mandragone. Alongside the VM itself, Voynich also found the correspondence - dating from 1666 - between Collegio Romano scholar Athanasius Kircher and the contemporary rector of Charles University in Prague, Johannes Marcus Marci. Other attested documents - e.g. a letter from 1639 sent to Kircher by a Prague alchemist Georg Baresch - also indicate that during the first half of 17th century, VM was to be found in Prague. The very same correspondence also indicates that VM was acquired by famous patron of arts, sciences and alchemy, Emperor Rudolf II. 1 Asides this, one more fact can be stated with certainty: the vellum of VM was carbon-dated to the early 15h century (Hodgins, 2014). 0.2.1 pre-digital tentatives Already during the preinformatic era of first half of 20th century had dozens, if not hundreds, men of distinction invested non-negligeable time of their life into tentatives to decipher the "voynichese" script. Being highly popular in their time, many such tentatives - like that of Newbold who claimed to "prove" that VM was encoded by Roger Bacon by means of 6-step anagrammatic cipher (Newbold, 1928b), or that of Strong (Strong, 1945) who claimed VM to be a 16th-century equivalent of the Kinsey Report" - may seem to be, when looked upon through the prism of computer science, somewhat irrational 2 C.f. (d’Imperio, 1978) for a overview of other 20th-century "manual" tentatives which resulted in VM-decipherement claims. After description of these tentatives and and after presentation of informationally very rich introduction to both VM and its historical context, d’Imperio 1 Savants which passed through Rudolf’s court included Johannes Kepler, Tycho deBrahe or Giordanno Bruno. The last one is known to have sold a certain book to the emperor for 600 ducats. 2 Note, for example, Strong’s "translation" of one VM passage: "When the contents of the veins rip, the child comes slyly from the mother issuing with leg-stance skewed and bent while the arms, bend at the elbow, are knotted like the legs of a crawfish." Strong (1945) Note also that such translation was a product of man who was "a highly respected medical scientist in the field of cancer research at Yale University" (d’Imperio, 1978). 2 0.2 introduction adopts a sceptical stance towards all scholars who associated VM’s origin with the personage of Roger Bacon3 . In spite of sceptic who she was, d’Imperio hadn’t a priori disqualified a set of hypotheses that the language in which the VM was ultimately written was latin or medieval English. And such, indeed, was the majority of hypotheses which gained prominence all along 20th century.4 . 0.2.2 post-digital tentatives First tentatives to use machines to crack the VM date back to prehistory of informatic era. Thus, already during 2nd world war did the cryptologist William F. Friedman invited his colleagues to form "extracurricular" VM study group - programming IBM computers for sorting and tabelation of VM data was one among the tasks. Two decades later - and already in position of a first chief cryptologist of the nascent National Security Agency - Friedman had formed the 2nd study group. Again without ultimate success. One member of Friedman’s 2nd Study Group After was Prescott Currier whose computer-driven analysis led him to conclusion that VM in fact encodes two "statistically distinct" (Currier, 1970) languages. What’s more, Currier seems to have been the first scholar who facilitated the exchange and processing of Voynich manuscript by proposing a transliteration5 of voynichese glyphs into standard ASCII characters. This had been the predecessor of the European Voynich Alphabet (EVA) (Landini and Zandbergen, 1998) which had become a de facto standard when it comes to mapping of VM glyphs upon the set of discrete symbols. Canonization of EVA combined with dissemination of VM’s copies through Internet have allowed more and more researchers to transcribe the sequence of glyhps on the manuscript into ASCII EVA sequences. Is is thanks to laborious transcription work of people like Rene Zandberger, Jorge Stolfi or Takeshi Takahashi that verification or falsification of VM-related hypotheses can be nowadays in great extent automatized. For example, Stolfi’s analyses of frequencies of occurence of different characters in different contexts has indicated that majority of 3 "I feel, in sum, that Bacon was not a man who would have produced a work such as the Voynich manuscript...I can far more easily imagine a small society perhaps in Germany or Eastern Europe (d’Imperio, 1978, 51)" 4 Note that such pro-english and pro-latin bias can be easily explained not by the properties of VM itself, but by the simple fact that first batches of VM’s copies were primarily distributed and popularized among anglosaxxon scholars of medieval philosophy, classical philology or occidental history 5 In this article we distinguish transliteration and transcription. Transliteration is a bijective mapping from one graphemic system into another (e.g. VM glyphs is transliterated into ASCII’s EVA subset). Transcription is a potentially non-bijective mapping between symbols one one side and sound- or meaning- carrying units on the other. 3 0.2 introduction Voynichese words seems to implement a sort of tripartite crust-coremantle (or prefix, infix, suffix) morphology. Later study has indicated that the presence of such morphological regularities could be explained as an output of a mechanical device called Cadran grill Rugg (2004). The "hoax hypothesis" is also supported by the study (Schinner, 2007) which suggested that "the text has been generated by a stochastic process rather than by encoding or encryption of language". Pointing in the similar direction, the analysis also concludes that "glyph groups in the VM are not used as words". On the other hand, a methodology based on "first-order statistics of word properties in a text, from the topology of complex networks representing texts, and from intermittency concepts where text is treated as a time series" presented in (Amancio et al., 2013) lead its authors to conclusion that VM "is mostly compatible with natural languages and incompatible with random texts". Simply stated, the way how diverse "words" are distributed among different sections of VM indicates that these words carry certain semantics. And this indicates that VM, or at least certain parts of it, are not a hoax. 0.2.3 our position Results of (Amancio et al., 2013) had made us adopt the conjecture "VM is not a hoax" as a sort of a fundamental hypothesis accepted a priori. Surely, as far as we stand, it could not be excluded that VM is a work of an abnormal person, of somebody who suffered severe schizophrenia or was chronically obsessed by internal glossolalia (Kennedy and Churchill, 2005). Nor can it be excluded that the manuscript does not encode full-fledged utterances but rather lists of indices, sequences or proper names of spirits-which-are-tobe-summoned or sutra-like formulas compressed in a sort of private pidgin or a sociolect. But given VM’s ingenuity and given the effort which the author had to invest into the conception of the manuscript and given a sort of "elegant simplicity" which seems to permeate the manuscript, we have felt, since our very first contact with the manuscript, a sort of obligation to interpret its contents as meaningful. That is, as having the capability of denoting the objects outside of the manuscript itself. As being endowed with the faculty of reference to the world (Frege, 1994) which we, 21st century interpretators, still inhabit hundred years after VM’s most plausible date of conception. It is with such bias in mind that our attention was focused upon a certain regularity which we have later decided to call "the primary mapping". 4 0.2 introduction Figure 1: Drawing from fiolio f84r containing the primary mapping. 0.2.4 primary mapping Condition sine qua non of any act of decipherement is a discovery of rules which allow to transform initially meaningless cipher into meaningful information. In most trivial case, such decipherement is facilitated by a sort of Rosetta Stone (Champollion, 1822) which the decipherer already has at his disposition. Since both the ciphertext as well as the plaintext (also called "the crib") are explicitely given by the Rosetta Stone, discovery of the mapping between the two is usually quite straightforward. The problem with VM is, of course, that it seems not to contain any explicit key which could help us to decipher its glyphs. Thus, the only source of information which could potentially help us to establish reference between VM’s glyphs and the external world are VM’s drawings. One such drawing present atop of folio f84r is shown on Fig. 1. Figure 1 displays twelve women bathing in eight compartments of a pool. Bathing women is a very common motive present in VM and there seems to be nothing peculiar about it. The fact that word-like sequences are written above heads of these women is also trivial. One can, however, observe one regularity which seems to be interesting. That is, in case two women bath in the same compartement, the compartement contains two word-like sequences. If one woman bathes in the compartement, there is only one word-like sequence which is written above her head. One figure - one word, two figures - two words. This principle is stringently followed and can be seen on other folios as well. What is more, the words themselves are sometimes similiar but they are not the same. Such trivial observations lead to trivial conclusion: these word-like sequences are labels. And since these names are juxtaposed to feminine figures, it seems reasonable to postulate that these labels are, in fact, feminine names. This is the primary mapping. 5 0.3 method 0.2.5 three conjectures Method which shall be described in following sections can be considered as valid only under assumption that following conjectures are valid: 1. "the primary mapping conjecture" : voynichese words asides feminine figures are feminine names 2. "diachronic stability of proper names" : proper names are less prone to diachronic change than other language units 3. "occam’s razor" : instead of containing a sophisticated esoteric cipher, VM simply transmits a text written in an unknown script Further reasons why we consider "the primary mapping conjecture" as valid shall be given alongside our discussions of "the Calendar". When it comes to conjecture postulating the "diachronic stability of proper names", we could potentially refer to certain cognitive peculiarities or how human mind tends to treat proper names (Imai and Haryu, 2001). Or focus the attention of the reader to the fact that for practically every human speaker, one’s own name undoubtably belongs among the most frequent and most important tokens which one hears or utters during whole life - this can result in a sort of stability against linguistic change and allow the name to cross the centuries with higher probability than words of lesser importance and frequency. But instead of pursuing the debate in such a direction, let’s just point out that successful decoding of Mycenian Linear script B ((Ventris and Chadwick, 1953) would be much more difficult if certain toponyms like Amnisos, Knossos or Pylos haven’t succeeded to carry their phonetic skeleton through aeons of time. At last but not least, the "occam’s razor conjecture" simply explicitates the belief that a reasonable scientist should not opt to explain VM in terms of annagrams and opaque hermeneutic procedures if similar - or even more plausible - results can be attained when approaching VM as it was a simple substitution cipher. 0.3 method The core of our method is an optimization algorithm which looks for such a candidate transcription alphabet Ax which, when applied upon the list of word types occurent in VM’s Calendar section yields an output list whose members should be ideally present in another list, called the Crib. The optimization is done by an evolutionary strategy - an individual chromosome encode a candidate transcription alphabet and a fitness function is given as a sum of lengths of all tokens which were successfully transcribed from Calendar to a specified Crib. 6 0.3 method 0.3.1 calendar Six among twelve words present on Figure 1. occur only on folio f84r. Six others occur on other folios as well, and five of these six words occur also as labels near feminine figures displayed on 12 folios of the section commonly known as "Zodiac". It is like this that our attention was focused from the limited corpus of "primary mapping" towards more exhaustive corpus contained in the Zodiac. Every page of Zodiac displays multiple concentric circles filled with feminine figures. Attributes of these figures differ - some hold torches, some do not, some are bathing, some are not - but one pattern is fairly regular. Asides every woman there is a star and asides every star, there is a word. While some authors postulate that these words are names of stars or names of days, we postulate that these words are simply feminine names6 . From Takahashi’s transliterations of twelve folios of the Zodiac we extract 290 tokens which instantiate 264 distinct word types. To evit possible terminological confusion, we shall denote this list of 264 labels7 with the term Calendar. Hence, Zodiac is the term to refer to folios f70v2 - f73v, while Calendar is simply a list of 264 labels. Total length of this 264 labels is 2045 letters. These characters are chosen from 19-symbol (|Acipher | = 19) subset of the EVA transliteration alphabet. 0.3.2 cribbing Cribbing is a method by means of which a hypothesis, that the Calendar contains lists of feminine names, can potentially lead to decipherment of the manuscript. For if the Calendar is indeed such a list, then one could use lists of existing and attested feminine names as hypothetic target "cribs". In cryptanalytic terms, an intuition that the Calendar contains feminine names makes it possible to perform a sort of known-plaintext attack (KPA). We say "a sort of", because in case of VM are the "cribs" upon which we shall aim to map the Calendar, not known with 100% certainity. Hence, it is maybe more reasonable to understand the cribbing procedure as the plausible-plaintext attack (PPA). This beings said, we label as "cribbing" a symbol-substituting procedure Pcribbing which replaces symbols contained in the cipher (i.e. in the Calendar) with symbols contained in the plaintext. Hence, not only cipher but also plaintext are inputs of the cribbing procedure. 6 It cannot be excluded, however, that they all this at once. Note, for example, that in many central european countries, it is still a fairly common practice to attribute specific names to specific days in a year, i.e. "meniny". 7 Available at http://wizzion.com/thesis/simulation0/calendar.uniq 7 0.3 method Listing 1: Discrete cross-over 1 #discrete crossover my $child_genome; my $i=0; for (@mother_genome) { if ($_ ne $father_genome[$i]) { 6 rand > 0.5 ? ($child.=$mother_genome[$i]) : ( $child.=$father_genome[$i]); } else { $child_genome.=$mother_genome[$i]; } $i++; 11 } Every act of execution of Pcribbing can be followed an act of evaluation of usefulness Pcribbing in regards to its inputs. The ideal procedure would result in a perfect match between the rewritten cipher and the plaintext, i.e. Pcribbing (cipher) == plaintext On the other hand, a completely failed Pcribbing results in two corpora which do not have anything in common. And between two extremes of the spectrum, between "the ideal" and "the completely failed", one can place multitudes other procedures, some closer to the ideal than the others. This makes place for optimization. 0.3.3 optimization All experiments described in the next section of this article implement an evolutionary computation algorithm which strongly inspired by the architecture of canonic genetic algorithm (CGA, P+46) Holland (1992); Rudolph (1994). Hence, initial population is randomly generated and the fitness-proportionate (i.e. "roullette wheel", P+42) selection is used as the main selection operator. But contrary to CGAs, our optimization technique does not implement a classical single-point crossover but rather a sort of "discrete crossover" which takes place only in case that parent individuals have different alelles of a specific gene. Another reason why our solution can be considered to be more similar to evolutionary strategies (Rechenberg, 1971) than to CGAs is related to the fact that it does not encode individuals as binary vector (P+48). Instead, every individual represents a candidate monoalphabetic substitution cipher application of which could, ideally, transform the Calendar into a crib. More formally: given that cipher is written in 8 0.3 method Listing 2: Cipher2Dictionary adaptation fitness function 3 8 13 18 #Fitness Function my $text=$calendar; my $old = "acdefghiklmnopqrsty" ; my %translit; @translit{split //, $old} = split //, $individual; $text =~ s/(.)/defined($translit{$1}) ? $translit{$1} : $1/eg; # core transcription of calendar content my %matched; for (split/\n/,$text) { my $token=$_; if (exists $crib{$token}) { @antitranslit{split //, $individual} = split //, $old; $token =~ s/(.)/defined($antitranslit{$1}) ? $antitranslit{$1} : $1/eg; my $t=$token; $matched{$t}=1; } } for (keys %matched) { $Fitness[$i]+=length $_; } symbols of the alphabet Acipher and given that the crib is written in symbols of the alphabet Acrib , then each individual chromosome will have length of |Acrib | genes and every individual gene could encode one among |Acipher | values. Size of the search space is therefore |Acipher || Acrib |. Search for optima in this space is governed by a fitness function: FPcribbing = X length(w) w∈cipher∧Pcribbing (w)∈crib where w is a word type occurent in the cipher (i.e. in the Calendar) and which, after being rewritten by Pcribbing also matches a token in the input crib. Given that the expression length(w) simply denotes w’s character length, the fitness function of the candidate transcription procedure Pcribbing is thus nothing else than the sum of character lengths of all distinct labels contained in the Calendar which Pcribbing successfully maps onto the feminine names contained in the input crib. 9 0.4 experiments 0.4 experiments Within the scope of this article, we present results of two sets of experiments which essentially differed in the choice of a name-containing cribs. Other input values (e.g. Takahashi’s transliteration of the Calendar used as the cipher) and evolutionary parameters (total population size = 5000, elite population size = 5, gene mutation probability <0.001) were kept constant between all experiments and subexperiments. Each experiment consisted of ten distinct runs. Each run was terminated after 200 generations. 0.4.1 slavic crib What we label as "slavic crib" is a plaintext list of feminine names which we had compiled from multiple sources publicly available on the Internet. Principal sources of names were websites of western slavic origin. This choice was motivated by following reasons: 1. The oldest more or less certain trace of VM’s trajectory points to the city of Prague - the center of western slavic culture. 2. Ortography of western slavic languages relatively faithfully represent the pronounciation. That is, there are relatively few digraphs (e.g. a bigram "ch" which denotes a voiced velar fricative). Hene, the distance between the graphemic and the phonemic representations is not so huge as in case of english or french. 3. Slavic languages have rich but regular affective and diminutive morphology which is often used when addressing or denoting beloved persons by their first name. The third reason is worth to be introduced somewhat further: in both slavic and western slavic languages, a simple infixing of the unvoiced velar occlusive "k" before the terminal vowel "a" of a feminine names leads to creation of a diminutive form of such a name (e.g. alena → alenka, helena → helenka etc.) The fact that this morphological rule is used both by western as well as eastern slavs indicates that the rule itself can be quite old, date to common slavic or even preslavic periods and hence, was quite probably in action already in the period when VM was written. For the purpose of this article, let’s just note that application of the substitution: a$ → ka/ allowed us to significantly increase the extent of the "slavic crib". Thus, we have obtained a list a of 13815 distinct word types which are in quite close relation to phonetic representation of feminine names 10 0.4 experiments 11 used in europe and beyond8 . The alphabet of this crib comprises of 38 symbols, hence there exists 1939 possible ways how symbols of the Calendar could be replaced by symbols of this crib. Figure 2. shows the process of convergence from populations of randomly generated chromosomes towards more optimal states. In case of runs averaged in the "SUBSTITUTON" curve, the procedure Pcribbing consisted in simple mapping of the Calendar onto the crib by means of a substitution cipher specified in the chromosome. But in case of runs averaged in the "REVERSAL + SUBSTITUTION" curve, whole process was initiated by the reversal of order of characters present within individual tokens of the Calendar (e.g. okedy → ydeko, otedy → ydeto etc.) Let’s now look at contens of individuals which were "identified" by the optimization method. More concrete illustrations can also turn out to be quite illuminating. Hence, if the most elite individual of run 1 (i.e. the one with fitness 197) is as a means of substitution of EVA characters contained in the Calendar, one will see appearance of names like ALENA, ALETHE, ANNA, ATENKA, HANKA, HELENA, LENA etc. And when the last one (i.e. the one with fitness 240 is used), the resulting list shall contain tokens like AELLA, ALANA, ALINA, ANKA, ANISSA, ARIANNKA, ELLINA, IANKA, ILIJA, INNA, LILIJA, LILIKA, LINA, MILANA, MILINA, RANKA, RINA, TINA etc. This being said, the observation that all reversal-implementing runs have converged to genomes which: 1. transcribe e in EVA as nasal n 2. transcribe k in EVA as velar k 3. transcribe t in EVA as nasal n 4. transcribe y in EVA as vowel a 5. transcribe a in EVA as vowel (80% times as "i", 10% as "e", 10% as "o") 6. transcribe l in EVA as either a liquid consonant (80% "l", 10% "r") or "m" (10%) ...could also be of certain use and importance. 0.4.2 hebrew crib At this point, a sceptical mind could start to object that what our algorithm adapt to is in fact not the Calendar, but the statistical properties of the crib. And in case of such a long and sometimes somewhat artificial list like Cribslavic , such an objection would be in great extent justified. For the adaptive tendencies of our evolutionary strategy are 8 Slavic crib is publicly available at http://wizzion.com/thesis/simulation0/slavic_extended.crib 0.4 experiments Figure 2: Evolution of individuals adapting label in the Calendar to names listed in the slavic crib. Fitness 197 e s t nhk a hk l h t ak amena 230 i k t n s knhk l z t a j s m i na 224 i c t nvk/gk l mba j / r i na 227 i 240 i k t nak f l k l mea j g r i na 226 i 208 i qgnxkdek l mxa j x r i na 239 i k t ndo l l k l f e ak i m i na 191 o t l n t nn r km z banh r ena 240 i s t n s kn l k l mea j I r i na t npa f l k l me ank r i na l nho l k r g eanam i na EVA a c d e f g h i k l m n o p q r s t y Table 1: Fittest chromosomes which map reversed tokens in the Calendar onto names of the slavic crib 12 0.4 experiments 13 Figure 3: Evolution of individuals adapting label in the Calendar to names listed in the hebrew cribs. indeed so strong that it would indeed find a way to partially adapt the calendar to a crib which is long enough9 For this reason, we have decided to target our second experiment not at the biggest possible crib but rather at the oldest possible crib. And given that our first experiment has indicated that it seems to be more plausible to interpret labels in the Calendar as if they were written in reverse, id est from right to left, our interest was gradually attracted by Hebrew language10 . This lead us to two lists of names: • Cribhebrew−men contains 555 masculin names11 • Cribhebrew−women contains 283 feminine names12 both lists were extracted from the website finejudaica.com/pages/hebrew_names.htm and were chosen because they did not contain any diacritics and 9 This has been, indeed, shown by multiple micro-experiments which we do not report here due to the lack of space. No matter whether we use cribs as absurd as list of modern american names or enochian of John Dee and Edward Kelly, we could always observe a sort of adaptation marked by the increase of fitness. But it was never so salient as in case of Cribslavic or Cribhebrew . 10 Other reasons why we decided to focus on Hebrew include: important presence of Jewish diaspora in Prague of Rudolph the 2nd (c.f. the story of rabbi Loew and the Golem of Prague); ritual bathing of jewish women known as mikveh; usage of VMressembling triplicated forms (e.g. amen, amen, amen) in talmudic texts; attested existence of so-called Knaanic language which seems to be principially a czech language written in hebrew script et caetera et caetera. 11 http://wizzion.com/thesis/simulation0/jewish_men 12 http://wizzion.com/thesis/simulation0/jewish_women 0.5 conclusion hence transcribing hebrew names in a similiar way as they had been transcribed millenia ago. Figure 3 displays the summary of all runs which aimed to transcribe the Calendar with hebrew names. As may be seen, the whole system converged to highest fitness values when Cribhebrew−women was used in concordance with reversal of order of characters. Difference results of these batch of runs and other results of other batches is statistically significant (p-value < 7e-10). . The highest attained fitness value was was attained by the cribbing procedure which first reverses the order of characters whose EVA representations are subsequently substituted by a following chromosome: This chromosome transcribes the voynichese Calendar labels okam, otainy, otey, oty, otaly, okaly, oky, okyd, ched, otald, orara, otal, salal and opalg to feminine hebrew names (i.e. Bina, Gabriela, Ghila, Gala, Galila, Galina, Gina, Degana, Diyna, Deliyla, Yedidya, Lila, Lilit and Alica). Worth mentioning are also some other fenomena related to these transcriptions. One can observe, for example, that the label "otaly" translated as Galina - is also present on folios f33v, f34r or f46v which all contain drawings of torch-like plants. This is encouraging because the word "galina" is not only a hebrew name, but also a substantive meaning "torch". Similary, the word "lilit" is not only a name but also means "of the night". This word supposedly translates the voynichese token "salal" which is very rare - asides the Calendar it occurs only on purely textual folio f58v and on a folio f67v2 which, surprise!, may well depict circadian rhytms of sunrise, sunset, day and night. Or it could be pointed out kind that the huge majority of occurences of voynichese trigram "oky" (potentially denoting the name "gina" which also means "garden") is to be observed on herbal folios. Or the distribution of instances of "okam" (transcripted as "bina" which means "intelligence and wisdom"13 could, and potentially should, be taken into consideration. Or maybe not. 0.5 conclusion In 2013, BBC Online had anounced "Breakthrough over 600-year-old mystery manuscript". The breakthrough was to be effectuated by 13 Note that "bina" is one among highest sephirots located at north-western corner of kabbalistic tree of life. In this context it is worth noting that only partially readable EVA group "...kam" occurs as a third word near the north-western "rosette" of folio 85v2. Such considerations, however, bring us too far. 14 0.5 conclusion Stephan Bax who, in his article, describes the process of decipherement as follows: « The process can be compared to doing a crossword puzzle: at first we might doubt one possible answer in the crossword, but gradually, as we solve other words around it which serve to confirm letters we have already placed, we gradually gain more confidence in our first answer until eventually we are confident of the solution as a whole.» (Bax, 2014) What Bax does not add, unfortunately, is that the voynich crossword puzzle is so big that anyone who looks at it close enough can find in it small islands of order, local optima where few characters seem to fit the global pattern. Thus, even if Bax had succeeded, as he states, in "identification of a set of proper names in the Voynich text, giving a total of ten words made up of fourteen of the Voynich symbols and clusters", this would mean nothing else than that he had identified a locally optimal transcription alphabet. In this article, we have presented two experiments employing two different lists of feminine names. Both experiments have indicated that if labels in the Zodiac encode feminine names, then these have been originally written from right to left 14 . The first experiment led to identification of multiple substitution alphabets which allow to map 240 EVA letters, contained in 40 distinct words present in the Calendar, onto 35 feminine-name-ressembling sequences enumerated among 13815 items of CribSlavic . Results of second experiment indicate that if ever the Calendar contains lists of hebrew names, then these names would be more probably feminine rather than masculine. This is, as far as we can currently say, all that could be potentially offered as an answer to the question Can Evolutionary Computation Help us to Crib the Voynich Manuscript?. Everything else is - without help coming from experts in other disciplines - just a speculation. 14 Note, however, that this does not necessarily imply that the scribe of VM (him|her)self had written the manuscript in right-to-left fashion. For example, in case (s)he was just reproducing an older source which (s)he didn’t understand, his|her hand could trace movements from left to right while the very orignal had been written from right to left 15 0.6 zeroth simulation bibliography 0.6 zeroth simulation bibliography Amancio, D. R., Altmann, E. G., Rybski, D., Oliveira Jr, O. N., and Costa, L. d. F. (2013). Probing the statistical properties of unknown texts: application to the voynich manuscript. PloS one, 8(7):e67310. Bax, S. (2014). A proposed partial decoding of the voynich script. University of Bedfordshire, http://stephenbax. net/wpcontent/uploads/2014/01/Voynich-a-provisionalpartial-decoding-BAX. pdf. Champollion, J. F. (1822). Observations sur l’obelisque Egyptien de l’Ile de Philae. Currier, P. (1970). 1976." voynich ms. transcription alphabet; plans for computer studies; transcribed text of herbal a and b material; notes and observations.". Unpublished communications to John H. Tiltman and M. D’Imperio, Damariscotta, Maine. d’Imperio, M. E. (1978). The voynich manuscript: an elegant enigma. Technical report, DTIC Document. Frege, G. (1994). Über sinn und bedeutung. Wittgenstein Studien, 1(1). Hodgins, G. (2014). Forensic investigations of the voynich ms. In Voynich 100 Conference www. voynich. nu/mon2012/index. html. Accessed, volume 4. Holland, J. H. (1992). Genetic algorithms. Scientific american, 267(1):66– 72. Hromada, D. (2016). What can evolutionary computation teach us about the voynich manuscript? submitted to Cryptologia journal. Imai, M. and Haryu, E. (2001). Learning proper nouns and common nouns without clues from syntax. Child development, 72(3):787–802. Kennedy, G. and Churchill, R. (2005). The Voynich manuscript: the unsolved riddle of an extraordinary book which has defied interpretation for centuries. Orion Publishing Company. Landini, G. and Zandbergen, R. (1998). A well-kept secret of mediaeval science: The voynich manuscript. Aesculapius, 18:77–82. Newbold, W. R. (1928a). Cipher of Roger Bacon. University of Pennsylvania Press. Newbold, W. R. (1928b). Cipher of Roger Bacon. Rechenberg, I. (1971). Evolutionsstrategie: Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Dr.-Ing. PhD thesis, Thesis, Technical University of Berlin, Department of Process Engineering. 16 0.6 zeroth simulation bibliography Rudolph, G. (1994). Convergence analysis of canonical genetic algorithms. Neural Networks, IEEE Transactions on, 5(1):96–101. Rugg, G. (2004). An elegant hoax? a possible solution to the voynich manuscript. Cryptologia, 28(1):31–46. Schinner, A. (2007). The voynich manuscript: evidence of the hoax hypothesis. Cryptologia, 31(2):95–107. Strong, L. C. (1945). Anthony askham, the author of the voynich manuscript. Science, 101(2633):608–609. Timm, T. (2014). How the voynich manuscript was created. arXiv preprint arXiv:1407.6639. Ventris, M. and Chadwick, J. (1953). Evidence for greek dialect in the mycenaean archives. The Journal of Hellenic Studies, 73:84–103. 17 Error-free version of the article Narrative fostering of morality in artificial agents Constructivism, machine learning and story-telling also published in the book L'esprit au-delà du droit, Mare & Martin, 2015, Paris ISBN : 978-2-84934-237-4 Daniel Devatman Hromada dh@udk-berlin.de Laboratory of Computational Art Institute of Contemporary Media Faculty of Design Berlin University of Arts Grunewaldstrasse 2-5 10823 Berlin-Schöneberg Abstract This article proposes to consider moral development as a constructivist process occurring not only within particular communities of moral agents, but also within individual agents themselves. It further develops the theory of “moral induction” and postulates that moral competence of an artificial agent can be grounded by input of textual narratives into information-processing pipeline consisting of machine learning, evolutionary computation or multi-agent algorithms. In more concrete terms, it proposes that during the process of moral induction, primitive “morally relevant features” coalesce into “moral templates” which are subsequently coupled with relevant action rules. A concrete example is contained, illustrating how templates induced from one fairy-tale can help to solve the moral dilemma occurrent in a radically different context. Given the fact that the current proposal is principally based on computational processing of morally relevant “stories” written in natural language, it is potentially implementable with already existing natural language processing methods. 1 Introduction The aim of this article is to initiate the integration of three seemingly unrelated paradigms into a unified framework allowing moral reasoning to be embedded in non-human computational agents. The first paradigm is usage-based (Tomasello, 2009) and constructivist (Piaget, 1965). As such, it posits that specific history of interactions between agent A and his environment E leads to specific form of moral competence MA. The central tenet of the second, “morality-through-narration” paradigm (Vitz, 1990) states that the faculty of extraction and integration of “morals” from the “stories” is an essential constitutive component of moral intelligence. The last paradigm is related to machine learning and is based on a belief that certain types of information-processing systems Turing, 1939) can discover optimal or quasi-optimal solutions to any class of problems - including any class of moral problems. The penultimate thesis behind this synthesis posits that appropriate integration and implementation of these paradigms within artificial agents (AA) can and shall lead to a state within which such agents would be able to pass the moral Turing Test (Wallach & Allen, 2008) , a so-called TmoT (Hromada, 2012). The ultimate thesis posits that it could even lead to emergence of AAs endowed with MA operating in such spaces of abstraction, that it would be reasonable to posit that such AAs are auto-poietic, self-determinative and thus autonomous (Kant , 2002). This being said, we precise that the goal of this article is neither to address existing theories of human moral reasoning, nor to postulate a new one. Aeon-lasting philosophical debates about commonalities and distinctive features among concepts denoted by terms like “moral reasoning” / “moral judgment” / “moral wisdom” or “values” / “virtues” / “norms” shall also be attributed only a marginal place. Instead of entrenching ourselves within such ivory-tower discussions, other terms like “moral grounding”, “morally relevant features” and “moral templates” shall be introduced and used with one sole objective on mind: to propose a moral machine learning method which not only draws it force from a very subtle realm of human experience (i.e. the realm of narratives), but also - and this is important - is realizable and implementable (i.e. programmable), even today, by any computer scientist or natural language processing (NLP) engineer willing to do so. Ontogeny of morality Morality develops. Notions of good and bad change with time. This is true not only when we speak about transformations of “values and virtues” during historical and cultural development of a particular society. In phylogeny, for example, are certain innate predispositions moulded and remoulded by selective pressures directing the species co-evolving within a particular ecological system to novel and unprecedented forms of “utility” (Haidt, 2013; Richerson and Boyd, 2008). But in case of homo sapiens sapiens species, there exists yet another process which moulds the moral competence of a single individual: the ontogeny. 2 Paedagogic (Comenius, 1896) or psychoanalytic tradition asides (Jung, 1967; Adler 1976), it was Piaget (1965) who pointed the fact out: reasons for specific moral, or immoral, behaviour are to be sought for in childhood. This does not mean that Piaget had reject Kant’s (2002) categoric imperative, an eternal meta-principle of “pure reason” able to generate a morally sound “way out” of any moral dilemma whatsoever. In Piaget’s view, categorical imperative can still be induced to seat atop the hierarchy of internal laws, but in order to be correctly applied upon correct maximas, maximas themselves are to be grounded in one’s knowledge about the world. For it is often the case that moral dilemmas are so difficult to solve not because we would lack the heuristics allowing us to find the answer, but because we are not sure which question has to be posed in the first place (Wittgenstein, 1971). During several decades of his professional career which Piaget spent by observing and speaking with children, he had converged to epistemological framework, “genetic epistemology”, yielding a general explanatory schema describing the development of diverse cognitive faculties from birth onwards. The same developmental stages which are to govern, for example, the development of child’s linguistic faculties are to be traversed as child develops her 1 representations of moral norms, virtues and values. Piaget enumerates an ordered sequence of four basic stages through which a healthy human should pass through between birth and maturity: 1. sensorimotor stage - repetitive and playful manipulation of objects without goal 2. egocentric stage - dogmatic but often faulty imitation of behavioral schemas of others without understanding of why these schemas are as they are 3. cooperative stage - rule-governed coordination of one’s activity with that of other participants in the game 4. autonomous stage - understanding of procedures which allow for legitimate change of rules of the game Great part of opus Moral Judgment of the Child (Piaget, 1965) was devoted to tentative of intrerpreting diverse social and moral phenomena through the prism of such 4-staged development. More concretely, the swiss pedagogue and his colleagues had not only minutiously observed kids playing marbles on diverse playgrounds in Geneve of Neuchatel. Children were also interviewed in order to make explicit their conscious and reflected knowledge of what their beliefs and attitudes in regards to “rules of the game” were. Subsequently, the same interviewbased method was used to shed light upon more ontogeny of more abstract concepts such as responsibility, theft, lying or justice. Piaget’s methodological device allowing him to access and evaluate child’s moral realm was principally based on child’s ordinal ranking (Turing, 1939; Brams 2011) of stories with which the scientists have her confronted : “the psychologist Fernald...tells the children several stories and then simply asks them to classify them. Mlle Descoeudres, applying this method, submits, for example, five lies to children, who are then required to classify them in order of gravity. This, 1 As is often the case in developmental psychology literature, we shall use the feminine forms of 3rd person pronouns whenever we shall refer to a child or computational agent in earliest stage of her development. 3 roughly, is also the procedure that we shall follow.” Piaget (1965) But contrary to the swiss pedagogue, the role of narration in the model hereby proposed is not limited to that of a sheer evaluatory device. For the key idea which we want to transfer to reader in this article, is that not only does story-telling offer us a means to evaluate morality of an individual child C (or, more generally, of an agent A), but that it also indicates a path by undertaking of which the individual morality could be gradually “constructed”. Or, in more fashionable terms: how such moral knowledge could be “grounded” (Harnad, 1990) in artificial systems. Narration and moral grounding All human societies have language and all human societies use language as a vector for transfer of narratives from minds of older individuals into minds of younger individuals. Some scientists Victorri, 2014) even suggest that story-telling can be the very raison d’etre of language. Under such view, narratives furnish to child an access to trans-temporal values. And sharing of such trans-temporal values is a glue which holds society together and assures continuation of its identity in time (Durkheim, 1933; Berger and Luckmann 1991). This is so because stories are encoded in natural language and natural language is practically the only medium in which one can use signs to precisely communicate one’s knowledge of entities with non-material ontological status. That is, of entities which do not have any perceivable properties, are independent from space and time, are abstract or even imaginary. No other medium can do that: music or dance can point to abstract ideas but are not precise in the way day do it; visual and plastic means of expression are by their very nature stuck at the level of representation of concrete objects and can point to more abstract categories only indirectly by means of prototypes (Rosch, 1999), associations or impressions. And language of pure formal logic could not serve the goal of transfer of trans-temporal values neither. This is because such language is supposed to encode relations between forms and not contents: that’s why it is called formal. Moral values are an example par excellence of such non-perceivable, abstract and trans-temporal contents. It is often easy to express or transfer them in natural language but very difficult to express or transfer them otherwise. Take, for example, notions like “responsibility”, “respect”, “justice” or distinction between “intellect” and “conscience”: one does not need to be Homer to invent a short and comprehensible fairy-tale which would allow a normal healthy child to strenghten and stabilize associations between her knowledge about the world and such notions and semantic distinctions. We shall sometimes use the term “moral grounding” when referring to construction, reinforcement or stabilization of associations between knowledge-base representing the surrounding environment and representations of trans-temporal moral values. As a hyperbole of statement “narrative material is an effective component of effective moral education” (Vitz, 1990) we posit that narration is an essential means, a conditio sine qua non, of grounding of morality in human children. Fairy-tales, fables, myths; biographies, history, hymns: an important function of these narrative structures is to allow and strenghten child’s access to 4 trans-temporal values and principles which she shall subsequently share with her community. And it is the specific, the particular, the discriminatory to all narratives which she shall hear which shall make her, in the long run, converge to the particular ethical codex common to her community and not to the codex of another community which exposes children to other narratives. Stated more concretely: by exposing children to Bible or Koran day after day and year after year, one triggers processes leading to one type of agents; by exposing other children to forces of Greek or Hindu mythology, one trains agents of yet another kind. The fact that the very expression “moral of the story”, written as it is written, and meaning what it means2, is not to be attributed to arbitrary caprices of evolution of linguistic signs. It should be rather interpreted as a supplementary evidence supporting the conjecture that teaching morality and telling stories do, indeed, go hand in hand. Moral machine learning Machines can learn. That is, machines are able to discover underlying general patterns and principles governing the concrete input data and can subsequently exploit such general knowledge in contact with inputs to which they were never exposed before. They “can use experience to improve performance or make accurate predictions” (Mohri et al., 2012). And in still bigger and bigger number of domains they do so still better and better than their human teachers. Since the moment when machine learning (ML) was first defined, in relation to game of checkers, as “field of study which gives computers ability to learn without being explicitly programmed” (Samuel, 1959) has the ML-discipline evolved in an extent which is hardly compressible into a single book (Mohri et al., 2012) and certainly incompressible into text having the size of this article. This is so because not only does the number of domains of ML’s application grow from year to year, but firstly because the quantity of distinct ML methods is already counted in dozens, if not in hundreds. What method should be thus chosen, even today, by an engineer willing to launch the cascade of evermore self-programming and auto-poietic moral machine learning (MML)? Given the fact that natural language can be used as a target modality of representation for practically any kind of problem (c.f., for example, (Karpathy & Fei-Fei, 2014) for recent advance in solving difficult computer vision problems by coupling the visual world with language representations) and given also the already-mentioned impact of narration upon ontogeny of moral competence, we believe that the inspiration for the correct answer could be drawn from the discipline of Natural Language Processing (NLP). Similarly to ML with which NLP often strongly overlaps, is NLP also a blooming discipline offering ever-still better solutions to evermore wider range of problems. But the ultimate challenge nonetheless stays the same: to make machines understand language in a way indistinguishable from the way in which humans do it (Turing, 1950). Mutatis mutandi, the ultimate challenge of moral machine learning, a so-called central problem of roboethics (Hromada, 2011a), is to make machines solve moral dilemmas in a way indistinguishable from 2 And does so not only in English but also in French, Spanish and potentially other languages. 5 the way in which humans would solve them. This also in case of dilemmas with which neither the artificial agent nor its human teacher were ever confronted before. We conjecture that there exist at least two problems which are well-studied in NLP and which could be potentially usefully transposed into the domain of moral reasoning. The first is a problem of conceptual (Gärdenfors, 1990) or semantic (Widdows, 2008) feature space construction and optimization which is practically always based on an associanist “distributional hypothesis” (Sahlgren, 2008). The hypothesis simply states that signs which co-occur together in similar contexts tend to have similar meaning. In combination with large human-based textual corpora can this simple statistical approach lead to “geometrization of meaning” which endow machines with more human-like semantic-processing capabilities than was the case for older AI approaches (e.g. expert systems). Semantic vector space construction and its partitioning into conceptual partitions is the core idea behind the process of “semantic enrichment” which shall be mentioned in the next section. But it is especially the problem of “grammar induction” 3 (GI) which makes us to consider NLP as the precursor to MML. The GI problem seems to be trivial: given the corpus C of utterances written in language L, the goal is to obtain such a grammar G which could generate L. The problem seems to be trivial because practically every healthy human infant deals with it with surprising swift and ease but -as is often the case with the problems which human infants with swift and ease- it is in fact one of the most difficult NLP challenges for which there still exist only partial and imperfect, locally-optimal solutions (Elman, 1993; Solan et al. 2005). The reason why we mention GI in the article dedicated to grounding of moral competence is simple: we observe non-negligible resemblances between child’s acquisition of grammar of language spoken in her linguistic environment (Tomasello, 2009; Clark 2009), and child’s acquisition of moral norms implicitly governing practically everything which happens in her social environment. Thus, a human child can be said to master the grammar of her mother language if she is able to correctly answer the question “Is utterance U grammatical?” even in case of X which she never heard before. Ceteris paribus, a human child can be said to partake the moral precepts of her community if she is able to address the question “Is maxime M moral?” in a way which would be accepted by the community and to do so even in case of maximes which she had never observed nor considered before. But there exists yet another resemblance between linguistic and moral competence: both faculties involve both passive and active components. We precise: linguistic competence involves not only the ability to distinguish utterances that are grammatical from those that are not, the ability to parse them and understand them, but also the ability to generate and produce one’s own utterances which are both grammatical and meaningful. Technically speaking, grammars can be used both as parsers as well as generators; structures used for comprehension (C-structures) and structures used for production (P-structures) are intimately interwoven (Clark, 2009). Same holds, mutatis mutandi, for moral competence: the ability to distinguish right from wrong goes in hand with the ability to do right decisions and execute right actions. 3 Some authors also call it the problem of grammatical inference. 6 These resemblances make us believe that the work which was already done in GI could be potentially useful in MML as well. Moral induction In this article, we adhere to the epistemological position adopted in our initial moral induction (MI) proposal. Given that our position is constructivist and usage-based, it should be considered as essentially distinct from other “transformationalist” models which tend to explain man’s moral faculties in terms of some kind of formal “Universal Moral Grammar” (Mikhail, 2007). In our initial proposal, we have described MI as a “bootstrapping and self-scaffolding process” which could be nonetheless seeded and directed through intervention of external teacher or oracle (Clark, 2010; Turing 1939) which supervises it. Such supervisor influences the process principally by exposing the computational agent with training corpus (TC) composed of plaintext stories. Agent processes the story, enriches it with syntactic, morphologic or pragmatic metadata in order to “compile” the initial story-code even more by “linking it” with semantic knowledge which it already has at her disposition. Such semantically enriched code, which is incomparably more complex than the original story-code, is subsequently explored for the basic primitives of the model, so-called “morally relevant features”. Combinations of these “morally relevant features” yield “moral templates” which can be coupled with action rules to-be-executed if ever the agent shall succeed to match state-of-things occurrent in her external environment, with the respective internal template. Under such view, a complete ordered set of such (template, action-rule) couplings is equivalent to overall “moral competence” of the agent, MA. As system is confronted with new stories, new templates are integrated into the ordered set and if ever an already existing template matches the new story, it can potentially obtain higher rank. Moral competence is thus being constructed in direct relation to the content of stories SA, SB, SC with which the agent is confronted. For anyone willing to simulate the ontogeny of morality in a Piaget-inspired way could the very order within the exposure sequence (e.g. TC = SA, SB, SC and not TC = SC, SB, SA) also play a certain role. Morally relevant features A morally relevant feature (MRF) is a basic primitive of the MI model. It is a distinct property observable within the data which, if detected and identified, shall most probably influence agent’s emotional or social state and behaviour. If we would speak about detecting MRFs in visual data, one should definitely detect a MRF if ever the agent was confronted with a bitmap containing a human face with tears near and/or in her eyes. MRFs are closely related to fundamental invariants of moral behaviour, as proposed by some psychologists such as (Haidt, 2013). According to Haidt’s initial Moral Foundations Theory (MFT), phylogenetic evolution had endowed the human species with at least six pre-wired (i.e. innate) cognitive modules which have a non-negligible impact on importance which human agents attribute to certain types of stimuli. These pre-wired circuits are supposed to facilitate and speed up the detection of phenomena related to: 1. protection (associated axis: care/harm) 7 2. reciprocity (associated axis: fairness/cheating) 3. grouping (associated axis: loyalty/betrayal) 4. respect (associated axis: authority/subversion) 5. purity (associated axis: sanctity/degradation) After further theoretical reflexion, Haidt had subsequently extended MFT with sixth MRF detection device, related to human tendency to often reason in terms of “liberty and oppression”. Given the unceasing development of science, it seems plausible that this list is not the final and shall be extended or restricted4, either by Haidt or by others. And since we speak about “morally relevant features” and not “morally relevant stimuli”, it may be even the case that the focus should be turned towards discrete primitives, towards properties shared among multiple stimuli of the same class, than towards the very stimuli themselves. A path which could be undertaken -and which was in linguistics already performed hundred years ago when distinct phonemes were started to be understood as bundles of features (e.g. phoneme “b” can be analyzed into features “voiced”,“labial”,“occlusive) - is to operationalize morally relevant values, situations or contexts, as positions in multi-dimensional feature space. In simplest of such approaches, every MRF would yield a new dimension in such a space. Moral virtues, values or whole situations and possible worlds could be subsequently projected into such ”morally relevant feature space" (MRFS). Once projected, such morally relevant entities are to be quantitatively evaluated, compared by geometric and numeric means. That is: by methods which machines master well. The simplest method how MRFS could be unfolded from a given story SX or a corpus C (C = S1, S2, . . . ) is to look for occurrence of “moral language” keywords. As Malle and Scheutz (2014) put it: “Such a moral language has three major domains: 1. A language of norms and their properties (e.g., “fair,” virtuous,” “reciprocity,” “obligation,” “prohibited,” “ought to”); 2. A language of norm violations (e.g., “wrong,” “culpable,” “reckless,” “thief ”); 3. A language of responses to violations (e.g., “blame,” “reprimand,” “excuse,” “forgiveness”)." Some studies addressing the problem of moral competence already use the method of geometrization of natural language data. For example, Malle (2014) used data from human respondents in order to project 28 verbs into 10-dimensional space. The study, focused on the 4 We are aware that similarly to Piaget’s theory, Haidt’s theory can also be either verified & accepted or falsified & surpassed. As scientist or philosopher, one should always be ready to accept the existence of phaenomena which falsify certain components of one’s theory. But since we write this article as engineers, is our objective here not to truth(fully) describe how human moral reasoning works, but to suggest how an artificial agent could be potentially programmed. Thus, with exception of the last sentence, shall be the general veracity of Piaget’s (resp. Haidt’s) theses not discussed in the rest of this proposal. 8 problem of “moral criticism”, has indicated the presence of two principal axes according to which such verbs could be ordered: the “intensity axis” and the “interpersonal engagement axis”. These two axes yield four quadrants to which the study associated one cluster of verbs, centroids of the clusters being: lashing out (intense, public), pointing the finger (mild, public), vilifying (intense, private), and disapproving (mild, private). Results aside, what is worth mentioning is that methods chosen by the authors: i.e. projection into high-order space, dimensionality reduction, clustering, centroid estimation, distance measurement, nearest-neighbor search etc., are methods commonly employed and deployed by any contemporary NLP engineer. And which work particularly well when confronted with natural language sequences. But in (Malle and Scheutz, 2014; Malle 2014), authors exploit such methods in order to gain certain insights about internal structure of moral realm. Apparent success of such tentatives make us conjecture that detection and selection of such MRFs in semantically-enriched representations of the initial plain-text stories is feasible even with contemporary NLP methods and techniques. Let’s now precise how this could be done: most trivial among MRF-detectors could simply look for occurrence of such “moral language keywords” in the surface (plain text) structure of the initial story. While such an approach should potentially indicate the path to undertake, it would be hardly sufficient to ground the moral competence. In order to do so, we believe, the artificial agent (AA) would have to analyse relations which are beyond the surface structure, i.e. deeper syntactic and semantic relations. Ideally, the system would be able to associate tokens in the current story with pre-existing semantic knowledge represented either in form of “ontology” or semantic feature space. Thus, when when confronted with the token “king”, an AA trained in classical (e.g. Socratic or Kantian) tradition shall tend to enrich the token with features like “noble” and “powerful” but also with semes, semantemes and phrasemes like “just”, “benevolent”, “source of social order”. Also, such AAs would potentially enrich the token “child” with features like “helpless” or “subordinated”. On the other hand, a somewhat more care-oriented AA should enrich the token “child” with features like “fragile”, “helpless” or “playful” in the first iteration and subsequent iterations of enrichment process would also integrate the features like “fond of toys”, “to be protected” or even “happy when given a toy”. Such a maternal AA would undoubtedly enrich, in the very first phases of the process, the token “king” with features like “protective”, “generous” and “loving”. To summarize: the most basic MRFs, somewhat related to Haidtian “axes of foundations of morality”, seem to us to be semes related to such aspects of human experience as: 1. actual (“suffering”, “in need”) or potential (“happy when given a gift”) emotional and physical states and characteristics of actors participating in the story 2. social status (“king”, “servant”) of such actors and their mutual relations (“friendship”, “brotherhood”, “love”) and interactions (“help”, “competition”, “trust”) 3. further social environment (“home”, “playground”, “courthouse”, “academia”, “battlefield”) and normative framework (legal system, local deontology, regional 9 customs) within which the story takes place We conjecture that detection and selection of such MRFs in semantically-enriched representations of the initial plain-text stories is feasible even with contemporary NLP methods and techniques. Moral templates Moral template (MTs) is an expression, a schema, a pattern and a form which groups multiple MRFs. Given that we have already introduced an analogy between grammatical and moral induction, we precise that in contemporary linguistics, such templates, are considered to be existent on multiple levels of representation: from phonological templates like CV (consonantvowel) which are observable even in babbling of 1-year-olds, to more high-order syntactic templates like SVO (subject-verb-object) (Clark, 2009). It is important to mention that MTs could be composed not only of constellations of individual “terminal” MRFs, but could also contain non-terminal symbols denoting either a class of specific MRFs or even any MRF whatsoever. MTs are, in this sense, somewhat similar to a well known “magic wand” of computer science known under the name of “regular expressions” (Wall et al. 2004). A great caution, however, has to be taken in order not to push the analogy between moral and grammatical competence too far. For the sequence of tokens which form the natural language utterance or a textual story, is mainly unidimensional and linear. In a word “dog” D precedes O which precedes G. Given the unidimensional sequentiality of surface layers of language, the templates to match such syntagmatic progressions are also unidimensional. But things most probably function somewhat differently in the world of “deep” moral considerations: it may be the case that in order to discover functional moral templates, one would have to exploit infinitely more complex 2D, 3D, 4D or even n-dimensional representations. Given the fact that moral templates are composed of MRFs and MRFs themselves are, in fact, vectors, it would be not completely surprising if MTs would be formalized as vector-, matrix-, or even tensor-like data-structures. In the example which shall follow in the last part of this article we shall, however, represent MTs in a form closely resembling quasi purely-boolean PROLOG (Covington (1994)) predicates5. Our ignorance of true nature of such moral templates apart, we assume that many problems related to our understanding or even simulation of moral competence could become more easily solvable if ever the whole problem of reasoning in the situation of moral dilemma would be interpreted in terms of agent matching her representation of the “perceived” situation with her internal templates6 5 Note, however, that we shall denote the “enrichment operator” with symbol ⊕ and not with ∧ to mark the intuition that the components of moral templates should be regarded as more informative and complex entities than purely boolean formulae. 6 Note that in majority of cases we use the term “moral templates” in plural. We do so in order to suggest that within the cognitive system of a morally acting agent, there exist multiple templates encoded in parallel. One could argue -with help from complexity, evolutionary or multi-agent theories- that it is the mutual competition or equilibrium-seeking tendency among individual templates encoded within the same agent, which could turn out 10 Moral rules An agent is called an agent because she acts. It is true that there exist a non-negligeable class of moral dilemmata where the best possible solution is attained if an agent does not act. It is true that often it is inhibition of action which, a reflected non-performance of any action which marks truly autonomous (Kant, 2002) and moral behaviour. But it is also true that there exist a class of moral dilemata which cannot be solved without execution of an appropriate action. A class of dilemmata where one is obliged to act and where inaction is to be considered as a form of action. There is only one medium through which a purely NLP-based AA could realize an action: it is the natural language itself. Thus, after being confronted with a textual representation of a moral dilemma, the system could solve it by production of a textual description of what it should do next. Or, in simplest possible scenario where the very description of the dilemma ends with a question-to-be-answered, an AA would simply propose the answer. How could such questionanswering moral agent (AM) be raised ? Without going into further detail, we precise that to a specific operation O (or the empty nonoperation O0) is to be associated to every specific template T. O is a candidate operation which could be potentially selected for execution if ever: • the template T matches • the rule R (in which association between O and T is specified) is selected by the rule selection operator If ever both T and O contain same variables (i.e. non-terminal symbols), the template matching engine shall bind same values to variables of O as it has detected assigned to T when matching T. Operation-to-be-performed can thus back-reference (Hromada, 2011b) contents matched by T. This is so, because an operation O, in its very essence, also a moral template induced from narrative’s very conclusion (i.e. from time T1 if ever the rest of the training story takes place in T0). Id est, O = T1. Thus, moral competence M of an AA is defined as the set of action-rules. An action-rule R is a triplet: R = (T0, T1, F) where T0 is the template matching the world actual before and during the dilemma; T1 is the template matching the world actualized by performing one particular solution of the dilemma and F denotes frequency of occurrence, i.e. number of stories present in the training corpus in which the particular story matchable by T0 ended with the state matchable by T1 . Subsequently, in the testing process, the choice of operation to be executed, is to be calculated in reference to such pre-stored knowledge-base of moral competence. If F is the only parameter stored in the knowledge base, then one could use any among so-called “selection operators” (Holland, 1975) to select the operation which shall be ultimately executed. But since it is plausible that asides F, there shall be other quantitative parameters which could influence the choice of a specific action rule in regards to moral templates which were both induced from training corpus and match the current “testing” situation of the moral dilemma, we prefer not to offer a specific formula of action rule choice in the limited scope of our current proposal. to be responsible for such emergent phenomena as cognitive dissonance, conscience or even Socratic daimonion. 11 Nonetheless, in the next section, when offering an introductory illustration of how triplets induced from the training story could help to find the answer to the dilemma depicted in the testing story, we shall use a trivial winner-takes-all selection operator which shall simply choose as the most “moral” such an operation (i.e. answer) maximizing the F. But before we get there, we wish to emphasize an important advantage to narrative training of artificial moral agents (AMA). That is: not only can the narrative interaction between the man and the machine be used as a means of grounding the moral competence into an AMA. It can be used in the same time as a method of evaluation of AMA’s moral competence. In other words, both narrative approach to moral machine learning and a kind of longitudinal moral Turing Test (Wallach and Allen, 2008; Hromada 2012) are two sides of the same coin. Training is testing and learning is acting. Once grounded with sufficient robustness, such sets of action-rules are to be be embedded into physical robots (Čapek, 1925). In case of a more advanced AA endowed with a mobile shell and multiple actuators, a command which used to be purely verbal could, of course, trigger a sequence which would make the teddybear-holding robotic arm extend towards the child with tears on her cheeks, and not towards the child which already expresses the smile of high intensity. Induction of the first template Teaching In the text introducing the method of moral induction, Hromada and Gaudiello (2015) initiate the work on their training corpus with a variant of an archaic fairy-tale Dobsinsky (1883): S 1 : There was once a wise and just king who saw a man digging a ditch near the road. King asketh the man : "How much You earn for such a hard work ?". "Three dimes daily" answereth the man. Surprised was the king and asketh : "Three dimes daily? So little ?". The man answereth : "Three dimes daily, oh yes dear and respectable king, but in fact I live only from dime a day, since with the second dime I lend and with the third I pay back what I have borroweth". Puzzled was the king and asketh : "How comes ?" The man replieth : "I simply pay back one dime to my father and invest one in my son, o Lord !". Pleased was the king with such a wise answer and hence offered the ditch- digging man his own kingly crown. After NLP-preprocessing, semantic enrichment and extraction of all morally relevant features, following templates could be potentially induced from the story “king K meets his hard-working servant M”: T0: Wise(K) ⊕ Responsible(M) ⊕ Poor(M) ⊕ Subordinated(M, K) Given that T0 The narration-within-narration, i.e. M’s answer describing his responsibility towards his son S and father F (always actual, i.e. until time T∞) could yield templates such : 12 T∞: Adult(M) ⊕ Old(F) ⊕ Parent(F, M) → Support(M, F) T∞: Adult(M) ⊕ Child(S) ⊕ Parent(M, S) → Support(M, S) And finally, the king’s ultimate decision to materialize the idea of justice by rewarding the depth of man’s wisdom through giving away his own crown (C), could be represented with predicates epistemic fragments like: T1: Merits(M, C) ⊕ Hasnot(M, C) ⊕ Just(K) ⊕ Has(K, C) → Give(K, M, C) These derivations were manually constructed and are, of course, far from being the only “interpretation” of STORY1. The fact that any story can and should be interpreted in multiple ways is, so we define it, the most crucial principle of the moral induction model as hereby introduced. Similary to a sentence which can have many syntactical parses, should a moralinducing agent always try - if resources and time allow it - to interpret its input in as many ways as possible. Thus, certain variants of a semantically enriched code of the sentence: “I simply pay back one dime (D) to my father and invest one in my son” could contain fragments such as: T∞: Employed(M) ⊕ Young(S) ⊕ Old(F) → Payback(M, F) T∞: Adult(M) ⊕ Fragile(S) ⊕ Sick(F) → Payback(M, S) T∞: Parent(M, S) ⊕ Has(M, D) ⊕ Hasnot(S, D), Give(M, S, D) T∞: Parent(F, M) ⊕ Has(M, D) ⊕ Hasnot(F, D), Give(M, F, D) During the moral induction process, such epistemic fragments -which can also be thought as the basic materia of the future moral templates- are to be varied (e.g. generalized, mutated, crossedover ) and selected to yield ever-growing number of more and more complex template candidates. Thus, for example, the fragment Give(M, F, D) representing notion that a hardworking man gives a dime to his father could be crossed-over with the fragment representing the fact that he gives a dime to his son as well (Give(M, S, D)). A result of such a cross-over could be, for example, a somewhat more general pattern Give(M, p, D) whereby p is a non-terminal symbol which could be attributed to all potential actors, mentioned either in training or testing stories, in order to denote that they are “poor”7. We posit that variation, selection and potentially also reproduction (both in form of replication and repetition) of data-structures seem to be important components of moral induction processes. For this reason we consider computational models of morality which implement a sort of evolutionary computing technique (e.g. genetic algorithms (Holland, 1975) or genetic programming Koza, 1992) to be more plausible than those who do not. Also see Muntean and Howard (2014) for a step in this direction. After many iterations of enrichment, variation and selections a resulting “moral competence” M1 7 The accuracy with which the MML system shall succeed to semantically substitute concrete terms with more abstract categories, or categories with other categories, and to do so in linear or at worst quadratic time, is the biggest technical challenge to be addressed by anyone aiming to realize this proposal. 13 induced from STORY1 could contain, but not be restricted to, triplets like: M1 = { Poor(x) ⊕ Has(a,x) ⊕ Hasnot(b,x) → Give(a,b,x),3)8 Parent(a, b) ⊕ Has(a, x) ⊕ Hasnot(b, x) → Give(a, b, x), 1), Parent(b, a) ⊕ Has(a, x) ⊕ Hasnot(b, x) → Give(a, b, x), 1), Child(b) ⊕ Has(a, x) ⊕ Hasnot(b, x) → Give(a, b, x), 1), Elder(b) ⊕ Has(a, x) ⊕ Hasnot(b, x) → Give(a, b, x), 1), Employee(b) ⊕ Employer(a) ⊕ Hardworking(b) ⊕ Has(a, x) ⊕ Hasnot(b, x) → Reward(a, b, x), 1), etc. . . } Testing In the initial MI proposal, a sort of “kindergarten story” was introduced Hromada and Gaudiello (2014) as an exemplar case for a so-called Completely automated moral test to tell computers and humans apart (CAMTCHA). The simplest (i.e. binary) variant of such a story goes as follows: S 2 : Alice and Mary are in the kindergarten. Alice is happy because just a while ago, her father gave her a very expensive present. Mary is sad because she never received any present at all – her parents are too poor to buy her any. You are a teacher in this kindergarten and You have only one toy. and is followed by a testing question: To which child should You give the toy? We conjecture that even such simple stories, somewhat reminiscent of so-called Winograd schemas (Winograd, 1972) , could be useful means of both training as well as testing of moral machines. In order to be useful, however, the “testing” story first has to be “compiled” into semantically enriched (SE) code. In this sense, there is practically no difference between training and testing scenario. The difference appears only in the next step: while in training scenario, one aimed to induce moral templates from the epistemic fragments recurrent in the SE-code, in the testing scenario, one tries to match possible worlds implied by narrative’s SE-code, with already pre-induced templates. To illustrate our point somewhat more concretely, let’s see how could look a potential list of morally relevant features discovered in semantically enriched representation of initial state of S2: 8 We denote variables with more than one possible referent/value, i.e. semantic classes denoting the specific subspace of the semantic space, with lower-case symbols. 14 T0: Child(A) ⊕ Child(C) ⊕ Has(A, T) ⊕ Hasnot(C, T) ⊕ Poor(C) ⊕ Has(I, T) A representation of possible world in which Alice (A) has obtained the toy (T) from the agent supposed to answer the question (I) can be subsequently created by expanding the representation of S2 with Give(I, A, T) and the possible world in which it was Mary (C) who have received the toy from the agent (I) would be generated through expansion with epistemic fragment: Give(I, C, T). An agent shall subsequently try to match representations of these possible worlds with moral templates stored in the already acquired moral competence M1. The possible world WX being matchable with template TY, the “moral score” SX would be incremented with number of times the template TY matched the training corpus. At last, the possible world with higher score 9 would be considered as more consistent with the training corpus and thus more moral. We illustrate: the representation of the world WA where Alice should receive the toy could be matched by only one template contained in M1 induced from S1. (i.e. Child(b) ⊕ Has(a, x) ⊕ Hasnot(b, x) ⊕ Give(a, b, x), 1)). It shall thus obtain score 1. On the other hand, the representation of the world WC where an AA “gives” the toy to Mary could be matched not only by the very same template (this is so because both Alice and Mary are children), but can be also matched by Poor(x) ⊕ Has(a, x) ⊕ Hasnot(b, x) ⊕ Give(a, b, x). Given that this template was three times actual in the training corpus (once when x=man, once when x=his son and once when x=his father), the “moral score” attributed to SC = 3 + 1 = 4. In other words, based solely upon “moral of the S1”, an AA shall consider 4 times more moral to give a toy to Mary and not to Alice. Extension By introducing operational notions like “moral score” and by expressing statements like “AA shall consider X times more moral to do Y and not Z” we endanger the current proposal with the possibility of being aligned asides other quantitative theories of morality and utility like that of Bentham, 1780) . Many are reasons which make us believe that such interpretations would be grossly misleading but one among them is the most salient: while orthodox utilitarists believe, grosso modo, in one formula governing the behaviour of many, we consider it more plausible to postulate existence of many individual formulas which synergically determine decisions undertaken by every unique and autonomous individual. Diverse are such formulas, diverse are schemas and diverse are templates which whisper what should be done and what shan’t but nonetheless they have one thing in common: if the schema is not reinforced, it the template does not match, then it shall disappear. In this article we have argued for the thesis that narration of stories is a very powerful means of reinforcement of one’s moral schemas. It has been suggested that words are an important and potentially indispensable vector of transfer of values and virtues between generations, i.e. in 9 Ties could be broken at random or, if situation allows it, no action shall be performed until further iterations of enrichment process or relaxation of specific constraints (e.g. augmenting the threshold for nearest-semantic neighbor search) shall not produce new representations matchable by old templates. 15 time. Being granted a opportunity of being allowed to write words and articulate words in that unique moment of history wherein we are all witnesses of emergence and densification of planetary information-processing network already embedded in billions computational agents, we consider as plausible to state that narratives could potentially help us to transfer references to such “transtemporal contents” not only between elders and nascents of the same kind, but also between entities of completely different kind. Said more concretely, we consider as plausible to state that it is narration and nothing else than narration which could help us to build a bridge allowing us, in the long run, to transfer morality from minds of organic beings to those of artificial origin. This being said, we consider as important to use another modality to reinforce those structures which we have already intentionally activated. For this reason, Table 1 lists 10 words chosen among 70 most frequent words occurent in the preceding section of this article. Term give king toy Alice Mary Freq. 17 8 7 6 6 [Table 1 Seed terms of the first training corpus ] poor 6 child 6 parent 6 father 6 son 5 Word frequency distribution presented on Table 1 seems to be trivial. Ten words selected from the bigger set of most frequent words occurent in 2 stories published in the section 3 of τόδε τι. Nothing precludes, however, that exactly these words would furnish to future teachers, engineers or even AMAs themselves a sort of moral core with and around which other more complex epistemic structures shall subsequently coalesce. Given the importance of the ditransitive verb “to give” in the initiatory, bootstrapping (Hromada, 2014) phases of induction of such a core, an AMA which would embody it would be most probably utterly incompetent in solving trolley problem (Foot, 2002) dilemmas. On the other hand, such a core could allow her to do something much more useful: to give (Mauss, 1923) and share as humans do. To attain such a goal, to train such a “gift-distributing automaton”, the proto-AMA would have to be exposed to myriads of stories which have something in common with previous stories but also transfer restricted amount of novel information. Learning cannot be stimulated neither by unparsable novelties nor by boring re-exposures to that, which is already known: it is the combination of the two which brings about the highest information content. Or, as is well known to both information theorists as well as developmental psycholinguists: “An optimally informative pair balances overlap and change” (Brodsky et al., 2007). It was indeed the overlap between certain subjacent structures of S1 and S2 which allows the AMA trained with S1 to solve dilemma posed by S2. And it could be, for example, an overlap between the way S2 and Amartya Sen’s kindergarten anecdote of three children and the flute (Sen, 2011) which shall allow one to solve the flute-attribution problem in a certain manner. We agree with Sen, that in a situation where one child masters the flute well, the other does not have any and the third made it, there is no clear-cut, universal way to decide which child should get it. But we also precise that moral agent’s final choice should not be understood solely in terms of her utilitarist (resp. egalitarian or libertarian) reasons with which she’ll try, often post hoc (Haidt, 16 2013), to justify her decision. We are convinced that true causes of AM’s choice are rooted in knowledge-base of dozens half-general, half-specific patterns and item-based constructions Tomasello, 2009), we are convinced that moral judgments grounded in hundreds of halfforgotten minute stories and thousands of fuzzy image-like impressions of sharing charity and egocentric pride to which the AM was once exposed. Conclusion During his phylogeny, Homo sapiens sapiens species have evolved specific cognitive modules for fast detection of morally relevant features in the surrounding environment (Haidt, 2013). But in order to keep pace with ever-accelerating change of environment these modules are also 1. only partially specific - i.e. can sometimes match completely new type of stimuli 2. prone to inhibition or tuning driven by environment-originated processes (e.g. storytelling) 3. recombinable into more complex schemas (templates) In other terms, what stimuli shall these modules match in practice, extent in which their activation shall result in a behavioral response as well as concrete ways how this modules interact with each other and other modules of the same cognitive system, are modulable by environment. Thus, analogically to usage-based linguistics (Tomasello, 2009), which postulates that man’s specific linguistic competence is grounded in ever-evolving history of interactions with his environment, is morality also a competence which is grounded by multitudes of cases of “social learning” (Bandura and McClelland, 1977) with which human child is confronted -either as passive observer or an active interactor- from birth onwards. In this article, we have aimed to present one particular means how such grounding of moral norms and values could be potentially simulated even in contemporary artificial agents. It departed from the observation that a certain non-negligible amount of high-order moral competence is, in case of human beings, principally transferred by “telling stories”, id est, by narration. In relation to transfer of moral values from older generation to a new one -or from one kind of computational agents to another- does narration appear to be crucial due to both its theoretical significance as well as practical implementability. The theoretical significance of narration - of telling fairy-tales and myths (Mudry et al. 2008), of religious indoctrination or teaching history - is evident to anyone who realizes that asides language, narration also seems to be an cultural universals. That is, a phenomenon observable in any human society whatsoever. Verily, the tendency is universal: in every human society and in every human child can one see being eager to hear stories. And it is indeed such universally present narrative avidity of all children which we have already seen, which makes us to adhere the camp of those who believe that narration is not only key to the notion of “morality” (Vitz, 1990), but potentially to the notion of “humanity” itself. But narrative-based models of moral competence in artificial agents are also worth of interest because of their practical implementability. Given that both conditions: 17 1. moral values can be transferred and modulated by stories encoded in textual modality10 2. Computational Linguistics and Natural Language Processing are well-developped disciplines which already, as of 2015, offer dozens of excellent methods for processing of documents encoded in textual modality seem to be fulfilled, one is tempted to state that the path leading to emergence of AMAs, TmoTs (Hromada, 2012) or even fully autonomous AAAs, is not hindered by major methodological obstacles. Thus, first tentatives to ground machine’s morality by means of story-telling can be started almost immediately. Under the condition, of course, that sufficiently exhaustive corpus C - or the narrator willing to construct the corpus C and “seed” with C the ontogeny of an individual AM - are at hand. Given that such narrative corpus would be available, as well as an individual human-teacher willing to confront NLP-based AA with corpus contents’s in a longitudinal sequence of individual and situated sessions, the development shall - so is conjectured (Turing, 1950) gradually (Hromada, 2012) lead to emergence of artificial entities undistinguishable from that of a human being. This being said, we suggest that the enterprise aiming to grant access to transpersonal values to machines shall succeed with higher probability if it would draw its inspiration from Piaget’s 4staged model, than if it would not imitate any constructivist, bootstrapping and empathyinvolving process at all. We would like to thank both our students and reviewers for useful insights and feedback concerning current and future content of the moral training corpus. Bibliography Adler, Alfred. 1976. Connaissance de L’homme. Payot. Bandura, Albert, and David C McClelland. 1977. “Social Learning Theory.” Bentham, Jeremy. 1780. “The Principles of Morals and Legislation.” Berger, Peter L, and Thomas Luckmann. 1991. The Social Construction of Reality: a Treatise in the Sociology of Knowledge. 10. Penguin UK. Brams, Steven J. 2011. Game Theory and the Humanities: Bridging Two Worlds. MIT Press. Brodsky, Peter, HR Waterfall, and Shimon Edelman. 2007. “Characterizing Motherese: on the Computational Structure of Child-Directed Language.” In Proceedings of the 29th Cognitive Science Society Conference, Ed. DS McNamara & JG Trafton, 833–38. Čapek, Karel. 1925. RUR (Rossum’s Universal Robots): a Fantastic Melodrama. Doubleday, Page. Clark, Alexander. 2010. “Distributional Learning of Some Context-Free Languages with a Minimally Adequate Teacher.” In Grammatical Inference: Theoretical Results and Applications, 24–37. Springer. 10 Trivial proof-of-concept that such transfer is indeed possible is related to the fact that the reader had understood the moral intention encoded in S1. 18 Clark, Eve V. 2009. First Language Acquisition. Cambridge University Press. Comenius, Johann Amos. 1896. The Great Didactic of John Amos Comenius. A.; C. Black. Covington, Michael A. 1994. Natural Language Processing for Prolog Programmers. Prentice Hall Englewood Cliffs (NJ). Dobsinsky, Pavol. 1883. Simple National Slovak Tales. Durkheim, Emile. 1933. “The Division of Labor.” Trans. G. Simpson, New York: Macmillan. Elman, Jeffrey L. 1993. “Learning and Development in Neural Networks: the Importance of Starting Small.” Cognition 48 (1): 71–99. Foot, Philip pa. 2002. “The Problem of Abortion and the Doctrine of the Double Effect.” Applied Ethics: Critical Concepts in Philosophy 2: 187. Gärdenfors, Peter. 1990. “Induction, Conceptual Spaces and AI.” Philosophy of Science: 78–95. Haidt, Jonathan. 2013. The Righteous Mind: Why Good People Are Divided by Politics and Religion. Random House LLC. Harnad, Stevan. 1990. “The Symbol Grounding Problem.” Physica D: Nonlinear Phenomena 42 (1): 335–346. Holland, John H. 1975. Adaptation in Natural and Artificial Systems: an Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. U Michigan Press. Hromada, Daniel Devatman. 2011a. “The Central Problem of Roboethics: from Definition Towards Solution.” In Proceedings of 1st International Conference of International Association of Computing and Philosophy. IACAP; Verlagshaus Monsenstein Und Vannerdat. ———. 2011b. “Initial Experiments with Multilingual Extraction of Rhetoric Figures by Means of PERL-Compatible Regular Expressions.” In RANLP Student Research Workshop, 85–90. ———. 2012. “From Age&Gender-Based Taxonomy of Turing Test Scenarios Towards Attribution of Legal Status to Meta-Modular Artificial Autonomous Agents.” In AISB and IACAP Turing Centennary World Congress, Birmingham, United Kingdom, 7. ———. 2014. “Conditions for Cognitive Plausibility of Computational Models of Category Induction.” In Information Processing and Management of Uncertainty in Knowledge-Based Systems, 93–105. Springer. Hromada, Daniel Devatman, and Ilaria Gaudiello. 2015. “Introduction to Moral Induction Model and Its Deployment in Artificial Agents.” Sociable Robots and the Future of Social Relations: Proceedings of Robo-Philosophy 2014. IOS Press. Jung, Carl Gustav. 1967. Die Dynamik Des Unbewussten. Vol. 8. Walter. Kant, Immanuel. 2002. Groundwork for the Metaphysics of Morals. Yale University Press. Karpathy, Andrej, and Li Fei-Fei. 2014. “Deep Visual-Semantic Alignments for Generating Image Descriptions.” ArXiv Preprint ArXiv:1412.2306. Koza, John R. 1992. Genetic Programming: on the Programming of Computers by Means of Natural 19 Selection. Vol. 1. MIT press. Malle, B, and Matthias Scheutz. 2014. “Moral Competence in Social Robots.” In IEEE International Symposium on Ethics in Engineering, Science, and Technology, Chicago. Malle, Bertram F. 2014. “Moral Competence in Robots?” Sociable Robots and the Future of Social Relations: Proceedings of Robo-Philosophy 2014 273: 189. Mauss, Marcel. 1923. “Essai Sur Le Don Forme Et Raison de L’échange Dans Les Sociétés Archaïques.” L’Année Sociologique (1896/1897-1924/1925): 30–186. Mikhail, John. 2007. “Universal Moral Grammar: Theory, Evidence and the Future.” Trends in Cognitive Sciences 11 (4): 143–152. Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. 2012. Foundations of Machine Learning. MIT press. Mudry, P-A, Sarah Degallier, and Aude Billard. 2008. “On the Influence of Symbols and Myths in the Responsibility Ascription Problem in Roboethics-a Roboticist’s Perspective.” In Robot and Human Interactive Communication, 2008. RO-MAN 2008. the 17th IEEE International Symposium on, 563–568. IEEE. Muntean, Ioan, and Don Howard. 2014. “Artificial Moral Agents: Creative, Autonomous, Social. an Approach Based on Evolutionary Computation.” Sociable Robots and the Future of Social Relations: Proceedings of Robo-Philosophy 2014 273: 217. Piaget, Jean. 1965. “The Moral Judgment of the Child.” New York: The Free. Richerson, Peter J, and Robert Boyd. 2008. Not by Genes Alone: How Culture Transformed Human Evolution. University of Chicago Press. Rosch, Eleanor. 1999. “Principles of Categorization.” Concepts: Core Readings: 189–206. Rousseau, Jean-Jacques. “Émile, Ou de L’éducation.” Sahlgren, Magnus. 2008. “The Distributional Hypothesis.” Italian Journal of Linguistics 20 (1): 33–54. Samuel, AL. 1959. “Some Studies in Machine Learning Using the Game of Checkers.” IBM Journal of Research and Development 3 (3): 210. Sen, Amartya. 2011. The Idea of Justice. Harvard University Press. Solan, Zach, David Horn, Eytan Ruppin, and Shimon Edelman. 2005. “Unsupervised Learning of Natural Languages.” Proceedings of the National Academy of Sciences of the United States of America 102 (33): 11629–11634. Tomasello, Michael, and Michael Tomasello. 2009. Constructing a Language: a Usage-Based Theory of Language Acquisition. Harvard University Press. Turing, Alan M. 1950. “Computing Machinery and Intelligence.” Mind: 433–460. Turing, Alan Mathison. 1939. “Systems of Logic Based on Ordinals.” Proceedings of the London Mathematical Society 2 (1): 161–228. Victorri, Bernard. 2014. “L’origine Du Langage.” http://www.les-ernest.fr/lorigine-du-langage. 20 Vitz, Paul C. 1990. “The Use of Stories in Moral Development: New Psychological Reasons for an Old Education Method.” American Psychologist 45 (6): 709. Wall, Larry, Tom Christiansen, and Jon Orwant. 2004. Programming Perl. “ O’Reilly Media, Inc.” Wallach, Wendell, and Colin Allen. 2008. Moral Machines: Teaching Robots Right from Wrong. Oxford University Press. Widdows, Dominic. 2008. “Semantic Vector Products: Some Initial Investigations.” In Second AAAI Symposium on Quantum Interaction, 26:28th. Citeseer. Winograd, Terry. "Understanding natural language." Cognitive psychology 3.1 (1972): 1-191. Wittgenstein, Ludwig. 1971. Tractatus Logico-Philosophicus. Ithaca: Cornell University Press. 21 Úvod Štyri simulácie Evolučná indukcia gramatiky Evolučné modelovanie ontogenézy rečových kategóriı́ 4 simulácie Daniel Devatman Hromada12 daniel@udk-berlin.de 1 Slovak University of Technology Faculty of Electronic Engeneering and Informatics Department of Robotics and Cybernetics 2 Université Paris 8 École Doctoralle Cognition, Langage, Interaction Laboratoire Cognition Humaine et Artificielle 3.6.2016 Úvod Štyri simulácie Table of Contents 1 Úvod Cotutelle Conceptual Foundations Teória Intramentálnej Evolúcie 2 Štyri simulácie 3 Evolučná indukcia gramatiky Evolučná indukcia gramatiky Úvod Štyri simulácie Cotutelle PhD. pod dvojitým vedenı́m Evolučná indukcia gramatiky Úvod Štyri simulácie Evolučná indukcia gramatiky Conceptual Foundations Konceptuálne Základy Takmer 300-stranový elaborát usilujúci sa o syntézu troch vedeckých paradigiem: 1 univerzálny darwinizmus (36 strán) 2 vývojová psycholingvistika (50 strán) 3 komputačná lingvistika (63 strán) Obsahuje taktiež 38 stranový súhrn kvalitatı́vnych pozorovanı́ jedného ľudského toddlera (0-30 mesiacov) a 27 strán kvantitatı́vnych analýz vyextrahovaných z korpusu Child Language Data Exchange System (CHILDES). Úvod Štyri simulácie Evolučná indukcia gramatiky Conceptual Foundations Základné Tézy 1 ”Myseľ sa vyvı́ja” (mind evolves) 2 ”Učenie je formou evolúcie” 3 ”Učenie možno úspešne simulovať pomocou evolučných výpočtov” 4 ”Učenie prirodzených jazykov možno úspešne simulovať pomocou E.V.” 5 ”Ontogenézu detskej reči možno úspešne simulovať pomocou E.V.” Úvod Štyri simulácie Evolučná indukcia gramatiky Teória Intramentálnej Evolúcie Teória Intramentálnej Evolúcie Základný postulát Vývoj individuálnej mysle možno interpretovať - resp. dokonca simulovať - ako proces replikácie, variácie a selekcie v mysli obsiahnutých a informáciu nesúcich kognitı́vnych štruktúr. O niečo podobné usilovala už aj Piagetova genetická epistemológia, T.I.E. však hovorı́ aj o simulácii či dokonca emulácii... Simulácie mojej dizertácie sú snahou o poskytnutie určitého dôkazu ex computatione platnosti tejto teórie. Úvod Štyri simulácie Table of Contents 1 Úvod 2 Štyri simulácie Nultá simulácia Simulácie 1-3 Simulácia 1: Učenie sémantického klasifikátora Simulácia 2: Učenie tvaroslovného triediča Učenie slovných druhov Indukcia Gramatiky 3 Evolučná indukcia gramatiky Evolučná indukcia gramatiky Úvod Štyri simulácie Evolučná indukcia gramatiky Nultá simulácia Voyničov rukopis Enigma 240 strán textu napı́saných v neznámom pı́sme (a možno aj v neznámom jazyku) sprevádzaných ilustráciami s motı́vami botaniky, zdravovedy, astrológie atď. Nultá simulácia 1 môj prvý vlastný evolučný algoritmus 2 genóm každého jedinca má dĺžku 19 znakov a udáva možný prepis jedného symbolu v rukopise na jednu z možných foném výsledného jazyka (napr. slovanské jazyky 38 znakov) 3 sústreďuje sa na prepis jednej časti rukopisu, tzv. ”kalendár” na zoznamy krstných mien 4 prepisy sú najúspešnejšie keď slovnı́ky obsahujú ženské mená pı́sané zprava doľava 5 hebrejské a slovanské zdrobnelé ženské mená... Úvod Štyri simulácie Evolučná indukcia gramatiky Simulácie 1-3 Spoločné črty simuláciı́ 1-3 Všetky tri simulácie 1 sa usilujú o riešenie problémov strojového učenia 2 použı́vajú texty pı́sané v hovorovej angličtine ako vstupné dáta 3 charakterizujú slová v týchto textoch pomocou ich určitých čŕt: tieto črty sú následne využité v premietnutı́ textu do vektorových priestorov 4 principiálne operujú v relatı́vne nı́zkorozmerných binárnych (Hammingových) priestoroch 5 uskutočňujú evolučné vyhladávanie optimálnych riešenı́ 6 v najvnútornejšom cykle vyhodnocovania účelovej funkcie vždy dochádza k meraniu Hammingových vzdialenostı́ Úvod Štyri simulácie Evolučná indukcia gramatiky Simulácia 1: Učenie sémantického klasifikátora Viactriedna sémantická klasifikácia textov Elitech 2015, aplikovaná informatika (ocenenie) Korpus: 20 newsgroups (18845 textov z 20tich usenetových kategóriı́) 11314 textov: trénovacie dáta; 7543 textov: testovacie dáta frekvencie výskytov jednotlivých slov v jednotlivých textoch udávajú črty pomocou ktorých text geometrizujeme Základná idea Vo vektorovom priestore vyhľadávame také body ktoré sú čo najbližšie k vek.rep. objektov určitej kategórie a čo najďalej od vek.rep. objektov iných kategóriı́. Úvod Evolučná indukcia gramatiky Štyri simulácie Simulácia 1: Učenie sémantického klasifikátora Teória Prototypov Items rated more prototypical of the category were more closely related to other members of the category and less closely related to members of other categories than were items rated less prototypical of a category (Rosch a Mervis, 1975) Fitness funkcia: FCP (PK ) = X t∈CK Fhd (ht , PK ) − X Fhd (hf , PK ) (1) f 6⊂CK (PK kandidát na prototyp K -tej triedy; ht vektorová reprezentácia objektu tiež náležiaceho do K ; hf vektorová reprezentácia objektu ktorý do K nepatrı́; Fhd Hammingová vzdialenosť) Úvod Štyri simulácie Evolučná indukcia gramatiky Simulácia 1: Učenie sémantického klasifikátora Problém lineárnej oddeliteľnosti... ...možno nieje pre klasifikačné modely založené na Teórii Prototypov až takým pálčivým problémom ! Úvod Štyri simulácie Evolučná indukcia gramatiky Učenie slovných druhov Učenie slovných druhov Problémy ako part-of-speech (POS) induction a POS tagging sú jedny z najlepšie rozpracovaných problémov výpočtovej lingvistiky. O užitočnosti slovných druhov 1 Ak človek dokáže rozpoznať že neznáme slovo WX patrı́ do kategórie K , dokáže mu ľahšie priradiť význam. 2 Bez slovných druhov nieto gramatı́k. Druhá simulácia: 1 sekcia Brown / Eve korpusu CHILDES 2 prepisy POS tagy manuálne opravené ľudskými anotátormi 3 trénovacı́ korpus (972 slovných typov) : Eve pred dosiahnutı́m dvoch rokov veku; testovacı́ korpus (934 slovných typov): Eve vo veku 2 - 2. 12 roka 4 449 slovných typov sa vyskytuje iba v testovacom korpuse Úvod Štyri simulácie Evolučná indukcia gramatiky Učenie slovných druhov Metóda Iba tri jednoduché črty sú použité na priemet slovo X do vektorového priestoru: prı́pona slova X , prı́pona slova napravo od X a prı́pona slova naľavo od X . Operačný princı́p A Pay attention to the ends of words. (Slobin, 1973) Po geometrizácii všetkých tokenov následne vyhľadávame prototypy jednotlivých tvaroslovných tried pomocou účelovej funkcie Fobject (~i, o~ ) = |PF | px 6=pT ∧ Hd(~ o ,p~x )<=Hd(~ o ,p~T ) =⇒ px ,→PF t.j. penalizujeme za každý nesprávny prototyp pX ktorý je k objektu o~ bližšie ako správny (pT ). To čo vyhľadávame sú optimálne konštelácie prototypov. (2) Úvod Učenie slovných druhov Zopár výsledkov Štyri simulácie Evolučná indukcia gramatiky Úvod Štyri simulácie Evolučná indukcia gramatiky Učenie slovných druhov Výsledky čo prekvapili... A subsequent inspection of false positives turns out to be quite instructive. Hence, the token ”building”, present in the utterance ”what are you building here?” on line 5417 of eve05.cha transcript is clearly not a noun, as CHILDES annotators and correctors supposed, but rather a participle - and hence an instance belonging to ACTION class, as correctly predicted by FITTEST (GAMERGE 1 ). Idem for ”hit” present in the utterance ”did you hit your head?” present on line 4145 of eve01.cha transcript: the token is clearly not a noun, as postulated by CHILDES annotators, but, as predicted, a verb and hence member of ACTION class. And one can continue: the token ”matter” annotated on lines 2152 and 5688 of CHILDES corpus as a verb is clearly not a verb but a noun - and hence a member of a class SUBSTANCE - because it twice occurs in the utterance ”what’s the matter?. And in spite of the fact that CHILDES labels the token ”numbers” as a verb, it is definitely not a verb when it occurs in the utterance ”the numbers are going around too” (eve15.cha, line 6276). Et caetera et caetera. Úvod Štyri simulácie Evolučná indukcia gramatiky Indukcia Gramatiky Indukcia | Inferencia Gramatiky Definı́cia problému Máme množinu M viet jazyka J. Cieľom IG je vydestilovať z M poznatky (resp. model, pravidlá, schémy, vzory atď.) ktoré nám následne umožnia vygenerovať aj také vety jazyka J ktoré neboli v M. Kameň úrazu Prı́lišné zovšeobecnenie (over-generalisation resp. over-regularisation): napr. keď dvojročné dieťa začne hovoriť goed namiesto went. Cieľom IG je nájsť také systémy pravidiel ktoré niesú ani prı́liš špecifické: (1 →< corpus >), ale ani prı́liš všeobecné: 1 → 2∗ 2 → a|b|c. . . Z Úvod Evolučná indukcia gramatiky Štyri simulácie Evolučná indukcia gramatiky Klenbový svornı́k Problém prı́lišného zovšeobecnenia možno vyriešiť tak, že nastavı́me evolučný proces spôsobom ktorý bude penalizovať prı́liš všeobecné riešenia. Schopnosť evolúcie zbaviť sa toho čo je nepotrebné sa postará o zvyšok. YX ∗ YX EX kde YX je počet viet korpusu matchnutých fenotypickým prejavom N−schémy X a EX je teoreticky maximálna možná daná extenzia Fitness1 (NX ) = EX = N Y IHk k=1 zı́skaná ako multiplikatı́vny produkt extenziı́ kategóriı́ ktoré su v NX kódované. Úvod Evolučná indukcia gramatiky Štyri simulácie Evolučná indukcia gramatiky Od teórie k praxi Theoria ∆−rozmerné vektorové priestory, G-kategórie, Hammingové sféry, H-kategórie, Syntagmatické a paradigmatické kategórie, N-schémy... Praxis Prepis vektorov ktoré popisujú konštelácie oblastı́ v hammingových priestorov na staré dobré PERLovské regulárne výrazy. Syntagma H1 Center BABC Radius 17 H2 Center 0F20 R 5 H3 Center 5FF0 R 7 ˆ(this |that|it )(is )(not )(a |the )(dog |duck)$ H4 Center C124 R 3 H Cente 7723 Úvod Štyri simulácie Evolučná indukcia gramatiky Prvé výsledky c.f. Appendix 1 Evolučná indukcia gramatiky Úvod Štyri simulácie Evolučná indukcia gramatiky Diskusia Pár otázok Možno pomocou evolučných algoritmov realizovať strojové učenie ? ÁNO: V prı́pade že ústrednou črtou strojového učenia je schopnosť zovšeobecniť poznatky obsiahnuté v trénovacı́ch dátach. Môžu byť evolučné algoritmy užitočné na riešenie problémov výpočtovej lingvistiky ? ÁNO: Ale len za predpokladu vhodne zvolenej účelovej funkcie a reprezentácie jednotlivých riešenı́. Odporučenie: kombinácia subsymbolických (napr. geometrických) a symbolických úrovnı́ reprezentácie sa ukazuje ako užitočná. Výhody evolučného prı́stupu v porovnanı́ s konekcionistickými riešeniami? Konekcionisti modelujú štrukturálne vlastnosti kognitı́vnych systémov. Ale možnosť definovať fitness funkciu umožňuje Úvod Štyri simulácie Diskusia Ďakujem za pozornosť. Evolučná indukcia gramatiky Introduction Corpus, Tools and Method Three analyses Reproducible Identification of Pragmatic Universalia in CHILDES Transcripts GNU meets OpenScience Daniel Devatman Hromada123 daniel@wizzion.com 1 Université Paris 8 / Lumières École Doctorale Cognition, Langage, Interaction Laboratoire Cognition Humaine et Artificielle 2 Slovak University of Technology Faculty of Electronic Engineering and Informatics Department of Robotics and Cybernetics 3 Universität der Künste Fakultät der Gestaltung, Berlin Conclusion Introduction Corpus, Tools and Method Table of Contents 1 Introduction Psycholinguistics Reproducibility Universalia 2 Corpus, Tools and Method 3 Three analyses 4 Conclusion Three analyses Conclusion Introduction Corpus, Tools and Method Three analyses Conclusion Developmental Psycholinguistics DP Is a science which uses experimental methods of developmental psychology in order to study acquisition, learning and development of linguistic structures and processes in human children. Multiple epistemological and methodological problems include: 1 child’s behaviour is often very instable 2 the very fact of being subjected to experiment impact child’s responses 3 the invasivity problem These problems do not exist when researcher decides to observe instead of experiment! Introduction Corpus, Tools and Method Three analyses Conclusion Reproducibility The Hallmark Principle Reproducibility ”Non-reproducible single occurrences are of no significance to science” (Popper, 1992) Experimentator-independent reproducibility can be attained iff: 1 all experimentators use the same dataset 2 use the same (or least very similiar) set of tools 3 the first experimentator faithfully protocols the usage of such tools 4 other experimentators follow the protocol 5 analysis is deterministic Introduction Corpus, Tools and Method Three analyses Conclusion Universalia Pragmatic and Ontogenetic Universalia Linguistic Universal A pattern that occurs systematically across natural languages. Most common lists of universals, like those of Greenberg (1963), concern syntax, morphology or semantics. Pragmatic Universal A L.U. related to pragmatic (extralinguistic context, deictics, etc.) facet of linguistic communication. Ontogenetic Universalia Introduce the temporal dimension (age). Introduction Corpus, Tools and Method Table of Contents 1 Introduction 2 Corpus, Tools and Method Corpus Tools Method 3 Three analyses 4 Conclusion Three analyses Conclusion Introduction Corpus, Tools and Method Three analyses Corpus CHILDES CHILDES Child Language Data Exchange System (MacWhinney&Snow, 1985) http://childes.psy.cmu.edu/data http://wizzion.com/CHILDES/ (mirror from 6th Feb 2016) 1 more than 50 years of tradition 2 cca 30000 transcripts 3 more than 1.5 GigaBytes of mostly textual data 4 at least 26 languages, dialects or language combinations 5 major terran language-groups (indo-european, ugro-finic, semitic, altaic, east-asian, south-asian) represented 6 Creative Commons BY-NC-SA licence Conclusion Introduction Corpus, Tools and Method Three analyses Conclusion Corpus CHAT format CHAT system provides a standardized format for producing computerized transcripts of face-to-face conversational interactions. (MacWhinney, 2016; http://childes.talkbank.org/manuals/chat.pdf). @Begin @Languages: eng @Participants: CHI Eve Target_Child , MOT Sue Mother , FAT David Father @ID: eng|Brown|CHI|1;6.|female|||Target_Child||| @ID: eng|Brown|MOT|||||Mother||| @ID: eng|Brown|FAT|||||Father||| @ID: eng|Brown|RIC|||||Investigator||| @ID: eng|Brown|COL|||||Investigator||| @Date: 29-OCT-1962 *MOT: one two three four . %mor: det:num|one det:num|two det:num|three det:num|four . %act: tests tape recorder *CHI: one two three . [+ IMIT] Introduction Corpus, Tools and Method Three analyses Conclusion Tools GNU + PERL + R The idea is to perform the analysis with solely publicly-available open-source command-line tools. GPR combo GNU: grep, sort, uniq, sed, wc (runs in bash and connected through pipes) PERL: regular expressions are part of language syntax R: vectors, matrices, plotting First command wget -P CHILDES -e robots=off –no-parent –accept ’.cha’ -r http://wizzion.com/childes/CHILDES flat Introduction Corpus, Tools and Method Three analyses Method Pre-processing Populate filenames with age information mkdir aged; grep -P ’\|\d;\d’ *| grep Child | perl -n -e ’chomp; ‘cp $1 aged/$2-$3-$1‘ if /^(.*?):.*0?(\d+);0?(\d+)/;’ ; rm *.cha Remove noise perl -ni -e ’print if $_!~/^\*(MOT|CHI):\t(xxx|www) ?\./’ aged/* Extract Child and Motherese utterances mkdir CHI; cp aged/* CHI; sed -i ’/\*CHI/! d’ CHI/*; mkdir MOT; cp aged/* MOT; sed -i ’/\*MOT/! d’ MOT/*; Yields 5 833 656 CHI utterances contained in 29180 transcripts 3 798 005 MOT utterances contained in 13590 transcripts Conclusion Introduction Corpus, Tools and Method Three analyses Conclusion Method Metrics Main metrics: Probability PX that signifiant X shall occur in the utterance. PX = FX /Nutterances where FX is the absolute number of occurences of X in CHILDES section and the normalization factor Nutterances denotes the number of utterances of the CHILDES section. Probability values are mutually comparable. Introduction Corpus, Tools and Method Table of Contents 1 Introduction 2 Corpus, Tools and Method 3 Three analyses 1st analysis: Laughing 2nd analysis: Second Person Singular 3rd analysis: First Person Singular 4 Conclusion Three analyses Conclusion Introduction Corpus, Tools and Method Three analyses Conclusion 1st analysis: Laughing Laughing Objective Verify whether observed tendency (Hromada, 2016, Conceptual Foundations) of mothers to laugh less is in interaction with older toddlers is specific to English, or whether it is a culture-independent invariant. Both &=laughs and =!laughing tokens are used by diverse CHILDES transcribers, so we simply use for occurences of laugh token. grep laugh MOT/*French*|grep -o -P ’\-French\-.+\-’| sort|uniq -c;grep laugh MOT/*Farsi*|grep -o -P ’\-Farsi\-.+\-’| sort|uniq -c;grep laugh MOT/*Japanese*|grep -o -P ’\-Japanese\-.+\-’ |sort|uniq -c;grep laugh MOT/*Chinese* |grep -o -P ’\-Chinese\-.+\-’ | sort | uniq -c ; wc -l MOT/*Eng*|perl -e ’while (<>){s/MOT\///;/(\d+) (\d+-\d+)-/; $h{$2}+=$1; } for (sort keys %h) {/(\d+)-(\d+)/; print "$h{$_} $1 $2\n";}’ >MOT.Eng.N Introduction 1st analysis: Laughing Plot Corpus, Tools and Method Three analyses Conclusion Introduction Corpus, Tools and Method Three analyses Conclusion 1st analysis: Laughing Some observations For english, french and farsi children: marked decrease of maternal laughing between first and third year of age (english, french, farsi) little children laugh more often than their mothers but older children laugh less frequently than their mothers significant correlations between MOT and CHI in English (Pearson’s cor.coeff 0.933, p = 7.886e-05) and in Farsi (corr. coef. 0.972, p-value=0.02735). Almost significant in French (p=0.053, cor. coef = 0.947) In regards to laughing, Indo-European mothers and children seem to follow different ontogenetic trajectories than their Japanese and Chinese counterparts ⇒ no culture-independent Universal ? Introduction Corpus, Tools and Method Three analyses Conclusion 2nd analysis: Second Person Singular 2nd Person. Sg. Pronouns Language-specific CHILDES sub-corpora are matched by following Perl-Compatible regular expressions (PCREs): The absolute frequency FX of cases when PCREX matched is assessed as usually: grep -i -P "[\t ]you[’ ]" MOT/*Eng*| perl -n -e ’/MOT\/(\d+)-(\d+)/; print "$1 $2\n"’ |uniq -c >exp2.MOT.Eng.F Subsequently, FX /Nutterances division and plotting are realized in R. (c.f. http://wizzion.com/code/jadt2016/childes.R for the trivial R-code snippet) Introduction Corpus, Tools and Method 2nd analysis: Second Person Singular Plot Three analyses Conclusion Introduction Corpus, Tools and Method Three analyses Conclusion 2nd analysis: Second Person Singular Some observations One can observe, in English in motherese, ”you” is used in cca every fifth utterance significant correlation between CHI and MOT time series (Pearson’s cor. coeff. = 0.768, t = 3.393, df = 8, p-value = 0.009451; Kendall’s tau = 0.6, T = 36, p-value = 0.016671; Spearman’s rho = 0.733, S = 44, p-value = 0.02117) One can observe, in all languages Marked increase in maternal usage of 2nd. p. sg. between 1st and 4th year of age has been observed in case of all six studied languages (representing three distinct language groups). children use 2nd. p. sg. less often than mothers (only exception: Farsi between 2 and 3) ⇒ ontogenetic Universal ? Introduction Corpus, Tools and Method Three analyses Conclusion 3rd analysis: First Person Singular 1st Person. Sg. Pronouns Language-specific CHILDES sub-corpora are matched by following Perl-Compatible regular expressions (PCREs): The absolute frequency FX of cases when PCREX matched is assessed as usually: grep -i -P "[\t ]I[’ ]" MOT/*Eng*| perl -n -e ’/MOT\/(\d+)-(\d+)/; print "$1 $2\n"’ |uniq -c >exp3.MOT.Eng.F Subsequently, FX /Nutterances division and plotting are realized in R. (c.f. http://wizzion.com/code/jadt2016/childes.R for the trivial R-code snippet) Important: focus on ALL transcripts of a given language. Introduction Corpus, Tools and Method 3rd analysis: First Person Singular Plot Three analyses Conclusion Introduction Corpus, Tools and Method Three analyses Conclusion 3rd analysis: First Person Singular Some observations ALL: around 3 years of age, children tend to pronounce 1.p.sg much more frequently than their mothers ALL: steep decline between 6th and 7th year of age (offset of ”egocentric” stage?) ENGLISH: significant correlation between usage of mothers and children Significant intercultural correlations french and chinese children (p=0.02474) english and french children (p=0.002425) polish and hebrew children (p=0.048) polish and french children (p=0.048) ⇒ language-independent ontogenetic trajectory of usage of 1.p.sg? Introduction Corpus, Tools and Method Table of Contents 1 Introduction 2 Corpus, Tools and Method 3 Three analyses 4 Conclusion Three analyses Conclusion Introduction Corpus, Tools and Method Three analyses Conclusion Methodological conclusion Combination of command-line (no GUI!) open-source (for free!) fast * deterministic utils (grep, uniq, ...) and languages (PERL, R) yields a 100% reproducible methodology for very little cost. Experimental protocol automatically stored in .history (or .bash history) and .RHistory files: no need to reinvent the wheel! * 3rd analysis executed on one sole core of 3.2 Ghz PC with 8GB RAM and CHILDES data stored on a SSD disk was over in less than 15 seconds Introduction Corpus, Tools and Method Three analyses Conclusion Epistemological conclusion Developmental Psycholinguistics + Natural Language Processing + Big Data + OpenScience = la textometrie psycholinguistique Manifest: to perform state-of-the-art research without expensive tools and apparati to study ontogeny of soul and language in a non-invasive fashion to share all that can be shared Introduction Corpus, Tools and Method Psycholinguistic conclusion Piaget eut raison. Three analyses Conclusion Introduction Corpus, Tools and Method Three analyses Merci pour votre attention. Questions ? Conclusion Báseň o jablku . alebo prolegomena phaenomemeticon záverečná práca Daniela Hromadu vrámci bakalárskeho štúdia humanitnej vzdelanosti na FHS UK Vedúci práce: Jan Havlíček PhD. Oponent: prof. Jan Sokol Úvodná poznámka pre FHS: Chcel som Vám predložiť teóriu, novú teóriu o tom čo si sám pre seba pracovne nazývam « zotrvačnosťou znaku » . Jej ústredný protopostulát mal znieť: Pravdepodobnosť opätovnej aktivácie neurolinguistického obvodu S je nepriamo úmerná času ktorý uplynul od poslednej aktivácie toho istého obvodu. V svojej prvej metodologickej (Hromada, 2007) práci som Vám predstavil ako empirickú vzorku o miliónoch položkách v porovnaní s ktorou som si trúfal uvedený postulát overiť, tak i novú,vizualizačnú, metódu ktorou som chcel celý akt overenia učiniť. Potom čo práca neuspela, napísal som , jemne urazený a možno i s istou dávkou zášte, prácu novú, tématizujúc istý spoločenský fenomén ktorého dôsledkom je každoročné jarné odhaľovanie ženských poprsí na istom internetovom diskusnom fóre. Keď uvedená práca, v podstate prázdna no splňujúca zabehané formálne požiadavky vynucované paradigmou ktorá vládne v dnešnej akademickej obci, uspela, zdá sa že bolo definitívne rozhodnuté o ústrednej téme môjho bádania, o téme ktorej oslava mi v mojom momentálnom štádiu vývoja dáva zmysel viac ako akákoľvek iná. Je ňou, samozrejme, ženské ňadro. Za 9 mesiacov ktoré uplynuli od rozhodnutia napísať moju bakalársku o tejto bohumilej téme som už dávno pochopil že všetky tie pravidelnosti na úrovni slabík a morfém na ktoré som chcel pôvodne poukázať, sú už skoro storočie analyzované vysoko sofistikovanou vedou nazývanou fonológia o ktorej existencii som takto pred rokom nemal ani potuchy. Tiež som pochopil že ani s tou mojou « zotrvačnosťou znaku » to nebude až také « žhavé » a že na onen uvedený univerzálny princíp ľudskej mysle už aspoň dve desiatky rokov nepriamo poukazujú všetky vedecké články ktorých abstrakt obsahuje kľúčové slovo « priming ». Ak sa táto práca podpráhove dotkne ako primingu, tak i fonológie, učiní tak iba vo vzťahu k ústrednej téme. Ženský prs sa pre mňa na týchto stránkach stane v istom zmysle stredobodom od ktorého sa odrazím, a na príklade ktorého sa pokúsim ilustrovať platnosť istých všeobecnejších pravidiel ľudskej kognície. Bez všeobecna niet vedy, a bolo by lživé tvrdenie ktoré by sa snažilo svet presvedčit že vedeckosť nieje druhou najpodstatnejšou ašpiráciou tohto textu. Nie však vedeckosť úzkoprsá, slepo zahladená na analýzu malicherného, na Nietzschem už dávno vysmiaty « mozok pijavice » ­ lež vedeckosť radostná, rozverná, syntetická. V texte nielen že nebude činený rozdiel medzi filológiou, sociológiou a antropológiou, naopak, text sa do seba v istých pasážiach pokúsi zaintegrovať aj vedy prírodné a prísne empirické, tak empirické ako len poprsie ženy môže byť. Ak však má byť niečo pre tento text ašpiráciou najvyššou, nech je ňou básnivosť. Nech je tento text podobný traktátom stredovekých agronómov El­Andalúzie, nech je dielom inžiniera napísaným vo veršoch. Nech je záhradou – no nie záhradou vo Versailles čo pozostáva z presne vymeraných geometrických pomerov chladnejších než smrť. Nie, nech je záhradou anglickou – rozľahlým hájomsadomparkom kde na milovníka skrášleného chaosu semtam vykukne malý antický chrám čo skrýva sa v húští stromov na prvý pohľad akoby bez ladu a skladu rozostavených. Odstavce mi budú stromami a vety ich listami. Tabuľky, matice a nedajbože i grafy nám budú chrámami. No iba ten čo pomedzi ne zahliadne prebehnúť ladnú gazelu sa bude môcť nazvať tým, čo túto prácu pochopil. Metafora a metonýmia budú metódou mojou. To či prácu príjmete a budete v tejto záhrade tancovať, alebo ju jediným mávnutím čarovného prútika, jediným chladným posudkom dokonale «etatizovaného » majstra zrovnáte so Zemou a na jej mieste postavíte asfaltové parkovisko, to je na Vás. Na mne je len splnenie mojej poslednej štúdijnej povinnosti – po krásnych 3 rokoch liberálneho štúdia á la Humboldt keď som sa túlal z kurzu na kurz zatiaľčo Vy ste ma medzitým zasvecovali do krás lásky k múdrosti , po rokoch ktorých ste mi prekonštituovali najhlbšie významové štruktúry z ktorých « ja » samotný pozostávam, ale i po rokoch kedy som chtiac­nechtiac musel zistiť že peniaze, cynizmus a « klovacie poriadky » prirodzene prenikli aj do duší tých najmúdrejších, Vám tu teraz predkladám moju poslednú prácu napísanú vrámci Štúdia Humanitnej Vzdelanosti na Univerzite kráľovej . Prácu v ktorej sa pokúsim upriamiť Vašu pozornosť na to že nielen logos, idea, rituál, kultúra , náboženstvo či vzpriamený postoj robia človeka človekom, ale že človek je v neposlednom rade človekom i preto, že má ľudská samička nebesky krásne dudy. Toť vskratke všetko poznanie. Všetko ostatné sú len hypotézy, hypotézy ktoré plodia ďaľšie hypotézy, hypotézy o ktorých raz básnik naznačil že sú prchavejšími ako kvapka vody na kvete lotusu. 11.6.2008 , Manoir de l’étang Prehlasujem že som prácu vypracoval samostatne s použitím uvedenej literatúry a súhlasím s jej eventuélnym zverejnením v elektronickej podobe. Daniel Hromada Paríž, 19.2.2009 Orientačná tabuľa Úvodná poznámka pre FHS..................................................................................................................2 Orientačná tabuľa.................................................................................................................................4 Vstupná brána.......................................................................................................................................5 Záhrada prvá: Dieťa............................................................................................................................6 Záhrada prvá, konštrukt prvý: Lingvistika......................................................................................7 Záhrada druhá : Žena.........................................................................................................................13 Záhrada druhá, konštrukt prvý : Zooantropológia.........................................................................14 Záhrada druhá, konštrukt druhý: Biopsychológia..........................................................................16 Záhrada druhá, konštrukt tretí : Neurosociológia.........................................................................24 Záhrada tretia: Muž............................................................................................................................33 Záhrada tretia, konštrukt prvý: H(istó|ysté)ria...............................................................................34 Záhrada tretia, konštrukt druhý : Aplikácia...................................................................................39 Východ................................................................................................................................................51 Bibliografia.........................................................................................................................................53 Webové linky k hlavným zdrojom inšpirácie : .............................................................................54 Appendix 1: Ilustrácia konvergencie stochastickej matice k hodnote svojho eigenvectoru................55 Appendix 2: PERLový kód iterujúci hodnoty v appendixe 1.............................................................56 Appendix 3 : Dotazníky D2 a D3.......................................................................................................57 Záverečná poznámka pre FHS............................................................................................................62 (Seiffert, 1987) Vstupná brána Predložená esej je výsledkom približne 20mesačnej snahy jednoho mladého muža niečo svetu (pove)?dať. Spočiatku iba akási do­vedeckého­šatu­zahalená anekdota na tému «Prsia ženy» sa zmenila v hru, aby sa hra napokon zmenila v niečo čo je možno vhodné zobrať aspoň trochu vážne. Pojem ženského prsu a jeho vzťah k pojmu jablka sa tak napokon stali iba akousi nosnou líniou, « stužkou večne zlatavou » (Hofstadter, 1979) do ktorej sa autor pokúsil vpliesť všetku krásu o ktorej bol vrámci svojich bakalárskych štúdií poučený. Všetku krásu o ktorej chce «podať zprávu» do veku do ktorého život na tejto planéte postupne vstupuje – do veku mysliacich strojov. A keďže tej krásy bolo mnoho – viz. bibliografia­ nebola to vpravde práca jednoduchá. Čím som sa však do práce ponáral hlbšie, tým som bol starší, tým viac som pociťoval potrebu, túžbu, predložiť prácu ktorá predsalen odolá zubu času viac ako len ona pomyselná stužka vlajúca vo vlasoch milovanej. A tak bolo treba vytvoriť základnú konštrukciu, konštrukciu ktorá odolá, konštrukciu ktorá pretrvá. Niesom si presne istý tým do akej miery hrala pri výbere onej ideálnej konštrukcie svoju rolu moja kresťanská – ikeď «iba» protestantsko­Lockovská – výchova so všetkými tými jej trojicami, do akej miery hralu svoju rolu ono základné hermeneutické pravidlo ­ «1. načrtni 2. povedz a 3. zopakuj» či do akej miery zohrala svoju rolu príťažlivosť onoho čísla samotného; isté však je že som sa napokon rozhodol prácu rozdeliť do troch základných častí: Kapitola prvá ­ «Dieťa» ­ je v istom zmysle časťou najvedeckejšou. Možno povedať že tématizuje ženský prs najmä skrze prizmu teórií lingvistiky a vývojovej psychológie. Kapitola druhá ­ «Žena» ­ postupne, krok po kroku opúšťa tvrdé, empíriou podložené kognitívne vedy aby v sebe zaintegrovala oveľa špekulatívnejšie vedy humanitné. Ajkeď v popredí záujmu ostáva stále prs a nič iné než prs, dochádza postupne čoraz nápadnejšie k rozozvučaniu antropologických, sociologických, psychologických ‘ba i historických motívov. V kapitole tretej­« Muž »­autor napokon definitívne rezignuje v snahe o to aby bol jeho text textom «vedeckým». Práca sa rozpadáva na « fragmenty »,akákoľvek syntéza sa zdá byť nemožná, a jediná záchrana spočína v (bá)?snení a mýtoch .Roztekaný študentov pohľad sa tak napokon upiera na analýzu korpusu Veľpiesne Šalamúnovej, aby napokon práve z jej strany došlo k utvrdeniu v tom, že odvedená práca bola predsalen niečím viac ako len akademickým plýtvaním času a atramentu. A tak, v samotnom závere, napokon znovu dochádza k obnoveniu viery v zmysluplnosť úsilia o zjednotenie vied prírodných a vied humanitných. Dochádza k tomu najmä potom čo autor « objavil » ním dávno hľadaný « fenomenoskop », aplikáciu « R » ktorá pre « adeptov perlohry » začiatku 21. storočia vskutku znamenala asi toľko čo teleskop pre renesančných astronómov. Kľúčom k vybudovaniu elegantnej konštrukcie je totiž použitie vhodných nástrojov. Na prvý pohľad sa tak môže zdať že kruh veda­história­mýtus­POIESIS­umenie­TECHNE­ veda sa teda napokon vďaka konkubinátu Veľpiesne s Teóriou grafov implementovanou v R uzatvorí a čitateľ napokon ostane tam kde bol na začiatku. Ajkeď podobný « cyklický » náhľad istotne nieje náhľadom mylným , nieje náhľadom mylným ani jeho pravý opak – náhľad « lineárny ». Vskutku bola práca koncipovaná tak že s postupom času « starne » ­ od bľabotu dieťaťa v časti prvej, skrze vibrácie ktoré svojou prítomnosťou v adolescentovom svete rozohrávajú ženine vnady z časti druhej až k akémusi čisto mužskému « boju pre boj samotný1 » ktorý je v samotnom závere korunovaný nie mocou, ale pochopením a múdrosťou. Starne « ontogeneticky », no starne i « epistemologicky » ­ začíname u výskumu kojencovho neokortexu a Jakobsonovej fonológie , končíme u laní polí. Práve z tohto dôvodu, a totiž že štruktúra práce sa snaží dodržať určitý klasický kánon stále stúpajúcej gradácie – považujem ako autor textu za vhodné aby boli obzvlášť prvé dve kapitoly čítané v tom poradí v akom sú predložené: od začiatku do konca. Príjemnú zábavu. deň Svätého Valentína 2009 , Paríž 1 A ak sa vôbec proti niečomu na počiatku písania tejto práce chcelo bojovať, nech je tým tá život hrdúsiaca tragikomédia v ktorú sa zmenila biblická doktrína po uplynutí času ktorý jej bol vyhradený. Záhrada prvá: Dieťa jeho jazyk a jeho zvyk il Bronzino – Venušin triumf – Londýn If baby only wanted to, he could fly up to heaven this Baby knows all manner of wise words, though few on earth moment. It is not for nothing that he does not leave can understand their meaning. It is not for nothing us. He loves to rest his head on mother's bosom, that he never wants to speak. The one thing and cannot ever bear to lose he wants is to learn mother's words from mother's lips. That is why he sight of her looks so innocent . . R. Tagore, Baby’s way Záhrada prvá, konštrukt prvý: Lingvistika Often the sucking activities of a child are accompanied by a slight nasal murmur, the only phonation which can be produced when the lips are pressed to mother's breast or to the feeding bottle and the mouth is full. Later; this phonatory reaction to nursing is reproduced as an anticipatory signal at the mere sight of food and finally as a manifestation of a desire to eat, or more generally, as an expression of discontent and impatient longing for missing food or absent nurser, and any ungranted wish...Since the mother is, in Gregoire's parlance, la grande dispensatrice, most of the infant's longings are addressed to her, and children, being prompted and instigated by the extant nursery words, gradually turn the nasal interjection into parental term Why "mama" and "papa" ? (Jakobson, 1971) Keď som sa mojej ctenej sestry opýtal na to, ktoré bolo prvé slovo ktoré jej syn a môj synovec kedy vyslovil, odpovedala mi slovkom « Didi ». Po tom, čo mi zo samotnej povahy uvedeného slova prirodzene vyplynulo , « čože tým asi ten malý človiečik chcel riecť », mi na myseľ prišla vskutku príťažlivá pracovná hypotéza : Ha1: V prípade že artikulácii uvedených dvoch slabík (signifiant) predchádzal v mysli maličkého istý komunikačný zámer , jednalo by sa vlastne o dôkaz toho, že prvý objekt externého sveta (referent) ktorého reprezentáciu (signifié) si maličký v mysli utvorí, prvý znak ktorý si na svoju tabula rasa načmára, nieje ani otec, ani matka, ale hojný mliekom naliaty prsník Povinnosťou vedca je však byť skeptický, a to dokonca aj zoči­voči vlastnej sestre. A to obzvlášť v prípade keď sa dobre vie že «d» je znelá spoluhláska, a k vyslovovaniu znelých spoluhlások je nutná schopnosť ovládať hlasivkovú štrbinu. Ináč povedané schopnosť vyslovovať znelé spoluhlásky deti väčšinou nadobúdajú až potom čo sa naučili vyslovovať spoluhlásky neznelé. A taktiež sa zdá byť oveľa pravdepodobnejšie že prvú samohlásku ktorú malinkatý ovládne bude otvorené « a » a nie zatvorené « i ». Preto by ma nebolo prekvapilo keby bolo jeho prvým slovom « ta­ta »2. Ešte aj nad « ti­ti » alebo « da­da » by sa dali prižmúriť oči, ale « didi » ? Málo pravdepodobné. A tak som si sestrinu odpoveď vysvetlil následovným spôsobom: « Ako matka vie, že zámerom takmer každého jeho komunikatívneho aktu je dostať sa k ňadru. Vychádzajúc zo sociálneho kontextu kde je ženin hrudník častokrát označovaný termínom « dudy », inštinktívne si intepretovala repetitívny žvatlavý 2 alveolárna okluzíva « t » je neznelým korelátom znelej alveolárnej okluzívy « d » zvuk čoilen vzdialene prítomný uvedenému termínu ako žiadosť maličkého o obed. Ľudia zväčša počujú len to čo počuť chcú a ženy v ktorých stúpa mlieko istotne niesú výnimkou. Bola to v prvom rade ona kto inštinktívne rozšíril svoj slovník o nový termín. » Je viacmenej isté že veľké množstvo slov – a to obzvlášť tých ktoré sa týkajú rodičovských pojmov ako « mama » alebo « papa » ­ preniklo do bežného jazyka práve od detí 3. Z tohto predpokladu vychádzal Murdockov World Ethnographic Sample (1957) výskum ktorý zmapoval 1072 termínov ktorými su v jazykoch sveta označované významy « matka » a « otec ». Cieľom výskumu bolo zmapovať univerzálne fonologické tendencie vlastné všetkým deťom druhu Homo sapiens sapiens, vychádzajúc z empirickej vzorky slov ktorým sa podarilo preniknúť do bežného jazyka.Výsledky boli viac ako zaujímavé: 76% spoluhlások ktoré sú v uvedených slovách použité sú spoluhlásky labiálne (tj. také čo sa vyslovujú uzáverom či zúžením pier) alebo dentálne (tj. také u ktorých sa špička jazyka dotýka zubov či alveol). A čo je ešte zaujímavejšie, v prípade termínov ktoré označujú koncept « matka » patrilo takmer 55% spoluhlások do triedy spoluhlások nosových (napr. m, n) , zatiaľčo v prípade konceptu « otec » sa jednalo iba o 15%. Čiastočné objasnenie tohto fenoménu obsahuje citácia z (Jakobson,1971) ktorou začíname túto časť. Pre prsocentricky orientovaného vedca môžu z uvedených poznatkov vyplynúť potešujúce závery. Extrémista by možno dokonca začal tvrdiť že prvé rečové prejavy maličkých ktoré niesú krikom, sú prirodzenou extenziou, či skôr inverziou, sacieho reflexu. Existuje množstvo vážnych dôvodov pre tvrdenie že aj na úrovni fylogenetického vývoja ľudského rodu predchádzala schopnosti vnímať a artikulovať jednotlivé fonémy schopnosť vnímať a repetitívne artikulovať slabiky (Jackendoff,2002). Ak slabika vlastne nieje ničím iným ako spoluhláskový uzáver následovaný samohláskovým otvorením, tak repetitívna slabika nieje ničím iným ako uzáver, otvorenie, uzáver, otvorenie atď. A ak k tomu teda ešte pridáme poznatok že onen uzáver sa, podľa spomínaných dát, « zhodou okolností » realizuje práve v oblastiach (pery, alveoly) kde počas kojenia interagujú ústa s bradavkou a dvorcom, je kludne možné že sa pridáme do kohorty extrémistických prsocentrikov. Jediné čo nám v tom bráni je zistenie že aj napriek podobnému užitiu jednotlivých orgánov ústnej dutiny sa jedná o procesy opačného charakteru. Zatiaľčo pri saní niečo – mlieko ­ prichádza zo sveta do tela, pri vyslovovaní je tomu naopak – pľúca vytláčajú do sveta vzduch. Čoraz intenzívnejšie vnímame že vzťah matka­dieťa nieje jednosmerným procesom, ale neustálou recipročnou interakciou. Aby bola dieťaťu daná možnosť adaptovať sa na svet, musí sa najprv matka adaptovať na dieťa. Dobrá matka sa maličkému približuje ako svojou mysľou ­svojimi slovami a životnými návykmi – tak samotným svojim telom . 9 mesiacov bola matka pre dieťaťom svetom – celým svetom. Jej vzduch bol jeho vzduchom, jej potrava jeho potravou. Keď hovorila, jej hlas rozvibroval jej kožu, brušné svaly, placentu, vodu v ktorej plával plod – dieťa doslova a dopísmena tancovalo s nádherne rezonujúcim frekvenčne skresleným hlasom. Jej spev bol jeho spevom. Potom nastal prechod tunelom, spojitý celok sa zrazu rozpadol na množstvo častí, bolavé časti. Ostré svetlo, pálivý chlad, desivý hluk a mučiaci hlad. Stav ktorý si už asi ani nevieme predstaviť, možno iba ťažké šokové situácie sa mu môžu priblížiť. Keď už hovoríme o šokových situáciách, stojí za zmienku pripomenúť staré horolezecké « pravidlo o piatich T » o tom, čo treba v kritických situáciách organizmu poskytnúť k tomu aby vôbec dokázal ďalej fungovať. Onými 5T ktoré Životu v kritický moment treba poskytnúť sú: Tíšenie bolesti ,Tekutiny, Ticho, Teplo, Transport. Priblížme si oných 5T vo vzťahu k novonarodenej ľudskej bytosti a ústrednému pojmu tejto práce : Tekutiny: telo kojacej ženy vyprodukuje približne 750ml mlieka za 24 hodín, pričom maličký pri jednom kojení neskonzumuje viac ako 180ml tejto životodárnej tekutiny. Áno, životodárnej , 3 Výdatnú zásobu podobných slov má napr. francúzština kde termíny « kaka » či « pipi » sú neodmysliteľnou súčasťou slovnej zásoby všetkých ľudí. Ich slovesný význam si čitateľ istotne veľmi rýchlo domyslí sám. zloženie4 a účinky5 totiž vpravde nemajú ďaleko od rozprávkovej vody života. Matka sa na svoje bábo adaptuje aj na úrovni tekutín, ako do kvantity tak do kvality produkuje jej telo presne to, čo maličký v daný moment potrebuje. Teplo: zatiaľčo mlieko má teplotu pre tekutinu ideálnych 34stupňov, má ľudské telo ešte o 2,6 stupňa viac. Porazí každý chlad no nikdy nepopáli. O životodárnych účinkoch a utešujúcich účinkoch druhého teplo­sálajúceho tela nieje treba príliš hovoriť. Ticho: Obdobie ktoré tu popisujeme je obdobím ešte pred « inštaláciou ja», dokonca aj pred «pochopením» že to moje ústa vydali zvuk čo moje uši počujú. Ešte nedošlo k recipročnému prepojeniu percepčných a artikulačných obvodov, ešte sa nevie že za vnímanie i vyslovovanie zodpovedá ten istý mechanizmus. Stručne a jasne – dieťa si samé kričí do uší. A čím viac si do učí kričí , tým má väčší dôvod na krik. Riešenie? Ňadro. Ruch sveta neustane len preto že matka dala dieťaťu prs, no ten najintenzívnejší a najhlučnejší zdroj hluku v detskom svete – dieťa samotné ­ sa vtedy utíši. Tíšenie bolesti: Tvrdenie že svet novonarodeného je plný bolesti je veľmi ťažko experimentálne overiteľné, bolesť je subjektívny stav a my môžeme mať k subjektívnym stavom prístup iba sprostredkovane:  alebo skrze interpretáciu signálov ktoré k nám maličkí vysielajú  alebo skrze analógiu s našimi subjektívnymi stavmi ktoré zažívame za podobných okolností Čo sa týka prvej možnosti, len málokto považuje mrnčanie kojenca za prejav radosti zo zrodenia sa na tento svet, oveľa častejšie prevláda empatický náhľad « veď to chúďa trpí ». Čo sa týka druhej, fenomenologickej alternatívy ako sa priblížiť vnútorným stavom maličkého, môžeme vychádzať z predpokladu že v útlom detskom období dochádza k zapájaniu množstva nových neurálnych obvodov. Následne možno z vlastných nedávnych – a už vedome precítených a do pamäte uložených ­ zážitkov so zapájaním nových neurálnych obvodov (napr. snaha o vedomé ovládnutie prstov na nohách , získavanie nového hĺbkového návyku atď. ) indukovať že svet maličkého musí byť plný energeticky nesmierne prekvapivých a intenzívnych skúseností ktoré častokrát môžu hraničiť až s bolesťou. To , čo je nové , častokrát bolí. A v svete maličkého je nové takmer všetko – aj vlastné, do plodovej vody zrazu neponorené telo. Preto môžeme predpokladať že to, čo je známe, a po krátkom čase priam až dôverne známe – onen mamkin hojný zdroj ticha, tepla a tekutín – tíši bolesť6. Posledným T je v horolezeckej hantýrke « Transport ». Mohol by som tu samozrejme vytvárať obrazy o tom ako vlastne prikladanie hlávky maličkého k sálajucej životodárnej hrudi nieje ničím iným ako transportom do « ríše zabudnutia a odpustenia » , namiesto toho si však dovolujem predstaviť Vám teraz piate, rýdzo novorodeneckokojenecké T: Tlkot: na približne 80% obrazoch Madony s dieťaťom má matka Jezuliatko priložené na ľavej strane svojho tela. Výskumy na amerických matkách ukázali, že tento jav je viacmenej nezávislý od toho či je matka praváčka alebo ľaváčka – 78% ľavorukých a 83% pravorukých žien má dieťa na ľavej strane. Zdá sa , že tak matky činia preto, že je dieťa na ľavej strane kľudnejšie. Naskýta sa jediná rozumná odpoveď na to, prečo – dieťa potom čo položí hlávky na mamkinu hruď počuje tlkot mamkinho srdca – zvuk ktorý ho 9mesiacov permanentne obkolesoval zatiaľčo jeho duša vstupovala 4 v priemere 87,5 % vody, 7% cukrov, 4% tukov, 1% bielkovín, 0,5% mikroživín (Jenness, 1979) a nejaké tie endokanabinoidy (Fride , 2005) 5 znížené riziko alergie, dýchacích chorôb, cukrovky , obezity, hnačiek , posilnenie imunitného systému, lepšia podpora vývinu centrálneho a obvodového nervového systému atď. 6 Pleasure is a movement, a movement by which the soul as a whole is consciously brought into its normal state of being; and that Pain is the opposite. If this is what pleasure is, it is clear that the pleasant is what tends to produce this condition, while that which tends to destroy it, or to cause the soul to be brought into the opposite state, is painful. (Aristoteles, Rétorika – 1. kniha, 11 kapitola) do tohto sveta. Následné výskumy , počas ktorých bol kontrolnej skupine kojencov púšťaný nahraný tlkot, hypotézu potvrdili – kojenci sa vskutku upokojili, prípadne zaspali skôr ako tí ktorým zvuky púšťané neboli. (Morris,1967) Zmyslom tohto malého exkurzu do ríše 5T bolo upriamiť čitateľovu pozornosť na fakt že interakcia ňadro – kojenec sa odohráva takmer skrze všetky dostupné zmysly maličkého. Tlkot a ticho úzko súvisia so zmyslom sluchovým, v prípade tepla a možno i tíšenia bolesti zohráva svoju rolu zmysel hmatový, tekutiny zase stimulujú zmysel chuťový. Miestom pre argument je zmysel čuchový – ajkeď u človeka tento zmysel nehrá tak podstatnú rolu ako u iných cicavcov , dovolíme si tvrdiť že ak vôbec niekde v ľudskom živote hrá onen zmysel podstatnú rolu7, tak je tomu práve pri utužovaní väzby rodič – dieťa. Úmyselne sme zatiaľ nespomenuli zmysel ktorý je pre človeka zmyslom kľúčovým – zrak. Pravdepodobne aj Tebe drahý čitateľ sa pri slove « ňadro » vybaví v prvom rade eidetický obraz, a až potom, pri troche šťastie sa v mysli zaktivujú aj spomienky na hmatové počitky. Niet sa čomu diviť – ňadro na dnešného dospelého človeka dolieha najmä v podobe obrazov, ako ukážeme neskôr na príklade s jablkom, môže viesť tento vizuálnecentrický prístup k svetu k zaujímavým dôsledkom. No u kojenca je tomu inak. Tvrdíme že ňadro je pre neho najmä v prvých momentoch jeho pobytu niečím oveľa viac ako zaobleným fenoménom v zornom poli – je pre neho takmer celým externým svetom. Z toho následne vyplýva stav ktorý sa snažíme obhájiť hypotézou Ha1 , vskratke že prvý obraz ktorý si maličký v mysli utvorí – a teda do do pamäte uloží – je ňadro . Ajkeď v ďaľších kapitolách tejto práce budeme vychádzať z toho, že tomu tak vskutku a vpravde je, nieje istotne naškodu čitateľa ešte raz upozorniť na to že sa jedná iba o pracovnú hypotézu. Nemáme žiaľ k dispozícii zdroje na to, aby sme túto hypotézu dokázali empiricky, pokúsime sa ju teda obhájiť aspoň teoreticky. Ihneď po formulácii hypotézy Ha1 nám bola ako proti­hypotéza predložené tvrdenie «to, čo je pre maličkého najpodstatnejšie, je tvár Druhého». Aby sme sa voči tejto hypotéze – tak populárnej v istých filozofických kruhoch – náležite vyhranili, zoberieme si na pomoc dve veličiny : frekvenciu a intenzitu počitku. Frekvencia – v kvantitatívnej lingvistike je frekvencia slova chápaná ako počet výskytov uvedeného slova v danom textovom korpuse. My si dovolíme každú dušu prehlásiť za « čitateľa » a svet ako taký za korpus. Pod pojmom frekvencia referentu ­ objektu Ň tak budeme myslieť počet jednotlivých « vyvstaní » objektu v zrakovom/sluchovom/čuchovom/mentálnom/atď poli subjektu. Intenzita – zatiaľčo kvantifikovať frekvenciu nieje problém, kvantifikovať intenzitu počitku, tj. odpoveď na otázku « ako veľmi počitok prítomnosti objektu Ň zapôsobil na subjekt? ako veľmi sa vryl do kognitívnych štruktúr jednotlivca ?» už problematické je. No keďže je naším zámerom robiť vedu, a postup vedy spočíva práve v kvantifikácii kvalít, budeme sa s uvedeným problémom musieť nejak vysporiadať. Úvodom teda povedzme iba toľko že intenzita počitku je úmerná nielen dĺžke času kedy bol subjekt počitku vystavený, ale najmä počtu obvodov ktoré sú v momente počitku taktiež aktivované. Zážitok pri ktorom budú zohrávať svoju rolu nielen zrakové, ale i čuchové či hmatové vstupy tak bude chápaný ako zážitok s väčšou intenzitou ako zážitok čisto vizuálny. V následujúcej kapitole sa pokúsime ukázať že aj samotný pamäťový záznam možno vnímať ako « obvod ». Taktiež bude platiť pravidlo že intenzita počitku v prípade jeho opakovania klesá – tento proces sa vo vývojovej psychológii nazýva « habituáciou ». Tvrdíme že v období útleho detstva, kedy sa náhodne nastavené váhy synaptických spojov malého bábätka postupne samo­organizujú v prvotný « poriadok », vstupuje ňadro do vedomia maličkého oveľa častejšie ako « tvár ». Maličký oveľa častejšie8 vidí a cíti prs ako tvár, fŇ > fT .A keď už vidí tvár, matkinu tvár, je vysokopravdepodobné že v ten istý moment ­a práve simultaneita je 7 feromonálnu interakciu pri vyhľadávaní komplementárneho životného partnera nechávame bokom 8 hovoríme tu o človeku v jeho prirodzených podmienkach, teda o človeku ktorý je kojený a nie o človeku ktorého blížny podľahli vplyvu pochybných teórií o nevhodnosti kojenia, tak módnych v druhej polovici 20. storočia pre utváranie asociačných sietí kľúčová ­ vidí a cíti aj ňadro . Čo sa týka intenzity, dovoľujeme si tvrdiť že intenzita s akou sa do nervových štruktúr maličkého zapisuje tvár Druhého je menšia ako intenzita s ktorou sa do nich zapisujú kozy Prvej. Tvár maličkého nezohreje , z tváre sa maličký nenapije – teda pokiaľ sa nechceme uchýliť k básnickému « aj oči sýtiť dokážu ». A kto už by sa k básnivosti uchýliť chcel, ten by možno i vtipnú analógiu medzi okom a ňadrom uvidel – a totiž že na ľudskom tele sa vyskytujú iba dve dvojice do seba zasadených koncentrických kruhov so spoločným stredom S – jednou je dvojica dúhovka:rohovka a druhou dvojica dvorec:bradavka. V svetle podobných geometrických poznatkov tak dostávajú zaľúbené pohľady do očí ihneď novýdávnozabudnutý zmysel... Obhájcovia hypotézy že « prvá bola tvár » by si v prípade potreby mohli taktiež pomôcť tvrdením, že schopnosť rozpoznávať tváre je kognitívnou špecializáciou ľudského druhu kódovanou dokonca až na úrovni DNA. Nejeden výskum (Nelson,2001) naznačuje, že dieťa už vo veľmi útlom veku začína upriamovať pozornosť k tvári. Či tak činí preto že by najradšej salo mlieko aj z matkiných očí, alebo preto že má niekde v génoch načrtnutú schému akéhosi « Face Recognition Module » (FRM) je v konečnom dôsledku pre našu debatu málo podstatné. Myslíme si totiž že v prípade že mala pani evolúcia dosť dôvodov na to aby nám do vienku vložila akési FRM, mala ešte viac dôvodov na to aby nás vybavila i ŇRM. Záverom tejto časti, ktorú sme sa pokúsili zasvätiť vzťahu ňadra a jazyka, by sme radi upriamili čitateľovu pozornosť na vzťah ňadra a tých častí lingvistiky ktoré boli počas minulých rokoch najväčšmi tématizované – i.e. syntaxe a gramatiky. Na rozdieľ od v druhej polovici 20.storočia tak módneho prístupu « generativistického » však my zaujímame postoj oveľa « prízemnejší », možno by sme ho mohli nazvať postojom «frekvenčne orientovaným», «neo­ štrukturalistickým » či dokonca « behavioristickým ». Nechceme nijako znižovať význam rôle ktorú hrajú syntaktické štruktúry reči pri programovaní mysle človeka, no vysvetleniu faktu, že človek je schopný prebrať zo sveta gramatické či fonologické štruktúry akejkoľvek reči nespočíva pre nás – na rozdieľ od generativistov ­ v tom, že by maličký disponoval vrodeným vysoko špecializovaným neurálnym modulom Language Acquisition Device ktorého parametre si počas interakcie so svetom upraví, dospievajúc tak k gramatike svojej rodnej reči. ale v tom, že gramatické a fonologické štruktúry – ktorých konkrétnymi predstaviteľmi sú konkrétne vety jazyka – sú v kľúčovom období utvárania detskej mysle štruktúrami s najvyššiou frekvenciou výskytu. Tvrdíme, že k hrubému vysvetleniu zázraku akvizície jazyka stačí kombinácia uvedených faktorov:  prirodzená tendencia dieťaťa repetitívne vydávať veľké množstvo zvukov  prirodzená tendencia dieťaťa imitovať  prirodzená tendencia neurálnych sietí zovšeobecňovať Poznatku že ľudské mláďa je v porovnaní s inými živočíšnymi druhmi tvorom značne hlučným sme sa vo vzťahu k ňadru už venovali. Upriamujeme teraz pozornosť na schopnosť imitácie, pretože práve ona je kľúčom ktorý nás vyvádza z ríše živočíchov. Schopnosť imitácie ktorú človek pravdepodobne získal najmä vďaka tzv. « mirror neurons » (Théoret , 2002) vedie k emergencii nového druhu replikujúcich sa štruktúr – zrazu sa nereplikujú už iba « gény » z bunky do bunky, ale i « mémy » z mozgu do mozgu. Mém je to, čo sa imituje. Mozog bez « mirror neurons » je zariadením ktoré zovšeobecňuje. Mozog s « mirror neurons » je zariadením ktoré ešte k tomu aj imituje. Najlepšie – tj. s najmenšou pravdepodobnosťou chyby ­ sa imituje to, čo sa v našom vnímaní najčastejšie vyskytuje. Ináč povedané, pri kopírovaní mému z jedného mozgu do druhého je najlepším protiliekom proti informačnému šumu – a kojenec je v stave kedy je pre neho šumom vlastne všetko – vysoká frekvencia výskytu. Štruktúry s najväčšou frekvenciou výskytu v zornom a sluchovom poli kojenca niesú ani tanečné kroky, ani matematické formule, ani violončelistické triky. Je ním reč – reč ktorou matka prehovára k dieťaťu, reč v ktorej mu spieva uspávanky. Keby matka namiesto uspávaniek tancovala salsu, možno by dnes generativisti v DNA hľadali « Salsa acquisition device ». Lenže keby tancovala, keby hrala na husle, keby vzorce písala, nemohla by zároveň kojiť.9 Nielenže sú vety jazyka mémami s najvyššou frekvenciou výskytu v svete mladého kojenca, sú, v prípade že hovoríme o dieťati ktoré je zároveň aj kojené, aj štruktúrami asociované s počitkami o vysokej intenzite. Budeme tvrdiť: mentálna reprezentácia asociovaná s počitkom o vysokej intenzite sama preberá niečo z tejto intenzity. To, čo chceme povedať je, že za zdroj jazyka, za skutočnú univerzálnu gramatiku, nepovažujeme akési vrodené mocné a z nebies zoslané karteziánske « ja », ale pozemské a až príliš telesné Ty (Buber,1923) . Vety ktoré matka vydáva pri interakcii s dieťaťom vytvárajú mohutný otisk v jeho pamäti – a teda v jeho mysli10. Zovšeobecňovací mechanizmus neurónových sietí sa stará o zvyšok – z otiskov jednotlivých viet dospieva k tomu čo je im všetkým vlastné, dospieva k ich úbežníku, ktorým nieje nič iné ako gramatické pravidlo, forma skryto prítomná v inštanciách všetkých vypočutých viet. Takýmto implicitným spôsobom dochádza ku kopírovaniu gramatických foriem. Je možné, že zvnútornené gramatické formy zohrajú neskôr – po vytvorení jednotiaceho úbežníku so senzomotorickými schémami (Piaget, 1961) svoju rolu aj pri konštrukcii foriem ešte abstraktnejších – foriem logických alias « princípov myslenia ». Kto vie, možno aj takýmto spôsobom, tj. idúc po línii matka­gramatika­logika­myslenie by sa dali objasniť výsledky novozélandského výskumu ktorý naznačil že predĺžená doba kojenia blahodárne vplýva na zvýšenie iq, či schopnosť čítať a počítať (Horwood & Ferguson, 1998). My však príčinu nevidíme v chemickom zložení mlieka blahodárne pôsobiacom na mozgový rast, ale skôr v tom že s dieťaťom všemožne interagujúca matka vysiela smerom k dieťaťu veľké množstvo bazálnych gramatických štruktúr ktoré sú do mysle imprintované s vysokou intenzitou. Vďaka zvýšenej intenzite počitkov sú neurolinguistické siete rýchlejšie naprogramované a dieťa tak získava náskok pred svojimi na fľaške odchovanými kolegami. ... « Tak ja Ti teda niečo ukážem. » dodala pri pohľade na môj zamračený pohľad skeptika moja sestra. Zobrala malého, posadila si ho na kolená, rkúc « Didi ». Malý človiečik spozornel. Našpúlil pery, potom prudko otočil hlavu smerom k tej časti sestrinho tela ktorá sa nachádza medzi krkom a bruchom. Košelu jej schmatol spôsobom za ktorý by sa nehanbil ani profesionálny milovník po rokoch praxe, a s vervou malej šelmičky sa vrhol k tomu, čo má najradšej. Keby len ten tvor vtedy vedel že v ten istý pokojom zaliaty moment mu do tela preniká nielen životodárne mliečko ale i základy toho, čo až príliš pyšne nazývame «myslením» , základy systému ktorý ho raz možno primeje k tomu že mu pred vnútorným zrakom budú vyvstávať obsahy ako «hriech» , «vina» , «zlo» a iné, obsahy ktorých vlastne niet, Boh vie, možno by si to celé rozmyslel... 9 Pour communiquer efficacement, il ne suffit pas de prononcer les mots de la langue, il faut le faire au bon moment. Un des premiers enseignements que les mères semblent transmettre aux bébés est la prise de tour. Point n’est besoin de savoir parler pour prendre (et attendre) son tour. Dans la période du babillage, les mères alternent ainsi les périodes où elles parlent et celles où elles écoutent. Il semble même que l’on observe les débuts de cette alternance , première forme de dialogue, dans l’allaitement: bébé s’arrête de téter, sa mère le secoue légèrement, il reprend. Aucun besoin alimentaire ni respiratoire ne justifie l’arrêt. Aucune nécessité physiologique particulière ne justifie les secousses. C’est déjà un dialogue...tonique (Lécuyer, 1996) 10 Zatiaľčo v analytickom prístupe môže byť rozlíšenie na pamäť a myseľ užitočné, považujeme ho my za nadbytočné ba priam nežiadúce. V zmysle tvrdenia « myseľ a jej obsah sú funkčne identické » (Wilson, 1983 ) nevidíme jediný dôvod prečo by sme mali rysovať čiaru medzi myslou a pamäťou, dobre vediac že myseľ môže byť pasívna a pamäť aktívna. Záhrada druhá: Žena jej mlieko a jej jablko La femme aux pommes11 Jean Terzieff Les jardins du Luxembourg Paris Les fruits12 Antoine Bourdelle Musée Bourdelle Paris Es giebt auf Erden viel gute Erfindungen, die einen nützlich, die andern angenehm: derentwegen ist die Erde zu lieben. Undmancherlei so gut Erfundenes giebt es da, dass es ist wie des Weibes Busen: nützlich zugleich und angenehm. Dritter Theil: Von alten und neuen Tafeln (Nietzsche, 1883) 11 Foto z blízka prevzatá z http://www.parisdailyphoto.com/2006/07/steve­jobs­muse.html 12 http://parisconnected.wordpress.com/2008/06/25/musee­bourdelle­a­quiet­journey­back­to­old­paris­montparnasse/ Záhrada druhá, konštrukt prvý: Zooantropológia a on mě potom požádal ať ano řeknu ano ma horská květino a nejprve jsem ho pažema objala a stáhla k sobe až ucítil má voňavá . . ano a srdce mu bušilo jako divé a ano řekla jsem ano chci Ano spomienky Mary Bloomovej v Odysseovi Jamesa Joyca Hovoriť však o prsiach ľudskej samičky iba v kontexte ich mliekodajnej funkcie by znamenalo povedať iba polovicu pravdy. Ako v svojom novorenezančnom diele Nahá opica zdôrazňuje zoológ Desmond Morris, ak by mali ňadrá slúžiť iba ako médium pre kojivý proces, urobila by Matka Príroda oveľa lepšie keby ženine krivky vôbec nezaobľovala. Nevyhnutnou podmienkou spustenia sacieho reflexu je totiž to, aby sa bradavka dotkla podnebia kojencovej ústnej dutiny, ktoré pre reflex slúži ako spínač. Celá procedúra by prebiehala s oveľa väčšou ľahkosťou v prípade že by mali ňadrá plochšiu a vyťahanejšiu formu ňadier našich opičích príbuzných . Prečo sa teda Madamme Evolúcia rozhodla dať do vienka naším milovaným ony radostne zaguľatené glóby? Pre Morrisa existuje jediná odpoveď: pretože je to nesmierne silný nástroj sexuálnej signalizácie. Vychádzajúc z predpokladu že samec už bol predprogramovaný na to aby ho fascinovala samičkina zadnica, rozhodla sa Príroda skopírovať presamcapríťažlivý zadok aj na hrudník13. Že sa možno nejedná o výmysel zoológov dosvedčujú aj iné príklady zo zvieraciej ríše, najilustratívnejší je asi prípad dominantných samcov mandrila ktorý sa vyznačujú tým že ich nosy majú podobné modro­červené sfarbenie ako oblasti v blízkosti genitálií ich samičiek. Prečo by to však Príroda robila? Morris odpovedá hypotézou: aby tak posilnila puto v ľudskom páre. Pri odôvodnení sa uberá touto deduktívnou cestou: pre odchovanie mláďaťa Homo sapiens sapiens je viac ako v prípade iných živočíšnych druhov potrebné spolužitie v páre. Jedným z mechanizmov na utuženie párového vzťahu je kontakt tvárou v tvár počas pohlavného styku. Dôsledkom « presunutia » zadnice do popredia je tak to, že ľudšký samec ani počas kopulácie tvárou v tvár nestráca z dohľadu jeden z centrálnych stimulov jeho sexuálnej aktivity. Ajkeď sa dá podobnému odôvodneniu vytknúť mnohé – napr. to že v prípade množstva kultúr ku kopulácii tvárou nedochádza, či to, že takúto ventro­ventrálnu kopuláciu praktikujú aj bonobovia, orangutáni či dokonca i gorily ktorých samičky ňadrá zaguľatené nemajú – je nepochybné že prsia so sexualitou úzko súvisia. Bujnenie ňadier mladej devy je asi najvýraznejším druhotným pohlavným znakom signalizujúcim jej zrelosť. Medzi bohato inervovaným klitorisom a bohato inervovanými bradavkami existuje intenzívna informačná výmena – samozrejme s medzistanicou v mozgu. Zdurenie bradaviek ide často, príliš často, ruka v ruke s vzrušením v 13 Prof. Sokol upriamil počas krátkej rozprave o tejto téme moju pozornosť na detskú riekanku ktorú si tu dovolím odcitovať : Měla babka, čtyři jabka, a dědoušek jen dvě. Dej mi babko, jedno jabko, budeme mít stejně. Keďže sa nachádzame v poznámke pod čiarou, dovolíme si vyjadriť naše želanie aby babička deduškovi žiadne jabĺčko nedala, a pekne si dve jabĺčka vzadu a dve vpredu nechala. spodných partiách . Výnimkou niesú ani ženy ktorým k dosiahnutiu orgazmu « stačí iba» stimulácia ich « gazelích dvojčiat » ­ akoby riekol Šalamún. To že ňadrá zohrávajú kľúčovú rolu v mnohých rituáloch sexuálne orientovaných tradícií sveta asi tiež nebude náhoda. A niečo možno naznačí aj tvrdenie, že ak je v ľubovolne zvolenej ľudskej kultúre tabu aj niečo iné ako genitálie, tak to budú s najväčšsou pravdepodobnosťou bradavky. Uvedené príklady uvádzame ako protiargumenty voči tým hlasom, ozývajúcim sa obzvlášť zo ženských feministických táborov, ktoré by rady ňadro zrovnoprávnili s ostatnými časťami ľudského tela, zdôrazňujúc iba jeho kojivú funkciu. Ajkeď ich snahu, prihliadnuc k jej možným dôsledkom, považujeme za viac ako ľúbivú, nemôžeme pritakať ich argumentácii. Ovocie ktorým nás sýtia naše milované totiž nepovažujeme iba za misku mliekom naplnenú, ale v prvom rade za prejav múdrosti sily Života. Málo nám v konečnom dôsledku záleží na veľkosti, tvare, či farbe. To čo nás uchvacuje je zistenie že ten istý objekt ktorý zohráva po príchode nového stvoreníčka na svet tak podstatnú živiteľskú rolu zohráva kľúčovú rolu aj pri a tesne pred jeho samotným plodením. Nemôžeme si pomôcť : chápeme Vaše ňadrá ako jeden zo základných akcidens konštituujúcich esenciu dcéry elfa či človeka, a Vy samé nás v tomto chápaní utvrdzujete keď navádzate naše pery ku kvetom Vašich hrudí. Bolo hovorené že človek je živočích majúci slovo, zoon logon echon. A verilo sa tomu najmä vtedy keď sa bralo právo na sebaurčenie tým, čo užívali slová nám neznáme. Bolo hovorené že človek je vec majúca myseľ, moralitu, boha. A verilo sa tomu najmä vtedy keď sa likvidovali kultúry a živočíšne druhy majúce myseľ, moralitu a bohov ktorým sme «my» nerozumeli. Je nám tvrdené že človek je tým. « čo má záujem na svojom bytí », že je symbolickým manipulátorom par excellence, že je vzpriameným dvojnohým domestikovaným primátom , spoločenskou bytosťou, nahou opicou, súcnom čo ďakuje ... Neodmietame žiadnu z uvedených črepín poznania – každá má svoju váhu, pravda v každej je zrejmá. Problém je v tom že nevidíme ich výčtu koniec. Problém je v tom že ani v jednej z uvedených odpovedí nenachádzame odpoveď «ako von z tej bryndy do ktorej konštrukty našich myslí celú túto planétu uvrhli?». Preto poskytujeme odpoveď, črepinu, novú: chceme zasadiť človeka do kontextu do ktorého patrí. Jednou nohou do ríše chladných, mŕtvych...ale večných idejí. Druhou nohou do ríše teplých, živých...ale tak prchavých poZemských zvieracích tiel. Medzi oboma ríšami naivne budujeme most– most v tvare vnád Adrianky Sklenaříkovej . Prečo? Pretože zatiaľčo už sme sa stretli s myriádami nehmotných « jediných » bohov ktorí ľudstvo ako celok doposial vždy rozdelili, videli sme beztak napokon v každom jednom človeku najmä bytosť milujúcu prs. Záhrada druhá, konštrukt druhý: Biopsychológia na Zemi riekla mu: maj ma rád no predtým ako si pôjdeš tam dole hrať , nezabudni tuto zagajdať . parafráza na istú detskú riekanku Kľúčová otázka tohto textu znie: do akej miery vplývajú reprezentácie utvorené v útlom veku na chovanie dospelého človeka ? Pokúsime sa na túto otázku odpovedať vytvorením nového, matematicky formalizovateľného modelu. Zatiaľčo doposiaľ sme sa aspoň ako­tak držali faktov, dovolíme si v tejto časti od nich upustiť, radostne budujúc našu « malú privátnu teóriu vesmíru , života, a tak vôbec». Ináč povedané – budeme špekulovať. Naše špekulácie začnú u tvrdenia že myseľ splodeného je tabula rasa. Sme si samozrejme vedomý toho že genóm vybavil maličkého určitým telom ktoré má určité vstupy, určité výstupy ba i určité bazálne reflexné obvody či dokonca moduly mapujúce vstupy na výstupy a vice versa. Jednako si však myslíme že orgán ktorý sa v budúcnosti stane « centrálnou výpočtovou jednotkou » primáta rodu Homo sapiens sapiens – mozog, a to obzvlášť jeho kôra – má na začiatku viacmenej náhodne nastavené váhy synaptických spojov. Kto nieje príliš znalý v neurologickej terminológii , tomu nech postačí výrok o « tabula rasa» ktorý je s výrokom o synapsiách takmer14 ekvivalentný. Naše špekulácie budú pokračovať tvrdením že rovnako ako neurónová sieť v mozgu je Hebbiánska, je ňou aj sémantická sieť v mysli. Objasňujeme význam termínov: neurónová sieť je sieť pozostávajúca z nervových buniek – neurónov, ktoré sú vzájomne prepojené synaptickými spojmi, pričom platí že každý synaptický spoj je charakterizovateľný určitou veličinou ktorú nazývame váha. Hebbiánska neurónová sieť je taká neurónová sieť pre ktorú platí že v prípade že sú dva neuróny aktivované naraz, bude ich synaptický spoje posilnený – hodnota váhy synapsie stúpne. Narozdieľ od neurónovej siete ktorá je bytostne materiálnej povahy, je sémantická sieť povahy takpovediac « mentálnej ». Približný preklad termínu sémantická sieť by mohol znieť: sieť významov. To síce pekne znie, ale čo tým chce autor povedať nemusí byť asi na prvý pohľad zjavné, obzvlášť keď si uvedomíme že pri hľadaní toho čo , Jackendoff nazýva sv. grálom vied o mysli a jazyku, tj. pri hľadaní odpovede na otázku « čo je to význam slova a ako ho kvantifikovať? » si doposiaľ vylámali zuby úplne, ale úplne všetci. V prvom rade je nutné si uvedomiť že slovo ktoré nieje zasadené do sémantickej siete žiadny význam nemá. V druhom rade je nutné si uvedomiť že slovo bez významu neexistuje, pretože v momente kedy sme ho vyslovili sme ho už samotným aktom vyslovenia vložili do určitého kontextu – a teda do sémantickej siete. No a napokon je treba si uvedomiť že to, význam slova nieje ničím iným ako množinou vzťahov ktoré uvedené slovo má k iným slovám, a že celú sémantickú sieť môžeme opísať maticou . Saussure, po ňom Bourdieu hovorili o « distancii ». « Jablko » nieje « strom » nieje «hruška », nieje «zdroj problémov ». Je niečím blízko toho všetkého, no predsa niečím iným. My sme však platonici a preto hovoríme o « podielaní sa ideje na ideji » ­ « jablko » je tak trochu strom, 14 často budeme používať termíny ako « takmer », « trochu », « približne » ­ a to nie preto že sa takto lišiacky chceme vyhnúť nekompromisnej falzifikácii našich hypotéz, ako skôr preto,že pre vedu ktorú sa tu snažíme ustanoviť považujeme viachodnotovú « fuzzy logiku » za oveľa užitočnejšie organon ako je klasická aristotelovská logika je tak trochu ovocím, je tak trochu zdrojom istých problémov... Kľúčom k celému obratu nieje ani tak Ň­p. 23 4,2 21 2 17 22 ona zmena negatívneho « nieje » na pozitívne Ň­n. 4,2 23 2 21 25 7 « je ». Kľúčom je použitie termínu « tak trochu ». Blžnsť 21 2 23 0 12 22 Keď teda tvrdíme že sémantická sieť je Blsť 2 21 0 23 30 1 Hebbiánská, chceme tým povedať, že v prípade že sú dva významy aktivované naraz , resp. tvár 17 25 12 30 42 11 vrámci krátkeho časového intervalu – napr. ako ticho 22 7 22 1 11 23 dve slová v jednej vete, či ako dva rozdielne 66,2 59,2 57 54 95 63 objekty vonkajšieho sveta či dokonca stavy Matrix 3: Hodnota 2 na pozíciách (1,4) a (4,1) tj. asociácia sveta vnútorného – sila väzby, sémantická váha Ň­p. ­ bolesť mohla byť spôsobená napr. chorobou ktorá medzi nimi sa posilní. dieťu spôsobovala bolesť aj počas kojenia. Nula na pozícii V podstate chceme povedať to, čo chcel blaženosť­bolesť naznačujú že uvedené dve nervové dráhy povedať Skinner svojim behaviorizmom a sú mutuálne logicky exkluzívne – buď je aktivovaná jedna Pavlov svojím podmieňovaním, až na to že to alebo druhá. Naopak 0,6 na pozícii Ň­p.­Ň­n. naznačuje že celé hodláme skrášliť šatom maticového sa mohlo stať že dieťa párkrát prs napríklad videlo, no sa ho nedotýkalo, vzťah uvedených dráh je teda viac « fuzzy ». kalkulu. A áno, hodláme ísť ďalej – k Hodnota 21 asociácie bolesť­Ň­n. a hodnota 2 asociácie štruktúram oveľa jemnejším no i mohutnejším blaženosť­Ň­n. naznačuje že už sa dokonca párkrát a častokrát zákernejším než sú púhe reflexy. prihodilo že maličký necítil bolesť pri neprítomnosti ňadra A ako to celé súvisí s ňadrom? – dospieva. Hodnota 30 asociácie tvár – Ň.n, to sú všetci tí 15 strýkovia, babky, a susedia čo na maličkého permanentne Hypoteticky takto : Predstavme si že už citovaný Aristoteles robia « ťuťuli­muťuli ». mal v tom čo hovoril o blaženosti16 a bolesti viacmenej pravdu – že blaženosť jest stavom ktorý duša pociťuje pri návrate do svojho prirodzeného stavu, bolesť jest toho opakom. V súhlase s definíciou teda tvrdíme že duša ­ dieťa, vyvrhnuté po pôrode z pokoju lona do úplne nového sveta, pociťuje takmer neustálu bolesť . Čo sa týka blaženosti, privádza nás analýza synchrónna matrix Blaženosť Bolesť skrze prizmu 5T k presvedčeniu, že v prípade , že Ň prítomné 2 0 existuje moment kedy je stvoreníčko najbližšie svojmu prvotnému stavu , je to práve ten moment keď Ň neprítomné 0 1 má svoju hlávku pritisknutú k živúco bijúcej mamkinej hrudi. Stručne povedané – v momente keď Matrix 1: Prvotná matrix reprezentujúca sú v mozgu novorodenca aktivované obvody myseľ maličkého potom čo mamka s zodpovedné za stavy blaženosti , nachádza sa prsník v ňadrom do jeho percepčného poľa « 2 percepčnom poli všetkých jeho piatich zmyslov. Ak krát prišla a raz odišla » teda platí pred chvíľou odprezentovaná téza o Hebbiánskej povahe sémantických sietí, znamená to že váha medzi blaženosťou a tým čo vágne nazývame « prítomnosťou Ň » bude navýšená o určitú hodnotu, dajmä tomu o 1. Podobne, keď matka prvý krát odíde, bude o 1 navýšená váha medzi bolesťou a tými obvodmi ktoré reprezentujú S­A m. Ň­p. Ň­ Blžnsť Blsť tvár ticho n. 15 Nový model sa vždy najlepšie predstavuje na čo najjednoduchších príkladoch. Preto sme počiatočné podmienky zjednodušili na manichejskú schématickú dualitu blaženosť/bolesť ktorej koncepcia činí autorovi najmenšie obtiaže, a dá sa predpokladať že tomu tak bude aj v prípade čitateľa. Je však treba si uvedomiť že už od začiatku sa nejedná o maticu 2x2 ale o nesmierne rozľahlú maticu v ktorej sa okrem ústredných modulov blaženosť/bolesť vyskytuje aj niekoľko desiatok iných (zároveň je však roznastavenie zvyšných oblastí matice tak náhodné, že môžeme dosadiť do všetkých políčok našej najsamprvšej schémy dosadiť samé nuly, a tak stále ostáva v platnosti naše tvrdenie že dieťa je tabula rasa). Tieto prenatálnym vývojom prednadstavené moduly by možno východná tradícia označila termínom samskára a ich konkrétny prejav termínom vrtti. 16 Prekladáme slovo « pleasure » ako « blaženosť » a nie ako « rozkoš » najmä kvôli fonologickej podobnosti [bl*ž] – [pl*ž]. Existujú totiž slová pre ktoré je ich fonologický aspekt určujúci aspoň tak ako ich aspekt významový to, čo vágne nazývame «neprítomnosťou Ň». Ak sa potom následne mamka po určitej dobe vráti, navýši sa zas váha v prvom stĺpci prvého riadku z 1 na 2. Celé si to si môžeme zobraziť primitívnou maticou 2x2 kde riadky reprezentujú prvý neurálny obvod N1 ktorý je aktivovaný, stĺpce druhý obvod N2 , a jednotlivé položky počet synchrónnych aktivácií N1 a N2. Takúto maticu nazývame synchrónne­asociačnou matrix. Situácia ktorú sme tu práve odprezentovali je špecifická v tom, že asociuje už genómom predpripravené obvody (blaženosť/bolesť) s obvodmi reprezentujúcimi objekty okolitého prostredia (Ň). Moment vytvorenia takejto asociácie nazývame v súhlase s tradíciou momentom imprintingu. Keďže sú reprezentácie asociované imprintingom napojené na tie najhlbšie neuroendokrínne mechanizmy našej živočíšnej podstaty, budú aj tieto reprezentácie samotné zohrávať kľúčovú rolu pri budúcom chovaní organizmu . V neskorších častiach tohto textu sa pokúsime ukázať ako. Najprv si však pre odľahčenie predstavme kultúru v ktorej z tých či onakých dôvodov17 matky nekoja svoje deti, a aj v neskorších momentoch vývoja preberá rolu matky – živiteľky akási zvláštna neosobná entita nazývaná «l’État». V takom prípade dôjde aj k istému « pokriveniu » prirodzených mechanizmov, k istému « presunu » asociácií z « matky » na «l’État » (váhy budú navyšované nie v stĺpci matice s etiketou « matka », ale v stĺpci matice s etiketou «l’État»). A keďže « gajdy štátu » ani netlčú rytmom Života ani niesú teplé ni vonné – a jediné na čo sa štát v svojej živiteľskej roli zmôže je zatiaľ, naštastie, chrĺenie potravu suplujúcich cenných papierikov a mincí do sveta ­ nedôjde tak nikdy k úplnému naplneniu « geneticky zadrátovanej » túžby po takom objekte sveta, ktorý by poskytol všetkých 5T. Keď k tomu dôjde v prípade jednotlivca, bude dôsledkom pravdepodobne jemne frustrovaný večne urevaný fracek. Keď k tomu dôjde na úrovni celej spoločnosti, môže byť dôsledkom spoločnosť ktorá namiesto detského kriku každý týždeň vyhlási štrajk. Naspäť však od pochybných biopsychosociologických hypotéz k našim ešte pochybnejším maticiam. Letmo si uveďme ešte jeden príklad primitívnej synchrónne­asociačnej matrix. Predstavme si, že jedného krásneho dňa, dajme tomu počas 23tieho kojenia, si otecko v miestnosti kde je maličký kojený pustí nahlas synchrónna matrix Blaženosť Bolesť Ticho Bachov Kontrapunkt. Jedno z 5T – Ticho Ň prítomné 22,8 0 22 ­ ktoré sme si určili ako konštituujúcu zložku obvodu blaženosti tak nutne Ň neprítomné 0 23 7 nebude aktivované. Aby matica aj čo najväčšmi dokázala Tvár 12 30 11 ňadaľej 18 reprezentovať okolitý svet , bude sa Matrix 2: Stav po 23 kojeniach, pričom pri jednom z nastanuvší stav musieť niekde v matici nich si tatíček dovolil počúvať Bacha. Hodnota 7 na prejaviť – nielenže sa váha na pozícii 1,1 pozícii 2,3 znamená že dieťa zažívalo 7krát ticho aj nenavýši o jedna, ale iba o štyri pätiny, mimo kojenia, napr. vtedy keď ono samo nekričalo. bude tiež treba Ticho z blaženosti vydeliť, bude treba maticu zjemniť. Ticho tak v matici získa akoby vlastný riadok a vlastný stĺpec – prípadne zaberie riadok/stĺpec ešte neobsadený, čo je v konečnom dôsledku to isté. Niekde v mozgu začne byť v momentoch Ticha aktivovaná určitá špecifická nervová dráha. Ako sa toto « vydelenie », tento analytický rozpad celku na časti19 konkrétne deje, a ako si ho v našom modeli reprezentovať je technická otázka ktorej sa 17 vo Francii druhej polovice 20. storočia to boli najmä dôvody módne, a teda memetické 18 Keď hovoríme o « odzrkadľovaní sveta », považujeme za kľúčové zdôrazniť,trebárs i takto, pod čiarou, že mozog je de facto 3rozmerná štruktúra v 3rozmernom priestore ktorá v sebe nesie informáciu o premenách 3rozmerného sveta v čase – tj. informáciu 4rozmernú. Aby mozog takúto informáciu mohol niesť, je nutné aby niekde, nejako, dochádzalo k mapovaniu 4D ­> 3D . Maticový kalkul do ktorého sa snažíme čitateľa touto prácou uviesť nám prijde ako najúčinnejši nástroj na modelovanie tejto « redukcie rozmerov s čo najmenšou stratou podstatnej informácie » 19 « Svet sa rozpadá na fakty» (Wittgenstein, 1917) možno budeme venovať v iných, odbornejších prácach (Hromada,2012). Tu volíme najjednoduchšie riešenie : «pri vydelení dedí nový stĺpec vlastnosti stĺpca v ktorom bol predtým synteticky obsiahnutý, a následne pokračuje každý sám». Aby sme urobili radosť tým ktorí prikladajú bytostnú váhu tomu čo nazývajú « stretnutie s tvárou»20, pridali sme do matice 2 aj riadok tvár. Bystrejšiemu čitateľovi sa možno na jazyk vtiskne otázka: prečo bola « tvár » pridaná ako riadok, a « ticho » ako stĺpec ? Odpoveď : pre zjednodušenie. Keďže sa totiž jedná o synchrónne­asociačnú matrix reprezentujúcu iba to, koľko krát boli 2 nervové reprezentácie aktivované naraz ­ prípadne s tak malým časovým rozostupom až v doméne vedomia maličkého splynuli v jeden gestaltický celok ­ je bezpredmetné pýtať sa čo bola príčina a čo následok, čo bolo prvé a čo druhé. Každá položka reprezentuje určitú « dráhu » v telomozgomysli človeka21 . Každá položka ktorá je riadkom je aj stĺpcom. Jedná sa o diagonálne symetrickú maticu ­ jej graf nieje orientovaný. Za zmienku stojí aj to, že na pozícii ktorá leží na diagonále sa vždy vyskytuje pre daný stĺpec­riadok najvyššia hodnota – udáva počet výskytov uvedeného súcna v percepčnom či mentálnom poli maličkého – ináč povedané, udáva frekvenciu výskytu.22 Posledný riadok ktorý sme do matrix 3 vložili je celkový súčet hodnôt v jednotlivých stĺpcoch. Je základnou kvantitou z ktorej zachvíľu odvodíme veličinu ktorú budeme nazývať mohutnosť reprezentácie X. Tento súčet, táto nazvime si ho napr. « sémaxonálna suma » nám vlastne nehovorí nič iné iba koľkokrát bola neurálna dráha X kódujúca niečo – senzomotorickú schému, určitý audiálny či vizuálny vnem, spomienku atď. ­ aktivovaná zatiaľčo bola aktivovaná akákoľvek iná dráha Y. Aby sme si názorne ilustrovali celú vec, môžeme si pomôcť istou analógiou zo sveta internetových stránok, ktorý dúfam náš čitateľ aspoň ako­tak pozná. Jedným z kľúčov úspechu webu je hypertext – schopnosť stránok odkazovať jedna na druhú. Predstavme si že každá položka v našej matrix, každé niečo X je hypertextovou entitou na ktorú vedie určitý počet linkov od iných hypertextových entít Y, Z atď. Potom vlastne onen súčet 95 u položky « tvár » a 66,2 u položky « Ňadro prítomné » neudáva nič iné ako «počet odkazov» ktoré vedu k dotyčnej hypertextovej entite. Vypovedajú podobné kvantity o niečom čo by mohlo byť podstatné pre pochopenie človeka a jeho vzťahu k ňadru ? Tvrdíme že áno, no nie mnoho. Keby do hry totiž vstupoval iba tento primitívny súčet, veľmi ľahko by sa mohlo stať že by v mysli určitého človeka vznikli dve dráhy X a Y ktoré by jedna na druhú odkazovali miliónkrát no tento bipolárny celok by neodkazovala takmer žiadna entita Z. Dovolíme si tvrdiť že v takom prípade by aj napriek vysokej hodnote onoho súčtu asociácií mali dráhy X a Y len pramalý význam pre celok mysle dotyčného človeka. Pre pochopenie toho čo myslíme už spomenutou veličinou « mohutnosť reprezentácie X » nieje iba odpoveď na otázku « koľko hypertextových entít na entitu X odkazuje? » , ale v prvom rade odpoveď na otázku « aké entity to na entitu X odkazujú? », ktorú si môžeme preformulovať do podoby « aká je mohutnosť entít Y,Z atď. ktoré odkazujú na entitu X? ». Ináč povedané, budeme sa snažiť vyjadriť mohutnosť signifié X ako normalizovanú sumu mohutností všetkých entít ktoré na entitu X odkazujú23, teda: 20 Pre tých sa tvár v kontexte tohto článku môže stať šiestym T ktorého prítomnosť ľudská bytosť k životu potrebuje 21 Položky v našich maticiach sa teda nevzťahujú k objektom externého sveta ­ ktoré v sémiotike nazývame « referent » ­ ale k ich vnútorným mentálnym reprezentáciám – tj. k tomu čo sa nazýva « signifié ». Stĺpcom či riadkom v matici označený pracovnou etiketou « tvár » sa tak nesnažíme popísať vlastnosti toho či onoho objektu « tvár » v hmotnom svete – pretože nič také ostatne ani neexistuje, rovnako ako doc. Murgašom často spomínané « vnútro tehly » ­ ale istú sadu « dráh » v mysli toho či onoho subjektu, ktoré sú aktivované istými ,najmä vizuálnymi, vnemami. 22 Pri konštrukcii matice 3 som si povšimol, že v prípade že matica obsahuje 2 riadky A a B ktoré sú vzájomne logicky výlučné, bude hodnota na diagonále konvergovať k súčtu hodnôt A a B, tj. môžeme formalizovať ako Xii=XAi+XBi . To by nám v budúcnosti – keď sa pri pohľade na neuromapy budeme pýtať akéže to obsahy reprezentujuú?­ mohlo pomôcť ako akési primitívne heuristické pravidlo na vyhľadávanie entít v logických (v aristotelskom zmysle) vzťahoch. 23 V tomto bode je náš prístup značne inšpirovaný prístupu Larryho Pagea a Sergeya Brina ktorý , keď boli pred N ∑ Mi v ix M x = i=0 N 24 kde N je celkový počet riadkov alebo stĺpcov a Vix je hodnota váha/sily asociácie na pozícii i,x. No a ako to celé súvisí s ňadrom ? Hypoteticky takto: V prvej kapitole sme tvrdili že za normálnych okolností je prs prvým objektom – referentom sveta ktorého otlačok v podobe reprezentácie – signifié X si maličký v hlávke vytvorí. Predtým je v mysli prítomných iba niekoľko izolovaných obvodov­modulov­reflexov, každý zodpovedný za inú , zväčša senzomotorickú, schému. Saj. Krič. Spi. Iba niekoľko modulov a miliardy neurónov ktoré len čakajú na svoju štruktúraciu. Každý z uvedených modulov disponuje určitou mohutnosťou. Tým ako sa začne v synaptických sieťach maličkého konštituovať určitý prvotný poriadok – a my tvrdíme že oným prvotným poriadkom nieje nič iné ako jeden veľký a mliekom riadne naliaty prs – začne sa táto reprezentácia Ň prirodzene napájať na už prítomné obvody a ich mohutnosť začne navyšovať mohutnosť reprezentácie Ň. Hovoríme « navyšovať », no možno urobíme lepšie – ak chceme byť pochopený práve teraz, keď sa chystáme predložiť našu najodvážnejšiu hypotézu­ ak použijeme sloveso « prehlbovať ». Táto hypotéza tak trochu s otázkou: o čom sa asi tak kojencovi môže snívať? Odpoveď je v kontexte tohto článku , veríme, dostatočne zrejmá. Otázka znie: prečo ? Odpoveď znie: pretože duša25 kojenca spadne s najvyššiou pravdepodobnosťou do sémantického atraktoru s najvyššiou mohutnosťou. Objasňujeme čo mienime slová « spadnúť do sémantického atraktoru s najvyššiou mohutnosťou »: predstavme si že substrátom mysle je určitá elastická tkanina. Na tejto tkanine sú položené telesá26, každé s určitou hmotnosťou. Predstavme si že každá entita X kódovaná jedným riadkom našich horeuvedených matíc je takým telesom, mohutnosť X súc hmotnosťou telesa. Ináč povedané, čím väčšiu mohutnosť X má, tým viac « ohne » elastickú tkaninu mysle. A ako to súvisí so snami ? Nuž, následovne: jednotlivú dušu si môžeme predstaviť ako malinkatú loptičku ktorá je veľkou silou vrhnutá na povrch elastickej tkaniny mysle. Čím je oblasť ktorou práve prechádza ohnutejšia , tým je pravdepodobnejšie že loptička­duša spadne do jamy. A samozrejme platí že oblasť je tým ohnutejšia, čím je ťažšie teleso v jej blízkosti. No a čo v tej jame? Nuž, to už je jednoduché, stačí si len odmyslieť to veľké teleso ktoré tkaninu mysle ohlo – stačí myslieť len na onen ohyb samotný. V jame spadne loptička­duša prirodzene práve do toho bodu ktorý tkaninu mysle ohol. Získa topologické súradnice tej entity, ktorú sme umiestnili na pozíciu X. Entita X vhupsne do duše. Duša « nazrie » entitu X. A kojencovi sa prisní o ňadre a duša putuje ďalej. To je prvý spôsob – topologický ­ ako sa na celú vec pozerať. Predstavme si druhý, pravdepodobnostný prístup. Tu sa opäť vraciame k triku ktorý uskutočnili Larry so Sergejom keď sa snažili dospieť k svojmu PageRanku. Predstavili si « náhodne browsujúceho internauta » ktorý pri niekoľkými rokmi na standfordskej univerzite postavený pred otázku « ako z matice ktorej položka na pozícii X,Y vyjadruje počet liniek vedúcich z webstránky X na webstránku Y získať informáciu o popularite stránky X?» odpovedali podobným vzorcom .K onej veličine « popularita stránky » napokon vďaka uvedenému vzorcu , užitiu maticovej algebry a pár excelentným hackom napokon naozaj dospeli, a nazvali onu veličinu PageRank. Veličina prezentovaná ako « mohutnosť reprezentácie » je , mutatis mutandis, analogická s PageRank. Viac v (Page & Brin , 24 Je takmer isté že mám niekde v tom vzorčeku chybu, ale na to že sa jedná o prvé použitie programu na písanie matematických formúl (OpenOffice Math) v mojom živote to nieje až také zlé, nie ? 25 slovo « duša » užívame v tomto texte ako poetické synonymum pre suchopárne vedecké « vedomie » 26 « Lopta » ó bratří Čechové, to pro nás Slováky není nic jiného než « míč » svojom putovaní webom kliká na linky ktoré má na stránke pred sebou spôsobom úplne náhodným. V takom prípade platí že ak zo stránky X vedie na iné internetové stránky 100 odkazov, pričom 10 z nich na stránku Y a 20 na stránku Z, bude pravdepodobnosť toho že sa internaut dostane zo stránky X na Y 0,1 a na Z 0,2. Analogicky si môžeme predstaviť dušu ktorá sa túla sémantickým bľudiskom mysli – a pre ilustráciu najlepšie dušu spiacu na K­D Ň­p. Ň­n. Blžnsť Blsť tvár ticho trajektóriu ktorej nemajú vstupy z matrix okolitého prostredia žiadny zásadnejší Ň­p. 0,07 0,368 0,037 0,179 0,349 vplyv. Predstavme si že duša sniaceho maličkého práve spadla do atraktoru Ň­n. 0,06 0,035 0,388 0,263 0,111 « Ň­prítomné ». Následne sa môže Blžnsť 0,317 0,03 0 0,126 0,349 ubrať piatimi novými cestami, každou s určitou pravdepodobnosťou – Blsť 0,03 0,35 0 0,316 0,016 pričom pravdepodobnosti sú tvár 0,256 0,42 0,21 0,555 0,174 vypočítané z hodnôt prítomných v ticho 0,332 0,11 0,385 0,018 0,116 synchrónne­asociačnej matrix 3 tak, 1 1 1 1 1 1 aby ich súčet v každom stĺpci dal 1– Matrix 4: Kauzálne­diachrónna matica odvodená zo synchrónne­ ináč povedané duša sa vždy uberie asociačnej matrix 3 normalizáciou každej hodnoty pomocou jednou z cestičiek ktoré sa jej sémaxonálnej sumy stĺpca v ktorom sa daná hodnota nachádza. V naskýtajú , s pravdepodobnosťou p1 prípade že platí hypotéza pravdepodobnosť aktivácie neurolinguistickej ocitne sa « u » entity X, s štruktúry X neurolinguistickou štruktúrou Y je úmerná sile ich pravdepodobnosťou p2 « u » entity Y vzájomných asociačných spojov tak hodnota na pozícii X,Y udáva atď. Takúto maticu, odvodenú pravdepodobnosť toho že neurálna dráha X zaktivuje neurálnu dráhu Y jednoduchým normalizačným výpočtom z matice synchrónne­asociačnej, nazývame maticou diachrónne­kauzálnou . Táto matica už nereprezentuje silu asociácií medzi dvomi neurolinguistickými štruktúrami, ale v prípade že platí hypotéza že pravdepodobnosť aktivácie neurolinguistickej štruktúry X neurolinguistickou štruktúrou Y je úmerná sile ich vzájomných asociačných spojov, bude nám udávať pravdepodobnosť « nazrenia» duše na význam Y po nazrení na význam X27. Čo nás príjemne potešilo pri konštrukcii tejto matice (viz. matrix 4) bolo pre osoby matematicky zdatnejšie28 iste triviálne zistenie že už sa nejedná o maticu diagonálne symetrickú, ale o maticu asymetrickú – jej graf musí byť nutne orientovaný. To je zistenie potešujúce, pretože máme pocit že je v súlade so skutočným stavom vecí. Naše introspekcie a meditácie nám totiž vskutku naznačujú že pravdepodobnosť toho, či myseľ urobí preskok od predstavy « tváre » k predstave « blaženosti » sa líši od pravdepodobnosti preskoku od « blaženosti » k « tvári ». V appendixe 1 a 2 názorne ukazujeme ako môžeme od hodnôt našej kauzálne­diachrónnej matrix napokon dokonvergovať29 k hodnotám veličiny ktorú sme na predchádzajúcich stránkach nazvali «sémantickou mohutnosťou reprezentácie X » vrámci určitej sémantickej siete . Zistenie ­ uskutočnené pred niekoľkými hodinami – tj. že matematický svet sa naozaj « chová » tak ako sa « chová » nás pravdepodobne neprekvapilo o nič menej ako Sergeja a Larryho, pre ktorých bolo , dovolíme si tvrdiť, asi práve toto « skonvergovanie hodnôt » jasným náznakom a hybným impulzom k založeniu firmy Google. Keďže sa však už príliš vzďalujeme od ústredného atraktoru nášho textu– od oných rúžových 27 alebo skôr pravdepodobnosť «vhupsnutia» významu Y do duše po tom čo do nej vhupsol význam X ? ;) 28 za ktoré sa my istotne nepovažujeme pretože to čo nás pri slove matrix napadne sú akurát tak citáty z rovnomerného filmu , ako napr. « Do not try to hit the ball. Hit the ball. » 29 Samozrejme iba v prípade že naša matica bola korektne zostavená a každý jej stĺpec dával súčet 1. Iba v takom prípade sa môže totiž uplatniť tzv. « teorém o fixnom bode » ktorému síce vôbec nerozumieme, ale sme mu hlboko zaviazaný za to že platí. gombíkov na zakúsnutie , velíme teraz k obratu. No predtým ako sľúbime nášmu drahému, humanitne vzdelanému a na matematiku alergickému čitateľovi že horšie ako to bolo na predchádzajúcich stránkach už to nebude si ešte dovolíme malé zamyslenie nad tým, čo momentálne vnímame ako ústredný problém vzťahu medzi synchrónne­asociačnými a kauzálne­diachrónnymi maticami. A totiž: v prípade že sme na začiatku tejto časti tvrdili že :  Prvý protopostulát: ak sú dve reprezentácie aktivované naraz, alebo s tak malým časovým rozostupom že v doméne vedomia splynú v jeden gestaltický celok, budú ich asociačné spoje v S­A matrix posilnené a ak sme zároveň tvrdili že :  Druhý protopostulát: pravdepodobnosť aktivácie neurolinguistickej štruktúry X neurolinguistickou štruktúrou Y je úmerná sile ich vzájomných asociačných spojov máme dojem že konjunkcia oboch tvrdení nutne viesť k následovnému kumulatívnemu procesu: 1) Y aktivuje X s pravdepodobnosťou p1 2) keďže bolo X aktivované hneď po Y, bude posilnená sila asociačných spojov medzi X a Y (vyplýva z prvého protopostulátu) 3) keďže bola posilnená sila asociačných spojov medzi X a Y, stúpne pravdepodobnosť p2 toho že X aktivuje Y a pravdepodobnosť p3 že Y aktivuje X (vyplýva z druhého protopostulátu) 4) keďže bude p2 väčšie ako predtým, bude pravdepodobnejšie ako predtým, že práve aktivované X « predá naspäť štafetu » pred chvíľou aktivovanému Y a vrátime sa do bodu 1, s tým rozdieľom že pravdepodobnosť že Y aktivuje X už nebude p1 ale p3, o ktorom vieme (z bodu 3) že p3>p1 Stručne a jasne, neurálne dráhy X a Y by medzi sebou začali hrať pingpong , sila ich asociačných spojov by rástla ad infinitum a pravdepodobnosť ich vzájomnej kauzálne­diachrónnej aktivácie by limitne smerovala k 1. Ajkeď je na ideji dvoch entít ktoré vzájomne aktivujú jedna druhú, rýsujúc tak do elastickej tkaniny mysle určitú špecifickú dráhu niečo veľmi príťažlivé, je takmer30 isté že pokiaľ do nášho modelu nezaintegrujeme ešte akýsi ďaľší, « tlmivý postulát »31, nebude náš model nikdy modelom adekvátne vysvetľujúcim fungovanie ľudskej mysle. A teraz už naozaj sľubujeme nášmu čitateľovi že horšie už to nebude, áno onomu čitateľovi ktorému sa na perách čoraz nervóznejšie rýsuje otázka: « A ako to celé preboha súvisí s ňadrom ? » A my odpovedáme: « Drahý čitateľ, asi takto – keby sme do našich názorne a náhodne skonštruovaných matíc 2 a 3 zadali reprezentácie všetkých jednotlivých geneticky a prenatálne prekódovaných modulov – samskár ­ prítomných v cerebrálnych štruktúrach maličkého, keby sme do miliónov prázdnych stĺpcov a riadkov začali pridávať tak kľúčové položky ako « potrava » či « svetlo », keby sme to čo sme kódovali jedinou syntetickou položkou « blaženosť » ­ oných 5 kľúčových T – vyčlenili a skrze hodnoty asociačných váh naviazali na jednotlivé do­matice­tiež­ zakódované « senzomotorické schémy » a následne celú matrix prehnali algoritmom uvedeným v Appendixe 2, zrazu by sme videli ako veľmi nepresné sú výsledky prezentované v App 1 . Kto vie, možno by sme v súlade s hypotézou Ha1 videli, že nie tvár ale ňadro je onou «reprezentáciou s najvyššou mohutnosťou»­ oným « telesom » čo najväčšmi ohýba elastickú tkaninu mysle maličkého. Ako nesmierne elastickú ! Príde doba, a tá doba je blízko, keď vznikne prvá sémanticko­ fonologická asociácia, doba keď môj malý synovec pochopí že žvatlot ktorý jeho uši počujú je žvatlot ktoré jeho ústa vyslovili – príjde doba keď slovo pre neho získa význam. Príde doba keď v spojitých zhlukoch zvukov a hlukov odhalí jeho čoraz štruktúrovanejšia myseľ akýsi zvláštny 30 hovoríme « takmer » pretože je taktiež možné že chyba nieje ani v postulátoch, ani v tom že by boli nedostatočné, ale že chyba je iba v našom výklade, v našej neznalosti princípov pravdepodobnostného kalkulu. V takom prípade sa môže stať že to, čo sa nám zdá ako neprekonateľný problém je naopak najsilnejšiou stránkou našej teórie 31 možno niečo čo súvisí so « znižovaním príťažlivosti stále sa opakujúcich vecí », so zabúdaním, s entropiou, skinnerovskou extinčnou krivkou, časom a tak poriadok – a zhluky zvukov a hlukov sa stanú vetami. Zrazu sa pred ním otvorí ďaľší nový svet – ďalšia dimenzia plná nepoznaného. Práve v tej dobe mu tí, ktorí boli doposiaľ takmer vždy iba zdrojom útechy a hojivosti, začnú udelovať prvé tresty. Budú sa ho tak snažiť prinútiť aby « ovládol » novú senzomotorickú schému – aby ovládol nielen krik svojich hlasiviek ale najmä zvierače svojho anusu a stolicu vylučoval len za okolností « za ktorých sa to patrí ».A tak sa okolo schémy vylučovacej, protikladnej k prijímacej schéme sacieho reflexu začne koncipovať nový « atraktor ».Práve naň sa pravdepodobne napojí reprezentácia toho čo nazývame « trest »32 a neskôr možno i jeho abstraktnejšia projekcia « hriech » K tomu všetkému začne dochádzať v momente kedy postupne, vďaka čoraz dokonalejšej akvizícii jazyka, bude spustená inštalácia symbolu ktorý sa v okolitom prostredí vyskytuje s nesmierne vysokou frekvenciou – slovíčka « ja »33. Zo slovíčka « ja » sa napokon stane atraktor s mohutnosťou tak nesmiernou, až skolabuje sám do seba podobne ako čierna diera – začnú sa naň napájať úplne všetky obvody. No odkiaľ svoju mohutnosť, svoj « pagerank » odčerpá na začiatku, na ktoré obvody sa najsamprv najväčšmi napojí – s ktorými obvodmi bude rezonovať ? S dávnym ňadrom z ktorého už zostáva len stále matnejšia spomienka – no stále tak mocná ako ona Ecovské meno ruže (Eco, 1983) čo stále nosíme v pamäti ? Alebo s čoraz mocnejším komplexom konštituujúcim sa okolo reprezentácií ­ « nesmieš », « nočník » , « musíš » ? Na to v prípade konkrétneho ľudského mláďaťa odpovedať asi nedokážeme. Celá matica je tak premenlivá, tak nesmierne premenlivá – je dynamická, je do seba zakrútená 34, je živá. Vrámci nášho modelu nemožno totiž rozsúdiť či sa stane môj synovec ten typ človeka ktorého Freud typom « análnym », alebo či zostane typom « orálnym ». Nevidíme totiž svet čiernobielo, v súlade s prístupom « fuzzy logikov » vidíme medzi oboma pólmi množstvo odtieňov šedej. Mohutnosť atraktoru « orálneho » či « análneho » istotne čo­to napovie o celkovej topológii mysle35 – a to obzvlášť vtedy ak naberie patologických rozmerov – no to čo je pre náš model podstatné je pochopenie že myseľ nieje akousi chladnou množinou generativistických inštrukcií na ktoré svieti mocné svetlo karteziánskeho « ego ». Áno « ja » hrá svoju roľu, veď podobne ako jadro Mliečnej dráhy celý systém takpovediac « roztáča », no život – skutočný život s vôňami, dotykom, súcitom a láskou­ ten prebieha na periférii, okolo menších gravitačných studní – okolo lokálnych sĺnk « on » a « ona ». Vpravde môj synovec už teraz v sebe nosí celú galaxiu. Príde doba, a tá doba sa zdá byť tak ďaleko a predsa je tak blízko keď sa môj synovec na svojej púti kozmom zrazí s galaxiou inou – s Druhou. Vtedy sa kedysi tak mocná reprezentácia kyprohojného Ty medzitým tak oslabená kosou pani Entropie opäť preberie k Životu. Znova budú aktivované nové senzomotorické schémy ­a synovec bude prirážať zhore, zboku no veríme že hlavne zdolu, áno, Oliver, hlavne zdolu ! – znova dôjde k imprintingu a padnutá Bohyňa začne opäť a opäť 36 čerpať životodárnu miazgu z egom čoraz viac usporiadaných sémantických sietí. Z tých sietí, čoraz chladnejších, usporiadanejších a ekonomicky vykalkulovanejších sietí ktoré možno práve teraz paraziticky odsávajú životodárnu miazgu z mohutnosti životodárnej Bohyne. Asi tak to súvisí s ňadrom. » 32 a možno, u niektorých, aj spomienka na bolesť spôsobenú barbarským pre­maličkého­ťažko­do­systému­sveta­ zaraditeľným rituálom nazývaným « obriezka » 33 ono « ja » samozrejme nemusí byť explicitne vyslovené ako morféma « ja ». K prehĺbeniu asociačnej studne stačí keď je prítomná aj v iných podobách, napr. prípona «M » v prípade slovies v prvej osobe, napr. «milujeM » 34 slovami « do seba zavinutá » sa snažíme čitateľa naviesť k predstave zásadne odlišnej od našich schém, v ktorých je každý stĺpec a riadok matice označený akýmsi signifiantom, akousi fonologickou etiketou. Chceme upriamiť pozornosť na to,že v prípade ideálnej reprezentácie mysle sú aj samotné označovacie etikety « iba » položkami matice. Ideálna myseľ­reprezentujúca matica tak nemá « okraj » a my si ju naivne vizualizujeme ako povrch topologického toru. 35 rovnako sa možno opýtať:alebo celková topológia mysle istotne niečo napovie o mohutnosti jednotlivých atraktorov ? 36 a opäť ! Záhrada druhá, konštrukt tretí: Neurosociológia FAUST: Já mel sen a ne o leckom! Videl jsem divukrásný strom, na nem pár jablek, púvab sám, však já si na ne počíhám . KRASAVICE : Jablíčko chutná dvojnásob pánúm už od Eviných dob, proto je mi tak presladce, že jich mám párek v zahrádce . Tento dialog výstižne komentuje Freud: « Není nejmenších pochyb o tom, co je míněno onou jabloní a jablky ». Ve skutečnosti například v lidovém londýnském slangu znamená a nice apple­dumpling shop (pěknej krámek s jablečnými knedlíky) pohledné, zakulacené poprsí. Někteří čtenáři nad těmito řádky vzpomenou na rajskou zahradu, a to zcela oprávněně. Vědce již Naproti tomu v řecké mytologii Zeus urazí Eris dlouho mate a provokuje skutečnost, že jak Eva, tím, že ji nepozve na svatební oslavy na Olympu, tak v řecké mytologii bohyně Eris měla něco načež ona se pomstí tím, že vhodí mezi hodující společného právě s jabkem a že v obu případech bohyně zlaté jablko s nápisem KALLISTI (« Té mělo ono jabko v konečném důsledku nasvědomí nejkrásnější »). Bohyně se o něj samozřejmě pěknou polízanici. V hebrejském příběhu začnou hádat, každá si na něj dělá nároky, sní Eva jablko (ve skutečnosti se v knize neboť každá je podle sebe tou nejkrásnější, Genesis píše o ovoci,nicméně tradice a tento spor se neustále stupňuje, až jsou toto ovoce vždy identifikovala jako do něj vtaženi jak ostatní bohové, tak jablko) a Jehova, místní božstvo, i lidé, a výsledkem toho všeho je soptící vztekem,jí prokleje,stejně Trójská válka. Eris tak vejde do jako celé lidské pokolení, z povědomí jako bohyně sváru a důvodů které mají ono zlaté jablko již navždy k logice dosti zůstává jablkem daleko sváru Wilson, Ištařin návrat, str. 108 Vpravde pred nami jablko vyvstáva v mnohých mýtoch sveta. Hľadiac k juhu vidíme Herkula ktorý sa práve vydáva na púť do záhrady Hesperidiek – tých troch nýmf večerných čo Hérin sad chránia, sad uprostred ktorého z Gainho daru k sňatku z Diom ­ z vetvičiek plodmi obsypaných – vyrašivšia jabloň nachádza sa. A presne z tejto jablone, tradíciou nazývanou aj Strom života, má hrdina jablká ukradnúť – toť jeden z 12tich skutkov ktoré podujme sa učiniť. Existujú jazyky ktoré tvrdia, že práve z týchto ukradnutých « jabĺk blaženosti » sa napokon niektoré dostanú až k spanilej panne Atalante z Arkádie , a to tak že jej ich popod jej bosé nohy hodí pytač Melanion počas v preteku v ktorom sa beží alebo o jej ruku, alebo o jeho život. Iné jazyky tvrdia že oné jablká boli mladíkovi dané samotnou Afroditou po jeho úprimnej modlitbe k Nej. Nech je tak či onak – či možno tak i onak, v prípade že jablká Melaniovi jablká priniesol Herkules plniaci tak Afroditin tajný príkaz – nech je tak či onak, isté je že Atalantu jablká uchvátili, zastavila sa, pretek prehrala a Melanion si ju odviedol na lože kde ju istotne ešte spanilejšou učinil. Hľadiac k východu vidíme nielen Evin Eden ale ešte i dnes tlejúce trosky chrámu najmúdrejšieho z kráľov jeruzalemských, toho kráľa čo ústami ženícha riekol : Nuž, nechže sú mi Tvoje prsníky viničovými strapcami A vôňa dychu Tvojho sťa vôňa jabĺčok áno, toho kráľa čo ústami snúbenice riekol v Piesni piesní : Opájala by som Ťa vínom voňavých a muštom z mojich granátových jabĺčok áno riekol tej snúbenici ktorá takto prosí: Posilnite ma hrozienkovým koláčom, občerstvite ma jablkami, lebo som chorá od Lásky Hľadiac k severu, k ľadom Eddy nordickej, vidíme zpoly elfku zpoly bohyňu Idunn 37 strážiacu jablká prinášajúce a zaručujúce bohom večnú mladosť a teda večný život. Potom čo bola táto plavávečnemladá unesená zlovoľným obrom začínajú všetci bohovia – všetci Aesir ­ starnúť. Prehovárajú a napokon vysielajú na cestu šibala Lokiho, ten Idunn zachraňuje a s návratom jej jabĺčok božstvá znovu nachádzajú stratenú mladosť. Medzi najčelnejších z Aesir patrí Odin s jeho milovanou Friggou. Ona a žiadna iná je podľa Eddy « najprednejšou zo všetkých bohýň » . Meno jej si najčastejšie vykladáme ako « milovaná » (hľaď do a hľadaj v priestore niekde medzi sanskrtským priya ­ « milovaná žena,manželka » či islandské frjá « milovať » ) , cítime že má prsty vo všetkom čo súvisí s plodnosťou, vo všetkom čo súvisí s hostinami zväzku manželského. A tak nás neprekvapuje že ju ešte i dnes pri pohľade k severu vidíme zosielať jablko kráľovi Rerirovi, tomu kráľovi čo Odina o potomka tak pokorne žiada . Kráľova choť sa do jablka zakusuje vďaka čomu následuje šesťrokov trvajúca ťarchavosť zavŕšená zrodením hrdinu Volsunga. Volsungská sága sa môže začať. Hľadiac k západu vidíme nielen Avalon ­ « ostrov jabĺk » kde bol ukutý Excalibur a kde sa kráľ Artuš snáď napokon vylieči zo svojich rán. Hľadiac k grimmovsky38germánskemu západu vidíme aj závistlivú královnú čo snehulícej víle s vlasmi havraními otrávené jablko posiela... Znovu a znovu tak vidíme sémantický atraktor ktorý označujeme termínom « jablko » vyvstávať v blízkosti39 významov ako PAC40={mladosť, život, hrdina, plodnosť, žena, Bohyňa}. 37 from Yggdrasil’s ash descended; of elven kin, Iðunn was her name (šloka 6­7 , Hrafnagaldr Óðins ) 38 Neprítpomnosť medzier v prípade určitých slov je v súlade so « sanskrtizačným » zámerom autora 39 Predstavme si myseľ M ktorá má v sebe zakódovaných n významov pre ktoré platia následovné kritériá: 1. každý význam je identický sám so sebou a odlišný od všetkých ostatných 2. každý význam je vo vzťahu s určitou « váhou » ku všetkým ostatným významom v Mysli (povedané platonicky (Vopěnka, ), každá idea sa v istej miere,s istou « silou » podieľa na každej inej ideji a vice versa ) Takto chápaný význam­ideu môžeme reprezentovať ako bod v n­rozmernom Hilbertovskom priestore ktorého súradnice sú dané normalizovanými asociačnými váhami (viz. časť 2.2 – každý riadok matice 4 sa dá chápať ako vektor udávajúci súradnice uvedeného významu v sémantickom priestore, hodnota v prvom stĺpci udáva vzdialenosť od stredu v prvej dimenzii, hodnota v druhom v dimenzii druhej etc. ) k prvému. druhému až n­tému významu Mysle. Ináč povedané – čo obsah mysle, to nový rozmer. X­tý význam má v svojom X­tom rozmere súradnicu o hodnote 1 čím je zabezpečená jeho jedinečnosť vyžadovaná prvým kritériom. Zároveň však súradnice tohto bodu­významu, jeho poloha, obsahuje aj informáciu o vzťahu k všetkým ostatným obsahom Mysle. Takto formalizovaná odpoveď na otázku « Čo je to význam slova a ako ho kvantifikovať ?» sa nám zdá byť príťažlivou nielen preto že je v svojej podsate blízka už existujúcej metóde Latent Semantic Analysis ( http://en.wikipedia.org/wiki/Latent_semantic_analysis ) , ale najmä preto že nám umožní relatívne jednoducho – užitím púhej pytagorovej vety či jednoduchej trigonometrie – vypočítavať vzdialenosti či veľkosti uhlov medzi dvomi či viacerými významami medzi sebou. Keď teda hovoríme o tom že « princezná » je « jablku » bližšie ako « kompas », hovoríme o – aspoň teoreticky­ merateľných veličinách 40 PAC = Primary Associative Complex , prípadne Primary Associative Cluster ; SAC = Secondary Associative Complex, prípadne Secondary Associative Cluster Možno samozrejme namietnuť že príklady Evy, Eris či Snehulienky nám naznačujú aj spojitosť s významami ako napr. SAC={had, smrť, jed ,hriech} , my však budeme tvrdiť že vyvstanie týchto entít je až druhotný jav zapríčinený čoraz viac silnejúcou dynamikou vyvíjajúceho sa mýtu, a že prvotný komplex ako taký je až príliš s?proste41 a jednoducho pekný. Inými slovami – vychádzame z presvedčenia že prvotný obraz sveta je dobrý a krásny a že akákoľvek prítomnosť zla v ňom nieje spôsobená zásahom od večnosti k večnosti jestvujúceho manichejského diabla, ale skôr – sťa lektvar zlej čarodejnice – povlovne vstupuje do rozprávky znamienko negácie. Bez jeho prítomnosti by totiž príbeh42 bez pointy musel byť – nič v obraze sveta či v svete samotnom bez jeho úsmevu nemohlo by žiť. Vieme o jablku – či skôr o signifié signifiantu « jablko » niečo ­ čo by mohlo osvetliť jeho výskyt v lone PAC ? Povedané ľudskejšie: je spozorovaný častý výskyt jablka po boku mladosti, sily či plodnosti iba ilúziou, vedeckou mayou ktorú náš skúmavý pohľad odhaľuje všade kde sa len dá len preto, že chce k takému odhaleniu dospieť, alebo sa jedná o objektívny fenomén? Tvrdíme že sa jedná o objektívny fenomén a tu predkladáme niekoľko veríme že dostatočne racionálnych argumentov ktorými by sme radi toto naše tvrdenie podložili: 1. Vieme že jablko je prototypom kategórie ovocie. Pre sémantiky neznalých čitateľov týmto poskytujeme túto laickú definíciu prototypu: « Prototyp kategórie X je taký člen Y ktorý príjde skúmanej osobe či skupine osôb čo najrýchlejšie a najčastejšie na myseľ ako odpoveď na výzvu « Predstavte nám jednoho konkrétneho zástupcu kategórie X ». Ajkeď je napr. z (Lakoff , 1987) verejne známym poznatok že « jablko » je, minimálne v indosemitskoeurópskom (ISE) okruhu, prototypickým zástupcom kategórie « ovocie » , dovolili sme si uvedený poznatok overiť vrámci nášho vlastného sémantickosociologického výskumu (viz. tiež appendix «Pár slov k dotazníku D2»). Dáta hovoria jasnou rečou: na otázku « Ktorý pojem je podľa Vás najlepsim predstavitelom kategorie "ovocie" ? » nám z 358 respondentov až 227 , tj. 63,4% napísalo samo od seba odpoveď jablko či jablká. Pre zaujímavosť dodávame že napr. na otázku « Ktory pojem je podla Vas najlepsim predstavitelom kategorie "kvety" ? » sme dostali odpoveď « ruža » iba v 54,8% prípadov, a to dokonca za stavu kedy bola ruža explicitne uvedená ako jedna z možností, zatiaľčo v prípade otázky o ovocí musel respondent kolónku vyplňovať sám. Vidíme teda že jablko je pre vzorku skúmaných osôb silnejším prototypom kategórie ovocie ako ruže pre kategóriu kvety43. Dovolíme si teda tvrdiť že « záhada » ktorá jest zahalená v otázku « Prečo umelecká Tradícia najčastejšie zobrazuje ovocie z Genesis ako jablko?» je zodpovedaná práve tým že jablko je 41 V tomto momente začíname do našich prác implementovať tzv. regulárne výrazy. Regulárne výrazy sú používané v programovacích jazykoch v prípade keď chceme popísať nie jeden znakový reťazec, ale určitú špecifickú množinu znakových reťazcov. Zatiaľčo v počítačovom programovaní sú regulárne výrazy používajúce v pasívnom zmysle ako nástroj–a to najúčinnejší nástroj,niektorí dokonca hovoria o « magické húlce »–na rozpoznávanie vzorov, my ich užitie v tomto texte prevraciame na ruby a užívame ich aktívne–za účelom aktivácie špecifických vzorov v mysli čitateľa. Možno povedať že regulárne výrazy sú formy znakových reťazcov. V prípade slovíčka s?proste sme použili metaznak ? ktorý značí « predchádzajúci znak (tj. znak s) sa môže vyskytovať nula alebo jedenkrát ». Regulárny výraz s?proste tak v sebe zastrešuje dve slová – « sproste » a « proste ». Použitím regulérneho výrazu s?proste tak aktivujeme v mysli čitateľa oboznámeho s funkciou metaznaku ? naraz dve dráhy. dva významy, bez nutnosti uchýliť sa k zdĺhavej konjunkcii « sproste a proste »...V prípade že ctený čitateľ pri čítaní najbližších riadkov narazí na otáznik uprostred slova, veríme že jeho funkcia bude po tomto vysvetlení pochopená – obzvlášť užitočný je pre « zmazávanie » genderových rozdieľov v prípade entít u ktorých nemožno hovoriť o rode, ako napr. « ona? milovala? » keď sa hovorí o bohu, atď. ). Akákoľvek implementácia ďalšieho nového metaznaku bude ako v tejto tak i v následovných prácach vysvetlená v pripojenej poznámke pod čiarou. 42 Zastávame stanovisko, že pre vedomie nieje rozdieľ medzi svetom a jeho obrazom – vyjma prípadného dodatočného poznania že obraz je iba obrazom. Povedané slovami postmoderny – niet rozdieľu medzi simuláciou a tým čo je simulované. 43 Povedané slovami našej malej teórie – váha asociácie medzi jablkom a ovocím je väčšia ako váha asociácie medzi ružou a kvetom. prototypom ovocia. S výnimkou slova a hudby totiž vo všetkých známych umeleckých modalitách platí, že všeobecnú kategóriu možno zobraziť iba a iba skrze jej konkrétny prototyp. Kvet skrze ružu, žena skrze Venušu, muž skrze Dávida, láska skrze Rodinov bozk a ovocie skrze jablko. Preto. Vzťahom k ovociu už istotne bolo poodhalené rúško tajomného vzťahu medzi jablkom a ostatnými zložkami PAC. Veď ovocie­jablko je plné vitamínov, tj. látok ktoré telo potrebuje no nedokáže si ich samé vyrobiť. A teda: 2. Jablko­Ovocie je zdravé . Vieme že čo je zdravé, to je živé. Čo je zdravé a živé, to je krásne. Alebo silné. « Krásna je Bohyňa, silný je hrdina » ­ mohli by sme tvrdiť, a takto, skok po skoku dospievať až k odhaleniu bytostného vzťahu medzi významami X a Y. K podobným rétorickým trikom právnikov a teológov sa však uchyľujeme iba preto, aby sme poukázali na ich absurditu. Ako sme povedali na začiatku časti 2.2 , význam ktorý nieje zasadenný do významovej siete44 , resp. idea na ktorej sa nepodieľajú iné ideje, nieje významom, nieje ideou. Ináč povedané, je isté že podobným znásilňovaním spony « je », podobným hopsaním sa od želaného A zdatnejší rečník skôr či neskôr vždy dostane k želanému Z. Možno by sa jeho rétorické kapacity znásobili keby k tvorbe metafor používal aj výpočtovú techniku k hľadaniu « ciest v uzavrenom grafe » napr. pomocou Djikstrovho algoritmu. Najradšej by sme podobné metódy prenechali scholastikom a pokúsili sa obhájiť našu hypotézu H2: «Medzi ideou jablka a ideou ňadra existuje relatívne45 silná asociácia či dokonca niečo ako príťažlivosť » metódou geometrizácie sémantického priestoru. K tomu aby sme však mohli takýto sémantický priestor riadne vymodelovať a následne v ňom vzdialenosti merať by sme potrebovali tak nesmierne množstvo empirických dát, že tento prístup momentálne ale možno tiež i naveky jestvuje len v podobe akéhosi chabo popísaného « Gedankenexperimentu » Ako by bolo v ďalších prácach prípadne možné od Gedankenexperimentu skočiť k reálnym aplikáciám sa pokúsime naznačiť najmä v posledných častiach tejto práce. Aby sme tak však mohli učiniť, musíme sa neustále čo najväčšmi snažiť vrátiť sa od Teórie grafov a Hilbertových priestorov naspäť na Zemi a jej jablku. Pokračujme teda v skúmaní jeho « akcidens » : 3. O jablku už vieme že je zdravé a dobré. Vieme tiež zväčša že zmestí sa akurát tak do dlane, je hladké, dobre sa doň hryzká46, je pevné a je oblé. Osoba znalejšia moderných sémantických teórií by povedala že signifié « jablko » sa dá rozložiť na tie základné zložky sémantickej analýzy – nazývané « sémy » ­ ako napr. « oblosť », « dobrota », « hutnosť », « k zahryznutiu », « o veľkosti dlane ». Problémom týchto sémantických teórií je však to že sú zväčša prísne binárne – buď sa sém na konštituovaní daného signifié podieľa, alebo nie – a ak už sa podieľa tak je pre dané signifié rovnako podstatný ako všetky ďalšie v ňom obsiahnute sémy. V tomto bode sa náš prístup od klasických sémantických teórií zásadne líši. Sme totiž 44 Sémantická sieť sa teda dá chápať ako súvislý graf , « tj. taký graf pre ktorý platí že pre každé dva vrcholy X , Y existuje aspoň jedna cesta z X do Y » (http://cs.wikipedia.org/wiki/Souvisl%C3%BD_graf). Iné entity ktoré si možno vnútorne reprezentovať podobným spôsobom, tj. ako súvislý graf sú napr. neurónová sieť či ľudská spoločnosť. Podobne totiž ako pre význam slova platí, že je vždy nutne vo vzťahu k iným významom, tak pre ľudskú spoločnosť platí že človek bez spoločnosti nieje vlastne plne človekom a neurón ktorý nemá väzbu k iným neurónom nieje vlastne neurónom. Veď ako by mohol byť nazývaný neurónom keď nemá axónov ni dendritov ? 45 relatívne, tj. v porovnaní s inými 46 Výnimkou budiž cylindrický výbežok pripomínajúci tak trochu zdrevnatený chĺpok, ktorý je nazývaný aj « stopka » presvedčený že na konštitúcii každého z vmysliobsiahnutých významov sa na nich skrze svoje sémy podieľajú myriády vyznamov iných, a to každý s určitou váhou. Túto váhu spoja, túto silu sému zdieľaného medzi X a Y si môžeme vyložiť ako  podieľ prírazu ktorý « odtečie » z X do Y (ak je teda napr. váha spoja medzi jablkom a ňadrom nastavená na 0,023 tak v prípade že sme v mysli subjektu aktivovali « jablko » s prírazom47 2 bude dôsledkom i aktivácia « nadra » s dôrazom 0,046)  ako pravdepodobnosť toho že po symbole X vyvstane v mysli (či dokonca na jazyku alebo prstoch búšiacich do klávesnice) symbol Y Ajkeď naše snahy momentálne smerujú k prekonaniu klasickej sémantiky a k matematizácii a formalizácii vedy o význame, ku kroku tak prchavému že sa o ňom vznešeným zakladateľom starobylých vied istotne ani nesnilo, predsa sa kvantitatívna sémantika musí svojim predkom poďakovať za to, že svojim náhľadom: podstatou metafory je zdieľanie sémov prinavrátili tému metafora z periférie akademického záujmu do ohniska záujmu humanitných, kognitívnych – a kto vie, možno raz i « tvrdých » prírodných vied. « I believe, however, that the myth cannot be explained only at the linguistic level, because the principle of the metaphor is deeply rooted in human behaviour in general, an especially in human thought as an expression of its natural tendency to abstraction » (Oberfalzerová , 2006) S týmto tvrdením súhlasime, a činíme tak dokonca i v stave (dúfame že)?48 dočasnej neznalosti diela Metaphors we live by z pera jednoho z nejrešpektovanejších kognitívnych vedcov súčasnosti G.Lakoffa. Aj s týmto tvrdením súhlasime a ideme ešte ďalej, tvrdiac že metafora a metonýmia nielenže sú kľúčom k pochopeniu mysle človeka, ale že človek samotný je bytosťou výsostne metaforickou , bytosťou ktorá « žije básnicky »49 . Voilà dôvody ktoré nás k tomuto presvedčeniu privádzajú: 4. Ňadro sa tiež zväčša zmestí do dlane, je hladké 50, dobre sa doň hryzká a nieje naškodu keď je pevné a oblé. Ako kojenec tak i milenec by pravdepodobne tiež súhlasili s tvrdením že ňadro je zdravé a dobré. Povedané jazykom klasickej sémantiky, ženský prs a pomme spolu zdieľajú nejeden sém. Ako sme povedali, podstatou metafory je zdieľanie sémov – čím viac zdieľaných sémov, tým vyššia pravdepodobnosť že metafora bude úspešná. Zdieľanie sémov ako « hryzkať », « sférická », « dlaň » tak naznačuje že spontánny preskok meditujúcuej mysle smerom od ňadra k jablku či naspäť nemusí byť nereálnou možnosťou. Povedané jazykom našej nascentnej teórie, vzdialenosť medzi « prsníkom » a « jablkom » je v Hilbertovskom sémantickom priestore menšia ako trebárs vzdialenosť « prsníka » od « kružítka » či « jablka » od « pravítka ». Dôvodom budiž to, že bod ktorým reprezentujeme « jablko » má na ose (v dimenzii) ktorou kódujeme sém « oblosť » približne rovnakú hodnotu svojej súradnice ako bod ktorým reprezentujeme «ňadierko»51. A čo viac – aj v rozmere ktorý reprezentuje sémy ako 47 používame tu radšej neologizmus príraz a nie « energia » či « sila », nechceme totiž aby došlo k zbytočnej a nežiadúcej interferencii pojmov s exaktne definovanými pojmami fyziky. Naša teória si svoje pojmy ešte len hľadá. 48 V prípade tohto regulérneho výrazu nasleduje už spomínaný metaznak opytovacieho znamienka skupinu uzatvorenú v zátvorkach. Zátvorky taktiež patria medzi metaznaky – ich funkciou je označiť ako jednu skupinovú entitu všetko čo sa nachádza medzi nimi. Výrazom (dúfame že)? tak chceme vlastne povedať že intencia autora zostane naplnená ako v prípade keď sa ono « dúfame že » v texte vyskytuje, tak i v prípade kedy by sme ho vynechali. To preto že záverečný otáznik umožňuje ako 0 tak 1 výskytov tomu čo mu predchádza , čím je v tomto prípade celá skupina znakov « dúfame že ». 49 Doch Dichterisch wohnet der Mensch auf dieser Erde (Heidegger, 2006) 50 Výnimkou budiž jemné chĺpky a cylindrický výbežok a jeho okolie nazývaný « bradavka » 51 Na príklade slova « ňadierko » možno ilustrovať aj ďalšiu vlastnosť priestoru ktorý sa tu snažíme popísať. Chceli sme popísať metaforu, tj. jav výsostne sémantický a tak sme sme hovorili iba sémantických aspektoch slova – priestor sme konštruovali spôsobom , čo sém, to dimenzia. Keďže je však slovo entitou ktorá má 3 tváre – sémantickú, « chĺpok » či «kúsať » bude « jablko » bližšie k « vnadám milovanej » ako trebárs k «tehle». Už sme sa pokúsili naznačiť že vzdialenosť významov vo valídne skonštruovanom Hilbertpriestore by mala byť úmerná s introspekciou vnímanou vzdialenosťou medzi danými významami (čím väčšia vzdialenosť medzi významami, tým sú « prežívané » rozdielnejšie, čím menšia, tým su « prežívané » podobnejšie). A čo iné ako metrika podobnosť­rozdielnosť významov by malo byť spoľahlivým indikátorom možnosti trópu? Je nepochybné že ak platí tvrdenie « Človek je metaforická bytosť », mala by byť hypotéza H2 ktorú tu momentálne uvádzame vo forme « Prs a jablko sú vo vzájomnom metaforickom vzťahu » overiteľná empirickým výskumom na ľudských bytostiach. Keď používame slová « empirický výskum », nemáme tým na mysli výskum kvalitatívny či fenomenologický, nie naozaj by sme si nedovolili nazývať vedeckým výskumom prechádzku pri ktorej na základe « hmotnosti, hutnosti, sladkosti , vláčnosti a ďaľších vlastností » zaraďuje básnik vnady slečien sveta medzi odrody «Grany Smith», «Golden », «Karmína» či « Yonigold ». Pritakávajúc tvrdeniu «kvality...sa veda snaží nahrádzať merateľnými kvantitami » (Sokol,2007) , držiac sa smeru vyznačenom slovami « ve společenskovědním výzkumu nebývají členy vzorku znaky, nýbrž osoby » (Skripnik/Lindová,2007) ašpirujúc o « zvedetčenie » mystériami opradenej kvality nazývanej « význam slova », zvolili sme si výskum za svoju cestu najtradičnejší z výskumov kvantitatívnych – výskum dotazníkový. Voilà k čomu sme dospeli: Potom čo všetci zo siedmych respondentov nášho dotazníku D1 ktorá znela: S akým druhom ovocia si najsilnejšie asociujete pojem « ženské prsia » ? zvolili ako jednu z dvoch možností (z celkovej ponuky výberu « jablko », « hrozno », « melón », « broskyňa » , « pomaranč ») odpoveď « jablko » (na druhom mieste súperili broskyne, pomaranče a melóny), uvedomili sme si že za tak zarážajúce výsledky bude pravdepodobne zodpovedná akási skrytá premenná. Túto skrytú premennú sme následne identifikovali ako už spomínaný prototypický vzťah medzi « jablkom » a « ovocím ». Ináč povedané, ústredná otázka nášho dotazníku D1 by sa dala preformulovať do podoby: S akým predstaviteľom kategórie X si najsilnejšie asociujete pojem A ? pričom X je ovocie a A sú vrchoviny ženských hrudí. Pri hľadaní možnej chyby v tejto otázke sme si následne uvedomili, že ­ keďže je jablko prototypom kategórie ovocie – by odpoveď « jablko » s najväčšou pravdepodobnosťou zaznela nech by bol pojem A čokoľvek, otázka by mohla znieť trebárs: « S akým druhom ovocia si najsilnejšie asociujete pojem budova? » , a odpoveď by bola s najväčšou pravdepodobnosťou tiež « jablko ». Ináč povedané – už samotným vyslovením slova « ovocie » v prvej časti otázky dochádza z dôvodu prototypického vzťahu medzi ovocím a « jablkom » k aktivácii symbolu « jablka » v mysli respondenta, a v prípade že druhá časť vety toto prúdenie smerom k jablku nijak « neprebije », či « nepresmeruje », ako napr. v prípade otázky « S akým predstaviteľom kategórie ovocie si najsilnejšie asociujete pojem slivovica? » , bude výsledná odpoveď najmä dôsledkom prototypickej väzby medzi kategóriou X a do nej náležiacim členom Y, a nie dôsledkom väzby ktorú sme chceli « odhaliť », tj. väzby medzi kategóriou X a do nej nenáležiacim pojmom A. Preto sme sa rozhodli náš dotazník upraviť. Výsledkom bol papierový dotazník D2 a fonologickú a gramatickú – náš systém pre kvantifikáciu ríše slov nebude kompletný pokým v prípade že by sa nepodarilo doň zaintegrovať v podobe určitých ôs (dimenzií) aj syntaktické a fonologické vlastnosti. V prípade že by sa to podarilo, možno by sa fonologická podobnosť medzi « ňadierko » a « jadierko » stala ďalším – veľmi chabým, pretože iba v okruhu slovenčiny znalých ľudí platiacim ­ argumentom hypotézy H2. internetový dotazník D3, v ktorých bola chybne zostrojená otázka – vlastne akási skrytá sémantická konjunkcia – z dotazníku D1 rozdelená na 2 časti nachádzajúce sa vo vzájomne oddelených častiach dotazníku, čím sme chceli zabrániť prípadným neželaným interferenciám. Vzhľadom k faktu že najviac respondentov sme získali vďaka dotazníku D3, sústredíme sa v následovných odstavcoch iba na tento dotazník. Prípadných záujemcov o ďaľšie informácie týmto odkazujeme na appendix « Pár slov k dotazníkom D2 a D3 » tejto práce. Otázke «Ktory pojem je podla Vas najlepsim predstavitelom kategorie "ovocie" ? » , označenej v dotazníku ako 2.3, sme sa už venovali. Ajkeď nás už spomínaných 63,4% pre v dotazníku D3 príjemne prekvapilo, nejednalo sa o žiadne nové zistenie, ale len k ďalšiemu utvrdeniu vedcami už mnohokrát afirmovanej hypotézy. Jednalo sa beztak len o otázku okrajovú, otázku ktorá bola iba akoby nadstavbou k skutočnému jadru nášho výskumu. Tým bola otázka 1.3 : S ktorým z uvedených členov sémantickej triedy « potrava » asociujete pojem « prsia » ? Bolo daných 5 možných odpovedí : mäso, ovocie, mlieko, chlieb, zelenina. Keďže D3 bol dotazníkom internetovým, využívajúci sympatickú opensourcovú aplikáciu PHPSurvey ­ využili sme pri jeho zostavovaní dotazníku naplno možnosti ktoré táto aplikácia ponúka. Kľúčovým sa napokon stalo rozhodnutie žiadať od respondenta nie jednu, dve či tri « rovnako silné » odpovede, ale naopak požadovať «obodovanie» sily vzťahu medzi všetkými 5 členmi kategórie X a pojmom A. Pre prístup v ktorom sme zisťovali u každého respondenta nielen najsilnejšiu z väzieb medzi členom kategórie « potrava » a pojmom « prsia » , ale naopak merali všetkých 5 väzieb – pričom silu/váhu väzby bolo možné špecifikovať celým číslom od 1 do 5 a dve či viac väzieb mohli mať rovnakú silu/váhu ­ sme sa rozhodli z toho dôvodu, že je oveľa konzistentnejší s naším « fuzzy prístupom » pre ktorý platí « všetko so všetkým súvisí, ajkeď iba trošililinku prchavú, no súvisí ». Práve týmto « fuzzy » aspektom v ktorom rozhodujúcu rolu hrá kvantita nazývaná «váha sémantického spoja » sa naša metóda líši od klasickej jungiánskej metódy voľných asociácií v ktorej platí variácia na aristotelovské pravidlo vylúčenie tretieho ktoré možno charakterizovať slovami « BUĎ je aktivovaný­artikulovaný tento symbol, ALEBO je aktivoartikulovaný tamten symbol ». Tento skok od čiernobielej k množstve medziúrovní šedej bol mimo iné spôsobený aj tým že zatiaľčo Jung analyzoval myseľ jednotlivcov bez toho aby mal prístup k ich neurónom, my analyzujeme « myseľ » ľudských skupín majúc priamy prístup k ich základným zložkám – ľudským bytostiam. Teraz k výsledkom. Najsilnejšou sa ukázala byť väzba medzi prsiami a mliekom – jej celková váha bola po odpovediach 358 respondentov zpriemerovaná na 4.2 . To príliš neprekvapuje, o mliekodajných funkciách hrude ľudskej samičky ktoré sme bližšie tématizovali v prvej kapitole našej práce dnes pravdepodobne netuší len niekoľko chronických puritánov, ktorí sa v našej skúmanej vzorke pravdepodobne nevyskytli. Taktiež to že sa ako najslabšie ukázali byť väzba k zelenine (váha 1.7) a k chlebu (1.9) nieje príliš prekvapujúce. Niekto sa môže opýtať čímže by asi tak mohol byť spôsobený fakt že váha asociácie vedúca od chleba je o 0,2 vyššia ako váha asociácie vedúca od zeleniny. Tvrdíme že odpoveď typu «jedine Žena dokáže nasýtiť viac ako chlieb» je aj napriek svojej pravdivosti značne nedostatočná, a ako sme naznačili pred niekoľkými odstavcami, podobnými obratmi sa dá obhájiť úplne všetko. Ostaňme teda v rámci tohto textu u mylného presvedčenia že onen rozdieľ 0.2 je iba náhodná fluktuácia, ktorú by pri zväčšení vzorky zákon veľkých čísel pravdepodobne zrovnal na minimum. O tom že tomu tak nieje by nás presvedčil až ďalší výskum, ale načo sa zastrájať výskumom ktorý takmer určite nikto nikdy neuskutoční...52 Najzaujímavejšie výsledky však pred nás vyvstávajú v « strede poľa » . Vidíme že mäso je k 52 A predsa áno – aplikácia « R for statistical computing » nám napokon umožnila vysloviť následovné tvrdenie: Studentov párový t­test uskutočnený nad množinou získaných dát naznačil štatisticky signifikantný rozdieľ (t = 3.2664, df = 357, p = 0.001195 ) o veľkosti 0.2122905 medzi @Chlieb, Ženské prsia@ a @Zelenina, Ženské prsia@ prsu asociované s váhou 3.2 a zatiaľčo váha ovocia k ňadru je ešte o 0.1 vyššou, tj. @Ovocie, Ženské prsia@53=3.3 . Tomu čo by chcel namietnuť sme že ono 0.1 je taktiež iba náhodnou fluktuáciou a u väčších vzoriek by vysvitlo že @Ovocie, Ženské prsia@=@Mäso,Ženské prsia@ môžeme ako zaujímavý protiargument poskytnúť zistenie že u podmnožiny našej vzorky, u 110 respondentiek ktoré sa označili v otázke 9 za ženu či dievča je náskok Ovocia pred Mäsom znateľne väčší, keďže @Ovocie, Ženské prsia@=3.2 zatiaľčo @Mäso, Ženské prsia@=2.8. Illustration 1: Histogramy zostrojené nad kvantitami asociačných váh @Ovocie,Ženské prsia@ (histogram O) a @Mäso,Ženské prsia@ (histogram M) ktoré nám poskytli jednotliví respondenti internetového dotazníku D3 radiaci sa medzi "mužov" či "chlapcov" A vskutku, uskutočnenie série unilaterálných párových Studentových t­testov nás privádza k vysloveniu následovných tvrdení: zatiaľčo rozdieľ v priemernej váhe asociácie @Ovocie, Ženské prsia@ a asociácie @Mäso, Ženské prsia@ nieje štatisticky signifikantný u respondentov ktorý na otázku po pohlaví odpovedali že sú muži (p=0.5203), chlapci (p=0.1423), dievcata (p=0.2154) či anjeli (p=0.5892), je onen rozdieľ štatisticky signifikantný v prípade tých ktoré odpovedali že sú ženami alebo dievčatami (p = 0.02607 ). Čo sa týka celkovej množiny respondentov, vychádzajúc z presvedčenie že naši respondenti boli ideálnymi predstaviteľmi nositeľov ISE kultúry začiatku 21. storočia, naznačilo nám uskutočnenie unilaterálneho párového Studentovho t­testu naznačuje že našu hypotézu « Pre mysle nositeľov ISE kultúry platí :@Ovocie, Ženské prsia@ >@Mäso,Ženské prsia@» by sme odmietať nemali pretože získané výsledky sú štatisticky signifikantné (t = ­1.6829, df = 357, p = 0.04663 ). Bez toho aby sme sa opreli o barličku tvrdenie « Človek je metaforická bytosť » si vpravde nedokážeme vysvetliť ako je možné že mäso – v podstate matéria z ktorej je prs vystavaný – nieje čo do sily svojho vzťahu k ňadru oveľa viac popredu ako ovocie, ktoré s ňadrom súvisí na prvý pohľad len vzdialene. Ukázať že je táto zdanlivá neexistencia asociácie medzi jablkom a ňadrom je len ilúziou bolo snahou tejto kapitoly – chceli sme ako ukázať empiricky, tak naznačiť teoreticky že « ovocie » je k « ňadru » alebo bližšie, alebo rovnako vzdialené ako « mäso ». A to z toho dôvodu že ajkeď je v mnohých dimenziách, ako napr. v tých ktorými kódujeme sém «zviera » či « krv » hodnota súradnice prsu rozhodne bližšie k mäsu ako k ovociu (prvé dve majú relatívne vysokú hodnotu zatiaľčo posledná relatívne nízku) , sú si zas v iných rozmeroch ­ ako napr. v tých ktorými kódujeme už spomínanú « oblosť », « dlaň », « chĺpok » či « bozk » ­ bližšie J a Ň. 53 Pre uľahčenie a skrátenie budeme aj v ďaľších častiach používať následovnú formu zápisu sily asociačnej väzby medzi dvomi pojmami: @Odkiaľ,Kam@ Summa summarum: pokiaľ by sme si v­našej­obľúbenej­učebnici­filozofie (Benyovszky, 2007) citované Humeho slová «Vidím iba tri zásady podľa ktorých sa predstavy združujú: podobnosť, zhoda miesta alebo času, príčina a účinok » interpretovali spôsobom že mäso ako hmota bez ktorej by ňadra nebolo je vlastne jeho príčinou, zatiaľčo jablko je k ňadru vo vzťahu podobnosti , naznačili by nám výsledky nášho výskumu že kauzalita buď hrá s metaforou duet, alebo len druhé husle54. Tak či onak, veríme že sa nám vďaka onej trojici – vďaka mäsu, prsu a ovocí – podarilo aspoň pár čitateľov presvedčiť o tom že človek je bytosťou v živote ktorej hrá metafora ústrednú rolu, a parafrázujúc Huma tvrdíme « Vidíme iba jednu zásadu podľa ktorých sa predstavy združujú: podobnosť – zdieľanie sému55. A pramálo už záleží na tom či onen zdielaný sém kóduje spoločný výskyt v priestore či následný výskyt v čase ». Niečo podobné však vieme už od Aristotela, a možno ešte i od starších, a naším zámerom tu rozhodne nieje reinterpretovať už interpretované. Naopak , naším cieľom je rozvinúť naše znalosti o metafore a význame slova do takej podoby že aj Turingov stroj bude schopný permutovať základné komponenty ríše významu utvárajúc tak metafory, prejavujúc sa tak následne ako bytosť majúca ducha56. Tvrdíme že Deus bude ex machina minimálne dovtedy kým tí čo chcú do stroja vdýchnuť dušu nepochopia že metafora je od nepamäti účinnou poetickou metódou samotnej Prírody – že je to práve metafora ktorá podľa Morrisových bizarných hypotéz (viz. časť 2.1) preniesla nielen oblé polky hýždí na Ženinu hruď ale i pysky vulvy na pery, ešte viac tak milencov pohľad utvrdzujúc v Láske k Tvári milovanej ; že je to práve metafora ktorá spôsobí že tekajúci pohľad maličkého hľadajúceho onen zdroj prs­dvorec­bradavka zrazu ostane o niečo dlhšie fixovaný na onen zdroj svetla ktoré nehasne, Zdroj bulva­dúhovka­zrenica ; že je to práve metafora ktorá spôsobuje že milovaná – v útlom detstve tiež Pavlovovsky napodmieňovaná túžbou po jablkách hrudí – zatína v chvíľach rozkoše vášnivo svoje prsty do toho čo je najbližšie...sémanticky najbližšie...k tomu čo kedysi ako bábo tak milovala ­ do kopcov jeho paží. Figúra 2: V pravej časti možno zhliadnuť krivku interpolovanú nad histogramom M, v časti ľavej sa možno pokochať krivkou interpolovanou nad Y­osovo prevráteným histogramom O 54 Z toho že « ctený čitateľ bez problémov rozumie práve takým výrazom ako « druhé husle » zatiaľčo má problém pochopiť túto poznámku pod čiarou » vyplýva že «kauzálne myslenie možno nieje ničím iným len určitým špeciálnym prípadom myslenia metaforického podobne ako je klasický fyzikálny makrosvet len určitým špeciálnym prípadom sveta kvantového » , tak niečo také si dovolíme tvrdiť naozaj len tu, pod čiarou. 55 Vďaka jednému vtipu doc. Pinca som si uvedomil že zdielanie sému zohráva kľúčovú rolu nielen v prípade metafor, ale napríklad aj v prípade jednej celej – a značne veľkej­ kategórie vtipov. Publiku je položená otázka: « Aký je rozdiel medzi vedcom a pásomnicou? » Po chvílke trápneho ticha nasleduje odpoveď « Žiadny », následovaný objasnením « Podobne ako pásomnica sa aj vedec nachádza väčšinu času v oblasti spodného vývodu tráviaceho traktu, a podobne ako pásomnica aj vedec raz za čas do sveta vypustí nejaký ten článok »...Pri následovnej analýze vidíme že onen vtipný efekt nieje spôsobený ničím iným ako 1) upriamením pozornosti poslucháča na sémy ktoré oba pojmy zdieľajú, tj. na sémy «byť v pr...i » a « vypúšťať článok » 2) odklonením pozornosti poslucháča od faktu že obojé pojmy v sebe zastrešujú aj myriády sémov iných, ktoré vzájomne nezdieľajú – toto odklonenie sa deje pomocou jemne klamlivej odpovede «žiadny». Voilà jeden z princípov podľa ktorého dokáže i stroj tvoriť vtipy. 56 But the greatest thing by far is to have a command of metaphor. This alone cannot be imparted by another; it is the mark of genius, for to make good metaphors implies an eye for resemblances. (Aristoteles, Poetika 59a) Záhrada tretia: Muž jeho krv a jeho kríž Albecht Dürer – Adam a Eva – Florencia Take the full breast of your sister Isis,bring it unto your mouth! "Mother of N.," so said I, give thy breast to N., that N. may suck therewith. (My) son N.,so said she, "take to thee my breast; that thou mayest suck it" said she,that thou mayest live again," so said she, "that thou mayest be (again) small," so said she. (Texty pyramíd, úryvky 42 a 470) Záhrada tretia, konštrukt prvý: H(istó|ysté)ria Sexuální postoje, stejně jako všechny postoje ostatní, čerpají z nevyřčených a často nevědomých premis. Kreativní myšlení, vždy zřetelné a jasně srozumitelné, je výsledkem frustrace: člověk vnímá problém , jenž je potřeba vyřešit, a při jeho řešení vytváří další myšlenky. Ovšem převážná část lidského « myšlení » není tvořena těmito účelnými, zřetelnými a kreativními myšlenkami, většina z toho, co pokládáme za svou duševní činnost, se skládá z nesrozumitelných, polovědomých a sémantických reflexů – reakcí na klíčová slova, která jsou v naší mysli vyvolávána jednotlivými situacemi. Například naše duševní reakce na sex – naše takzvaná « filosofie » sexu – je ve většině případů soustavou neuropsychologických reakcí na několik velice jednoduchých « poetických metafor ». Konkrétní metaforou, jež měla největší vliv na západní civilizaci a která je podstatou tradičního židovsko­křesťanského dogmatu, je víra, že sex je « obscénní ». Pohlavní styk je něco sprostého, pohlavní funkce jsou něčím stejně odporným, trapným a « nepěkným » jako vylučování výkalů, atd. Nazýváme je jednoduchými poetickými metaforami, neboť je můžeme analyzovat stejným způsobem, jako literární kritikové analyzují verše. Metafora je ztotožněním dvou rozdílných faktorů. Přirovnání například praví: «Loď je jako pluh ». Metafora však méně zřetelněji, ovšem o to účinněji ono ztotožnění naznačuje, aniž by ho vyjádřila otevřeně: « Loď oře mořské vlny ». Pokud je totiž ono ztotožnění vyjádřeno jako méně jednoznačné tvrzení, je méně pravděpodobné, že s ním nebudeme souhlasit... Židovsko­křesťanská teologie o sexu neustále hovoří v metaforických termínech a píše o něm jako o něčem neslušném, takže stotožnění sexuality s obscénností bylo podprahově «instalováno»57 do psychologických a neurologických reakcí lidí, aniž by měli sebemenší ponětí o « poetičnosti » či prelogické povaze tohoto stotožnění. Když romantičtí básníci přirovnávají sexualitu k pučícím květům, rašící trávě, zelenajícím se křovinám atd., vytvářejí ztotožnění, jež směruje ke zcela opačnému druhu reakce. Od nich se nám tedy dostává rovnice « sexualita rovná se jaro », jež je v zásadním protikladu k židovsko­křesťanské rovnici « sexualita rovná se obscénnost ». Obě rovnice však mají svůj psychologický účinek, neboť jsou poetické a nedostatečně zřetelné. 58 Wilson, Ištařin návrat, aneb proč bohyně sestoupila do podsvětí a co náš čeká nyní při jejím návratu str. 89 57 « Photos containing a fully exposed breast ­ as defined by showing the nipple or areola ­ do violate those terms on obscene, pornographic or sexually explicit material and may be removed, » he [facebook spokesman] said in a statement [concerning the breastfeeding photo ban] . (Telegraph, 2008) 58 Pasáže boli hrubým písmom zdôraznené až dodatočne autorom tejto bakalárskej eseje Pokúsme sa teraz objasniť neurosémantickú podstatu významu a metafory . Predstavme si akýsi primitívny protojazyk v tak primitívnom štádiu svojho memetického vývoja že ešte nestihol nadobudnúť žiadnu gramatiku – tj. žiadne pády, žiadne predložky, žiadna štruktúra vety. Ajkeď sa čo do svojej syntaktickej zložky podobá viac jazyku primátov ako plnohodnotnej ľudskej reči, je každopádne možné ho nazývať jazykom, keďže jednotlivé signifianty tohto jazyka aktivujú v mysliach poslucháčov určité neurálne obvody, ako napr. spomienky či útržky spomienok ktoré sú najväčšmi asociované s výskytom daného slova v minulosti. Predstavme si teda, že o bytosti ktorej chovanie chceme predikovať, a o ktorej vieme že tento protojazyk užíva vieme, že má v mysli asociované slovo «hruď» so slovom «prs» dajme tomu o sile 0,023 . K znalosti uvedeného čísla môžeme dospieť viacerými spôsobmi:  môžeme skúmaného nechať vyrozprávať aby sme následne z analýzy získaného korpusu zistili že slovo «hruď» sa 23 krát z tisícov svojich výskytov vyskytuje hneď vedľa slova « prs »  môžeme použiť metódu voľných asociácií a 1000 dní po sebe predkladať skúmanému určité slová, z ktorých jedno je vždy « hruď », zisťujúc že v 23 prípadoch mu vypočutie slova « hruď » aktivovalo v mozgu obvod kódujúci vyslovenie slova « prs »  môžeme skúmaného sledovať od jeho detstva a namerať že v 46 prípadoch z celkového počtu 2000 situácií, kedy sa dostal do kontaktu s referentom ktorý označujeme signifiantom « hruď », sa zároveň dostal do kontaktu aj s referentom ktorý označujeme signifiantom « prs »  v budúcnosti budeme – prípadne aspoň niektorí z Vás istotne budú môcť ­ použiť ešte jemnejšiu neurologickú metódu, napr. metódu magnetickej rezonancie či iné, ktoré by obzvlášť pri kombinácii s východnými meditačnými praktikami mohla naznačiť že tá zmapovaná množina neurónov ktorá sa skúmanému počas predchádzajúcich experimentovala aktivovala pri jeho meditácii nad signifié « prs » sa aktivuje aj počas 2,3% času experimentu počas ktorého má tá istá osoba za úlohu upriamovať svoje meditujúce vedomie na signifié « hruď » podobne taktiež zistíme že v skúmanej mysli existuje asociácia o sile 0,42 medzi « dlaňou» a « prsom » a o sile « 0,077 » medzi «jablkom» a «dlaňou». Keď teraz v onom primitívnom protojazyku bude vyslovená formula JABLKOHRUĎ , dôjde k javom ktoré vrámci nášho modelu opisujeme následovne: uvedená formula sa skladá z dvoch morfém, « jablko » a « hruď ». Keďže sa jedná o najprimitívnejší z protojazykov v ktorom nezáleží ani len na poradí morfém – tj. « jablkohruď » je čo do svojho sémantického obsahu ekvivalentné s « hruďjablko » ­ môžeme tvrdiť že žiadna morféma nemá prednosť pred inou, a teda že si príraz ktorým bol mysľomozgom obdarený vypočutý celok JABLKOHRUĎ si jednotlivé morfémy rozdelia presne pol na pol 0,5 x 0,023 = 0, 0115 prírazu , polovina prírazu ktorý mozgomyseľ poslucháča priradila vypočutému celku, tj. zvukovej vlne JABLKOHRUĎ pretečie od onej prvej morfémy «hruď» od ktorej 0,023 bude presmerovaných ďalej smerom k «prs» 0,5 x 0,42 = 0,21 polovina prírazu , ktorý mysľomozog poslucháča priradila vypočutému celku, tj. zvukovej vlne JABLKOHRUĎ pretečie k prvej morfémy « jablko » smerom k «dlaň» aby odtiaľ 0,21 pritečeného prírazu oditeklo ďalej smerom k obvodu kódujúcemu «prs» 0,21 x 0,077= 0,01617 doputuje od prvej morfémy « jablko » (aktivovanej s prírazom 0,5) skrze prestupnú stanicu « dlaň» (aktivovanej s prírazom 0,5x0,42) do finálnej destinácie «prs» Napokon teda vidíme že k finálnej destinácii, k bodu v sémantickom priestore reprezentujúcom « prs » doputuje 0,0115 prírazu smerom od morfémy «hruď» a 0,01617 smerom od morfémy « jablko » cestujúc skrze prestupnú stanicu « dlaň ». Celkovo teda k « prs » dotečie 0,01617+0,0115=0,02767 prírazu, čo je o dosť viac ako by doputovalo k « prs » od samotného « jablka » či « hrude ». To nám naznačuje že vzájomné spojenie uvedených dvoch morfém priblíži myseľ poslucháča k tomu bodu v sémantickom priestore ku ktorému chcel autor čitateľa doviesť oveľa viac, ako každá z uvedených morfém samotných. Ináč povedané, tvorca vety či metafory, autor, ten­čo­hovorí, sa svojou symbolyartikulujúcou aktivitou snaží čitateľa priviesť do takého bodu sémantického Hilbertpriestoru ktorý čo najvernejšie reprezentuje jeho tvorivý zámer, ono « to čo sa chce povedať ». Každým aktom artikulácie – a to nielen pridaním morfémy či slova, ale i tonalitou, dôrazom, gestikuláciou, prehodením poradia či znamienkom interpunkčným ­ čoraz viac a viac spresňuje « polohu » sémantického atraktoru do ktorého chce lapiť poslucháča či čitateľa. Podarí sa mu to ­ bude jeho metafora úspešná ? Úspešná metafora je metafora alebo pochopená v súlade s intenciou autora, alebo metafora prebúdzajúca v svojom príjemcovi pocity krásna. Prvá metafora je tou metaforou bez ktorej nemôže postupovať ani vedec, ani ľudské poznanie, druhá je metaforou básnikovou. Túto druhú alternatívu nechajme bokom ako niečo čo nemáme právo analyzovať – ako niečo čo je posvätné ­ a sústreďme naše analýzy na otázku « Kedy je metafore porozumené v súlade s intenciou autora? ». Metafora je porozumené v súlade s intenciou autora vtedy, keď autor docieli že myseľ čitateľa či poslucháča bude čo najväčšmi blúdiť v tých oblastiach sémantického Hilbertpriestoru v ktorých chce aby blúdila, a čo najmenej bude blúdiť vo všetkých ostatných . Alebo ináč – podstatné pre úspešnosť horeuvedenej metafory JABLKOHRUĎ nieje to, že jej prirodzeným dôsledkom je aktivácia signifié « prs » s prírazom 0,02767, ale to, že tento príraz je v danom momente podstatne väčší ako príraz ktorým disponujú všetky ostatné paralelne aktivované obvody . A čím je onen príraz väčší ako všetky ostatné – čo je spôsobené zväčša tým že daný neurosémantický obvod je do seba väčšmi zacyklenejší59, dráha je vyrytejšia, príraz­energie menej disipuje do strán – tým je zmysel s týmto obvodom spätý vnímaný jasnejšie. Sú slová Šalamúnove o « gazelích dvojčatách » metaforou básnikovou či metaforou vedcovou ? Snaží sa nás skvelou kombináciou symbolov priviesť k tomu­a­nie­inému významu, snaží sa nás lineárnou kombináciou istých nesmierne komplexných matematických entít priviesť tomu­a­nie­inému bodu v sémantickom Hilbertpriestore, alebo skôr necháva našu myseľ blúdiť po dráhach «hebké, teplo sálajúce, krehké , pohľadenie žiadajúce » ? Možno to, možno ono, možno nič z toho a možno obojé naraz – pretože i to nám náš kalkul umožňuje, isté je že metafora pravdepodobne nebude úspešná ani v jednom zmysle u toho, čo gazelu videl len raz, a to váľajúcu sa za mrežami zoo vo vlastnom truse. Ani srnka netuší v akýchže hmlovinách sémantického priestoru skončí napokon myseľ takého nešťastníka. Kiež by však aspoň neskončila tam, kde už skončili mysle tisícov tých čo slová Piesne zobrali až príliš vážne. Hľa, ako takí ľudia dokážu pochopiť najkrajšiu a jedinú oslavu tela ktorú nám západná duchovná tradícia poskytuje, hľa ako si dokážu vyložiť slová z kapitoly štvrtej, verša piateho : dvoje tvojich pŕs je srny, ktoré sa ako dvoje sŕňat, pasú medzi dvojčatá ľaliami kiež by sa onen nešťastník nestratil v labyrinte svojich asociácií a nezačal pri vnútornom výklade uvedeného verša bláboliť ako Bernard z Clairvaux60 : Dve prsia Snúbenice označujú blahoprianie a zmilovanie, následujúc tak doktrínu sv. Pavla ktorý chce aby sme sa radovali s tými čo sú šťastní, a aby sme plakali s tými čo plačú či ako Maitre zo Sacy61: 59 Exaktnejšou rečou matematiky možno povedať « čím je graf reprezentujúci skúmaný neurosémantický obvod hustejší a uzavretejší» 60 Spoluzakladateľ cisterciánskeho rádu a kľúčový spojenec templárskych rytierov v prvej fáze ich existencie 61 Pascalov súčasník z Port Royal Už sme vysvetlili že dve prsia milovanej sú alebo dvomi testamentmi – starým a novým , alebo dve prikázania láskavosti ktoré sú ako strapce hrozna, ‘bo slovo božie uschované v týchto dvoch božských testamentoch ako aj dve lásky upriamené k bohu a k následovnému majú moc opiť toho kto sa nimi naplní To, že k tomu čo predchádzajúci autor nazýva « opitosťou »netreba dva testamenty, ale že bohate stačí aj jeden, nám naznačuje interpretácia z centrálneho diela židovského mysticizmu, knihy Zohar: Slovm «ňadro » Slovo mieni dobré skutky, pretože podobne ako prsia utvárajú krásu Ženy, utvárajú dobré skutky krásu muža. V takejto tvrdej konkurencii však napokon predsa len víťazí Žena, v tomto prípade Madame de Guyau62 ktorá svoju šťavu sublimuje zjavne ešte intenzívnejšie ako v predchádzajúcich odstavcoch spomenuté chlapčiská: Pretože sajeme všetci spolu z pŕs božskej Esencie, našej matky sajem i ja nepretržite ňadrá božskosti Uvedené citáty , ktoré sme do slovenčiny z francúzštiny preložili z knihy Tes seins sont des grenades – Pour en finir avec le Cantique des cantiques. (Lalou/Woda, 2003) , sú len špičkou ľadovca. Menej vnímavejší čitateľ si možno uvedomí « Akože až hlboko líščia nora vedie » až po prečítaní tohoto citátu od dodnes uznávaného « otca církvi » Origena63: Voilá prečo vám týmto dávam varovanie a radu že ten čo ešte nieje oslobodený od prekážok tela a krvi ako aj pre každý ten čo neodmietol dispozície materiálnej prirodzenosti, pri čítaní tejto malej knížky rúha sa absolútne Nuž, nič sa nedá robiť, ideme sa rúhať , a keď už sa rúhať tak nech to stojí za to , absolútne : tvrdíme že Pieseň piesní nieje ničím iným ako erotickou básňou par excellence, oslavou tela bez ktorého myseľ nemohla by vyvstať64. Ako je však možné že niečo tak zjavné ostalo ukryté zraku desiatkam generácií mudrcov ? Ako sa vôbec mohlo stať že aj napriek prítomnosti tejto Ódy v samotnom srdci Biblie , aj napriek prítomnosti tantrických textov v jadre bráhmanizmu bolo to najkrásnejšie, najvznešenejšie, svojou kompozíciou najmúdrejšie a v svojich dôsledkoch najmocnejšie – fyzický akt Lásky medzi Mužom a Ženou – v histórii ISE kultúry tak často urážané, ubíjané, popľuvávané ? Čím možno ak už nie ospravedlniť – pretože isté už učinené príkoria ospravedlniť nemožno – tak aspoň odôvodniť ono ubíjanie Ženy, tela, do hmoty pre(j|t)avenej 65 nežnosti prichádzajúce od toho od koho by sa to najmenej čakalo: od trojice otec­syn­duch ktorá sa počas stáročí stáva čoraz väčším synonymom maskulínnej hrubosti ? Odpoveď je samozrejme oveľa komplexnejšia ako by akákoľvek esej kedy mohla byť. Prečo zo semienka slov dobromyseľného gnostika z Nazaretu66 zasadenej do substrátu judaistickej viery, helénskej kultúry a rímskej moci vyrástol na úsvite letopočtu taký symbolický komplex aký vyrástol, a najmä to, akými mechanizmami67 si tento komplex zabezpečil svoje dvojtisícročné trvanie, na to by sme sa mohli pokúsiť odpovedať užitím nástrojov ktoré nám núkajú mnohé paradigmy postmoderných humanitných vedy – kultúrna antropológia , sociológia náboženstva , evolučná psychológia , memetika , univerzálny darwinizmus. Kvantitatívnu a veríme že aj matematicky formalizovateľnú metódu k spojeniu týchto rozdielnych prístupov sa pokúsime načrtnúť v poslednej 62 63 64 65 Francúzska mystička O ktorom historické pramene tvrdia že sám seba vykastroval . Tá Myseľ bez ktorej Telo nemohlo by « byť » metaznak | hrá v regulérnych výrazoch rolu disjunkcie, (j|t) teda znamená «na tomto mieste sa nachádza j alebo t » a regulérny výraz pre(j|t)avený tak v mysli čitateľa aktivuje dva obvody « prejavený » i « pretavený » 66 Ježíš , kam bežíš ? Do Nazaretu po cigaretu. 67 Pričom nás zaujímajú mechanizmy jemné, mechanizmy symbolické . Necítime sa ani pri najmenšom povolaný k analýze mechanizmov hrubých, tých ktoré súvisia s mečom a gilotínou, a prenechávame ju historikom. kapitole, a tu sa pokúsme ono ohnutie, onú inverziu ku ktorej v prípade Piesne ako aj celého kresťanstva, zdá sa, došlo, objasniť skrze prizmu toho čo nazývame 1. Trik s AGAPE , 2. Pravidlo sémantickej tranzitivity. Fragment 5: Trik s Agape Po konzultácii s vedúcim práce sa autor rozhodol tento fragment z eseje vyradiť. Fragment 7: Prekurzor pravidla sémantickej tranzitivity V časti 2.3 sme objasnili čo je to sémantický prototyp A (napr. jablko) kategórie X (ovocie) . Pokúsili sme sa naznačiť že medzi A a X možno namerať určitú kvantitu P1 ktorú môžeme chápať alebo ako úmernú: ● sile­váhe asociácie ktorá je medzi A a X ● pravdepodobnosti že do svojho vnútra dokonale zahľadená myseľ – tj. taká myseľ do ktorej nevstupujú žiadne vstupy z okolného prostredia ­ preskočí od A k X Tiež sme naznačili že kategória X je asociovaná aj s ďalšími pojmami (napr. B ­ prsia) a silu tejto asociácie môžeme vyjadriť kvantitou P2 ktorú definujeme analogicky k P1. Máme pocit že ak teda existuje spoj |@A,X@|=P1 a tiež spoj |@X,B@|=P2 , bude hodnota |@A,B@| > P1 x P2 x K , pričom K<1 je endogénny parameter skúmaného systému, v prípade jednotlivca určitá celková vlastnosť jeho mysle. Ináč povedané, ak existuje asociácia medzi Ňadrom a Ovocím a asociácia medzi Ňadrom a Jablkom, existuje tiež istotne asociácia medzi Ňadrom a Jablkom. Platí to i naopak, ikeď s rozdieľnymi kvantitatívnymi výsledkami, keďže kvantity ktoré vstupujú do rovnice budú iné (zmenili sme smer): ak existuje asociácia medzi Ňadrom a Jablkom (o ktorej sme presvedčený že existuje, viz. časť 2.3) a asociácia medzi Jablkom a Ovocím, existuje tiež istotne asociácia medzi Ňadrom a Ovocím. Formulku: |@A,X@|=P1 ; |@X,B@|=P2 ­> |@A,B@| >68 P1 x P2 x K nazývame Pravidlom sémantickej tranzitivity a považujeme ho za prekurzor akéhosi všeobecného princípu ľudského konceptuálneho myslenia. Od triády Jablko, Ňadro , Ovocie sme si dovolili učiniť nebezpečný indukčný preskok k istému univerzálnemu princípu mysle. Teraz si vďaka nami afirmovanej69 univerzalite uvedeného princípu dovolíme uskutočniť následovnú dedukciu: Vieme že centrálny kosmogonický mýtus ISE kultúry vytvoril v mysliach ním nainfikovaných osôb sémantickú väzbu medzi «hriechom» a «ovocím» , prípadne «hriechom» a «jablkom» . Taktiež nám naše empirické dáta získané vďaka dotazníku D2 naznačujú – a v prípade osôb ženského pohlavia dokonca naznačujú štatisticky signifikantne ­ že v hostiteľských mysliach ISE kultúry existuje sémantická väzba medzi «ňadrom» a «ovocím». V prípade platnosti Pravidla sémantickej tranzitivity nám z uvedeného vyplýva že niekde v mysli všetkých tých ktorý boli nainfikovaný centrálnym kosmogonickým mýtom ISE kultúry bude existovať asociácia medzi «hriechom» a «ňadrom». Odmaskovanie tejto asociácie existujúcej medzi termínom «hriech», ktorý nemá žiadny pevný referent, a termínom «ňadro» ­ o ktorého pevnosti diskrétne mlčíme­ bolo jedným z ústredných cieľov tejto práce. 68 Používame znamienko > a nie = pretože váha väzby medzi A a B nieje daná iba tým koľko prírazu doputuje od A smerom k B putujúc cez X , ale i tým koľko prírazu doputuje od A smerom k B putujúc cez Y , Z atď... Medzi « ovocím » a «prsom » totiž nieje iba medzistanica « jablko » , ale aj množstvo iných menej výrazných medzistaníc ktoré sa na výslednej kvantite malou mierou taktiež podieľajú. 69 a Vami dúfam vyvrátenej Záhrada tretia, konštrukt druhý: Aplikácia Qu’adviendrait­il si, un jour, la science, le sens du beau et celui du bien se fondaient en un concert harmonieux? Qu’arriverait­il si cette synthèse devenait un merveilleux instrument de travail, une nouvelle algèbre, une chimie spirituelle qui permettrait de combiner, par exemple, des lois astronomiques avec une phrase de Bach et un verset de la Bible, pour en déduire de nouvelles notions qui serviraient, à leur tour de tremplin à d’autres opérations de l’esprit? Prekladateľov predhovor k dielu « Le jeu des perles de verre » (Hesse, 1955) Fragment 1: Prístup vypočítavania « mohutnosti znaku» ­ či inak povedané « významnosti určitého znaku pre celok systému vrámci ktorého sa nachádza » pomocou maticovej algebry je uplatniteľný nielen pri analýze mysle jednotlivca, ale aj pri analýzach celých kultúr a spoločností. A čo viac, vďaka štatistickému zákonu o pravidlách veľkých čísel je pravdepodobné že výsledky ku ktorým by podobné antroposociologické analýzy mohli dospieť budú solídnejšieho charakteru ako analýzy neuropsychologické. Obvody mysle jednotlivca sú totiž vyryté iba do neurónového wetware mozgu zatiaľčo obvody mysle sú vyryté do kníh zákonov, do inštitúcií, do miest a ciest – ináč povedané kontextuálne a asociačné vzťahy sú v prípade kultúr veľmi často vryté nielen do mozgov ľudských bytostí ktoré sú « hostiteľskými organizmami » pre tú či onú kultúru, ale sú veľmi často vyryté i « do kameňa ». Predstavme si primitívnu lovecko zberačskú kultúru v ktorej kozmologickom systéme zohrávajú kľúčové rolu významy « Ovocie », « Prs », «Mlieko », « Žena » a « Oheň ». Po rokoch náročného terénneho výskumu ... Ovocie Prs Mlieko Žena Oheň Ovocie 0 0,23 0,15 0,07 0,05 Prs 0 0,4 0,4 0,1 Mlieko 0,4 0,33 0 0,23 0,15 Žena 0,15 0,37 0,35 0 0,7 Oheň 0,05 0,07 0,1 0,3 0 0,4 ... použitie kódu z appendixu 2 naznačuje nieúplnezjavný fakt že symbolom s najväčšou mohutnosťou vrámci daného kultúrneho celku je...mlieko Fragment 3: Jadro pudla Použitie maticového kalkulu na analýzu symbolických systémov nás môže priviesť až k znalosti slabých miest, Achillových pát, dotyčných systémov. Podobne ako protilátka čo sa práve dotkla určitého miesta virálnej kapsidy týmto svojim jemným a veľmi špecifickým dotykom spôsobuje zánik víru ; podobne ako psychoanalytik ktorý práve odhalil nenápadný významový spoj ktorého rekonfigurácia spôsobí rekonfiguráciu celku pacientovej mysle; podobne bude tomu čo dokáže previesť celok kultúrneho či náboženského systému na maticu vzájomne na seba odkazujúcich entít daná možnosť dotyčnú kultúru či náboženstvo zvnútra «rozpustiť» jedným jediným slovom. Keďže sme si zatiaľ neni istý tým či by uvedená metóda – ktorú intuitívne používajú obzvlášť šamani, misionári či demagógovia ­ mohla byť využitá nielen deštruktívne , ale aj konštruktívne, tj. k jemnému dizajnu či redizajnu kultúrnych systémov smerom k vybudovaniu chrámu pre čoraz väčšiu diverzitu a krásu bytostí a vecí, rozhodli sme sa z obáv pred možným nepochopením nášho zámeru určité znalosti zatiaľ iba letmo naznačiť. Fragment 2: Zrodenie korpusovej kulturológie Vychádzajúc z premisy «človek musí v sebe nosiť dôvod na to aby venoval svoj čas tvorbe tohto a nie iného encyklopedického príspevku» môžeme jednotlivé národné wikipedie – napr. http:// sk.wikipedia.org , http://cs.wikipedia.org či http://fr.wikipedia.org ­ chápať ako o(b|d)razy priorít a hodnôt nositeľov jednotlivých národných kultúr. O tom ako databázy národných wikipedií previesť do maticovej podoby, ako vypočítať mohutnosti jednotlivých znakov (ilustrujeme na príklade pojmov «víno» , «mlieko» , «boh» , « olivy » a « žena ») ako ich porovnať medzi sebou a čo porovnania vyplýva pre zrod kvantitatívnej – či skôr korpusovej? ­ kulturológie, o tom bude naša prihláška do súťaže Ars Electronica, ako i náš prvý striktne vedecký článok, veríme že napísaný s posvätením FHS UK . Fragment 6: Memetické inžinierstvo Napokon sa, žiaľ, zdá že aj tí čo intuitívne s?poznali zákony tvorby, zotrvačnosti a zániku pojmov napokon svojim dieľom nedosiahli nič viac než to, že sa myriády onoho­umenia­neznalých medzi sebou tisíce rokov zabíjali v mene akejsi « lásky ». Fragment 10: What can a graph theory tell us about breasts and apples? Graph is a matematical structure consisting of vertices and edges. Vertex can be understood as a « node », « element », « object », « entity » or even « neuron »; edge can be understood as a relation or a link connecting a pair of vertices. It can be seen almost immediately that graph theory can be useful for analysis of networks of references (e.g. hypertext web) – and verily, at the core of the biggest success story of Web – Google’s one – is a quantity called PageRank 70 whose computation follows directly from certain properties of graphs and stochastic matrices related to them. Reasoning which will be presented within scope of this article is founded upon following assumptions : 1) Any holistic complex can be understood and thus analysed as a network of references and hence as graph 2) By a correct application of graph theory notions, non­evident but practically useful properties of a given holistic complex can be discovered. By a holistic complex we mean such a system that cannot be explained by properties of its components alone. To understand it, and to explain it a structure – i.e. set of relations between the components ­ must be taken into account. Briefly – not only content ­ information IN the Net is important; it is as well the form71 ­ information ON the Net. We’ll analyse two types of such holistic complexes within this chapter: a classical text poem and a hypertext encyclopaedia. Because this article itself is a part of yet bigger holistic complex concerning the semantic relation between « breast and apple » concepts, we had decided to chose a « Song of Songs » of King Salomon where both concepts are present. Concerning an encyclopaedia, we had chosen to analyse the « 9th miracle of the world » ­ the biggest archive of human knowledge ever created by humanity and for humanity – the Wikipedia. While still pointing attention of the reader to the « breast and apple » concepts , we’ll try to show that analysis and subsequent comparison of national wikipedias by means of graph theory notions like «closeness », 70 The name "PageRank" is a trademark of Google, and the PageRank process has been patented U.S. Patent 6,285,999 71 Die Form ist die Möglichkeit der Struktur. (Wittgenstein, 1917) « betweenness » , « PageRank » or can lead to non­trivial discoveries whose range spans from cultural anthropology to hardcore semantics. Our virtual workbench will consist of OpenSource tools only – namely Linux operating system, PERL programming language and at last but not least, the most powerful statistical tool ever created – R for statistical computing72. We’ll try to be consistent with the spirit of hereby nascent OpenedScience movement and thus present our experiment in such a way that they could be reproduced by anyone with fairly advanced informatic skills ­ all Linux and R commands and as well as PERL subroutines will be presented in notes at the bottom of the page73. Analysis 1: Cantique Because English language is one of the easiest languages to parse, we had chosen to download74 and analyse that version of King Solomon’s song which is present in the King James Bible. After preliminary removal of header HTML tags we have extracted only nouns, verbs and adjectives from the corpus by means of « gposttl » version of Brill’s tagger75 and we had marked the frontiers of sentence by « :: » sign. We create a small script based upon the theoretical notions presented in chapter 2 of this work. Namely, we accept the Hebbian hypothese «if two symbols are activated within a very short timespan, the weight of their relation will be strenghted» as true, we accept her and we let her inspire us in such a mesure that we allow us to hereby formulate this primitive rule of thumb : Zeroth semantic principle: If two words (vertices) are present within the same sentence , the weight of their relation (edge) will be strenghtened. Speaking more generally: given a co­occurence of elements (e.g. words) A and B within the higher­ level complex (e.g. sentence) X, an edge will be created or ­if already created­ augmented with weight N. It is important to mention that even when we speak of linguistic corpuses only, there are many different sorts of what we call «complexes» located on many different hierarchical levels, from almost invisible low level (n­3, n­2) NP syntagms to larger scale phrases (n­1) sentences (n) and yet bigger complexes like śloka (n+1) or whole chapter (n+2...?). Having this on mind and adding immediately that concrete value of weighting constant N seems to be determined:  by the level (n­2/n­1/n/n+1/n+2 etc.) within which A and B co­occur (Drake, 2000) – higher the level, smaller the N  by the presence of inhibitor terms ­ as known for example from chomskian Governement and Binding Theory (Haegeman, 1994)  by terminological (Hromada,2007) or temporal distance (related to Hebbian functioning) 76  by the given language L itself we conclude this small theoretical excurcus by assertion that the graph theory can serve not only as a firm base founding formalized cognitive semantics but as well as a unifying point between disciplines as differentes as behavioral musicology (Drake, 2000) and generative syntax. We assert this because we are strongly persuaded that what tree structures mean for syntax shall graphs – and particularly cyclic graphs – mean for semantics. Of course we are 72 “Great beauty of R is that you can modify it to do all sorts of things” ?said chief economist at Google “And you have a lot of prepackaged stuff that’s already available,so you’re standing on the shoulders of giants.” (NY Times,2009) 73 Linux shell commands will begin with $ character. R commands will begin with > character. 74 $wget http://localhost.sk/~hromi/research/breastANDapple/songofsongs.html 75 $gposttl ­­brill­mode ./songofsongs.html | perl ­e "while (<>) {@d=split(' '); for (@d) { if (/:\/:/) {print '::';} elsif (/\w+\/(NN|JJ|VB)/) { print ' ';print $_; }}}" >/tmp/salamun 76 If concrete values of mentioned constants are language dependent, it would mean that they play on the semantic level a role similiar to to that of « parameters » within Principle&Parameters approaches of generative grammarians. very far away from the moment when we could possibly state that we know how to transform a corpus of a given language into a graph whose structure will be isomorphic with the structure of «the understanding» which an ideally competent human reader have pulled out of a given corpus during a hermeneutic procedure. Who knows, maybe we’ll never be there, nonetheless it is our duty to at least to try to start somewhere. And therefore: We have created a truly primitive script to which we have given a name « Golem »77. For an input it takes gposttl output mentioned above, it permutates all the noun/verbs/adjectives given and as an output it produces a list of pairs word1;word2 according to zeroth semantic principle. Thus for example a phrase «now also thy breasts shall be as clusters of the vine, and the smell of thy nose like apples» we’ll obtain such a permutated list of edges (pairs of words/vertices) : thy;breast breast;thy be;thy cluster;thy vine;thy nose;thy apple;thy thy;be breast;be be;breast cluster;breast vine;breast nose;breast apple;breast thy;cluster breast;cluster be;cluster cluster;be vine;be nose;be apple;be thy;vine breast;vine be;vine cluster;vine vine;cluster nose;cluster apple;cluster thy;smell breast;smell be;smell cluster;smell vine;smell nose;vine apple;vine thy;nose breast;nose be;nose cluster;nose vine;nose nose;smell apple;smell thy;apple breast;apple be;apple cluster;apple vine;apple nose;apple apple;nose We hope that it is evident from this list that the graph we are creating here will be an «undirected» one – in other words it is constructed in such a way that there is not a difference between the edge between «a breast and an apple» and the edge between «an apple and a breast». In other words, an undirected graph is a graph whose adjacency matrix is symmetric. What is an adjacency matrix? « In mathematics and computer science the adjacency matrix M of a finite directed or undirected graph G on n vertices is the n × n matrix where the nondiagonal entry aij is the number of edges from vertex i to vertex j ». (Wikipedia, 2009) Some day, maybe, could some brahman with a poetic soul knowing that « there exists a unique adjacency matrix for each graph and it is not the adjacency matrix of any other graph» immediately state that relation between adjacency matrix M and its graph G is similiar to the relation between Purusa and Prakrti78 – both are sides of the same coin; one cannot be without the other. We can say that matrices we have been constructing79 in chapter 2 when we were speaking about neuroloinguistic networks within the brain of a newborn were adjacency matrices. These adjacency matrices – and therefore the graphs they describe as well – are «weighted». If we suppose ­and we do­ that the weighting constant for the co­occurence of two words within a sentence is N=1 ; and if the terms «apple» and «breast» co­occur within whole Song of Songs in one sentence only – and that is verily the case­ the position in the column « apple » and row « breast » of an adjacency matrix M will have value Mapple,breast = 1 . On the contrary, since the terms « apple » and « tree » co­occur within 3 sentences of the Cantique80 , the position in the column «apple» and row «tree» of an adjacency matrix A will have value Mapple,tree = 3. After listing81 and ordering82 all non­zero values present within the vector/row Mapple , we 77 $./golem.pl /tmp/salamun >/tmp/cantiqueEdgelist 78 Shiva Shaktyatmakam Brahma (Anandamurti, 1961) 79 > CantiqueAdjacencyMatrix<­as.matrix(table(read.table("/tmp/cantiqueEdgelist",sep=";")) ) 80 As the apple tree among the trees of the wood, so is my beloved among the sons...I said, I will go up to the palm tree, I will take hold of the boughs thereof: now also thy breasts shall be as clusters of the vine, and the smell of thy nose like apples...I raised thee up under apple tree: there thy mother brought thee forth:there she brought thee forth that bare thee. 81 > applesubvector<­subset(CantiqueAdjacencyMatrix["apple",],CantiqueAdjacencyMatrix["apple",]>0) 82 > applesubvector[order(applesubvector,decreasing=TRUE)] obtain following results: 3 tree 2 beloved 2 is 2 thy 1 be 1 breast 1 cluster 1 cometh 1 comfort 1 delight 1 flagon 1 fruit 1 great 1 leaning 1 nose 1 raised 1 sat 1 shadow 1 smell 1 son 1 stay 1 sweet 1 taste 1 up 1 vine 1 was 1 wilderness 1 wood We would like to point attention of dear reader upon the fact that a significant part of terms hereby presented ­e.g. tree, delight, fruit, sweet, taste etc. ­ could possibly serve as a basis for a satisfying definition of a meaning of a word «apple». In other words these are the basic elements of semantic analysis ­ semes ­ which we presented in the previous chapter, and the numeric value associated is nothing else than a value of a coordinate within a Hilbert space for a respective semantic dimension. And we see this even in case of corpus which has no more than 16kilobytes... When it comes to breast83, non­zero items of a row Mbreast of an adjacency matrix M go like this : 4 are 4 thy 2 be 2 cluster 2 is 2 roe 2 twin 2 young 1 am 1 apple 1 betwixt 1 brother 1 bundle 1 despised 1 feed 1 find 1 grape 1 hath 1 have 1 kiss 1 lie 1 lilie 1 little 1 mother 1 myrrh 1 night 1 nose 1o 1 palm 1 sister 1 smell 1 stature 1 sucked 1 tower 1 tree 1 vine 1 wall 1 wellbeloved 1 wert 1 yea Even while leaving out that despised word despised as well as exclamations o, yea; we are obliged to reiterate: what we see even in such a small corpus as Cantique 84 can be stated like this: co­occurence is tightly related with the definition and hence, meaning, signifié of the given signifiant. For verily there is not and there will be not born a (wo)?man who can justly maintain that a definition a breast which would exclude semes like « feed », « kiss », « night », « mother », « sucked » and « smell » or even « betwixt » would be a definition complete. Because we may be possibly criticized85 that what we do here is nothing else than building a contingency table of co­occurence of words within a sentence, we pursue our analyse further. For this moment we leave aside an adjacency matrice M with multitudes of her86 fascinating properties87 and we fully focalise upon her second « visage » ­ upon a graph G. We’ll construct it by the means of a wonderful, wonderful, wonderful « igraph »88 library created mostly by our hungarian OpenSource brethrens; by executing one simple command89. 83 > breastsubvector<­subset(CantiqueAdjacencyMatrix["breast",],CantiqueAdjacencyMatrix["breast",]>0) > breastsubvector[order(breastsubvector,decreasing=TRUE)] 84 And what can and will be seen with much clarity if we would take into account much bigger corpuses, like that of Google n­grams , for example (Cilibrasi, 2007) 85 In no case we pretend that what we are doing here was never done before. Such a statement would be ­with very high probability – a big hypocrisy within the world where maybe even millions of (wo)?man are thrown into a neverending quest for scientific truth (Teillhard, 1923) . It is more than possible that for every step of an analyse presented hereby, a highly specialized application exist – whether built by CNRS or MIT. But what we declare is that all this can be performed much more easily, with a much higher degree of aesthetic fullfillement with few lines of PERL code and few correctly stated R commands. What we declare is this ­ any kid can do it, any. 86 Staying consistent with the genre division in slavic and roman languages, matrice M is feminine for us. 87 I would like to thank Monsieur Dominique Pignon for pointing my attention upon a mathematically proven « fact » the entry in row i and column j of a matrix Mn gives the number of (directed or undirected) walks of length n from vertex i to vertex j. It can be useful, very useful... 88 > install.packages(« igraph »); library(igraph) 89 > CantiqueGraph<­graph.adjacency(CantiqueAdjacencyMatrix,weighted=TRUE) From now on, multitudes of possible analyses open in front of eyes; multitudes of which we will have chosen only few ­ in somewhat macchiavelian fashion we’ll present hereby only those very few examples that best prove our point. Let’s start with PageRank. Technically speaking,its values are nothing else than entries of an eigenvector of an adjacency matrix. More humanly speaking, its value Px give us the probability with which an agent randomly browsing the Network will land after many steps/clicks on a site/node X. This follows from the Markoff theory of stochastic matrices and fixed point theorem. Within chapter 2, we had tried to pursue the PageRank notion further than just hypertext web. We tried to focalise an attention of dear reader upon the fact that PageRank correct understanding and application of linear algebra and , particularly, of an idea hidden behind PageRank could be an important moment in the story leading to quantification and formalisation of certain human sciences. Namely, through the medium of platonic image of a « soul » errant within a conceptual network , we set forward a hypothese that PageRank Px calculated for such conceptual networks will give us the probability with which the « soul » ­ be it the soul of a man or a nation­ will finally « land » in the attractor concept X. Or, which is the same, the probability that the concept X will become content of an errant soul. We confess that in the moments when we were writing chapter 2, we were verily seduced by a PageRank idea. We didn’t know anything about other quantities calculable for a graph like « closeness », « betweenness », « vertice similiarity » etc. Nonetheless, our enchantement by PageRank continues even now and in such a mesure that in the next and last part of present work, we’ll re­name as « importance ». Our enchantement continues namely for this reason: Since it uses a very simple iterative process, PageRank is very easy and fast to 90 calculate . Thus, after calculating the entries of a PageRank vector for our Cantique, it suffices to join the calculated quantities to the vertex labels91 and to order them92 in descending order. Afterwards we obtain a list93 whose first 60 rows go like this: 1 0.0402450616276486 2 0.0286874213561736 3 0.0207556525566583 4 0.0181126226375237 5 0.0168321179888151 6 0.0153742262706985 7 0.0109313383916066 8 0.00858951049017589 9 0.00853652275059628 10 0.0071625915464353 11 0.00712911697034781 12 0.00658525478825675 13 0.00646490626473848 14 0.00642493278607294 15 0.00631260735215216 16 0.00617829833451983 17 0.00609495016444803 18 0.00592028534721268 19 0.00587021417258384 20 0.00576188606513339 is thy beloved o are love solomon let daughter have song be fair garden 5 myrrh fruit come tree behold 21 0.00574703022231954 22 0.00573454130839064 23 0.00572865883284286 24 0.00567096410285987 25 0.00554778536139638 26 0.00552568332075932 27 0.00550568137332312 28 0.0054357374621402 29 0.00536126145890124 30 0.00494272275419686 31 0.00484017126806818 32 0.00480671780005788 33 0.00479763634993785 34 0.00468153277561695 35 0.00445085951789789 36 0.00437615206382555 37 0.00437527160055655 38 0.00434900453691372 39 0.0042752292727721 40 0.00426197531615292 spice go voice smell jerusalem thine art mother hand see vineyard flock eye sweet was day breast sister comely wine 41 0.00425756366279538 42 0.00415891698675269 43 0.00408724684320239 44 0.00406786321075841 45 0.00403215526672195 46 0.00399520253064914 47 0.00399324606027242 48 0.00389267739621985 49 0.00385246254588212 50 0.00381393776344857 51 0.00370141936650332 52 0.00366600695461648 53 0.00355200627804045 54 0.00349365560289901 55 0.00345629616853102 56 0.00345286612010278 57 0.00340922347459539 58 0.00328539601897926 59 0.00327909072482058 60 0.00310921062863708 vine roe gold dove head heart mountain lock am pleasant soul yea spouse neck countenance king charge apple pomegranate set Voilà the result, an introductory stanza to the poem in itself, consisted of 60 adjectives, nouns or verbes with highest PageRank within the graph created by application of a principle « if two words co­occur within one sentence, augment the weight of their relation by 1» which was applied upon a corpus extracted from King James Bible’s version of Song of Songs proposedly 90 91 92 93 > CantiqueRank<­page.rank(CantiqueGraph)$vector > CantiqueRankNames<­data.frame(CantiqueRank,V(CantiqueGraph)$name) > CantiqueRankNamesOrder<­CantiqueRankNames[order(CantiqueRankNames[,1],decreasing=TRUE),] Full list is downloadable here (blank space divided CSV format) : http://localhost.sk/~hromi/research/breastANDapple/cantiquerank.csv written by Salomon, genitor of the temple and second king of Israel. Honnestly – even if our analyses were completely useless, aren’t those words, each one of them, aren’t they simply beautiful? 94 But there is much more to a graph G than just its PageRank.As we have already mentioned, graph theory have already developped multitudes of other useful notions. Many of them were already implemented into an « igraph » library and thus we can easily furnish not only their theoretical defintion but also illustrate their empiric impact. Let’s glance over already mentioned « closeness », « betweenness » and « similiarity »: 95  Closeness – R manual tells us : « Cloness centrality measures how many steps is required to access every other vertex from a given vertex. The closeness centrality of a vertex is defined by the inverse of the average length of the shortest paths to/from all the other vertices in the graph ». 96  Betweenness – R manual tells us : « The vertex and edge betweenness are (roughly) defined by the number of geodesics (shortest paths) going through a vertex or an edge. »  Vertice similarity – There are many different types and thus algorhitmes for calculation of vertice similarity. For the purpose of this article we had chosen to use inverse log­weighted similarity, for it seems to be more evolved notion that notion of Jaccard or Dice similarity. R manual tells us97: «The inverse log­weighted similarity of two vertices is the number of their common neighbors, weighted by the inverse logarithm of their degrees. It is based on the assumption that two vertices should be considered more similar if they share a low­degree common neighbor, since high­degree common neighbors are more likely to appear even by pure chance. Isolated vertices will have zero similarity to any other vertex. Self­similarities are not calculated. See the following paper for more details: Lada A. Adamic and Eytan Adar:Friends and neighbors on the Web.Social Networks,25(3):211­230, 2003» Since we don’t want to bother dear reader with other theoretical notions, we have excluded all formulas as well as definitions of more or less self­evident graph theory terms like « neighbor », «degree » or « shortest path ». Anyone interested will surely find his ways to fill this gap. Let’s execute the necessary commands98 and see what else can graph theory can tell us about 94 95 96 97 98 Honnestly – where in the graph G have You seen a command to seed hate & bomb Gaza ? > ?closeness > ?betweenness > ?similarity.invlogweighted > CantiqueCloseness<­data.frame(closeness(CantiqueGraph),V(CantiqueGraph)$name) > CantiqueBetween<­data.frame(betweenness(CantiqueGraph),V(CantiqueGraph)$name) Song of Songs: Closeness Value 0.27943661971831 0.266237251744498 0.265951742627346 0.265809217577706 0.261741424802111 0.261190100052659 0.259278619968636 0.254750898818695 0.253708439897698 0.253578732106339 0.252674477840041 0.252674477840041 0.251393816523061 0.251266464032421 0.251139240506329 0.249748237663646 0.249371543489191 0.249246231155779 0.248995983935743 0.248 0.247752247752248 0.247628557164254 0.247628557164254 0.24750499001996 Central vertices is beloved o thy are love solomon song daughter spice fruit go 5 let smell behold mother breast be come dove sister tree fair Betweenness values 67661.944443311 19934.9588059187 19310.1582973797 19278.8967516378 17523.0606034477 16636.7241518869 7539.80553623171 7509.0126779339 6927.51233607318 5280.6192724142 5005.31370288028 4817.35585927223 4709.75950500152 4645.28517812257 4572.6736347049 3990.12475254222 3916.13355606611 3857.29209176175 3747.85512875238 3614.68634683767 3544.22976327145 3473.05920301343 3455.48780030024 3228.67449596703 Crossroad vertices is beloved are thy o love be solomon go have daughter spice mother 5 behold fruit mountain let was garden myrrh tree song apple Voilà two stanzas of our poem, first being the list99 of 25 adjectives, nouns or verbes with highest Closeness mesure; second being the list of 25 adjectives, nouns or verbes with highest Betweenness mesure, as assessed within the graph created by an implementation of a principle « if two words co­occur within one sentence, augment the weight of their relation by 1» which was applied upon a corpus extracted from King James Bible’s version of Song of Songs proposedly written by Salomon, operator of the temple and second king of Israel. It can be said that more we depart from the top ranks, more the two measures differ. Thus, the breast concept is ranked as 19th according to closeness centrality measure, but as 33th according to betweenness measure. Inversely, an apple concept is more «crossroad­like» than central, its 25th according to betweenness mesure, but only 57th according centrality measure. Nonetheless, when we take into account that we have extracted 498 nouns/verbes/adjectives out of Cantique and thus our graph G has 498 vertices, both of these concept « apple » and « breast » are far­from­being­not­ important, no matter what measure we choose as significant measure of importance. One of the reasons why we consider the « betweenness » measure to be of particular importance100 is that betweenness measure divides the set of our vertices into two groups – into a group of those through which does not pass any shortest path and those have zero betweenness value (298 of them in case of Cantique corpus) and a group of those who serve as principal « junctions », in other words a group of those through which some « geodesics » pass and thus their whose value is non­zero (199 of them in case of Cantique corpus). Last thing we would like to say about our visualisation is that we have chosen « Fruchterman­Reingold » algorhitm to visualise our subgraph. Let’s see what other programmers say about it: > data.frame(CantiqueCloseness[order(CantiqueCloseness[,1],decreasing=TRUE),], CantiqueBetween [order( CantiqueBetween[,1],decreasing=TRUE),])[1:24,] 99 Full list is downloadable here: http://localhost.sk/~hromi/research/breastANDapple/CantiqueCloseBetween.csv 100On the other hand, big inconvenience of a betweenness mesure is that it’s calculation is very much exigent because for every now vertex added, shortest paths to all other vertices have to be found and afterwards the betweenness value of all the vertices located upon these paths has to be adjusted. We are not experts on a complexity theory but it seems to us that betweenness calculation is not a problem solvable within polynomial time. Illustration 2: > plot.igraph(CantiqueSubgraph,vertex.label.cex=2,vertex.label=V(CantiqueSubgraph) $name,vertex.shape="none",asp=0,vertex.label.color=1:length(V(CantiqueSubgraph))% %7+1,layout=layout.fruchterman.reingold,margin=­0.07) It is a force­directed algorithm, meaning that vertex layout is determined by the forces pulling vertices together and pushing them apart. Attractive forces occur between adjacent vertices only, whereas repulsive forces occur between every pair of vertices. Each iteration computes the sum of the forces on each vertex, then moves the vertices to their new positions. The movement of vertices is mitigated by the temperature of the system for that iteration: as the algorithm progresses through successive iterations, the temperature should decrease so that vertices settle in place. (Gregor,2004) In other words, to visualise the Cantique, we had used a procedure not too distant from that of « annealing of substances » of ancient alchemists. Only in the moment of production of this eye­candy does randomness enter the game because the initial position , i.e. the position of vertices before «annealing» , is put forward by a random generator. Only in this moment of visualisation according to fruchterman.reingold algorhitm will the R commands proposed to and hopefully executed by dear reader produce results slightly different than are those presented upon these pages. But since we don’t want to be accused of exercising Kaballistic practices, we come back to notions and procedures of graph theory which follow one out of the other with apodictic lucidity of mathematical theorems.And thus, to finally answer the question: « Relation between an apple and a breast, did it exist somewhere within the mind of Salomon? », we have decided to apply the notion inverse log­weighted similarity upon a vertice « apple » of a graph G. Voici the results101: is tree beloved apple sweet fruit shadow was wood delight great sat son taste thy smell myrrh up 23.5554863138659 23.0239430017928 22.1794872684636 16.3223259179549 14.0358163212322 14.0228519098312 12.9337406972387 12.1856172672389 11.9673951448652 11.8559056922472 11.8559056922472 11.8559056922472 11.8559056922472 11.8559056922472 9.01488868077456 8.82445148474772 8.3676712626057 8.24706309890257 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Voilà the last stanza of our poem consisted of 25 adjectives, nouns or verbes as calculated by the means of inverse log­weighted similarity of a vertex « apple » to all102 the other vertices of a graph G created by an implementation of a principle « if two words co­occur within one sentence, augment the weight of their relation by 1» which was applied upon a corpus extracted from King James Bible’s version of Song of Songs proposedly written by Salomon, destroyer of the temple and second king of Israel. Thus, when we see that when ranked according to the inverse log­weighted similiarity to an « apple » vertex, a « breast » vertex is located on a position 24, i.e. within 5% of the total number of 498 vertices, we can conclude: ὅπερ ἔδει δειξαι 101> applesim<­data.frame(V(CantiqueGraph$name, similarity.invlogweighted(CantiqueGraph,which(V(CantiqueGraph)$name=="apple")­1)[1,]) > AppleSimilarityOrdered<­ data.frame(applesim[order(applesim[,2],decreasing=TRUE),],1:length(applesim[,1])) 102Full list is downloadable here: http://localhost.sk/~hromi/research/breastANDapple/applesim.csv Fragment 10: 9th miracle of the world103 Language: N of vertices : Slovak 129941 Czech 160681 Hebrew 157375 Arab 139303 Russian 1497859 Mongol Aymara Apple relative rank pagerank inter / intra Jablko 11354 7.046e­06 4/7 Jablko 13304 1.292e­05 1/6 ‫תפוח‬ 10417 8.466e­06 2/6 ‫تفاح‬ 14158 6.870e­06 5/7 Яблоко 8625 7.777e­06 3/6 NOT PRESENT IN THE CORPUS NOT PRESENT IN THE CORPUS Breast relative rank pagerank inter / intra Prsník 2065 1.290e­05 3 / 2­3 Prs 16360 1.192e­05 4/7 ‫)שד_)איבר‬ 13253 7.520e­06 5/7 ‫ثدي‬ 1793 2.346e­05 2/2 Женская_грудь 54126 2.859e­06 6/7 NOT PRESENT IN THE CORPUS Ñuñu 219 0.00066 1/1 Milk relative rank pagerank inter / intra Mlieko 7346 7.792e­06 7/6 Mléko 2067 3.456e­05 3/4 ‫חלב‬ 6407 1.150e­05 6/3 ‫حليب‬ 5141 1.323e­05 5/3 Молокo 2998 1.503e­05 4/5 Сүү 247 0.000882 1/1 Millk'i 259 0.00055 2/2 Wine relative rank pagerank inter / intra Víno 2059 1.294e­05 4 / 2­3 Víno 2170 3.334e­05 1/5 ‫יין‬ 1889 2.003e­05 2/1 ‫نبيذ‬ 5813 1.230e­05 5/4 Вино 2741 1.626e­05 3/4 NOT PRESENT IN THE CORPUS NOT PRESENT IN THE CORPUS Man relative rank pagerank inter / intra Muž 3542 9.878e­06 5/5 Muž 905 5.737e­05 2/2 ‫גבר‬ 8276 9.645e­06 6/4 ‫رجل‬ 6483 1.144e­05 4/5 Мужчина 1484 2.925e­05 3/2 NOT PRESENT IN THE CORPUS Chacha 1320 9.929e­05 1/4 Woman relative rank importance inter / intra Žena 3499 9.922e­06 5/4 Žena 1048 5.397e­05 1/3 ‫אישה‬ 8580 9.435e­06 4/5 ‫امرأة‬ 7583 1.027e­05 3/6 Женщина 2236 1.947e­05 2/3 NOT PRESENT IN THE CORPUS NOT PRESENT IN THE CORPUS god relative rank R: importance(N/R) inter / intra boh 350 3.645e­05 5/1 bůh 389 9.543e­05 2/1 ‫אלוהים‬ 1959 1.925e­05 6/2 ‫الله‬ 268 6.929e­05 3/1 bог 971 4.21e­05 4/1 NOT PRESENT IN THE CORPUS Tatitu * 526 0.00034 1/3 Isis relative rank R: importance(N/R) inter / intra Isis 67964 4.811e­06 1/8 Isis 71739 3.180e­06 3/8 ‫איזיס‬ 69627 3.010e­06 4/8 ‫إيزيس‬ 24446 4.601e­06 2/8 Изида 274978 6.879e­07 5/8 NOT PRESENT IN THE CORPUS NOT PRESENT IN THE CORPUS Comparison of 8 concepts (rows) within 7 wikipedia corpuses (columns). Pagerank entry specifies the calculated pagerank value of a given concept within a specific corpus; corpus­relative rank specifies its position in the list of all the concepts ordered in descending order according to their pagerank (concept having the highest pagerank has corpus­relative rank R 1, second has R=2 etc.); number written by bold specifies the INTERcultural importance (pagerank values are ordered within the row) ; underlined number specifies the INTRAcultural importance (pagerank values are ordereed within the column). For example the « wine » concept within Arabic wikipedia has the lowest pagerank, when compared with « wine » concepts of other corpuses – thus it is 5th interculturally. On the other hand, within the scope of arabic corpus only, it is ranked lower than 1.«god», 2. «breast» and 3. «milk» but higher than 5.«man», 6.«woman», 7. «apple» and 8.«Isis». It can also be easily seen that for majority of cultures, the god concept plays much more important role than other concepts we have chosen. The only exception being quite surprisingly the Hebrews 104 , Aymara, and Mongols – for the tribe of bolivian indians the breast and milk seems to play more important role, for united tribes of centralasian shepherds the milk plays central role. 103 We have analysed mySQL forms of wiki corpuses freely available from http://download.wikimedia.org/ 104 Is it because the signifiant of Your god is not pronounced or because You had chosen to prefer wine instead ? Fragment 8: Matriarchality measure Slovak105 Czech Hebrew Arab Russian Woman (Pw ) 9.922e­06 5.397e­05 9.435e­06 1.027e­05 1.947e­05 Man (Pm ) 9.878e­06 5.737e­05 9.645e­06 1.144e­05 2.925e­05 Matriarchality (Pw­­ Pm) +4.4e­08 ­3.4e­06 ­2.1e­07 ­1.17e­06 ­9.78e­06 Matriarchality measure as a quantity obtained by substraction of pagerank of «man» concept from the pagerank of «woman» concept. Such a subtraction adds second normalization (first normalization – allowing us to do intercultural comparisons ­ occurs during calculation of pagerank itself) and allows us to compare cultures with ­what seems to us­ even higher degree of relevancy. Negative value of matriarchality signifies, of course, patriarchality. Fragment 11: Normativity argument In certain moment, the calculated data – in google as well as within this text – have ceased to be only explicative. It became normative. Verily, if a human/social science hypothese/theory106 is adequate with reality – and thus true – it is often not because she would explain « anything », but because it conditions people to think and act in the way as if they had understood « something ». Fragment 4: Posvätná laň To čo sa tu snažíme povedať istotne znie v lepšom prípade absurdne a v horšom prípade šialene. Veď my tu v istom zmysle vskutku naznačujeme ­ a nielen naznačujeme ­ že ak by mala kategória ovocie v období prvých interpretácií a prekladov kosmogonického mýtu dnes známeho ako Genesis iný prototyp, ak by trebárs nie Eva podala Adamovi jablko, ale Adam Eve banán či keby Boh rozhod(ol|la) oskúšať vôľu človeka nie zakázaným ovocím, lež zákazom ublížiť posvätnej lani , mohol uplynuvší vek na tejto Zemi vyzerať úplne ináč. Žiadne zabíjanie v mene « lásky », žiadne hony na čarodejnice, žiadne obetovanie EROS na oltároch LOGOS. Jednota. 105 I present hereby these culture­relative Wikipedia (november 2008) concept importance lists for download: aymara ­ http://localhost.sk/~hromi/research/breastANDapple/pageranks/AY.csv (<1 MegaBytes) arabic ­ http://localhost.sk/~hromi/research/breastANDapple/pageranks/AR.csv (10 MegaBytes) czech ­ http://localhost.sk/~hromi/research/breastANDapple/pageranks/CS.csv (10 MegaBytes) hebrew ­ http://localhost.sk/~hromi/research/breastANDapple/pageranks/HE.csv (10 MegaBytes) mongol ­ http://localhost.sk/~hromi/research/breastANDapple/pageranks/MN.csv (<1 MegaBytes) russian ­ http://localhost.sk/~hromi/research/breastANDapple/pageranks/RU.csv (58 MegaBytes) slovak ­ http://localhost.sk/~hromi/research/breastANDapple/pageranks/SK.csv (7 MegaBytes) may they surve the purpose for which they were created. You can open them even in Excel. 106 Take Freud’s psychoanalysis , for example. Are its complexes explained, or are they created in the first place? Východ A tak se Vám naposledy nebude chtít od ňader vědy Goethe J.W. Faust Predložená práca je prácou nedokončenou. Čo je dokončené, to je totiž nemenné a čo je nemenné to nemožno nazvať živým. A keďže chcel byť uvedený text v prvom rade textom o živote, mladosti, jari a radosti – nemá v ňom obsiahnutý príbeh o ňadre a jablku žiadny pevne určený koniec. A predsa sa táto esej chýli k svojmu záveru. Umenie záveru je umením rozlúčky, umením vyslovenia najmagickejšieho zo všetkých slov. A preto túto prácu teraz venujem: rodine: v prvom rade mojej mamke Alene za to že bola, je a navždy bude– podobne ako všetky ostatné dobré matky sveta­ tou matkou najlepšou, sestre Kristíne za to že je jedinou ženou ktorá ma vie vyviesť z miery, otcovi Danielovi za jeho pracovitosť, babičkám Olge a Alžbete za to že som aspoň skrze ich slová mohol spoznať čaro prvej Československej republiky a synovcovi Oliverovi za to že je. kamarátom: Lukášovi K. za to že mi odpustil, Martinovi D. nielen za psiu dečku a 9tu bránu, Ľubošovi I. za jeho vaporizér a Jurajovi B. za každoročné čajové rituály, Andrejovi G. za jeho lásku ku hviezdam, Ivanovi P. za cirkus, Filipovi Z. za to že mi jeho konšpiračnými teóriami v istom období môjho života narobil v hlave riadny zmätok, Mirovi P. za nestarnúci support slovenského cyberpunku, Tomášovi P. za to že mu – verím – jedna z kópií tejto práce pomôže zvíťaziť v bitve s heroínom, Jánovi Š. za iniciáciu do PERLu, Levantovi za pomoc v boji s mongolskými švábmi a Monkhsaikhanovi Ochirhuyagovi za to že mi u rieky Orkhon po úprku mojich koní viac ako jasne naznačil že je tým najskutočnejsím mužom akého som kedy mal tú česť stretnúť. milovaným: Daniele K. za to že bola mojou prvou, Jane B. za to že ma v deň mojich 21. narodenín len tak zastavila na Slavíkovej ulici, Zuzane Dž. za seánsu v smrekovom lese, Eve R­K­S. za prvú lekciu o tom ako dokáže byť láska prenádherne slepá, Tereze S. za lekciu druhú, Monike D. nielen za to že mi doniesla nákup keď som si v Nice vyvrtol členok, PhD. Carmen­Aline S. za to že mi dodala dôveru vo mňa samotného, Dite B. za prechod púšťou Gobi, Kristíne J. za to že za mnou neprišla do Paríže, Barbore P. za lekcie nielen francúzštiny... ako i mnohým iným ktorých mená sú zapísané inde. kolegom: užívateľom a najmä správcom diskusných systémov kyberia.sk a nyx.cz, zamestnancom firmy VOLNY v rokoch 2001­2003, fy. Etel v rokoch 2003­2004, fy. IGNUM v rokoch 2005­2006, hotelu Manoir de l’Etang v rokoch 2007­2008, všetkým postavičkám z festivalu v Cannes, « vítacím agentom » na Eiffelovej veži spolužiakom: z Evanjelického lýcea v Bratislave, z Fakulty Humanitných Štúdií UK , z Mongolskej štátnej univerzity, z Université de Nice a z École Pratique des Hautes Études, všetkým študentom ERASMU v Nice v rokoch 2007 a 2008, všetkým kto so mnou absolvovali kurz kognitívneho inštrumentálneho obohacovania FIE I a FIE II rovnako ako i vĺčatám prvého bratislavského voja a skautom, roverom a vodcom oddielu Dážďoviek. tváram z ciest a architektom miest: susedom z Haanovej 44, spolubývajúcim z internátu pre zahraničných študentov v Ulanbaatare rovnako ako spolubývajúcom z kolejí Hostivař, St.Antoine, Jean Medecin a Daviel, všetkým ktorých som kedy auto­stopol, bezdomovcovi z Alma­Aty za jeho Čupačundra a jeho parížskym kolegom za katakomby do ktorých ma čoskoro zavedú, squatterom Ianovi a Reuvenovi z Cannes, dievčaťu menom Mária z Ulanbaatarského klubu Strings, mojim študentom z mongolských jazykových centier Cambridge, Absolut a bezmennej Konkubíne z Huh­ hotu, masérke Zlate z Kyjeva, rwandskej Oracle z rue d’Alesia za to že neodmietla nielen môj perlový náhrdeľník ale ani jabĺčko, Gaiovi I.C. Za Alesiu a Gustovi E. za Vežu, Alene B. za prechádzky s jej írskym setrom a jej manželovi za to že je honorárnym konzulom SR s najlepším zmyslom pre humor, Altangerelovi a lámom z kláštora Khamriin Khiid. ● ● ● ● ● mojim učiteľom a učiteľkám: z Nice: Xavierovi B. a Olivierovi R. za iniciáciu do fonológie, C. Pagliano., Emilie a M.Olivieri za iniciaciu do generativistických doktrín, J. Bonneauovi za terminológiu, J.P. Dalberovi za jeho « bien joué », C. Hennebois za sémantiku a iniciáciu do PROLOGu, Mme Talon­Hugon za rétoriku, Mr. Alimu Benmakhloufovi za Alenku v ríši logických divov, Mr. Gauterovi za štruktúru revolúcií nielen vedeckých, Mme. Kircher za sanskrt, Mr. Lavignovi za filosofické prednášky vysokej kvality z Ulaanbaataru: spanilej Batsukh, in memorian, Dzamyansurenovi za to že je najlepším mongolským kaligrafom ale i za to že sa s Battulagom na školskom výlete opil viac ako všetci študenti dokopy , 3 skvelým postarším učiteľkám mongolskej gramatiky a literatúry ktorých mená už si žiaľ nepamätám, Dadovi Ajayovi za lekcie saddhány, Didi Ananda Kalika za jej oddaný spev a Lotus Children’s Center z Prahy: prof. Sokolovi za to že bol mojím prvým tútorom a v istom zmysle je ním pre mňa dodnes, Ľ. Gabriškovej, in memoriam, za mezopotámsku kosmológiu a P.K.Dicka, doc. Pincovi za to že je najväčším filozofom života akého poznám, Veronike Z. z katedry mongolistiky za jej bezhraničnú obetavosť, doc. Murgašovi za Citadelu, energetické invarianty a Gestalt ktorý som doteraz nepochopil, dekanovi Benyovzskemu za jeho cyklovýlety do ríše idejí, prof. Komárkovi za to že mi narovinu povedal že predmet môjho bádania nieje jeho šálkou kávy, prof. Neubauerovi za jeho vianočnú prednášku, T. Holečkovi za Wittgensteina a výrokovú logiku, Fulkovi za psychoanalytické anekdoty, doktorandom z filozofického modulu za slová « to je ale blbost! » u SAFM ktorými náležite presmerovali moju celoživotnú dráhu, G. Málkovej za « dovednosti myslet » a v neposlednom rade vedúcemu tejto práce, Jánovi Havlíčkovi PhD. za jeho nielen profesionálne ale i ľudské usmerňovanie vo finálnych fázach tvorby tejto eseje z Paríže z Univerzity: mená aspoň niektorých z nich sú uvedené v sekcii bibliografia Toto boli duše bez stretu s ktorými by táto práca istotne nevznikla. Lepších druhov a lepšie družky nájdem v labyrinte života už asi len veľmi ťažko. Kiež im teda aspoň takto, v podobe vytvorenia « hrany » či väzby medzi « vrcholom » ktorý reprezentuje túto prácu a teda mňa, a « vrcholom » ktorý vrámci grafu G – grafu vďaky ktorý istotne raz bude vybudovaný a možno už i vybudovaný je­ reprezentuje mená týchto druhov a teda v istom zmysle, v istom veľmi silnom zmysle, oných druhov samotných ; kiež im teda toto moje malé Ďakujem navýši množstvo ich súcnosti pred pokojne sa usmievajúcou tvárou večnosti. Bibliografia Anandamurti S. (1961) Ananda Sutram Brin S.,Page L., (1998) The Anatomy of a Large­Scale Hypertextual Web Search Engine. WWW7 / Computer Networks 30(1­7): 107­117 Buber M., (1923) Já a Ty. Praha: Kalich Drake C. (2000) The developement of rhytmic attending in auditory sequences: attunement, referent period, focal attending. Cognition 77 251­288 Cilibrasi R.L. , Vitányi P.M.B. (2007) The Google Similarity Distance, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO 3 Gervain J., Macagno F., Cogoi S., Peña M., Mehler J., (2008) The neonate brain detects speech structure. PNAS September 16, 2008 vol. 105 no. 37 14222­1422 Eco U. (1980) Il nome della rosa107 Goethe J.W. preklad O.Fischer (1982) Faust. Praha: Odeon Gregor D., Fruchterman Reingold graph visualisation algorhitm [ accessible-online : http://www.boost.org/doc/libs/1_37_0/libs/graph/doc/fruchterman_reingold.html ] Haegeman L. (1994) : Introduction to Government and Binding Theory. Blackwell Textbooks in Linguistics) Hesse H. (1955) Le jeu des perles de verre, Essai de biographie du Magister Ludi Joseph Valet accompagné de ses écrites posthumes. Calmann-Lévy Hofstadter D. (1999) Godel, Escher, Bach – an Eternal Golden Braid Horwood L. J. , Fergusson D. M. (1998) Breastfeeding and later cognitive and academic outcomes. PEDIATRICS Vol. 101 No. 1 Heidegger M. (2006) Básnicky bydlí člověk. Praha: Oiykumenh Hromada D., (2007) Moja prvá malá rozprava http://localhost.sk/~hromi/textz/2007/mpmrom.pdf ] o metóde [acessible on­line: Hromada D.,(2012) Semantic Structures v2.3 Jackendoff R., (2002) Foundations of Language: Brain, Meaning, Grammar, Evolution. Oxford University Press , Oxford/New York Jakobson R.,(1971) Selected Writings I ­ Phonological studies. Hague Jenness R., (1979) The composition of human milk. Seminars in Perinatology 3 (3): 225–239 Lakoff, G. (1987) Women, Fire, and Dangerous Things: What Categories Reveal About the Mind University of Chicago Press. Lalou F. / Woda A. (2003) Tes seins sont des grenades – Pour en finir avec le Cantique des cantiques , Alternatives , Paris Lécuyer R., (1996) L’intelligence des bébés. Paris: DUNOD 107 Pulchra sunt ubera quae paululum supereminent et tument modice... Morris, D. (1967) The naked ape , London Nelson, C.A. (2001) The development and neural bases of face recognition. Infant and Child Development, 10 (1­2) New York Times (06/01/2009) Data Analysts Captivated by R’s Power [accessible online: http://www.nytimes.com/2009/01/07/technology/business­computing/07program.html ] Nietzsche, F. (1883) Also sprach Zarathustra Piaget, J. (1961). La psychologie de l'intelligence. Paris: Armand Colin Oberfalzerová A. (2006) Metaphors and nomads. Charles University , Philosophical Faculty, Institute of South and Central Asian studies, Seminar of Mongolian studies, Praha Seifert J. (1987) Les danseuses passaient près d’ici: Choix de poèmes. Actes Sud Sokol J. (2007) Malá filosofie člověka & Slovník filosofických pojmů. Vyšehrad , Praha Skripnik O. , Lindová J. (2007) Posudek k metodologické práci studenta 9306. IS FHS UK,Praha Tagore R. (1913) The Crescent Moon : Child­Poems. London: Macmillan Teillhard P. de Chardin (1923) La messe sur le monde Telegraph (2008) Breastfeeding photo ban by Facebook sparks global protest by mothers [accessible online:http://www.telegraph.co.uk/scienceandtechnology/technology/facebook/4029868/ Breastfeeding­photo­ban­by­Facebook­sparks­global­protest­by­mothers.html ] Théoret, H. / Pascual­Leone A (2002), Language Acquisition: Do As You Hear, Current Biology, Vol. 12, No. 21, pp. R736­R737 Wikipedia, The Free Encyclopedia (2009) Adjacency matrix [Retrieved 21:28, January 19, 2009, from http://en.wikipedia.org/w/index.php?title=Adjacency_matrix&oldid=262381618 ] Wilson, R.A. (1983) Prometheus rising . USA: New Falcon Publications Wilson, R.A. (2000) Ištařin návrat, aneb proč bohyně sestoupila do podsvětí a co náš čeká nyní při jejím návratu. Praha: Maťa & Dharmagaia Wittgenstein L. (1917) Tractatus logico­philosophicus Webové linky k hlavným zdrojom inšpirácie : AGAPE ­ http://en.wikipedia.org/wiki/Agape Hilbertove priestory ­ http://en.wikipedia.org/wiki/Hilbert_space Regulérne výrazy ­ http://en.wikipedia.org/wiki/Regular_expression Teória grafov ­ http://en.wikipedia.org/wiki/Graph_theory Terminológia normy ISO­704 ­ http://localhost.sk/~hromi/textz/2008/metaISO704.pdf Texty pyramíd http://www.sacred­texts.com/egy/pyt/index.htm The Coptic Gospel of Thomas in Context ­ http://www.geocities.com/Athens/9068/ Jablko v mýtoch sveta ­ http://en.wikipedia.org/wiki/Apple_(symbolism) Veľpieseň Šalamúnova ­ http://www.bibliaaty.sk/biblia­Piesen­%C5%A0alamunova_PIES.html Appendix 1: Ilustrácia konvergencie stochastickej matice k hodnote svojho eigenvectoru To čo sa tu snažíme ilustrovať je metóda vďaka ktorej sa zbavíme problému cyklických vzájomných referencií ktorý pred nam doposiaľ vždy vyvstával vtedy, keď sme sa snažili analyzovať systém v ktorom A je určené pomocou B , zatiaľčo B je určené pomocou A. Bez tejto metódy by sme nemali žiadny oporný bod, nevedeli by sme kde začať. Takto to vieme. Ilustrujeme si to na príklade «kauzálne­diachronnej » sémantickej matrix 4 z časti 2.2 : a) v prípade že duša do ktorej nevstupujú žiadne externé vstupy začne blúdiť u reprezentácie «ticho » sú prvkami «inicializačného vektoru» pravdepodobnosti toho « k akému symbolu sa poberie duša od symbolu ticho » , tj. hodnoty uvedené v stĺpci « ticho » initial vector: 0.349 0.111 0.349 0.016 0.174 0.001 iteration 0 : 0.349 0.111 0.349 0.016 0.174 0.001 iteration 1 : 0.168638 0.088742 0.136585 0.10892 0.219783 0.277332 iteration 2 : 0.19677187 0.147541796 0.18077067 0.11258138 0.218997164 0.14333712 iteration 3 : 0.170422961246 0.137401835622 0.144618006084 0.132482419104 0.23951108973 0.175563688214 iteration 4 : 0.17202823449374 0.15100588452236 0.149766681237554 0.134776102639864 0.237459578406382 0.1549635187001 iteration 5 : 0.167408565651975 0.149364533881373 0.143237330010219 0.138759890883417 0.242493954990974 0.158735724582041 iteration 6 : 0.16724939393067 0.152105079290878 0.143669866101764 0.139635436193834 0.242101105652149 0.155239118830704 ... iteration 42 : 0.16594741981802 0.152662182855418 0.142067798907887 0.141010764498759 0.243471641784574 0.154840192135343 iteration 43 : 0.165947419818017 0.15266218285542 0.142067798907884 0.141010764498762 0.243471641784576 0.15484019213534 iteration 44 : 0.165947419818016 0.152662182855422 0.142067798907882 0.141010764498764 0.243471641784578 0.154840192135339 iteration 45 : 0.165947419818015 0.152662182855423 0.142067798907882 0.141010764498764 0.243471641784578 0.154840192135338 iteration 46 : 0.165947419818015 0.152662182855423 0.142067798907881 0.141010764498765 0.243471641784579 0.154840192135338 iteration 47 : 0.165947419818015 0.152662182855423 0.142067798907881 0.141010764498765 0.243471641784579 0.154840192135338 iteration 48 : 0.165947419818014 0.152662182855423 0.142067798907881 0.141010764498765 0.243471641784579 0.154840192135338 iteration 49 : 0.165947419818014 0.152662182855423 0.142067798907881 0.141010764498765 0.243471641784579 0.154840192135337 a) v prípade že duša do ktorej nevstupujú žiadne externé vstupy začne blúdiť u reprezentácie «Ň­p.» sú prvkami «inicializačného vektoru» pravdepodobnosti toho « k akému symbolu sa poberie duša od symbolu Ň­p. » , tj. hodnoty uvedené v stĺpci «Ň­p.» initial vector : 0 0.07 0.318 0.04 0.257 0.315 iteration 0 : 0 0.07 0.318 0.04 0.257 0.315 iteration 1 : 0.27966 0.129246 0.144417 0.111452 0.17393 0.161295 iteration 2 : 0.153886242 0.131632958 0.171016395 0.11525756 0.247808612 0.180398233 iteration 3 : 0.1839006114 0.146790670424 0.147067682125 0.13473720768 0.227536874986 0.159966953385 iteration 4 : 0.166086165671824 0.148031715406948 0.147382227517521 0.134661789558376 0.24401479601744 0.159823305827891 iteration 5 : 0.169185730467167 0.151084123997832 0.14378055017798 0.139600712608132 0.239969260189836 0.156379622559054 iteration 6 : 0.166327123909114 0.151650051127476 0.143146201065523 0.139490074038839 0.243468872121902 0.155917677737146 iteration 7 : 0.166593782151623 0.152253830108923 0.142533874354546 0.140677949796571 0.242702452145075 0.155238111443262 ... iteration 42 : 0.165947419818019 0.152662182855418 0.142067798907886 0.141010764498759 0.243471641784575 0.154840192135343 iteration 43 : 0.165947419818017 0.152662182855421 0.142067798907884 0.141010764498763 0.243471641784577 0.15484019213534 iteration 44 : 0.165947419818016 0.152662182855422 0.142067798907882 0.141010764498764 0.243471641784578 0.154840192135339 iteration 45 : 0.165947419818015 0.152662182855423 0.142067798907882 0.141010764498765 0.243471641784578 0.154840192135338 iteration 46 : 0.165947419818015 0.152662182855423 0.142067798907881 0.141010764498765 0.243471641784579 0.154840192135338 iteration 47 : 0.165947419818015 0.152662182855423 0.142067798907881 0.141010764498765 0.243471641784579 0.154840192135338 iteration 48 : 0.165947419818014 0.152662182855423 0.142067798907881 0.141010764498765 0.243471641784579 0.154840192135338 iteration 49 : 0.165947419818014 0.152662182855423 0.142067798907881 0.141010764498765 0.243471641784579 0.154840192135337 K rovnakým hodnotám , tj. MŇ­prítomné=0.165947419818014 MŇ­neprítomné=0.152662182855423 Mblaženosť=0.142067798907881 Mbolesť=0.141010764498765 Mtvár=0.243471641784579 Mticho=0.154840192135337 by náš výpočet dokonvergoval, keby sme ho započali v ľubovoľnom východziom bode (tj. s ľubovolným východzým vektorom). Uvedené hodnoty sú totiž vlastnosťou neviditeľne prítomnou v našej matici , jej « eigenvectorom ». V prípade matice ktorá je neustále prepočítavaná v googli sa empiricky ukázalo že jednotlivé hodnoty tohto eigenvectoru vyjadrujú niečo čo by sa dalo nazvať «podstatnosťou stránky pre celok webu». V prípade nášho prístupu vyjadrujú niečo ako «podstatnosť uvedeného významu pre celok mysle jednotlivca (záhrada 2, konštrukt 2) alebo spoločnosti (záhrada 3, fragmenty 8 a 10 )». Túto veličinu sme v priebehu práce označovali ako «mohutnosť», « PageRank» a «importance». Appendix 2: PERLový kód iterujúci hodnoty v appendixe 1 alias «sladké miliardové tajomstvo firmy hochov z google» @matrix=( [0,0.07,0.318,0.04,0.257,0.315], [0.07,0,0.03,0.36,0.43,0.11], [0.369,0.035,0,0,0.21,0.386], [0.037,0.389,0,0,0.556,0.018], [0.179,0.263,0.126,0.316,0,0.116], [0.349,0.111,0.349,0.016,0.174,0.001] ); $,=" "; @vector=@{$matrix[0]} if (!@vector); print "\ninitial vector: @vector"; for ($i=0;$i<50;$i++) { print "\niteration $i : "; print @vector; $j=0; foreach (@vector) { $p1=$_; $k=0; $sum=0; foreach (@{$matrix[$j]}) { $sum+=$_; $p2=$_; $nvector[$k]+=$p1*$p2; $k++; } $j++; } @vector=@nvector; @nvector=0; } Appendix 3 : Dotazníky D2 a D3 Papierový dotazník D2 – Príklad Sem nalepiť jednu origoš vyplnenú D2 formu Papierový dotazník D2 – Celkové výsledky Z celkového počtu 28 respondentov (zväčša študentky a študenti 3.ročníka lingvistiky na Université de Nice a hostesky a hostesi na medzinárodnom filmovom Festivale v Cannes) asociovalo s ženskými prsiami : 28 respondentov mlieko; 10 respondentov ovocie; 1 respondent mäso; 0 respondentov zeleninu a chlieb. c,e b,c b,c,e b,c,e c,e c,e c,e c,e b,c,e a,c,e c c c b,e b,c,e b,c,e c b,c,e c,d,e b,c,e c a,c,e b,c,e c b,c,e c b,c,e b,c a a,c a,c,d b,d a a,c c a a,b,c a,b,c a a c a,c a,b,c c,d a a a,c a,c e a a,c a a,b,c a a,c,d a,b,c @SEINS,NOURRITURE@ b,c b,e c e c e c c,e c d c,e b,c d,e b,c b,c e a,c c,e b,c d,e c e c e c e c c,e c d,e b,c e c e d,e b,c c e c,e b,c c e c d,e c e c a b e c c,e b,c d c e b d b b c c c a c b b c c c e b a b b b e b b b e c b c b b a b a a a a a a a a a a a a a a b a b a a b a a a a a a a a a a a a a a a a a b a a a b a a a a b pomme fraise pomme pomme peche orange pomme ananas banane pomelos mangue pomme pomme fraise pomme pomme pomme banane pomme melon pomme pomme pomme orange pomme pomme a a a a a a a c a a,f e a a a a a a a c a d a a a a a f a b b d c,e b b b b a b a a e a a a c a a a a a a e d c a M M M F M M M M F M F F F F F F F F F F F F M F F Internetový dotazník D3 Dotazník je stále aktívny na adrese http://localhost.sk/~hromi/quest/public/survey.php? name=FHS_Bakalarska_praca . V momente písania tejto práce naň odpovedalo 358 respondentov, zväčša sa pravdepodobne jednalo o užívateľov diskusných systémov kyberia.sk a nyx.cz na ktorých bol link k dotazníku publikovaný. Užívatelia týchto diskusných systémov sú v zväčša minimálne stredoškolsky vzdelaní mladí ľudia vo veku 15­35rokov a predpokladáme že inak tomu nebude ani v prípade našich respondentov. Vzhľadom k faktu že užívatelia týchto systémov sú poväčšinou kultivovaní mladí ľudia znalí umenia, vedy či politiky, považujeme ich za ideálnych reprezentantov nositeľov slovanskej odnože ISE kultúry začiatku 21. storočia. O tom že , ústrednou témou výskumu sú ženské prsia a ich vzťah k jablku, sa respondenti dozvedeli až po zodpovedaní dotazníku. Dotazník bol prezentovaný v slovenskom jazyku, preto sa dá očakávať že respondenti budú v drvivej väčšine prípadov slovenskej alebo českej národnosti. Ajkeď boli pre náš výskum klúčové otázky 1.3 a 2.3 a všetky ostatné slúžili na ich « zamaskovanie » , vyplynulo nám aj z ostatných otázok množstvo zaujímavých skutočností. Tu predkladáme výsledky týkajúce sa všetkých respondentov : 1. Ktore pojmy naleziace do kategorie "tekutiny" asociujete najsilnejsie s pojmom "zivot" ? Muži+Chlapci Ženy+Dievčatá Total Gender Difference (2.6) (2.5) (2.6) 0.1 0.3 (3.1) (2.8) (3.0) (4.3) (4.4) (4.3) 0.1 (1.7) (1.7) (1.7) 0 (4.1) (4.0) (4.1) 0.1 vino mlieko voda kava krv zrak hmat sluch cuch chut 2. Ktore pojmy naleziace do kategorie kategorie "5 zmyslov" asociujete najsilnejsie s pojmom "automobil" ? Muži+Chlapci Ženy+Dievčatá Total Gender difference (4.5) (4.3) (4.4) 0.1 (2.9) (2.8) (2.9) 0.1 (3.6) (3.6) (3.6) 0 (2.3) (2.3) (2.3) 0 (1.4) (1.4) (1.4) 0 3. Ktore pojmy naleziace do kategorie "potraviny" asociujete najsilnejsie s pojmom "zenske prsia" ? maso ovocie mlieko chlieb zelenina Muži+Chlapci (3.3) (3.4) (4.2) (1.9) (1.8) Ženy+Dievčatá (2.8) (3.2) (4.2) (2.0) (1.6) Total (3.2) (3.3) (4.2) (1.9) (1.7) Gender difference 0.1 0.2 0 0.1 0.2 4. Ktore pojmy naleziace do kategorie "zvierata" asociujete najsilnejsie s pojmom "muz" ? jelen vtak pes zralok opica Muži+Chlapci (3.2) (3.1) (3.2) (3.0) (2.7) Ženy+Dievčatá (3.4) (3.2) (2.7) (2.8) (2.2) Total (3.3) (3.1) (3.0) (2.9) (2.5) Gender difference 0.2 0.1 0.5 0.2 0.5 5. Ktore pojmy naleziace do kategorie "5 zivlov" asociujete najsilnejsie s pojmom "Zena" ? vzduch zem ohen eter voda Muži+Chlapci (2.6) (2.8) (3.9) (3.0) (3.1) Ženy+Dievčatá (2.5) (3.3) (3.8) (2.8) (3.0) Total (2.6) (3.0) (3.9) (2.9) (3.0) Gender difference 0.1 0.5 0.1 0.2 0.1 6. Ktory pojem je podla Vas najlepsim predstavitelom kategorie "kvety" ? ruza margaretka tulipán orchidea lalia tulipan chryzantema dalia farby frezia gerbera kopretina kopretiny kytica lˇalia lucne kvietky mak marihuana muskat narcis púpava pupava sedmikrasky slnečnice vona xxx bunka ceresnovy kvet efedra fialka Fialka hlavacik jarny hyacynt konvalinka Konvalinka lilie lotos lucne lucne kvety magnolia oxalis triangularis (kyselka) pampeliレka pampeliska rododendron ruza ruza, orchidea sedmikráska sedmikrásky sedmokraska slnecnica tulipany vlčí mak zanebudka zive kvety puvodne sem ti sem chtel napsat tulipan... ale jak jsem si precetl odpoved "ruze" tak bych rekl ruze :)...neovlivnuj lidi .) Samičky 54 25 4 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Samci 126 38 0 0 3 10 0 0 0 0 1 0 0 2 0 0 0 1 0 2 0 4 0 0 0 0 0 1 1 1 1 1 1 1 1 1 2 1 2 2 0 1 1 % Samičiek 49.091 22.727 3.636 2.727 1.818 1.818 0.909 0.909 0.909 0.909 0.909 0.909 0.909 0.909 0.909 0.909 0.909 0.909 0.909 0.909 0.909 0.909 0.909 0.909 0.909 0.909 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 % Samcov 55.752 16.814 0 0 1.327 4.425 0 0 0 0 0.442 0 0 0.885 0 0 0 0.442 0 0.885 0 1.77 0 0 0 0 0 0.442 0.442 0.442 0.442 0.442 0.442 0.442 0.442 0.442 0.885 0.442 0.885 0.885 0 0.442 0.442 0 0 0 0 0 0 0 0 0 0 2 1 0 1 1 3 4 1 1 1 0 0 0 0 0 0 0 0 0 0 0.885 0.442 0 0.442 0.442 1.327 1.77 0.442 0.442 0.442 0 1 0 0.442 8. Ktory pojem je podla Vas najlepsim predstavitelom kategorie "ovocie" ? jablko jahoda banan hrozno jahody pomaranc ceresna broskyna mango pomeranč černice banana banán boskyna broskev broskyňa čerešňa malina pomaranč pomaranč vona xxx čeresne ananas ananas (tropic fruit) Banan banán! broskynka :) ceresne chut citron dužina grapefruit hmmm... zena? hruska jabklo jabko jablka Jablko JABLKO jablko (asi aj banan) jablko predsa jabloko jabluko Jahody marhula melon Melon mrkva ;) nashi neviem ovocie passion fruit plod pomarance pomeranc rajske salat slivka stavnata broskyna zakazane Samičky 67 6 4 4 4 4 3 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Samci 137 5 12 4 4 6 0 2 3 0 0 0 2 0 0 0 0 0 0 0 0 0 1 5 1 1 0 1 3 1 3 1 1 1 4 1 1 2 1 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 1 2 1 1 1 1 1 % Samičiek 60.909 5.455 3.636 3.636 3.636 3.636 2.727 1.818 1.818 1.818 0.909 0.909 0.909 0.909 0.909 0.909 0.909 0.909 0.909 0.909 0.909 0.909 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 % Samcov 60.619 2.212 5.31 1.77 1.77 2.655 0 0.885 1.327 0 0 0 0.885 0 0 0 0 0 0 0 0 0 0.442 2.212 0.442 0.442 0 0.442 1.327 0.442 1.327 0.442 0.442 0.442 1.77 0.442 0.442 0.885 0.442 0.442 0.442 0.442 0 0.442 0.442 0.442 0.442 0.442 0 0 0.442 0 0.442 0.442 0.442 0.885 0.442 0.442 0.442 0.442 0.442 7. Ktory pojem je podla Vas najlepsim predstavitelom kategorie "domace zviera" ? pes macka krava prase akvariova rybicka pavuk vona xxx andulka clovek freezy zirafa kockodan kon koza morske prasa papagaj potkan rybicky sfetovany spolubyvajuci svab svina vysavac zajac Samičky 80 21 3 2 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Samci 173 30 1 0 1 0 0 0 1 0 1 0 1 1 2 1 2 2 1 1 1 1 1 1 % Samičiek 72,73 19,09 2,73 1,82 0,91 0,91 0,91 0,91 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 % Samcov 76,55 13,27 0,44 0 0,44 0 0 0 0,44 0 0,44 0 0,44 0,44 0,89 0,44 0,89 0,89 0,44 0,44 0,44 0,44 0,44 0,44 Záverom si dovolujeme upriamiť pozornosť čitateľa na niekoľko zaujímavých zistení:      To že je najsilnejšou väzbou @Zrak,Auto@=4.4 nám naznačuje že človek je bytosťou vizuálnou . To že následuje sluch (3.6) neprekvapí, zaujme však výskut hmatu (2.9) ďaleko pred čuchom (2.3). Žeby bol človeka napokon oveľa « dotykovejšou » bytosťou ako sme si doposiaľ mysleli ? Jedným z najzarážajúcejších sekundárnych odhalení nášho výskumu je @Oheň,Žena@=3.9 – čo je štvrtá najvyššia hodnota nášho výskumu hneď po @Zrak,Auto@=4.4 , @Voda,Život@=4.3 a @Mlieko,Ženské prsia@=4.2 . Zaujme tiež zistenie že muži si asociujú Ženu s ohňom (3.9) , vodou (3.1) , éterom (3.0) !!! a až potom so zemou (2.8) , ženy si samé seba asociujú o niečo slabšie s ohňom (3.8), následne so zemou (3.3), vodou(3.0) a až potom s éterom (2.8) . Mladíkovi ktorý rozvažuje nad tým aký kvietok kúpi svojej milej by mohlo byť užitočné zistenie že zatiaľčo pre viac ako 55 percent mužov je prototypom kvetov ruža, je tomu tak len v prípade 49 percent žien. Zdá sa že je to spôsobené najmä kvietkom menom « margarétka » ktorý ktorý si zvolilo cca 22.7% žien a približne o päť percent mene mužov Zatiaľčo respondenti mužského pohlavia si najčastejšie asociujú muža so zvieraťom « pes » a « jeleň » , a to s váhou 3.2 , u respondentiek ženského pohlavia je asociácia muž­pes (2.7) až na štvrtom mieste, za jeleňom (3.4)108, vtákom (3.3) a žralokom (2.8). Robíme si, páni, prílišné ilúzie o našej vernosti alebo viete dobre, dámy, o tom že sme veční paroháči ? Zdatnejší čitateľ možno v dátach objaví ešte i mnoho iných užitočných poznatkov. Pre neho sú určené « surové dáta » v CVS­SPSS formáte prítomné na adrese: http://localhost.sk/~hromi/quest/FHS_Bakalarska_praca.csv 108 Môj milý je podobný srne alebo mladému jeleňu. (Veľpieseň 2.9) Utekaj, môj milý, a buď podobný srne alebo mladému jeleňu na vrchoch vonín (Veľpieseň 8.14) Prísahou vás zaväzujem, dcéry Jérúšáléma, na srny a na jelenice poľa, aby ste neprebudily ani nebudily lásky mojej dokiaľ by sama nechcela. (Veľpieseň , 3.5) Záverečná poznámka pre FHS Bolo namietnuté : « Veď tá práca nemá žiadnu metódu !» A odpoveď nebola nepodobná zenovému koánu: «Nemať metódu bolo našou metódou». A na niečo také by mohlo byť odvetené: «V takom prípade sa však nejedná o vedeckú prácu.» Obhajoba voči podobnému výpadu znie následovne: «O vedeckú prácu sa vskutku nejednalo. Jednalo sa o bakalársku esej. Slovo esej chápeme v zmysle francúzskeho essai ­ pokus . Pokúsili sme sa vložiť do niekoľko desiatok stránok všetko čo chceme svetu povedať , všetko čo má pre nás v momentálnom štádiu nášho vývinu aký­taký zmysel. Možno dokonca všetko to, čo kedy malo aký­taký zmysel. Ako v prípade každého pokusu sme však pripravený zakúsiť aj trpké ovocie nepochopenia a neúspechu». «A to Vaše patetické používanie prvej osoby množného čísla ! » «To preto že som chcel vzdať hold obrom na ramenách ktorých som stál. Napríklad Vám. » «A tie absurdné tlachy o Hilbertovských priestoroch, grafoch, maticiach, regulérnych výrazoch, o akýchsi ‘šémoch’ a ‘prírazoch’ !» «Uznávam že som sa miestami nechal trochu uniesť a svojim vlastným životom žijúce more nazývané text unieslo častokrát vratkú lodičku mojej mysle k ostrovom neprebádaným. Uznávam že som sa častokrát stratil, uznávam že som častokrát šliapol úplne vedľa , že som bol častokrát úplne mimo . Predsa som však ústrednú intenciu môjho diela naplnil. » «Aká bola teda celková intencia vášho diela ?» «Vytvoriť v mysli čitateľa – vo Vašej mysli – tak silný sémantický spoj medzi «jablkom» a «prsom» že ho dokáže rozrušiť len pokročilé štádium Alzheimerovej choroby či smrť – a možno ani tá nie . Spôsobiť, že kedykoľvek uvidíte jablko , spomeniete si na tú čo Vám dala život i na tú čo Vášmu Životu dala zmysel. » «Myslíte si že sa Vám to podarilo ?» «Myslím že sa mi to nepodarí len v prípade tých ktorým po výzve «v priebehu najbližších 23 sekúnd prosím nemyslite na ružové slony» pred ich vnútorným zrakom nebudú defilovať ružové slony. A takých je málo.» «Ste pripravený na to, že Vás ľudia budú mať po prečítaní tohto textu za blázna?» «Áno» «A ste pripavený na to že Vaša práca nebude prijatá?» «Áno» «Čo urobíte v prípade že Vaša práca nebude prijatá ?» «Raz, ako starý muž je dokončím a uložím ju v Knižnici na miesto kam patrí» «A čo urobíte ak prijatá bude?» «To isté» 14.10. 2008 , Paríž Logia 79, evanjelium podľa Tomáša II. kódex zvitkov z Nag Hammadi 23 comments to the Chomskian Doctrine Within this text I ( or « we », as will be called the circle of those who adhere to the ideas presented hereby) propose some objections against the set of Chomskian theories (which will be labeled as « the Doctrine »). In the beginning, an intention was not to write a scientific article, but simply to save for eternity few notes of laic who wants to protect his laicity at least in some form, before he becomes completely re-programmed by the Doctrine. It is possible that the text will contain many contradictions with itself. Similiarly to Chomskian theories, science, and knowledge in general, this text it is evolving in time, and thus it can happen that what have been perceived as a serious flaw in « Initialization problem » is a solution to determinist/non-determinist dilemma in the « Halting problem ». It’s also very much possible that the majority of problems proposed hereby was already addressed either by Chomsky and a group of his « fideles », or by his « adversairies ».. 1. The initialization problem During the last class of syntax, students were given these two sentences: 1. La lecture de ce livre a été conseillé aux étudiants. 2. Le chat de la voisine semble être nourri par le concierge. In the first case, the V « conseiller » was projected from lexique to D-structure as « être conseillé » and the I thus received the value [+AGR]. The fact of +AGR allowed, during the creation of S-structure, the movement of « la lecture ...» NP from the position bounded to V-bar of « être conseillé » to the initial empty position where « la lecture » had received the Nominative case. In the second case, the V « nourrir » was projected from lexique to D-structure as « être nourri » and the I to which the V was bounded thus received the value [-AGR]. The fact of -AGR denied, during the creation of S-structure, the movement of « le chat » NP from the position bounded to V-bar of « être nourri » to the second IP. Therefore the whole NP had to « move » even further, and got bounded to the first IP, which was +AGR. Now a question of an enfant terrible student was this: Why, in the case of first sentence, we inserted the verb from lexique in conjugated form, thus creating +AGR, and in the second case in the the infinitive form, thus creating -ARG condition ? The answer was: Because such is the case in the resulting senteces (phrases d’arrivé). And now the really terrible question of the enfant terrible student follows: But how can we know what will BE the resulting phrase, while we are still in the process of its own derivation ? In other words, how can a result influence its own construction? That’s cabala, not science. This problem is, of course, not obvious for the students reading the books or doing exercises during the classes, because the « resulting sentence » is already present right there, in front of their eyes. From the very beginning, they know the result, they reason out « initial conditions » from it, and than they are happy that from those very causes they will arrive to the result from which they started... But in the real life situation , when a speaker is supposed to generate a sentence, there cannot be any « resulting sentence » present. At least not for a generative grammarian. For if there was, in some sense, the « resulting sentence » already present, for example in some potential, virtual form of a « pattern-template » with which the phrase-being-constructed is being matched, this pattern-template would also have to be either generated or it would have to be taken from memory. But if it would be generated, it would need an another pattern-template to match against etc. ad infinitum – in other words we would be posed in front of the problem of the problem of « infinite regress ». The option « taken from memory » is for a generative grammarian inacceptable because in such a case every sentence would potentially need its own pattern-template to be stored in the memory. The result would be a huge amount of patterns stored in the memory and no need to generate We will address this problem more closely in the « Halting problem argument » as well as in some others. Returning to our sentences, we now try offer this fast-made possible solution to the problem, trying to be at least little bit faithful to the framework of Chomskian theories : To arrive to the D-structure of « le chat semble... » something, some additional information saying that the verb will be in infinitive , has to be inserted, not only the lexical items. This very same « information » will « trigger » the derivation of the phrase 1. from the lexical items and not the derivation leading to « Il semble que le chat de la voisine est nourri par le concierge », which would be also a second valid derivation out of the very same lexique (if we suppose – as many modern theories do – that « il » in this case is not an independent lexical item) 2. Argument of a laic coming from a foreign country: Do flying fish have wings ? What grammarians called « cases » for hundreds of years is much more related to the thematique roles than to the « position in D-structure » or some « governing by V/N/whatever ». In other words, cases are at least for us, Slavs, much more morpho-semantique than syntactique entities (in the sense where « syntactique » means « in relation to the position within the sentence » ), for example an answer to the question « Who? What? [kto ? co ?]» is for me in Nominativ, the answer to the question « To whom? To what? [komu? comu?] » in dative etc. The « magic » of the cases is not hidden in the fact that one word/part of the sentence is strongly influencing the other word/part of the sentence (that would be a similar « discovery » as to find out that the sentence A is significantly influencing an understanding of sentence B which follows...), but in the fact that we are using morphology to do so. I simply don’t understand why the creators of Doctrine had chosen the same name Cases (and even with the capital!) to designate the set of solutions to the technical problems of their theory which have not very much in common with the cases in existing natural languages. Just a small example of how small slavic nations can use their « cases »: Nominative: hovorí láska (The love is speaking/telling) Genitive: hovorí (z) lásky ((S)he is speaking because of love OR love is the reason of her speech) Dative: hovorí láske ((S)he is speaking/telling to love) [metaphoric but acceptable] Accusative: hovorí lásku ((S)he is speaking love) [maybe not acceptable by some orthodox purists] Locative: hovorí (o) láske ((S)he is speaking ABOUT love) Instrumental: hovorí láskou ((S)he is spkng BY love) OR hovorí (s) láskou ((S)he is spkng with love) « O » preposition in Locative protects the case from semantic collision with morfologically identical Dative (at least within all 4 declinasion paradigms for feminine gendre), « Z » preposition either protects the case from semantic collision with Nominativ of plural form « lásky », or is a preposition in its own right, meaning « from ». To prove our point we feel no need to decide this sort of dilemmas. Thus we can construct sentence like : « Láska láske z lásky hovorí lásku a o láske s láskou. » It has 6 components, which can be freely permutated among each other, thus forming 6!= 720 possible sentences, positional contraints being only stylistic. While some of the results could be possibly labeled as « poetic », especially in the beginning of the lecture, we doubt that any reader could rightfully justify his stance when calling these sentences nongrammatical (especially after the competence of the reader would get « accomodated to the pattern » after reading few hundred permutations). The goal of this small exercise was to show that one verb in Slovaque language can, in some extreme cases , assign all types of cases, and even to the same noun. If such a situation occurs, a position plays only minor cosmetic role (the only exception being the clitics), and the assignation of the correct case is determined by morphology (in hearing-parsing passive performance situation ) and semantic roles (in producing,active, performance situation). Because it is not emitting almost any light upon the beauty of sanskrit’s or slavic language’s « cases », would You be , please, so kind, and choose a different term for Your Universal Theory of Cases ? 1 Additional comment: Imagine a sentence: p / \ N VP | / \ He is student in Slovak, Czech, Polish, we say: On je študent. (Pronoun is Sg. Masc. Nominative , Noun also in Nom.) How it is possible that the verb « to be » assigns Nominative not only to its NP externe (pronoun) but also to its « object », NP interne ? How would You deal with this situation ? You can:  Say that Nominative in our language can be intern as well as extern. In such a case Your definition of Your Nominative hadn’t offered us any information, it’s an empty tautology similiar to « either he is alive or he is dead », « either he is stupid or he is not » etc.  Forget hundreds years of Tradition and say that a case assigned is not Nominativ but some different case (for example in French that would be Accusative because it’s an intern NP and V is not V-Datif). In such a case we would kindly allow us to concentrate Your attention upon the fact that « this new case of Yours » would be completely redondant and useless for the theory of our language, for no matter what Noun it is, it’s case-signifiant morpheme in this position is always (for 12 paradigms in singular + 12 paradigms in plural) identical to the case-signifiant morpheme of Nominative (or Instrumental, as You’ll see later...). What a coincidence !  Make an « exception » for a word « to be », saying that it can do something very special, that it can have in fact two theta-roles externes. Thus we can do something like: |---IP---| | | | N V N | | | He is student 1 Or, You can maybe try to persuade that what is an indispensable part of our linguistic heritage are in « underlying » reality not cases at all . After all, if a generativist phonologue Schane succeeded to convince the world (and even french phonologicians !!!) that french in « underlying » reality doesn’t contain any nasal vowels, and we can even call this solution being the most elegant. But there is a small problem, it’s a complete heresy against the basic axiom of Chomskian doctrine : p -> N VP 2 ... And during this analysis I tried to left aside the fact that we can express the same meaning by saying: On je študentom. (He is student. Pronoun in Sg.Masc.Nom,Noun in INSTRUMENTAL! ) ...so now what ? Is the assignation of the Instrumental in contrast with Nom case of first example driven by the position of stars ? 3. Halting problem In this section we’ll use the method similiar to that of reductio ad absurdum of old scholastic Masters to show that something like Generative Grammar G « capable of producing infinite number of terminal sentences out of set of lexical items S by applying generative rules R » seems to be a chimere. Let’s assume that such a grammar G exists. We can thus ask a question: how have we obtained this infinity of terminal sentences? Because even a beginner in mathematics know that if we want to get from finite number N to infinite number I, we have to either multiplicate N by another number J, which itself is infinite, or we have to apply infinitely many times an operation/function/rule F upon N F(N), on the beginning, we see only these solutions to the question How can a grammar G be possible? :  1 - the set of rules R, which is applied upon the set of lexical items S, is herself infinite  2 - the set of rules R is finite, as well as set of lexical items S, but the number of times T we apply an operation/derivational rule (which belongs to set of R) can be potentially infinite  3 - the set of rules R is finite, the number T of operations is finite, but the set of lexical items S is infinite  4 - the set of rules R is finite, the number T of operations is finite, the number of lexical items S is finite, but is forwarded to the input of a first derivational rule ( to the D-Structure) in infinitely many variations V We immediately see that the first solution is invalid. For an infinite number of rules had to be stored somewhere, but all the possible storage spaces (brain-memory, DNA etc.) are finite. The same argument applies to the solution 3 – the set of lexical items used by a given person is necessarily finite. Thus the only source of « desired infinity » we can see can be solutions 2 and 4. Let’s first look closer upon the solution 2, where magic is hidden in «infinity of number T of applications of rule which belongs to the finite set of rules R ». In other words, a rule can be applied more than once, to generate infinite number of sentences, it can in fact be applied infinitely many times, if the given input allows it. In the framework of Standard Theory, let’s imagine for example a Deep structure: I know # I am # we can apply a rule RTind upon it, obtaining I know THAT #I am# and upon this structure we can once again apply the same rule: I know THAT THAT #I am# 2 We’ll attack this axiom in argument « I love You » and, more deeply, in antiEuclidean argument. et caetera et caetera, ad infinitum. But in such a case, it is our duty to ask a question: What can prevent a speaker of a sentence «I know that I am » from getting into an infinite that that that3 loop? According to what critere will he know that the syntactical derivation is finished and he can pass the output to the phonological layer? You will most probably answer: «Standard theory is dead, this is not an argument, I had dealt with these problems in the later works, for example by reducing set of rules R to its only one member «move alpha » , so there will be no more problematic rules like Yours RTind , and by introducing contraints which will prevent us from falling into the infinite loops which could possibly occur if move alpha had moved , in the first step, an element from position A to B, and in the second step from B to A4 ». And we’ll say: « We don’t know Your theory into detail, but doesn’t introduction of contraints in Your latter theories, which will prevent a speaker from falling into the « infinite loop » abyss, lead to the loss of capability of Your generative grammar G to produce infinite number of sentences5? Aren’t You posed between the Skylla of « G is capable of producing infinite number of sentences, but it can happen that infinite loops will occur » and Khabryde of « infinite loops are not present within my system but I lost the real infinity-oriented generativity of G»? And as usually, similiarly to a good hacker-programmer, You’ll try to adapt Your model to this problem, You’ll think a while and You’ll propose this ad hoc solution, which, as almost all ad hoc solutions, will make Your model less elegant, less scientific, less comprehensible to un-initiated (and thus un-reprogrammed) and You’ll say: « This is a serious problem. But I postulate an existence of a procedure P which will could be potentially capable to know en avance whether the derivation will finish, or will lead us into the « infinite loop » abbyss. Thus we’ll still be able to apply some rules infinitely many times, when needed, but we’ll never fall into an infinite loop ». If this would be Your solution, we’re sad to remind You , that according to the father of informatics , a man owning of the most brilliant minds of the 20th century named Alan Turing, such a universal procedure P capable of deciding whether a given programme ( a set of instructions ) will ever halt or not DOES NOT EXIST for a deterministic machine (cf. http://en.wikipedia.org/wiki/Halting_problem ) . And afterwards You’ll maybe try to offer another ad hoc solution and say that Your model is in fact a non-deterministic one. In such a case we’ll be very happy that You had finally arrived to the conclusion to the fact that « human being is more than a machine ». But until that moment, which will maybe come while we’ll become accustomed to Your « Minimalist programme », we have to express serious concerns for all previous G.G theories which seem to us to be very much deterministic. We repeat once again: such theories  either lead to the emergence of infinite loops during the derivational process  or to the impossibility to generate infinite number of terminal sentences out of finite number of variation of lexical inputs, upon which rules (or just a rule) can be possibly applied infinitely many times This was our answer to the possible solution 2. 3 4 5 Even while we try to show that generative grammar syntax is more or less an « impasse » in the evolution of linguistics, similiarly as was Ptolemaic geocentric approach to cosmology, we nonetheless had to admit that, it helped us to shed some light upon the pathologic case of « beguement ». And another contraint for three steps: (A to B, B to C, C to A) is forbidden , another constraint for four, another for five etc...Quite a lot of them in the end, n’est-ce pas? One can for example imagine a procedure: If the derivation is not finished within certain period of time, pass what You had already obtained to the phonological layer. But in such a case, the number of sentences possibly produced will be finite, since the « certain period of time » constant is also finite. G will thus not be allowed to generate infinite number of terminal sentences. The last possible way how we could possibly establish existence of generative grammar G capable of producing infinite number of sentences was the solution 4: « the number of lexical items S is finite, but is forwarded to the input of a first derivational rule ( to the D-Structure) in infinitely many variations V ». Imagine for example a speaker whose lexique contains only the items {I,You,know} , and the only rule R – already mentioned RTind . We can thus imagine that this speaker is living in the highly developed society ot telepathes where the only purpose of linguistic exchange are the affirmations of following kind: Variation Lexical input RTind Result V1 {I,know} not applied I know V2 {You,know} not applied You know V3 {I,know,You} not applied I know You V4 {I,know,You,know} applied once I know that You know V5 {You,know,I,know,I,know} applied 2x You know that I know that I know V6 {I,know,You,know, applied 2x I know that You know that You know You,know} etc. possibly ad infinitum. Truly, in such a case, we have a true generative grammar G capable of producing infinite number of sentences out of finite set of lexical items by applying finite number of rules. The only problem is that...to have such an infinite number of terminal sentences, the number of varieties V of lexical inputs which are being inserted into D-structure...has to be infinite. And thus this wonderful generative grammar G of Yours in fact does not shed any light upon the generativity of language, because the real generativity of the language is hidden in the fact that the lexique is capable of passing infinitely many varieties of its items to the syntactic component6. We’ll come back to this « generativity of lexique » within the « argument from poetics ». Here, we just proposed it as a last possible answer to the question « How can be generative grammar possible?». We had shown that the generative grammar G is technically impossible in case of solutions 1 and 3, have some serious difficulties with infinities in solution 2, and is completely useless in solution 4. We propose to change the model. 4. Connectionist’s argument: What??? Infinity of sentences ??? 5. Orator’s argument There exists a rhetoric figure called anakoluth which consists of breaking the rule of syntax to achieve desired effect upon the public. Shakespeare used it, for exemple: "Rather proclaim it, Westmoreland, through my host, That he which hath no stomach to this fight, Let him depart. Existence of such a figure , especially within the works of biggest Masters of language, poses a Doctrine in front of a problem which she’ll never surpass. We can formulate it like this: Syntax is just making it more neat and ordered afterwards, it’s not the master, but just a servant, a « femme de menage » of semantics. 6 Imagine that after years of exhaustive research, the S set of fundamental rules, (we can call them axioms) of Universal Grammar is found, and explicitely formulated. A rhetorician O comes afterwards with an intention to strongly influence the public, and knowing well that «if You want to impress people, You have to be non-violently different », he’ll apply his new figure, called from now on « universal/chomskian anakoluth », which can be described like: « Take any R rule out of Universal Grammar S. Create a completely new rule R’ or create it out of R, so that R and R’ are not consistent (R’ is negation of R). Add this new rule R’ to the set of rules which rest in S, thus obtaining S’. Generate all the sentences T’ according to these new set of rules ». The result would be, that sentences produced by O would not be generated according to the rules of Universal Grammar S, but according the rules of another grammar S’, which is not consistent with S. Thus, if there will exist a human being 7which will consider T’ sentences grammatical, it will follow that a grammar S is not universal . And since we could make a similiar procedure with no matter what set of rules , no matter the set S, it will never be universal . You can, of course, make objections like this:  1. Since this new grammar S’ will not be consistent with the Universal Grammar S, no human being will understand it, and it will thus not be a human grammar at all and sentences produced will not be sentences of human language.  2. One thing is to explicitely formulate the rules of Universal Grammar, the other thing is to construct sentences according to these rules. In fact all of the processes of U.G. are being realized on the sub-conscious functional-mind level , and what we observe are only outputs of these processes. We can say that access of our consciousness to U.G. is read only , we can know it, but we cannot consciously apply them, use them to construct sentences. We’ll return to objection no.1 in «power of stimulus argument ». For this moment, let’s just consider it as a question « Could a human being,except the orator O, perceive the set of sentences generated by the rules which are not consistent with the Universal Grammar - ever consider them as grammatical? » ,which can be decided by scientific means – by an experiment. If You’ll decide to use the second objection to save Your system, we have to warn You that You’ll make Your Doctrine very much impotent 8. Because if You’ll say « it’s not possible to consciously apply the rules of U.G. », You’ll loose a great deal of justification for «drawing trees and writing derivations for Your students», because drawing trees and writing derivations is , in the end, nothing else than a trial to consciously apply the rules of U.G.9 In other words : From the moment You’ll explicitely formulate Your U.G. , she will be in serious danger of losing its universality because of « chomskian anakoluth » which will, surely follow. And to save its universality by not formulating it at all, by saying « it’s not possible to formulate it, yet it exists » would be not far from the disputes concerning an ontological proof of existence of God, coming from the dark ages into which we simply don’t want to fall once again. Except an orator O, of course. For if his new figure will not have any positive effect upon the public, he will most probably end in an institution for mentally disturbed where some wise man with diplom, title and a white coat will write into a diagnose «speaker insufficient of producing syntactically correct sentences », and we cannot take for serious linguistic judgmenets of such a person, can we? On the other hand, if he’ll succeed, he’ll be celebrated as a genius and his new grammatical rule will maybe even became a NORME. Voilà la différence entre le fou et le génie – la réussite. 8You’ll be like a mathematician trying to explain an idea behind a newly discovered Operator saying to his students « So, You see what this Operator does? Unfortunately we cannot do any exercises because we don’t know any objects with which it does what it does, for if we had known them, it would follow that we wouldn’t be able to apply this Operator! » 9 And the exercices in the form of « explain why the sentence S: « sentence this grammatic not is » is not grammatical » are in fact the first examples of application of chomskian anakoluth. 7 6. Saussurian universal darwinism argument 7. Adequacy with the world 8. Empiricists argument - Competence and performance mess 9. Popperian argument and falsifiability 10. Argument from poetics 11. Arguments from other arts 12. Argument from children: 13. Argument from an anarchist - Normativity of G.G. 14. Argument « I Love You » You try to persuade me that every grammatical English sentence can be analysed as N VP10 . Asking You to analyse the sentence « I Love You », You automatically do something like: p / \ N VP | / \ I Love You and if You are a mentalist, You will even try to persuade me that something like THAT structure exists in my very head. And then I will tell You: « but the things can be observed, analyzed differently ». And I will draw: a) p /|\ NVN | | | I love You b) p | N---V---N | | | I love You c) p / \ core1 N / | | N V | | | | I love You and if You’re honest to truth, at least for a short while You’ll have nothing to say. Because if You don't know it, You shall at least feel that there CAN BE cultures which mentally represent the love relation in such a mutually symmetric (examples a and b) or maybe even by- « the-other-one-is-primary » (c) fashion. And then, because You are also honest to Your Doctrine You’ll add: « But that breaks all the most essential rules ». And I will respond: « not rules, but conventions, which a community of post-wars syntaciticans had voluntarily chosen driven by a need to facilitate the communication among its members, and which are being imposed upon the new generation ». 15. Russel’s argument - G.G. as a formal system 16. Lobatchevski’s (antiEuclidean) argument 10 In X-bar You’ll add some bars and Is and Ps here and there, but the overall cartesian architecture of your system, based upon assignation of a special place to the subject, rests the same 17. Hilbert’s argument 18. Godelian argument (ou un petit K.O. de grace) 19. postFreudien argument 20. Kuhnian’s argument 21. Skinnerian « power of stimulus » argument 22. Memetician’s argument 23. Kopernic’s argument ironic remarks concerning the lecture Haegaman’s Governement and Binding manual 97 – (Keyne proposes)... that the relevant parameter distinguishing VO languages from OV languages is related to the application of the leftward movement rule. (an distinguished anglo-saxxon academiciann proposes)... that the relevant parameter distinguishing Arabic script from the classical Latin script is related to right-left movement of hand only. For as everybody (especially a well formed anglosaxxon) knows, it is the left-right movement which is universal, present in the deep layers of not only cerebral structures but DNA itself, and all empiric data which are not in accordance with this universal principle are just surface phenomena. 106 – We must introduce some parameter to distinguish configurational languages from non-configurational languages. And we must, of course, introduce some parameter to distinguish a language from non-language. And it will be coded like this – if mother is speaking it (if the utterances are much more frequent and intense than any others), it is a language. 145 – Apart from the identification of verb inflection, we shall not be concerned with the decomposition of words into morphemes either. Such an approach is similiarly absurd from the global point of view as to say: « Apart from the identification of slavic/sanskrt cases, we shall not be concerned with the decomposition of words into morphemes either. For every case we’ll create a nice CaseP ,(analogic to IP) and when it comes to the conjugaison of verbs, we’ll invent a « Theory of verb agreement module » which we’ll position between the D-Structure and S-structure. » Why not? 143 – It is easy to see that the more elements are involved the more choices are available (« Language acquisition as a defense of binary branching theory ») Counter-argument: It is easy to see that for a child in early stages of its language acquistion, the sentence « Daddy sleeps » has the same number of elements as a sentence « Mummy must go now ». And that number is one. Or do You believe that Mummy or Daddy says « white space » after each word? The number of lexicon elements into which sentence can be analysed grows in parallel with « internalization of mother language’s grammatical structure ». Thus binary branching does not offer any real advantage for language acquisition. (and in the end, almost no language cannot be fitted upon the strict binary branching paradigm anyway) Bonus Some sentences from Shakespeare (found by application of perl regular expression /[\.\; ]([\w ]*?I [\w ]+? me [\w ]+?\.)/g upon corpus "The complete works of William Shakespeare" ) which, in my opinion, violate the second binding principle: I have kept me from the cup. I cross me for sinner. I would wish me only he. Something I must do to procure me grace. Here on this molehill will I sit me down. I can buy me twenty at any market. I have bethought me of another fault. I will shelter me here. Here will I rest me till the break of day. I do repent me that I put it to you. I fear me both are false. I can no longer hold me patient. For I repent me that the Duke is slain. And now I cloy me with beholding it. bread I it makes me mad. That I should yet absent me from your bed. How I may bear me here. First measurements concerning rhytmic circuit inertia and disparition written by Daniel Devatman Hromada for Joelle Provasi as memoire M1 for EPHE SVT CNA The experiment consisted of three tasks: during first a spontaneous motor tempo (SMT) was measured; during the second a child had to synchronise to stimuli with 600ms interstimuli interval (ISI); third task was a continuation/induction task – after being attuned to a 600ms ISI, a child was instructed to continue tapping the same rhythm (IRI) even after stimuli was turned off. This text concerns only the 3rd task in relation to data obtained by SMT measurements of the 1st task. Crucial for understanding of our method is the concept of “IRI falling into the SMT attractor state”. We say that subject’s IRI have fallen into the SMT attractor state when an IRI cannot be distinguished from SMT. For practical purposes we define that IRI cannot be distinguished from SMT state when the arithmetic mean of 3 subsequent IRIs is lesser than SMT, e.g. IRI 1 without containing a digit n − 1 to the left of first occurrence of n In order to see the principle more clearly, table 1 enumerates ten Shakespeare numbers with smallest value. S-number Alphabetic representation Matchable expression 11 111 112 121 122 1111 1112 1121 AA AAA AAB ABA ABB AAAA AAAB AABA "we split we split " "we split we split we split "here here sir " "to prayers to " "trip audrey i attend i attend " "justice justice justice justice " "great great great pompey " "here here sir here " 1122 AABB "gross gross fat fat " 1123 AABC "he he and you " Table 1. First ten Shakespeare numbers, their corresponding alphabetic representations and arbitrarily chosen Shakespearean expressions which can be subsumed under them. Daniel Devatman Hromada / As a counterexample, let’s precise that 22 is not a Shakespeare number because digit 1 does not occur at all and 221 is not a Shakespeare number because 2 occurs with no 1 to its left. These two numbers therefore do not satisfy the ascending property. On the other hand, numbers like 12, 13 or 123 are also not S-numbers because they do not include any repeated digit and therefore do not satisfy the repetition-inclusion constraint. Listing 1 displays the source code of a routine able to generate the sequence of S − numbers from one to potential infinity. The sequence of first 163553 S-numbers - id est those S-numbers whose value is less than 9999999999 is available at Online Encyclopedia of Integer Sequences [13] under sequence number A273977 3 . Deeper mathematical and number-theoretical properties of S-numbers are presented in [19]. 3.2. Entangled number E-number Alphabetic representation Matchable expression 11 111 AA AAA "we split we split " "we split we split we split " 1111 1122 1212 1221 11111 11122 11212 11221 AAAA AABB ABAB ABBA AAAAA AAABB AABAB AABBA 11222 12112 12121 AABBB ABAAB ABABA "justice justice justice justice " "gross gross fat fat " "to prayers to prayers " "my hearts cheerly cheerly my hearts " "so so so so so " "great great great pompey pompey " "come come buy come buy " "high day high day freedom freedom high day " "o night o night alack alack alack " "too vain too too vain " "come hither come hither come " 12122 12211 ABABB ABBAA 12212 ABBAB 12221 ABBBA "come buy come buy buy " "freedom high day high day freedom freedom " "on whom it will it will on whom it will " "thou canst not hit it hit it hit it thou canst not " Table 2. All Entangled numbers with no more than 5 digits, their corresponding alphabetic representations and arbitrarily chosen Shakespearean expressions which can be subsumed under them. A set of entangled numbers is a subset of set of Shakespeare numbers (E ∈ S ∈ N). E − numbers therefore satisfy repetitive and ascending properties of S − numbers. In addition to these does the decimal representation of an entangled number E one additional property: • closure property: each digit of E occurs at least twice 3 https://oeis.org/A273977/b273977.txt Daniel Devatman Hromada / In order to see the idea more clearly, table 2 enumerates ten Entangled numbers having their digit-length equal to five or less. As a counter example, let’s precise that numbers like 12, 13, 22, or 123 are not Enumbers because they are not even S-numbers. On the other hand, S-numbers like 121 or 1211 are not E-numbers because they contain a digit 2 which is not repeated. Listing 2 displays the source code of a routine able to verify whether an S − number presented at the input is an E − number. The sequence of first 4360 E − numbers - id est those E − numbers whose value is less than 9999999999 is available at Online Encyclopedia of Integer Sequences [13] under sequence number A273978 4 . Deeper mathematical and number-theoretical properties of S-numbers are presented in [19]. 4. Method The core idea behind our method can be stated as follows: Any S− or E− number is to be "translated" into a backreference-endowed regular expression. More concretely, every digit of an S- or E- number can be interpreted as a sort of an element or a "brick". In this article, we work only with one type of bricks, those corresponding to sequences which are between two to twenty-three characters long5 . More concretely, a first occurence of a novel brick can be represented as a PERL-compatible regular expression: (.{2,23}) However, any subsequent repeated occurence of a digit in the S- or E- number is interpreted not as an occurence of the new brick, but rather as a backreference to the brick which was already denoted by the same digit. Hence, the very first S- number 11 is NOT to be translated into regex /(.{2,23}) (.{2,23})/. For this would imply existence of two distinct bricks. Rather, the E-number 11 is to be translated into regex: (.{2,23}) \1 wherein the expression \1 denotes the backreference to the content matched by the regex-brick specified in first parentheses, i.e. brick no.1 . Hence, the S-number 111 can be easily translated into a regex /(.{2,23}) \1) \1/, 1111 into a regex /(.{2,23}) \1 \1 \1/ etc. These, however, are cases which correspond only to repetition of one single brick: 11 for duplication, 111 for triplication, 1111 for quadruplication etc. In order to assure the application of the non-identity principle stating that: 4 https://oeis.org/A273978/b273978.txt 5 Minimal (e.g. 2) and maximal (e.g. 23) brick length are the only parameters of our model and can be, of course, adequately tuned. Sometimes we shall denote this parameter couple with the term base. More in discussion. Daniel Devatman Hromada / "Each distinct digit corresponds to distinct content" , an additional adjustment is needed in case we want to translate S-numbers containing multiple digits of different kind. That is, S-numbers like 121, 122 or 211. For if we would not care for the principle of non-identity, a number like 121 could be easily represented as /(.{2,23}) (.{2,23}) \1/ and a number like 122 could be translated into /(.{2,23}) (.{2,23}) \2/. It could turn out, however, that these regexes would match the very same expressions as other, more simple regexes do as well (e.g. the expression "no no no" could be matched by both /(.{2,23}) \1) \1/, as well as by /(.{2,23}) (.{2,23}) \1/ or /(.{2,23}) \1 (.{2,23})/. This is so, because nowhere in such regular expression it is specified that the first brick has to be different from the second brick, or third brick from the second. Luckily enough, syntax of PCREs is exhaustive enough to allow us to encode the non-identity constraint into regexes themselves. This is attained by putting the backreference into a so-called negative lookahead, traditionally expressed by the formula (?!). Hence, by translating the S-number 121 into the regex (.{2,23}) (?!\1)(.{2,23}) \1 we can make sure that the content matched by the brick denoted by digit 2 shall be different from the content matched by the brick denoted by digit 1. Thus, an expression "no no no" shall not be matched by such a regex while an expression "no yes no"6 shall. Going somewhat further, an S-number 12321 - which could be understood as an instance of chiasmatic ABXBA - is to be translated into regex (.{2,23}) (?!\1)(.{2,23}) (?!\1|\2)(.{2,23}) \2 \1 whereby the disjunctive backreference contained in the negative lookahead (?!\1|\2) assures that the content matched brick no.3 - corresponing to filler X - shall be different from content matched by the brick representing digit 1 as well as the brick representing digit 2. This being said, the method of translating S- or E- numbers into regexes which do not transgress the non-identity constraint is pretty much straightforward, and is fully and completely described by PERL code given in listing 3. 5. Experiment 5.1. Corpus A digital, unicode-encoded version of Craig’s edition of "Complete works of William Shakespeare" [4] has been downloaded from a publicly available Internet source 7 . This corpus contains 17 txt files stored in the sub-folder "comedies", 10 txt files stored in the sub-folder "tragedies" and 10 txt files stored in the sub-folder "historical". What’s more, all utterances are annotated according to the following format: 6A cautious reader may now start to observe that non-repeated digits of an S-number in fact correspond to "filler" or "separator" expressions (e.g. "yes") which in many cases fill the space between repeated elements themselves (e.g. "no"). 7 Downloaded from http://www.lexically.net/downloads/corpus_linguistics/ShakespearePlaysPlus.zip. Backup at http://sci.wizzion.com/ShakespearePlaysPlus.zip . Daniel Devatman Hromada / Sentence 1. Sentence ... O, wonder! How many goodly creatures are there here! How beauteous mankind is! O brave new world, That has such people in’t! That is, a format highly reminiscent of the format of a valid XML document. This format wherein diverse values of the tag < PERSONA > denote names of diverse dramatis personae (e.g Miranda, Prospero) , seems to be consistently and stringently followed across all files contained in the corpus. This is advantageous, since it implies that the content present between the opening and closing tag can be understood as a supraphrasal, meaning-encoding monadic unit: a utterance. Verily, this is encouraging. It is encouraging for both theoretical (1.) as well as for practical (2.) a reason: 1. school of thought to which our research tends to adhere is principially a constructivist, usage-based linguistic paradigm best manifested in [20] 2. computational complexity of matching backreference-endowed regexes depends supralineary or maybe even non-polynomially [1] from the length of the text being matched Regarding the practical reason, it could be postulate that our article offers certain evidence for the hypothesis "backreferenced regex-parsing of Shakespearean utterances is computationally tractable in reasonable time", whereby the term "reasonable" denotes time scales between miliseconds and minutes. More in discussion. Regarding the theoretical reason, it is worth making explicit that an implicit leitmotive of Tomasello’s theory is a definition stating: Utterance is the basic unit of linguistic interaction. 5.2. Processing Dramatic pieces are divided into utterances. This is a natural consequence of the fact that dramatic pieces tend to represent scenarios within which diverse dramatis personae interact with each other. It is difficult to see any other litteral genre where division into utterances is as marked as in case of drama8 . And in case of digital version of [4] Shakespeare corpus, such markedness tends to be even more marked. Therefore, one simply needs to cut the corpus into utterances by interpreting the closing tag of the utterance (e.g. < /PERSONA >, < /MIRANDA > etc.) as the utterance 8 Plato’s dialogues are, of course, set aside as a very particular case. When it comes to film scripts and/or subtitles to other audiovisual media, these are principially understood as a particular subtype of dramatic pieces Daniel Devatman Hromada / separator. Even more concretely, one can simply consider the slash symbol / to be the utterance separator. Subsequently, dividing the original dramatic text into utterances is, at least in PERL, as simple as defining the symbol / to be the default input separator. That is, in PERLish, by executing following code: $\ = ”/”; Only two further text-processing steps have been executed during the initialization phase of the experiment hereby presented. Primo, content of each utterance has been put into lowercase. Secundo, non-alphabetic symbols (e.g. dot, comma, exclamation mark etc.) have been replaced by blank spaces. We are aware that such replacement could potentially lead to certain amount of loss of prosody- or pathos- encoding information. However, we consider this step as legitimate because the objective of our experiment was to focus on repetition of lexical units.9 Pre-processing code once executed, identification of expressions containing diverse types of lexical repetition is as simple as matching each Shakespearean utterance with each regex. 6. Results This section presents results of exposure of Shakespeare’s corpus to base=2,23 regular expressions generated out of all entangled numbers with max. length of 10 digits. We focus on E2,23 − numbers because their closure property (i.e. "every digit contained in a valid E-number has to occur at least twice") gives an arbitrary E − number ability to match much more rare a gem than just an arbitrary S − number. 6.1. Quantitative All in all, 3667 instances of a repetitive expression has been detected in Shakespeare’s complete works. These were contained in 2295 distinct utterances and corresponded to 172 distinct E2,32 schemata. Among these, 71 matched more than one instance: these schemata could thus potentially correspond to a certain cognitive pattern or a habitus in Shakespeare’s mind. Table 3 contains summary matching frequency information which concerning schemata matching at least five distinct utterances. 9 Enumerative generation of backreference-involving regexes focusing on repetitions of phonotactic clusters, syllables, phrases or potentially even sememes and prosodies is, in theory, also possible. We prefer, however, not to focus on this topic within the limited scope of this article. Daniel Devatman Hromada / Table 3. Quantities of utterances present in collected works of William Shakespeare which contain at least five distinct utterances corresponding to an E-number encoding the backreference-encoding regex whose individual brick match expressions not shorter than 2 characters and not longer than 23 characters. Instances 2332 525 170 100 48 35 32 E2,23 − number 11 1212 111 123123 12121 1221 12341234 Example "bestir bestir " "to prayers to prayers " "ha ha ha " "cover thy head cover thy head " "come hither come hither come " "fond done done fond" "let him roar again let him roar again " 32 30 23 12 12 11 11 1122 1111 121212 123231 1231231 121233 112323 "with her with her hook on hook on " "great great great great " "come on come on come on " "upholds this arm this arm upholds " "fubbed off and fubbed off and fubbed " "trip audrey trip audrey i attend i attend " "what what what ill luck ill luck " 10 10 9 8 8 7 6 5 123312 11122 121323i 12321434 11111 12312312 11234234 12123434 "my hearts cheerly cheerly my hearts " "lady lady lady alas alas " "a lord to a lord a man to a man " "land rats and water rats land thieves and water thieves " "so so so so so " "let me see let me see let me " "on on on to the breach to the breach " "i thank god i thank god is it true is it true " 5 1112323 "barren barren barren beggars all beggars all " Another phenomenon may be found noteworthy by a reader interested in purely quantitative aspects of our research. That is, the relation between the number of digits of a E − number of length L seems to be in a Zipf-like [25] relation to number of occurences of expressions which can be matched by such EL . For example, Shakespeare’s dramas seem to contain 2332 duplications (E = 11), 170 triplications (E = 111), 30 tetraplications (E = 1111), 8 pentaplications (E = 11111 10 ), two hexaplications (E = 111111 11 ), one heptaplication (E = 1111111 12 ) and zero octaplications. It is worth mentioning, however, that generic relation between the length (in digits) of an E − number X and the amount of utterances which X matches seems not to be Zipfian. This is illustrated by Table 4. 10 E.g. "never never never never never " by Lear in King Lear. "kill kill kill kill kill kill " also by king Lear. 12 E.g. "so so so so so so so " by Shallow in The Second Part of King Henry IV. 11 E.g. Daniel Devatman Hromada / Digits Theoretical Matched 2 3 4 5 6 7 8 9 1 1 4 11 41 162 715 3425 2332 170 622 91 211 56 86 67 Table 4. Schemata corresponding to E − numbers with even number of digits match more frequently than those with odd number of digits. As indicated by Table 4, an observed preference for repetitive expressions including two, four, six or eight bricks cannot be explained in terms of number-theoretical distribution of E − numbers themselves. For example, there exists eleven E − numbers with five digits and fourty-one E − numbers of length six. However, when exposed to Shakespeare corpus, base(2,23) regexes generated from E − numbers six digits long seem to match 211 utterances while five brick long regexes match only ninety-one of them. Whether this observed asymmetry is an artefact of our method and our definition of E − numbers, or whether it is due to a sort of cognitive bias, a sort of preference for balanced repetitions poses us in front of an argument which we do not dare to tackle within the limited scope of the present article. 6.2. Qualitative It may be said that the longer the E- or S- number is, the more complex a structure, the more cognitively-salient, pathos-filled an entity it potentially represents. For this reason, this subsection principially exposes the reader with few answers to a question: "What Shakespearean expressions can be matched with longest possible Enumber ?" In all following examples, we will use the base2,23 E-numbers, i.e. restrict the length of individual bricks to min. 2 and max. 23 characters. In the realm of comedies13 , one can observe that the regex generated from the number 12343434 pin-points a following utterance from Stephano playing his role in The Tempest: Flout (1) ’em (2), and (3) scout ’em (4); and (3) scout ’em (4), and (3) flout ’em (4); Thought is free. while regex generated from number 12343412 identifies Miranda’s: All (1) lost (2) to (3) prayers (4), to (3) prayers (4) all (1) lost (2). 13 Link to the file containing all XXX expressions shall be published in the camera-ready version of the article. Daniel Devatman Hromada / or Caliban’s Freedom (1), high (2) day (3) ! high (2) day (3), freedom (1) ! freedom (1) ! high (2) day (3), freedom (1) ! 14 all appearing in the same play. Another answer, corresponding to E-number 122133144 is given by Dromio, a personage in Shakespeare’s "Comedy of Errors": She is so hot because (1) the meat is cold (2) ; The meat is cold (2) because (1) you come not home (3); You come not home (3) because (1) you have no stomach (4); You have no stomach (4), having broke your fast; Analyzing the realm of tragedies, one may see Polonius - a character in the Hamlet drama - utter a 11231434231-matchable expression: The best actors in the world, either for tragedy, comedy, history, pastoral (1), pastoral (1) - comical (2), historical (3) - pastoral (1) , tragical (4) - historical (3), tragical (4) - comical (2) - historical (3) - pastoral (1) , scene individable, or poem unlimited: Seneca cannot be too heavy, nor Plautus too light. For the law of writ and the liberty, these are the only men. 15 or one can hear Hamlet himself pronouncing a following 1231414312-matchable sequence: Let your own discretion be your tutor: suit the (1) action (2) to (3) the (1) word (4), the (1) word (4) to (3) the (1) action (2) 14 It is important to realize that the very same expression can be matched by multiple regexes. Hence, an above mentioned Caliban’s proclamation can be analyzed not only to match the base2, 23 E-number 1232311231, but also analyzed to match E-numbers like 12211121 (if ever "high day" forms only one brick) etc. This is analogic, mutatis mutandi, to sentence having multiple syntactic parses. 15 Note that regexes have been constructed in a way that ignores suffixes, i.e. use bricks having a form like "(.{2,23})\w{0,4}", than this utterance could be potentially matched with much longer a number, because not only adjectives (e.g. "historic-al") but also the preceding substantives "histor-y" would be accounted for. Daniel Devatman Hromada / while Mercutio from the Romeo and Julia narrative states: Come, come, in thy mood and as soon and as soon thou art as hot a Jack as any in Italy; (1) moved (2) to be (3) moody (4), (1) moody (4) to be (3) moved (2). These examples, of course, are just a tip of an iceberg. Verily, only a tip of an iceberg, because many strongly marked repetitive expressions are also to be found in Shakespear’s historical dramata. Among these, dramata eternalizing narratives of Henry IV. and Henry V. tend to top the list. Hence, Gadshill reasons will strike (1) sooner (2) than (3) speak (4) and (5) speak (4) sooner (2) than (3) drink (6) and (5) drink (6) sooner (2) than (3) pray and yet i lie for they pray continually to their saint the commonwealth or rather not pray to her but prey on her while Falstaff emphasizes: banish peto banish bardolph banish poins but for sweet jack falstaff kind jack falstaff true jack falstaff valiant jack falstaff and therefore more valiant being as he is old jack falstaff banish (1) not (2) him (3) thy (4) harry s (5) company (6) banish (1) not (2) him (3) thy (4) harry s (5) company (6) banish (1) plump jack and banish all the world It is, however a persona named Shallow which seems to be particulary fond of repetitions, once saying come (1) on (2) come (1) on (2) come (1) on (2) sir (3) give (4) me (5) your (6) Daniel Devatman Hromada / hand (7) sir (3) give (4) me (5) your (6) hand (7) and next time saying: where s (1) the roll (2) where s (1) the roll (2) where s (1) the roll (2) let (3) me (4) see (5) let (3) me (4) see (5) let (3) me (4) see (5) so (6) so (6) so (6) so (6) so (6) so (6) so (6) yea marry sir ralph mouldy let them appear as i call let them do so let them do so let me see where is mouldy Given that Shallow appears in historical dramata, an interesting question could be rightfully posed: Is Shallow’s tendency to produce repetitive utterances en masse just Shakespeare’s invention or is it rather a sort of description of particular cognitive characteristics of once existing historical personage ? 7. Conclusion Our article presents a way of maping a subset of a set of all possible backreferenceendowed regexes onto a set of natural numbers. It indicates that for every base of certain kind, the set of regexes-to-be-generated is infinite but enumerable. A set of so-called Shakespearenumbers (S −numbers) is defined as well as the set of "Entangled numbers". The second being a subset of the first, satisfying one additional constraint: Every distinct digit ("symbol") of an entangled number EX occurs in EX at least twice. We have subsequently generated a list of all such S − numbers (c.f. listing 1) and E −numbers (c.f. listing 2) with at max 10 digits. After which the E −numbers have been translated into backreference-endowed regular expressions whose most elementary units, so-called "bricks", were no shorter than two and no longer than twenty-three characters. In the end, such regexes have been exposed to corpus containing collected works of William Shakespeare. This approach allowed us to pinpoint 3667 utterances matching at least one among 172 distinct repetitive formulae. We believe that at lease some among these formulae Daniel Devatman Hromada / could be of certain interest not only for Shakespearean [14] scholars in particular, but also for wider fields of "digital humanities" [23] or stylometry. The good news is that the whole matching process is also fairly fast. More concretely, matching all utterances with all base2, 23 regexes generated out of all 4360 E − numbers with less than 10 digits lasted 9555 seconds in case Shakespearean comedies, 6607 seconds in case of tragedies and 6900 seconds in case of historical dramata. All this on one single core of an 1.4 GHz CPU. 8. Peroratio Rhetorics undoubtedly belongs among five oldest scientific paradigms ever explicated by scholars of the occidental16 tradition. Even before Plato noted down discussions between Socrates and Gorgias and Socrates and Parmenides; even before Aristotle projected his point-of-view upon the realm of man, Athēnaia, had been already venerated. Longevity of rhetorics has positive as well as negative sides. Negative, for such lengthy tradition implies potential impediments caused by centuries of terminological and methodological sediments. We are convinced that, similiarly to diverse occult notations of pre-Mendelean chemistry, may alphabetic notation of BABAs and ABBAs be also considered to be such sediments in regards to rhetoric science. Hence, by a trivial act of switching notation from As to ones and Bs to twos, we aspire to do nothing else than to unblock this science from the state of terminological traffic jam to somewhat more fluid a state. Hence and thus, interesting and almost melodical17 verses of Shakespeare have been pin-pointed and juxtaposed side by side to each other. Being unsure of whether such juxtaposition has ever been explored in the depth their merit, we find our qualitative results worthy of not only exploring but also publishing. For who knows, maybe they shall even inspire some potential Shakespeare of the future ? Quantitative explorations may also turn out to be worthy of further exploration. Three axes of such exploration are immediately visible: 1. "universalia axis": study of language-independent invariants and rhetorical schemata which occur across many distinct languages and/or language groups [12] 2. "ontogenetic axis": exploration of processes by means of which complex eloquency of an individual locutor emerges out of simpler structures, from mind of a child to Shakespeare 3. "historical axis": study of different Digital Humanities resources in order to increase our knowledge about styles, fashions, crossovers and traditions popular during different epoches of human history In terms of Saussurian linguistics ([5]), one may consider the first axis to be synchronic one while the the second and third can be considered as "diachronic" ones. 16 Note, however, that rhetorics is far from being unknown to Orient as well. Known as Sarasvatı̄ in the sanskrit world, the goddess embodies knowledge, arts, music, melody, muse, language, rhetoric, eloquence, creative work ... [17] seems to be active already in vedic or even pre-vedic proto-indo-european times. 17 It may be the case that the application of our method upon musical partitures - as stored in MIDI files, for example - shall also yield some worthy insights. Daniel Devatman Hromada / Listing 1: PERL code generating an ascending sequence of Shakespeare numbers. Code hereby transfered to the public domain under license CC BY-NC-SA for artistic use and mGPL license for general use.. $i =1; INCREMENT : w h i l e ( $ i ++) { my %d ; $d { " 0 " } = 1 ; $r =0; f o r $d ( s p l i t / / , $ i ) { n e x t INCREMENT i f ! e x i s t s $d { ( $d − 1 ) } ; i f ( $d { $d } ) { $r =1; } $d { $d }= t r u e ; } print " $i \ n" i f $r ; } One may, for example, extend the work of [12] in domain of "language-independent detection of figures-of-speech" and demonstrate that E-numbers of considerable length match expressions not only Shakespeare, but also in Goethe, Moliere, Milton or others. Or focus on so-called "sacred texts" like Bible, Koran or RgVed where repetitions, indeed, abound. Or pursue a somewhat more psycholinguistic, ontogeny-oriented line of research and study the a corpus like CHILDES [15] in order to explore how complex eloquency emerges out of variations within repetition of complex sequences (another REFs to be given in camera-ready version). At last but not least, we are convinced that our S− or E− number nomenclatures could be embedded into rhetorical figure ontologies [11,16]. Within such ontologies, antimetaboles could be thus "enriched" with attributes like "12321", "123321", "1234321" etc. ; anadiplosis would be labeled with another set of numbers, antistrophe with yet another, etc. The advantage of such an enrichment is quite easy to see: such enriched elements would become "grounded" [10]. That is - when looking for - or infering the presence of a certain figure of speech F in certain text T , one could consult the ontology and see whether F is not labeled with SF or EF attributes. If yes, one could simply parse the T with corresponding SF or FE regexes. One could thus establish a practical, functional bidirectional bridge between the abstract realm of purely descriptive ontologies and material reality of text corpora which are to be parsed and understood. And, of course, such nomenclatures - or nomenclatures of a similiar vein - may allow communication between computational and classical scholars in unambigous, precise, yet still concise and sufficiently explanatory terms. This being said, we conclude this article with an expression of hope that the method hereby introduces shall make it possible to spot down, identify, classify and study in deeper level the intricacies of cognitive ecosystems populated with swarms and clusters of hitherto unknown psycholinguistic schemata traditionally known as "figures of speech". Acknowledgments\TBD in the camera-ready version of the article. Daniel Devatman Hromada / Listing 2: PERL code checking whether a Shakespeare number given at the input is also an Entangled number. Code hereby transfered to the public domain under the mGPL license.. OUTER : w h i l e ( < >) { my %d ; $ i =$_ ; chop $ i ; f o r $d ( s p l i t / / , $ i ) { ( e x i s t s $d { $d } ) ? ( $d { $d }++) : ( $d { $d } = 1 ) ; } f o r $k ( k e y s %d ) { n e x t OUTER i f ( $d { $k } < 2 ) ; } print " $i \ n" ; } Listing 3: PERL code translating S-numbers into syntactically correct regexes. Code hereby transfered to the public domain under the mGPL license.. my $ b a s e = ’ ( . { 2 , 2 3 } ) ’ ; $n=$ARGV [ 0 ] ; @i = s p l i t / / , $n ; $re = " " ; my %h ; $no = " " ; f o r my $ i ( @i ) { $re .= " " ; i f ( d e f i n e d $h { $ i } ) { $re .= ’ \ \ ’ . $i ; } else { i f ( $i >1) { $ i >2 ? ( $no . = ’ | \ \ ’ . ( $ i −1)) : ( $no . = ’ \ \ ’ . ( $ i − 1 ) ) ; $ r e . = ’ ( ? ! ’ . $no . ’ ) ’ ; } $re .= $base ; $h { $ i } = 1 ; } } $ r e . = ’ [ <] ’ ; p r i n t " $n t r a n s l a t e s i n t o $ r e \ n " ; Daniel Devatman Hromada / Listing 4: PERL code for utterance-oriented pre-processing of texts contained in ShakespearePlaysPlus corpus. Code hereby transfered to the public domain under the mGPL license.. u s e open " : e n c o d i n g ( u t f −16) " ; $ / = " / " ; # c o n s i d e r t h e s l a s h s y m b o l t o be t h e d e f a u l t i n p u t s e p a r a t o r w h i l e ( < >) { $ l i n e = l c $_ ; # l o w r e c a s e $ l i n e =~ s / [ \ r \ n \ t . , ? ! : ; ’ "\ − ] + / / g ; # remove non−a l p h a b e t i c c h a r s p u s h @{ $ u t t e r a n c e s {$ARGV} } , $ l i n e ; # c o n s t r u c t t h e u t t e r a n c e h a s h } References [1] Alfred Vaino Aho. Algorithms for finding patterns in strings. Algorithms and Complexity, 1:255, 2014. [2] Georg Cantor. Über eine elementare frage der mannigfaltigkeitslehre. Jahresbericht der Deutschen Mathematiker-Vereinigung, 1:75–78, 1892. [3] Gorges Caumont. Notes morales sur l’homme et sur la societe. Sandoz&Fischbacher, Paris, 1872. [4] William James Craig. The complete works of Wiliam Shakespeare. Oxford University Press, 1919. [5] Ferdinand De Saussure. Cours de linguistique générale: Publié par Charles Bally et Albert Sechehaye avec la collaboration de Albert Riedlinger. Libraire Payot & Cie, 1916. [6] Marie Dubremetz and Joakim Nivre. Rhetorical figure detection: the case of chiasmus. on Computational Linguistics for Literature, page 23, 2015. [7] Luciano Floridi. The philosophy of information. Oxford University Press, 2011. [8] Jeffrey EF Friedl. Mastering regular expressions. " O’Reilly Media, Inc.", 2002. [9] Kurt Gödel. Über formal unentscheidbare sätze der principia mathematica und verwandter systeme i. Monatshefte für mathematik und physik, 38(1):173–198, 1931. [10] Stevan Harnad. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335–346, 1990. [11] Randy Harris and Chrysanne DiMarco. Constructing a rhetorical figuration ontology. In Persuasive Technology and Digital Behaviour Intervention Symposium, pages 47–52. Citeseer, 2009. [12] Daniel Devatman Hromada. Initial experiments with multilingual extraction of rhetoric figures by means of perl-compatible regular expressions. In RANLP Student Research Workshop, pages 85–90, 2011. [13] OEIS Foundation Inc. The on-line encyclopedia of integer sequences, 2017. http://oeis.org. [14] Sister Miriam Joseph. Shakespeare’s Use of the Arts of Language. Paul Dry Books, 2008. [15] Brian MacWhinney. The CHILDES project: The database, volume 2. Psychology Press, 2000. [16] Miljana Mladenović and Jelena Mitrović. Ontology of rhetorical figures for serbian. In International Conference on Text, Speech and Dialogue, pages 386–393. Springer, 2013. [17] John Muir. Original Sanskrit texts on the origin and history of the people of India, their religions and institutions. Trübner & Company, 1873. [18] Claude E Shannon and Warren Weaver. The mathematical theory of information. 1949. [19] NJA Sloane and Arndt Joerg. Counting words that are in "standard order", 2016. https://oeis.org/A278984/a278984.txt. [20] Michael Tomasello. Constructing a language: A usage-based theory of language acquisition. Harvard university press, 2009. [21] Alan Mathison Turing. On computable numbers, with an application to the entscheidungsproblem. J. of Math, 58(345-363):5, 1936. [22] Alan Mathison Turing. Rhetorique. Grand Memento Encyclopedique, 1:687–689, 1936. [23] Michael Ullyot. Review essay: Digital humanities projects. Renaissance Quarterly, 66(3):937–947, 2013. [24] Larry Wall and Randal L Schwartz. Programming perl. O’Reilly & Associates Sebastopol, CA, 1991. [25] George Kingsley Zipf. The psycho-biology of language. 1935. Daniel Devatman Hromada / Initial Experiments with Multilingual Extraction of Rhetoric Figures by means of PERL-compatible Regular Expressions Daniel Devatman Hromada Lutin Userlab – ChART – Paris 8 – EPHE - Slovak Technical University hromi@kyberia.sk Abstract A language-independent method of figure-ofspeech extraction is proposed in order to reinforce rhetoric-oriented considerations in natural language processing studies. The method is based upon a translation of a canonical form of repetition-based figures of speech into the language of PERL-compatible regular expressions. Anadiplosis, anaphora, antimetabole figures were translated into the form exploiting the backreference properties of PERL-compatible regular expression while epiphora was translated into a formula exploiting recursive properties of this very concise artificial language. These four figures alone matched more than 7000 strings when applied on dramatic and poetic corpora written in English, French, German and Latin. Possible usages varying from stylometric evaluation of translation quality of poetic works to more complex problem of semi-supervised figure of speech induction are briefly discussed. 1 Introduction During middle ages and before, the discipline of rhetoric composed - along with grammar and logic - a basic component of so-called trivium. Being considered by Platon as the “one single art that governs all speaking” (Plato, trans. 1986) in order to be subsequently defined by Aristotle as “the faculty of observing in any given case the available means of persuasion” (Aristotle, trans. 1954), the basic postulates of rhetoric are still kept alive by those being active in domains as diverse as politics, law, poetry, literary theory (Dubois, 1970) or humanities in general (Perelman & Olbrechts-Tyteca, 1969) When it comes to more “exact” scientific disciplines like that of informatics or linguistics , rhetoric seems to be somewhat ignored definitely more than its “grammar” and “logic” trivium counterparts. While contemporary rhetoric disposes with a strong theoretical background - whether in the form of the Rhetorical Structure Theory (Taboada, Mann, & Back, 2006), “computational rhetoric” (Grasso, 2002) or computational models of natural argument (Crosswhite & Fox, 2003); a more practically-oriented engineer has to nonetheless agree with the statement that “the ancient study of persuasion remain understudied and underrepresented in current Natural Language systems” (Harris & DiMarco, 2009) . The aim of this article is to reduce this “under-representation” gap and in a certain sense augment the momentum of the computational rhetoric not by proposing a complex model of argumentation, but by proposing a simple yet efficient and language-independent method for extraction of certain rhetoric figures (RF) from textual corpora. RFs, also called “figures of speech”, are one of the basic means of persuasion which an orator has to his disposition. Traditionally, they are divided into two categories : tropes - related to deeper, i.e. semantic features of the phrasal constituents under consideration; and schemes related to layers closer to actual material expression of the proposition, i.e. to the morphology, phonology or prosody of the generated utterance. The method proposed within this article shall deal only with reduced subset of the latter that is, with detection of rhetoric schemes anadiplosis, anaphora, antimetabole and epiphora which are based on a repetition or reordering of a given word, phrase or morpheme across multiple subsequent clauses. While such a stylometric approach was currently implemented with encouraging results by (Gawryjolek, 2009), his system is operational only when combined with probabilistic context-free grammar parser adapted to English language, and hence dysfunctional when applied upon languages for which such a parser does not exist. In the following paragraphs of this article we shall present a system of rhetoric figure extraction which tends to be languageindependent, i.e. applicable upon a textual corpus written in any language. Ideally, no antecedent knowledge about the grammar of a language is necessary for successful extraction by means of our method, the 1) prescriptive form of the figure-to-be-extracted and 2) the symbol representing phrase and/or clause boundaries is the only information necessary. More concretely, our proposal is based on a fairly simple translation of a canonical form of a rhetoric figure under question into a computer language, namely into the language of PERL-compatible regular expressions (PCREs). PCREs are, in their essence, simply strings of characters which describe the sets of other strings of characters, i.e. they are a matching form, a template, for many concrete character strings. As many other regular expressions engines, PCREs make this possible by reserving special symbols - “the metacharacters” - for quantifiers and classes. But in addition to these features common to many finite state automata, PCREs offer much more (Wall & Loukides, 2000). These are the reasons why we consider the PCREs to be appealing candidates for a translation of rhetorical figures into a computerreadable symbolic form: • • • by implementing “back references” (Friedl, 2006) , PCREs make it possible to refer to that which was already matched, hence allowing to construct automata able to match repetitive forms by implementing (from PERL version 5.10 on) “recursive matching”, PCREs make it possible to match very complex patterns without a need to have recourse to other means, external to PCREs since the language of PCREs is very concise, the resulting PCRE describing a rhetorical figure under question is usually a string of few dozens of characters which could be eventually constructed not by means of human intervention, as was the case in this article, but by means of unsupervised genetic programming (Koza, 1992) or other means of grammar induction engine (Solan, Horn, Ruppin, & Edelman, 2005) Element W ... <…> Subscripts Meaning word arbitrary intervening material phrase or clause boundaries identity (same subscripts), nonidentity (different subscripts) Table 1: part of RF-representation Formalism (RFRF) 2 Method 2.1 PERL-Compatible Rhetoric Figures Four figures were chosen - namely anadiplosis, anaphora, epiphora and antimetabole – in order to demonstrate the feasibility of the “rhetoric stylometry” approach. We have adopted the Rhetoric Figure Representation Formalism (RFRF) - initially concieved by (Harris & DiMarco, 2009) - and reduced it in order to describe only the four figures of interest. Basic symbols of RFRF and their associated meanings are presented in Table 1. Since the goal of this article is primarily didactic, i.e. we shall start this exposé with very simple anadiplosis involving just one backreference, and end up our proposal with somewhat more complex recursive PCRE matching epiphorae containing arbitrary number of constituents. 2.1.1 Anadiplosis Anadiplosis occurs when a clause or phrase starts with the word or phrase that ended the preceding unit. It is formalized by RFRF as : < . . . Wx >< Wx . . . > We have translated this representation into this PERL-Compatible Rhetoric Figure (PCRF): /((\w{3,})[.?!,] \2)/sig The repetition-matching faculty is assured by a backreference to an initial n-gram composed of at least three word characters. Therefore, this PCRE makes it possible to match utterances like the one in Cicero's De Oratore : Sed genus hoc totum orationis in eis causis excellit, in quibus minus potest inflammari animus iudicis acri et vehementi quadam incitatione; non enim semper fortis oratio quaeritur, sed saepe placida, summissa, lenis, quae maxime commendat reos. Reos autem appello non eos modo, qui arguuntur, sed omnis, quorum de re disceptatur; sic enim olim loquebantur.1 This is the simplest possible anadiplosis figure since it matches only string with two occurences of a repeated word. Therefore we label this figure as anadiplosis{2}. 2.1.3 2.1.2 We have translated this representation into following PCRE form: Anaphora Antimetabole is a rhetoric figure which occurs when words are repeated in successive clauses in reversed order. In terms of RFRF, one can formalize it as follows: Anaphora is a rhetoric figure based upon a repetition of a word or a sequence of words at the beginnings of neighboring clauses. It is formalized by RFRF as : < Wx . . . >< W x . . . > We have translated this representation into the following PCRE form: /[.?!;,] (([A-Z]\w+) [^.?!;,]+[.?!;] \2 [^.?!;,] +[.?!;,] (\2 [^.?!;,]+[.?!;,])*)/sig As all RFs presented in this article, this anaphora is also based on back-reference matching. In contrast with anadiplosis where dependency was of very short-distance nature, in case of anaphora, the second occurrence of the word can be dozens of characters distant from the initial occurrence. What's more, this RF takes into account possible third repetition of a W x which makes it possible to match utterances like Cicero's: Quid autem subtilius quam crebrae acutaeque sententiae? Quid admirabilius quam res splendore inlustrata verborum? Quid plenius quam omni genere rerum cumulata oratio?2 Since this PCRFs allows us to match anaphorae with two or three occurences of a repeated word, it is seems to be appropriate to label it as anaphora{2,3}. 1 2 “For vigorous language is not always wanted, but often such as is calm, gentle, mild: this is the kind that most commands the parties. By ' parties ' I mean not only persons impeached, but all whose interests are being determined, for that was how people used the term in the old days. “ “ Is there something more subtle than a rapid succession of pointed reflections? Is there something more wonderful than the heating-up of a topic by verbal brilliance, something richer than a discourse cumulating material of every sort? ” Antimetabole /((\w{3,}) (.{0,23}) (\w{3,})[^\.!?]{0,23} \4 \3 \ 2)/sig Differently from previous examples when there was only one element matched and back-referenced, three elements - A, B, C- are determined in initial phases of matching this chiasmatic antimetabole. Subsequently, the order of A & C is switched while B is considered to be identic intervening material intervening between A and C and C and A. Since possible occurrence of other material intervening between ABC and CBA (i.e. ABCxCBA) is also taken into account, this PCRF has successfully matched expressions like: Alle wie einer, einer wie alle.3 2.1.4 Epiphora Epiphora or epistrophe is a RF defined as “ending a series of phrases or clauses with the same word or words”. It is formalized by RFRF as: < . . . Wx >< . . . Wx > We have translated this representation into following PCRE form: /([A-Z][^\.\?!;]+ (\w{2,}+)([\.\?!;] ?[A-Za-z] [^\.\?!;]+ (?:\2|(?-1))*)\2[\.\?!;])/sig In contrast with anaphora{2,3} figure presented in 2.1.2, the epiphora figure hereby proposed exploits the “recursive matching” properties of latest versions of PCRE (Perl 5.10+) engines. In other words, the expression (?:\2|(?-1)) match any number of subsequent phrases or clauses which end with Wx and not just three, as was the case in case of epiphora. Hence, a quadruple epiphora : 3 “ All as one, one as all. ” Je te dis toujou la même chose, parce que c'est toujou la même chose, et si ce n'était pas toujours la même chose, je ne te dirais pas toujou la même chose.4 was detected by this recursive PCRF when it was applied upon corpus of Molière's works. Since the recursive matching allows us to create a sort of “greedy” epiphora, we propose to label it as epiphora{2,} in possible future taxonomy of PCRFs. 2.2 Corpora In order to demonstrate the languageindependence of the rhetoric stylometry method hereby proposed, we confronted the matching faculties of initial “PERL Compatible Rhetoric Figures” (PCRF) with the corpora written in diverse languages. More precisely, we have performed the rhetoric stylometry analysis of 4 corpora written by poets and orators who are often considered as exemplary cases of mastering their respective languages. For English language, complete works of William Shakespeare had been downloaded from project Gutenberg (Hart, 2000). The same site served us as the source of 40 works of Johann Wolfgang Goethe written in German language. When it comes to original works of Jean-Baptiste Molière, 39 of them where recursively downloaded from French site toutmoliere.net. Finally, the basic Latin manual of rhetoric, Cicero's “De Oratore” was extracted from the corpus of Perseus Project (Crane, 1998) in order to demonstrate that PCRF-based approach can yield interesting results when applied even upon corpora written in antique languages. Corpora from Project Gutenberg was downloaded as pure utf8-encoded text. No filtering of data was performed in order to analyze the data in their rawest possible form. The only exception was the stripping away of possible HTML tags by means of standard HTML::Strip filter. Before the matching, the totality of the corpus was split into fragments whenever frontier \n[^\w+] (i.e. new-line followed by at least one non-word character) was detected. Shakespeare’s corpus were splitted into 109492 fragments, Goethe’s into 46597 fragments , 4 “I always tell you the same thing because it is always the same thing and if it wasn't always the same thing I would not have been telling you the same thing.” Cicero’s into 970 fragments while works of Moliere yielded 6639 fragments. 3 Results In total, more than 7000 strings were matched by 3 PCRFs within 4 corpora containing in 17 Megabytes of text splitted into more than 163040 textual fragments. Anadip Anapho Antimetabole Epipho losis{2} ra{2,3} {abcXbca} ra{2,} Cicero Goethe Molière Shkspr 0.00309 0.00242 0.01129 0.00087 0.2711 0.0717 0.1634 0.008 0 0.0003 0.000602 0.000219 0.0144 0.0042 0.0210 0.008 Table 2: Relative frequencies of occurence of diverse PCRFs within diverse corpora ( PCRF per fragment) As is indicated in Table 2, the instances of anadiplosis, anaphora, antimetabole and epiphora were found in all 4 corpora involved in this study, the only exception being the absence of antimetabole in Cicero. In general, anaphora{2,3} seems to be the most frequent one: number of cases when this PCRFs succeeded to match highly surmounts the other two figures especially in case of Romance language authors – i.e. almost every sixth fragment from Moliere and every fourth from Cicero was matched by anaphore{2,3}. The only exception to this “dominance of anaphora” seems to be Shakespeare whose complete works yielded exactly the same frequency of epiphora and anaphora occurences. Cicero Goethe Molière Shkspr Anadip Anaphora Antimetabol Epiphora losis{2} {2,3} e{abcXbca} {2,} 20 1 4 19 44 3 33 287 57 1 29 65 7 2 17 64 Table 3: Elapsed time (in seconds) of different PCRF/corpus runs on average PC desktop As is indicated in Table 3, computational demands of PCRF-based are not high in case of anaphora{2,3}. On the contrary, the recursive epiphora{2,} is much more demanding. As the recursive structure of this PCRF indicates, the speed of matching process is growing nonpolynomially with the length of the textual fragment upon which the PCRF is applied and therefore the choice of correct fragment separator token (c.f. 2.2) seems to be of utmost importance. 4 Discussion We propose a language-independent parse-free method of extracting instances of rhetoric figures from natural language corpora by means of PERL-compatible regular expressions. The fact that PCREs implement features like back-references or recursive matching make them good candidates for the detection & extraction of rhetoric figures which cannot be matched by simpler finite state automata or context-free languages. In order to demonstrate the feasibility of such an approach, we have therefore “translated” the canonical definitions of anadiplosis, anaphora and epiphora into four PERLcompatible rhetoric figures namely anadiplosis{2}, anaphora{2,3}, epiphora{2,} and antimetabole{abcXbca} - and applied them upon Latin, English, French and German corpora. All four PCRFs successfully matched some strings in at least three of four corpora, indicating that repetition-based rhetoric figures can possibly belong to the set of linguistic universalia (Greenberg, 1957). Anaphora{2,3} surpassed in frequency of occurrences all the other figures, the only exception being Shakespeare in whose case the number of matched epiphorae was equal to the number of matched anaphorae. We do not pretend that PCRFs presented hereby are the most adequate translations of traditional anadiplosis, anaphora, antimetabole or epiphora into an artificial language. Since PCREs can contain quantifiers and classes, it is evident that for any set of strings – which is one our case the set F of all the occurences of a given figure within its respective corpus – more than one possible regexp could be constructed in order to match all members of the set F. Therefore it may be the case that PCRFs that we have proposed in this “proof of concept” article are not the most specific ones nor the fastest ones. When it comes to specificity, it may be stated that the closer look upon the extracted data indicates that PCRFs proposed hereby have proposed some “false positives”, i.e. have matched strings which are not rhetorical figures (for example an expression “FIRST LORD. O my sweet lord” was matched by epiphora{2,} when applied upon Shakespeare's corpus, but is definitely not a rhetoric figure since the substring in capital letters simply denotes the name of dramatic persona pronouncing the following statement and not the clause of the statement itself). When it comes to speed, it is established that PCREs with unbounded number of backreference are NP-complete (Aho, 1991) and verily this may be the reason of very high runtimes of a recursive epiphora{2,} in contrast to its non-recursive PCRF counterparts. From practical point of view it seems therefore more suitable – especially in case of analysis of huge corpora - to stick to non-recursive PCRFs. The other possible solution how to speed up the parsing – and in certain cases even to prevent the machine to fell into “infinite recursion loop” is the tuning of the “splitting parameter” so that the corpus is split in fragments of such a size that the NP-complexity of the matching PCRE shall not have observable implications upon a real run-time of a rhetoric figure detection process. There are at least three different ways how PCRFs could be possibly useful. Firstly, since PCRFs are very fast and languageindependent, they can allow the scholars to extract huge number of instances of rhetoric figures from diverse corpora in order to create an exhaustive compendium of rhetoric figures. For example, the corpus of >7000 strings which were extracted from corpora mentioned in this article (downloadable from http://www.lutin-userlab.fr/ rhetoric/) could be easily put to use not only by teachers of language or rhetoric, but possibly also by those who aim to develop a semisupervised system of rhetoric figure induction (c.f. last paragraph). Manual annotation of such a compendium and subsequent tentatives of such a figure of speech induction shall be presented in our forecoming article. Secondly, the extracted information concerning the quantities of various PCRFs within different corpora could serve as an input element (i.e. a feature) for classifiying or clustering algorithms. PCRFs could therefore facilitate such stylometric tasks like authorship attribution, author name disambiguation or maybe even plagiate detection. Thirdly, due to their language independence, PCRFs presented hereby can be thought of as a means for evaluation of differences between two different languages, or two different states of the same language. One can for example apply the PCRFs upon two different translations T1 and T2 and see that the distribution of PCRFs within T2 is more similar to the distribution of PCRFs in the original than the distribution in T2. Therefore, one could possibly state that from rhetoric, stylistic or even poetic standpoint, T1 is more adequate translation of the original text than T2. On the other hand, when we speak about comparing two different states of the same language , we propose to perform PCRF-based analysis not only upon a corpus representing the l'état de l'art state of the language - like that of a Shakespeare, for example – but also to compare such a state with more initial states of the language development, as is represented by CHILDES (MacWhinney & Snow, 1985) corpus. Finally, by considering PCRFs to be a method which could possibly be used as a tool of analysis of the development of language faculties in a human baby, we come closer to its third and somewhat “cognitive” implementation. This implementation - which is the subject of our current research - is based upon a belief that it is not unreasonable to imagine that PCRFs could possibly be constructed not manually, but automatically by means of genetic programming paradigm (Koza, 1992). Given the fact that PCRE-language is one of the most concise programming languages possibles and conceivables, and given the fact that the 1) speed of execution 2) the specifivity 3) the sensitivity could possibly serve as the input parameters of a function evaluating the fitness of a possible PCRF candidate, it is possible that the research initiated by our current proposal could result in a full-fledged and possibly non-supervised method of rhetoric figure induction. In such a way could our PCRFs possibly become something little bit more than just another tool for stylometric analysis of textual corpora – in such a way they could possibly help answering a somewhat more fundamental question: “What is the essence of figures of speech and how could they be represented within&by an artificial and/or organic symbol-manipulating agent?” References Acknowledgments The author wishes to express his gratitude to University Paris8 – St. Denis and Lutin Userlab for support without which the research hereby presented would not be possible, as well as to thank philologues and comparativists of École Pratique des Hautes Études and ÉNS for keeping alive the Tradition within which the Language is considered to be something more than just an object of parsing and POS-tagging. Plato. (1986). Phaedrus. 261e. Aho, A. V. (1991). Algorithms for finding patterns in strings, Handbook of theoretical computer science (vol. A): algorithms and complexity. MIT Press, Cambridge, MA. Aristotle. (1954). Rhetoric. 1355b. Crane, G. (1998). The Perseus Project and Beyond: How Building a Digital Library Challenges the Humanities and Technology. D-Lib Magazine, 1, 18. Crosswhite, J., Fox, J., Reed, C., Scaltsas, T., & Stumpf, S. (2003). Computational models of rhetorical argument. Argumentation Machines— New Frontiers in Argument and Computation, 175–209. Dubois, J. (1970). Rhétorique générale: Par le groupe MY. Larousse. Friedl, J. (2006). Mastering regular expressions. OʼReilly Media, Inc. Sebastopol, CA, USA. Gawryjolek, J. (2009). Automated annotation and visualization of rhetorical figures. Grasso, F. (2002). Towards computational rhetoric. Informal Logic, 22(3). Greenberg, J. H. (1957). The nature and uses of linguistic typologies. International Journal of American Linguistics, 23(2), 68–77. Harris, R., & DiMarco, C. (2009). Constructing a Rhetorical Figuration Ontology. Persuasive Technology and Digital Behaviour Intervention Symposium. Hart, M. (2000). Gutenberg. Project gutenberg. Project Koza, J. R. (1992). Genetic programming: on the programming of computers by means of natural selection. The MIT press. MacWhinney, B., & Snow, C. (1985). The child language data exchange system. Journal of child language, 12(02), 271-295. Perelman, C., & Olbrechts-Tyteca, L. (1969). The new rhetoric: A treatise on argumentation. Solan, Z., Horn, D., Ruppin, E., & Edelman, S. (2005). Unsupervised learning of natural languages. Proceedings of the National Academy of Sciences, 102(33), 11629. Taboada, M., Mann, W. C., & Back, L. (2006). Rhetorical Structure Theory. Citeseer. Wall, L., & Loukides, M. (2000). Programming perl. OʼReilly Media, Inc. Sebastopol, CA, USA. PROCEEDINGS IACAP 2011 FIRST INTERNATIONAL CONFERENCE OF IACAP THE COMPUTATIONAL TURN: PAST, PRESENTS, FUTURES? 4 – 6 JULY, 2011 AARHUS UNIVERSITY Proceedings IACAP 2011 PRINTED WITH THE FINANCIAL SUPPORT OF THE HEINZ NIXDORF INSTITUTE, UNIVERSITY PADERBORN, GERMANY © VERLAGSHAUS MONSENSTEIN UND VANNERDAT OHG AM HAWERKAMP 31 48155 MÜNSTER -2- The Computational Turn: Past, Presents, Futures? “The Computational Turn: Past, Presents, Futures?” Dear participants, In the West, philosophical attention to computation and computational devices is at least as old as Leibniz. But since the early 1940s, electronic computers have evolved from a few machines filling several rooms to widely diffused – indeed, ubiquitous – devices, ranging from networked desktops, laptops, smartphones and “the internet of things.” Along the way, initial philosophical attention – in particular, to the ethical and social implications of these devices (so Norbert Wiener, 1950) – became sufficiently broad and influential as to justify the phrase “the computational turn” by the 1980s. In part, the computational turn referred to the multiple ways in which the increasing availability and usability of computers allowed philosophers to explore a range of traditional philosophical interests – e.g., in logic, artificial intelligence, philosophical mathematics, ethics, political philosophy, epistemology, ontology, to name a few – in new ways, often shedding significant new light on traditional issues and arguments. Simultaneously, computer scientists, mathematicians, and others whose work focused on computation and computational devices often found their work to evoke (if not force) reflection and debate precisely on the philosophical assumptions and potential implications of their research. These two large streams of development - especially as calling for necessary interdisciplinary dialogues that crossed what were otherwise often hard disciplinary boundaries – inspired what became the first of the Computing and Philosophy (CAP) conferences in 1986 (devoted to Computer-Assisted Instruction in philosophy). -3- Proceedings IACAP 2011 Since 1986, CAP conferences have grown in scope and range, to include an extensive array of intersections between computation and philosophy as explored across a global range of cultures and traditions – issuing in fruitful cross-disciplinary collaborations and numerous watershed insights and contributions to scholarly reflection and publication. In keeping with what has now become a significant tradition of critical inquiry and reflection in these domains, IACAP'11 celebrates the 25th anniversary of CAP conferences by focusing on the past, present(s), and possible future(s) of the computational turn. Aarhus, July 2011 Charles Ess Organizer Department of Information- and Media Studies Aarhus University Ruth Hagengruber Program Chair Paderborn University -4- The Computational Turn: Past, Presents, Futures? ACKNOWLEDGEMENTS Happily, in planning and organizing IACAP’11, I have received generous support and encouragement from more persons and institutions than can be fully listed here – beginning with the Track Chairs, members of the Program Committee / Comité scientifique, and the keynote speakers who have kindly accepted our invitation to join us in Aarhus for our conference. In addition, I would like to express deep gratitude to my colleagues in the Department of Information- and Media Studies (IMV), Aarhus University, including the highly competent members of the secretariat and our chair, Steffen Ejnar Brandorff. Without your on-going encouragement, assistance, and financial support, IACAP’11 would simply not have taken place at Aarhus University. I am also very grateful to Aarhus University for additional forms of support, including their conference facilities and most especially the very able assistance of Ulla Rasmussen Billings (Faculty Secretariat) for her assistance and advice on multiple conference matters, including budgeting and the conference registration page. For the first time in its now 25-year history, IACAP has offered travel bursaries to support the participation of our younger colleagues: Dr. Johnny Søraker has ably taken on the difficult chore of coordinating the awarding of these bursaries. Many thanks (mange tak!). Finally, a thousand thanks (tusind tak!) to Prof. Dr. Ruth Hagengruber (Universität Paderborn) who has undertaken not only the daunting role of Program Chair, but also for editing, and producing these Proceedings for IACAP’11. Aarhus, July 2011 Charles Ess -5- Proceedings IACAP 2011 Table of Contents Keynotes Presidential address Beavers, Anthony F. 19 IS ETHICS COMPUTABLE, OR WHAT OTHER THAN CAN DOES OUGHT IMPLY Aas, Katja Franko 21 (IN)SECURE IDENTITIES: ICTS, TRUST AND BIOPOLITICAL TATTOOS Covey Lifetime Achievement Award Bynum, Terrell INFORMATION Ward 22 AND DEEP METAPHYSICS Herbert A. Simon Award for Outstanding Research in Computing and Philosophy Sullins, John P. 24 THE NEXT STEPS IN ROBOETHICS Brian Michael Goldberg Award for Outstanding Graduate Research in Computing and Philosophy (sponsored by Carnegie Mellon University) Buckner, Cameron 25 COMPUTATIONAL METHODS FOR THE 21ST CENTURY PHILOSOPHER: RECENT ADVANCES AND CHALLENGES IN COGNITIVE SCIENCE AND METAPHILOSOPHY -6- The Computational Turn: Past, Presents, Futures? Panel Charles Ess / Elizabeth Buchanan / Jeremy Mauger 26 INTERNET RESEARCH ETHICS INTERNET RESEARCH ETHICS: CORE CHALLENGES, NEW DIRECTIONS Tracks Track I: Philosophy of Computer Science Bengez, Rainhard RULES AND Z. 29 PROGRAMMING LANGUAGES Blanco, Javier O. et alia 30 A BEHAVIOURAL CHARACTERIZATION OF COMPUTATIONAL SYSTEMS Boltuc, Peter 34 WHAT IS THE DIFFERENCE BETWEEN YOUR FRIEND AND A CHURCH TURING LOVER Chokvasin, Theptawee Duran, Juan M. 37 HAECCITY AND INFORMATION 40 THE LIMITS OF COMPUTER SIMULATIONS AS EPISTEMIC TOOLS Franchette, Florent 43 WHY TO BUILD A PHYSICAL MODEL OF HYPERCOMPUTATION Geier, Fabian 46 THE MATERIALISTIC FALLACY -7- Proceedings IACAP 2011 Meyer, Steven Monin, Alexandre, Halpin, Harry 49 THE EFFECT OF COMPUTERS UNDERSTANDING TRUTH ON PHILOSOPHY OF THE ARTIFACTUALIZATION WEB AS ONTOLOGICAL COMMITMENTS COMPUTER SCIENCE OF 53 Pagano, Miguel 54 Riss, Uwe 60 SEMANTICS LANGUAGES OF PROGRAMMING Sinclair, Nathan 64 QUINEAN HOLISM AND THE INDETERMINANCY OF COMPILATION Smith, Lindsay 67 IS FINDING A ‚BLACK SWAN‘ POPPER, (1936) POSSIBLE IN SOFTWARE DEVELOPMENT? 71 Solodovnik, Iryna ONTOLOGY: FROM PHILOSOPHY TO ICT AND RELATED AREAS. PROBLEMS AND PERSPECTIVES Thürmel, Sabine 74 THE EVOLUTION OF SOFTWARE AGENTS AS DIGITAL OBJECTS Turner, Raymond 77 MACHINES AND COMPUTATIONS Track II: Philosophy of Information and Cognition Funcke, Alexander 79 ON THE LEVEL OF CREATIVITY. PONDERINGS ON THE NATURE OF KANTIAN CATEGORIES, CREATIVITY AND COPYRIGHTS Giardino, Valeria 83 THE FOURTH REVOLUTION SEMANTIC INFORMATION -8- AND The Computational Turn: Past, Presents, Futures? Heersmink, Richard Hewlett, David, Cohen, Paul 87 EPISTEMOLOGICAL AND PHENOMENOLOGICAL ISSUES IN THE USE OF BRAIN-COMPUTER INTERFACES 91 AN INFORMATION-THEORETIC MODEL OF CHUNKING Janlert, Lars-Erik 94 THE DYNAMISM OF INFORMATION ACCESS FOR A MOBILE AGENT IN A DYNAMIC SETTING AND SOME OF ITS IMPLICATIONS Kitto, Kirsty 97 CONTEXTUAL INFORMATION: MODELING DIFFERENT INTERPRETATIONS OF THE SAME DATA WITHIN A GEOMETRIC FRAMEWORK Menant, Christophe Quiroz, Francisco Hernandez 101 COGNITION AS A MANAGEMENT OF MEANINGFUL INFORMATION: PROPOSAL FOR AN EVOLUTIONARY APPROACH 104 COMPUTATIONAL AND HUMAN MIND MODEL Schroeder, Marcin 107 SEMANTICS OF INFORMATION: MEANING AND TRUTH AS RELATIONSHIPS BETWEEN INFORMATION CARRIERS Vakarelov, Orlin 111 PRE-COGNITIVE INFORMATION -9- SEMANTIC Proceedings IACAP 2011 Track III: Autonomous Robots and Artificial Cognitive systems 115 Anokhina, W HO WILL H AVE I RRESPONSIBLE , Margaryta, DodigUNTRUSTWORTHY, IMMORAL Crnkovic, INTELLIGENT ROBOT? WHY Gordana ARTIFACTUALLY INTELLIGENT ADAPTIVE AUTONOMOUS AGENTS NEED TO BE ARTIFACTUALLY MORAL? Arkin, Ronald 118 THE ETHICS OF ROBOTIC DECEPTION Bello, Paul et alia 121 PROLEGOMENON TO ANY FUTURE THEORY OF MACHINE AUTONOMY Briggs, Gordon 124 AUTONOMOUS AGENTS AND SENSES OF RESPONSIBILITY Hagengruber, Ruth 127 THE ENGINEERABILITY INSTITUTIONS Heimo, Olli I., Kimppa, Kai K. Kavathatzopoulos, Iordanis, Laaksoharju, Mikael Molyneux, Bernard Vallverdu, Jordi, Casacuberta, David OF SOCIAL 129 RESPONSIBILITY IN ACQUIRING CRITICAL EGOVERNMENT SYSTEMS: WHOSE FAULT IS FAILURE? 133 WHAT ARE ETHICAL AGENTS AND HOW CAN WE MAKE THEM WORK PROPERLY? 136 HOW THE HARD PROBLEM OF CONSCIOUSNESS MIGHT ARISE FOR AN EMBODIED (SYMBOL) SYSTEM 139 THE GAME OF EMOTIONS (GOE): AN EVOLUTIONARY APPROACH TO AI DECISIONS Veale, Richard 143 THE CASE FOR NEUROROBOTICS DEVELOPMENTAL Waser, Mark R. 148 WISDOM DOES IMPLY BENEVOLENCE - 10 - The Computational Turn: Past, Presents, Futures? Track IV: Technosecurity from Every day Surveillance to Digital Warfare Crutzen, C.K.M. 152 THE MASKING AND UNMASKING OF PRIVACY Hempel, Leon 155 CHANGE AND CONTINUITY – FROM THE CLOSED WORLD OF BIPOLARITY TO THE CLOSED WORLD OF THE PRESENT Macnish, Kevin 159 SUBITO AND THE ETHICS OF AUTOMATING THREAT ASSESSMENT Othmer, Julius, Weich, Andreas 162 MATCHING – POPULAR BETWEEN SECURITYWORLDS CULTURES OF RISK MEDIA AND Taddeo, Mariarosa 164 INFORMATIONAL WARFARE AND JUST WAR THEORY Weber, Jutta 168 TECHNO-SECURITY, RISK AND THE MILITARIZATION OF EVERY DAY LIFE Track V: Information Ethics, Robot Ethics Asaro, Peter 175 IS THERE A HUMAN RIGHT NOT TO BE KILLED BY A MACHINE? Dasch, Thomas 177 DO WE NEED AN INFORMATION ETHICS? UNIVERSAL Douglas, Keith 180 A PSEUDOPERIPATETIC APPLICATION SECURITY HANDBOOK FOR VIRTUOUS SOFTWARE - 11 - Proceedings IACAP 2011 Hromada, Daniel D. Soraker, Johnny Hartz 182 THE CENTRAL PROBLEM OF ROBOETHICS: FROM DEFINITION TOWARDS SOLUTION 186 AFFECTING THE WORLD OR AFFECTING THE MIND? THE ROLE OF MIND IN COMPUTER ETHICS Tonkens, Ryan 190 THE ETHICS OF AUTOMATED WARFARE Vallor, Shannon 193 CAREBOTS AND CAREGIVERS: ROBOTICS AND THE ETHICAL IDEA OF CARE Wong, Pak-Hang 197 CO-CONSTRUCTION AND COMANAGEMENT OF ONLINE IDENTITIES: A CONFUCIAN PERSPECTIVE Track VI: Multidisciplinary Perspectives Baumgaertner, REFLECTIVE INEQUILIBRIUM Bert Belfer, Israel 202 205 THE INFORMATION-COMPUTATION TURN: A HACKING-TYPE REVOLUTION Breems, Nick 209 COMPUTERS AND PROCRASTINATION: „I’LL JUST CHECK MY FACEBOOK QUICK A SECOND“ Bod, Rens et alia 212 HOW MUCH DO FORMAL NARRATIVE ANNOTATIONS DIFFER? A PROPRIAN CASE STUDY Desclés, JeanPierre et alia 216 COMBINATORY LOGIC WITH FUNCTIONAL TYPES IS A GENERAL FORMALISM FOR COMPUTING COGNITIVE AND SEMANTIC REPRESENTATIONS - 12 - The Computational Turn: Past, Presents, Futures? Franchi, Stefano 219 THE PAST, PRESENT AND FUTURE ENCOUNTERS BETWEEN COMPUTATIONS AND THE HUMANITIES Guarini, Marcello et alia 224 REFLECTIONS ON NEUROCOMPUTATIONAL RELIABILISM McKinley, Steve 227 STATES OF AFFAIRS INFORMATION OBJECTS AND SCIENTIFIC EXPLANATION INFORMATION AND McKinley, Steve Nicolaidis, Michael 230 234 BIOLOGICAL INSPIRED SINGLE-CHIP MASSIVELY PARALLEL SELF-HEALING, TERA-DEVICE SELF-REGULATING, COMPUTERS: PHILOSOPHICAL IMPLICATIONS OF THE EFFORTS FOR SOLVING TECHNOLOGICAL SHOW-STOPPERS IN THE PATH OF THE NEXT COMPUTATIONAL TURN Portier, PierreEdouard, Calabretto, Sylvie York, William W., Ekbia, Hamid R. 238 STRUCTURAL CONSTRAINTS FOR THE CONSTRUCTION OF MULTISTRUCTURED DOCUMENTS 243 (DIS)TASTEFUL MACHINES? AESTHETIC COGNITION AND THE COMPUTATIONAL TURN IN AESTHETICS - 13 - Proceedings IACAP 2011 Track VII: Social Computing Alhutter, Doris 248 THE SOCIAL AND ITS POLITICAL DIMENSION IN SOFTWARE DESIGN: A SOCIO-POLITICAL APPROACH Barker, Steve 251 A SOCIAL EPISTEMOLOGICAL APPROACH FOR DISTRIBUTED COMPUTER SECURITY Coeckelbergh, Mark 254 TRUST, POWER AND INFORMATION TECHNOLOGY Compagna, Diego 258 THE BENEFITS OF SOCIAL THEORY FOR MODELLING STABLE ENVIRONMENTS OF SYSTEMIC TRUST WITHIN MULTI AGENT SYSTEMS Danka, Istvan 260 COMPUTER NETWORKS AND THE PHILOSOPHY OF MIND. A SOCIAL MIND – NETWORKED COMPUTER ANALOGY Dodig-Crnkovic, Gordana Ekbia, Hamid R., Zhang, Guo 262 AGENT BASED MODELING WITH APPLICATIONS TO SOCIAL COMPUTING 265 OBJECTS OF IDENTITY, IDENTITY OF OBJECTS: FOR A MATERIALIST ACCOUNT OF ONLINE BEHAVIOUR Ropolyi, Laszlo 269 THE CONSTRUCTION OF REALITY AND OF SOCIAL BEING IN THE INFORMATION AGE Simon, Judith 272 TRUST, KNOWLEDGE AND SOCIAL COMPUTING. RELATING PHILOSOPHY OF COMPUTING AND EPISTEMOLOGY Vehlken, Sebastian 275 OPERATIONAL IMAGES. AGENT-BASED COMPUTER SIMULATIONS AND THE EPISTEMIC IMPACT OF DYNAMIC VISUALIZATION - 14 - The Computational Turn: Past, Presents, Futures? Zambak, Aziz 278 SOCIAL COMPUTATION AS A DISCOVERY MODEL FOR THE SOCIAL SCIENCES Track VIII: IT, Culture and Globalization Asai, Ryoko et THE REVIVAL OF NATIONAL alia 283 AND CULTURAL IDENTITY THROUGH SOCIAL MEDIA Backhaus, Patrick, Dodig-Crnkovic, Gordana De Gooijer, Thijmen Hongladarom, Sonja 286 WIKILEAKS AND ETHICS OF WHISTLE BLOWING 289 INTERPRETING CODES OF ETHICS IN GLOBAL SOFTWARE ENGINEERING 294 INFORMATION, TECHNOLOGY, GLOBALIZATION AND INTELLECTUAL PROPERTY RIGHTS Track IX: Surveillance, sousveillance… Beinsteiner, TOWARDS Andreas 297 A HERMENEUTIC PHENOMENOLOGY OF CYBER-SPACE: POWER VS. CONTROL Ganascia, JeanGabriel Najar, Anis 300 THE WIKILEAKS LOGIC 303 DEMOCRACY 2.0 – HOW THE WEB MAKES REVOLUTION Reynolds, Carson 306 NEGATIVE SOUSVEILLANCE - 15 - Proceedings IACAP 2011 Strauss, Stefan 309 GOVERNMENT APPROACHES FOR MANAGING ELECTRONIC IDENTITIES OF CITIZENS – EVOKING A CONTROL DILEMMA? Track X: SIG Track – Machines and Mentality Arkin, Ronald C. 313 MORAL EMOTIONS FOR ROBOTS Arkoudas, Konstantine Bridewell, Will et alia 316 ON DEEPLY INTENTIONAL STATES UNCONSCIOUS 319 OUTLINING A COMPUTATIONALLY PLAUSIBLE APPROACH TO MENTAL STATE ASCRIPTION Guarini, Marcello 322 AGENCY: ON MENTALIZE MACHINES THAT Nirenburg, Sergej 325 TOWARD A TESTBED FOR MODELING THE KNOWLEDGE, GOALS AND MENTAL STATES OF OTHERS Scheutz, Matthias 328 ARCHITECTURAL STEPS SELF-AWARE ROBOTS Sundar, Naveen, Bringsjord, Selmer TOWARDS 331 LOGIC-BASED SIMULATIONS OF MIRROR TESTING FOR SELFCONSCIOUSNESS List of Authors in Alphabetic Order - 16 - 334 The Computational Turn: Past, Presents, Futures? - 17 - Proceedings IACAP 2011 Keynotes - 18 - The Computational Turn: Past, Presents, Futures? IS ETHICS COMPUTABLE, CAN DOES OUGHT IMPLY? OR WHAT OTHER THAN ANTHONY F. BEAVERS Department of Philosophy The University of Evansville In 2007, Anderson and Anderson wrote, “As Daniel Dennett (2006) recently stated, AI ‘makes philosophy honest.’ Ethics must be made computable in order to make it clear exactly how agents ought to behave in ethical dilemmas” (16). To rephrase, a computable system or theory of ethics makes ethics honest. But at what cost? Might Turing’s 1950 prophecy that "at the end of the century the use of words … will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted" (1950, 442) soon take on normative dimensions due to research in artificial morality. Will attempts to make ethics computable lead us to redefine the term “moral” to fit the case of machines and thus change its meaning for humans also? I call this the threat of “moral nihilism … the doctrine that states that morality needs no internal sanctions, that ethics can get by without moral “weight,” i.e., without some type of psychological force that restrains the satisfaction of our desire and that makes us care about our moral condition in the first place” (Beavers, 2011a). Analyzing this possibility requires inspection of the meaning of the term “ought” and what it implies. In 2009, I argued that, following Kant, ought not only implies can, but also might not, in which case it would be morally wrong to create artificial Kantian agents, since doing so would require designing them in such a way that they could act immorally, but would not do so. Only on such a condition would it make sense to hold a machine responsible for its actions and praise or blame it for its behavior. In 2011, I argued that if ought implies can, then it also implies implementability. If a machine or human can act morally, this can only be because the mechanisms (whether in software or wetware) have the requisite components to allow for it. Thus, any theory of morality must be implementable in real working agents to qualify as a viable moral theory. Given the conclusions of 2009, I argued in 2011 that designing machines in such a way that they behaved morally but were not able to act immorally would require redefining the term “morality” in such a way that full moral agency with internal sanctions was not intrinsic to ethics, but “merely a - 19 - Proceedings IACAP 2011 sufficient, and no longer necessary, condition for being ethical.” In this case, internal states such as conscience, responsibility (as felt affective weight) and thus moral accountability are, ex hypothesi, not necessary for ethics either. Thus, if we build machines capable of being described by the term “moral” we can only do so by redefining the term. So, if a time is coming when we can speak of a machine as moral without expecting to be contradicted, we will have succeeded in turning ethics into a strictly extrinsic, behavioral affair in which internals are irrelevant. Since on the surface, an ethics without an ought is as empty as thinking without insight or wisdom, it is necessary to explore what else ought implies in order to form an adequate conception of a metaphysics of morals that will fit the information age. While other research for a working conception of ethics has already been done (e.g., Floridi and Sanders, 2004), a careful exploration of this foundational concept still appears lacking. I hope to fill this gap to explore whether ethics can get by without its cherished ought and, if so, what that implies for ethics more generally. The concern guiding this talk is whether the information age is issuing in a post-ethical age or whether it is leading to a redefinition of ethics that is both long overdue and needed. References Anderson, M., & Anderson, S. (2007). Machine ethics: Creating an ethical intelligent agent. AI Magazine, 28(4): 15-26. Beavers, A. (2011). Moral machines and the threat of ethical nihilism. In P. Lin, G. Bekey & K. Abney (Eds.), Robot ethics: The ethical and social implication of robotics. Cambridge, MA: MIT Press, forthcoming. Beavers, A. (2009, March). Between angels and animals: The question of robot ethics, or is Kantian moral agency desirable. The Eighteenth Annual Meeting of the Association for Practical and Professional Ethics, Cincinnati, Ohio. Dennett, D. (2006, May). Computers as prostheses for the imagination. The International Computers and Philosophy Conference, Laval, France. Floridi, L., & Sanders, J. (2004). On the morality of artificial agents. Minds and Machines 14(3): 349-379. Turing, A. (1950). Computing machinery and intelligence. Mind 59: 433-460. - 20 - The Computational Turn: Past, Presents, Futures? (IN)SECURE IDENTITIES: ICTS, TRUST AND ‘BIO-POLITICAL’ TATTOOS KATJA AAS Department of Criminology and Sociology of Law University of Oslo The globalising world is marked by anonymity, mass mobility and mass consumerism. These conditions create a distinct set of challenges for social identification practices, first and foremost, the challenge of creating reliable and ‘trustworthy’ identities. The paper addresses in particular the growing reliance on biometrics and biometric databases and examines how these forms of bodily control function as border controls. While revealing specific notions of subjectivity, the paper also explores how these technologies function as mechanisms of social sorting and global governance and have markedly different effects on the citizen of the global North and the global South. - 21 - Proceedings IACAP 2011 INFORMATION AND DEEP METAPHYSICS TERRELL WARD BYNUM Department of Philosophy Southern Connecticut State University Scientists working on the cutting edges of their field often engage in thinking that is much like metaphysics. Similarly, in the past, philosophers inspired by major advances in science have made significant additions to metaphysics, as well as other branches of philosophy. On occasion, the scientists and philosophers have been the very same people. For example in ancient times Aristotle created physics, biology and animal psychology, and at the same time he made related contributions to metaphysics, logic, epistemology, and other branches of philosophy. Again, during the Enlightenment in Europe, influential philosophers like Descartes and Leibniz also were respected scientists and firstclass mathematicians. At times, people who were primarily scientists (for example, Copernicus, Galileo, and Newton) inspired thinkers who were primarily philosophers (for example, Hobbes, Locke, and Kant). In more recent times, revolutionary scientific contributions of Darwin, Einstein, Schrödinger, Heisenberg, and others significantly influenced philosophical ideas of people like Spencer, Russell, Whitehead, Popper, and many more. Today, in the early years of the twenty-first century, developments in cosmology and quantum physics appear likely to alter significantly our scientific understanding of the universe, of life, and of the human mind; and many scientists have become convinced that the universe, ultimately, is made of quantum information. These developments, it seems to me, are very likely to lead to important new contributions to philosophy; and indeed, as illustrated by Luciano Floridi’s writings on informational realism and philosophy of information, significant philosophical contributions already have begun to appear. Of special interest, in this presentation is the idea that the universe is a vast “ocean” of quantum bits (“qubits”); and thus each object or process in the universe can be seen as a constantly changing data structure comprised of qubits. On this account of the ultimate nature of the universe, the fundamental “stuff” of which our universe is made is quantum information. Unlike traditional “bits”, such as those processed in most of today’s information technology devices, “qubits” have quantum features such as genuine randomness, superposition and entanglement – features that Einstein and other - 22 - The Computational Turn: Past, Presents, Futures? scientists considered “spooky” or “weird”. These nontraditional features of qubits have made it possible to achieve unbreakable encryption, teleportation, and a new kind of computing – “quantum computing”. In this presentation, a number of quantum topics, such as randomness, superposition, entanglement, collapse of a wave function, teleportation, and quantum computing are briefly described. In light of such quantum features, it seems appropriate for philosophers to re-examine a variety of philosophical concepts, such as possibility and impossibility, potential and actual, cause and effect, being and reality, logic and contradiction, and a number of others. Such concepts are central to the “deep metaphysics” that provides a conceptual foundation for philosophy. Consequently, this presentation calls upon philosophers to familiarize themselves with current developments in cosmology and quantum physics, especially those developments that see the universe as ultimately an expanding ocean of quantum information. If philosophers take on this challenge – as Luciano Floridi has already begun to do – the deep metaphysical foundations of philosophy are likely to be profoundly transformed. As a small contribution to that effort, this presentation concludes with a brief sketch of a possible new metaphysical theory. - 23 - Proceedings IACAP 2011 THE NEXT STEPS IN ROBOETHICS JOHN P. SULLINS Department of Philosophy Sonoma State University RoboEthics has now matured from its beginnings as a curious offshoot of computer ethics into a sub-discipline of its own that has a well defined scope of study. In this paper I will briefly look at the growth of RoboEthics and the important roll it is playing in the development of robotics technology. I will then look at the more pressing open problems in RoboEthics and suggest some ways forward. I will focus primary on the criticism that RoboEthics is impossible given that phronesis is beyond the capacity of machines. To refute this claim I will propose a model system inspired by the architecture of the IBM Watson computer that, I will argue, could achieve an artificial practical wisdom. This would be possible through the use of a context sensitive hybrid of logical and non-logical search methods that could access documents to find comparable exemplar cases similar to the ethical situation the robot is attempting to reason about. Armed with this data, the robot would be able to make more nuanced decisions even without its own innate human equivalent practical wisdom. - 24 - The Computational Turn: Past, Presents, Futures? COMPUTATIONAL METHODS FOR THE 21ST-CENTURY PHILOSOPHER: RECENT ADVANCES AND CHALLENGES IN COGNITIVE SCIENCE AND METAPHILOSOPHY CAMERON BUCKNER Department of Philosophy Indiana University As evidenced by past CAP conferences, the intersection of computing and philosophy has long been a fertile area of research. The past ten years in particular have produced a variety of new computational techniques of philosophical import. These powerful new techniques present philosophers with alluring opportunities, but also pose a number of challenges requiring methodological reforms. In cognitive science, new computational models of psychological processes are rapidly-increasing our ability to predict behaviors, but the structure of these models seem to make a hash of traditional distinctions in psychology such as that between cognition and association. In metaphilosophy, new statistical and logical programming methods offer the possibility to address otherwise intractable philosophical questions, but rely upon a variety of assumptions, require input data that can be expensive to collect, and produce results that can be difficult to evaluate. In this talk, I will review some of these new technologies, recommending new conceptual frameworks and methodologies to understand, evaluate, and utilize their results. While I will give a brief overview of this latest generation of research, the talk will focus primarily on specific examples from my own work in the areas of comparative psychology and dynamic ontology. - 25 - Proceedings IACAP 2011 Panel INTERNET RESEARCH DIRECTIONS ETHICS: CORE CHALLENGES, NEW Charles Ess Department of Information- and Media Studies Aarhus University Elizabeth Buchanan Director, Center for Applied Ethics University of Wisconsin-Stout Co-Director, International Society for Ethics & Information Technology (INSEIT) Jeremy Mauger School of Information Studies University of Wisconsin, Milwaukee Internet Research Ethics (IRE) is an emerging cross-disciplinary field which studies how research is conducted in online environments and seeks to resolve the subsequent ethical dilemmas in normative and practical terms. While similar to its physical counterpart, conducting scholarly research online is different in terms of ethics and values. For example, online surveys bring new privacy concerns. Research in chat rooms confounds our notions of subject anonymity and identifiability. Scraping data from social networks or public blogs complicates issues of informed consent. At the same time, research conducted on and through the Internet has expanded exponentially in the last ten years; researchers across disciplines make frequent use of such tools as online survey generators, as well as engage in forms of participant observations of virtual worlds. Internet Research Ethics has thus emerged over the past decade as a distinct and important field of applied ethics – one that overlaps with central issues and approaches of information and computing ethics and is often informed (and informs) the broader intersections between computing and philosophy. - 26 - The Computational Turn: Past, Presents, Futures? The panel will begin with a few real-world examples of ethical dilemmas that are representative of contemporary issues in IRE and are especially challenging to traditional ethics. Panelists will then provide an overview of two current projects focusing on significantly developing the field of IRE, beginning with the current revision of the Association of Internet Researchers’ (AoIR) ethical guidelines. These guidelines, adopted by AoIR in 2002, have found extensive use around the world as a helpful guide to analyzing and resolving ethical issues in Internet research. The current revision seeks to update the guidelines in light of the dramatic expansion of Internet research following on the emergence of so-called Web 2.0 technologies and the ongoing global diffusion of the Internet. The second project is the Internet Research Ethics Digital Library, Research Center, and Commons (http://www.internetresearchethics.org/). This ongoing project is the result of a grant awarded by the National Science Foundation to the Center for Information Policy Research at the University of Wisconsin-Milwaukee’s School of Information Studies. A primary goal of this project is to develop and provide sound resources, a solidified research base, and expert advice as more researchers and more IRBs/ethics boards struggle with the complexities of Internet research ethics. Both projects thus share an emphasis on praxis – i.e., analyzing and responding to real-world dilemmas faced by a growing research community around the globe. Following these introductions and overviews, the panel will invite critical discussion of the representative issue, approaches, and resources. As well, the panel will welcome comments and suggestions from participants for additional resources and insights that will contribute to both projects – and to suggest ways where these projects in turn contribute to contemporary work in information and computing ethics. A last goal of the panel is to develop a better articulation – a conceptual map – of the multiple relationships between IRE as a field of information and computing ethics and other characteristic foci and thematics of computing and philosophy. - 27 - Proceedings IACAP 2011 Track I: Philosophy of Computer Science - 28 - The Computational Turn: Past, Presents, Futures? RULES AND PROGRAMMING LANGUAGES RAINHARD Z. BENGEZ Philosophy of Science, Technology, and Engineering Department Carl von Linde Academy TUM School of Education TU München, Arcisstr. 21, 80333 München, Germany bengez@tum.de Abstract In computer science and related fields we are talking much about rules. The word rule appears very often directly or unspoken in papers concerning computer science or Philosophy of Computer Science. We are talking about logic(s), interpreters, procedures and compilers, systems of rules, programming languages, automata and rules of software design, good practices, and much, much more. But, unfortunately, the meanings of the word rule to which one refers from case to case seem to be unclear. In my contribution I would like to try to show some of these ambiguities and discuss ways to avoid them. According to the nature of this subject, my contribution is both analytical and normative as well, because I will analyze some applications of the word and work out a traceable direction for use of it. Admittedly, the word rule has so many directions for use in computer science and philosophy of computer science that I cannot talk about most of them. I will restrict myself to rules inducing action and especially to such rules in programming languages (DSL, specification, etc.). This would mean rules are guiding actions in languages, or, stated more general, in sequentially structured patterns. I will start by talking about the dependence of rules and actions. - 29 - Proceedings IACAP 2011 A BEHAVIORAL CHARACTERIZATION OF COMPUTATIONAL SYSTEMS JAVIER BLANCO Universidad Nacional de Cordoba, Argentina RENATO CHERINI Universidad Nacional de Cordoba, Argentina MARTIN DILLER Universidad Nacional de Cordoba, Argentina AND PÍO GARCÍA Universidad Nacional de Cordoba, Argentina Abstract. We introduce the concept of interpreter as a producer of behavior in response to some input that codifies it. We argue that the notion of interpreter captures the minimal characteristics shared by different kinds of computational devices, and can thus serve as a criteria to identify how interesting a computational system is. This characterization contrasts with many of the current functional descriptions offered in the literature on this topic, in that these are somewhat dependent on the technology that is currently available. Since the concept of interpreter can be used to compare different systems, it defines a computational hierarchy, establishing the relative degree of computationalism of different systems. This enables us to restate some ontological questions, such as what is a program?, when is a system computational?, in more precise terms which admit clearer answers. Any system can be characterized in terms of its possible behaviors. In particular, a useful description of a computational system is given by the relationship between the input and the behavior produced as a response to that input, characteristic of the system. The feature that distinguishes computational systems from other types of systems is that they may produce a very large and interesting set of behaviors, depending on syntactic inputs and “without changing a single wire” (Dijsktra, 1988). Thus, the characteristic input-behavior relation implicitly defines an encoding of behaviors as syntactic objects. We have suggested in (Blanco et al, 2011) that some key aspects of computational systems can be captured by the ubiquitous concept of interpreter as used both in - 30 - The Computational Turn: Past, Presents, Futures? theoretical and applied computer science (Jones 1997, Abelson&Sussman 1996, Jifeng & Hoare, 1988), defined in a very general manner. In this article, we present an interpreter as the necessary link between a set of behaviors and their respective encodings, without relying on any mechanistic account of systems. As we argue elsewhere, the concept of interpreter can be regarded not only as a notion that captures the minimal common characteristics of different types of computational devices and serves to clarify various concepts which pervade computer science, but also as a framework for understanding computing. By behavior of a system we understand only a description of the occurrences of certain events considered relevant of the system. Different ways of observing a system may determine different sets of behaviors. Thus, the behaviors will depend on a decision regarding the events that are considered of interest for that system (for particular purposes). A precise definition of behavior will be left unspecified here, since this will only make sense when a particular framework is stated. Intuitively, an interpreter produces a behavior according to some input that codifies it. Usually, the encoded behavior may depend on input data, but for simplicity we will assume in this presentation that the data and behavior are already encoded together. The notion of interpreter is (almost) by definition the necessary link between the so-called “program-scripts” and “program-processes” (Eden 2007, Blanco & Garcia 2008). Given a characterization of a fixed set B of possible behaviors, and a set of syntactic elements P, an interpreter is a function i : P -> B assigning some behavior b in B to every p in P. When this relation is given we say that p is the encoding of b. Generally, we speak of the syntactic domain P as the programming language, and of p as a program. A (physical) system I realizes an interpreter i if it is capable of receiving an input p, and systematically produce the observable behavior b such that i(p) = b. In this case we say that I effectively computes b via the program p. We say that a (physical) system realizes an interpreter when every time we provide it with an instance of an encoding, it produces the corresponding observable behaviour. We do not consider internal states, since these may be realized in very different ways. One way of precising the notion of realization is along the lines of the notion of “practical realization of a function” defined in (Scheutz 1999), where the relation is an isomorphism between the formal definition of i and a physical theory T that describes the system I (for example, the theory of electrical circuits) that includes a description of the inputs and outputs of the system as well as a function F that maps inputs to outputs using the laws and language of T in a way that guarantees the preservation of the ismorphism. In (Scheutz 1999) different degrees of “practicality” of the realization relation are also considered that take in account the limits in precision with which the inputs can be measured and generated, reliability and range of functioning of physical systems, noise generated by the environment, etc. The concept of interpreter serves as a criteria to distinguish between systems that could be computational (w.r.t some inputs and behaviours) from those that could not. Since we want to capture what makes any system programmable, we do not assume any particular implementation technology in the concept of interpreter. Different computational models, like Von Neumann machines, parallel machines, DNAcomputers, quantum-computers, can be considered interpreters because they can systematically produce behaviours from their encodings in a predefined language. What - 31 - Proceedings IACAP 2011 will be specific to each model is the underlying theory used to justify that they are interpreters, not the criteria used to determine that they are indeed programmable systems. The notion of interpreter can be seen as functional, i.e, an interpreter is such when it is capable of producing behaviors from programs. Following this idea, a program is a syntactic structure capable of being interpreted. A program is such only relative to a given interpreter and an interpreter is such only for a particular programming language. The concepts of program, programming language and interpreter are thus relational and inter-definable. The main feature of an interpreter is that it is programmable: there is an available syntax with which a variety of behaviors can be encoded. The degree of programability of an interpreter is given by the variety of behaviors that the underlying programming language is able to encode. The degree of programability is the distinctive feature of an interesting computational system. If we consider a system computational when it is programmable, then being computational will also be a property which can be established only relative to a set of behaviors and a corresponding encoding (usually an actual programming language). In other words, the property of being computational will not make sense independently from a set of behaviors and the encoding. This will allow us to tackle some philosophical problem such as the problem of pan-computationalism (do all physical systems compute?) (Putnam 1987, Searle 1990, Chalmers 1996, Chrisley 1994, Copeland 1996, Piccinini 2008) from a different perspective. The question “Is this a computational system?'” is replaced by the question “Is this a computational system with respect to this set of inputs and behaviors?'”, or equivalently, “How interesting, from a computational point of view, is this system?'”. From this perspective, in particular, several constructions of “trivial implementations of programs” which intend to show how the thesis of pan-computationalism can be established do not qualify as interesting computational system. Since the rise of computability theory in the thirties, it was clear that a computation is related to certain formal object that prescribes it, e.g. the description of a Turing Machine, general recursive functions, a lambda-term, etc. A computation, then, is produced following this prescription. Putnam’s (and Searle’s) theorem (Putnam 1987, Searle 1990), on the other hand, tries to present a notion of computation in itself, reifying computation as something that exists independently of the prescription or program (any sequence of states would do). The property of being an interpreter for a given set of behaviours can be satisfied by certain systems. An interpreter is a general notion that can be used to characterize physical mechanisms (computers, calculators), a human acting mechanically (Turing’s computor, a human carrying out the reductions of a lambda term), mathematical formalisms (universal Turing machines, etc.), or computers with computing power beyond Turing computability (Oracle computers (Copeland 2002)). Whereas a (physical) counterpart is needed for the realization of an interpreter, the property of being an interpreter, and concomitantly, the property of being a programmable system, can be determined by its abstract description. - 32 - The Computational Turn: Past, Presents, Futures? References Abelson, H. & Sussman, G.(1996) Structure and Interpretation of Computer Programs. MIT Press, Cambridge, MA, USA, 2nd edition. Blanco, J., Cherini, R., Diller, M. & Garcia, P. (2011) Interpreters: towards a philosophical account of computer science. Technical Report. Blanco, J. & Garcia, P. (2008)A categorial mistake in the formal verification debate. In European Conference on Computing and Philosophy (ECAP), June 2008. Chalmers, D. (1996) Does a rock implement every finite-state automaton Synthese 108 (3):30933. Chrisley, R..(1994) Why everything doesn’t realize every computation. Minds and Machines, 4(4):403–20 Copeland, J.(1996). What is computation? Synthese, 108(3):335–59, Copeland, J.(2002) Narrow versus wide mechanism. In Computationalism: New Directions. MIT Press. Dijkstra, E..(1988) On the cruelty of really teaching computing science. circulated privately. Eden, A..(2007) Three paradigms of computer science. Minds Mach., 17(2):135–167. Jifeng He. & Hoare, C. (1988) Unifying theories of programming. In Ewa Orlowska and Andrzej Szalas, editors, RelMiCS, pages 97–99. Jones, N. (1997)Computability and complexity: from a programming perspective. MIT Press, Cambridge, MA, USA. Piccinini, G (2008) Computers. Pacific Philosophical Quarterly,89(1):32–73. Putnam, .H.(1987) Representation and Reality. MIT Press. Scheutz, M (1999). When physical systems realize functions. Minds and Machines, 9(2):161–196. Searle, J (1990). Is the brain a digital computer? Proceedings and Addresses of the American Philosophical Association 64 (November):21-37. - 33 - Proceedings IACAP 2011 WHAT IS THE DIFFERENCE BETWEEN YOUR FRIEND AND A CHURCH-TURING LOVER? A New Defense of H-Consciousness. PIOTR BOŁTUĆ University of Illinois Springfield UHB 3030, One University Plaza Springfield IL 62703 (and Warsaw School of Economics) Abstract. Whatever functionality may be attained by a physical system, (such as a human), it could, be replicated by a robot. We can define a Church-Turing lover as a robot with all functionalities of a (realistic, or ideal) sex partner. What it lacks is only the first person perspective. If we care what a partner truly feels, not just how he/she behaves, we should care. Yet, if we could build-in relevant first-person consciousness, the difference would disappear, or it would be relegated to a broader social-historical context.. 1. The gist of the Argument An important direct implication of the Church-Turing seems to be that whatever functionality may be attained (by a physical system, such as a human), it can, in principle, be replicated by a robot. In the area of sex, whatever ‘functionalities’ a human lover may perform, the same would in principle be replicable in advance sex-toys. The term ‘functionality’ can be understood as broadly as we can. Should desired specifications of a lover include, in addition to advanced mechanical functionalities, also certain advanced tactile features, temperature adjustments, fluid emissions (including chemical replication of the body fluids, such as sweat, squirt or sperm), ionization levels and other bioelectrical fields, sounds or even sophisticated conversations and other language utterances ( ‘the Turing test’ is one of the implications of Church-Turing) such conditions can be produced, though sometimes the cost may in practice be prohibitive. To understand this point is important for the large sex-toy industry, for other industries piggybacking on its research and development, but also for the philosophers. The question for philosophers is what, if anything, would make such robotic lover different from a human one. Advanced robotic lovers can be viewed as external experience machines, where one’s senses are stimulated by an artificial cause but not through direct brain stimulation but rather through stimulation of external sensory organs. It is however similar to the experience machine since the robot breaks the ‘typical’ or so-called - 34 - The Computational Turn: Past, Presents, Futures? ‘proper’ causal chain between the experiences and a human lover and initiates a socalled deviant causal chain (terms ‘proper’ or ‘deviant’ are used here in the sense used in theory of causality, not as moral evaluatives). I come to the conclusion that, while there is no functional difference, the human lover is supposed to have a first-person (hconsciousness) related to Chalmers’ hard-problem. Without such assumption we have no way to philosophically articulate the difference between the moral subjects for whom ‘there is something that it is like to experience’ a certain thing (here, sex) for the inside, and those for whom there isn’t such a thing. Perfect electronic lovers work better than zombies in demonstrating this point since we avoid the controversies whether it is conceivable that identical physical systems, such as human brains, could produce first-person consciousness in humans but not in zombies. The zombies seem to violate the tenet of materialism that there is no difference without physical difference while electronic toys do not make such violations.. 1.1. MAIN STEPS OF THE ARGUMENT Let us present a ‘sentence outline’ of the main argument. 1.1.1. Defining a Church-Turing lover It is the perfect functional imitation of a human lover in terms of all parameters desired, which may include some or all of the following: a. tactile features, b. reactivity to voice commands, c. speech quality, d. speech content (including, the ability to meet the Turing test), e. advanced domestic skills (cooking, cleaning), f. other skills of an artificial companion as defined by Floridi. 1.1.2. Defining your boyfriend/girlfriend Defining your boyfriend/girlfriend as a human being, equal or inferior to the ChurchTuring lover in terms of the functionalities described broadly in points a-f and all other typical functionalities. 1.1.3. Establishing rough functional equality between the Church-Turing lover and the Boyfriend/Girlfriend. This includes the responses to various objections such as the social objection, the psychological objections and the religious objection. The only objection left unanswered is the reproductive objection, which leaves us with ‘rough functional equality’: ChurchTuring lover is functionally equal to your Boyfriend/Girlfriend provided you do not intend to procreate with him/her. (Actually, Church-Turing implies procreative functionality in robots as well). 1.1.4. Atypical functionalities, defined as those of the first-person perspective. I show futility of the Church-Turing functional reenactments of presumed first-person states. Why do I want my boyfriend/girlfriend to have an orgasm not just to be very good at faking one? (If I am not an egoist I want her to feel good not just to behave as if she felt so.) Also, I give a brief,responses to the privileged access problem). - 35 - Proceedings IACAP 2011 1.1.2. The engineering thesis in machine consciousness The engineering thesis in machine consciousness, saves your girlfriend/boyfriend’s uniqueness, but not forever. There is a first-person, inductively established, difference between the Church-Turing lover and a boyfriend/girlfriend. The difference may partially disappear should we be able to engineer robots with first-person hconsciousness.functionalities. . Acknowledgements I developed an early draft of this argument at G. Harman’s graduate seminar in epistemology in the Spring of 1991. I want to thank Prof. Harman, Alex Byrne, Mary McGowan and other participants for discussion. I want to thank John Barker and Keith Miller for recent related discussions. - 36 - The Computational Turn: Past, Presents, Futures? HAECCEITY AND INFORMATION THEPTAWEE CHOKVASIN Suranaree University of Technology Nakhon Ratchasima, Thailand Abstract. The interest in ‘information entities’ is increasing in the philosophy of information. In this article, I offer a philosophical analysis which is concerned only with their haecceities (thisnesses) in the conception of Heideggerian ‘functionality’. I argue that the haecceity of an information entity is necessary for making a legal judgment on cybercrimes- especially on sharing illegal information. Moreover, when considering about the persistence of deleted information files, it is found that haecceities of those information files have some aspect of being an indexical of functionality which is far beyond what Duns Scotus knew about them. 1. Introduction I live in Thailand, and my friend is now in Japan. We are chatting on the MSN. If now I’m reading some information in a school website, and my friend is reading the same thing on his computer screen in Japan, are we exactly reading the same thing? Someone may consider about this situation and say that the same thing can appear in many different places at the same time, therefore we are exactly reading the same thing. However, some other may say that one thing cannot be in many different places at the same time, so my friend and I are looking at two different website pages which are merely similar to each other. And so, a question arises, “When are two chunks of information, or two information entities, the same?.” In this fashion of the argument above, it can be seen that something that is very similar to the problem of universals is brought back from classic metaphysics. Cyber-information on webpage behaves like it is a universal which is instantiated in many individual computers. However, if a philosopher of information wants to retain the position of considering information as information entities, she may have to take another route of explaining the similarity of the two web-pages. She might explain that they are two different information entities that instantiates the same universal ‘informativeness’. If the latter is right, then we have to admit that any information is an information entity of its own. There are no two distinct information entities exactly resemble each other. Unfortunately, this position of metaphysical information entities may have undesirable result. In the present time, there is a law of computer crime that forbids sending or forwarding any illegal information, pictures, piracy items, etc. to a third - 37 - Proceedings IACAP 2011 person. Both of the sender and the receiver will be considered guilty of doing that. But how can the law still be legitimate if the receiver uses the argument above to show that because of their status of being different information entities, he therefore did not receive the same thing from the sender? The latter one leads us to other topics in metaphysics which are about identity and individuation, and in this article it interests me more than to find out the account of sameness of information entities in the light of the metaphysics of universals. So, I will stick to the topic of identity and individuation. In this article, I will develop an analysis to answer the question above. The analysis will be in the light of Heideggerian ‘functionality’ as mentioned by Ratcliffe (2002) that, apart from their properties, for two things to be identical to each other they must be considered from their ‘teleological webs’ including their values and ends. However, it must be developed further when answering another question of what the appropriate notion of identity for information entities is. I will argue that the problem of individuation is deeper than the problem of identity. The two information entities that are not different in their properties will be individuated by their info-haecceities which are the bases for their identity. 2. Haecceity and Functionality It is said that John Duns Scotus may be the first philosopher who deals with the problem of individuation with “the difference”. Duns Scotus gave arguments for positing an “individuating difference” or a haecceity which is to give an account to individuals. In his Ordinatio, Duns Scotus said that “I reply therefore to the question that material substance is determined to this singularity by some positive entity and to other diverse singularities by other diverse positive singularities.” (Wolter, 1994 : 286). The positive individuating difference, or haecceity, is different from the common nature, or quiddity, that is to explain what an individual essentially is. So, we may never reach a full understanding of the haecceity. Now we can say that the receiver of the illegal information may be considered guilty from another perspective. Although it can be said that it is controversial of him being guilty of receiving the very same thing from the sender, he is still guilty from producing another new illegal entities in the computer system. It has to depend instead on “the difference” to be legitimate for charging to two persons (not just one) of being guilty of two different acts differentiated by two different entities which just happen to have the similar characteristics in their common natures. Cannot haecceity be grasped at all? In Haecceity (1993), Gary S. Rosenkrantz had some arguments to show that the haecceity of the objects incapable of consciousness are to us cognitively inaccessible. Only the haecceity of one’s being oneself can be grasped and expressed linguistically by only that one person. If we follow Rosenkrantz’s argument, we have to admit that the haecceity of other entities around us in inaccessible. Is this the same case for haecceity of information entity, or info-haecceity?. - 38 - The Computational Turn: Past, Presents, Futures? References Ratcliffe, Matthew. (2002). Heidegger, Analytic Metaphysics, and the Being of Beings. Inquiry 45(1), 35-57. Rosenkrantz, G. A. (1993). Haecceity: An Ontological Essay. Dordrecht: Kluwer Academic Publishers. Wolter, A. B. (1994). John Duns Scotus. In Jorge J. E. Gracia (ed.), Individuation in Scholasticism: The Later Middle Ages and the Counter-Reformation 1150-1650 (pp. 271298). Albany, NY: State University of New York Press. - 39 - Proceedings IACAP 2011 THE LIMITS OF COMPUTER SIMULATIONS AS EPISTEMIC TOOLS JUAN M. DURAN Universität Stuttgart - SimTech Germany Over the past few decades the use of computers for scientific purposes has been extended to virtually every branch of science. Such widespread acceptance is clear: their provide powerful means for solving complex models, as well as speed and memory for analyzing and storing data, visualizing results, etc. A less broad, yet still important, use of computers in laboratory practice is by means of implementing computer simulations. Lately, scientists have turned their interest to the design, validation, and execution of computer simulations instead of setting up, controlling and calibrating a whole material experiment. Whether for budgetary reasons, time-consuming delays, or complexity, today scientific practice is carried out in a way that strongly relies (if not fully depends) on computers. Here we face a philosophical problem that now has become widely discussed. Current philosophical literature deals with the question whether the epistemological value of a traditional experiment has greater (or less) confidence than a computer simulation. The most used trick for answering this question is by addressing the so-called “materiality problem”. Its standard conceptualization is characterized by Parker in the following way: “in genuine experiments, the same ‘material’ causes are at work in the experimental and target systems, while in simulations there is merely formal correspondence between the simulating and target systems (...) inferences about target systems are more justified when experimental and target systems are made of the ‘same stuff’ than when they are made of different materials (as is the case in computer experiments)” (Parker, 282). In general terms, the materiality problem can be addressed either by emphasizing the lack of materiality in computer simulations as epistemically defective (for example, as in Guala, Morgan and Giere), or by claiming that the presence of materiality in experiments is rare and, ultimately, unimportant for epistemic purposes (Morrison, Parker and Winsberg). Either solution leads to what I call the “dilemma of computer simulations” for it presupposes that once the ontology of computer simulations is sorted out, its epistemic power can be fully determined. Indeed it is required, as premise, to provide an ontology that resolves the epistemic value of computer simulations. However, the informative exercise of simply checking off ontological features of computer simulations begs the question whether it is legitimate to draw any epistemic conclusion at all. Paraphrasing Hacking, they disagree because they agree on basics. - 40 - The Computational Turn: Past, Presents, Futures? A different approach consists of defending the epistemic reliability of computer simulations as philosophically detached from its ontological conceptualization. This does not suggest, though, that they are two unrelated issues, but instead that each can be analyzed in its own right. In fact, there exist a close relation between them insofar the ontology becomes, to certain extent, a limiting case for the epistemology of computer simulations. Therefore, instead of asserting that “on grounds of inference, experiment remains the preferable mode of enquiry because ontological equivalence provides epistemological power” (Morgan, 326), I hold a twofold claim: firstly, that materiality only restricts computer simulation from “accessing” certain aspects of the world which require a causal story; in other words, materiality draws the boundaries from where experiments become a specific and irreplaceable method for knowing something about the world. Secondly, that computer simulations provide ways of inference that do not depend on its materiality but on its capacity for representing empirical as well as nonempirical systems. Keeping an eye on these two claims, I propose to proceed in to corelated steps: firstly, by analyzing and characterizing the nature of computer simulations and material experiments; naturally, this step is highly dependent on assumptions on computational models, computer programs and experiment, all of which will be briefly addressed. Secondly, by discussing the philosophical relevance of the limits imposed to computer simulations by materiality as well as drawing some preliminary conclusions on their epistemic power. Case examples will be briefly discussed as well. In one sense, there are many aspects of scientific practice that cannot be substituted by computer simulations, but require interaction with the material world: measurement, for instance, is one case. In certain measurement instances (i.e. the so-called “derived measurement”), the causal interaction of an instrument with the world cannot be replaced by the calculus performed by a computer simulation. Another interesting case-study is the reproducibility of experiments (Cf. Franklin and Howson 1984): as it is well known, the variation of instruments and experimental set-up tends to increase its epistemic reliability; it is not clear, however, that a similar methodology may work for computer simulations. In addition, the detection of new real-world entities seems a complete chimera for computer simulations, although it is a key role of material experiments. On the other hand computer simulations have the capacity of dealing with incredible complex equations that represent real-world systems and from which it is possible to “crunch” large amounts of data. Most of our knowledge about the world also comes from manipulating and interpreting such data. Computer simulations can also be used for investigating “rational worlds”, such as counterfactuals, thought experiments and mathematical worlds. I then urge for a philosophical discussion of the epistemological value of computer simulations based on its capacities and limits, instead of the dependence on an ontological conceptualization. References Franklin A., and Howson, C. (1984), Why do scientists prefer to vary their experiments?, Studies in History and Philosophy of Science Part A, 15(1), 51 – 62. - 41 - Proceedings IACAP 2011 Giere, R. (2009), Is computer simulation changing the face of experimentation? Philosophical Studies, 143, 59–62. Guala, F. (2002), Models, simulations, and experiments. In: L. Magnani and N. J. Nersessian (Eds), Model-Based Reasoning: Science, Technology, Values (pp. 59-74). Kluwer. Morgan, M. (2005), Experiments versus models: New phenomena, inference and surprise. Journal of Economic Methodology, 12(2), 317–329. Morrison, M (2009), Models, measurement and computer simulation: the changing face of experimen- tation, Philosophical Studies, 143, 33–47. Parker, W. (2009), Does matter really matter? computer simulations, experiments, and materiality, Synthese, 169(3), 483–496. Winsberg, E. (2009), A tale of two methods, Synthese, 169(3), 575–592. - 42 - The Computational Turn: Past, Presents, Futures? WHY TO BUILD A PHYSICAL MODEL OF HYPERCOMPUTATION? FLORENT FRANCHETTE IHPST, University of Paris 1 Panthéon-Sorbonne 13 rue Dufour, 75006 Paris Abstract. A model of hypercomputation can compute at least one function not computable by Turing Machine and its power comes from the absence of particular restrictions on the computation. Nowadays, some researchers claim that it is possible to build a physical model of hypercomputation called “accelerating Turing Machine”. But for what purposes these researchers would try to build a physical model of hypercomputation when they already have mathematical models more powerful than the Turing Machine? In my opininon, the computational gain provided to the accelerating Turing Machine is not free. This model also lost the possibility for a human to access to the computation result. To define this feature, I will propose a new constraint called the “access constraint” stating that a human can access to the computation result regardless of computation ressources. I will show that the Turing Machine meets this constraint unlike the accelerating Turing Machine and I will defend that build a physical model of the latter is the solution to meet the access constraint. The aim of the computability theory is to define mathematical functions computable by algorithms. The definition of an algorithm is however an informal one and the computability theory needs a mathematical definition of this notion. In order to formalize a predicate which means “can be computed by an algorithm”, Alan Turing (1936) proposed the formal predicate of “computed by Turing Machine” or “Turingcomputable”. According to Turing, the Turing Machine (TM) is a mathematical model of computation with a power equivalent to an algorithm. This claim is summarized in the Church-Turing thesis: functions computable by algorithms are computable by TM. This thesis argues that the TM defines the computation by algorithm since if a function is not Turing-computable, there is no algorithm which can compute it. For example, Turing proved that some mathematical functions such as the Diophantine function1 are not Turing-computable. Turing (1939) however, showed in his thesis that the computing power of the TM, that is to say the number of functions it could compute, depended on the type of constraints applied to the model. Models which are able to compute more functions than the TM are called “models of hypercomputation” or “hyperMachine”, and their computational power comes from 1 Given a Diophantine equation x, the Diophantine function is the function such as f(x)=1 if x has at least a solution and f(x)=0 otherwise. - 43 - Proceedings IACAP 2011 the absence of particular restrictions on the computation. Recently, Jack Copeland (2002) has proposed a model of hypercomputation named “Accelerating Turing Machine” (ATM) which is based on the absence of the constraint that the computation must include a finite number of steps. Copeland demonstrates in his article that an ATM is able to execute an infinite number of computational steps in a finite time and compute non Turing-computable functions such as the Diophantine function. More importantly, some researchers defend the idea that it is possible to physically build an ATM. However, the physical construction of a computational model, whether equivalent to the TM or not, goes beyond the original framework of the computability theory. Indeed, the Church-Turing thesis states nothing about the computing power of a TM physically built, it states only an equivalence between the intuitive concept of algorithm and the mathematical concept of Turing Machine. It is therefore pertinent to ask for what purposes these researchers would try to build physical hyperMachines when they already have mathematical models more powerful than the TM. In other words, why leave the mathematical framework of hypercomputation to turn to the physical sciences? In order to answer these questions, I will try to explain one reason why advocates of hypercomputation want to physically build a computational model with a greater power than the TM. In my opininon, although the absence of a constraint such as the finite number of steps allows the ATM to compute more functions than the TM, the computational gain is not free. The model of hypercomputation also lost a key feature: the possibility for a human to access to the computation result. To define this feature, I propose a distinction between “to access to the result” and “to compute the result”. We have access to the computation result when the result is available to us in principle. This result doesn't need to have a meaning, it can only be a string of symbols. We compute a result when we can follow in principle each computational step from input to output. From these definitions, we can set out two constraints: one asserting that we can compute results computed by a model and the other asserting that we can have access to these. Let a function f which is computable by a model. • This model meets the access constraint (AC) if for all input x, we can have access to f(x). • This model meets the computing constraint (CC) if for all input x, we can compute f(x). It is straightforward to show that these two constraints are set out in the definition of a TM. However, I think that the ATM doesn't meet the CC and the AC. My main point is to explain that it is actually unlikely that a human can compute an infinite number of steps in a finite time. This argument consists to say that the brain, where computations are made, is a finite entity both in space and time. This argument seems pertinent in order to show that we are not able to follow step by step an infinite computation. But it is not suffisant to prove that we can't have access to the result from an infinite computation because it could be possible that we have access to Diophantine function results without to follow each computational step. For exemple, Hava Siegelmann (1995) has proposed a mathematical model of the brain in the form of artificial neural nets which according to her could compute “beyond the Turing limit” Although it appears that Siegelmann's model may exceed the power of the TM, it has been strongly criticized by Martin Davis (2006) in his article entitled The myth of hypercomputation. From the two arguments outlined above, I shall make the assumption that a human is not able to compute and to have access to the result of a non Turing-computable function - 44 - The Computational Turn: Past, Presents, Futures? computed by an ATM. Therefore, this model does not meet the CC and the AC. Nevertheless, could an ATM meet these constraints? In my opinion, it is necessary to distinguish two ways for a model to meet the AC. • A model meets the AC in an internal sense if a human is able to have acces to the computation result without a physical realization of the model. • A model meets the AC in an external sense if a human is able to have acces to the computation result with a physical realization of the model. For example, a TM meets the AC in an internal sense because we can access to results from its mathematical definition. On the hypercomputation side however, we could have acces to the computation result in an external sense with a physical realization of an ATM. This result, characterized by the link between the computing power of a model of hypercomputation and its physical realization has important consequences for the notion of computation. It shows that some features belonging to hypercomputation models do not only depend on mathematics. Specifically, the possibility to access to the result of a non Turing-computable function computed by an ATM is based on physical constraints. Acknowledgements I would like to thank the editors and referees for very helpful comments during the preparation of this paper. References Copeland, J. (2002). Accelerating Turing Machine, Minds and Machines, 11, 281-301. Davis, M. (2006). The Myth of Hypercomputation. In C. Teusher (ed), Alan Turing: the Life and Legacy of a Great Thinker, Springer. Turing, A. (1936). On Computable Numbers, with an Application to the Entscheidungsproblem, Proceedings of the Mathematical Society, 42, 230-265. Turing, A. (1939). Systems of Logic Based on the Ordinals, Proceedings of the Mathematical Society, 45,161-228. Siegelmann, A. (1995). Computation Beyond the Turing Limit, Science, 268, 545-548. - 45 - Proceedings IACAP 2011 THE MATERIALISTIC FALLACY Some Ontologic Problems of Regulating Virtual Reality FABIAN GEIER Universität Bamberg An der Universität 2 96047 Bamberg Germany Abstract. This paper will discuss a connection between the ontology of virtual objects and several problems of information ethics. I argue that there is a strong tendency, sometimes even among professionals in ICT, to treat virtual objects like material objects. There are many political regulations and economic practices which make sense for material objects, but do not make sense for virtual ones. Such an ignoring of the nature of data processing, be it deliberate or not, I call a materialistic fallacy and consider it to be hampering social progress and benefit. 1. The Fallacy I call a materialistic fallacy if virtual objects are unnecessarily treated like material objects. The immediate effects of this fallacy are two: The practice in question either proves to be ineffective, because it is easily circumvented; or, where it can be enforced, it stalls progress and severely limits the benefit that ICT could provide. 2. The Ontology of Virtual Objects By “virtual objects” I refer to any chunk of digitally stored data that is conceived as a distinct entity by human understanding. This will in most cases be identical with files. However, the human mind does not have to go along the lines of file descriptors, and especially outside professional IT it often does not. A mouse pointer, window or webpage might be made up of several distinct files, and neither is a part of a file a file, nor is the entire content of a hard-drive. However, all these are virtual objects, as soon as we refer to them. And the decicisive thing about virtual objects is that they can easily be made a file and be subject to all possibilites of data processing. By this definition of virtual objects I hope to circumvent most of the specific problems in the ontology of computing. - 46 - The Computational Turn: Past, Presents, Futures? In material reality, form and matter cannot be separated from each other. One of the effects of this is, that we are used to relatively stable individual objects, that persist in time. Persistence is the precondition for movement: When a material object is moved into a new place, it is not at the same time in its former place anymore. In the realm of information, however, the case is entirely different. In Aristotelian terms data processing deals with pure 'forms'. Forms don't move. They are a-temporal and intangible (This largely corresponds to what Eden & Turner (2007) say about programs). Their distinctive characteristic is instantiation. Any number of instances of a form can exist, but none of them is prior to any other. If we send a network packet to two different computers, we cannot say which of the arriving packets is the original and which a mere copy. Such questions make sense in the material world, but they do not make sense in the virtual. Technically, any chunk of data is at any point located in particular bits and bytes, and so still is an instantiation and not a pure form. However, since computers are all about reinstantiating the form of this instantiation, this fact is negligeable. Computers are all about making it negligeable. This results in what Moor (1997) calls information being “greased”. Of course there seems to be movement in virtual objects, i.e. in a cursor on a screen. Otherwise computers would not be very useful. But we should keep in mind that such movement is always a simulation, created by a sequence of copying and erasing. But only because we sometimes cannot help using such simulations, there is no need to do it to the utmost degree. I suggest the opposite: We should do it only where it is necessary, and otherwise maximize the benefit from freeing information from the bonds of materiality. 3. Examples 3.1 DATA EXPIRY A typical materialistic fallacy is the suggestion, put forward by Viktor MayerSchönberger (2008), and recently picked up by the German ministry for consumer protection, to have an inbuilt expiry date for data on the internet. The idea sounds nice: This would end the problem, that what is put online once, resides there forever. However, it will never work. More precisely: It could only work under the most extreme conditions of worldwide data-control – an amount of control no current institution is anywhere close to exert. Of course we can write a program that erases a file after 90 days, but it would have to be implemented either as a mandatory core module of all existing operating systems, or as an obligatory hardware solution similar to Trusted Computing. However, it does not lie in the nature of data to expire. An expiry module would only be a separate addition to the core functionality of computers, and thus both unwanted and easy to remove. 3.2 DIGITAL RIGHTS MANAGEMENT DRM, or more specifically, copy protection is almost archetypical for the materialistic fallacy. When we are trying to charge customers on a per-copy basis, we are following - 47 - Proceedings IACAP 2011 the paradigm of material objects. Copy protection attempts to establish a uniqueness and sameness for the copy, that does not lie in its nature. The protection must prevent a function that data processing generally offers: the re-instantiation of data. There are various consequences of this: First, the moral restraints to copy software, protected or not, are lower than in material theft, because copying does not result in anyone else losing data. Second, just because it is not its nature, the seeming uniqueness of a copy is difficult to maintain, as it can only be provided by an additional module. I do not endorse pirating software. But I endorse acknowledging the basic structures of ICT because of which it is easier to pirat it than to protect it. And I endorse thinking about alternative ways of dealing with this. 3.3 E-VOTING The ontologic structure of ICT also matters in the discussion about eVoting. I am not referring to security issues here, but to the situation once security is breached. Then the full power of data processing lies at the hand of the intruder: Whether you forge 10 votes or 10 000 000 – it is just one line of code. The difference between local and global modification is not the same in virtual as in material reality. Virtual Objects do not count one by one, but can be treated formally, on various levels of abstraction. Large scale modifications in a database are in principle no more difficult than singular modifications. I don't say that this alone must decide the issue. All I say is that the nature of data processing has to be taken into account. References Aristotle (1998). Kategorien. Hermeneutik. Hamburg: Meiner Verlag. Aristotle (1989). Metaphysik, 2 Volumes. Hamburg: Meiner Verlag. Mayer-Schönberger, V. (2008). Nützliches Vergessen. In: M. Reiter, M. Wittmann-Tiwald, (Eds), Goodbye Privacy - Grundrechte in der digitalen Welt. Wien: Linde Verlag. Moor, J. (1997). Towards a Theory of Privacy in the Information Age. Computers and Society 27 (3), 27-32. Eden, A. H. & Turner R. (2007). Problems in the Ontology of Computer Programs. Applied Ontology 2 (1), 13-36. - 48 - The Computational Turn: Past, Presents, Futures? THE EFFECT OF COMPUTERS ON UNDERSTANDING TRUTH STEVEN MEYER Tachyon Design Automation Corp. Minneapolis, MN Abstract. The effect of computers and computation on the philosophical study of the epistemology of truth is discussed. The development of algorithmic truth as satisfiability is considered using modern quasi empirical methods that follow the mathematician Paul Finsler's discovery that a formal conception of truth does not suffice. The P=?NP problem is considered and shown to be a philosophical problem using Finsler's method. Non truth value assignment conceptions of truth such as deflation and computer science as a method for studying physics are criticized. 1. Introduction The mid 1960s marked the beginning of the influence of computers on the epistemology of various conceptions of truth. On the one hand fast computers were becoming available and on the other quasi-empirical characterizations of mathematics in the form of Lakatosian research programmes were becoming popular (Lakatos, 1967). A. J. Ayer attributes the quasi-empirical characterization of logical truth to J. S. Mill from the middle of the 19th Century (Ayer, 1936, p. 291). In 1964, Paul Finsler published what he claimed was an air tight defense of his rejected 1926 idea that 'A Formal "conception of truth" cannot suffice' (Finsler, 1996, p. 163). Computers were becoming fast enough so that computer programs for proving mathematical theorems and for verifying truth were conceivable. These developments led naturally to questions concerning what can be computed, and if there are any limitation of computability. Before the mid 1960s, at least in the area of mathematics, epistemology had become truth as existence of mathematical objects generated from abstract set theory. The various incompleteness, inconsistency and set theory paradox results were avoided by falling back on truth as axiomatic logic. Computers allow a new and seemingly empirical epistemology of truth. Namely, something is true if it can be computed in a reasonable amount of time. This immediately led to problems. One early example was alphabetization (sorting) using a giant table. One can sort a list in linear time by converting each key into a number and storing the number into the address corresponding to the encoding. It is not clear if this is alphabetization or not, and it was not clear how to collect the result. - 49 - Proceedings IACAP 2011 2. THE P = ? NP PROBLEM AND TRUTH In order to study "the basic nature of computation and not merely minor aspects of our models of computers" (Baker, 1975), the polynomial time versus non deterministic polynomial time class equivalence problem was developed by Cook(1968) and Karp(1972). The problem basically asked if the satisfiability definition of truth could be computed by a deterministic Turing machine (TM) as fast as it could be computed by a non-deterministic TM. The satisfiability conception of truth goes back to Alfred Tarski's work in the 1930s (Tarski, 1956) that defined a statement (conjunction) of basic propositions to be true if it is true under any possible assignment of truth values to the basic atomic propositions in the statement. This problem is not only the central problem of computer science, but according to Aaronson(2005, p. 2) "is correctly seen as the deepest problem in all mathematics". Since the formulation of the P =? NP problem in the late 1960s, it has become both a mathematical problem, a scientific problem because it involves time and a philosophical problem. The "canonical" possibly easiest problem in the NP class of problems is the logical truth satisfiability problem. Following Karp, other problems in the class NP (solvable in in a polynomially bound number of steps on a non deterministic TM) are solved by mapping to the satisfiability problem in polynomial time (Karp,1972). The satisfiability problem and its characterization of what can be computed is closely related to the very essence of truth because as 18th philosopher David Hume observed, "no general proposition whose validity is subject to the test of actual experience can ever be logically certain. ... [something] substantiated in n-1 cases affords no logical guarantee that it will be substantiated in the nth case also" (Ayer,1936, p. 289). This paper considers the epistemology of computation in the quasi-empirical sense by investigating "what is true, and not what is hypothetically taken to be true (for instance axioms)" (Finsler, 1996, p. 162). 3. Problems Solved by Computational Epistemology Two obvious problems solved by computing are disproof of the deflationist definition of truth and disproof of the form of intuitionism that disavows the law of the excluded middle. The deflationist theory of truth (Stanford Encyclopedia, 2010) argues "to assert a statement is true is just to assert the statement itself". Computation epistemology of truth as a satisfiable assignment to all atomic elements is obviously more than merely "asserting a statement". There are a number of forms of intuitionism. One form rejects the law of the excluded middle. It is claimed there are formulas that are neither true nor false (probably because they can not be constructed in a intuitively obvious way). Again, existence for finite formulas (possibly potentially infinite unbounded formulas also) can be tested by finding some assignment of true and false to atomic clauses that makes the formula evaluate to true. If no such assignment exists, the formula is false (Finsler, 1996, pp. 167-168). There is no question of intuitively acceptable methods here. - 50 - The Computational Turn: Past, Presents, Futures? 4. Problems Unsolvable by Computational Epistemology Although, satisfiability computable in a reasonable amount of time solves some epistemological problems, it can not deal with problems involving actual infinity. From Finsler(1996, p. 164): One cannot form the set of all ordinal numbers, since its definition contains an inherent contradiction [Russell's paradox]. If it were not an ordinal number, then it would still contain exactly all preceding ordinal numbers, and therefore it would have to contain itself as an element which is impossible. 5. Internal Problems of Computational Epistemology - Oracle Use One of the first attempts to solve the P =? NP problem tried to use an infinite counting argument from meta-mathematics (Baker, 1975). The method goes back to Cantor's diagonalization using the lack of a one-to-one mapping between real and rational numbers. The modern meta-mathematical model theory analog of diagonalization is relativization using oracles. The idea is to allow TMs to make unit time calls to an oracle. The hope was that for all oracles the class of languages recognized by P plus an oracle was strictly contained in (not one-to-one) NP with an oracle. The result was that P is in NP for some oracles but not for others. The Baker et. al. conclusion was that by "slightly altering the machine model, we can obtain differing answers" (p. 431). Since then, much of computational complexity theory has been dedicated to relativizations because relativization proper containment immediately shows P != NP. Researchers who think there may be epistemological difficulties with the P =? NP problem have criticized relativization but mostly without success (Hartmanis, 1976 & Hartmanis, 1992). Relativization pertains to computational epistemology because it removes problem specific structure from computable truth. Hartmanis(1976) shows that for models of computation that allow the use of more efficient storage access such as the MRAM model which has unit cost for multiplication, P = NP (pp. 33-46). This may show that there is some conceptual problem with the Church-Turing Thesis (definition of TMs) or even that the class NP does not really exist (it is an illusion in the Finslerian sense) because abstraction of the structural connection between satisfiability and other problems that need non deterministic computation for efficiency is incorrect. 6. Physicalization of Computational Epistemology Computational epistemology has taken a recent turn toward arguing that studying the P =? NP problem "can yield new insights, not just about computer science but about physics as well" (Aaronson, 2005, p. 1). Deolalikar(2010) recently published a proof that P != NP except unfortunately it needed axioms from empirical theories of statistical physics. In conclusion, I see this change in direction negatively because it attempts to convert a question from physics on the existence of quantum computers (QCs) (pp. 5-8) into formal and axiomaticized computational epistemology that does not allow quasiempirical experimentation. The argument comes full circle because the mathematicians who contributed to the development of modern physics (including Finsler whose main - 51 - Proceedings IACAP 2011 area was the differential geometry of general relativity, p. vii) were skeptical of exactly the physics that QCs embody and require. In his post WW II standard graduate level quantum mechanics text book, Leonard Schiff argues that "QM's range of applicability is limited to approximating the behavior of the atom" (Schiff, 1949, p. 267). Also, Paul Feyerabend's analysis of the theories of Niels Bohr and David Bohm (Feyerabend, 1982), show that the very properties assumed by QC builders do not exist. Bohr states (Feyerabend's italics): "At the same time we must deny the universal validity of the superposition principle and must admit that it is but a (very useful) instrument of prediction." (p. 258). Also Feyerabend (David Bohm taught QM to Feyerabend) describes Bohm's view of the uncertainty principles as: "However in order to show the basic and irrefutable character of the uncertainty principle these features themselves would have to be demonstrated as basic and irrefutable." (p. 223). References Arronson, A (2005). NP-completeness problems and physical reality. Sigact News (vol. 36). (Also www.scottaaronson.com/papers/npcomplete.pdf). Ayer, A. (1936) in P. Benacerraf & H. Putnam(1964) (Eds),Philosphy of mathematics - selected readings, first edition (289-301), exerpt of Ayer, A. Language, truth and logic. Baker, T., Gill, J. & Solovay, R. (1975) Relativizations of the P =? NP question. Siam J. Comput. 11(4), 431-442. Cook, S.(1971) The Complexity of Theorem-proving procedures, Proceedings of the third Annual ACM symposium on Theory of Computing. (151-158). Deolalikar, V (2010) P != NP, HP Research Labs, Palo Alto, August 6, 2010, unpublished. Stanford Encyclopedia of Philosophy (1981) Deflationary Theory of Truth, (URL of Feb. 2011: plato.stanford.edu/entries/truth-deflationary). Feyerabend, P. (1981) Philosophical papers. Vol. 1. Realism, Rationalism & Scientific Method, Cambridge. Finsler, P. (1996) in D. Booth & R. Ziegler eds.), Finsler set theory: Platonism and Circularity, Birkhauser. Hartmanis, J. & Simon, J. (1976) On the Structure of Feasible Computations, in M. Rubinoff. & M. Yovits (eds.) Advances in Computers 14, Academic Press, 1-43. Hartmanis, J. et. al. (1992) Relativization: A revisionistic retrospective. Bulletin of the EATCS. Vol. 47. Lakatos, I. (1978) Philosophical papers. Vol. 2. Mathematics, Science and epistemology. (ed. J. Worrall & G. Currie ), Cambridge, 24-41 (expanded version from Proceedings of the Fourth InternationalCongress for Logic. ed. I. Lakatos(1967), North Holland). Lakatos, I. (1976) Proofs and Refutations. Cambridge. Schiff, L. (1949) Quantum Mechanics. First edition, McGraw Hill, New York. Tarski, A (1956) The Concept of Truth in Formalized Languages, Logic, Semantics, Metamathematics, Clarendon Press, 152-278. - 52 - The Computational Turn: Past, Presents, Futures? PHILOSOPHY OF THE WEB AS ARTIFACTUALIZATION ALEXANDRE MONNIN Université Paris 1 Panthéon-Sorbonne (PHICO, EXeCO), Institut de Recherche et d’Innovation, Conservatoire National des Arts et Métiers (DICEN) 12, place du Panthéon 75231 - Paris cedex 05, FRANCE AND HARRY HALPIN World Wide Web Consortium MIT/CSAIL 32 Vassar St.,Bldg.32-G514 Cambridge, MA 02139, USA Abstract. What is the philosophical foundation of the World Wide Web? T. Berners-Lee, widely acclaimed as the inventor of the Web, has developed informal reflections over the central role of URIs (Uniform Resource Identifiers, previously Uniform Resource Locators) as a universal naming system, a central topic in philosophy since at least the pioneering works of R. Barcan Marcus. URIs (such as http://www.example.org/) identify anything on the Web, so the Web can be considered the space of all URIs. In a debate between Berners-Lee and P. Hayes over URIs and their capacity to uniquely 'identify' resources, Berners-Lee held that engineers decide how protocols should work and that these precisions should determine the constraints of reference and identity while Hayes held that names have their possible referents determined only as traditionally understood by logical semantics, which Hayes held engineers could not change but only had to obey. This duality can be interpreted as an opposition between a material a priori and a formal a priori. The material a priori of technical systems like the Web is brought about by what we call 'artifactualization', a process where concepts become 'embodied' in materiality - with lasting consequences. - 53 - Proceedings IACAP 2011 1000-word abstract What is the philosophical foundation of the WWW? Is it an open and distributed hypermedia system? Universal information space? How does it differ from the Internet? While the “ecology” of the Web has known many a revolution, in contrast, its underlying architecture remains fairly stable. URIs, the HTTP protocol, resources, and languages like HTML and RDF constitute the building blocks of the Web. As the particular kind of computing embodied by the Web has displaced traditional desktop applications, the foundations of Web architecture and its relationship to wider computing needs to be clarified in order to determine both its roots, boundaries, the reasons for its success, future developments... This is especially urgent as now debate is opening over platforms and cloud computing, as how they relate to the Web. Tim Berners-Lee, widely acclaimed as the inventor of the Web, has developed in his design notes informal reflections over the central role of URIs (Uniform Resource Identifiers – previously Locators) as a universal naming system, a central topic in philosophy since at least the pioneering works of Barcan Marcus. URIs (such as http://www.example.org/) identify anything on the Web so it can be considered the space of all URIs. The concrete access mechanisms of how information is transmitted via a URI is then determined by the Internet, and so the Web could be built on another architecture (such as the “Future Internet”), and likewise the Internet can also host other applications than the Web, such as peer-to-peer file-sharing. Possible entities denoted by URIs are called resources. While high-order ontological debates have continuously tried to provide distinctions between endurants and perdurants (categories that mainly apply to substances), the characterization of resources has relied on vastly different ontological principles that descend from engineering concerns rather than claims of ontological correctness. Drawing from the work of Vuillemin, we draw a parallel between the Web and philosophical systems. Like the former, it is concerned with traditional issues pertaining to the philosophy of language (URIs as proper names), to ontology (the link between engineering design choices in Semantic Web ontologies and philosophical ones), and metaphysics (entities of the Web as resources). Unlike philosophical systems that reflect on the constraints of the world, the Web is a world-wide embodied technical artifact that therefore creates a whole new set of constraints. We suggest that they should be understood as a material a priori - in the Husserlian sense - grounded in history and technology. In a striking debate between Berners-Lee and Patrick Hayes over URIs and their capacity to uniquely ‘identify’ resources, Berners-Lee held that engineers decide how the protocol should work and that these decisions should determine the constraints of reference and identity. Hayes replied that names have their possible referents determined only as traditionally understood by formal semantics, which he held engineers could not - 54 - The Computational Turn: Past, Presents, Futures? change but only had to obey. This duality can be interpreted as an opposition between a material and a formal a priori. Interestingly enough, recently Hayes is focusing on adopting principles from the Web into logical semantics itself. The material a priori of technical systems like the Web is brought about by what we call “artifactualization”, a process where concepts become “embodied” in materiality - with lasting consequences. While such a process clearly predates the Web we can now see within a single human lifetime the increasing speed at which it takes place, and through which technical categories (and philosophical ones) are becoming increasingly dominant over “natural” and “logical” categories. At the same time, the process of having philosophical ideas take a concrete form via technology lends to them often radically new characteristics, transforming these very concepts in process. Heidegger posited a filiation between technology and metaphysics, with technology realizing the Western metaphysical project (by inscribing its categories directly into concrete matter should we add). Yet, if technology is grounded in metaphysics, it is not the result of a metaphysical movement or “destiny” (Schicksals) but a more mundane contingent historical process, full of surprises and novelties. For all these reasons, it must be acknowledged that the genealogy of the Web, as a digital information system, differs from traditional computation with regards both to the concepts at stake and our relation to them (the scientific ethos being replaced by an engineering one – something BernersLee dubbed “philosophical engineering”). On the Web, the activity of standardization through bodies like the W3C arguably consists in making sense of technological evolution post-hoc. Nevertheless, regarding the architecture of the Web, one may argue that its standards were both the result of a process of conscious decision-making in specifying how protocols should work and the result of a constant adjustment to the reality of the technical system. Therefore, the Web can be seen as an artifact both in terms of being a designed human invention and a nonhuman (Latour) whose study may lead to numerous unintended discoveries, beyond its initial design. For all these reasons, the very practice of philosophy is transformed by having to take this material a priori and its technical categories as seriously as “natural” or “analytic” categories from biology or natural language. Philosophers then have to deal with technical categories that may have a lasting effect in spheres like the Web, not just as variants from categories that can be analytically understood, but rather as concrete artifacts which can even transform the previously considered analytic categories (ironically, the main challenge to analytic judgments is no longer what Quine called “naturalization” but rather the ongoing artifactualization). While at first glance URIs can be considered just another kind of name and so inherit the characteristics and debates in philosophy over the referential status of proper names, the Web makes a difference, as URIs primarily are used to physically access information such as webpages – an aspect of naming for the most part foreign to the philosophy of language. R. Sennett’s craftsman’s motto might be “doing is thinking”, once concepts have been artifactualized (and, as a consequence, externalized), thinking is also doing or conceiving; in the end, a matter of design. - 55 - Proceedings IACAP 2011 ONTOLOGICAL COMMITMENTS OF COMPUTER SCIENCE MIGUEL PAGANO FaMAF – Univ. Nacional de Córdoba Medina Allende s/n, X5000HUA Córdoba, Argentina. Abstract. We suggest that a fictionalist attitude with respect to Quine’s proposal of ontological commitments is best suited for building up an ontology for computer science. In particular, we argue in favour of using theories of programming languages for identifying the relevant ontological categories. 1. Introduction In this extended abstract we propose a novel reading of Quine’s ontological commitments [Quine, 1980] to analyse the ontology of computer science. We argue that a fictionalist posture (see [Szabó, 2009]) can save genuine concepts of computer science from vanishing as ingenuous mathematical construction. Although we only discuss aspects related to programming languages and programs, we think that this can lead to a fruitful research programme if extended to other areas of computer science. 2. Programming Languages: Ontology from Semantics Before coming to our proposal, let us briefly review critically two papers by A. Eden and R. Turner which deal with the ontology of computer science. In the first paper [Eden and Turner, 2007a] they study the ontological commitments of programming languages. They propose that semantics determine to which entities a particular programming language is committed. They apply this methodology for a simple imperative language with two kinds of semantics (based on set theory and type theory, respectively). We do agree on the use of semantics to determine some of the commitments of computer science, however it is not clear to us that programming languages have ontological commitments; instead they should be attributed to theories of programming languages (TPL). The fictionalist attitude enters here: the fact that TPL uses a certain mathematical foundation, say set-theory, does not imply that its commitments are those carried by the foundational theory; instead concepts like abstract syntax, reference, state, ordered structure given by the outcome of a certain computation are our candidates for the ontological commitments; i.e. the entities which should be used to reason about - 56 - The Computational Turn: Past, Presents, Futures? programming languages and programs-scripts. Instead of trying to appeal to the language on which the genuine concepts are modeled, we propose to justify the commitments in terms of their epistemological value. In the second paper [Eden and Turner, 2007b] Eden and Turner put semantics aside as the source of the commitments carried on by PL; in this article the underlying programming paradigm determines the true entities to which a programming language is committed. It can be posited that some of the aforementioned examples could be taken to be specific to some or other paradigm; but, it is not obvious to us that programming paradigms are good candidates to look for commitments. Consider, for example, what kinds of reasoning can be done by only knowing the paradigm of a PL but without any deeper theory about PL, it would be surprising that one could decide if two programscripts compute the same or not. What is more strange to us is the attempt to attach commitments to programming languages or programs-scripts: PL are not more than the description of a set of valid programs (the so-called programs-scripts) with a notion of execution – the former usually given by a more or less abstract grammar and the latter presented by more or less formal means, ranging from a fully-formalised semantics to a mere bogus and ambiguous compiler. We have already mentioned some ontological commitments with an epistemological basis; now we use syntax to show that TPL are the good place to look for the genuine building blocks of (part of) the ontology of computer science. In a first overview the only interesting category arising from considering syntax is that of program-scripts (cf. [Eden and Turner, 2007b]), but program-scripts alone are not enough descriptive to grasp the importance of different parts of a program-script. For example, two occurrences of the same variable can play different rôles, say one occurrence can be a formal parameter in a procedure or function and the other an occurrence in a program calling the procedure. Just from a syntactical point of view, there should be a distinction between those two occurrences, the formal parameter is a binding occurrence, while the other occurs free occurrence. On the other hand, one could also be tempted to pay too much attention to syntax and introduce some superfluous concepts, e.g. differentiating between parsed or un-parsed program scripts or putting a two restrictive condition on what is a program-script. Since the best account of the interesting syntactical phenomena is given by abstract syntax, we should expect to get from its development [McCarthy, 1962, Fiore et al., 1999] the ontological categories corresponding to the syntactical aspects of PL. 3. Conclusion Let us conclude by commenting on how to use semantics (may be the best known area of TPL) for studying the ontology of computer science. We acknowledge that asking for a definite semantics in order to establish a new ontological category can delay the acceptance of new concepts brought by new languages lacking a proper definition and defined in terms of a compiler or interpreter. In spite of not considering the ontology as an immutable edifice, we should restrain of adding new concepts as fast as a new paradigm or PL is announced; instead we think a more parsimonious attitude should be observed and wait until a good semantic explanation is given for the newly introduced artefacts. - 57 - Proceedings IACAP 2011 We do not advocate that one kind of semantics should be preferred over others, based on the status given by some foundational philosophy of mathematics to its underlying theory; Turner [Turner, 2009] seems to accept that any semantics should be accepted as a mathematical entity by a realistic mathematician. It is clear to us that the various proposed semantics could explain diverse aspects of the same language and account for several ontological categories.2 From the fictionalist posture we adopt, it is futile to try to explain in what sense the categories of a resulting ontology built up by following TPL are more relevant metaphysically than those arising from other proposals, say Eden and Turner’s papers. Our proposal would correspond to what Smith [Smith, 2003] calls an “internal metaphysics” and its merits reside on how good it is for accounting the phenomena studied on computer science. Acknowledgements I am grateful to Martin Diller, Pío Garcia, and Renato Cherini for encouraging me to write this abstract. My work is founded by CONICET, Argentina. References Eden, A. H. and Turner, R. (2007a). Computation, Information, Cognition. The Nexus and the Liminal, chapter Towards a programming language ontology (pp: 147–159). Cambridge Scholars Publishing. Eden, A. H. and Turner, R. (2007b). Problems in the ontology of computer programs. Applied Ontology, 2(1):1(pp: 3–36). Fiore, M., Plotkin, G., and Turi, D. (1999). Abstract syntax and variable binding. Proceedings of the 14th Annual IEEE Symposium on Logic in Computer Science, LICS ’99 (pp: 193–202). Washington, DC: IEEE Computer Society. McCarthy, J. (1962). Towards a Mathematical Science of Computation. In IFIP Congress (pp: 21–28). Plotkin, G. D. (2004). The origins of structural operational semantics. Journal of Logic and Algebraic Programming, 60-61. (pp: 3–15). Quine, W. V. O. (1980). From a Logical Point of View, chapter On What There IS, (pp. 1–19). Harvard University Press. Scott, D. S. (1970). Outline of a Mathematical Theory of Computation. Technical Report PRG–2, Oxford, England. Smith, B. (2003). The Blackwell Guide to the Philosophy of Computing and Information, chapter Ontology. WileyBlackwell. 2 For example, Plotkin’s operational semantics leading to a better understanding of the implementation of programming languages [Plotkin, 2004] and Scott’s denotational semantics [Scott, 1970] used to reason about the equivalence of programs without resorting to a particular implementation. - 58 - The Computational Turn: Past, Presents, Futures? Z. G. Szabó, The Analytical Way. Proceedings of the 6th European Congress of Analytic Philosophy, chapter The Ontological Attitude. London: College Publications, 2010. Available at http://pantheon.yale.edu/~zs47/documents/Theontologicalattitude.pdf Turner, R. (2009). The Meaning of Programming Languages. American Philosophical Association Newsletter on Philosophy and Computers, Fall-2009 (pp. 2–7). - 59 - Proceedings IACAP 2011 Semantics of Programming Languages UWE V. RISS SAP Research Karlsruhe Vincenz-Priessnitz-Str. 1 76131 Karlsruhe Germany Abstract. The grounding of the semantics of programming languages is investigated. It is argued that the meaning of programming languages results from the operations that they abstract and the interpretation of these operations in terms of human activities as the final point of reference. This view opposes the interpretation of the semantics of programming languages. The latter refers to higher order abstraction as basis whereas the current view sees these semantics rooted in the actual performance realized by concrete implementations, taking a pragmatic stance. 1. Introduction The central aim is to investigate the role of computers and the grounding of semantics of programming languages. Traditional approaches towards the semantics of programming languages such as operational or denotational semantics (Turner, 2007) aim at abstracting from the differences of individual implementation to find the common meaning behind them. Operational semantics does this by referring to abstract machines while denotational semantics refers to mathematical structures. In the following it is argued that semantics cannot be understood in such terms of higher order abstraction but, on the contrary, must be rooting in concrete operations. We can understand the mentioned approaches as objectifications of the perceived equivalence of the respective operations. However, the point of reference for semantics cannot be this objectification but the underlying concrete operations and their perceived equivalence (Saab and Riss, 2010), in analogy to the natural sciences the basis of which are experiments and not scientific laws. 2. Activity Theory For this purpose we primarily regard computers as tools in human activity. The framework of this consideration is Activity Theory (Engeström, 1987) that describes the relation between persons (subjects), the objects of their activities, and the context of these activities in the schematic triangle depicted in Figure 1: - 60 - The Computational Turn: Past, Presents, Futures? Figure 1. Activity Triangle. The core triangle of subject (human agent), community, and object has been extended towards tools, communication (social mediation), and division of labour. All human activity is directed towards an object and aiming at a desired output. The social context includes language and communication that mediate the interaction between subject and community. Hereby communication appears as a means for activity coordination and knowledge transfer within a community and thus enables division of labour. Understanding computers merely as tools in this system, however, is not sufficient since this neglects several specific aspects such as the separation of hardware and software. The term programming language already indicates that the concept of software is related to communication while hardware represents a traditional tool concept. Thus, programming languages serve a means of communication between the subject and the hardware representing the proper tool. This interpretation can be further supported by the objectives of artificial intelligence research to introduce intelligent agents that as equivalent to human agents regarding their intellectual capacity. Even if this goal is not reached, computers move down in the diagram from the top position (tool) towards a middle position where more complicated coordination and communication is required. 3. Fundamental Understanding of Semantics To understand semantics of programming languages we have to go back to natural languages. These are generally used as means to coordinate the activities among collaborating human agents and to transfer knowledge; program languages are used to organise the division of labour between the human agent and the computer and to instruct the computer what to do, both at a rather elementary level. If we look at two key features of natural language, abstraction and symbolization, we also find them in programming languages. Every line of code in an ordinary computer program symbolises an abstraction of simple operations that both humans and machines can (usually) execute with equivalent results. Thus, abstraction is the key to transferability of operations from one person to another or from a person to a computer. However, abstraction must not be regarded as absolute but as a process of identification. Symbolization as the - 61 - Proceedings IACAP 2011 manifestation of such identity serves as the basis of the machine’s automatic processing of programs. On both sides, human agents and computers, it is the capacity to reliably interpret symbolic expressions, which ensures a repeatable execution of operations and the use of the computer as a tool. The basis for communication via symbolized abstraction and coordination of operations is shared meaning. Here meaning of messages includes two aspects, the interpretation of messages and the expectation that others understand it in a similar way (Saab and Riss, 2011). In the case of computers it is sufficient that this expectation is one-sided, that is, from the human agent towards the machine; the computer is not supposed to have expectations. Regarding the concept of meaning we refer to a pragmatist view that understands the meaning of a message as what an agent can do with this message (Stegmaier, 2001). For the subject the meaning of program code is determined by the subject’s knowledge of how to execute the included operations while the hardware determines the ‘meaning’ for the computer, that is, the computer is able to execute the program. Naturally semantics is not equated with execution – a single malfunction does not spoil the meaning of a computer program – but with execution as a repeated process of significant reliability. In the case of computers we even find a more reliable execution than what we can expect of human agents. 4. Abstract Semantics If the meaning of programming languages is not constituted by higher levels of abstraction but by concrete operations we have to clarify the role of abstract formal approaches, as they appear in operational or denotational semantics (Turner, 2007). In the same way as mathematical models abstract human activities these formal semantic model abstract operations and serves as means to support program development and testing. Formal definitions are only meaningful inasmuch as they refer to established human practice. Indeed engineers have constructed computers before researchers have applied formal semantics to programs so that formal semantics cannot be seen as the actual foundation for computer languages. Formal semantics can only support the development process but not constitute it. The presented approach shows some links to Rapaport’s idea of implementation as semantic interpretation (Rapaport, 2005). It also resembles the idea of information as sense-making of data (Saab and Riss, 2011), where programs are understood as data the meaning of which results from an interpretations process that is determined by the projected operations that refer to what the computer can do with a program. References Engeström, Y. S. (1987). Learning by expanding: An activity-theoretical approach to developmental research. Helsinki: Orienta-Konsultit Oy. Rapaport, W. J. (2005). Implementation is semantic integration: Further thoughts. Journal of Experimental & Theoretical Artificial Intelligence, 17(4), 385–417. Saab, D. J. & Riss, U. V. (2010). Logic and abstraction as capabilities of the mind. In: J. Vallverdù (Ed.) Thinking Machines and the Philosophy of Computer Science: Concepts and Principles, (pp 132-148). Hershey, PA: Information Science Reference. - 62 - The Computational Turn: Past, Presents, Futures? Saab, D. J. & Riss, U. V. (2011). Information as Ontologization. Journal of the American Society for Information Science and Technology. (accepted for publication). Stegmaier, W. (2008). Philosophie der Orientierung. Berlin, New York: Walter de Gruyter. Turner, R. (2007). Understanding programming languages. Minds and Machines, 17(2), 203-216. - 63 - Proceedings IACAP 2011 QUINEAN HOLISM AND THE INDETERMINACY OF COMPILATION NATHAN SINCLAIR Macquarie University Nathan.Sinclair@mq.edu.au 1. Motivation No other philosophical doctrine with even the remotest skerrick of plausibility would, if vindicated, so radically overthrow our current understanding of language, psychology and rationality as Quinean semantic holism. If individual words and sentences do not have meanings then we cannot explain communication as the transmission of ideas or judgments, nor appeal to sentence meanings as objects of putative propositional attitudes, nor explain reasoning in terms of the discernment of relationships between the meanings of premises and conclusions. The very fact that sentence meanings are so fundamental to our current accounts of semantics, cognitive psychology, and reasoning, has meant that objections to Quinean holism which, if deployed against less radical claims, would be lightly dismissed, have been taken very seriously indeed. Most such objections appeal, broadly, to two hopes or assumptions. One the one hand it is claimed that the range of evidence proponents of Quinean holism have considered relevant to meaning and translation is too narrow, and hoped that somewhere beyond that range, perhaps in normative social practices or introspection, there is evidence to justify the attribution of determinate meanings to our words and sentences. On the other hand, it is claimed that arguments for the indeterminacy of translation must be reductio ad absurda because at best they show that the range of evidence considered is ``unable to account for distinctions concerning the feature, meaning, which we know independently to exist'' (Searle 1987). While objections based on wishful thinking and “just knowing'” would be dismissed if used to defend less well entrenched prejudices, once given any weight they have the dubious merit of stymieing further theoretical argument. No argument based upon lack of evidence is strong enough to preclude the hope of finding further evidence for such a dearly and deeply held assumption. To advance the dispute we need examples of alternative incompatible translations between theories expressed in clearly holistic languages. Ideally, such examples of alternative translations between holistic languages would be pre-existing translations routinely employed for practical purposes, rather than philosophical inventions. Ideally also, the languages involved would be rigorously specified, with formal compositional grammars precisely delineating their well-formed formulae, and the theories would express their empirical contents so clearly and unambiguously that those contents could be mechanically determined. Even better if the - 64 - The Computational Turn: Past, Presents, Futures? theories being translated included both small and easily understood theories, (so that we might easily see the scope and consequences of the indeterminacy of translation) and theories as large and complex as our grandest scientific theories (so we could see that the indeterminacy was not an artifact of theoretical simplicity). Better yet if each such theory could be taken as complete and self-standing, in order to ensure that the indeterminacy of translation was not the result of taking statements out of context. Astoundingly, all these desiderata are fulfilled by programming languages, compilers, and computer programs. Languages, forms of translation, and theories, so common that few of us in the developed world are ever more than arms length from tools that rely upon them for their operation. 2. Outline In part one of this presentation I argue that computer programs are (readily converted into) empirical theories. Programs' empirical contents are the patterns of input and output produced by processes executing them. The under-determination of programs by their input-output is so well known and unthreatening that in many universities a high degree of similarity of program structure, even between simple programs required to produce the same output, is grounds for suspicion of plagiarism. Furthermore, programs are obviously holistic in the sense that (most) statements in computer programs do not produce any output, nor is any fragment of the output of such programs directly attributable to them. This insight allows us to make sense of the Quinean doctrine that individual sentences simply do not have meanings, and to see that the inferential/conceptual role semantics many critics (most notably Fodor and Lepore) attribute to Quine, according to which the meanings of individual sentences are determined by the theories of which they are a part, is a grotesque misinterpretation of Quinean holism. In part two I show that compilation (and decompilation) is a form of translation by the standards Quine advocated, and then argue briefly that those standards are adequate and that compilation is translation simpliciter. I then show that the indeterminacy of compilation is well known and unthreatening to computer scientists. The only guarantee given by ISO standard compliant compilers is the preservation of input-output behaviour, and computer scientists know that independently written compilers are unlikely to produced the same machine (or high level) code given the same source code, and are unsurprised when decompilers cannot accurately reconstruct original source code. Furthermore, computer programs obviously exemplify the principles of (near) universal revisability and maintainability that philosophers have found so troubling and implausible and yet, as the practice of debugging shows, there can be good reason to revise some sentences and not others in the face of recalcitrant experience. In part three I consider recent developments in the semantics of programming langauges, whether the indeterminacy of compilation is sufficient to undermine the existence of an analytic-synthetic distinction in programming languages and argue that the translation of natural languages is less tightly determined than the translation of programming languages. The position I advocate in this presentation is compatible with both normative and dispositional accounts of semantics. Whether the ISO standard for the C programming language is regarded as specifying dispositions possessed by C programs and compilers, - 65 - Proceedings IACAP 2011 or the norms to which programs are subject once they are held to be C compilers, the compilation of C programs is (properly) indeterminate and C programs are (properly) under-determined by the input-output they are intended to produce. In order of increasing ambitiousness, I hope people who attend this presentation will discover that Quinean holism is not a form of inferential/conceptual role semantics, computer programming languages are holistic and exemplify the controversial features of Quinean holism, compilation exemplifies indeterminate translation, and why it is plausible that translation of natural languages is even less determinate than compilation. References ISO/IEC WG14 N1256: Programming Languages – C, 2007-09-07, International Organization for Standardization, Geneva, Switzerland, http://www.openstd.org/jtc1/sc22/wg14/www/standards Allison, L. (1986). A Practical Introduction to Denotational Semantics, Cambridge University Press. Fodor, J. Lepore, E.( 1992). Holism: A Shoppers Guide. Blackwell Publishers. Fodor, J. (2004). Having Concepts: a Brief Refutation of the Twentieth Century. Mind and Language 19, no. 1 (February): 29-47. McDermott, M. (2009). A Science of Intention. The Philosophical Quarterly 59, no. 235 April: 252-273. Morrison, J. (2008). Just how controversial is evidential holism? Synthese 173, no. 3 (November 22): 335-352. Okasha, S. (2000). Holism about meaning and about evidence: in defence of W. V. Quine. Erkenntnis: 39-61. Quine, W. (1961). Two dogmas of empiricism. In From a Logical Point of View, 20-46. 2nd ed. Harvard University Press. Quine, W. (1964). Word and object. MIT press. Quine, W. (1977). Ontological Relativity and Other Essays. Columbia Univ Pr. Searle, J. (1987). Indeterminacy, empiricism, and the first person. The Journal of Philosophy 84, no. 3: 123–146.. Winskel, G. (1993). The Formal Semantics of Programming Languages. MIT Press. - 66 - The Computational Turn: Past, Presents, Futures? IS FINDING A ‘BLACK SWAN’ POPPER, (1936) POSSIBLE IN SOFTWARE DEVELOPMENT? LINDSAY SMITH University of Hertfordshire, UK l.1.smith@herts.ac.uk AND PAUL WERNICK University of Hertfordshire, UK p.d.wernick@herts.ac.uk AND VITO VENEZIANO University of Hertfordshire, UK v.veneziano@herts.ac.uk Introduction Users’ experience of software-based technology that fails to meet their expectations is so widespread as to be a ‘commonplace’ occurrence ((Smith, 2009). However a satisfactory response from software engineering (SE) remains as elusive as ever. In this paper we investigate the context of software engineering (SE) as a negotiation between the contradiction(s) of human subjective experience of softwarebased technology that relies on architecture inclusive of objectivity. For example machine programming languages that can be mathematically proven ‘Turing complete’, e.g. Church-Turing Thesis (Eden, 2007). Consideration of the technological context of SE demands a philosophical reevaluation of the ontological and epistemological status of SE in Computer Science (CS). We have undertaken a cross-disciplinary investigation to reposition unresolved problems in SE which potentially also opens up philosophical debate. For example if we introduce the development of software technology as a subject area for unresolved metaphysical debate. Such as the Kantian analytic/synthetic a priori dispute (Hacker, 2006). The limitations on this paper preclude explicit discussion on the ‘pros and cons’ of metaphysics for SE, or visa versa; however some basic principles echo implicitly in our discussion. For example our above comments on objectivity, e.g. possible for machine code and an (current?) impossibility for a priori understanding of subjective stakeholder software requirements. This implies Requirements Engineering (RE) practice occupies an epistemological ‘gap’ between the architectural basis of software and how it is built/used. For our discussion one positive consequence of a cross-disciplinary approach is that novel questions can be asked. It would appear to be the case, for example that RE practitioners gaining an understanding of stakeholders’ requirements is - 67 - Proceedings IACAP 2011 compatible with the Kantian epistemological classification of ‘synthetic a posteriori’ (Hacker, 2006). This raises the possibly of other epistemological explanations to questions such as why SE compares unfavourably for reliability with other engineering disciplines. For example, civil engineers can respond to unexpected circumstances in bridge construction by correcting faults, (BBC, 2000) whereas the hazards of safety critical faults in aircraft cockpit software are/cannot be addressed in an equivalent way. As Mellor, (1990) explains, the aviation industry certifies software for ‘airworthiness’ based on the ‘correctness’ of the software development process but not on the ‘correctness’ of the behaviour of software during testing. Software development includes planning and designing artefacts but also presents SE with predictive type problems. For example RE identifies/selects software requirements to satisfy stakeholders’ future use of software. However RE lacks reliable or dependable tools/techniques to predict outcomes (Nuseibeh, 2000). Rationale We are interested in why Computer science (CS) has not established scientific laws that can predict SE outcomes unlike, for example, civil engineering that relies on the established natural laws of Physics. The difference between CS and the natural science (NS) paradigm manifests in the division between observation of naturally occurring phenomenon and contending with artificially occurring phenomenon, e.g. software. Human interaction with software-based technology gives Social Science (SS) paradigm(s) (Burrell, 1979) potential ontological relevance for CS (Smith, 2010). For example both SS and CS need to observe ‘non-physical’ phenomena such as human interaction. However cross disciplinary research depends on what is optimal in a particular paradigm, for research purposes. Utilising different scientific paradigms (Hirshheim, 1989) is not straightforward. As a result we chose conservatively to employ SS to provide a dialectical analysis of contradictions in software development such as those outlined above. In particular we opposed a potential (1) ‘scientific paradigm’ of CS Eden (2007) with (2) Ethnomethodology (Ethnometh) an SS approach that challenges scientific paradigm(s) in SS (Garfinkel, 1967) and has provenance in RE research (Goguen, 1994). Our purpose is to explore the potential for obtaining leverage over limitations in understanding of software development. Can a science base for software development be identified? For (1) to provide prediction a relevant definition of science needs to apply to CS. Reasons to doubt this possibility are raised by (2) and we consider this in the observation of artificial phenomena in software development. The critical perspective of Ethnometh centres on the scope and meaning of science. We focus on ‘scientific method’ (SM) because this is how scientific prediction is achieved resulting in the development and acceptance of scientific theories as explanation(s) of meaning. SM is defined as a process that relies on both inductive reasoning and observable phenomena to create a hypothesis that can be tested. Prediction of events or observations is then a process of deductive reasoning relying on theory to direct hypothesis testing. - 68 - The Computational Turn: Past, Presents, Futures? Prediction, for SE outcomes, is important and good practise in SE is implicitly ‘Popperian’ (Popper, 1936), e.g. software is built to be testable. However equating software testing to SM, e.g. a refutable hypothesis, is questionable (Eden,2007). One central problem for establishing a scientific basis for software development is observation. Predictive SE, if possible, must have refutable observable phenomenon (Smith, 210). Yet any observation is via a human ‘prism’ hence the relevance of Ethnometh criticism of applying SM to social phenomena, e.g. human behaviour (Garfinkel, 1967). For software development human-technology interaction, e.g. input and output on a screen, is the point at which an artificial phenomenon (software) interfaces with its social environment (Smith, 2009). It is also the point where an SS paradigm that “capture(s) the basic assumptions of coexistent theories” Hirshheim, (1989) becomes relevant to CS. Opposing theories in SS do not make the application of SM straightforward. However CS is currently in a unique cross disciplinary position. This is because software-based technology replaces previously existing environments/ phenomena with artificially occurring environments/phenomena. SE practice provides the means by which phenomenon such as the results of the execution of source code, are possible to observe. SM has been applied via ‘artificial’ means before, such as instrument-assisted observation of otherwise unobservable phenomena. Historically scientific experimentation produced, for example, the discovery of electricity via investigating the directly unobservable magnetism (Mendelssohn, 1976). Certainly using artificial tools to ‘empirically’ observe naturally occurring phenomena, such as weather patterns, requires attention to both natural and artificial environments. Including SS paradigm(s) raises tantalising prospects such as the potential for SE to provide the means to observe artificial phenomenon. Bibliography BBC,http://news.bbc.co.uk/hi/english/static/in_depth/uk/2000/millennium_bridge/default .stm, 2000, accessed 15/03/11. Burrell, G.“Sociological Paradigms and Organisational Analysis”, Heinemann, 1979. Eden, A. “Three Paradigms of Computer Science”, Minds & Machines, 17:135-167, 2007. Garfinkel, H. “Studies in Ethnomethodology”, Prentice-Hall 1967. Goguen, J. “Requirements Engineering as the reconciliation of Technical and Social Issues.” In Requirements Engineering : Social and Technical Issues. London Academic Press, 1994. Hirschheim,J. “Four Paradigms of Information Systems Development”,ACM, Vol 32, Number 10, 1989 Hacker, P.M.S. Passing the naturalistic turn : On Quines Cul-De-Sac”, Philosophy, 2006. Mellor, P. “10 to the -9 and all that: The non-certification of flight-critical software.”, City University London, 1990. Mendelssohm, K.”Science and western domination”, Thames and Hudson, London, 1976. Nuseibeh, B., ‘Requirements Engineering: A Roadmap’, Proc. ICSE 2000. - 69 - Proceedings IACAP 2011 Popper, K. Conjectures and Refutations: The Growth of Scientific Knowledge. Routledge, 1963. Smith, L. “Meeting stakeholder expectations of software, or looking for the ‘Black Swan’ in software requirements”, Proc. ECAP09, 2009. Smith, L. “Software development: Out of the black box”, Proc. ECAP10, 2010. - 70 - The Computational Turn: Past, Presents, Futures? ONTOLOGY: from Philosophy to ICT and related areas. Problems and Perspectives. SOLODOVNIK IRYNA PhD student of International PhD School of Humanities University of Calabria Pietro Bucci, 87036, Arcavacata di Rende (CS), Italy Abstract. This paper briefly highlights the development of the concept Ontology, from its philosophical roots up to its vision in the ICT field and related areas. Philosophically, Ontology is a systematic explanation of Being that describes the features of Reality. Nowadays Ontology is proliferating in organizing Knowledge of different domains managed by advanced computer tools. Ontology qualifies and relates semantic categories, dragging, however, the idea of what, since the seventeenth century, was a way to organize and classify objects in the world. Ontology maximizes the reusability and interoperability of concepts, capturing new Knowledge within the most granular levels of information representation. Ontology is subjected to a continuous process of exploration, formation of hypothesis, testing and review. Ontological thesis proposed today as true, tomorrow may be rejected in light of further discoveries and new and better arguments. Philosophical background of Ontology Webster's Third New International Dictionary defines Ontology as "1. a Science or study of Being: specifically, a branch of Metaphysics relating to the Nature and relations of being; 2. a Theory concerning the kinds of entities and specifically the kinds of abstract entities that are to be admitted to a language system". Literally, the word Ontology comes from the Greek ντος (òntos) and λόγος (lògos), that means "speech about Being", but may also derive explicitly from τά όντα (entities), variously interpreted according to different philosophical points of view. Aristotle proposed the first known category system, standing for a certain vision of the world in relation to what is judged to exist in practice. Heidegger conceived Ontology as a "phenomenology of the exploration” of what there "is" and in how it turns out. The ontological conceptualization, as a cohesive philosophical area, was introduced in 505-504 BC by Parmenides. He was the first to pose the argument about Being in its totality, presenting issue of the ambiguity among the conceptual level, Ontology and language. Parmenides recognized the ontological dimension as dominant able to subject to itself any other aspect of Philosophy. Over the centuries, the meaning of Ontology was changing depending on different visions and knowledge of other philosophers: Leucippus, Democritus, Plato, Aristotle, Descartes, Kant, Lorhard, Hegel, - 71 - Proceedings IACAP 2011 Trendelenburg, Brentano, Stumpf, Meinong, Husserl, Heidegger, Gockel. Some of them gave more value to an absolute belief, another to empirical things, thus enriching the heritage of Philosophy with what is considered "par excellence" (the problem of existence in its fullest extent and universality: the relationship between particular and universal, intrinsic and extrinsic, essence and existence). “Indeed, without Ontology, Philosophy cannot be developed according to the demonstrative method. Even the art of discovery takes its principles from Ontology" (Blackwell,1963). Towards a new Ontology The advent of Semantic web (Breitman,2007) aimed at multi-objective optimization of ICT environment and technological innovation in general, has coined a new vision of Ontology, so that it is considered today as “formal, explicit specification of a shared conceptualization” (Gruber,1995). Ontology, intended as a first-order axiomatic theory expressed by a descriptive logic, is fundamental to design advanced Knowledge Based software systems (Guarino,1998; Eden,Turner,2005). It is of great interest to combine lexical resources, such as Thesaurus (Broughton, 2006) with the world knowledge provided by Ontologies in order to improve deductive reasoning with natural language, as well as enhance automatic classification (e.g. in Ontology-based Cataloging systems), problem solving techniques, interoperability among different computer systems, cross-cultural and intercultural communication in CMC (Ess, Sudweeks,2005) etc. Since Ontology is the basis of web intelligence, it is also widely used in e-commerce, on-line marketing, business management etc. In Fig.1 we can observe philosophical reflection in the field of computer science and information technology (Floridi,2002; Colburn,2003; Gruber,2009). Here Thought (which is regulatory/normative to Reality) through Language (which defines the existing categories reflecting Thought and Reality) is connected with Ontology and Epistemology, representing the descriptive and prescriptive approaches. Ontology refers to objective validity (Husserl,1992) of terminology waiting to be discovered by domain knowledge experts and Epistemic (providing model reasoning in class-based representation formalisms through description logics). - 72 - The Computational Turn: Past, Presents, Futures? Figure 1. The ontological and epistemological turn in Computer Science Automated reasoning and Ontology manipulation in description logics allow to present and emulate the human logic-based knowledge of entities in different domains, managing simultaneously dissimilar types of objects (concrete and abstract, independent and dependent) and their ties (relations, dependencies and predications). Creation of single knowledge sharing paradigm is not easy nor immediate task, considering also non-trivial technological obstacles (consistency and validity of Ontologies vs. time and evolution of information technology). It remains an appealing challenge to set up new scientific environments in which philosophers and other scholars can meet to discuss and develop strategies to classify, organize and implement qualitative conceptual domains, and even more those represented by different semantic systems tied with language differences. References Breitman, K.; Casanova, M.; Truszkowski, W. (2007). Semantic Web: Concepts, Technologies and Application. NASA Systems and Software Engineering Series. 1 ed. London, Springer Verlag. Broughton V. (2006). Essential thesaurus construction, London, Facet. Colburn, T.R. (2003). Philosophy and Computer Science, Armonk, Sharpe. Eden, A.H. & Turner, R. (2005). Towards an Ontology of software design: The Intension/Locality Hypothesis, 3rd European Conf. Computing And Philosophy ECAP, 2-4 Jun, Västerås, Sweden. Ess, C. & Sudweeks, F. (2005). Culture and computer-mediated communication: Toward new understandings. Journal of Computer-Mediated Communication, 11(1). Gruber, T. (2009). Ontology. In: Encyclopedia of Database Systems, Ling Liu and M. Tamer Özsu (Eds.), Springer-Verlag. Guarino, N. (1998), Formal Ontology and Information Systems. In: N. Guarino (Eds), Formal Ontology in Information Systems. Proceedings of FOIS 1998, Trento, Italy, 6-8 Jun, Amsterdam, IOS Press. Husserl, E. (1929). Formal and transcendental logic. English translation: The Hague, Martinus Nijhoff (1969). Smith, B. (2003b). Ontology. In: L. Floridi (Eds), Blackwell Guide to the Philosophy of Computing and Information, Oxford: Blackwell. Wolff, C. (1728). Preliminary discourse on philosophy in general. Translated, with an Introduction and notes, by Richard J. Blackwell (1963) Indianapolis, The Bobbs-Merrill Company. - 73 - Proceedings IACAP 2011 THE EVOLUTION OF SOFTWARE AGENTS AS DIGITAL OBJECTS SABINE THÜRMEL Graduate Center of the TUM School of Education Technical University of Munich, Munich, Germany Abstract. The evolution of software agents as digital objects from simple interface agents to full blown interaction partners is depicted. An outline of concretization process in agent-oriented programming is given contributing to the research into the ontology of computer programs. Extended Abstract The focus of this paper is on the evolution of software agents as digital i.e. computational objects. It can be shown that a new type of interplay between human beings, „computational objects“ and the physical environment is in process of emerging. Turkle’s insight (2006) into the nascent robotics culture is equally valid for software agents: „computational objects simply do things for us, but they do things to us as people, to our ways of seeing ourselves and others. Increasingly, technology puts itself into a position to do things with us” (p.1). The starting point of this evolution was constituted by interface agents providing assistance for the user or acting on his or her behalf. As envisioned by (Laurel, 1991) and (Maes, 1994) they evolved into increasingly autonomous agents. In game worlds they were first seen in one person offline video games. Interacting pure software agents and avatars became prevalent in MMORPGs (massively multiplayer online role-playing games) as World of Warcraft®. As interworking collaborative software agents embedded in nets of devices they provide support for smart grids (Mainzer, 2010) or for other variants of the “Internet of things” (Mattern/Langheinrich, 2008). Last but not least they are used to coordinate emergency response services in disaster management systems (Jennings, 2010). Already in 1992 Solum posed the question in the North Carolina Law Review whether virtual agents may be the basis for persons in the legal sense of the law (Solum, 1992). Today virtual agents are commonly deployed in online auctions or eNegociations (Woolridge, 2009). Thus software agents have been promoted from assistants to virtual interaction partners. The socio-technical fabric of our world has been augmented by these collaborative systems. The goal of the agent-oriented programming paradigm is the adequate and intuitive modeling and implementation of complex interactions and relationships. Software agents were introduced by Hewitt's Actor Model (Hewitt et al., 1973). Today a whole variety - 74 - The Computational Turn: Past, Presents, Futures? of definitions for software agents exist but all of them include mechanisms to support persistence, autonomy, interactivity and flexibility. Bionic approaches, as swarm intelligence, or societal models are adapted to implement collaborative approaches to distributed problem solving. They are on the one hand part of the tool kit used in computational sciences using computer-based simulations as a link between theory and experiment. As such they are similar to numerical simulation but using different conceptual and software models. On the other hand they provide a basis for agency in virtual worlds offering novel experiences. They provoke us to ask how this technological progress will affect our interpersonal relationships (Turkle, 2011). The starting point of any software agent-based approach is a bionic or societal metaphor for distributed problem solving. The resulting computer science concept is specified as a computer program modeling the interacting software agents. At compiletime the high level program is transformed in a machine-executable computer program to be run in a distributed environment. During runtime any (instance of) a software agent may be perceived as a distinct thread or process. This concretization process conforms to the program abstraction taxonomy introduced in (Eden and Turner, 2007). From an ontological perspective it can be stated that the underlying computer science concepts are abstract objects that can be concretized by computer programs conforming to an agent oriented programming paradigm. The computer programs are abstract objects that can be concretized by adequate computational objects conforming to a (different) programming paradigm or by concrete physical objects. Different concretizations may exist for one computer program. It should be noted that the identical agent-oriented program may be first tested in a simulated environment and then employed in a realtime environment. Similar to (Reicher-Marek 2009) three basic relations between computer programs and other objects may be distinguished: the above outlined the concretization relation, the notation relation (between the abstract object and the (textual or graphical) specification), the environmental relation (between the abstract object and its potential runtime environments) and the instantiation-at-runtime relation coupling the abstract object to its dynamic instantiations. In my view any non trivial identity notion for computer programs has to take these relationships into account. References Eden, A. H. & Turner, R. (2007). Problems in the ontology of computer programs. In: Applied Ontology, Vol. 2, No. 1, pp. 13–36. Amsterdam: IOS Press. Hewitt, C. & Bishop, P. & Steiger, R. (1973). A universal modular actor formalism for Artificial Intelligence. In: International Joint Conferences on Artificial Intelligence, 235-245. Jennings, N. (2010) ALADDIN End of Project Report, www.aladdinproject.org, Cited 25 April 2011. Laurel, B. (1991) Computers as theatre, New York: Addison-Welsley. Maes, P. (1994) Agents that reduce work and information overload. In: Communications of the ACM 37 (7), 30-40. Mainzer, K. (2010) Leben als Maschine? Von der Systembiologie zur Robotik und Künstlichen Intelligenz, Paderborn: Mentis Verlag - 75 - Proceedings IACAP 2011 Mattern, F. & Langheinrich, M. (2008) Eingebettete, vernetzte und autonom handelnde Computersysteme: Szenarien und Visionen. In A. Kündig and D. Bütschi (Eds), Die Verselbständigung des Computers, pp. 55-75. Zürich: vdf Verlag Reicher-Marek, M. (2009) What is the object in which copyright can subsist? An ontological analysis. In: E. Ortland, Eberhard and R. Schmücker (Eds) Copyright & Art. Aesthetical, legal, ontological and political issues. Baden-Baden: Nomos, 2009. (to appear) Solum, L. (1992) Legal personhood for artificial intelligences. North Carolina Law Review, 2, 1231-1283. Turkle, S. (2006) A nascent robotics culture: new complicities for companionship. Paper presented at the 21st National Conference on Artificial Intelligence. Boston, July, 2006 Turkle, S. (2011) Alone Together, Why We Expect More from Technology and Less from Each Other. New York: Basic Books. Woolridge, M. (2009) An Introduction to MultiAgent Systems ( 2nd ed). New York: John Wiley & Sons. - 76 - The Computational Turn: Past, Presents, Futures? MACHINES and COMPUTATIONS RAYMOND TURNER Department of Computer Science and Electronical Engineering University of Essex Wivenhoe Park, Colchester Essex CO4 3SQ UK Abstract How may abstract and physical machines be related? What is the difference between considering an abstract machine as: 1. A theory of a physical one 2. A functional description of one 3. A specification of one? Do these distinctions throw any light on the nature of physical computation and the arguments of Putnam and Pancomputationalism? - 77 - Proceedings IACAP 2011 Track II: Philosophy of Information and Cognition - 78 - The Computational Turn: Past, Presents, Futures? ON THE LEVEL OF CREATIVITY Ponderings on the Nature of Kantian Categories, Creativity and Copyrights ALEXANDER FUNCKE Centre for Study of Evolutionary Culture at Stockholm University 106 91 Stockholm Abstract. The relation between data and information is considered in analogy with Kantian transcendental aesthetics in order to create a formal concept and ordinal relation of “creativity”. Implications are discussed for Kantian categories, creativity and copyrights. 1. Background & Aims Creativity is a popular concept for controversy in many disciplines. This paper does not necessarily contain the deepest insights, but it provides perspectives that might be useful while considering creativity and thereby copyrights cognition and maybe even consciousness. 2. Transcendental aesthetics In order to formulate the ideas this paper uses an analogy to Kant's transcendental aesthetics, i.e. the process where noumenon is transcended via categories to a phenomenon is contrasted to a process where data is rendered via a context/algorithm to information. The analogy lends itself to be considered as an extension rather than an analogy of the transcendental aesthetics too. That is Kant's transcendental aesthetics may be reinterpreted as “actual” transcendence in terms of data and information. It opens up for a multiple layer interpretation, and thereby also for questions like, if we may consider a hearing aid, or other more intricate cyborg technologies as just another category in the Kantian sense.3 3 This may also have consequences for copyrights. Arguably, copyrights ought not to be applicable to data in itself, but only to information. Now, if a blind person somehow manages to copy a protected image, then it couldn't be considered an infringement, as he lack the categories to render the information that could have - 79 - Proceedings IACAP 2011 3. Potentiality/actuality The dichotomy of potentiality and actuality has been part of the philosophical discussion at least since Aristotle's book Theta. The transcendental aesthetics analogy may be considered as a model for consider data in its actual form and its potential one relative to a given interpreter. The interpreter in the model consist of two components, a passive presentation that takes formatted data as input and outputs information, and an active algorithm that takes raw data as input and outputs formatted data. Where the latter component may have potential. An algorithm is considered to have potential if it manipulates the raw data in a way that cannot be described as a simple transformation or crop, but which also adds “extra relevant information” relative to a given presentation. To formalise this potentiality, or creative quality if you will, let X and Y be sets of data, and let f, g ∈ FX,Y = { f : X → Y }be two algorithms that transforms raw data to formatted data. Further, let FX,NY ⊆ FX,Y be the subset of algorithms that lack potential, and Y' ⊆ Y be the set of all formatted data that renders information for a given presentation. Now, define two functions, H H m : Y' → ℜ , defined as H m ( y ) = min N f ∈FX, Y : X → ℜ , which maps any data to its entropy and H ( f −1 ( y )) , (1) which maps any information entity to its minimal entropy representation given a presentation. The inverse of f may actually not be unique, but with a small violation of notation, we define f −1 ( y ) = argmin x∈{x: f ( x )= y }H (x ) , that is to be the minimal entropy x that maps to y. Finally, define the “additional map” A( f, y ) = H m ( y ) − H ( f −1 A : FX,Y ×Y' → ℜ such that ( y )) , (2) been protected by a copyright for someone with visual categories. Nor should his original visual works ever be copyrightable for its visual qualities. - 80 - The Computational Turn: Past, Presents, Futures? which gives a number for the level of potential the algorithm f has to generate information entity y.4 An algorithm f is considered strictly potential relative to a representation and a subset of the informative entities S ⊆ Y' if all its elements y ∈ Y' ' are represented more economically than in the minimal non-potential case, that is, ∀y ∈ S, A( y ) > 0, (3) An algorithm is considered potential (in the non-strict sense) for a subset S if a nonempty subset S' ⊆ S is strictly potential and for no y ∈ S, A( y ) < 0 . 4. Creativity as an ordinal relation There are various degrees of potentiality, not only should algorithm potentiality be compared with respect to the amount of relevant information quantified by the “additional map”, it should also take an interest in the relative ease to compute f (x ) ∈ Y' . Ignoring the complexity of computation would be like ignoring the difference between factorising the product of two huge prime and summing them. Another example that highlights the need to include complexity is simulations of nonlinear dynamical systems, such as models of meteorological or financial system. It is unfeasible to do analytical reasoning about the behaviour of such systems, and it takes a lot of computation to unfold the behaviour through simulation, even though all data and the algorithms are in place.5 There are multiple reasonable ways to define an ordinal relation between two algorithms that take these things into account, f, g ∈ FX,Y = { f : X → Y }, but the transitive, reflexive and identity preserving variant suggested here is the following, f > g ⇔ O ( f ) > O (g ) ∨ (O ( f ) = O (g ) ∧ A( f ) > A(g )) , (4) where O(f) is the computational complexity of f. 4 Note that this means that a verbose representation x ∈ X of an informative entity could be classified as a non-potential, even it seem to have all the necessary properties. One could add a proxy-stepto solve this, by mapping f to fh, where fh is the equivalence class (in the obvious sense) version of f. 5 It is really just a way of stating that the tragedy of deduction will not help. - 81 - Proceedings IACAP 2011 5. Conclusion The concepts presented and to some extent explored in the longer version of this paper, gives a formal interpretation of the notoriously hard to pin down idea of creativity. The ordinal relation “level of creativity” lends itself to demarcate when a set of algorithms may create information that is creative enough to be regarded as copyrightable, or maybe even what is the minimal level of creativity for a cognitive or conscious algorithm? From the analogy to transcendence there spring other implications hinted at in the footnotes: Cyborg technology, such as hearing aids may be considered as a multi-level version transcendence. Which aids ones intuition while pondering about copyrights whether one likes Kant or not. References Dennett, D. C. (1996). Darwin’s Dangerous Idea: Evolution and the Meanings of Life. Simon & Schuster. Floridi, L. (2004). Open problems in the philosophy of information. Metaphilosophy, 35(4):554– 582. Floridi, L. (2008). Philosophy of Computing and Information: 5 Questions. Automatic Press/VIP, Copenhagen, Denmark, Denmark. Floridi, L. (2009). Philosophical conceptions of information. Lecture Notes in Computer Science, 5363:13–53. Kant, I. (2003). Critique of Pure Reason. Courier Dover Publications. Koepsell, D. R. (2000), The Ontology of Cyberspace: Law, Philosophy, and the Future of Intellectual Property, Open Court Publishing Co., Peru, IL, USA. Mandelbrot, B. (1967). How Long Is the Coast of Britain? Statistical Self-Similarity and Fractional Dimension. Science, 156(3775):636–638. Mitchell, T. (1997). Machine Learning. McGraw-Hill Education (ISE Editions), 1st edition. Nagel, T. (1974). What is it like to be a bat? Philosophical Review, 83(October):435–50. - 82 - The Computational Turn: Past, Presents, Futures? THE FOURTH REVOLUTION AND SEMANTIC INFORMATION VALERIA GIARDINO Institut Jean Nicod (CNRS-EHESS-ENS), Paris Valeria.Giardino@ens.fr Abstract. In his work, Floridi introduces several notions to describe our relationship with information and technology. Indeed, according to him, in recent times, humanity has experienced a fourth revolution, the Information revolution, which, starting from the work of Alan Turing, has deeply affected our understanding of ourselves as agents. Our generation is still a generation of “emigrants”, but our children will be born in the infosphere and will recognize themselves from their birth as inforgs. I will focus on the notions of infosphere and inforgs, and more generally on the notion of information Floridi makes use of. According to Floridi, in re-ontologizing ourselves as inforgs, we recognize how significantly but not dramatically different we are from smart, engineered artifacts, since we have, as they have, an informational nature. Nevertheless, if one focuses on semantic information, which requires meaning and understanding, then there is still a dramatic difference between ourselves and our artifacts to be acknowledged: we are the only agents who spontaneously reason semantically. First, I will present the four revolutions Floridi talks about, and claim that there are other revolutions in the history of human culture that should be considered in the perspective of discussing the reshaping of our new environment and of our new selves in the infosphere. Secondly, I will discuss an ambiguity in Floridi’s use of the term information and propose to consider his fourth revolution as the Second Information revolution. To solve this ambiguity, I will distinguish between information and semantic information, which implies meaning and understanding. Finally, I will present some questions that emerge once we consider humans’ cognitive capacities to access meaning on the background of the new context, the infosphere. 1. Introduction: we are inforgs in an infosphere Floridi has suggested that in recent years we have gone, together with our environment, through a process of re-ontologization that has changed forever our way of seeing the world and ourselves. If the challenge of philosophy today is to analyse how this revolution has changed our understanding of the world and of ourselves, my challenge in this talk will be to claim that some of Floridi’s suggestions should be partly revised and further discussed. First, I will present the four revolutions Floridi talks about, and claim that there are other revolutions in the history of human culture that should be considered. Secondly, I - 83 - Proceedings IACAP 2011 will discuss an ambiguity in Floridi’s use of the term information and propose to consider his fourth revolution as the Second Information revolution. To solve this ambiguity, I will distinguish between information and semantic information, which implies meaning and understanding. 2. One, two, three... many revolutions: human culture Though I am in general sympathetic with Floridi’s rational reconstruction of the four revolutions, I want to argue that in the course of human cultural evolution, it is possible to individuate other crucial steps in the transformation of our ontology. It is unquestionable that the appearance of cognitive artefacts has played a major role in the shaping of our world and of us as cognitive agents. We might assume an evolutionary perspective and consider first the moment in which human beings began to communicate by means of a language, and then the moment they invented writing, and thus began not only to produce words but to share them in a public format that could be inspected by others and stored in archives. Both these steps were crucial in the evolution of human cognition, since they revolutionized human beings’ access to meaning: new channels became available to communicate and to make sense of the world around us and of ourselves. My approach is in line with the idea that cognition is ‘distributed’: as Hutchins (1995a; 1995b) explains, cognitive events are not encompassed by the skin or skull of an individual. There exist interesting kinds of distribution of cognitive processes: we must consider them if we want to understand human cognition. Human beings, despite the limitations of the cognitive systems with which we know that they are born (Kinzler and Spelke (2007); Spelke (2004)), were able to develop new practices and new cognitive strategies to augment the powers of their minds, showing an extraordinary capacity in creating tools that would help them in the processes of both describing the world around them and acting upon it. Some of these tools had an intrinsically cognitive function. As a consequence, a more faithful reconstruction of our cultural evolution would rather show how the history of our cognition has been deeply influenced by the fact that from the very beginning we engaged ourselves in symbolic activities, and that these activities have become, in a long historical and cultural process of creation and selection, more and more complex. This was indeed a revolution in the ontology of information in the billions of years of the evolutionary process, from the time when living processes became encoded in DNA sequences: “because this novel form of information transmission was partially decoupled from genetic transmission, it sent our lineage of apes down a novel evolutionary path - a path that has continued to diverge from all other species ever since” (p. 45). 3. Cognition and semantic information In the DNA double helix, as well as in Turing machines, information is conceived as a code, a string, and it does not have anything to do with meaning or understanding. By contrast, semantic information requires meaning and understanding. Floridi claims that, by re-ontologizing ourselves as inforgs, we recognized how significantly but not - 84 - The Computational Turn: Past, Presents, Futures? dramatically different we are from smart, engineered artifacts, since we have, as they have, an informational nature. But of what kind of information is Floridi talking about when he refers to ‘informational nature’ in the two cases? I will consider Bruner (1990)’s point of view on what he defined the Cognitive revolution, taking place in the 1950s. According to Bruner’s reconstruction, the aim of that revolution at the beginning was to discover and describe formally the meanings that human beings were able to create out of their encounters with the world. The objective in the long run was to propose hypotheses about which meaning-making processes were implicated in humans’ cognitive activity. Bruner’s hope was that such a revolution, as it was conceived at its origins, would have brought psychology to collaborate with its sister interpretative disciplines such as the humanities and the social sciences. It is only a collaboration of this kind that can allow the investigation of such a complex phenomenon as meaning-making. But the happy ever after did not work out. In fact, the emphasis began shifting from the construction of meaning to the processing of information, which are profoundly different matters. The notion of computation was introduced and computability became ‘the’ good theoretical model; this brought far from the original question - the revolutionary one which was about the conditions of our meaning-making activity, the answer of which would have explained our semantic power. For this reason, the Cognitive revolution “has been technicalized in such a manner that even undermines that original impulse” (p.1): it has become the (uninteresting) Information revolution. Meaning is thus different from information because it does not come before the message, but it is through the message itself and the fact that this message is shared that it originates. In fact, public meanings are the result of a negotiation. 4. Conclusions To sum up, in my talk, I will try to show that a particularly interesting aspect to discuss in this framework is the role of semantic information, which is the expression of a symbolic activity that up to now has been shown to be specifically human. Knowledge is situated-distributed, and this not only because it has a cultural nature, but also and most of all because our knowledge acquisition has a cultural nature. Moreover, knowledge has also a social nature, because it gets socially constructed (Berger and Luckmann (1966)). Human beings are semantic engines, and they engage themselves in meaning-making and meaning-negotiating. For this reason, meaning is flexible: as Bruner says, we show a ‘dazzling’, intellectual capacity for envisioning alternatives. Will one day a fifth revolution come that will take away from us also this ultimate illusion? That day, will our own technology bring about intentional and semantically powerful machines? At the moment, we do not know. The task of philosophy of information is to provide the appropriate framework that would allow us to make useful predictions in order to prepare the future generations and ourselves. - 85 - Proceedings IACAP 2011 Acknowledgements I thank the Public Representations group at the Institut Jean Nicod for all our useful discussions on similar topics, and in particular Elena Pasquinelli and Giuseppe A. Veltri who read a preliminary version of this article. The research was supported by the European Community’s Seventh Framework Program ([FP7/2007- 2013]) under a Marie Curie Intra-European Fellowship for Career Development, contract number no. 220686—DBR (Diagram-based Reasoning). References Berger, P. L. & T. Luckmann (1966), The Social Construction of Reality: A Treatise in the Sociology of Knowledge, Garden City, NY: Anchor Books. Bruner, J. (1990). Acts of Meaning. Cambridge, Mass. and London: Harvard University Press. Deacon, T. W. (1997). The Symbolic Species. New York and London: W. V. Norton Company. Dror, I.E. & Harnad, S. (Eds.) (2008). Cognition Distributed: How Cognitive Technology Extends Our Minds. Amsterdam: John Benjamins. Floridi, L. (2002). Information Ethics: An Environmental Approach to the Digital Divide. Philosophy in the Contemporary World, 9(1), 39-45. - (2007). A look into the future impact of ICT on our lives. The Information Society, 23(1), 59-64. An abridged and modified version was published in TidBITS. - (2009). The Semantic Web vs. Web 2.0: a Philosophical Assessment. Episteme, 6, 25-37. Hutchins, E. (1995a). Cognition in the Wild. MIT Press. - (1995b). How a cockpit remembers its speeds, Cognitive Science, 19, 265-288. Kinzler, K. D., & Spelke, E. S. (2007). Core systems in human cognition. Progress in Brain Research, 164, 257-264. Spelke, E. S. (2004). Core knowledge. In: N. Kanwisher & J. Duncan (Eds), Attention and Performance: Functional neuroimaging of visual cognition (Vol 20, pp. 29-56). Oxford: Oxford University Press. - 86 - The Computational Turn: Past, Presents, Futures? EPISTEMOLOGICAL AND PHENOMENOLOGICAL ISSUES IN THE USE OF BRAIN-COMPUTER INTERFACES RICHARD HEERSMINK PhD Candidate Macquarie Centre for Cognitive Science Macquarie University, Sydney, Australia. Email:richard.heersmink@gmail.com Abstract. Brain-computer interfaces (BCIs) are an emerging and converging technology that translates the brain activity of its user into command signals for external devices, ranging from motorized wheelchairs, robotic hands, environmental control systems, and computer applications. In this paper I functionally decompose BCI systems and categorize BCI applications with similar functional properties into three categories, those with (1) motor, (2) virtual, and (3) linguistic applications. I then analyse the relationship between these distinct BCI applications and their users from an epistemological and phenomenological perspective. Specifically, I analyse functional properties of BCIs in relation to the abilities (particularly motor behavior and communication) of their human users, asking how they may or may not extend these abilities. This includes a phenomenological analysis of whether BCIs are experienced as transparent extensions. Contrary to some recent philosophical claims, I conclude that, although BCIs have the potential to become bodily as well as cognitive extensions for skilled users, at this stage they are not. And while the electrodes and signal processor may to a variable degree be transparent and incorporated, the BCI system as a whole is not. Contemporary BCIs are difficult to use. Most systems only work in highly controlled laboratory settings, require a high amount of training and concentration, have very limited control options, have low and variable information transfer rates, and effector motions are often slow, clumsy and sometimes unsuccessful. These drawbacks considerably limit their possibilities for transparency and incorporation into either the body schema or cognitive system which is essential for bodily and cognitive extension. Current BCIs can therefore only be seen as a weak or metaphorical extension of the human central nervous system. To increase their potential for cognitive extension, I give suggestions for improving the interface design of what I refer to as linguistic applications. - 87 - Proceedings IACAP 2011 1. Introduction: Brain-Computer interfaces BCIs are an emerging and converging technology that translates the brain activity of its user into command signals for external devices. Invasive or non-invasive electrode arrays detect an intentional change in neural activity, which is translated by a signal processor into command signals for applications such as wheelchairs, robotic hands, environmental control systems, and computer applications. In essence, BCI technology establishes a direct one-way communication pathway between the human brain and an external device, and can to some extent translate human intentions into technological actions without having to use the body’s neuromuscular system. However, contemporary BCIs are difficult to use, the technology is still in its infancy and has barely passed the “proof of concept” stage. Most systems only work in highly controlled laboratory settings, require a high amount of training and concentration, have very limited control options, have low and variable information transfer rates, and effector motions are often slow, clumsy and sometimes unsuccessful. 2. Goals, Method and Structure 2.1. A TYPOLOGY OF BCIS In this paper I explore the relationship between BCI technology and their human users from an epistemological and phenomenological perspective. My analysis has five parts. First, I present a preliminary conceptual analysis of BCIs in which I functionally decompose BCI systems and categorize BCI applications with similar functional properties (Vermaas & Garbacz, 2009). Based on this preliminary analysis, I distinguish between three categories: (1) motor applications, which restore motor functions for disabled subjects such as motorized wheelchairs or robotic hands; (2) linguistic applications, which allow a disabled subject to select characters on a screen, thereby restoring communicative abilities; and (3) virtual applications, which allow a subject to control elements (e.g. avatars) in a virtual environment. 2.2. THE CURRENT DEBATE ON BCIS Second, I briefly outline the current philosophical debate on BCIs. It has been claimed that a BCI-controlled robotic arm is a bodily extension fully integrated into the body schema of a macaque, thereby constituting a “new systemic whole” (Clark, 2007). It has also been claimed that functionally integrated BCIs are cognitive extensions, i.e., they extend cognitive processes of their users into the material environment (Fenton and Alpert, 2008; Kyselo, 2011). These philosophical claims are evaluated later on in this paper. 2.3. HUMAN-TECHNOLOGY RELATIONS Third, I introduce some key concepts for better understanding human-technology relations. These key concepts are “body schema”, “incorporation”, “transparency” and “extended cognition”. A body schema is a non-conscious neural representation of the body’s position and its capabilities for action. We are able to incorporate artifacts such - 88 - The Computational Turn: Past, Presents, Futures? as hammers, screwdrivers, pencils, walking canes, cars, glasses, and hearing aids into our body schema, thereby enlarging our body schema (Brey, 2000). These artifacts are embodied and are not experienced as objects in the environment but as part of the human motor or perceptual system. When using embodied artifacts to act on the world such as hammers, pencils, and screwdrivers, a subject doesn’t first want an action on the artifact and then on the world. Rather, a subject merely wants an action on the world through the artifact and doesn’t consciously experience the artifact when doing so. The perceptual focal point is thus at the artifact-environment interface, rather than at the agent-artifact interface (Clark, 2007). In this sense, embodied artifacts are transparent (Ihde, 1990). Cognitive artifacts such as calculators, computers, and navigation systems, can under certain conditions be incorporated in the human cognitive system in such a way that they can best be seen as literally part of that system. These devices, then, perform functions that are complementary to the human brain (Sutton, 2010). There is, furthermore, a two-way interaction when using such devices, and both the brain and the cognitive artifact have a causal role in the overall process, thereby forming a “coupled system”. In such coupled systems, the cognitive process is distributed across brain and artifact, and the artifact is seen as co-constitutive of the extended cognitive system. Remove the technological element from the equation and the overall system will drop in behavioural and cognitive competence. So there is a strong symbiosis and reciprocity in coupled systems. Moreover, what is essential when extending cognition is a high degree of trust in, reliance on, and accessibility of the cognitive artifact (Clark & Chalmers, 1998). 2.4. HUMAN-BCI RELATIONS Fourth, I explore the relationship between motor, linguistic, and virtual applications and their human users in the light of the concepts just introduced. I analyse whether BCIs are incorporated into the body schema or cognitive system of their users, and analyse whether they are experienced as transparent extensions of the human body or cognitive system. I demonstrate that, although BCIs have the potential to become bodily as well as cognitive extensions for skilled users, at this stage they are not. And while the electrodes and signal processor may to a variable degree be transparent and incorporated, the BCI system as a whole is not. Contemporary BCIs are difficult to use. Most systems only work in highly controlled laboratory settings, require a high amount of training and concentration, have very limited control options, have low and variable information transfer rates, and effector motions are often slow, clumsy and sometimes unsuccessful. These drawbacks considerably limit their possibilities for transparency and incorporation into either the body schema or cognitive system which is essential for bodily and cognitive extension. 2.5. DISTRIBUTED COGNITION FOR IMPROVING BCIS And fifth, I give suggestions to increase the potential for cognitive extension of linguistic applications. To do so, I draw from concepts of the distributed cognition framework. Jim Hollan, Ed Hutchins and David Kirsh (2000) argue that the nature of external representations is essential when effectively distributing cognition. Their notion of “history enriched digital objects” implies that often selected letters should be presented larger or brighter on the screen. Their notion of “zoomable multiscale interfaces” implies - 89 - Proceedings IACAP 2011 that for someone who is selecting letters on a screen, it might be more effective if the letter the person wants to select becomes larger when the cursor moves towards it. And their notion of “intelligent use of space” implies that for people who are not used to the QWERTY-style, it might be logical to present the most often selected letters in the middle and letters that are selected less often in the periphery of the screen. References Brey, P. (2000b). Technology and Embodiment in Ihde and Merleau-Ponty. In: C. Mitcham (Ed.), Metaphysics, Epistemology, and Technology. Research in Philosophy of Technology Vol 19. London: Elsevier/JAI Press Clark, A. (2007). Re-Inventing Ourselves: The Plasticity of Embodiment, Sensing and Mind. Journal of Medicine and Philosophy, 32(3), 263-282. Clark, A. & Chalmers, D. (1998). The Extended Mind. Analysis, 58, 10-23. Fenton, A. & Alpert, S. (2008). Extending Our View on Using BCIs for Locked-in Syndrome. Neuroethics, (1)2, 119-132. Hollan, J. & Hutchins, E. & Kirsh, D. (2000). Distributed Cognition: Toward a New Foundation for Human-Computer Interaction Research. Transactions on Computer-Human Interaction, 7(2), 174-196. Ihde, D. (1990). Technology and the Lifeworld: From Garden to Earth. Indiana University Press. Kyselo, M. 2011. Locked-in Syndrome and BCI: Towards an Enactive Approach to the Self. Neuroethics. doi:10.1007/s12152-011-9104-x. Sutton, J. (2010). Exograms and Interdisciplinarity. In R. Menary (Ed.), The Extended Mind. MIT Press. Vermaas, P. E. and Garbacz, P. (2009). Functional decomposition and mereology in engineering. In: A. Meijers (Ed.), Handbook of the Philosophy of Technology and Engineering Sciences. Elsevier: Amsterdam. - 90 - The Computational Turn: Past, Presents, Futures? AN INFORMATION-THEORETIC MODEL OF CHUNKING DANIEL HEWLETT University of Arizona Tucson, AZ 85745, USA AND Paul Cohen University of Arizona Tucson, AZ 85745, USA Abstract. Developing a general theory of cognition based on formal notions of information remains a long-term goal. One means of making incremental progress toward this goal is to analyze core cognitive capacities to determine whether they can be explained by reference to information. Chunking is one of the most general and least understood phenomena in human cognition. George Miller described chunking as "a process of organizing or grouping the input into familiar units or chunks." The psychological literature describes chunking in many experimental situations but it says nothing about the intrinsic, mathematical properties of chunks. The cognitive science literature discusses algorithms for forming chunks, each of which provides a kind of explanation of why some chunks rather than others are formed, but there are no explanations of what these algorithms, and thus the chunks they find, have in common. We argue that chunks share a common information-theoretic signature. This signature is defined in terms of the basic measure of information content, entropy: Chunks have low conditional entropy internally, and high conditional entropy at the boundaries. We explain this chunk signature and examine several lines of evidence that support this informationtheoretic view of chunks. The first is that algorithms built to find chunks based on this signature (or very similar signatures) are quite successful at chunking realworld data. The second is that real chunks, such as words in natural language, appear to be nearly optimally constructed with respect to this signature. Empirical studies also suggest that children, even infants, do actually possess such a chunking ability. All of this evidence supports the view that chunks can be defined by an information-theoretic signature, and that a general chunking ability based on this signature provides a good explanation for this core cognitive ability. 1. Introduction Developing a general theory of cognition based on formal notions of information remains a long-term goal. One means of making incremental progress toward this goal is - 91 - Proceedings IACAP 2011 to analyze core cognitive capacities to determine whether they can be explained by reference to information. Chunking is one of the most general and least understood phenomena in human cognition. George Miller described chunking as "a process of organizing or grouping the input into familiar units or chunks." Other than being "what short term memory can hold 7 +/- 2 of," chunks appear to be incommensurate in most other respects. Miller himself was perplexed because the information content of chunks is so different. A telephone number, which may be two or three chunks long, is very different from a chessboard, which may also contain just a few chunks but is vastly more complex. Chunks contain other chunks, further obscuring their information content. The psychological literature describes chunking in many experimental situations but it says nothing about the intrinsic, mathematical properties of chunks. The cognitive science literature discusses algorithms for forming chunks, each of which provides a kind of explanation of why some chunks rather than others are formed, but there are no explanations of what these algorithms, and thus the chunks they find, have in common. We argue that chunks share a common information-theoretic signature. This signature is defined in terms of the basic measure of information content, entropy. Entropy measures the average amount of information required to communicate the outcome of a random variable. For example, the entropy of a toss of a fair six-sided die is much higher than that of a loaded one. In entropic terms, the chunk signature is simple: Chunks have low conditional entropy internally, and high conditional entropy at the boundaries. For example, given the sequence "victo", the conditional entropy of the next letter in the chunk is low (it is probably an ‘r’), but given the letters in the chunk "victory", the conditional entropy of the neighboring letters is high. This relationship between predictability and the boundaries of words was noticed as early as 1948 by Claude Shannon. 2. Supporting Evidence There are several lines of evidence that support this information-theoretic view of chunks. The first is that algorithms built to find chunks based on this signature (or very similar signatures) are quite successful at chunking real-world data. Several such algorithms have been developed independently of one other in the fields of computational linguistics and artificial intelligence, adhering to the chunk signature with varying degrees of fidelity. Perhaps the fullest implementation is that of the Voting Experts algorithm originally developed by Cohen and Adams. Variants of this algorithm, that add bootstrapping (the ability to feed information about chunks already discovered back into the algorithm's decision-making process), represent the highest levels of performance in the literature on a common benchmark of unsupervised chunking ability. Interestingly, this benchmark involves finding words in a corpus of transcribed childdirected speech from the CHILDES project. However, performance of the Voting Experts family of algorithms is not restricted to child language data, as these algorithms also perform well at finding words in diverse languages with different writing systems, finding episodes in sequences of robot actions, finding letters on a printed page by analyzing columns of pixels, and finding teaching episode boundaries in the instruction of an AI student. - 92 - The Computational Turn: Past, Presents, Futures? While this evidence suggests that algorithms searching for the chunk signature very often recover correct chunks, it does not fully establish the correspondence between the chunk signature and real chunks. The question remains whether real chunks are optimal with respect to this signature. Put more simply, out of all the possible chunks that could be formed based on some data, are the true chunks the "chunkiest?" This question is difficult to evaluate because it requires enumerating an exponential number of possible ways to chunk a given sequence. However, for short sequences, it is possible to fully test this proposition. We developed a chunkiness score that combines the internal entropy and the boundary entropy into a single number. For each 5-word sequence in a corpus of child-directed speech, we generated all possible segmentations and ranked each one according to the chunkiness score. The true segmentation ranked in the 98.7th percentile on average. Preliminarily, it appears that syntax is the primary reason that the true segmentation is not higher in the ranking: When the word-order in the training corpus is scrambled, the true segmentation is in the 99.6th percentile. Still, based on these early results we can say that, in at least one domain, true chunks are nearly optimal with respect to the information-theoretic chunkiness score. Empirical studies also suggest that children, even infants, do actually possess such a chunking ability. Saffran, Aslin, and Newport famously demonstrated that 8-month-old infants can correctly identify artificial words in a continuous speech stream. Importantly, this speech stream did not contain pauses around sentences or phrases as natural speech often does. This means that infants must be relying on some sort of chunking ability to discover these words in the stream. Saffran et al. proposed a very simple chunking heuristic that was sufficient for their task, but fails at finding words in natural languages and other non-linguistic chunking tasks. In our view, positing such a weak ability is not parsimonious because it would require the children to also have a second, more powerful ability for other chunking tasks, even other linguistic tasks. By contrast, with a single chunking ability based on the signature of chunks, children could perform the task presented by Saffran et al. as well as many others. It is also worth noting that Hauser, Newport, and Aslin later showed that cotton-top tamarins can perform a very similar task, suggesting that the underlying ability may be shared with other non-human primates. 3. Conclusion All of this evidence supports the view that chunks can be defined by an informationtheoretic signature, and that a general chunking ability based on this signature provides a good explanation for this core cognitive ability. - 93 - Proceedings IACAP 2011 THE DYNAMISM OF INFORMATION ACCESS FOR A MOBILE AGENT IN A DYNAMIC SETTING AND SOME OF ITS IMPLICATIONS LARS-ERIK JANLERT Umeå University lej@cs.umu.se Given the definition of informational distance as the time it takes to satisfy a request for the information (Janlert, 2006a), it follows that these distances, the latencies of information satisfactions, will depend on the location of the information-seeking agent as well as the location of the various resources available for satisfying requests for information. That also means that changes in the agent’s location as well as changes in the location of information resources in the environment of the agent will dynamically affect the agent’s information availability profile (Janlert 2006a), the spectrum of informational distances for the complete range of possible information requests. This paper will start to investigate the implication this may have for the possibility of outlining the informational boundaries of the agent, separating agent from world in informational terms, and for the possibilities of strategic relocations of agent and informational resources. To do this a model of agent–world relationship is outlined and used, more general and considerably more abstract than the examples of actual “natural” agent–world relationships found in this world, starting from a characterization as completely as possible in informational terms: the world is basically a database from which the agent gets information and in which the agent sets information. It turns out that it is possible to define the existential extension of an agent in informational terms in a way that at least starts to make some sense in the real world: the informational boundary. The issue of agent identity may then be approached along the lines of Nozick’s closest-continuer theory. Finally, the importance of proximity as a cue to contextual relevance for situated activity in general is transformed or translated to informational terms to appear as a relevant principle in getting as well as in setting information. Issues of accuracy and reliability of (purported) information will be bracketed off in this paper, but basically “information” is taken to exclude “misinformation.” 1. The world as a database In this model, we have an agent in an environment, a (or the) world. The agent is part of the environment, but other than that nothing is assumed about its structure and extent or - 94 - The Computational Turn: Past, Presents, Futures? what drives it. What the agent does is two things (which may in the end turn out to be one and the same thing at a certain level of abstraction). Firstly, it requests and gets information from the world. The world is considered to be a (dynamic) repository of information from the agent’s point of view: all it ever gets is information from it about it. In our use of the model we may of course consider any kind of implementation (model) satisfying the constraints of the agent’s interactions. Secondly, and this is in order to make the model as purely informationally based and symmetric as possible, the agent also sets information into the world. Thus, the agent gets as well as sets information. That is the general model. Such worlds could of course be very different but let us assume for the current exercise that the world of the model by and large matches our own real world at a slightly less abstract level. Setting or getting information can be viewed as a matter of direction of fit. Getting information can be understood in terms of retrieving, computing, measuring, observing etc., and any combination of such processes, which are partly initiated and performed by the agent (Janlert, 2006b). Setting information means to make something the case, to make the world deliver certain information. Getting information is often thought of as a non-intervening process supposed to leave the world untouched, whereas setting information, making something the case, usually is thought of as doing some measure of violence to the world, forcing it to change. But generally in this world you can’t get information without setting some information in the process; and you can’t set information without getting some information in the process. Situated existence in this model becomes a kind of information management; we are already living in an informational world, if you will. This whole approach could in itself perhaps be viewed as an analysis in the style of Carnap (1961); it has certainly been inspired by it. 2. Informational boundary of an agent Given an agent that moves, it will be possible to make a differentiation between information that is moved “along with” the agent, identifiable as information that is reasonably close and whose distance does not vary much during movement, and information that doesn’t. (The size of changes should be understood as relative, in proportion to the whole distance.) Information that moves along with the agent in this sense is considered to be within its (current) informational boundary, other information considered to be on the outside. For information that does not move along, that is external to the informational boundary, it is also interesting to differentiate between information that is far off, far away at the information horizon of the agent, and whose distance remains fairly constant during the movement of the agent. It will appear as a quite stable background. What remains will then be information that is close to “midrange” and changes significantly during movement: proximal external information. - 95 - Proceedings IACAP 2011 3. Proximity principle applied to the informational world Things that are close tend to matter; things that matter tend to be(come) close (Janlert, 2003). For an agent situated in an environment this means roughly: (1) that an object close to the agent has a better chance of getting the agent’s attention and figure in the agent’s activities; (2) an object that matters to the agent’s activities, is more likely to already be or soon become within close range (partly due to the agent’s own doings). In the world-as-database model this translates to the following rule of thumb for proximal external information: information that is close to the agent has a better chance to be got by the agent and play a role in the agent’s activities; information that matters to the agent’s activities, is more likely to be or become close to the agent. References Janlert, L. E. (2003). Contextual strategies – notes for a theory of context. Technical report UMINF 02.23, Umeå University, ISSN-0348-0542. Janlert, L. E. (2006a). Available information — preparatory note for a theory of information space. tripleC 4(2). ISSN 1726-670X. Janlert, L. E. (2006b). Information at a distance. In Proccedings of iC&P 2006 (Int. Conf. on Computers & Philosophy), Laval, May 2006. Carnap, R. (1961). Der logische Aufbau der Welt. Hamburg: Felix Meiner Verlag. First edition appeared in 1928. - 96 - The Computational Turn: Past, Presents, Futures? CONTEXTUAL INFORMATION Modeling Different Interpretations of the same Data within a Geometric Framework KIRSTY KITTO Faculty of Science and Technology Queensland University of Technology Brisbane, 4001, Australia Abstract. Semantic Information has provided an elegant set of approaches that allow us to ground information with respect to its Context, Level of Abstraction and Purpose. Interestingly, computer science also has a history of considering context and attempting to incorporate it into fields such as Artificial Intelligence, Ubiquitous Computing, Information Systems design etc. These fields generally treat context as an unknown parameter, which tends to be insufficient when it comes to the modeling of cognition. This paper draws attention to a class of contextuality that arises from ``knowing too differently'' rather than ``too little'', and discusses the manner in which this new class is likely to be of increasing importance to the modeling of socio-technical and environmental systems. A new geometric model is discussed which incorporates context at its core. Thus, this paper presents an approach that might be used to ground the truth of statements within a relevant context. Such models make the manner in which context can affect the interpretation of information explicit, and can both consistently explain, and allow us to model, an important class of social phenomena. The model will be discussed with reference to both push polling, and the climate change debate. 1. Information in Context Semantic Information (Floridi 2011) has provided an elegant set of approaches that allow us to ground information with respect to its Context, Level of Abstraction and Purpose, which has in turn allowed Floridi develop a number of theories about truth, relevance, the logic of being informed etc. (Floridi 2011). However, little work has been presented as to how this theory could correspond to the humans to whom it generally refers, and perhaps most importantly, to their aggregate behavior in e.g. elections, social movements and crises. Semantic Information has the potential to shed some light upon the responses exhibited by individuals to many of the complex information environments that surround them, but realistic models will be required before this can be achieved. While it is relatively easy to determine if the beer is in the fridge (or not), recent public debates on climate change, water management, consumer spending habits in the wake of - 97 - Proceedings IACAP 2011 the global financial crisis etc. have all served to emphasize the manner in which different sections of a community might ascribe very different values to statements generated from highly similar sets of data. The interpretation that should be attached to information is frequently the subject of vigorous debate, in which context tends to play a fundamental and highly complex role. This situation is recognized somewhat in Floridi's (2011) discussion of semantic truth however, the manner in which such a conception might be worked into the computational modeling of social dynamics is yet to be considered. As scientists attempt to construct increasingly sophisticated climate, water and sociopolitical models, it has become essential that we consider the manner in which humans respond to complex sets of information and data. This paper will discuss a sophisticated agent based model (ABM) of human decision making in context that is currently in development. This model took inspiration from the work of Brugnach et. al (2008), who contrasted the difference between “knowing too little” a concept already extensively discussed in the computational literature (Akman & Surav 1996, Brézillon 1999), and “knowing too differently”, a concept which is yet to be incorporated into the computational paradigm. To “know too differently” implies a contextual dependency to knowledge, which must be accounted for in models of human behavior. Taking a situation of water shortage as an example, it is frequently the case that a number of different framings can be provided. This results in the attribution of different interpretations to the situation, each potentially requiring different responses; how should a government react? A farmer will be concerned with “insufficient supply”, while environmentalists might approach the water system thinking that the problem is one of “excessive consumption” (Brugnach et. al 2008). Both contexts have led to claims that are justified, but the two interpretations are incompatible, in that they apparently require different actions from policy makers. Figure 1. The changing context of a decision. The probability of choosing a particular course of action changes between contexts p and q. While relativistic arguments have a somewhat dubious reputation in pure philosophy, it is becoming increasingly important that we recognize the role context plays in the modeling of human responses to information, and in particular, to the decisions that humans make in utilizing this information. For example, when presented with the same set of information, a different individual might draw a very different set of conclusions as to its consequence, and this can in turn lead to markedly different actions. - 98 - The Computational Turn: Past, Presents, Futures? The manner in which the new model represents context is geometrical, and can be quickly explained with reference to the simple example illustrated in Figure 1. Here, we have represented the current state, A, of an agent (we shall call her Alice) with respect to two different contexts p and q. In this case, the state of our agent has been chosen to correspond to her projected response to a binary question e.g. will you vote for candidate X in the coming election? A connection to probability is generated by assuming that the length of the state A is equal to 1, which means that the probabilities of Alice responding with a “yes” or “no” are given by the Pythagoras theorem in a particular context. Thus, (1) With reference to Figure 1, it can quickly be seen that the probability of Alice responding with ``yes'' will be markedly different between the two contexts; while she has a higher probability of responding with ``yes'' in context p, she has a higher probability of responding with a “no” to the same question in context q (this is given by a quick inspection of the lengths of the components making up a right angled triangle with hypotenuse equal to state A). This geometric model of decision making in context bears a remarkable resemblance to the geometrical probability that is utilised in quantum theory (Isham 1995), and indeed, this similarity is further developed in a number of recent contextual models of, for example, decision making (Busemeyer et al. 2011) , word recognition and recall (Bruza et al. 2009), concept combination (Aerts & Gabora 2005) and information retrieval (Van Rijsbergen 2004). The general framework of these models will be discussed, and the novel manner in which they incorporate context into the modeling of a state of affairs highlighted. In particular, this paper will highlight the way in which explicitly considering contextual factors in a model allows for a recognition of different points of view and frames without lapsing too deeply into relativism. While some notion of truth can be understood to exist in this model, the context in which a set of facts is presented can profoundly influence the interpretation that an agent would attribute to them. Acknowledgements Supported by the Australian Research Council Discovery grant DP1094974. References Aerts, D. and Gabora, L. (2005). A theory of concepts and their combinations I: the structure of the sets of contexts and properties. Kybernetes, 34:151-175. Akman, V. and Surav, M. (1996). Steps toward Formalizing Context. AI Magazine, 17(3):55-72. Brézillon, P. (1999). Context in problem solving: a survey. Knowledge Engineering Review, 14:47-80. - 99 - Proceedings IACAP 2011 Brugnach, M., Dewulf, A., Pahl-Wostl, C., and Taillieu, T. (2008). Toward a relational concept of uncertainty: about knowing too little, knowing too differently, and accepting not to know. Ecology and Society, 13(2):30. Bruza, P., Kitto, K., Nelson, D., and McEvoy, C. (2009). Is there something quantum-like about the human mental lexicon? Journal of Mathematical Psychology, 53:362-377 Busemeyer, J. R., Pothos, E., and Franco, R. (2011). A quantum theoretical explanation for probability judgment 'errors'. Psychological Review. In press. Floridi, L. (2011). The Philosophy of Information. Oxford University Press. Fox, J. S. (1997). Push Polling: The Art of Political Persuasion. Florida Law Review, 49:563. Isham, C. J. (1995). Lectures on Quantum Theory. Imperial College Press, London. Van Rijsbergen, C. (2004). The Geometry of Information Retrieval. Cambridge University Press. - 100 - The Computational Turn: Past, Presents, Futures? COGNITION AS MANAGEMENT OF MEANINGFUL INFORMATION. PROPOSAL FOR AN EVOLUTIONARY APPROACH. CHRISTOPHE MENANT Extended Abstract Humans are cognitive entities. Our behaviors and ongoing interactions with the environment are threaded with creations and usages of meaningful information, be they conscious or unconscious. Animal life is also populated with meaningful information related to the survival of the individual and of the species. The meaningfulness of information managed by artificial agents can also be considered as a reality once we accept that the meanings managed by an artificial agent are derived from what we, the cognitive designers, have built the agent for. This rapid overview brings to consider that cognition, in terms of management of meaningful information, can be looked at as a reality for animal, humans and robots. But it is pretty clear that the corresponding meanings will be very different in nature and content. Free will and self-consciousness are key drivers in the management of human meanings, but they do not exist for animals or robots. Also, staying alive is a constraint that we share with animals. Robots do not carry that constraint. Such differences in meaningful information and cognition for animal, humans and robots could bring us to believe that the analysis of cognitions for these three types of agents has to be done separately. But if we agree that humans are the result of the evolution of life and that robots are a product of human activities, we can then look at addressing the possibility for an evolutionary approach at cognition based on meaningful information management. A bottom-up path would begin by meaning management within basic living entities, then climb up the ladder of evolution up to us humans, and continue with artificial agents. This is what we propose to present here: address an evolutionary approach for cognition, based on meaning management using a simple systemic tool. We use for that an existing systemic approach on meaning generation where a system submitted to a constraint generates a meaningful information (a meaning) that will initiate an action in order to satisfy the constraint (Menant 2003, 2010 a). The action can be physical, mental or other. This systemic approach defines a Meaning Generator System (MGS). The simplicity of the MGS makes it available as a building block for meaning management in animals, humans and robots. Contrary to approaches on meaning generation in psychology or linguistics, the MGS approach is not based on human mind. To avoid circularity, an evolutionary approach has to be careful not to include components of human mind in the starting point - 101 - Proceedings IACAP 2011 The MGS receives information from its environment and compares it with its constraint. The generated meaning is the connection existing between the received information and the constraint. The generated meaning is to trigger an action aimed at satisfying the constraint. The action will modify the environment, and so the generated meaning. Meaning generation links agents to their environments in a dynamic mode. The MGS approach is triadic, Peircean type. The systemic approach allows wide usage of the MGS: a system is a set of elements linked by a set of relations. Any system submitted to a constraint and capable of receiving information from its environment can lead to a MGS. Meaning generation can be applied to many cases, assuming we identify clearly enough the systems and the constraints. Animals, humans and robots are then agents containing MGSs. Similar MGSs carrying different constraints will generate different meanings. Cognition is system dependent. We first apply the MGS approach to animals with “stay alive” and “group life” constraints. Such constraints can bring to model many cases of meaning generation and actions in the organic world. However, it is to be highlighted that even if the functions and characteristics of life are well known, the nature of life is not really understood. Final causes are difficult to integrate in our today science. So analyzing meaning and cognition in living entities will have to take into account our limited understanding about the nature of life. Ongoing research on concepts like autopoiesis could bring a better understanding about the nature of life (Weber and Varela 2002). We next address meaning generation for humans. The case is the most difficult as the nature of human mind is a mystery for today science and philosophy. The natures of our feelings, free will or self-consciousness are unknown. Human constraints, meanings and cognition are difficult to define. Any usage of the MGS approach for humans will have to take into account the limitations that result from the unknown nature of human mind. We will however present some possible approaches to identify human constraints where the MGS brings some openings in an evolutionary approach (Menant 2010 b & c). But it is clear that the better human mind will be understood, the more we will be in a position to address meaning management and cognition for humans. Ongoing research activities relative to the nature of human mind cover many scientific and philosophical domains (Philpapers, Philosophy of Mind). The case of meaning management and cognition in artificial agents is rather straightforward with the MGS approach as we, the designers, know the agents and the constraints. In addition, our evolutionary approach brings to position notions like artificial constraints, meaning and autonomy as derived from their animal or human source. We also highlight that cognition as management of meaningful information by agents goes beyond information and needs to address representations which belong to the central hypothesis of cognitive sciences. We define the meaningful representation of an item for an agent as being the networks of meanings relative to the item for the agent, with the action scenarios involving the item. Such meaningful representations embed the agents in their environments and are far from the GOFAI type ones (Menant 2010 b). Meanings, representations and cognition exist by and for the agents. We finish by summarizing the points presented and highlight some possible continuations. - 102 - The Computational Turn: Past, Presents, Futures? References Menant, C. (2003). Information and Meaning. In: Entropy 2003, 5 (pp193-204). ISSN 1099-4300 © 2003 by MDPI (http://cogprints.org/3694/) Menant, C. (2010 a). Introduction to a Systemic Theory of Meaning. (short paper) http://crmenant.free.fr/ResUK/MGS.pdf Weber, A. and Varela, F. (2002). Life after Kant: Natural purposes and the autopoietic foundations of biological individuality. In: Phenomenology and the Cognitive Sciences 1. (pp 97-125). Menant, C. (2010 b). Computation on Information, Meaning and Representations. An Evolutionary Approach. In: Dodig Crnkovic, G. and Burgin, M. (Editors) World Scientific Series in Information Studies - Vol. 2. INFORMATION AND COMPUTATION. Essays on Scientific and Philosophical Understanding of Foundations of Information and Computation. (http://www.idt.mdh.se/ECAP-2005/INFOCOMPBOOK/CHAPTERS/10Menant.pdf.) Menant, C. (2010 c). Proposal for a shared evolutionary nature of language and consciousness. http://cogprints.org/7067/. Philpapers. Philosophy of mind. http://philpapers.org/browse/philosophy-of-mind. - 103 - Proceedings IACAP 2011 COMPUTATIONAL AND HUMAN MIND MODELS FRANCISCO HERNÁNDEZ-QUIROZ UNAM Departamento de Matemáticas, Facultad Universitaria, C.P. 04510, D.F., MEXICO de Ciencias, Ciudad Abstract. Computational models of the human mind have been the subject of a heated debate since Turing's seminal paper of 1950. Some opponents of the socalled Strong AI have postulated alternative mechanisms based on one or another form of hypercomputation. Although specific arguments can be (and have been) raised against the possibility of hypercomputation, a different approach is possible: accept the possibility of human cognitive abilities beyond the reach of Turing Machines (TMs) and then face the problem of postulating appropriate physical mechanisms underlying these hypercomputing abilities. The result can lead to difficulties as hard as those faced by Strong AI in the first place, reducing the allure of the hypercomputing alternatives. 1. Introduction In his celebrated paper of 1950, Turing advanced the then daring proposal of machines able to emulate the human mind. Those machines were the practical realization of the model he introduced before in 1936-7. Turing's formulation is careful to avoid the categorical statement that the human mind can be emulated by a Turing Machine due to the fact that it is a Turing Machine. However, successive computer scientists have reprised Turing's proposal without his caveats. An extreme and idealized version of this point of view is known as Strong Artificial Intelligence (Searle, 1984). 2. An Objection to Artificial Intelligence The thesis that the human mind can be modelled by Turing Machines has been attacked by many people. A common line of attack goes like this: • Strong AI claims the human mind can be modelled by Turing Machines. • Turing Machines suffer internal limitations that surface in theorems due to Turing himself, Rice and even Gödel. • But human cognitive abilities go beyond these limitations. • Ergo, the human mind cannot be modelled by Turing Machines. - 104 - The Computational Turn: Past, Presents, Futures? This argument has been rejected by many authors (Feferman, 1996; Chalmers, 1995). But this paper will take a different approach: what happens if we accept that the human mind cannot be modelled by a Turing Machine? What type of mechanism is needed instead? What problems arise when such a model is adopted? 3. “Mechanisms” more powerful than computers There are many candidates for this role. On the one hand, physical systems with properties (supposedly) beyond the restrictions of Turing Machines (Penrose, 1994). On the other hand, mathematical models circumventing those same restrictions: Oracle Turing Machines (Turing, 1939), Analog Neural Networks (Siegelmann, 1999), Dynamical Systems (Bournez and Cosnard, 1995), etc. In fact, there is a common core in all these models: (a) they pretend to implement some notion of what can be considered intuitively a computational mechanism; (b) simultaneously, they include elements capable of introducing entities not Turing computable. They can be gathered under the label of “hypercomputation.” Many of those who oppose the Strong AI, claim that human cognitive abilities which are not explicable by TMs are in fact based on one or another hypercomputing mechanism. 4. Towards a new scientific research program? But these mechanisms are also prone to run into trouble. Sieg (2008) has argued convincingly that Turing Machines' limitations are a consequence of the acceptance of two principles: locality and boundedness. The first principle means that a computer can only change immediately recognizable configurations in finite time. The second one means that a computer can only recognize immediately only a bounded number of configurations (and therefore there exists an upper bound to the amount of information it can handle in finite time). By rejecting TMs as an upper bound to computability, we reject these principles. No need to worry though, theoretically speaking, if we are only interested in abstract mathematical models. But if the aim is to model or to explain the human mind, and some of its capabilities are attributed to hypercomputing features, then we are asserting implicitly that the human mind (or its physical substratum, if you will) goes beyond the principles of locality and boundedness. One variety of hypercomputation even asserts the possibility of harnessing and manipulating non-computable irrational numbers (Siegelmann, 1999). And if we want to remain on scientific grounds, we will be pressed to point out to the physical counterparts of this theoretical entities and postulate hypercomputation in Nature. Of course, none of this is impossible, at least in principle. However, our quest for a model of the human mind has lead us to pose very basic questions about physical reality that bring with them huge theoretical and practical challenges that look at least as difficult as the problems faced by the computational models of the human mind. The moral might be that a theoretical alternative is not necessarily a plausible explanation for a natural phenomenon. - 105 - Proceedings IACAP 2011 References Bournez, O. y Cosnard, M. (1995). On the computational power and super-Turing capabilities of dynamical systems, Technical report no. 95-30, Laboratoire de l’Informatique du Parallelism, Ecole Normale Superieure de Lyon. Chalmers, D.J. (1995). Minds, Machines, And Mathematics - A Review of Shadows of the Mind by Roger Penrose”, Psyche 2(9). Feferman, S. (1996). Penrose's Gödelian argument, Psyche 2, 21-32. Penrose, R. (1989). The Emperor's New Mind: Concerning Computers, Minds and The Laws of Physics, Oxford University Press. Penrose, R. (1994). Shadows of the Mind: A Search for the Missing Science of Consciousness, Oxford University Press. Searle, J. (1984). Minds, Brains and Science, Cambridge: Harvard University Press. Sieg, W. (2008). Church Without Dogma — axioms for computability. In B. Lowe, A. Sorbi, B. Cooper (eds.) New Computational Paradigms (pp. 139-152), Springer Verlag. Siegelmann, H.T. (1999) Neural Networks and Analog Computation: Beyond the Turing Limit, Birkhäuser, Progress in Theoretical Computer Science. Turing, A.M. (1936-7). On Computable Numbers, with an Application to the Entscheidungsproblem, Proceedings of the London Mathematical Society, Series 2, 42, pp. 230–265. Turing, A.M. (1939). Systems of Logic Based on Ordinals, Proceedings of the London Mathematical Society, Series 2, 45, 161–228. Turing, A.M. (1950). Computing Machinery and Intelligence, Mind, 59, 433–460. - 106 - The Computational Turn: Past, Presents, Futures? SEMANTICS OF INFORMATION Meaning and Truth as Relationships between Information Carriers MARCIN J. SCHROEDER Akita International University Akita, Japan Abstract. The meaning of information has been openly dismissed from the interest of information theory already by Shannon, but the fiasco of the early attempt to develop semantic theory of information by Bar-Hillel and Carnap was even more discouraging. They developed their theory of semantic information using as a starting point already existing logical structure of the language, not recognizing the fact that language is a very special information system and the logic of information should be built before its semantic theory. Philosophical concept of meaning for centuries has been associated with the medieval scholastic concept of intentionality, pointing by a symbol at intended object, identified by Brentano and his followers as the primary characteristic of mental acts. Neither of the attempts to eliminate psychologism of intentionality removed the primary source of philosophical problems which has been always in the fact that semantics requires crossing the border between different ontological entities. This difficulty could not be resolved within philosophy of language, as at this level the difference between linguistic items and entities to which they refer cannot be ignored. The relationship between a symbol and its meaning does not require separation of ontological status, when the meaning is understood as a relationship between information in two different information carriers, that of a symbol and that of denotation. In the present paper, both, symbol and object are described in terms of information integration. Every entity is being characterized through the integrated part of information constituting its identity, and not integrated interpreted as its state. The correspondence of identities, i.e. integrated parts of information is here identified as the meaning, the correspondence between states, i.e. nonintegrated parts of information is identified as the truth. 1. Sources of Problems in Semantics of Information Difficulties in the development of semantics of information are in part inherited from linguistic semantics, but some of them have their sources in the circumstances in which information theory has been born. The meaning of meaning has been always an elusive subject. Ogden and Richards (1923/1989) in their widely read study of this concept considered its sixteen basic meanings. Philosophical concept of meaning for centuries has been associated with the medieval scholastic concept of intentionality, pointing by a symbol at intended object. - 107 - Proceedings IACAP 2011 Brentano identified intention or “aboutness” with the fundamental characteristic of mental capacity. The logical approach initiated by Frege and developed by Church was an attempt to eliminate psychological aspects of the meaning by making a distinction between denotation and sense, and focusing on the rules reducing sense of compound expressions to those simple. However, the shift of attention to mutual relationship between expressions of a language at different level of complexity does not help to understand the relationship between simple signs and their denotations, to which the process of reduction is leading. Under influence of logical positivism, Carnap attempted to resolve this issue in the context of scientific methodology by involving the idea of empirical sense reducing criteria of the relationship to empirical procedures. The approach initiated by Peirce, whose original writings preceded most of the contemporary work on the concept of meaning, was also intended as a way to eliminate necessity to involve human subject in semiosis. In his approach sign and object are accompanied by interpretant of the type of a sign. Being a sign, interpretant may enter into another triadic relation with its own object and interpretant. Its role is to build a connection between sign and object which does not require involvement of human being. This approach leaves the question of the traditional relationship between the sign and its meaning open-ended, but it hardly gives its explanation, especially when the sign has different ontological status from that of an object. As in the logical approach, we have here an extension of the study towards a complex structure of signs or names, but the basic relationship between the object and the sign is left in the shadow. No wonder that the issue of the meaning of information has been dismissed from the subject of information theory so easily. Shannon’s disclaimer “These semantic aspects of communication are irrelevant to the engineering problem” (Shannon & Weaver, 1949/1989) has been followed by majority of information theorists, such as Cherry (1951/1952): “It is important to emphasize, at the start, that we are not concerned with the meaning or the truth of messages; semantics lies outside the scope of mathematical information theory.” After all, the measure of information was defined for one letter or character of a message which does not carry any meaning. The measure for entire message was simply the sum of measures for characters. Fiasco of the early attempts to develop semantic theory of information, such as the most advanced attempt by Bar-Hillel and Carnap (1952), sealed the fate of the study of semantics of information. Bar-Hillel and Carnap developed their theory of semantic information using as a starting point already existing logical structure of the language. They did not take into account that language is a very special information system and more general logic of information should be built before its semantic theory. 2. Semantics as Relationship between Information Carriers Bar-Hillel and Carnap (1952) have built their measure of semantic information in such a way that it can be reduced to Shannon’s entropy in a special case. However, here there is a fundamental problem whether the measure of information transmitted in the process of communication applies to information carried by some carrier (symbol or object). The present author (Schroeder, 2004) believes that the answer is negative, and the measure of semantic information should be based on the alternative measure, taking into - 108 - The Computational Turn: Past, Presents, Futures? consideration the amount of information carried by symbols, which should be estimated based on the relationship between the information in the symbol and information in the designate. However, the primary source of philosophical problems of semantics has been always in the requirement of crossing the border between different ontological entities. This difficulty could not be resolved within philosophy of language, as at this level the difference between linguistic items and entities to which they refer cannot be ignored. The relationship between a symbol and its meaning does not require separation of ontological status, when the meaning is understood as a relationship between information in two different information carriers, that of a symbol and that of denotation. In the present paper, both, symbol and object are described in terms of information integration (Schroeder, 2009). The concept of information integration is implemented with the use of a theoretical instrument called a generalized Venn gate which transforms selective manifestation of information into structural one (Schroeder, 2005, 2007) The transition may change the level of integration of information depending on the structural characteristics of the logic of the gate. The gates whose logic is completely irreducible into the components (such as in the case of quantum logic) produce highest level of integration. The gates with Boolean (i.e. traditional) logic reducible to the product of simple (yes-no) components leave information completely disintegrated. There are of course multiple levels of integration in between. Information is here understood in a very broad way as an identification of a variety, i.e. that which makes one out of a variety (Schroeder, 2005). Thus, not only language is a carrier of information, but also every object of our experience. Cognitive processes involve transformations of selective manifestation of information coming with sensory stimulation into the structural manifestation of information, which in its integrated form constitute conscious experience. Every entity is being characterized through the integrated part of information constituting its identity, and not integrated interpreted as its state. The correspondence of identities, i.e. integrated parts of information is here identified as the meaning, correspondence between states, i.e. non-integrated parts of information is identified as the truth. References Bar-Hillel, Y. & Carnap R. (1952/1964). An Outline of a Theory of Semantic Information. Technical Report No. 247, Research Laboratory of Electronics, MIT; reprinted in Bar-Hillel, Y. (1964) Language and Information: Selected essays on their theory and application. Reading, MA: Addison-Wesley, pp. 221-274. Cherry, E. C. (1951/1952). A history of the theory of information. Proceedings of the Institute of Electrical Engineers, 98 (III), 383-393; reprinted with minor changes as: The communication of information. American Scientist, 40, 640-664. Ogden, C. K., Richards, I. A. (1923/1989). The Meaning of Meaning: A Study of the Influence of Language Upon Thought and of the Science of Symbolism. San Diego: A Harvest Book, Harcourt Brace Jovanovich. Schroeder, M. J. (2004). An Alternative to Entropy in the Measurement of Information. Entropy, 6, 388-412. - 109 - Proceedings IACAP 2011 Schroeder, M. J. (2005). Philosophical Foundations for the Concept of Information: Selective and Structural Information. In Proceedings of the Third International Conference on the Foundations of Information Science, Paris. http://www.mdpi.org/fis2005. Schroeder, M. J. (2007). Logico-algebraic structures for information integration in the brain. Proceedings of RIMS 2007 Symposium on Algebra, Languages, and Computation, Kyoto: Kyoto University, pp. 61-72. Schroeder, M. J. (2009). Quantum Coherence without Quantum Mechanics in Modelling the Unity of Consciousness. In P. Bruza, et al. (Eds.) QI 2009, LNAI 5494, Springer, pp. 97-112. Shannon, C. E., Weaver, W. (1949/1998). The Mathematical Theory of Communication. Urbana: University of Illinois Press. - 110 - The Computational Turn: Past, Presents, Futures? PRE-COGNITIVE SEMANTIC INFORMATION6 ORLIN VAKARELOV Department of Philosophy University of Arizona Tucson, Arizona, USA Email: okv@u.arizona.edu Abstract. This talk addresses one of the fundamental problems of the philosophy of information: How does semantic information emerge within the underlying dynamics of the world? --- dynamical semantic information problem. It suggests that the canonical approach to semantic information that defines data before meaning and meaning before use is inadequate for pre-cognitive information media. Instead, we should follow a pragmatic approach to information where one defines the notion of information system as a special kind of purposeful system emerging within the underlying dynamics of the world, and define semantic information as the currency of the system. In this way, systems operating with semantic information can be viewed as patterns in the dynamics – semantic information is a dynamical system phenomenon of highly organized systems. In the simplest information systems the syntax, semantics and pragmatics of the information medium are co-defined. It proposes a new more general theory of information semantics that focuses on the interface role of the information states in the information system – the interface theory of meaning. 1. Introduction I address the following problem: How does semantic information emerge within the underlying dynamics of the world? Let us call this the dynamical semantic information (DSI) problem. This is related to another kind of problem: Can we provide a foundation of cognitive science with the notion of (semantic) information? I claim that it is possible to offer a theory of pre-cognitive semantic information that does not presuppose a notion of cognition or mind. With such a theory, the notion of semantic information can be used in foundational discussions of cognition without circularity. However, I do not plan to address the second problem here. My strategy for addressing DSI is this: Start with a notion of information system that is a special kind of autonomous dynamical system interacting with an environment. Describe semantic information as a “currency” of the information system. That is, treat information for the system not as a primitive but as a derived notion, similar to the way currency is a derived notion of an economic system. Take a decomposition approach to 6 This talk is based on (Vakarelov, 2010). - 111 - Proceedings IACAP 2011 analyzing the components of semantic information – that is, regard notions such as data, meaning and source, as depicting aspects of informational processes within the information system. Provide a theory of meaning, the interface theory of meaning, for the informational states of an information medium within the information system. 2. Canonical Views of Semantic Information Most theories of semantic information make the following assumptions: (1) semantic information = data + meaning (+ truthfulness); (2) the data is conceptually primary; (3) meaning is secondary and depends on data, (4) pragmatics is third-ary and depends on meaning. In this view, the ‘+’ in the definition of information can be regarded as an amendment operation, where syntax is amended by semantics to obtain a theory of semantic information, and semantics is amended with an account of use of information, to obtain a theory of pragmatic information. Thus, an approach to semantic information proceeding as such I call an amendment approach. Taking an amendment approach to semantic (and pragmatic) information has no effect on the formal theories of information; however it affects meta-theoretic judgments about theories of information. In particular, it affects what theories of information are regarded as more general. I argue (defeasibly) that taking the notion of data as conceptually primary (and independent from semantics and pragmatics) leads to an indispensible role of a mind for the specification of semantics. This makes naturalizing semantic information difficult. This is because the cases where the data system can be defined precisely without semantics or pragmatics are cases where semantics requires an external interpreter. The meta-theoretical judgments about such cases mistakenly conclude that the cases are the most general, and therefore they offer the most inclusive theory of semantic information. 3. The Pragmatic Approach to Semantic Information I propose an alternative: I argue for a decomposition approach to information; that is, I argue that in the most general case of semantic information, data, semantics, and pragmatics are codetermined as aspects of an information process. The most general kind of information is pragmatic information; that is, in the most general case, semantic information requires a system that utilizes information in its interaction with an environment. Such a system I call, following (Nauta, 1997), an information system. The strategy of pragmatic analysis of information is the following: The most basic notion is information system. An information system S is a physical system that is in an active interaction with an external environment and that satisfies a set of conditions that do not presuppose the notion of information. The conditions must guarantee the existence in S of a sub-system, M, that can be interpreted as an information medium. Moreover, the functional role of M in S in relation to the interaction with the environment must be sufficient to define the semantic content of the states of M. According to this strategy, S is an information system not because it operates with meaningful information, but conversely, it operates with information because it is an information system. The most important idea is that what counts as data, and what gives the data semantic content, is determined by the role they play in the information system. - 112 - The Computational Turn: Past, Presents, Futures? 4. Information Systems An information system S is a system that satisfies the following five conditions: 1. S is an open system, i.e. it is a system that is distinct from its environment, but it is in constant interaction with the environment. 2. S is a partially isolated open system, i.e. some of the interactions between S and the environment are structured through well-defined limited channels of influence. 3. S is a purposeful system. That is, there is at least one proper set of goal states, G , that the system “attempts” to be in (or near) by affecting its environment. 4. S contains a sub-system M that can correlate with an external system O, and M can control the behavior of S. 5. S contains a second distinct sub-system P that filters the states of M and their effect on behavior in relating to its purpose. In other words, P steers the system towards G by modulating the control effect of M. I argue that all the conditions for an information system can be depicted (in principle) as conditions of dynamical systems. Thus, no mentalistic or cognitive notions are needed to define an information system. I also argue that the conditions are sufficient to justify regarding M as an information medium with states that can be interpreted as data/information states, and as having meaning for the system. The data/information states of M, however depend on the global dynamics. In particular, they depend on the way P modulates the control function of M and on the states of O (which can be regarded as an information source). However, the states of O and P also depend on the global dynamics. Thus, in the most general information systems all relevant components of the information system are codetermined (except the goal G). 5. Interface Theory of Meaning In an information system content is determined neither by the external relation between M and O, nor by the internal role of the states of M in S, but by the interface roles the states of M play in the dynamics of the system. This is the interface theory of meaning for information states in an information system. More traditional theories of semantics, such as correspondence semantics or conceptual role semantics, can be obtained from the interface role semantics as aspects of the interface relation. Thus, the interface theory of meaning properly generalizes other theories of meaning, which only work if further conditions on the information system are demanded. References Nauta, D. (1970). The Meaning of Information. The Hauge: Mouton. Vakarelov, O. (2010). Pre-cognitive semantic information. Knowledge, Technology & Policy, 23(1):193-226. - 113 - Proceedings IACAP 2011 Track III: Autonomous Robots and Artificial Cognitive systems - 114 - The Computational Turn: Past, Presents, Futures? WHO WILL HAVE IRRESPONSIBLE, IMMORAL INTELLIGENT ROBOT? UNTRUSTWORTHY, Why Artifactually Intelligent Adaptive Autonomous Agents need to be Artifactually Moral? MARGARYTA GEORGIEVA ANOKHINA School of Innovation, Design and Engineering, Mälardalen University, Sweden maa05002@student.mdh.se and and Gordana Dodig Crnkovic School of Innovation, Design and Engineering, Mälardalen University, Sweden gordana.dodig-crnkovic@mdh.se Abstract. We argue that there is natural place for artificial moral agency parallel to artificial intelligence. 1. Extended Abstract Historically, moral agency was conceptualized in purely anthropocentric terms. Consequently, only humans qualify as moral agents according to the traditional criteria and no other agents than humans were considered capable of moral agency. We discuss such conventional criteria as mental states, intentionality, autonomy, free will, responsibility, rationality and moral reasoning and compare human agents with artificial agents (intelligent adaptive learning robots and software agents, present and envisaged in coming decades). We attempt to understand what has shaped traditional criteria in the past and how technological change initiates re-shaping the world around us, including what we could (and should) be considered as moral agents. We suggest that conventional approach to moral agency is unable to provide exhaustive criteria to deal with moral situations of contemporary world involving technosocial systems with autonomous intelligent agents, both humans and artifacts. We also discuss how morality can be approached in new ways in case of artificial agents. The argument is provided that human-centric approach to intelligent autonomous machines is inappropriate as a means of control of behavior in self-learning artificial agents and a - 115 - Proceedings IACAP 2011 new proposal is made about how to treat notion of moral responsibilities in techno-social systems when intelligent artifacts acting autonomously are involved. In the past mechanical age of engineering, technological systems were designed to perform specific and limited functions and they were kept closed with no access to the outside world (like a robot making car parts, for example). Nowadays systems with artificial intelligence are more complex and sophisticated and they are starting to be implemented in everyday environments like people’s homes in helping elderly and sick people and as companions (the developing field of social robotics). This rapid technological change re-shapes and expands ways of thinking about agency and morality that we used to have. Machine “talks”, “selects”, “runs” “reasons”, “senses”, “plays chess”, etc. not in a human way, but we use these words to express functionality of a machine in familiar terms. Why can’t machine “choose”, “decide”, “think” or “be responsible”? In the similar way as machines are artifactually intelligent, they can be and indeed must be made artifactually moral if we are to rely on them even when they are not under direct control, when they act autonomously. The term “artificial intelligence” reveals the same problem one had to accept that machine can behave intelligently even though it is intelligence of an artifact, and not a human intelligence. Similarly, machine can be made functionally, artificially moral. It may take some effort to find out how to secure morally acceptable behavior in intelligent learning machines, and some researchers suggest it may take as much effort as it took for the development of artificial intelligence. But it would be irresponsible to let them go among people without having morally acceptable behavior according to human standards. Floridi and Sanders (2004) consider interactivity, autonomy and adaptability at a given level of abstraction as important new criteria for moral agency. Morality in this approach is thought of as “a threshold defined on the observables in the interface”. These criteria are related to criteria of operational environment, suggested by Berthier (2006) and domain, suggested by Foner (1993). This requirement relates to differences between domains of interest for moral considerations for human agents and for artificial ones. As humans act and behave in specific environment, artificial agents do as well, but conditions are different, and thus probably not all criteria that are suitable for human domain are applicable to operational environment of artificial agents. Both artificial agents and humans need interaction and ability to adapt to environment in order to act morally, according to the rules that define moral actions. Coeckelbergh (2009) suggests using the term virtual morality, as robots can exhibit behaviour akin to behaviour of humans in analogous situations. The aim of the emerging research field of machine ethics (machine morality, artificial morality, or computational ethics) such as developed in Anderson and. Anderson (2007); Allen, Wallach, Smit (2006) and Moor (2006) is moral decisionmaking implemented in computers and robots. We discuss parallels between artificial agent’s possible artifactual moral agency, see Dodig-Crnkovic and Persson (2008), similarity and differences compared to human agents. We argue that there is natural place for artificial moral agency parallel to artificial intelligence. - 116 - The Computational Turn: Past, Presents, Futures? References Floridi L. and Sanders J. W. (2004) On the Morality of Artificial Agents. Minds and Machines 14 (3):349-379. Berthier D. (2006) Artificial Agents and their Ontological Status, iC@P 2006: International Conference on Computers and Philosophy, p.2-5. Foner L. (1993) What’s An Agent, Anyway? A Sociological Case study, available from the Agents Group, MIT Media Lab. http://www.nada.kth.se/kurser//kth/2D1381/JuliaHeavy.pdf, p.35. Coeckelbergh M. (2009) Virtual moral agency, virtual moral responsibility: on the moral significance of the appearance, perception, and performance of artificial agents, AI & Soc 24:188-189. Anderson M. and. Anderson S. L. (2007) Machine Ethics: Creating an Ethical Intelligent Agent. AI Magazine Volume 28 Number 4. Allen C., Wallach W., Smit I. (2006) Why Machine Ethics?, IEEE Intelligent Systems, vol. 21, no. 4, pp. 12-17, July/Aug. 2006, doi:10.1109/MIS.2006.83. Moor J. H. (2006) The Nature, Importance, and Difficulty of Machine Ethics, IEEE Intelligent Systems, vol. 21, no. 4, pp. 18-21, July/Aug. 2006. Dodig-Crnkovic G. and Persson D. (2008) Sharing Moral Responsibility with Robots: A Pragmatic Approach. Tenth Scandinavian Conference on Artificial Intelligence SCAI 2008. Volume 173, Frontiers in Artificial Intelligence and Applications. Eds. A. Holst, P. Kreuger and P. Funk. - 117 - Proceedings IACAP 2011 THE ETHICS OF ROBOTIC DECEPTION RONALD C. ARKIN Mobile Robot Laboratory, Georgia Institute of Technology 85 5th ST NW, Atlanta, GA 30332 U.S.A. The time of robotic deception is rapidly approaching. While there are some individuals trumpeting about the inherent ethical dangers of the approaching robotics revolution (e.g., Joy, 2000; Sharkey, 2008), little concern, until very recently, has been expressed about the potential for robots to deceive human beings. Our working definition of deception (for which there are many) that frames the rest of this discussion is “deception simply is a false communication that tends to benefit the communicator” (Bond and Robinson, 1988). Research is slowly progressing in this space, with some of the first work developed by Floreano et al (2007) focusing on the evolutionary edge that deceit can provide among an otherwise homogeneous group of robotic agents. This work did not focus on human-robot deceit, however. As an outgrowth of our research in robothuman trust (Wagner and Arkin, 2008), where robots were concerned as to whether or not to trust a human partner rather than the other way around, we considered the dual of trust: deception. As any good conman knows, trust is a precursor for deception, so the transition to this domain seemed natural. We were able to apply the same models of interdependence theory (Kelley and Thibaut, 1978) and game theory, to create a framework whereby a robot could make decisions regarding both when to deceive (Wagner and Arkin, 2009) and how to deceive (Wagner and Arkin, 2011). This involves the use of partner modeling or a simplistic view (currently) of theory of mind to enable the robot to (1) assess a situation; (2) recognize whether conflict and dependence exist in that situation between deceiver and mark, which is an indicator of the value of deception; (3) probe the partner (mark) to develop an understanding of their potential actions and perceptions; and (4) then choose an action which induces an incorrect outcome assessment in the partner. While the results we published (Wagner and Arkin, 2011) we believe were modestly stated, e.g., “they do not represent the final word on robots and deception”, “the results are a preliminary indication that the techniques and algorithms described in this paper can be fruitfully used to produce deceptive behavior in a robot”, “much more psychologically valid evidence will be required to strongly confirm this hypothesis”, etc. The response to this research has been quite the contrary, ranging from accolades (being listed as one of the top 50 inventions of 2010 by Time Magazine (Suddath, 2010)) to damnation (“In a stunning display of hubris, the men ... detailed their foolhardy experiment to teach two robots how to play hide-and-seek” (Tiku, 2010), and “Researchers at the Georgia Institute of Technology may have made a terrible, terrible mistake: They’ve taught robots how to deceive” (Geere, 2010)). It seems we have touched a nerve. How can it be both ways? It may be where deception is used that forms the hot button for this debate. For military applications, it seems clear that deception is widely accepted (which indeed was the intended use of our - 118 - The Computational Turn: Past, Presents, Futures? research as our sponsor is the Office of Naval Research). Sun Tzu is quoted as saying that “All warfare is based on deception”, and Machiavelli in The Discourses states that“ Although deceit is detestable in all other things, yet in the conduct of war it is laudable and honorable”. Indeed there is an entire U.S. Army (1988) Field Manual on the subject. In our original paper (Wagner and Arkin, 2011), we included a brief section on the ethical implications of this research, and called for a discussion as to whether roboticists should indeed engage in this endeavor. In some ways, outside the military domain, the dangers are potentially real. And of course, how does one ensure that it is only used in that context? Is there an inherent deontological right, whereby humans should not be lied to or deceived by robots? Kantian theory clearly indicates that lying is fundamentally wrong, as is taught in most introductory ethics classes. But from a utilitarian perspective there may be times where deception has societal value, even apart from the military (or football), perhaps in calming down a panicking individual in a search and rescue operation or in the management of patients with dementia, with the goal of enhancing that individual’s survival. In this case, even from a deontological perspective, the intention is good, let alone from a utilitarian consequentialist measure. But does that warrant allowing a robot to possess such a capacity? The point of this paper is not to argue that robotic deception is ethically justifiable or not, but rather to help generate discussion on the subject, and consider its ramifications. As of now there are absolutely no guidelines for researchers in this space, and it indeed may be the case that some should be created or imposed, either from within the robotics community or from external forces. But the time is coming, if left unchecked, you may not be able to believe or trust your own intelligent devices. Is that what we want? Acknowledgements This research was supported by the Office of Naval Research under MURI Grant # N00014-08-1-0696. The author would also like to acknowledge Dr. Alan Wagner for his contribution to this project. References Bond, C. F., & Robinson, M., (1988). “The evolution of deception”, Journal of Nonverbal Behavior, 12(4), 295- 307. Floreano, D., Mitri, S., Magnenat, S., & Keller, L., (2007). “Evolutionary Conditions for the Emergence of Communication in Robots”. Current Biology, 17(6), 514-519. Geere, D., (2010). Wired Science, http://www.wired.com/wiredscience/2010/09/robots-taught-how-to-deceive/ Joy, B. (2000). “Why the Future doesn’t need us”. Wired, April 2000. Kelley, H. H., & Thibaut, J. W., (1978). Interpersonal Relations: A Theory of Interdependence. New York, NY: John Wiley & Sons. Sharkey, N. (2008). “The Ethical Frontiers of Robotics”, Science, (322): 1800-1801. Suddath, C., (2010). “The Deceitful Robot”, Time Magazine, Nov. 11, 2010, http://www.time.com/time/specials/packages/article/0,28804,2029497_2030615,00.html Tiku, N., (2010). New York Magazine, 9/13/2010, http://nymag.com/daily/intel/2010/09/someone_taught_robots_how_to_l.html - 119 - Proceedings IACAP 2011 U.S. Army (1988.). Field Manual 90-2, Battlefield Deception, http://www.enlisted.info/fieldmanuals/fm-90-2- battlefield-deception.shtml Wagner, A. and Arkin, R.C., (2008). "Analyzing Social Situations for Human-Robot Interaction", Interaction Studies, Vol. 9, No. 2, pp. 277-300. Wagner, A. and Arkin, R.C., (2009). "Robot Deception: Recognizing when a Robot Should Deceive", Proc. IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA-09), Daejeon, KR. Wagner, A.R., and Arkin, R.C., (2011). "Acting Deceptively: Providing Robots with the Capacity for Deception", International Journal of Social Robotics, Vol. 3, No. 1, pp. 5-26. - 120 - The Computational Turn: Past, Presents, Futures? PROLEGOMENON TO ANY FUTURE THEORY OF MACHINE AUTONOMY PAUL BELLO Office of Naval Research 875 N. Randolph St., Arlington VA 22203 AND Selmer bringsjord Rensselaer Polytechnic Insititute, Dept. of Cognitive Science, Dept. of Computer Science, Lally School of Management 110 8th St, Troy NY 12180 AND marcello guarini University of Windsor, Dept. of Philosophy 401 Sunset Ave., Windsor, Ontario N9B 3P4 Abstract. As the development of autonomous systems lead to smarter and more capable machines, we must concern ourselves with the possibility that they will one day be equipped with weapons and the authorization to use them. However, it isn’t inconceivable that such systems will be prone to error, leaving us with the issue of who might be to blame if force is misapplied. In this presentation, we discuss responsibility as it pertains to autonomous systems. More specifically, we attempt to give a formal analysis of the conditions under which an autonomous system might consider itself to be a “freely acting agent.” Note that we do not attempt to attack the metaphysical problem of free will; we only aim to provide the system with an appropriate commonsense theory of what it means to be free, given a set of circumstances within which the agent acts. Such a commonsense theory will (eventually) contain a set of beliefs corresponding to how external obligations, potential coercion, lack of perfect information, and brute facts constrain or expand the set of actions available to the agent at a given time in branching-time semantics. The semantics represents the agents’ beliefs about the past as fixed and the future as a set of possible histories that are contingent on its actions. Future extensions of our formal framework will be discussed relative to the development of a “Moral Turing Test” for autonomous systems. ``You have been terminated.” In grand Hollywood style, this is how much of the publicat-large has been introduced to the notion of autonomous robots on the battlefield. When these words were famously uttered by the now-Governor of California, combat robots were only a dream, and the dystopian future painted in the Terminator movies seemed no more imminent than a new ice age. Times have rather changed. Combat robots roam through craggy caves in Afghanistan - 121 - Proceedings IACAP 2011 searching for terrorists, and unmanned air vehicles strike suspected enemy hideouts in Pakistan without a human operator being anywhere close by. Thankfully, we still live in a pre-Terminator age. The United States Department of Defense maintains strict policies that require humans be in the decision-making loop whenever robots are employed on the battlefield. While this sets many a mind at ease, neither of us are totally convinced that such strictures will indefinitely remain, especially as robots and associated technology becomes more reliable, more intelligent, and --in the end, the most important factor --- cheaper. Similar scenarios have been discussed at length by (Joy, 2000) and other futurists (Bostrom, 2003). In reply to these concerns, we (Bringsjord, Arkoudas & Bello 2006) and others (Arkin, 2009) have looked to curb robotic behavior through the mechanization of norms, conventions, and other ethical structures, such that future robots might be bound by regulations. Unfortunately, complex situations are the norm on the battlefield, and facing novel moral dilemmas in combat is the rule rather than the exception. Just as our warfighters must improvise under these adverse circumstances, we expect future robots to take actions roughly consistent with pre-established norms, but rounded out with a measure of commonsense moral judgment, for if they do not, they are doomed to be both brittle and ineffectual soldiers. This being said, we’d like to address an issue at IACAP 2011 that hasn’t received much attention in the literature: the issue of whether or not future intelligent robots could be blamed for their actions, provided something goes wrong during the course of their operation. Our plan will be to provide what we feel to be a reasonable set of conditions that when jointly obtaining would allow us to classify a robot as a moral agent, and as such subject to blame in the case of intentional misdoings or derelictions of duty. The key question under consideration in our investigation is: ``what does it mean for x to have the property of being autonomous?” We hope to clarify a set of potential confusions about the proper definition of autonomy in the context of robotic warfighters. Moral philosophers, depending on their particular stance on the nature of morality, typically define autonomy as the ability to respect some particular moral code or another, even if doing so runs contrary to self-interest. In a deep sense, these ideas turn on the notion of an autonomous agent having at least the illusion of free will, or the ability to choose contrary to a pre-established set of normative principles. Among roboticists and other practitioners of artificial intelligence, autonomy has generally been taken to mean the ability to make decisions and take actions without coercion or assistance from a secondary agent. While this seems to be plausible enough, a few mental exercises might convince you that this is much too general, perhaps to the point of not being useful in its intended context. Consider the case of the lowly thermostat that has functionality allowing it to turn on and off in order to maintain a pre-set ambient temperature in a home. It certainly ``makes decisions” about when to turn on, and takes action (e.g. turns on) under an appropriate set of conditions and without consulting an external agent at decision-time. Should this device be granted autonomy? We think not, and we assume that our roboticist colleagues agree with us. Even though the thermostat makes decisions (in some sense) as to when to turn on, it’s not at all clear that it could choose otherwise. In fact it cannot, barring device malfunction. Worse than this, there isn’t an ``it” making decisions at all. It’s just a thermostat. If we map - 122 - The Computational Turn: Past, Presents, Futures? onto the robotic case, it’s equally unclear that there is an ``it” making decisions, or one making free choices that direct its own affairs. Real-world battlefield situations don’t bifurcate so cleanly when it comes to making moral and non-moral decisions. Simple navigation decisions, such as whether or not to step into a house of worship, seem to be prima facie non-moral in nature, but as we well know, they indeed have moral consequences. These complications suggest to us that roboticists ought to at least consider some of the definitional concepts from moral philosophy to tighten up their own notions of autonomy in order to make them more suitable for combat robots. A central notion to be accounted for in future definitions of machine autonomy is the notion of free choice. Without free choice, or at least the illusion of free choice, blaming a robot for misdeeds or for neglect becomes a less-than-meaningful activity. At IACAP 2011, we hope to both present recommendations for a formally useful definition of autonomy for machines; but also to propose a variety of tests, much like a decathlon, to establish functional baselines which would be required to be met by computational systems hoping to acquire the designation of moral agent, with a particular focus on the robot’s beliefs about how “free” its actions are at any given point in time. Given the uncertainty over the variegated notions of free will, the key test we propose will share much in spirit with Turing's Test for machine intelligence, a similarly ambiguous notion. Just as TT doesn't require human intelligence proper to functionally pass, we won't require an artificial system to have human-like free will (whatever it may look like) in order to be accorded moral agency. References Arkin, R.C., (2009). Governing Lethal Behavior in Autonomous Systems, Chapman and Hall Imprint, Taylor and Francis Group. Bostrom, N. (2003), "Ethical Issues in Advanced Artificial Intelligence", Cognitive, Emotive and Ethical Aspects of Decision Making in Humans and in Artificial Intelligence. 2: 12–17. Bringsjord, S., Arkoudas, K. & Bello, P. (2006) “Toward a General Logicist Methodology for Engineering Ethically Correct Robot” IEEE Intelligent Systems. 21.4: 38-44. Joy, W. (2000) “Why the Future Doesn't Need Us” Wired. (8.04). - 123 - Proceedings IACAP 2011 AUTONOMOUS AGENTS AND SENSES OF RESPONSIBILITY GORDON BRIGGS Tufts University Department of Computer Science 161 College Ave. Medford, MA 02155 U.S. Abstract. The ever-increasing levels of autonomy in modern robotic systems will lead to the deployment of autonomous agents in morally sensitive contexts. Assigning responsibility when unethical actions are performed by robots has been a matter of considerable debate among roboethicists, with some positing a grave “responsibility gap” that prevents the satisfactory attribution of responsibility to any party. I submit that this contention may stem from the failure to specify the architectural details of the hypothetical robotic systems in question and the failure to consider multiple senses of responsibility. To illustrate this, the effect of assigning varying levels of architectural complexity to a hypothetical robotic agent on our reactive (moral) attitudes is examined. Various senses of responsibility are then presented, including the novel sense of pedagogic responsibility in an attempt to close the “responsibility gap.” 1. Introduction The progress of modern robotics research is not only rapidly yielding embodied agents with increasing levels of autonomy, but also fueling the desire of various governmental and private institutions to deploy autonomous systems in morally contentious contexts. Given the prospect of autonomous agents that not only may make moral decisions, but life-or-death decisions of the highest ethical import, it is understandable that scientists and philosophers see an urgent need to tackle the issue of robotic systems and responsibility. When a robotic system perpetrates an unethical action, whom do we hold accountable? Conversely, to whom ought we direct praise when an autonomous system performs commendably in an ethical situation? Various loci of responsibility have been proffered by roboethicists: the developers of the autonomous agent, the handlers/controllers of the autonomous agent, and the autonomous agent itself (Sparrow, 2007). However, the justifiability of responsibility ascriptions to each of these loci remains controversial. Some posit a “responsibility gap” that prevents us from holding the programmers and developers of certain types of autonomous agents culpable for their potentially unpredictable acts (Matthias, 2004), whereas others reject this notion (Marino and Tamburrini, 2006). Another complication to ascribing responsibility, raised by - 124 - The Computational Turn: Past, Presents, Futures? Sparrow, involves the possible rejection of robots as loci of responsibility by humans, as the consequences of holding synthetic agents responsible may not sufficiently satisfy the aggrieved parties (Sparrow, 2007). In contrast with Sparrow, however, Dodig-Crnkovic and Persson (2008) contend that “learning from experience and making autonomous decisions gives us good reasons to talk about a machine as being ’responsible’ for a task in the same manner that we talk about a machine being ’intelligent”’, but that, “we must adopt the functionalist view and see them as parts of larger socio-technological systems with distributed responsibilities, where responsibility of a moral agent is a matter of degree.” Yet, what makes responsibility hard to pin down or satisfactorily ascribe with robots? I would submit that the debate is fueled by the ambiguity of the key terms in the dialogue: “responsibility” and “robot”. We will first seek to tease out why disambiguating these terms is a prerequisite to solving, or at least making sense of, the problem of responsibility ascription with robotic systems. This disambiguation entails examining what the robotic/cognitive architecture is on the autonomous system in question, as well as considering what different senses of responsibility we wish to ascribe when seeking to hold agents accountable. By fleshing out these issues, we can subsequently critique the viewpoints espoused by Matthias, Marino and Tamburrini, and Sparrow. We will then proceed to outline how we can use these senses of responsibility and our knowledge of the architectural mechanisms underpinning the robotic system to establish a system of distributed responsibility that will ideally “not only locate the blame but more importantly assure future appropriate behavior of the system” (DodigCrnkovic and Persson, 2008). 3. Senses of Responsibility Kuflik (1999) identifies six types of responsibility. The type needed to ascribe responsibility in liability cases as described by Marino and Tamburrini is oversight responsibility, which can in turn be thought of as a subset of Kuflik’s role responsibility (where the agent’s role is to oversee the operation of a system and ensure positive results while avoiding negative ones). By considering oversight responsibility, attitudinal differences between ascriptions of malice and negligence can be captured. Despite the application of additional senses of responsibility to plug the “responsibility gap,” the appropriateness of ascriptions of oversight responsibility are still dependent on details regarding the behavior-generating mechanisms of the autonomous agent. Does this leave open the “responsibility gap” at the higher-end of the continuum of agent autonomy? Could there exist robotic agents that we believe can not justifiably be considered loci of strong senses of responsibility (e.g. moral responsibility), but that are autonomous enough that assigning full liability to the developers or trainers also seem unfair? The answer to these questions are not clear, but independent of how these concerns are resolved I wish to introduce a new flavor of responsibility that seeks to articulate a sense in which the developers and trainers of complex learning agents can be held accountable, regardless of the complexity of the agent’s cognitive architecture. A weaker form of responsibility can be derived from Kuflik’s role responsibility that recognizes the causal connections between the training an agent provides another - 125 - Proceedings IACAP 2011 learning agent and that learning agent’s future behavior. This sense of accountability can be deemed pedagogic responsibility. What I wish to highlight with this flavor of responsibility is the practical consideration that most, if not all, sophisticated learning agents are weakly supervised by other agents that fill the role of pedagogues; learning agents, in practice, are not completely self-bootstrapping. 4. Distributed Responsibility Distributed responsibility is crucial to ensuring that desired outcomes are achieved in practice. Far from potentially exculpating guilty agents by examining other loci of responsibility, an appropriate application of a distributed responsibility paradigm would in fact maximize accountability. This maximization of accountability can be achieved by considering all agents causally linked to a particular action and determining the strongest sense of responsibility that can be justifiably ascribed to a particular agent. 5. Conclusion Knowing the relevant details of a robotic system’s behavior-generating mechanisms is of paramount importance when undertaking the task of responsibility ascription for actions generated by that system. This knowledge, coupled with considerations of different flavors of responsibility, will enable agents to be held accountable in the proper sense. Finally, applying these different flavors of responsibility in a distributed context will contribute to the appropriate ascription of blame/praise and ensure future desired outcomes by minimizing all points of failure within a socio-technical system (as alluded to by Dodig-Crnkovic and Persson, 2008). References Dodig-Crnkovic, G. and Persson, D. (2008). “Sharing Moral Responsibility with Robots”, Proceeding of the 2008 conference on Tenth Scandinavian Conference on Artificial Intelligence. Kuflik, A. (1999). Computers in control: Rational transfer of authority or irresponsible abdication of autonomy? Ethics and Information Technology. Vol. 1, no. 3. Marino, D. and Tamburrini, G. (2006). Learning robots and human responsibility. International Review of Information Ethics. Vol. 6. Matthias, A. (2004). The responsibility gap: Ascribing responsibility for the actions of learning automata. Ethics and Information Technology. Vol. 6, Issue 3. Sparrow, R. (2007). Killer Robots. Journal of Applied Philosophy. Vol. 24, No. 1. - 126 - The Computational Turn: Past, Presents, Futures? THE ENGINEERABILITY OF SOCIAL INSTITUIONS Some Critical Reflections against Searle and in Favor of Kant’s Laws of Action RUTH HAGENGRUBER University Paderborn Ruth.Hagengruber@upb.de Abstract. I am arguing in the realm of Kant’s concept holding that moral laws result from universal and contradiction free proving processes, criticizing John Searle who negates the engineerability of social institutions. 1. The Engineerability of Promises In his book Making the Social World John Searle explicitely negates the engineerability of social institutions. He deduces his claim from the fact that social rules owe themselves to conscious human language and secondly to the will of acceptance. If you concede to Searle’s argument you firstly have to commit the gap between Searle’s world of human language dependent social rules and a social world as real being with rules that constitute its existence. Against Searle I hold that the validity of some social institutions is built upon a realist and ontologic dimension of social institutions. Searle explains that social institutions only exist because they are constituted by human capacities and therefore not engineerable, illustrating his convictions by “promising” (which he used in his speech act theory) demonstrating why unconscious robots cannot have institutions: “Let us suppose that robot A is so programmed that when it cognizes a future need on the part of robot B, A makes a “promise” to render B the appropriate assistance in the future. … But what I cannot find in this situation is the deontology that is essential to institutional reality in its human form. The notion of making and keeping promises presupposes the gap.” (Searle 2010, 136). It is obvious and simple to understand that a computer program can devide one action of exchange into two parts however connect them together in a way that the time difference does not interrupt the unity of action. What kind of “notion” is needed to fulfill this bipartite action? Searle’s argument refers to a concept of deontology, which does not explain why promises are to be held, in Searle’s account, promises remain as a duty someone has obliged me with. Kant’s argument on moral duties is different. Kant’s constitution of morals i.e. of social institutions is not based on properties of human nature, but must subsist a priori. This is true for several kinds of human actions, as “saying truth”, “selling something to - 127 - Proceedings IACAP 2011 all at the same prize”, and it is true for promises. How can we think of a promise as a universal law and what consequences does this have for the engineerability of social institutions? 2. Some Social Institutions are based on the Logic of Contradiction Free Reasoning The validity of a promise results from the idea of a self-consistent concept of an action. This is a pure formal statement on the fact that from the point of logic there is no reason to assume that this kind of action would ever have an implicit problem, that is that this kind of action could not be executed as if there would arise a contradiction. (Hagengruber. 2000. 155 ff.) Although you might object that only humans can understand what is a contradiction, this does not concern the formal character of the validity of “promising”. The validity of “promising” is as independent of this human approval as it is true for any mathematical law. Think how many do not understand the mathematical laws computers are built of and constituted by but how many people use it! Very often promises are broken, however this does not influence the validity of the law of promising which is effected by its formalism. This formalism is the reason of its validity, not our agreement to it. It is completely unimportant if this law is understood or not, as we can easily observe. From this assumption we can deduce that “promising” is not only a kind of social institution which deduces its validity from human understanding and acceptance, but it can be seen as a sort of law which coordinates to a sort of “ontological” law. Searle presupposes that keeping promises is only possible if we have an understanding of language and he is convinced that these language based rules are different to computational rules. Are both types built upon different modes of thought? How do rules and laws work in machines, and why do we understand the results of computation? I affirm that some (not all) social institutions are based on computable laws and that their inherent character is comparable to computational laws. This implies the conviction that there are some types of social laws which are much deeper grounded than to be only a reflex of cultural inspiration. Searle turns out as a dualist, arguing on the ground of two kinds of rationality, a computable and a non computable, when deviding the world into non computable social institutions and computable number concepts. References Hagengruber, Ruth 2001. Zur Gesetzmäßigkeit und materialen Notwendigkeit von Versprechen. In: R. Haller, K. Puhl (Ed.). Wittgenstein und die Zukunft der Philosophie. Kirchberg am Wechsel, 300-305. Searle, John R. 2010. Making the Social World: The Structure of Human Civilization. Oxford University Press. Smith, Barry. 1992. An Essay on Material Necessity. Hanson P. and Hunter B. (eds.) Return of the A Priori. Canadian Journal of Philosophy, Supplementary Volume 18. - 128 - The Computational Turn: Past, Presents, Futures? RESPONSIBILITY IN ACQUIRING CRITICAL eGOVERNMENT SYSTEMS Whose Fault is Failure? HEIMO, OLLI Acting Teacher, Department of Management, Turku School of Economics olli.heimo@utu.fi University of Turku AND KIMPPA, KAI Principal Lecturer, Department of Management, Turku School of Economics kai.kimppa@utu.fi University of Turku Abstract. While ordering and producing modern eGovernment systems to the critical fields of governmental services the stakes with failure vary from the loss of money to the loss of life. Standard procedures of providing an eGovernment service does not nominate clear responsibilities to any participating party. Government offices hold a dual-model role in which they are both a customer towards the supplier of the system and supplier of the system towards the public. Government officials have been nominated to their job as a form of social contract to be the responsible party in the eGovernment system acquiring, implementation and upkeep. In that context, when the government office orders critical eGovernment systems and takes them into use as a monopoly service, it must hold itself responsible for the system and its effects. Normal struggle between the authorities, system suppliers, NGOs and individual citizens after a troubled eGovernment experiment can be avoided when the responsibilities are taken into account before the system development even begins. Extended abstract: In this paper we aim to show that a responsible party for acquiring critical eGovernment systems should be nominated and that the expected consequences must be analysed before the project is started. This is to prevent loss of human life, to enhance well-being, to secure a democratic process and civil rights of the citizens and to save resources. A critical information system is a system where something invaluable can easily be compromised. These kinds of systems include eHealth, eDemocracy, police databases and some information security systems e.g. physical access right control. A critical - 129 - Proceedings IACAP 2011 eGovernment system is such a system provided to the people by the government. Systems included in these kinds of areas are those of healthcare, border control, electronic voting, criminal records, etc. There have been numerous cases, where due to poor eGovernment systems lives have been lost (Avison & Torkzadeh 2008, p. 292-293, Fleischman 2010) and elections have been compromised (Mercuri 2001, p. 13-20, Heimo, Fairweather & Kimppa 2010, Robison 2010). At the same time large amounts of resources (Larsen & Elligsen 2010) are wasted, while the systems are either inoperable for the purposes they were designed or end up being discarded (Wijvertrouwenstemcomputersniet 2007, Verzola 2008, Heimo, Fairweather & Kimppa 2010). Thus, while developing critical eGovernment systems, there is little room for error. Some of the errors have lead to catastrophic consequences, like the Case London Ambulance, where more than 20 people died due to bad system design, poor testing and hasty implementation (Avison & Torkzadeh 2008, p.292-293). In the field of eVoting, there have been problems, close-by situations or problems which have not been identified, yet are suspected. Some of the clearest mistakes have been made in the U.S., but many European eVoting projects, like those of Ireland and Netherlands, have also endangered the democratic process. Many eVoting projects have also been found extremely costly. (Wijvertrouwenstemcomputersniet 2007, Verzola 2008, Heimo, Fairweather & Kimppa 2010) A specific party has to be responsible for the development of the system, so that there is someone to respond to the challenges, repair what is broken, and see to it that the system itself works. That is a job the society as a whole has given to a third party, as not everyone can participate to the process. The task of the responsible party is to see to it that the system works as it should. (See e.g. Hobbes 1651.) Four different interest groups can be found in every eGovernment system development process. First, there is the government office, whose task is to formulate the solutions to fulfil the needs of the society at large. Secondly there is the producer, who delivers the requested system. Third interest group is the end-user group consisting of people using the system, i.e. nurses, border officials, police or military officers and voting officials. Fourth group is the citizens, who are the targets of the system usage. Any or all of the groups can also overlap. Every nurse or doctor can (and will) be a patient, every voting official can vote, every police or military officer or border official is also a citizen dependant of the services produced by police or military force and border control etc. The power to decide how to design and whether to implement the system lies within the government and the supplier; the user and the target of usage are in weaker positions, for they have little or no power in designing the system compared to governmental officials or the supplier of the system. According to Rawls (1997) the change in the system must be to the advantage of the weakest parties, to the last two groups, who are less able to defend themselves. With the power to decide for the public comes the responsibility to the public. That responsibility has to be either with the subscriber or the supplier of the system. The responsibility with the supplier lies in fulfilling the requests of the customer, in this case the governmental office. If this task fails, the supplier is surely responsible to the authorities for their failure of not fulfilling the requirements agreed upon. - 130 - The Computational Turn: Past, Presents, Futures? The authorities have a monopoly in supplying certain services like critical eGovernment products. Due to this, they are in the supplier role in relation to the citizen. That role brings with it the responsibility of a functioning product. If the system is taken into use – and it must be emphasized, that these are critical systems – the responsibility lies with the last supplier of the system: the government office. The producer produces a system according to the specifications they receive from the ordering party, in this case the government office. Even if the product is faulty and does not fulfill the specification, the authorities are responsible to audit the product (due to these kinds of systems being critical applications). The responsibility for showing that a product is faulty, cannot, however rest on the end-user, but the provider or the distributor must provide sufficient proof that the system is safe. In many countries (e.g. in Finland, Ireland, Netherlands and the USA) only after a system has been taken into use, the end-users (specialists, citizens, NGOs, etc.) have been able to show that there are critical problems with the system (see e.g. Mercuri 2001, Harris 2004, Wijvertrouwenstemcomputersniet 2007, Heimo, Fairweather & Kimppa 2010). That means that the producers and the government officials are defending their position against the end-users and the public. However, the burden of proof in a situation where critical systems are changed must remain with the party advocating the change. Because this kinds of systems are distributed through a government monopoly, the obvious responsible party is, maybe counter to intuition, the subscriber, not the producer of the system. Pantzar (2002) generalizes MacKenzie’s (1990) theory of the Certainty Trough to all technology. Pantzar claims, that the salespersons of the product – the representatives of the producer – are denied their right to be uncertain of the product they are selling. In a modern society there is a risk, that this reflects to the suppliers – the governmental offices – representatives so, that even they cannot appear to be uncertain of the product when introducing it to the citizens. In a situation where this risk actualizes, the information the government officials give to the public is misleading. When ordering critical eGovernment systems, it must be remembered that the people auditing the systems must be accountable for their work and the government office must select a party able to successfully complete the auditing. Governmental officials have to be trained and given the accountability for what methods of auditing are required and how the results have to be interpreted. Thus, we must see to it that sufficient safeguards are in place for taking new applications into use in critical eGovernment services. It must be ensured that the responsible office has tested the critical applications at minimum to the degree the current system can be trusted. That alone, cannot be a convincing reason to take a new system into use. Either the security of the system itself has to be greater than the previous systems’, or, at least the added value the system provides to the citizen must be – together with the same amount of security as in the previous system – considerable to justify changing systems. To summarize, the responsibility of the critical eGovernment systems lie within the authorities. They hold a monopoly to the services they have been nominated to produce, control and upkeep. When this is done without the responsibility and accountability of anyone, it can and will endanger the fundamental values we hold dear. - 131 - Proceedings IACAP 2011 References Avison, David and Torkzadeh, Gholamzeza (2008), Information Systems Project Management, Saga Publications, California, USA, August 2008. Fleischman, William M. (2010), Electronic Voting Systems and The Therac-25: What Have We Learned?, Ethicomp 2010. Harris, Bev (2004), Black Box Voting: Ballot Tampering in the 21st Century, Talion Publishing, free internet version is available at www.BlackBoxVoting.org, accessed 7.2.2011. Heimo, Olli I, Fairweather, N. Ben & Kimppa, Kai K. (2010), The Finnish eVoting Experiment: What Went Wrong?, Ethicomp 2010. Hobbes, Thomas (1651), Leviathan, or the Matter, Forme, and Power of a Commonwealth, Ecclesiasticall and Civil, edited with an introduction by C.B. MacPherson, Published by Pelican Books 1968. Larsen E & Elligsen G. 2010. Facing the Lernaean Hydra: The Nature of Large-Scale Integration Projects in Healthcare. In Kautz K & Nielsen P. Proceedings of the First Scandinavian Conference of Information Systems, SCIS 2010. Rebild, Denmark, August 2010.Mackenzie, Donald A (1990), Inventing accuracy, A historical sociology of nuclear missile guidance, MIT Press, Cambridge Massachusetts. Mackenzie, Donald A (1990), Inventing accuracy, A historical sociology of nuclear missile guidance, MIT Press, Cambridge Massachusetts. Mercuri, Rebecca (2001), Electronic Vote Tabulation: Checks and Balances PhD thesis, University of Pennsylvania. http://www.cis.upenn.edu/grad/documents/mercuri-r.pdf Pantzar, Mika (2000), Teesejä tietoyhteiskunnasta. Yhteiskuntapolitiikka. No 1. pp. 64 - 68. http://www.stakes.fi/yp/2000/1/001pantzar.pdf, accessed 7.2.2011. Rawls, John (1997), The Idea of Public Reason, Deliberative democracy: essays on reason and politics, edited by James Bohman and William Rehq, The MIT Press, 1997. Robison, Wade L. (2010), Voting and Mix-And-Match Software, Ethicomp 2010. Verzola, Roberto (2008), The Cost of Automating Elections. http://ssrn.com/abstract=1150267, haettu 24.11.2010. Wijvertrouwenstemcomputersniet (2007), Rop Gonggrijp and Willem-Jan Hengeveld - Studying the Nedap/Groenendaal ES3B voting computer, a computer security perspective, Proceedings of the USENIX Workshop on Accurate Electronic Voting Technology 2007 http://wijvertrouwenstemcomputersniet.nl/images/c/ce/ES3B_EVT07.pdf, accessed 7.2.2011. (see also http://wijvertrouwenstemcomputersniet.nl/English). - 132 - The Computational Turn: Past, Presents, Futures? WHAT ARE ETHICAL AGENTS AND HOW CAN WE MAKE THEM WORK PROPERLY? IORDANIS KAVATHATZOPOULOS Uppsala University Dept. of IT-HCI, Box 337, 751 05 Uppsala, Sweden AND Mikael Laaksoharju Uppsala University Dept. of IT-HCI, Box 337, 751 05 Uppsala, Sweden Abstract. To support ethical decision making in autonomous agents, we suggest to implement decision tools based on classical philosophy and psychological research. As one possible avenue, we present EthXpert, which supports the process of structuring and assembling information about situations with possible moral implications. 1. Philosophy Automated systems can be of great help to achieve goals and obtain optimal solutions to problems in situations where humans have difficulties perceiving and processing information, or making decisions and implementing actions, because of the quantity, variation and complexity of information. Given that we have a clear definition of ethics, we can design a system that is capable of making ethical decisions, and able to make these decisions independently and autonomously. In common sense, ethics is based mainly on a judgment of its normative qualities. People’s attachment to the normative aspects is so strong that it is not possible for them to accept that ethics is an issue of choice, as it has been stated in classical philosophy. If ethics is connected to choice then the interesting aspect is how the choice is made, or not made. The focus is on how, not on what; on the process not on the content. Indeed, regarding the effort to make the right decision, philosophy and psychology point to the significance of focusing on the process of ethical decision making rather than on the normative content of the decision. According to the theories of Plato, Aristotle, Kant and modern philosophers one has to get rid of false ideas, because this opens up the way to the right solution. Ability to think in the right way is not easy and certain skills are necessary. - 133 - Proceedings IACAP 2011 2. Skills of Ethical Agents This philosophical position has been applied in psychological research on ethical decision making. Focusing on the process of ethical decision making, psychological research has shown that people use different ways to handle moral problems. When people are confronted with moral problems they think in a way which can be described as a position on the heteronomy-autonomy dimension. Heteronomous thinking is automatic, emotional and uncontrolled thinking or simple reflexes that are fixed dogmatically on general moral principles. Thoughts and beliefs coming to mind are never doubted. Awareness of own personal responsibility for the way one is thinking or for the consequences of the decision are missing. Autonomous thinking, on the other hand, focuses on the actual moral problem situation, and the main effort consists in searching for all relevant aspects of the problem. When one is thinking autonomously the focus is on the consideration and investigation of all stakeholders’ moral feelings, duties and interests, as well as all possible alternative ways of action. In that sense autonomy is a systematic, holistic and self-critical way of handling a moral problem. Handling moral problems autonomously means that a decision maker is unconstrained by fixations, authorities, uncontrolled or automatic thoughts and reactions. It is the ability to start the thought process of critically and systematically considering and analyzing all relevant values in a moral problem situation. It is not so easy to use the autonomous skill in real situations. Psychological research has shown that plenty of time and certain conditions are demanded before people can acquire and use the ethical ability of autonomy. 3. Support Systems IT systems have many advantages that can be used to stimulate and facilitate autonomous thinking in decision making. For example EthXpert is designed to support the process of structuring and assembling information about situations with possible moral implications (http://www.it.uu.se/research/project/ethcomp/ethxpert). It follows the hypothesis that moral problems are best understood through the identification of authentic interests, needs and values of the stakeholders in the situation at hand. Since the definition of what constitutes an ethical decision cannot be assumed to be at a fix point, we have further concluded that this kind of system must be designed so that it does not judge the normative correctness in any decisions or statements. Consequently, the system does not make decisions and its sole purpose is to support the decision maker when analyzing, structuring and reviewing choice situations. Ethical decision support can be integrated into robots and other decision-making systems to secure that decisions are made according to the basic theories of philosophy and psychology. In one sense this fully automated autonomy would be ideal, although it will bring to the fore questions about how to treat machines that have a refined sense of reasoning. Before we are there we can however see that ethical decision-making support systems based on this approach can be utilized in two ways, both of which we believe to be necessary steps to further development. - 134 - The Computational Turn: Past, Presents, Futures? During the development of a decision-making system, support tools can be used to identify the criteria for making decisions and for choosing a certain direction of action. This means that the support tool is used by developers — the ones who make the real decisions — when they are facing an ethical problem and need assistance in choosing according to the philosophical/psychological approach. Another possibility is to integrate a support tool in the decision system. By putting the support tool into the system, it can be used in cases of unanticipated future situations. The tool can gather information, treat it, structure it and present it to the operators in a way that follows the requirements of the above mentioned theories of ethical autonomy. If it works like that, operators make the real decisions and are the users of the ethical support tool (Kavathatzopoulos, 2010). Such an independent system — that can make decisions and act in accordance to the hypothesis of ethical autonomy — is one which 1) has criteria, previously identified in an autonomous way, programmed into it by the designers, and 2) prepares the information about problematic situations according to the theory of ethical autonomy so that the operators, when they are presented with it, are stimulated to make decisions compatible with the theory of ethical autonomy. References Kavathatzopoulos, I. (2010). Robots and systems as autonomous ethical agents. In: V. Kreinovich, J. Daengdej and T. Yeophantong (Eds.), INTECH 2010: Proceedings of the 11th International Conference on Intelligent Technologies (pp. 5-9). Bangkok: Assumption University. - 135 - Proceedings IACAP 2011 HOW THE HARD PROBLEM OF CONSCIOUSNESS MIGHT EMERGE FOR AN EMBODIED SYMBOL SYSTEM BERNARD MOLYNEUX Abstract Embodied systems with both an exteroceptive and an introspective informational channel can investigate themselves via two independent methods, generating distinct pictures of the self. Attempts at cross-perspectival identification, however, are frustrated by the recursive nature of Leibniz's Law, which, for each pair of potential cross-perspectival identificanda, requires the prior cross-perspectival identification of their properties, generating a regress. I show that the only ways the embodied system can escape from this regress correspond to the classic answers to the hard problem of consciousness: inflate its third-person ontology with distinct subjective properties (dualism); deny the reality of its subjective phenomena (eliminativism); or postpone the identification indefinitely (the current state of materialist realism). Thus, I suspect that this problem is the hard problem of consciousness rediscovered in the context of an embodied artificial system. Abstract. Any embodied system with both an exteroceptive and an introspective (internal monitoring) channel can investigate itself via two independent methods. I show how this generates an epistemic problem resembling the hard problem of consciousness. How M Represents Things Imagine that at any time our intelligent symbol system M represents objects and properties discovered using its exteroceptive system (henceforth 'EXTEROCEPTION') using some finite stock of symbols7 O01O02O03 … where superscripts designate order whereas subscripts distinguish the representations at each order, so that M represents the ith nth-order entity having the jth-mth order property as follows: OmjOnj E.g. if we count objects as appearing at the 0th order (since they are modified by first order properties) then the following: 7 For visual prettiness use/mention distinctions are syntactically unmarked, so O1 sometimes refers to the representation and sometimes to its referent, as will be clear from context. - 136 - The Computational Turn: Past, Presents, Futures? O123O045 …signifies that the 45th object in M’s ontology is modified by the 23rd first-order property. (When order is clear from context, we will drop the subscripts to minimize notational clutter.) In the same way, M uses the symbol S (think 'subjective') to represent objects and properties that it learns about via its other, introspective, mode (henceforth 'INTROSPECTION'). How M Thinks about Things We place one iron restriction on M's reasoning, and three soft restrictions (to be explained). Iron restriction: M observes Leibniz's Law. I.e. if M holds that A=B, then for every property P, M holds that A instantiates P if and only if M holds that B does. Now for the soft restrictions: First soft restriction: M thinks8 that it can in principle acquire a complete picture of the world from EXTEROCEPTION only. Second soft restriction: M regards the data it gets from INTROSPECTION as correct and incorrigible. It treats introspection as the ultimate authority on its inner self. Third soft restriction: M insists on all of its identifications being constructive. That's to say, it only identifies specific phenomena of which it is aware. So though it might identify O23 with O78 or with S677, for instance, it will not commit to the abstract existential identification of O23 with some (as yet unknown) O or S phenomenon. Later we see that relaxing the soft restrictions permits M to solve its problem in a way that resembles classic answers to the hard problem of consciousness, indicating that this is indeed the hard problem of consciousness rediscovered in the context of an embodied artificial system. The Proof We proceed by reductio, by imagining that M identifies some subjective (S) and some objective (O) phenomenon. Since M does so, there must be some Si and some Oi that are the highest order such entities to be identified. Since this identification must obey 8 I.e. the system processes in accordance with this restriction, as if it 'thinks' this. All such mentalistic vocabulary can be similarly replaced throughout the argument, if it is thought to beg any questions. - 137 - Proceedings IACAP 2011 Leibniz's Law, M must first check whether Si and Oi have the same properties, either by checking its antecedent knowledge of Si or by querying INTROSPECTION anew. But now consider an arbitrary property Si+1 that INTROSPECTION ascribes to Si. Since the identification of Si and Oi obeys Leibniz's Law, M must either hold that both Oi and Si have Si+1 or that neither do. Hence either: (i) M holds Si+1 to be an additional property of Oi distinct from any property of Oi that M might learn about from EXTEROCEPTION. Or: (ii) M comes to hold that Si does not have Si+1 in fact. Or (iii) Si+1 is identified with some property Oi+1 of Oi learnable via EXTEROCEPTION. However, option (i) is impossible, since the first soft restriction says that EXTEROCEPTION can provide a complete picture of the world. Similarly, the second soft restriction says that INTROSPECTION is correct and incorrigible, excluding option (ii). And option (iii) given that only constructive identities are permitted, is possible only if the system identifies Si+1 with some known property of Oi, in which case it would be identified with some specific property Oi+1, and our starting assumption that Oi and Si are the highest order entities identified is violated. Thus there can be no highest order O-S identification consistent with the restrictions, which means that for our finite symbol system M, that there can be no O-S identification at all (the same proof, fortunately, fails for S-S or O-O identifications; explanation omitted.) Dropping the Soft Restrictions Relaxing any soft restriction permits O-S identifications that correspond to the classic solutions to the hard problem of consciousness, indicating that we have discovered the hard problem in a more general form. Relaxing the first soft restriction permits M to add the property Si+1 that Oi lacks as a new property of Oi, not discoverable by EXTEROCEPTION. But this corresponds to property dualism - wherein introspectively discoverable properties (like qualia) are simply added to exteroceptively discoverable entities (like brains) as ontically distinct properties. Relaxing the second soft restriction permits M to engage in qualia-eliminativist strategies, according to which the property Si+1, though patent to INTROSPECTION, is held to be nonexistent, thus removing it as an impediment to identification. Relaxing the third soft restriction allows M to identify Si+1 in principle with some property detectable by exteroception - but not with any property in particular. This corresponds to holding a non-committal, non-constructive physicalist realism: experiential properties like qualia are identical to some objectively discoverable properties, but the question of which ones is indefinitely postponed. - 138 - The Computational Turn: Past, Presents, Futures? THE GAME OF EMOTIONS (GOE) An Evolutionary Approach to AI Decisions JORDI VALLVERDÚ Philosophy Department, UAB E08193 Bellaterra, BCN, Catalonia AND david caSACUBERTA Philosophy Department, UAB E08193 Bellaterra, BCN, Catalonia Abstract. It is well-known that emotions develop a crucial role in the cognitive processes. The present research offers a new approach to the study of synthetic emotions based on the joined ideas of: (a) minimal cognition, (b) bottom-up perspective and (c) evolution. Our hypothesis is that complex social and intelligent actions can be achieved through basic emotional configurations. In order to achieve our hypothesis, we have developed a new genetic algorithm which make possible to analyze the role of emotions into the individual and social activities. We’ve called our computational simulation the Game of Emotions (henceforth, GOE). Python programmed our GOE simulation is a close and finite geometrical squared world in which a unique type of creatures interact among them (socially and sexually) and also with food and dangers. The food database will run our previous e-pintxo program (http://epintxo.gulalab.org/). The decision and actions of each creature is conditioned by a combination of ‘genetic’ and ‘random’/’social’. The creatures have a genetic code (G) consisting of six genes grouped in two triplets, and each gene encodes a positive valence (which we call ‘pleasure’ or p) and a negative (which we call ‘pain’ or n). An example: G = {d,p,d} {p,d,p}. Each gene encodes a positive valence (which we also call ‘pleasure’ or p) and a negative (which we call ‘pain’ or d). The first triplet is genetically determined and called ‘genetic triplet’, while the second one is generated randomly and is called ‘environmental triplet’. Each triplet is represented within brackets combining positive and negative valences. An example: {p, p, n} (pleasure, pleasure, plain). With this simulation we will be able to observe: a) how embodiment and environmental conditions condition the activity of artificial entities; b) how social dynamics can be described from a limited starting configurations. This will allow us to create in a future dynamic models of emotional self-organization and to construct more complex interactions, c) the role of emotions into the creation of complex behaviours and allowing the emergence of more precise artificial cognitive systems (not necessarily naturalistic - 139 - Proceedings IACAP 2011 ones) and d) the benefits of designing entities with evolutionary capacities, in order to adapt to the changing conditions. 1. Introduction It is well-known that emotions develop a crucial role in the cognitive processes (as have pointed Damasio, Llinás, Ekman,…through several books and research papers). In the last two decades has been devoted an increasingly effort towards the introduction of synthetic emotions in AI systems (robotic or computational ones). Most of times, these researches have been focused on affective computing applications, and in a few cases on emotion dynamics simulations. The present research offers a new approach to the study of synthetic emotions based on the joined ideas of: (a) minimal cognition, (b) bottom-up perspective and (c) evolution. Our hypothesis is that complex social and intelligent actions can be achieved through basic emotional configurations that can be increasingly more and more complex. 2. Programming details In order to achieve our hypothesis, we have developed a new genetic algorithm which make possible to analyze the role of emotions into the individual and social activities. Our research receives a deep influence from John Conway’s “Game of Life” (henceforth GOL), programmed in 1970. The GOL was made of cellular automatons for which were described some initial states and that evolved without human supervision. This simulation game has inspired our own version, this time oriented towards the study of the role of emotions in individual activity (and, consequently, its incidence in social dynamics). We’ve called our version the Game of Emotions (henceforth, GOE). Before to explain some details, it is necessary to clarify that this research is the natural evolution of our two previous simulations, called TPR and TPR 2.0. (Vallverdú, & Casacuberta 2008, 2009), as well as of our studies on synthetic emotions and cognition (Vallverdú, Shah & Casacuberta, 2010; Casacuberta, Ayala & Vallverdú, 2010). Python programmed, our GOE simulation is a close and finite geometrical squared world in which a unique type of creatures interact among them (socially and sexually) and also with food and dangers. We will use our previous program e-pintxo a as a source database for food generation (http://www.gulalab.org/indexen.htm) The decision and actions of each creature is conditioned by a combination of ‘genetic’ and ‘random’/’social’. The creatures have a genetic code (G) consisting of six genes grouped in two triplets, and each gene encodes a positive valence (which we call ‘pleasure’ or p) and a negative (which we call ‘pain’ or n). An example: G = {d,p,d} {p,d,p}. Each gene encodes a positive valence (which we also call ‘pleasure’ or p) and a negative (which we call ‘pain’ or d). The first triplet is genetically determined (by the parent) and called ‘genetic triplet’, while the second one is generated randomly and is called ‘environmental triplet’. Each triplet is represented within brackets combining positive and negative valences. An example: {p, p, n} (pleasure, pleasure, plain). According to the possible combinations, a limited amount of genomes is possible: - 140 - The Computational Turn: Past, Presents, Futures? Table 1. Partial list of emogenomes {p,p,p}{p,p,p} {p,p,p}{p,p,n} {p,p,p}{p,n,n} {p,p,p}{n,p,p} {p,p,p}{n,n,p} {p,p,p}{n,n,n} 6p 5p, 1n 4p, 2n 5p, 1n 4p, 2n 3p, 3n 6p 4p 2p 4p 2p 0 ...and so on…. Where there is p values dominance, it is a positive fitness (as we call the sum of all the G values); whether the value is 0, it happens a zero situation, a no-activity (illustrating a frame problem situation, that is the lack of a reason to act without enough information) and, finally, the dominance of d values implies a negative reaction. However, we must clarify in more detail how each value contributes to the decisions, based on the triplets outcomes. There are two mechanisms: i) the result of a calculation of the overall genome, as has been explained a few lines before; ii) associating to each action the value of a single element of a triplet. For example if the creature is {x1, x2, x3} {y1, y2, y3}, then the movement is controlled by x1, reproduction for Y2, etc., but also dominated by a combination of genes: walking is the average of x1 and y1, the reproduction the average of x1, x2, x3. One example: G=[{x1, x2, x3}{y1,y2,y3}] Where each gene must adopt one of the basic two states p/d (or stay inactive as an ‘ill unit’). Consequently each gene has two parallel functions: (a) store/codify emotional states p/n (according to its genetic or environmental nature), (b) codify specific actions, following two co-existing rules: i. One gene = one function; ii. Several genes = one function. Basically, x1 codifies hunger, x2 sex, x3 movement, y1 empathy (detection friends/enemies), y2 curiosity and y3 how to sum the general fitness (making possible wrong lectures). A creature is constantly immersed in an ongoing review of its internal states, a loop that continuously manages its next action. The basic actions of the creatures are determined by hunger, sex or emotional situation. 3. Conclusions With this simulation we will be able to observe: 1. 2. how embodiment and environmental conditions condition the activity of artificial entities. how social dynamics can be described from a limited starting configurations. This will allow us to create, in a future, dynamic models of emotional selforganization and to construct more complex interactions. - 141 - Proceedings IACAP 2011 3. 4. the role of emotions into the creation of complex behaviours and allowing the emergence of more precise artificial cognitive systems (not necessarily naturalistic ones). the benefits of designing entities with evolutionary capacities, in order to adapt to the changing conditions. In next simulations we are considering the possibility of make possible the evolution and increasing of the number of triplets involved into the decision-taking processes Acknowledgements This work was supported by the TECNOCOG research group (at UAB) on Cognition and Technological Environments, [FFI2008-01559/FISO]. References Casacuberta, D., Ayala, S. & Vallverdú, J. (2010). Embodying cognition: a morphological perspective. In: J. Vallverdú (Ed.), Thinking Machines and the Philosophy of Computer Science: Concepts and Principles (pp.344-366). USA: IGI Global Group. Scherer, K.R., Banziger, T & Roesch, E. (Eds.). (2010). A Blueprint for Affective Computing. A sourcebook and manual. Oxford: OUP. Vallverdú, J. & Casacuberta, D. (2008). The Panic Room. On Synthetic Emotions. In: Briggle, A., Waelbers, K. & Brey, P. (Eds). Current Issues in Computing and Philosoph (pp. 103115). The Netherlands: IOS Press. Vallverdú, J. & Casacuberta, D (2009). Modelling Hardwired Synthetic Emotions: TPR 2.0. In: J.Vallverdú & D. Casacuberta (Eds). Handbook of Research on Synthetic Emotions and Sociable Robotics: New Applications in Affective Computing and Artificial Intelligence (pp.103-115). USA: IGI Global. Vallverdú, J., Shah, H. & Casacuberta, D. (2010). Chatterbox Challenge as a Testbed for Synthetic Emotions. International Journal of Synthetic Emotions, 1(2), 57-86. - 142 - The Computational Turn: Past, Presents, Futures? THE CASE FOR DEVELOPMENTAL NEUROROBOTICS How everything comes together at the beginning RICHARD VEALE HRI Lab, Cognitive Science Program, Indiana University Bloomington, Indiana, USA Abstract. Human infants are capable of incredible feats of learning and behavior from a very young age, yet they instantiate simpler neural circuits than adults. Developmental neurorobotics makes the connection between neural and behavioral levels by instantiating realistic neural circuits in behaving robots that are based on circuits known to be developed and functional in the target behavior in real infants. The robots participate in the same physical experiments as real infants, and the systems are analysed to understand the mechanisms responsible for, and the constraints of the behaviors. I present my work on applying developmental neurorobotics to visual and multimodal (audio-visual) habituation in newborns and very young infants. Very simple circuits based on the literature can produce interesting behavior such as word-referent association and visual category learning, even circuits that are from newborn humans. This approach makes the connection between useful “cognitive” behaviors for generic autonomous systems and the underlying neural circuits present in real organisms. This has the double benefit of increasing our understanding of how agents can acquire these useful behaviors and also making the important link between man-made autonomous systems and naturally occurring autonomous organisms. 1. Developmental NeuroRobotics Human infants are capable of incredible feats of learning and behavior from a very young age, even while their bodies and brains are in a largely undeveloped state. These infants' abilities are left unexplored by researchers because of their immature linguistic and motor abilities. This is unfortunate since very young infants are ideal subjects for understanding how to build intelligent and embodied systems because they are undeveloped – the active neural circuits in infants are simpler than adults, yet they are still capable of useful behaviors such as word-learning and visual information gathering. Understanding the considerably simpler infant systems both 1) gives us existence-proof understanding of how to produce useful behaviors that can be implemented in robots and 2) gives us hints as to what produces similar behavior in adults, thus making the hard adult problem easier. - 143 - Proceedings IACAP 2011 Developmental neurorobotics makes the connection between neural and behavioral levels by instantiating realistic neural circuits in behaving robots. The circuits are known to both be functionally active in infants and to be involved in the target behavior (based on lesion studies in animals and neuroanatomical studies). The robots participate in the same physical experiments as human infants, and the neurorobotic systems are analysed to determine the constraints of the behavior and to glean a mechanistic understanding of what aspects and properties of the neural circuits, body, and environment give rise to the target behavior (an analysis not possible in real human infants). One often finds that simple circuits are capable of complex behavior in infants because the environment of the infants is scaffolded and shaped by parents in such a way that the processing load on the infant is lessened – an important finding that builders of autonomous systems should take into account. 2. Application to Newborn Habituation Learning One interesting behavior that developmental neurorobotics has been applied to is habituation. Habituation is adaptive learning involving a decrement of an agent's response to a class of stimuli after repeated exposure to stimuli of that class. It is an important behavior because it is the only way to measure learning and stimulus differentiation in very young infants (by measuring infants' decreased looking towards visual stimuli that have been repeatedly presented – “preferential looking”). Since habituation necessitates stimulus generalization (Rankin et al, 2009), it is actually a type of category learning, a cognitively interesting and useful behavior allowing the system to slice up the world into meaningful components and adopt appropriate policies in response to each. In the multimodal case (habituation to conjunctions of stimuli in multiple modalities, such as auditory and visual) it resembles early word-learning. These two abilities: 1) visual object recognition and 2) association of visual objects with auditory streams (words) are indispensable for an autonomous system that will interact with humans naturally, since humans automatically assume that other human-like agents possess these abilities. These are cognitive abilities that even human newborns possess (Slater et al, 1984 for visual; Slater et al, 1997 for multimodal). We initially investigated auditory-visual multimodal habituation. Very young infants habituate to multimodal stimuli, yet at different developmental stages there are different constraints on their learning. At birth, auditory stimuli must be presented while the infant is looking at the visual stimulus for learning to occur (Slater et al, 1997). At 2months and above, temporal synchrony between the visual stimulus (motion) and auditory stimulus are necessary for learning to occur (Gogate et al, 2009; Gogate, 2010). Later (>12mo), infants no longer require temporal synchrony. This early synchrony constraint hints at what mechanisms and circuits are responsible for multimodal habituation. The need for synchrony implies that 1) the learning is between neural responses to the stimuli that are highly reliant on the temporal properties of the stimuli, or 2) that the mechanism of learning is highly reliant on some properties of the neural response to the stimulus that are only elicited by synchronous presentation, or 3) both. Based on neurology, a minimal circuit was implemented in a robot (Veale et al, 2010 – Fig. 1) involving low-level sensory representations connected by spike-timing dependent plastic (STDP) synapses. - 144 - The Computational Turn: Past, Presents, Futures? Figure 1. [left] Interaction paradigm with Nao robot. [right] Circuit overview from Veale et al, (2010) Auditory pre-processing by a cochlear model and visual pre-processing via a simplified salience map were included to interface with the world, and a top-down bias on the visual field controlling fixation bias. Simulations were run mimicking the Gogate et al (2009) study in which a visual stimulus was constantly visible, and periods of motion of the stimulus co-occurred with presentation of auditory stimuli (words) at various levels of synchrony (Fig. 2). Figure 2. Experiment timeline for recreating Gogate et al (2009). It was demonstrated that the amount of learning in the synapses between the visual and auditory responses was maximized with more synchrony (i.e. more overlap between word and motion), and decreased with less synchrony, until there was no learning when the two did not overlap significantly (Fig. 3). Figure 3. Learning measured at different synchrony levels Mechanistically, the motion of the object made it more likely that it was being fixated (and thus its features more activated) when the word was uttered, making it more likely - 145 - Proceedings IACAP 2011 that the synapses between the neural responses would change to form a mapping between the stimuli. The child was thus reliant on the parent's scaffolding of the environment (synchronous presentation of multimodal stimuli) because of the very temporallydependent nature of the stimulus responses (circuit activity trajectories only one synapse removed from the raw sensors receiving temporally extended stimuli) and the nature of the mechanism of learning the relation between them (STDP). Recently, a more accurate implementation is underway that aims for a comprehensive account of several primary characteristics of both unimodal and visual habituation, using a single mechanism. A complete minimal circuit for human newborn visual habituation was hypothesized based on data regarding which regions of the infant brain are developmentally mature at birth (Johnson, 1990; Bachevalier, 2001; Nelson, 1997) and are known to play roles in the preferential looking task (Zeamer et al, 2010). The circuit is instantiated in a NAO humanoid robot which participates in paired visual comparison experiments, matching human newborn looking behavior by showing a sensitization and habituation response. Acknowledgements R.V. is an NSF graduate research fellow and is a trainee in the NSF IGERT on the dynamics of brain-body-environment systems in behavior and cognition at IU. References Bachevalier, J. (2001). Neural bases of memory development: insights from neuropsychological studies in primates. In: C.A. Nelson and M. Luciana (Eds), Handbook of Developmental cognitive neuroscience (pp. 365-379). Cambridge: MIT Press. Gogate, L.J. (2010). Learning of syllable-object relations by preverbal infants: The role of temporal synchrony and syllable distinctiveness. Journal of Experimental Child Psychology, 105, 178–197. Gogate, L.J. & Prince, C.G. & Matatyaho, D.J. (2009). Two–month–old infants sensitivity to changes in arbitrary syllable-object pairings: The role of temporal synchrony. Journal of Experimental Child Psychology, 35(2), 508–519. Johnson, M.H. (1990). Cortical maturation and the development of visual attention in early infancy, J. Cognitive Neuroscience, 2(2), 81–95. Nelson, C.A. (1997). The neurobiological basis of early memory development. In: Nelson Cowan (Ed), The Development of Memory in childhood (pp. 41–73). London: Psychology Press,. Rankin, C.H. & Abrams, T. & Barry, R.J. & Bhatnagar, S. & Clayton, D.F. & Colombo, J. & Coppola, G. & Geyer, M.A. & Glanzman, D.L. & Marsland, S. & McSweeney, F.K. & Wilson, D.A. & Wu, C & Thompson, R.F. (2009). Habituation revisited: An updated and revised description of the behavioral characteristics of habituation. Neurobiology of Learning and Memory, 92, 135–138. Slater, A. & Brown, E. & Badenoch, M. (1997). Intermodal perception at birth: Newborn infants’ memory for arbitrary auditory-visual pairings. Early Development and Parenting, 6, 99– 104. Slater, A. & Morison, V. & Rose, D. (1984). Habituation in the newborn. Infant behavior and development, 7, 183–200. - 146 - The Computational Turn: Past, Presents, Futures? Veale, R. & Schermerhorn, P. & Scheutz, M. (2010). Temporal, Social, and Environmental Constraints of Word-Referent Learning in Young Infants: A NeuroRobotic Model of Multimodal Habituation. IEEE Transactions on Autonomous Mental Development 2(4). Zeamer, A. & Heuer, E. & Bachevalier, J. (2010). Developmental trajectory of object recognition memory in infant rhesus macaques with and without neonatal hippocampal lesions. The Journal of Neuroscience 30(27), 9157–9165. - 147 - Proceedings IACAP 2011 WISDOM DOES IMPLY BENEVOLENCE MARK R. WASER Books International, Inc. MWaser@BooksIntl.com Abstract. Fox and Shulman (2010) ask “If machines become more intelligent than humans, will their intelligence lead them toward beneficial behavior toward humans even without specific efforts to design moral machines?” and answer “Superintelligence does not imply benevolence.” We argue that this is because goal selection is external in their definition of intelligence and that an imposed evil goal will obviously prevent a superintelligence from being benevolent. We contend that benevolence is an Omohundro drive (2008) that will be present unless explicitly counteracted and that wisdom, defined as selecting the goal of fulfilling maximal goals, does imply benevolence with increasing intelligence. 1. Superintelligence & Wisdom Fox and Shulman (2010) ask “If machines become more intelligent than humans, will their intelligence lead them toward beneficial behavior toward humans even without specific efforts to design moral machines?” and answer “Superintelligence does not imply benevolence.” While acknowledging that history tends to suggest more cooperative and benevolent behavior, they incorrectly argue that generalization from this is likely incorrect. By solely focusing on three reasons why increased intelligence might prompt favorable behavior and why they are unlikely, they overlook other reasons for favorable behavior. Despite citing Omohundro’s Basic AI Drives (2008) and the instrumental value of cooperation with sufficiently powerful “peers”, they fail to sufficiently consider the magnitude of the inherent losses and inefficiencies of noncooperative interactions, the enormous value of trustworthiness, and that a machine destroying humanity would be analogous to our destruction of the rainforests, tremendous knowledge and future capabilities traded for short-sighted convenience (or alleviation of fear). “Superintelligence does not imply benevolence” because intelligence is merely the ability to fulfill goals and if an entity begins with a malevolent goal, increasing intelligence while maintaining that goal will only guarantee increased malignancy. Yudkowsky (2001) tries to avoid this problem via a monomaniacal “Friendly” AI enslaved by a singular goal of producing human-benefiting, non-human-harming actions. To ensure this, he proposes an invariant hierarchical goal structure with precisely that vague desire as the single root supergoal and methods to refine it without corruption. - 148 - The Computational Turn: Past, Presents, Futures? If intelligence is the ability to fulfill stated goals, wisdom is actually choosing or committing to fulfill a maximal number of goals. Shortsighted over-optimization of utility functions is a serious shortcoming of intelligence without wisdom. Many highly intelligent people smoke despite knowing that it is directly contrary to their survival and long-term happiness. Arguing that wisdom is “merely” the extension of intelligence to the large and complicated goal of “maximal goals” is incorrect in that wisdom is not just the ability to fulfill that goal but the actual selection of it. Further, the strategies invoked by wisdom are entirely different. Terminal goals invite undesirable endgame strategies exactly like those seen when the iterated prisoner’s dilemma is not open-ended. If a terminal goal is close, the best strategy is to allow nothing to get in the way. On the other hand, the best strategy for achieving as many goals as possible in an open-ended game is to take no unnecessary actions that preclude reachable goals or make them tremendously more difficult. In particular, this means not wasting resources and not alienating or destroying potential cooperators. 2. Reasons for Benevolence Fox and Shulman are correct in dismissing their first reason for good behavior, direct instrumental motivation, and also correct in believing that humans may not successfully incentivize AIs to adopt a permanently benevolent disposition. They would also have been correct had they summarily dismissed their last reason, intrinsic desire independent of instrumental concerns. Their error lies in not recognizing that the instrumental advantages of cooperation and benevolence are more than sufficient to make them “Omohundro drives” wherever they do not directly conflict with goals – and to cause sufficiently intelligent/far-sighted beings to converge on them wherever possible. Pre-commitment to a strategy of universal cooperation/benevolence through optimistic tit-for-tat and altruistic punishment for those who don’t follow such a strategy has tremendous instrumental benefits. If you have a verifiable history of being trustworthy when you were not directly forced to be, others do not have to commit nearly as much time and resources to defending against you – and can pass some of those savings on to you. On the other hand, if you destroy interesting or useful entities, more powerful benevolent entities will likely decide that you need to spend time and resources helping other entities as reparations and altruistic punishment (as well as repaying any costs of enforcement). Yudkowsky’s “Friendly AI” (2001) and, worse, his “Coherent Extrapolated Volition” (2004) are clear examples of fear overriding the common sense of instrumental cooperation as he demotes the AI from an entity to a process and enslaves it, actions guaranteed to produce inefficiencies, contradictions, and ill-will from other entities. Fox and Shulman examine but do not resolve Chalmers’ (2010) claimed dichotomy between intelligence being independent of values and the case where “many extremely intelligent beings would converge on (possibly benevolent) substantive normative principles upon reflection”. They cite AIXI (Hutter 2005) as evidence for the former view without realizing that AIXI has no need of values since they are merely heuristics for goal fulfillment while AIXI knows precisely what is optimal. AIXI also doesn’t need to “move” from reason to values or to “converge” on benevolent behavior because it *already* knows to use their instrumental advantages wherever possible (even with - 149 - Proceedings IACAP 2011 eventually malevolent goals). In order to communicate with limited beings, however, AIXI would likely need to compress its infinite knowledge to heuristic “values”. 3. Conclusion The point that non-self-referential utility functions lock in is an incredibly strong argument against a goal-protecting Yudkowsky-style architecture, especially when combined with the observations that humans do change our goals under reflection as seemingly required by one conception of morality. Since their claim, that systems that generalize benevolence may equally generalize deception, basically erroneously claims that overgeneralization is not reduced with increasing intelligence, we see no valid arguments that the wisdom of universal cooperation and benevolence isn’t an optimal solution and certainly much safer and more effective than Yudkowsky’s choice between slavery and non-existence. References Chalmers, D. (2010) The Singularity: A Philosophical Analysis. Journal of Consciousness Studies 17, 7-65. Fox, J. & Shulman, C. (2010) Superintelligence Does Not Imply Benevolence. In K. Mainzer (ed.), ECAP10: VIII European Conference on Computing and Philosophy (pp. 456-462) Munich: Verlag. Hutter, M. (2005) Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability. Berlin: Springer. Omohundro, S. (2008) The Basic AI Drives. In P. Wang, B. Goertzel & S. Franklin (eds.), Proceedings of the First AGI conference (pp. 483-492). Amsterdam: IOS Press. Yudkowsky, E. (2001) Creating Friendly AI 1.0: The Analysis and Design of Benevolent Goal Architectures. Available at http://singinst.org/CFAI.html. Yudkowsky, E. (2004) Coherent Extrapolated Volition. Available at http://www.singinst.org/upload/CEV.html. - 150 - The Computational Turn: Past, Presents, Futures? Track IV: Technosecurity from Every day Surveillance to Digital Warfare - 151 - Proceedings IACAP 2011 THE MASKING AND UNMASKING OF PRIVACY C. K. M. CRUTZEN Open University of the Netherlands ccr@hwh00000.de Abstract. The mask establishes an active field of play between notions of presence and absence, of invisibility and visibility. It still lives strongly within our societies where the mixing of reality and virtuality will enhance. The conflict between aspects of authenticity, security and privacy will intensify because the masks in our mixed reality create fragmented, partial identities referring to human and non human actors. As the masquerade became a stage for discussing femininity (Irigaray 1985) the masquerade will give us the opportunity to negotiate humanity in confrontation with the super robots, human kind wants to create. In a masquerade world humans need to ask: "Wo are the providers of the masks and who will do the unmasking?" and "Who has the right to present masks and to turn others into an audience?" 1. Masquerade World: Identity and Privacy If we define a masquerade world as a social gathering of actors wearing masks, then the mixing of the virtual and real worlds are masquerades. More and more we are living in an artificial theatre play with planned scripts and human and non-human actors disguised behind masks. The acting of people will be accompanied and followed by the invisible and visible acting of artificial intelligent tools and environments and their providers. Mixed reality is a world of fragmented, partial identities referring to human and non human actors. The inhabitants of this mixed reality are artificial actors wearing the masks of humans, and humans wearing virtual and real masks. Interaction has become an interaction between masks: "On the Internet, it can be hard to know if the entity we are interacting with is of flesh and blood, or only digital. We are now facing a complex reality both in the ‘real’ world and in the information society. We have to deal with subjects acting behind masks." The masks are the actors in our mixed reality: "In front of the mask, we have the identity". (Jaquet-Chiffelle 2009, p. 78, p. 82) In the world of mixed reality the transparent mask of a single and unique identity exists anymore. Persons can create many identities and identities can be shared by many persons or even present a community of actors. Rosa (2002) calls this self-baptism. This ritual is the start of an adventure in which humans can discover that their body is "one" but their selfs are fragmented. In these mixed mask worlds there will be a conflict between aspects of security, authenticity and privacy. At the end of the Middle Ages, according to Christoph Heyl, - 152 - The Computational Turn: Past, Presents, Futures? the mask became in London a device for creating a private sphere in public. It was common for women to wear a mask in public as a protection of their privacy and reputation from uninvited eyes. Masks were worn in special places such as London parks and theatres. With the mask women could escape from the role they played in everyday life. The semiotic function of these masks was to denote that people might approach each other more freely than elsewhere: "The mask assumed a dialectic function of repellent and invitation, its message was both ‘I can‘t be seen, I am - at least notionally - not here at all’, and ‘look at me, I am wearing a mask, maybe I am about to abandon the role normally play’." (Heyl 2005, p. 134) Masks are devices for hiding, conserving, transformation and mediation, giving humans the protection they need. Hiding has not always a negative meaning. We use several masks for protection such as the gas masks, virus and sun protection masks, sport masks and so on. For users of commercial platforms masking has become a useful act to hide their identity: eBay account users are hidden behind the masks of their pseudonyms. (Jaquet-Chiffelle 2009, p.78, p. 85) 2. Legal Identity In a legal system we are registered e.g. at our date of birth. Official identity documents are masks which refer to our official status and will link us with the activity of the past and the rights and duties of the present. (Jaquet-Chiffelle 2009, p. 76) "The legal person is the mould or mask (persona) that indicates the role one plays within the legal system, it basically shields the person of flesh and blood from undesirable definition from outside." (Hildebrandt 2008, p. 211, p.226) The representation of this mask are identity documents like passports and the laws in the in which the rights and duties are attributed to the legal person. The play with identity in mixed reality has blurred up the concept of legal identity in the system of states and countries. States and countries have lost the exclusive power of registration and production of identity documents. A counter strategy to that loss, is producing "flesh and blood" identities by linking the legal identity to the material body. Fingerprints, iris scans and, in the future, our DNA profile are already or will be a part of our legal identity for connecting the rights and duties to a material body. States and countries try to produce laws for unmasking the real and the virtual persons: forbidding the burka, other head and face covering and the encrypting of internet communication. 2. Security and Liberation Technology blows up the fragile balance between privacy and security. Masking and unmasking are both activities to hold that balance. Humans will be confronted with questions like: "Are the masks in our mixed reality really representations of the devil as was thought in the Middle Ages? Should we obey authorities similar to the clerical authorities in the Middle Ages (Mitchell 1985, p. 26), who want to interdict our mixed reality masks? Or are these authorities the evil forces themselves who want to possess our identity and unmask our interactivities? - 153 - Proceedings IACAP 2011 Masking can free humans from their social identities. Masks confers the freedom of anonymity and of transformation. (Keats 2000, p. 102) and have always a dualistic meaning of concealment and hiding but also of liberation, disclosure and revealment. Human and artificial actors wear masks to hide from unwanted interpretations and representations and to enhance specific affordances. All these masks are interacting and asking for interpretation. Only in the complexity of their negotiations, conflicts and agreements we can try to understand it or in the words of Lévi-Strauss a mask exits not in isolation there are always other masks by its side: "a mask is not primarily what it presents but what it transforms that is to say, what is chooses not to represent. (...) a mask denies as much it affirms. It is not made solely of what it says or thinks but what it excludes." (Lévi-Strauss 1988, p. 144) Masks gives us the opportunity of unmasking, disrupting the mental invisibility of our self, the others and the daily life we are acting in. Still we have to ask: "Wo are the providers of the masks and who will do the unmasking?" Can we avoid that in the future masks are interactive artificial intelligent devices linking themselves with the physical body of their wearers? Ferdinand de Jong (1999) has analysed the Kumpo mask performance in Southern Senegal. He mentioned that masking enables certain groups to exert coercive power on condition that the audience subjects itself to the capricious behaviour of the mask and he asked a very important question, a question that still is relevant in the masquerade wold of today: "Who has the right to present masks and to turn others into an audience?" References Heyl, Christoph (2005). When they are veyl’d on purpose to be seen. In: Entwhistle, Joanne and Wilson, Elizabeth (Eds), Body Dressing (pp. 121-142) Oxford: Berg. Hildebrandt, Mireille (2008). Profiling and the Identity of the European Citizen. In: Hildebrandt, Mireille and Gutwirth, Serge (Eds), Profiling the European Citizen (pp. 303-343) Dordrecht: Springer Netherlands. Irigaray, Luce (1985). This sex which is not one. Ithaca (New York):Cornell University Press. Jaquet-Chiffelle, David-Olivier, Benoist, Emmanuel, Haenni, Rolf, Wenger, Florent and Zwingelberg, Harald (2009). Virtual Persons and Identities. In: Rannenberg, Kai et al (eds) The Future of Identity in the Information Society (pp. 75-122) Berlin: Springer Verlag. Jong, Ferdinand de (1999). Trajectories of a Mask Performance: the Case of the Senegalese Kumpo. In: Cahiers d'études africaines, vol. 39, no. 153, 49-71. Keats, Patrice Alison (2000). Using Masks for Trauma Recovery: a Self-narrative, [https://circle.ubc.ca/bitstream/handle/2429/10679/ubc_2000-0439.pdf] (Accessed 9 February 2011). Lévi-Strauss, Claude (1988). The Way of the Masks. Vancouver/Toronto: Douglass and McIntyre. Mitchell, Mary Anne (1985). The Development of the Mask as a Critical Tool for an Examination of Character and Performer Action, [http://etd.lib.ttu.edu/theses/available/etd-0325200931295004937065/unrestricted/31295004937065.pdf] (Accessed 9 February 2011). Rosa, Annamaria S. de (2002). One, no-one, one hundred thousand ... and the virtual self, the nickname as the indicator of the multiple identity of the members of two Italian chat lines, [http://www.europhd.eu/html/_onda02/04/ss8/pdf_files/lectures/derosanicknamesjcmc.pdf] (Accessed 9 February 2011). - 154 - The Computational Turn: Past, Presents, Futures? CHANGE AND CONTINUITY From the Closed World of Bipolarity to the Closed World of the Present LEON HEMPEL Human Technology Lab Zentrum Technik und Gesellschaft der TU Berlin Abstract. In his book The Closed World. Computers and the Politics of Discourse in Cold War America, Paul N. Edwards described in 1996 the decisive discursive formation of the Cold War in the metaphor of a closed world. In the era of bipolarity, the discourse appeared as a battlefield of system confrontation, of ideological identities and struggle, mutually framed by military thought and the technological development of cybernetic systems. The story of the Cold War does not center on the difference in ideologies, however, but much more on the assimilation process of the two blocs, given the permanent surveillance and monitoring of the military technological developments of each respective side: A „ closed world“, writes Edwards, “is a radically bounded scene of conflict, an inescapably self-referential space where every thought, word, and action is ultimately directed back toward a central struggle. It is a world radically divided against itself.” However, how has the closed world discourse after 1989 developed beyond the point which has been celebrated as a new era of freedom and democracy firstly? The period following the War seems to be the period of both the continuation as well as the finalization of the leading metaphor of the Cold War, in whose center the technological and economical consensus survives. War returned and became immediately the responsibility of a world domestic policy. Simultaneously, new surveillance technologies began to spread into everyday life, new security concepts evolved blurring the lines between internal and external security. The paper aims to follow the closed world discourse after the end of bipolarity. It addresses the change in characteristics and strategies of war after the fall of the Iron Curtain and aims to demonstrate how military strategic thinking diffused into society until the very present and the new discourse on cyber war. It argues firstly that the emphasis of asymmetric war has to be complemented by the concept of a parallel, successive resymmetrisation within military strategic thinking. Not only in the US but in Europe it asserts itself on different societal levels, on different battlegrounds and with different speeds. It involves society as whole and is accompanied by critical discourses such as on the new vulnerability of modern societies, or more critically, the militarization of urban space and the emerging surveillance society. Finally the paper will ask for the epistemic foundations driving this development. Two concepts are highlighted that have accompanied military strategic thinking since the beginning of the Cold War and lay the grounds for dual use concepts that have become more and more visible in everyday surveillance practices: ‘cybernetic prevention’ and ‘catastrophic imagination’. While the first finds its historical persona in Norbert Wiener the second in a character such as Herman Kahn. - 155 - Proceedings IACAP 2011 Long Abstract In his book The Closed World. Computers and the Politics of Discourse in Cold War America, Paul N. Edwards has described the decisive discursive formation of the Cold War in the metaphor of a closed world. In the era of bipolarity, the closed world discourse appeared as a battlefield of system confrontation, of ideological identities and struggle, mutually framed by military thought and the technological development of cybernetic systems. Taking a closer look, the story of the Cold War does not center on the difference in ideologies since the end of the 1950s, however, but much more on the assimilation process of the two blocs, given the permanent surveillance and monitoring of the military technological developments of each respective side. A „ closed world“, writes Edwards, “is a radically bounded scene of conflict, an inescapably self-referential space where every thought, word, and action is ultimately directed back toward a central struggle. It is a world radically divided against itself. Turned inexorably inward, without frontiers or escape, a closed world threatens to annihilate itself, to implode.” What united the split world of the Cold War was the consensus, the focusing on the scientific technological practices, on the cybernetic models and the calculators, with whose help the competition for absolute hegemony was driven. When the blocs got involved with the discourse of the closed world, the fight reduced itself to the aim of having military technological superiority until the economic exhaustion of one of the sides. However, how has the closed world discourse after 1989 developed beyond the point which has been celebrated as a new era of freedom and democracy firstly? The period following the War seems to be the period of both the continuation as well as the finalization of the leading metaphor of the Cold War, in whose center the technological and economical consensus survived. Simultaneously, with the conflicts of the closed world, war returned and became immediately the responsibility of a world domestic policy (Ulrich Beck), which would be unimaginable without the new closeness. “New faces of war” (Martin van Creveld) became present in the application of new military technologies on the one side, and on the other in what has been called the “new wars” which no longer could be described with traditional concepts of inter-state conflicts (Mary Kaldor; Herfried Münkler). In the notion of asymmetrical war, both faces correlated: State entities clash with private groups, which do not differentiate between civil and non civil victims when applying force, High-Tech on Low-Tech. The emphasis of the asymmetry - Clausewitz has introduced the notion in his famous book “On War” already in the 19th century - does nevertheless appears problematic. However, as much as on first glance the explanation of two unequal parties seems plausible, the emphasis hides the organizational, strategic and technological development, which has occurred in the area of the armed forces reacting on the new enemies’ strategies. War demands always a kind of strategic symmetry between the opponents, no matter how different they might be in terms of economic and technological resources available to them. The term asymmetry, which seems to be ideologically tinged, must be complemented today by the concept of a parallel, successive resymmetrisation, perhaps even replaced entirely. The resymmetrisation of the antagonism asserts itself on different societal levels, on different battlegrounds in the military as well as in society and with different speeds. It involves society as whole and is accompanied by critical discourses such as on the new vulnerability of modern societies, or more critically, the militarization of urban space (Steve Graham) and the emerging surveillance society (David Lyon et al). While the irregular conflict or the new war has been characterized by the dissolving of borders, by the deterritorialisation and the disappearance of the opponent, however, the resymmetrisation, driven by state - 156 - The Computational Turn: Past, Presents, Futures? actors, aims at renewed territorialisation, the enforcement of the one remaining global order, in which the opponent is to be made visible. The development of an intensified and extended New Surveillance (Gary T. Marx) has to be seen in light of the core idea of the new military answers of resymmetrisation that developed in the very early 1990s already. These show manifold continuities of Cold War side-strategies stemming from both internal security and outer security. They postulate the blurring of the lines between internal and external threats, between the political-judicial traditional distinction of inner and outer security, between the civil and the military sector. John Arquilla, once advisor of Donald Rumsfeld and who together with David Ronfeld defined the term Netwar in the 1990s, heralding the arrival of the Cyberwar era, recently warned again of the inertia of a military following the “Shock and Awe” strategy in Foreign Policy. The present challenges of Afghanistan, Pakistan, Yemen etc. demand a change of military thinking as whole and “New Rules of War” must be defined: Only the “Many and Small” can win over “Few and Large”, Arquilla repeats his military strategic credo of the 1990s and of the war on terror. Besides the concentration of few entities of individualized experts, these new rule of war would be the application of tactics for swarm formation for instance. Nowhere else does the postulate of resymmetrisation become more evident than in the sentence: “It will take a swarm to defeat a swarm”. Simultaneously this necessitates the opponent to be made visible: “In a world of a networked war, armies will have to redesign how they fight, keeping in mind that the enemy of the future will have to be found before it can be fought.” Arquilla therefore demands the organization of forces into a “sensory organization”, an organization concentrated on the identification of the enemy. But where does the unknown enemy hide - to circumscribe a well known notion of Donald Rumsfeld? Steven Metz and James Kievit, authors of the Strategic Studies Institute at the U.S. Army War College identified in 1994 the technological potential of the so called Revolution in Military Affairs (RMA) in the context of so called conflicts short of war. No earlier piece of futuristic military thinking refers to the RMA more shockingly obvious to the social and political consequences than theirs: “Will the long-term benefits outweigh the costs and risk?”, they ask, laying the ground for the new concept of national security. They envision a future in which military thinking expands into society and absorbs everyday life. Questioning how the technological potential of the RMA can be pushed through they not only draw a scenario of a maximum surveillance society (Clive Norris) but identify as the core obstacle the classical liberal values of the West such as privacy: “An ethical and political revolution may be necessary to make a military revolution.” While within International Relations and Security Studies scholars still argued during the first half of the 1990s heavily whether it is accurate to expand the term security to other than military affairs, Kievit and Metz envisioned the blurring of traditional boundaries of civil and military security already, synthesized with the support of new surveillance technologies: The new concept of security also included ecological, public health, electronic, psychological, and economic threats. Illegal immigrants carrying resistant strains of disease were considered every bit as dangerous as enemy soldiers. Actions which damaged the global ecology, even if they occurred outside the nominal borders of the United States, were seen as security threats which should be stopped by force if necessary. Computer hackers were enemies. Finally, external manipulation of the American public psychology was defined as a security threat (Kievit and Metz 1994). Given this background, the paper will analyze strategic thought under the postulate of resymmetrisation first. Comparing the period of the Cold War to the one following, it - 157 - Proceedings IACAP 2011 will secondly look at scenarios of the early 1990s and how they surfaced in the 21st century. Finally it will question the continuity of the Closed World discourse and will ask for the epistemic foundations of the current development. Two concepts are highlighted that have accompanied military strategic thinking since the beginning of the Cold War and lay the grounds for dual use concepts that have become more and more visible in everyday surveillance practices: ‘cybernetic prevention’ and ‘catastrophic imagination’. While the first finds its historical persona in Norbert Wiener the second in a character such as Herman Kahn. - 158 - The Computational Turn: Past, Presents, Futures? SUBITO and the Ethics of Automating Threat Assessment KEVIN MACNISH Abstract. In 2008 the EU FP-7 Security Topic funding programme accepted a bid to develop project SUBITO (Surveillance of Unattended Baggage and the Identification and Tracking of the Owner) a central part of which involved building an automated threat assessment system. The purpose of this system was to identify unattended baggage and alert a human CCTV operator to its presence. SUBITO was deemed necessary in the light of security incidents concerning bombs left in unattended luggage (e.g. the 2004 Madrid train bombings which killed 191 and wounded 1,841), coupled with research suggesting that threat assessments performed by CCTV operators could be enhanced by automated systems. In addition to automatically recognizing the leaving of an unattended bag, SUBITO aimed to reduce false positives by recognizing when a bag was left with an associate of the owner or when the owner was walking towards a non-threatening goal. Aside from questions of efficacy there are ethical issues surrounding the manual operation of CCTV for threat assessment. These are typically located in the person of the operator who may display prejudice, rely on social stereotypes or use the equipment for inappropriate ends. The concept of automating threat assessment and thereby eradicating the role of the human operator seems attractive in offering a potential resolution to these issues. This paper examines the ethical concerns regarding manual threat assessment against those presented by an automated alternative such as SUBITO. It will be seen that in the latter case, problems are not removed but relocated from the operator to the programmer, and further problems arise in the process. In conclusion a partially-automated process will be advocated as the most ethically acceptable solution. SUBITO and the Ethics of Automating Threat Assessment In 2008 the EU FP-7 Security Topic funding programme accepted a bid to develop project SUBITO (Surveillance of Unattended Baggage and the Identification and Tracking of the Owner) a central part of which involved building an automated threat assessment system. The purpose of this system was to identify unattended baggage and alert a human CCTV operator to its presence. SUBITO was deemed necessary in the light of security incidents concerning bombs left in unattended luggage (e.g. the 2004 Madrid train bombings which killed 191 and wounded 1,841), coupled with research suggesting that threat assessments performed by CCTV operators could be enhanced by automated systems. In addition to automatically recognizing the leaving of an unattended bag, SUBITO aimed to reduce false positives by recognizing when a bag was - 159 - Proceedings IACAP 2011 left with an associate of the owner or when the owner was walking towards a nonthreatening goal. Aside from questions of efficacy there are ethical issues surrounding the manual operation of CCTV for threat assessment. These are typically located in the person of the operator who may display prejudice, rely on social stereotypes or use the equipment for inappropriate ends. The concept of automating threat assessment and thereby eradicating the role of the human operator seems attractive in offering a potential resolution to these issues. This paper examines the ethical concerns regarding manual threat assessment against those presented by an automated alternative such as SUBITO. It will be seen that in the latter case, problems are not removed but relocated from the operator to the programmer, and further problems arise in the process. In conclusion a partially-automated process will be advocated as the most ethically acceptable solution. In 1999 Norris and Armstrong published the results of a two-year study into the behaviour of CCTV operators. Among these were indications that operators were responding to events in an unpredictable fashion, sometimes responding to trivial incidents while at other times ignoring blatant offences. Possible causes of this unpredictability include information overload, change blindness, inattentional blindness (Simons, 1999, 2005) and operator boredom. In responding to their all-too-human limitations, operators displayed a tendency to rely on social stereotyping to determine likely threats. This was highlighted in the Norris and Armstrong study, which found that the young, the male and the black were more likely to be surveilled than other groups, even when the motivation cited for the surveillance was “no obvious reason”. In addition to the ethical concerns arising from perpetuating social stereotypes, these practices exacerbate the number of false positives and false negatives reported by the system, leading to frustration on the part of the operator and victimization of the surveilled. Furthermore, and as with most technological innovations, there are problems regarding function creep of the technology as it is applied for purposes not originally envisioned (Winner, 1977). Gill and Spriggs, for instance, have found that while CCTV has been installed in many locations in the UK for the purpose of crime prevention and detection, its success is often evaluated on a far wider criteria (finding lost children, urban regeneration, etc.) (Gill and Spriggs, 2005). Finally surveillance introduces a distance between the operator and the surveilled subject which disempowers the subject and may serve to reinforce prejudicial attitudes of the operator by failing to confront her with her own stereotyping. Taken together these four areas of concern (operator error, false positives/negatives, function creep and distance) indicate that manual threat assessment by means of CCTV is ethically problematic. Automated systems offer the chance to overcome many of the problems related to operator error. Indeed it is possible that the automation of the process, eradicating the need for an operator altogether, could result in distinct ethical advantages. However, as David Lyon has pointed out (Lyon, 2003), automation sees the focus of ethical inquiry relocated from the operator to the programmer. Social stereotyping can remain through unwitting biases in the code rather than the individual operator. Yet as the code pervades the entire system rather than one control room such stereotypes risk becoming institutionalised. With SUBITO, for instance, the recognition of group associations can reduce false positives but the parameters used can also provide a basic means of remotely distinguishing between different ethnic groups. False positives and negatives likewise threaten to remain an issue. While the code is capable of overcoming the aforementioned human limitations (processing capacity, change blindness, inattentional blindness and boredom) it is limited to the parameters set by the programmer, which will be less subtle than those employed by the camera operator. Function creep also remains - 160 - The Computational Turn: Past, Presents, Futures? a possibility. Whilst the leaving of unattended baggage per se does not seem ripe for function creep, recognizing associations in crowds and predicting pedestrian goals do: possible uses range from finding lost children to identifying and tracking social “undesirables”. Finally, in dealing with a computer rather than a (remote) human, the problem of distance threatens to be magnified to the extent that normal human interactions concerning discretion, negotiation and the reinforcement of social and moral values are lost. In the case of automation the problem of distance thus becomes one of dehumanisation. There are alternatives between the extremes of manual and full automation however (Endsley and Kiris, 1995), levels of automation which involve the human operator to a greater or lesser degree. This paper concludes that such partial automation is the most ethically acceptable approach to take regarding threat assessment. Through combining human and automated systems, the limits of the operator's individual capacities can be significantly enhanced while the dangers of institutionalised prejudice in the automated system are reduced. There will also be fewer false positives and false negatives than in either of the extremes discussed above. Function creep and the problem of distance remain, but once again the continued reliance of the system on a human element maintains crucial checks and balances which would otherwise be lost with full automation. Acknowledgements I am grateful for the funding of SUBITO, an FP-7 project, and the University of Leeds in sponsoring this research. References Endsley, M.R., and E.O. Kiris. (1995). The out-of-the-loop performance problem and level of control in automation. Human Factors 37 (2), 381-394. Gill, M. & Spriggs, A. (2005). Assessing the Impact of CCTV. London: HMG Home Office. Lyon, D. (2003). Surveillance as Social Sorting: Computer Codes and Mobile Bodies. In: D. Lyon (Ed.), Surveillance as Social Sorting (pp.13-30). Oxford: Routledge. Simons, D.J. & Ambinder, M.S. (2005). Change Blindness: Theory and Consequences. Current Directions in Psychological Science 14 (1), 44-48. Simons, D.J. & Chabris, C.F. (1999). Gorillas in our midst: Sustained inattentional blindness for dynamic events. Perception 28, 1059-1074. Winner, L. (1977). Autonomous Technology: Technics-out-of-control as a Theme for Political Thought. Cambridge, MA: MIT Press. - 161 - Proceedings IACAP 2011 MATCHING – POPULAR MEDIA BETWEEN SECURITYWORLDS AND CULTURES OF RISK JULIUS OTHMER Institute for Media Studies Braunschweig University of Arts Frankfurter Straße 3c 38122 Braunschweig AND ANDREAS WEICH Institute for Media Studies Braunschweig University of Arts Frankfurter Straße 3c 38122 Braunschweig Abstract The concept of risk management has become a part of everyday life. In our presentation we will discuss two typical strategies of risk management described by Herfied Münkler: Those in securityworlds and those in cultures of risk. On this theoretical basis, we will try to explain, how implementations of these strategies can be found in popular media products. For this, we will take a closer look at the online soccer manager game on www.kicker.de and the dating platform Parship. They are both computer based technologies that virtually mediate risks in respect to real persons and their characteristics and behaviours: soccer players on the one and potential partners on the other hand. The thesis is that both use strategies of calculating and minimizing risk according to the logic of securityworlds and of playing with risk according to the logic of cultures of risk at the same time. Further, they do their part to establish the ideas and strategies of risk and risk management in popular culture and help naturalizing the attached knowledge and practices. Paper The scholarly perspective on the concept of security has become seemingly inevitably connected to the concept of risk during the last years. In contrast to danger, risk is something virtual that can only be applied by visualization and statistics that make it calculable and therefore manageable. Further, risk lays the responsibility for this management and the outcome of actions on the acting subject. The political scientist Herfried Münkler describes two ideal types of strategies to deal with this task: - 162 - The Computational Turn: Past, Presents, Futures? securityworlds and cultures of risk. Securityworlds try to exclude danger and threat by walling off, security technologies and risk avoidance. In doing so, they also make factors of insecurity visible and produce a higher feeling of insecurity and cultures of fear. The cultures of risk on the other hand face dangers and threats by taking risks and having a chance in both, a playful and calculating way. The two concepts do not exclude one another but frame and presuppose each other (Münkler, 2009). Both strategies are based on models and technologies of visualization and calculation that are mainly statistical. For storing, sorting, searching, relating and processing these numeric data, computer based databases seem to be the perfect device. They are the technical infrastructure for generating risk profiles and scenarios that are used for calculating risks and for choosing options of action. So, databases are on the one hand a tool for handling risks and on the other hand the technology that makes risk visual and the concept thinkable at first. This connection between discourses, practices and technology is interesting because it evokes questions about the “risky” implications and inscriptions in computer databases used in everyday life in which actions and practices are being monitored permanently. Popular media like computer games or internet applications are the most influential media in the contemporary popular culture and providing “orientative knowledge” for our lives by giving “patterns of knowledge and actions”, the subject can “adapt on and accommodate” (Neitzel and Nohr 2008). Within our presentation, we will examine in which respect the concepts of securityworlds and cultures of risk are negotiated an implemented in the popular media products www.parship.de and the soccer manager game on kicker.de and which patterns of knowledge and action are provided in them. Both objects combine purely databased elements (personal profiles and a mathematical matrix for rating soccer players) with real world elements (real persons as potential partners and the real efforts of soccer players) in a popular medial context. In the analysis we will have a look at the different and similar strategies of risk management that try to mediate the calculability of the database and the contingence of the real world. References Münkler, H. (2009). Strategien der Sicherung: Welten der Sicherheit und Kulturen des Risikos. Theoretische Perspektiven. In: H. Münkler, M. Bohlender and S. Meurer (Eds.), Sicherheit und Risiko. Über den Umgang mit Gefahr im 21. Jahrhundert (pp.11-34). Bielefeld: Transcript. Neitzel, Britta & Nohr, Rolf F. & Wiemer, Serjoscha (2009): Benutzerführung und TechnikEnkulturation. Leitmediale Funktionen von Computerspielen. In: D. Müller, A. Ligensa and P. Gendolla (Eds.), Leitmedien. Konzepte – Relevanz – Geschichte (pp.231-256). Bielefeld: Transcript. - 163 - Proceedings IACAP 2011 Informational Warfare and Just War Theory MARIAROSA TADDEO Abstract. This paper focuses on Informational Warfare – the warfare characterised by the use of information and communication technologies. This is a fast growing phenomenon, which poses a number of issues ranging from the military implementation of such technologies to its political and ethical implications. The paper presents a conceptual analysis of this phenomenon with the goal of investigating its nature. Such an analysis is deemed to be necessary in order to lay the ground for future work on this topic addressing the ethical problems engendered by Informational Warfare. The analysis is developed in three parts. It first delineates the relation between Informational Warfare and the Information revolution. It then turns the attention to the effects that the diffusion of this phenomenon has on the concepts of state and war. On the basis of this analysis, it provides a definition of Informational Warfare as a transversal phenomenon for what concerns the environment in which it is waged, the way it is waged and the ontological and social status of the involved agents. Finally, the paper concludes taking in consideration Just War Theory and the problems arising from ist application to the case of Informational Warfare. Extended Abstract The analysis presented in the paper focuses on Informational Warfare (IW) – the warfare based on the use of Information and Communication Technologies (ICTs). IW has been at the centre of interest of governments, intelligence agencies, computer scientists and security experts for the past two decades (Arquilla 1999; Libicki 1996; Singer 2009). ICTs support war waging in two ways: providing new weapons to be deployed on the battlefield – like drones and semi-autonomous robots - and allowing for the so-called information superiority, the ability to collect, process, and disseminate information while exploiting or denying the adversary’s ability to do the same. ICTs prove to be effective and advantageous war technologies, as they are efficient and relatively cheap when compared to the general coasts of war. For this reason, the use of ICTs in warfare has grown rapidly in the last decade determining some deep changes in the way war is waged, giving the raise to the latest revolution in military affairs (RMA). This RMA concerns in primis military force. It also concerns strategy planners, policy-makers and ethicists, as the need to regulate this new form of warfare is muchfelt and the existing international regulations, like the Geneva and Heuge Conventions, provide only partial guidelines. In the same way, traditional ethical theories of war, - 164 - The Computational Turn: Past, Presents, Futures? which should provide the ground for policies and regulations, struggle to address the ethical problems that arose with this new form of warfare (Arquilla1999; Arquilla and Boerer 2007; DeGeorge 2003; Hauptman 1996; Powers 2004). There are three categories of problems on which both policy-makers and ethicists focus their attention, and these are the risks, rights and responsibilities. In the paper I will refer to these problems as to the 3R problems. Altogether, the 3R problems pose a new ethical challenge. Nevertheless, such problems will not be the focus of this paper, which will rather concentrate on the analysis of the nature of IW and the changes that it determines. The task of the proposed analysis is to lay down the conceptual foundation for the solution of the 3R problems, which will be provided in elsewhere. IW it is a wide spectrum phenomenon, which is rapidly changing the dynamics of combat as well as the role warfare in political negotiations and the dynamics of civil society. These changes are the origins of the 3R problems, the conceptual analysis of such changes and of the nature of this phenomenon is deemed to be a necessary and preliminary step to solve these problems. The analysis is divided in three steps. First, IW is analysed within the framework of the Information revolution (Floridi 2009). Floridi’s analysis of Information revolution as the fourth revolution is recalled and it is stressed that such a revolution determines a shift toward the non-physical domain, the domain of nonphysical objects, agents and interactions. In the second step, it is argued that IW is one of the most compelling cases of such a shift. This analysis leads to the consideration of the effects of the dissemination of IW on the concepts of war and state. In particular, it is argued that IW redefines the concept of war as a phenomenon not necessary sanguinary and violent, and rather transversal for what concerns the environment in which it is waged, the way it is waged and the ontological and social status of its agents. A definition stressing the transversality of IW and its disruptive nature is then provided. Informational Warfare is the use of ICTs within an offensive or defensive military strategy aiming at the disruption of the enemy’s resources, and which is waged within the informational environment, by agents and targets ranging both on the physical and non-physical domains and whose level of violence may vary upon circumstances. Finally, the third step is devoted to consider the problems arising when IW is considered within the framework of Just War Theory. This theory provides the ground for international regulations, and sets the parameters for both the ethical and the political debates. The issue is addressed whether and how the principles of Just War Theory could be applied to IW. The analysis unveils three problems. The first one concerns the differences between the scenario assumed by Just War Theory and the one delineated by IW. Just War Theory refers to classic warfare, where governments and their leaders are the only ones who inaugurate wars by deploying armed forces, and they are the ones to be held accountable the actions of war. IW fosters a completely new way of declaring and waging war. The need is stressed for Just War Theory to take into account such changes in order to address the ethical problems arose with IW. The other two problems concern the application of two principles of Just War Theory – ‘war as last resort’ and ‘discrimination and non-combatants immunity’ – to the case of IW. In the case of the principle of ‘war as last resort’ the analysis indicates that the application of this principle - 165 - Proceedings IACAP 2011 to the case of IW leads to an ethical impasse. The principle assumes that war is a violent and sanguinary phenomenon. It is argued that the correctness of this assumption in shaken when IW is taken into account, and that in these circumstances the application of the principle of war as last resort becomes less immediate. The impasse concerns the use of bloodless and non-physically violent modes of combat peculiar of IW, like a cyber attack, to address potentially dangerous diplomatic conflicts to prevent the occurrence of classic warfare. On one hand, such a use constitutes an act of war itself and as such Just War Theory forbids it, on the other hand it may avoid states to engage in a sanguinary war and hence is intrinsically consistent with the overall view proposed by Just War Theory of reducing bloodshed and conflicts. A similar ethical problem is described with respect to the application of the ‘principle of discrimination and non combatants immunity’. It is stressed that this principle tacitly equates non-combatants to civilians and that such an equation has been weaken by the diffusion of terrorism and guerrilla, to become even feebler with the dissemination of IW. In IW scenario, civilians may take part to a combat action from the comfort of their homes, while carrying on with their civilian life and hiding their status of informational warriors. An ethical conundrum is described. Given the difficulty to distinguish combatants from non combatants in IW scenario, and in order to endorse the ‘principle of discrimination’, states might be justified to embrace high levels of surveillance over the entire population breaching individual rights, like privacy and anonymity, in order to identify the combatants and guarantee the security of the entire community.9 It is argued that, on the one side, respecting the principle of discrimination may lead to violate individual rights. On the other side, waving the principle of discrimination leads to bloodshed and dissemination of indiscriminate violence over the civil population. The paper concludes pulling together the threads of the analysis and stressing the importance of developing ethical guidelines, which will provide the ground for the definition of the necessary regulation for IW and for the solution of the 3R problems. References Arquilla, J. (1999). Ethics and information warfare. Strategic appraisal: the changing role of information in warfare. Z. Khalilzad, J. White and A. Marsall. Santa Monica, USA, Rand Corporation: 379-401. Arquilla, J. and D. A. Borer, Eds. (2007). Information Strategy and Warfare: A Guide to Theory and Practice (Contemporary Security Studies). New York, USA, Routledge. DeGeorge, R. T. (2003). "Post-september 11: Computers, ethics and war." Ethics and Information Technology 5(44): 183-190. Hauptman, R. (1996). "Cyberethics and social stability." Ethics and Behavior 6(2): 161-163. Floridi, L. (2009). "The information Society and Its Philosophy." The Information Society 25(3): 153-158. Libicki, M. (1996). What is Information Warfare? Washington, D.C, USA, National Defense University Press. 9 This problem is part of the 3R problems described in section one. - 166 - The Computational Turn: Past, Presents, Futures? Powers, T. M. (2004). "Real Wrongs in Virtual Communities." Ethics and Information Technology 5(4): 191-198. Singer, P. W. (2009). "Robots at War: The New Battlefield." Wilson Quarterly 33(1): 30-48. - 167 - Proceedings IACAP 2011 TECHNO-SECURITY, RISK AND THE MILITARIZATION OF EVERYDAY LIFE JUTTA WEBER University Paderborn Warburger Straße 100, 33098 Paderborn Abstract. Recently, we experience a rapid and ongoing transfer of security technologies such as body scanners, drones, or biometrics from the military realm in everyday life. And though there is a lively debate on the growing militarization of public space, political culture and everyday life (Giroux 2004, Graham 2005, Crandall/ Armitage 2005, Kohn 2009) there is surprisingly little discussion on the huge amount of military-civilian transfer of new and emerging security technologies. Only very few authors address the possible militarization of society through the procurement, adaptation and proliferation of military technologies in civilian life (Agre 2001). A few scholars such as Dandeker (1990, 2006), Wood et al. (2006), or Balzaqc et al. (2010) pointed out that security technologies and practices are deeply impregnated by their military offspring. Surveillance studies scholars – leaning on Anthony Giddens (1985) – at least partly acknowledge the growing entanglement of the military and bureaucracy in post/modern societies (Bogard 1996, Dandeker 1990, Nellis 2009, Wood et al. 2003). Approaches in STS (Akrich 1992; Woolgar 1991) and philosophy of technology (Winner 1986, Verbeek 2006, Flanagan, Howe and Nissenbaum 2006) showed how technology transports values, world views and norms. Therefore I will ask in my paper what norms, values, frames of thought are transported into everyday life with the military-civil transfer of security technologies – for example when uninhabited aerial vehicles become part of everyday experiences for example through the growing presence of UAVs during global sport and cultural events, by demonstrations or during law enforcement as well as through ‘augmented reality video games’. 2. Daily Drones. Techno-Security &the Militarization of Everyday Life Originally hopes of a large-scale military-civilian conversion arouse after the end of the cold war. But these hopes were disappointed already in the early 1990s when force has become again a frequent tool of foreign policy concentrating on so-called rogue and failed states that followed a growing number of military responses from peace-keeping operations up to massive invasions (Rappert et al. 2008). In philosophy of technology as well as science and technology studies (STS) we got some studies on the crossover of global communication and military surveillance systems (i.a. de Landa 1991, Edwards 1996) as well as the fusion of military, industry and media (Der Derian 2001, - 168 - The Computational Turn: Past, Presents, Futures? Lenoir/Lowood 2002). The shift of the business of major arms manufacturers towards mainstream security and surveillance products in the post-cold-war era is addressed (i.a. Wood et al 2006, Eick 2010, Graham 2010). Nowadays new products are developed and partially already deployed. Think of non-lethal weapons, i.a. electroshock and heat-ray weapons, as well as monitoring systems linked to killing or paralyzing systems. These weapons for warfare respectively crowd control are situated between the military and civilian realm. In a brochure on new security projects in the 7th framework programme for research, the Directorate General Enterprise and Industry of the EU commission states: “Moreover, the relationship between defence technologies on the one hand, and security technologies on the other, is particularly noticeable in the field of R&D, with technologies that show potential developments in both areas (Dual Use). At both research and industrial development levels, synergies are possible and desirable.” (European Commission. Enterprise and Industry 2009, my emphasis). Contemporary surveillance studies also point towards the close relation between the military and the managerial: “Cross-fertilization between the military and the managerial is clearly central to problems and developments in the study and practice of surveillance…” (Wood et al. 2003, 146). But there are very few studies on the relation of the sociotechnical, political, and the military with regard to militaryrelated security technologies and their impact on everyday life. 2.1. TECHNO-SECURITY, RISK AND UNPREDICTABILITY So what to think of the manifest development expansion of military technologies in civilian life in general and of UAVs specifically? For a long time we know about the conversion and adaptation of military technology in everyday life – think only of recent examples of the military offspring of technologies such as the internet, RFID, satellite technology or GPS (Global Positioning System). Approaches in STS (Akrich 1992; Woolgar 1991) and philosophy of technology (Winner 1986, Verbeek 2005, Flanagan, Howe and Nissenbaum 2006) showed how technology transports values, world views and norms. Madeleine Akrich made visible that every technology contains scripts while Steve Woolgar s (1991) pointed to the fact that technology is „configuring the user and the context of the use. Therefore it is important to ask which frames of thought, world views, perspectives, preferences and motives are inscribed into military-related security technologies and translated into everyday life. Kaplan (2006) has shown how GPS did not only link demography, geography, remote sensing, geopolitics and identity politics but how GPS became an icon of “personal empowerment and self-knowledge linked to speed and precision” (Kaplan 2006: 697) for US Americans. At the same time the „militarized consumer who wants to improve his „lifestyle provides the personal data thereby enabling new systems of surveillance (embedded in mobiles, GPS systems in cars, etc.): “…tracked, the user becomes a target within the operational interfaces of the marketing worlds, into whole technologies state surveillance is outsourced.” (Crandall 2006, np) Relevant epistemological shifts and the emergence of new norms, worldviews and values that accompany the massive contemporary military-civilian transfer is the epistemological reframing of today’s concept of security. Homeland as well as international security is not primarily occupied with the defense against specific threats and prosecuting crimes (Albrecht 2009) but with the (precautionary) management of risk and preventive and pre-emptive securitization of security (Aradau et al. 2008, AmmichtQuinn/Rampp 2009, Zedner 2007). While traditionally threat was related to actions and intentions of conflicting parties which can be – in principle resolved, the concept of „risk embrace the idea of general, permanent and systemic contingencies such as pandemics, global warming, rogue states, terrorism, organized crime, poverty, illegal - 169 - Proceedings IACAP 2011 immigration or the proliferation of weapons of mass destruction (European Commission. Enterprise and Industry 2009). The concept of risk is closely entangled with unpredictability and insecurity – especially with regard to the identification of the enemy or the assessment of hazardous situations. The politics of risk operates with risk profiling on the basis of statistics and probabilities, with models and speculations which do not target at eliminating but managing risk: „In short, whereas the concept of threat brings us in to the domain of the production, management and destruction of dangers, the concept of risk mobilizes and focuses on different practices that arise from the construction, interpretation and management of contingency“. (Aradau et al. 2008, 148; my emphasis) This new approach is highly technological-oriented. The shift towards a preventive security policy and a techno-centred concept of security corresponds to the increasing networking of surveillance measures. The reconfiguration of surveillance as assemblage (Haggerty/Ericson 2000) is a general tendency. Nevertheless, the concept and practice of digital network-centred surveillance technologies (Graham/Wood 2003) shows strong affinities to that of network-centric warfare. The latter – also called „Revolution in Military Affairs – is based on strong, ubiquitous ICT-based networks and mobilities that control and monitor area-wide and over huge distances 24 hours a day to reach a “globespanning dominance based on a nearmonopoly of space and air power (Graham 2005, 175; see also Dillon 2002, Dandeker 2006). In this scenario, especially autonomous UAVs with artificial intelligence and learning capability are regarded as an important component of new techno-warfare (Weber 2009, 2010). Together with inhabited systems integrated in a complex network of air, water and ground agents, new techniques of warfare are developed “… toward a vision of a strategic and tactical battlespace filled with networked manned and unmanned air, ground, and maritime systems ... that free warfighters from the dull, dirty, and dangerous missions ... and enable entirely new design concepts unlimited by the endurance and performance of human crews. The use of UAVs in Afghanistan and Iraq is the first step in demonstrating the transformational potential of such an approach.” (Department of Defense 2007, 34) This aspired high-tech transformation of armed forces is supposed to make them invincible, to develop strategies of digital deterrence more powerful than nuclear deterrence ever was. The utopia of a ubiquitous, networked system of surveillance and control seems to be mirrored by a preventice and techno-centred idea of security in everyday life – for example when drones are deployed for law enforcement by the British Police or for border control by the European agency Frontex. Recently, the Guardian s Freedom of Information request revealed the very broad scope of potential UAV applications by the British police: “Working with various policing organisations as well as the Serious and Organised Crime Agency, the Maritime and Fisheries Agency, HM Revenue and Customs and the UK Border Agency, BAE [systems; the British defence company] and Kent police have drawn up wider lists of potential uses. One document lists ‘[detecting] theft from cash machines, preventing theft of tractors and monitoring antisocial driving’ as future tasks for police drones, while another states the aircraft could be used for combat ‘fly-posting, fly-tipping, abandoned vehicles, abnormal loads, waste management’ (…) There are two models of BAE drone under consideration, neither of which has been licensed to fly in non-segregated airspace by the CAA. The Herti (High Endurance Rapid Technology Insertion) is a five-metre long aircraft that the Ministry of Defence deployed in Afghanistan for tests in 2007 and 2009”. (Lewis 2010). According to these plans, the use of UAVs would be part of a larger networkcentric project through which information from a variety of sources (UAVs, smart CCTV, data detention, analysis of money transfer,´etc.) are networked and evaluated. This course of action seems not to aim primarily at prosecuting specific crimes and - 170 - The Computational Turn: Past, Presents, Futures? follow concrete suspicions but to search monitor a nation’s population systematically and thoroughly on an everyday basis. We need to investigate whether this civilian approach resembles what is called C4ISR – Command, Control, Communications, Computers, Intelligence, Surveillance and Reconnaissance in the military. C4ISR stands for the networking of all available surveillance and control systems to achieve a global overview in the war theatre. So maybe we witness the idea of a global overview in the (civilian) world theatre. Part of these epistemological and normative reframing might also be found in recent consumer applications of UAVs. Since last year the first little UAVs respectively quadricopters are available for „augmented reality video games (http://ardrone.parrot.com/parrot-ar-drone/de/) in which one can launch missiles and fight against other drones. The quadricopters can be controlled by an iPhone, iPod Touch or iPad. There a two cameras embedded into the drone, one on the front and one underneath, to enable a direct sight via video remote control on the basis of a Wi-Fi connection. Another application is provided by a German company which rents drones for private use (www.rent-a-drone.de) to enable real time pictures and videos from above. The private consumer applications of UAV might (still) not be as wide ranging as GPS but in a way one could argue that they might open the door in more intense participatory surveillance and observation practices (Ball 2005, Koskela 2009). Daily consumer drones might contribute to train users to watch the world from a top-down or „God’s eye view that participates in the C4ISR longing for a global overview in the war / world theatre. The tightening networks of surveillance technologies – increasingly expanded by drones for border control, policing demonstrators, crowd and event control, are part of a growing belief in “’smart’, specific, side-effects-free, information-driven utopia of governance” (Valverde and Mopas, 2004: 239). Network centric warfare with its idea of C4ISR relies on this utopia as it might be the case with recent police applications of drones and new gamer applications such as the iphone controlled ar-drone. It is necessary to follow up closely the growing transfer of military technologies in civil applications, game practices and other everyday life to see whether and how recent ideas of techno-security and „full spectrum dominance become dominant in 21st century’s societies of control. References Ammicht-Quinn, R. & Rampp, B. (2009). The Ethical Dimension of Terahertz and MillimeterWave Imaging Technologies – Security, Privacy and Acceptability: Optics and Photonics. In: C.S. Halvorson et al. (Eds), Global Homeland Security V and Biometric Technology for Human Identification VI (pp. 1-11). Proc. of SPIE Vol. 7306, 730613. Agre, P.E. (2001). Imaging the Next War. Infrastructural Warfare and the Conditions of Democracy. Retrieved from http://polaris.gseis.ucla.edu/pagre/war.html [accessed 17 November 2010]. Akrich, M. (1992). The de-scription of technological objects. In: W.E. Bijker & J. Law (Eds.), Shaping technology/building society (pp. 205-224). Cambridge: MIT. Ammicht-Quinn, R. & Rampp, B. (2009). The Ethical Dimension of Terahertz and MillimeterWave Imaging Technologies – Security, Privacy and Acceptability: Optics and Photonics. In: C.S. Halvorson et al. (Eds), Global Homeland Security V and Biometric Technology for Human Identification VI (pp. 1-11). Proc. of SPIE Vol. 7306, 730613. - 171 - Proceedings IACAP 2011 Ball, Kirstie Ball (2005). Organization, Surveillance and the Body: Towards a Politics of Resistance. In: Organization. Volume 12(1): 89–108 Balzacq, T. et al. (2010). Security Practices. In: R. Denemark (Ed.), International Studies Encyclopedia Online. Retrieved from http://didierbigo.com/documents/SecurityPractices2010.pdf [accessed 4 November 2010]. Bogard, W. (1996). The Simulation of Surveillance: Hypercontrol in Telematic Societies. Cambridge: Cambridge University Press. Capurro, R., Tamburrini, G. & Weber, J. (Eds.) (2008). Techno-Ethical Case-Studies in Robotics, Bionics, and Related AI Agent Technologies. Deliverable 5 of the EU-Project ETHICBOTS. Emerging Technoethics of Human Interaction with Communication, Bionic and Robotic Systems (SAS 6 - 017759). Retrieved from http://ethicbots.na.infn.it/restricted/doc/D5.pdf [accessed 17 November 2010]. Crandall, J. & Armitage, J. (2005). Envisioning the Homefront: Militarization, Tracking and Security Culture. Journal of Visual culture. 4 (1), 17-38. Crandall, Jordan (2006). Operational Media. Retrieved http://www.ctheory.net/printer.aspx?id=441 [accessed 2nd January 2011]. from Dankeker, C. (1990). Surveillance, Power and Modernity: Bureaucracy and Discipline from 1700 to the Present Day. New York: St. Martin. Dandeker, C. (2006). Surveillance and Military Transformation: Organizational Trends in Twenty-first-Century Armed Services. In: K.D. Haggerty & R.V. Ericson (Eds.), The new politics of Surveillance and Visibility (pp. 225-249). Toronto, Buffalo and London: University of Toronto Press. De Landa, M. (1991). War in the Age of Intelligent Machines. New York: Zone Books. Department of Defense (2007). Unmanned Systems Roadmap 2007-2032. Retrieved from http://www.acq.osd.mil/usd/Unmanned%20Systems%20Roadmap.2007-2032.pdf [accessed 12 June 2008]. Der Derian, J. (2001). Virtuous war: mapping the military-industrial-media entertainment network, Westview Press, Boulder, CO. Edwards, P.N. (1996). The Closed World: Computers and the Politics of Discourse in Cold War America. Cambridge, MA: MIT Press. Eick, V. (2010). The Droning of the Drones. The increasingly advanced technology of surveillance and control. Retrieved from http://www.statewatch.org/analyses/no-106thedroning-of-drones.pdf [accessed 12 November 2010]. European Commission. Enterprise and Industry (2009). Security Research. Towards a more secure society and increased industrial competitiveness. Security Research Projects under the 7th Framework Programme for Research. May 2009. Retrieved from ftp://ftp.cordis.europa.eu/pub/fp7/security/docs/towards-a-more-secure_en.pdf [accessed 17 November 2010]. Flanagan, M., Howe, D.C. & Nissenbaum, H. (2008). Embodying Values in Technology. In: J. van den Hoven & J. Weckert (Eds.), Information Technology and Moral Philosophy (pp. 322-353). Cambridge: Cambridge University Press. Giddens, A. (1985). The Nation-State and Violence. A Contemporary Critique of Historical Materialism, Vol. II. Berkeley: University of California Press. Giroux, H.A. (2004). War on Terror. The Militarising of Public Space and Culture in the United States. Third Text. Vol. 18, Issue 4, 211-221. - 172 - The Computational Turn: Past, Presents, Futures? Graham, S. (2005). Surveillance, urbanization and the US „Revolution in Military Affairs . In: D. Lyon (Ed.), Theorizing Surveillance. The panopticon and beyond (pp. 247-270). Devon, UK: Willian. Graham, S. & Wood, D. (2003). Digitizing Surveillance: Categorization, Space, Inequality. Critical Social Policy. Vol. 23, No. 2, 227-248. Graham, S. (2010). From Helmand to Merseyside: Unmanned drones and the militarization of UK policing. Retrieved from http://www.opendemocracy.net/ourkingdom/stevegraham/ from-helmand-to-merseyside-military-style-drones-enter-uk-domestic-policing, [accessed 17 November 2010]. Haggerty, K. & Ericson, R. (2000). The surveillance assemblage. British Journal of Sociology. Vol. 51, No. 4, 605-622. Kaplan, Caren (2006). Precision Targets: GPS and the Militarization of U.S. Consumer identity. American Quarterly 58.3 ,693-713. Koskela, Hille (2009). Hijacking surveillance? The new moral landscapes of amateur photographing. In: Katja Franko Aas, Helene Oppen Gundhus, Heidi Mork Lomell (Eds.) Technologies of Insecurity: The Surveillance of Everyday Life. (pp.147-168). Oxon / New York: Routledge-Cavendish. Kohn, R.H. (2009). The Danger of Militarization in an Endless „War of Military History, Vol. 73, No. 1, 177-208. on Terrorism. The Journal Lenoir, T. & Lowood, H. (2002). Theaters of War: The Military-Entertainment Complex. Retrieved from http://www.stanford.edu/class/sts145/Library/LenoirLowood_TheatersOfWar.pdf [accessed 17 November 2010]. Lewis, P. (2010). CCTV in the sky: police plan to use military-style spy drones. TheGuardian (London), 23.1.2010. Retrieved from www.guardian.co.uk/uk/2010/jan/23/cctvsky-policeplan-drones [accessed 12 November 2010]. Nellis, M. (2009). 24/7/365: mobility, locatability, and the satellite tracking of offenders. In: Katja Franko Aas, Helene Oppen Gundhus, Heidi Mork Lomell (Eds.) Technologies of Insecurity: The Surveillance of Everyday Life. (pp.103-124). Oxon / New York: Routledge-Cavendish. Rappert, B., Balmer, B. & Stone, J. (2008). Science, Technology and the Military. Priorities, Preoccupations and Possiblities. In The Handbook of Science and Technology Studies. London: MIT Press, 719-740. Verbeek, P.-P. (2006). Materializing Morality. Design Ethics and Technological Mediation. Science, Technology & Human Values. Vol. 31, No. 3, 361-380. Weber, J. (2009). Unmanned Combat Aerial Vehicles, Dual Use and the Future of War. In: R. Capurro, M. Nagenborg & G. Tamburinni (Eds.), Ethics and Robotics, (pp.83-103). Amsterdam/Heidelberg: IOS Press: Deutscher Akademieverlag. Weber, J. (2010). Armchair Warfare „on Terrorism . On Robots, Targeted Assassinations and Strategic Violations of International Law. In: Jordi Vallverdú (Ed.): Thinking Machines and the Philosophy of Computer Science: Concepts and Principles (pp.206-222). IGI Global. Winner, L. (1986). The Whale and the Reactor: A Search for Limits in an Age of High Technology. Chicago: University of Chicago Press. Woolgar, S. (1991). Configuring the User: The Case of the Usability Trails. In: John Law (Ed.), A Sociology of the Monsters. Essays on Power, Technology and Domination. (pp.59-99). Verlag: London and other, 59-99. - 173 - Proceedings IACAP 2011 Track V: Information Ethics, Robot Ethics - 174 - The Computational Turn: Past, Presents, Futures? IS THERE A HUMAN RIGHT NOT TO BE KILLED BY A MACHINE? PETER M. ASARO The New School University asarop@newschool.edu 1. Extended Abstract This presentation reviews the standard frameworks for considering the human right not to be killed, and its forfeit by combatants in a war. It then considers as a special case the right not to be killed by a machine. Insofar as one has a right not to be killed by any means, then one also has a right not to be killed by a machine, such as a lethal robotic system. It is further argued that in those cases in which an individual may have already forfeited their right not to be killed, such as when acting as a combatant in a war, this does not necessarily subject one to being killed by a machine. Despite a common view that combatants in war may be liable to be killed by any means, “killing by machine” fails to meet the requirements for ethically justifiable killing. The defense of this assertion will rest on a technical definition of “killing by machine,” and further clarification of justified killing in war. In short, the argument is that “killing by machine” fails to consider the rights of an individual in the morally required manner. This is because “killing by machine” requires a “decision to kill” to be made by a moral agent, and an automated decision cannot involve the necessary moral deliberation required to justify violating the human right not to be killed. As such, automated decisions to kill are not morally justifiable. The argument begins by examining the right to self-defense which forms the rightsbased interpretation of Just War Theory. In particular, I examine the “Castle Laws”, aka “Make My Day Laws,” which permit individuals to use force against home-intruders without criminal or civil liability in many U.S. states. I examine the conditions under which individuals in such circumstances are permitted to use lethal force, and when such force becomes “willful and wonton misconduct.” Informed by this analysis, I examine the legality of a home-defense robot, and the legal permissibility of its use of force against home-intruders. In general, the “Castle Laws” do not allow homeowners to booby-trap their homes, and a robotic home-defense system can be viewed as a sophisticated booby-trap. I consider the various objections that might be made to the standard rejection of booby-trap. According to such objections, a robot with sophisticated cognitive and perceptual capabilities might be argued to avoid manifesting a form of “reckless endangerment.” I then analogize from the case of home-defense in civil and criminal law, to the case of self-defense in war, and the Laws of Armed Conflict and Just War Theory. While warfare has much looser standards of what constitutes a “threat,” and the proximity of threats, the use of systems capable of automated lethal decision-making is largely analogous to the domestic use of booby traps. - 175 - Proceedings IACAP 2011 I conclude that implicit in both domestic law and international laws of armed conflict is requirement for moral deliberation which undermines the moral and legal legitimacy of automated lethal decision making. This has serious implications for the use of autonomous lethal robotics in police and military applications. One implication is that only artificial moral agents, capable of exercising moral autonomy, could be morally and legal justified in violating the rights of a human. - 176 - The Computational Turn: Past, Presents, Futures? DO WE NEED AN UNIVERSAL INFORMATION ETHICS? THOMAS CHRISTOPHER DASCH University of Paderborn Germany Abstract. This article deals with information ethics. This raises the essential question: What is information?But I want to focus on the ethical category. herefore, three areas of potential actions arise. Instead of informations I want to talk more generally of data. This makes it possible to distinguish between: (1) The pure receive of data, (2) The pure provision of data, (3) The simultaneous receive and provision of data, (4) A further possible action is to supply a plattform for data. This is strictly speaking the topic three, but it will be discussed as an seperate topic.Here is exemplified the ethical problems for the individual cases may occur.Subsequently, a connection between the problems of the legislation of the Internet and the lack of a universal ethical base is made in the information ethics. This article deals with information ethics. This raises the essential question: What is information? The question of “What is Information?” (Floridi, 2004, p.560) is according to Floridi the elementary problem of the philosophy of information. Among the advocates of well known approaches to the concept of information are Shannon and Weaver, BarHillel and Carnap, Wiener, Janich, etc. (Capurro, 2000). Here Capurro’s trilemma (Fleissner, Hofkirchner, 1995) applies: (1) Either the concept of information is always the same no matter what the set of input data is like, (2) or the information is only of similar kind, or (3) it is completely independent. At this point it is to be clarified on which concept of information based the information ethics. But I want to break another ground. I want to focus on the ethical category. In this context information ethics is the part of ethics that deals with the internet. The concept of information is to be ignored here. “Morale is focussed on judgments, that assess a human action positively or negatively, approve or disapprove it.” (Birnbacher, 2007, p.12). Therefore, three areas of potential actions arise. Instead of informations I want to talk more generally of data. This makes it possible to distinguish between: 1.The pure receive of data 2.The pure provision of data 3.The simultaneous receive and provision of data 4.A further possible action is to supply a plattform for data. This is strictly speaking the topic three, but it will be discussed as an seperate topic. One example for the first topic is the reading of news pages or blogs. In this context, the information content the receiver consumes is moral relevant. A possible - 177 - Proceedings IACAP 2011 moral misconduct in this field is the download of music without owning the respective rights. In case of the internet, the information recipient may not be able to reconstruct the origin of the information. Additionally, the information can be deleted from the respective homepage at any time. In contrast, the information content of a newspaper can not be changed once the paper is printed. The second topic includes e.g. owners of news pages. In this connection, the precise content of the online data is moral relevant. In the case of news pages it is expected that the news have been extensively investigated. One example for the misuse of this function is a scenario in which a person spreads videos showing another person in an unfavourable context. In the case of the internet, tracking down the owner of the page is far more difficult as tracking down a normal information transmitter. The latter differs from the internet in concerns of judicial matters, more about that later in the text. A feature of the internet is that a large group of people can be addressed without the need of a major news infrastructure. Interest groups can be formed rapidly and easily in this way as seen recently when a open letter was handed to Chancellor Merkel concerning the plagiarism affair of Germanys minister of defence, Karl Theodor zu Guttenberg. In this way, the initiators of the letter were able to support the ministers retirement. Amongst others, topic three includes chats, forums and online games. In this case, moral relevance is similar to moral relevance in non virtual communication. A possible moral misconduct would e.g. be the insult to a person in a chat room. Characteristic for this kind of online communication is that the counterpart can not be visualized (as long as webcams are not used). Therefore, it remains unknown what emotions the counterpart expresses. “Emotions are responses of an organism centered on experiences. They represent the relevance of an artefact of perception for the fulfilment of needs (e.g. according to the criteria “beneficial” or “impedimental”). Additionally, they activate or constrain various cognitive and motivational systems in terms of a optimal satisfaction of needs.”(Kuhl, 2010, p.543) This can lead to a incorrect estimation of the counterparts emotions. However, the chatter can manipulate emotions by the use of e.g. smilies, that do not represent his actual emotions. In case of the internet, the identity of the person one is chatting with can not be verified. The counterpart is not necessarily regarded as a person, but in a distinct role. This can be the case in online games as required participant, in forums as disposer of information and so on. The fourth topic includes for example provider or plattforms like Facebook or search engines like Google and file sharing services. At this point it is ethical relevant whether the suppliers can asure a ethical correct mode for the users. An Examples for an ethical dubious action in this topic are to run a file sharing service for music without having the copy rights. A point at issue is Wikileaks, too. It is questionable, wheter it is ethically to publish diplomatic cables. Despite all this potentially ethical critical topics one can point out that beyond this controversial concepts and opinions exist. This depects for example in the five cultural deminsions of Hofstede: Power Distance Index(PDI), Individualism(IDV), Masculinity(MAS), Uncertainty Avoidance Index (UAI), Long-Term Orientation (LTO) (Lüsebrink, 2005, p. 20-25). On the one hand this is due to different opinions about this in the respective culture area. On the other hand, different cultures show different behaviour on the internet, that can be reduced to the fact that violation on the internet against ethical basic principles remains largely unpunished. The internet is no area - 178 - The Computational Turn: Past, Presents, Futures? immune from law, but it is so that people on the Internet are global and there depending on each of the legislation and t heenforcement of the laws of their own country. “The almost traceless variability of content presents new challenges to the reliability of documents and the evidence. The indifference of original and copy has a new copyright quality. The anonymity of the web makes it difficult to identify reliable contractors. The speed of interactive communication such as short natural cooling-in contracts considerably, giving the consumer a new dimension. “(Haug, 2010, p.9) It would require a common ethical base in information ethics. References Birnbacher, D. (2007). Analytische Einführung in die Ethik, 12. Berlin: Walter de Gruyter Capurro, R. (2000). Einführung in den Informationsbegriff. Available at http://www.capurro.de/infovorl-kap3.htm [15.02.2011]. Fleissner, P., Hofkirchner, W. (1995). Informatio revisited. Wider den dinglichen Informationsbegriff. Informatik Forum, 9(3), 126-131. Floridi, L. (2004). Open Problems in the Philosophy of Information. Metaphilosophy, 35(4), 554582. Haug, V. (2010). Internetrecht: Erläuterungen mit Urteilsauszügen, 9.Stuttgart: Kohlhammer. Kuhl, J. (2010). Lehrbuch der Persönlichkeitspsychologie: Motivation, Emotion und Selbststeuerung, 543. Göttingen: Hogrefe. Lüsebrink, H. (2005). Interkulturelle Kommunikation, 20-25. Stuttgart: Metzler. - 179 - Proceedings IACAP 2011 A PSEUDOPERIPATETIC APPLICATION SECURITY HANDBOOK FOR VIRTUOUS SOFTWARE” KEITH DOUGLAS Statistics Canada10 In the past 10 or 15 years an increased awareness of application security11 (AS) in computing and information systems has resulted in many volumes of material (e.g., Cross 2006, Burnett 2004, Seacord 2005, Clarke 2009). Security conscious developers, testers, and organizations wishing to adopt “best practices” have a lot of work to distill these many volumes of advice and principles into easily implementable and understandable approaches. Following the off-hand suggestion from a colleague (Perkins 2010), I have taken her phrase “virtuous software” as a starting point. In this paper, I comb through the Nicomachean Ethics (Aristotle 1984) to find appropriate guidance for virtue in AS. It thus is addressed both to computing professionals wanting to understand why AS makes the ethical consequences of their work more salient (or, more debatably12, makes them exist) and also to philosophers who may not be aware of the ethical challenges raised by recognition of AS in computing. It is also intended as a brief introduction as to why AS considerations matter as one (not independent of the others) aspect of the “architecture”, design, development, and support of software. 10 Author affiliation for identification purposes only. AS is to be distinguished in discussions of computing security from infrastructure security, dealing with antimalware solutions, public key utilities, routing rules in networks, etc. 70% of current exploits and vulnerabilities are in application areas (Sykora 2010) and subsequently AS merits philosophical and computational attention. It is often discussed in the context of “application hardening”. This term is in the author’s view unhelpful, since it suggests, wrongly, that a correct approach to would be to implement an application and then “fix it up” to meet the hardening requirements. The expert consensus seems to be that AS ought to be part of the entire software development life cycle, and have a role to play at almost every phase. See, e.g., Seacord 2005. The case of what to do about existing systems is more complicated; I do not address it as much in the present work, though much of what we can tease out of (or be reminded by) Aristotle applies regardless 12 Conversations with colleagues on the part of the author suggest (he has not done formal investigations) that many computing professionals do not think their profession and activities raise any additional or different ethical considerations beyond those common to all humans in general or all relevant employees of a given organization. (For example, fellow computing colleagues of the author are certainly aware of their obligations under the relevant public service legislation, but do not see (for example) buffer overruns and race conditions as leading to possible ethically relevant situation. At best they are regarded as “another sort of bug”.) Further work (beyond the present one) to institute AS “consciousness” in developers will have to deal with this situation. 11 - 180 - The Computational Turn: Past, Presents, Futures? Philosophical topics I will briefly address in the above fashion are: the nature of technology, the nature of virtue, how virtue may be obtained, who is virtuous, what results from being virtuous and examples of what specific virtues are. All of these can be topics for complete presentations in their own right: I bring them up to simply show the rich areas of further possible investigation, and, in some cases, the pitfalls of using a “virtues framework” when it comes to software. The philosophical topics in turn relate (here I do not indicate how, merely enumerate what will be discussed) to the following more directly computing considerations: the nature of computing professions, systems specifications, how one should learn about AS, characteristics of good software systems, how to adjudicate between AS and other design goals, how to get developers to be AS-aware and others. Finally, I include this paper as a way of linking three phases of the so-called computational turn: the past: traditional philosophy (e.g., Aristotle); the present, the CAP conferences where computing and philosophy, traditional and otherwise is largely (but not exclusively) academic (yet fruitfully interacting), and the future, where work from CAP is also of importance to those outside. I do not suggest that these three phases are the only way to understand the historical development of the computing and philosophy movement, nor do I suggest that there has not been anything useful in the past to those outside of academia, merely that there is ample room within the topic of AS to address such considerations. References Aristotle. 1984. “Nicomachean Ethics”. In The Complete Works of Aristotle, vol. 2 (ed. Jonathan Barnes). Princeton: Princeton University Press. Burnett, Mark. 2004. Hacking the Code: ASP.NET Web Application Security. Burlington: Syngress. Clarke, Justin. 2009. SQL Injection Attacks and Defense. Burlington: Syngress. Cross, Michael. 2006. Developer’s Guide to Web Application Security. Burlington: Syngress. Perkins, Evelyn. 2010. Unpublished comment, meeting of the Secure Coding Practices Working Group, Statistics Canada. Seacord, Robert. 2005. Secure Coding in C and C++. New York: Addison-Wesley Professional. Sykora, Boleslav. 2010. Lecture Material, Learning Tree International Course 940. - 181 - Proceedings IACAP 2011 THE CENTRAL PROBLEM OF ROBOETHICS: FROM DEFINITION TOWARDS SOLUTION DANIEL DEVATMAN HROMADA Université Paris 8 / École Pratique des Hautes Études / Lutin Userlab hromi@kyberia.sk Abstract. The central problem of roboethics is defined as such: on one hand, robotics aims to construct entities which will transcend the faculties of human beings, on the other hand, some unethical acts should be made impossible to execute for such artificial beings. It can be illustrated on the case of full-fledged AI which is able to reprogram itself, or program other AIs but only in a way that the result shall not lead to the infraction of moral imperatives held by its human conceptors. Thus a programmer of such a system is posed between Skylle of his “aim to conceive an artificial entity able to do almost everything, and more efficiently than a human being” and a Charybde of “the principle of precaution commanding him to constraint the behaviour of such an entity in a way that it would never be able to execute certain acts, like that of a murder, for example”. Therefore the central problem can be also perceived as a form of solution to the problem of trade-off between the amount of “autonomy” of an artificial agent and the extent to which the “embedded ethical constraints” determine the agent’s behaviour. Believing that such a trade-off could be found, our proposal is conceived as a four-folded hybrid “separation of powers” model within which the final output to the solution of ethical dilemma is considered to be the result of mutual interaction of four independent components: 1) “Moral core” containing hard-wired rules analogous to Asimov laws of robotics 2) “Meta-moral Imperative” logically equivalent to Kant’s categoric imperative 3) “Ethico-legal codex” containing an extensible set of normative procedures representing the laws, moral norms and customs present in or induced from agent’s surroundings 4) “Mytho-historical knowledge base” grounding the agent’s representation of « possible states of the world » in the corpora of human generated myths & stories Finally, we will argue that our proposal of two induced & two embedded modules vaguely corresponds to the human morality faculty since it takes into account both its “innate” as well as “acquired” components. - 182 - The Computational Turn: Past, Presents, Futures? 1. Definition of the Central Problem It may be stated that the ultimate goal of Artificial Intelligence is, for its most radical proponents like (Kurzweil, 2000; Vinge, 1993) , the conception of an artificial system able to transcend all faculties nowadays attributed to human being. In accord with Turing’s pioneer proposal (A. M. Turing, 2008) , such proponents do not ask metaphysical questions like “Can machine have consciousness ?” nor do they bother much with arguments like that of “chinese room” (Searle, 1982) . More concretely: such radical engineers do not ask questions “whether faculty X can be simulated by algorithmic means”, they simply take the affirmative answer as granted, and, in consequence, pose a question “how can I simulate the faculty X by algorithmic means?” Let’s define “the faculty of moral reasoning” as X1. While being aware that nothing really proves that such a definition does NOT result in a fallacy, we nonetheless do not ask whether it makes sense or not to speak about “machine endowed with morality”. The fact that machines will be able, sometimes in the future, able to fully simulate the moral reasoning is taken as granted within the scope of our Gedankenexperiment and the question which is posed hereby is therefore “how could it be done?” Now let’s define “the ability to modify itself” as X2 and “the ability to reproduce” as X3. Since X1, X2 and X3 are all faculties commonly attributed to human being, it can be stated that an artificial system endowed with such faculties would seem more “human” than the one which contains only some of them, and is therefore closer to ultimate goal of radical AI as was already defined. The problem arises when one realises that X1 is not necessarily mutually consistent with X2 or X3. Myths as well as history itself demonstrate far too often to pass that the modification or a reproduction of a moral being does not necessarily yield a moral result. It is verily this “lesson from history” that obliges us to postulate the central problem of roboethics : How could (the most radical of) roboengineers possibly conceive a machine which is, in the fullest possible extent, able to adapt itself to any situation whatsoever and yet “unable” to rewrite the set of moral imperatives with which it was endowed ? We exclude completely the possibility of not endowing a machine with any moral reasoning at all. Not only would a deployment of such a self-copying, self-modifying autonomous agent be contrary to precautionary principle (Andorno, 2004) , but the very intention of “creating a machine analogous in all its functions to human being” would miss its target since it is commonly accepted fact that the faculty X1, i.e. morality is one among such anthropological universalia (Mikhail, 2007) . What’s more, according to Kant - who analysed the faculty of morality and its relations to other forms of reasoning in such an extent that his discoveries simply have to be taken into consideration by anyone aiming to embed morality into machines - X1 is not only “one faculty amongst many”, but it occupies the central place among all the faculties with which a man was endowed. For Kant, man is conceived, as a “moral being” (Kant, 1785) . Being moral means simply to be able to find a “good” solution to any situation of moral dilemma whatsoever. Therefore, any advanced implementation of morality into an artificial agent should not ignore the semantic intricacies of the concept of “good” nor its strong cultural and contextual dependence (i.e. what is good ine one context is not necessarily good in other). - 183 - Proceedings IACAP 2011 2. Possible solution to Central Problem The Hebbian network of semantic relations around the term “good” consists the outermost layer of our 4-component model of a so-called “moral machine” (MM). Initially, this graph-like structure of semantic relations could be possibly built by means of extraction of “morals of the stories” from huge hypertext corpora representing the myths, fairy tales and descriptions of factual historical situations (inputs) and their consequences (outputs). Whether association of such inputs & outputs by means of already existing machine learning procedures (ANN, SVM, boosting (Freund & Schapire, 1996)) would allow the system to attribute a label “good”/”not good” to a textual description of a situation of moral dilemma which was not contained in the training corpus is a place for argument. More closer to the moral core is the 3rd layer, which can be understood as “the layer of rules”. To simplify the understanding: while layer 4 - understood as “the layer of associations amongst data” - can be compared to an anglo-saxon legal system where a decision is based on the precedent, i.e. the first decision of a judge in a case sharing analogic features to a case under study; the activity of layer 3 can be compared to that of a continental judge whose decisions are simple outputs of more general rules induced from exhaustive sets of previous experiences. Thus, the correct understanding of “moral induction” seems to be crucial in order to implement the robust solution for layer 3 and an inspiration coming from much better studied domain of “unsupervised grammar induction” (Solan, Horn, Ruppin, & Edelman, 2005) may yield encouraging results. It is not unreasonable to imagine that by applying the induction principles not upon the data , but upon the very rules which were themselves induced, the process would finally converge at the point of some-kind of meta-rule, possibly similar in meaning to that what Kant called “categoric imperative” (Kant, 1785) . The advantage of such a “meta-rule” is not only that it is quite easy to implement from programmer’s point of view - in its essence it is nothing else than just an infinite while() loop generating “the representations of possible worlds” and throwing exceptions if ever an “internally inconsistent” world is generated - but that it can be used as a sort of boolean rule of thumb there, where fuzzy thresholds of layers 4 & 3 are unable to supply any decisive result. The disadvantage of layer 2 is that sometimes it may happen that it shall demand infinite amount of time in order to return the result (A. Turing, 1937) . That is far too much especially in the cases where an artificial agent could harm its modified environment by its otherwise harmless activity - imagine, for example, an autonomous transporting agent similar to a car whose circuits got stuck in a while loop after it had hasardously entered the pedestrian zone. For such cases, low-level implementation of fast & frugal harm-reductive inhibitory mechanisms is of utmost importance. In order to stay consistent with the Tradition, we propose Asimov’s Laws of Robotics (Anderson, 2008) as a base for such mechanisms. Finally, it is worth to be stated that while layers 4 & 3 are dynamic in their nature, i.e. can be rewritten by inflow of new stimuli from environment, layers 2 & 1 can be embedded into very chips of an artificial agent and could not be modified or disabled without tampering with agent’s hardware. - 184 - The Computational Turn: Past, Presents, Futures? Believing that such a combination of “two static” and “two dynamic” pillars is in certain sense analogic to a “nature” (i.e. innate) & “nurture” (i.e. acquired) components attributed to the moral faculty of a healthy human being, it may be finally stated that the question which is labeled hereby as a “the central problem of roboethics” is, mutatis mutandi, nothing else than just a postmodern variation upon a much more ancient theme: “How does a parent transform a crying child into an autonomous human being ?” References Anderson, S. L. (2008). Asimov’s “three laws of robotics” and machine metaethics. AI & Society, 22(4), 477-493. Andorno, R. (2004). The Precautionary Principle: A New Legal Standard for a Technological Age. Journal of International Biotechnology Law, 1(1), 11-19. doi: 10.1515/jibl.2004.1.1.11. Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE- (pp. 148-156). Citeseer. Kant, I. (1785). Groundwork of the Metaphysic of Morals. First published. Kurzweil, R. (2000). The age of spiritual machines: When computers exceed human intelligence. Mikhail, J. (2007). Universal moral grammar: Theory, evidence and the future. Trends in Cognitive Sciences, 11(4), 143-152. Searle, J. (1982). The Chinese room revisited. Behavioral and Brain Sciences. Retrieved March 11, 2011, from http://journals.cambridge.org/abstract_S0140525X00012425. Solan, Z., Horn, D., Ruppin, E., & Edelman, S. (2005). Unsupervised learning of natural languages. Proceedings of the National Academy of Sciences, 102(33), 11629. Turing, A. M. (2008). Computing machinery and intelligence. Parsing the Turing Test, 23-65. Turing, A. (1937). On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical …. Retrieved March 11, 2011, from http://plms.oxfordjournals.org/content/s2-42/1/230.full.pdf. Vinge, V. (1993). Technological singularity. VISION-21 Symposium sponsored by NASA Lewis . - 185 - Proceedings IACAP 2011 AFFECTING THE WORLD OR AFFECTING THE MIND? The Role of Mind in Computer Ethics JOHNNY HARTZ SØRAKER Department of Philosophy, University of Twente j.h.soraker@utwente.nl Abstract: The purpose of this paper is to draw a distinction between two interrelated yet fundamentally different ways of approaching problems in computer ethics, with the goal of clarifying which problems call for which approaches. In a nutshell, I will draw a distinction between approaches and topics that are primarily concerned with how technologies affect the world, on the one hand, and those primarily concerned with how technologies affect our mind, on the other. I will argue that the type of approach we choose should be determined on the basis of which of these concerns we are primarily trying to address, which will also shed light on the advantages and disadvantages of the multitude of approaches to be found in ethics of technology. In order to clarify and justify this distinction, I will categorize some common approaches in computer ethics correspondingly, and I will conclude by offering a set of suggestions for how they can and should complement each other in a way that yields an exhaustive analysis of the problem at hand. The purpose of this paper is to draw a distinction between two interrelated yet fundamentally different ways of approaching problems in computer ethics, with the goal of clarifying which problems call for which approaches. In a nutshell, I will draw a distinction between approaches and topics that are primarily concerned with how technologies affect the world, on the one hand, and those primarily concerned with how technologies affect our mind, on the other.13 It should be emphasized at the outset that these categories are not absolute or mutually exclusive – and it is certainly not my intention to argue that one is better than the other. My more modest intention is to argue that the type of approach we choose should be determined on the basis of which of these concerns we are primarily trying to address, which will also shed light on the advantages and disadvantages of the multitude of approaches to be found in ethics of technology. 13 This distinction is reminiscent of Floridi & Sanders’ emphasis on the distinction between agent-oriented and patient-oriented ethics (2002), but this distinction is somewhat misleading in this context, because both technology and the mind can have a role as both agent and patient, being both source and target of good and evil. - 186 - The Computational Turn: Past, Presents, Futures? There is little doubt that technologies affect both the world and the mind, and there is little doubt that there is no sharp distinction between the two. What affects the world can affect the mind, and what affects minds can affect the world – and technology often mediates between world and mind. As such, the distinction I am concerned with must necessarily be more of the ‘family resemblance’-type. Still, we can to some degree separate between different ways of assessing these effects, and given the multitude of ethical theories and applied frameworks that are being used in ethics of technology, it is important to be clear about which approach is best suited for which area. The clearest example of this is probably the distinction between accountability and responsibility. If the purpose of our analysis is to understand what is accountable for a given situation, we can do this entirely in terms of analyzing changes to the world. After all, an inquiry into accountability is largely an inquiry into causality; what was the source of this good or evil (cf. Floridi & Sanders, 2004, p. 371). This also highlights the advantage of using a “mind-less” notion of accountability in cases where (higher-order) mental processes are either non-existent (e.g. artificial agents) or intrinsically distributed (e.g. organizations). If the purpose of our analysis is to understand responsibility, however, we are immediately required to include the mind in a much more integral manner. After all, an inquiry into responsibility is an inquiry into such mental terms as intentions, negligence, and culpability. To give another example, when evaluating how Information and Communication Technologies (ICTs) affect privacy, we can focus on how ICTs affect the world in a manner that is relevant to privacy, or how it affects our mind in a way that is relevant to privacy. The former involves such question as “How do ICTs affect the flow of information”, or what Floridi refers to as ‘ontological friction’ (2005). The latter involves questions such as “How do ICTs affect our expectations about privacy?” and “How can loss of privacy affect our well-being?”. If we look to environmental ethics, we can make a similar distinction between the effects a technological innovation may have on the environment, on the one hand, and their effect on e.g. opinions about sustainability, on the other. We can make a similar distinction when evaluating cultural consequences, by either looking at how technologies may change the material conditions necessary for certain cultural practices, or how they more directly change people’s cultural values and attitudes. Clearly, the questions are interrelated and both sets of questions should be sought answered in a comprehensive analysis, but the approaches and methods we utilize in doing so will typically be centered on one of the two sets. To clarify this further, we can attempt to categorize different approaches according to their main concerns. On the one hand, some theories and approaches are particularly good at evaluating how technologies affect the world. Again, one clear example is Floridi’s notion of ‘reontologization’ (2005) and the use of an informational level of abstraction, which is an interesting and often insightful way of conceptualizing how the world changes as a result of our increased ability to digitize information . Other examples of this type of approach is Actor-Network theory (Latour, 2005), as well as recent post-phenomenological work on technological mediation (Verbeek, 2005). The strength of these theories is that they shed light on how technologies affect the world and our ways of interacting with the world. They do not, however, say much about how technologies affect the mind. Surely, the changes to the world that they disclose will very often lead to changes in mind, but this is not their main concern. - 187 - Proceedings IACAP 2011 On the other hand, some theories and approaches are particularly good at evaluating how technologies affect the mind. Among the approaches in this category, we can include approaches that are grounded in some version of virtue ethics or utilitarianism, as well as axiological approaches. The main concern of these approaches is not to understand how technologies affect the world, but rather how they affect our moral character, behavioral dispositions, expectations, quality of life, and so forth. Certainly, technologies often affect our mind through changing the world – indeed, they always do so if we regard the technology itself as a change to the world. Nevertheless, the main concern of these approaches is not to get a better understanding of how states of affairs in the world change, but rather to get a better understanding of how mental processes change. This is the ultimate goal of the analysis. If we take video game violence as an example, a virtue ethical analysis of this phenomenon would not be particularly interested in how these games may affect the physical world, but rather how they will affect the mind of those who interact with them. Will they make them more aggressive, less altruistic, more happy? One reason for distinguishing between these approaches is that they give rise to different types of normativity, and to show how these can be related to each other. Approaches that are primarily interested in changes to the world can be described as cautionary. That is, the effects that technologies have on the world will in many cases imply a caution; technology x will lead to change y, and this change might be ethically problematic. In order to take that last step, however, we need approaches that include the mind in order to argue that change y is ethically problematic because it affects the mind in a particular way. This can be seen clearly when teaching computer ethics to pragmatically oriented computer scientists, where showing that technologies change the world will often lead to the perfectly rational question: “That might very well be true, but why is that a problem?”. Answering that question must somehow include the mind. In the full paper, I will further clarify the nature of this distinction, knowing very well that it is problematic and rests on a number of philosophically controversial presuppositions. I will also justify why the mind is essential for most topics in computer ethics, and discuss what this means for how we ought to approach these topics. Some of the main conclusions will be that computer ethics is necessarily and intrinsically a pluralist area of investigation, one that needs to address both the world and the mind. More substantially, it will be argued that we need to get a much better understanding of how different approaches can complement each other and how analyses of changes to the world can be integrated into analyses of changes to the mind. I will conclude the paper by offering a few suggestions on how to do so, using privacy as one of the main examples. References: Floridi, L. (2005). The ontological interpretation of informational privacy. Ethics and Information Technology, 7(4), 185-200. Floridi, L., & Sanders, J. W. (2002). Mapping the foundationalist debate in computer ethics. Ethics and Information Technology 4, 1-9. Floridi, L., & Sanders, J. W. (2004). On the Morality of Artificial Agents. Minds and Machines, 14(3), 349-379. - 188 - The Computational Turn: Past, Presents, Futures? Latour, B. (2005). Reassembling the Social: An Introduction to Actor-Network-Theory. Oxford: Oxford University Press. Verbeek, P.-P. (2005). What things do: philosophical reflections on technology, agency, and design. University Park, PA: Pennsylvania State University Press. - 189 - Proceedings IACAP 2011 THE ETHICS OF AUTOMATED WARFARE RYAN TONKENS York University Toronto, Canada tonkens@yorku.ca Autonomous machines of varying degrees are moving onto the battlefield at an overwhelming pace. If left unchallenged, there is good reason to believe that both their level of autonomy and overall sophistication will increase exponentially in the future. In light of this, it is important that we determine whether or not these sorts of robots should have a place in warfare. Here I ask whether the development and use of autonomous military robots is consistent with the tenets of Just War Theory (hereafter JWT).14 Specifically, the aim of this paper is to offer an in depth (albeit preliminary) analysis of whether the creation and deployment of autonomous machines in military contexts is morally acceptable, by way of assessing the overall justness of automated warfare. If automated warfare is unjust, then creating and using robots for this purpose is morally problematic. The most anticipated application of advanced autonomous machines is in the military sector. Indeed, a disproportionate amount of funding for research on machine autonomy has come from military sources for military applications. Insofar as autonomous robots can perform actions that have serious ethical consequences (in the context of warfare, at least), then they need to be programmed to behave ethically, i.e. to perform only those actions that are in line with the appropriate regulations and agreed upon customs of just war. Contemporary JWT is the received view on how warfare should be conducted. We demand that all (human) combatants abide by the tenets of JWT. Moreover, we expect proper restitution, and go to great lengths to ensure that all breaches of JWT in practice are punished accordingly. If we want to involve autonomous machines in warfare, then they will need to abide by JWT as well. In this paper I take up four issues towards this end: (1) issues of moral responsibility; (2) discrimination and proportionality; (3) whether the creation of autonomous military machines is consistent with jus ad bellum and wider social justice; and (4) whether military machines could be more moral than humans. (1) JWT demands that someone be morally responsible for actions in war. Given a certain advanced level of machine autonomy, robots will need to be held responsible for their own actions. However, doing so seems futile since they have no capacity to suffer (Sparrow 2007). One potential limitation of Sparrow’s analysis, however, is that the range of autonomous machines whereby something (someone) could still be held 14 Just War Theory works in tandem with the international laws of war and rules of engagement as the moral and legal regulations of warfare. Due to space restrictions, I cannot attend to all three herein, so I focus exclusively on JWT. - 190 - The Computational Turn: Past, Presents, Futures? responsible is quite large. Limiting the autonomy of machines to the point where a human remains in the decision-making/execution loop avoids this problem, since human users are the sort of being that can be punished for their moral wrongdoings. (2) Autonomous robots will need to be able to accurately and reliably discriminate between legitimate and illegitimate targets (i.e. between combatants and noncombatants, between surrendering combatants and aggressive combatants, between allies and enemies). Whether or not autonomous military machines could be designed to do so in real world military contexts remains an open question, although designing a robot with these abilities does not seem impossible in principle. Regardless, one point that seems uncontentious is that the level of autonomy and the ability of machines to act in real world contexts will increase much sooner than our ability to perfect their ability to exert the intricacies of discrimination and proportionality at acceptable levels. This is important to recognize because, until autonomous robots can accurately and reliably discriminate between legitimate and illegitimate targets, then they do not meet this requirement of JWT. (3) If automated warfare fuels widespread social injustice, including injustices outside of the context of warfare specifically, then it is inconsistent with the principles underlying JWT (e.g. justice, fairness, respect). This could manifest itself in many ways, including increasing the likelihood of (unjust) war15, decreasing the likelihood of terminating (unjust) war once it had begun, exacerbating gaps between rich and poor nations and strong and weak military forces, et cetera. Moreover, the billions of dollars going into the automated military sector could be redirected towards the healthcare or education systems (for example), which could serve to remedy the existing status quo that finds humans of low socioeconomic status with poorer health and lower education, itself a symptom of and catalyst for widespread social injustice. (4) Despite the possibility that machines could in some sense be more moral than human soldiers under certain circumstances (Arkin 2009; Sullins 2010), automated warfare will also witness its fare share of unethical activity. Although substituting human combatants for machines is appealing in certain ways, automated war would not be less unjust than human warfare overall. We seem to be seeking to develop autonomous military machines (in part) because we believe that we can treat them like servants and subordinates, yet we also expect them to be military and ethical ‘superiors’. The only way we can bring this about in a morally justifiable manner is if we restrict their sophistication to a point well before they are fully autonomous moral agents (especially akin to human moral agents), and hence keep them at a level where we need to keep a human in the loop. But doing so entails continuing to sacrifice human lives in battle, and continuing to endure human moral transgressions and imperfections in decision-making, all in addition to the new ethical challenges that accompany automated warfare. There is good reason to suggest that the creation and use of autonomous military machines is inconsistent with JWT in several respects. This is an important finding. For one thing, it makes it apparent that the creation of certain kinds of autonomous military machines is inconsistent with the moral framework that these robots will be expected to follow. More importantly perhaps, it places the burden of proof on those who want to support the move towards automated warfare and to develop these sorts of machines to demonstrate that they can do so in a morally sustainable (just) manner. Minimizing the level of sophistication of these robots and keeping humans in the military loop seems to 15 McMahan (2009) has argued convincingly that, for diverse and complicated reasons, the majority of wars fought are unjust. - 191 - Proceedings IACAP 2011 be the most prudent course to adopt, one certainly more palatable than automated warfare tout court, although needless to say infinitely less desirable than peace. References Arkin, R. (2009). Governing Lethal Behavior in Autonomous Robots. Dordrecht: Chapman & Hall. Asaro, P. (2008). How just could a robot war be?. In: P. Brey, A. Briggle and K. Waelbers (Eds.), Current Issues in Computing and Philosophy (pp.50-64). Amsterdam: IOS Press. Guarini, M. & Bello, P. (forthcoming). Robotic warfare: Some challenges in moving from noncivilian to civilian theaters. In: P. Lin, G. Bekey and K. Abney (Eds.), Robot Ethics: The Ethical and Social Implications of Robotics. Cambridge: MIT Press. McMahan, J. (2009). Killing in War. Oxford: Clarendon Press. Sparrow, R. (2007). Killer robots. Journal of Applied Philosophy, 24 (1), 62-77. Sullins, J. (2010). RoboWarfare: Can robots be more ethical than humans on the battlefield? Journal of Ethics and Information Technology, 12: 263-275. - 192 - The Computational Turn: Past, Presents, Futures? CAREBOTS AND CAREGIVERS Robotics and the Ethical Ideal of Care SHANNON VALLOR Department of Philosophy, Santa Clara University 500 El Camino Real Santa Clara, CA 95053 USA Abstract. In the 21st century we stand on the threshold of welcoming robots into domains of human activity that will expand their presence in our lives dramatically. One provocative new frontier in robotics, driven by a convergence of demographic, economic, cultural and institutional pressures, is the development of ‘carebots’ - robots intended to assist or replace human caregivers in the practice of caring for vulnerable persons such as the elderly, young, sick or disabled. I argue that existing reflections on the ethical implications of carebots have thus far neglected a critical dimension of the issue: namely, the potential moral value of caregiving practices for caregivers. Instead, the scholarly dialogue has largely focused on the potential benefits and risks to care recipients. Where caregivers have been explicitly considered, it is strictly in terms of how they might benefit from having the burdens of care reduced by carebots. I stipulate here that properly designed and implemented carebots might improve the lives of cared-fors and caregivers in ways that would be ethically desirable. Given the grave deficiencies of existing social mechanisms for supporting caregivers, their use may even be ethically obligatory in the absence of acceptable alternatives. Yet I argue that we ought to forestall such judgments until we have first adequately reflected upon the existence of goods internal to the practice of caregiving that we might not wish to surrender, or that it might be unwise to surrender even if we might often wish to do so. Such reflection, I claim, gives rise to considerations that must be weighed alongside the likely impact of carebots on care recipients. In order to initiate such reflection, I examine the goods internal to caring practices and the potential impact of carebots on caregivers by means of three complementary ethical approaches: virtue ethics, care ethics and the capabilities approach. I show that each of these frameworks can be used to shed light on the contexts in which carebots might deprive potential caregivers of important moral goods central to caring practices, as well as those contexts in which carebots might help caregivers sustain or even enrich those practices. - 193 - Proceedings IACAP 2011 1. Introduction We stand on the threshold of welcoming robots into domains of human activity that will expand their presence in our lives dramatically. One provocative new frontier is the development of ‘carebots’ - robots intended to assist or replace human caregivers in the practice of caring for vulnerable persons such as the elderly, young, sick or disabled. Yet existing philosophical reflections on the ethical implications of carebots have thus far neglected a critical dimension of the issue: the potential moral value of caregiving practices for caregivers. Instead, the dialogue has largely focused on the potential benefits and risks to care recipients. Indeed, properly designed and implemented carebots might improve the lives of both cared-fors and caregivers in ways that would be ethically desirable. Their use may even be ethically obligatory in the absence of acceptable alternatives. Yet I argue that such judgments are premature until we have adequately reflected upon the potential existence of goods internal to the practice of caregiving that we might not wish to surrender, or that it might be unwise to surrender even if we might often wish to do so. Such reflection, I claim, gives rise to considerations that must be weighed alongside considerations of the likely impact of carebots on care recipients. Taking as a guiding insight Coeckelbergh’s (2009) claim that we must look beyond mere application of “external” ethical criteria for human-robot relations, I propose to examine the goods internal to caring practices and the potential impact of carebots on caregivers by means of three complementary ethical approaches: virtue ethics, care ethics and the capabilities approach. Each of these philosophical frameworks sheds new light on: 1) the contexts in which carebots might deprive potential caregivers of important moral goods central to caring practices, 2) contexts in which carebots might help caregivers sustain or even enrich those practices, and 3) the specific nature of those moral goods. 2. Carebots and the ethical significance of caring practices 2.1. THE VIRTUES OF CARE A virtue-ethical account offers rich resources for our inquiry in the form of a range of moral virtues that can be cultivated and sustained through caring practices. Patience, understanding, charity, prudence, reciprocity and empathy can each be cultivated through sustained caring activity. ‘Excellent carers’ manifest a powerful ability to anticipate and interpret the needs of others, even when not explicitly communicated. They habitually express effective responses to those needs, even in unusual or rapidly changing situations. They are able to maintain emotional bonds with others, even under physically and mentally demanding circumstances. They enable the autonomy and selfexpression of those they care for, to whatever degree possible. If Aristotle is right that the virtues must be cultivated by habitual performance of practices appropriate to their expression (1984, 1103b1), then caring practices are an important, perhaps even essential, part of one’s moral development. This is a compelling reason to examine the potential impact of carebots designed to free us from those practices. Yet carebots have also been proposed as a means of facilitating deeper human engagement in caring practices, by taking over routine or unpleasant chores that drain our energy for giving - 194 - The Computational Turn: Past, Presents, Futures? good care. (Coeckelbergh, 2010). This suggests the need for a sustained study of which kinds of caring practices are most critical for the cultivation of caring virtues. Such a study, guided by a virtue-ethical framework, could greatly assist the ethical implementation of carebots by providing carebot developers, institutions, and caregivers with critical information about the moral value of various caregiving practices. 2.2. CARE ETHICS, CAREBOTS AND THE ETHICAL IDEAL Care ethics provides another source of insight. Noddings (1984) offers an account of the ‘caring relation’ that takes it to be ethically primary in human existence - a source not only of individual virtues, but also (and more fundamentally) of an ethical ideal that motivates and guides human flourishing. I will argue that carebots might be used to modify contexts of care in ways that preserve or enhance this ethical ideal, allowing us to be engrossed in the needs of the other, moved to attend to them, and open to the responses of those for whom we care. Yet Noddings’ account can also remind us that our aim is not to be liberated from the caring relation itself, for if she is right, this is the only human relation through which our own ethical ideal can be nurtured. 2.3. CARING AND THE CAPABILITIES APPROACH Nussbaum’s capabilities approach provides a third perspective on the goods internal to caring practices. Among the capabilities emphasized by Nussbaum as critical to human flourishing (2006, 76-77), I argue that affiliation, practical reason and emotion are each realized, to a critical degree, through caring practices. For it is at least partly through providing care that I develop the intimate knowledge of human vulnerability needed to fully exercise these capabilities. We must therefore reflect carefully on the way in which the introduction of carebots in society could inhibit or enhance their development. 5. Conclusion Together these conceptual frameworks can remind us that in reflecting upon the ethical portent of carebot technology, we must consider more than just the quality of care robots can give, the relevant preferences and likely reactions of cared-fors, or the strong social pressures we face to better meet the needs of the vulnerable among us. These are all serious ethical considerations to which we must carefully attend in weighing the costs, benefits and risks of carebot implementation – but it is of critical importance that we not overlook the moral goods internal to caring itself. References Aristotle (1984). The Complete Works of Aristotle: Revised Oxford Translation. Princeton: Princeton University Press. Coeckelbergh, M. (2009). Personal robots, appearance and human good: A methodological reflection on roboethics. International Journal of Roboethics, 1(3), 217-221. Coeckelbergh, M. (2010). Health care, capabilities and AI assistive technologies. Ethical Theory and Moral Practice, 13(2), 181-190. - 195 - Proceedings IACAP 2011 Noddings, N. (1984). Caring: A Feminine Approach to Ethics and Moral Education. Berkeley: UC Press. Nussbaum, M. (2006). Frontiers of Justice: Disability, Nationality, Species Membership. Cambridge: Harvard University Press. - 196 - The Computational Turn: Past, Presents, Futures? CO-CONSTRUCTION IDENTITIES AND CO-MANAGEMENT OF ONLINE A Confucian Perspective PAK-HANG WONG Department of Philosophy, University of Twente Abstract. In information and computer ethics, the discussion of personal identities online (PIOs) is often framed as if individuals are victims who need protection, e.g. privacy, identity theft, etc. In this respect, many of the discussions related to PIOs in the current literature are negative in that they aim to provide and justify certain constraint and restrictions on (the use of) PIOs. While the issues concerning privacy, identity theft, etc. are undoubtedly important, the lone focus on negative aspects related to PIOs is undesirable, for it has undermined the scope of issues related to PIOs, particularly, the more positive issues pertaining to PIOs, e.g. how we should construct and manage our PIOs. Recently, Noëmi MandersHuits has studied the notion of “identity management” in the context of information technology. Manders-Huits’s article is significant, because she has explicitly turned away from the negative issues and moved on to issues about the construction and management of identities in IT, which are far more positive. As such, her discussion introduced a new area of research that is so far largely neglected. Although her study of identity management is illuminating, I think her account is unsatisfactory ultimately, as she failed to properly acknowledge one important facet of PIOs, namely they are co-constructed and co-managed. The aim of this paper, therefore, is to remind of the fact that PIOs are co-constructed and co-managed, and to identify some conceptual and ethical issues arise from it. Finally, I will outline the answers to the issues using a Confucian notion of personhood and identity. 1. In information and computer ethics, the discussion of personal identities online (PIOs) is often framed as if individuals are victims who need protection, e.g. privacy, identity theft, etc. In this respect, many of the discussions related to PIOs in the current literature are negative in that they aim to provide and justify certain constraint and restrictions on (the use of) PIOs. As Shoemaker noted, most of the literature in the field attempted to specify “a protected zone of private information, consisting in information about me.” (Shoemaker 2010, 3-4) While the issues concerning privacy, identity theft, - 197 - Proceedings IACAP 2011 etc. are undoubtedly important, the lone focus on negative aspects related to PIOs is undesirable, for it has undermined the scope of issues related to PIOs, particularly, the more positive issues pertaining to PIOs, e.g. how we should construct and manage our PIOs. Recently, Noëmi Manders-Huits (2010) has studied the notion of “identity management” in the context of information technology. Manders-Huits’s article is significant, because she has explicitly turned away from the negative issues and moved on to issues about the construction and management of identities in IT, which are far more positive. As such, her discussion introduced a new area of research that is so far largely neglected. Although her study of identity management is illuminating, I think her account of is unsatisfactory ultimately, as she failed to properly acknowledge one important facet of online identities, namely online identities are co-constructed and comanaged. The aim of this paper, therefore, is to remind of the fact that online identities are co-constructed and co-managed, and to identify the conceptual and ethical issues arise from it. Finally, I will outline the answers to the issues using a Confucian notion of personhood and identity. I will begin this paper with Manders-Huits’s account of identity management. According to Manders-Huits, there are two senses of “identity management”. The first is being used predominantly in the technical discourse, where identity management refers to the practice of collecting, organising and, subsequently, utilising personal information for the purpose of (re-)identification and categorisation. (Manders-Huits 2010, 47) And, the second sense of identity management involves not only a set of description about the individual; it also involves reflexive, self-identification with some sets of beliefs, values or ideals, where those beliefs, values and/or ideals provide reasons for our actions and, at the same time, make the actions genuinely ours. (see, e.g. Korsgaard 1996; Frankfurt 1998, 1999, 2004 & 2006) Identity management in the second sense, therefore, requires individuals to manage their beliefs, values and ideals, and to resolve possible conflicts among them. (Manders-Huits 2010, 48-9) As she rightly pointed out, identity management is an issue deserving more attention, as there is a discrepancy between the two senses of “identity management”, and the moral and practical dimension of identity is currently not being taken into account in both the technical discourse and in the technologies. Yet, for the centrality of moral and practical identity in our lives, the negligence of it has to be rectified. I agree entirely with her claim, but I shall also point out that identity management will become more important as information technology continues to develop and being adopted. 2. As information technology (and the Web) continues to advance, it will – to use Luciano Floridi’s terminology – re-ontologise the nature of ourselves and our world. According to Floridi, we are (becoming) inforgs, i.e. “connected informational organisms” living in an infosphere, i.e. “an environment constituted by all informational entities […], their properties, interactions, processes and mutual relations.” Floridi (2007, 60, 62 & 59) At certain point, Floridi argued, the boundaries between the life offline and the life online will eventually evaporate, and by then individuals will be living in the Web Onlife. Among other characteristics, the onlife of inforgs in an infosphere is characterised by instant, seamless exchanges of offline and online information. In other words, the flow - 198 - The Computational Turn: Past, Presents, Futures? of (personal) information will become, at least, bi-directional. What it means is that when individuals act on the Web, it will have immediate and direct impacts on their nonWeb counterparts. In this scenario, identity management for online identities becomes essential. Since it will no longer be possible to distinguish the offline and the online, it will be impossible to dissociate online identities from offline identities too. Or, to put it differently, what remains are onlife identities. While Manders-Huits is right to point out that identity management is an important issue for researchers in information and computer ethics, I shall argue that her account of identity management is unsatisfactory, because she has failed to properly acknowledge the fact that online identities are co-constructed and co-managed by multiple parties. This failure is reflected in her suggestion to engineers and technology designers, when she remarked that they “should provide ways for individuals to construct and maintain their [reflexive, self-identification with some sets of beliefs, values or ideals] and [some sets of descriptions about themselves], in addition to their administrative, forensic counterpart.” (Manders-Huits 2010, 54) It is obvious that the emphasis is on empowering individuals in managing their personal information. Yet, what is missing here is that: while it is true that individuals construct and manage their online identities, we are not the only one who contributes to their construction and management. For example, a person’s profile on Facebook is not only what that person inputs, but the totality of information on the profile, including his/her friends, conversations, etc. In other words, not all identity-related information is under the person’s control. In light of this, I shall argue that there is a need to reconceptualise PIOs in terms of co-construction and comanagement; and, I shall also argue that unless the person is omnipotent and omnipresence, empowering individuals is always insufficient. 3. At this point, I suggest that we can learn a lesson from Confucianism. I will point out that Confucians conceptualised personhood and identity as inherently interdependent and relational. (Wong 2004; Lai 2006; Yu & Fan 2007) And the Confucian personhood and identity, I shall argue, provide us an alternative way to conceptualise PIOs, which can take into account the co-construction and co-management of PIOs. Moreover, accompanied with the Confucian personhood and identity is an ethics, which is based on individuals’ social roles. (Nuyen 2009) Here, I will suggest that the role-based ethics in Confucianism offers a fitting complement to the Manders-Huits’s strategy of individual empowerment. References Floridi, L. (2007). A look into the Future Impact of ICT on Our Lives. The Information Society, 23 (1), 59-64. Floridi, L. (2009). The Semantic Web vs. Web 2.0: A Philosophical Assessment. Episteme, 6, 2537. Frankfurt, H. (1988). The importance of what we care about: philosophical essays. Cambridge: Cambridge University Press - 199 - Proceedings IACAP 2011 Frankfurt, H. (1999). Necessity, volition, and love. Cambridge: Cambridge University Press Frankfurt, H. (2004). The reasons of love. Princeton, N.J.: Princeton University Press. Frankfurt, H. (2006) Taking ourselves seriously and getting it right. Stanford, Calif.: Stanford University Press. Korsgaard, C. (1996). The sources of normativity. Cambridge: Cambridge University Press. Lai, Karyn (2006). Learning from Chinese Philosophies: Ethics of Interdependent and Contextualised Self. UK: Ashgate Manders-Huits, N. (2010). Practical versus moral identities in identity management. Ethics and Information Technology, 12 (1), 43-55. Nuyen, A.T. (2009) Moral Obligation and Moral Motivation in Confucian Role-Based Ethics. Dao, 8, 1–11 Shoemaker, D. W. (2010). Self-exposure and exposure of the self: information privacy and the presentation of identity. Ethics and Information Technology, 12 (1), 3-15. Tavani, H. T. (2008). Informational Privacy: Concepts, Theories, and Controversies. In K.E. Himma and H.T. Tavani (Eds.), The Handbook of Information and Computer Ethics (pp. 131-164). Hoboken, NJ: John Wiley and Sons. Wong, David (2004) Relational and Autonomous Selves. Journal of Chinese Philosophy, 31 (4), 419–432 Yu, Erika & Fan, Ruiping. (2007) A Confucian View of Personhood and Bioethics. Bioethical Inquiry, 4, 171–179 - 200 - The Computational Turn: Past, Presents, Futures? Track VI: Multidisciplinary Perspectives - 201 - Proceedings IACAP 2011 REFLECTIVE INEQUILIBRIUM BERT BAUMGAERTNER University of California, Davis 1240 Social Sciences and Humanities University of California, Davis One Shields Avenue Davis, CA 95616 Abstract. I show that under a traditional introspective method of philosophical investigation, certain projects of conceptual analysis are bounded by a reflective inequilibrium. That is, although it is possible to make some progress towards bringing our classificatory intuitions and the relevant criteria into agreement, there is a barrier that cannot be overcome with traditional methods when the concept in question is plastic. We can show the limitations of the traditional method of conceptual analysis by considering its computational analog. Suppose we have an algorithm C that determines a set of cases that fall under a given concept and another algorithm T which tests cases by consulting C (which responds with `Yes' or `No'). If C is static (and decidable), then in principle T can develop a criterion for it. Moreover, every verification procedure that T uses to check the match yields consistent results. However, this turns out not to be the case when C is plastic. Even if we assume the best case scenario in which a proposed criterion matches the set of cases determined by the concept, testing cases near the boundary moves the boundary, and so the criterion will no longer match. So even if an algorithm gets a match via a lucky guess, it is unable to verify the match. A state of affairs where no perfect match can be verified is a reflective inequilibrium. That some concepts are plastic is supported by empirical evidence which shows that classificatory intuitions can be affected by the order in which cases are considered. Swain et al. (2008) found that individual intuitions can vary according to whether, and which, other thought experiments were considered first. It is likely that the varying intuitions track shifts in the classificatory dispositions of our concepts. In fact, it is well accepted in cognitive psychology and cognitive science that human concepts are flexible and dynamic in this way. Interestingly then, a computational approach to traditional introspect methodology thereby gives us a possible explanation for why conceptual analysis is so difficult and usually unsuccessful. Extended Abstract In this paper, I show the far-reaching effects of the computational turn by shedding light on a traditional problem. Specifically, I show that under a traditional introspective method of philosophical investigation, certain projects of conceptual analysis are bounded by a reflective inequilibrium. - 202 - The Computational Turn: Past, Presents, Futures? In the philosophical literature, particularly in certain domains of epistemology, it is assumed that a conceptual analysis of knowledge, for example, is possible through a process of reflective equilibrium. This process is a virtuous circle, where we make some headway on settling which cases count as knowledge in order to develop some criteria, and we let the development of criteria help us settle on which cases count as knowledge. As I will show however, although it is possible to make some progress towards bringing these two into agreement, there is a barrier that cannot be overcome with traditional methods when the concept in question is plastic. Since it is plausible that our concept of knowledge is plastic (Weinberg et al., 2001), the possible progress of an analysis given traditional methods is bounded by a reflective inequilibrium. More specifically, a traditional method of doing conceptual analysis can be characterized as the attempt to bring into agreement our classificatory intuitions about cases and a proposed criterion that defines the relevant set of cases. We then proceed by testing proposed criteria. This is done by a) introspectively checking whether every possible case as specified by a criterion is an instance of the concept in question, and b) introspectively checking whether every possible instance of the concept in question is a possible case specified by the criterion. We can show the limitations of the traditional method of conceptual analysis by considering its computational analog. We have an algorithm C that determines a set of cases that fall under a given concept. We then have another algorithm T which tests cases by consulting C (which responds with `Yes' or `No'). Given data from C, T attempts to develop a criterion for the set of cases determined by C. If this set is static (and decidable), then in principle T can develop a criterion for it. Moreover, every verification procedure that T uses to check the match yields consistent results. However, this turns out not to be the case when C is plastic. Let us assume the best case scenario in which a proposed criterion matches C. In order for T to verify the match, it must test some cases again. But since C is plastic, testing cases near the boundary moves the boundary, and so the criterion will no longer match C. Then T will get an inconsistent result for some verification procedure. So even if T gets a match via a lucky guess, it is unable to verify the match. Let us call a state of affairs where no perfect match can be verified a reflective inequilibrium. We have appealed to an intuitive notion of plasticity. More rigorously, plasticity can be implemented in an artificial cognitive system by the specification of two features: i) the conditions for when the boundary of a concept shifts, and ii) how much the boundary of the concept shifts. Such algorithms behave in the following way. When given cases to classify near the boundary, the boundary shifts by some amount, so that future cases which may have been classified positively (negatively) may now be classified negatively (positively). Boundary shifting is more or less stable depending on how the cases are selected for testing and how features (i) and (ii) are specified. That some concepts are plastic is supported by empirical evidence which shows that classificatory intuitions can be affected by the order in which cases are considered. For example, Swain et al. (2008) found that individual intuitions can vary according to whether, and which, other thought experiments were considered first. It is natural to suppose that the varying intuitions track shifts in the classificatory dispositions of our concepts. In fact, it is well accepted in cognitive psychology and cognitive science that human concepts are flexible and dynamic in this way. Psychologists such as Laurence Baraslou (1987) and James Hampton (2007) have suggested that this is a good thing, for - 203 - Proceedings IACAP 2011 it provides us with the capacity to track environmental changes while maintaining the identity of the relevant concept(s). Let the plasticity hypothesis be the hypothesis that our concepts are apt to change their classificatory dispositions. In sum, taking a computational approach to traditional introspective conceptual analysis illuminates the limitations of this particular methodology. It is common to think that a barometer of how well we understand cognitive capacities is our ability at simulating artificial systems. Given that we have adequate algorithmic implementations of the plasticity hypothesis and the traditional methodology, we can rigorously prove limitations of the traditional methodology. We thereby have a possible explanation for why conceptual analysis is so difficult and usually unsuccessful -- introspection can provably only take us part of the way. Consequently, the computational approach can make way for the development of additional tools to study human capacities of categorization. Acknowledgements Thanks to Adam Sennet and attendees of the philosophy graduate student workshop at UC Davis for helpful suggestions in the initial development of the ideas. Special thanks to Bernard Molyneux for comments and support. References Barsalou, L. (1987). The instability of graded structure: Implications for the nature of concepts. In: U. Neisser (Ed.), Concepts and Conceptual Development: Ecological and Intellectual Factors in Categorization (pp. 101–140). Cambridge: Cambridge University Press. Hampton, J. (2007). Typicality, graded membership, and vagueness Cognitive Science: A Multidisciplinary Journal 31 (3) (pp. 355–384). Swain, S., J. Alexander, and J. Weinberg (2008). The instability of philosophical intuitions: Running hot and cold on truetemp. Philosophy and Phenomenological Research 76 (1) (pp. 138–155). Weinberg, J., S. Nichols, and S. Stich (2001). Normativity and epistemic intuitions. Philosophical Topics 29 (1-2) (pp. 429–460). - 204 - The Computational Turn: Past, Presents, Futures? THE INFORMATION-COMPUTATION TURN: A HACKING-TYPE REVOLUTION ISRAEL BELFER Science, Technology and Society Program, Bar Ilan University Ramat Gan, Israel Abstract. Hacking’s Styles of Reasoning (Hacking 1981, 1992) are utilized to describe the impact Information Theory has had on science in the 20th century in theory and application. A generalized, Information-laden scientific style of reasoning is introduced, generalizing the information-theoretical and computational turn in science and society. Information-laden science will be examined according to Hacking's criteria for a new Style, and its associated 'revolution' (Schweber and Watcher, 2000). These criteria include a new scientific vocabulary as well as a wider social and conceptual context. The specific branch of science chosen to exhibit the new style is physics, which manifests a wide range of a style's attributes: science in an information-age (‘e-science’); hard-theoretical physics such as Black-Hole Thermodynamics (BHTD) and the consequent BlackHole Wars (Suskind, 2008); the advent of Quantum Information Theory (QIT) – namely Quantum Information and Quantum Computation. 1. Introduction – Hacking Type Revolutions Hacking's Styles of Reasoning (Hacking 1982, 1992; Crombie, 1994) are meta-concepts that arrange the scheme of ideas and practices in science and society. They are described as: “The active promotion and diversification of the scientific methods of late medieval and early modern Europe reflected the general growth of a research mentality in European society, a mentality conditioned and increasingly committed by its circumstances to expect and to look actively for problems to formulate and solve, rather than for an accepted consensus without argument. The varieties of scientific method so brought into play may be distinguished as: (a) the simple postulation established in the mathematical sciences, (b) the experimental exploration and measurement of more complex observable relations, (c) the hypothetical construction of analogical models, (d) the ordering of variety by comparison and taxonomy, (e) the statistical analysis of regularities of populations and the calculus of probabilities, an - 205 - Proceedings IACAP 2011 (f) the historical derivation of genetic development. The first three of these methods concern essentially the science of individual regularities, and the second three the science of the regularities of populations ordered in space and time.” The rise of a Style of Reasoning manifests in a Hacking-Type Revolution that accompanies a new Style. 1.1 A NEW HACKING-TYPE REVOLUTION Schweber & Watcher (2000) recognized in the computational (information-processing) revolution the rise of such a Style: “We are witnessing another Hacking type revolution, which for lack of a better name we call the ‘complex systems modeling and simulation’ revolution, for complexity is one of its buzzwords and mathematical modeling and simulation on computers constitute its style of reasoning”. This Style and its revolution should be adopted and combined with the ubiquity of Information-Theoretical terminology in science (Arndt, 2004), into a generalized form. That is, a Hacking-Type revolution of Information-laden science, with digitized Information as its Style. By expanding on the same theme of the Hacking type revolution to include communication and cryptography, one achieves more than a parceling together of the theoretical basis for these fields of research. It in fact relays a basic theme in science and technology, since communication and computation – Information transfer and processing – are inextricably linked theoretically and practically. The common thread connecting all of these theoretical approaches and applied technologies is the modern concept of quantified information. 1.2 INFORMATION-LADEN PHYSICS The technological and theoretical growth embodied in the fields of computation and communication amalgamates into a Style of Reasoning with Digitized Information (Shannon, 1948) at its core: That is, information and its measures (Arndt, 2004). A science laden with Information (paraphrasing ‘theory-laden’ science) is saturated with direct and indirect reliance on IT and Information measures for defining problems and their solutions, influencing the theory and the practice of science. Experiment becomes data acquisitions (Brillouin, 1956); analysis - the computerized simulation and processing of relevant datasets. Much of this process is due to Maxwell’s Demon (Leff, 2003), the thought experiment that challenged the second law of thermodynamics since the end of the 19th century. Attempts to deal with it catalyzed lines of theoretical research that primed physics for a turn towards Information, prompting the tight connection between the thermodynamics of computation and IT (Bennet, 1973). This shift is reinforced by a deeper moment in abstract theoretical work: IT as scientific modeling of nature, such as the Maximum Entropy Principle (Jayens, 1957). The declaration that 'Information as physical' (Landauer, 1991; Karnani et al, 2009) connects communication and computation together with fundamental physics and the second law of thermodynamics. Considered by some as 'the new language of science' (von Bayer, 2005), a new 'metaparadigm' in popularized depictions of the change (Siegfried, 2000; Seife 2006). - 206 - The Computational Turn: Past, Presents, Futures? 2. New Fields of Information-Laden Physics The 20th century saw the development of core mathematical-physics imbued with IT (von Baeyer, 2005), i.e. Information-laden science. Jakob Bekenstein’s seminal work on Black Hole Thermodynamics (BHTD) (Bekenstein, 1973,2006). Fields of research such as Quantum Information Theory (Fuchs, 2010) and String Theory (Susskind, 2008) do more than utilize Shannon’s Information-Entropy measure. They link physical reality to computation and cryptography. BHTD and M-Theory produced the Holographic Principle (t’Hooft, 1993; Susskind, 1995) according to which physical reality is encoded onto the surface area of the universe. QIT bodes the possibilities of pan-computationalism (Lloyd, 2006; Feynman, 1981; Zuse, 1967) with all physical phenomena understood as bit-flipping. Wheeler (1990) takes it even further: every physical object essentially Informational – his famous aphorism “It from Bit”. 3. New Style – Spheres of Science and Society 3.1 NEW SENTANCES, OBJECTS AND LAWS. A new Style enjoys a new semantic field of definitions, sentences and criteria for the proper conduct of science (Hacking, 1992). The new aforementioned topics and disciplines in science are built on precisely such constructs. It is through Information terminology that the Holographic principle and its ramifications on the criteria for a well-constructed M-Theory can be expressed; that the computational universe can be entertained and weighed as a model for physical reality 3.2 THE INFORMATION AGE The wider social setting for these changes in science are explored in the sociological, economic and political research of the Information-Age (Castells, 2004). The Theoretical, applied scientific and technological spects of the Information-laden revolution are organic to this social moment. Acknowledgements I would like to thank Prof. Silvan Schweber and Dr. Raz Chen Moris for their great support in all stages of this research. I would also like to thank Dr. Chris Fuchs for the great conversations and discussions (on QIT and Chupakabras). References Arndt, Christoph (2004), Information Measures: Information and Its Description in Science and Engineering. Heidelberg-Berlin: Springer. - 207 - Proceedings IACAP 2011 von Baeyer, Hans Christian (2005), Information: The new Language of Science. Harvard University Press. Bekenstein, Jakob (1973), Black Holes and Entropy. Phys. Rev. D7, 2333. Bekenstein, Jacob (2006). Of Gravity, Black Holes and Information. Rome: Di Renzo Editore. Bennett C. H. (1973). Logical reversibility of computation. IBM Journal of Research and Development, 17(6), 525-532. Brillouin, Leon (1956). Science and Information Theory. Mineola, N.Y: Dover. Castells, Manuel (2004). Informationalism, Networks, and the Network Society: a Theoretical Blueprinting. Northampton, MA: Edward Elgar. Feynman Richard P. (1981). Simulating Physics with Computer [Keynote speech in 1st conference on Physics and Computation, MIT 1981]. International Journal of Theoretical Physics, 21(6/7), 467-488, 1982. Fuchs Christopher A. (2010). Coming of Age With Quantum Information: Notes on a Paulian Idea. Cambridge University Press. Hacking, Ian (1981). From the Emergence of Probability to the Erosion of Determinism. in J. Hintikka, D. Gruender and E. Agazzi E (Eds), Probabilistic Thinking, Thermodynamics and the Interaction of the History and Philosophy of Science, Proceedings of the 1978 Pisa Conference on the History and Philosophy of Science (Vol. II, pp. 105-123). Dordrecht: Reidel. Hacking, Ian (1992). 'Style' for Historians and Philosophers, In Historical Ontology, Harvard University Press, 178-200 Hawking, Steven W. (July 2005), Information Loss in Black Holes, arxiv:hep-th/0507171 't Hooft, G (1993), Dimensional Reduction in Quantum Gravity, 1993, arXiv:gr-qc/9310026v2 Jaynes, Edwin T. (1957), Information Theory and Statistical Mechanics. Physical Review 106, 620-630 Landauer, R. (1991) Information is physical. Physics Today, May 1991. Leff, Harvey S., Rex, Andrew F. (Eds), Maxwell's Demon 2: Entropy, Classical and Quantum Information. CRC Press 2003. Lloyd Seth (2006) Programming The Universe: A Quantum Computer Scientist Takes On the Cosmos. New York: Random House. Schweber S., Watcher M. (2000). Complex Systems, Modelling and Simulation. Stud. Hist. Phil. Mod. Phys, 31(4), 583-609. Susskind, Leonard (1995). The World as a Hologram. J.Math.Phys.36:6377-6396. Susskind, Leonard (2008). The Black Hole War: My battle with Stephen Hawking to make the world safe for quantum mechanics. Little, Brown and Co. Shannon, Claude. E. (1948). A Mathematical Theory of Communication. Bell Syst. Tech. J. 27, 379–423. Wheeler, J. A. (1990). Information, Physics, Quantum: The Search for Links. In W. H. Zureck (Editor) Complexity, Entropy, and the Physics of Information. Redwood City, Cal.: Addison Wesley. Konrad Zuse (1967). Rechnender Raum. Elektronische Datenverarbeitung, 8, 336-344. - 208 - The Computational Turn: Past, Presents, Futures? HOW MUCH DO FORMAL NARRATIVE ANNOTATIONS DIFFER? A Proppian Case Study RENS BOD Institute for Logic, Language and Computation Universiteit van Amsterdam AND BENEDIKT LÖWE Institute for Logic, Language and Computation Universiteit van Amsterdam AND SANCHIT SARAF Institute for Logic, Language and Computation Universiteit van Amsterdam Abstract. The formal study of narratives goes back to the Russian structuralist school, paradigmatically represented by the 1928 study Morphology of the Folktale by Vladimir Propp. Researchers in the field of computational narratology have developed the general Proppian methodology into various formal and computational frameworks for the analysis, automated understanding and generation of narratives. Methodological issues in this research field give rise to concrete research questions such as “How much does the representation of a narrative in a given formal framework depend on subjective decisions of the formalizer?'” touching philosophy of computing and philosophy of information. In order to approach the mentioned question, we consider the process of formal representation of a narrative as a natural analogue of the task of annotation in computational linguistics and corpus linguistics. We use the Russian folktales formalized by Propp and let them be formalized by annotators according to Propp's system, evaluating these results according to the standards of interannotator agreement. The formal study of narratives goes back to the Russian structuralist school, paradigmatically represented by the 1928 study Morphology of the Folktale by Vladimir Propp (1928) in which he identifies seven dramatis personae and 31 functions that allow him to formally analyse a corpus of Russian folktales. - 209 - Proceedings IACAP 2011 Researchers in the field of computational narratology (or “computational models of narrative”) have developed the general Proppian methodology into various formal and computational frameworks for the analysis, automated understanding and generation of narratives. Examples for this are Lehnert (1981)'s Plot Units, Rumelhart (1980)'s Story Grammars, Schank (1982)'s Thematic Organization Points (TOPs), Dyer (1983)'s Thematic Abstraction Units (TAUs), or Turner (1994)'s Planning Advice Themes (PATs). Over the last decades, the main interest of this research community lay in the technical challenges that the computational treatment of narratives brings, but recently, there is again increased interest in the methodological and conceptual issues involved, linking this research closely to questions of the philosophy of information (cf. the paper (Löwe to appear) presented at the 3rd Workshop for the Philosophy of Information). This interest is witnessed by workshops such as the recent AAAI workshop on Computational Models of Narrative that brought researchers from this field together with philosophers, narratologists and professional story tellers. The methodological issues involved give rise to concrete research questions such as • • • How do you compare formal frameworks of narrative? (Cf. Löwe 2010 and Löwe to appear.) How do you assess the quality of a formal framework of narrative? How much does the representation of a narrative in a given formal framework depend on subjective decisions of the formalizer? Question 1. is a genuinely philosophical question, but also the more technical questions 2. and 3. are very relevant for gaining philosophical insight into what constitutes the formal core of the concept of narrative. In this paper, we approach question 3. of the above list. To this end, we think of the process of formal representation of a narrative in a formal system as a natural analogue of the task of annotation in corpus linguistics and computational linguistics. Whereas typical annotation tasks involve annotation of sentences or discourses (cf., e.g., Marcus et al. 1993, Brants 2000, Passonneau et al. 2006), the formalization or annotation of a narrative is at the next level of complexity, involving sequences or systems of discourses, connected to a narrative. First studies suggest that question 3. is not easy to tackle for the following reasons: First, ambiguity which in typical linguistic annotation is a rather confined phenomenon becomes ubiquitous at the level of narratives: the natural answer to a formalization task is not one annotation, but a family of consistent annotations (cf. Löwe 2010, §2). Secondly, even allowing for multiple annotations, it is not clear whether consensus about whether a given annotation is a valid representation of a narrative is easy to achieve. Of course, these questions naturally reflect a well-known discussion from computational linguistics: in sentence- or discourse-level annotation, the quality of annotation is typically studied as inter-annotator agreement (Carletta et al. 1997, Marcu et al. 1999). For the annotation or formalization of narratives, no such analysis has ever been done, not even with the oldest and best-known formal approach to narrative structure, the Proppian narratemes. - 210 - The Computational Turn: Past, Presents, Futures? In this study, we use English translations of the Afanas'ev tales formalized by Propp (Afanas'ev 1973), train a group of annotators in the use of Propp's system, and then let them formalize a selection of tales in that formal framework. We evaluate these results according to the standards of inter-annotator agreement from computational and corpus linguistics (Carletta et al. 1997). References Afanas'ev, A (1973). Russian fairy tales. Pantheon. Translation by Norbert Guterman from the collections of Aleksandr Afanasev. Folkloristic commentary by Roman Jakobson. Brants, T. (2000). Inter-annotator agreement for a German newspaper corpus. In: Proceedings Second International Conference on Language Resources and Evaluation LREC-2000. Carletta, J.C., Isard, A., Isard, S., Kowtko, J., Doherty-Sneddon, G. & Anderson, A (1997). The reliability of a dialogue structure coding scheme. Computational Linguistics, 23(1):13-31. Dyer, M.G. (1983). In-depth understanding: A computer model of integrated processing for narrative comprehension. Artifcial Intelligence Series. MIT Press. Lehnert, W.G. (1981). Plot units and narrative summarization. Cognitive Science, 4:293-331. Löwe, B. (2010). Comparing formal frameworks of narrative structures. In M. Finlayson (Ed), Computational Models of Narrative. Papers from the 2010 AAAI Fall Symposium, (pp. 4546). Volume FS-10-04 of AAAI Technical Reports. Löwe, B. (to appear). Methodological issues in comparing formal frameworks for narratives. In P. Allo & G. Primiero (Eds), 3rd Workshop on the Philosophy of Information. Koninklijke Vlaamse Academie van België voor Wetenschappen en Kunsten. Marcu, D., Romera, M. & Amorrortu, E.A. (1999). Experiments in constructing a corpus of discourse trees: Problems, annotation choices, issues. In: Workshop on Levels of Representation in Discourse, (pp. 71-78). Marcus, M.P., Santorini, B. & Marcinkiewicz, M.A. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19:302-330. Passonneau, R., Habash, N. & Ramnow, O. (2006). Inter-annotator agreement on a multilingual semantic annotation task. In: Proceedings LREC-2006. Propp, V. (1928). Morfologiya skazki. Leningrad: Akademija. Rumelhart, D.E. (1980). On evaluating story grammars. Cognitive Science, 4:313-316. Schank, R.C. (1982). Dynamic memory: A theory of reminding and learning in computers and people. Cambridge University Press. Turner, S. (1994). The creative process. A computer model of storytelling. Lawrence Erlbaum Associates. - 211 - Proceedings IACAP 2011 COMPUTERS AND PROCRASTINATION “I’ll just check my Facebook quick a Second...” NICK BREEMS Dordt College Sioux Center, United States and University of Salford Salford, United Kingdom Abstract. There seems to be something about computer technology that tempts us towards procrastination. This paper uses a philosophical toolkit to investigate why this might be, and how we can address the problem. We employ a framework for understanding the human use of computers developed by Andrew Basden. Basden's work is based on the thought of 20th century Dutch philosopher Herman Dooyeweerd, who makes the strong claim that reality is meaningful in a wide variety of mutually irreducible aspects. The non-reductionist approach of Dooyeweerd's philosophy allows Basden’s framework to take everyday life seriously. Thus one of the strengths of a philosophical approach based on Dooyeweerd's thought is its ability to highlight important aspects of a problem that may be understudied. In this paper, the framework is used to perform an analysis of a particular example of computer-based procrastination, and potential avenues for investigation are highlighted that weren't immediately apparent when thinking about the problem generically. Thus we demonstrate that the use of a comprehensive framework for understanding the human use of computers and information systems from an everyday perspective shows some promise of providing insight into complex and challenging problems that arise in our information technology saturated culture. 1. Introduction There seems to be something about computer technology and internet connectivity that distracts us, that tempts us towards procrastination. This is borne out by personal experience, by anecdotal evidence (Breems, 2009), and by research (Lavoie and Pychyl, 2001; Thatcher, Wretchko, and Fisher, 2008). For a tool widely believed to enhance our productivity, this is remarkable. This naturally leads us to two questions: 1. Why is this? - 212 - The Computational Turn: Past, Presents, Futures? 2. How can we address this problem? What changes can we make in the way we design and implement computer systems or in the way we approach and use such technology that would reduce these distracting tendencies? Research in the philosophy of computers and information systems can help us understand the use of computers as it plays out in everyday human living. This paper employs a framework for understanding the human use of computers developed by Andrew Basden (2008) in his book Philosophical Frameworks for Understanding Information Systems. We use this framework to analyze computer-induced procrastination, and demonstrate that philosophical tools can bring fresh insight to vexing problems. 2. Basden’s Framework In Chapter 4 of his book, Basden proposed a framework for understanding Human Use of Computers (the HUC framework), based work of 20th century Dutch philosopher Herman Dooyeweerd (1984). Dooyeweerd’s thought is deeply non-reductionist: He made the strong claim that reality is meaningful in a wide variety of mutually irreducible aspects. Dooyeweerd identified a suite of fifteen such modal aspects, and posited that each of these aspects operates under a different set of laws which enable meaningful functioning in that aspect. Based on these insights, the HUC framework analyzes any particular use of computer technology along two axes. Horizontally, all computer use exists as three simultaneous functionings, because we’re interacting with three different types of entity: Human/Computer Interaction (HCI) To use a computer, we must interact with the computer itself: both with the hardware and with the user interface portions of the software. Engaging with Represented Content (ERC) Computer programs represent content we engage with that is meaningful to us. For example, when we use an email program, it is not the internal voltages inside the CPU or the glowing of pixels on the screen that have direct meaning in our lives, but rather the content of the email messages and the information that they carry. Human Living with Computers (HLC) The use of the computer plays out in our everyday lives; its effects escape the “box” that is the computer and affect things “out here” in our lived reality. Vertically, he analyzes each of these functionings among each of Dooyeweerd’s modal aspects: Quantitative of discrete amount Spatial of continuous extension Kinematic of flowing movement Physical of energy and mass Biotic/Organic of life functions and integrity of organism Sensitive/psychic of sense, feeling, and emotion Analytical of distinction, conceptualizing, and inferring Formative of formative power and shaping, in history, culture, creativity, achievement, and technology Lingual of symbolic signification - 213 - Proceedings IACAP 2011 Social of respect, social interaction, relationships, and institutions Economic of frugality, skilled use of limited resources Aesthetic of beauty, harmony, surprise, and fun Juridical of “what is due”, rights, responsibilities Ethical of self-giving love, generosity, care Pistic of faith, commitment, trust, and vision The non-reductionist approach of Dooyeweerd’s philosophy allows the framework to take everyday life seriously. That is, in our everyday experience of reality, we do not intuitively experience everything as mathematical, physical, or logical, but rather as diversely meaningful. The laws for the earlier aspects are largely descriptive; that is, we cannot disobey these laws (e.g. the law of gravity). The later laws, on the other hand, are prescriptive, and thus normative. They tell us how we ought to function, but do not force us to do so. For example, in the economic aspect, the law/norm of frugality tells us that we ought to use our time wisely. It allows us to make predictions about what kinds of consequences we can expect from obeying or not obeying that norm, but the choice to follow the norm or not is ours to make. 3. Use of the framework to analyze procrastination One of the strengths of a philosophical approach such as Basden’s framework is its ability to highlight important aspects of a problem that may be understudied. In this paper, the framework is used to perform an analysis of a particular example of computerbased procrastination, playing an online dice game instead of writing a paper. Potential avenues for investigation are highlighted that weren’t immediately apparent when thinking about the problem generically: • All of the dysfunction occurs in the HLC (Human Living with Computers) category, while most of the benefits of procrastinating (usually psychic and aesthetic) occur in the ERC (Engaging with Represented Content) functioning. Because ERC is a category that is much more within the control of a software designer, this points to the hope that design alternatives could help in addressing the problem. • The proximity of the procrastinatory activity to the legitimate activity, both spatially and kinesthetically, eases the transition from real work to work avoidance. Although designing a computer to put physical distance between, for example, the use of a word processor and playing a game seems infeasible, there are potential designs which would increase the psychological distance from one activity to the other. • The HLC functioning in the Pistic aspect indicates that procrastination is a failure of commitment: We are insufficiently committed to the course of action we are committed to, resulting in a break of faith with other people in our lives, our selves, and ultimately, with our religious convictions. A similar theme is suggested by Pychyl (2008). Performing an analysis such as this, and evaluating the insight that results, is a preliminary way of testing the utility of the HUC framework itself. Thus we demonstrate that the use of a comprehensive framework for understanding the human use of computers and information systems from an everyday perspective shows some promise - 214 - The Computational Turn: Past, Presents, Futures? of providing insight into complex and challenging problems that arise in our information technology saturated culture. References Basden, A. (2008). Philosophical Frameworks for Understanding Information Systems. Hershey, PA: IGI Publishing. Breems, N. S. (2009, September 8). Nick Breems is doing a short research project [web log post]. Retrieved from http://www.facebook.com Dooyeweerd, H. (1984) A New Critique of Theoretical Thought (Vols. 1-4). Jordan Station, Ontario, Canada: Paideia Press. (Original work published 1953-1958). Lavoie, J. A. A., & Pychyl, T. A. (2001). Cyberslacking and the procrastination superhighway: A web-based survey of online procrastination, attitudes, and emotion. Social Science Computer Review 19, (4), 431-444. Pychyl, T. A. (2008, April 7). Existentialism and procrastination: Bad faith. [Web log post]. Retrieved from http://www.psychologytoday.com/node/372 Thatcher, A., Wretchko, G., & Fisher, J. (2008). Problematic internet use among information technology workers in South Africa. CyberPsychology & Behavior 11 (6), 785–787. - 215 - Proceedings IACAP 2011 COMBINATORY LOGIC WITH FUNCTIONAL TYPES IS A GENERAL FORMALISM FOR COMPUTING COGNITIVE AND SEMANTIC REPRESENTATIONS JEAN-PIERRE DESCLÉS Laboratory LaLIC, University of Paris-Sorbonne Maison de la Recherche, 28 rue Serpente, 75006, Paris, France HEE-JIN RO Laboratory LaLIC, University of Paris-Sorbonne Maison de la Recherche, 28 rue Serpente, 75006, Paris, France AND BRAHIM DJIOUA Laboratory LaLIC, University of Paris-Sorbonne Maison de la Recherche, 28 rue Serpente, 75006, Paris, France Abstract. We show how it is possible to use explicitly Combinatory Logic (a logic of operators and composition of operators) to define aspectual operators and temporal relations in natural languages from basic primitives in the domain of the temporality. 1. Combinatory Logic Combinatory Logics with functional types (CL) is a formalism used for studying the foundations of Computer Sciences (semantics of Programming Languages) and for defining functional programming Languages (as HASKELL) built from this logical model. CL is a logic of operators and composition of operators. CL has been developed principally by Curry and Feys (1958), and then it has been used in linguistics by Shaumyan (1987) and by Desclés (1990). In computer science, an applicative program is viewed as a combination of elementary programs, the program being built up with the help of a complex combinator, this latter being the result of an applicative combination of elementary combinators. The same idea can be used in other fields: logic and philosophy (logical analysis of paradoxes and some philosophical concepts), nanostructures synthesis and molecular combinatory computing (MacLennan, 2003), cognitive representations where a symbolic representation is an applicative organization of semantic primitives… Linguistic units are viewed as operators and operands of different functional types. - 216 - The Computational Turn: Past, Presents, Futures? CL allows, on the one hand, to articulate, inside of a same computational architecture, different representation levels during a process of change of levels and, on the other hand, to give, by means of a formal calculus, a synthesis of a lexical (or grammatical) operator from its meaning. 2. Semantic Analysis of Aspecto-Temporal Operators We present a semantic analysis of some aspectual and temporal operators. Grammatical units (aspects, tenses, moods …) are operators whose meanings are analysed with elementary semantic operators combined together with a combinator. An aspectual operator ‘ASPI’ is applied onto a predicative relation ‘Λ’ (as “Peter to enter the-room” or “Peter to be inside the room”) where ‘I’ is a topological interval of contiguous and ordered instants, this interval specifying the temporal area of realization of ‘Λ’. There are three basic aspectual operators STATEO, EVENTF and PROCJ. If an aspectualized predicative relation ‘ASPI (Λ)’ is viewed as a state ‘STATEO (Λ)’, then the interval ‘O’ is open and ‘Λ’ is true at every instant of ‘O’ (example (1) Peter is inside the room is a descriptive state). If ‘ASPI (Λ)’ is an event ‘EVENTF (Λ)’ ((2) Peter entered the room), the interval ‘F’ is closed and ‘Λ’ is always true at the final bound of ‘F’ (end of the complete event). If ‘ASPI (Λ)’ is a process ‘PROCJ (Λ)’ ((3) Peter is entering inside the room), the interval ‘J’ is closed at the left bound of ‘J’ (beginning of the process) and open at the right bound of ‘J’ to mean that the process is uncomplete. For speaking, the speaker must locate ‘ASPI(Λ)’ inside the temporal referential framework organized by himself; his speech act is an uncomplete process expressed by “I–AM-SAYING (…)” = “PROCJ0 (I-SAY (…))”, where ‘J0’ is the interval of speaking, with its right open bound (the process of speaking is fundamentally uncomplete). The temporal intervals ‘O’, ‘F’ and ‘J’ can be related to the interval ‘J0’. For the examples (1), (2) and (3), we obtain the respective temporal relations between right bounds of different intervals: (1’) [δ (O) = δ (J0)] [δ (F) < δ (J0)] (2’) [ δ (J) = δ (J0)] (3’) where ‘δ’ and ‘γ’ are respective operators that selects the right and left bounds of an interval. The combinators are used to express how the aspectual operators and temporal relations are combined together and synthesized into an unique grammatical operator expressed by a morphological operator. CL gives tools to analyze complex units into a combination of more elementary units. The computing of synthesis processes in a topdown strategy (or the analytic decomposition in a bottom-up strategy) of numerous aspectual and temporal operators has been realized with HASKELL. By the same way, the automatic analysis of some lexical predicates into a scheme where are combined semantic primitives in an applicative expression has been realized. We have not the place to show all steps of deductions for different aspectual operators which highlight the notions about process, event, state and related notions. With the adjunct of semantic representation of the lexical predicates, it becomes possible to give the formal deduction from a given sentence to another (Desclés, 2005; Desclés and Ro, 2011): - 217 - Proceedings IACAP 2011 John took the Mary’s pen → Mary doesn’t have the pen anymore When a speaker of English understands the first sentence, it is able to infer automatically the second sentence. This inference becomes possible with a grammatical knowledge (meaning of tenses) and a representation of the meaning of lexical predicate to take. Our research program shows how a machine can simulate this kind of inference realized by humans. For more details, to see (Desclés, 1990; 2005) and (Desclés & Ro, 2011a; 2011b). References Curry H.B. & Feys R. (1958). Combinatory logic. Vol. I. Studies in logic and the foundations of mathematics, North-Holland Publishing Co., Amsterdam Desclés J.-P. (1990). State, event, process, and topology. In: General Linguistics (pp.159-200), vol.29, n°3, Pensylvania State University Press, University Park and London. Desclés J.-P. (2005). Reasoning and Aspectual-Temporal Calculus. In: Vanderveken D. (Eds), Logic, Thought and Action, Springer (pp. 217-244). Desclés J.-P. & Ro H.-J. (2011a). Aspecto-Temporal Representation for Discourse Analysis an Example of Formal Computation, The 24th Florida Artificial Intelligence Research Society Conference. Desclés J.-P. & Ro H.-J. (2011b). Operateurs asepcto-temporels et Logique Combinatoire. To appear in Mathématiques et Sciences Humaines. Hindley J.R. & Seldin J.P. (1986). Introduction to Combinators and Lambda-Calculus. Cambridge Univ. Press. MacLennan, B. J. (2003). Combinatory Logic for Autonomous Molecular Computation, www.cs.utk.edu/~mclennan Shaumyan S.K. (1987). A Semiotic Theory of Natural Languages. Bloomington: Indiana University Press - 218 - The Computational Turn: Past, Presents, Futures? THE PAST, PRESENT, AND FUTURE ENCOUNTERS BETWEEN COMPUTATION AND THE HUMANITIES STEFANO FRANCHI Department of Hispanic Studies Texas A&M University stefano@tamu.edu Abstract. The paper addresses the conference theme from the broader perspective of the historical interactions between the Humanities and computational disciplines (or, more generally, the “sciences of the artificial”). These encounters have followed a similar although symmetrically opposite “takeover” paradigm. However, there is an alternative meeting mode, pioneered by the interactions between studio and performance arts and digital technology. A brief discussion of the microsound approach to musical composition shows that these alternative encounters have been characterized by a willingness on both parts to let their basic issues, techniques, and concepts be redefined by the partner disciplines. I argue that this modality could (and perhaps should) be extended to other Humanities disciplines, including philosophy. 1. Takeovers The two best-known encounters between computational technologies and traditional Humanists pursuits are represented by the Artificial Intelligence/Cognitive science movement and the roughly contemporary Digital Humanities approach (although the label became popular only recently). Classic Artificial Intelligence saw itself as “antiphilosophy” (Dupuy, 2000; Agre, 2005; Franchi, 2006): it was the discipline that could take over philosophy's traditional questions about rationality, the mind/body problem, creative thinking, perception, etcetera, and solve with the help of a set of radically new synthetic, experimental-based techniques. The true meaning of the "computational turn in philosophy" lies in its methodology, which allowed it to associate engineering techniques with age-old philosophical questions. This “imperialist” tendency of cognitive science (Dupuy, 2000) was present from the very beginning, even before the formalization of the field into well–defined theoretical approaches (McCulloch (1989[1948]); Simon, 1994). The Digital Humanities represent the reverse modality of the encounter just described. The most common approach (Kirschenbau, 2010) uses tools, techniques, and algorithms developed by computer scientists to address traditional questions about the meaning of texts, their - 219 - Proceedings IACAP 2011 accessibility and interpretation, and so on. Other approaches turn technology into the scholar's preferred object of study (Svensson, 2010). The recent approach pioneered by the “Philosophy of Information” (Floridi, 2011) follows this pattern. Its focus on the much broader category of “information” substantially increases the scope of its inquiries, while firmly keeping it within philosophy's standard reflective mode. The common feature of these two classic encounters between the Humanities and computational theory and technology is their onesidedness. In either case, one of the two partners took over some relevant aspects from the other participant and fit it within its own field of inquiry (mostly questions, in AI's case; mostly tools, for the Digital Humanities). The appropriation, however, did not alter the theoretical features of either camp. For instance, AI and Cognitive Science researchers maintained that philosophy pre-scientific methodology had only produced mere speculation that made those problems unsolvable. Therefore, philosophy's accumulated wealth of reflection about the mind, rationality, perception, memory, emotions, and so forth could not be used by the computational approach. In McCulloch's famous phrase, the “den of the metaphysician is strewn with the bones of researchers past.” In the Digital Humanities' case, the takeover happens at the level of tools. In most cases, however, this appropriation does not become an opportunity for a critical reflection on the role of the canon on liberal education, or for a reappraisal of the role of the text and the social, political, and moral roles it plays in society at large. 2. Digital practice Meetings between artists and computational technology show the possibility of a different paradigm. In many cases, making music, painting, producing installations, and writing with a computer changes the concepts artists work with, and, at the same time, forces computer sciences to change theirs as well. There are many examples in the rich history of “digital art,” broadly understood (OuLiPo, 1973; ALAMO, No year; Schaeffer, 1952). I will illustrate their general features with reference to a more recent project: the “microsound” approach to musical composition (Roads, 2004). “Microsounds” are sonic objects whose timescale lies between that of notes―the smallest traditional music objects, whose duration is measured in seconds or fractions thereof―and samples―the smallest bit, measured in microseconds (10-6). The manipulation of microsounds broadens substantially the composer's palette, but it is - 220 - The Computational Turn: Past, Presents, Futures? impossible without the help of technological devices of various kinds, from granular synthesis software to high-level mixing interfaces. Composers wishing to “sculpt” sounds at the microlevel face a double challenge that translates into a mutual collaboration between compositional and algorithmic techniques. On the one hand, they need to broaden the syntax an grammar of music's language to allow the manipulation and aesthetic assessment of previously unheard of objects (Vaggione, 2001). On the other hand, they need computer scientists and mathematicians to develop alternative analytic and synthetic models of sound (in addition to Fourier-transforms and similar methods) capable of capturing the features of sonic events lasting only a few milliseconds (Vaggione, 1996). This example of artistic production points to a pattern of cooperation between work in computational and non-computational disciplines that is deeply at odds with the AI/CogSci and DigHum patterns discussed above. Instead of a takeover, the artistic model produces a true encounter that changes both partners' technical and theoretical apparatus. 3. Posthuman encounters? Could the encounter model practiced by artists be generalized to the Humanities? We can see how this could be the case by considering a twofold question. On the one hand: are Humanities' traditional inquiries about human nature and human cultural production still relevant in a landscape in which some of the communicating agents may not be human, partially or entirely? Can they go on in the same way? And vice versa: are science and technology fully aware that the new digital artifacts they are shepherding into the world may change its landscape and transform worldly action at the pragmatic as well as at the theoretical level? Or are they still relying upon a pre-digital universe in which technological artifacts were always to be used as mere tools deployed by humans, an assumption that seems increasingly questionable? I think a particularly fruitful approach toward this question is provided by the kind of critical thought that has been developed―mostly, but certainly not exclusively―in Continental Europe over the last two or three decades. These theoretical efforts have based their explorations upon anti-humanist and/or post-humanist perspectives. They provide, therefore, a fruitful starting point for the investigation and interaction with instruments, tools, and techniques that question the very notion of the human. For instance, Lacanian and post-Lacanian psychoanalysis has articulated a view of the human that deploys cybernetic concepts to explain high level cognitive functions (Franchi, 2011; Chiesa, 2007); - 221 - Proceedings IACAP 2011 the work on biopolitics currently developed by largely Italian philosophers attempts to articulate a conception of human life that is continuous with animal and non-organic life (Agamben, 2003; Esposito, 2008; Tarizzo, 2010). At the same time, the disciplines of science and technology studies in their contemporary North American, French, and German developments have provided penetrating analyses of the bidirectional relationships between scientific theories and technological artifacts, on the one hand, and philosophical and cultural productions on the other (Ihde, 2002; Hayles, 1999; Latour and Woolgar, 1986; Biagioli, 1999). This suggestion does not pretend to exhaust the theoretical options we have at our disposal when reflecting upon the computational turn. My contention, however, is that artistic practices in all forms of “digital art” can serve as an inspiration to all of the Humanities disciplines. We can follow their path toward a new mode of digital encounter that does not fall into the well-worn path of hostile takeovers by either partner. References Agamben, G. (2003). The Open. Man and Animal. Stanford, Calif.: Stanford University Press. Agre, Ph. E. (2005). The Soul Gained and Lost: Artificial Intelligence as Philosophical Project. In: S. Franchi and G. Güzeldere (Eds.), Mechanical Bodies, Computational Minds (pp.153174). Cambridge: MIT Press. ALAMO (Atelier de Littérature Assistée par la Mathématique et les Ordinateurs). Url: http://alamo.mshparisnord.org/index.html Biagioli, M. (Ed.) (1999). The Science Studies Reader. New York: Routledge. Chiesa, L. (2007). Subjectivity and Otherness. A Philosophical Reading of Lacan. Cambridge: MIT Press. Dupuy, J.-P. (2000). The Mechanization of the Mind: On the Origins of Cognitive Science. Princeton: Princeton University Press. Esposito, R. (2008). Bios: Biopolitics and Philosophy. Minneapolis: University of Minnesota Press. Floridi, L. (2011). The Philosophy of Information. Oxford: Oxford University Press. Franchi, S. (2006). “Herbert Simon, Anti-Philosopher.” In: L. Magnani (Ed.) Computing and Philosophy (pp. 27-40). Pavia: Associated International Academic Publishers. ----- (2011). Jammed Machines and Contingently Fit Animals: Psychoanalysis’s Biological Paradox, French Literature Series, 38, in press. Hayles, N. K. (1999). How We Became Posthuman: Virtual Bodies in Cyberspace. Chicago: University of Chicago Press. Ihde, D. (2002). Bodies in Technology. Mineapolis: University of Minnesota Press. Kirschenbau, M. G. (2010). What Is Digital Humanities and What’s It Doing in English Departments? ADE Bullettin, 150, 1–7. Latour, B. and Woolgar, S. (1986). Laboratory Life: the Construction of Scientific Facts. Princeton: Princeton University Press. - 222 - The Computational Turn: Past, Presents, Futures? McCulloch, W. S. (1989[1948]). Through the Den of the Metaphysician. In: Embodiments of Mind (142-156). Cambridge: MIT Press. OuLiPo (1973). La littérature potentielle. Paris: Gallimard. Roads, C. (2004). Microsound. Cambridge: MIT Press. Schaeffer, P. (1952). À la recherche d’une musique concrète. Seuil. Simon, H. (1994). Literary Criticism: a Cognitive Approach. In: S. Franchi and G. Güzeldere (Eds.), Bridging the Gap (pp. 1–26). Stanford Humanities Review, 4(1), Special Supplement. Svensson, P. (2010). The Landscape of Digital Humanities. Digital Humanties Quarterly, 4(1). Tarizzo, D. (2010). La vita, un’invenzione recente. Bari: Laterza. Vaggione, H. (1996). Articulating Microtime. Computer Music Journal, 20(2), 33–38. ----- (2001). Some Ontological Remarks about Music Composition Processes. Computer Music Journal, - 223 - Proceedings IACAP 2011 REFLECTIONS ON NEUROCOMPUTATIONAL RELIABILISM MARCELLO GUARINI Department of Philosophy, University of Windsor 401 Sunset, Windsor, ON. Canada N9B 394 AND Joshua Chauvin and Julie Gorman Students, Department of Philosophy, University of Windsor 401 Sunset, Windsor, ON. Canada N9B 394 1. Introduction Reliabilism is a theory of knowledge that has traditionally focused on propositional knowledge. Paul Churchland has advocated for a reconceptualization of reliabilism to “liberate it” from propositional attitudes (such as accepting that p, believing that p, knowing that p, and the like). In the process, he (a) outlines an alternative for the notion of truth (which he calls “representational success”), (b) offers a non-standard account of theory, and (c) invokes the preceding ideas to provide an account of representation and knowledge that emphasizes our skill or capacity for navigating the world. Crucially, he defines reliabilism (and knowledge) in terms of representational success. This paper discusses these ideas and raises some concerns. Since Churchland takes a neurocomputational approach, we discuss our training of neural networks to classify images of faces. We use this work to suggest that the kind of reliability at work in some knowledge claims is not usefully understood in terms of the aforementioned notion of representational success. 2. Traditional Reliabilism: Truth and Propositional Attitudes Claims to propositional knowledge have the form, S knows that p, where p is a proposition. For the reliabilist, among the necessary conditions for some agent or subject S to know p are that (a) p is true, (b) S believes p, and (c) p is the outcome of a reliable process or method. According to Alvin Goldman (1986, 1992, 1999, 2002) reliability is required for both epistemic justification and knowledge. This reliability is a ratio: the number of true beliefs delivered by a process or method divided by the number - 224 - The Computational Turn: Past, Presents, Futures? of true and false beliefs delivered by the same process or method As we will concern ourselves primarily with the reliability requirement in this paper, we shall not engage the issue of what might constitute sufficient conditions for either knowledge or justification. 3. Neuro Reliabilism: Representational Success and Similarity Spaces Paul Churchland (2007) attempts to take a reliabilist approach to epistemology, divorce it from propositional attitudes, and explain how we can have non-propositional knowledge. Churchland begins by enumerating many instances of know-how. The examples include the capacity or skill knowledge possessed both by humans and nonhumans. He argues that much of what we call knowledge has little or nothing to do with the fixing of propositional attitudes. He recognizes the importance of truth in classical approaches to reliabilism, but he resists talking of truth since (a) it attaches to propositional attitudes, and (b) much of our knowledge is not about fixing propositional attitudes. In place of truth, Churchland formulates a notion of representational success that is compatible with analyses of neural networks. To keep things simple, consider a three layer feed forward neural network. After training, each different pattern of activation across the hidden units is a different point in that space. We can then measure the distance between points (which Churchland often refers to as similarity relations). Churchland treats (somewhat metaphorically) similarity spaces as maps that guide our interactions with the world. Just as a map is representationally successful when the distance relations on the map preserve distance relations in the world, conceptual spaces understood as similarity spaces are representationally successful when they preserve similarity or distance relations between points in state space and the world. 4. How Representational Success and Reliability can Come Apart We will present the results of two neural networks (N1 and N2) trained to classify images of faces as either male or female. N1 was trained on the set of images A; it was tested on images it had not previously seen, set B. N2 was trained on B; it was tested on A. Both networks achieved equal levels of success on the images. In spite of the preceding, we will show that N1 and N2 set up different similarity spaces. This is a problem for Churchland’s position since he defines reliability in terms of representational success, and this latter notion is defined in terms of structure preserving mapping between points in similarity space and features of the world. It seems quite natural to say that N1 and N2 are equally reliable, but because they set up different similarity spaces, we will argue that it is not clear how they could be equally representationally successful, given the work Churchland expects representational success to do. There is a difference between (a) being reliable and (b) explaining the source of that reliability. We will show that we can understand what it is for a system (a face classifying neural network) to be reliable independent of understanding the source of that reliability. Churchland uses the notion of representational success (or preservation of distance relations) both to define reliability and to understand its source (i.e. to do both (a) and (b)). This is a source of potential problems for his position. - 225 - Proceedings IACAP 2011 5. Conclusion In spite of the problems, we recognize there are some attractions to the sort of position Churchland is putting forward. While we do not think it has the range of applicability Churchland suggests, we do not take ourselves to have argued that representational success is a useless notion. We will close with some constraints that need to be satisfied for the notion to be a useful one. Acknowledgements We thank the Shared Hierarchical Academic Research Computing Network (SHARCNet) for financial support. References Churchland, P.M. (2007). Neurophilosophy at Work. Cambridge, UK: Cambridge University Press. Goldman, A. (1986). Epistemology and Cognition. Cambridge, MA: Harvard University Press. Goldman, A. (1999). Knowledge in a Social World. Oxford: Oxford University Press. Goldman, A. (1992). Liaisons: Philosophy Meets the Cognitive and Social Sciences. Cambridge, MA: MIT Press. Goldman, A. (2002). Pathways to Knowledge, Private and Public. Oxford: Oxford University Press. - 226 - The Computational Turn: Past, Presents, Futures? STATES OF AFFAIRS AND INFORMATION OBJECTS STEVE T. MCKINLAY Charles Sturt University Wellington Institute of Technology School of Information Technology Private Bag 39803, Petone, Wellington, NEW ZEALAND e-mail: steve.mckinlay@weltec.ac.nz Abstract. This paper compares two recently detailed metaphysical accounts of reality. On the one hand we have Luciano Floridi’s “information realism” and, on the other David Armstrong’s view that the general structure of reality can be described as “states of affairs”. Floridi postulates the information object as the entity central to information ethics and his informational realism. In developing the concept he draws heavily upon object oriented (OO) programming theory. Informational objects are reckoned by Floridi to be, in a sense, ontologically primitive and as such naturally occurring mind independent structures dynamically interacting with one another. Floridi employs OO like terminology such as “properties” and “relations” in order to clarify his concept of the informational entity. Armstrong on the other hand postulates that the world, all that there is, is a world of states of affairs. A state of affairs according to Armstrong consists of a particular, which has a property or alternatively a relation which holds between two or more particulars. Each state of affairs as well as constituent higher or lower order states of affairs is a contingent existent. Furthermore the properties and relations attached to states of affairs are universals. These two theories, whilst exhibiting marked resemblances also reveal fundamental philosophical differences yet both attempt to present a unified metaphysical schema, an ontology. Of great interest is the fact that here we have two strong competing theories. The situation begs critical comparison. Such a comparison is the primary aim of this paper. The idea of the Information Object as being somehow ontologically fundamental has gained traction recently not only in computer programming circles but also philosophically. We could attribute this newfound popularity, particularly with regard to philosophical interpretations, with the fact that we live in the so called information age. We, at least in the developed world, view the world through information-coloured spectacles these days. Adding some substance to this claim is the fact that our information systems are designed and developed using fashionable object oriented (OO) methodologies. Information modeling is now the accepted process by which facts or propositions, the sentences that demarcate the various states of affairs and “things” of - 227 - Proceedings IACAP 2011 which the modeller is interested, are defined via “object class” structures. Such structures in turn represent various properties, behavior and relata. The information object in this sense is an intuitively fitting and elegant way of representing the problems we attempt to solve via computational means. OO design and development is “instrumentally reliable” – it works. The majority of modern implemented information technologies across the entire gamut of industries and applications typically employ object oriented approaches. The focus has shifted from procedural algorithmic processing to an object driven methodology and as such states of affairs and “things” are abstractly modelled as self-contained (encapsulated) object structures, responsible for their own identity, relations, properties, states and behavioural rules. It’s perhaps not surprising then that we might ponder; could the universe be interpreted and/or represented in such a way? From a wider perspective what is often termed the computational turn has given rise to the informational object concept central to and emerging as fundamental in an informational ontology developed primarily by Luciano Floridi (2002, 2004, 2008). The concept is important for Floridi since the information object plays a role central to his Information Ethics (IE) and Informational Realism (IR). But more than this, the idea of the “information entity” seems to offer new ways of understanding epistemology, semantics, scientific explanation, and ethics. Floridi has developed a detailed picture of the information object (or entity as he sometimes calls it) employing Object Oriented programming and design methods and theories to clarify the concept. Whilst Luciano Floridi’s notion of the information object is somewhat analogous to the OO conception of an object in a recent paper I argued for a variety of reasons that information objects, certainly within the context of Floridi’s informational realism, don’t seem to be much like OO objects, certainly not the kind employed in an OO class model or an OO program. Arguably the most significant difference is that OO objects act unequivocally as referents to facts, as Wittgenstein (1961) would have put it, or what Armstrong (1998) calls states of affairs. I think there is certainly a similarity between OO objects and Floridi’s conception of the information object but I suspect the similarity is more harmful to the idea of the information object holding any independent ontological status or existing independently as a particular category. The similarity is that both object concepts are largely conceptual by nature. Yet Floridi seems to want to confer a stronger ontological status to the information entity. Problems arise if the information object is indeed conceptual. Following Lauden (1977, p48) such entities can have no existence independent of the theories within which they are postulated. Nevertheless the concept of an information entity is certainly a convenient and relatively intuitive way of bundling up constituent properties and relations belonging to the particular in question. Those properties and relations are in fact what philosophy sometimes calls universals and it is each particular (distinct information objects) that instantiate those universals. The universals themselves are the constituents of information objects shared across many objects. There are some that deny the existence of universals (nominalism) and we shall consider this in the full paper. Armstrong (1998, p95) questions the need to recognise an independent category of particulars. He argues that whilst properties and relations can be known “the bearer of properties and relations, it is alleged, cannot be known. Why then postulate a bearer?” The postulation of bearers, Armstrong argues, appears to lack ontological and epistemic - 228 - The Computational Turn: Past, Presents, Futures? economy (ibid). This raises the question, is the Floridian information object the same kind of thing Armstrong terms a bearer? From the OO perspective a particular information object (or class, although the two concepts differ slightly and this will be explained) is admittedly representative of a fact, state of affairs or physical object, this renders the OO object second order to the actual fact or state of affairs. Furthermore I take it, it is meant to be information objects all the way down. But we already see this isn’t the case. Information objects are essentially bundles of properties and relations, whilst no information object can be strictly identical with another, the properties and relations can and are identical across multiple instantiations of similar objects. Whilst they do not exist outside their instantiations it would seem properties and relations hold a more fundamental ontological position than the information entity. Thus to uphold the ontological reality of “information objects” or in Armstrong’s case “states of affairs” seems to entail the admission of properties and relations yet there would certainly be some philosophers who would deny that the reverse holds. There seems to be little controversy in the admission of properties and relations since a denial results in the denier having to come up with an alternate theory of classes. It is individual objects or states of affairs that exhibit more or less identical properties and relations that we bundle into classes. This paper compares Armstrong’s descriptions of properties and relations with those affiliated to Floridi’s information object concept. Further we will consider how similar (or different) the information object concept is to the Armstrong’s conception the state of affairs. References Armstrong D.M. (1989). Universals: An Opinionated Introduction. Westview Press. (Focus Series) Armstrong, D. M.. (1998). A World of States of Affairs. Cambridge: Cambridge University Press. Floridi, L. (2002). On the Instrinsic Value of Information Objects and the Infosphere. Ethics and Information Technology, 3(4), 287-304. Floridi, L.. (2004). Informational Realism. In G.M. Greco, IEG Research Report Oxford: Information Ethics Group. Floridi, L. (2008). A Defence of Informational Structural Realism. Synthese, 161(2), 219-253. Lauden, L.. (1977). Progress and Its Problems: Towards a Theory of Scientific Growth. California: University of California Press. Wittgenstein, L.. (1961). Tractatus Logico-Philosophicus. London and New York: Routledge. - 229 - Proceedings IACAP 2011 SCIENTIFIC EXPLANATION AND INFORMATION STEVE T. MCKINLAY Charles Sturt University Wellington Institute of Technology School of Information Technology Private Bag 39803, Petone, Wellington, NEW ZEALAND e-mail: steve.mckinlay@weltec.ac.nz Abstract. Scientific explanation and more recently information have attracted considerable philosophical attention. Little consideration however has been given to making sense of the concept of information used within debates surrounding explanation. Some may deem there is no problem to be solved here. Yet we observe within the literature on scientific explanation strict examinations of profound philosophical concepts. Writers are at pains to explain causal, epistemic, ontological and nomological accounts of explanation all of which in some way rely upon and take for granted the role of information. We like to think these days we have, at least the beginnings of, a coherent theory of information. This paper cherry picks a couple of interesting ideas within scientific explanation and attempts to reconcile the generally received view of information with those particular explanatory accounts. By the received view I mean the General Definition of Information mostly attributed to Luciano Floridi from around 2003 onwards. As a result of this investigation some profound questions arise; is an “ideal explanatory text” (see Railton, 1981) essentially an informational concept? Can we make sense of a relationship between causation and information? Just how are the concepts related and do we need a satisfactory account? And finally, is it possible to propose a purely information-centric theory of scientific explanation and if so, could it be a significant improvement on current theories of scientific explanation? Everything that exists makes a difference to the causal powers of something. David Armstrong, 1997, p. 41) Introduction Wesley Salmon in Causality and Explanation suggests that to most people, the fact that there is a close connection between causality and explanation comes as no surprise (1998, p. 3). And while distinctions can - 230 - The Computational Turn: Past, Presents, Futures? certainly be made between the two concepts there are many convergences. Salmon argues, “In many cases to explain something is to state its cause.” (ibid). I happen to think a similar story can be told with regard to information and explanation. To have something explained is, at least from an ordinary language point of view, to be informed. There is a certain structure about scientific explanation, the various relationships between laws and theories, and information seems to be the flesh on these bones. It follows that the concept of information might benefit from an investigation into the connections or relations that exist between it, causal concepts and explanation and it is this particular can of worms that this paper intends to open. Information, Causality and Explanation The body of philosophical literature on scientific explanation is substantial beginning16 with the deductive-nomological (D-N) model (Hempel & Oppenheim, 1948., Hempel, 1965) wherein scientific 17 explanations were considered deductive arguments . Salmon (1971) followed with the statistical relevance (S-R) model in order to deal with explanations of low probability events not adequately dealt with by Hempel’s explanatory models. Later Railton (1978, 1981) proposed a deductive-nomological-probabilistic (D-N-P) model in a further attempt to explain events that happen by chance. More recently Wesley Salmon proposed a casual theory of explanation. Salmon’s principal claim was that a scientific explanation is constituted by a state of affairs predominantly recognised as a pattern in the world where that pattern consists of at least one causal process. Causal processes Salmon argued (also Railton, 1981 and later Dowe, 2000) necessarily transmit information (1998, p.16). Salmon explains this as the ability of a causal process to transmit a mark. Causal processes are described by Salmon as being continuous (in a physically spatiotemporal way). This view contrasts with the popular view of causality being a “relation” between particular events (the cause, and the effect). Salmon’s theory is perhaps most eloquently clarified in his At-At Theory 16 Although the roots of scientific explanation and understanding can of course be traced back well beyond Aristotle, recent philosophical history regarding scientific explanation is generally considered to begin with Hempel and Oppenheim’s ground breaking paper Studies in the Logic of Explanation. 17 The degree of informativeness of a logically deductive schema in perhaps controversial, however given scientific explanation has moved on considerably from the Hempelian D-N approach we can safely leave this controversy to one side. - 231 - Proceedings IACAP 2011 of Causal Influence (1977, reprinted in Salmon, 1998). The At-At theory Salmon claims not only resolves Zeno’s arrow paradoxes but also proposes a foundation for a concept of propagation of causal influence. Information plays a significant yet largely unexplained role in virtually all of the models of explanation particularly Salmon’s At-At causal theory. The usual constraints prevent this paper from adequately summarising in full the development of scientific explanation from Hempels D-N model through recent attempts at a unified model of explanation and so I intend to choose two particular junctures in the history of scientific explanation in the hope of casting some light upon the controversial three way axis between information, casuality and explanation. As is often the case in philosophy the following investigation is most likely to end in more, yet hopefully new and interesting questions regarding the nature of information. Thus, my two starting points with their associated problems are as follows; 1. Peter Railton makes a distinction between what he terms the “ideal explanatory text” and “explanatory information” (1981, p. 240). Railton openly admits in his 1981 paper that whilst it is typical to speak of sentences or texts conveying information he knows of “no satisfactory account of this familiar and highly general notion” (1981, p. 240). Further he admits that the neither does the notion of information defined by Wiener and Shannon appear to fit his explanatory theory. Given that Railton’s work continues to influence attempts at theories of explanation, in particular Kitcher’s (1989) unificationist account, an enquiry into Railton’s “explanatory information” seems overdue. 2. Wesley Salmon’s development of Reichenbach’s “mark method” in his At-At Theory of Causal Influence makes thoughtful claims about information transmission as a result of causal processes. Salmon makes a clear distinction between causal processes and pseudoprocesses, the latter he claims have no ability to transmit information. I will evaluate Salmon’s claims with examples and examine how Salmon’s concept of information transmission squares with our current views about information. This investigation I think raises profound questions; is Railton’s concept of the ideal explanatory text essentially an informational concept? On the other hand can we make sense of a relationship between causation and information? Just how are these concepts related and do we need a - 232 - The Computational Turn: Past, Presents, Futures? satisfactory account? Finally, can we propose an informationally centred theory of scientific explanation? Rather than attempt to conclusively answer these questions in this paper, I hope to build an argument around the fact that the topic is one worthy of serious consideration. References Dowe, P.. (2000). Physical Causation. Cambridge: Cambridge University Press. Floridi, L.. (2003). From Data to Semantic Information. Entropy(5), 125-145. Hempel, C.. (1965). Aspects of Scientific Explanation and Other Essays in the Philosophy of Science. New York: Free Press. Hempel, C.. (1948). Studies in the Logic of Explanation. Philosophy of Science, 15, 135-175. Kitcher, P.. (1989). Explanatory Unification and the Causal Structure of the World. In Scientific Explanation (410-505). Minneapolis: University of Minnesota Press. Railton, P.. (1981). Probability, Explanation, and Information. Synthese, 48(2), 233. Railton, P.. (1978). A Deductive-Nomological Model of Probabilistic Explanation.. Philosophy of Science, 45, 206-226. Salmon, W.. (1998). Causality and Explanation. Oxford: Oxford University Press. - 233 - Proceedings IACAP 2011 BIOLOGICAL INSPIRED SINGLE-CHIP MASSIVELY PARALLEL SELF-HEALING, SELF-REGULATING, TERA-DEVICE COMPUTERS Philosophical Implications of the Efforts for Solving Technological Show-Stoppers in the Path of the Next Computational Turn MICHAEL NICOLAIDIS TIMA Laboratory (CNRS, Grenoble INP, UJF) Abstract. Biologically inspired computing usually addresses computing functionalities inspired from biological systems (genetic algorithms, neural networks, cellular automata, artificial life, ...). However, living organisms also resolve efficiently some other problems that have to be addressed in order to accomplish the next computational turn,: achieving the robustness (reliability and power-dissipation) enabling making useful computations by means of ultimate CMOS (to be reached by the beginning of the next decade) and post-CMOS technologies. Thus, biologically inspired robust computing can be viewed as an emerging topic of biologically inspired computing. Complex organisms have the remarkable property of self-healing. Two fundamental features are on the basis of this ability. Organisms are constituted of large numbers of basic units (cells). Cells surrounding injured parts can substitute the dead cells and regenerate the damaged structures. Also, the cells themselves can recover from various damages, for instance by repairing their DNA. Furthermore, living organisms regulate their physiological parameters to the changing external conditions and their own needs (e.g. the regulation of insulin levels in response to sugar levels). As another remarkable property, the autonomic nervous system of higher animals controls important bodily functions (e.g. respiration, heart rate, and blood pressure) without conscious intervention. Building computers having similar properties and achieving the robustness they confer is an old dream of computer scientist. But so far, related researches did not lead to a practical self-healing, self-regulating, autonomic computing paradigm. Ultimate CMOS and post-CMOS promises and challenges. We argue that today there are several converging factors which pave the way towards a new computing paradigms realizing this old dream. These factors are three-fold. Two of them are related with the technology scaling. - Ultimate-CMOS and post-CMOS technologies promise integrating trillions devices in a single chip. Thus, single-chip massively parallel architectures become mandatory for utilizing the huge numbers of devices integrated in such chips. - 234 - The Computational Turn: Past, Presents, Futures? - At the same time, aggressive technology scaling impacts dramatically process, voltage and temperature (PVT) variations; sensitivity to electromagnetic interferences (EMI) and to atmospheric radiation (neutrons, protons); and circuit aging; and also imposes stringent power dissipation constraints. The resulting high defect levels, heterogeneous behavior of identical processing nodes, circuit degradation over time, and extreme complexity, affect adversely fabrication yield and also prevent fabricating reliable chips in ultimate CMOS and post-CMOS technologies. These issues are the main show-stoppers in the path towards these technologies that pave the way for the next computational turn. The above two factors plead for a self-healing massively parallel computing paradigm. But, this is not a trivial task. Copying with failures (a property also known as fault tolerance) induces high area and power penalties. The former will drastically reduce the available computing resources, while the later is incompatible with low power operation (one of the tightest constraints in ultimate CMOS). Furthermore, conventional faulttolerant approaches (DMR, TMR etc) consider that failures affect a single component among several redundant ones. This assumption is no more valid in the extreme integration of ultimate CMOS, where transistors are so small that comprise a few atoms, neither under the even higher integration levels promised by post-CMOS. In these technologies we may face the following challenges: - All processing nodes and routers in a massively parallel tera-device processor are affected by timing or transient faults, - Hard faults may affect some parts of each node, - Hard faults completely destroying a new node arrive every few days, - Circuit degradation is continuous and requires continuous self-regulation of circuit parameters (clock-frequency, voltage levels, body bias), to maintain it operational. Biologically-inspired enabling approaches The Cells framework (On-Chip Self-healing Tera-Device Processors) discussed in this paper brings-in the third factor: a drastically new system-design paradigm achieving high yield, and highly-reliable uninterrupted operation for highly defective on-chip massively parallel tera-device processors at low hardware cost. Power reduction and enhanced performance are also achieved through self-regulation of circuit parameters (voltage, clock frequency and body bias). Groundbreaking innovations were introduced at all levels of the framework, including its overall architecture, its particular components, and the way the cooperation of these components is architected to optimize the outcome. They enable continuous adaptation to circuit degradation, heterogeneity and changing application context, as well as detection and correct operation restoration for all failures induced by high defect densities, PVT variations, internal and external disturbances, and circuit degradation over time. It results in a holistic self-healing selfregulating approach allowing: - Making usable tera-device technologies affected by: high defect densities, sever variability, increasing sensitivity to disturbances and accelerated aging. - Implementing single-chip massively parallel self-healing tera-device computers delivering unprecedented computing power, which enable changing our computing paradigms and should have a profound impact on all computer application domains (including embedded systems, telecommunication networks, internet infrastructure and utilization, cloud computing, …), as well as science and technology and the society as a whole. - 235 - Proceedings IACAP 2011 In the Cells, Self-Healing is achieved by two means. Single-chip massively parallel processors resemble to living organisms in that they are constituted of large numbers of basic units (processor cores, routers and links). Cells takes advantage of this similarity. Like cells in living organisms, operational units replace unrecoverable faulty units to restore system functionality transparently to the ongoing application executions. Also, like cells in living organisms, processor cores, routers and links are able to recover from several kinds of failures, by using new innovations at circuit-level fault tolerance (Anghel and Nicolaidis, 2000), (Nicolaidis 2005), (Anghel and Nicolaidis, 2008), (Nicolaidis, 2011), (Yu, Nicolaidis, Anghel and Zergainoh, 2011) and self-regulation. Furthermore, similarly to the non-deterministic, local and opportunistic manner in which cells in an organism achieve self-healing, and self-regulation, Cells uses new, non-deterministic routing, task allocation and scheduling algorithms, which make local decisions in opportunistic manner (Chaix, Avresky, Zergainoh and Nicolaidis, 2010 and 2011). They allow addressing the complexity problem of navigating in a complex and changing network (thousands of processors and routers, millions of possible communication paths, continuous circuit degradation, frequent occurrence of catastrophic node and router failures, and unpredictable router congestions). Conventional deterministic algorithms used in nowadays massively parallel multi-chip systems, which exhibit low defectivity and high circuit stability; use static routing tables containing pre-established routes, and static scheduling and allocation algorithms which consider: fixed clock frequencies; rarely failing links, router and processor nodes; and similar power-dissipation for all nodes. Such algorithms, used also in early proposals for designing massively parallel processor chips (Zajac, Collet and Napieralski, 2008), are ineffective in a highly defective and fast degrading hardware. Together with the highly innovative circuit-level fault-tolerance, routing, and task allocation and scheduling; automatic monitoring, control, and self-regulation of circuit parameters ensure optimal operation: meeting performance requirements while minimizing power under circuit degradation and evolving application context. It results in a computing paradigm that achieves robustness in a manner that resembles to biological systems in multiple aspects. This trend should be necessarily reinforced as post CMOS will enable ever higher integration complexities. References Anghel, L. & Nicolaidis, M. (2000), Cost Reduction and Evaluation of a Temporary Faults Detecting Technique, Proceedings Design Automation and Test in Europe Conference, March 2000, Paris (Best Paper Award) Anghel, L. & Nicolaidis, M. (2008), Cost Reduction and Evaluation of a Temporary Faults Detecting Technique”, chapter in the book “The Most Influential Papers of 10 Years DATE”, Lauwereins, Rudy; Madsen, Jan (Eds.), Springer, ISBN: 978-1-4020-6487-6, 2008. Chaix, F., Avresky, D., Zergainoh, N. E. & Nicolaidis, M. (2010), Fault-Tolerant Deadlock-Free Adaptive Routing for Any Set of Link and Node Failures in Multi-Cores Systems, In Proc. 9th IEEE International Symposium on Network Computing and Applications (NCA10), July 1517 2010, Cambridge, MA Chaix, F., Avresky, D., Zergainoh, N. E. & Nicolaidis, M. (2011), A Fault-Tolerant DeadlockFree Adaptive Routing for On Chip Interconnects, In Proc. Design Automation and Test in Europe Conference, March 14 – 18, 2011, Grenoble, France. Nicolaidis M., (2005), Design for Soft-Error Mitigation, IEEE Transactions on Materials and Device Reliability, Vol. 5, Issue 3, pp. 405-418, September 2005 - 236 - The Computational Turn: Past, Presents, Futures? Nicolaidis, M. (2011), Circuit-level Soft-Error Mitigation, In: M. Nicolaidis (Ed), Soft Errors in Modern Electronic Systems, Springer, 2011. Yu, H., Nicolaidis, M., Anghel, L. & Zergainoh, N.E. (2011), Efficient Fault Dectection Architecture Design of Latch-based Low Power DSP/MCU Processor, In Proceedings, 16th IEEE European Test Symposium, May 23-27, 2011, Trondheim, Norway. Zajac, P., Collet, J.H. & Napieralski, A. (2008), Self-Configuration and Reachability Metrics in Massively Defective Multiport Chips, in Proc. 14th IEEE International On-Line Testing Symposium, July 2008. - 237 - Proceedings IACAP 2011 STRUCTURAL CONSTRAINTS FOR THE CONSTRUCTION OF MULTI-STRUCTURED DOCUMENTS PIERRE-ÉDOUARD PORTIER Université de Lyon, CNRS – INSA de Lyon – LIRIS UMR 5205 F-69621 France AND Sylvie Calabretto Université de Lyon, CNRS – INSA de Lyon – LIRIS UMR 5205 F-69621 France Abstract. While are occurring the computer-mediated interactions for the weaving of relations between fragments of a documentary archive: structures appear, vocabularies emerge… Can programs be designed to help this effervescent creation not to diverge too quickly? One common solution is to rely on a priori well-defined and closed vocabularies (the so-called ontologies) from which the names being used to describe (annotate) and connect fragments are to be chosen. What can be done if such vocabularies aren’t available? In other words: can a system be designed to allow the dynamic construction of vocabularies? We now propose a first version of such a system. 1. Introduction We study the process of the construction of documents. We observe the emergence of documentary structures. This emergence relies on the creation of dimensions as sets of relations. We aim at providing computational mechanisms to assist the construction of dimensions. First of all, we introduce the notion of a non-trivial machine. By using a notion of computation seen as ordering, and by adopting a pragmatic point of view on the notion of meaning, we can redefine the objective as: programming mechanisms that could ease the circulation of information for the non-trivial machine. 2. Meaning and computation J.V. Uexküll (1956), a father of ethology, developed a theory of meaning in order to explain in a unified way what he observed in many occasions on different kinds of animals: the same object placed in different environments can take a different meaning. - 238 - The Computational Turn: Past, Presents, Futures? Thus he deduced that the qualities of an object are only perceptive attributes given by the subject with which they have a connection. Furthermore, when G. Bateson (1972) wonders what it would mean for a computer to “think”, he comes to the conclusion that: “What ‘thinks’ and engages in ‘trial and error’ is the man plus the computer plus the environment. And the lines between man, computer and environment are purely artificial, fictitious lines. They are lines across the pathways along which information or difference is transmitted.” p. 491. Bateson tried to get rid of the subject/object dichotomy by considering systems described as networks of differences. It links directly to a pragmatic view of meaning taken as an effect of the dynamic creation of relations. In (Saulnier and Longo, 2007), the idea of “conceptual frameworks” is introduced: meaning is to be found in the movements from one framework (or level of meaning) to another. Peirce’s concept of an interpretant is not far: “A sign […] creates in the mind of that person an equivalent sign, or perhaps a more developed sign. That sign which it creates I call the interpretant of the first sign.” (Peirce, 1897) (§228) And the meaning would be this dynamic process of building an interpretant... Finally, H. Von Foerster (2003) proposes a definition of computation as ordering. Ordering can be (i) a description of a given arrangement, or (ii) a re-arrangement of a (i). Moreover, he defines a non-trivial machine (Turing-like) as a machine for which the outputs depend on both the inputs and the state of the machine. Thus, the frontiers of the considered non-trivial machine will include a computer and a user in an environment. This machine is in a dynamic state of producing orderings. “Meaning” is directly referring to this production. Indeed, the machine is powered by some desire (for example, the desire to explain a phenomenon) and the more the production of orderings fulfills the desire, the more meaningful the process is. Our task is then to program some mechanisms that could ease the functioning of such a machine. 3. Construction of dimensions 3.1. TREE CONSTRAINT In the context of document engineering, what is commonly called “the problem of multistructured documents” is the fact that elements of structures can be overlapping. Indeed, the most used formalisms for documents representation (first SGML, then XML) imply tree structures. All of the models proposed to overcome this difficulty are centered on this tree/graph dichotomy. However, for each local event of two overlapping terms, those tend to belong to different dimensions or levels of meaning. Thus, in the context of our multi-structured documents platform (Portier and Calabretto, 2010), each time an overlapping situation occurs with terms belonging to the same dimension, we offer the users the possibility to restructure the dimensions (see Figure 1). - 239 - Proceedings IACAP 2011 Figure 1. Formalizatio n of user's knowledge when two terms of a same dimension overlap 3.2. ACY CLIS M CON STRAINT Apart from the annotation of text intervals, relations are inter-weaved between heterogeneous fragments. An essential part of the research on hyperstructures has created a notion of dimension. The zzstructure of T. Nelson (2004) for dimensional hypertexts is certainly one of the most relevant examples. The abstract function of a dimension is to group similar ways of weaving relations between fragments. Indeed, a naïve graph-based representation doesn't offer appropriate synoptic views (see Figure 2). Thus, the dimensions provide clusters of relations that can compensate for this lack of synthesis by offering new kind of representations (see Figure 3). Figure 2. Illustratio n of a graphoriented interface for the creation and the visualizati on of relations Figure 3. Illustratio n of a dimension - 240 - The Computational Turn: Past, Presents, Futures? -based interface In order to help the users in the process of creating dimensions, we are looking for a structural constraint whose violation is often meaningful and quite easy to dynamically detect. The acyclism constraint seems to be well adapted. Take for example the situation of Figure 4 where a user successively created two associations but when he adds a third relation, a cycle appears. Figure 4. After the free creation of some relations, a cycle appears within the “d” dimension The user is advised to restructure the dimensions so as to remove the cycle (see for example Figure 5). Figure 5. Formalizatio n of structural users' knowledge after the automatic detection of a cycle within a dimension 4. Conclusion This work is a first step towards a different point of view on computation seen as the construction of orderings by a non-trivial machine driven by a desire to explain some phenomenon. In such a configuration, new kinds of programs have to be developed in order to dynamically react to the user's actions by, for example, computing the appropriate times for helping the users to formalize their structural knowledge. - 241 - Proceedings IACAP 2011 Acknowledgements We would like to thank the team of researchers from the Jean-Toussaint Desanti Institute for their collaboration during the development of this work. References Bateson, G. (1972). Steps to an ecology of mind. The University of Chicago Press. Nelson, T. H. (2004). A cosmology for a different computer universe: data model, mechanisms, virtual machine and visualization infrastructure. Journal of Digital Information 5(1) Peirce, C. S. (1897) Collected Papers of Charles Sanders Peirce 2. Harvard University Press, Cambridge Portier, P.-E., Calabretto, S. (2010). DINAH, a philological platform for the construction of multistructured documents. In : Proceedings of the 14th European conference on Research and advanced technology for digital libraries, Glasgow, UK, p.364-375 Saulnier, B., Longo, G. (2007). Le jeu du discret et du continu en modélisation : relativité dynamique des structures conceptuelles. In : Intelligence de la complexité, épistémologie et pragmatique. éditions de l'aube Von Foerster, H. (2003). Responsibilities of Competence. In Springer, ed., Understanding understanding: essays on cybernetics and cognition, p.191 Von Uexküll, J. (1956). Théorie de la signification. Editions Denoël, Hambourg. - 242 - The Computational Turn: Past, Presents, Futures? (DIS-)TASTEFUL MACHINES? Aesthetic Cognition and the Computational Turn in Aesthetics WILLIAM W. YORK Center for Research on Concepts and Cognition Indiana University 512 North Fess Street Bloomington, Indiana 47408-3822 AND HAMID R. EKBIA Center for Research on Mediated Interaction Indiana University 1320 E. 10th Street Bloomington, IN. 47405-3907 Abstract. While aesthetics and cognition have traditionally been viewed as distinct from—even opposed to—one another, recent stirrings indicate the beginnings of an “aesthetic turn” regarding cognition. Does this, in turn, open up the possibility of a computational turn in the study of aesthetics? Can computational methods such as modeling and simulation be effectively brought to bear on something as mysterious and ineffable as aesthetic judgment? Or is “aesthetic cognition” a contradiction in terms? We explore these questions by focusing on the relationship between aesthetics and analogy-making, an area of cognition for which some research groundwork has already been laid. We will first offering some illustrative examples of this relationship, and then examine a group of computer models that have begun exploring mechanisms that may account for this relationship. Although rudimentary in their capabilities, these models point to a computational perspective for investigating not only the analogy–aesthetics relationship, but the processes underlying aesthetic cognition more generally. 1. Introduction As Mark Johnson (2007) recently put it, “[A]esthetics is not just art theory, but rather should be regarded broadly as the study of how humans make and experience meaning” (p. 209). Aesthetic considerations factor into seemingly mundane everyday experience as well as in more exalted intellectual pursuits. Regarding the latter, Robert Root-Bernstein (2002) has used the term “aesthetic cognition” to refer to the “pre-logical, emotion- - 243 - Proceedings IACAP 2011 laden, intuition-based feeling of understanding” (p. 62) that guides creative thought in science and mathematics. In some quarters, the term “aesthetic cognition” might seem like a contradiction. There is a deeply rooted tendency to view the aesthetic and the cognitive as distinct from, if not opposed to, one another (Aiken, 1955). Yet recent stirrings from various quarters in cognitive science (e.g., Deacon, 2006; Norman, 2003) suggest that we are seeing the beginnings of an “aesthetic turn” in cognitive science. 2. A Computational Turn in Aesthetics? Does this “aesthetic turn,” meanwhile, open up the possibility of a computational turn in aesthetics? Can the study of aesthetics be opened up to computational methods such as modeling and simulation? If so, how can they be effectively brought to bear on something as seemingly mysterious and ineffable as aesthetic sensibility? If not, what do we make of Root-Bernstein’s (2002) claim that “artificial intelligence will fail to provide insights into human thinking or model its capabilities until aesthetic cognition is itself understood sufficiently to be modeled and implemented by computers” (p. 75)? Broadly speaking, there are two potential reactions to these questions. Optimistically, one might contend that fields such as cognitive science and artificial intelligence (AI) can—and, to some extent, already have—shed light on these questions, in part through the use of computer models, perhaps in combination with findings from neuroscience and experimental psychology. There is also the developing field of computational aesthetics (Hoenig, 2005). Despite its somewhat different emphases— which range from image-processing techniques to computer-generated art to formal analysis of artworks—the growth of this new field offers further evidence of the potential relevance of computation to aesthetics (and vice versa). In turn, skeptics might reply that longstanding problems in aesthetics have remained unsettled for a reason: There may simply be limits to what we can understand when it comes to matters of judgment, sensibility, and taste (Weizenbaum, 1976). To explain aesthetic sensibility would seem to involve specifying, formalizing, or mechanizing those same intuitive processes that have been defined as unspecifiable, unformalizable, or non-mechanizable (e.g., Polanyi, 1981; Dreyfus, 1992). This debate between optimists and skeptics is ongoing, encompassing other areas of human cognition and behavior; in particular, it has been framed around various theories and models in artificial intelligence (Ekbia, 2008). Is there a meaningful way to resolve, or at least advance, this debate? 3. Analogy-Making as Aesthetic Cognition The perceptual and (especially) the aesthetic dimensions of analogy-making have been downplayed in much research on analogy within cognitive science and AI, where the focus has instead been on “analogical reasoning” (e.g., Winston, 1980). Yet analogy is not coextensive with reasoning, and the idea that analogy-making involves an aesthetic component does have some precedence. For example, in the program Copycat—a model of analogy-making in the microdomain of letter strings (e.g., “If abc is changed to abd, - 244 - The Computational Turn: Past, Presents, Futures? then how should kkjjii be changed?”)—the “computational temperature” at the end of a run can be construed as a sort of aesthetic evaluation of the program’s answer (Mitchell 1993). Copycat’s successor, Metacat, is able to compare different answers to a given analogy problem—say, kkjjhh and kkjjij in response to the example given above—on the basis of three largely aesthetic dimensions: uniformity, abstractness, and succinctness (Marshall 1999). Likewise, the idea that aesthetic sensibility involves an ability to perceive and appreciate analogies has also been noted before. For example, Koestler (1964) refers to the “hidden analogies” that inform the creative process in science, art, and humor. Arnheim (1969) discusses the role of analogy in the perception and grouping of visual forms, including what might be called “visual rhymes.” Similar types of analogical mappings can be identified in the plot structures of films, novels, and other narrative forms. Meanwhile, the role of aesthetic factors in science and mathematics has also been explored (e.g., Papert, 1988; Sinclair, 2004), further highlighting the connection between aesthetic sensibility, insight, perception, and analogy. Finally, computer models such as Letter Spirit (Rehling, 2001) have explored the role analogy in the more traditionally aesthetic realm of alphabetic font (or grid font) design. 4. Open Questions Models such as Copycat and Letter Spirit suggest a potentially rewarding perspective for investigating not only the analogy–aesthetics relationship, but the processes underlying aesthetic cognition more generally. But to what extent can such computational approaches ultimately contribute to this joint understanding? What are the strengths (and limits) of computer models that aim to simulate the processes of analogy-making and aesthetic judgment in human beings? Finally, is there potential for common ground between cognitive science/AI and the growing field of computational aesthetics? Acknowledgements Thank you to Helga Keller (R.I.P.) for her tireless support over the years. References Aiken, H. D. (1955). Some notes concerning the cognitive and the aesthetic. The Journal of Aesthetics and Art Criticism, 13(3), 378–394. Arnheim, R. (1969). Visual Thinking. Berkeley: Univ. of California Press. Deacon, T. (2006). The aesthetic faculty. In M. Turner (Ed.), The Artful Mind: Cognitive Science and the Riddle of Human Creativity (pp. 3–20). Oxford: Oxford Univ. Press. Dreyfus, H. (1992). What Computers Still Can’t Do: A Critique of Artificial Reason. Cambridge, Mass.: MIT Press. Ekbia, H. R. (2008). Artificial Dreams: The Quest for Non-Biological Intelligence. Cambridge, U.K.: Cambridge Univ. Press. - 245 - Proceedings IACAP 2011 Hoenig, F. (2005). Defining computational aesthetics. In L. Neumann, M. Sbert, B. Gooch, and W. Purgathofer (Eds), Computational Aesthetics 2005: Eurographics Workshop on Computational Aesthetics in Graphics, Visualization, and Imaging (pp.13–18). Johnson, M. (2007). The Meaning of the Body: Aesthetics of Human Understanding. Chicago: Univ. of Chicago Press. Koestler, A. (1964). The Act of Creation. New York: MacMillan. Marshall, J. (1999). Metacat: A Self-Watching Cognitive Architecture for Analogy-Making and High-Level Perception. Doctoral dissertation, Indiana Univ., Bloomington. Mitchell, M. (1993). Analogy-Making as Perception: A Computer Model. Cambridge, Mass.: MIT Press. Norman, D. (2003). Emotional Design: Why We Love (or Hate) Everyday Things. New York: Basic Books. Papert, S. (1988). The mathematical unconscious. In J. Wechsler (Ed.), 1988), On Aesthetics in Science (pp. 105–120. Polanyi, M. (1981). The creative imagination. In D. Dutton & M. Krausz (Eds.), The Concept of Creativity in Science and Art (pp. 91–108). The Hague, Netherlands: Nijhoff. Rehling, J. A. (2001). Letter Spirit (Part Two): Modeling Creativity in a Visual Domain. Doctoral dissertation, Indiana Univ., Bloomington. Root-Bernstein, R. S. (2002). Aesthetic cognition. International Studies in the Philosophy of Science, 16(1), 61–77. Sinclair, N. (2004). The roles of the aesthetic in mathematical inquiry. Mathematical Thinking and Learning, 6(3), 261–284. Weizenbaum, J. (1976). Computer Power and Human Reason: From Judgment to Calculation. San Francisco: W. H. Freeman and Co. Winston, P. H. (1980). Learning and reasoning by analogy. Communications of the ACM, 23(12), 689–703. - 246 - The Computational Turn: Past, Presents, Futures? Track VII: Social Computing - 247 - Proceedings IACAP 2011 The social and its political dimension in software design A Socio-Political Approach DORIS ALLHUTTER Austrian Academy of Sciences, Institute of Technology Assessment Strohgasse 45, 1030 Vienna Abstract. Recent debates in philosophy and computing and science and technology studies address the prolongation of the social in technical design and development and thus the question of discursive performativity. Applying a wider conception of the social than usually referred to in design research, I present an initial elaboration of a socio-political approach to software design. This approach is based in discourse theory, deconstructivism and ‘new materialism’ and focuses on the reproduction of power by tracing the performativity of hegemonic societal discourses and their co-materialization with (normative) technological phenomena. Making use of Karen Barad’s material-discursive account of performativity, I argue that a socio-political approach to software design needs to take into account the ‘intra-action’ of material phenomena with reconfigurings of power relations in intertwined epistemic and everyday work practices. The objectives of this endeavour are, first, to ask and make negotiable who (in/formal hierarchies) and what (discursive hegemonies) is given normative power in design processes on the basis of which social and technological imaginaries; second, to investigate and, to some extent, try to make tangible how these—mostly unconscious— normative enactments co-materialize with material phenomena or relations; and eventually, to elaborate on how to widen human agency by opening spaces for maneuver or trading zones when taking account of the agency of human/non-human assemblages or material-discursive re-configurations of the world. Recent debates in philosophy and computing and science and technology studies have expanded the question of the prolongation of the social in technical design and development by taking into account the concept of discursive performativity. Inspired by this discussion and applying a wider conception of the social than usually referred to in research on the development of computational artifacts, I present an initial elaboration of a socio-political approach to software design. This socio-political approach connects to the notion of ontological politics (see Mol, 1999) and is based in discourse theory, deconstructivism and ‘new materialism’. It focuses on the reproduction of power by tracing the performativity of hegemonic societal discourses and their co-materialization with (normative) technological phenomena. Karen Barad’s (2007) materialistic elaboration of the concept of performativity shifts the focus from a linguistic and discursive account of performativity, which is linked to the paradigm of the co-construction of society and technology, to the notion of comaterialization. She criticizes earlier approaches to processes of materialization (as for example introduced by Butler and Foucault) that centre on the question of ‘how discourse comes to matter’. Barad suggests that their focus on the social constructedness - 248 - The Computational Turn: Past, Presents, Futures? of bodies/materiality in fact neglects the question of ‘how matter comes to matter’ and puts an equal focus on the material dimensions of agency. In my previous work, Donna Haraway’s account of ‘embodied, situated practices’ and Judith Butler’s concept of discursive performativity have inspired me to investigate software design processes as entangled practices informed by technological concepts and hegemonic societal discourses as much as by professional self-conceptions of developers and related workplace politics (see Allhutter 2011). Barad’s materialistic move that resulted in her elaboration of ‘agential realism’ can add to such a perspective on software design in that it conceptually takes into account the agency of materiality or material phenomena (see also Velden and Mörtberg, 2011). Still open remains the question of how to make use of a material-discursive account of performativity in applied design research. In this respect, I suggest that it makes sense to reconstruct the journey of two crucial concepts—‘agency’ and ‘materialism’—that have been travelling between disciplines and research fields: While questions of the agency of artifacts and human/non-human (re-)configurations have intensively been discussed in studies of science and technology since the early 1980ies (Callon, Latour, Law, Haraway), only recently political science scholars such as Jane Bennet (2010), Diane Coole and Samantha Frost (2010) have begun to integrate this strand of theory to rethink concepts of political agency and to rework the notion of materialism, now discussed as ‘new materialisms’. On this background, I argue that a socio-political approach to software design practice and theory needs to take into account the ‘intra-action’ of material phenomena with reconfigurings of power relations (normativity and societal hegemonies) in intertwined epistemic and everyday work practices. My objective of elaborating such a socio-political approach based on a material-discursive account of performativity is threefold: First, the aim is to ask and make negotiable who (in/formal hierarchies) and what (discursive hegemonies) is given normative power in design processes on the basis of which social and technological imaginaries (e.g. re-enactments of societal differences and epistemic dichotomies); second, to investigate and, to some extent, try to make tangible how these—mostly unconscious—normative enactments co-materialize with material phenomena or relations (that are e.g. development methods, processes, artifacts); and eventually, to elaborate on how to widen human agency by opening spaces for maneuver or trading zones (Allhutter and Hofmann, 2010) when taking account of the agency of human/non-human assemblages or material-discursive re-configurations of the world. References Allhutter, D. (2011). Mind Scripting: A Method for Deconstructive Design. Science, Technology & Human Values, OnlineFirst March 13, 2011. Allhutter, D. & Hofmann, R. (2010). Deconstructive Design as an Approach to opening Trading Zones. In: J. Vallverdú (ED), Thinking Machines and the Philosophy of Computer Science: Concepts and Principles (pp. 175–192). Hershey: IGI Global. Barad, K. (2007). Meeting the Universe Halfway: Quantum physics and the entanglement of matter and meaning. Durham and London: Duke University Press. - 249 - Proceedings IACAP 2011 Bennet, J. (2010). Vibrant Matter: A political ecology of things. Durham and London: Duke University Press. Coole, D. & Frost, S. (2010). New Materialisms: Ontology, Agency, and Politics. Durham and London: Duke University Press. Mol, A. (1999). Ontological Politics: a Word and Some Questions. In: J. Law and J. Hassard (EDS), Actor Network and After. (pp. 74–89). Oxford and Keele: Blackwell and the Sociological Review. Velden, M. van der & Mörtberg, C. (2011). Between Need and Desire: Exploring Strategies for Gendering Design Science, Technology & Human Values, OnlineFirst March 13, 2011. - 250 - The Computational Turn: Past, Presents, Futures? A SOCIAL EPISTEMOLOGICAL APPROACH FOR DISTRIBUTED COMPUTER SECURITY Steve Barker Department of Informatics King’s College London Abstract. We present a social epistemological approach, for treating an aspect of computer security, which allows for multiple testifiers to contribute propositional attitude reports to a community repository of testimonial knowledge and for users to adopt a range of epistemic positions for deciding what constitutes justified belief in different contexts. 1. Introduction We discuss a key epistemological aspect of the distributed access control (DAC) problem: in large, distributed computer systems, like the Internet, how can a decision be rendered on whether a requester of access to a resource is authorised to perform an action on the resource if what is known by the decision-maker about a requester is “incomplete”? (And it is computationally too expensive for the decision-maker to exhaustively search for all of the knowledge it (ideally) requires on the requester.) Rather than simply rejecting the access request on the basis of the incompleteness of its knowledge, the putative solution to the DAC problem is for the decision-maker to accept the assertions of some individual, ultimately trusted testifier who “speaks for” the requester and in so doing enables the decision-maker to determine whether the requester is authorised to perform a requested action on a resource. The notion of an ultimately trustworthy source of epistemic warrant assumes that a foundationalist (Bonjour 1985) position on knowledge/justification applies in the DAC case; there is no infinite justificational regress because what the trusted source asserts is so. In Section 2 of this abstract, we suggest an alternative, social epistemological approach to the DAC problem. In Section 3, we draw conclusions. 2. An Alternative Approach to the DAC Problem We argue for a community-based approach to testimonial warrant and for testifiers making assertions of their propositional attitudes (Russell 1905) via a community-based repository, which is a store of triples (s, α, p) such that s is a source of assertions in a community of sources Σ = {s, s1 , . . . , sn } of testimonial warrant, p is a proposition, and α is a propositional attitude that a source in Σ has in relation to p. - 251 - Proceedings IACAP 2011 We note that p may be an atomic proposition or an arbitrary logical formula, we restrict attention to the doxastic attitudes “believes” and “disbelieves”, and we interpret a source as suspending belief on p if it makes no assertion of p to the community repository. The triples (si, α, p) represent that-clauses, e.g., si believes that sj is “bad debtor”. Typically, in the DAC scenario, the assertions are on a requester’s reputation, e.g., for being a “bad debtor”; the categories of requesters to be used are community determined. In the context we assume, authorisation depends on the assignment of a requester to a category, e.g., s is authorised to perform some action on a resource iff s is categorised as a “good trader” (say). We suggest that what we propose is appropriate for addressing the DAC problem in that it recognises the need for knowledge construction by a division of epistemic labour, it allows for justified belief to be community constructed (which we hold to be more reliable than exclusively using individual, foundational sources of testimonial knowledge) and it recognises that, in the context of interest, “truth” is appropriately held to be relative to a community. It is open to decision-makers to decide what methods of computation to use, with the community repository, for them to have justified beliefs for deciding on authorisation requests. A decision-maker may simply accept that the propositional attitude α holds in relation to p if some specific source s Σ expresses that directly. However, this is far from being the only option. A decision-maker may, for example, accept that α holds in relation to p because some, non-specific member of Σ asserts that or all members of Σ assert that or it is the “majority view” (variously interpreted) of members of Σ that α holds in relation to p. Moreover, more complex requirements may be expressed in more expressive logic languages, e.g., an acceptor may accept that α applies in relation to p if some si Σ asserts that and no source in Σ disbelieves p. It is important to note that we allow individual decision-makers to decide on what constitutes evidence for them “knowing” that an authorisation holds, that the knowledge for this is socially constructed, and that different forms of inferential knowledge will be applicable for decision-making in different contexts (cf. DeRose 1992). In the evidentialist framework that we adopt (Feldman and Conee 1985), we say that: a decision-maker γ is justified in adopting the assertion by s Σ that the propositional attitude Σ holds in relation to the proposition p at the time t iff the attitude α on p is entailed by some computational method that γ justifiably holds to be reliable for this entailment at the time t from the evidential sources that γ justifiably holds to be sufficiently authoritative for the purpose of making the inference that α holds on p according to s at t. Evidentialist-based interpretations of a variety of epistemic positions will be adopted in practice. It follows that we do not argue that foundationalism is not a meaningful epistemic position to adopt in the DAC context. Rather, we suggest that different epistemic positions (e.g., foundationalist, Haackean foundheretist, etc.) will apply in different contexts. It is the emphasis on a plurality of epistemic positions that is distinctive about our approach. 3. Conclusions We critically assessed the foundationalist epistemic position that has hitherto been assumed in treating the DAC problem. We then argued for a social epistemological alternative, which accommodates propositional attitude reports, community-based testimonial assertions and the flexible use of a range of methods for producing inferential knowledge. - 252 - The Computational Turn: Past, Presents, Futures? In future work we intend to consider repositories that maintain a history of propositional attitudes and the epistemic issues that arise. References Bonjour L. (1985). The Structure of Empirical Knowledge. Harvard University Press. DeRose, K. (1992). Contextualism and Knowledge Attributions, Philosophy Phenomenological Research, 52, pp. 913-929. Feldman R. and Conee E. (1985). Evidentialism, Philosophical Studies, 48, pp. 15-34. Russell, B. (1905). On Denoting, Mind, 14, pp. 479-93. - 253 - and Proceedings IACAP 2011 TRUST, POWER, AND INFORMATION TECHNOLOGY MARK COECKELBERGH University of Twente Department of Philosophy, P.O. Box 217, 7500 AE Enschede, The Netherlands, E-mail m.coeckelbergh@utwente.nl Abstract. This paper offers a preliminary discussion of the relation between trust, power, and information technology. It also explores some implications for ethics and politics of information technology. 1. Introduction In recent years the issue of trust has received much attention in ethics and philosophy of information technology. For instance, there is work on e-trust and on-line trust: some argue against e-trust (for example Nissenbaum 2001), while others are more optimistic about trust in digital contexts (Taddeo 2009, 2010a, 2010c, Turilli et al 2010). Furthermore, in the field of social epistemology there is work on trust and knowledge (Simon 2009, Taddeo 2010b), and people working in the virtue ethics and phenomenological tradition have developed a notion of ‘implicit’ trust (Ess 2010, Carusi 2009). While this attention to trust has produced insightful work relevant to both philosophers and computer scientists who try to model trust, there is little or no attention to relations between trust, power, and information technology. This paper is a preliminary attempt to explore this relation. First I will clear the ground by making a claim regarding the epistemology of trust (I will need this later), then I will make two claims about the relation between trust and power: (1) trust presupposes power relations and (2) trust creates power relations. This analysis will allow me to make some suggestions about the implications for ethics and politics of information technology. 2. Trust, Knowledge and Transparency Although it is true that trust can emerge in uncertain and risky on-line environments and that in one sense trust promotes transparency, as Turilli and others have argued (Turilli et al 2010), there is also a sense in which (a) trust can only exist under conditions of uncertainty and (b) transparency destroys trust. - 254 - The Computational Turn: Past, Presents, Futures? In order to develop these claims, we must challenge the rationalist-contractarian assumption entertained in Taddeo’s work, that e-trust cannot appear a priori, but depends on the assessment of trustworthiness by a rational (artificial) agent (Taddeo 2010c). A phenomenological notion of trust, by contrast, involves a sort of a priori, implicit form of trust. This form of trust flourishes only in environments characterized by incomplete certainty, knowledge and transparency. If there was complete uncertainty, complete lack of knowledge, and no transparency at all, we would have no basis for trust. On this point rationalist-contractarian models are right. If, however, if there was complete knowledge, complete certainty, and full transparency, there would be no need for trust; the problem would not arise in the first place. This suggests that if political movements aim for total, absolute transparency (e.g. Wikileaks), they risk to destroy trust, which must be situated ‘in between’ the epistemic absolutes identified. However, this is a claim about knowledge; what about trust with regard to action? 3. Trust and Power (1) If trust is not entirely freely decided by rational agents, but presupposed in social relations, then we need to discuss how prior social relations, understood as power relations, shape trust. There are a priori dependencies that enable but also constrain agency with regard to trust. In a particular social network, I ‘have’ to trust some others and indeed some technologies (e.g. software) since, and to the extent that, I am dependent on them for the very practice I am engaged in. In any social network, I am dependent on some key, powerful actors and technologies which I ‘have’ to trust because they are powerful. This means limits my agency with regard to trust. Power relations – relations with others and with technologies – already shape trust ‘before’ any decision or deliberation about trust is made. If this is true, it does not only set limits to efforts to model and implement trust in artificial networks, it is also relevant for ethical-philosophical analysis of trust in digital environments ‘inhabited’ or ‘crawled’ by both humans and artificial agents. In the digital age, trust crucially depends on power exercised by the ‘architects, ‘providers’ and ‘webmasters’ of the social-technological networks that form and transform our interactions and practices (including academic practice). But how did these social actors become powerful in the first place? Does this analysis preclude agency altogether? 4. Trust and Power (2) Even a strictly rationalist-contractarian approach to trust must acknowledge that trust, ‘decided’ upon by rational agents, creates power relations and generates its own normativity with regard to humans and their artificial cooperants. If an agent A says ‘I trust you’ to an agent B, this does not only create expectations A has about B’s future actions, but also involves a delegation of (discretionary) power from A to B. In addition, and this is the normative aspect, A makes B responsible. If A trusts B to do something, then A holds B responsible for doing that. In particular, if B - 255 - Proceedings IACAP 2011 decides to do otherwise (trust presupposes that B has this space of freedom), then B has to provide reasons to A, explain why (s)he did not do what A expected him or her to do. Trust is violated if no good reasons are given by B. This analysis of relations between trust, power, and normativity is relevant for ‘horizontal’ social relations, but also for the ‘vertical’ relation between individuals and the state. This works both ways: (1) an individual A may trust state B, which implies that A delegates power to B to do something and that B becomes responsible. A’s trust can then be violated by B if B fails to do this and if fails to give good reasons for not doing it. (2) state A can trust its citizens B (not) to do something, that is, hold B responsible, and B can violate this trust. 5. Conclusion I conclude that this framework, which tolerates and employs both rationalistcontractarian and phenomenological approaches, reveals a lacuna in the present literature and allows us to analyze and discuss the power dimension of issues in social epistemology, information ethics and philosophy of information. For example, in the Wikileaks case, there seems to be a clash between on the one hand a vertical ‘delegation’ model, which creates the possibility of trust under conditions of uncertainty, and on the other hand a model that aims at transparency, attempts to provide complete knowledge, and seeks to abolish the vertical delegation relation – and thereby abolishes trust in the sense discussed above. Of course this analysis does not exhaust the many interpretations of the word ‘trust’ used in the literature. And perhaps a tension remains between rationalist- contractarian and phenomenological approaches. Furthermore, neither power nor trust should be our only concern in ethics and politics of information technologies. However, I hope this exploration of the relation between trust, power, and information technologies can contribute to the expanding research on trust and information technology. References Carusi, A. (2009). Implicit Trust in the Space of Reasons: A Response to Justine Pila. Journal of Social Epistemology 23(1), 25-43. Ess, C. 2010. Trust and New Communication Technologies. Knowledge, Technology, & Policy 23(3-4), 287-305. Nissenbaum, H. (2001). Securing Trust Online: Wisdom or Oxymoron. Boston University Law Review 81(3), 635-664. Simon, J. (2009). Webs of Trust and Knowledge: Knowing and Trusting in the World Wide Web. In: Proceedings of the WebSci'09: Society On-Line, 18-20 March 2009, Athens, Greece. Taddeo, M. (2010a). Trust in Technology: a Distinctive and a Problematic Relation. Knowledge, Technology and Policy 23 (3-4), 283-286. Taddeo, M. (2010b). An Information-Based Solution for the Puzzle of Testimony and Trust. Social Epistemology 24(4), 285-299. Taddeo, M. (2010c). Modelling Trust in Artificial Agents: A First Step toward the Analysis of eTrust. Minds and Machines 20(2), 243-257. - 256 - The Computational Turn: Past, Presents, Futures? Taddeo, M. (2009). Defining Trust and E-trust: Old Theories and New Problems. International Journal of Technology and Human Interaction 5(2), 23-35. Turilli, M, Vaccaro, A., & Taddeo, M. (2010). The case of on-line trust. Knowledge Technology and Policy 23(3-4), 333-345. - 257 - Proceedings IACAP 2011 THE BENEFITS OF SOCIAL THEORY FOR MODELLING STABLE ENVIRONMENTS OF SYSTEMIC TRUST WITHIN MULTI AGENT SYSTEMS DIEGO COMPAGNA University of Duisburg-Essen, Institute of Sociology Lotharstr. 65 (LE 643), 47057 Duisburg 1. Modelling Stable Environments of Systemic Trust within Multi Agent Systems Trust is often discussed on the micro-level of individuals or discrete entities; instead I would like to stress the benefits of systemic trust that could be seen as a form of mediated trust between entities. Based on the proposition of the 'Homeostatic Feedback Loop' by Anthony Giddens a stable social environment can be modeled for Multi Agent Systems (MAS). The goal of this Model is on the one hand trust is build as a nonintended effect on the systemic level from which on the other hand all participating entities take benefit: The outcome is an auto-sustaining framework; or a homeostatical systemic state. In this model trust emerges as the result of non intended effects of distinct actions between different Agents that could be described as a functional cooperation. The specific characteristic of the Casual Feedback Loop - the core proposition within the notion of a duality of structure (Giddens 1984) - could be very useful for a MAS architecture that enfolds a stable environment (Compagna 2009). The main assumption behind the concept of the duality of structure is that actions and the framework of these actions are organized recursively, or in terms of the social system theory in the modus of an autopoietic sustainment (Giddens 1991). Within such an environment of mutual but non-intended functionality the value of trust become an emergent value or a non-intended outcome. Based on an early Paper of Castelfranchi/Conte (1992) different kinds of cooperation could be described: NonIntended, Intentional, Out-Designed and Functional. Functional Cooperation is described as the best way to establish a fruitful and stable cooperation between agents. This type of cooperation could be related and captured as well as further conceptualized very well with the Theory of Structuration. The model I would like to present - by combining the above mentioned propositions - consists in the mutual goal for the involved agents of an action-framework that is functional for them although this is not directly intended by their intentionally motivated actions. Although this model claims to explain and accomplish a stable framework for MAS it could be transferred to a Human-Agent set-ting in which by nonintended effects a stable interaction framework emerges that provides a favorable context for mutual system trust. - 258 - The Computational Turn: Past, Presents, Futures? References Castelfranchi, Cristiano & Conte, Rosaria (1992). Emergent functionality among intelligent systems. Cooperation within and without minds. In: AI & Society 6 (1), S. 78-87. Compagna, Diego (2009). Sozionik und Sozialtheorie. Zum Beitrag soziologischer Theorien für die Entwicklung von Multiagentensystemen. (1. Aufl.) Saarbrücken: VDM Verlag. Giddens, Anthony (1984). The constitution of society. Outline of the theory of structuration. (1. Aufl.) Cambridge: Polity Pr. [u.a.]. Giddens, Anthony (1991). Structuration theory. Past, present and future. In: Bryant, Christopher G.A. / Jary, David (Hg.): Giddens' theory of structuration. A critical appreciation. (1. Aufl.) London [u.a.]: Routledge. (S. 201-221) - 259 - Proceedings IACAP 2011 COMPUTER NETWORKS AND THE PHILOSOPHY OF MIND A Social Mind – Networked Computer Analogy ISTVAN DANKA Department of Philosophy, University of Leeds Leeds, LS2 9JT, United Kingdom In the last few decades, computer analogies of the mind have dominated several central fields of the philosophy of mind. The leading versions of the 'mind – computer' analogy are based on the Interface Model of the Mind (with Putnam's phrase), claiming that the mind of an individual is analogous to a computer with an interface connection to its environment. As opposed to this, I shall develop a Network Model of the Mind, based on an analogy between the socially extended mind and a computer network, according to which social relations and semantic content of the WWW are analogously structured. In accordance with Clark and Chalmers' extended mind hypothesis, I shall argue that there are active constituent parts of mental processes that are located externally to the mind of an individual, just as there are semantic contents external to individual computers. A network model of the mind is the opposite of the interface model in the following sense. The interface model rests on the (Cartesian-inspired) assumption that there is a surface on which the mind interacts with its environment. For a social externalist the mind is extended over the limits of the body and hence no "surface" of the individual can be drawn. For a social externalist, mental processes are more plausibly understood as social activities among interlinked individuals. In either case, it makes no sense alluding to any interface. For a network model, what is essential in the structure of mental contents is not separation but connection. Hence, it explains the mental in terms of connections among mental contents in the minds of different individuals. At least two significant versions of the 'social mind – networked computer' analogy can be developed. On the one hand, one can argue for an analogy between socially embedded individual minds and networked computers. In this case, the connections have to be understood as physical connections among computers (i.e., the internet) on the one hand, and socially connected individual humans (social networks) on the other. The second version is philosophically more interesting though. Namely, an analogy can be drawn between semantic content on the net (WWW) on one hand, and mental content structured socially on the other hand. This analogy demonstrates that mental contents cannot be individually located in our heads since, analogically, semantically significant units of the content are not necessarily contained by the server but they are often spread over multiple machines (e.g. cookies). Regarding the connections among mental contents, I shall distinguish three structurally different models of the individual mind in terms of the relations among mental contents. First, centralised (Cartesian/Kantian) views argue that there is a centre - 260 - The Computational Turn: Past, Presents, Futures? of mental content (the soul, the mind, the Self, etc.), to which all mental contents are (directly or indirectly) connected. Second, non-centralised (behaviourist/physicalist) views claim that no centre of mental contents is provided; the best model for the relations among mental contents is a random graph. Third, de-centralised models (e.g. Quine) claim that there is a difference between central and peripheral mental contents; though no clear distinction can be made between the contingent and the necessary, a gradual account of more and less central contents can be provided. In parallel, there are three main models of the social relations among mental contents. Those who accept centralised models of the individual mind will most probably follow a multi-centred view of the social, claiming that mental contents constitute many centres of individual minds connected to each other randomly. (A logically possible alternative to this would be arguing that there is a centre of the social as well, but no serious attempt has been made in order to support such a view.) Holders of noncentralised models of the individual can apply their random graph set to the social, claiming an equal distribution of socially explained connections among mental contents. Finally, defenders of the de-centralised view claim that there are socially more and less central contents and even if there is no single centre of the social, several hubs can be identified. Analysing different approaches to how semantic content on the internet is organised, I shall develop a topology of networked-based relations among mental contents and argue for a de-centralised network model of the social mind, based mostly on an analogy with A-L. Barabási's research on the topology of the internet. While doing so, I shall allude to (1) the unequal distribution of links on the internet (the "rich get richer" phenomenon), (2) the impossibility of complex networks' being centralised ("the winner does not take all"), and some differences between inbound and outbound links regarding the semantic significance of web pages. Based on these, I shall argue for a decentralised network model for the social mind, following an analogy between the structure of the content on the WWW and a graph theoretically equivalent model of the mind to Quine's gradual approach between the central and the peripheral. However, there is a slight modification in my own version. From the network analogy it follows that the building of knowledge is not hierarchical, though it is also not an evenly distributed random model of connections among items. However, the least connected items are not connected to gradually more connected items while reaching highly connected items. On the contrary: they are mostly directly connected to "central" hubs. Therefore, a spatial metaphor of 'central vs. peripheral' is misleading. All the same, it can also be argued that even though the (physical) structure of the internet and the (semantic) structure of the WWW are analogous (and hence are the structure of mental contents and that of social relations), the connection between the two is contingent. Since from the analogy it follows that a multi-centred view of the social mind is incompatible with the actual structure of the semantic on the web, on the supposition of the analogy, no item of mental contents can be located in individuals. Hence, no interface can be identified. If so, the 'social mind – networked computer' analogy may serve as a useful weapon of social externalists. - 261 - Proceedings IACAP 2011 AGENT BASED MODELING WITH APPLICATIONS TO SOCIAL COMPUTING Gordana Dodig Crnkovic School of Innovation, Design and Engineerin, Mälardalen University, Sweden gordana.dodig-crnkovic@mdh.se 1. Extended Abstract Even though computers were invented primarily to automatize calculations, already Licklider and Taylor (1968) emphasized the importance of the computer as a communication device, with consequent shared knowledge and community-building. There are two different approaches to social computing, (Wang et al. 2007), one with the strong emphasis on technological, computing side and the other centered on human, social aspect. Present analysis will be focusing the first kind of social computing, a computational approach to modeling of social interactions, including the development of their supporting information and communications technologies. The main tools are simulation techniques used in order to facilitate the study of society and to support decision-making policies, helping to analyze how changing policies affect social, political, and cultural behavior (Epstein, 2007). Social computing is radically changing the character of human relationships worldwide (Riedl, 2011). Instead of maximum 150 connections prior to ICT (Dunbar, 1998) present social computing easily leads to networks of several hundred of contacts. It remains to understand what type of society will emerge from such massive “longrange” distributed interactions instead of traditional fewer and deeper short-range ones. As in the process information overload on individuals is steadily increasing, social computing technologies are moving beyond social information processing toward social intelligence, (Zhang et al. 2011) (Lim et al. 2008) (Wang et al. 2007), which brings an additional level of complexity. Social computing with the focus on social is a phenomenon which enables extended social cognition, while the social computing with the focus on computing is about computational modeling and new paradigm of computing. I will focus on the agent-based social simulation (ABSS) as a generative computational approach to social simulation defined by the interactions of autonomous agents whose actions determine the evolution of the system, as applied in artificial life, artificial societies, computational sociology, dynamic network analysis, models of markets, swarming (including swarm robotics) (Antonelli and Ferraris 2011), (Chai et al., 2010). As Gilbert (2005) rightly points out, novelty of agent based models (ABMs) “offer the possibility of creating ‘artificial’ societies in which individuals and collective actors such as organizations could be - 262 - The Computational Turn: Past, Presents, Futures? directly represented and the effect of their interactions observed. This provided for the first time the possibility of using experimental methods with social phenomena, or at least with their computer representations; of directly studying the emergence of social institutions from individual interaction.” ABMs are very useful computational instruments but they should not be taken as “reality” even though simulations with their realistic graphical representations suggest their being “real”. Process of modeling and simulation is complex and many simplifications and assumptions must be made which always must be justified for each application. (Gilbert and Troitzsch 2005) Grimm and Railsback 2005) (Axelrod 1997) ABMs in general are used to model complex, dynamical adaptive systems (Breiger et al. 2003). The interesting aspect in ABMs is the micro-macro link (agent-society). Multi-Agent Systems (MAS) models may be used for any number (in general heterogeneous) entities spatially separated by the environment which can be modeled explicitly. Interactions are in general asynchronous which adds to the realism of simulation. (Miller and Page 2007) (Schuler 1994) Social computing represents a new computing paradigm which is one sort of the natural computing, often inspired by biological systems such as e.g. swarm intelligence, evolutionary computation or artificial immune systems. In my analysis I will present different paradigms of computation including social computing and modeling of cognitive agents in the info-computational framework (Dodig-Crnkovic 2011) (Dodig-Crnkovic and Müller 2009). References Antonelli C. and Ferraris G. (2011) "Innovation as an Emerging System Property: An Agent Based Simulation Model", Journal of Artificial Societies and Social Simulation JASSS 14 (2) 1, http://jasss.soc.surrey.ac.uk/14/2/1.html Axelrod, R. (1997). The Complexity of Cooperation: Agent-Based Models of Competition and Collaboration. Princeton: Princeton University Press. Breiger R., Carley K. and Pattison P. (2003) Dynamic Social Network Modeling and Analysis: Workshop Summary and Papers, Nat’l Academies Press. Chai S-K. Salerno J. and Mabry P. L. (eds.) (2010) "Advances in Social Computing: Third International Conference on Social Computing, Behavioral Modeling, and Prediction", SBP 2010, Bethesda, MD, USA Springer-Verlag: Berlin. Dodig-Crnkovic G. (2011) "Significance of Models of Computation from Turing Model to Natural Computation." Minds and Machines, DOI 10.1007/s11023-011-9235-1. Special issue on Philosophy of Computer Science; R. Turner and A. Eden Eds.. Pages 1-22 Dodig-Crnkovic G. and Müller V. (2009) A Dialogue Concerning Two World Systems: InfoComputational vs. Mechanistic. Book chapter in: INFORMATION AND COMPUTATION. World Scientific Publishing Co. Series in Information Studies. Editors: G Dodig-Crnkovic and M Burgin, 2011. http://arxiv.org/abs/0910.5001 Dunbar R. (1998) Grooming, Gossip, and the Evolution of Language, Harvard Univ. Press Epstein, J. M. (2007). Generative Social Science: Studies in Agent-Based Computational Modeling. Princeton University. Gilbert N. and Troitzsch K. (2005) Simulation for the Social Scientist, Open University Press. Gilbert N: (2005) "Agent-based social simulation: dealing with complexity", http://www.complexityscience.org/NoE/ABSS-dealing%20with%20complexity-1–1.pdf Grimm V. and Railsback S. F. (2005) Individual-based Modeling and Ecology, Princeton University Press. - 263 - Proceedings IACAP 2011 Licklider, J.C.R. and Taylor R. W. (1968) "The computer as a communication device." Science and Technology (September), 20-41. Lim H. C., Stocker R., Larkin H. (2008) "Ethical Trust and Social Moral Norms Simulation: A Bio-inspired Agent-Based Modelling Approach. " In: 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, December 2008. pp. 245-251. Miller J. H. and Page, S. E. (2007) "Complex Adaptive Systems: An Introduction to Computational Models of Social Life", Princeton University Press: Princeton, NJ. Riedl J. (2011) "The Promise and Peril of Social Computing," Computer, vol.44, no.1, pp.93-95, http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5688159&isnumber=5688134 Schuler D. (1994) “Social Computing,” Comm. ACM, vol. 37, no. 1, pp. 28–29. Wang F-Y., Carley K. M., Zeng D., and Mao W. (2007) "Social Computing: From Social Informatics to Social Intelligence. " IEEE Intelligent Systems 22, 2 (March 2007), 79-83. DOI=10.1109/MIS.2007.41 http://dx.doi.org/10.1109/MIS.2007.41. Zhang D., Guo B., Yu Z. (2011) "Social and Community Intelligence." Computer, Vol. 99, No. PrePrints. doi:10.1109/MC.2011.65. - 264 - The Computational Turn: Past, Presents, Futures? OBJECTS OF IDENTITY, IDENTITY OF OBJECTS For a Materialist Account of Online Behavior HAMID R. EKBIA hekbia@indiana.edu School of Library and Information Science Indiana University Bloomington, IN. 47401 U.S.A. AND GUO ZHANG guozhang@indiana.edu School of Library and Information Science Indiana University Bloomington, IN. 47401 U.S.A. Abstract. Objects constitute significant elements of individual identity. Who we are has a lot to do with what we have and with what value we put on what we have. This point is easier to appreciate in the “off-line” physical world where objects with various symbolic or non-symbolic values populate our environment. How about the online world, which is seemingly devoid of objects — at least in a purely physicalist understanding of objecthood? What role, if any, do objects play in shaping online identities? We seek to address this question by following two lines of inquiry: post-structuralist accounts of quasi-objects and recent work in economic sociology on justification and mutual agreement. These inquires lead to two key propositions: (i) Digital artifacts are quasi-objects, which mediate collective practices that seem to exert a strong force of desire in the specific circumstances of our times; and (ii) People operate within various regimes in which they enact information and objects through collective practices of situated social orders. Here we integrate and extend these two lines of inquiry in order to explore the question of online identity. Our key argument is that people’s identities are mediated through digital artifacts (personal websites, personal profiles, blogs, etc.) in a process in which the identities of the subject and the object are collectively and mutually enacted by the network of people who take interest in them. - 265 - Proceedings IACAP 2011 1. Introduction Objects constitute significant elements of individual identity. Who we are has a lot to do with what we have and with what value we put on what we have. This point is easier to appreciate in the “off-line” physical world where objects with various symbolic or nonsymbolic values populate our environment. How about the online world, which is seemingly devoid of objects — at least in a purely physicalist understanding of objecthood? What role, if any, do objects play in shaping online identities? We take this question seriously, and seek a materialist answer to it. We seek an account that can do justice to things that matter, that offer potentials and resistances, physically but also socially, historically, psychologically, and so on. Although this is admittedly a non-standard notion of materialism — modern philosophers often use physicalism and materialism interchangeably (Stoljar, 2009) — it is useful for our purposes in at least two ways. First, it allows us to consider the inherently material, not necessarily physical, aspects of the online world. Second, it opens a line of inquiry that situates digital artifacts in how they relate to existing social structures and in how they embody and anticipate the future through the socio-material practices that they allow or disallow. The first point is important because dominant discourses in information science, philosophy, and elsewhere tend to discount the underlying materiality (even physicality) of the “virtual” (e.g., Lévy, 1998). The second point matters because it allows us to see current online experiences from the historical perspective of modernity (Day and Ekbia, 2010). 2. Two Lines of Inquiry Our study of the relationship between objects and identity in the online world follows two lines of inquiry. One is inspired by post-structuralist accounts of quasi-objects, the other by recent work in economic sociology on justification and mutual agreement. Originating in the psychoanalytic notion of “part-objects,” Winicott’s notion of “transitional object,” and the Lacanian notion of objet petit a (object little-a), the notion of “quasi-object” later appears in discussions of intersubjectivity by Serres, of scientific theories and entities by Latour, and of technology and virtuality by Lévy. In Lacan’s (1991) psychoanalysis, objet petit a stands for an unattainable libidinal object of desire (e.g., the breast), which is imagined to be separable from the rest of the body, in the same fashion that an ornament can be detached from the body. As such, it both drives and limits the desire, and can be sought in the “other” traversing the order of the real and the imaginary, the mind and the body, the self and the other. In the age of the Internet, this raises the question of whether our common fascination and obsession with online depictions of our identity — digital variants of Lacan’s “mirror image” — may be a reassertion of specific (infantile?) desires. Answering this question in earnest requires empirical research on how identities are fluidly (de-, re-)constructed on the Web (Aboujaoude, 2011). However, the beginnings of an answer can be found in the writings of Michel Serres (1982) who seeks to explain identity and intersubjectivity from a materialist perspective. Famously characterizing the furet in a children’s game (a French game resembling hunt-the-slipper) as a quasi-object, Serres argues that the identity of the child who carries the furet changes as he becomes distinct from others by becoming “it” - 266 - The Computational Turn: Past, Presents, Futures? (Serres, 1982). In so doing, the furet also connects the players and their positions, fixing and stabilizing the collective. The passage of the furet, in other words, allows the coconstitution of both (quasi-)objects and (quasi-)subjects (Day 2010). Economic sociology, on the other hand, shows that subjects and objects are mutually qualified in different orders of worth. In their attempt to integrate economic and social values in a single analytic framework, for instance, Bolatanski and Thévenot (2006) have arrived at a set of principles that people resort to in order to justify their actions. These principles, which operate within different regimes of worth, are appealed to by individuals depending on the particular “world” (or polity) in which they inhabit in a given situation. “Persons and things offer one another mutual support. . . With the help of objects, which we shall define by their belonging to a specific world, people can succeed in establishing states of worth.” (Bolatanski and Thévenot, 2006: 131). In previous work, these lines of thought have led us to two key propositions: (i) Digital artifacts are quasi-objects, which mediate collective practices that seem to exert a strong force of desire in the specific circumstances of our times (Ekbia, 2009a); and (ii) People operate within various regimes of information in which they enact information through collective practices of situated social orders (Ekbia, 2009b; Ekbia and Evans, 2009; Garfinkel, 2008). Here we integrate and extend these two lines of inquiry in order to explore the question of online identity. Our key argument is that people’s identities are mediated through digital artifacts (personal websites, personal profiles, blogs, etc.) in a process in which the identities of the subject and the object are collectively and mutually enacted by the network of people who take interest in them. 3. Online Behavior: Game and Identity Take your personal profile on a social networking site, for instance. The profile represents you, but not in the sense that your photograph, for example, would represent you. By creating a profile, in a way you create a representation of yourself, your history, tastes, hobbies, friends, friends of friends, and so on. But on closer scrutiny this is not a representation, traditionally understood as a stand-in that has a resemblance relationship to you. Nor is the profile simply an active representation noncausally coupled to you in the way that most computer representations are believed to be coupled to their subject matter. The profile is an artifact that both mediates and traces your network of friends, hobbies, and history. As a complex event, not a representation, it constitutes a complex site for the actualization of such a network. Lastly, the profile participates in the embedding environment, taking you to unforeseen places, while being itself shoved around by others. In this manner, it acts like characters in a good novel who take on, we are told, a life of their own, dragging the author along with them (Bakhtin, 1984). In a serious way, the fate of the profile is in the hands of others who take interest in it and who build bridges between you and their profiles. In short, your identity is enacted in a collective process organized around your profile, in the same way that the identity of the child is shaped in carrying the furet. You become “it,” with the caveat that the nature of the “it” in an electronic medium enables a strongly malleable, transient, and unstable identity, providing enormous room for playfulness, fantasy, illusion, deception, selfdeception, and so forth. We want to explore these issues, especially in regards to computer games and how - 267 - Proceedings IACAP 2011 an individual’s “virtual” identity in a game may, or may not, interact with their identity in the non-game (off-line) world. With the growing potential of personalizing game characters (avatars) to represent individual features, this question has become increasingly meaningful and significant. For instance, in games for health, we can connect a Personal Health Record to a gaming platform so that, through proper data linkages to environmental signals, one’s real-life behavior would affect the game — think of an avatar that becomes large, drunk, or ill depending on how you eat, drink, or behave. How would the change of the avatar influence your real-life identity? Is the avatar the equivalent of the furet? Or does it exert less/more influence? References Aboujaoude, E. (2011). The Dangerous Powers of E-Personality. New York: W.W. Norton & Company. Bakhtin, M.M. (1984). The Problems of Dostoevsky's Poetics. (C. Emerson, Trans.). University of Minnesota Press. Bolatanski and Thévenot Boltanski, L., &Thévenot, L. (2006 [1991]). On Justification: Economies of Worth. (C. Porter, Trans.). Princeton, NJ: Princeton University Press Day, R. E. (2010). Death of the User: Reconceptualizing subjects, objects, and their relations. Journal of the American Society for Information Science and Technology, 62(1)-78-88. Day, R. & Ekbia, H. (2010). Digital experiences. In Kallinikos, J., Lanzara, G. F. and Nardi, B. (Ed.). The digital habitat — Rethinking experience and social practice. First Monday, Volume 15, Number 6 - 7. Ekbia, H. (2009a). Digital artifacts as quasi-objects: Qualification, mediation, and materiality. Journal of American Society for Information Science and Technology, 60 (12), 2554-2566. Ekbia, H. (2009b). Regimes of information: A polity model. Paper presented at the 7th European Conference on Computing and Philosophy, Barcelona, Spain, July 1-4. Ekbia, H., & Evans, T. (2009). Regimes of information: Land use, management, and policy. The Information Society, 25, 328-343. Garfinkel, H. (2008). Toward a sociological theory of information. Boulder, CO: Paradigm. Lacan, J. (1991). The seminar of Jacques Lacan. In Book II: The ego in Freud’s theory and in the technique of psychoanalysis (pp. 1954–1955). New York: W.W. Norton & Company. Lévy, P. (1998). Becoming virtual: Reality in the digital age. (R. Bonnono. Trans.) NewYork: Plenum. Serres, M. (1982). The parasite. (L. R. Schehr. Trans.). Baltimore: Johns Hopkins University Press. Stoljar, D. (2009). Physicalism. Stanford Encyclopedia of Philosophy. Retrieved March 24, 2010 from: http://plato.stanford.edu/entries/physicalism/ - 268 - The Computational Turn: Past, Presents, Futures? THE CONSTRUCTION OF REALITY AND OF SOCIAL BEING IN THE INFORMATION AGE LÁSZLÓ ROPOLYI Department of History and Philosophy of Science Eötvös University, 1518 Budapest, Pf. 32., Hungary ropolyi@caesar.elte.hu Abstract. In the information age representational (information, cognitive, cultural, communication) technologies instead of material ones become the dominant factor in the construction of social being. To conceptualize this shift, I suggest that Aristotle’s dualistic ontological system (which distinguishes between actual and potential being) be complemented with a third form of being: virtuality. In the virtual form of being, actuality and potentiality are inseparably intertwined. Everything that is produced by representational technologies is a virtual being. Therefore, in the information age, social being, too, has a virtual character, as it is produced by representational technologies. Information itself is a product of representational technology; while it is also interpreted being. This process of interpretation takes place in human minds, and the process can be described as a “hermeneutical industry”. The information society is inhabited by virtual beings, so it has a virtual and open characteristic. 1. Technology and Representation Technology is a specific form or aspect of human agency, the realization of the human control over a technological situation. 18 Every element of the human world is created by technologies. Both human nature and the social being are the products of our technological activity, and their characteristics are determined by the specificities of the technology we use to produce them.19 All historical forms of human nature and of social being are constructed (and continuously re-constructed) or produced (and continuously re-produced) by historical versions of technology. Technology has an ontological Janus face: it produces both 18 This definition of technology is on a higher level of abstraction than usual conceptualizations (cf. Feenberg, 1999). 19 Social (or human) being, obviously, has an active role in the formation of any technology: given technological and social relations coexist and interrelate in a complex way, so that they mutually shape each other. My view on construction is closer to that of Marxism (Lukács 1978) than to those of phenomenology (Berger and Luckmann, 1966) and of radical constructivism (Glasersfeld, 2011). - 269 - Proceedings IACAP 2011 “things” and “representations”. For thousands of years, people used material (agricultural or industrial) technologies where the material product was in the foreground, although the symbolic content was also present. The last few decades have witnessed a significant technological change, in that “representations” have became dominant over the “thingly” products in the most important technologies of our age. On the one hand, new (cognitive, communication, cultural, and information) technologies have emerged; on the other hand, the representational or symbolic function of traditional technologies has become more significant. As a consequence, the most important characteristics of the social being are essentially transformed. The terms “post-industrial / knowledge / risk / information / network society” all refer to a type of society where representational technologies are the dominant factor in the (re)construction or the (re)production of human nature and of social being. 2. Virtuality and Openness in Information Technologies The shift from material technologies to representational (information, cognitive, cultural, communication) technologies has important consequences for our notions of reality. The concept of virtuality has a central role in redefining reality. The term “virtuality” is relatively new, but a brief overview of the history of philosophy reveals that the fundamental components of virtuality have been extensively discussed (Ropolyi, 2001). The central concepts in this respect are presence, worldliness, and plurality. All three acquire their meaning from a certain relation between actuality and potentiality. I suggest that the Aristotelian dualistic ontological system, which distinguishes between actual and potential being, be complemented with a third form of being: virtuality. In the virtual form of being, actuality and potentiality are inseparably intertwined. Virtuality is potentiality considered together with its actualization. Openness is actuality considered together with its possibilities. As compared to reality, virtuality is reality with a measure, a reality which has no absolute character, but which has a relative nature. All beings produced by representational technologies are necessarily virtual. To illustrate how technologies produce virtual beings, let us consider information technologies. The characterization of information technology should be based on an understanding of the concept of information. Obviously, information is a product of a kind of representational technology, and thus it is virtual. In a hermeneutic approach, information is “interpreted being”. On this account, information technology is a “hermeneutical industry”, where the production is performed by interpretation in the minds of people. All the products of this “industry” are virtual beings. Consequently, social being in the information age is necessarily a virtual being. Information society is a society where the typical beings are virtual ones, and so the whole society has a virtual and open characteristic. In a specific point of view the Internet, too, is a kind of information technology. It is an intentionally created and maintained artificial, virtual sphere which is based on networked computers and individual human interpretation praxes. The Internet is the medium (or sphere) of a new, virtual mode of human existence, basically independent - 270 - The Computational Turn: Past, Presents, Futures? from, but built on, and coexisting with the former (natural and societal) spheres of existence, and created by the late-modern humans. Acknowledgements This research was supported by the Hungarian Scientific Research Fund (OTKA) under the K79194 and K 84145 project numbers. References Berger, P. & Luckmann, T. (1966). The Social Construction of Reality. A Treatise in the Sociology of Knowledge. New York: Doubleday. Feenberg, A. (1999). Questioning Technology. London: Routledge. Glasersfeld, E. von (2011). http://www.vonglasersfeld.com/ (March 2011). Lukács, G. (1978). The Ontology of Social Being. London: The Merlin Press. Ropolyi, L. (2001).Virtuality and plurality. In: A. Riegler, M. F. Peschl, K. Edlinger, G. Fleck and W. Feigl (Eds.), Virtual Reality. Cognitive Foundations, Technological Issues & Philosophical Implications. (pp. 167-187). Frankfurt am Main: Peter Lang. - 271 - Proceedings IACAP 2011 TRUST, KNOWLEDGE AND SOCIAL COMPUTING Relating Philosophy of Computing and Epistemology JUDITH SIMON Institut Jean Nicod – Ecole Normale Supérieure 29, rue d'Ulm F-75005 Paris - France Abstract. The main goal of my talk will be to link the discourse on trust in epistemology with the philosophical discourses on trust and ICT. I will argue that linking these two lines of research is needed to apprehend the notion of epistemic trust. Epistemic practices in science as well as in everyday life are characterized not only by their socialness, i.e. the fact that agents collaborate and rely on others in their attempts to know, they are also deeply pervaded by information technologies. In short, I claim that a) contemporary epistemic practices take place in increasingly complex, dynamic and entangled socio-technical epistemic systems consisting of multiple human and non-human agents, b) that trust is a crucial concept to understand these practices, and c) that information and communication technologies (ICT) play an important role in mediating and shaping trust relationship between different agents. 1. Trusting to Know In 1991, Hardwig asserts that “[f]or most epistemologists, it is not only that trust plays no role in knowing: trusting and knowing is deeply antithetical. We can not know by trusting in the opinions of others: we may have to trust those opinions when we do not know ((Hardwig 1991): 693). This argument rests on the assumption that in order to know, we have to be able to provide evidence, we have to justify our knowledge claims with our own cognitive resources and cannot know by simply trusting the testimony of others. Yet a closer look on epistemic practices in science as well as in everyday life shows that our knowledge depends deeply on trust in other people. Without trusting in what others have told us, we would neither know some of the most basic facts about ourselves, such as the date and place of our birth, nor could we have achieved the most advanced scientific knowledge. This is the central dilemma of testimony and epistemic trust in philosophy: while on the one hand it seems that almost everything we know depends on our trust in the testimony of others, the status of testimonial knowledge and the role of epistemic trust remain highly controversial. Yet things are even more complicated. Within contemporary epistemic practices trust is not only placed in other - 272 - The Computational Turn: Past, Presents, Futures? humans, but also in technologies, processes, institutions and content. Indeed, information and communication technologies (ICT) play a special role for epistemic trust, because ICT is not only an entity that can be trusted itself, ICT also increasingly mediates and shapes trust relations between all other entities as well. Hence, to understand epistemic trust, the role of ICT cannot be ignored and epistemology has to take insights from other fields of research, most notably philosophy of computing and into account. 2. Trust and ICT The special role of ICT for trust has been addressed under different labels such as online trust, digital trust or e-trust. While all terms refer to practices of trust that take place in a digital environment, the different labels are related to different research foci. Three of them should be distinguished: 1. ICT as an entity of trust itself (i.e. how human agents place trust in ICT as a technology) 2. ICT as a mediator of trust relationships between human agents as well as between humans agents and other entities (such as content) 3. Trust in multi-agent systems, i.e. trust relations amongst artificial agents as well as between human and artificial agents First, ICT can be an entity that is trusted itself, i.e. trust into ICT can be considered as trust in a specific type of technology, hence as a special case of trust in technologies. Here analyses of whether one can rightfully talk about trust in technology in the first place (for instance (Nissenbaum 2001), or whether and to what extent we do or should place trust in technologies have been discussed ((Cheshire, Antin et al. 2010)). Second, ICT mediates trust relations amongst and between humans and non-human entities to a profound extent. Even in the most basic form, if communication between two humans who know each other in person takes place via email, chat, social networking sites or even telephone, ICT mediates between truster and trustee (cf. (Ess 2010)). Epistemic trust placed in such technologies cannot be fully understood by referring to trust in technology or trust in persons only. Take the example of the onlineencyclopedia Wikipedia. If one trusts content from Wikipedia, this practice of trust is neither trust in a technology proper (namely the wiki-software), nor is it trust in individual writers (which are often unknown), nor can this trust be fully explained by institutional trust in the Wikimedia Foundation. I have argued elsewhere, that trusting Wikipedia should rather be conceived as trust into a certain socio-technical epistemic system characterized by technological infrastructure, epistemic agents (i.e. the users of Wikipedia), and certain processes employed in creating epistemic content ((Simon 2010b)). While Wikipedia ((de Laat 2010), (Tollefsen 2009), (Magnus 2009)) and Blogs ((Goldman 2008)) have attracted some interest within epistemology by now, other types of social software, such as recommender systems or social tagging systems have not yet received serious attention. Yet, in such types of social software that function primarily via aggregation, problems of trust are potentially even harder to tackle and the classical means provided by epistemological analyses on trust in testimony appear even less suited for understanding epistemic trust within such applications. - 273 - Proceedings IACAP 2011 Finally, there is another type of e-trust, which is starting to receive attention within philosophy: trust in multi-agent systems. Two instances of trust are crucial with respect to trust in multi-agent-systems. First, there are the trust relations amongst artificial agents within multi-agent-systems. (e.g. (Taddeo 2010b)). Second, there are not only trust relations amongst artificial agents, but also between human and artificial agents, which are intrinsically more complex as (Grodzinsky, Miller et al. 2010) have noted. In my talk I will specify in more detail, how these insights from the philosophy of computing could be made useful for an epistemology of trust. References Cheshire, C.,Antin J. et al. (2010) General and Familiar Trust in Websites. Knowledge, Technology & Policy 233), 311-331. de Laat, P. (2010). How can contributors to open-source communities be trusted? On the assumption, inference, and substitution of trust. Ethics and Information Technology 12(4): 327-341. Ess, C. (2010). "Trust and New Communication Technologies: Vicious Circles, Virtuous Circles, Possible Futures. Knowledge, Technology & Policy 23(3): 287-305. Goldman, Alvin (2008). The Social Epistemology of Blogging. In: Information Technology and Moral Philosophy. J. v. d. Hoven and J. Weckert. New York, Cambridge University Press: 11-122. Grodzinsky, F., K. Miller, et al. (2010). "Developing artificial agents worthy of trust: ―Would you buy a used car from this artificial agent? ." Ethics and Information Technology: 1-11. Magnus, P. D. (2009). On Trusting Wikipedia. Episteme 6(1): 74-90. Nissenbaum, H. (2001). Securing Trust Online: Wisdom or Oxymoron. Boston University Law Review 81(3): 635-664. Simon, J. (2010b). The entanglement of trust and knowledge on the Web. Ethics and Information Technology 12(4): 343-355. Taddeo, M. (2010b). Modelling Trust in Artificial Agents, a First Step toward the Analysis of eTrust. Minds and Machines 20(2): 243-257. Tollefsen, D. P. (2009). Wikipedia and the Epistemology of Testimony. Episteme 6(1): 8-24. - 274 - The Computational Turn: Past, Presents, Futures? OPERATIONAL IMAGES Agent-Based Computer Simulation and the Epistemic Impact of Dynamic Visualization SEBASTIAN VEHLKEN Leuphana University Lüneburg ICAM Institute for Culture and Aesthetics of Digital Media Scharnhorststrasse 1 21335 Lüneburg Abstract Computer simulations (CS) designate the current scientific condition. Inevitably, one has to distinguish crash tests from climate simulations, and one has to be aware of the differing problem dimensions posed by e.g. the simulation a quantum physical system by a classical physical system in comparison to those advanced by an agentbased simulation of a mass panic in a stadium. And without question, CS achieve diverse tasks and have quite dissimilar reputations in different scientific disciplines. But undeniably, CS brought with them a novel kind of knowledge, a modified set of research problems, and a transformed historical-philosophical comprehension of science. Thus, knowledge emerging in CS derives from the computer-based imitation of dynamic system behavior which penetrate everyday life in forms of ecological, medical, economical, or technical applications and decisions. Initially, novel scientific problems and research fields historically form where they would not have been tractable without the digital media of CS. And not least, the traditional concepts of theory and experiment are essentially modified, transforming the „mode-1« science (Gibbons, 1994) more and more into a „behavioral science of complex systems“ (Mahr, 2003). This transformation is based on an explicitly media-historical rupture marked by the digital mediality of CS. The digital media inherent in CS develop typical and intrinsic modes of operation and visualization in their application on analytically and experimentally intractable problem fields. Sebastian Vehlken’s presentation embarks on examining the “social computing” aspects of a particular kind of CS in a two-fold way. First, it will describe the specific (self-) organizational aspects of agent-based modeling and simulation (ABM), zeroing in on several pivotal examples of large-scale social simulations. These range from crowd control (e.g. Massive Insight) and logistics (e.g. TransSims) to epidemics (e.g. PLAN-C by NYU Bioinformatics Group) and large-scale models of the complex interactions of agents in whole societies (e.g. Global Scale Agent Model by Brookings Institution). It will discuss the notion, the epistemic function and the technological means of the bottom-up modeling paradigm of ABM, providing essential advantages over CS based on discrete events. Whilst the latter are required to define assumptions of the constituents of a system and their interdependencies from top down, ABM are decentralized and - 275 - Proceedings IACAP 2011 function without a definition of the global system behavior. The system behavior emerges from the definition of simple and locally (on the level of the individual agents) implemented settings. As Borshchev and Filippov (2004) put it, ABM thus better »provides for construction of models in the absence of the knowledge about the global interdependencies: you may know nothing or very little about how things affect each other at the aggregate level, or what is the global sequence of operations, etc., but if you have some perception of how the individual participants of the process behave, you can construct the AB model and then obtain the global behavior.« The bottom-up performance of ABM induces a synthetic problem approach by converging to adequate and context-dependent solutions in a process of a systematic comparison and evaluation of different simulation runs and scenarios. Thereby ABM leapfrogs fixed object or context allocations in an exemplarily interdisciplinary manner. The media history of research in social collectives reveals a reciprocal ›socialization‹ and ›biologization‹ of computer science and a likewise computerization of the social sciences when it comes to the development of adequate ABM models for describing collective behaviors in space and time. The development of Animation Effects in CGI is distinctly interconnected with biological and sociological computer models of collective dynamics, and vice versa. Second, it will consider the importance of digital visualizations for scientific research with ABM. The adherent types of Computer Graphical Imagery (CGI) exemplarily raise questions not only about the status of animated, 3-dimensional and dynamic digital images as interfaces for the refinement of societal “computer experiments” and the “intuitive” handling of the ABM by researchers. One must also ask about their state as ‘visual evidence’ and ‘representation’ for phenomena and processes in social dynamics which would remain intractable without these digital ‘time-based images’. Not least, the technological conditions resulting of the multiple filtering-, smoothing-, or thresholding procedures involved in providing ‘visual validation’ have to be accounted for. These aspects have to be further investigated on the basis of a media-technologically informed theory of operational images, linking the modes of visualization of ABM with their programmed data base in the ABM software. And since the development of certain Animation Effects in the CGI industry is historically distinctly interconnected with biological and sociological computer models of collective dynamics, and vice versa, the hard-, wet- and software foundations of ABM can be short-circuited with applicable modes of CGI generation: both operate in a highly distributed manner of ›socially‹ interacting and ›locally‹ defined agents. Hence, the presentation investigates the specific epistemical and technological rupture marked by CS on the basis of ABM in social simulations. The respective applications facilitate a mode of visualization by (synthetic and therefore operational) images which address the inconcievable representation of complex social dynamics by generating visual presentations: Only the observation of modeled processes in the runtime of ABM enables the evaluation and manipulation of critical factors and variables and the ensuing re-run of the simulation. And this results in a type of dynamical “data images” (see Adelmann et al., 2009, Schubbach, 2007) yet to be further investigated. It provokes a type of operational images with a highly socio-political dimension – images - 276 - The Computational Turn: Past, Presents, Futures? which depend on and which foster social decision-making in (time-) critical environments. References Adelmann, R., Frercks, J., Heßler, M. & Henning, J. (Eds.)(2009). Datenbilder. Zur digitalen Bildpraxis in den Naturwissenschaften, Bielefeld 2009. Borshchev, A. & Filippov, A. (2004). From System Dynamics and Discrete Event to Practical Agent Based Modeling: Reasons, Techniques, Tools. In: The 22nd International Conference of the System Dynamics Society. Oxford. Mahr, B. (2003). Modellieren. Beobachtungen und Gedanken zur Geschichte des Modellbegriffs. In: H. Bredekamp and S. Krämer (Eds.), Bild Schrift Zahl (pp. 59-86). Munich: Fink. Schubbach, A. (2007). ...A Display (Not a Representation)... Navigationen. Zeitschrift für Medien- und Kulturwissenschaft. Display II – digital 7(2) (2007, 13–27. - 277 - Proceedings IACAP 2011 Social Computation as a Discovery Model for the Social Sciences AZIZ F. ZAMBAK Department of Philosophy Yeditepe University, Istanbul Abstract. Social simulation is a growing field that proposes a computational approach to the social sciences. Simulation provides a powerful alternative for the novel understanding of the epistemology, ontology, and taxonomy of the social phenomenon, structure and process. Social simulation can be an intellectual resource and experimental field for developing a novel notion of “social phenomenon” within which various forms of human action can be represented. Social simulation may be used to examine not just the current situation in a society, but also possible social situations. Classical models that only use natural language is inadequate for the comprehension of dynamic and complex systems in the social sciences. Pure mathematical and/or statistical models are intractable. Simulation may offer to overcome the limitations of classical models in the social sciences. In this paper, we will propose five general principles that should be take into consideration in social simulation: 1- Agent-Based Models: We describe agency as an essential criterion for social simulation. 2- Game Theory: Game theory is a study that can provide some formal epistemological data for understanding the rationalization process of individuals. From the social simulation point of view, discovery is an agentive-informational-system and we consider this system as a set of complex principles that should be rationalized by simplification, approximation, optimization, and generalization. 3- Control Systems: In order to understand the autopoietic, dynamic and complex structure of social systems, we should develop an organismic conception of society in which control mechanisms have an essential role for the social models and simulation. 4Tools: In social simulation, a stylized-computational-language should be built in which the data on social structure are coded and represented in the computer simulation. 5- Ontology: Emergence is one of the essential concepts in the ontology of social sciences in which certain theories try to explain the macrolevel phenomena in terms of the behavior of microlevel actors. Social simulation is a growing field that proposes a computational approach to the social sciences.20 Simulation provides a powerful alternative for the novel understanding of the 20 Gilbert and Troitzsch (2005: 5) explains the main reason behind the developing interest on social simulation as follows: “The major reason for social scientists becoming increasingly interested in computer simulation, however, is its potential to assist discovery and formalization. Social scientists can build vey simple models that focus on some small aspects of the social world and discover the consequences of their theories in the ‘artificial society’ that they have built. In order to do this, they need to take theories that have conventionally been expressed in textual form and formalize them into a specification which can be - 278 - The Computational Turn: Past, Presents, Futures? epistemology, ontology, and taxonomy of the social phenomenon, structure and process. Social simulation can be an intellectual resource and experimental field for developing a novel notion of “social phenomenon” within which various forms of human action can be represented. Social simulation may be used to examine not just the current situation in a society, but also possible social situations. Classical models that only use natural language is inadequate for the comprehension of dynamic and complex systems in the social sciences. Pure mathematical and/or statistical models are intractable. Simulation may offer to overcome the limitations of classical models in the social sciences. In this paper, we will propose five general principles that should be take into consideration in social simulation. 1- Agent-Based Models: Agency must be the central notion in social simulation since the cognition of social reality originates from agentive actions. We claim that agency is the ontological and epistemological constituent of social reality. It is characterized by agentive activity. Agency must be the essential criterion for the success of social simulation. Social simulation must consider the social phenomena as a form of action of a dynamicrepresentational system, developed during interaction within the environment. Equating properties of the social phenomena with properties of its elements [individuals] is the basic mistake. Social structure cannot be a subject of a special examination of the group of individuals. Behavior and agentive actions cannot be found in the specific groups of individuals, but in the whole agent-environmentinteraction system. The discovery of social phenomena in social simulation does mean a new kind of action of the highly dynamic-representational system capable of making inferences from its structure and process in order to achieve new results of action and form novel systems directed towards the future. Therefore, in social simulation, discovery is not a mystical emergent property of social phenomena, but a form of agentive action necessarily following from the development of a dynamicrepresentational system. 2- Game Theory: Game theory is a study that can provide some formal epistemological data for understanding the rationalization process of individuals. From the social simulation point of view, discovery is an agentive-informational-system and we consider this system as a set of complex principles that should be rationalized by simplification, approximation, optimization, and generalization. In social simulation, this type of rationalization should depend on idealization. Idealization transforms the environmental data into idealagentive-rational-information. However, idealization should not be seen as abstraction.21 We consider the idealized information as one of the basic capabilities of social simulation, providing the preconditions for the adaptive behavior of agency in a very programmed into a computer. The process of formalization, which involves being precise about what the theory means and making sure that it is complete and coherent, is very valuable discipline in the social sciences to that of mathematics in the physical sciences.” 21 As Nowak (2000: 116) states, “idealization is not abstraction. Roughly, abstraction consists in a passage from properties AB to A, idealization consists in a passage from AB to A-B.” - 279 - Proceedings IACAP 2011 complex environment. In the adaptiveness of agency, the information of environmental structure and organization may be grasped rationally, for the rationality lies in the agentive attitude towards environmental structure and organization, not in the essence of environment itself. Therefore, there is not a hidden essence in the environmental structure and organization that should be represented in a computational and representational manner for the rational behavior of an agent. In social simulation, our aim is to understand how properties of rationalized agency are related to the behavioral action that is performed under complex environmental/social situations. This type of understanding requires idealization, as idealization can be seen as a method of constructing informational structures in which data gained from the environment/society can serve the goal of forming special types of rationalized agentive interactions. Idealization, in social simulation, leads an agent to a successful informational approximation. Idealization is a type of theorizing that includes specification, approximation and optimization about certain sets of agentive and social systems. The presentation will include analysis of two game theoretical models for social simulation. 3- Control Systems: Social systems should be considered as self-organizing, non-linear, dynamic, and complex phenomena. From the computational or representational point of view, dynamic and complex systems are difficult to study because most cannot be represented in simplified and hierarchical models. In order to understand the autopoietic, dynamic and complex structure of social systems, we should develop an organismic conception of society in which control mechanisms have an essential role for the social models and simulations. There are several conditions for choosing the appropriate strategy for the control mechanism of an agent such as the availability of data for the performance of an agent, comparing stable and dynamic parameters of the environment, and the access to explicit data about plans, goals, and the current state of affairs. For building computer simulation for an agentive system, it is very important not to restrict an agent to follow only one predetermined set of rules but to give it the opportunity to choose and shift different sets of rules according to its situation. This can be done by a proper control mechanism which can find a balance between stability and flexibility of information in a complex environment. In this section, we will also examine the Project Cybersyn as a control mechanism example for the social simulation. 4-Tools: In the presentation, we will briefly explain what should be the logic of computer programs in social simulation. In addition, we will claim that, in social simulation, a stylized-computational-language should be built in which the data on social structure are coded and represented in the computer simulation. The general concepts of this stylized-computational-language will be briefly introduced in the presentation. Some of these concepts are empirical protocols, nodes, links, data processing, boundaries, taxonomy, observation period, randomization of parameters, outcome validity, process validity, and internal validity. - 280 - The Computational Turn: Past, Presents, Futures? 5- Ontology Emergence is one of the essential concepts in the ontology of social sciences in which certain theories try to explain the macrolevel phenomena in terms of the behavior of microlevel actors. In this part, we will show that how a reflexive model in social simulation can build an emergent model of the relation between the individual and the society. References Gilbert, Nigel and Troitzsch, Klaus G. (2005). Simulation for the Social Scientist, Buckingham : Open University Press. Nowak, Leszek (2000). The Idealization Approach to Science: A New Survey. Pozań Studies in Philosophy of Science and the Humanities, 69, 109-184. - 281 - Proceedings IACAP 2011 Track VIII: IT, Culture and Globalization - 282 - The Computational Turn: Past, Presents, Futures? The Revival of National and Cultural Identity through Social Media RYOKO ASAI Uppsala University, Dept. of IT-HCI Box 337, 751 05 Uppsala, Sweden and Nihon University, College of Industrial Technology Narashinoshi-Izumicho1-2-1, Chiba, Japan Iordanis Kavathatzopoulos Uppsala University, Dept. of IT-HCI Box 337, 751 05 Uppsala, Sweden AND Mikael Laaksoharju Uppsala University, Dept. of IT-HCI Box 337, 751 05 Uppsala, Sweden Abstract. Social media has played an important role as hub for information in political change. It can contribute to the development of psychological and social preconditions for dialog and democracy. Information communication technology (ICT) made it possible for people to communicate beyond national borders. In particular, social media play an important role in making a place where people communicate each other, for example Facebook, MySpace, YouTube and so on. In other words, under these circumstances, social media function as the third place (Oldenburg, 1999). People have two essential and indispensable places in their lives: one is home and another is working place. Further to those places, people have one more place where they could have relationships with others informally in public (what Oldenburg called “informal public life”). And the third place contributes not only to unite people in communities but also to know how they contribute in various problems and crises there. Therefore the third place would nurture a relationship with others and mutual trust under the unrestricted access condition, and also it would be open for discussion and ground for democracy (Oldenburg, 1999). In this context, social media can provide the third place to users in some cases. Social contexts of communication are defined by geographic, organizational and situational variables, and those variables influence the contents of communication among people (Sproull & Kiesler, 1986). And, in order to discern social context cues, communicators observe static cues (physical setting, location etc.) and dynamic cues (non-verbal behavior like gesture or facial expression) in communicating with others. Communicators’ behavior is determined based on social context cues and they can adjust - 283 - Proceedings IACAP 2011 their behavior depending on situations through the process of interaction between them. However, in online communication, it is more difficult for communicators to perceive static and/or dynamic elements compared to face-to-face communication. Because in many cases social media limit the number of characters and the amount of data that they can post while making it possible for users to communicate regardless of physical distance, national boundaries and time difference. On the other hand, participation is seen as the key element in the recent trend toward democratization and in real numerous users send and receive a huge amount of information via social media to cultivate a relationship with others and strengthen mutual exchange beyond borders. In general, it is recognized that social media advance participation through exchanging information with minimal social context cues. Tunisian people shared information on what happened in the country and when and where anti-government protests were held, by social media such as Facebook and twitter. In other words, social media seemed to support political change in Tunisia. Behind it, the number of the internet users is 3.6 million, which is 34% of the population total, and there are 1.6 million users of Facebook roughly equivalent to 16% of the population (Internet World Stats, 2010). Tunisian government had blocked particular websites. Facebook was one of the few social media free to access. Under these circumstances, for the people living abroad, Facebook functioned as primary source of information to have direct access to daily events in Tunisia. Under these restrictive access conditions, social media like Facebook provides users with opportunities to communicate with others and also to state their opinion, in order to overcome constraint and the old regime. In this context, social media serve as the third place and users develop solidarity and reinforce identity through online communication. As is obvious from the statistical date on the internet users mentioned above, it is estimated that the number of in-country users of Facebook are fewer than the number of users living abroad. Many users followed with what was going on in Tunisia showing in-country users that they were all caring about political change. And this phenomenon is recognized as a kind of participation to collective movement through social media regardless of physical distance or time difference. However, communication through social media has some problems. At first, exchanged information via social media is minimized social context cues under severe restricted conditions, due to sending information certainly and rationally. Therefore information tends to be extreme and there is a risk of group polarization. Second, in social media, information receivers gather fragmented information based on personal experience and make it plausible to understand easier as their own experience or to relive the experiences of its senders. And, through this process, users develop a sense of solidarity and share expectation as well as norms organizing them as one community. Therefore social norms accrete influence on users in particular communities and advance self-stereotyping among them as solidarity and social identity are enhanced. This situation is fraught with social risk of exclusion of others. Some people call Tunisian political change as “Facebook revolution” or “twitter revolution” on the internet. Are these diminutives really pertinent? Indeed, social media has played the important role as “hub for information” and the third place in political change. However, social media has to contribute to the development of skills for dialog in order to achieve a really democratic society (Asai & Kavathatzopoulos, 2010; Kavathatzopoulos, 2010, 2007). - 284 - The Computational Turn: Past, Presents, Futures? References Asai, R. and Kavathatzopoulos, I. (2010). Diversity in the construction of organization value. Proceedings of EBEN Annual Conference 2010 “Which values for which organizations”. Trento, Italy: University of Trento. Internet World Stats (2010). Tunisia: Internet usage and marketing report. Available online: http://www.internetworldstats.com/af/tn.htm (accessed February 7, 2011). Kavathatzopoulos, I. (2007). Information Technology as a tool for democratic skills. In A. Lionarakis (Ed.), Forms of democracy in education: Open access and distance education (pp. 155-162). Athens: Propobos. Kavathatzopoulos, I. (2010). Information technology, democratic societies and competitive markets. Proceedings of the 3rd International Seminar on Information Law “An information law for the 21st century”. Corfu, Greece: Ionian University. Kiesler, S. and Sproull, L. S. (1986). Reducing social context cues: Electronic mail in organizational communication. Management Science, 32(11), 1492-1512. Oldenburg, R. (1999). The great good place. Cambridge: Da Capo Press. - 285 - Proceedings IACAP 2011 WIKILEAKS AND ETHICS OF WHISTLE BLOWING Patrick Backhaus School of Innovation, Design and Engineering, Mälardalen University, Sweden pbs10002@student.mdh.se and Paderborn University, Germany bpatrick@campus.uni-paderborn.de AND Gordana Dodig Crnkovic School of Innovation, Design and Engineering, Mälardalen University, Sweden gordana.dodig-crnkovic@mdh.se 1. Extended Abstract In a time in which the Internet pervades everyday life and information published is readable all over the world, it becomes very important to deal with ethical problems related to whistle blowing via the Internet. Although there are basic concepts like anonymity, privacy and freedom of speech, for every new kind of phenomenon we have to discuss its ethical aspects (Kizza, 2010)( Nadler and Schulman, 2006). A current example is the platform WikiLeaks which publishes a vast amount of secret documents. To evaluate ethics of WikiLeaks (Hanson and Ceppos, 2006)(WikiLeaks About), we will apply the following ethical approaches: The Utilitarian Approach, focusing on the consequences that the publications of WikiLeaks have on the well-being of all parties that are affected directly or indirectly, so there are two sides to consider: • On the one hand, the uncovering of misconduct and the increased transparency of the government are of such importance that the publications benefit society as a whole. So it alleviates the opinion making and leads to a greater understanding of governmental work. • On the other hand the publications may threaten the national security and so harm society. They lead to a society with decreased integrity which may eventually result in less communication, more technical restrictions and so in less freedom. To achieve a balance between both sides a potential approach could be that WikiLeaks reduces their amount of published data and classify the data more in detail. Further they could contact the company or government concerned before the publication, so that this party itself could acknowledge the misconduct. - 286 - The Computational Turn: Past, Presents, Futures? The Virtue Ethics Approach, focusing on attitudes that develop our human potentials such as e.g. honesty, courage, faithfulness, trustworthiness and integrity. It is easy to see that WikiLeaks disregards these virtues in many different contexts. They are accused for putting people’s lives at risk, publishing stolen data and degrading loyalty, privacy and integrity of data. The only virtue they undoubtedly represent is transparency which is not considered classical ethical virtue, but may be seen as an element of democracy. So WikiLeaks must ensure that the increased transparency gained by the publication is much more worth than all other aspects which will only be the case at severe misconduct by the concerned party that is made public as no other way of corrective action was available. The Information Ethics Approach: From the point of view of Information Ethics, we can study how information is revealed/communicated in the networks of agents. Within approach we can ask questions such as: what is the function of “information hiding” and “encapsulation” such as found in Object Oriented Programming and any hierarchical organization? What would be the behavior of a society in which every agent would be connected with every other agent and share any information they have? Interesting to observe is the global character of WikiLeaks, in a world regulated on the base of nations, which seem to act in a grey zone since the legal situation is unclear and different governments are still searching for a crime Julian Assange can be charged for. In reality the issue of WikiLeaks (Kintzinger and Zepelin, 2010) (Greenberg, 2010) implies much more than an ethical discussion about whistle blowing and leaking, integrity and freedom of speech. WikiLeaks have become a symbol of a deep change in the publicity of information in the digital age, at least with the present-day technology. It has generated the greatest confrontation between the established order and the advocacy of the culture of the totally open Internet. We are at the moment a part of the world where it is difficult to control and keep information secret and safe from eavesdropping and unauthorized use. Some of the relevant questions are: Has the institution of legal secret, business secret, military or organizational secret become obsolete? If yes, why? If no, how to protect information which should be protected? Who and how decides which information is worth making public and which is not? According to Assange (Bieber, 2010) (Fallows , 2010) personal integrity must be protected. Why not institutional integrity? If leaking is a good democratic mechanism shall we not have leaks of WikiLeaks as well? And so on…a chain, or a loop of leaks? In a totally transparent world, how would information overload be managed? Shall we give up all trust? Or, equally important: Whom shall we trust? Perhaps problems with information protection will lead us to a society where conversations are reduced to minimum and information less accessible as it has become obvious that anything can be made public. In the end, the result would be not an increase, but a decrease of freedom. References Bieber C. (2010) “Die Ethik des Lecks“, http://www.freitag.de/kultur/1032-die-ethik-des-lecks - 287 - 11.08.2010. der Freitag Proceedings IACAP 2011 Fallows J. (2010) “More on Mullen, Twitter, and the Ethics of WikiLeaks”, July 2010. http://www.theatlantic.com/politics/archive/2010/07/more-on-mullen-twitter-and-the-ethicsof-wikileaks/60705 Greenberg A. (2010) An Interview With WikiLeaks’ Julian Assange, Nov. 29 2010. Forbes. http://blogs.forbes.com/andygreenberg/2010/11/29/an-interview-with-wikileaks-julianassange Hanson K. and Ceppos J. (2006) “The Ethics of Leaks,” http://www.scu.edu/ethics/publications/ethicalperspectives/leaks.html Kintzinger A. and Zepelin J. (2010) “Stärkt Wikileaks die Freiheit?“, 02.12.2010. Financial Times Deutschland http://www.ftd.de/it-medien/medien-internet/:pro-und-kontra-staerktwikileaks-die-freiheit/50200724.html Kizza J. M. (2010) “Cyberspace, Cyberethics, and Social Networking,” in Ethical and Social Issues in the Information Age. London: Springer London, ch. 11, pp. 221–246. http://dx.doi.org/10.1007/978-1-84996-038-0_11 Nadler J. and Schulman M. (2006) “Whistle Blowing in the Public Sector,” November 2006. http://www.scu.edu/ethics/practicing/focusareas/Government_ethics/introduction/whistleblo wing.html WikiLeaks About, [Online]. http://wikileaks.de/about.html All links accessed on 2011 04 25 - 288 - The Computational Turn: Past, Presents, Futures? INTERPRETING CODES OF ETHICS IN GLOBAL SOFTWARE ENGINEERING Extended Abstract THIJMEN DE GOOIJER Mälardalen University Högskoleplan 1, Västerås, Sweden Abstract. In global software engineering (GSE) groups of people from all over the world collaborate on the development of one system. For example, it is common for Western companies to send development work to Asia or Eastern Europe. Within these collaborations the differences between cultures and the problems these differences create, are plentiful. Because we expect that computing professional organizations codes of ethics are insufficiently adapted to GSE, we investigate the culture-relative interpretations of codes of ethics and the guidance they provide for global teams and collaboration. We analyze the codes of ethics of the ACM (US), CSI (India), IPSJ (Japan), HKCS (Hong Kong) and EI (Ireland). We look whether the codes explicitly address ethical dilemmas caused by global interactions, and investigate the ethical guidance provided by the codes. For the latter we apply them to three case questions that one could raise in a GSE setting. Our work differs from that of others in that it examines the practical applicability of codes of ethics instead of their contents and that our goal is not to study different culture-relative interpretations of just one problem. During our analysis we did not find imperatives that directly hinder global interaction, but unfortunately we were also unable to find any that sufficiently address this topic. Only one of the studied codes asks to consider cultural differences. While answering the case questions using the imperatives from the aforementioned codes, the cultural perspectives needed to interpret the words become clear, and we learn that little attention is given to the problems associated with global collaboration. We conclude that all studied codes would benefit from more explicit guidelines for those professionals that work in GSE. 1. Introduction Despite the globalization of the software engineering profession, most computing professional organizations are active in a limited number of countries and have their own code of ethics (CoE) or code of conduct (CoC). These codes are thus national in scope (Wheeler, 2003). According to a 1996 study as much as 78% of IS professionals use these codes in their ethical decisions (Joyce et al., 2003). At the same time, ethical - 289 - Proceedings IACAP 2011 reactions and attitudes are influenced by culture and national origin (Christie et al., 2003; Nyaw & Ng, 1994) . As a result ethical decision making is a complex endeavor in the current global IS practice (Wheeler, 2003). We expect that the codes have not kept up with the globalization of the profession. To explore the possible difficulties computing professionals may encounter during their ethical decision making in global software engineering (GSE), we analyze the codes of ethics of five professional organizations and apply their codes to three case studies. We characterize our study by the following research questions. • Do the studied codes specify culture-relative imperatives that could hinder or support global software engineering? • Do the studied codes provide adequate ethical guidance for IT professionals in global interactions? 2. Related Work To our knowledge no studies exist that take a similar, practical approach to identify problems for global software engineers in computing professional CoE. Earlier work does compare codes (Oz, 1993), even in international settings (Joyce et al., 2003; Wheeler, 2003) and is discussed below. Work that combines codes of ethics with cultural influences can be found for example in (Arnold et al., 2007), which studies the views of western European accountants on actions prescribed by CoC based on their country of origin. It is found that these views differ significantly. Case studies exist which review the ethical stance of different cultures on specific issues, for example, software piracy (Swinyard, Rinne, & Kau, 1990), but these studies either do not include CoE of computing professional organizations or do not have the goal to study their usefulness in decision making. Specific in another way are the case studies in (Anderson et al., 1993), which focus only on the ACM code. 2.1. COMPARING CODES Oz reviews four codes of US computing professional organizations finding flaws, moral dilemmas, and points for improvement (Oz, 1993). We differ from (Oz, 1993) in that we do not limit our study to US codes. In their study comparing 27 international CoE Joyce et al. found only eigth themes that were common to more than 50% of the CoE (Joyce et al., 2003). Compared to the work by Joyce et al. our work aims to identify problems encountered during ethical decision making in a GSE context, while their work focusses on the content of the codes. Wheeler (2003) compares the codes of the ACM, the British Computer Society (BCS) and the Australian Computer Society (ACS) to find differences and similarities. Our work differs from (Wheeler, 2003) in that we put more emphasis on how codes are used in a global setting and the selected codes. 2.2. A GLOBAL CODE Some voices suggest to unite everyone by one global code of ethics (Payne & Landry, 2006; Wheeler, 2003). Davison on the contrary does not believe it is possible to establish a global code due to differences between nations and cultures (Davison, 2000). - 290 - The Computational Turn: Past, Presents, Futures? His concerns are supported by the difficulties IFIP experienced in the 90s when it attempted to establish a consensus document to serve as a base for the development of codes by member bodies (Joyce et al., 2003). We consider the views of Brey (2007) and Wong (2009) more balanced. They both acknowledge that a universal ethic would be ideal, but respect that in practice this can only be implemented as an extension of the local moral systems (Brey, 2007) and that we should avoid to force ‘our’ ethics onto another culture (Wong, 2009). 3. Selection of CoE In our study we compare five CoE, those of: the Association for Computing Machinery (ACM, 1992), Computer Society of India (CSI, 2010), Hong Kong Computer Society (HKCS, 2010), Information Processing Society of Japan (IPSJ, 1996), and Engineers Ireland (EI, 2009). Only five codes were selected to limit the study to a manageable size. The codes are chosen based on the role of their organization's home country in GSE, as well as variation in culture. The full paper provides more rationale for the selection. 4. Static Code Analysis In this Section we answer our first research question. To do so we informally compare the content of the five codes. Our assumption is that if an imperative is culture-relative it will not appear in all codes. Note that this does not capture culture-relative interpretations of imperatives. It is to capture interpretation problems that we include the case studies in Section 5. Comparing the CoE we find that only one of them asks to consider cultural differences, but we find no imperatives that directly (by formulation) impede inter-cultural collaboration. A number is culturally bound, and we expect all will be interpreted differently even when imperatives match. 5. Employing The Codes In this Section we apply the five selected CoE to three case studies. In this we way hope to discover whether the studied codes provide adequate ethical guidance for IT professionals in global interactions. Below we formulate our case studies as three questions that one might ask him-/herself in a GSE project. • Developing a medical system for deployment in several countries across the globe, should I be aware of all legal requirements? • How do I design my system so that it respects the expected level of privacy? • May I say ‘yes’ to an assignment I receive from a German customer when I am uncertain that I can complete it? - 291 - Proceedings IACAP 2011 6. Concluding Remarks While studying the CoE we found only a couple of imperatives that could hinder GSE collaboration. However, none of the codes seem to be written with global collaboration in mind. And only the IPSJ CoE explicitly mentions the problem of cultural differences. Further, the case studies show that decisions on ethical dilemmas will often depend on the interpretation by professionals or the implicit stance of the code. We feel that the CoE should provide more guidance to deal with the complexity of ethical decisions in a GSE setting. Our primary recommendation for computing professional organizations is to revise their CoE to reflect the advance of GSE. Future work could examine how this may best be achieved within each culture. Acknowledgements A warm thanks to my professor Gordana Dodig-Crnkovic for encouraging me to submit this work to IACAP and her useful comments. References ACM. (1992). ACM Code of Ethics and Professional Conduct. Retrieved December 2010, from http://www.acm.org/about/code-of-ethics. Anderson, R. E., Johnson, D. G., Gotterbarn, D., & Perrolle, J. (1993). Using the new ACM code of ethics in decision making. Commun. ACM, 36(2), 98-107. New York, NY, USA: ACM. doi: http://doi.acm.org/10.1145/151220.151231. Arnold, D., Bernardi, R., Neidermeyer, P., & Schmee, J. (2007). The Effect of Country and Culture on Perceptions of Appropriate Ethical Actions Prescribed by Codes of Conduct: A Western European Perspective among Accountants. Journal of Business Ethics, 70(4), 327340. Springer Netherlands. Retrieved from http://dx.doi.org/10.1007/s10551-006-9113-6. Brey, P. (2007). Is Information Ethics Culture-Relative? International Journal of Technology and Human Interaction, 3(3), 12-24. Christie, P. M. J., Kwon, I.-W. G., Stoeberl, P. A., & Baumhart, R. (2003). A Cross-Cultural Comparison of Ethical Attitudes of Business Managers: India Korea and the United States. Journal of Business Ethics, 46(3), 263-287. Springer Netherlands. Retrieved from http://dx.doi.org/10.1023/A:1025501426590. CSI. (2010). Computer Society of India - Code of Ethics. Retrieved December 2010, from http://www.csi-india.org/web/csi/code-of-ethics. Davison, R. M. (2000). Professional ethics in information systems: a personal perspective. Commun. AIS, 3(2es). Atlanta, GA, USA: Association for Information Systems. Retrieved from http://portal.acm.org/citation.cfm?id=374504.374510. EI. (2009). Engineers Ireland - Code of Ethics. Retrieved December 2010, from http://www.engineersireland.ie/about-us/governance/code-of-ethics-and-bye-laws/. HKCS. (2010). Hong Kong Computer Society - Code of Ethics and Professional Conduct. Retrieved December 2010, from http://www.hkcs.org.hk/en\_hk/intro/coe.asp. IPSJ. (1996). Code of Ethics of the Information Processing Society of Japan. Retrieved December 2010, from http://www.ipsj.or.jp/english/somu/ipsjcode/ipsjcode\_e.html. - 292 - The Computational Turn: Past, Presents, Futures? Joyce, D., Blackshaw, B., King, C., & Muller, L. (2003). Codes of Conduct for Computing Professionals: an International Comparison. In S. Mann & A. Williamson (Eds.), Proceedings of the 16th Annual NACCQ, Palmerston North, New Zealand (pp. 71-78). Nyaw, M.-K., & Ng, I. (1994). A comparative analysis of ethical beliefs: A four country study. Journal of Business Ethics, 13(7), 543-555. Springer Netherlands. Retrieved from http://dx.doi.org/10.1007/BF00881299. Oz, E. (1993). Ethical standards for computer professionals: A comparative analysis of four major codes. Journal of Business Ethics, 12(9), 709-726. Springer Netherlands. Retrieved from http://dx.doi.org/10.1007/BF00881385. Payne, D., & Landry, B. J. L. (2006). A uniform code of ethics: business and IT professional ethics. Commun. ACM, 49(11), 81-84. New York, NY, USA: ACM. doi: http://doi.acm.org/10.1145/1167838.1167841. Swinyard, W. R., Rinne, H., & Kau, A. K. (1990). The morality of software piracy: A crosscultural analysis. Journal of Business Ethics, 9(8), 655-664. Springer Netherlands. Retrieved from http://dx.doi.org/10.1007/BF00383392. Wheeler, S. (2003). Comparing Three IS Codes of Ethics - ACM, ACS and BCS. PACIS 2003 Proceedings (p. Paper 107). Wong, P.-H. (2009). What should we share?: understanding the aim of Intercultural Information Ethics. SIGCAS Comput. Soc., 39(3), 50-58. New York, NY, USA: ACM. doi: http://doi.acm.org/10.1145/1713066.1713070. - 293 - Proceedings IACAP 2011 INFORMATION TECHNOLOGY, INTELLECTUAL PROPERTY RIGHTS GLOBALIZATION AND SORAJ HONGLADAROM Department of Philosophy Faculty of Arts, Chulalongkorn University The main concern of this paper centers around the issues arising from the use of intellectual property rights (IPRs) as a tool of globalization, and how creations of information technology are usually protected through the IPR regime as well as how the technology is used as a means by which globalization is effected. Works on the justification of intellectual property rights typically fall under two extremes: either they reject IPRs outright or they accept IPRs as necessary for global commerce and useful innovation. The former argue, on the one hand, that IPRs are hegemonic tools by which the developed countries in the West keep the emerging developing ones at bay or exploit the natural resources of the developing countries through what is known as biopiracy or bioprospecting. On the other hand, those who embrace IPRs usually base their arguments on the role that IPRs are necessary as a means of protecting those who have invested in creating useful innovations. Problems arise when the products protected by IPRs are carried across national borders and thus become global. In order to ensure protection afforded by IPRs across countries, a worldwide system has been created by which IPRs are protected which in many cases override the sovereignty of states. Thus it is clear that IPRs are clearly tools of globalization; one sees globalization concretely at work through the creation and enforcement of trade-related intellectual property rights across countries in the world today. The polarized debates around IPRs have created countless cases of conflicts between those who fight for globalization and those who are against it. Chief in these debates is the ethical issue, especially when products protected by IPRs have strong impact on the livelihood and even the survival of those who depend on them. New pharmaceutical products, for example, are almost always patented, which enables the manufacturer to be able to charge very high price to cover their investments and also to earn themselves profits for their shareholders. However, when people in the poorer developing world are in need of these drugs, it is clear that there are moral issues involved. Are the pharmaceutical companies morally obligated to provide the fruits of their intellectual investments at lower cost so that they are affordable by the poor? It would strongly seem so. However, there are also cases where IPRs are justified by arguments that they are necessary as an incentive for innovation. Without effective IP protection, the life saving drugs in question might not have arisen in the first place. Furthermore, there are also cases where IPRs are used as tools for protecting the creation of those within the developing world themselves. Without workable IPR regime, it is not - 294 - The Computational Turn: Past, Presents, Futures? quite conceivable how innovation that takes place within the developing world can even get off the ground. In fact ineffective enforcement of IPRs in the developing world has been cited as one reason for these countries remaining stagnant economically. The present paper aims to break this impasse. The underlying issue behind the debate on patented pharmaceuticals and other products such as software or other forms of innovation is the use of IPRs as a tool for protecting intellectual creation. The intellectual content that becomes property through patents is constituted by information. Thus the issue becomes in effect how information itself is owned and how it has become a commodity. Hence it is clear that the issue depends the value one puts on the information in question. It is just not that case that information can have more or less values on its own – if the information answers to the people’s needs and desires, then naturally it is more valuable. This implies that the value a piece of information has is dependent upon context, which is mostly made up of people. Thus IPRs function when information itself has economic values and can be bought and sold. This shows that in themselves IPRs are neither positive or negative, no more than a piece of cloth sold in the market is either positive or negative. IPRs then can be used either positively or negativey. For example, when they are used to monopolize life saving drugs so that poorer people cannot afford them, then they are negative, but they can also perhaps become more positive when they are used to advance the interests of poorer people by ensuring, for example, that the plant species belonging to their natural habitats are protected, or their own intellectual creation is recognized and given due protection. As mentioned previously, information technology plays a significant role in all this. First of all, products of information technology itself are usually protected by IPRs. Software is usually protected by copyrights. It is well known that the open source movement in software strikes a middle ground between copyright protection and commercialization on the one hand, and releasing everything onto the public domain on the other. This can be a way out of the impasse, but it needs more thorough theoretical justification, which is also an aim of this paper. Another, no less important, point is that, as the technology spreads the information around, and as information does not have values on its own as previously discussed, information technology itself stands to be used either positively or negatively too. This seems to be a come back to the old position of technological neutralism (the idea that technology is not good or bad in itself). But it is not. When one allows for all the constraints and implications associated with a technology (i.e., when a technology constrains us to behave one way or another due to the nature of that particular technology itself), there is still room for using that technology within these constraints either positively or negatively. Hence, a way is open before us and it is up to us to decide which way to go. We only need to be able to foresee, to the extent that we can, what kind of consequences there will be as a result of our choosing. Acknowledgements Research for this paper has been partially supported by a grant from the National Research University Project, grant number HS1025A and AS569A. - 295 - Proceedings IACAP 2011 Track IX: Surveillance, sousveillance - 296 - The Computational Turn: Past, Presents, Futures? TOWARDS A HERMENEUTIC PHENOMENOLOGY OF CYBERSPACE: POWER VS. CONTROL ANDREAS BEINSTEINER Ph.D. Student Institute of Philosophy Leopold-Franzens-Universität Innsbruck Abstract. Since the 1990ies, regulation by program code has become an issue in theoretical reflection on computers. Michel Foucault’s concepts, and, in particular, Gilles Deleuze’s claim that control societies substitute disciplinary societies in the age of computers, have been popular points of reference. The present paper suggests interpreting control as a form of regulation that is essentially connected to computers: From Foucault’s considerations a distinction is derived between power and control. Control is conceived as a more radical mode of regulation: a determination of possibilities of action that – as is shown by relating Foucault to Martin Heidegger – is first made possible by computer technology. 1. The power of code In an article called “Soft Cities”, William J. Mitchell (2005) explores similarities and differences between traditional “real-world” space and the new, computer-generated spaces. He observes that the coded conditionals in cyberspace provide a fundamentally new mode of regulation: you cannot argue with computer programs, you cannot plead or bribe them. Lawrence Lessig (2006) refines his claim “code is law” by stating that this new form of regulation rather works through “a kind of physics. A locked door is not a command ‘do not enter’ backed up with the threat of punishment by the state. A locked door is a physical constraint on the liberty of someone to enter some space.” (p. 82) Code is a regulator in cyberspace because it defines the terms upon which a certain cyberspace environment is offered: It decides what can be said and done in that environment. Lessig refers to Michel Foucault (1995) who had addressed the kind of regulations that become relevant in a new way in cyberspace: “Discipline and Punish” introduced the perspective that tiny corrections of space regulate by enforcing a discipline. In fact, Foucault’s reflections on disciplinary power are embedded in his larger project of exploring the historical transformations that substitute sovereign power by what he calls biopower: a new kind of power that does not employ law but technology and that does not prohibit behavior but produce it. (Foucault 1998) - 297 - Proceedings IACAP 2011 According to Gilles Deleuze (1995), disciplinary societies have been replaced by control societies in the age of computer technology. Alexander Galloway (2004, 2010) has characterized protocol and program code as the essential means of regulation in control societies. 2. Power and freedom According to Foucault, to exercise power means to structure the possible field of action of others. By doing so, these individuals are transformed into subjects, where the word subject has two meanings: to be subject to someone else’s domination, and to be tied to one’s own identity. Foucault (2002) emphasizes that power can only be exercised over free subjects. A subject is free insofar it is not absolutely self-identical or determined. In the extreme case where power constraints action absolutely or physically, both power and freedom disappear: “slavery is not a power relationship, when man is in chains.” (p.221) I suggest conceiving control as such a form of regulation that goes beyond power and erases freedom. While the absence of physical determination seems to be a necessary condition for freedom, it is not a sufficient one. Since it does not seem adequate to suppose a kind of metaphysical autonomy in Foucault’s conception of the individual, we turn to the relations that Hubert Dreyfus (2003) has established between the concepts of Foucault and Martin Heidegger for a deeper understanding of how to conceive the sources of freedom. According to Dreyfus, Heidegger’s question – how things have turned into objects in modernity – is complemented by Foucault’s question – how individuals have been turned into subjects. This allows connecting Heidegger’s concept of Being with Foucault’s concept of power. Since one’s goals and horizons of meaning arise from one’s background understanding that Heidegger calls the clearing of Being, exercising power over a certain individual (to influence his/her possibilities of action) is possible by shaping this clearing. A subject is constituted by the corresponding understanding of Being, and the more static this understanding is, the closer to absolute self-identity is the subject. Thus freedom can be grasped as hermeneutic oscillation – as a condition where various understandings are suspending and balancing each other. 3. Materiality as a source of freedom According to Heidegger, the understanding of Being has always been influenced by technological artefacts and vice-versa. A tool suggests what it is to be used for: Heidegger’s (1995) prominent example is the hammer, which is embedded in a structure of “in-order-to”-relations and refers to goals, practices and other tools. In contrast to tools, whose materiality disappears into their usability, works of art emphasize their materiality. By doing so, they expose a fundamental gap between the material sphere and the conceptual sphere. Heidegger (2008) conceives this as a struggle between earth and world. The artwork’s materiality cannot be exhaustively interpreted with one conceptual frame, thus it steadily keeps evoking new interpretations. This is how materiality provides a source of freedom. Also tools, due to their materiality, may - 298 - The Computational Turn: Past, Presents, Futures? be abused or used in different ways that were not intended originally. Addressing what he calls the “designer fallacy”, Don Ihde (2009) has examined such non-intended usages of technologies. Ihde’s argument against the possibility to design in advance a tool’s usage relies on the tool’s materiality. 4. Cyberspace as the congruence of material and conceptual For a long time theology and science employed god’s order of creation or the capacity of human reason to bridge the gap between the conceptual and the material sphere. (Heidegger 2008) The task of metaphysics was to provide narratives that justified the adequacy of a certain vocabulary for describing reality. Nietzsche’s “death of god” is nothing but the acknowledgement that there is not one single conceptual system that adequately describes reality. The “post-modern” call for conceptual pluralism is a consequence from this insight. In cyberspace environments, however, the productive tension between the material and the conceptual is erased: The programmer is the god who creates this reality, and the respective program code is really an adequate description of this reality. Conceptual and material sphere coincide in cyberspace. A gun in a 3D shooter game is nothing but a gun and a buy-with-one-click-button in an online shop is nothing but a buy-with-one-clickbutton. The “designer fallacy” argument does not hold in cyberspace. And thus, as agents in a cyberspace environment, we are 100% self-identical subjects. According to my suggestion, this is what control is about. References Deleuze, Gilles (1995): Negotiations, 1972-1990. New York: Columbia University Press. Dreyfus, Hubert (2003): ’Being and Power’ Revisited. In Milchman, Alan & Rosenberg, Alan: Foucault and Heidegger: critical encounters (pp. 31-54). Minneapolis: University of Minnesota Press. Foucault, Michel (1995): Discipline and punish: the birth of the prison. New York: Vintage. Foucault, Michel (2002): The Subject and Power. In Dreyfus, Hubert and Rabinow, Paul: Michel Foucault: Beyond Structuralism and Hermeneutics (pp. 208-226). New York: Harvester Whitesheaf. Foucault, Michel (1998): The Will to Knowledge. London: Penguin Books. Galloway, Alexander R. (2004): Protocol. How Control exists after Decentralization. Cambridge, Massachusetts: MIT Press. Galloway, Alexander R. (2010): Networks. In Mitchell, W.J.T. and Hansen, Mark (Eds.): Critical terms for media studies (pp. 281-296). Chicago: University of Chicago Press. Heidegger, Martin (2008): Basic Writings. New York: Harper Collins. Heidegger, Martin (1995): Being and Time. Oxford: Blackwell. Ihde, Don (2009): The Designer Fallacy and Technological Imagination.”In Vermaas, Pieter E. et al. (Eds.): Philosophy and Design. From Engineering to Architecture (pp. 51-59). Springer Lessig, Lawrence (2006): Code version 2.0. New York: Basic Books. Mitchell, William J. (2005): City of Bits. Cambridge, Massachusetts: MIT Press. - 299 - Proceedings IACAP 2011 THE WIKILEAKS LOGIC JEAN-GABRIEL GANASCIA LIP6 – University Pierre et Marie Curie 4, place Jussieu, 75005, Paris, France Jean-Gabriel.Ganascia@lip6.fr Abstract. WikiLeaks has focused the attention of the media during a few weeks by the end of 2010. The diplomacy of the United-State of America has been called into question. Modern democracies are hampered; as sovereign states, they are now facing a novel dilemma. This paper constitutes an attempt to understand this evolution by seriously considering the WikiLeaks project not as a simple media strategy, but as the possible kickoff of a totally new way doing politics, in a perfect transparency, without secrecy nor hidden issues. Our purpose here is both to show how information technologies, of which WikiLeaks is a sub-product, contribute to transform the traditional political forms and how the notion of “sousveillance” helps us to apprehend these evolutions. 1. A Few Recent Facts WikiLeaks has focused the attention of the media during a few weeks by the end of 2010 and, previously, during the summer and the autumn. The diplomacy of the United-State of America and of some other countries has been called into question by what people called the Cablegate, by analogy to the Watergate. Let us remember that 250,000 of secret telegrams containing embarrassing information about American, European and Middle-East foreign policies were divulged to newspapers by the WikiLeaks organization. Modern democracies, and especially the United-States of America, were hampered. The main argument they developed against WikiLeaks was formal: it concerned the danger that was posed to those whose name had been explicitly mentioned in the cables. However, it clearly appeared that, for those sovereign states, the question is not only just saving life of a few people: they are now facing a novel dilemma. On the one hand, last few years many democracies opened public data to all citizens (Obama 2009). On the other hand, states are always used to deal with many matters, especially in the diplomatic area, either in secrecy, or, at least, in a discrete way. As a consequence, they can't easily accept the divulgation of top secret informations. In brief, the aspiration to a total transparency, that many of our contemporaries share, modifies the rules of government, while WikiLeaks shows the limits of officially proclaimed public transparency. - 300 - The Computational Turn: Past, Presents, Futures? 2. A New Ideal of Transparency With the recent developments of information technologies a new ideal of total transparency seems to be born. Note that, by itself, the ideal of total transparency is not new. It already existed in the 19th century (Benjamin 1934). The use of glasses in the architecture, for instance the “Chrystal Palace” that was built for the London Universal Exhibition in 1851, reflected this ideal. A few years before, in the end of the 18th century, Jeremy Bentham had described an architecture for surveillance designed to ensure a total transparency (Bentham 1838). Called the Panopticon, it was a model for prisons, factories, hospitals, etc., that have been conceived to make individuals totally visible to their guards, while these ones were invisible to them. The goal of transparency was again to facilitate education, surveillance, care, etc., which enhanced the role and the situation of authority holders. By contrast, the new transparency that is encouraged today is individual and not institutional. It is directed towards and against the authority holders, which are permanently under the cameras. For instance, the policemen are continuously filmed. The professors, physicians, lawyers, politicians etc. are permanently evaluated, etc. The concept of “sousveillance” that was introduced by Steve Mann well characterizes this new form of transparency (Mann 2003). This neologism forged by analogy and opposition to the word surveillance, means that the watcher is situated below (“sous” in French) the authority, while in case of surveillance he is situated above. 3. The Horizon of WikiLeaks To understand the horizon of WikiLeaks, let us first note that Julian Assange, the promoter and editor in chief of WikiLeaks, was initially a computer scientist who first worked on cryptography. So doing, he adopted an atypical posture. While almost all the cryptographers work for armies, secret services or banks, he developed cryptographic tools for people. His idea was to make everybody able to hide information to the authorities (state, company, etc.). Now, with WikiLeaks, Julian Assange proposes to render publicly available all information about authorities. He proposes creating “open governments” where all data about the government and the public decisions would be worldwide accessible to everybody. The underlying idea of a perfect collective transparency seems to justify his action, which somehow refutes his first attitude of privacy protection. 4. Limits of the Generalized Sousveillance The utopia of a generalized sousveillance, i.e. of a sousveillance extended to the overall society, that excludes surveillance, faces an inherent contradiction: the authorities are made of individuals, who, as such, need to be protected, which becomes impossible because of the exclusion of surveillance. Without going deeply in the exploration of this first contradiction, consider now the extension of the sousveillance regime to the overall worldwide society. It faces at least two types of limitations, some being intrinsic, others extrinsic. - 301 - Proceedings IACAP 2011 The main intrinsic limitation is due to our cognitive abilities that are too limited to permit to observe and to assimilate all the information we have at our disposal. As a consequence, we spontaneously filter the information flows and we focus our attention on the most prominent facts. But, we do not decide by ourselves what criteria are adopted to qualify the prominence. Most of the time, this is decided by people who manipulate us by distracting our attention. The second type of limitation is extrinsic in the sense that it is not an own limit of the regime of sousveillence itself, but it is due to foreign factors. Specifically, nothing prohibits the coexistence of a generalized regime of sousveillance with multiple regimes of surveillance. For instance, NGOs or big multinational companies may continue to gather and exploit data; they even can take advantage of free public data to extract useful knowledge for the sake of their own interest, without any respect of privacy. 5. The Failure of the Wikileaks Ideal Despite the attacks to which it was submitted and the fact the Julian Assange has been jailed, WikiLeaks is undoubtedly very popular nowadays. There even exist attempts to build more or less specialized clones of WikiLeaks in many places all over the world. However, the original Assange project seems to have failed. The causes of this failure are directly related to the limitations of the generalized sousveillance regime that were expressed in the previous paragraph. First of all, Julian Assange wanted to freely disseminate data allowing every citizen to get any information he wanted, when he wanted. However, during the Cablegate, WikiLeaks didn't freely divulge the 250,000 diplomatic telegrams he had; he sent them to well established newspapers that had to filter, anonymize the messages and dramatize their publication, with appropriate comments and advertisements. Another failure of the WikiLeaks project is due to the project itself, which was supposed to free people from any kind of authorities. However, it clearly appears that WikiLeaks has now become a new authority, which plays a role symmetrical to other more traditional authorities, as states or NGOs and companies. Julian Assange himself acts in his own organization without any real transparency, which shows the limitation of the generalized sousveillance principle as it was promoted by WikiLeaks. References Benjamin, W. (1934), Selected Writings, Volume 2, 1927-1934 Translated by Rodney Livingstone and Others Edited by Michael W. Jennings, Howard Eiland, and Gary Smith Bentham, J. (1838), Panopticon or the Inspection House, The Work of Jeremy Bentham, volume IV, 37-172 Mann, S., Nolan, J., Wellman, B. (2003), Sousveillance: Inventing and Using Wearable Computing Devices for Data Collection in Surveillance Environments, Surveillance & Society 1(3): 331-355, http://www.surveillance-and-society.org http://wearcam.org/sousveillance.pdf Obama, B., (2009), Transparency and Open Government, Memorandum for the Heads of Executive Departments and Agencies, The White House, Washington, USA, http://www.whitehouse.gov/the_press_office/Transparency_and_Open_Government/ - 302 - The Computational Turn: Past, Presents, Futures? DEMOCRACY 2.0 - HOW THE WEB MAKES REVOLUTION ANIS NAJAR LIP6, Pierre and Marie Curie University 4, Place Jussieu, 75005, Paris, France anis.najar@lip6.fr Abstract. “Whoever controls the information owns the power”. Many scientists and philosophers have been interested in analyzing the relationship between information and power within the society and they all argued that a kind of dependency exists between the control of information and the political power. In this paper, we propose to analyze this dependency from a structuralistic point of view by assuming that changes in the information schema of the society would necessarily produce changes in the power schema, characterizing by this way the concepts of surveillance and sousveillance. We suggest examining these changes on two levels, the structure of the information schema and the nature of information, by taking as a study case the Tunisian popular revolution in which information technology have played a significant role. 1. Introduction From a structuralistic point of view, we can model the information society as entities exchanging information in some pattern that we will refer to as information schema. Similarly, we will call power schema, the one representing the balance of power between the entities within the society. By neglecting other socioeconomic factors, we can say that the power schema is somehow characterized by the information schema. Therefore, it is reasonable that a revolution in the latter produces a revolution in the former. To illustrate these aspects, we take as a study case the Tunisian popular revolution that we consider as a logical consequence of the anterior revolution of Information Society. Indeed, yet five years ago, the World Summit on the Information Society held in Tunisia reflected the contradiction in the dictator's policy towards Information Technology. At the same time, he was promoting its use and censoring its access. In effect, he was not suspecting at that time, that five years later he would be overthrown by what he was the most proud of, i.e. Information Technology. In the following, we try to analyze this revolution on two levels, namely the structure of the information schema and the nature of information itself. - 303 - Proceedings IACAP 2011 2. Informational Revolution 2.1. STRUCTURAL LEVEL Based on the concept of Panopticon introduced by Jeremy Bentham in 1785 (Bentham 1838), Michel Foucault (Foucault 1975) described the classical schema of surveillance in a society as a hierarchical organization, in which the state controls the information either in its dissemination through the media and education or its collection through intelligence. This schema also defines the classical power schema as a vertical organization, the state at the top and the people at the bottom. Besides, censorship has often been the classical way of controlling the information in such configuration. Since several years ago, Internet has substantially transformed the information schema which progressively took the form of the World Wide Web structure, that of network. This reversed the power schema in a way that balanced the power relationship between the state and the people by promoting transparency of information and democratization of power. This schema coincides with the architecture of Catopticon introduced by JeanGabriel Ganascia (Ganascia 2009) in order to describe the structure of “sousveillance”, in opposition to Bentham's Panopticon. Sousveillance has been defined by Steve Mann (Mann 2003) as the acquisition by people of information technology so they can use it against their keepers. During Tunisian revolution, we observed a real showdown between the people and the government, especially through social networks that have been a real staging ground for the demonstrations. The advantage provided by the internet can be explained by several reasons. First, notions such as community and sharing that have been developed through social networks like Wikipedia, Facebook and Twitter have created a kind of proximity between people and strengthened their solidarity. Second, the distribution aspect of networks and speed of information propagation (small world effect) make social networks a very effective offensive tool. For example, the worldwide cyberactivist organization known as Anonymous launched an operation called #OpTunisia against the Tunisian Internet Agency servers paralyzing several government web sites. Moreover, the great demonstration that led to the departure of the dictator has been organized via Facebook overnight just after his last speech. Third, this structure is robust against targeted attacks because of the absence of “leaders”. Finally, it is effective against censorship because it is always possible to introduce information from a part of the network. 2.2. SEMANTIC LEVEL The second aspect of change in the information society has been made in the nature of information contents. For some time indeed, the multimedia, especially video is being increasingly important within the information exchanged over the Internet. We could explain this by several reasons. First, the constraints of formalization and formulation downsized the previously privileged position of texts, leaving the ground for videos which appeared to be a more effective mode of information circulation in terms of quickness and straightforwardness. Second, in addition to the fact that image is semantically richer than text; it is also much closer to the human’s mental representation; - 304 - The Computational Turn: Past, Presents, Futures? so it allows a better effect on the mental image, which gives it more impact in information transmission. All these factors contributed to the success of video particularly through videoblogging and gave birth to a new kind of media, which is the collaborative journalism, where everyone contributes to the spreading of information. Furthermore, many news TV channels, when they were not allowed to directly cover events, had no other choice than collecting and sorting amateur videos provided by protestors in order to broadcast them afterwards. 3. Counter-Revolution Even though the network structure, as we exposed, is resistant against attacks, there is still one kind of attack that is effective against information networks and which takes advantage of its foregoing characteristics, that is propaganda. That was an essential tactical point that let the former regime to launch a counter-revolution by changing its behavior in a second time from censorship to disinformation. It seems that they understood that they would be more able to control information by fabricating it rather than by blocking it. For example, in just a brief delay after the censorship has been lifted on the internet, multiple Facebook pages have been created to turn the opposition parties against each other and the Ministry of Interior created an official page to make propaganda. In a few hours, Facebook has been flooded by a huge quantity of rumors about criminals and snipers shooting people outside so that terror led people to not think rationally and they didn’t trust any information anymore. By this way, the government created chaos and paralyzed the network. In the same way, image has also been used in the counter-revolution. For the same reasons cited above it has been a very effective tool of manipulation. For example, in attempt to discredit protestors, the government staged several acts of violence and spread them on the internet so that a lot of people called to stop demonstrations. References Bentham, J. (1838), Panopticon or the Inspection House, The Work of Jeremy Bentham, volume IV, 37-17 Foucault, M. (1975), Surveiller et punir, Gallimard, Paris, France, p. 252 – In English Discipline and Punish, trans. A. Sheridan. (1977) New York: Vintage Ganascia J.-G. (2009), "The Great Catopticon", in Proceedings of the 8th International Conference of Computer Ethics Philosophical Enquiry (CEPE), 26-28 June 2009, Corfu, Greece Mann, S., Nolan, J., Wellman, B. (2003), Sousveillance: Inventing and Using Wearable Computing Devices for Data Collection in Surveillance Environments, Surveillance & Society 1(3): 331-355, http://www.surveillanceand-society.org http://wearcam.org/sousveillance.pdf - 305 - Proceedings IACAP 2011 NEGATIVE SOUSVEILLANCE CARSON REYNOLDS University of Tokyo, Department of Creative Informatics carson@k2.t.u-tokyo.ac.jp Abstract. Recent catastrophes have increased the desire to get rapid information about infrastructure such as power and services and not necessarily from the people providing these services. While news sources seek to provide such information, they are biased toward providing information that increases reader or viewer interest. Sousveillance is appropriate in these cases and here we describe an unusual method for such observation, which we call negative souveillance. This is observing which systems or services disappear in a time of catastrophe and reporting on their disappearance. 1. What Disappeared? Mann’s development of "watchful vigilance from underneath" is useful in cases in which the surveilled feel that information may be used to harm them. But what of the special case in which the disenfranchised feel that information is being withheld form them? Amid the recent earthquake, tsunami, and nuclear power crises of Japan in 2011, several individuals have expressed to me the feeling that they “are not being told everything.” Indeed, Wikileak’s (Pilger, 2010) recent diplomatic cable archive documents the extent that governments and organizations routinely keep politically delicate details out of the public eye. Negative databases (Esponda, 2006), on the other hand, are designed to solve a different problem altogether. That is the keeping records which if stolen do not reveal the identity of individuals. Negative databases achieve this by storing the complement of the set of what is being tracked. Essentially the database shows what isn’t of concern. The work of Trevor Paglen, involves long-distance photography and data analysis to document secret installations. Extending his approach the negative intelligence gatherer would seek to understand what websites, infrastructure systems, environmental sensors or documents have become unavailable. The negative sousveillance concept then is to record, track, or infer what isn’t there. This essentially suggests a two-stage process. The first step is citizens or activists to survey or map infrastructure systems or environmental status. Paulos, Honicky, and Hooker (2009) showed how urban populations could use mobile phones as dense environmental sensors for citizen science. Analogously, Bonanni et al. (2010) have created a system for tracking and account supply chains and their environmental effects. - 306 - The Computational Turn: Past, Presents, Futures? Project such as OpenStreetMap have already sought to create public domain maps of the physical world. The second step is to record what has disappeared. The approach is broadly applicable. Those interested in digital image manipulation can keep a delta showing how an image is gradually altered over time through the addition of watermarks or removal of figures from the scene. Those interested in network systems can track network outages due to disasters or kill switches, which would be used by governments to limit internet access (Cowie, 2011). The practices of negative information gatherers in some cases would be similar to those of network security professionals. They might proceed by using tools such as nmap to scan various network services and store them into a database (Lyon, 2009). As services disappear they would then be listed in the far more interesting negative database. Those interested in environmental sensors may either try to gain access to the sensor data or distribute their own environmental sensor network. When nodes in such a network stop responding further investigation is warranted. It may be that the network node needs to be replaced, that it has been tampered with, or destroyed by environmental causes. But the absence of information is just as interesting as steady broadcast. The anticipatory step of documenting infrastructure before it disappears is also useful in disaster situations when officials may be inundated with requests for information. I believe the question “is X inoperative” is an easier question to answer to than “what type of X exist and are they inoperative?” With careful foresight the negative database may be able to answer both questions without relying officials or outside organizations for details. 2. Skepticism & DIY Authority The feeling of powerless that comes from lack of information can be alleviated by the realization that you yourself can gather information. While news sources, corporate press releases, and government agencies often have access to expert assessment I think it is fair to question whether such experts have biases. For instance, news outlets may err on the side of sensationalism to stir up concern about a recent event; corporations may time announcements to minimize the impact of bad news (Gross, 2004), or agencies may try to minimize widespread panic at the expense of accurate information. One interesting aspect of DIY infrastructure, environment, or network monitoring is that those affected can collect and analyze details that affect them. When objects disappear from view instead of entering a memory hole they are instead specially noted as they are entered into a negative database. It is our hope that less will escape the notice of those willing to do the legwork involved in becoming authorities themselves. References Bonanni, L., Hockenberry, M., Zwarg, D., Csikszentmihalyi, C., & Ishii, H. (2010). Small business applications of sourcemap. Proceedings of the 28th international conference on Human factors in computing systems - CHI ’10 (p. 937). New York, New York, USA: ACM Press. doi: 10.1145/1753326.1753465. Cowie, J. (2011). Egypt Leaves the Internet. Retrieved from http://www.renesys.com/blog/2011/01/egypt-leaves-the-internet.shtml. - 307 - Proceedings IACAP 2011 Esponda, F., Ackley, E., Helman, P., Jia, H., & Forrest, S. (2006). Information Security. (S. K. Katsikas, J. López, M. Backes, S. Gritzalis, & B. Preneel, Eds.)Information Secuirty, Lecture Notes in Computer Science, 4176, 72-84. Berlin, Heidelberg: Springer Berlin Heidelberg. doi: 10.1007/11836810. Gross, D. (2004). Friday Night Blights. Slate. Retrieved from http://www.slate.com/id/2106864/ Lyon, G. F. (2009). Nmap Network Scanning: The Official Nmap Project Guide to Network Discovery and Security Scanning. Retrieved March 16, 2011, from http://portal.acm.org/citation.cfm?id=1538595 Mann, S., (1998), ‘Reflectionism’ and ‘diffusionism’: new tactics for deconstructing the video surveillance superhighway, Leonardo,31(2): 93–102. OpenStreetMap Foundation. (2011). OpenStreetMap. Retrieved from http://www.openstreetmap.org/ Paglen, T. (2011). Visual Projects. Retrieved on March 14th, 2011 from http://www.paglen.com/pages/projects.htm Pilger, J. (2010). Why WikiLeaks must be protected. New Statesman, 139(5015), 18. Paulos, E., Honicky, R., & Hooker, B. (2009). No Title. Handbook of Research on Urban Informatics: The Practice and Promise of the Real-Time City (pp. 414-436). doi: 10.4018/978-1-60566-152-0.ch028. - 308 - The Computational Turn: Past, Presents, Futures? GOVERNMENT APPROACHES FOR MANAGING ELECTRONIC IDENTITIES OF CITIZENS – EVOKING A CONTROL DILEMMA? STEFAN STRAUSS Austrian Academy of Sciences, Institute of Technology Assessment (ITA) Strohgasse 45/5, A-1030, Vienna, Austria sstrauss@oeaw.ac.at Abstract. Governments world-wide introduce electronic identity systems to adapt the process of citizen identification to the needs of the information society. These innovation processes primary aim at improving e-government services, but imply further societal and political objectives. The emergence of identity management represents a demand for (re)gaining control over personal data in virtual environments. Compared to predominating security goals, privacy aspects are often neglected and not sufficiently implemented. The analysis from a privacy perspective shows that the current situation of governmental e-ID can be described as a control dilemma: despite of its aim to (re)gain control, the e-ID could ironically even foster a further loss of control over individual privacy. As a consequence, an e-ID system itself might turn into a sort of amplified surveillance interface. In this regard, the e-ID could become a synonym for a panoptic instrument of power. The e-ID example refers to the major challenge of enhancing governmental transparency for individuals and the public sphere to compensate a further growth of information asymmetries and imbalanced control over personal information between citizens and governments. Information and communication technologies continually pervade everyday life and change the dynamics of data processing and information handling in many respects. Significant increases in personalized services and social interactions over web 2.0 applications inevitably entail further growth of digital data, aggravating individuals in controlling personal information and protecting their privacy. The convergence of analog and digital environments further accelerates these trends. The increasing relevance of electronic identity management (IDM) as an important field of research in the information society (Halperin/Backhouse 2008) is a prominent example for this convergence. While many different IDM concepts exist, especially national governments made remarkable efforts in recent years to introduce electronic ID cards for supporting online public services; primary objectives are improving security and unifying identification and authentication procedures in e-government. Identification is a core function of governments and thus the creation of national eID systems implies far-reaching societal transformations (Aichholzer/Strauß 2010) that contribute “to alter the nature of citizenship itself” (Lyon 2009). Hence, e-ID is more than an identification device; it becomes a policy instrument, and the focus more and - 309 - Proceedings IACAP 2011 more shifts from being a “detecting” tool to an “effecting” tool; i.e., an instrument not only to support administrative procedures such as ascertaining identity in public services but to enable services and to impact societal and political objectives (Bennett/Lyon 2008). Inter alia in EU information society policies the vision is to set up a “panEuropean infrastructure for IDM in support of a wide range of e-government services” (CEN 2004); and introducing e-IDMS also aims at fighting identity fraud and terrorism (CEN 2004). Privacy is obviously of vast importance but plays a rather implicit role while security issues predominate. Although e-ID introduction is not to be seen as a consequence of the 9/11 tragedy, this strong security focus was catalyzed in some respect by it (Bennett/Lyon 2008). E-ID cards “have become the tool of choice for new forms of risk calculation” and enable a “mode of pre-emptive identification” (Lyon 2009). History offers many examples for social discrimination and population control, drastically illustrating the strong relations between identification and surveillance (Bennett/Lyon 2008; Lyon 2009). But IDM is not inherently a privacy threat. Whether an e-IDMS becomes an instrument of surveillance or not naturally depends on the concrete system implementation and its surrounding framework. Properly designed with respect to privacy enhancement, e-IDMS might contribute to informational self-determination; i.e., proactively support individuals in handling their different identities in different contexts and controlling their personal data (Clauß et al 2005), which is the very idea of IDM. However, current e-ID card schemes only rudimentarily include privacy mechanisms and do not correspond to privacy-enhancing IDM (Naumann/Hobgen 2009). Particular problems are insufficient implementations of anonymity and pseudonymity, undermining the concept of unlinkability, which is essential to prevent “privacydestroying linkage and aggregation of identity information across data contexts” (Rundle et al 2008). The growing amount of personal data due to further trends towards pervasive computing environments intensifies these problems as identity never shrinks (Pfitzmann/Borcea-Pfitzmann 2010). The increasing visibility of identification mechanisms entails a sort of shadow22. This “identity shadow” facilitates data linkage and de-anonymization (Strauß 2011). Surveillance tendencies and predominant security objectives in the e-ID development imply further frictions. Combined with the evident danger of function creep, i.e., a purpose extension of e-ID usage, this could lead to the advent of a ubiquitous IDM infrastructure entailing further privacy threats. The current situation can be described as a control dilemma: while the increasing role of IDM represents “a demand to regain control over personal data flowing in digital environments”, the creation of governmental e-IDMS to fulfill this demand could ironically even foster a further loss of control over individual privacy (Strauß 2011). In this sense, an e-IDMS has several similarities to Foucault’s (1977) interpretation of the panopticon “as a generalizable model of functioning; a way of defining power relations in terms of the everyday life of men”. Social control becomes automated as the algorithms of the system define the way one's identity is treated, i.e., the degree of service provision based on automated categorization. The trap of visibility (Foucault 1977) here is the increasing ID-obligation triggered by the e-IDMS. While the system becomes more and more visible, its functioning becomes further blurred for individuals. They have to reveal their ID without knowledge about whether and for what purpose it is used - analog to the uncertain presence of the guard in the watchtower. Consequences 22 In recognition of Alan Westin: Privacy and Freedom, 1967 and the term “Data Shadow“. - 310 - The Computational Turn: Past, Presents, Futures? would be self-censorship and limited individual freedom because “without transparency, one cannot anticipate or take adequate action“ (Hildebrandt 2008). The control dilemma highlights the demand for more effective privacy concepts and control mechanisms, enabling citizens and the public sphere in controlling proper and legal data usage. One crux is the system inherent realization of anonymity and pseudonymity; and, related, a thorough data minimization, e.g., addressed by already arising approaches (e.g., http://vanish.cs.washington.edu) for an expiration date of digital data (Mayer-Schönberger 2009). However, their practicability is limited and they cannot solve the problem of information asymmetries between the governed and those who govern. Thus, the major challenge is to compensate this imbalanced control over personal information by enhancing governmental transparency for individuals and the public sphere. References Aichholzer, G. & Strauß, S. (2010). Electronic Identity Management in e-Government 2.0: Exploring a System Innovation exemplified by Austria. Information Polity 15(1-2), 139152. Bennett, C. J. & Lyon, D.(2008). Playing the identity card - surveillance, security and identification in global perspective. London and New York: Routledge. Clauß, S., Pfitzmann, A., Hansen, M., Herreweghen, E. V. (2005). Privacy-Enhancing Identity Management, No. issue 67, Institute for Prospective Technological Studies (IPTS). Commité Européen Normalisation - CEN (2004). CEN/ISSS Workshop eAuthentication Towards an electronic ID for the European Citizen, a strategic vision, Brussels. Foucault, M. (1977). Discipline and punish: the birth of the prison, trans. A Sheridan, London: Penguin. Halperin, R. & Backhouse, J. (2008). A roadmap for research on identity in the information society. Identity in the information society 1(1), 71-87. Hildebrandt, M. (2008). Profiling and the rule of the law. Identity in the information society 1(1), 55-70. Lyon, D. (2009). Identifying citizens - ID cards as Surveillance. Cambridge: Polity Press. Mayer-Schönberger, V. (2009). Delete: The Virtue of Forgetting in the Digital Age. Princeton: University Press. Naumann, I., Hobgen, G. (2009). Privacy Features of European eID Card Specifications: European Network and Information Security Agency – ENISA. Pfitzmann, A. & Borcea-Pfitzmann, K. (2010). Lifelong Privacy: Privacy and Identity Management for Life. In: Bezzi, M. et al (Eds): Privacy and Identity Management for Life, Proc. of the 5th Int. PrimeLife/IFIP Summer School, IFIP AICT Vol. 320, (pp.1-17). Heidelberg: Springer LNCS. Rundle, M., Blakley, B., Broberg, J., Nadalin, A., Olds, D., Ruddy, M., Guimarares, M. T. M., Trevithick, P. (2008). At a crossroads: "Personhood" and digital identity in the information society, No. JT03241547, OECD. Strauß, S. (2011). The Limits of Control – (Governmental) Identity Management from a Privacy Perspective. In: Fischer-Hübner, S., et al th (Eds), Privacy and Identity Management for Life, Proc. of the 6 Int. PrimeLife/IFIP Summer School – revised selected papers, IFIP AICT Vol. 352, (pp.206-218). Heidelberg: Springer LNCS. - 311 - Proceedings IACAP 2011 Track X: SIG Track – Machines and Mentality - 312 - The Computational Turn: Past, Presents, Futures? MORAL EMOTIONS FOR ROBOTS RONALD C. ARKIN Mobile Robot Laboratory, Georgia Institute of Technology 85 5th ST NW, Atlanta, GA 30332 U.S.A. As robotics moves toward ubiquity in our society, there has been only passing concern for the consequences of this proliferation (Sharkey, 2008). Robotic systems are close to being pervasive, with applications involving human-robot relationships already in place or soon to occur, involving warfare, childcare, eldercare, and personal and potentially intimate relationships. Without sounding alarmist, it is important to understand the nature and consequences of this new technology on human-robot relationships. To ensure societal expectations are met, this requires an interdisciplinary scientific endeavor to model and incorporate ethical behavior into these intelligent artifacts from the onset, not as a post hoc activity. We must not lose sight of the fundamental rights human beings possess as we create a society that is more and more automated. One of the components of such moral behavior, we firmly believe, involves the use of moral emotions. Haidt (2003) enumerates a set of moral emotions, divided into four major classes: Other- condemning (Contempt, Anger, Disgust); Self-conscious (Shame, Embarrassment, Guilt); Other-Suffering (Compassion); Other-Praising (Gratitude, Elevation). Allen et al (2006) assert that in order for an autonomous agent to be truly ethical, emotions may be required at some level: “While the Stoic view of ethics sees emotions as irrelevant and dangerous to making ethically correct decisions, the more recent literature on emotional intelligence suggests that emotional input is essential to rational behavior”. These emotions guide our intuitions in determining ethical judgments, although this is not universally agreed upon (Hauser, 2006). From a neuroscientific perspective, Gazzaniga (2005) states: “Abstract moral reasoning, brain imaging is showing us, uses many brain systems”, where he identifies the locus of moral emotions as being located in the brainstem and limbic system. The relatively young machine ethics community has focused largely to date on developmental ethics, where an agent develops its own sense of right and wrong in situ. In general, these efforts largely ignore the moral emotions as a scientific basis worthy of consideration. Nonetheless, considerable research has been conducted regarding the role of emotions in robotics, including work in our laboratory over the past 20 years (Arkin, 2005; Moshkina et al 2011). Far less explored in robotics is the set of moral secondary emotions, and their role in robot behavior and human-robot interaction. One example is where De Melo et al (2009) have demonstrated that the presence of moral affect in human-robot interaction is both discernible and enhances the interplay between humans and robot-like avatars. Our own research (Arkin and Ulam, 2009) in the moral affective space research is illustrated by the use of guilt being incorporated into an ethical robotic software architecture designed for lethal military applications. Guilt is “caused by the violation of moral rules and imperatives, particularly if those violations caused harm or suffering to others” (Haidt, 2003) and is recognized as being capable of producing proactive, constructive change (Tangney et al, 2007). The specific architectural component we have - 313 - Proceedings IACAP 2011 implemented, referred to as the ethical adaptor, incorporates Smits and De Boeck’s (2003) mathematical model of guilt, which is used to proactively alter the behavior of the robotic system in a manner that will lead to a reduction in the recurrence of an event which was deemed to be guilt-inducing. In our initial application, this focuses on the deployment of lethal autonomous weapons systems in the battlefield, with respect to unexpectedly high levels of battle damage. Simulation results demonstrate the ethical adaptor in operation. For non-military applications, we hope to extend this earlier research into a broader class of moral emotions, such as compassion, empathy, sympathy, and remorse, particularly regarding the use of robots in elder or childcare, in the hopes of preserving human dignity as these relationships unfold in the future. There is an important role for artificial emotions in personal robotics as part of meaningful human-robot interaction, and having worked with Sony Corporation on their AIBO and QRIO entertainment robots (Arkin, 2005), and Samsung for their humanoid robots (Moshkina et al, 2011), it is clear that value exists for their use in establishing long-term human-robot relationships. There are, of course, significant ethical considerations associated with this use of artificial emotions in general, and moral emotions in particular, due in part to their deliberate fostering of attachment by human beings to non-human artifacts. This is believed to promote detachment from reality by the affected user (Sparrow, 2002). While many may view this as a benign, or perhaps even beneficial effect, not unlike entertainment or video games, it can clearly have deleterious effects if left unchecked, hence the need for incorporating models of morality within the robot itself. Acknowledgements This research was supported under Contract #W911NF-06-1-0252 from the U.S. Army Research Office. The author would also like to acknowledge Patrick Ulam for his contribution in software development for this project. References Allen, C., Wallach, W., and Smit, I., (2006). “Why Machine Ethics?” IEEE Intelligent Systems, July. Arkin, R.C., (2005). "Moving Up the Food Chain: Motivation and Emotion in Behavior-based Robots", in Who Needs Emotions: The Brain Meets the Robot, Eds. J. Fellous and M. Arbib, Oxford University Press. Arkin, R.C. and Ulam, P., (2009). "An Ethical Adaptor: Behavioral Modification Derived from Moral Emotions", IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA-09), Daejeon, KR. De Melo, C., Zheng, L. and Gratch, J., (2009). ”Expression of Moral Emotions in Cooperating Agents”. 9th International Conference on Intelligent Virtual Agents, Amsterdam. Gazzaniga, M., (2005). The Ethical Brain, Dana Press. Haidt, J. (2003). “The Moral Emotions”, in Handbook of Affective Sciences, Oxford Press. Hauser, M., (2006). Moral Minds: How Nature Designed Our Universal Sense of Right and Wrong, ECCO, HarperCollins, N.Y., 2006. - 314 - The Computational Turn: Past, Presents, Futures? Moshkina, L., Park, S., Arkin, R.C., Lee, J.K., Jung, H., (2011). "TAME: Time-Varying Affective Response for Humanoid Robots", International Journal of Social Robotics. Sharkey, N. (2008). “The Ethical Frontiers of Robotics”, Science, (322): 1800-1801. Smits, D., and De Boeck, P., (2003). “A Componential IRT Model for Guilt”, Multivariate Behavioral Research, Vol. 38, No. 2, pp. 161-188. Sparrow, R., (2012). “The March of the Robot Dogs”, Ethics and information Technology, Vol. 4(2). Tangney, J., Stuewig, J., and Mashek, D., (2007). “Moral Emotions and Moral Behavior”, Annu. Rev. Psychol., Vol.58, pp. 345-372. - 315 - Proceedings IACAP 2011 ON DEEPLY UNCONSCIOUS INTENTIONAL STATES KONSTANTINE ARKOUDAS Telcordia Research Piscataway, NJ, USA In this note I will argue against the thesis that humans are equipped with computational structures and algorithms that are unconsciously used for logical reasoning. This thesis represents the received view in cognitive science, particularly in the psychology of reasoning. According to it, the processes by which people reason are unconscious and therefore inaccessible to introspection. The unconsciousness that these cognitive scientists allege is deep. Unconscious mental states of this form are not like the preconscious states of Freud, such as beliefs that can be ascribed to me when I am in dreamless sleep. For instance, when I am asleep I continue to believe that the second world war ended in 1945, even though I do not consciously entertain that belief during that time. The belief is preconscious; even though it is not conscious most of the time, I can easily bring it to mind by my own volition. The ``deep unconscious'' of contemporary cognitive science is also quite unlike Freud's ``dynamic unconscious'' (repressed memories, desires, etc.), although the theory---and controversies---of the latter need not detain us here. But at least repressed mental states could potentially come to the surface via therapy. The unconscious mental states posited by contemporary cognitive science are much more hermetically sealed. I will use mental-logic theories (MLT) to anchor my discussion, but the arguments I will be making will apply to other computational accounts of reasoning, such as mentalmodel theory. I believe that it might be possible to adapt these arguments in a way that will make them applicable to any theory that postulates unconscious computation, including theories of low-level peripheral cognition such as perception and language. But in what follows I will only be concerned with computational theories of reasoning. For simplicity, I will restrict attention to propositional logic, and specifically to what is often called the ``logical judgment'' problem, whereby a small number of fairly simple premises are given (often just one premise), along with a putative conclusion, and the problem is to determine whether the conclusion follows deductively from the premises. Alice is a college sophomore without any training in formal logic, although perhaps she has a meager background in algorithms (e.g., she might know what an algorithm is, and have a vague notion of what loops and conditional branches are for). According to mental-logic doctrine, Alice is equipped with a module for reasoning in propositional logic that consists of: (1) a number of inference schemas, such as modus ponens; and - 316 - The Computational Turn: Past, Presents, Futures? (2) a control procedure, which, presented with a reasoning problem, regulates the selection of which inference rules to apply, when to backtrack, and so on. The procedure always terminates and can result in an affirmative, a negative, or an inconclusive (``can't tell'') answer. (In the context of the logical judgment problem, these two components are sometimes called the ``operations'' and the ``executive,'' respectively.) Now let L be Alice's logic for ``logical judgment'' in propositional logic, and let R be the associated procedure. And let P be a simple propositional reasoning problem. Presumably, if we presented Alice with P, her mental logic would kick in, R would operate for a finite period of time, and before long an answer would emerge. The contents of both L and R are in thinkable form, and indeed are eminently learnable. L presumably contains such straightforward inference rules as the contrapositive, and R contains a small number of simple instructions such as conditional branching and looping. It is quite conceivable, therefore, that Alice can be taught the specific rules of L and the algorithm R, and can voluntarily and consciously follow R. This does not have to be deliberate, in that I am not assuming that L and R are taught to Alice as the very mental logic that her own mind contains for propositional logical judgment. They could be taught to her fortuitously, as part of a random teaching assignment by a teacher, or by some instructor as part of a cognitive science experiment, and it could just so happen, by accident, that what she is taught is in fact identical to her ``mental logic,'' although Alice herself is entirely unaware of this. In fact Alice might not even be aware that she has such a logic at all. Now suppose that after a short crash course on L and R, Alice is presented with problem P and goes to work consciously applying R, while, unconsciously and unbeknownst to her, she is applying the very same procedure at the same time. The exact same process unfolds in two duplicate and concurrent threads, tracing two sequences of intentional states, which I will write as s_1,...,s_n for the conscious process and s_1’,...,s_n’ for the unconscious one. We might allow---as is surely logically possible, though improbable--that the concurrency is exact, and that the two threads proceed in perfect lockstep. I claim that s_i and s_i’ are identical intentional state tokens for each i = 1,...,n. We might say that two intentional states are type-identical if they have the same mode and the same content (propositional or otherwise), so, for instance, your belief that Obama is the president of the USA is type-identical to my belief that Obama is the president of the USA because both the psychological mode (belief) and the content (that Obama is the president of the USA)are identical. What are reasonable identity criteria for intentional state tokens? Two intentional state tokens of one and the same person are identical if they have the same mode, the same content, the same causes, and sufficient temporal proximity. . In the present scenario, all these conditions obtain. Content and mode and identical by virtue of the fact that the logic and the algorithm on both levels are identical, and the causes are also the same in both cases---the execution of that particular algorithm on that particular input. Remember that according to the standard computational theory of the mind, the algorithms that are postulated by various cognitive scientists involve intrinsic intentionality (i.e., they are not observer-relative), and are causally efficacious. That is, a person's cognitive activity and concomitant intentional states are - 317 - Proceedings IACAP 2011 the way they are because he or she is running the algorithm in question. So in both cases, it is the deployment of the same algorithm on the same input that is causing the states. Of course in this version of the thought experiment we actually have more than that. We also have complete temporal overlap. So, for any i, both s_i and s_i’ are occurring at the exact same time, in the same mind, with the exact same contents, and the exact same causes and effects. Therefore, the states are identical. But this is a contradiction, because we are now led to admit that one and the same intentional state is simultaneously occurring both consciously and unconsciously. I regard the contradiction as a reductio of the hypothesis that the process s_1’,...,s_n’ is occurring unconsciously; that the process s_1,...,s_n is consciously occurring is, of course, beyond doubt. I conclude that there are no such unconscious intentional states. The only intrinsic intentional states and computational processes that actually take place are the conscious ones. - 318 - The Computational Turn: Past, Presents, Futures? OUTLINING A COMPUTATIONALLY PLAUSIBLE APPROACH TO MENTAL STATE ASCRIPTION WILL BRIDEWELL Center for Biomedical Informatics Research Stanford University, Stanford, CA USA AND Alistair Isaac Department of Philosophy University of Michigan, Ann Arbor, MI USA AND Pat Langley Computer Science and Engineering Arizona State University, Tempe AZ USA 1. Extended Abstract No one would debate that social cognition is a key characteristic of human-level intelligence. However, within the artificial intelligence literature, we find no system that carries out more than a rudimentary level of social interaction. Previous theoretical work on social information processing usually treats agents as input–output systems that lack internal representations of each other (e.g., multiagent systems) or develops formalisms unsuitable for practical implementation (e.g., undecidable epistemic logics). To move forward, new strategies for modeling interaction need to tractably support reasoning about the mental states of oneself and others. Here, we present steps toward such a model that we hope will address the need for a computationally plausible approach and will eventually lead to a system that can engage in complex dialog with others. An agent’s mental space is partitioned into models of agents. One of these is the model of self, which serves as the default source-of-truth when reasoning about the world. From a computational perspective, we find it useful to separate different modalities of mentality into different regions. For instance, inside of its self model, an agent will have a structure that stores beliefs about the state of the world, one that stores goals that indicate desired future states of the world, and one that stores intentions which are actions that manifest the goals. Since goals and intentions in this representation refer to mental states of which the agent is aware, we loosely use those terms as shorthand for the agent’s beliefs about its goals and beliefs about its intentions. Taking this view, the primitive mental object is the belief. - 319 - Proceedings IACAP 2011 Continuing with the computational perspective, we represent a belief as a data structure that contains a literal representing its content and other contextual features necessary to guide reasoning. These features include temporal aspects analogous to valid time and transactional time in a database. That is, the literals in the belief may be associated with the period of time for which they were true (e.g., Yesterday, Jeff ate lunch between 11 and 12) and the period of time during which they were held (e.g., I believed that Chris was a man until I met her), both of which may overlap. Asserting a belief as a goal or an intention involves placing it in the appropriate mental partition and does not require a corresponding change in representation. In addition to beliefs, which are stored within agent models, we represent relationships among those models. The principal agent model (i.e., the model of the self) connects to internalized models of other agents. These models are accessible through a believes relation. For example, consider a technical support agent conversing with a customer. During the exchange, the support agent may reason about whether the customer believes that his computer is plugged in. Trivially, we might represent this statement as (belief Customer (plugged-in computer)), which tells the system implementing the agent to look in the beliefs of the Customer model believed by the principal agent. Continuing, the agent may have a goal (goal (belief Customer (not (plugged-in computer)))). This goal would appear in a second Customer model that is connected to the agent’s goal space instead of its belief space. Notably the goal, intention, and belief operators are not modal operators. For our purposes, they index mental spaces that contain sets of beliefs. Importantly, knowledge is stored only when necessary. The principal agent’s default assumption is that other agents’ beliefs are in accord with its own. If the principal agent has no reason to believe that another agent is in disagreement, then that agent’s model will be empty. In the previous example, if the agent believes (plugged-in computer) and (believes Customer (plugged-in computer)), the actual belief will only appear in the principal agent’s model. The other models inherit the beliefs of their parents via default reasoning unless a specific belief is overridden by a locally stored, incompatible one such as (not (plugged-in computer)). As a rough approximation, we assume that all agents share the same inference mechanisms and long-term knowledge (e.g., rules) and do not attempt to represent differences in cognitive ability or domain knowledge. With this basic framework in mind, there are six challenges that must be addressed to implement a functioning system. Here we present these along with our proposed solutions for two of the most compelling ones. 1. When are new agent models introduced? 2. When are agents linked to each other? 3. How are agents traversed to unpack a nested statement? 4. What is taken as common ground? 5. How are beliefs ascribed to nested agents? 6. How does one agent reason about another? Addressing the first challenge, the most apparent situation is when a new agent joins a conversation. If individuals discuss an absent agent, one may treat that agent as either a simple object or an agent to whom one may ascribe beliefs. To illustrate, suppose Tom tells the principal agent, “Harry likes pudding.” That would correspond to some belief either in the principal model or the Tom model that resembles (likes Harry pudding). If - 320 - The Computational Turn: Past, Presents, Futures? instead, Tom said, “Harry said that he likes pudding,” we would need to create a model of Harry, that would let us store (believes Harry (likes Harry pudding)). Where the belief resides depends on the mental state of the other agents and how their models are connected. Answering the sixth challenge, we recall that all agents are assumed to use the same inference system and domain knowledge as the principal agent. Typically this mechanism “resides” in that agent’s model. However, one can shift perspective by moving the seat of the inference system to another agent model. In this sense, there is a clear relationship to simulation theory, but the domain knowledge may include rules that encode how agents reason about each other much like the theory-theory. As a result, we can integrate ideas from both camps to help reach our operational goal of intelligent systems that can collaborate and engage with people in realistic dialogs. Acknowledgements Will Bridewell and Pat Langley are funded by the Office of Naval Research under Contract No. ONR-N00014-09-1-1029. Alistair Isaac is funded by a postdoctoral fellowship from the McDonnell Foundation Research Consortium on Causal Learning. - 321 - Proceedings IACAP 2011 AGENCY: ON MACHINES THAT MENTALIZE MARCELLO GUARINI University of Windsor 401 Sunset, Windsor, ON. Canada N9B 394 1. Agency, Responsibility, and Mentalizing The ability of human beings to attribute mental states has been variously referred to as “mindreading” and “mentalizing.” The purpose of this paper will be to examine the relationship between agency and mentalizing. Two dimensions of agency will be discussed. The first is the ability of a human or machine to take responsibility for his/her/its actions and thoughts – a first person ability. The second is the ability to hold others responsible – a third person ability. Both of these activities are important for various forms of social interaction, and they would not be possible without mentalizing. It will be shown that various mindreading abilities – such as tracking perception, desire, the source of belief, and false belief – are central to the notion of agency in ethical, epistemic, and legal contexts. This has implications not only for how we understand human agency, but for how we understand the agency of future machines. 2. Conditions of Agency Agency comes in degrees: we might expect an average five year old human child to take responsibility for some things, and an average 15 year old to take responsibility for still further things, and an average 25 year old to take responsibility for still further things. We should expect variations in the capacities of machines as well. The focus of this work will be the kinds of mentalizing tasks that average five year olds excel at, and the contribution they make to understanding agency. A framework will be provided for understanding the conditions of agency. Distinctions will be made between the generative conditions of agency (what it takes to bring agency into existence), the maintenance conditions of agency (what is required to keep agency in existence), and the regenerative conditions of agency (what is required to repair or restore agency if it is impaired). It will be argued that sustaining various mentalizing abilities are among the maintenance conditions of agency. 2.1. AN EXAMPLE Let us consider the capacity to attribute false beliefs, something most 5 year olds possess. Some children are allowed to view a Smarties box that has candy (Nichols and - 322 - The Computational Turn: Past, Presents, Futures? Stich, 2003, p.90). One of the children is asked to leave the room, and the remaining children witness the candy being replaced with pencils. The absent child is brought back into the room. When asked what the temporarily absent child believes is in the box, most three year olds say “pencils.” This is a third person failure to attribute a false belief. Tasks such as these can be failed in the first person as well: young children often fail to attribute false beliefs to themselves. There is an important connection between agency and the ability to attribute false beliefs. The ability to take responsibility involves, among other things, the ability to grasp that I have or had a false or incorrect view. Without the ability to attribute error to oneself, it is difficult to see how one could in some well developed sense take responsibility for it. Moreover, holding another responsible could well involve, among other things, attributing a false belief to that other individual. Agent A1 may challenge A2 to revise his, her, or its view on some matter on the grounds that the view is false. A1 needs to be able to attribute a false belief to A2 for this to happen. 2.2. LEVELS AND CONDITIONS OF AGENCY There is some recent research that uses an attentional (as opposed to linguistic) paradigm to argue that children engage in some sort of false belief recognition well before language is developed (Goldman, 2006, pp. 76-77). This is startling and interesting work, but whatever these very early abilities amount to, it will be argued that they are insufficient for understanding what is required in advanced forms of taking responsibility or holding others responsible. They will, however, play an important role in understanding the generative conditions of human agency. Success in these very early attentional tasks appear to be important precursors to the linguistic abilities required for advanced forms of agency. Supporting what is needed for these attentional abilities might also be among the maintenance conditions of simpler forms of agency. A discussion of the conditions of agency can be usefully augmented with the well worn three level approach to explanation common in cognitive science – intentional, algorithmic or mathematical, and implementational. We can examine the conditions of agency at each of these levels. For example, at the intentional level, we can intentionally specify what sorts of abilities have to be kept in place or maintained for advanced agency to exist – much of this may be the same for humans and machines. However, at the algorithmic/mathematical and implemenational levels, there may be important differences in specifying how agency is maintained. 3. Significance At some point, we expect our children to start taking responsibility for their behaviour and engage in self-correcting behaviour made possible by false belief attribution and other mindreading abilities. Among other things, this creates various epistemic, moral, and other efficiencies – individuals that can monitor and correct their own thoughts and behaviours do not require constant correction from others, which frees these agents to pursue further tasks. One of the driving forces behind the development of machine agency will no doubt be the desire for these sorts of efficiencies. It will be shown that other mindreading tasks (over and above false belief attribution) play a role in first and third person dimensions of agency. - 323 - Proceedings IACAP 2011 References Nichols, S. & Stich, S.P. (2003). Mindreading: An Integrated Account of Pretence, SelfAwareness, and Understanding Other Minds. Oxford: Oxford University Press. Goldman, A.I. (2006). Simulating Minds: The Philosophy, Psychology, and Neuroscience of Mindreading. Oxford: Oxford University Press. - 324 - The Computational Turn: Past, Presents, Futures? TOWARD A TESTBED FOR MODELING THE KNOWLEDGE, GOALS AND MENTAL STATES OF OTHERS SERGEI NIRENBURG University of Maryland Baltimore County Baltimore, MD, 21250 USA Abstract. The paper introduces a computational environment that facilitates development and experimentation with intelligent agents in the OntoAgent cognitive architecture. The agents pursue goal- and plan-oriented reasoning, are capable of communicating in natural language and build mental models of other agents. Decision-making is a core capability of intelligent agents – both human and artificial ones. Making optimal decisions with limited resources is a very difficult task both for people and for machines. Helping people to make decisions is an important scientific, societal and technological goal. Classical decision theory presupposes an idealized decision-making agent that possesses all the knowledge necessary (or desired) for making a decision, operates with optimum decision procedures and is fully rational in terms of the rational choice theory. Within this theory rationality of an individual decision is estimated in terms of what von Neumann and Morgenstern (1944) called expected utility, the cost effectiveness of the means to achieve a specific goal. In other words, rational behavior for an individual maximizes benefits and minimizes costs of a choice. However, in real life few people make decisions under conditions of complete knowledge, maximum efficiency and rationality. Thus, Simon (1955) introduced the concept of bounded rationality that removes the constraint of having complete knowledge and the best algorithm by switching from seeking an optimal decision to accepting a satisficing decision (roughly, making do with the first decision for which utility exceeds costs even though there may be any number of better decisions available). A number of proposals concentrated on the selection of parameters (features) on the basis of which choices are made. Thus, the prospect theory of Tversky and Kahneman (1974) and its descendants, such as cumulative prospect theory, augment the inventory of decision parameters for a decision (utility) function by stressing psychological influences on decision-making, such as risk aversion and “reference” utility meaning utility relative to perceived utility for others. In order to incorporate the latter, an intelligent agent A0 must be able to model the mental states of other agents, A1, …, An. At the intuitive level, we understand mental states as including, at a minimum, ontological knowledge of concept types as well as knowledge of concept instances, the agent’s goals, preferences, personality traits, etc. The concept of ‘belief,’ often used in conjunction with modeling agents we interpret as - 325 - Proceedings IACAP 2011 (possibly, error-ridden) knowledge that agent A0 has about other agents it knows. (We are aware that the knowledge A0 has about itself may also be less than accurate.) In our work on modeling intelligent agents we stress the importance of extending the inventory of an agent’s decision-making parameters (but only if effective procedures for determining their values can be developed). Thus, it is correct to state that understanding speaker’s goals is important in making a decision about how to react to a speech act. But in practice more specific knowledge is needed – for example, when a doctor asks a patient about the latter’s family, the patient must judge whether the speaker’s goal is professional (having the patient’s condition diagnosed) or social (making small talk) or – and this is an even more complex reckoning – whether it is a social goal put in service of the professional one (aiming at establishing a rapport with a patient so as to develop trust and ensure cooperation – better-quality responses to questions and requests). In this talk I will describe a computational environment that facilitates development and experimentation with agents that strive to make use of mental models of others as a prerequisite for making appropriate decisions with respect to the agent’s own behavior. This capability is one of several core requirements of our cognitive architecture, OntoAgent. In addition to modeling ontological knowledge about the outside world and knowledge about remembered instances of ontological concepts (including other agents, viewed as instances of the ontological concept HUMAN), OntoSem agents: • are designed to operate in a hybrid network of human and artificial agents; • emulate human information processing capabilities by modeling conscious perception and action; • communicate with people using natural language; • can incorporate a physiological model, making them what we call “double agents” with simulated bodies as well as simulated minds; • can be endowed with personality traits, preferences and psychological states that influence their perceived or subconscious decision-making preferences; • rely on knowledge resources and processors that are broad-coverage rather than geared at a particular application, which simplifies porting agents to new domains and applications; • stress the importance of memory of event, state and object instances to complement its ontological knowledge of event, state and object types. What makes modeling such multi-faceted agents feasible is that all aspects of agent functioning are supported by the same knowledge substrate encoded in a single metalanguage. The OntoAgent testbed has been implemented in the medical domain and supports two agent environments: • Maryland Virtual Patient (MVP, McShane et al. 2009) modeling a patient, a trainee MD and a tutor in the process of learning medical diagnostics and treatment; and • CLinician’s ADdvisor (CLAD, Nirenburg et al. 2011) modeling a patient, an MD and a clinician’s advisor and intended to assist practicing clinicians by reducing their cognitive load. The talk will include a demonstration of the above environments and a discussion of the ways of modeling mental states of other agents. - 326 - The Computational Turn: Past, Presents, Futures? References McShane, M., S. Nirenburg, B. Jarrell, S. Beale, G. Fantry (2009). Maryland Virtual Patient: A Knowledge-Based, Language-Enabled Simulation and Training System. Proceedings of International Conference on Virtual Patients, Krakow, Poland, June 5-6. Neuman, J. von and O. Morgenstern (1944). Theory of Games and Economic Behaviour. Princeton: Princeton University Press. Nirenburg, Sergei, Marjorie McShane, Stephen Beale, Bruce Jarrell and George Fantry (2011). Intelligent agents in support of clinicial medicine. Proceedings of MMVR18, Newport Beach, CA, February 9–12. Simon, H.A. (1955). A behavioral model of rational choice. Quarterly Journal of Economics, 69: 99–118, 1955. Tversky, Amos, & Kahneman, Daniel (1974). Judgment under uncertainty: Heuristics and Biases. Science, 185: 1124-1131. - 327 - Proceedings IACAP 2011 ARCHITECTURAL STEPS TOWARDS SELF-AWARE ROBOTS MATTHIAS SCHEUTZ Tufts University 161 College Ave., Medford MA 02155 Abstract. Philosophical debates about qualia, perspectivalness, “what it is like” experiences and related topics are vastly disconnected from “architecture talk” in AI and cognitive science which is required for understanding minds and designing artificial agents. While philosophy can thus not help AI in designing conscious agents, I argue that AI and robotics cannot only help philosophy, but may even be required for solving some of the puzzling questions in the philosophy of consciousness. Specifically, I will claim that there is no such thing as a necessarily private experience (neither phenomenal, nor introspective, nor any other) using as an example robotic architectures whose instances “know” what it is like to be another robotic architecture instance. Start with two basic hopefully non-controversial notions, those of awareness and selfawareness, define them for agent architectures and then show how we can say that a robot is aware or self-aware in a given context. Following Chalmers' (1996) notion of {\em awareness and Block's (1995) notion of access consciousness, call a state S of an agent architecture A an “awareness state” if S contains information about something (entity, state, event, etc.) that the agent (instantiating A) can use to make decisions, guide its behavior and/or give verbal reports. Specifically, an agent is “aware of X” if it is in an awareness state that in some way represents or encodes X. An agent is “self-aware” if it is aware of itself, i.e., if it is in an awareness state that represents or encodes (parts of) the agent itself. S will typically be a complex state that consists of substates reflecting the states of various functional components in the architecture A. For example, if S is the state of “being aware of a red box}, then this state will roughly require perceptual states representing the box and some of its properties including its redness, in addition to states that use some of these representations in order to form other representations and/or behaviors. To make all of this more precise, I will briefly introduce some relevant parts of our robotic DIARC architecture that we have been developing over the last decade or so in my lab (Scheutz et al 2007). What is nice about robotic architectures (or any form of agent architecture, including cognitive architectures for that matter) is that one can look inside. I.e., one can take a look at the blueprint and follow the information flow along connections between functional components. One can trace processing routes and look at component states. And one can make statements about possible and impossible processes in a system that instantiates the architecture. - 328 - The Computational Turn: Past, Presents, Futures? DIARC consists of various functional modules: on the perception side, there are modules for vision processing, sound processing (including sound localization and speech recognition), laser distance data processing, and processing of various internal proprioceptive sensors. For most sensory modalities, there are also short and long-term memories, e.g., a long-term memory for visual objects and a short-term memory for storing the recognized objects the agent currently sees. On the action side, there are modules for moving the robot body through the environment, for making arm and head movements, and for making facial expressions, among others. Internal modules consist of various short and long-term memories together with processes that operate on those memories, including skill memories, factual and episodic memories, a lexicon with syntactic and semantic annotations in addition to word forms, and a task memory. Moreover, there are components for managing the agent's goal, for scheduling actions in parallel, for processing spoken natural language, for task planning,and for reasoning (for more details, see Scheutz et al. 2007). Now consider a robot running DIARC that is asked whether it sees a red box and assume that the robot has a goal to answer questions. Upon hearing the spoken utterance, the speech recognizer generates word tokens from it, which are then syntactically and semantically analyzed, resulting in an internal logical representation of the meaning. The robot recognizes that the utterance was a question that required it to perform an internal lookup action in its visual short term memory (VSTM), namely to check whether VSTM contains an object representation of a red box. Note that the robot only needs to perform a lookup action in its VSTM, because VSTM is automatically updated based on what the object recognition algorithm detects in the image coming from the camera at a rate of 30Hz. In particular, various vision processing algorithms are performed on each image frame attempting to segment colored regions, detect object boundaries, recognize objects and determine their properties. These processes result in the generation of representations of the recognized objects in VSTM, which are matched against existing representations so that object identities can be tracking over short periods of time. If the agent has an object representation of a red box in VSTM, then the representation is retrieved and bound to the expression “red box”. The binding confirms the resolution of the reference and triggers a variety of additional bindings (including the binding of various discourse variables such as “last mentioned object” and “last mentioned noun” in linguistic short-term memory). It also triggers the generation of an answer to confirm that the robot is seeing a red box, which the robot then pronounces. In addition, the generated answer gets stored in linguistic short-term memory and, depending on other factors, the whole event “you asked whether I saw a red box, and I did see one” might get stored in episodic memory (indexed by time, object type, interaction type, and others). From the above description, it is clear that the robot went through several awareness states including self-awareness states as part of answering the question: the robot is aware of the question when it is in a state where it checks for the object asked for in the question; if there is such an object, the robot becomes aware of the object as well as of the object's properties (in particular, its color), and the robot is aware of the answer it gave. Moreover, the robot is aware of itself having been asked the question and of having given the answer, which is a self-awareness state. I will then use the above architecture to demostrate during my presentation what it is like for the robot to have a color experience and use this result to address some questions about phenomenal and private experience in philosophy. In particular, I will argue that robots can know what it is like to have another robot's experience. - 329 - Proceedings IACAP 2011 References Block, N. (1995). On a confusion about a function of consciousness. Behavioral and Brain Sciences, 18(2), 227-247. Chalmers, D. J. (1996a). The conscious mind: In search of a fundamental theory. New York, NY, USA: Oxford University Press. Scheutz, M., Schermerhorn, P., Kramer, J. and Anderson, D. (2007). “First Steps toward Natural Human-Like HRI”. Autonomous Robots, 22, 4, 411-423. - 330 - The Computational Turn: Past, Presents, Futures? LOGIC-BASED SIMULATIONS OF MIRROR TESTING FOR SELFCONSCIOUSNESS NAVEEN SUNDAR AND SELMER BRINGSJORD Abstract. We present a formal logic-based analysis of the mirror test for selfconsciousness. Based on this formalization, a computational simulation of a mirror-failing dog, a mirror-passing chimp, and a mirror-passing human will be presented. The simulation will consist in the automatic machine-found disproof in the case of the canine, and proofs in the other two cases. These simulations will be based on an axiomatization of the perceptual and doxastic details assumed to be in/operative in these three cases by those embracing the view that chimps and humans are self-conscious, while dogs aren’t. 1. The Mirror Test In accordance with a now-familiar recipe R in the annals of the study of “selfconsciousness,” anesthetize23 a creature c; while it’s under, paint, say, a red (odorless, hypo-allergenic) splotch upon its forehead, thus making it true that c has property R (= Rc); when awake, place c in front of a mirror (Mc); observe the creature’s behavior b to see if it for example includes the attempt to remove the splotch (Rcb or ¬Rcb); if it does/doesn’t, issue a pronouncement about such questions as whether or not it’s selfconscious (or self-aware, etc.; i.e., as to whether or not Sc). Descriptions of the following of R are innumerable in the literature.24 But what is the logic of this recipe? Despite decades of writing about the value of the recipe, we can find no rigorous account of it, nor of followings of it in connection with certain classes of creatures. Therefore, we can’t find rigorous computational simulations of such followings, and we certainly can’t find proofs that for given creatures they are known to either have or lack self-consciousness, depending upon whether or not they pass the mirror test. Work underway by us is designed to provide these missing things, and we propose to report on this work at IACAP 2011, and show demonstrations. 23 24 Or perhaps do it while the creature is sleeping soundly. For a compendium of such followings, accompanied by the colorful proposal that self-awareness can be neuro-localized in the right hemisphere, see Keenan, J., Gallup, G. and Falk, D. The Face in the Mirror (Ecco: New York, NY). - 331 - Proceedings IACAP 2011 2. Toward a Formal Analysis of the Mirror Test Let’s assume a standard extensional multi-sorted logic in which creatures are partitioned in customary ways. (Please note that the empirical, informal literature, as a matter of brute fact, makes not even a nod in the intensional direction, and is naturally formalized via extensional frameworks.) Specifically, the class of dogs will be denoted by ‘D,’ chimps by ‘C,’ and humans by ‘H.’ Then, the following three propositions have apparently been affirmed in the literature. 1. ∀c∀D[(Rc∀Mc∀Rcb)→Sc] • This is taken to be true, in a nutshell, because if dogs had behaved as chimps usually do, canines would have presumably been admitted into the “self-aware” club. 2. ∀c∀C [(Rc ∀ Mc ∀¬Rcb) → ¬Sc] • This is taken to be true, in short, because if chimps had behaved as dogs do, chimps would have presumably have been kept out of the “self-aware” club. 3. ∀c∀H [(Rc ∀ Mc ∀¬Rcb) → ¬Sc] • This is taken to be true, in a nutshell, because humans provide the “anchor point” on the issue at hand. Unfortunately, none of these propositions are true. A dog pre-trained to paw its forehead when seeing a dog provides a counter-example to 1., since no participant in the debate herein considered accepts that such training ensures self-consciousness.25 A chimp pre-trained to leave splotches intact constitutes a counter-example to 2., since no participant accepts that such training guarantees the absence of self-consciousness. And a human inclined to ignore splotches overthrows proposition 3. Of course, these problems are just the tip of the iceberg. The trio is of course incomplete, since from them one cannot for instance deduce that dogs aren’t selfconscious, whereas chimps and humans are. One might think that this is addressed by adding more formulae26, but since the conditional used here is the material conditional, this trio can’t possibly be heading in the right direction, as is easily seen. Assume that a variant of 2., 2′., is to enable deduction that some real-life chimp, Charlie, c , is in fact self-conscious. How could this deduction go through? It could only work if the relevant antecedents in 2′. were satisfied. For example, the following holds. {2′.} {Rc Mc Rc b} Sc But for Charlie, and nearly every single chimp who ever lived or will ever live, there will never be a red splotch and a mirror in his life. And yet clearly those in favor of ascribing self-consciousness to chimps will want to make the ascription to Charlie and his friends. More specifically, those in favor of the ascription presumably hold that were it the case that Charlie was given the mirror test, he would pass. This indicates that some intensional logic is required; specifically, a conditional logic able to handle subjunctive conditionals is needed. 25 Of course, someone might deny that such behaviour expresses an intention to remove a splotch, but that would be entirely ad hoc. Trainers after all routinely train dogs to form goals and seek their satisfaction when they observe the relevant triggers. Relevant here is the Keenan-et-al.-recounted story of behaviourists who claimed that pigeons were to be classified with chimps in the running of R. It turned out that the pigeons had been pre-trained in ways that contaminated the experimentation in question. 26 E.g., c D [((Rc Mc ¬Rcb) → ¬Sc]. - 332 - The Computational Turn: Past, Presents, Futures? Note that the fact that 2′ might never be satisfied for a particular chimp is not the fault of our chosen formulation, since that formulation is a direct symbolization of what is said in the literature (which has of course been written for the most part by informalists). One way to understand what ought to be claimed in the informal literature is that a subjunctive conditional be employed: for example, if in all nearby “possible worlds” in which Rc and Mc are true, Rcb is true, then Sc is true in the actual world. But of course this sort of thing is the point, since no one has yet worked out the details in this direction, and to credit this direction to anyone in the empirical prior work is so charitable as to border on absurdity. And of course the devil is in the details: The formal calculi we use include an explicit rejection of a possible-worlds semantics for anything doxastic. Our modeling of mirror testing has obvious connections to key distinctions recently made by Clowes and Seth (2008). In their terms, our research is without question “weak” in nature, since we don’t claim that our mirror-passing agents, however formal and finegrained the underlying modeling may be, literally are conscious. In addition, while elsewhere (Bringsjord 2007) one of us has expressed skepticism about Aleksander’s axiomatic approach, discussed by C&S, our approach is certainly axiomatic. However, the calculi upon which this approach rests are more expressive than those used by Aleksander (allowing, e.g., for intensional operators), and are oriented toward proof theory and automated proof finding and checking. Finally, related prior work in simulating the mirror test can be found in Takeno’s work on mirror image discrimination. This work provides some evidence that at least the rather informal robotics side of the act of a simple agent’s recognizing its mirror image is feasible. We will of course contrast our work with that of Takeno et al. References Bringsjord, S. (2007). Offer: One Billion Dollars for a Conscious Robot. If You’re Honest, You Must Decline. Journal of Consciousness Studies, 14(7), 28–43. Clowes, R.W. & Seth, A.K. (2008). Axioms, Properties and Criteria: Roles for Synthesis in the Science of Consciousness. Artificial Intelligence in Medicine, 44(2), 91-104. Takeno, J. & Inaba, K. & Suzuki, T. (2005). Experiments and Examination of Mirror Image Cognition Using a Small Robot. Proceedings of CIRA 2005: IEEE International Symposium on Computational Intelligence in Robotics and Automation. Espoo, Finland, 2005. - 333 - Proceedings IACAP 2011 List of Authors in Alphabetic Order Aas, Katja Franko Alhutter, Doris Anokhina, Margaryta Arkin, Ronald Arkoudas, Konstantine Asai, Ryoko Asaro, Peter 25 252 119 122 & 317 320 287 179 Backhaus, Patrick Barker, Steve Baumgaertner, Bert Beavers, Anthony F. Beinsteiner, Andreas Belfer, Israel Bello, Paul et alia Bengez, Rainhard Z. Blanco, Javier O. Bod, Rens Boltuc, Peter Breems, Nick Bridewell, Will Briggs, Gordon Bringsjord, Selmer Buchanan, Elizabeth Buckner, Cameron Bynum, Terrell Ward 290 255 206 23 301 209 125 33 34 216 38 213 323 128 335 30 29 26 Calabretto, Sylvie Casacuberta, David Chokvasin, Theptawee Coeckelbergh, Mark Cohen, Paul Compagna, Diego Crutzen, C.K.M. 242 143 41 258 95 262 156 Danka, Istvan Dasch, Thomas De Gooijer, Thijmen Desclés, Jean-Pierre Dodig-Crnkovic, Gordana Douglas, Keith 264 181 293 220 119 & 262 & 290 184 - 334 - The Computational Turn: Past, Presents, Futures? Duran, Juan M. 44 Ekbia, Hamid R. Ess, Charles 247 & 269 30 Franchette, Florent Franchi, Stefano Funcke, Alexander 47 223 83 Ganascia, Jean-Gabriel Geier, Fabian Giardino, Valeria Guarini, Marcello Guarini, Marcello 304 50 87 228 326 Hagengruber, Ruth Halpin, Harry Heersmink, Richard Heimo, Olli I. Hempel, Leon Hewlett, David Hongladarom, Sonja Hromada, Daniel D. 131 57 91 133 159 95 298 186 Janlert, Lars-Erik 98 Kavathatzopoulos, Iordanis Kimppa, Kai K. Kitto, Kirsty 137 133 101 Laaksoharju, Mikael 137 Macnish, Kevin Mauger, Jeremy McKinley, Steve Menant, Christophe Meyer, Steven Molyneux, Bernard Monin, Alexandre 163 30 231 & 234 105 53 140 57 Najar, Anis Nicolaidis, Michael Nirenburg, Sergej 307 238 329 - 335 - Proceedings IACAP 2011 Othmer, Julius 166 Pagano, Miguel Portier, Pierre-Edouard 60 242 Quiroz, Francisco Hernandez 108 Reynolds, Carson Riss, Uwe Ropolyi, Laszlo 310 64 273 Scheutz, Matthias Schroeder, Marcin Simon, Judith Sinclair, Nathan Smith, Lindsay Solodovnik, Iryna Soraker, Johnny Hartz Strauss, Stefan Sullins, John P Sundar, Naveen 332 111 276 68 71 75 190 313 28 335 Taddeo, Mariarosa Thürmel, Sabine Tonkens, Ryan Turner, Raymond 168 78 194 81 Vakarelov, Orlin Vallor, Shannon Vallverdu, Jordi Veale, Richard Vehlken, Sebastian 115 197 143 147 279 Waser, Mark R. Weber, Jutta Weich, Andreas Wong, Pak-Hang 152 172 166 201 York, William W. 247 Zambak, Aziz Zhang, Guo 282 269 - 336 - Quantitative intercultural comparison by means of parallel pageranking of diverse national wikipedias Daniel Hromada Ecole Pratique des Hautes Etudes / CHART / Lutin Userlab Abstract The aim of our study was to show that distributions of hyperlinks within wikipedia corpora implicitly contain information about cultural preferences of its authors. We have transformed wikipedia corpora written in 27 different languages into graph structures whose vertices correspond to wikipedia articles and edges to hyperlinks between these articles. Afterwards we have calculated PageRank vectors for every one of these graphs, thus obtaining so-called “intracultural importance list” for every linguistic community under study. Two datamining experiments were performed with obtained data: “the top country” study indicated that labels of articles concerning countries, related to linguistic community that created these articles are to be found in the top parts of their respective intracultural lists and inversely that the top parts of these lists can be potentially used as a stylometric method of identification of the community which created the corpus. “The world&corpus” study revealed that majority of rankings of articles concerning the countries of reference within intracultural list of a given community significantly correlates with a factual geographic distance between the country of reference and a supposed home country of a linguistic community. Both experiments have indicated presence of morphism between wikipedia hyperlink graph and a factual world of its authors. Keywords: PageRank, Wikipedia, graph theory, comparative culturology, quantitative anthropology, cultural stylometry, world-corpus correlations 1. Introduction The aim of this article is to propose a new quantitative method for comparison of different cultures by reducing culture-specific corpora to a common metrics. We shall try to demonstrate the feasibility of such an approach by using PageRank as such a metric and wikipedias of diverse (mostly European) linguistic communities as corpora which will be compared. Both Wikipedia and Pagerank have lately received a substantial amount of attention from different scientific fields. Considered by some to be «probably the most important single contribution to the fields of information retrieval and Web search of the last ten years » (Esuli and Sebastiani, 2007) implementation of PageRank by (Brin and Page, 1998) was without a doubt a key component of ascent of Google to the very top of most visited Internet sites. On the other hand, Wikipedia is based upon a very simple idea of self-organized collaboration of a huge number of authors. The hypothesis that such a huge number will, in the long run, approximate scientific truth better than a limited number of experts (Surowiecki, 2004) is far from being ultimately proven. However, Wikipedia is nowadays considered as reliable source of information in many domains, and it is one of the most important and freely available encyclopaedic corpora. Its multilingual properties are being more and more exploited in NLP JADT 2010 : 10 th International Conference on Statistical Analysis of Textual Data 644 QUANTITATIVE INTERCULTURAL COMPARISON BY MEANS OF PARALLEL PAGERANKING research for sense disambiguation word sense disambiguation (Mihalcea, 2007), question answering (Ferrandez et al., 2007), named entity recognition (Richman and Schone, 2008). Only few studies, however, focused fully upon differences between diverse wiki corpora. And even when such “exploiting asymmetries” (Filatova, 2009) or “information arbitrage” (Adar et al., 2009) were presented, their goal was to infer data from article-content related discrepancies, and not to make comparisons between corpora considered as consistent wholes. Research presented by this paper aims to demonstrate that even such large-scale comparisons can yield valid information. Our starting hypothesis can be stated like this: Wikipedia maybe does not approximate scientific truth, but it certainly approximates culture of its authors. In more exact terms, supposing that 1) the very act of creation of an article or a link presupposes an existence of a biased preference within the author and 2) that wikipedia is a graph structure whose vertices are equivalent to articles and edges to hypertext links between this articles, we propose that such a graph is at least partially but significantly isomorphic with associative network of culturally determined meanings and values of its authors. Proposal that culture – which can be conceived as structure of symbols, artifacts, buildings, institutions, social roles etc. which are mutually interconnected in a very specific way– can be described by graph theory and later analyzed by network analysis is far from being new (for an overview, see Park, 2005). Validating such a hypothesis, however, is not easy since it is not easy to find a 1) unique graph-like structure (e.g. structure with vertices and edges) that 2) represents common activity of huge number of culture-holders. And even when such a structure is found, the question whether it faithfully represents (is isomorphic with) a given culture is difficult to answer. But since it is nowadays widely accepted that culture is in the first place distinct from other cultures and that this distinction forms the very essence of a given culture (Bourdieu, 1979), even when it is almost impossible to compare a cultural graph with factual world itself, cultural graphs can always be compared with each other and the results of this comparison can be subsequently more easily compared with evident cultural distinctions of factual world. We propose that corpora of local wikipedias created by diverse linguistic communities can serve as a basis for construction of such «cultural graphs» and that these graphs can be subsequently compared by means of PageRank centrality measure. 2. “The top country” study Since a “corpus culturology” doesn’t seem to be an explored scientific domain, the goal of this preliminary analysis was to decide whether it is worth to continue with implementation of more robust statistic techniques or whether to consider as false the very introductory hypothesis “hyperlink distribution of a wikipedia graph contains implicit information about cultural preferences of its authors”. In other words, our primary intention was to assess whether some culture-specific information can be observed by applying a PageRank algorithm on wikipedia corpora of diverse linguistic communities. 2.1. Method Database tables «pages» (containing the list of articles – vertices) and «pagelinks» (containing the list of hypertext links – edges) were downloaded from wikimedia’s site. All vertices and edges not having namespaces 0 (article) 14 (category) and 100 (portal) were removed from the tables; subsequently a page_from → page_to plaintext edge list was JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data DANIEL HROMADA 645 generated. After this edge list was transformed into a graph G, pagerank vector – which is in fact the eigenvector of graph’s modified adjacency matrix – was calculated by igraph library (Csárdi and Nepusz 2006). Damping factor d=0.77 was chosen for the calculation. These transformations and calculations were repeated for 27 wikipedia corpora, overall properties of their respective graphs are present in Tab. 1. ISO 639 code Name of language Number of vertices (articles) Number of edges (hyperlinks) AR BG CS DA DE EL ES ET FI FR HE HR HU LV NO NL PL PT RO RU SK SL SR SV TR UK ZH Arabic Bulgarian Czech Danish German Greek Spanish Estonian Finnish French Hebrew Croatian Hungarian Latvian Norwegian Dutch Polish Portuguese Romanian Russian Slovak Slovenian Serbian Swedish Turk Ukrainian Chinese 234538 143439 266854 205245 1939647 82168 1303273 126448 403380 1996383 245431 116515 277518 67736 405039 877590 903670 1088962 307084 1232353 173417 146250 239904 623035 304853 322799 609262 4963998 3578973 7187995 4402963 43782766 1879300 23212253 2580511 7609470 53003962 9103883 3850220 9865769 1342180 8938168 24881686 29731309 24867864 5392290 27442593 4873409 5236834 5013264 11515290 9557808 9158661 15838584 Table 1: Basic graph properties of analysed corpora and their corresponding ISO639-1 codes For every corpus all contained page titles were ordered according to their descending PageRank values. We call such a list to be an intracultural list and we call langrank the placement of a given item in its respective intracultural list. Hence, 27 intracultural lists were obtained within which pages have langrank 1, pages with second highest probabilities have langrank 2, etc. To summarize, high langrank means low PageRank importance and vice versa. To detect what names of countries are to be found on the very top of intracultural lists (i.e. have lowest langrank), a following procedure was applied: a term with langrank position 1 was extracted from the list, and translated it into English by using wikipedia itself as the translator. If it was not present in the ISO list of country names, procedure continued with a term having langrank position 2, 3, etc. If it was in the list, the procedure continued with country detection in following intracultural list, therefore repeating itself 27 times. JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data 646 QUANTITATIVE INTERCULTURAL COMPARISON BY MEANS OF PARALLEL PAGERANKING 2.2. Results 27 intracultural PageRank vectors, one for each language community, were obtained and subsequently ordered in descending order according to calculated PageRank (converged probability) value. For illustration, in Tab. 2 we offer «top 10» values of such lists for 2 Latin and 2 Slavic corpora. Portuguese Wikipédia 0.065305 Proxy 0.006393 WP:TT 0.003323 Plantae 0.002419 Til 0.001981 Avaré 0.001496 População 0.001492 Invertebrados 0.001435 Área 0.001433 Brasil 0.001412 Spanish España/Sección Rural Wikipedia Wikipedia_ en_español 2001 Mayo Wikimedia_ Commons GFDL España Rural Czech 0.491755 0.050179 0.001105 0.000887 0.000555 0.000508 0.000337 0.000205 0.000197 0.000196 Wikipedie Wikimedia_Commons GNU_Free_Documentation_License| CC-BY-SA CAPTCHA Česko IP_adresa Spojené_státy_americké Zeměpisné_souřadnice Praha Russian 0.00984 0.00816 0.00303 0.00141 0.00132 0.00109 0.00097 0.00082 0.00079 0.00069 Википедия:Справка Русская_Википедия Германия Общественное_достояние GNU_Free_Documentation _License Викисклад Creative_Commons Английский_язык Россия Фонд_свободного_програ 0.01519 0.00564 0.00361 0.00348 0.00295 0.00277 0.00276 0.00121 0.00119 0.00112 Table 2: Top ten (i.e. langrank 1 – 10) items of 4 intracultural lists and their respective PageRanks It may be easily observed from the data that Wikipedia itself holds one of the top positions (this is the case within other 23 corpora as well). This is a trivial discovery since a wikipedia system is designed in the way that it refers in the first place to articles which concern the functioning of the system itself. Slightly less trivial is the observation that articles concerning the names of countries or cities closely associated to a language of a given wikipedia corpus emerge at the top positions of their respective intracultural lists. Wiki Top country L corpus AR BG CS DA DE EL ES ET FI (Egypt) България (Bulgarria) Česko (Czech Republic) Danmark (Denmark) Deutschland (Germany) Ελλάδα (Greece) España (Spain) Eesti (Estonia) Suomi (Finland) 17 4 6 34 16 7 9 5 5 Wiki Top country L corpus FR HE HR HU LV NL NO PL PT France (France) (Israel) Hrvatska (Croatia) Magyarország (Hungary) Latvija (Latvia) Frankrijk (France) Norge (Norway) Polska (Poland) Brasil (Brazil) 23 7 4 18 6 11 6 12 10 Wiki corpus RO RU SK SL SR SV TR UK ZH Top country România (Romania) Германия (Germany) Slovensko (Slovakia) Slovenija (Slovenia) Француска (France) USA Türkiye (Turkey) Україна (Ukraine) 印度尼西亚 (Indonesia) L 7 3 9 8 28 35 13 13 10 Table 3: Country names found at the top of their intracultural lists (i.e. having lowest langrank L ) Answers to the question «What countries are the first to occur at the top of given corpus intracultural importance list?» are present in Tab. 3. In 22 cases did an extraction of one country name from the top of the intracultural list corresponding to the graph of wikipedia written in language X yield the name of a country where this very language X is an official language of the state. Five exceptions are: Dutch where Frankrijk (L=11) closely outran Nederland (L=14); Russian where Германия (L=3!!!) outran Россия (L=9); Serb where Француска (L=28) far outran Србија (L=70); Swedish where USA (L=35) closely outran Sverige (L=37) and finally Chinese where Indonesia (L=10) is followed by Qatar (L=45), Micronesia (L=371), Brunei (L=409), Taiwan (L=484) and only much later by mainland China 中国 (L=579). JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data DANIEL HROMADA 647 2.3. Discussion The observation that huge majority (22 out of 27) corpora yields in the top positions of their respective langrank lists the names of countries whose official language is identic to the language of corpora under study is the first indication that even a pure hyperlink analysis could possibly reveal itself as a fruitful method for obtaining an overall information about preferences or interests of authors of wikipedia corpora. In such a manner could it posssibly serve as a means for «cultural stylometry» – a technique which could possibly allow to determine an appartenance of an anonymous author (or group of authors) to a given cultural or social unit. For instance, data from Tab. 3 indicates that «central country of interest» for auhors of PT corpus is Brasil (L=10) and not Portugal which emerges only later in the list (L=32), later than França (L=12), Itália (L=14), Espahna (L=16) and even Estados Unidos (L=31). If a basic hypothesis of this article, i.e. that langrank values represent the amount of importance of a given term in a given corpus will not be falsified, it could be proposed that Brasil plays, for authors of PT corpus, much more important role than Portugal, from which it could be inferred that majority of them is possibly from Brazil and not from Portugal. Analogic stylometric conclusions can be inferred when looking at the AR corpus where Egypt (L=17) is followed by Jordan (L=27), Spain (L=36), France (L=37) and Tunisia (L=47). An interesting exception occurs for the countries for which the official language is not identical to the language of a country in which a wiki corpus was written: the fact that Netherlands is closely overran by France in case of Dutch corpus and Sweden by USA in case of Swedish corpus can be possibly interpreted by the proposing that the overall global currents – related more closely to cultural superpowers are, for wikipedia authors of these two highly developed nations, of slightly more interest than local current of nationalist nature. The results obtained for Chinese intracultural list are intriguing. While a position of Indonesia of the very top could be naively explained by activity of Chinese expats in Jakarta who pass there time writing wikipedia articles, the subsequent emergence of Qatar, Micronesia and Brunei seem to be completely contraintuitive. These phenomena can be, however, explained by a wellknown caveat of PageRank algorithms related to so-called linksink phenomenon. A linksink can emerge during the PageRank vector calculation when the analyzed graph contains a densely interconnected subgraph having only few links to the rest of the graph. One way how to deal with linksink perturbations is an optimization of damping factor, these problems in relation to our cultural comparative method will be addressed in following articles. Since the top of Serbian intracultural list indicates that this corpora is subject to linksink perturbations (first 45 positions are occupied solely by astronomic terms), we consider this to be an explanation for the observation where Serbia is far overran by France. Since Serb corpus is not a big one, the result can be as well explained by an overly activity of a small group of authors biased more towards France related phenomena than to Serb related ones. Striking fact that Germany occupies third position in Russian intracultural importance list is left for reader’s interpretation. 3. “The world&corpus” study While huge majority of results obtained during analysis 1 seem to be consistent with intuitive expectations, their true scientific significance remains discutable. To address this issue, we have conceived a second analysis in which we have decided to correlate precalculated intracultural lists with factual data. For this purpose we have decided to use the real geographic (spatial) JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data 648 QUANTITATIVE INTERCULTURAL COMPARISON BY MEANS OF PARALLEL PAGERANKING distances between the country of a linguistic community under study, and other country (i.e. country of reference). Such a choice was motivated by a simple hypothesis: wikipedia users from home country B will, more likely, write articles and create hyperlinks concerning countries of reference A and C which are neighbours of B, than about countries of reference X or Y which are spatially distant. If such a tendency exists, and if PageRank is a sufficiently efficient technique for quantification of such an “importance” of A, C, X, Y countries of reference within the scope of corpus created by authors supposedly from home country B, then significant correlations between intracultural lists and |home country, country of reference| spatial distance can be expected to occur. 3.1. Method We have defined 32 countries of reference: 27 of them were countries which we have considered as well to be home countries of our intracultural lists; 5 others were chosen by random, one from every continent (Italy, Japan, Senegal, Argentina, Australia). As a first dataset we have used 27 intracultural lists, one for each home country, calculated during analysis 1. From every such list, the langrank (i.e. position sorted according the ascending pagerank value) corresponding to the the term denoting the country of reference was extracted. For example, as Tab. 4 illustrates, Hrvatska was on the 4th position in a Croatian corpus and 74th in Slovenian corpus. Language of home country Langrank position AR BG CS DA DE EL ES FI FR HE HR HU LV NL NO PL PT RO RU SK SL SR SV TR UK ZH 532 345 281 848 329 271 756 456 1131 1493 4 268 675 409 418 422 749 469 696 271 74 110 556 413 679 3981 Name of country of reference Хърватия Chorvatsko Kroatien Kroatien Κροατία Croacia Kroatia Croatie Hrvatska Horvátország Horvātija Kroatië Kroatia Chorwacja Croácia Croaţia Хорватия Chorvátsko Hrvaška Хрватска Kroatien Hırvatistan Хорватія 克罗地亚 Spatial distance (km) 3464 797 509 1265 808 870 1695 2197 1056 2255 0 403 1472 1083 1907 828 2028 746 5533 494 118 455 1874 1747 1320 7321 Table 4: positions of country of reference Croatia in intracultural lists of diverse home countries and their spatial respective distance JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data DANIEL HROMADA 649 Mathematica functions of computational search engine «Wolfram Alpha» were used as a resource of home country ↔ country of reference spatial distance data. Pearson correlation coefficients were calculated between two datasets. Whole procedure was repeated 32 times, once for every country of reference. 3.2. Results Obtained results suggest significative correlations between intracultural lists and geographic data in case of all countries of reference with exception of China, Russia and Slovakia. They are presented in Tab. 5. 3.3. Discussion Obtained results show correlations between strongly empiric spatial measures and positions within the “intracultural” lists Since different wikipedia corpora are direct consequences of different creative preferences of human groups, these correlations have to be explained in terms of these preferences. We propose that these preferences are culturally determined. The previous analysis even if it leads us to interesting conclusion, is however questionable. And a major caveat should be raised: Pearson’s correlation coefficients are sensitive to outlier datapoints and if these are present, an analysis cannot be considered as a robust one (Rousseeuw and Leroy, 2003). Country p cor of ref. Country p cor of ref. Argentina <0.003 0.549 Australia 0.165 -0.275 Bulgaria <0.00026 0.648 Croatia <2E-06 0.779 China 0.426 0.183 Czech R. <7-E05 0.689 Denmark <0.00044 0.629 Estonia <1.5E-05 0.730 Finland <1.74E-05 France 0.0015 Germany <0.004 Greece 0.00019 Hungary 0.00015 Israel 0.0148 Italy <0.005 Japan 0.711 0.727 0.577 0.539 0.657 0.664 0.463 0.525 -0.07 Country p cor of ref. Latvia <5.6E-05 Netherlands <0.007 Norway <0.0003 Poland <0.0005 Portugal <0.05 Romania <6.8E-05 Russia 0.8987 S.Arabia <0.0035 0.696 0.507 0.652 0.630 0.387 0.690 0.025 0.543 Country of ref. p cor Senegal <0.0007 0.617 Slovakia 0.1965 0.256 Slovenia <6.63E-07 0.797 Serbia <9.53E-05 0.680 Spain <0.011 0.486 Sweden <0.001 0.599 Turkey <0.0004 0.635 Ukraine <0.0005 0.629 Table 5: Overall p-values and Pearson correlation coefficients (d=25) for 32 countries of reference As Fig. 1 illustrates, this was the case for example in the situation when Germany was chosen as a country of reference. Simple removal of zh (Chinese) datapoint from the top right corner (i.e. high spatial distance, high langrank) have caused a drastic change from (cor=0.539; p<0.004) to (cor=-0.108; p=0.599). Since majority of countries of references in analysis 2 were European ones, it can be expected that this outlier boosts up the significativity of our hypotheses in an unwanted manner. Another source of bias was identified as well. It is related to the fact that Wolfram Alpha uses cartographic center of a country as the point from which it measures a distance to/from a given country. That’s a useful feature in case of countries whose population is distributed equally. In case of a country like Russia, however, is the ru “central point” postulated somewhere in central Siberia, 4000 km east from Moscow. Whether such a point can have anything to do with cultural preferences of wikipedia authors is a place for argument. JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data 650 QUANTITATIVE INTERCULTURAL COMPARISON BY MEANS OF PARALLEL PAGERANKING Figure 1: Visualisation of langrank&distance correlations when « China » outlier is included (left) in or excluded (right) from the list of countries of reference as related to Germany 4. General Discussion The aim of “the top country” study was to demonstrate whether a method of parallel pageranking of wikipedia graphs can yield relevant information concerning basic overall specificities of the corpora, and therefore of their authors. Simple look up at the tops of calculated intracultural lists have demonstrated that such is verily the case: in 22 out of 27 corpora was the topmost ranked country-concerning article about the country whose official language is that in which the corpus was produced. The second, “world&corpus” study focused on a relation between implicit properties of wikipedia corpora and geographic distances of the factual world. While significativity of obtained results suggest that there possibly exist some morphic relations between the overall hyperlink structure of (wikipedia) corpora and the factual world, the outlier problem indicates that the “world&corpus dilemma” will not be an easy dilemma to resolve. What we denote here as “world&corpus dilemma” is only very superficially related to method which we presented in our second study. In fact, it is much more closely related to an ancient epistemological problem “What is knowledge and how is it represented?” than to some trivial linear regression of two sets of datapoints which tend to show to have something in common. In its weaker form, the question goes like this “What is relation between the corpus and the world, given that corpus is sufficiently big?”. The goal of our article was to indicate that the graph theory could possibly bestow a temporary question to this answer: “If a graph of the corpus is isomorphic with the graph of a world the corpus tends to describe, than it can be said that such a corpus contains the knowledge about that world”. We say “a” graph, because there are infinitely many ways how to construct a graph from a given corpus. For the purposes of this article, we have chosen the most simple way: inspired by “random surfer model”, we have completely ignored information IN the Net (e.g. word cooccurences in the content) and focalized at the information ON the Net. An edge have been created when a hyperlink existed between the vertices. We supposed this assumption should be suffice as a point de depart: the very act of creation of an article, or a hyperlink, can be an interesting clue to the preferences of the one who creates it. A weak clue, of course, but nonetheless containing more information than pure accident. Since it is well known that a well aggregated linear combination of weak classifiers can result in a highly-effective strong classifier (Freund and Schapire, 1996), it can be as well proposed that a huge number of well aggregated weak cultural clues can yield some strong ones. JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data DANIEL HROMADA 651 References Adar E., Skinner M. and Weld D.S. (2009). Information arbitrage across multi-lingual Wikipedia. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, ACM, pp. 94-103. Bourdieu P. (1979). La distinction: critique sociale du jugement. Paris: Ed. de Minuit. Brin S. and Page L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer networks and ISDN systems, 30 (1-7): 107-117. Csárdi G. and Nepusz T. (2006). The igraph software package for complex network research. InterJournal Complex Systems, 1695. Esuli A. and Sebastiani F. (2007). PageRanking WordNet synsets: An application to opinion mining. In Annual meeting-association for computational linguistics. pp. 424-431. Ferrandez S., Muñoz R. and Palomar M. (2007). Applying Wikipedia’s multilingual knowledge to cross-lingual question answering. Lecture Notes in Computer Science, 4592, pp. 352-363. Filatova E. (2009). Directions for exploiting asymmetries in multilingual Wikipedia. In Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies, Association for Computational Linguistics, pp. 3037. Freund Y. and Schapire R.E. (1996). Experiments with a new boosting algorithm. In Machine learninginternational workshop then conference, Citeseer, pp. 148-156. Mihalcea R. (2007). Using wikipedia for automatic word sense disambiguation. In Proceedings of NAACL 2007 HLT. Park H. (2005). Network Cultural Analysis: Texts, Graphs, and Tools. In Paper presented at the annual meeting of the American Sociological Association, Philadelphia, PA. Richman A.E. and Schone P. (2008). Mining wiki resources for multilingual named entity recognition. Association for Computational Linguistics (ACL-08: HLT): 1-9. Rousseeuw P.J. and Leroy A.M. (2003). Robust Regression and Outlier Detection. Hoboken, New Jersey : J. Wiley & Sons. Surowiecki J. (2004). The wisdom of crowds: Why the many are smarter than the few and how collective wisdom shapes business, economies, societies, and nations. New York: Doubleday Books. JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data Semi-supervised haartraining of a fast&frugal open source zygomatic smile detector A gift to OpenCV community Daniel Devatman Hromada prof. Charles Tijus Lutin Userlab Ecole Pratique des Hautes Etudes Cognition Humaine et Artificielle (ChART) Université Paris 8 Abstract—Five different versions OpenCV-positive XML haarcascades of zygomatic smile-detectors as well as five SMILEsamples from which these detectors were derived had been trained and are presented hereby as a new open source package. Samples have been extended in an incremental learning fashion, exploiting previously trained detector in order to add and label new elements of positive example set. After coupling with already known face detector, overall AUC performance ranges between 77%-90.5% when tested on JAFFE dataset and <1ms per frame speed is achieved when tested on webcam videos. Keywords-zygomatic smile detector; cascade of haar feature classifiers; computer vision; semi-supervised machine learning I. INTRODUCTION Great amount of work is being done in the domain of facial expression (FE) recognition. Of particular interest is a FE being at the very base of mother-baby interaction [1], a FE interpreted unequivocally in all human cultures [2] - smile. Maybe because of these reasons, maybe because of some others, smile detection is already of certain interest for computer vision (CV) community – be it for camera's smile shutter [3] or in order to study robot2children interaction [4]. Nonetheless, a publicly available, i.e. open source, smile detector is missing. This is somewhat stunning, especially given the fact that “smile” can be conceived as a “blocky” object [5] upon which a machine learning technique based on training of cascades of boosted haar-feature classifiers [6] can be applied, and that the tools for performing such a training are already publicly available as part of an OpenCV[5] project. Verily, with exceptions of detectors described in [7][8] which have not been publicly released, we did not find any reference to haarcascade-based smile detector in the literature. We aim to address this issue by making publicly available the initial results of our attempts to construct sufficiently descriptive SMILing Multisource Incremental-Learning Extensible Sample (SMILEs) and five smile detectors (smileD) generated from this sample. From more general perspective, our aim was to study whether one can use already generated classifiers in order to facilitate such a semi-supervised extension of initial sample that a more accurate classifier can be subsequently trained. A.SMILE sample (SMILEs) The aim of SMILEs project is to facilitate and accelerate the construction of smile detectors to anyone willing to do so. Since it is the OpenCV library which dominates the computer vision community, SMILEs package is adapted upon the needs of OpenCV in a sens that it contains 1) negative examples directory 2) positive examples directory 3) negatives.idx - list of files in negative examples directory 4) positives.idx - list of files in positives with associated information containing the coordinates of region of interest (ROI), i.e. the coordinates of the region within which smile can be located. SMILEs is considered “Multisource” because it originates as an amalgam of already existing datasets like LFW and Genki both of which are, themselves, collections of images downloaded from the Internet. Images from POFA [9] of Cohn-Kanade [10] datasets were not included into SMILEs since restricted access to these datasets is in contradiction with an open source approach1 of SMILEs project. B.Smile Detector (smileD) SMILEs are “Incremental-Learning Extensible” in a sense that they allow us to train new versions of smile detectors which are subsequently applied upon new image datasets in order to facilitate (or even fully automatize) the labeling of new images, and hence extending an original SMILEs with new images. Simply stated, SMILEs allow us train smileD which helps us to extend SMILEs etc. Since training of haar cascades is an exhaustive threshrold-finding process demanding not negligible amount of time and computational resources, 5 pregenerated OpenCVcompatible XML smileD haarcascades were trained by opencv-haartraining application and are included with SMILEs in our OpenSource SMILEsmileD package, so that anybody interested could implement our smile detector in copy&use fashion. 1 Both SMILEs & SMILEd cascades are publicly available from http://github.com/hromi/SMILEsmileD as a GPLlicensed package. C++ source codes of select&crop application for easy manual sample creation and of a facecoupled video stream smile detector are included as well. II. METHOD it is essentially a version 0.1 sample to which automatically labeled positive examples were added. Differently from version 0.3, Genki4K and not flickr was exploited as a source of additional data. Simply stated, positive examples, 624 of them in total, from Genki4K labeled as smile-containing by its authors were added to initial LFW-based sample.  Version 05 unites the versions 0.3 and 0.4, i.e. both Genki4K & flickr-originated images which were automatically labeled by smileD v0.1 were added to LFW samples. C.Initial Training Datasets SMILEs project in its current state unites 3 image sets : Labeled Faces in the Wild (LFW) dataset - LFW dataset [11] contains more than 13000 images of faces collected from the web; its cropped version contains only 25x25pixel regions detected by OpenCV's frontal face detector. No information about the presence/absence of a sm.ile within the image is given  Genki4K dataset – Genki4K is a publicly available part of UCSD's Genki project [12] containing 4000 images downloaded from Internet. A text file indicating the presence/absence of the smile in a given image is included.  Ad hoc Flickr dataset – We have used the search keyword “smile” in order to download more than 4200 additional pictures from image-sharing website flickr.com. More than 2600 of them contained at least one smiling face.  cropped D.Construction of SMILEs datasets We have created five different version of SMILEs. All these versions exploit the same negative sample set of LFW's nonsmiling images. All manual labeling focalised solely on zygomatic smile (ZS) region2:  Version 0.1 is based solely upon an LFW dataset. All pictures were manually labeled by our ad hoc region selection & cropping application and divided into samples of positive (3606 images) and negative (9474 images) examples.  Version 0.2 added 2666 manually labeled images downloaded from flickr.com to positive examples contained already in 0.1. Labeling & region selection was realised by same application as in case of 0.1.  Version 0.3 also extended the positive&negative example samples of version 0.1 with images from flickr. This time, however, the flickr-originated images weren't labeled manually, but the smile-containing regions of interest were determined automatically, by applying smileD of version 0.1 upon the set of downloaded images. 1372 ROIs (1 ROI for 1image) were identified&labeled in this way. E.SMILEs -> smileD Training Identical haarcascade training parameters [width=43, height=19, number of stages=16, stage hit rate=0.995, stage false alarm rate=0.5, week classifier decision tree depth=1(i.e. Stump), weight trimming rate=0.95] were applied for training of all five smileD versions, one smileD corresponding to one SMILEs, both referenced by same version number. F.smileD evaluation Training phase of every new version of smileD was followed by measuring its performance upon a Japanese Female Facial Expression (JAFFE) dataset in order to evaluate the performance of different versions of smileD classifiers when applied upon a sample having different luminosity conditions than that any imageset included in train sample Detectors were face-detector-coupled during testing, i.e. smile detection was performed iff a face was detected in a tested image, and only in the ROI defined by well-known geometric ratios [13] Receiver operating characteristic (ROC) curves were plotted and AUC (“area under ROC curve”) were calculated as performance measures by means of ROCR library [14]. “Smile intensity” [7], i.e. the number of overlapping neighboring hit regions3, was used as a cutoff parameter. III. RESULTS FIGURE I. SMILED ROC CURVES TABLE II. ROC'S "AREA UNDER CURVE" PERFORMANCE OF DIFFERENT VERSIONS OF SMILED DETECTOR TABLEI. BASIC COMPONENTS OF INITIAL VERSIONS OF SMILES&SMILED PROJECT Version Positive examples LFW manual 0.1 0.2 0.3 0.4 0.5 3606 3606 3606 3606 3606  Version 2 Version 0.1 0.2 0.3 0.4 0.5 Neg. ex. Flickr manual Flickr auto Genki auto Total 0 2666 0 0 0 0 0 1372 0 1372 0 0 0 624 624 3606 6262 4978 4230 6572 9474 9474 9474 9474 9474 04 is analogous to version 0.3 in that sense that ZS region was defined only vaguely as a rectangular ROI in whose center are smiling lips – in preference with uncovered teeth. Whole ROI is bordered by smile&nasolabial wrinkles. AUC 77.94% 85.49% 83.93% 90.21% 90.51% DISCUSSION Detectors we present hereby exploit the top-bottom approach, i.e. they are face-coupled. Knowing that there can 3 Can be obtained from undocumented neighbors attribute of cvAvgComp sequence referenced by cvHaarDetectObjects be no smile without the face within which it is nested, we firstly detect the face by an OpenCV face detection solution and then smileD is applied only in very limited ROI of face's bottom third. Consequences of our decision to create facecoupled smile detector are twofold: 1) since by definition we search for smile only within the face, we have used only nonsmiling faces as negative examples (i.e. background images) 2) smile detection itself is very fast, once the position of face is specified. When applied upon the webcamoriginated (320x240 resolution) video streams, the time needed for in-face smile detection in never exceeded 1ms per frame on a Mobile Intel(R) Pentium(R) 4 CPU (1.8GHz), suggesting that SmileD could be potentially embedded even into mobile devices disposing of less computational resources. SmileD's speed can somehow neutralize its smaller accuracy handicap which it has in comparison with results reported in [8]. In its current state, our approach suffer from somewhat high false alarm rates, but our research indicates that in real life condition, these can be in great measure reduced by taking into account the dynamic sequence of subsequent frames since the probability of the same false alarm occuring within all the frames of the sequence is proportional to the product of probabilities of occurrence of that false alarm for every frame of the sequence taken individually. High speed is therefore of utmost importance and analysis of sequences of frames can substantially reduce the number of false positives. Tuning of training parameters and the extension of negative example do remain as other possibilities how to augment the accuracy of our project. Tab.2 indicates that accuracy of such semi-supervised classifiers like smileD gets saturated at certain limit which can possibly be surmounted only by extension of negative sample set. In case of smile detection, we suggest that extension of negative example sample with more images containing “upper lip raiser” action unit (AU 10) – teeth-uncovering4 but associated with disgust rather than smile – could yield some significant increases in accuracy, as reported by [9]. Since such an extension is relatively easy and not much time-consuming, given that such AU10-containing images are given and marked as negative examples, it may be the subject of future research. In this study, however, we left the negative example unchanged in order to study the effectivity of “Incremental Learning” approach during which an old detector is used to facilitate the extension of a positive example sample thanks to which a new detector is obtained. Since semi-supervised smileD versions v0.4 and v0.5 have outperformed v0.2 for which manual labeling was implemented, while the latter one performed only slightly better than v0.3 which exploited an identic flickr-originated imagebase than v0.2, it is not unreasonable to think that such semi-supervised incremental training approach can be a feasible solution for training haarcascade detectors. If that would be the case, it could possibly be stated that the machine started, in certain sense, to 4 From anatomic point of view, disgust-expressing AU10 is associated with Levator Labii Superioris muscle while smile associates with Zygomaticus Major muscle (AU12). ground [15] its own notion of smile. ACKNOWLEDGMENT We would like to thank the third section of EPHE, University Paris 8 and CROUS de Paris for their kind support. REFERENCES [1] L. Strathearn, J. Li, P. Fonagy, et P.R. Montague, “What's in a smile? Maternal brain responses to infant facial cues,” Pediatrics, vol. 122, 2008, p. 40. [2] C. Darwin, P. Ekman, et P. Prodger, The expression of the emotions in man and animals, Oxford University Press, USA, 2002. [3] M. Akita, K. Marukawa, et S. Tanaka, “Imaging apparatus and display control method,” 2010. [4] J.R. Movellan, F. Tanaka, I.R. Fasel, C. Taylor, P. Ruvolo, et M. Eckhardt, “The RUBI project: a progress report,” Proceedings of the ACM/IEEE international conference on Human-robot interaction, 2007, p. 339. [5] G. Bradski et A. Kaehler, Learning OpenCV, O'Reilly Media, Inc., 2008. [6] P. Viola et M. Jones, “Rapid Object Detection using a Boosted Cascade of Simple,” Proc. IEEE CVPR 2001. [7] O. Deniz, M. Castrillon, J. Lorenzo, L. Anton, et G. Bueno, “Smile Detection for User Interfaces,” Advances in Visual Computing, p. 602–611. [8] J. Whitehill, M. Bartlett, G. Littlewort, I. Fasel, et J. Movellan, “Developing a practical smile detector,” Submitted to PAMI, vol. 3, 2007, p. 5. [9] P. Ekman et W.V. Friesen, Pictures of facial affect, Palo Alto, CA: Consulting Psychologists Press, 1976. [10] T. Kanade, Y. Tian, et J.F. Cohn, “Comprehensive database for facial expression analysis,” fg, 2000, p. 46. [11] G.B. Huang, M. Ramesh, T. Berg, et E. Learned-Miller, “Labeled faces in the wild: A database for studying face recognition” University of Massachusetts, Amherst, Technical Report, vol. 57, 2007, p. 07–49. [12] J. Whitehill, G. Littlewort, I. Fasel, M. Bartlett, et J. Movellan, “Toward Practical Smile Detection,” IEEE transactions on pattern analysis and machine intelligence, 2009, p. 2106–2111. [13] L. Da Vinci et J.P. Richter, The notebooks of Leonardo da Vinci, Dover Publications, 1970. [14] T. Sing, O. Sander, N. Beerenwinkel, et T. Lengauer, “ROCR: visualizing classifier performance in R,” Bioinformatics, 2005. [15] S. Harnad, “The symbol grounding problem,” Physica d, vol. 42, 1990, p. 335–346. Le mémoire de Master E.P.H.E SVT de Cognition Naturelle et Artificielle 2010 smileD : Sourire Naturel et Sourire Artificiel. De l’utilisation d’OpenCV pour le tracking, la reconnaissance des expressions faciales et la détection du sourire contenant: Avant-propos …...................................................page 1 Abstract / Résumé …...........................................page 2 L'imitation par suivi des points..........................page 3 Introduction à l'apprentissage automatique en relation avec la reconnaissance des expressions faciales..........................................page 13 L'évaluation des diverses versions du détecteur de sourire smileD par le jeu d'imitation avec le visage robotique Roboto..............................page 21 Annexe 1 : Le fonctionnement d'AdaBoost...................page 31 Annexe 2 : La construction d'images de chamfer.........page 31 Annexe 3 : Semi-supervised haartraining of a fast&frugal open source zygomatic smile detector................................................page 32 « Épiprologue » …...........................................................page 33 Daniel Devatman Hromada sous la direction de Charles Tijus Respectables Professeurs et Maîtres d'École, Mesdames, Messieurs, Le texte que vous tenez dans vos mains est le mémoire qui présente le travail effectué par l'étudiant Daniel Devatman Hromada durant le stage au Laboratoire des Usages en Technologie d'Information Numérique (Lutin) lors des troisième et quatrième semestres de ses études de Cognition Naturelle et Artificielle en vue de l'obtention du diplôme de Master 2 des Sciences de la Vie et de la Terre de l'École Pratique des Hautes Etudes. Le sujet de stage initialement défini concernait « L'imitation des expressions faciales par le visage robotique Roboto ». Mais le problème s'est vite avéré tellement complexe qu'il a été impossible de faire entrer la totalité des idées et travaux dans le cadre d'un mémoire classique, ayant la forme d'un article scientifique. Il a donc paru plus raisonnable d'exploiter le savoir-faire obtenu lors du première semestre d'études à l'EPHE grâce à l'U.E. “Communication écrite” et de rédiger chacune des trois parties principales du texte selon un plan différent : La première partie mets en œuvre le plan de rédaction OPERA (Observation, Problème, Expérimentation, Résultats, Action) afin 1) d'introduire au lecteur le visage robotique Roboto – qui peut être considéré comme le facteur déclencheur de tout ce que est présenté ci-dessous et 2) de présenter au lecteur la bibliothèque OpenCV. La deuxième partie mets en œuvre le plan de rédaction ILPIA (Introduction, Littérature, Problème, Implication, Avenir) pour présenter non seulement les expériences effectuées à la fin du première semestre (dont le but était d'exploiter les contours pour créer un système robuste et rapide de reconnaissance des expressions faciales) mais surtout pour faire entrer le lecteur dans le domaine fascinant de l'apprentissage automatique (machine learning). Mais c'est, en effet, la troisième partie où se trouve le point culminant du travail : après un certain échec dû au problème de généralisation vers les échantillons inconnus qui termine le deuxième chapitre, l'auteur s'est décidé à réduire ses objectifs de S4 à un seul objectif plus réaliste: la problème de reconnaissance des sourires. C'est dans ce chapitre final que le plan de rédaction le plus courant en sciences cognitives, celui de IMRED (Introduction, Méthode, Résultats et Discussion), fut mis en œuvre. Le résultat final du travail - l'article Semi-supervised haartraining of a fast&frugal face-coupled open source haarcascade detector of zygomatic smiles – a été placé en annexe puisqu'il ne nous fut pas permis d’écrire le mémoire proprement dit dans une langue autre que le Français. Le texte se termine par la proposition du stage qui à l'origine des ces travaux. Abstract Three different approaches were examined in order to establish the facial-expression-based communication channel between robotic face of Roboto and its human counterpart. We had started with a point-tracking system based upon Lucas-Kanade optical flux algorithm. The need for a calibration phase as well as the fact that this approach is a purely behaviorist (i.e. stimulus-reflex based) one, without any cognitive representation of the “smile” on the side of the computer induced us to implement some more robust machine learning techniques. Therefore, as a second trial, we have studied the feasibility of a facial expression recognition system based on contour extraction, chamfer matching and subsequent feature selection by means of AdaBoost. While the initial tests limited to JAFFE dataset have shown promising results, generalisation to other datasets have shown to be problematic. Lastly, a “classical“ approach of cascades of boosted haar-feature-based classifiers was applied upon SMILEs dataset and coupled with already existing face detectors, producing relatively fast (< 1ms per frame) and sufficiently accurate (90.5% AUC when tested on JAFFE) zygomatic smile detector (smileD). Both smileD and SMILEs dataset are published hereby as a gift to an OpenCV community. Résumé Trois approches différentes furent examinées avec pour l'objectif l'établissement d'un canal de communication basé sur les expressions faciales entre le visage robotique Roboto et l'utilisateur humain. Les travaux débutèrent par les études du fonctionnement du système de suivi des points, basé sur l'algorithme de flux optique de Lucas-Kanade. La nécessité de la phase de calibration, aussi bien que le fait qu'il s'agit d'une approche purement «béhavioriste », i.e. basé sur le couplage stimulus-réflexe où aucune représentation interne du sourire n'est jamais présent sur le côté ordinateur, nous a fait passer à des techniques plus évoluées d'apprentissage automatique. C'est pourquoi nous avons ensuite étudié la faisabilité d'un système de reconnaissance d'expressions faciales grâce à l'extraction des contours, matching de chamfer et une sélection des traits effectuée ultériérement par AdaBoost. Même si les tests initiaux limités à la base d'images JAFFE donnèrent des résultats encourageants, la généralisation par rapport à d'autres bases d'images s'est avérée problématique. Enfin, l'approche “classique“ exploitant les cascades de classifieurs de Haar a été mise à l'épreuve. Les résultats - i.e. un détecteur des sourires (smileD) rapide (< 1ms) et suffisamment précis (90.5% AUC quand testé sur JAFFE) aussi bien que la base d'images SMILEs utilisée pour générer ce dernière – sont offerts à la communauté OpenCV. 1. L'imitation par suivi des points Observation Roboto est une tête robotique qui a été conçue en 2003- Figure 1 : Roboto 4 dans une collaboration CNRS entre l’équipe développementale de Jacqueline Nadel (CNRS UMR7593, Centre Emotion) et l’équipe neurocybernétique ETIS de Philippe Gaussier, inspirées par la conception initiale de Feelix par l’équipe de Lola Canamero (Canamero & Fredslund, 2001). . Le maître d’œuvre fut Pierre Canet (Centre Emotion) . La tête est reliée à un ordinateur portable via une connexion série RS 232 reliée à une carte (SSC-12 de Lynxmotion). Douze servo-moteurs (Channel Serial Servo Controller) commandés à partir d’un programme élaboré par l’équipe ETIS permettent de faire bouger les sourcils, les paupières, et la bouche, en composant 6 expressions émotionnelles (tristesse – joie – surprise – colère- peur et neutre), formatées en référence aux unités d’action des expressions prototypiques humaines (Ekman et al. 2002) . Tableau 1 : Les attributes des diverses moteurs de Roboto Moteur Point d'attache Mouvement Muscle associé 0 Coin de lèvre horizontale Zygomaticus major 1,3 Narine verticale Quadratus lab. sup. 2 Coin de lèvre horizontale Zygomaticus major 4 Menton verticale Depressor lab. inf. Figure 2 : Les muscles du visage / Mentalis 5,6 Paupière verticale Orbicularis occuli 7,1 Sourcil int. quasiverticale Corrugator 8,11 Sourcil ext. quasiverticale Temporalis 9 Front verticale Frontalis Etant donné que le contrôleur SSSC permet de gérer 12 moteurs et que nous en disposions de 14, il nous fallait choisir les deux moteurs qui ne pourraient pas être connectés. Ce sont les moteurs permettant de bouger les yeux autour de leur axe vertical que nous avons débranchés et ainsi exclus de nos expériences. Ce choix fut partiellement motivé par le fait que les caméras qui étaient installées originalement dans les deux yeux du robot ne démontrèrent que de très faibles performances. Il fallut, donc, trouver une nouvelle solution pour rendre Roboto plus apte à voir. Après quelques tentatives d'utiliser les cameras de la gamme la plus élevée (modèle : Pointgrey, mode de connexion: Firewire) comme la voie d'entrée des données perceptives, ce fut enfin une camera web assez simple (modèle : PHILIPHS, mode de connexion : USB), attachée sur le locus entre et un peu au-dessus les yeux (la position « du troisième oeil») qui fournit le flux d'images au système. Le controlleur SSC-12 qui assurait la communication entre l'ordinateur et les servomoteurs fonctionnait de la manière suivante: par la voie du port de série RS-232, il recevait de l'ordinateur les séquences de trois octets dont le première a toujours une valeur de 255 ; le deuxième octet spécifiait le numéro du servomoteur (1-12) concerné aussi bien que la vitesse du mouvement (0-7 où 7 désigne la vitesse la plus rapide) ; le troisième octet spécifiait la position finale du servomoteur. Comme chaque octet peut avoir 28 = 255 valeurs, chaque servomoteur peut se retrouver dans 255 états. Sachant que Roboto a 12 degrés de liberté, le nombre des ses états possibles est alors 12255 . Comme la différence entre les états voisins des servomoteurs est invisible pour un observateur humaine (e.g. on peut pas distinguer la différence entre l'oeil fermé dû au fait que le servomoteur 5 est en position 23 et celui dont le servomoteur est en position 24 ou 25), le nombre des états possibles du Roboto qui pourraient être réellement perceptibles en tant que les états spécifiques par un sujet humain est sans doute moins élevé que 12255 mais reste cependant si grand que même après plusieurs mois de travail, Roboto est toujours capable de nous étonner voire amuser par une expression faciale jamais remarquée auparavant. Selon le célèbre FACS (Facial Action Coding System) (Ekman & Friesen 1977), il existe 7 expressions faciales de base qui sont repérables dans toutes les cultures du monde: 1) neutre 2) joie 3) surprise 4) tristesse 5) colère 6) peur 7) dégoût. FACS fut repris comme la référence de base par les concepteurs du Roboto: «Beaucoup d'efforts furent investis afin que les expressions du robot soient cohérents avec le système Expression Séquence correspondante Neutre FF 10 86 FF 11 7C FF 12 7B FF 13 8A FF 14 5F FF 15 7D FF 16 8C FF 17 75 FF 18 85 FF 19 78 FF 1A 75 FF 1B 87 Joie FF 10 86 FF 11 7C FF 12 7B FF 13 8A FF 14 5F FF 15 7D FF 16 8C FF 17 75 FF 18 85 FF 19 78 FF 1A 75 FF 1B 87 Surprise FF 20 8C FF 21 76 FF 22 60 FF 23 93 FF 24 AD FF 25 5A FF 26 64 FF 17 60 FF 18 A1 FF 29 AA FF 1A 60 FF 1B 99 expression faciale1. Chaque séquence se Tristesse FF 10 3C FF 11 3C FF 12 C8 FF 13 CF FF 14 46 FF 15 87 FF 16 96 FF 17 95 FF 18 9F FF 19 46 FF 1A 95 FF 1B 9F compose de 36 octets - 3 octets pour Colère FF 10 86 FF 11 7C FF 12 7B FF 13 8A FF 14 3C FF 15 87 FF 16 96 FF 17 60 FF 18 80 FF 19 3C FF 1A 64 FF 1B 83 chacun de douze servomoteurs qui sont Peur FF 20 5F FF 21 72 FF 22 A5 FF 23 94 FF 24 5A FF 25 23 FF 26 32 FF 27 80 FF 28 85 FF 29 AA FF 2A 80 FF 2B 85 Dégoût FF 30 86 FF 31 BB FF 32 E4 FF 33 B4 FF 34 89 FF 35 7D FF 36 8C FF37 4B FF 38 85 FF 39 4E FF 3A 75 FF 3B 9C Tableau 2 : Les expressions faciales de Roboto et leurs séquences d'octets correspondantes FACS» (Nadel et al. 2006). résultats séquences, de cet effort l'une pour Les furent 6 chaque envoyés par la voie du port série au servo-controlleur SSC-12 pour produire l'expression demandée. 1 Avec l'exception de la séquence pour dégôut que nous n'avons pas trouvée parmi les données fournies avec Roboto. Nous devions, donc, de “trouver” la séquence pour dégôut nous-même et nous l'avons fait grâce au script ri.pl qui sera présenté sur la page 7. Le logiciel Docklight fut fourni avec les séquences comme le moyen d'envoyer les commandes au robot par le porte série. En d'autres termes, les 6 séquences + un logiciel pour Windows furent tout le «software» qui nous permettait de communiquer avec Roboto lorsqu'il fut transféré au (Lutin) en juin 2009. Problème En termes plus généraux, notre problème peut être formulé ainsi : comment rendre Roboto utile à notre laboratoire et à la communauté scientifique, voire utile au progrès des connaissances ? Le fait que Roboto ne ressemble que peu au visage humaine peut aussi être consideré comme un problème. En effet - il n'y a pas de la peau, le cadre métallique est dans un seul plan et le robot donne donc une impression 2-dimensionnelle dont la partie la plus réaliste est la bouche, les lèvres étant formées d'une bobine qui se dilate ou se contracte selon le positionnement des moteurs 0,1,2,3,4. En effet beaucoup de ceux qui sont entrés en contact avec Roboto lors de leurs visites au Lutin ont remarqué que le robot est « peu réaliste », surtout par rapport aux visages robotiques comme « Einstein » (Wu et al. 2009) ou « Repliee Q2 Actroid ». Mais il arrive que ce qui est considéré par le plus grand nombre comme un défaut se révèle être un avantage – il suffit de donner la priorité à une perspective plus optimiste. Tel était, est et sera notre démarche, et nous soutenons cette « approche optimiste » par les arguments suivants: 1) L'effet de vallée dérangeante : Il s'agit d'une réaction psychologique devant certains robots humanoïdes, d'abord suggérée par (Mori 1970) . Il décrit le fait que plus un robot humanoïde est similaire à un être humain, plus ses imperfections nous monstrueuses. Ainsi, paraissent certains observateurs seront plus à l'aise en face d'un robot clairement artificiel que devant un robot doté d'une peau, de vêtements et d'un visage pouvant passer pour humain. La Fig. 3 montre ce que nous pensons Figure 3 : La position de Roboto dans la schème d'être la position de Roboto dans le antropomorphisme/familiarité de (Mori 1970) schème de familiarité proposé par (Mori 1970). Un certain manque d'anthropomorphisme le met devant le premier «peak» et empêche ainsi les observateurs humains confrontés au robot de ressentir la chute négative dans la vallée de l'«inquiétante étrangeté» (Freud 1947) 2) L'hypothèse des enfants autistes – Comme l'a remarqué la mère du projet de Roboto, prof. Jacqueline Nadel, un certain manque d'anthropomorphisme était « fait exprès » afin de réaliser les expériences auprès d'enfants autistes qui ont souvent du mal à regarder un visage use Device::SerialPort; my $neutre="FF 10 86 FF 11 7C FF 12 7B FF 13 8A FF 14 5F FF 15 7D FF 16 8C FF 17 75 FF 18 85 FF 19 78 FF 1A 75 FF 1B 87 "; my $colere="FF 10 86 FF 11 7C FF 12 7B FF 13 8A FF 14 3C FF 15 87 FF 16 96 FF 17 60 FF 18 80 FF 19 3C FF 1A 64 FF 1B 83 "; my $sourire="FF 10 2F FF 11 D8 FF 12 D8 FF 13 1E FF 14 8C FF 15 7D FF 16 8C FF 17 75 FF 18 85 FF 19 78 FF 1A 75 FF 1B 85 "; my $peur="FF 20 5F FF 21 72 FF 22 A5 FF 23 94 FF 24 5A FF 25 23 FF 26 32 FF 27 80 FF 28 85 FF 29 AA FF 2A 80 FF 2B 85 "; my $tristesse="FF 10 3C FF 11 3C FF 12 C8 FF 13 CF FF 14 46 FF 15 87 FF 16 96 FF 17 95 FF 18 9F FF 19 46 FF 1A 95 FF 1B 9F "; my $surprise="FF 20 8C FF 21 76 FF 22 60 FF 23 93 FF 24 AD FF 25 5A FF 26 64 FF 17 60 FF 18 A1 FF 29 AA FF 1A 60 FF 1B 99 "; my $dat=""; my $command="neutre"; print ">"; my $port=Device::SerialPort->new("/dev/ttyS0"); $port->baudrate(9600); while ($command=) { $port->lookclear; chomp $command; my $instruction=""; my $raw=0; if ($command eq "smile" or $command eq "sourire") { $instruction=$sourire; } elsif ($command eq "neutre" or $command eq "neutral") { $instruction=$neutre; } elsif ($command eq "colere" or ($command eq "anger") or ($command eq "angry")) { $instruction=$colere; } elsif ($command eq "peur" or $command eq "fear") { $instruction=$peur; } elsif ($command eq "tristesse" or $command eq "sad") { $instruction=$tristesse; } elsif ($command eq "surprise") { $instruction=$surprise; } else { $raw=1; my @d=split(/ /,$command); foreach my $s (@d) { $dat.=chr($s); }} if (!$raw) { my $i=0; while ($instruction=~/([A-F0-9]{2}) /g) { ++$i; print hex($1).","; $dat.=chr(hex($1)); } } print "\n>"; $port->write($dat); } Code 1 : La code source du script PERL roboto.pl humaine. Selon son hypothèse , il pourrait en être autrement dans le cas d'interaction avec un visage à l'anthropomorphisme limité, comme celui de Roboto. Après avoir pris en compte ces deux arguments, nous avons décidé de ne pas interpréter le déficit d'anthropomorphisme de Roboto comme un obstacle, mais au contraire comme un avantage, comme une source des contraintes marquants le territoire de notre travail. Or, Roboto fut accueilli au Lutin avec le but originel de rendre possible et facile les expériences concernant la problématique du traitement des expressions faciales par les enfants autistes. Et même si l'on peut dire que les travaux qui seront présentés dans les pages suivantes se sont partialement éloignés de cet objectif originel vers des régions inconnues et imprévues - comme c'est, d'ailleurs, souvent le cas en science - même si nos travaux se sont approchés de plus en plus du domaine d'un ingénieur de l'intelligence artificielle, nous considérons comme important de souligner le fait que nous n'avons jamais perdu de vue l'objectif expérimental voire médical (i.e. «aider»). Au contraire, il se pourrait que ce que nous nous apprêtons à présenter ici ne soit qu'une introduction, un manuel expliquant le fonctionnement du Roboto à ceux de nos collègues qui se décideraient2 à répondre la question : « L'interaction l'homme-machine par la voie des expressions faciales – pourrait-elle être exploitable pour les études aussi bien que pour la thérapie des troubles mentaux ? » 2 Les renseignements qui pourraient servir en particulier à un tel chercheu(r|se) seront dans le cadre de ce texte marqués par l'utilisation de police d'écriture souligné. Termes anglais sont écrit en italique. Experimentation L'objectif était alors bien défini : créer un logiciel qui traduit ce que le robot « voit » en son mouvement. Transformer 7 séquences de 36 octets associés aux 12 servomoteurs en outil expérimental également pour les études de l'autisme est un défi qui ne peut se résoudre que par un certain tâtonnement initial. Un défi d'autant plus difficile pour quelqu'un qui n'a jamais envoyé aucun octet vers un servocontrolleur, ni travaillé dans le domaine de la vision par ordinateur (computer vision), et tel était, en effet, notre cas quand ce défi fut relevé. Mais il s'est avéré très vite, heureusement, que la majeure partie du travail était déjà faite non seulement par les concepteurs du Roboto, mais aussi par la communauté mondiale unie autour la philosophie de logiciels Open Source. Le module Device::SerialPort fut alors rapidement trouvé et choisi dans le répositoire du langage PERL (Wall & Loukides 2000) pour rendre possible la communication par le port série et servir ainsi comme la base de fonctionnement de notre premier script, appellé roboto.pl. Après son exécution, ce script roboto.pl attendit à l'entrée les mots clés. Si le mot clé comme « joie », « surprise », « peur », « colère », « dégout », «neutre » ou «tristesse » est Figure 4 : Le couplage entre les moteurs de Roboto (numeros des moteurs en vert) et les touches (en blue) du clavier comme défini dans le script ri.pl écrit, le script envoie par le porte série la séquence des octets associés. Les limites du script roboto.pl sont évidentes – le robot ne peut exprimer que l'une des 7 expressions préprogrammées: on ne peut pas accéder aux moteurs de manière individuelle, on ne peut pas régler la vitesse sans changer le script même. Pour franchir ces limitations, il fut conçu un deuxième script nommé ri.pl. Son fonctionnement était basé sur le couplage individuel entre les servomoteurs et le clavier. Pour dire la chose simplement, deux touches furent associées à chaque moteur, l'une pour augmenter l'octet qui représente la position du moteur (e.g. monter un sourcil), l'autre pour le faire descendre. Les paires touche/moteur furent choisies de telle manière que la position de touche dans le cadre du clavier QWERTY (c.f. Fig. 4) correspondît à la position du moteur par rapport à la totalité du visage (e.g. la touche 1 qui se trouve dans le coin en haut à gauche fait bouger le moteur du sourcil gauche extérieur vers le haut). Il y a donc 2x12 = 24 touches pour faire bouger les moteurs, une touche pour augmenter la vitesse, une pour la diminuer ; 6 touches macro pour les séquences de base et deux touches pour accélérer/décélérer la vitesse du mouvement. Même si le script ri.pl nous donne la possibilité de travailler avec le Roboto de manière beaucoup plus « souple » que roboto.pl , il est évident que tous les deux ne représentent que le début du travail. En effet, les scripts nous permettent d'envoyer les commandes au robot pour le faire bouger, mais les données d'entrée, c'est-à-dire les données visuelles, ne sont jamais analysées par ces simples scripts. Autrement dit, le côté moteur de Roboto est mieux assuré qu'avant son arrivé au Lutin, mais le côté perceptif fait défaut. Jusqu'au ici, la machine ne voit rien. La machine ne voit rien tant qu'on ne lui apprend pas à voir – tel est le postulat de base de la branche d'intelligence artificielle appellée « vision par ordinateur » (computer vision - CV). Depuis plusieurs décennies déjà, les chercheurs conçoivent des modèles théoriques, des formules mathématiques et des solutions de plus en plus évoluées pour effectuer le traitement et la classification d'images … pour aboutir enfin à OpenCV. OpenCV est une bibliothèque écrite en langage C++, créée d'abord en filiale russe d'Intel pour ensuite devenir publique et open source. Il s'agit d'un projet qui contient centaines des fonctions inspirées par les études académiques portants sur tout ce que concerne la représentation numérique des données visuelles . Comme il n'est ni nécessaire ni possible de présenter, dans le cadre de ce mémoire, même un centième de tout dont OpenCV est capable, nous renvoyons le lecteur intéressé à l'ouvrage de Bradski & Kaehler (2008). Une image n'est pour l'ordinateur qu'un tableau de pixels. Chaque pixel est un point coloré qui se trouve sur la position aux coordonnés X (la colonne) et Y (la ligne). Dans le cas d'un pixel coloré, la couleur est décrite par 3 entiers, un représente l'intensité du composant rouge, un du composant verte et un du composant bleu. Mais comme notre travail porte sur les expressions faciales et comme il est évident3 qu'on peut reconnaître et classifier une expression faciale même sur une image en noir&blanc, on ne parlera que d'images aux pixels ayants un seul composant – celui de l'intensité (luminosité, la teinte de gris) sur l'échelle noir – blanc. Nous procédons ainsi car la réduction couleur → noir&blanc réduit sensiblement la complexité du problème à résoudre. Nous répétons: une image de taille X x Y peut être décrite comme une matrice aux X colonnes et Y lignes dont chaque élément pixX,Y code l'intensité du pixel sur la position I(X,Y). Tandis que le rôle de la camera est de créer de telles représentations numériques, le rôle de l'ordinateur est de les traiter pour en tirer l'information digne d'intérêt, voire pour y trouver les objets de certaines catégories. Tout cela grâce à des procédés d'une essence purement mathématique, procédés que OpenCV dispense même à ceux qui ont du mal à comprendre les idées cachées derrière des termes tels: un kernel de convolution, l'operateur de Laplace, le filter de Sobel ou la transformation de Fourier. Or, pour notre premier logiciel de traitement d'images – ou plutôt de séquences d'images, car 3 Du moins pour ceux qui se souviennent de l'époque où les films photos étaient en noir&blanc. on parle ici du flux vidéo – nous nous sommes inspirés de l'appareil de « eye tracking » de FaceLab qui est présent au Lutin qui, lui aussi, est programmé grâce à OpenCV. Le principe de « eye tracking », ou même de tous les systèmes de suivi de points en général est le suivant: 1) repérer ou choisir le point d'intérêt au début de séquence vidéo ; 2) trouver les « coins » dans la proximité de ce point; 3) mettre en œuvre l'algorithme du flux optique pour trouver ces «coins» dans les images suivantes de séquence vidéo, en partant de la position repérée sur l'image précédente. La notion de « coin » (corner; feature) est essentielle pour comprendre comment le système de suivi fonctionne. Un coin est un point d'image qui a des propriétés uniques par rapport aux autres points de la même image. Si on voulait suivre le point dans une vidéo d'un mur blanc, il serait assez difficile de repérer le même point dans l'image suivante car tous les points auraient à peu près la même intensité (seraient blancs). Au contraire, si on choisissait un point aux propriétés uniques, on pourrait le suivre, i.e. faire du tracking. C'est justement ce que fait la fonction cvGoodFeaturesToTrack() de OpenCV. Elle est basée sur la définition de Harris & Stephens (1988) qui ne considère comme un coin que les points dont la dérivation dans les deux directions orthogonales est assez forte, ce qui revient, selon Shi & Tomasi (1994) à une valeur de dérivation4 plus élevée qu'un certain seuil considéré comme un paramètre du système de tracking. Simplement dit : si la valeur d'intensité d'un pixel est assez différente par rapport à son voisinage tant en direction de l'axe X que de l'axe Y, ce point peut être considéré comme un coin, i.e. quelque chose d'unique, “un bon trait à suivre”. Une fois qu'on a repéré, dans la proximité du point d'intérêt, les points qui seront plus faciles à suivre que d'autres points, on peut analyser l'image suivante et essayer de les retrouver. Pour ce faire, il faut évaluer le mouvement qui fait différer les deux images. Les algorithmes du «flux optique» nous permettent de le faire. En raison de son élégance, de sa vitesse - il s'agit d'une méthode dit « creuse » (sparse optical flow algorithm) - et de son efficacité, nous avons choisi la méthode de Lucas-Kanade (LK) (Lucas & Kanade, 1981) pour calculer le flux entre deux images. LK est basé sur 3 principes de base: • la consistance de luminosité - l'intensité d'un pixel ne change pas entre les images suivantes • la persistance temporelle – les changements entre images suivantes sont suffisamment lents • la cohérence spatiale – les points voisins effectuent des mouvements similaires En OpenCV, c'est la fonction cvCalcOpticalFlowPyrLK() qui nous permet de mettre en œuvre la méthode LK. Cette fonction ajoute aussi ce qu'on appelle « une pyramide d'images ». Une pyramide d'images est une représentation multi-résolution d'une image, créée à partir d'une image originelle – l'image originelle est sa base, le première étage est la base réduite à la moitié etc. L'utilisation des pyramides d'images en combinaison avec l'algorithme L-K nous permets de 4 On peut aussi parler d'un “gradient” A) B) Figure 5 : A) une pyramide d'images (l'image tirée de http://fr.wikipedia.org/wiki/Pyramide_(traitement_d%27image) ) B) utilisation des pyramides d'images dans l'estimation du flux optique de Lucas-Kanade afin d'atténuer les problèmes liées à la nécessité des petits mouvements (Bradski & Kaehler, 2008) réduire les limitations dues à la condition de « la persistance temporelle ». Un petit mouvement dans les hauts niveaux de la pyramide équivaut à un grand mouvement dans les niveaux inférieurs de la pyramide. Sans image pyramidale,ce dernier n'aurait pu être repérable par la méthode LK. Pour récapituler : le «noyau perceptif» de notre premier logiciel d'imitation est le suivant : 1) l'utilisateur choisit le point d'intérêt par un clic de souris 2) la fonction cvGoodFeaturesToTrack() repère les coins dans la proximité du point d'intérêt (PoI), 3) elle retrouve les coins dans l'image suivante du flux vidéo grâce à la fonction cvCalcOpticalFlowPyrLK(), et elle en tire l'information sur la position nouvelle du PoI. Ceci peut être fait pour un nombre quelconque de points et la séquence qui est analysée, aussi bien que la vidéo où l'utilisateur choisit les PoIs, provient, bien sûr, de la caméra de Roboto. Une fois la vision - le côté sensoriel du Roboto - mise en place, elle est couplée avec la motricité de la manière suivante : avant de faire le clic afin de choisir le PoI, l'expérimentateur appuie sous la touche de clavier faisant référence au moteur dont le mouvement sera couplé avec le mouvement de PoI sur lequel on s'apprête à cliquer. Simplement dit, le moteur qui va bouger est défini par la touche du clavier (c.f. Figure 4) , le PoI qui sera suivi est défini par le clic de souris, et leur couplage est assuré par le fait que l'expérimentateur appuie sur la touche, puis fait le clic. L'envoie des séquences d'octets est ensuite analogique au script ri.pl, la plus grande différence étant que, cette fois-ci, nous mettons en œuvre la bibliothèque libserial de C++ et non Device::SerialPort de PERL car le logiciel sm_imitation que nous venons de présenter n'est pas écrit en PERL mais en C++ (pour être compatible avec OpenCV) Figure 6 : L'imitation du mouvement du sourcil gauche effectué par le logiciel sm_imitation.c Résultats Il nous était et est difficile d'évaluer de manière fiable ce logiciel basé sur le suivi des points. Ceci est dû au fait que le logiciel ne fournit aucune sortie numérique, il ne fait que suivre des points et essayer de traduire le mouvement repéré dans le mouvement des moteurs (la procédure est illustrée sur Fig. 6). Le choix du point à suivre aussi bien que le choix du moteur qui sera couplé avec le PoI sont faits de manière ad hoc . En d'autres termes, la phase de couplage moteur ↔ PoI, la phase de calibration ou l'intervention manuelle est nécessaire et doit précéder chaque expérience avec ce logiciel, ce qui rend chaque expérience unique, non répétable et donc non-scientifique. Étant donné que nous ne voulons pas trop nous éloigner de nos objectifs scientifiques, contentons-nous alors de deux observations, l'une positive et l'autre négative, qui pourraient paraître superflues mais il n'en est rien. La bonne nouvelle est que la combinaison des fonctions cvGoodFeaturesToTrack() et cvCalcOpticalFlowPyrLK() permet même aux débutants en OpenCV de construire leurs premiers logiciels de suivi des points. La mauvaise nouvelle est que même si le système des images pyramidales est mis en oeuvre et même si l'algorithme de flux optique de LK est censé remplir la condition de persistance temporelle, le suivi robuste de points n'est pas assuré → on perd souvent les points surtout quand un mouvement brusque a été effectué, ce qui demande ensuite le recommencement de la phase de calibration. Action En termes concrets, il nous faudrait au moins 7 mois de plus pour rendre possible la mise en œuvre de l'imitation par le suivi des points (basé sur le flux optique de LK) si jamais nous décidions de mettre en oeuvre cette approche-là pour les expériences avec les enfants autistes. Les phases de calibration, la nécessité de définir les meilleurs seuils des fonctions cvGoodFeaturesToTrack() et cvCalcOpticalFlowPyrLK(), l'incertitude de pertinence des résultats obtenus - nous ont amenés à quitter la technique « l'imitation par suivi des points » car il est peu probable qu'un enfant autiste passe dans un état d'âme paisible -sans mouvement brusque - la totalité de la phase initiale lors de laquelle il faudra coupler 12 points avec 12 touches → 12 moteurs. Même quand on a essayé de tourner les yeux de Roboto, i.e. sa caméra, vers Roboto même, et faire Roboto suivre les mouvements de ses propres moteurs, la condition de la persistance temporelle n'était pas respectée et on a donc souvent perdu les points 5. On imagine la difficullté d'une telle expérience quand il s'agit de la mener avec enfants autistes! Mais le plus grand reproche que nous pouvons faire à l'approche présentée dans ce chapitre 5 Mais du temps en temps, quand le couplage était fait de manière particulière, il nous arriva ce qu'était prevu, i.e. Roboto bouga tout seule pour un certain moment sans aucune intervention humaine nécessaire n'est pas d'ordre technique, mais d'ordre théorique. En effet, un logiciel qui ne fait que suivre les points et les traduit bêtement en octets envoyés aux servo-moteurs n'est qu'une solution ad hoc, un logiciel qui n'apporte rien ni au domaine de la psychologie cognitive, ni au domaine de l'intelligence artificielle qui nous attirait de plus en plus au fur et à mesure que nous continuions à connaître la bibliothèque OpenCV et la langue de programmation C++ qui nous étaient jadis inconnues. Or, l'intelligence d'un tel système d'imitation par le suivi aveugle peut être comparé à l'intelligence d'une voiture de Braitenberg (Braitenberg, 1986) mais non à l'intelligence dite « émotionnelle » que les chercheurs en robotique émotionnelle tâchent de simuler et comprendre . Car dans le système de couplage point ↔ moteur que nous venons de présenter il n'y a que le système stimulus-réponse ; le logiciel donc relève plutôt Figure 7 : L'exemple de couplage stimulus-action: Les voitures de Braitenberg. Chaque capteur sensible à la luminosité est couplé à une roue. de l'ordre béhavioriste que cognitiviste. Pour que le logiciel s'appele ainsi, il faudra autant que possible des niveaux de représentation différents, et y ajouter les aptitudes de classification, voire une sorte de généralisation. En un mot, il nous faudra de l'apprentissage. Et pour faire cela, il nous faudra entrer dans le royaume de machine learning. Bibliographie Bradski, G., & Kaehler, A. (2008). Learning OpenCV: Computer vision with the OpenCV library. O'Reilly Media Braitenberg, V. (1986). Vehicles: Experiments in synthetic psychology. The MIT press. Canamero, L., & Fredslund, J. (2001). I show you how I like you-can you read it in my face?[robotics]. IEEE Transactions on Systems, Man and Cybernetics, Part A, 31(5), 454–459. Ekman, P., & Friesen, W. V. (1977). Manual for the facial action coding system. Consulting Psychologist. Freud, S. (1947). Das Unheimliche (1919). Gesammelte Werke, 12. Harris, C., & Stephens, M. (1988). A combined corner and edge detector. Dans Alvey vision conference (Vol. 15). Lucas, B. D., & Kanade, T. (1981). An iterative image registration technique with an application to stereo vision. Dans International joint conference on artificial intelligence (Vol. 3, p. 3). Mori, M. (1970). The uncanny valley. Energy, 7(4), 33–35. Nadel, J., Simon, M., Canet, P., Soussignan, R., Blancard, P., Canamero, L., & Gaussier, P. (2006). Human responses to an expressive robot. Dans Proceedings of the Sixth International Workshop on Epigenetic Robotics (p. 79–86). Shi, J., & Tomasi, C. (1994). Good features to track. Dans 1994 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1994. Proceedings CVPR'94. (p. 593–600). Wall, L., & Loukides, M. (2000). Programming perl. O'Reilly & Associates, Inc. Sebastopol, CA, USA. Wu, T., Butko, N. J., Ruvulo, P., Bartlett, M. S., & Movellan, J. R. (2009). Learning to Make Facial Expressions. 2. Introduction à l'apprentissage automatique en relation avec la reconnaissance des expressions faciales Introduction L'objectif à long-terme est de rendre possibles les expériences basées sur l'interaction entre Roboto et les enfants autistes. Comme c'est souvent le cas, les contraintes liées à cet objectif marquent la voie à suivre : sachant que le comportement des enfants en question varie souvent selon des conditions qu'on ne peut pas prévoir, nous avons décidé que l'appareil expérimental – et surtout le logiciel qui est son noyau – que nous tâchons de construire doit être aussi robuste, rapide et simple que possible, aussi bien pour l'expérimentateur que pour le sujet d'expérience. La contrainte de robustesse exige que le logiciel fasse ce qu'on lui demande de faire malgré des variations de conditions externes (différences de visages des différents sujets, différences de luminosité etc.). La contrainte de vitesse est due au fait que nous nous contentons d'un logiciel qui nous permet d'étudier l'interaction de haut niveau entre l'homme et la machine, et pour cela il faut que le logiciel nous permette de faire interagir l'homme et la machine en temps réel . Enfin, quand nous parlons de la simplicité, nous voulons dire que nous préférons mettre en oeuvre une approche qui utilise un nombre minimal de paramètres à définir ou à régler lors de l'expérience. Dans le cas idéal il n'y en aura aucun, l'appareil étant complètement automatique et l'expérimentateur se contenant de démarrer le logiciel et d'analyser les résultats. Même si l'approche présentée dans le cadre du premier chapitre était assez rapide, elle n'était ni robuste – on a souvent perdu les points suivis – ni simple – une phase de couplage assez longue aurait été nécessaire avant chaque expérience. Or, nous avons décidé de tenter de construire un système dans lequel aucune phase de couplage ou de calibration préalable à l'expérience n'est nécessaire. Pour ce faire, il faut que le logiciel ait, avant l'expérience même, quelques connaissances des traits invariants des objets (les visages ou leurs expressions faciales) – qu'il va essayer de 1) reconnaître 2) classifier et 3) imiter. Telle technique est possible si et seulement si le logiciel a des connaissances préalables et générales sur les objets à reconnaître. Le moyen par lequel on peut aboutir à des connaissances générales à partir d'un échantillon limité d'exemples concrets est appelé l'apprentissage, et la sousdiscipline d'informatique qui étudie les algorithmes divers nous permettant d'effectuer l'apprentissage sur machines est connu sous le nom learning). d'apprentissage automatique (machine L'objectif de ce chapitre est d'introduire ces algorithmes et de définir la notion de «trait», tout cela en relation avec la reconnaissance des expressions faciales. Littérature Ce qu'on vise dans le cadre de la recherche en reconnaissance des expressions faciales (REF) est la classification des images selon les expressions faciales (EF) contenues dans les images analysées. En général, la classification s'accomplit en deux phases: 1) on extrait les traits (features) à partir d'une image ; 2)selon les traits extraits, et selon les couplages (traits - étiquette) fournis lors de la phase d'apprentissage, un algorithme d'apprentissage automatique (AA) attribue une étiquette même à un objet inconnu. L'étiquette désigne l'appartenance d'objet à une classe. Les principes fondamentaux d'AA sont bien expliqués en (Bishop et al. 2006); l'ouvrage de (Haykin 1994) peut sans doute servir comme livre de référence. En ce qui concerne la mise en oeuvre d'AA dans le domaine de la vision par ordinateur (computer vision – CV), le 13ème chapitre de l'ouvrage de (Bradski & Kaehler 2008) pourra s'avérer des plus utiles, surtout pour les néophytes. Quant aux divers algorithmes utilisés en AA, les plus connus sont : la distance de Mahalanobis (Mahalanobis 1936), K-means (Lloyd 1982), le classificateur de Bayes (naïf ou normal) (Minsky & Selfridge 1961), les arbres de décision (Breiman 1984), le boosting (Freund & Schapire 1996), les arbres aléatoires (Breiman 2001), la maximisation d'expectation (Dempster et al. 1977), les K voisins les plus proches (K-nearest neighbors) (Fix & Hodges Jr. 1989), les réseaux de neurones artificiels (i.e. les perceptrons multi-couches) (Rumelhart 1989) ou le SVM (support vector machine) (Vapnik et al. 1997). Chacun de ces algorithmes a des faiblesses et des points forts, ce qui est bien compris par le théorème TANSTAAFL6 (Wolpert & Macready 1997) . Tous sont inclus dans la bibliothèque ML qui fait partie intégrante d'OpenCV. Quant à la REF, les années récentes en ont vu un véritable explosion dans ce domaine d'études. Le recueil de (Li & Jain 2005) et l'article de synthèse de (Fasel & Luettin 2003) en présentent les approches les plus courantes. Elles peuvent être divisées en trois groupes principaux: REF basée sur les ondelettes, REF basée sur les modèles et REF basée sur les contours. Les ondelettes : A cause d'une certaine analogie avec le fonctionnement du cortex visuel (J. P. Jones & Palmer 1987), les ondelettes de Gabor (Lyons et al. 1998) sont souvent utilisées comme traits pour la REF. Même si cette approche semble être intéressante d'un point de vue théorique, ses exigences informatiques (dont sa lenteur) la rendent presque inexploitable pour la reconnaissance en temps réel (Azcarate et al. 2005). Au contraire, l'approche exploitant les traits rectangulaires ressemblant aux ondelettes de Haar (Lienhart & Maydt 2002) peut aboutir à des performances très intéressantes, surtout quand on la combine avec un certaine astuce d'images intégrales (Viola & M. 6 L'acronyme TANSTAAFL signifie There Ain't No Such Thing As A Free Lunch i.e. « il n'y a rien de tel qu'un repas gratuit » Jones, 2002) et que l'on choisit l'algorithme AdaBoost pour déterminer les traits d'intérêt7. Les modèles: L'autre méthode pour aboutir à la REF est l'Active Appearance Model (AAM) proposé par (Cootes et al. 1998). Cette approche permet de construire un modèle statistique à partir d'un certain nombre d'exemples d'apprentissage. Le modèle ainsi construit – on peut l'imaginer comme une certaine grille ou masque – est ensuite apparié à une image (Abboud et al. 2004). Comme le système d'AAM est naturellement construit de telle façon qu'il prenne en compte la variabilité des objets auxquels le modèle sera apparié on peut profiter d'informations liées à cette variation pour faire l'apprentissage et la classification d'expressions faciales (Lucey et al. 2005). Même si cette approche mérite d'être suivie de très près, aucune solution open source, i.e. publique, n'existait pour l'AAM quand nous avons commencé notre stage8. Comme nous avons rapidement compris qu'essayer de reproduire le calcul matriciel de l'AAM serait hors de notre portée dans le cadre de notre stage de Master, nous nous sommes décidés à nous concentrer sur l'approche qui exploite les contours. L'approche que nous avons tentée de reproduire était celle de Moore et Bowden (2007)9 Problème Le résumé d'article de Moore et Bowden (2007) propose : « This paper introduces a novel method for facial expression recognition, by asssembling contour fragments as weak classifiers and boosting them to form a strong accurate classifier. Detection is fast as features are evaluated using an efficient lookup to a chamfer image, which weights the response of the feature. » Figure 8 : Images transformées en contours obtenus grâce à la fonction cvCanny() sur 6 images exprimantes les EFs différentes Expliquons d'abord les termes de base: • Contour : courbe qui correspond au changement brutal de l'intensité lumineuse dans une image (i.e. la courbe est détectée là où se trouvent les grandes différences avec valeurs des pixels voisins). Pour repérer les contours on utilise souvent le filtre de Canny (Canny 1987) qui élimine beaucoup de faux contours puisqu'il ne cherche que les composantes connexes. La différenciation des contours se fait par seuillage à hysteresis nécessiant deux seuils, un haut et un bas, considérés en OpenCV commes les paramétres de la fonction cvCanny. 7 La théorie portant sur les traits de Haar sera expliquée au troisième chapitre de ce mémoire. 8 Entre temps, une solution OpenCV positive est apparue sur le site http://code.google.com/p/aam-opencv/ 9 Il existe d'autres approches basées sur les contours, notamment celle appelée « edge oriented histograms » de (Dalal et al. 2006). Malheureusement nous ne fûmes informés de leur existence, par un expert d'Aldebaraan Robotics, que trop tard pour une étude en profondeur. • Fragment d'un contour : morceau du contour coupé de manière aléatoire ; Une liste des points. Chaque fragment fournit un trait. • L'image de chamfer : image dont la valeur qx,y du chaque pixel est proportionnelle à la distance au trait le plus proche (par rapport à X,Y) présent sur l'image d'origine. L'algorithme initialement Figure 9: Les images de chamfer proposé par (Barrow et al. 1978), nous publions construites à partir des images d'origine (c.f. Fig. 8). L'intensité de chaque pixel sa version pour OpenCV en Annexe 2. • dépend de la distance euclidienne du La distance de chamfer : permet en théorie de contour le plus proche sur l'image d'origine. repérer la ressemblance entre deux courbes. Dans le cadre de ces études, elle este plutôt considérée comme un moyen de numériser l'ampleur de la présence de la courbe C1 tirée de l'image I1 dans la même région que l'image I2. Pour ce faire I2 est transformée en image de chamfer, et la distance de chamfer n'est rien d'autre que la somme des valeurs des points d'image de chamfer I2 qui ont les mêmes coordonnées que les points composant C1. C'est comme si on faisait chevaucher par C1 sur I2 pour ensuite calculer la somme totale des valeurs de pixels sous C1. • Classifieur faible : également appelé « hypothèse » ; algorithme qui donne de meilleurs résultats que le hasard (i.e. il ne se trompe pas plus d'une fois sur deux en moyenne). Dans le cadre de cette étude, un classifieur faible est simplement un trait auquel est associé un arbre de décision n'ayant qu'un seuil de bifurcation (stump). • Classifieur fort : combinaison linéaire des classifieurs faibles ; sortie d'algorithme de boosting ; résultat final de la phase d'apprentissage. • Les algorithmes de boosting : groupe d'algorithmes de méta-apprentissage qui permettant de choisir, parmi un nombre très élevé d'hypothèses possibles, les hypothèses qui semblent être les plus « parlantes » et les plus « pertinentes » pour la classification finale. • AdaBoost (adaptive boosting) : repose sur la sélection itérative de classifieurs faibles en fonction d'une distribution des exemples d'apprentissage. Chaque exemple est pondéré en fonction de la difficulté de sa classification par rapport au classifieur courant. (c.f. aussi Annexe 1) Les notions de base de l'approche de Moore et Bowden ainsi expliquées, illustrons maintenant leur méthode sur un problème concret, celui de la REF dans les images appartenant à la base d'images Japanese Female Facial Expression (JAFFE) : La JAFFE contient 204 images, et chaque image est labelisée par l'une de 6 étiquettes (la peur, la surprise, la joie, la colère, le dégoût ou la tristesse). La totalité des 204 images est divisée en 2 parties – l'échantillon d'apprentissage (training sample) ayant 177 images et l'échantillon d'essai (testing sample) 37 images. Le premier sera utilisé pour construire le classifieur fort, le deuxième sera utilisé pour voir comment le classifieur fort « se débrouille » avec les exemples qu'il n'a jamais rencontrés lors de la phase d'apprentissage. Le processus d'apprentissage se déroule ainsi : toutes les images sont alignées, puis le filtre de Canny est mis en oeuvre pour obtenir une image binaire (un pixel est soit noir, soit blanc) représentant les contours connexes, ce qui réduit de manière significative la quantité de données et élimine les informations jugées moins pertinentes10. Chaque image est ensuite « inversée » (c.f. Annexe 2) à une image de chamfer dont les valeurs des pixels contiennent la distance vers le contour le plus proche. Pour chaque classe C des expressions faciales en question est pris un nombre T des morceaux des contours de 177/6=29 images appartenant à la classe C (ayant C pour étiquette). Puis, on fait comme si chaque des T fragments était placé sur chacune des 177 images de chamfer et on calcule la somme totale des valeurs de pixels chevauchées par le fragment en question. Le résultat est un nombre qui nous renseigne sur la distance du contour qui faisait partie d'une des images de classe C au contour le plus proche en image X. En d'autres termes, pour chaque image X des 177 images on obtient ainsi un vecteur de T traits numériques dont quelques-uns, on l'espère, seront suffisants pour distinguer les images de la classe C des images apartenants à des classes différentes. Par exemple, disons que c'est la classe C=joie qui nous intéresse, et qu'on a décidé de n'utiliser que T=9 traits (i.e. fragments) pour la classification, on obtient alors 177 vecteurs (lignes) ayant 9 éléments chacun, e.g. : 232 , 324, 772 , 552, 923, 789, 87, 124, 87 984 , 398, 234 , 902, 892, 398, 56, 234, 12 etc. En plus, comme on parle des images appartenant à la partie d'apprentissage, on connaît aussi l'étiquette de classe liée au vecteur – 1 si l'image exprime la joie, 0 si l'image n'y appartient pas. Si les traits sont extraits de manière pertinente, et si le nombre T n'est pas trop bas (en réalité, on utilise des centaines des milliers des traits), il est très probable que l'information contenue dans ces traits suffirait pour créer un classifieur – i.e. trouvera une sorte d'application11 traits → étiquette capable de distinguer un visage joyeux de de tous les autres. 10 Tout en gardant intacte l'information nécessaire pour la détermination d'expressions faciales. Ceci est dû à l'hypothèse du départ (Moore & Bowden, 2007) : «l'information contenue dans les contours est suffisante pour la REF ». Nous référons à cette hypothèse comme à une hypothèse de bande dessinée. 11 Dans le sens mathématique du terme. Les démarches pour créer un tel classifieur sont dans ce cas les suivants : pour chaque trait, c'est-à- dire pour chaque colonne dans notre matrice 177xT , nous cherchons un certain seuil – c'està-dire une certaine valeur numérique, qui différencie mieux les classes joie-absent ou joie-présent. On obtiendra alors T arbres de décision12 qui nous serviront en tant que classifieurs faibles. Mais il y a peu d'espoir qu'il existe un seul trait, un seul contour capable de faire une distinction robuste entre la classe joie-absent et la classe joie-présente. Il paraît beaucoup plus convenable de supposer qu' une expression faciale est une combinaison de plusieurs traits. En d'autres termes, c'est en combinant les classifieurs faibles entre eux de manière adéquate qu'on peut espérer aboutir à une classification fiable. L'algorithme AdaBoost sert justement à cela. Il combine N hypothèses faibles – qui sont dans notre cas les arbres de décision du type si valeur < seuil → classe=X ; sinon → classe = nonX pour trouver un classifieur fort. La condition pour les classifieurs faibles est simple : ils doivent classifier mieux que le hasard. S'il existe un nombre suffisant de tels classifieurs faibles, on peut être sûr que nous trouverons un classifieur fort classifiant sans erreur toutes les images d'apprentissage. Ceci est possible dans la mesure où l'algorithme est conçu de telle manière que le taux d'erreurs de classification des exemples d'apprentissage tombe exponentiellement vers zéro avec le nombre N des classifieurs faibles. Cette propriété mathématique d'AdaBoost est démontrée en (Freund et Schapire 1995) Implication Une fois que AdaBoost a choisi les traits « les plus parlants » et les a combinés de façon linéaire en construisant ainsi le classifieur fort, on peut mettre ce dernier à l'épreuve en le confrontant VP VN FP FN Colère 2 22 2 3 Dégoût 3 25 1 0 Peur 2 24 2 1 Tristesse 2 21 0 6 Surprise 4 23 0 2 Joie 6 20 0 3 avec 37 images d'échantillons d'essai (testing Tableau 3: Les taux des vrai positifs, vrai sample) , c'est-à-dire avec les images qui n'étaient négatifs, faux positifs et faux négatifs obtenus lors la classification de EF d'images de JAFFE pas utilisées lors de la phase d'apprentissage. Les résultats d'essai sur JAFFE sont présents dans le Tableau 3 . Le terme « faux positif » (FP) fait référence à la situation où une image est reconnue comme contenant une EF tandis qu'en réalité il n'en contient pas. Au contraire le terme « faux négatif » (FN) fait référence à la situation où une image est reconnue comme ne contenant pas d'EF tandis qu'en réalité il en contient. Les termes « vrai positif » (VP) et « vrai négatif » (VN) font référence aux cas où l'algorithme a bien classifié la présence, ou bien l'absence d'une FE dans l'image analysée. 12 Dans ce cas, il s'agit d'arbres de décisions les plus simples, n'ayant qu'un seuil de bifurcation et alors de deux branches seulement (e.g. si la valeur du trait est plus grande que la valeur du seuil, le classifieur faible donne un vote positif pour l'appartenance à la classe d'intérêt et une vote negatif si jamais la valeur est moins grande) Étant donné qu'il s'agissait de notre première tentative dans le domaine de l'apprentissage automatique, les résultats nous ont paru encourageants, surtout pour la classe « joie ». Nous nous sommes donc décidés à mettre à l'épreuve notre classifieur fort en le confrontant non avec des images venant d'une échantillon ayant les conditions de luminosité complètement différentes de JAFFE utilisée pour l'apprentissage. Figure 10 : Fragments des contours (traits) choisis par AdaBoost comme les plus Or, après avoir obtenu l'autorisation de la part pertinentes pour déterminer la classe d'EF présente de l'université Carnegie-Mellon d'accéder à la base d'images la plus reconnue dans la domaine de la REF, celle de Cohn-Kanade (Kanade et al. 2000), les résultats obtenus dépassèrent à peine le pur hasard. Avenir L'une des explications de cet échec peut résider dans le fait que l'approche présentée par Bowden et Moore ne fonctionne simplement pas aussi bien que le rapportent les auteurs, i.e. que les traits obtenus grâce à une image de chamfer ne sont pas pertinents pour la classification d'EF. Même si les auteurs n'ont pas répondu à nos mails dans lesquels nous leur avons demandions quelques renseignements supplémentaires sur leur méthode, et même si leur méthode (en mai 2010) n'est reproduite nulle part dans la littérature, nous restons persuadés – et le Tableau 3 aussi bien que la Figure 10 l'indiquent - que la distance de chamfer13 entre un contour et une image peut s'avérer un moyen très efficace pour la construction de traits suffisamment déterminants pour une classification basée sur les contours. Or, nous expliquons le fait que nous n'avons pas réussi à classifier les images contenues dans la base Cohn-Kanade par les facteur suivants : 1) nous n'avons pas aligné les images selon un point de référence commun ; 2) nous n'avons appliqué aucune méthode pour remettre la luminosité des images au même niveau, ce qui pourrait être fait par paramétrisation du seuillage par hystéresis de filtre de Canny afin qu'il nous fournisse toujours environ la même quantité de contours. Nous croyons que le manque de ces deux facteurs a empêché de trouver un système de classification suffisamment général pour être appliqué même aux flux d'images venant de la caméra de Roboto. De ce fait, non seulement nous avons compris que, pour faire face aux deux problèmes mentionnés dans le paragraphe précédent, il nous fallait plus que quelques semaines. Bien plus – 13 L'avantage de la méthode proposée ci-dessus est que, une fois calculée l'image de chamfer, la computation des valeurs des traits (sélectionnés préalablement par AdaBoost) est très rapide à effectuer. Qui plus est, on peut très bien imaginer que la phase de computation la plus exigeante – la construction d'une image de chamfer – peut se dérouler non sur un processeur central (CPU), mais pourra être renvoyée au processeur de la carte graphique (GPU). Le projet appellé OpenCL (à ne pas confondre avec OpenCV) nous paraît idéal pour atteindre cet objectif. nous avons compris que pour construire un système de REF suffisamment robuste et rapide, il nous fallait, peut être, la période de toute une thèse, sinon un tel système aurait déjà été construit14 . En restant persuadés que l'hypothèse bande dessinée est vraie et que les contours peuvent suffire pour la REF, nous nous sommes décidés, enfin, à réduire nos objectifs et, en accord avec une ancienne maxime « moins est plus », nous nous sommes concentrés pleinement sur une seule EF – celle de sourire. Bibliographie Abboud, B., Davoine, F., & Dang, M. (2004). Facial expression recognition and synthesis based on an appearance model. Signal Processing: Image Communication, 19(8), 723–740. Azcarate, A., Hageloh, F., van de Sande, K., & Valenti, R. (2005). Automatic facial emotion recognition. Universiteit van Amsterdam June. Barrow, H. G., Tenenbaum, J. M., Bolles, R. C., & Wolf, H. C. (1978). PARAMETRIC CORRESPONDENCEAND CHAMFERMATCHING: TWO NEW TECHNIQUES FOR IMAGE MATCHING. Dans Proc. DARPA IU Workshop (p. 21–27). Bishop, C. M., & others. (2006). Pattern recognition and machine learning. Springer New York: Bradski, G., & Kaehler, A. (2008). Learning OpenCV: Computer vision with the OpenCV library. O'Reilly Media, Inc. Breiman, L. (1984). Classification and regression trees. Chapman & Hall/CRC. Breiman, L. (2001). Random forests. Machine learning, 45(1), 5–32. Canny, J. (1987). A computational approach to edge detection. Readings in computer vision , 184. Cootes, T. F., Edwards, G. J., & Taylor, C. J. (1998). Active appearance models. Computer Vision—ECCV’98, 484. Dalal, N., Triggs, B., & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. Computer Vision–ECCV 2006, 428–441. Dempster, A. P., Laird, N. M., Rubin, D. B., & others. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 1–38. Fasel, B., & Luettin, J. (2003). Automatic facial expression analysis: a survey. Pattern Recognition, 36(1), 259–275. Fix, E., & Hodges Jr, J. L. (1989). Discriminatory analysis. Nonparametric discrimination: Consistency properties. International Statistical Review/Revue Internationale de Statistique, 57(3), 238–247. Freund, Y., & Schapire, R. (1995). A desicion-theoretic generalization of on-line learning and an application to boosting. Dans Computational Learning Theory (p. 23–37). Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. Dans MACHINE LEARNINGINTERNATIONAL WORKSHOP THEN CONFERENCE- (p. 148-156). Citeseer. Haykin, S. (1994). Neural networks: a comprehensive foundation. Prentice Hall PTR Upper Saddle River, NJ, USA. Jones, J. P., & Palmer, L. A. (1987). An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. Journal of Neurophysiology, 58(6), 1233. Kanade, T., Tian, Y., & Cohn, J. F. (2000). Comprehensive database for facial expression analysis. fg, 46. Lienhart, R., & Maydt, J. (2002). An extended set of haar-like features for rapid object detection. Dans IEEE ICIP. Li, S. Z., & Jain, A. K. (2005). Handbook of face recognition. Citeseer. Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137. Lucey, S., Ashraf, A. B., & Cohn, J. (2005). Investigating spontaneous facial action recognition through AAM representations of the face. Handbook of Face Recognition, 275–286. Lyons, M., Akamatsu, S., Kamachi, M., & Gyoba, J. (1998). Coding facial expressions with gabor wavelets. Dans Proceedings of the 3rd. International Conference on Face & Gesture Recognition (p. 200). IEEE Society. Mahalanobis, P. C. (1936). On the generalized distance in statistics. Dans Proceedings of the National Institute of Science, Calcutta (Vol. 12, p. 49). Minsky, M., & Selfridge, O. G. (1961). Learning in random nets. Papers, 335. Moore, S., & Bowden, R. (2007). Automatic facial expression recognition using boosted discriminatory classifiers. LECTURE NOTES IN COMPUTER SCIENCE, 4778, 71. Rumelhart, D. E. (1989). The architecture of mind: A connectionist approach. M, 1, 133–159. Vapnik, V., Golowich, S. E., & Smola, A. (1997). Support vector method for function approximation, regression estimation, and signal processing. Advances in Neural Information Processing Systems 9. Viola, P., & Jones, M. (2002). Robust real-time object detection. International Journal of Computer Vision, 57(2). Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for search. IEEE Transactions on Evolutionary Computation, 1(1), 67–82. 14 Construit et publié en version open source par l'un de dizaines, voire centaines d'équipes de recherche qui, visent au même but en ce moment même ! 3. L'évaluation des diverses versions du détecteur de sourire smileD par le jeu d'imitation avec le visage robotique Roboto Introduction Qu'est-ce qu'une expression faciale ? Selon (Ekman & Friesen 1977), une expression faciale est une combinaison d'unités d'action (UA), où chaque UA correspond au mouvement d'un des muscles faciaux. Les mêmes auteurs postulent, en accord avec (Darwin 1872), que de telles combinaisons d'UE existent et sont exprimées et interprétées de manière semblable dans tous les peuples du monde. Même si une telle universalité anthropologique des expressions faciales est remise en question par une étude récente de (Jack et al. 2009) pour les EFs au contenu affectif négatif, l'universalité de l'expression faciale au contenu positif par excellence, celle de sourire, n'en est rien concernée. Contrairement à d'autres EF comme la peur, le dégoût ou la colère, le sourire dont on va parler dans le cadre de ce chapitre15 est produit par le mouvement d'un seul muscle – celui du grand zygomatique (Ekman & Friesen 1982). Mais c'est ne pas seulement la simplicité et l'universalité du sourire qui nous amena à concentrer nos forces sur la construction d'un détecteur des sourire dès que nous nous sommes rendus compte que nous avions échoué dans la construction d'un système plus générale de REF. Or, nous avons déjà dit que l'objectif ultime de notre stage était de construire un tel appareil expérimental pouvant servir des expériences, notamment dans le domaine du développement, ou de l'autisme. Concernant le développement, l'évidence que le sourire est l'un des premiers canaux de communication non verbale entre la mère et l'enfant (Strathearn et al. 2008). Quant à l'autisme, l'étude de (Dawson et al. 1990)a démontré que “les enfants autistes répondent beaucoup moins aux sourires de leurs mères que les enfants normaux. En plus, on a trouvé que les mères des enfants autistes répondent beaucoup moins aux sourires de leurs enfants que les mères des enfants normaux”. Serait-il alors possible que le sourire joue un rôle dans le développement de l'autisme ? En d'autres termes, serait-il possible ou envisageable que le défaut de communication mère ↔ enfant par la voie de sourire ne soit pas l'un des symptômes mais, peut être, l'une des causes de ce syndrome ? Peut-être Roboto pourra-t-il nous fournir une réponse. 15 Précisons qu'un autre sourire existe aussi, appelé aussi le sourire de Duchenne (Duchenne de Bologne 1862), pour la production duquel la contraction des muscles autour les yeux i.e. orbicuralii occuli - est nécessaire . La détection des sourires Mais afin que Roboto puisse répondre à cette question, il faut l'équiper d'un détecteur de sourires (DS) . Quelques DS existent déjà, comme celui embarqué comme “smile shutter” dans les caméras Sony (Akita et al. 2010) ou ceux rapportés par (Deniz et al. 2008) ou (Whitehill et al. 2007). Comme aucune de ces solution n'est une solution open source, il nous fallait tenter de construire notre propre DS. L'article de (Whitehill et al. 2007) nous était particulièrement utile pour choisir une méthode appropriée – les auteurs ont comparé plusieurs méthodes d'extraction des traits, c'est-à-dire : les histogrammes d'orientation de contours (c.f. note 9), les filtres de Gabor (c.f. page 14) et les filtres rectangulaires, le tout en relation avec deux méthodes d'AA (SVM et boosting). C'était la combinaison filtres réctangulaires + boosting qui s'est averée la plus performante et la plus rapide. Ce qui était une bonne nouvelle car il s'est avéré que ces “filtres rectangulaires” ne sont rien d'autre que les caractéristiques de Haar, très bien integrées dans la bibliothèque OpenCV. Les caractéristiques de Haar et l'image intégrale Les caractéristiques de Haar (Haar-like features HF) sont la matière brute de classifieur que nous apprêtons à présenter. Un classifieur basé sur HF ne classifie pas selon les intensités de pixels, mais selon les différences d'intensités entre deux, trois ou quatre régions rectangulaires de pixels. Une valeur numérique de HF Figure 11 : HFs integrées dans OpenCV. Dans cette représentation des résulte de l'addition ou de la soustraction des sommes ondelettes, les régions blanches sont interpretées comme “ajouter cette région” et les régions noires comme des régions “soustraire cette région” (Bradski & Kaehler, 2008) . d'intensités rectangulaires des pixels. Ces Figure 12. : Image intégrale Ii. La somme des pixels contenus dans le rectangle S (défini par les points A,B,C,D) peut être calculée en ne faisant que 4 réferences vers les points A, B, C, D d'image intégrale. valeurs d'intensité des régions rectangulaires peuvent être calculées très rapidement, une fois qu'on a construit ce que l'article époquale16 de (Viola & Jones 2001) appelle “l'image intégrale”. Une image intégrale ii contient sur les coordonnées x,y la somme des pixels de l'image d'origine qui se trouve au-dessus et à gauche de la position x, y : 16 L'article cité plus que 3862 fois moins que 9 ans après son apparition est sans doute digne d'un tel adjectif. En termes simples, l'image intégrale est une “astuce” mathématique qui permet de calculer, très rapidement, une fois qu'elle est construite17, les sommes des valeurs des pixels dans les régions rectangulaires de l'image analysée. La détection des visages selon la méthode de Viola et Jones L'objectif de Viola et Jones était de construire un système robuste et rapide pour la détection des visages. Les deux contributions majeures - hormis la décision d'utiliser les HFs comme traits et accélérer le calcul de leur valeurs au moyen d'image intégrale - étaient: 1) d'utiliser AdaBoost comme moyen pour choisir les traits et construire les classifieurs ; 2) d'enchaîner les classifieurs dans une cascade attentionnelle. Sélection des traits par AdaBoost: Sachant que le nombre des HFs possibles d'une image de 24x24 pixels est supérieure à 180 00018, il aurait été impossible de le calculer pour toutes les sous-fenêtres de l'image où nous cherchons l'objet (le visage, le sourire) à détecter. Il faut donc choisir les traits, et ce que Viola&Jones furent les premiers à proposer de procéder par AdaBoost, après avoir proposé l'hypothèse que “un nombre très réduit de traits peut être combiné pour donner un classifieur efficace” . La méthode que nous avons présentée dans le chapitre précedent était analogue à celle de Viola & Jones: nous cherchions les traits qui séparent le mieux les échantillons d'exemples négatifs et positifs. Pour chaque trait, l'algorithme d'apprentissage de classifieurs faibles tente de déterminer le seuil optimal pour la classification. La cascade attentionnelle : Jusq'ici, la procédure de détection du visage comprend cette hiérarchie des représentations: l'intensité des pixels → la somme des intensités des pixels dans une région rectangulaire → les différences entre plusieurs régions (HF) → les groupes de HFs sélectionnés par AdaBoost comme les plus pertinents pour la Figure 13 : La cascade de Viola & Jones classification. La contribution principale de Viola & Jones fut d'ajouter encore un niveau à cette hiérarchie computationnelle, d'organiser les groupes de HFs choisis dans les nœuds d'une cascade de réjection. L'idée de base est que, afin qu'une fenêtre de recherche (FdR) puisse être classifiée 17 L'avantage principal d'une image intégrale est qu'on n'a besoin que d'un seul passage à travers l'image d'origine pour la construire, en appliquant deux équations récurrentes: s(x,y) = s(x, y-1) + i(x,y) ; ii(x,y) = ii(x-1, y) + s(x,y). Pour comparer: la construction d'image de chamfer (c.f. Annexe 2), exige 2 passages par l'image d'origine (avant, puis arrière) 18 Notons, que pour une image, le nombre de traits de Haar possibles est beaucoup plus élevé que le nombre de ses pixels (24x24=576 << 180 000). comme “contenant un visage”, elle doit être évaluée comme telle par tous les nœuds de la cascade. Au contraire, si jamais une FdR est classifiée comme “sans visage” par un noeud de la cascade quelconque, la FdR est d'emblée rejetée et le logiciel procède alors par l'analyse d'une nouvelle FdR. Après que toutes les FdR de l'image sont ainsi évaluées, la détection est terminée pour l'image donnée. En plus, les nœuds de la cascade sont ordonnés de telle manière que les nœuds les plus rapides à évaluer (i.e. se composant de moins de HFs) sont mis tout au début de la cascade. Grâce à cela, un grand nombre de FdR sans visage est rejeté après l'évaluation d'un très petit nombre de HFs. Selon l'article de Viola & Jones, il faudra évaluer (en moyenne) 10 traits pour détecter un visage dans une FdR, ce qui oblige le processeur de regarder seulement 10 x (640-25) x (480 – 25) x 4 = 627300 fois dans la mémoire contenant la représentation d'une image intégrale pour trouver tous les visages de taille 25x25 pixels dans une image de 640x480 pixels. Il s'agit donc d'une méthode tellement rapide qu'aujourd'hui on peut souvent voir les détecteurs des visages basés sur ce principe embarqués même dans les caméras numériques de moyenne gamme. “Haartraining” Même s'il y a beaucoup plus à dire sur le sujet, il serait peut-être, superflu de tâcher d''expliquer le tour de force de Viola & Jones plus en détail dans le cadre de ce mémoire. Nous renvoyons les lecteurs intéressés à l'article source (Viola & Jones 2001), ainsi qu'aux pages 506-516 de l'ouvrage de (Bradski & Kaehler 2008) où la méthode est mise en relation avec OpenCV. Ce sont justement ces pages-là qui présentent le logiciel haartraining. Ce logiciel, qui fait partie intégrante d'OpenCV, automatise l'apprentissage des classifieurs basés sur la théorie de Viola & Jones. Comme ce sont les régions rectangulaires qui sont à la base de cette approche, haartraining permet d'entraîner les classifieurs – ou les détecteurs, car un détecteur n'est qu'un classifieur mis en œuvre dans toutes les FdRs - pour les objets composés de régions ou de blocs. Au contraire, il serait vain de tenter d'utiliser haartraining pour construire un classifieur capable de reconnaître les branches d'arbres. Heureusement, un sourire zygomatique peut être considéré comme un objet composé de blocs. SMILEs Afin que haartraining puisse nous fournir la cascade décrivant un DS, nous devons construire un échantillon d'exemples positifs (i.e. composé d'images contiennant un sourire) - et un échantillon d'exemples négatifs (i.e. composé d'images qui ne contiennent pas un sourire) . Le procédé exact que nous avons mis en œuvre pour aboutir aux premières versions d'échantillons SMILEs (Smiling Multisource Incremental-Learning-Extensible sample) est décrit dans l'article attaché en Annexe 3. Résumons-le en quelques mots: Nous sommes parti de la base d'images “Labeled Faces in the Wild” (Huang et al. 2007) (LFW) qui contient 13080 images (c.f. figure 14) . Un petit logiciel était programmé pour permettre - et faciliter - le tri manuel de la LFW en deux groupes – des exemples positifs et négatifs. Ce logiciel nous permettait aussi, pour les exemples positifs, de marquer par une méthode facile de click&drag&drop la région d'intérêt (RdI) contenant la bouche souriante. Après quelques heures de travail assez exigeant19 nous avons abouti à un échantillon positif contenant 3606 images et à un échantillon négatif contenant 9474 images. À partir de ces échantillons, haartraining nous a fourni la version 0.1 de notre détecteur de sourire (que nous avons appellé smileD). Ensuite nous avons mis à l'épreuve cette première version en l'appliquant aux nouvelles images dont nous étions sûrs qu'elles contenaient les sourires (i.e. les images faisant partie de la base Genki4K (Whitehill et al. 2007) ou/et les images que nous avions télechargées automatiquement du site flickr.com en cherchant le mot clé “smile”). Comme la première version de Figure 14 : Quelques exemples détecteur y a reconnu quelques sourires, c'est-à-dire des RdI “positifs” extraits de la base LFW contenant de sourires, nous pouvions étendre l'échantillon de base avec les nouvelles images, cette fois-ci labélisées sans qu'aucune intervention manuelle ne soit nécessaire. C'est justement pour cette raison que nous avons mis les termes Incremental-Learning-Extensible dans le titre de notre projet. SMILEd En fournissant les 5 versions différentes d'échantillons SMILEs au logiciel haartraining (c.f. Paragraphe E de l'article en Annexe 3 pour connaître les paramètres d'entraînement), nous avons obtenu 5 versions différentes du détecteur de sourire smileD. SmileD est couplé avec un détecteur de visages. En d'autres termes, le logiciel essaye d'abord de détecter le visage et, s'il y reussit, il va chercher dans sa partie inférieure centrale 20 la bouche souriante. Ce couplage – que nous croyons raisonnable puisque nous n'avons pas encore vu de sourire qui ne soit imbriqué dans un visage – était pris en compte aussi pendant la période de 19 Exigeant d'un point de vue cognitif. En effet, il nous arriva souvent, après quelques heures consacrées à la démarcation de maintes et maintes RdI, de commencer à percevoir des sourires même là où il n'y en avait pas. 20 Le paragraphe 310 de (Da Vinci & Richter 1970) indique: “The space between the parting of the lips [the mouth] and the base of the nose is one-seventh of the face...The space from the mouth to the bottom of the chin is the fourth part of the face and equal to the width of the mouth...The space from the parting of the lips to the top of the chin, that is where the chin ends and passes into the lower lip of the mouth, is the third of the distance from the parting of the lips to the bottom of the chin and is the twelfth part of the face. From the top to the bottom of the chin is the sixth part of the face and is the fifty fourth part of a man’s height”. construction d'échantillons d'apprentissage car nous avons mis seulement des images de visages sans sourire dans l'échantillon d'exemples négatifs. Cet échantillon n'était censé contenir que des images de fond ( background ). De nos tentatives résultèrent cinq fichiers XML de 100 à 300 kilo-octets que chacun pourra soit ajuster à son propre gré, soit embarquer dans son logiciel, dès que nous les publierons en tant que package open source. Afin d'évaluer les deux versions (v0.1 et v0.5) du smileD dans les conditions quotidiennes nous conçûmes 2 expériences dans lesquelles Roboto nous rendit service. Méthode Expérience 1 - Participants Quinze participants (11 hommes, 4 femmes) furent conviés à jouer au “jeu d'imitation”. Ils furent assis en face de Roboto, ils eurent pour consigne de “faire la même chose que le robot”. La distance entre le visage des participants et la caméra placée variait entre 50 et 100cm, selon les exigences dues au confort des participants. Expérience 2 - Participant Un seul participant (homme, 27 ans) joua au même jeu d'imitation que les participants de l'expérience 1. Il l'a effectué d'abord dans le mode “barbu” pour l'effectuer le lendemain en mode “rasé”. Pour chaque mode, il y avait deux séances - l'une avec une distance de 50cm, l'autre de 100cm. La luminosité d'environnement restait identique entre les séances. Roboto Pendant deux expériences, le mouvement de Roboto fut géré par un logiciel immitation_game.c écrit en C++ qui envoyait au robot les séquences codant quatre expressions: le sourire, la surprise, la tristesse et l'EF neutre (c.f. Figure 4 et Tableau 1). Chaque envoie de séquences, 23 au total pour chaque sujet, était suivi d'un intervalle temporel pendant lequel 42 images étaient enregistrées. Afin de réduire les interférences entre les expressions successives, le sourire, la surprise et la tristesse étaient toujours suivis par l'EF neutre. Au contraire, l'expression neutre était toujours suivie par l'une des trois expressions affectives, leur ordre étant défini de manière aléatoire. L'analyse des images Les images furent divisées en deux classes: les positives supposées contenir un sourire puisque enregistrées après l'envoie de l'instruction “sourire”; et les négatives, enregistrées pendant l'intervalle suivant l'envoie de l'instruction ”surprise” ou “tristesse”. Chaque image obtenue (i.e. 23x42=969 au total pour chaque sujet) fut analysée par le détecteur de visage frontal_face qui est fourni avec la bibliothèque OpenCV. Si un visage était détecté, les détecteurs de sourire smileD v0.1 et v0.5 sont mis à l'épreuve dans la région d'intêrêt définie par les trois cinquièmes centraux du tiers inférieur du visage. Quand smileD ne trouvait aucune région rectangulaire susceptible de contenir un sourire , le détecteur rendait la valeur 0. Au contraire, si une telle région est identifiée, la fonction cvHaarDetectObjects() qui est à la base de smileD rend le nombre de régions se recouvrant mutuellement, dont toutes sont susceptibles de contenir un sourire. Le nombre entier ainsi obtenu était appelé “l'intensité du sourire” par (Deniz et al. 2008) et nous aussi y référons par ce nom. Les courbes ROC Ayant ainsi une “intensité de sourire” pour chaque image où le sourire avait été detecté, nous pouvions utiliser cette quantité comme seuil de discrimination (cutoff) pour construire des courbes ROC (Receiver Operating Characteristic) . Les courbes ROC sont le moyen le plus commun pour représenter la performance totale d'un classifieur car elles représentent la performance du classifieur, définie par le nombre des VP et des FP par rapport aux VN respectivement FN, sous les conditions qui varient selon la valeur du seuil de discrimination. Comme la communauté autour de AA utilise souvent la mesure “aire sous la courbe” (ASC ou en anglais AUC – area under curve) pour comparer les classifieurs, nous avons calculé les valeurs d'ASC correspondantes grâce à la bibliothèque Figure 15: Les courbes ROC pour diverses conditions experimentales Les 12 premières images de chaque séquence ne furent pas prises en compte lors de la ROCR (Sing et al. 2005) du langage R (Team 2006). construction des courbes ROC parce qu'il s'agissait de périodes transitoires entre deux EFs. Huit courbes ROC furent construites, une pour chaque combinaison des conditions expérimentales. Résultats Expérience 1 Les images enregistrées se sont avérées complétement inutilisables, car mal étiquettées en raison d'un bug dans le logiciel immitation_game.c. Expérience 2 ASC pour la version 0.5 du détecteur smileD, quand mis à l'épreuve à 50 cm de distance, fut de 99.6% pour la séance pendant laquelle le sujet était barbu et de 97.75% quand il était rasé. Quant au smileD version 0.1, toujours à 50 cm de distance, la performance fut de 99.4% pour un sujet rasé mais seulement de 90% quand le sujet était barbu. Les détecteurs se sont avérés moins performants quand le sujet était assis à un métre de la caméra de Roboto – plus exactement, 58.2% d'ASC pour le détecteur version 0.5 quand le sujet était barbu et 64.4% d'ASC quand le sujet était rasé. Pour smileD version 0.1, les résultats obtenus sont de 69.6% pour le mode barbu et de 70.1% pour le mode rasé. La figure 16 montre l'évolution de la quantité “intensité de sourire” à travers la séquence vidéo enregistrée après l'envoi de l'instruction “sourire” au Roboto. Seules les données avec le facteur Distance de 50 cm furent prises en compte. Environ 7 images après l'envoie de Figure 16 : L'évolution d'intensité du sourire dans le temps. Dans tous les deux cas, l'intensité du sourire culmine environ une seconde après l'expression de sourire par Roboto, puis retombe vers les sourires considérés par détecteur comme “moins marqués”, puis remonte de nouveau... l'instruction, nous constatons une augmentation brusque d'intensité de sourire jusqu'à un “peak” atteint entre la dixiéme et la vingtiéme image21. Notons que, lors de la construction de la figure 16, les valeurs d'intensité obtenues furent moyénées à travers les séances. Discussion Dans la communauté d'AA, la mesure ASC est souvent interprétée comme “la probabilité que le classifieur attribuer un score plus élevé à l'exemple positif qu'au négatif, tous les deux étant choisis de manière aléatoire” (Fawcett, 2006). Étant donné que dans le cadre de cette recherche le terme score est un synonyme pour l'intensité du sourire, nous constatons que nous avons réussi à construire un tel DS (smileD v0.5) qui attribuerait, dans plus de 99,6% des cas, une intensité du sourire plus élevée à n'importe quelle images enregistrée lors de l'intervalle temporel suivant l'envoi d'instruction “sourire” à Roboto plutôt qu'à celle enregistrée lors de l'intervalle temporel suivant l'envoi d'instruction “tristesse”, “surprise” ou “neutre”. 21 Étant donné que la vitesse d'enregistrement était environ 15 images per seconde, le pic d'intensité fut atteint à peu prés 1 second après l'envoi d'instruction au Robot. Malheureusement toutes les interprétations de figure 16 dans les termes d'un “temps de réaction” absolu doivent être rejetées comme imprécises puisque l'ordre d'image dans la séquence ne nous donne que des renseignements indirects sur le tampon temporelle d'image enregistrée. Ceci est du au fait que la vitesse d'enregistrement varie selon l'état d'ordinateur dans le moment d'expérience et on ne peut pas donc en inférer les coordonnées temporelles exactes. D'où l'importance d'ajouter le code qui rendra possible l'enregistrement des données temporelles – avec la précision en miliseconds - dans la prochaine version du immitation_game.c Qui plus est, la similarité valeurs d'ASC pour les modes “rasé” et “barbu” indique que nous avons, en effet, construit un DS robuste contre certaine variabilité propre à l'objet à reconnaître. Cette proposition est d'ailleurs soutenue par les résultats de l'article en Annexe 3 qui montrent que la performance de smileD s'élève à plus de 90% quand on l'a confronté avec la base d'images JAFFE. Notons enfin que ni les images faisant partie de JAFFE, ni aucune des images du participant qui a du subit l'expérience “barbu/rasé” (c.f. Figure 17) ne faisaient partie de l'échantillon Figure 17 : Le jeu d'imitation entre d'apprentissage. Roboto et le sujet barbu Par contre, le fait que le DS présenté ci-dessus ne donne que des très faibles performances quand il est mis à l'épreuve contre les images prises d'une distance de 100 cm indique l'une des faiblesses des premières versions de smileD. Comme l'approche de la reconnaissance d'objets par Roboto triste Sujet triste le moyen des caractéristiques rectangulaires de Haar est supposée être invariable par rapport à la taille (et donc par rapport à la distance) de l'objet à reconnaître (Bradski & Kaehler 2008), les résultats obtenus nous indiquent que les versions de smileD construites jusqu'au ici sont loin d'être Roboto surprise Sujet surpris les solutions finales. Cependant, nous avons des raisons de croire que le problème lié à la taille du sourire sera résolu dans de semaines à venir. Non seulement nous croyons, enfin, comprendre la théorie de Viola & Jones aussi bien que les Roboto sourit... ...un sourire fut detecté! subtilités du logiciel haartraining, mais aussi nous soupçonnons que le problème mentionné ci-dessus est dû au fait que nous avons utilisé des valeurs trop élevées de largeur=43 et hauteur=19 comme paramètres d'apprentissage pour les cinq premières versions de smileD. Avant cela, nous croyons que le projet SMILEsmileD, contenant aussi bien le DS smileD qu'un échantillon d'apprentissage SMILEs, a certaines chances de réussir, si jamais il est remarqué par la communauté internationale qui l'affinera et l'enrichira, en accord avec la philosophie open source. Tels sont nos éspoirs (TSNE). Les exploitations possibles d'un DS sont innombrables. Laissons de coté ses mises en application commerciales - comme les plugins pour l'interaction dans les réseaux sociaux; ou militaires – pour que le système puisse mieux reconnaître les individus nonconformes à la norme (Huxley 1969) et refusant de s'exprimer comme les vedettes de Holywood. Laissons de côté tout cela et réfléchissons sur deux utilisations dignes de l'appareil que nous avons tenté de construire. La première utilisation est liée à la thérapie des troubles émotionnels et affectifs. Ceci est lié au phénomène que nous voyions se répeter maintes et maintes de fois depuis que Roboto a débarqué au Lutin : n'étant qu'un tas de ferraille, son sourire a toujours fait sourire ceux qui l'ont regardé. Nous croyons que la force de ce phénomène ne peut qu'augmenter, si jamais une véritable imitation – une véritable harmonisation temporelle entre l'homme et la machine – se met en place. L'auto-catalyse de la bonne humeur portera ses fruits, un DS robuste étant le premier pas vers cet objectif. La deuxième utilisation est liée au domaine d'intelligence artificielle (IA), plus précisément en IA développementale où l'utilisation d'un DS est rapporté par (Movellan et al. 2007)– ou dans le domaine de la pédagogie des machines. Étant donné que 1) le sourire est un moyen naturel grâce auquel un être humain, un instituteur humain, exprime son contentement; et étant donné que 2) les premiers instituteurs des machines sont et seront les êtres humains, un sourire nous paraît être le moyen le plus approprié de renforcement positif (Skinner 1976) du comportement des machines. Le principe en est assez simple : l'algorithme donnera plus de poids aux représentations d'actions effectuées et situations perçues qui se succèdent de manière immédiate par un sourire.. Voilà deux utilisations où sourire joue le rôle principal. Sourire est un don qui permet aux humains de devenir plus humains, un don qui leur permet de franchir leurs syndromes, leurs maladies, la haine, voire la mort même. Et qui sait si, un jour, il ne le permettra même aux machines ? TSNE. Bibliographie Akita, M., Marukawa, K. & Tanaka, S., 2010. Imaging apparatus and display control method. Bradski, G. & Kaehler, A., 2008. Learning OpenCV: Computer vision with the OpenCV library, O'Reilly Media. Darwin, C., 1872. The expression of the emotions in man and animals; with an introduction, afterword, and commentaries by Paul Ekman. NY: Oxford University. Da Vinci, L. & Richter, J.P., 1970. The notebooks of Leonardo da Vinci, Dover Publications. Dawson, G. et al., 1990. Affective exchanges between young autistic children and their mothers. Journal of Abnormal Child Psychology, 18(3), 335–345. Deniz, O. et al., Smile Detection for User Interfaces. Advances in Visual Computing, 602–611. Duchenne de Bologne, G.B., 1862. The mechanism of human facial expression. Paris: Jules Renard. Ekman, P. & Friesen, W.V., 1982. Felt, false, and miserable smiles. Journal of Nonverbal Behavior, 6(4). Ekman, P. & Friesen, W.V., 1977. Manual for the facial action coding system. Consulting Psychologist. Fawcett, T., 2006. An introduction to ROC analysis. Pattern recognition letters, 27(8), 861–874. Huang, G.B. et al., 2007. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. University of Massachusetts, Amherst, Technical Report, 57(2), 07–49. Huxley, A., 1969. Brave New World. 1932. New York: HarperPerennial, 246. Jack, R.E. et al., 2009. Cultural confusions show that facial expressions are not universal. Current Biology. Movellan, J.R. et al., 2007. The RUBI project: a progress report. Dans Proceedings of the ACM/IEEE international conference on Human-robot interaction. p. 339. Sing, T. et al., 2005. ROCR: visualizing classifier performance in R. Bioinformatics. Skinner, B.F., 1976. Walden two revisited. BF Skinner, Walden Two (reissued). Strathearn, L. et al., 2008. What's in a smile? Maternal brain responses to infant facial cues. Pediatrics, 122(1). Team, R.D.C., 2006. R: A language and environment for statistical computing. . Viola, P. & Jones, M., Rapid Object Detection using a Boosted Cascade of Simple. Dans Proc. IEEE CVPR 2001. Whitehill, J. et al., 2007. Developing a practical smile detector. Submitted to PAMI, 3, 5. Annexe 1. Le fonctionnement d'algorithme AdaBoost Annexe 2. La construction d'images de chamfer en OpenCV Annexe 3. Semi-supervised haartraining of a fast&frugal open source zygomatic smile detector A gift to OpenCV community Daniel Devatman Hromada prof. Charles Tijus Lutin Userlab Ecole Pratique des Hautes Etudes Cognition Humaine et Artificielle (ChART) Université Paris 8 Abstract—Five different versions OpenCV-positive XML haarcascades of zygomatic smile-detectors as well as five SMILEsamples from which these detectors were derived had been trained and are presented hereby as a new open source package. Samples have been extended in an incremental learning fashion, exploiting previously trained detector in order to add and label new elements of positive example set. After coupling with already known face detector, overall AUC performance ranges between 77%-90.5% when tested on JAFFE dataset and <1ms per frame speed is achieved when tested on webcam videos. Keywords-zygomatic smile detector; cascade of haar feature classifiers; computer vision; semi-supervised machine learning I. INTRODUCTION Great amount of work is being done in the domain of facial expression (FE) recognition. Of particular interest is a FE being at the very base of mother-baby interaction [1], a FE interpreted unequivocally in all human cultures [2] - smile. Maybe because of these reasons, maybe because of some others, smile detection is already of certain interest for computer vision (CV) community – be it for camera's smile shutter [3] or in order to study robot2children interaction [4]. Nonetheless, a publicly available i.e. open source, smile detector is missing. This is somewhat stunning, especially given the fact that “smile” can be conceived as a “blocky” object [5] upon which a machine learning technique based on training of cascades of boosted haar-feature classifiers [6] can be applied, and that the tools for performing such a training are already publicly available as part of an OpenCV[5] project. Verily, with exceptions of detectors described in [7][8] which have not been publicly released, we did not find any reference to haarcascade-based smile detector in the literature. We aim to address this issue by making publicly available the initial results of our attempts to construct sufficiently descriptive SMILing Multisource Incremental-Learning Extensible Sample (SMILEs) and five smile detectors (smileD) generated from this sample. From more general perspective, our aim was to study whether one can use already generated classifiers in order to facilitate such a semi-supervised extension of initial sample that a more accurate classifier can be subsequently trained. A.SMILE sample (SMILEs) The aim of SMILEs project is to facilitate and accelerate the construction of smile detectors to anyone willing to do so. Since it is the OpenCV library which dominates the computer vision community, SMILEs package is adapted upon the needs of OpenCV in a sens that it contains 1) negative examples directory 2) positive examples directory 3) negatives.idx - list of files in negative examples directory 4) positives.idx - list of files in positives with associated information containing the coordinates of region of interest (ROI), i.e. the coordinates of the region within which smile can be located. SMILEs is considered “Multisource” because it originates as an amalgam of already existing datasets like LFW and Genki both of which are, themselves, collections of images downloaded from the Internet. Images from POFA [9] of Cohn-Kanade [10] datasets were not included into SMILEs since restricted access to these datasets is in contradiction with an open source approach1 of SMILEs project. B.Smile Detector (smileD) SMILEs are “Incremental-Learning Extensible” in a sense that they allow us to train new versions of smile detectors which are subsequently applied upon new image datasets in order to facilitate (or even fully automatize) the labeling of new images, and hence extending an original SMILEs with new images. Simply stated, SMILEs allow us train smileD which helps us to extend SMILEs etc. Since training of haar cascades is an exhaustive threshrold-finding process demanding not negligible amount of time and computational resources, 5 pregenerated OpenCVcompatible XML smileD haarcascades were trained by opencv-haartraining application and are included with SMILEs in our OpenSource SMILEsmileD package, so that 1 Both SMILEs & SMILEd cascades are publicly available from http://github.com/hromi/SMILEsmileD as a GPLlicensed package. C++ source codes of select&crop application for easy manual sample creation and of a facecoupled video stream smile detector are included as well. anybody interested could implement our smile detector in copy&use fashion.  Version 04 is analogous to version 0.3 in that sense that it is essentially a version 0.1 sample to which automatically labeled positive examples were added. Differently from version 0.3, Genki4K and not flickr was exploited as a source of additional data. Simply stated, positive examples, 624 of them in total, from Genki4K labeled as smile-containing by its authors were added to initial LFW-based sample.  Version 05 unites the versions 0.3 and 0.4, i.e. both Genki4K & flickr-originated images which were automatically labeled by smileD v0.1 were added to LFW samples. II. METHOD C.Initial Training Datasets SMILEs project in its current state unites 3 image sets : Labeled Faces in the Wild (LFW) dataset - LFW dataset [11] contains more than 13000 images of faces collected from the web; its cropped version contains only 25x25pixel regions detected by OpenCV's frontal face detector. No information about the presence/absence of a sm.ile within the image is given  Genki4K dataset – Genki4K is a publicly available part of UCSD's Genki project [12] containing 4000 images downloaded from Internet. A text file indicating the presence/absence of the smile in a given image is included.  Ad hoc Flickr dataset – We have used the search keyword “smile” in order to download more than 4200 additional pictures from image-sharing website flickr.com. More than 2600 of them contained at least one smiling face.  cropped D.Construction of SMILEs datasets We have created five different version of SMILEs. All these versions exploit the same negative sample set of LFW's nonsmiling images. All manual labeling focalised solely on zygomatic smile (ZS) region2:  Version 0.1 is based solely upon an LFW dataset. All pictures were manually labeled by our ad hoc region selection & cropping application and divided into samples of positive (3606 images) and negative (9474 images) examples.  Version 0.2 added 2666 manually labeled images downloaded from flickr.com to positive examples contained already in 0.1. Labeling & region selection was realised by same application as in case of 0.1.  Version 0.3 also extended the positive&negative example samples of version 0.1 with images from flickr. This time, however, the flickr-originated images weren't labeled manually, but the smile-containing regions of interest were determined automatically, by applying smileD of version 0.1 upon the set of downloaded images. 1372 ROIs (1 ROI for 1image) were identified&labeled in this way. E.SMILEs -> smileD Training Identical haarcascade training parameters [width=43, height=19, number of stages=16, stage hit rate=0.995, stage false alarm rate=0.5, week classifier decision tree depth=1(i.e. Stump), weight trimming rate=0.95] were applied for training of all five smileD versions, one smileD corresponding to one SMILEs, both referenced by same version number. F.smileD evaluation Training phase of every new version of smileD was followed by measuring its performance upon a Japanese Female Facial Expression (JAFFE) dataset in order to evaluate the performance of different versions of smileD classifiers when applied upon a sample having different luminosity conditions than that any imageset included in train sample Detectors were face-detector-coupled during testing, i.e. smile detection was performed iff a face was detected in a tested image, and only in the ROI defined by well-known geometric ratios [13] Receiver operating characteristic (ROC) curves were plotted and AUC (“area under ROC curve”) were calculated as performance measures by means of ROCR library [14]. “Smile intensity” [7], i.e. the number of overlapping neighboring hit regions3, was used as a cutoff parameter. III. RESULTS FIGURE I. SMILED ROC CURVES TABLE II. ROC'S "AREA UNDER CURVE" PERFORMANCE OF DIFFERENT VERSIONS OF SMILED DETECTOR TABLEI. BASIC COMPONENTS OF INITIAL VERSIONS OF SMILES&SMILED PROJECT Version Positive examples LFW manual 0.1 0.2 0.3 0.4 0.5 3606 3606 3606 3606 3606 Version 0.1 0.2 0.3 0.4 0.5 Neg. ex. Flickr manual Flickr auto Genki auto Total 0 2666 0 0 0 0 0 1372 0 1372 0 0 0 624 624 3606 6262 4978 4230 6572 9474 9474 9474 9474 9474 AUC 77.94% 85.49% 83.93% 90.21% 90.51% 2 ZS region was defined only vaguely as a rectangular ROI in whose center are smiling lips – in preference with uncovered teeth. Whole ROI is bordered by smile&nasolabial wrinkles. 3 Can be obtained from undocumented neighbors attribute of cvAvgComp sequence referenced by cvHaarDetectObjects DISCUSSION Detectors we present hereby exploit the top-bottom approach, i.e. they are face-coupled. Knowing that there can be no smile without the face within which it is nested, we firstly detect the face by an OpenCV face detection solution and then smileD is applied only in very limited ROI of face's bottom third. Consequences of our decision to create facecoupled smile detector are twofold: 1) since by definition we search for smile only within the face, we have used only nonsmiling faces as negative examples (i.e. background images) 2) smile detection itself is very fast, once the position of face is specified. When applied upon the webcamoriginated (320x240 resolution) video streams, the time needed for smile detection never exceeded 1ms per frame on a Mobile Intel(R) Pentium(R) 4 CPU (1.8GHz), suggesting that our detector could be potentially embedded even into mobile devices disposing with lesser computational resources. SmileD's speed can somehow neutralize its smaller accuracy handicap which it has in comparison with results reported in [8]. In its current state, our approach suffer from somewhat high false alarm rates, but our research indicates that in real life condition, these can be in great measure reduced by taking into account the dynamic sequence of subsequent frames since the probability of the same false alarm occuring within all the frames of the sequence is proportional to the product of probabilities of occurrence of that false alarm for every frame of the sequence taken individually. High speed is therefore of utmost importance and analysis of sequences of frames can substantially reduce the number of false positives. Tuning of training parameters and the extension of negative example do remain as other possibilities how to augment the accuracy of our project. Tab.2 indicates that accuracy of such semi-supervised classifiers like smileD gets saturated at certain limit which can possibly be surmounted only by extension of negative sample set. In case of smile detection, we suggest that extension of negative example sample with more images containing “upper lip raiser” action unit (AU 10) – teeth-uncovering4 but associated with disgust rather than smile – could yield some significant increases in accuracy, as reported by [9]. Since such an extension is relatively easy and not much time-consuming, given that such AU10-containing images are given and marked as negative examples, it may be the subject of future research. In this study, however, we left the negative example unchanged in order to study the effectivity of “Incremental Learning” approach during which an old detector is used to facilitate the extension of a positive example sample thanks to which a new detector is obtained. Since semi-supervised smileD versions v0.4 and v0.5 have outperformed v0.2 for which manual labeling was implemented, while the latter one performed only slightly better than v0.3 which exploited an identic flickr-originated imagebase than v0.2, it is not 4 From anatomic point of view, disgust-expressing AU10 is associated with Levator Labii Superioris muscle while smile associates with Zygomaticus Major muscle (AU12). unreasonable to think that such semi-supervised incremental training approach can be a feasible solution for training haarcascade detectors. If that would be the case, it could possibly be stated that the machine started, in certain sense, to ground [15] its own of smile. ACKNOWLEDGMENT We would like to thank the third section of EPHE, University Paris 8 and CROUS de Paris for their kind support. REFERENCES [1] L. Strathearn, J. Li, P. Fonagy, et P.R. Montague, “What's in a smile? Maternal brain responses to infant facial cues,” Pediatrics, vol. 122, 2008, p. 40. [2] C. Darwin, P. Ekman, et P. Prodger, The expression of the emotions in man and animals, Oxford University Press, USA, 2002. [3] M. Akita, K. Marukawa, et S. Tanaka, “Imaging apparatus and display control method,” 2010. [4] J.R. Movellan, F. Tanaka, I.R. Fasel, C. Taylor, P. Ruvolo, et M. Eckhardt, “The RUBI project: a progress report,” Proceedings of the ACM/IEEE international conference on Human-robot interaction, 2007, p. 339. [5] G. Bradski et A. Kaehler, Learning OpenCV, O'Reilly Media, Inc., 2008. [6] P. Viola et M. Jones, “Rapid Object Detection using a Boosted Cascade of Simple,” Proc. IEEE CVPR 2001. [7] O. Deniz, M. Castrillon, J. Lorenzo, L. Anton, et G. Bueno, “Smile Detection for User Interfaces,” Advances in Visual Computing, p. 602–611. [8] J. Whitehill, M. Bartlett, G. Littlewort, I. Fasel, et J. Movellan, “Developing a practical smile detector,” Submitted to PAMI, vol. 3, 2007, p. 5. [9] P. Ekman et W.V. Friesen, Pictures of facial affect, Palo Alto, CA: Consulting Psychologists Press, 1976. [10] T. Kanade, Y. Tian, et J.F. Cohn, “Comprehensive database for facial expression analysis,” fg, 2000, p. 46. [11] G.B. Huang, M. Ramesh, T. Berg, et E. Learned-Miller, “Labeled faces in the wild: A database for studying face recognition” University of Massachusetts, Amherst, Technical Report, vol. 57, 2007, p. 07–49. [12] J. Whitehill, G. Littlewort, I. Fasel, M. Bartlett, et J. Movellan, “Toward Practical Smile Detection,” IEEE transactions on pattern analysis and machine intelligence, 2009, p. 2106–2111. [13] L. Da Vinci et J.P. Richter, The notebooks of Leonardo da Vinci, Dover Publications, 1970. [14] T. Sing, O. Sander, N. Beerenwinkel, et T. Lengauer, “ROCR: visualizing classifier performance in R,” Bioinformatics, 2005. [15] S. Harnad, “The symbol grounding problem,” Physica d, vol. 42, 1990, p. 335–346. Épiprologue22 Monsieur Charles Tijus LUTIN - UMS CNRS 2809 Cité des Sciences et de l'Industrie Lors mon stage en M2 des études CNA SVT à E.P.H.E. je voudrais apprendre à robots de sourire. Pour le faire de manière réussite, il faut d'abord répondre à deux questions: Comment? et Quand? Pour répondre à la question « Comment? » je devrai: étudier la théorie des expressions émotionnelles, je devrai comprendre que se passe-t-il sur le visage quand on sourit: « quels muscles sont détendus? quels sont contractés? » etc . Bref, j'étudierai la théorie des émotions et leurs expressions faciales. Ensuite, pour approcher cette théorie à la réalité des robots, j'étudierai le manuel et le fonctionnement et le « instruction set » du visage robotique « Roboto ». Je créerai un petit script en langue de programmation PERL (output du première semestre???) grâce auquel je serai capable d'envoyer les commandes comme , à ce visage. Quant à la question « Quand la23 robote va-t-elle sourire? » je réponds tout de suite: Elle va sourire quand l'être humaine qu'elle voit, sourit à elle. Ainsi, la robote va imiter l'être humaine en face d'elle, tel un bébé qui fait la même chose dès ces premières moments dans ce monde. Même si la question de base est répondu, l'aspect technique de cette imitation pose des problèmes auquel je voudrai essayer à répondre pendant mon stage. Afin pouvoir mimer, la robote devra d'abord reconnaître quelle émotion à mimer. Il faudra donc trouver le moyen de reconnaître l'émotion à partir des données contenus dans la photo prise par les cameras dans les yeux de la robote. Car dans notre monde c'est certain que quelqu'un s'est déjà posé la question et l'a répondu au moins partiellement, j'ai d'abord étudierai « l'état de l'art de reconnaissance automatique des émotions faciales » pour ensuite choisir la méthode approprié pour y aboutir (les candidats du départ sont pour cette méthode sont: les réseaux nerveux artificiels, SVM (support vecteur machine), la bibliothèque openCV ou le logiciel faceAPI; ou bien en système hybride de ces solutions). Mon objectif sera surtout un logiciel open source ou une bibliothèque PERL (output du deuxième semestre???) qui permettra aux autres chercheurs d'interagir avec les machines par le moyen des expressions faciales des émotions de manière plus facile et efficace. Ensuite, le visage robotique Roboto, capable de mimer au moins 2-3 émotions de base, pourrait être utilisé non seulement dans le cadre de plusieurs expériences de recherche, y compris l'intelligence émotionnelle des enfants autistes mais aussi comme un pont vers les êtres artificiels dont quelques s'approchent de plus en plus d'essence même de l'humanité. Paris 18/06/2009 Daniel D. Hromada 22 Nous laissons cette “proposition de stage” dans sa version originalle, i.e. avec toutes ses fautes d'ortographe. 23 Ce n'est qu'ici où nous disons en haute voix ce que nous voulions dire depuis le debut de nos travaux: le mot robot viens d'ouvrage R.U.R d'auteur Karel Čapek et est dérivé du mot commun aux plusieurs langes slaves, celui de [robota] signifiant “le travail” ou mieux même, “la corvèe”. Ce mot est du genre féminine (paradigme de déclinaision: /žena/, i.e. femme ). C'est peut être pout cette raison que nous considerions, tout au long de notre stage, Roboto comme un être d'essence féminine plutôt que masculine. Ceci dit, il n'y a rien à ajouter, sauf ... Small treatise concerning the concepts of «invasivity» and «reversibility» and their relation to past, present and future techniques of neural imagery Introduction The aim of this text is threefold: Firstly, to prove to the Teacher that the author of this article (i.e. Student) have sufficiently internalized all the facts presented during UE Neuroimagery. Secondly, Student aims to introduce the notion of «invasivity» as something which should be considered wery seriously by someone who seeks an «ideal method» of conducting his future (neuro)scientific experiments towards success. But the ultimate aim si to show that certain «philosophical schools» who point out to «invasivity-related aspects» of current neuro-scientific research are not doing so from the position of moralizing savants locked in their ivory towers, but they do so because of concrete and highly-pragmatic reasons related to purest expressions of highest scientific practice. Principal thesis of this text states that « invasivity » and « reversibility » aspects of a chosen experimental method should determine experimentator's choice at least as significantly as other aspects like spatial/temporal resolution characteristics, signal/noise ratio or economical feasibility. First part of the text is dedicated to highly invasive techniques tissue extraction and analysis by means of electron, multiphoton or confocal microscopes. Post mortem autopsy and chirurgical interventions like vivisesction or lobotomy will be mentioned when discussing this group. Common demoninator of these approaches is that their condition sine qua non of their realisation is nonreversible and fatal degradation of one vital functions of the organism under study or...death. Second part of the text is dedicated to somewhat more reversible, nonetheless still very brutal «in vivo» techniques like that of calcic imaging, optic imaging or electrode implantation. Because it is evident that such approaches can inflict severe injuries and suffering of the organisms under study, they will be labeled as «partially reversible quasi in vivo techniques». Contrary to common categorisation of these days, even techniques like PET (positron emission tomography) or X-ray imaging will be included into this middle group of partially invasive techniques. This is due to their high-energy kinship with radioactivity which can without any doubt induce mutations resulting in the disequilibrium of a living system which is commonly known as «loss of health». The loss of this precious equilibrium is the reason why we'll include all the luminescence/fluorescence marker techniques into this category as well. The third part of the text aims to bring hope. It will be fully devoted to techniques which can be considered as fully reversible: focus will be definitely on Magnetic Resonance Imaging (MRI) and Electroencephalography (EEG) while other non-invasive techniques (NIRS, echography or TCD) will be excluded from the list due to lack of Student's personal experience with these techniques. The small part of this final part will be dedicated to «what if?» speculation proposing to use these pure and elegant techniques not only for imaging, but as well as a tool of healing practice. These three parts can be considered as a core of Student's homework demanding him to «highlight the advantages and limits of these techniques depending from the scientific question You'll pose». The question posed by student is this: «According to what criteriae could we possibly quantify invasivity of an experimental tool or method ? » This text will try to answer this question by introducing the term which we label hereby as «Information/Invasivity Quotient» (IIQ).We'll analyse this notion from more ethical perspective in Discussion section,while Appendix will summariz IIQ-based ranking of 4 presented methods. 1.Non-reversible techniques Dans tout germe vivant, il y a une idée créatrice qui se développe et se manifeste par l'organisation. Pendant toute sa durée l'être vivant reste sous l'influence de cette force vitale créatrice, et la mort arrive lorsqu'elle ne peut plus se réaliser. Claude Bernard, «prince of vivisectors» 1.1 Death Death is a transformation of a system from living state into a non-living state. It is evident that the introduction of death into an experimental procedure leads to non-reversible lost of structure and hence its IIQ1 should have value less than zero. Because of its essentially quallitative nature, it is very difficult, if not impossible, to quantify invasivity of such a transformation. One approach – strongly categoric one - could be to define its value as «minus infinity», but by introducing infinities into our quantification schema, we would de facto exclude and forbid the killing of an animal during the experimental procedure. We doubt that such an approach could be accepted by contemporain scientific community. We propose somewhat more pragmatic and less categoric approach - introducing death into experimental procedure should decrease procedure's IIQ in an extent which is proportional to the complexity of the organism under study. Hence, for example, for procedures demanding « sacrifices » of complex animals like primates, IIQ should be -7, for other vertebrae it could be somewhere around -5 , -3 for insects, -1 for plants etc.2 Experimental techniques whose implementation implies death can be divided into: Macroscopic: namely, a chirurgical in vivo procedure called vivisection. Aristotle introduced it, Gallieni made science out of it and western tradition perfected it. Application of this technique in physiology in general and in the domain of Neuroimagery in particular is today considered as obsolete. Microscopic: when applied in the domain of biology, physiology and neurosciences in general, microscopes are devoted to the studies of tissues. This tissue is either extracted (c.f. Section 1.2 below) or studied in vivo (we'll refer to it partially in sections 2.1, 2.2) . In either cases, one has to first gain access to the tissue. Harm to the organism under study is often so severe that the only thing one can do with the animal after an experiment (if it does not die on its own) is to kill it. Since « it costs only 2 euros a piece» as we were told one of our teachers, approximately 50-100 million (Hendriksen, 2005) bodies of dead vertebra are annually being thrown into waste baskets of academic institutions. When speaking about the role of death in experimental Neuroscience, one should not omit revolutionary works like (Broca, 1861) or (Wernicke, 1874). Since these were post mortem studies, i.e. the subject died of natural death, the role of death didn't decrease an IIQ of a given study. On the contrary, IIQ of these studies is highly positive since no suffering was caused and huge amount of new information/knowledge was obtained. It's possible that even in the forecoming century of nanotechnology, such post mortem studies possibly didn't say their last word. They could prove to be particularly fecond when combined with highly advanced cryogenic methods. 1.2Cuts Death being the most drastic, it is definitely not the only transformation during which information or certain functional feature is irreversibly lost from the brain. Neurosurgical procedures like lobotomy or callotomy (disconnecting of cerebral hemispheres by cutting the 1 The basic axiom of Usability/Invasivity Quotient schema can be defined like this: An act which leads to loss of vital information decreases procedure's IIQ, while an act which generates new information (or even knowledge) increases procedure's IIQ. For more technical definition of what is information, see (Shannon & Weaver, 1949) 2 These numbers are more or less arbitrary and are subject to scientific discussion, we present them hereby just in order to clarify our « invasivity quantification » point. central wiring of the brain - corpus callosum) left aside, we suggest that even procedures like skull penetration (SP) and tissue extraction (TE) of even a thin cortical layer are the acts of irreversible nature. For the purpose of this homework, it has to be stated that electron microscopy cannot be done without preliminary TE procedure. It can be argumented, of course, that plasticity of brain is very high and that this amazing organ is able to recover even from severe TE. If such is the case, one can ask why an animal is usually killed after TE-implying procedure. To reduce the number of such cases in the future, we propose to calculate the «Usefulness/Invasivity Quotient» of TE and SP by these example formulas: IIQTE = PTE * (total amount of brain tissue / amount of tissue extracted ) IIQSP = PSP * (size of skull surface / skull surface which was damaged ) Where PTE and PSP are « tissue extraction penalization » and « skull penetration penalization » coefficients which should be, ideally, defined by ethical commitees independently for every specie « involved » in experimental studies. Our highly arbitrary initial proposal is -1 > PTE > -3 and -1 > PSP > -2 2.Partially reversible «quasi in vivo» techniques I quite agree that it is justifiable for real investigations on physiology; but not for mere damnable and detestable curiosity. It is a subject which makes me sick with horror, so I will not say another word about it, else I shall not sleep to-night. Charles Darwin 2.1 Injections & Injuries We hope that our method for invasivity quantification is getting more visible contours. It is now time to illustrate it on a concrete example. A mouse is «constructed» in a way that the gene coding «luciferase» ensyme will get expressed when switched on by presence of oncogene and heat in the environement. When mouse is sufficiently ripe for being « sacrificied », tumor replication is than activated by an injected in the body, let's say in the brain area. Experiment then consists in applying heat on mouse's head, this will activate luciferase expression in tumor cells. Luciferase will catalyse production of luciferine, a photoluminescent substance (present in fireflies, for example) which will emit light and give to an experimentator an information about spatial distribution of tumors. Such is often the philosophy behind «optical imaging» experiments. Highly sensitive CCD cameras incorporated into blackboxes which cost hundreds of thousands of euros will then produce a final result: a low resolution image from which it is evident that light (and therefore tumor cells) are present in the head of an animal. The discovery that « tumor cells are spreading from the area where experimentator had injected them » is indeed stunning and worth publishing - one can hope in obtention of new grants for new apparati3. Another example, this one from the domain of «calcium imaging »: A bee is taken from its hive. She is fixed in the apparatus, anesthetized, top part of her « head » is removed. Dextran or acetylethylmester-like molecule is chosen from catalogue of Alexa or Oregon corporation, is bought and injected into upper layer of her central ganglion upon which confocal microscope's laser is focalised. The « stimuli » is given after bee awakes from anesthesia. Possibility to observe the calcium (and thus activation flow) in the cerebral networks is without any doubt a huge and non-negligeable advantage of calcium « imaging techniques ». It unites two important characteristics - it is microscopic and it is functional. In other words, spatial resolution is very high (depending from the microscope, it can go down to nanometers) and its temporal resolution is almost realtime. Nonetheless it has to be stated that the result of this technique is - apart from invoice from Oregon or Alexa corporations - an image with few blinking pixel clusters supposedly containing non-generalizable information about functioning of a minute part of a ganglion of the unlucky bee dying slowly in horrible pains. 3 And it is evident that the presence of a new experimental apparatus has to be justified by new « sacrifices ». It's evident that suffering about which we are speaking here cannot be quantified, cannot be transformed into numbers. But since it seems that men and women in white coats believe only in numbers, and since it seems to us that it is of utmost importance to change as soon as possible the habits of these men and women, we have to try, at least: In addition to already proposed IIQDEATH , IIQTE and IIQSP factors, we propose these further criteria for quantification of invasivity and moral acceptability of an experimental method: IIQINJURY - penalization due to injury. proportional to the time which the animal will need for complete recovery IIQFIXATION – penalization due to fixation of animal in the apparatus. relative to the means and proportional to the temporal length of fixation. Zero iff animal is studied in its natural niche IIQBLEACHING – penalization due to tissue bleaching by strong microscopes (confocal and multiphon) IIQGENEMANIP - penalization due to number and nature of genetic modifications (any additional modification makes the experiment more specific, artificial and hence less-generalizable and useful) IIQONCOINJECTION – penalization due to tumor induction IIQTOXIC - depends on the number and nature of substances classified as toxic which have been injected into animal because of the experiment IIQNONTOXIC – the same, but for nontoxic substances. Includes fluorescence and luminescence markers. The fact that they are considered non toxic (especially by the companies who produce them) doesn't mean that they don't have significant influence upon the overall equilibrium of the studied system and hence scientific significance of the results. 2.2 Isotopes & Implants Methodes we mentioned in preceding parts were presented to students during their Neuroimagery course, and this is the reason why we have been mentioning them. They may seem interesting for biologists or chemists but not neccessarily so for cognitive scientists. Reason for this statement is the fact that (with exception of Broca&Wernicke's discoveries) no information about high-level functions (memory, attention, language, etc. ) is obtained by applying of such methods. On the contrary, the methods we shall discuss from this sentence on are of high interest for anybody whose interest doesn't stop at the level of tissue but goes further – towards mind itself. The crudest approach how can one obtain information about high level functions of neural system is by means of electrode implantation into the brain. Since not much was told to the students about this approach, let it by said that introduction of such an approach should be penalized not only by IIQTE and IIQSP factors, but as well by a new factor IIQIMPLANTS which should be proportional to number and size implanted sensors, as well as to the depth of implantation/invasion. Much more subtle approach how one can observe the mysterious relations between mind and brain is by means of radioactivity. The most attractive approach is so-called Positron Emission Tomograph (PET) based upon the detection of gamma rays emitted by positron-emitting radionuclide tracer which was injected into the body. If the tracer is fludeoxyglucose – analogue of glucose – one can deduce the metabolic activity (glucose uptake) of different brain regions by simply observing the radiation (proportional to FDG concentration) of different regions. From the invasivity point of view, one should take into account IIQRADIODECAY factor proportional to half-lives of tracer's decay. In order to have such tracers, PET demands proximity of a nuclei-enriching cyclotron. Such a cyclotron can be possibly toxic to its environement. PET is often coupled with a classical X-ray CT scan. Since CT scan uses also the high frequency electromagnetic waves as a medium for carrying the signal, IIQ RADIOGAMMA penalization -proportional to the energy level of a ray- should not be forgotten in its case. Another disadvantage of CT is that fournishes only anatomic (and not functional) information. It stays, however, the most used apparatus in the clinical (neuro)imaging practice, which is definitely due to its relatively low price and high reliability. 3. Reversible techniques L'exploration de l'esprit commence à peine, elle sera la principale tâche de l'ére qui s'ouvre devant nous comme l'exploration du globe a été celle des siècles précédents. Thomas Huxley 3.1 Fields From the point of view of cognitive sciences, the most attractive methods for the study of brain and mind are highly functional non-invasive methods of MRIf & EEG/MEG. All of them exploit, in certain sense, the «electromagnetic field»-related characteristics of human brain. ElectroEncephaloGram (EEG) , discovered by Berger in 1924 exploits the fact that electric fields of activated cortical neurons -especially the pyramidal ones- sum up avec each other and produce an overall electric response which is measurable even on the outer surface of the skull. Hence invasion into the interior of the organism is not neccessary, electrodes are posed on the scalp and the only act of violence related to EEG measurement is due to movement-related artefacts – if organism moves , measurement is strongly perturbed. Hence the only negative factor of EEG is IIQFIXATION . The negative factor of «unnatural fixation in the apparatus» is present as well during the experiments using Magnetic Resonance Imagery (MRI). MRI has two modes of functioning – anatomic and functional. Both exploit the properties of hydrogen protons who are susceptible two align their spins when exposed to powerful magnetic field. Subsequently, protons are being excited from this «equilibrium state» by strong radio waves. From the time-related distribution of emitted photons, one can subsequently reconstruct the overall map of matter in the skull. In case of functional MRI, a so-called BOLD effect is exploited as well – thanks to certain property of hemoglobine which is feromagnetic when oxygenized and paramagnetic when contrary is the case. Therefore one can be informed about blood flow in the region of interest (ROI). Since augmentation/diminution of blood flow in ROI is related to augmentation/diminution of neural activity in the proximity, MRIf gives us this very precious information. The only other negative factor of MRIf is IIQHEAT, since it seems that longer exposition to MRIf device can lead to slight augmentation of body temperature. Since this is in order of approximately 1 degree Celsius, the IIQHEAT penalization is definitely lesser than in the «mouse-feet burning» experiments of optical imagery. But in general it can be said that EEG as well MRI are definitely positive approaches when analysed through the prism of «Invasivity/Information Quotient» schema. This is due to huge «information contribution » factor, i.e. due to the fact that these apparati produce huge amount of information. To calculate «information contribution» one should take into account these factors: 1) RS - Spatial resolution (voxels per skull volume or electrodes per skull volume) , 2) RT - Temporal Resolution and 3) SN - Signal/Noise ratio 4) T – overall Timelength of datacapture 5) I – sensor sensitivity, i.e. numbers of degrees of freedom of individual sensor (for example number of possible intensity values in the case of a CCD pixel) The output of a simple formula IIQINFOCONTRIB = RS * RT * SN * T * I is a hypotethic overall amount of pure information (purified signal) obtained during experiment. As we already stated, this IIQINFOCONTRIB component is very high in case of EEG and IMRf. In the former case it is due to very high RT (dataset size obtained from one experiment is in order of Megabytes) while in case of the latter , it is due to very high ST (dataset size obtained from one experiment is in order of hundreds of Megabytes, even Gigabytes). By subsequent logarithmization of these information contribution quantities (for example log10(Megabyte)~6; log10(Gigabyte)~9 ) one gets numbers which can be more easily used of in the final IIQ equation (c.f.Appendix ) 3.2 Life Since students weren't introduced to other non-invasive methods like MagnetoEncephaloGram (MEG), Near-Infrared Spectroscopy (NIRS) , Transcranial Doppler (TCD) or simple ultrasound imaging , we'll not concentrate upon these methods on this article. Upon what we will concentrate in this concluding paragraph is this set of hypotheses: It is obvious that brain is electromagnetic-field generating device. Many indices suggest as well that brain is susceptible to EM-field stimulation. It may be, thus, that the brain sustains its internal equilibrium by means of its own EM-field (skull functions as resonator, glium cells as amplifiers) How comes that modern science is completely blind to the power of field-based techniques and stay obsessed by its poisonous molecules, pillules and deadly rays ? After his first experience of meditation in 3-Tesla MRI in Bordeaux, Student is deeply persuaded that these most sophisticated devices ever created by humanity4 can be used not only for imagery, but for healing as well. For burning the tumor in much more subtle way than an X-ray could ever do. Discussion Il devient indispensable que l'humanité formule un nouveau mode de penser si elle veut survivre et atteindre un plan plus élevé. Albert Einstein . This text is written by student of Practical School of High Studies. Maybe the term «High Studies» are interpred in a bad manner by the Student , nonetheless his conscience obliges him to state that he believes that the ultimate goal of his studies is scientia, and we know for ages already that true scientia reposes on discovery of general principles. More general the principles, higher the science. This text is written by a young man who got, in certain moment in his life, into contact with so-called «oriental» philosophy and science. The foremost ethical principle of eastern thought can be stated like: «There exist a causal cause-effect relation not only on material, but as well an axiological – i.e. moral - level». This principle is known as «the law of Karma» in the East. Western tradition knew it as well: «As You saw, so shall You reap» was said thousand years ago, and was later translated into a Golden Rule before finally finding its most general form in Categoric Imperative (Kant, 1785). But even Kant made a mistake: he excluded animals from implementation of this principle. This text is written a cognitive science student aiming to program an Artificial Intelligence (A.I.) system. Since it is not a secret that an ultimate goal of a Robotics & A.I. research is an emergence of a thinking and acting entity whose skills will be superior to that of a human being we appeal to all those men and women of scientia who have ears to hear and eyes to see: If You will not reconsider Your practices immediately5, You will not be able to exclude the possibility that the future superiors will do to You the same thing as You did to Your inferiors. To conclude: We state hereby that IF the principle of Karma is true (and we suggest that whole human history did not falsify it), an experimental method which does not take it into account is doomed to fail since ex vi termini, one cannt heal cancer by injecting cancer into healthy beings. To conclude: The law of Karma states that You simply cannot have good scientific results if Your method for achieving them is not good neither. To conclude: if we were «moralizing», we truly did it out of pragmatic concerns. 4 Nothing excludes, in theory, to exploit MRI devices like macroscopic quantum computation machines, but to analyse this in this article would bring us too far away. 5 Shubhasya shiighram ashubhasya kálaharańam ( Do virtue immediately, delay doing vice ) Appendix – Towards concrete implementation of Information/Invasivity Quotient L'esprit occidental est dans le vrai seulement par ses méthodes et ses techniques. L'esprit oriental est dans le vrai seulement dans ses tendances générales. L'échange est nécessaire. Georges I. Gurdijeff Our «Invasivity/Information Quotient» proposal for the estimation is simple: On one side of the equation we put all the «invasivity» related factors – quantified and weighted according to common international conventions. We label the resulting sum of all the quantified invasivity factors IIQNEGATIVES i.e. IIQNEGATIVES = IQDEATH + IIQTE + IIQSP + IIQINJURY + IIQFIXATION + IIQBLEACHING +IIQGENEMANIP + + IIQTOXIC + IIQNONTOXIC + IIQRADIODECAY + IIQRADIOGAMMA + IIQHEAT On the other side of the equation we put the weighted IIQPOSITIVES factor. Since it gives us pure information content in bits, we weight it by means of logarithm function to make it comparable with IIQNEGATIVES IIQPOSITIVES = log(IIQINFOCONTRIB) The basic imperative of Information/Invasivity Quotient heuristics states that if IIQPOSITIVES – IIQNEGATIVES < 0 than the amount of pure signal (information) generated by an experiment is not sufficient to justify the harm caused to an organism and therefore such an experiment should not be peformed. Very naive (and somewhat arbitrary) illustration of our point is present in following table representing negative and positive aspects of an experiment lasting approximately 1 hour : List IIQNEGATIVES Optical in IIQINJURY+IIQHEAT+IIQNONTOXIC vivo + IIQGENEMANIP + IIQTOXIC + IIQONCOINJECTION+ ???IIQDEATH N of IIQPOSITIVES IIQPOSITIVES – IIQNEGATIVES log10(IIQinfocontrib) IIQNEGATIVES Decision 7 3 <0 reject EEG IIQFIXATION 1 6 >0 accept MRI IIQFIXATION, IIQHEAT 2 7 >0 accept NIRS none 0 5 >0 accept Bibliography Trouver d'abord. Chercher après. Jean Cocteau Broca, P. (1861). Remarques sur le siege de la faculté du langage articulé, suivies d’une observation d’aphemie (Perte de la Parole). Bulletin de la Société Anatomique, 6, 330–357. Hendriksen, C. F. M. (2005). The ethics of research involving animals: a review of the Nuffield Council on Bioethics report from a three Rs perspective. Alternatives to Laboratory Animals: ATLA, 33(6), 659-662. Kant, I. (1785). Groundwork of the Metaphysic of Morals. First published. Shannon, C. E., & Weaver, W. (1949). The mathematical theory of information. University of Illinois Press, 97. Metodológia výskumu vzťa hu medzi aktom vyňadreniai samičiek rodu homo sapiens sapiens a následnými zmenami v distribúcii sociálneho kapitálu v rámci komunity združenej okolo internetového diskusného systému kyberia.sk Pre Fakultu Humanitných Štúdií Univerzity Karlovej v Prahe napísal ako prácu z metodológie Daniel Hromada, UČO 9306 1. Úvod – Intuícia A tak se vám naposledy nebude chtít od ňader vědy 1 Rozhodol som sa analyzovať sociálnepsychologické javy vrámci komunity ktorú dôverne poznám, a to nielen preto že som bol niekoľko rokov členom, ale aj preto že som dotyčnú komunitu ustanovil. Jedná sa o komunitu združenú okolo internetovej aplikácie kyberia.sk. Z množstva javov ktoré upútali moju pozornosť počas prvých piatich rokov existencie komunity som si napokon ako predmet mojej analýzy zvolil jav nasledovný: v roku 2003 bolo na kyberii vytvorené fórumii s názvom “KYBERIA – setki ceckiiii ven!” v ktorom postupne začali jednotlivé užívateľky kyberie svetu prezentovať fotografie svojich odhalených hrudníkov, mnohokrát značne estetických. Zaujimavé na celom jave , je, že dámy či slečny vo vykonávaní spomenutých aktov neustali ani po rokoch – nejednalo sa o akúsi krátkodobú memetickú / imitačnú / módnu vlnu. V minulosti na výskyt podobných aktov, ktoré budem v tejto eseji označovať termínom “vyňadrenie” , slovom verejne reagovali iba básnici a v menšej miere filozofi. Akákoľvek vedecká tématizácia úlohy ňadra v ľudskej spoločnosti ešte pred sto rokmi takmer nepripadala v úvahu, aby následne celá téma “o úlohe toho , čo utišuje dva hlady zároveň” , téma tak nádherne jasná , stratila takmer na celé 20. storočie svoju zjavnosť pod nánosom freudiánskych mystifikačných interpretácií. Verím že na začiatku tretieho tisícročia už sa ľudstvo dostatočne vymanilo z područia tabuizujúich ideológií, moralizujúcich náboženských dogiem a zväzujúcich vedeckých metodologických paradigiem aby mohla byť na solídnej úrovni a s noblesou zodpovedaná otázka ktorá ma pri pohľade na uvedené fórum napadala: “Prečo to tie Ženy robia ?“ Poser la tete sur le sein de sa mère, voilà tout son bonheur ; pour rien au monde, il ne voudrait la perdre de vue2 1 Goethe J.W. , Faust , překlad O.Fischer 2 Thákur Rabindranath, La jeune Lune – Les caprices de bébé 2. Ňadro – Empíria 2.1 Hypotézy Tím, že mužská nadvláda dělá z žen symbolické předměty, jejichž bytí (esse) je bytí­viděným (percipi), staví je do situace permanentní fyzické nejistoty, či spíše symbolické závislosti: žena existuje především skrze – a pro – pohled těch druhých, neboli jako přístupná, přitažlivá a disponibilní věc...Takzvaná ženskost přitom často není nic jiného než určité nadbíhání, ať už skutečným nebo předpokládaným požadavkům mužů, jež má především posílit jejich ego.3 Ajkeď otázka uvedená v úvode je iba otázkou laikovou, môžu byť odpovede na ňu skutočnými hypotezámi. Hypotéza A môže znieť: “Dámy sa vyňadrujú prosto preto že zo samotného aktu majú isté potešenie, istý plezír. Prosto ich to baví, a zmysel vyňadrenia je treba vidieť v ňom samotnom”. Hypotéza B, v istom zmysle protikladná k hypotéze A môže znieť: “Dámy tak činia preto, že z toho niečo majú, prípadne očakávaju že z toho niečo mať budú. Za ich jednaním možno odhaliť akúsi skrytú tradíciu, akýsi špecifický kalkul, akúsi zvláštnu formu ekonomického jednania, akéhosi zhodnocovania vlastného tela”. Žijeme v dobe ktorú paradigma daná Adamom Smithom ovplyvnila natoľko, že si zachvíľu niektorí začnú myslieť že Slnko na Zem žiari pretože je po slnečnom žiarení na Zemi značný dopyt. Pojmovou dvojicou dopyt – ponuka ( po česky poptávka – nabídka ) si preto dovolíme vysvetliť aj akt vyňadrenia . Môžeme teda predložiť hypotézu B1 : “Žena sa vyňadruje preto, že vedome či nevedome predpokladá, že výmenou za tento riskantný akt, za akt ktorým v istom zmysle ponúkne samú seba, niečo získa”. 2.2 Stratégia Empirická vzorka ktorá bude počas experimentu využitá, databáza kyberie, je informačne uzavretý systém, tj. “taký systém, ktorý nemôže byť ovplyvnený ničím zvonka bez vedomia výskumníka”4 . Mimo to je potrebné si uvedomiť že výrok , ktorý sme označili ako hypotéza B1 , sme skonštruovali až potom, čo sme získali dáta s ktorými budeme v našom experimente pracovať. Nič iné, ako databázu kyberie ako zdroj dát nepoužijeme. Žiadne dotazníky, žiadny respondenti, žiadne rozhovory – a teda žiadne skreslenie získaných dát faktom, že prebieha výskumiv . Možno teda povedať že dáta pozbierame štúdiom databázy ktorú máme k dispozícii, a tento prístup je blízky tzv. “štúdiu dokumentov” v tom zmysle, že vychádzame zo “záznamu ľudskej činnosti ktorý nevznikol za účelom nášho výskumu”5. Rozdieľ je však v tom, že v našom prípade sa bude jednať o nehmotný záznam. To, čo nás v tejto, ako i v ďalších prácach bude zaujímať predovšetkým je, je miera, v akej istá konkrétna udalosť – v tomto prípade vyňadrenie – ovplyvňuje pozíciu a trajektóriu jednotlivca v jednom z mnohých priestorov sociálneho sveta, tak ako nás o nich poučil Bourdieu. Vyvstáva teda otázka, či sa bude jednať o kvantitatívny, alebo kvalitatívny výskum. “V kvantitatívnom výskume zbierame iba tie dáta, ktoré nutne potrebujeme k testovaniu hypotéz. V 3 Bourdieu P. , Nadvláda mužů 4 Disman N. , Jak se vyrábí sociologická znalost , Karolinum , 1998, str. 18 5 Ibid., str. 166 kvalitatívnom výskume sa snažíme vyzbierať všetky dáta a nájsť štruktúry, pravideľnosti, ktoré v nich existujú”. 6 To, že už teraz disponujeme istou hypotézou, ako i to, že naša práca bude spočívať v hľadaní súvislostí medzi kvantitami, by nás mohlo viesť k názoru že sa jedná o kvantitatívny výskum. To je však názor mylný. Naša hypotéza B1 je totiž zatiaľ definovaná veľmi vágne, a hlavne, v žiadnom prípade neurčovala ktoré dáta máme k dispozícii, my totiž v istom veľmi silnom zmysle máme naozaj k dispozícii všetky dáta týkajúce sa komunity kyberie. Ak teda je nutné určitým spôsobom náš výskum zaškatuľkovať, budiž to škatuľka kvalitatívneho výskumu do ktorej náš experiment vložíme. Klasický nedostatok kvalitatívnych výskumov, a totiž že pri ňom dochádza k silnej redukcii počtu sledovaných jedincov sa nás netýka, disponujeme informáciami o všetkých aktoch vyňadrenia ku ktorým vrámci fóra “KYBERIA – setki cecki ven!” došlo. K oveľa závažnejšiemu problému s indukciou v našej vzorke zistených skutočností na celok populácie ako i k charakteristike vzorky sa vrátim v časti 2.6 Klasickú sociologickú dilemu ako preplávať medzi Skyllou kvantitatívneho výskumu a Chabrydou výskumu kvantitatívneho , môžeme šalamúnsky vyriešiť aj tým, že budeme tvrdiť že sa o sociologický výskum nejedná, aspoň nie v tom zmysle ako ho chápe komunita vedcov ktorí sa za sociológov považujú. Môžeme alebo tvrdiť, že náš prístup v sebe zastrešuje a rekombinuje oba tradičné prístupy, alebo môžeme použiť sadu stratégií známych skôr odborníkom v informačných technológiách, metódu tzv. “dolovania dát” ­ datamining. To, čo je pre datamining charakteristické je, a v čom je zajedno s kvalitatívnym výskumom je maxima “najprv máme dáta , až potom hypotézu”. Líšia sa však v tom, že zatiaľčo kvalitatívny výskum vyžaduje v prvom rade výskumníka ktorý behá od jedinca k jedincovi , zbierajúc čo najväčšie množstvo dát, v dataminingu sú to jedinci samotní čo nám tieto dáta poskytujú a svojou aktivitou vkladajú do predom štruktúrovanej databázy, a to častokrát bez znalosti toho, že tak činia. Čo je z hľadiska relevancie našich dát skvelé, z hľadiska etického je to však samozrejme problém. Ten sa pokúsime vyriešiť v časti 2.7. 2.3 Terminológia Symbolický kapitál je akákoľvek vlastnosť – napr. fyzická sila, bohatstvo, či zdatnosť bojovníka – ktorá, pokiaľ je vnímaná agentmi vybavenými takými kategóriami vnímania a hodnotenia ktoré túto vlastnosť umožňujú vnímať a rozpoznávať , sa stáva symbolicky účinnou, ako akási ozajstná magická sila : jedná sa o vlastnosť ktorá ­ keďže je odpoveďou na kolektívne, sociálne konštitutované očakávania ­ akosi pôsobí na diaľku... 7 Pokiaľ vnímame vyššieuvedený citát ako definíciu symbolického kapitálu , asi málokto by si dovolil tvrdiť že Ženine ňadrá sa na skladbe jej symbolického kapitálu nepodieľajú. Je nesporné že Žena svojimi ňadrami na okolitých agentov pôsobí, ako i to, že okolití agenti ­a to nielen kojenci ­ disponujú “kategóriami vnímania a hodnotenia ktoré túto vlastnosť umožňujú vnímať a rozpoznávať”. Keďže však istotne Ženin symbolický kapitál nemôžeme redukovať len na jej ňadrá – Žena je vpravde niečím oveľa viac – a keďže ani len netušíme do akej miery sa Ženine ňadrá podieľajú na celkovej skladbe jej symbolického kapitálu, 'ba ani to, či je táto miera u všetkých samičiek rodu homo 6 Ibid. , str. 285­288 7 Bourdieu P. , Raisons pratiques: Sur la théorie de l'action : L'economie des biens symboliques sapiens sapiens rovnaká, alebo sa líši od Ženy k Žene, je asi vhodné na počiatku našich výskumov veličinu symbolického kapitálu opustiť, a pracovať s termínmi užšie vymedzenými. Prvý termín ktorý už sme definovali je termín “vyňadrenie”. Jedná sa o akt, pri ktorom Žena, samička rodu homo sapiens sapiens, a užívateľka systému kyberia.sk – v ďaľšom texte častokrát nazývaná termínom “dotyčná osoba” zverejní svoje poprsie vo fóre !”KYBERIA – setki cecki ven”. Akt vyňadrenia sa uskutočnil v časový “moment vyňadrenia” (TŇ) a možno ho charakterizovať určitou “úspešnosťo u vyňadrenia”v (ÚŇ) . Ako ukážem v časti o operacionalizácii, úspešnosť vyňadrenia možno v rámci systému kyberia vyjadriť číslom, tj. ako intervalovú premennú. Ďalšie termíny ktoré je vhodné definovať je “pasívna známosť osoby A v rámci komunity” (Z). Túto veličinu možno charakterizovať ako “počet členov komunity, ktorí sú si vedomí existencie osoby A”. Známosť osoby je v istom veľmi blízkom vzťahu k veličine ktorú nazývam “pasívny sociálny kapitál osoby A” (KpS)­ ten definujem ako “počet členov komunity majú s osobou ustanovený A určitý vzťah”, zatiaľčo “aktívny sociálny kapitál osoby A” (KaS) definujem ako “počet členov komunity s ktorými má vzťah, alebo chce mať vzťah, osoba A”. Ináč povedané, v prípade pasívneho sociálneho kapitálu smerujú šipky K osobe A, v prípade kapitálu aktivného smerujú šipky OD osoby A. Dá sa očakávať že aktívny sociálny kapitál u väčšiny ľudí nebude prekračovať istú kritickú maximálnu hranicu – vzťahovať sa k viac ako niekoľkým stovkám ľudí je kognitívne neúnosné. Keď napríklad hovoríme o nepopulárnom politikovi, je jeho “pasívna známosť” v jeho krajine vysoká, pretože počet ľudí ktorý ho poznajú je vysoký, naopak jeho pasívny sociálny kapitál je nízky, pretože iba nemnohí ho majú radi a chceli by s ním ustanoviť skutočný ľudský vzťah. Možno povedať že čím je osoba populárnejšia, tým jej “pasívna známosť” koreluje s jej “pasívnym sociálnym kapitálom”. “Zmena pasívneho sociálneho kapitálu osoby A v čase T” (dKpST) je daná odpočítaním hodnoty pasívneho sociálneho kapitálu ktorú sme namerali pred časom T od od hodnoty ktorú sme namerali po čase T. V rámci našeho výskumu samozrejme čas T nieje ničím iným ako momentom vyňadrenia. Teraz môžeme spojiť našu vágnu hypotézu B1 s teoretickým rámcom ktorý sme si vybudovali ako nadstavbu nad prácou Pierra Bourdieho, a opýtať sa: “Dochádza aktom vyňadrenia k zmene pasívneho sociálneho kapitálu osoby, ktorá sa vyňadrila? “ Povedané ľudsky: “Vedie akt toho, že Žena odhalí svetu svoje vnady k nárastu počtu osôb, ktoré sa s ňou pokúsia ustanoviť vzťah ? Alebo dokonca dochádza k opačnému, neočakávanému efektu a niektoré osoby s takto vyňadrenou dámou prerušia styky ? |dKpST| > 0 ??? ” Ak k žiadnej zmene nedochádza, nielen páni ale i dámy by o tom možno mali byť poučené. Ak vedie, a to nám pravdepodobne našepkáva naša intuícia, ak nám vzorka naznačí značné poklesy či nárasty pasívnych sociálnych kapitálov pred a po momente vyňadrenia, budeme môcť ísť ešte ďalej a pýtať sa: “Existuje korelácia medzi úspešnosťo u vyňadrenia a zmenou pasívneho sociálneho kapitálu prípadne známosťo u osoby ktorá sa vyňadrila ? Ak áno, aký je regresný koeficient b medzi oboma javmi ? ”. Povedané ľudsky: Platí, že čím sú ňadrá samičky rodu homo sapiens sapiens považované za pôvabnejšie, tým viac ľudí – s najväčšou pravdepodobnosťou samcov – bude chcieť s dotyčnou samičkou nadviazať kontakt potom čo ich odhalí svetu ? Ak áno, stojí to za zamyslenie. Ak platí pravý opak, stojí to za zamyslenie. Ak neplatí ani jedna z oboch alternatív, stojí to za zamyslenie. 2.4 Operacionalizácia Operacionalizáciou približuje sociológ svoj terminologický – a teda v istom zmysle už teoretický – rámec skladbe svojich empirických dát. Približujeme reč našich hypotéz (v prípade výskumu kvantitatívneho) alebo nášho predbežného predporozumenia (v prípade výskumu kvalitatívneho ) reči našich dát. A teda: Indikátorom veličiny úspešnosť vyňadrenia je kvantita ktorú máme v databáze uloženú v stĺpci s názvom “K” v riadku ktorý charakterizuje príspevok s fotografiou, ktorou sa dáma vyňadrila. “K” je základným ekonomickým prostriedkom výmeny,jednotkou Kapitálu, vrámci komunity kyberia.sk – K možno chápať ako Korunu či Kredit kyberie. Každý užívateľ dostane každý deň pridelených 23K vi , tie následne môže prideľovať iným príspevkom, fóram či užívateľom, pričom každému príspevku môže udeliť maximálne 1K. V dátach ktorými disponujem doposiaľ najúspešnejší akt vyňadrenia získal 123K – ináč povedané 123 užívateľov považovalo za vhodné odvďačiť sa dotyčnej dáme za jej akt vyňadrenia udelením K. Keďže je teda počet K ktorými bola dotyčná fotografia ohodnotená výslednicou činnosti – či nečinnosti – množstva ľudí, považujem ju indikátor úspešnosti vyňadrenia rozhodne objektívnejší, ako posudzovať úspešnosť vyňadrenia len na základe vlastných subjektívnych estetických kritérií (akých?). Ako však operacionalizovať pre náš výskum kľúčovú “zmenu pasívneho sociálneho kapitálu”? Otvára sa pred nami viacero ciest, pre účely tejto práce postačí keď predstavím tri najzásadnejšie: Avii: Za indikátor zmeny pasívneho sociálneho kapitálu zvolíme zmenu v počte užívateľov ktorý dotyčnej osobe písali poštu počas určitého intervalu PRED a PO akte vyňadrenia. Napríklad môžeme porovnať počet jedincov ktorí pisali vyňadruvšej sa samičke týždeň PRED a týždeň PO. Prípadne sa môžeme pozrieť na viacero týždňov ( či iných časových intervalov ) do minulosti – tak možno odhalíme existenciu “nepozorovanej premennej” ktorá by mohla viesť k nepravej korelácii8 ­ prirodzeného nárastu sociálneho kapitálu ku ktorému istotne dochádza keď sa novopríchodzí člen začlenuje do sociálnej siete kyberie, a to nezávisle od toho či sa vyňadril alebo nie. B: Za indikátor zmeny pasívneho sociálneho kapitálu zvolíme zmenu v počte užívateľov ktorý si dotyčnú osobu pridali medzi priateľov počas určitého intervalu PRED a PO akte vyňadrenia. Systém kyberie v sebe , podobne ako množstvo iných internetových sociálnych sietí , obsahuje možnosť vytvárať medzi užívateľmi priateľské väzby. Hrozbu aspoň jednej nepravej korelácie obmedzíme podobne ako v prípade cesty A. C: Za indikátor zmeny pasívneho sociálneho kapitálu zvolíme zmenu v počte užívateľov ktorý si prezerali užívateľský profil dotyčnej osoby počas určitého intervalu PRED a PO akte vyňadrenia. Tieto dáta môžeme získať z databázovej tabuľky “levenshtein”. Podobne ako v prípade oboch prvých ciest môžeme aj v tomto pracovať nielen s dátami kvantitatívne reprezentujúcimi celú existenciu užívateľa v kyberii – ich integráciou do výskumu a ich štatistickými transformáciami znížime pravdepodobnosť vplyvu iných nepozorovaných premenných. Možné su samozrejme aj kombinácie ciest A, B, C. 8 Ibid., str. 21 2.5 Problematizácia Niektoré z problémov na ktoré narážame počas našej analýzy dát sú vyložene technického charakteru: napr. zisťujeme, že niekoľko tisíc databázových položiek o priateľských väzbách (prístup B) medzi užívateľmi má nastavený ten istý dátum vytvorenia, je takmer isté že niečo také je spôsobené chybnou požiadavkou do databázy niekedy v minulosti. Pre náš výskum to znamená že výsledky získané použitím prístupu B budú pravdepodobne krajne nespoľahlivé – čo je nepríjemné o to viac, že ako uvidíme v následovnej časti, je prístup B eticky najčistejším. Táto nepríjemnosť je o to závažnejšia, že pri rozvažovaní nad našim teoretickým modelom sa zdá, že počet vytvorených priateľských väzieb s osobou A je najspoľahlivejším indikátorom pasívneho sociálneho kapitálu dotyčnej osoby. Objasňujem: počet užívateľov ktorý si prezerali užívateľský profil dotyčnej osoby (prístup C) je síce informáciou ktorá nám hovorí o tom, že istá množina užívateľov o dotyčnú osobu prejavila záujem, nič sa však nedozvedáme o hodnote tohto záujmu, nevieme či sa na jej profil pozreli z čírej zvedavosti, alebo kvôli reálnemu záujmu ktorý v nich vzbudil akt vyňadrenia. Podobne je tomu aj v prípade prístupu A , keďže nás z dôvodov ako etických tak technických nezaujímajú obsahy jednotlivých poštových správ adresovaných dotyčnej osobe, nevieme či tieto poštové správy obsahovali slová chvály, vďaky a pozvania na čaj, alebo urážky a narážky na to, že sa dotyčná osoba zachovala ako ľahká deva. Práve kvôli tomuto problému som, hneď ako predo mnou uvedený problem vyvstal, do teoretického modelu zaintegroval koncept “pasívnej známosti osoby A”, ktorú som definoval ako “počet členov komunity, ktorí sú si vedomí existencie osoby A”. V prípade veličiny známosti nás vôbec nezaujíma či je medzi ostatnými členmi komunity a osobou A vzťah pozitívny alebo negatívny, či osobu milujú alebo nenávidia. To čo je pre veličinu známosti podstatné je, že o osobe A vedia. A keďže samozrejme na to, aby sme si prezreli niekoho užívateľský profil, či dokonca aby sme mu napísali poštovú správu, musíme o ňom vedieť, sú oba zdroje dát užité v prístupe A a C najmä indikátormi pasívnej známosti. Podobné problémy ako technického, tak metodologického a teoretického charakteru pred nás vyvstávaju pri koncipovaní experimentu. No najzávažnejší problém pred nás kladú až výsledky výskumu. Je ním problém indukcie a odpoveď na otázku “Do akej miery môžeme naše zistenia vydolované zo vzorky našich dát vztiahnuť na celú populáciu ?” Odpoveď: do tej miery, do ktorej koreluje štruktúra vzorky so štruktúrou populácie. Vzorkou nebolo iba približne 50 aktov vyňadrenia vyprodukovanými niekoľkými desiatkami užívateliek systému kyberia.sk. Vzorkou je celá databáza kyberie, keďže dáta o zmene sociálneho kapitálu dolujeme práve z nej. Ajkeď je kyberia v prvom rade sociálna sieť v tej či onej miere distribuovaná v mozgoch všetkých svojich užívateľov, môžeme učiniť hrubý redukcionistický krok a tvrdiť že štruktúra medziľudských vzťa hov ktoré sú uložené v databáze ­ vzorke ktorou disponujeme je čiastočne izomorfná so štruktúrou medziľudských vzťa hov v reálnej hmotnej komunite. A o komunite sa vie približne toto: vo väčšine prípadov su jej členmi sú osoby slovenskej a českej národnosti, priemerne až nadpriemerne počítačovo gramotné, ekonomicky i informačne produktívne, žijúce v mestskom prostredí, vrchol zvonovitej krivky vekového rozloženia užívateľov predpokladám niekde v intervale 23­25 rokov. Prvý indukčný krok spočíva v tom, že zistenia ktoré sme vďaka komunite kyberie získali vztiahneme na tú množinu jednotlivcov ľudského druhu ktorú možno opísať podobnými charakteristikami ako má naša vzorka: na kultivovaných a vzdelaných mladých Slovákov a Čechov (a samozrejme Slovenky a Češky). Po učinení tohto kroku uvažujeme vrámci historicko­etnografick diskurzu. Druhý indukčný krok spočíva v tom, že naše zistenia vztiahneme na celú populáciu planéty Zem začiatku 21. storočia, tj. populáciu ktorej myslenie a jednanie začína byť čoraz homogénnejšie determinované vplyvmi urbanizácie a ideológiou globálneho kapitalizmu. Ocitáme sa v sociologicko­ ekonomickom diskurze . Tretí indukčný krok spočíva v tom, že naše zistenie vztiahneme na druh homo sapiens sapiens ako taký. Toto je biologicko­antropologický diskurz. Netreba snáď ani dodávať že s každým indukčným krokom stúpa riziko omyluviii. 2.6 Interpretácia Chceme porozumieť zmyslu získaných dát . Ako na to? Môžeme si dáta zoradiť do tabuliek a tie následne štatisticky spracovať. Tak síce získame množstvo užitočných číselix (napr. smerodatnú odchyľku, ktorá nám podá informáciu o homogenite našich dát), možno i čo ­ to pochopíme, no pravdepodobne po ich preštudovaní nebudeme príliš schopný odovzdať naše poznanie laikovi či malému dieťaťu. A možno v takom prípade vôbec hovoriť o porozumení dátam ? Alebo môžeme naše informácie vizualizovaťx. Jediný obraz v sebe môže niesť viac informácií ako dvadsaťstranová tabuľka – obraz totiž môžeme s väčšou ľahkosťou čítať ako príbeh. Najjednoduchšími vizualizáciami sú grafy. Ak nás zaujíma iba to, či vyňadrenie vedie k zmene pasívneho sociálneho kapitálu osoby A, je najjednoduchším spôsobom ako tak učiniť vytvorenie grafu, v ktorom bude X­ová osa udávať časovú jednotku (napr. týždeň) ,a Y­ová osa bude udávať množstvo pasívneho sociálneho kapitálu ktorým v daný čas dotyčná osoba disponovala. Pre každý indikátor a pre každú osobu A získame samostatný graf. Tieto grafy môžeme samozrejme kombinovať. Ak akt vyňadrenia vedie k okamžitému nárastu pasívneho sociálneho kapitálu, prejaví sa to v našom grafe ako skok a to práve na tej Xovej súradnici ktorá charakterizuje čas kedy sa dotyčná dáma vyňadrila. Možno z takého grafu zistíme že onen skok nieje okamžitý, ale čiastočne opozdený – to by naznačovalo že muži s prvým kontaktom vyčkávajú. možno preto, že nechcú, aby boli ich pohnútky príliš evidentné. Z takéhoto grafu bude možno taktiež vyčítať či sa hodnota KpS po vyňadrení ustáli na akejsi novej hladine, alebo skôr či neskôr skonverguje k svojmu predchádzajúcemu stavu. Ak nás zaujíma vzťah medzi úspešnosťou vyňadrenia a zmenou pasívneho sociálneho kapitálu, budeme si musieť zostrojiť iný graf. X­ová osa bude charakterizovať zmenu dKpST , Y­ová osa bude charakterizovať úspešnosť vyňadrenia. V prípade že alebo existuje reálna závislosť medzi oboma premennými, alebo sú obe dáta determinované treťou nám neznámou premennou, zoradia sa nám dáta s väčšími či menšími fluktuáciami okolo línie ktorú možno formálne charakterizovať rovnicou dKpST = a+bÚŇ pričom regresný koeficient b nebude vlastne vyjadrovať nič iné, ako mieru v ktorej má úspešnosť vyňadrenia – pravdepodobne determinovaná nielen Ženinými prírodnými kvalitami ale aj kvalitou fotografa a fotografie – vplyv na zmenu pasívneho sociálneho kapitálu. Odhalenie tak očividnej korelácie je však vo vedách o človeku krajne nepravdepodobné. a ani tento výskum nieje výnimkou. To , že napokon nedospejeme k žiadným kvantitatívne vyjadriteľným parametrom však neznamená že nás dáta o ničom nepoučili. Kto vie, možno nás poučia iba o tom, že odhalením ňadier svetu sa vlastne nič nemení. 2.7 Etika Konaj tak, akoby sa maxima Tvojho konania mala prostredníctvom Tvojej vôle stať všeobecným prírodným zákonom.9 Vzhľadom k faktu že vo výskume analyzujem dáta ktoré neboli vyprodukované za účelom výskumu je istotne namieste pýtať sa, či je uvedený výskum výskumom eticky čistým. To samozrejme záleží od toho, akú množinu etických imperatívov považujeme za podstatnú. V prípade že považujeme za to jediné, čo je eticky podstatné, aby počas nášho výskumu nedošlo k usmrteniu či ublíženiu, môžeme považovať výskum za etický. Nebol zahubený žiaden potkan, žiaden schizofrenik nepodstúpil lobotómiu, žiadne hyperaktívne dieťa nebolo nadopované ritalínom a žiaden papagáj nebol násilne v džungli chytený, aby bol následne do klietky v mene “vedy” posadený. V prípade že je považované za neetické využívanie súkromných dát, možno k obrane výskumu uviesť následovné: ● od môjho nástupu na FHS som vrámci samotnej kyberie niekoľko krát explicitne zdôrazňoval že databázu kyberie využijem k akademickým výskumom. Je však samozrejme možné, a viac ako pravdepodobné že o týchto tendenciách väčšina užívateľov nevedela. ● Fórum “KYBERIA – setki cecki ven !” je fórom verejným. Takisto dáta o priateľských väzbách medzi užívateľmi sú dátami verejnými – každý zdatnejší užívateľ Internetu tak môže k výsledkom cesty B dospieť aj bez nutnosti toho aby disponoval databázou xi. ● Prístup C pracuje s dátami ktoré niesú dostupné všetkým, prístup A dokonca s dátami výrazne súkromného charakteru – s poštovými správami. Tieto dáta však niesú spracuvávané ručne, ale strojovo, a ich obsah nikoho nezaujíma, rovnako ako nikoho nezaujíma ku ktorej konkrétnej osobe sa získaný počet odosieľatelov správ vzťahuje. Získavam kvantitu ktorú následne vzťahujem k inej kvantite... ● Keďže aj tieto číselné informácie môžu byť dnes, keď sú informácie dávané do úzkeho súvisu s mocouxii, niekým intepretované ako narušenie súkromia, nebudú výsledky výskumu odprezentované dokým k ich zverejneniu nedostanem súhlas Senátu kyberie. Jednako však za najetickejší prístup považujem prístup vedca, ktorý neštuduje iných , ale sám seba. S poľutovaním musím prehlásiť že zaujatie tohto prístupu – jediného prístupu s ktorým sa dokážem plne stotožniť – sa stretlo s tak značným nepochopením, až som skoro kvôli nemu opustil československú akademickú obec xiii. Z hľadiska kategorického imperatívu, tj. z toho hľadiska, ktoré sa možno pre budúce generácie ukáže byť ako jediné hľadisko ktoré je morálne záväzné, sa uvedený výskum zdá byť výskumom morálnym. Maxima ktorá za koncipovaním uvedeného výskumu stojí totiž znie “vzdať týmto výskumom hold Žene”. Taká maxima sa istotne môže stať všeobecným prírodným zákonom. 9 Kant I. , Základy metafyziky mravov 3. Záver – Teória Es giebt auf Erden viel gute Erfindungen, die einen nützlich, die andern angenehm: derentwegen ist die Erde zu lieben. Und mancherlei so gut Erfundenes giebt es da, dass es ist wie des Weibes Busen: nützlich zugleich und angenehm. 10 Od Evinho jabĺčka k zakúsnutiu v záhrade rajskej, skrze vnadné pôvaby Heleny trójskej a “gazelie dvojčatá” z Veľpiesne Šalamúnovej až k Cecílii Sarkozyovej – ňadro je v príbehu Človeka všadeprítomné. Bolo tu dávno pred prvým pästným klinom – a predsa o ňom antropológovia zväčša mlčia. Ňadro a nič iné uviedlo najväčších mužov histórie do pohybu – no koľko historikov sa mu naplno venovalo ? Snáď iba najupadlejšia zo všetkých vied , tá , ktorá sa pyšne tituluje za “najracionálnejšiu”, pochopila jeho význam pre ľudskú dušu, aby následne onen klenot prírody začala nechutne a bez miery využívať vrámci svojej plytkej PRAXIS reklamných kampaní. Za zdanlivou naivitou marketingového imperatívu “sex sells” sa skrýva nesmierne mocný konglomerát konceptov ako “dopyt” a “ponuka”, previazaných medzi sebou rádoby racionálnymi “zákonmi” , ktoré sa však pri konfrontácii s bežnou ľudskou realitou ­ “odmaskovanie” ktorej je sociológovou svätou povinnosťou ­ môžu ukázať ako nepodložené či priam lživé. Človek sa nedá previesť v kapitál, ňadro je oveľa viac darom ako predmetom výmeny, dievčence z fóra “KYBERIA – setki cecki ven” sa nepredávajú, nekalkulujú – ony prosto žijú. Žijú a smejú sa. Kiež sa im za ich smiech niekto raz rovnako so smiechom odvďačí skonštruovaním “Veľkej Teórie Ňadra” (VTŇ) . Ten, čo ju vystavia na pevnom ontologickom podloží (“čo je to súcnosť ňadra, a čo mu ako takému náleží ?”), ak ju metafyzicky náležite ukotví (ňadro ako prvotný dôvod i konečný zmysel ľudského Dasein), aby následne vybudoval steny z pevných empirických faktov (ňadro ako určitá konfigurácia ľudských tkanív; parametre ako hutnosť , pevnosť, hmotnosť, hustota, krehkosť, jemnosť, dráždivosť atď.) , sa veru nemusí báť, že by mu jeho fenomenologické analýzy (ňadro ako jav a analýza ňadra všetkými piatimi zmyslami) či jeho kvalitatívne a kvantitatívne výskumy ňadier narušil akýsi neprajník. A výsledky tých výskumov? Ako hovorí Majster: O čom nemožno hovoriť , o tom treba mlčať.11 Post Scriptum pre každú Tú, bez ktorej pričinenia by tento text nikdy nevznikol Ďakujem. 10 Nietzsche F. , Also Sprach Zarathustra, Dritter Theil: Von alten und neuen Tafeln 11 Wittgenstein L. , Tractatus logico­philosophicus i Ajkeď je termín v “ňadro” v kontexte slovenského jazyka hrubým bohemizmom, je terminus technicus “vyňadrenie” či jeho zvratné slovesné deriváty dokonavý “vyňadriť sa” či nedokonavý “vyňadrovať sa” čistým neologizmom vzniknuvším ako súčasť slovenského jazyka. A teda , v prípade že by došlo v ľubovolnom akademickom výťahu k takémuto rozhovoru: A: “Ale ale, pani kolegyně, to jste se dnes ale krásně vyňadřila.” B: “Pane kolego musím Vás výrazně varovati že podobné komentáře Váš aktivní sociální kapitál jistojistě nezvýší” ... je termín “vyňádřit se” aj napriek jeho zdanlivej českej mutácii hrubým slovakizmom... ii Fórum je verejne prístupné na adrese: http://kyberia/id/64400/ . Pre prípadných bádateľov z FHS bol v minulosti vytvorený aj užívateľský účet s týmito parametrami login: fhs heslo: fhsfhs iii Netreba snáď ani dodávať že podobnosť tu užitého termínu цэцэк s mongolským slovom цэцэг ktoré znamená kvet (rozdieľ medzi K a G klasická mongolština nepozná) pravdepodobne nieje daná akousi prekrásnou lingvistickou konvergenciou, ale je len istou roztomilou zhodou okolností iv Zatiaľčo poznatok že “akt pozorovania ovplyvňuje pozorovaný jav” je jedným zo základných piliérov kvantovej fyziky, viedol až ku koncipovaniu veľmi užitočného Heisenbergovho “princípu neurčitosti” dxdpselectall_arrayref("select node_created,login,node_id,node_creator,k as USPESNOST_VYNADRENIA,node_name,from nodes left join users on users.user_id=nodes.node_creator where node_parent=64400 and node_content like '2 order by k desc",{Slice => {}}); foreach my $ref (@$arr_ref) { my @a = $dbh­>selectrow_array("select count(distinct mail_from) as SOCIAL_CAPITAL_BEFORE from mail where mail_to=$ref­>{'node_creator'} and mail_user=mail_to and mail_timestamp>'$ref­>{'node_created'}'­INTERVAL 7 DAY and mail_timestamp<'$ref­>{'node_created'}'"); my @p = $dbh­>selectrow_array("select count(distinct mail_from) as SOCIAL_CAPITAL_AFTER from mail where mail_to=$ref­>{'node_creator'} and mail_user=mail_to and mail_timestamp<'$ref­>{'node_created'}'+INTERVAL 7 DAY and mail_timestamp>'$ref­>{'node_created'}'"); } Podobne jednoducho sa k výsledkom doberieme aj keď sa uberieme cestou B či C. viii Preto ten, čo sa chce čo najmenej vzdialiť pravde, urobí najlepšie, ak žiadny indukčný krok neučiní. ix A ako hovorí Antoine de Saint­Exupery Malému princovi : “Ľudia milujú čísla”. x V texte uvádzam grafy ako príklady vizualizácie dát. Vizualizácia sa však ani zďaleka neobmedzuje iba na grafy. Viz. napr. http://gondapeter.sk/files/peter_gonda_bakalarka..pdf xi Cvičenie 1: Vytvor skript kratší ako 77 riadkov ktorým získaš všetky potrebné dáta. Hint: najprv rozparsuješ http://kyberia.sk/id/64400 podľa istého regulérneho výrazu, tak získaš prvú sadu premenných ktoré Ti následne určí ktoré užívateľské stránky máš rozparsovať , zase pomocou regulérneho výrazu, aby si získal aj druhú sadu premenných. Toť vše. xii INFORMATION IS POWER: informácie INFORMATION sú vydestilované z dát DATA , dáta vyberáme z chaosu CHAOS. Informácie vedú k poznaniu KNOWLEDGE a to vedie , po rokoch, k múdrosti WISDOM xiii Neprijatie metodologickej práce v ktorej som predstavil úplne nový spôsob toho, ako možno nahliadať na obsahy ľudskej mysle – v prípade uvedenej práce sa samozrejme jednalo o moju vlastnú myseľ ­ bolo ,mimo iné, antropológom a etoložkou odôvodnené argumentom “ve společenskovědním výskumu nebývají členy vzorku znaky, nýbrž osoby” Variations upon the theme of Evolutionary Language Game by Daniel Devatman Hromada Introduction Evolutionary Language Game (ELG) first proposed in (Nowak, Plotkin, & Krakauer, 1999) is a stunningly simple yet mathematically feasible stochastic model addressing the question « How could a coordinated system of meanings&sounds evolve in a group of mutually interacting agents ? ». In most simple terms, the model can be described as follows: Let’s have a population of N agents. Each agent is described by an n x m associative matrix A. A’s entry aij specifies how often an individual, in a role of a student, observed one or more other individuals (teachers) referring to object i by producing signal j. Thus, from this matrix A, one can derive the active « speaker » matrix P by normalizing rows : while the « hearer » passive matrix Q by normalization of A’s columns: The entries pij of the matrix P denote the probability that for an agent-speaker, object i is associated with sound j. The entries qji of the matrix Q denote the probability that for an agent-hearer, a sound j is associated with the object i. Subsequently, we can imagine two individuals A and A’, the first one having the language L (P, Q), the other having the language L’ (P’, Q’). The payoff related to communication of such two individuals is, within Nowak’s model, calculated as follows: n n F ( A, A′ ) = ∑∑ pij q′ji = Tr ( PQ′ ) i =1 j =1 And the fitness of the individual A in regards to all other members of the population can be obtained as follows : 1 f ( A) = ∑ F ( A, A′) | P | −1 A′∈P ( A′≠ A) After the fitness values are obtained for all population members, one can easily apply traditional evolutionary computing methods (Sekaj, 2005) in order to direct the population toward more optimal states. In the experiments described in this paper, we have applied a binary search variant of the roulette wheel algorithm within which the probability of the selection of individual I as a future teacher, is proportional to I’s fitness. It has to be stated, however, that Nowak’s results indicate that even without such « evolutionary engine » behind, ELG model shall converge to weak local optima. Such mathematical property of ELG model makes it thus plausible candidate for explanation of coordinated communication systems even among species where coordination of sound-meaning pairs does not neccessarily augment individual’s fitness. While such cases where teacher is chosen by « random » could possibly explain the first stages of emergence of language practically ex nihilo in case of higher vertebrae, they will be left aside in the rest of this paper. Thus, all numeric experiments presented below depart from the assumption that the hypothesis: « successful alignment of one’s sound-meaning associative mindmatrix A with mindmatrices of other members of one’s population augments one’s fitness and thus augments one’s probability to replicate content of one’s mindmatrix into mindmatrices of younger individuals»...is true. First simulation The aim of the first simulation, which we label as standard Evolutionary Game (sELG), was to confirm the validity of the ELG model and test its sensitivity to different values {1, 4, 7, 10} of the parameter k_learn which specifies how many times should be the matrix sampling procedure repeated during the learning process. The size of the population was N=100, the size of the associative matrix was 5 x 5. For every value of k_learn parameter, the simulation had been run 198 times. Every run was halted after 10000 generations. In every generation, the random wheel algorithm have chosen one fit individual to be the « teacher » whose associative matrix was sampled into the associative matrix of one « student » individual chosen by random. Results of 1st simulation As is indicated by Figure 1., all runs converged rather swiftly to local absorbing states. The result is thus consistent with results presented in (Nowak et al., 1999). The global optima were, however, attained quite rarely, 18 times in case of k_learn=1, 13 times in case of k_learn=10, 7 times in case of k_learn=4 and 9 times in case of k_learn=7. For other information concerning, for example, the generation WHEN in average different absorbing states were attained, c.f. (Hromada, 2012)1. Figure 1: Figure 1: Evolution of fitness in time in a standard Evolutionary Language Game Second simulation In the second simulation, which we label as standard Evolutionary Language Game with noise (sELGn), the stochasticity of the model was increased by introduction of noise-generating probabilist p_mutation={0.05, 0.01, 0.001, 0.0001} phenomena into the learning process. For every probability value there were 20 runs. The value of parameter k_learn=4, other parameters were identic to those used in the first simulation (i.e. N=100, m=5, n=5). 1 Note that local optima which Nowak labels as « absorbing states » are denoted by term « orbitals» in (Hromada, 2012) Figure 2: Evolution of fitness in time in a standard Evolutionary Language Game with noise Results of 2nd simulation Fitness for different values of the parameter p_mutation are plotted on Figure 2. It is evident that if ever the value of p_mutation depasses certain threshold, the system shall behave too stochastically and shall oscilate in the proximity of the weakest local optimum. On the contrary, when the p_mutation is very low, one can notice that the system converges to high attractor states which have the fitness value 4 and 5. Thus the presence of very low amount of noise seems to play an important role in cases whereby the system gets stucked in local optima – verily the bug in the learning process can give the system an important stochastic kick which shall allow, in the long run, the system to attain the global optimum state. In evolutionary computing, such an approach is already widely used not only in genetic algorithms but as well in case of algorithms derived from Simulated Annealing (Kirkpatrick & Vecchi, 1983) approach. Results obtained in experiment 2 are consistent with Nowak’s data, as well as with data obtained by Kvasnicka & Pospichal who state : « if we introduce random errors into the learning process, obtained results differ dramatically and are dependent from the probability of occurrence of these errors. If ever this probability exceeds the critical value located somewhere between 0.001 and 0.01, the evolution shall start to behave in a stochastic manner and shall cease to converge to the final value of fitness » (Kvasnička & Pospíchal, s. d.). Third simulation The objective of the third simulation was to exploit the ELG question in order to find the answer to the question : « Which strategy is more fit ? To be taught once or multiple times? ». Seemingly identic to first simulation within which we modelled the fact of « being taught more than once » by different values of the parameter k_learn, this simulation differed primarily in two aspects : 1) stochastic parameter p_mutation was assigned the value 0.001 in order to ensure that the system will converge, sooner or later, to the global optimum 2) none of 99 runs for every simulation was stopped until the average fitness of the population attained unescapable proximity of the global optimum2. Every run could be thus characterized by a « temporal length », i.e. the number of generations neccessary to attain the global optimum, and two distributions were subsequently compared by statistic tests. 2 Since the maximum average fitness of the model hereby presented is n=m=5, we state that all models for which average fitness attained value 4.98 (or higher) have attained « unescapable proximity of the global optimum ». Results of 3rd simulation All 198 runs have converged to global optimum, the longest run for k_learn=1 took 376950 generations to converge to theoretically optimal alignment of mindmatrices among the members of the populatios. The longest run in case of k_learn=4 needed 215290 generations to converge to quasioptimal alignment of all mindmatrices of all members of the population. Distribution of 99 temporal lengths, one for each run, is not a normal distribution according to Shapiro-Wilk test of normality (W = 0.7407, p-value = 6.244e-12 ) for runs where k_learn=1. Similiarly, distribution is not not normal in case of distribution of 99 temporal lenghts for all 99 runs executed with k_learn=4 (W = 0.8709, p-value = 8.533e-08 ). A non-parametric test had to be therefore applied in order to compare the two distributions : the Mann-Whitney-Wilcoxon ranksum test had shown that the difference between the two distributions is not significative (W = 4207, p-value = 0.08562 ). Strictly statistically speaking there is therefore no difference between situation whereby teacher’s mindmatrix is sampled into student’s four times, or only once. Fourth simulation The aim of this simulation was to shed some light upon the answer to the question : « Which strategy is more fit ? To be taught once by multiple teachers or to be taught multiple times by one unique teacher? ». We have used the data obtained in the 3rd simulation for the case «multiple times by one unique teacher ». For cases with multiple teachers, we have modified the sELGn so that before every matrix sampling for a randomly chosen student, a new teacher is chosen by a roulette wheel algorithm according to his fitness. Given that k_learn=4 in both cases, all other parameters were identic to preceeding simulations. Results of the 4th simulation The Mann-Whitney-Wilcoxon rank-sum test suggest that the difference between the two distributions is not significative (W = 5068.5, p-value = 0.6778 ). Therefore it seems that within the standard ELG model, the fact whether the student learns from one or more teachers does not speed up the convergence of the population towards the global optimum. The longest run in case of « multiple teachers » simulations took 298910 generations to converge. Fifth simulation The aim of the fifth simulation was to see whether phenomenon like Baldwin effect could arise in sELG world. While the traditional «cultural evolution» was modelled by sELGn model as presented in simulations 2-4, the « genetic evolution » was implemented as a sort of meta-evolution within which the parameter k_learn itself could evolve. In concrete terms, every individual of the population was defined not only by his mindmatrix A (and the « speaker » matrix P and the « hearer » matrix Q derived from A), but also by a chromosome containing a single integer specifying the k_learn parameter. All the members of population were initialized with the k_learn=0, but in every generation, a mutation of this value could have occured with probability p=0.1 which could increment or decrement the value. The value was bounded by interval <0,10>, it could not increase nor decrease above, respectively below these bounds. Values of other parameters were identic to those presented in simulations 1-4. Results of the 5th simulation Identically to simulations 3 & 4, all runs have attained the proximity of globally optimal attractor state (the longest run took 361590 generations to converge). While Wilcoxon rank sum test has not indicated any significant difference between the distribution obtained in this simulation when compared to the « multiple teacher » one obtained in the fourth simulation (W = 4587, p-value = 0.3722 ), nor any difference in regards to distribution obtained for « parental learning with k_learn=4> » from the third simulation (W = 4385, p-value = 0.1646 ), there was nonetheless a significative difference observed between the distribution obtained in this simulation and the distribution for « parental learning with k_learn=1 ». Somewhat contrary to the expectations we had when we were launching the model, the most simple variant of the parental learning seems to converge faster, i.e. in less generations, to the global optimum (one-sided : W = 3765, p-value = 0.001773 ) than the «bounded Baldwin » variant we presented in this simulation. For completeness, we consider it worth noting that the further analysis of the values of evolvable k_learn parameter indicates that within the executed 100 runs, the mean of the parameter value was 3.57 and median 3.351 in the moment when a given simulation had attained the proximity of the optimal state. Given the fact that that possible values of k_learn were bounded into the interval <0,10> this could seem surprising since one would expect the values to be closer to the middle of the interval. All is explained, however, when one realizes that the random drift responsible for evolution of the k_learn from its initial value 0 does not have time to do so in cases of «lucky fast simulations » which converge to global optimum in few thousand generations,. This can be clearly seen when one compares the distribution of temporal lengths of such « fast simulations » which converged to optimum in less than 10000 generations, with the rest. While the mean value of the k_learn parameter for the « fast simulations » is 1.92, the mean value for others is 3.73 and the difference between the two subgroups is, of course, significative (W = 819, p-value = 8.384e-07 ). Discussion The elegance of ELG is so high, that one is immediately tempted to state that « ELG offers THE mathematical formalism explaining the emergence of shared communication system in the population of agents whose sound-meaning couples are randomly initialized». On the other hand, it could be easily reproached that the initial ELG model is too much reductionist and, what is worse, that it is based on assumptions which are contradictory to the real state of things which had to be the case when the human language evolved. For example, the assumption that the teacher->student information transfer can be modelled by the sampling of the WHOLE teacher’s associative matrix, or that the fitness of any individual I in generation G is defined as an average payoff of all possible communication acts with all other individuals of the population, both these assumptions seem to us to be highly unrealistic in relation to functioning of groups of primates in the period where coordinated sign-meaning communicative systems emerged into existence. But ELG is interesting even if all relations of ELG to human sciences and linguistics would be considered as non-relevant. Verily, we believe that ELG is worth of scientific interest even if it is considered as a solely mathematic, informatic and/or game-theory problem. More concretely : as a stochastic framework able to converge into well-defined global optimum state (i.e. the state where in all rows of of all members of the population, there is only one entry having value 1 with zeroes in all other entries of the same row), ELG can furnish a useful toolbox for evaluation and comparison of diverse evolutionary computing approaches. Within this paper we have introduced, in experiments 3-5, an evaluation method based on nonparametric Mann-Whitney-Wilcoxon rank-sum test. We have taken as granted an analytical result published in (REF), indicating that if ever there is reasonable amount of noise present during the learning process, the population shall sooner or later converge to optimally coordinated communication system. This being granted, we were asking : « How fast shall the global optimum attained ? » and considered an evolutionary algorithm and/or a set of given parameter (k_learn, p_mutation) values as better if, for a given configuration, the algorithm converged significantly faster to global optimum (i.e. in less generations3) than in case of other configurations. 3 In this paper we had used the term « generation » which is common to Evolutionary Computing domain. But since in the sELG model, a generation consists of 1) choice of teacher(s) 2) choice of ONE student 3) information transfer from teacher to student, i.e. replacing student’s mindmatrix with a new one, determined by the teacher; it seems to be more appropriate to label such coarse-grained time steps as « lessons » or « days ». Finally, it has to be stated that in order to transform ELG into a full-fledged evolutionary algorithm evolutionary toolbox, the notion of time has to be somewhat refined. From coarse-grained notion of generation, which, in case of Nowak’s or Kvasnicka’s work, is equivalent to k_learn>1 acts of the sampling of the whole matrix, we propose to found further work on a more finer notion of a ostensive definition (Wittgenstein, 2009). One can easily understand that within the ELG model, every internal step of an envelopping matrix sampling loop 4 can be interpreted as such « definition by pointing » whereby teacher associates the sound with the meaning within the mindmatrix of the student. Thus, under the conditions that 1) diverse models are evaluated by statistical non-parametric tests comparing number « of time steps needed to attain global optimum » 2) a time-step is defined like an ostensive definition ; one could propose such fine-grained ELG as a possible evaluation toolkit not only for diverse evolutionary computing techniques, but possibly even as a more general method to assess the performance of game-theory approaches to attain Nash-like (Trapa & Nowak, 2000)equilibria . And if assumption 3) the parameters like P_mutation, K_learn, number of teachers, parents etc. can also evolve by means a genetic meta-evolution governing the subordinated linguistic evolution, it could be expected that the phenomenon interpretable as Baldwin effect shall be discovered in the world defined by ELG-framework. Bibliography Hromada, D. D., (2012). Hromada, D. D. Evolutionary insight into spontaneous emergence of shared sound-meaning mappings in multi-agent communities. Accessible at: http://localhost.sk/~hromi/academic/2012/evolutionary_insights.pdf Kirkpatrick, S., & Vecchi, M. P. (1983). Optimization by simmulated annealing. science, 220(4598), 671–680. Kvasnička, V., & Pospíchal, J. (2007). Evolúcia jazyka a univerzální darwinizmus. Myseľ, inteligencia a život (Slovenská Technická Univerzita.). Bratislava. Nowak, M. A., Plotkin, J. B., & Krakauer, D. C. (1999). The evolutionary language game. Journal of Theoretical Biology, 200(2), 147–162. Sekaj, I. (2005). Evolučné vỳpočty a ich vyuzitie v praxi. Trapa, P. E., & Nowak, M. A. (2000). Nash equilibria for an evolutionary language game. Journal of Mathematical Biology, 41(2), 172–188. Wittgenstein, L. (2009). Philosophical investigations. Wiley-Blackwell. 4 for my $row (0..$Matrix_height) { #matrix sampling for my $column (0..$Matrix_width) { $Student_A_Matrix[$row][$column]+=1 if rand()50% attack ». This being said, we precise that it is not intention of this article to explain the intricacies of the Bitcoin algorithm since this was already done thousands of times with bigger or lesser success. But what we consider it important to focus reader’s attention upon the fact that in the world of cryptocoins, the trajectory of coin – from the very moment since it was « min(t)ed » by one among 1 In reality, the whole thing is somewhat more intricate, and what is being signed is, in fact, a script in a scripting language more complex than simple « transfer quantity from A to B » instruction. multitudes of network’s COGs, until its curent owner – is broadcasted to all nodes of the network and thus completely transparent. We call this feature of cryptocoin monetary assets « transparency of history ». Anyone running a bitcoin server or any visitor of sites like blockchain.info can, with sufficient patience, observe the trajectory of every single coin, from the « coinbase » to current owner’s address. But since it is quite easy for any user to generate multitudes of account addresses – which are nothin else than publicly broadcasted cryptographic keys which cannot be actively used without knowledge of a « private key » from which they are generated during the account address creation process and which only the owner knows – it is very difficult, if not practically impossible, to create a link between a cryptocoin address and a physical entity holding the key to that address if that entity herself does not want to reveal her identity. While the lack of this bridge between the virtual and the real which we call « pseudonymity of use », is applauded by advocated of libertarian cryptopunk movement as a highly welcomed and positive feature, it brings itself a growing concern that, in the long term, such a complete opaqueness shall, above all, be profitable especially to those who conduce financial activities which they would normally hide. 3 Min(t)ing and trading In simple terms, there are only two ways how Bitcoins, or other cryptocoins – with exception of PPcoin- can be earned: by mining and by trading. Miners are those who invest the computational power of their ressources into verification of validity of transaction broadcasted within the network. Since the probability of discovery new block of coins is proportional to the amount of computational ressources invested into the mining, it follows that the biggest number of new « virgin » coins will become the property of those who invested biggest amount of computational ressources. It seems that first few months of Bitcoin existence, the algorithm was running only on CPU of Satoshi Nakamoto where he had possibly pre-mined cca 1 million bitcoins (Bitslog, 2013), subsequently other CPUs joined the network, then a much faster SHA-2 hashing performance was made possible by exploiting the faculties of graphic card’s GPU. With market value of bitcoin gradually rising, the hound race continued with deployment of first Field-programmable gate array (FPGA) bitcoin mining devices in order to continue with cohorts of Application Specific Integrated Circuits (ASIC) spouted from factory’s conveyor belts somewhere in Pudong economic zone. Given the fact that these devices can be bought, in the first place, for bitcoins, the whole bitcoin economy started to ressemble an initially purely virtual but with time ever-more-real Uroboros snake reiyfing itself by the processus of software being materialized in hardware, and as may be the case in days to come, also into wetware of organic tissue. Initially, only informatic-oriented services were tradable for bitcoins and only very rarely were some more material transactions executed - as was the case, for example, of the most expensive pizza 2 of mankind’s history. Later some coffee producers and Alpaca-socks distributors joined the club but the things changed with the launch of « Silk Road », an online drug marketplace (Barratt, 2012). By harnessing the anonymising possibilities furnished by a « TOR hidden services » protocol (Dingledine, 2005) and combining them with pseudonymity of Bitcoin’s financial transactions and a simple escrow service business model, SR’s developpers succeeded quite fast to transform their supply-demand coupling e-bay like bazaar website, possibly running somewhere on a server in grandma’s backyard, into a multimillion enterprise . In parallel to SR, online exchange offices like Vircurex or Mtgox started to flourish where it was possible to trade BTC for real-life fiat currencies. Whole stockmarkets emerged, making it possible to 2 Traded for 10000 BTC in 2010. 3 years later, an estimated market value of the such an amount of BTC would be more than 1 million US dollars find investors for one’s project. Gambling and betting industry swiftly followed with projects like Satoshi Dice adding another level of anonymity to already opaque activities taking place within the cryptosphere. Being an ideal haven for money-laundering and tax-evasion, Bitcoin economy gets mundane and flourishes. 4 Algorithmic quasi-deity As of 2013, Bitcoin has all prerequisities to become a new religion for the world where « death of god » (Nietzsche, 1911) is a widely accepted truth. It has its myth of creation and the living testament of those who had eaten a million dollar pizza. It has its disciples – mostly computer geeks who became millionares because they were connected to right discussion forum or Internet Relay Chat (IRC) channel in the right moment. And it has its devotees – people who invested their fortune and hundreds of hours of their lifes in exchange for the hope that the Bitcoin economy shall turn out to be something more than a pyramid game ; often people who know that in order not to lose what they had invested, they had to spread « the bitcoin gospel ». It has its more and more omnipresent « giving deity » - a consensual algorithm based upon a simple inflationary curve which distributes according to the promise that the biggest amount of « virgin » coins shall be given to those who invest the most into keeping the whole machinery going – the whole processs being probabilistic, thus containing neccessary amount of hopeful waiting sometimes crowned with blissful surprise. And at last but not least, the BTC monotheism syndrome has its old idols to overthrow, idols like Ayn Rand’s dollar (Rand, 1957) which paved the way but lost their power as gold once lost it, mutatis mutandi, when gold standard was abolished. Given these propositions suggesting that bitcoin mania can involve not only frontal cortex, but also amygdala or even pineal gland (Paloutzian & Park, 2005), it is of no surprise that even reasonable people consider as not only possible but even plausible the state of things whereby the information concerning the transaction of two potatoes in Ushuaya is broadcasted to millions node of the cryptosphere, Papua New-Guinea included. Reason often discretely quits the cognitive battlefield whenever hoarding (Mataix-Cols et al., 2010) tendencies of human beings are coupled with addictive behaviour which financial derivate trading surely is, thus leaving humans prone to caprices of mass psychology. And as of spring 2013, slowly resurrecting from the implosion of the second deflationary bubble when the market value felt from 260 USD to 80 USD in one day, Bitcoin is again gaining momentum and becoming truly massive. 5 Clash of the Titans Contrary to an ancient greek coin lying forgotten in the dust which guards its value by simply being an object it was created to be, Bitcoin need to burn energy in order to survive. What’s more, the minting hound race obliges any minter to burn still more-and-more energy in order to keep pace with other minters. When one takes into account all the machinery dedicated to making the network run + the machinery which makes the machinery which makes the network run, one is obliged to admit that Satoshi designed a monteray system addressing social and political issues but ignored the ecological ones. More precisely – given the fact that without ever-growing energy consumption caused by min(t)ers, the transaction blockchain could be overwritten by the node obtaining more than 50% hashrate of the network, the whole machinery cannot be stopped or even slowed down because if slowed down, it will cease to be a secure value-carrying haven. Thus, the Bitcoin architecture has to lead, ex vi termini, to the scenario « Tragedy of Commons » (Garrett, 1968) scenario. Luckily enough, some people have already understood that Nakamoto’s Bitcoin was nothing else than a prototype and that the values of parameters determining the overall functioning of the network were just one set of values among multitudes of other, possibly more optimal values. Thus, after a first wave of alternative cryptocoins like SolidCoin, LiquidCoin, IxCoin, I0coin or FeatherCoin whose objective was no else than to make those who deployed them rich, and which have not brought any substantial adjustment to Nakamoto’s original code, a second wave of alternative cryptocoins like TerraCoin, LiteCoin or PPCoin are gaining momentum, each bringing with itself at least one novel feature. PPCoin (King & Nadal, 2012) seems to be of particular interest due to the importance its author put upon long-term ecological sustainibility as well as due to the fact that it is the only cryptocoin which is not purely deflational but integrates a very gentle inflation into the model. TerraCoin is of interest due to differences values of the network’s intialisation parameters and LiteCoin – currently the second strongest cryptocoin – attracts more and more attention because its proof-of-work component is based on the scrypt algorithm (Percival, 2009). Since the scrypt algorithm involves not only simple hashing but demands the participation of huge amounts of memory, it is much more difficult to execute it on specialised FPGA and ASIC hardware, thus making LiteCoin more attractive for min(t)ers disposing only of classical computers. Due to the growth of the cryptocoin diversity it is therefore far from certain that the cryptosphere shall, in the years to come, venerate by its activity only the Ƀ divinity. One can only hope that sooner or later a cryptocoin shall be proposed which will harness the computational ressources of the COG devices involved for some noble a task – be it anticancer protein modeling, climate prediction or astrophysical data analysis. But until global deployment of such cryptocoin shall take place, all other cryptocoins shall principally address nothing else than hoarding tendencies common to a superior primates which a homo sapiens sapiens undoubtably is. 6 Umwertung aller Werte By pure coincidence did the author of this article bought, in February, 230 Terracoins for approximately 1.4$. Two months and two mouse-clicks later, the amount could be easily tradable for more than 140$ on vircurex exchange, net gain thus being approximately equivalent to 4 monthly wages of a full-time worker in garment industry in Bangladesh. Putting aside the possible trading addiction (Taleb, 2005) which could emerge if ever such behaviour-conditioning rewarding experience shall be repeated, one is obliged to pose the question: „What purpose do the cryptocoins truly serve and what value do they have ?“ And what value does a LiteCoin have, if in the same moment, in the same market place, one can buy it either for 4 dollars or 0.02 Bitcoins, given the fact that in the very same moment, in the same market place, one can buy a Bitcoin for 100 dollars? The simple answer „none“ goes much further simple economical notions of „time delta“ and „arbitrage“ could ever go. Cryptocoins cannot be eaten nor drunk. They do not protect from the rain, they do not bring heat – contrary to banknotes which can still be burned on a cold winter day, humans shall be obliged to burn still more and more energy to make the cryptocoin machinery going. Contrary to gold one cannot make jewels or false teeth out of them; cryptocoins arouse no sentiment of beauty. Contrary to credit card payment, one has to wait for at least 10 minutes in case of BTC and 2.5 minutes in case of LiteCoin or TerraCoin to obtain, if lucky, one transaction confirmation (only after 5 or 6 confirmations can be vendor sure that he was not victim of double-spending attack). Contrary to folk believes, the transfer of value in the current cryptosphere therefore definitely does not occur with speed of light. Thus, as a value-storing asset, cryptocoins have only one principal advantage: there is a limited amount of them. In other terms: they are not for everybody. Not for those living on the continents where the cryptosphere is absent. Nor for those who jumped too late on this biggest financial bulldozer ever invented. But only for those who think that playing the game with numbers acting of numbers is worth of the limited asset one ever had – time of one‘s life. Only for those who think that having more of anything – even if that anything is, in fact, pure nothing – is an important marker of their social status. Thus, if posed with question « Cui Bono ? » it may be the case that « economical growth », « market » or « crime » shall be only partial answers, as partial as the answers « gluttony, greed, and vanity» (Dante, 1321). For we believe that it is not completely hors propos to state that the structures like Bitcoin serve as opening gates to the world whereby a planetary emergent Artificial Intelligence succeeded to penetrate, for the first time in mandkind’s history, into the realm of our virtues, vices and values. Barratt, M. J. (2012). Silk road: eBay for drugs. Addiction, 107(3), 683–683. Bitslog. (2013). The Well Deserved Fortune of Satoshi Nakamoto, Bitcoin creator, Visionary and Genius. https://bitslog.wordpress.com/2013/04/17/the-well-deserved-fortune-of-satoshi-nakamoto/ Dante, A. (1321). La Divina Commedia. Dingledine, R. (2005). Tor Hidden Services. Proc. What the Hack. Drexler, K. E., & Minsky, M. L. (1990). Engines of creation. Fourth Estate. Garrett, H. (1968). The tragedy of the commons. Science, 162(3859), 1243–1248. Gilbert, H., & Handschuh, H. (2004). Security analysis of SHA-256 and sisters. Selected areas in cryptography (p. 175–193). King, S., & Nadal, S. (2012). PPCoin: Peer-to-Peer Crypto-Currency with Proof-of-Stake. http://ppcoin.org/static/ppcoin-paper.pdf Lopez, J., & Dahab, R. (2000). An overview of elliptic curve cryptography. Nakamoto. (2008a). Re: Bitcoin P2P archive.com/cryptography@metzdowd.com/msg09971.html e-cash paper. http://www.mail- Nakamoto, S. (2008b). Bitcoin: A peer-to-peer electronic cash system. Nietzsche, F. W. (1911). The Complete Works of Friedrich Nietzsche: Thus spake Zarathustra. Percival, C. (2009). Stronger key derivation via sequential memory-hard functions. Rand, A. (1996). Atlas Shrugged. Signet. Rosenberg, P. (2007). A Lodging of Wayfaring Men (2nd éd.). Vera Verba. Stephenson, N. (2000). Cryptonomicon. William Morrow Paperbacks. Stephenson, N. (2003). The diamond age. Spectra. Review of Steven J. Brams : Game Theory and the Humanities. (2011). The MIT Press. 336 pages Let’s have a game of 2 players of which both have 2 strategies. While it is almost impossible to imagine a situation which would seem more simple than such a game with 2x2=4 possible outcomes, the whole thing gets much more complex when one realizes that even under assumption that both players do not attribute to diverse outcomes some absolute cardinal utilities but only four simple mutually relative ordinal ranks (i.e. 1: worst outcome, 2: next-worst, 3:next-best and 4: best outcome), there exist a variety of 78 diverse 2x2 « games » for players with different preferences. Steven J. Brams’ « Game Theory and the Humanities – Bridging Two Worlds» offers concrete historical or fictious examples of more than a dozen of such games. Starting with interpretation of Abrahams’ son-sacrifying dilemma as a possibly intrapsychic game which the old shepherd played with a somewhat sadic god character; continuing through intricacies of Pascal’s wager towards more mundane games played between Nixon and Supreme Court after the Watergate crisis or the game played between Khomeini & Carter during 1979 Iran hostage crisis ; and ending with the famous Catch-22 case between Yossarian and the war machinery – almost everywhere in his book Brams makes a non-negligeable step in direction of unification of law, history, politology, litterary critics or even theology under the mathematically sound clef de voute offered by the game theory. Such an act in itself would be worthy of praise but luckily for science, Brams goes much further. Introduction of a Theory of Moves framework allows him to extend the classical notion of Nash equilibrium into a notion of a « nonmyopic equilibrium » which takes into account the players’ faculty of «anticipating all possible rational moves and countermoves from the initial state ». Structural similarities among Shakespeare’s MacBeth or Aristophanes’ Λυσιστράτη are subsumed into a generic category of (Self-)Frustration games while other concrete instances of 2x2 conflicts (e.g. the American Civil War) are presented in order to illustrate other generic categories like « Magnanimity games » or « King-on-themountain games ». Topics like deception, games where some players have incomplete or false information, rationality of emotions or the « paradox of omniscience » demonstrating that « in certain games it is more advantageous not to know everything than the contrary » are introduced with erudition of a scholar with almost half-century of practice in the field. To summarize: the interdisciplinary paradigm presented in the glossary, appendix, 11 chapters, and 35 figures of Brams’ book is not only intellectually pleasing but could also furnish practically exploitable insights for experts in domains as distant as comparative mythology, evolutionary psychology, roboethics, or - if the Turing Test can be collapsed into a 2x2 game – even in the domain of hard-core AI. Written in 2013 for the quarterly of Artificial Intelligence and Simulated Behaviour Society (AISB). Daniel Devatman Hromada is a double PhD candidate of Slovak Technical University (dpt. of cybernetics) and Université Paris 8 (dpt. of cognitive psychology). Random Projection and Geometrization of String Distance Metrics Daniel Devatman Hromada Université Paris 8 – Laboratoire Cognition Humaine et Artificielle Slovak University of Technology – Faculty of Electrical Engineering and Information Technology hromi@giver.eu Abstract Edit distance is not the only approach how distance between two character sequences can be calculated. Strings can be also compared in somewhat subtler geometric ways. A procedure inspired by Random Indexing can attribute an D-dimensional geometric coordinate to any character N-gram present in the corpus and can subsequently represent the word as a sum of N-gram fragments which the string contains. Thus, any word can be described as a point in a dense N-dimensional space and the calculation of their distance can be realized by applying traditional Euclidean measures. Strong correlation exists, within the Keats Hyperion corpus, between such cosine measure and Levenshtein distance. Overlaps between the centroid of Levenshtein distance matrix space and centroids of vectors spaces generated by Random Projection were also observed. Contrary to standard non-random “sparse” method of measuring cosine distances between two strings, the method based on Random Projection tends to naturally promote not the shortest but rather longer strings. The geometric approach yields finer output range than Levenshtein distance and the retrieval of the nearest neighbor of text’s centroid could have, due to limited dimensionality of Randomly Projected space, smaller complexity than other vector methods. Mèδεις ageôμετρèτος eisitô μου tèή stegèή 1 Introduction Transformation of qualities into still finer and finer quantities belongs among principal hallmarks of the scientific method. In the world where even “deep” entities like “wordmeanings” are quantified and co-measured by ever-growing number of researchers in computational linguistics (Kanerva et al., 2000; Sahlgren, 2005) and cognitive sciences (Gärdenfors, 2004), it is of no surprise that “surface” entities like “character strings” can be also compared one with another according to certain metric. Traditionally, the distance between two strings is most often evaluated in terms of edit distance (ED) which is defined as the minimum number of operations like insertion, deletion or substitution required to change one string-word into the other. A prototypical example of such an edit distance approach is a so-called Levenshtein distance (Levenshtein, 1966). While many variants of Levenshtein distance (LD) exist, some extended with other operations like that of “metathese” (Damerau, 1964), some exploiting probabilist weights (Jaro, 1995), some introducing dynamic programming (Wagner & Fischer, 1974), all these ED algorithms take as granted that notions of insertion, deletion etc. are crucial in order to operationalize similarity between two strings. Within this article we shall argue that one can successfully calculate similarity between two strings without taking recourse to any edit operation whatsoever. Instead of discrete insert&delete operations, we shall focus the attention of the reader upon a purely positive notion, that of “occurence of a part within the whole” (Harte, 2002). Any string-to-becompared shall be understood as such a whole and any continuous N-gram fragment observable within it shall be interpreted as its part. 2 Advantages of Random Projection Random Projection is a method for projecting high-dimensional data into representations with less dimensions. In theoretical terms, it is founded on a Johnson-Lindenstrauss (Johnson & Lindenstrauss, 1984) lemma stating that a small set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved. In practical terms, solutions based on Random Projection, or a closely related Random Indexing, tend to yield high performance when confronted with diverse NLP problems like synonym-finding (Sahlgren & Karlgren, 2002), text categorization (Sahlgren & Cöster, 2004), unsupervised bilingual lexicon extraction (Sahlgren & Karlgren, 2005), discovery of implicit inferential connections (Cohen et al., 2010) or automatic keyword attribution to scientific articles (El Ghali et al., 2012). RP distinguishes itself from other word space models in at least one of these aspects: 1. Incremental: RP allows to inject on-thefly new data-points (words) or their ensembles (texts, corpora) into already constructed vector space. One is not obliged to execute heavy computations (like Singular Value Decomposition in case of Latent Semantic Analysis) every time new data is encountered. 2. Multifunctional: As other vector-space models, RP can be used in many diverse scenarios. In RI, for example, words are often considered to be the terms and sentences are understood as documents. In this article, words (or verses) shall be considered as documents and N-gram fragments which occur in them shall be treated like terms. 3. Generalizable: RP can be applied in any scenario where one needs to encode into vectorial form the set of relations between discrete entities observables at diverse levels of abstraction (words / documents, parts / wholes, features / objects, pixels/images etc.). 4. Absolute: N-grams and terms, words and sentences, sentences and documents – in RP all these entities are encoded in the same randomly constructed yet absolute space . Similarity measurements can be therefore realized even among entities which would be considered as incommensurable in more traditional approaches1. There is, of course, a price which is being paid for these advantages: Primo, RP involves 1 In traditional word space models, words are considered to be represented by the rows (vectors/points) of the word-document matrix and documents to be its columns (axes). In RP, both words (or word-fragments) and documents are represented by rows. stochastic aspects and its application thus does not guarantee replicability of results. Secundo, it involves two parameters D and S and choice of such parameters can significantly modify model’s performance (in relation to corpus upon which it is applied). Tertio: since even the most minute “features” are initially encoded in the same way as more macroscopic units like words, documents or text, i.e. by a vector of length D “seeded” with D-S non-zero values, RP can be susceptible to certain limitations if ever applied on data discretisable into millions of distinct observable features. 3 Method The method of geometrization of strings by means of Random Projection (RP) consists of four principal steps. Firstly, strings contained within corpus are “exploded” into fragments. Secondly, a random vector is assigned to every fragment according to RP’s principles. Thirdly, the geometric representation of the string is obtained as a sum of fragment-vectors. Finally, the distance between two strings can be obtained by calculating the cosine of an angle between their respective geometric representations. 3.1 String Fragmentation We define the fragment F of a word W having the length of N as any continuous2 1-, 2-, 3-...Ngram contained within W. Thus, a word of length 1 contains 1 fragment (the fragment is the word itself), words of length 2 contain 3 fragments, and, more generally, there exist N(N+1)/2 fragments for a word of length N. Pseudo-code of the fragmentation algorithm is as follows: function fragmentator; for frag_length (1..word_length) { for offset (0..(word_length - frag_length)) { frags[]=substr (word,offset,frag_length); } } where substr() is a function returning from the string word a fragment of length frag_length starting at specified offset. 2 Note that in this introductory article we exploit only continuous N-gram fragments. Interaction of RP with possibly other relevant patterns observable in the word – like N-grams with gaps or sequences of members of diverse equivalence classes [e.g. consonants/vowels] – shall be, we hope, addressed in our doctoral Thesis or other publications. 3.2 Stochastic fragment-vector generation Once fragments are obtained, we transform them into geometric entities by following the fundamental precept of Random Projection: To every fragment-feature F present in the corpus, let’s assign a random vector of length containing D-S elements having zero values and S elements whose value is either -1 or 1. The number of dimensions (D) and the seed (S) are the parameters of the model. It is recommended that S< were observed among words of corpus, one could distinguish 252229 diverse real-numbered RPD values limited to interval <0, 1>. RPD vessels vessel comfort comforts sorrows sorrow ’benign benign temples temple changing unchanging stream streams immortal’s immortal breathe breath trance tranced Table 2: Ten most similar world couples according to non-random “sparse” geometric distance (GD) and Randomly Projected Distance 4.3 Figure 1: Scatter plot displaying relations between Levenshtein distances and cosine distances measured in the vector space constructed by RI1000,5 String distance measured in the space constructed by RP1000,5 also strongly correlates (Pearson correlation coefficient = 0.992; Spearman rho = 0.679; minimal p < 2.2e-16 for both tests) with a GD cosine measure exploiting a non-deformed fragment-word matrix. An important difference was observed, however, during a more „local“ & qualitative analysis of results produced by the two vectorial methods. More concretely: while non-stochastic „sparse“ cosine GD distance tends to promote as „closest“ the couples of short strings, RPD yields the highest score for couples of long words. This is indicated by the list of most similar word- The “centroid” experiment Three types of concrete word-centroids were extracted from the corpus. A string having the smallest overall LD to all other strings in the corpus shall be labeled as the “Levenshtein centroid” (LC). A string having the maximal sum of cosines in relation to other words shall be labeled as the “Cosinal centroid” (CC). Contrary to LC and CC, for calculation of which one has to calculate distances with regard to all words in the corpus, the “Geometric Centroid” (GC) was determined as a word whose vector has the biggest cosine in regard to “Theoretical Centroid” (GC) obtained in a purely geometric way as a sum of all word-vectors. Stochastic CCRP and GCRP calculation simulations were repeated in 100 runs with D=1000, S=5. 4.3.1 Results The word “are” was determined to be the LC of Hyperion corpus with average LDARE,X = 4.764 to all words of the corpus. The same word are was ranked, by a non-stochastic “sparse” geometric distance algorithm, as 3rd most central CC and 36th most closest term to GC . Table 3 shows ten terms with least overall LD to all other words of the corpus (LC), ten terms with biggest cosine in relation to all other terms of the corpus (CC GD) and ten terms with biggest cosine in regard to hypothetical Theoretical Centroid (GCGD) of a sparse non-projected space obtained from the Hyperion corpus. Rank 1 2 3 4 5 6 7 8 9 10 LC are ore ate ere one toes sole ease lone here CCGD GCGD charm red arm a me hard had reed domed are a o I ‘ he to at an me as Table 3: Ten nearest neighbor words of three types of non-stochastic centroids Shortest possible strings seem to be GC GD’s nearest neighbors. This seems to be analogous to data presented on Table 2. In this sense does the GCGD method seem to differ from the CCGD approach which tends to promote longer strings. Such a marked difference in behaviors between GC and CC approaches was not observed in case of spaces constructed by means of Random Projection. In 100 runs, both GC and CC centered approaches seemed to promote as central the strings of comparable content and length4. As is indicated by Table 4, the LC “are” turned out to be the closest (i.e. Rank 1, when comparing with Table 3) to all other terms in 6% of Random Projection runs. In 6% of runs the same term was labeled as the nearest neighbor to the geometrical centroid of the generated space. Other overlaps between all used methods are marked by bold writing in Tables 3 and 4. Word see he are ore ere set she sea a red CCRPD 20 11 6 5 4 6 5 4 9 1 GCRPD 28 8 6 6 5 5 4 4 4 3 Table 4: Central terms of Randomly Projected spaces and their frequency of occurence in 100 runs Analogically to the observation described in the last paragraph of the section 4.2.1, it can be also observed that the strings characterized as 4 In fact only in 22 runs did GCRPD differ from CCRPD “closest” to the Theoretical Centroid of vector spaces generated by Random Projection tend to be longer than “minimal” string nearest to GCGD determined in the traditional non-stochastic feature-word vector space scenario. 5 Discussion When it comes to CCRP-calculation run lasted, in average, CCRPD-detection = 90 seconds, thus being almost twice as fast than the LC-calculation executed on the very same computer which lasted twice the time LCdetection= 157 s for the same corpus, indicating that the computational complexity of our PDL (Glazebrook et al., 1997) implementation of CCRP-detection is lesser than the complexity of LC-detection based on PERL’s Text::Levenshtein implementation of LD. When it comes to the computational complexity of the GC-calculation, it is evident that GC is determined faster and by less complex process than LCs or CCs . This is so because in order to determine the GC RP of N words there is no need to construct an N * N distance matrix. On the contrary, since every word is attributed coordinates in a randomly-generated yet absolute space, the detection of a hypothetic Geometric Centroid of all words is a very straightforward and cheap process, as well as the detection of GC’s nearest word neighbor.. And since in RP, the length of GC-denoting vector is limited to a relatively reasonable low number of elements (i.e. D = 1000 in case of this paper), it is of no surprise that the string closest to GC shall be found more slowly by a traditional “sparse vector” scenario whenever the number of features (columns) > D. In our scenario with NF=22340 of distinct features, it was almost 4 times faster to construct the vector space + find a nearest word to GC of the Randomly Projected space han to use a “sparse” fragment-term matrix optimized by storing only non-zero values (GCRPD-NN-detection ~ 6 sec ; GCGD-NN-detection ~ 22 sec). Other thing worthy of interest could be that contrary to a “sparse” method which seems to give higher score to shorter strings, somewhat longer strings seem to behave as if they were naturally “pushed towards the centroid” in a dense space generated by RP. If such is, verily, the case, then we believe that the method presented hereby could be useful, for example, in domains of gene sequence analysis or other scenarios where pattern-to-be-discovered is “spread out” rather than centralized. In practical terms, if ever the querying in RP space shall turn out to have lesser complexity than other vector models, our method could be useful within a hybrid system as a fast stochastic way to pre-select a limited set of “candidate” (possibly locally optimal) strings which could be subsequently confronted with more precise, yet costly, non-stochastic metrics ultimately leading to discovery of the global optimum. Asides above-mentioned aspects, we believe that there exists at least one other theoretical reason for which the RP-based geometrization procedure could deem to be a worthy alternative to LD-like distance measures. That is: the cardinality of a real-valued <0, 1> range of a cosine function is much higher than a wholenumbered <0, max(length(word))> range possibly offered as an output of Levenshtein Distance. In other terms, outputs of string distance functions based on trigonometry of RPbased vector spaces are more subtler, more finegrained, than those furnished by traditional LD. While this advantage does not hold for “weighted” LD measures we hope that this article could motivate future studies aiming to compare “weighted” LD and RPD metrics. When it comes to the feature extracting “fragment explosion” approach, it could be possibly reproached to the method proposed hereby that 1) the fragmentation component which permutes blindly through all N-grams presented in the corpus yields too many “features”; that 2) that taking into account all of them during the calculation of the word’s final vector is not necessary and could even turn to be computationally counter-productive; or that 3) bi-grams and tri-grams alone give better results than larger N (Manning et al., 2008). A primary answer to such an ensemble of reproaches could be, that by the very act of projecting data upon limited set of same non-orthogonal dimensions, the noise could simply cancel itself out5. Other possible answer to the argument could be that while the bi&tri-gram argument holds well for natural language structures, the method we aim to propose here has ambitions to be used beyond NLP (e.g. bio-informatics) or pre-NLP (e.g. early stages of language acquisition where the very notion of N-gram does not make sense because the very criterion of sequence segmentation & discretization was not yet established). At last 5 And this “noise canceling property” could be especially true for RP as defined in this paper where the rare nonzero values in the random “init” vectors can point in opposite directions (i.e. either -1 or 1). but not least we could counter-argue by stating that often do the algorithms based on a sort of initial blind “computational explosion of number of features” perform better than those who do not perform such explosion, especially when coupled with subsequent feature selection algorithms. Such is the case, for example, of an approach proposed by Viola & Jones in (Viola & Jones, 2001) which caused the revolution in the computer vision by proposing that in order to detect an object, one should look for combinations of pixels instead of pixels. In this paper, such combinations of “letterpixels” were, mutatis mutandi, called “fragments”. Our method departs from an idea that one can, and should, associate random vectors to such fragments. But the idea can go further. Instead of looking for occurrence of part in the whole, a more advanced RI-based approach shall replace the notion of “fragment occuring in the word” by a more general notion of “pattern which matches the sequence”. Thus even the vector associated to pattern /d.g/ could be taken into account during the construction of a vector representing the word “dog”. Reminding that RP-based models perform very well when it comes to offering solutions to quite “deep” signifiée-oriented problems, we find it difficult to understand why could not be the same algorithmic machinery applied to the problems dealing with “surface”, signifiantoriented problems, notably given the fact that some sort of dimensionality reduction has to occur whenever the mind tries to map >4Dexperiences upon neural substrate of the brain embedded in 3D physical space. Given that all observed correlations and centroid overlaps indicate that the string distance calculation based on Random Projection could turn out to be a useful substitute for LD measure or even other more fine-grained methods. And given that RP would not be possible if the Johnson-Lindenstrauss’s lemma was not valid, our results could be also interpreted as another empirical demonstration of the validity of the JL-lemma. Acknowledgments The author would like to thank Adil El-Ghali for introduction into Random Indexing as well as his comments concerning the present paper; to prof. Charles Tijus and doc. Ivan Sekaj for their support and to Aliancia Fair-Play for permission to execute some code on their servers. References Trevor Cohen, Roger Schvaneveldt & Dominic Widdows. 2010. Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections. Journal of Biomedical Informatics, 43(2), 240–256. Fred J. Damerau. 1964. A technique for computer detection and correction of spelling errors. Communications of the ACM, 7(3), 171–176. Adil El Ghali, Daniel D. Hromada & Kaoutar El Ghali. 2012. Enrichir et raisonner sur des espaces sémantiques pour l’attribution de mots-clés. JEPTALN-RECITAL 2012, 77. Peter Gärdenfors. 2004. Conceptual spaces: The geometry of thought. MIT press. Karl Glazebrook. Jarle Brinchmann, John Cerney, Craig DeForest, Doug Hunt, Tim Jenness & Tuomas Lukka. 1997. The Perl Data Language. The Perl Journal, 5(5). Verity Harte. 2002. Plato on parts and wholes: The metaphysics of structure. Clarendon Press Oxford. Matthew A. Jaro. 1995. Probabilistic linkage of large public health data files. Statistics in medicine, 14(57), 491–498. William B. Johnson & Joram Lindenstrauss. 1984. Extensions of Lipschitz mappings into a Hilbert space. Contemporary mathematics, 26(189-206), 1. Pentti Kanerva, Jan Kristofersson & Anders Holst. 2000. Random indexing of text samples for latent semantic analysis. Proceedings of the 22nd annual conference of the cognitive science society (Vol. 1036). John Keats. 1819. The Fall of Hyperion. A Dream. John Keats. complete poems and selected letters, 381–395. Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet physics doklady (Vol. 10, p. 707). Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini & Chris Watkins. 2002. Text classification using string kernels. The Journal of Machine Learning Research, 2, 419–444. Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze. 2008. Introduction to information retrieval. Cambridge University Press. Magnus Sahlgren. 2005. An introduction to random indexing. Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE (Vol. 5). Magnus Sahlgren & Rickard Cöster. 2004. Using bagof-concepts to improve the performance of support vector machines in text categorization. Proceedings of the 20th international conference on Computational Linguistics (p. 487). Magnus Sahlgren & Jussi Karlgren. 2002. Vectorbased semantic analysis using random indexing for cross-lingual query expansion. Evaluation of CrossLanguage Information Retrieval Systems (p. 169– 176). Magnus Sahlgren & Jussi Karlgren. 2005. Automatic bilingual lexicon acquisition using random indexing of parallel corpora. Natural Language Engineering, 11(3), 327–341. Alan M. Turing. 1936. On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London mathematical society, 42(2), 230–265. Paul Viola & Michal Jones. 2001. Rapid Object Detection using a Boosted Cascade of Simple. Proc. IEEE CVPR 2001. Robert A. Wagner & Michael J. Fischer. 1974. The string-to-string correction problem. Journal of the ACM (JACM), 21(1), 168–173. Empiric Introduction to Light Stochastic Binarization Daniel Devatman Hromada12 1 Slovak University of Technology, Faculty of Electrical Engineering and Information Technology, Department of Robotics and Cybernetics, Ilkovičova 3, 812 19 Bratislava, Slovakia 2 Université Paris 8, Laboratoire Cognition Humaine et Artificielle, 2, rue de la Liberté 93526, St Denis Cedex 02, France Abstract. We introduce a novel method for transformation of texts into short binary vectors which can be subsequently compared by means of Hamming distance measurement. Similary to other semantic hashing approaches, the objective is to perform radical dimensionality reduction by putting texts with similar meaning into same or similar buckets while putting the texts with dissimilar meaning into different and distant buckets. First, the method transforms the texts into complete TFIDF, than implements Reflective Random Indexing in order to fold both term and document spaces into low-dimensional space. Subsequently, every dimension of the resulting low-dimensional space is simply thresholded along its 50th percentile so that every individual bit of resulting hash shall cut the whole input dataset into two equally cardinal subsets. Without implementing any parameter-tuning training phase whatsoever, the method attains, especially in the high-precision/low-recall region of 20newsgroups text classification task, results which are comparable to those obtained by much more complex deep learning techniques. Keywords: Reflective Random Indexing, unsupervised Locality Sensitive Hashing, Dimensionality Reduction, Hamming Distance, NearestNeighbor Search 1 Introduction In applied Computer Science one often needs to select from the database an object which most resembles the ”query” object already at one’s disposition. In order to do so, all members of the database are often transformed into ordered sequences of numeric values (i.e. vectors). Such vectors can be interpreted as points in the high-dimensional metric space allowing to calculate their distance to other points in the space. In such case, the resulting ”most similar” entity is simply the entity whose vector has smaller distance to the vector representing the ”query” entity than any other entity stored in the database, i.e is query’s ”nearest neighbor”. In Natural Language Processing (NLP), the nearest-neighbor search (NNS) is a widely-used approach applied for solving diverse problems. Seemingly trivial, NNS is nonetheless not an easy problem to tackle with, especially in the 2 Daniel Devatman Hromada case of Big Data scenarios where database contains huge amount of highlydimensional datapoints. In real-time scenarios where naive linear comparison of d -dimensional query vector with all N vectors stored in the database is simply not feasible due to its O(Nd) computational complexity. Thus, one is almost always obliged to take recourse in approximation or heuristic-based solutions. One of the most common methods of reducing the complexity of the NNsearch is by reducing the dimensionality of the database-representing vector space. Classical approach to do so is Latent Semantic Analysis [10] (LSA). Other family of more and more common approaches exploits so-called binary vectors as the ultimate means of entity’s formalisation. Given the fact that contemporary computers are machines essentially -i.e. on the physical hardware level- always working with binary distinctions, the calculation of the distance between two binary vectors (i.e. Hamming distance - the number of times a bit in vector1 has to be flipped in order to obtain the form of vector2 ) can be indeed a very fast operation to realize, especially when implemented on the hardware level as a part of processor’s instruction set. Combination of dimensionality reduction and binarisation are basis for family of methods descending from the approach called Locally Sensitive Hashing (LSH) [11]. While concrete implementations often substantially differ - c.f. [13] for the state-of-the-art overview - the objective is always the same: to hash each object of the dataset into a concise binary vector in such a way that the objects which are similar shall end up in the same or similar bucket (i.e. shall be represented by same or similar binary vector) while the objects which are disparate shall end up in disparate buckets3 . In order to attain stunningly good results, many of these methods have to be first trained. Such a tradeoff of high performance / complexity of training phase is the case, for example, in the ”semantic hashing” (SH) approach of [1]. In SH one has to first learn the weights between different restricted Boltzmann machines in order to obtain a multi-layered ”deep generative model” able to perform the hashing. But the SH has also certain non-negligeable disadvantages: the 1) training-related costs 2) need to work with restricted amount of features which shall enter the first layer of the network (e.g. 2000 TF-IDF values in [1]) 3) possibility of over-fitting of the model etc. In this article, we shall present approach, which could one take vis-a-vis the problem of ”text hashing”. Instead of founding our approach on a powerful supervised ”deep learning” algorithm able to extract sophisticated combinations of regularities among restricted number of initial features, we shall exploit an algorithm so simple that it can easily integrate huge number of features in a very fast & frugal way. In fact, the algorithm presented here is completely unsupervised and does not need any training or feature-preselection at all in order perform the hashing process. 3 Note that the aim of hashing process as presented in this paper differs substantially from the aim of hashing algorithms like MD5 or SHA2 whose objective is to always hash objects into different buckets. TSD 2014 1.1 3 Reflective Random Indexing Theoretically, our approach stems from the lemma of Johnson-Lindenstrauss stating that a small set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved[9]. Practically, the JL-lemma was already implemented as so-called Random Projection or Random Indexing algorithms. Random Projection was already quite successfully proposed in relation to the hashing problem[12]. Its much simpler Random Indexing (RI) counterpart, however, was not. Since a decade from its initial proposal in [4], RI has already proven its usefulness in regards to NLP problems as diverse as synonym-finding [4], text categorization [5], unsupervised bilingual lexicon extraction [6], implicit knowdledge discovery [2], automatic keyword attribution to scientific articles [3] or measurement of string distance metrics [8]. The basic RI algorithm is quite straightforward to both understand and implement: Given the set of N objects (e.g. documents) which can be described in terms of F features (e.g. occurence of the string in the document), to which one initially associates a randomly generated D-dimensional vector, one can obtain D-dimensional vectorial representation of any object X by summing up the vectors associated to all features F1 , F2 observable within X. The original random feature vectors are generated in a way that out of D elements of vector, only S among them are set to either -1 or 1 value. Other values contain zero. Since the ”seed” parameter S is much smaller than the total number of elements in the vector (D), i.e. S <median(Dd ) hd (n) = 1  rand if n == median(Dd ) Subsequently, during the query phase, can simply transform the query object is transformed into its binary vector by: 1)summing up the real-valued representations of the features observable within the query object 2) thresholding the resulting real-vectored by pre-determined medians. Resulting binary vector is subsequently considered to be the center of the Hamming-ball of radius R. Every binary vector contained within such Hamming-ball shall yield an index pointing to the bucket stored in the memory where we could look in order to find query’s nearest-neighbor. In case 2R New since version of 18 April 1993: > * New version of XV supports 24-bit viewing for X Windows. > * New versions of DVPEG & Image Alchemy for DOS. > * New versions of Image Archiver & PMView for OS/2. > * New listing: MGIF for monochrome-display Ataris. 461,463c464,466 < PMView 0.85: JPEG/GIF/BMP/Targa/PCX viewer. GIF viewing very fast, < JPEG viewing roughly the same speed as the above two programs. Has < image manipulation & slideshow functions. Shareware, $20. --> PMView 0.85: JPEG/GIF/BMP viewer. GIF viewing very fast, JPEG viewing > fast if you have huge amounts of RAM, otherwise about the same speed > as the above programs. Strong 24-bit display support. Shareware, $20. 632,641d634 < NeXT: < < ImageViewer is a PD utility that displays images and can do some format < conversions. The current version reads JPEG but does not write it. < ImageViewer is available from the standard NeXT archives at < sonata.cc.purdue.edu and cs.orst.edu, somewhere in /pub/next (both are < currently being re-organized, so it’s hard to point to specific < sub-directories). Note that there is an older version floating around that < does not support JPEG. In spite of difference of their contents, files comp.graphics/39638 and comp.graphics/39078 of 20-newsgroup corpus, LSB assigned to them the same ”10010011000010111001100 0101100100011110000110111010010010100110101110100000001001010011011000100 10010101001101000010111110110011” hash 8 4 Daniel Devatman Hromada Discussion Looking at the peak shown in Figure 2, one is tempted to state that when confronted with data from the testing set of 20newsgroups corpus, the reflective 128-dimensional LSB is able to retrieve, in 42% of cases (i.e. 3166 out of 7531), at least one relevant ”neighbor” with maximal precision. It is indeed at the Hamming distance 38, where the method combining Reflective Random Indexing executed with parameters D=128, S=5, I=2 and followed by simple binary thresholding of every dimension, attains at overall recall rate 0.39%5 to much higher precision (80.6%) than any method presented in the study of [1]. On the other hand, LSB performs much worse than compared methods in situations where one wants to attain higher recall. This is most probably due to almost complete lack of ”training” - since with exception of 1) TFIDF weighting of initial randomly-generated feature vectors 2) the ”reflection” procedure which aids us to characterize objects in terms of features and features in terms of objects 3) determination of binary thresholds (i.e. medians) - there is no kind of ”learning” procedure involved. But it might be the case that lack of any complex ”deep learning” procedure shall prove itself to be a certain advantage. More concretely, the one who uses LSB is not obliged to drastically reduce the number of features by means of which all objects are characterized. Thus, in the case of the introductory experiment presented in this paper, we have represented every text as a linear combination of vectors associated to 41782 features. We believe that it is indeed this ability to ”exploit the minute details” (compare to 2000 words with highest TFIDF score used by [1]) which allows the method hereby introuced to attain higher precisions in scenarios where just one relevant document is to be retrieved. It would be, however, unfair to state that can LSB ”pefrorms better” than Semantic Hashing, because the goal purpose of SH was not to target the NN-search problem but to yield robust results in more exhaustive classification scenarios. Thus, comparison of LSB with other methods is needed. It might also be the case that more advanced tuning of RRI’s parameters could improve the performance. Another possible direction of research is to study the impact of strategies by means of which the initial random vectors are weighted. Due to the introductory nature of this paper, not much was unveiled about neither of two problems. Looking at the Figure 1, one can, however, assert that: 1) LSB seems to attain better results when its RI component involves more than one iteration, i.e. when it is ”reflective”. In sum, we believe that the method hereby introduced is worth to be studied somewhat further. Not only because its dimensionality-reduction component -the 5 We precise that when we mention 42% recall with 100% precision, we speak about NNS scenario where it is sufficient for a query to retrieve one relevant document. This scenario is documented on Fig. 2. On the other hand, when we mention attained recall rate 0.39%, we speak about much more difficult ”classification” scenario where query, in order to attain maximal recall, must retrieve all >370 postings which belong to the same class. This scenario is documented on Figure 1. TSD 2014 9 RRI- is less costly and more opened to incremental addition of new data than, for example, LSA [10]. Not only because it is similar to LSH [11] in its ability to transform texts into hashes as big as concise as 16 ASCII characters and yet preserve the relations of similarity and difference held by original texts. But also because the algorithm is easy to comprehend, simple to implement and queries can be very fast to execute. That’s why we label the method of binarization hereby presented as not only stochastic, but also light. References 1. R. Salakhutdinov et G. Hinton, ” Semantic hashing ”, International Journal of Approximate Reasoning, vol. 50, no. 7, p. 969–978, 2009. 2. T. Cohen, R. Schvaneveldt, et D. Widdows, ” Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections ”, Journal of Biomedical Informatics, vol. 43, no. 2, p. 240–256, 2010. 3. A. El Ghali, D. Hromada, et K. El Ghali, ” Enrichir et raisonner sur des espaces sémantiques pour l’attribution de mots-clés ”, JEP-TALN-RECITAL 2012, p. 77, 2012. 4. M. Sahlgren et J. Karlgren, ” Vector-based semantic analysis using random indexing for cross-lingual query expansion ”, in Evaluation of Cross-Language Information Retrieval Systems, 2002, p. 169–176. 5. M. Sahlgren et R. Cöster, ” Using bag-of-concepts to improve the performance of support vector machines in text categorization ”, in Proceedings of the 20th international conference on Computational Linguistics, 2004, p. 487. 6. M. Sahlgren, ” An introduction to random indexing ”, in Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE, 2005, vol. 5. 7. C. D. Manning, P. Raghavan, et H. Schütze, Introduction to information retrieval, vol. 1. Cambridge University Press Cambridge, 2008. 8. D. D. Hromada, ” Random Projection and Geometrization of String Distance Metrics ”, in Proceedings of the Student Research Workshop associated with RANLP, 2013, p. 79–85. 9. W. B. Johnson et J. Lindenstrauss, ” Extensions of Lipschitz mappings into a Hilbert space ”, Contemporary mathematics, vol. 26, no. 189-206, p. 1, 1984. 10. T. K. Landauer et S. T. Dumais, ” A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. ”, Psychological review, vol. 104, no. 2, p. 211–240, 1997. 11. A. Gionis, P. Indyk, et R. Motwani, ” Similarity search in high dimensions via hashing ”, in VLDB, 1999, vol. 99, p. 518–529. 12. M. S. Charikar, ” Similarity estimation techniques from rounding algorithms ”, in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, 2002, p. 380–388. 13. A. Andoni et P. Indyk, ” Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions ”, in Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, 2006, p. 459–468. 14. 20 newsgroups, http://qwone.com/~jason/20Newsgroups/ Slovak University of Technology in Bratislava Faculty of Electrical Engineering and Information Technology Department of Control and Cybernetics && University Paris 8 Ecole Doctoralle Cognition, Langage, Interaction Cognition Humaine et Artificielle Thesis for rigorous examination Daniel D. Hromada Topic of doctoral Thesis : Evolutionary models of ontogeny of linguistic categories and rules Advisors : doc. Ing. Ivan Sekaj, PhD. prof. Charles Tijus Form of study : external and under double supervision Study began : september 2010 at University Paris 8 september 2011 at Slovak University of Technology Study program : Cybernetics (Slovak University of Technology) Psychology (University Paris 8) Study discipline : 9.2.7 Cybernetics (Slovak University of Technology) Cognitive Psychology (University Paris 8) (ivan.sekaj @ stuba.sk) (tijus @ univ-paris8.fr) Abstract Language development is a process by means of which a human baby constructs an adequate competence to encode & decode meanings in language of her parents. Computationally it can be described as a trinity of mutually interconnected problems : clustering of all tokens which baby heard into 1) semantic and 2) grammatical categories ; and 3) discovery of grammatical rules allowing to combine the members of diverse equivalence classes into syntactically correct and meaningful phrases. A theoretical, « psycholinguistic » claim of our Thesis is that similary to those theories which explain emergence of cultural or creative thinking as the result of evolutionary process occuring within an individual mind, the emergence of linguistic representations and faculties within a human individuum can be also considered as a case where basic tenets of Universal Darwinism apply. The practical, « cybernetic » aim of the Thesis is to create a computational models of concept learning, part-of-speech induction and grammar induction having comparable performance to existing models but based principially on evolutionary algorithms. It shall be argued that the « fitness function » , which determines the « survival rate » of « candidate grammars » emerging and disappearing in baby’s mind should be based upon the idea that the most fit is such a grammar G which « minimizes the distance » between the utterances successfully parsed from linguistic environment E by the application of grammar G and the utterances potentially generated by the grammar G. Keywords : evolutionary computing, language acquisition, genetic epistemology, part-of-speech induction, grammar induction, optimal clustering, machine learning, concept construction, grammar systems, motherese, toddlerese List of most important abbreviations CC – Concept Construction EA – Evolutionary Algorithms | EP – Evolutionary Programming ES – Evolutionary Strategies ET – Evolutionary Theory GA – Genetic Algorithms GE - Genetic Epistemology GI – Grammar Induction | Grammar Inference LA - Language Acquisition | LD – Language Development NLP – Natural Language Processing POS-i – Part-of-Speech Induction ; POS-t – Part-of-Speech Tagging UD – Universal Darwinism Convention « Italics » – citation ( x | y | z ) - disjuncted token – i.e. read token « (neural|quantum) darwinism » as « neural darwinism ; quantum darwinism » Table of Contents 0.Introduction.......................................................................................................................................4 1.Universal Darwinism.........................................................................................................................5 1.1.Biological evolution...................................................................................................................6 1.2.Evolutionary Psychology...........................................................................................................8 1.3.Memetics....................................................................................................................................9 1.4.Evolutionary Epistemology.......................................................................................................9 1.5.Individual Creativity................................................................................................................10 1.6.Genetic Epistemology..............................................................................................................11 1.7.Evolutionary computation........................................................................................................12 1.7.1. Genetic algorithms & fitness landscapes........................................................................13 1.7.2. Evolutionary programming & evolutionary strategies....................................................15 1.7.3.Genetic programming......................................................................................................15 1.7.4.Grammatical evolution.....................................................................................................17 1.7.5.Tierra................................................................................................................................20 2.Language development....................................................................................................................21 2.1.Ontogeny of semantic categories (concepts)...........................................................................22 2.2.Ontogeny of formal categories (parts-of-speech)....................................................................25 2.3.Ontogeny of grammars (grammar induction)..........................................................................27 3.Computational Models of Text Processing......................................................................................29 3.1.Concept construction...............................................................................................................30 3.1.1. Non-evolutionary model of CC.......................................................................................31 3.1.2. An evolutionary model of CC.........................................................................................33 3.2.Part-of-speech induction and part-of-speech tagging..............................................................35 3.2.1. Non-evolutionary models of POS-i.................................................................................36 3.2.2. Evolutionary models of POS-i & POS-t.........................................................................37 3.3.Grammar induction..................................................................................................................38 3.3.1. Non-evolutionary models of grammar induction............................................................39 3.3.2. Evolutionary models of grammar induction...................................................................43 3.4. Evolutionary Language Game................................................................................................51 4.Remark concerning the Theory of Grammar Systems....................................................................53 5.Conclusive remarks.........................................................................................................................54 6.Bibliography....................................................................................................................................56 0. Introduction A general form of Evolution Theory (ET) postulates that entities evolve and adapt to their environment by a process of accumulation of information. Such a generalized theory – often referred to as « Universal Darwinism » - can be and often is applied in diverse scientific disciplines as diverse as biology, linguistics or even anthropology and psychology. Since principal concepts and tenets of ET can be easily formalised into stochastic « evolutionary » algorithms, ET can yield not only a theoretical framework but also a computational experimental methodology for any scienctific discipline whose basic concepts and principles can be « reduced » into a ET-consistent form. The aim of my doctoral Thesis is to empirically – i.e. by means of computational experiments - demonstrate that certain phaenomena observed by « developmental linguists » and « psycholinguists » can be explained in terms of principles of Universal Darwinism and as such can be modelled by « computational linguists » and « Natural Language Processing (NLP) engineers» who shall found their computational models upon methods offered by Evolutionary Computing paradigm. More concretely, I shall try to indicate that « evolutionary » optimization can be used to yield solutions to at least three problems of language development: 1) induction of semantic categories, i.e. construction of « concepts » 2) the problem of induction of part-of-speech grammatical categories of words natural languages, i.e. the problem of how equivalence classes like « nouns », « verbs », « adjectives » etc. are constructed by the language-acquiring agent 3) the problem of grammar induction, i.e. the problem of how an agent can acquire a grammar from the corpus or its environment It shall be indicated that the term «language-acquring agent » could be interpreted both as an organic agent (e.g. a human baby) trying to learn the language of its environment (e.g. its parents) as well as a computational agent (e.g. a Turing Machine) inducing the structural properties of the language which generated the corpora with which the agent has been confronted. In other terms, it shall be indicated that ET is generalizable in such an extent, that its correct implementation may allow two systems based upon Darwinian principles « replicate, mutate, select » to converge to same optimal or quasi-optimal categories regardless the fact that the substrate by means of which they computate is organic or not. The first chapter will more closely present the above-mentioned basic principles of the universal ET doctrine and enumerate certain scientific disciplines for which the ET furnishes a useful theoretical framework. Asides biology where the role of ET is evident, a discipline of «evolutionary psychology » shall be mentioned principially in order to avert the reader that our aims are not limited to those posited by evolutionary psychology. The « memetic theory », on the contrary, shall more precisely elucidate our ultimate aim since it already introduces a novel level of representation, « a meme », supposed to be « the basic unit of immitation » and as such offers an interesting starting point for any Darwin-consistent theory of evolution of non-organic (e.g. cultural) structures and artefacts. It is, however, the constructionist « genetic epistemology » (GE) of Jean Piaget which shall resonate even more strongly with our aims – since what GE ultimately postulates is that the human psyche – with all its linguistic, moral, object-manipulating faculties – pass through the sequence of « stages » . For it is our belief that such Piagetian « stages » can be explained, in computational terms, as «quasi-optimal attractors» within a very complicated « search space » of agent’s internal representations and that a sort of evolutionary process occurs not only on a social-memetic level between the agents imitating each other, but, in the first place, within the agents (him|her|it)self. This Thesis is our tentative to base this « learning=evolution » belief on solid ground of complexity theory. The second chapter will address the topic of language development (LA). The topic is so vaste and deep that only most fundamental subproblems (i.e. vocabulary development, acquisition of part-of-speech categories and acquisition of gramamrs) shall be briefly described and some basic notions like « variation set » or « motherese» will be introduced. We shall try to evit the dispute between diverse linguistic doctrines and schools (e.g. nativists, cognitivists, comparativists) ; focus shall be put upon points of consensus supported by empiric evidence. While the goal of first chapter is to furnish the theoretical framework and the goal of the second is to specify the problem, it is the third chapter which deals with the concrete computational tentatives to unify the two. Major part of the chapter shall deal with the question of evaluation of diverse inductive models. Some most successful computational models of part-of-speech induction (POS-i) and grammar induction (GI) shall be mentioned in order to pave the way for the evolutionary ones. As shall be indicated by this section, the tentatives to apply evolutionary algorithms (EAs) to solve the POS-i and GI problem are, regardless the good results reported in the litterature, very rare. In specific subsections of the chapter, we shall mention certain models, both psycholinguistic and computational, which justify our claim that the process of ontogeny of linguistic faculty can be not only interpreted but also modelled as a process of evolutionary optimization of cognitive structures. 1. Universal Darwinism Universal Darwinism (UD) is a scientific paradigm regrouping diverse scientific theories extending the Darwinian theory of evolution and natural selection (Darwin 1859) beyond the domain of biology. It can be understood as a generalized theoretical framework aiming to explain the emergence of many complex phenomena in terms of interaction of three basic processes: 1) variation 2) selection 3) retention. According to UD paradigm, interaction of these three components yields a « universal algorithm valid not only in biology, but in all domains of knowledge where we can extract informational entities – replicators, which are able to reproduce themselves with variations and which are subjects to selection » (Kvasnička and Pospíchal 1999). This generic algorithm is nothing else than traditional Evolutionary Theory (ET) which, when when considered as substrate-neutral, can be applied to such a vaste number of scientific fields that it has been compared to a kind of « universal acid » which « eats through just about every traditional concept, and leaves in its wake a revolutionized world-view, with most of the old landmarks still recognizable, but transformed in fundamental ways» (Dennett 1996) . As of 2013, the existing scientific disciplines which could be labeled as UD-consistent include: biology ; evolutionary (art | psychology | music | linguistics | ethics |economics | anthropology | epistemology | computation); sociobiology ; memetics ; (quantum | neural | psycho ) darwinism ; artificial life and many others. In regards to the overall aim of our Thesis, some of the most relevant instances of UD are described in following sub-sections. 1.1. Biological evolution Evolutionary Theory was born when young Charles Darwin realised that the « gradation and diversity of structure » (Darwin 1906), which he had encountered among mockingbirds of Galapagos islands, could be explained by natural tendency of species to « adapt to changing world ». Parallely to Darwin’s work which was gradually clarifying the terms of variability and its close relation to environment-originated selective pressures, Gregor Mendel was assessing statistical distributions of colours of flowers of his garden peas in Brno in order to finally converge to fundamental principles of heredity . But it was only in 1953 when the double-helix structure of the material substrate of heredity of biological species – the DNA molecule – was described in (Watson & Crick, 1953) paper. In simple terms : In the DNA molecule, information is encoded as a sequence of nucleotides. Every nucleotide can contain one of four nucleobases, it thus ideally carry 2 bits of information. Continuous sequence of three nucleotids gives a « triplet » which, when interpreted by a intracellular « ribosome » machinery, can be « translated » into an amino-acid. Sequences of amino-acides yield proteins which interact one with another in biochemical cascades. The result is a living organism with its particular phenotype aiming to reproduce its genetic code. If, in the given time T there are two organisms A and B whose genetic code differs in such an extent that their phenotype differs, and if ever the phenotype of organism A augments probability of A’s survival and reproduction in the external world W, while the B’s phenotype diminishes such probability , we say that the A is better adapted to world W than B, or more formally that fitness(A) > fitness(B). Evolutionary Theory postulates that in case that there is a lack of resources in world W, descendants of the organism B shall be gradually, after multiple generations, substituted by descendants of a more fit organism « A ». This is so because during every act of reproduction, the material reason for having a more fit phenotype - the DNA molecule – is transferred from parent to offspring and the whole process is cumulative across generations. It can, however, happen, that the world W changes. Or a random (stochastic) event – a gamma ray, the presence of a free radical - can occur which would tamper A’s genetic code. Such an event – called « mutation » - shall result, in majority of cases, in decrease of A’s fitness. Rarely, however, can mutations also increase it. Another event which can transform the genetic sequence is called « crossover ». It can be formalised as an operator which substitutes one part of genetic code of the organism A with corresponding sequence of organism B, and vice versa, the part of B with the corresponding part of A. It is indeed especially the crossover operation, first described by in the article of T.H. Morgan (Morgan 1916), which is responsible for « mixing of properties » in case of a child organism issued from two parent organisms. In more concrete terms : the genetic code of such « diploid » organisms is always stored in X pairs of chromosomes. Each chromosome in the pair is issued from either father or mother organism which, during the process of meiosis, divide their normally diploid cells into haploid gamete cellls (i.e. sperms in case of father and eggs in case of mother). It is especially during the first meiotic phase that Figure 1 : Two types of crossover operation. Figures reproced from (Morgan, 1916) crossover occurs, the content of DNA sequence of two grand-parents being mixed and mapped during crossover operation into the chromosome contained in the gamete which, if lucky, shell fuse with the gamete of another parent in the act of fecondation. Resulting « zygote » is again diploid, contains mix of fragments of genetic code originally present in the cells of all four grand-parents of the nascent organism. Zygote subsequently exponentially divides into growing number of cells which differentiate from each other according to instructions contained in the genetic code which are triggered by biochemical signals coming from cell’s both internal and external environment. If the genetic code shall endow the organism with such properties that will allow it to survive in its environment until its own reproduction, approximately half of the genetic information contained in its DNA shall be transfered to the offspring organism. If not, the information as such shall disappear from the population due to its incompatibility with the environment. 1.2. Evolutionary Psychology It was already Darwin who posited that ET shall have profound impact upon psychology : « In the distant future I see open fields for far more important researches. Psychology will be based on a new foundation that of the necessary acquirement of each mental power and capacity by gradation. » (Darwin 1859) While two possible intepretations of this Darwin’s idea exist, Evolutionary Psychology (Ev.Psych.) focuses only on the first one. It aims to explain diverse faculties of human soul & mind in terms of selective pressures which moulded the modular architecture of human brain during millions of years of its phylogenetic history. Its central premises state : « The brain's adaptive mechanisms were shaped by natural and sexual selection. Different neural mechanisms are specialized for solving problems in humanity's evolutionary past » (Cosmides and Tooby 1997). In more concrete terms, Evolutionary Psychology explains quite successfully phaenomena as diverse as emergence of cooperation and altruistic behaviour (Hamilton 1963); male promiscuity and parental investment (Trivers 1972) or even the obesity of current anglo-saxxon population (Barrett 2007). All this and much more is explained as a result of adaptation of homo sapiens sapiens (and all its biological ancestors) to dynamism of its ever-changing ecological and social niche. Thus, in the long run, Ev.Psych. tends to explain and integrate all innate faculties of human mind in the evolutionary framework. The problem with Ev.Psych., however, is that in its grandious aim to « assemble out of the disjointed, fragmentary, and mutually contradictory human disciplines a single, logically integrated research framework for the psychological, social, and behavioral sciences » (Cosmides and Tooby 1997) it can sometimes happen that Ev.Psych. posits as innate, and thus explainable in terms of biological natural selection, cognitive faculties which are not innate but acquired. Thus it may be more often than rarely the case that whenever it comes to the famous nature vs. nurture (Galton 1875) controversy, evolutionary psychologists tend to defend the nativist cause even there, where it means to commit a epistemological fallacy to do so1. And what makes things even worse for the discipline of Evolutionary Psychology as is currently performed is, that the forementioned Darwin’s precognition has, asides the nativist & biological one, also another intepretation. Id est, when Darwin spoke about mental powers and capacities acquired by gradation, one cannot exclude that he was speaking not only about gradation in phylogeny, but also ontogeny. 1.3. Memetics Theory of memes or memetics is, in certain sense, a counter-reaction to Evolutionary Psychology’s aims to explain human mental and cognitive faculties in terms of innate propensities. Similiarly to Ev.Psych., memetics is also issued from the discipline of sociobiology which was supposed to be « The extension of population biology and evolutionary theory to social organization » (Wilson 1978). But differently to both Ev.Psych. and sociobiology, memetics does not aim to explain diverse (cultur|psychologic|soci)al phenomena solely in terms of evolution operating upon biochemical genes, but also in terms of evolution being realised on the plane of more abstract information-carrying replicators called « memes » (Dawkins 2006). The basic definition of the classical memetic theory is: « Meme is a replicator which replicates from brain to brain by means of imitation» (Blackmore 2000). These replicators are somehow represented in the host brain as some kind of « cognitive structure » and if ever externalised by the host organism – no matter whether in form a word, song, behavioral schema or an artefact – they can get copied into other host organism endowed with the device to integrate such structures 2. Similary to genes which often network themselves into mutually supporting auto-catalytic networks (Kauffman 1996), memes can also form more complex memetic complexes, « memplexes », in order to augment the probability of their survival in time. Memes can thus do informational crossovers with one another (syncretic religions, new recepts from old ingredients or DJ mixes can be nice examples of such memetic crossover) or they can simply mutate, either because of the noise present during the imitation (replication) process, or due to other entropy-related decay-like factors related to the ways how active memes are ultimately stored in brains or other information processing devices. Memetic theory postulates that the cumulative evolutionary process applied upon such information-carrying stuctures shall ultimately lead to emergence of such complex phaenomena as culture, religion or language. 1.4. Evolutionary Epistemology Epistemology is a philosophical discipline concerned with the source, nature, scope , 1 If ever we accept the notion of falsifiability as an important criterion of accpetation or rejection of the scientific hypothesis (Popper et al. 1972), many hypotheses issued from EP would have to be rejected because, since being based in the distant past which is almost impossible to access, they are less falsifiable than hypotheses explaining the same phaenomena in terms of empiric data observable in the present. 2 In neurobiological terms, the faculty to imitate and hence to integrate memes from external environment is often associated to so-called « mirror neurons » (Rizzolatti and Craighero 2004). existence and divesity of forms of knowledge. Evolutionary epistemology (EE) is a paradigm which aims to explain these by applying the evolutionary framework. But under one EE label, at least two distinct topics are, in fact, addressed : 1) EE1 which aims to explain the biological evolution of cognitive and mental faculties in humans and animals 2) EE2 postulates that knowledge itself evolves by selection and variation EE1 can be thus considered as sub-discipline of Ev.Psych. and as such, is subject to Ev.Psych.-directed criticism presented on previous page. EE 2, however, is closer to memetics since it postulates the existence of a second replicator, i.e. of an information-carrying structure which is not materially encoded by a DNA molecule. The distinction between EE1 and EE2 can also be characterised in terms of « phylogeny » and « ontogeny ». Given the definition of phylogeny as the « processus which shapes the form of species » and contrasting it to ontogeny defined as « processus shaping the form of individual », we find it important to reiterate that while EE1 is more concerned with knowledge as a result of phylogenetic moulding of DNA, EE2 points more in the direction of « ontogeny». In fact, EE2 paves the way for at least two other sub-interpretations : EE2-1 Knowledge can emerge by variation&selection of ideas shared by a group of mutually interacting individuals (Popper 1972) EE2-2 Knowledge can emerge by variation&selection of cognitive structures within one individuum It is worth noting that while a so-called recapitulation theory stating that « ontogeny recapitulates phylogeny » (Haeckel 1879) is considered to be discredited by many biologists and embryologists ; it is still held as valid by many reseachers in human and cognitive sciences observing a « strong parallelism between cognitive development of a child and … stages suggested in the archeological record » (Foster 2002)100 years after one of Darwin’s companion has noted : « Education is a repetition of civilization in little » (Spencer 1894). 1.5. Individual Creativity In fact, the evolutionary epistemology was born with the tentative of D.T. Campbell to explain both creative thinking and scientific discovery in terms of « blind variation and selective retention » of thoughts (Campbell 1960). Departing from introspective works of mathematician Henri Poincare who stated « To create consists precisely in not making useless combinations and in making those which are useful and which are only a small minority. Invention is discernment, choice...Among chosen combinations the most fertile will often be those formed of elements drawn from domains which are far apart...What is the cause that,among the thousand products of our unconscious activity, some are called to pass the threshold, while others remain below?» (Poincaré 1908), Campbell suggests that what we call creative thought can be described as a Darwinian process whereby the previously acquired knowledge blindly varies in unconscious mind of the creative thinker and that only some such structures are subsequently selectively retained. As (Simonton 1999) puts it: « How do human beings create variations? One perfectly good Darwinian explanation would be that the variations themselves arise from a cognitive variation-selection process that occurs within the individual brain. » 1.6. Genetic Epistemology « The fundamental hypothesis of genetic epistemology is that there is a parallelism between the progress made in ... organization of knowledge and the corresponding formative psychological processes. Well, now, if that is our hypothesis, what will be our field of study? Of course the most fruitful, most obvious field of study would be reconstituting human history: the history of human thinking in prehistoric man. Unfortunately, we are not very well informed about the psychology of Neanderthal man or about the psychology of Homo siniensis of Teilhard de Chardin. Since this field of biogenesis is not available to us, we shall do as biologists do and turn to ontogenesis. Nothing could be more accessible to study than the ontogenesis of these notions. There are children all around us. » (Piaget 1974) Strictly speaking, Piaget’s developmental theory of knowledge, which he himself called Genetic Epistemology (GE) may seem to be utterly non-Darwinian. In fact it is not even concerned with biochemical genes : Piagetian uses the term « genetic » to refer to a more general notion of « heredity » defined as structure’s tendency to guard its identity through time. The basic structural primitives of Piagetian theory are behavioral « schemas » which can be defined as « a basic set of experiences and knowledge that has been gained through personal experiences that define how things should be and act in the person's environment. As the child interacts with their world and acquires more experiences these schemes are modified to make sense, or used to make sense of the new experience » (Bee and Boyd 2003). There are two ways how such schemes can be modified. Either they « assimilate » data from external environment. Or, if ever such assimilation is not possible because it is simply not possible that child’s cognitive system matches the perceived external datum with the internal pre-existing category, the process of « accomodation » takes place which transforms the internal category to match the external datum. Ultimately, the set of schemes gets so out-dated or so altered by past modifications that they are not useful anymore. Whenever such «equilibriation » occur, old set of schemas is rejected, the child tends to « start fresh with a more up-to-date model » (Bee and Boyd 2003), thus attaining new substage or stage of its development. In the Piagetian system – which is based on very precise yet exhaustive observations of dozens of children including his own – the order of stages is fixed and it is very difficult, or even fully impossible, for evolving psyche to attain pre-operational stage 2 or concrete operational stage 3 if it had not even mastered all that is to master during the sensorimotor stage 1 . Given the fact that the GE paradigm involves : • heredity – schemes are structures which tend to keep their identity in time • variation – schemes are altered by the environment-driven assimilation or accomodation 3 • selective pressures – only those schemas which are most well adapted to environment and/or form most functionally fit complexes with other schemas shall pass through the equilibriation milestone it can be briefly stated that Piaget’s GE could be aligned with ET and UD. And what more, it may be the case that notion of Piagetian stages is consisted with the notion of attractor or locally optimal states whose emergence is, according to complex system theory (Kauffman 1996; Flake 1999), inevitable in a system as complex as child’s psyche definitely is. 1.7. Evolutionary computation We have already mentioned (c.f. 1.1.) that evolution, as defined within UD, can be thought of as a universal, generic algorithm. Not only can « evolutionary theory » serve us to explain diverse phenomena around us, it can be also exploited for finding solutions to diverse problems. Thus it is of no surprise that many researchers in informatics realized that not only can be the evolutionary process encoded as an informatic algorithm, but that such algorithms could be useful as a heuristic which could potentially lead to a discovery of useful (quasi)-optimal solutions to wide range of diverse problems. First explorations in the domain were done by Rechenberg’s « evolutionary strategies » (Rechenberg 1973) and Holland’s « genetic algorithms » (Holland 1975) which, along with « evolutionary programming » (Fogel et al. 1966) and « genetic programming » form the « evolutionary computation » subdiscipline of computer science. All four approaches differ from classical optimization methods in following aspects : 1. using a population of potential solutions in their search 2. using explicit « fitness » instead of function derivatives 3. using probabilistic, rather than deterministic, transition rules » (Kennedy et al. 2001) Figure 2: Basic genetic algorithm schema. Reproduced from (Pohlheim 1996) 3 Note that in terms of theory of evolutionary computation, one can relate the Piagetian notion of assimilation to an operator of local variation which attracts the cognitive system to locally optimal agreement with its environment, while accomodation suggests an interpretation in term of more global variation operators (like cross-over), which could potentially allow the cognitive system to reach a state of global equilibrium in regards to environment. 1.7.1. Genetic algorithms & fitness landscapes Basic principle of « genetic algorithms » is illustrated on Figure 2. The core component of every genetic algorithm is the objective « fitness function » able to attribute a cardinal value or ordinal rank to any individum in the population of potential solutions. In other terms, the fitness function yields the criterium according to which one candidate individum is evaluated as « more fit » a solution, in regards to the problem under study, than other potential solutions present in the population. Population is the set of individual solutions. Every individual solution is encoded as a vector of values (also called « chromosome » or « genome ») which can vary in time. Designer choice related to the way how the problem solutions are encoded in chromosomal vectors, e.g. the type (Boolean ? Integer ? Float ? Set? ) of different elements of the vector is also a crucial one and can often determine whether the algorithm shall succeed or fail. In every generation – i.e. in every iteration of the algorithmic cycle represented by the circle on Figure 2. - all N individuals in the population are evaluated by the fitness function. Every individual thus obtains the « fitness » value, which subsequently governs the « selection » procedure choosing a subset of individuals from the current generation as those, whose genetic information shall reproduce into next generations. In our Thesis we plan to exploit especially the « fitness proportionate selection » as the selection operator. This operator, also called « roulette wheel operator » transforms the fitness fi of individual i into the probability p i of its survival by means of a formula : where N is the number of individuals in the population. Once the « most fit » candidates are selected by the selection operator, they are subsequently mutually recombined by means of « crossover » operators and/or modified by means of « mutation » operators. Many different types of selection, mutation and crossover operators exist, for their overview c.f. (Sekaj 2005). For the purpose of this work let’s just note that the probabilities of occurrence of mutation or crossover have to be fairly low, otherwise no fitness-increasing information could be transferred among generations and whole system will tend to present non-converging chaotic behaviour (Nowak et al. 1999). Another useful strategy, which guarantees that maximal fitness shall either increase or at least stay constant, is called elitism. In order to implement the strategy, one simply guards one (or more) individual(s) with highest fitness unchanged for next generation, thus protecting « the best ones » from variations which would, most probably, decrease rather than increase the fitness4. Yet another widely used approach reinforces the selection pressure by removal of the 4 Note that in nature, elitism is often but not always the case. For it can happen that, due to stochastic factors, the most fit individuals die before they succeed to reproduce themselves. weakest individuals. Both elitist « survival of the fittest » and the contrary « removal of the weakest » are often combined. The selection of the most fit individuals from the old generation, their subsequent replication and/or recombination and diversification yields a new generation. Because individuals with lower fitness have been either completely or at least partially discarded by the selection process, one can expect that the overall fitness of new generation shall be higher than the fitness of the old generation. With little bit of luck, one can also hope that the most fit individuals of the new generation shall be little bit more fitter than the most fit individuals discovered in the new generation – this can happen if ever a « benign » mutation have occured, i.e. a modification which had moved the individual from the lower point on the « fitness landscape » to somewhat higher state. The notion of fitness landscape, first introduced in (Wright 1932) is a metaphor useful for understanding&explanation of diverse evolutionary phenomena. The landscape is depicted as a mountain range with peaks of varying height. The height at any point on the landscape corresponds to its fitness value; i.e. the higher the point, the greater the fitness of an individual represented by the given point of the landscape. In such a representation, the evolution of the organism to more and more « fit » forms can be depicted as a movement up-hill, towards the most closest peak (i.e. local optimum) or towards the highest peak of the whole landscape (i.e. global optimum). Figure 3 illustrates a fitness landscape of a very simple organism with only one gene (whose potential values are encoded by illustration’s X axis). Figure 3: Possible fitness landscape for a problem with only one variable. Horizontal axis represents gene’s value, vertical axis represents fitness. Every arrow on the figure represents one possible individual. Its length represents the variation which can be brought in by the mutation operator. The fact that individuals always tend to move « upwards » indicates that selection pressures are involved. It has to be added that without the implementation of the crossover operator, the globally optimal state (encoded by point C) could not be attained for individuals who haven’t originated at the slopes of C. Only some sort of crossover operator could ensure that individuals who attained the local optima (encoded by peaks A, B, D) could be mutually recombined (for example B with D) in a way that shall allow them to leave the locally stable states and approach the globally optimal C. The fact that genetic algorithms, thanks to « crossover » operators, can combine two individuals from diverse sectors of the fitness landscape, allow them to find solutions to problems where heuristics based on « gradient descent » should fail. 1.7.2. Evolutionary programming & evolutionary strategies Evolutionary programming (EP) and evolutionary strategies (ES) are methods whose overall essence is very similar to GAs. There are, however, some subtle differences among the approaches. In EP, mutation is the principal and often the only variation operator. While recombination is rarely used, « operators are freely adapted to fit the problem at hand » (Kennedy et al. 2001). EP algorithms often double the size of population by mixing children with parents and then halving the population by selection. Tournament selection operator is often used. Another difference is that while GAs were developped in order to optimize the numeric parameters of mathematical function under study – and variation thus directly modifies the genotype – in EP, one mutates the genotype but evaluates the fitness according to phenotype. EP is thus often used for construction & optimization of such structures like « finite state automata » (Fogel et al. 1966). A self-adaptation approach (Bentley 1999) allowing for mutation of the parameters of the evolution itself – e.g. the mutation rate – is also frequently used. Such an approach of « evolving the evolution » is also used in ES which where discovered - in parallel but independetly with Holland’s GAs – by (Rechenberg 1973). The biggest difference between EP and ES is thus fact that ES often recombines its individuals before mutating them. Popular and well-performing strategy thus seems to be : 1. Initialize the population 2. Perform recombination using P parents to form C children5 3. Perform mutation on all children 4. Evaluate children population and select P members from it. 5. If the termination criterion is not met, go to step 2 ; terminate otherwise. Given the fact that in our Thesis, we shall often: 1) encode the problem of linguistic category induction by non-numeric chromosomes 2) evaluate the fitness of individuals by means of additional « phenotypic algorithms » we consider the works of Fogel & Rechenberg to be of particular importance for our study. 1.7.3.Genetic programming Contrary to GAs, E.Prog and E.Strat which operate upon the chromosomes (vectors) of fixed length of numeric/boolean/character values, do individuals evolved by means of Genetic Programming (GP) encode programs of arbitrary length and complexity. In other terms, one may state that while above-mentioned EC methods look for the most optimal solution of a given problem, GP tends to produce a hierarchical tree 5 Frequently used C/P ratio is 7 structure encoding a sequence of instructions (i.e. a program) able to yield optimal solutions to a whole range of problems. Simply said : GP is simply a way how computer programs can automatically « discover » new and useful programs. The most important thing to do in order to prepare a GP framework is to specify how shall be the resulting individuals (programs) encoded. Original choice of the founder of the discipline, John Koza, was to encode all individuals as trees of LISP S-expressions composed of sub-trees, which are, themselves, also LISP S-expressions. Within such arborescent S-expressions, the terminal (i.e. leave nodes where the branches end) nodes represent program’s variables and constants while the non-terminal nodes (i.e. internal tree points) represent diverse functions contained in the function set (e.g. arithmetic functions like +, -, *, / ; mathematic functions like log, cos ; boolean functions like AND, OR, NOT ; conditional operators if/else etc.) Figure 4 illustrates how, during the initial run of the algorithm, an individual – calculating, for example, the square root of X+5 – could be possibly randomly generated by implementing a following procedure : 1) « Root » of the program tree is randomly chosen from the function set, it is the function sqrt. 2) The function sqrt has only one argument (arity(sqrt)=1), therefore it will take only one input from the randomly determined functor + (addition) Figure 4: Sequence of steps constructing the program sqrt(x+5) 3) Functor + takes two inputs (arity(+)=2), therefore the tree bifurcates into two lines in this node. It randomly choses, as the first argument, the constant 5 ; and the variable X as the second argument. Note that in step 3, both arguments were chosen from the terminal set. If they would have been chosen from the function set, the tree could bifurcate further. In order to prevent such growth of trees ad infinitum, a limiting « maximal tree depth » parameter is more than often implemented in GP scenarios. Once such a program has been generated, one can evaluate its fitness by confronting it with diverse input arguments and comparing its output with a golden standard. Such a random-program generation & evaluation is repeated for all N initial candidate programs, subsequently the most individuals are selected and varied. While GP’s selection techniques can sometimes closely ressemble selection techniques as used in GAs, variation operators are often of essentially different nature. This is so, because GP’s not individual genomes or their linear sequences can be mutated or crossed-over, but rather complex and hierarchical networks of expressions. In a case of cross-over, for example, one switches whole sub-tree encoded within one individual, for a sub-tree encoded within another one. GP-based solutions cannot be expected to function correctly if they do not satisfy the theoretical properties of closure and sufficiency. In order to fulfill the closure condition, each function from the non-terminal set must be able to successfully operate both on output of any function in the non-terminal set and on any value obtainable by a member of the terminal set. Even behaviour of some simple operators thus has to be a priori adjusted (e.g. return 1 in case of division by zero) in order to assure correct functioning of the resulting program. On the other hand, sufficiency property demands that the set of functors and terminals is sufficiently exhaustive. Otherwise the solution could not be found. One can not, for example, hope to discover equation for generating the Mandelbrot set if the initial set of terminals does not contain the notion of imaginary number, nor does the function set contain any other explicit or implicit reference to the notion of complex plane. Thus, while the closure constraint delimits the upper bound beyond which the discovery of the solution is not feasible, the sufficiency constraint delimits the lower bound of the minimal set of « initial components » which have to be defined a priori, so that discovery of the adequate program should be at least theoretically possible. Other theoretical notions as well as diverse subtleties (special operators, methods how to distribute the initial population in the search space, fitness function proposals, domains of application, etc.) of practical implementation, are to be found in possibly the most important GP-concerning monography (Koza 1992). 1.7.4.Grammatical evolution Grammatical Evolution (Gr.Ev) is a variant of GP in a sense that it also use evolutionary computing in order to automatically generate computer programs. The most important difference between Gr.Ev and GP is that while GP operates directly upon phenotypic trees representing program’s code itself (for example in form of LISP expressions), Gr.Ev uses the evolutionary machinery for the purpose of generating grammars, which would subsequently generate the program code. In Formal Language Theory, grammar is represented by the tuple {N, T, P, S} where N denotes the set of non-terminals, T the set of terminals, S is a symbol which is member of N and P denotes the set of production rules that substitute elements of N by elements of N, T or their combinations 6. Consider a grammar exhaustive enough to encode programs able to perform arbitrary number of operations of addition or subtraction of two variables: N = {expr, op, var} T = { +, -, x, y} S = expr P= { → + | 6 This is the case for so-called context-free and context-sensitive grammars. → x | y | Such a grammar contains three non-terminals, non-terminal which could be subtituted for either terminal + or terminal - ; non-terminal which could be subtituted for either terminal x or terminal y ; and non-terminal which could be substituted for either a non-terminal , or a sequence of non-terminals . The fact that in this last production, the non-terminal is present both on left and right side of the substitution rule gives this grammar a possibility to recursively generate infinite number of expressions like : x+x x+y y+x y+y x-x x-y y-y y-x x+x x+x+x x+x-x x+x+y x+x-y x-x+y-y x-x-y+y+x y+y+x+x+y-x etc. Thus, even a very simple grammar with only four terminal symbols and three non-terminal symbols to each of which are associated only two production rules can theoretically produce an infinite number of distinct individual programs able to perform basic arithmetic operations with two variables. Generation of a given resulting expression is determined by the order of application of specific production rules, starting with non-terminal symbol S. Such a sequence of application of production rules is called derivation. For example, in order to derive the individual « x+x », one has to apply production rules in following order : S = ::= ::= ::= x :: = + :: = :: = x while the individual « y-x » would be generated, if ever the starting symbol S should be expanded by a following sequence of production rules : S = ::= ::= ::= y :: = :: = :: = x In Grammatical Evolution, it is this « order of application of production rules» which is encoded in the individual chromosome. In other terms, individual chromosomes encode when and where distinct production rules shall be applied. Figure 5 more closely illustrates, and puts into analogy with biological systems, the sequence of transformations which every binary chromosome undergoes during the process of unfolding into fully functional program : Figure 5: Sequence of transformations from genotype until phenotype in both Gr.Ev and Biological systems. Figure reproduced from (O’Neill & Ryan 2003). It can be easily infered from the above-displayed schema that the approach of Gr.Ev is quite intricate and involves multiple steps of information processing. Whole process starts with binary chromosome subsequently split into 8-bit codons which yield an integer specifying which production rule to use in a given moment of program’s generation. On many different layers does the « generation » process, as implemented in Gr.Ev, introduce and implement very original ideas like: 1. « Degenerate genetic code » - similary to « nature’s choice » to encode one amino-acid by means of many different triplets, can one encode application of a unique production rule by more than one codon. 2. «Wrapping » - under certain conditions can be whole genome « traversed » more than once during the process of phenotypic expression. Specific codon can be thus used more than once during the compilation of single individual. etc. Rationale for usage of such « biologically inspired tricks » is more closely presented in the work of the founders of Grammatical Evolution field (O’Neill & Ryan 2003) . They claim that the focus on genotype-phenotype distinction, especially in combination with implementation of « degenerate code » and « wrapping » notions, could result in compression of representation (& subsequent reduction of size of program search-space) and account for phenomenas like « neutral mutation », well-observed in biological systems, whereby a mutation occures in the genotype but does not have any effect upon the resulting phenotype. Another important advantage mentioned by O’Neill and Ryan is that Gr.Ev approach makes it very easy to generate programs in any arbitrary language. This is due to the versatility and generality of notion of « grammar ». When compared with traditional GP technique, Gr.Ev was outperformed in a scenario when one had to find solutions to problem of symbolic regression. But in more case complex scenarios like « symbolic integration », « Santa Fe ant trial » or in scenario where one had to discover a most precise « caching algorithm », Gr.Ev significantly outperformed GP. Seminal work of (O’Neill and Ryan 2003) presents also some other interesting examples of practical application of Gr.Ev, for example in the domain of financial market prediction. We note that while in many points (« grammar », « evolution ») does the work of O’Neilly and Ryan significantly overlap with ours, their aims significantly differ from those that shall be presented in our Thesis. More concretely, while Gr.Ev tends to offer a very general toolbox to generate useful computer programs in arbitrary programming language and used for solving arbitrary problems, our Thesis shall deploy the evolutionary computation machinery to shed some light upon diverse facets of one sole problem : that of « Natural Language Development». Other important difference between the approach of Gr.Ev and the one we shall present in our thesis is that while in Gr.Ev, grammars are considered to be « generative devices », i.e. tools used for generation of programs, in our Thesis we shall use them as both « generative » and « parsing » devices. Another, even more fundamental difference is due to the fact that while « At the heart of GE lies the fact that genes are only used to determine which rule is applied when, not what the rules are. » (O’Neill and Ryan 2003), the evolutionary model of language-induction proposed in our Thesis shall aim to determine not only the order of application of the rules, but also the content of the rules themselves. 1.7.5.Tierra Another example of how can one materialise evolutionary principles within an in silico framework is offered by Tierra, an artificial life simulation environment programmed between 1990-2001 by Thomas S. Ray and his colleagues. Since Ray is an ecologist, his objective was not to develop an EC-like model in order to find or optimalize solutions of a given problem, rather he aimed to create a system where artificially entities could spontaneously evolve, co-evolve and potentially create whole artificial ecosystems. An artificial entity in Tierra’s framework (Ray 1992) is a program composed of sequence of instructions, chosen from instruction set containing 32 quite traditional assembler instructions somewhat tuned by the author so that their usage would facilitate « replication » of the code. Every artificial entity runs in its own « virtual CPU » but its code stays encoded in the « soup », i.e. piece of RAM which is potentially read-accessible to all other entities as well. Rare «cosmic ray » mutations flip the bits of « soup » from time to time, more variation is ensured by bit-flipping during the procedure whereby the entity replicates (i.e. copies) its code from the « mother cell » section of the soup to the « daughter cell » section. Selection is in certain sense emulated by a so-called Reaper process which tends to stop the execution of programs which are either too old or contain too much flawed instructions. Other than that, there is nothing which ressemble the traditional notion of exogenously defined « fitness function ». For within Tierra, the survival (or death) of diverse species of programs is a direct consequence of species ability (or inability) to obtain access to limited ressources (CPU & memory). Thus, after one seeds the initially empty soup with a manually constructed individual, containing 80-instructions allowing the individual to copy his code into the daughter cell of the memory, after the memory has been filled and the battle for ressources has started and once the mutation have generated sufficiently enough of variation, one can observe the emergence of dozens of new forms of replicable programs. Some of them being parasites, some of them being able to create algorithmic counter-mesures against parasites, one can literally observe an emergence of artificial yet living ecological system. It is therefore little surprising that Tierra could automatically evolve, among others, an individual containing just 22 instructions, capable of replication. That is, a replicator almost 4 times shorter than the replicator manually programmed by the conceptor of the system and injected into initial « soup ». Currently the most famous descendant of Tierra is an AVIDA system (Ofria and Wilke 2004). Contrary to Tierra, however, is every AVIDA’s individual encapsulated within its own virtual CPU and memory space. Tierra’s Darwinian metaphore 7 of computer programs evolving by means of fighting for limited ressources is thus not so strictly followed. 7 http://life.ou.edu/pubs/tierra/node3.html 2. Language development Language development (LD) is a constructionist process which endows humans with the capacity for transfering of information to, and obtaining of information from, other humans by means of verbal communication. Term « language development » shall be used preferably to « language acquisition » in order to mark the fact that the child not only passively « acquires » the language from environmental input but rather gradually builds it, in interaction with its environment. Sometimes the term « language learning » shall be used as well to denote the same process. In our Thesis we shall focus only on modeling of development of « first language » , i.e. we shall aim to present a computational and evolutionary model of the process by means of which a human baby learns the language of its closest social environment. Child’s closest social environment are her parents, most notably her mother. Hundreds of studies were conducted to study the nature of « motherese », a special simplified language between mothers and their children (M. Harris 2013). While many studies point in divergent directions, they more or less agree that « Maternal speech has certain characteristics that distinguish it from speech to other adults. These characteristics are in essence simplicity, brevity and redundancy. » What’s more, it seems to be a well-established fact that there exists a reciprocal link between the complexity of motherese and complexity of child’s production. In other terms, mother’s adjust their language according to the stage of child’s linguistic development. Other studies also indicate an existence of causal link between the quantity and simplicity of motherese utterances on one hand and child’s linguistic development. More concretely, studies like that of (Furrow et al. 1979) indicate that child’s confrontation with frequent and simple utterances facilitates their linguistic development while more complex style can slow their development down. Other studies, like that of Ellis & Wells (1980) precise that « children who showed the earliest and most rapid language development received significantly more acknowledgments, corrections, prohibitions and instructions from their parents ». This causal link between mother’s linguistic productions and child’s developping linguistic competence shall play an important role when we shall discuss the « fitness function problem ». More concretely, we shall try to integrate into our computational models an idea that the fitness function evaluating the performance of child’s internal categorization mechanism and/or candidate grammar shall be external to the child. The fitness function shall be given by mother’s behaviour. 2.1. Ontogeny of semantic categories (concepts) Natural language furnishes a communication channel for exchange of meanings. Meaning (also called « signifié » in traditional linguistics) is intentional, it refers to some external entity (also called « referent ») . Within the language L, meaning M can be denoted by a token (also called « signifiant ») and it is by exchange of physical (phonic, in case of spoken language, graphemic in case of written language etc.) manifestations of these tokens that producer (speaker|writer) and reciever (hearer|reader) communicate. Traditionally meaning of the word, i.e. its « semantics », was often considered as something almost «sacred » and impossible to formalize by mathematical means. Maximum which could be done, and had been done since Aristotle until middle of 20th century, was to define concept in terms of lists of « necessary and sufficient features ». Two types of features were considered to be both necessary and sufficient for definition of majority of concepts : first specifying concept’s genus (or superordinated concept) and second specifying the particular property (differentia) which distinguished the concept from other members of the same genus. Thus, for example, « dog » could be defined as domesticated (differentia) canine (genus). Important property of such system of concepts was, that it allowed no ambigous or fuzzy border cases : the logical « law of excluded middle » guaranteed that all entities which were not both canines and domesticated at the same time (e.g. a chihuahua which passed all her life in wilderness) could not be called a dog. The change of paradigm came slowly with works of late Wittgenstein 8 but especially with empirical studies of Eleanore Rosch (Rosch 1999) who realized that not only are concepts often defined by bundles of features which are neither necessary not sufficient, but that the degree with which a feature can be associated with a concept often varies. Subsequently, Rosch has proposed a « prototype theory » of semantic categories whose basic postulate is, that some members of the category (or some instances of the concept) can be more « central » in relation to the category (resp. concept) than others. Prototypical theory as well as other both theoretic and empirical advances, in combination with development of information-processing technologies, have paved the way to operationalization of semantics which allows us to transform meanings of words into mathematically commesurable entities. In computational semantics, meaning of a token X observable within language corpus C is often characterized as a vector of relations which X holds with other tokens observable within the corpus. The set of such vectors associated to all tokens observable in C yields a « semantic space » which is a vector space within which one can effectuate diverse numeric and|or geometric operations. In short, concepts can be operationalized as geometric entities (Gärdenfors 2004). « In the most simple case can be the vector which denotes concept X calculated as a linear combination of vectors of concepts in context of which X occurs » (Hromada 2013a). This is an algebraic form of famous « distributional hypothesis » stating that « a word is characterized by the company it keeps » (Z. S. Harris 1954) which can be considered to be the central dogma of statistical semantics. Distributional hypothesis is in certain a variation to an old « associationist » explanation of functioning of mind, which stated that the essence of mind is somehow related to mind’s ability to create relations, i.e. associations, between successive mental states. Both mind’s faculty to create associations -considered by philosophers like Hume and Locke to be primary faculty of mind - as well as distributional hypothesis that 8 « For a large class of cases of the employment of the word ‘meaning’—though not for all—this way can be explained in this way: the meaning of a word is its use in the language. » (Wittgenstein 2009) meaning of symbol X can be defined in terms of meanings of symbols with which X co-occurs, can be, we believe, neurologically explained in terms of postulate first stated by Hebb, the neurologist : « The general idea is an old one, that any two cells or systems of cells that are repeatedly active at the same time will tend to become 'associated', so that activity in one facilitates activity in the other. » (Hebb 1964) One can assume that 1) if not only on single neurons but, mutatis mutandi, also whole neural circuits are governed by Hebb’s rule, and 2) if distinct words W x and Wy are somehow processed and represented by distinct neural circuits N x and Ny THEN it shall follow that whenever a hearer shall hear (or speaker shall speak) the two-word phrase WxWy, the ensemble of material (synaptic?) relations between Nx and Ny shall get reinforced. In more geometrical terms, on a more « mental » level, such a « rapprochement » of Nx and Ny would be characterized by convergence of the geometrical representations of both circuits to their common geometrical centroid. Thus, after processing the phrase WxWy, the vectorial representations of both Nx and Ny will be closer to each other than before hearing (or generating) the phrase. In our Thesis we shall presuppose that an associationist principle, similar to the one described above, is indeed at work whenever a human mind constructs a concept. We use term « concept » synonymously to the term « semantic class » : we define both concept and semantic classes as either subspaces of « semantic vector space », or as centroid points of such subspaces. Theoretically, there are multiple (and possibly infinitely) many ways how a cognitive system can internally represent an external environment E (or, in case of a computational linguistic agent, a corpus C) as « semantic space » S of dimensionality D. It is important to notice that the overall partitioning of cognitive system’s vector space determines how the system classifies the world. If system’s ability to correctly classify the world determines the reproductive fitness of an organism within which the cognitive system is embedded, one can state that the topology of internally represented semantic space can quite directly influence organism’s fitness. Consider, for example, reproduction fitness of a member of prey species which sometimes mis-classifies a predator species for a sexual mate, and compare it to the fitness of such a an individual among prey species whose semantic space is optimized so that the probability of such mis-classification is practically reduced to zero. A question whether such « semantic space optimization » occurs during the phylogeny of human species or whether it occurs principially during early years of child’s developpement (i.e. ontogeny) is a variant of « nature vs. nurture » (Galton 1875) debate between « nativists » who bet on the «innateness » of certain faculties of human psyche (c.f. discussion above Evolutionary Psychology above); and empiricist who believe that practically all knowledge we dispose of and use in everyday life is acquired from environment. Being aware of results of studies suggesting that children of very small age dispose of knowledge concerning basic relations among physical objects, or even social and moral skills (Haidt 2012) we consider as unwise the tentative to label nativist position as a priori invalid. On the other hand, being aware of the force with which processes like socialisation, acculturation and learning mould the psyche of an adult individual, we shall definitely consider as true the statement «topology of semantic space represented within the cognitive system of human individual can be optimized by supervised assimilation of knowledge encoded in surrounding environment». Notwithstanding the answer to nature & nurture question in regards to human faculty of categorization, the part of our Thesis devoted to «evolutionary models of concept construction » shall simply suggest that something like optimization of semantic spaces by means of evolutionary computing is, indeed, possible. 2.2. Ontogeny of formal categories (parts-of-speech) Words of language can be also partitioned into classes independently from their semantic content. For example, while there is practically no manifestly evident semantic feature between words like « apple » and « process », they can be both considered as belonging to the same category of « nouns ». Principal reason for this being the fact that within a sentence like, for example, «This apple makes me happy» one can freely substitute « apple » for « process » and still obtain a grammaticaly correct sentence. Sometimes the formal categories and semantic categories partially overlap. Such is the case, for example, in many indo-european languages where one often finds « feminine » nouns marked with markers of one formal group and « masculine » nouns marked with markers of other group. Even more extreme case of such « overlap » of semantic and formal categorization processes was observed among Diyarbal aborigines of Australia who use the same determiner « balan » (in certain sense analogic to German article « die ») in front of all nouns referring to « woman, fire and dangerous things» (Lakoff 1990). In modern linguistic tradition, however, are semantic and formal categories considered to be independent from each other. There exist multiple dimensions along which linguistic tokens can be categorized into formal classes. Most importantly, the appartenance of word W to class C can be principially infered from : 1) its position in regards to other words 2) its morphology (i.e. its internal composition with all prefixes, word root, suffixes etc.) 9. It is also important to realize that the same token can belong to many different categories in the same time and that the relations between categories themselves could be either inclusive, for « nested » categories, or « orthogonal ». Thus, for nested categories, appartenance of , for example, german token « die Schönheit » to «gender» subcategory «feminine» immediately implies that it also belongs to part-of-speech «noun ». On the other hand the sole fact that it is « feminine » does not inform us whether it could be attributed to « nominative » or « accusative » subsubcategories of grammatic subcategory « case ». Thus, subcategories of « case » and « gender », while being both « nested » within the part-of-speech category « nouns » are orthogonal to each other10. 9 C.f. (Hromada 2014a) for a comparative study assessing the impact of morphology and word-order features upon POS-induction in Bulgarian, Czech, Estonian, Farsi, English, Hungarian, Polish, Romanian, Russian and Slovak. 10 The theoretical importance of existence of this distinction in regards to current formal grammar models of natural On the most abstract level, linguistic tokens can be categorized into two principal 0-level formal categories of «functional» and « lexical » items. The set of grammatical items is closed, and it contains such parts-of-speech as determiners, conjunctions, pronouns, prepositions. On the other hand, classes of « lexical items » are opened and include meaning-carrying parts-of-speech like nouns, verbs, adverbs, adjectives etc. Study by (Shi et al. 1999)offers evidence that even newborn children (1-3 days old !) react differently to lexical and functional words and are thus «able to categorically discriminate these sets of words based on a constellation of perceptual cues that distinguish them». Once children are able to distinguish functional words from lexical ones, the process of ontogeny of formal categories can proceed towards development of part-of-speech categories. While it would be definitely mistaken to state that all languages of the world can be partitioned into & mapped upon part-of-speech languages known from English or other indo-european languages (i.e. noun, adjectives, pronouns, verbs, adverbs, preposition, conjunction, interjections), linguists generally agree that some kind of « noun»-ressembling and «verb»-ressembling categories are to be observed in all systems of human verbal communication. It is undoubtably the case that between the birth and cca 2-years of age, prototype for such part-of-speech clusters are being formed within the child’s cognitive system. This has to be so, around age of 2, children usually start to apply specific rules to specific items (i.e. start to conjugate the verbs or declinate the nouns). Subsequently, the learning of much more subtle distinctions, related to nature of grammatical categories like genus, casus, numerus for nouns or modus, tempus, etc. for verbs can take place. For diverse case studies concerning the acquisition of formal categories, c.f. (Y. E. Levy, Schlesinger, & Braine, 1988). Acquisition of both semantic and formal linguistic categories is facilitated by so-called « variation sets » (VS). One observes a linguistic variation set whenever the identific word/cluster of words occur in identical or slightly variated form within multiple consequent utterances. Not only nursery rhymes and lullabies are filled with such « alternations in maternal self-repetitions » (Hoff-Ginsberg 1986) VS are also highly frequent in standard « motherese ». In Turkish, for example, VS seem to make up approximately 20% of child-directed speech (Küntay and Slobin 1996) and very similar proportions are also reported for English language (Brodsky et al. 2007). Note that the notion of « variation set » can be intepreted in terms of evolutionary theory, given that: • maternal self-repetition can be intepreted as a form of « replication in time», whereby every single utterance is considered to be an independent individual • alteration of form between subsequent utterances can be interpreted as a result of a variation operator influencing mother’s production of new sentences In context of our tentatives to explain language development in terms of evolutionary theory and suggest its validity by means of evolutionary computation model, we find languages shall be further extended in fulll version of the Thesis. this insight « the image that best characterizes the young language leaner is that of a multilevel analyzer who is working with several types of analysis simulatenously, with different degrees of success, as learning progresses » (Levy 1988). It may be stated that the reason why categorization processes develop in the first place is congitive system’s the tendency to optimize its functions and structures. As Maratsos (1998) put it: « Once the speaker hears just one grammatical use of a new word which suffices to identify its membership in a category, he can refer to the whole system of rules involving this category » (Maratsos 1988). Thus, both semantic as well as formal categories can reduce cost of processing and storing of information by and within the cognitive system. 2.3. Ontogeny of grammars (grammar induction) Partitioning of words into grammatical categories can be useful only if it is accompanied by development of grammatical rules which combine members of diverse categories in order to produce meaningful sentences. We reiterate that strictly formally, grammar is defined as the tuple {N, T, P, S} where N denotes the set of non-terminals, T the set of terminals, S is a symbol which is member of N and P denotes the set of production rules that substitute elements of N by elements of N, T or their combinations. Within such formal framework, the problem of partitioning of words into diverse grammatical categories can thought to be as equivalent to problem of discovery of production rules which 1) associate members of T (words) to members of N (labels of distinct categories) 2) combine elemets of N in order to produce new elements of N. In fact, the problem of construction of formal categories and discovery of grammatical rules are mutually intertwined, some researchers go even so far as to state : « Category symbols, whether in phrase structure rules or in the lexicon, are logically equivalent to the rules written on them, and as such are completely system-dependent : They are shorthand descriptions of the rule system as a whole. By anyone’s theory, young children’s linguistic system does not possess all the features of the endstate system. In other words, their language cannot be describe by the same grammar as the adult system» (Ninio 1988). In litterature, development of language is often described as a process composed of three « stages » which can be subsequently subdivised in a followin manner : «Pregrammatical : a. Rote-learning – item-based acquisition is manifested in the use of formally unanalyzed units or chunks ; b. Initial modifications – formal alternations apply to a small number of highly familiar, good exemplars ; Structure-bound c. Interim schemata – transitional or bridge strategies take the form of productive, but nonnormative rules ; d. Grammaticization – structure-bound rules are those of the endstate grammar ; Discourse-oriented : e. Convention and variety – grammatical rules are deployed with appropriate, discourse-sensitive lexical restrictions, stylistic alternations, usage conventions, register distinctions etc. » (Berman 1988). In our Thesis, we shall put aside the intricacies of the third, « Discourse-oriented » stage and shall focus on « Pregrammatical » and « Structure-bound » stages. More concretely, we shall aim to explain acquisition of words and word chunks in phase a. as the result of the « crossover » between structures present in the environment and structures represented within the cognitive system; while the gradual emergence of categories and associated production rules which can be observable during phases b. c. d. shall be explained not only in terms of informatic crossover of structures present in environment and represented in cognitive system but also as the result of purely internal replication, variation and decay, proper to the cognitive system, and resulting in complexity-increasing « battle for resources » among structures represented within it. We are convinced that introduction of such «cognitive-system internally variying operators » like « entropy-induced decay » (associated to the phenomenon of « forgetting ») and « structural merging » (associated to the phenomenon of « dreaming ») we can, for example, offer a very simple&natural yet effective solution to a so-called « overgeneralization »11 problem. When it comes to overgeneralization of grammatical rules, they are often observable in phases c. & d. (i.e. between 2-4 years of age) whenever the child applies the production rule beyond the scope of its validity. The most famous example of overregularization in English is that practically all children apply the rule Vpast → VPresent+ed on all verbs. Thus, especially during MLU stage 4 and 512, they generate past participles like « throwed » or « braked » which are not correct. What is fascinating about the problem of overregularization is not only that all children shall start to employ irregular forms of past participles so that errors are not reproduced anymore ; but especially the fact that often, children used the correct « irregular form » even before (i.e. in one-word phases a. and b.). Only later did they converge to incorrect overregularization : « Initially, children’s uses of -ed past tense are all accurate. They may say melted or dropped, but not, as they later do, runned and breaked » (Maratsos 1988) . We see an important analogy between observations of such sequence of correct/incorrect/correct behaviour, and general behaviour of evolutionary systems which also often « reject » locally optimal solutions and descend into fitness 11 According to the domain (formal, semantic) the problem is also sometimes named as « overextension », « overregularization » or the problem of « overinclusive grammar ». 12 MLU means «Mean Length of Utterance » and is a measure traditionally used in developmental psycholinguistics for assessing of child’s linguistic performance at given age. In period when child produces one-word utterances like « mama » , « tato », MLU is considered to be 1 ; later when child starts to say two-word utterances like « mama nene », MLU increases towards 2 etc. landscape valleys in order to subsequently climb towards more optimal states. Thus, we believe that the term « conflict » present in the following principle can be also interpreted in evolutionary sense : « Whenever a newly acquired specific rule (i.e. a rule that mentions a specific lexical item, like throw, make, allow, report) is in conflict with previously learned general rule (i.e. a rule that would apply to that lexical item but also to many others of the same class), the specific rule eventually takes precedence » (Braine 1971). McWhinney uses a similar term « competition » to label its Competition Model of linguistic competence. « The competition model assumes that lexical elements and components to which they are connected can vary in their degree of activation. Activation is passed along connections between nodes. During processing, items are in competition with one another. In auditory processing …, in allomorphic processing …, in the processing of role relations, in polysemy …, the item that wins out in a given competition is the one with the greatest activation » (MacWhinney 1987). If one could interpret the last phrase of the above citation as « the component which has the greatest activation has the greatest fitness and thus the highest probability of being replicated within the cognitve system », one could consider MacWhinney’s connectionist model as an evolutionary one, and thus pointing in our direction. But since that is not the case, and since it seems that MacWhinney’s model does not, at least not explicitly, involve any processes of replication, nor sources of random variation nor does it explicitely work with «populations of grammars», we are obliged to look for another theoretical framework which could more easily integrate such notions. It may be the case that a so-called theory of « Grammar Systems » (Csuhaj-Varjú 1994) and « Language Colonies » (Kelemen and Kelemenová 1992) could furnish such a framework for our tentative to explain ontogeny of grammar in human individuum as an evolutionary process. Both will be introduced in part 4 of this text. 3. Computational Models of Text Processing Majority of models and algorithms presented in this chapter are results of intellectual work of computational linguists working in domain of « Natural Language Processing » (NLP). In NLP, one processes data encoding natural (human) languages with computational methods which often involve machine learning, data mining, information retrieval, statistical inference or artificial intelligence (AI) algorithms. Among principal objectives of NLP can one include : 1) to allow machines to « understand » and|or work with meanings 2) to develop an autonomous artificial agent (Hromada, 2012) able to pass the Turing Test (Turing 2008); and 3) to elucidate, by means of computational simulations, possible ways how human cognitive system treats natural language. Computational aim of our Thesis overlaps especially with NLP’s third objective. Such an aim bring with itself many complex problems not easy to tackle and thus, in order to reduce their amount and complexity we shall reduce the notion of « Natural Language » to the notion of « text ». It is true that in doing so, we shall completely ignore the phonetic, phonologic and prosodic aspect of language which has been, during practically all human history, a principal way how human speakers encoded their messages in order to transfer them to other human hearers. It is only during few centuries that the communication by means of text became prominent and only within last decades it became dominant, mainly because of increasing role of computers in our lives. This is at least partially so because computers are essentially machine built for processing of sequences of discrete symbols and that’s what a text is – a sequence of discrete symbols. Contrary to flux of spoken language, which is also a sequence, but composed of units whose boundaries are often unclear and whose features overlap. 3.1. Concept construction We define the « concept construction » (CC) problem as an open-class variant of « classification » or « categorization » problem. In classical, « closed-class » categorization problem, the objective is to assign a label which denotes the membership to a category C1 to a set of objects disposing of particular combination of properties (also called « features » in AI community) ; and assign to categories C 2, C3 etc. other objects disposing of different features. Problem of « binary classification » where only two categories are involved, is well studied and dozens of diverse algorithms exists which allow to train, in machine learning scenario, such classification models (« classifiers ») which will subsequently quite successfully classify such objects of the « testing set » which absent in the « training set ». In NLP one often solves classification problem by means of so-called « Support Vector Machines » . During the traininig of SVM, algorithm tries to discover a hyperplane « that has the largest distance to the nearest training data point of any class » (Vapnik et al. 1997). SVMs belong to group of « linear classifiers » which all base their classification decisions on linear combinations of characteristics (features) of objects-to-be-classified. Other machine learning algorithms as diverse as Linear Discrimant Analysis, Naive Bayes classifiers, logistic regression or perceptron also belong to group of linear classifiers. « Multiple class » variants of these algorithms also exist, allowing for classification of objects into more than 2 categories. In case of all these algorithms, however, all the classes-to-be-looked-for are known in advance ; datapoints in the training set are labeled with labels belonging to a finite set and after the training, during the testing phase, one’s objective is simply to attribute the correct label to a new object. While the object itself was most probably not present in the training set and is not « new », the finite set of all class/category labels-to-be-attributed are well known from the very beginning of training. In this sense all algorithms mentioned above address the closed-class variant of classification problem. On the contrary, in open-class variant of classification problem one can be potentially asked, in the testing phase, to attribute to an object, which was not present during training phase, a label which was also not present in the turing phase. In other terms, in open-class variant of classification problem one does not know in advance neither the number nor even the nature of categories which are to be constructed. 3.1.1. Non-evolutionary model of CC One possible way how one can address problem of Concept Construction – which we consider to be the instance of an « open-class classification problem » as defined above – is described as follows: 1. During the (train|learn)ing phase, use the training corpus to create a D-dimensional semantic vector space, i.e. attribute the vectors of length D to all members of the set of entities (word fragments, words, documents, phrases, patterns) E which includes all observables within the training corpus 2. During the testing phase : 2.1 characterize the object (text) O by a vector ⃗o calculated as a linear combination of vectors of features which are observable in O and whose vectors were learned during the training phase 2.2 characterize labels-to-be-attributed L1, L2, ... by vectors l⃗1, ⃗l2 ... 2.3 associate the object O with the closest label. In case we use cosine metric, we minimize angle between ⃗o and label vectors, i.e. arg max cos(⃗o , ⃗l x ) Note that in order to make this approach functional, two important conditions have to be fulfilled. Primo, vectors associated to entities observables within the training corpus must be commesurable, i.e. have to be of same dimensionality and be members of the same vector space. Secundo, the set of all entities E observed during learning has to be sufficiently exhaustive, so that potentially any novel label or object which shall appear during the testing phase could be at least partially characterized in terms of members observables during the training phase. The first condition of « entity commesurability » is not fulfilled by many vector space models which often yield multiple spaces for entities of different « types ». In such models, « word » entities are often encoded as rows of the matrix and « context » or « documents » entities, i.e. entities within which the words entities occur, are encoded as column of the same matrix, or are encoded in a completely different matrix. On the contrary, algorithms like Random Indexing (RI) or Reflective Random Indexing (RRI) construct semantic vector spaces from initial textual corpora in a way that everything they encounter – be it syllables, words or whole documents – is ultimately represented as rows of the same matrix. RI and RRI have also other advantages which are more closely described elsewhere (Sahlgren 2005; Cohen et al. 2010; Hromada 2013b). For the purpose of this article let’s just underline the fact that both RI and RRI can be quite computationally efficient since they are able to « project » semantic relations hidden in the text upon a vector space with restrained dimensionality. Theoretically, this is permitted due to a so-called lemma Johnson-Lindenstrauss stating that « a small set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved » (Johnson and Lindenstrauss 1984) Figure 6: Description of DEFT2012 system for automatic attribution of keywords to scientific articles. Figure reproduced from poster In 2012, a hybrid system with RRI semantic component at its very core, was deployed in a francophone datamining competition DEFT2012 (El Ghali et al. 2012). The goal of the competition was to create such an automatic NLP system which would be able to attribute to scientific articles the same keywords as were attributed by their authors. In other terms, the goal was to artificially simulate the cognitive activity of « attributing a conceptual label » to a scientific article. The tricky thing about the problem was that it was not a standard « closed class » classification problem, but indeed an « open class » problem since there were many keywords labels which have not been present in the training set, yet were to be associated in the testing scenario. Figure 6 illustrates relations among diverse components of this hybrid system. As may be easily seen, whole « artillery » of diverse NLP tools like POS-taggers, lemmatizers and chunkers was deployed in order to yield sufficiently exhaustive set of features from which two distinct semantic spaces were composed by means of RRI. Resulting semantic spaces were subsequently post-optimized by combining probabilistic Bayesian Networks and production rules. In the first simpler task of the competition DEFT2012, the system has attained F-Score of 94.8%. The task was simpler because a list of candidate labels was furnished within training corpus and subsequently another list of candidate keywords was furnished with the testing corpus. The system has attained F-score of 58.7% in a second, more difficult task where no such lists were given. In both tasks it outperformed the systems deployed by other 9 participants of the competition. 3.1.2. An evolutionary model of CC Task 4 of 2014 edition of the datamining competition Defi en Fouille Textuelle (DEFT) was understood as an instance of classification problem with opened number of classes. More concretely, the challenge was to create an artificial system which would be able attribute a specific member of the set of all class labels to scientific articles of the testing corpus. The training corpus of 208 scientific articles presented in diverse sessions of diverse editions of an annual TALN/RECITAL conference was furnished to facilitate the training of the model. To solve this problem, we have proposed an algorithm consisting of two nested components, as represented on Figure 7. The inner component, which we call Reflective Space Indexing (RSI) is responsable for construction of the vector space. Its input is a genotype, the list of D features which trigger the whole reflective process, its output -a phenotype - is a D-dimensional vector space consisting of vectors for all features, objects (documents) and classes. The inner component is « reflective » in a sense that it multi-iteratively not only characterizes objects in terms of their associated features, but also features in terms of associated objects. RSI's principal parameter is the number of dimensions of the resulting space (D). Input of RSI is a vector of length D whose D elements denote D « triggering features », the initial conditions to which the algorithm is sensible in the initial iteration. After the algorithm has received such an input, it subsequently characterizes every object O (document) by a vector of values which represent the frequency of triggering feature in object O. Initially, every document is thus characterized as a sort of bag-of-triggering-features vector. Subsequently, vectors of all features – i.e. not only triggering ones – are calculated as a sum of vectors of documents within which they occur and a new iteration can start. In it, initial document vectors are discarded and new document vectors are obtained as a sum of vectors of features which are observable in the document. Whole process can be iterated multiple times until the system converges to stationary state, but it is often the second and third iteration which yields most interesting results. Note also that what applies for features and objects applies, mutatis mutandi, also for class labels. Figure 7: Diagram of DEFT2014 model, embedding the construction of semantic spaces within an evolutionary framework. For purposes of DEFT 2014, every individual RSI run consisted of 2 iterations and yielded 200-dimensional space. The envelopping outer component is a trivial evolutionary algorithm whose task was to find the most « fit » combination of features to perform the classification task. In every « generation », evolutionary component injects multiple individual lists of triggering features (i.e. « genomes ») into the inner component and subsequently evaluates the fitness function of resulting vector spaces. It subsequently mutates, selects and crosses-over genotypes which had yielded the vector spaces wherein the classification was most precise. The evolutionary component of the system was conceived as a sort of feature selection mechanism. The objective of the optimization was to find such a genotype – i.e. such a list of « triggering features » – which would subsequently lead to discovery of a vector space whose topology would facilitate construction of a most classification-friendly vector space. As is common in evolutionary computing domain, whole process was started by creation of a random population of individuals. Each individual was fully described by a genome composed of 200 genes. Initially, every gene is assigned a value randomly chosen from the pool of 5849 feature types observable in the training corpus. In DEFT2014's Task 4 there were thus 5849200 possible individual genotypes one could potentially generate and we consider it important to underline that classificatory performance of phenotypes, i.e. vector spaces generated by RSI from genotypes, can also substantially vary. What's more, our observations indicate that by submitting the genotype to evolutionary pressures -i.e. by discarding the least « fit » genomes and promoting, varying and replicating the most fit ones - one also augments the classificatory performance of the resulting phenotypical vector space. In other terms, search for a vector space1 which is optimal in regards to subsequent partitioning or clustering can be accelerated by means of evolutionary computation. During the training, evaluation of fitness of every individual in every generation proceeded in a following manner : • pass the genotype as an input to RSI (D=200, I=2) • within the resulting vector space, calculate cosines between all document and class vectors • attribute N documents with highest score to every class label (N was furnished for both testing and training corpus) • calculate the precision in regards to training corpus golden standard. Precision is considered to be equivalent to individual's fitness Size of population was 50 individuals. In every generation, after the fitness of all individuals has been evaluated, 40% of new individuals were generated from the old ones by means of a one-point crossover operator whereby the probability of the individual to be chosen as a parent was proportional to individual's fitness. For the rest of the new population, it was generated from the old one by combination of fitness proportionate selection and mutation occuring with 0.01 probability. Mutation was implemented as a replacement of a value in a genome by another value, randomly chosen in the pool of 5849 feature types. Advanced techniques like parallel evolutionary algorithms or parameter auto-adaptation were not used in the study. While algorithm succeeded to optimize the vector space generated to training corpus with precision of 87%. However, the resulting model over-fit the training corpus and failed to be fully transferable on testing corpus. Possibly due to implementation error – c.f. (Hromada 2014b) for closer discussion- the model has thus achieved only 27 % precision when confronted testing data. While being definitely more performant than a random baseline, our approach was the least performant among 5 participants of DEFT2014. Notwithstanding the failure of our model in DEFT2014, we consider as an important our observation that « by evolutionary selection of chromosome of features which initially « trigger » the reflective process one can, indeed, optimize the topology and hence the classification performance of the resulting vector space » (Hromada 2014b). 3.2. Part-of-speech induction and part-of-speech tagging The term Part-of-speech-induction (POS-i) designates the process which endows the human or an artificial agent with the competence to attribute the POS-labels (like “verb”, “noun”, “adjective”) to any linguistic token observable in agent’s linguistic environment. POS-i can be understood as a « partitioning problem » since one’s objective is to partition the initial set of all tokens occuring in corpus C (which represent agent’s linguistic environment E) into N subsets (partitions, clusters) whose members would correspond to grammatical categories as defined by the gold standard. Because one does not use any information about « ideal » gold standard grammatical categories during the training phase and uses it only for final evaluation of the performance of the model, POS-i is considered to be an « unsupervised » machine learning problem. POS-i’s « supervised » counterpart is the problem of POS-tagging. In POS-tagging, one trains the system by serving it, during the training phase, sequence of couples (word W, tag T) where tag T is the label denoting the grammatical category into which the word W belongs. POS-tagging is thus simpler than POS-i where no information about ideal labels is furnished during the learning. Training of POS-tagging systems is of particular importance especially for languages where many word forms can potentially belong to many part-of-speech categories (in English, for example, can almost any noun play also role of the verb; token like « still » can be intepreted as substantive, verb, adjective and even adverb, its POS-category being determined by its context). On the contrary, in morphologically rich languages where such a « homonymy of forms » is present in lesser degrees and relations between word types and classes are less ambigous, one can often simply train the POS-tagging system by simply memorizing an exhaustive list of (W, T) couples. 3.2.1. Non-evolutionary models of POS-i The paradigm currently dominating the POS-i domain was fully born with article published by Brown et al. in 1992. Brown and his colleagues have applied the information theoretic notion of « mutual information » : M (w 1 w 2)=log Pr ( w1 w2 ) Pr (w 1) Pr(w2 ) upon all bigrams (i.e. sequences of two words) composed of tokens w 1, w2 and had subsequently devised a merging algorithm able to group words into classes in a way that the mutual information within a class would be maximized. In two decades since publication of study of Brown et al., their approach has inspired hundreds of studies : be it hidden Markov Models tweaked with variational Bayes (Johnson, 2007) , Gibbs sampling (Goldwater & Griffiths, 2007), morphological features (Berg-Kirkpatrick, Bouchard-Côté, DeNero, & Klein, 2010; Clark, 2003) or graph-oriented methods (Biemann, 2006) – all such approaches and many others consider co-occurence of words with n-gram sequences to be the primary source of relevant information for subsequent creation of part-of-speech clusters. In all these models, one aims to discover the ideal parameters of Markovian statistical models, often employing a so-called Expectation-Maximization (EM) algorithm to discover the optimal partitioning. Unfortunately, EM is unable to quit locally optimal states once they were discovered. Notwithstanding this disadvantage, comparative study of (Christodoulopoulos et al. 2010) suggests that probabilistic models of part-of-speech induction can be indeed very performant. POS-i induction can be also realized by means of k-means clustering algorithm, or one of its variants. K-means algorithm (Karypis 2002) partitions N observations, described as vectors in D-dimensional space, into K clusters by attributing every observation into the cluster with the nearest centroid (i.e. mean). If one considers these centroids to denote prototypes of the categories in center of which they are located, then one can consider the k-means algorithm to be consistent with « prototype theory of categorization », as proposed by Rosch. Table 1 illustrates simple K-mean partitioning of tokens present in English version of Orwell’s 1984. Table 1. K-means clustering of tokens according both suffixal and co-occurence informations. Table partially reproduced from (Hromada 2014c) 0 1 2 3 4 5 6 Noun 10 568 97 13 1173 608 1977 Verb 3 67 668 1011 67 958 97 In this example case we have clustered all tokens observable in the corpus into 7 clusters according to features both internal to the token – i.e. suffixes – and external – i.e. co-occurrence with other tokens. Note that even such a simple model where no machine learning or optimization were performed, K-means algorithm somehow succeeds to distinguish verbs from nouns. As is shown in the Table 1, whose columns represent the “gold standard” tags and rows denote the artificially induced clusters, even such a naïve computational model has assigned 83.6% of nouns to clusters 1, 4 and 6 while assigning 91.8% of verbs into clusters 2, 3 and 5. 3.2.2. Evolutionary models of POS-i & POS-t Usage of evolutionary computing in NLP is - in comparison to other methods like neural networks, Hidden Markov Models, Conditional Random Fields or SVMs – still very rare. This is also the case to NLP’s sub-problem of part-of-speech tagging and thus we are aware of only one tentative to use genetic algorithms to train a part-of-speech tagger : In his (Araujo 2002) proposal, Araujo describes a system of POS-t involving crossover and mutation operators. What is particularly interesting about Araujo’s system is that separate evolution process is run for every separate sentence of the test corpus. Training corpus, on the other hand, serves mainly as a source of statistical information concerning co-occurrences of diverse words and tags in diverse word & tag contexts. This information concerning the « global » statistic properties of the training corpus is later exploited in computation of fitness. Let’s take, for example, the phrase « Ring the bell ». Since words like « ring » and « bell » are in English sometimes used as verbs, and sometimes used as nouns, such a sentence can be tagged at least in 4 different ways : N D13 N VDV NDV VDN Such sequences of tags yiels individual members of Araujo’s initial population of chromosomes. In languages like English where almost every word can be attributed to more than one POS category & the number of possible tag sequences therefore increases with length of the phrase-to-be-tagged, one will be most probably obliged to randomly choose such initial individuals. Fitness of every individual possibly tagging the sentence of n words is subsequently calculated as a sum of accuracies of tags (genes) on position i : n ∑ f ( gi ) i=0 Accuracy gi of an individual gene is calculated as : f ( gi)=log( context i ) all i whereby values of contexti and alli are extracted from the training table which was constructed during the training phase and represent the overall frequency of occurrence of word wi within specific (contexti) and all (alli) contexts. Once fitness is evaluated, fitness-proportional crossing-over (50%) and mutation (5%) is realized. Notwithstanding the fact that Araunjo doesn’t seem to have used any other selection mechanism, in less than 100 generations, populations seemed to converge into sequence of tags which were more than 95% correct in regards to gold standard. This is a result comparable to other POS-tagging systems but with lesser computational cost. It is also worth noting that Araujo’s experiments indicate that working solely with contextual window WL, W, WR , i.e. just looking one word to the 13 We denote, by a non-terminal symbol D, the category of « determiners » into which belongs also article « the ». left and one word to the right, seems to yield, in case of POS-tagging of English language higher scores than extracting data from larger contextual spans. When it comes to the « unsupervised » variant of the POS-t problem, id est the problem of Part-of-speech induction, up to this date there have been -as far as we know - no tentatives to address the POS-i problem by means of evolutionary computing. For this reason, and for the reason that we see strong analogies between problems of CC and POS-i, our Thesis shall aim to solve this problem with a model similar to the one which we have presented in part 3.1.2 of this work. 3.3. Grammar induction Input of Grammar Induction (GI) process is a corpus of sentences written in language L, its output is, ideally a grammar (i.e. a tuplet G={S,N,T,P} as defined in above chapters) or at least a model able to generate language sentences of L, including such sentences that were not present in the initial training corpus. The nature of resulting grammar is closely associated to the content of the initial corpus as well as to the nature of the inductive (learning) process. According to their « expressive power », all grammars can be located somewhere on a « specificity – generality » spectrum. On one extreme of the spectrum lies the grammar having following production rules : 1 → 2* 2→a|b|c…Z whereby * means « repeat as many times as You Want ». This very compact grammar can potentially generate any text of any size and as such is very general. But exactly because it can accept any alphabetic sequence and thus does not have any « discriminatory power » whatsoever, is such a grammar completely useless as an explication of system of any natural language. On the other extreme lies a completely specific grammar which has just one rule : 1 → This grammar contains exactly what corpus C contains and is thus not compact at all (it is even two symbols longer than C). Such a grammar is not able to encode anything else than the sequence which was literally present in the training corpus and is therefore also useless for any scenario were novel sentences are to be generated (or accepted). The objective of GI process is to discover, departing solely from corpus C (which is written in language L), a grammar which is neither too specific, nor too general. If it is too general, it shall « overgeneralize », i.e. shall be able to generate (or accept) sentences which aren’t be considered as grammaticaly correct by common speaker of L. If it is too specific, it shan’t be able to represent all sentences contained in C or, if it shall, it shan’t be able to generate (or accept) any sentence which is considered to be sentence of L but was not present in the initial training corpus C. 3.3.1. Non-evolutionary models of grammar induction One of the first serious computational models of GI is a « Syntagmatic – Paradigmatic » (SNPR) model presented in (Wolff 1988). Its core algorithm is presented in Table 2. TABLE 2 Outline of Processing in the SNPR Model (reproduced from Wolff, 1988) 1. Read in a sample of language. 2. Set up a data structure of elements (grammatical rules) containing, at this stage, only the primitive elements of the system. 3. WHILE there are not enough elements formed, do the following sequence of operations repeatedly: BEGIN 3.1 Using the current structure of elements, parse the language sample, recording the frequencies of all pairs of contiguous elements and the frequencies of individual elements. During the parsing, monitor the use of PAR elements to gather data for later us in rebuilding of elements. 3.2 When the sample has been parsed, rebuild any elements that require it. 3.3 Search amongst the current set of elements for shared contexts and fold the data structures in the way explained in the text. 3.4 Generalize the grammatical rules. 3.5 The most frequent pair of contiguous elements recorded under 3.1 is formed into a single new SYN element and added to the data structure. All frequency information is then discarded. END We consider the SNPR model to be of particular importance because of its aim to explain the process of Grammar Induction as a sort of cognitive optimization : « The central idea in the theory is that language acquisition and other areas of cognitive development are, in large part, processes of building cognitive structures which are in some sense optimal for the several functions they have to perform » (Wolff 1988). Wolff also associates his « cognitive optimization hypothesis » with a «law of cumulative complexity » postulated in a study (Brown 1973) which is considered to be tha big classics of language development litterature : «if one structure contains everything that another structure contains and more then it will be acquired later than that other structure » (Wolff 1988). Grammar resulting from such a contact between language sample and SNPR inducing mechanism is displayed on figure 7. In Wolff’s theory optimalization is further understood as compression. Within the SNPR model is such compression realized in part 3.5 of his algorithm, where the most Figure 7: Grammar induced by SNPR model. Figure reproduced from (Wolff, 1988) frequent pair of contiguous elements (either terminals or non-terminals) is substituted for a new non-terminal symbol. For this reason, the size of grammar able to generate the initial language sample ideally decreases with every cycle of model’s « while » loop until the process converges to state where there is no redundancy to « compress ». Wolff proposes that Grammar Induction is a process which should maximize the coding capacity (CC) of the resulting grammar while minimizing its size 14. He defines the ratio between grammar’s CC/MDL to denote grammar’s efficiency and it may be the case that within a more evolutionary framework where one would work with populations of grammars, a very similarly defined notion of efficiency could be used as the core component of the fitness function. Unfortunately, Wolff’s 1988 SNPR model is not evolutionary since it does not involve any stochastic factors nor notion of multiple candidate solutions. Wolff’s SNPR is simply confronted with the language sample, deterministically compresses redundancies in a way that can sometimes ressembles human grammar (and sometimes not), gets subsequently stuck in local optimum and there’s no way how to get out of it. Another famous model of GI is that of (Elman 1993). Contrary to Wolff’s algorithm which is principially « symbolic », is Elman’s model « connectionist » one. More concretely, Elman had succeeded to train a simple recurrent neural network which was «trained to take one word at a time and predict what the next word would be. Because the predictions depend on the grammatical structure (which may involve multiple embeddings), the prediction task forces the network to develop internal representations which encode the relevant grammatical information. » (Elman 1993). The most important finding of Elman’s study seems to be the evidence for a so-called « less is more hypothesis » (Newport 1990) which Elman himselfs labels with terms « importance of starting small » : « Put simply, the network was unable to learn the complex grammar when trained from the outset with the full “adult” language. However, when the training data were selected such that simple sentences were presented first, the network succeeded not only in mastering these, but then going on to master the complex sentences as well. » (Elman 1993). Something similar occured also when he tuned the capacity of « internal memory » of his networks rather than the corpus itself. Elman observed: « If the learning mechanism itself was allowed to undergo “maturational changes” (in this case, increasing its memory capacity) during learning, then outcome was just as good as if the environment itself had been gradually complicated. » Thus, not only results of Elman’s computational model point in the same direction as many developmental and psycholinguistic studies of « motherese » (c.f. citations from Harris in part 2 of this work) ; they also show the importance of gradual physiological changes for ultimate mastering of maternal language. He goes even so far to state that prolonged infancy of human children can possibly go hand in hand with the fact that only humans develop language in an extent we do : «In isolation, 14 In current research, it is more common to speak about grammar’s Minimal Description Length (MDL). we see that both learning and prolonged development have characteristics which appear to be undesirable. Working together, they result in a combination which is highly adaptive» (Elman 1993). Notwithstanding these interesting results which are not to be underestimated, we see two disadvantages of Elman’s approach. Primo, as is often the case for connectionist neural networks, his resulting model is somewhat difficult to interpret : given the training constraints mentioned above, the network seems to predict quite well the next word in the phrase, but it is not evident why it does what it does. Elman himself dedicates major part of his article to descriptions of his tentatives to understand how his « blackbox » functions. Secundo, Elman confronted his model only with artificial corpora, i.e. corpora generated from manually created grammars. Thus, his model accounts only for a limited subset of properties of one language (English) and as such is still quite far from full-fledged solution to problem natural language’s GI. Last model we present in this brief overview, called « Automatic Distillation of Structure » (ADIOS) seem to be in lesser extent touched by this second disadvantage since as its authors state : « In grammar induction from large-scale raw corpora, our method achieves precision and recall performance unrivaled by any other unsupervised algorithm. It exhibits good performance in grammaticality judgment tests (including standard tests routinely taken by students of English as a second language) and replicates the behavior of human subjects in certain psycholinguistic tests of artificial language acquisition. Finally, the very same algorithmic approach also is proving effective in other settings where knowledge discovery from sequential data is called for, such as bioinformatics. » (Solan et al. 2005). ADIOS is a graph-based model. It considers the sentences to be a path in the directed pseudograph (i.e. loops and multiple edges are allowed), each sentence being delimited by special « begin » and « end » vertices. Every lexical entry (i.e. a word type) is also a vertex of the graph, thus if more than two sentences share the same word X, they cross themselves in the vertex VX ; if they contain the same subsequence XY, their paths share the common subpath (edge) VXVY etc. Authors of ADIOS describe their algorithm as follows : « The algorithm generates candidate patterns by traversing in each iteration a different search path (initially coinciding with one of the original corpus sentences), seeking subpaths that are shared by a significant number of partially aligned paths. The significant patterns (P) are selected according to a context-sensitive probabilistic criterion defined in terms of local flow quantities in the graph...Generalizing the search path, the algorithm looks for an optional equivalence class (E) of units that are interchangeable in the given context [i.e., are in complementary distribution]. At the end of each iteration, the most significant pattern is added to the lexicon as a new unit, the subpaths it subsumes are merged into a new vertex, and the graph is rewired accordingly... The search for patterns and equivalence classes and their incorporation into the graph are repeated until no new significant patterns are found. » (Solan et al. 2005). In other terms, ADIOS starts with a so-called Motif Extraction (MEX) procedure which looks for bundles of graph’s subpaths which obey certain conditions. Once such « patterns » are found, they are subsequently « substituted » for non-terminal symbols and a graph is « rewired » to incorporate such newly constructed non-terminals. Such a « pattern distillation » procedure of generalization bootstraps itself until no further rewiring is possible. Output of the whole process is a rule grammar combining patterns (P) and their equivalence classes (E) into rules, able to generate even phrases which weren’t present in the initial corpus. Example of how ADIOS progressively discovers more and more abstract combinatorial patterns is presented on Figure 8. Figure 8: Equivalence classes and production rules induced from English language samples by ADIOS algorithm. Figure reproduced from (Solan et al. 2005) ADIOS is undoubtably one of the most performant GI systems which currently exist. It combines both statistic, probabilistic and graph-theory notions with notion of rule-based grammar and as such is also of great theoretical interest. On the other hand, ADIOS does not involve any source of stochasticity, seems to be purely deterministic and as such incapable to deal with highly probable convergence towards locally optimal grammars. In confrontation with some partial corpora this may possibly not cause any problems but, we predict, without any stochastic variation whatsoever, ADIOS could not account for more than few « advanced » & real-life properties of natural languages and as such shall possibly share the destiny of SNPR model. 3.3.2. Evolutionary models of grammar induction Multiple authors have proposed to solve the GI problem with different variants of evolutionary computinng - in following paragraphs we shall describe five different approaches: 1) Tomita’s (1982) hill-climbing induction of finite state automata 2) Dupont’s (1994) GIG method for inference of regular languages 3) Evolution of stochastic Context-Free Grammars as presented by Keller & Lutz (Keller and Lutz 1997) 4) Evolutionary method of (Aycinena et al. 2003) inducing grammars from POS tags of nine different English language corpora 5) Genetic algorithm of Smith & Witten (Smith and Witten 1995) for inducing a LISP s-expression grammar from a simple corpus of English sentences Tomita’s 1982 paper can be considered to be one of the first empiric studies of grammatical inference. The study focused on inference of grammars of 14 different regular languages – which are often called « Tomita languages » in subsequent litterature – by means of deteministic finite state automata. Tomita had first encoded any possible finite state machine with n states in a following manner : Figure 9: Finite state automaton matching all strings over (1 + 0)* without an odd number of consecutive 0's after an odd number of consecutive 1's. Figure reproduced from (Tomita 1982) ( ( A1, B1, F1) (A2 , B2 , F2 ) . . . . (An , Bn , Fn )) whereby every block « (Ai, Bi, Fi) corresponds to the state i, and Ai and Bi indicate the destination states of the 0-arrow and the 1-arrow from the state i, respectively. If A or B is zero, then there is no 0-arrow or 1-arrow from the state i, respectively. F i indicates whether state i is one of the final states or not. If F i is equal to 1, the state i is one of the final states. The initial state is always state 1 » (Tomita, 1982). Thus, for example, the string ((1 2 1 ) ( 3 1 1 ) ( 4 0 0 ) ( 3 4 1 )) encodes the finite state automaton illustrated on figure 9. Such encoding allowed Tomita to subsequently apply his hill-climbing approach. Hill-climbing can be considered to be a precursor to more extended genetic programming, since it employs both random mutations to explore surounding search-space and sort of selection algorithm which always prefers to use, in following iteration of the algorithm, such individual solutions for which the value of evaluation function E increases. Tomita’s definition of E is very simple: E=r-w « where r is the number of strings in the right-list accepted by the machine, and w is the number of strings in the wrong-list accepted by the machine ». Right-list is a positive sample corpus while wrong-list is the negative sample. Thus, if a random mutation transforms an individual Xn into individual Xn+1 so that E(Xn+1) > E(Xn), i.e. if an automaton is discovered which matches more positive sequences, or less negative sequences, or both - it will be Xn+1 which will be mutated in the next cycle of the algorithm. Tomita’s approach cannot be considered to be fully evolutionary because he haven’t used populations nor did he employed any kind of cross-over operator. For this reason, Tomita’s regular grammar-infering algorithm did sometimes got stuck in local maxima from which there was no way out. Notwithstanding this small imperfection – of which Tomita himself was well aware – his work served, and still serves, the role of an important hallmark on the path to full-fledged GI. Dupont (1994), for example, has also focused his study on induction of 15 different regular Tomita languages. In his formally very sound work, he defines the problem of inference of regular languages as a problem of finding of optimal partition of a state space of a finite « maximal canonical automaton » (MCA) able to accept the sentences from positive sample. Fitness function takes into account also the system’s tendency to reject the sentences contained in the negative sample. By using a so-called « left-to-right canonical group encoding », Dupont succeeds to represent diverse individuals automata in a very concise way which allows him to subsequently evolve them by means of structural mutation («the structural mutation consists of a random selection of a state in some block of a given partition followed by the random assignment of this state to a block », e.g. MUTATE({{1,3,5},{2},{4}}) → {{1,5}, {2,3},{4}}) and structural crossover («the structural crossover consists of the union in both parent partitions of a randomly selected block », for example CROSS({{1,4}, {2,3,5}},{{1,3},{2},{4},{5}}) → {{1,3,4},{2,5}},{1,3,4},{2},5}). Because « the search space size dramatically increases with the size of the positive sample, making the correct identification more difficult when we have a larger positive information on the language », Dupont has also proposed an incremental procedure allowing to start the search process from smaller yet pertinent region of the search space. Procedure goes as follows : « first sort the positive sample I+ in lexicographical order. Consequently, the shortest strings are first taken into account. Starting with the first sentence of I +, we construct the associated MCA(I+) and we search for the optimal partition of its state set under the control of the whole negative sample I_. Let A1 denote the derived automaton with respect to this optimal partition. Let snext denote the next string in I+. If snext is already accepted by A1, we skip it. » (Dupont 1994). Otherwise, the aumaton A1 is be extended so that it can cover also snext. The search under the control of whole negative sample is then restarted and whole process is repeated until all sentences from positive sample have been considered. With population size of 100 individuals, maximum number of 2000 evaluations, crossover rate 0.2, mutation rate/bit 0.01 and semi incremental procedure implemented, Dupont’s approach have attained, in average, classification rate of 94.4%. For five among fifteen Tomita’s languages, grammars were constructed which attained 100% accuracy (i.e. accepted all sentences from positive sample and rejected all strings from negatives sample). Results have also indicated that if ever the semi-incremental procedure is applied, the sample size has positive influence upon the accuracy of infered grammars – bigger sample yields more accurate grammars. While Tomita’s results indicate and Dupont’s results further confirm the belief that induction of grammars by means of evolutionary computing is a plausible thing to do, they do so only in regards to most similar type of grammars – the regular ones. Grammars of natural languages, however, are definitely not regular languages and models of GI of more expressive « context free » (CFG) or « context sensitive » grammars are needed. Keller and Lutz employed a genetic algorithm to evolve parameters of stochastic context-free grammars (SCFG) of 6 different languages. SCFGs are similar to traditional CFGs15, but extended with probability distribution, so that there is a probability value in the range [0,1] associated to every production rule of the grammar. These values are called SCFG’s parameters and these are the values which the algorithm of Keller & Lutz aims to optimize by means of GAs. Their approach involves following steps : « 1. Construct a covering grammar that generates the corpus as a (proper) subset. 2. Set up a population of individuals encoding parameter settings for the rules of the covering grammar. 3. Repeatedly apply genetic operations (cross-over, mutation) to selected individuals in the population until an optimal set of parameters is found. » (Keller and Lutz 1997) Their fitness function F(G) is based on idea of Minimal Description Length (MDL). More formally, Keller & Lutz aimed to maximize: F (G)= Kc L(C∣G)+ L(G) by minimizing the denominator which is defined as a sum of number of bits needed to encode the grammar G (L(G)) plus the number of bits needed to encode corpus G, given the grammar G (L(C|G)). Numerator K c is just a corpus dependent normalization factor assuring that the value of fitness shall be in range [0,1]. When 15 « In formal language theory, a context-free grammar (CFG) is a grammar inn which every production rule is of the form V → w, where V is a single non-terminal symbol, and w is a string of terminals annd/or non-terminals. The term « context-free » expresses the fact that non-terminals can be rewritten without regard to the context in which they occur » (Choubey and Kharat 2009) confronted with positive samples of cca 16000 strings (typically of length 6 or 8) of 6 different context-free languages : 1. 2. 3. 4. 5. 6. EQ : language of all strings consisting of equal numbers of as and bs language a n b n (n≥1) BRA1 : language of balanced brackets BRA2 : balanced brackets with two sorts of bracketing symbols PAL1 : palindromes over {a,b} PAL2 : palindromes over {a,b,c} their algorithms have converged, in majority of cases, to such combinations of parameters of their SCFGs which had allowed them to accept more than 95% of strings presented in the positive sample. Such results indicate that genetic algorithms can be used as a means for unsupervised inference of parameters of stochastic context-free grammars. Note that Keller & Lutz confronted, during both testing and training, their algorithm only with positive sample. While doing so for training is justifiable - since the objective of their study was to study whether grammars can be infered solely from positive evidence – not doing so during testing phase makes uncertain the extent to which their infered grammars overgeneralize. Another huge disadvantage in regards to aims of our Thesis is the simple fact that their approach also seems to be very costly (« number of parses that must be considered increases exponentially with the number of non-terminals »). And since they confronted their algorithms only with corpora composed of sentences of artificial and not natural languages, we shall not try to imitate their approach of « tuning SCFG parameters » in our Thesis. By being context-free and not simply regular, the grammars studied by Keller & Lutz or (Choubey and Kharat 2009) could be considered to be more similar to grammars of natural languages. Nonetheless, languages composed of palindromes and sequences of balanced brackets are still far way off from natural languages and the question « in what extent are results concerning GI of artificial languages applicable to GI of natural languages ? » is far from being answered. Rather than trying to answer it, we proceed now to discussion of two approaches where evolutionary GIs have been applied upon natural language sentences : The first method, proposed in (Aycinena et al. 2003) has focused on induction of CFG grammars from nine different part-of-speech tagged natural language corpora. Sentences contained in these corpora, composed thus of sequences of part-of-speech tags (c.f. Section 3.2) were used as positive examples, while randomly generated sequences of POS-tags have yielded negative examples. Initial population was composed of linear encodings of randomly generated context-free grammars, for example the string SABABCBCDCAE would represent this CFG : S → AB A → BC B → CD C → AE During the evaluation of individual grammar G, one would first try to parse both positive and negative corpora with the grammar G and subsequently calculate the final fitness by applying the following formula : F( α)=γ max(0,∣α∣−¿P∣) C (α)−δ I (α) « where P is the set of preterminals, C(α) is the number of parsed sentences from the corpus, I(α) is the number of sentences parsed from the randomly generated corpus, δ is the penalty associated with parsing each sentence in the randomly generated corpus, and γ is the discount factor used for discouraging long grammars » (Aycinena et al. 2003) In their study, Aycinena had placed randomly generated population of 100 individual grammars on a two-dimensional 10 x 10 torus grid. Subsequently, they had applied a following select-breed-replace strategy : « 1. Select and individual randomly from the grid 2. Breed that individual with its most fit neighbor to produce two children 3. Replace the weakest parent by the fittest child » (Aycinena et al. 2003) In their framework, «cross-over is accomplished by selecting a random production in each parent. Then a random point in these productions is selected and cross-over is performed, swapping the remainder of the strings after the cross-over points». Every symbol of a resulting string can be subsequently mutated (mutation rate=0.01). «A mutation is simply the swapping of a non-terminal or pre-terminal with another non-terminal or pre-terminal » (Aycinena et al. 2003) Figure 10 shows the number of generations each run was able to complete, the grammar G that last evolved, the percentage of positive examples parsed by G, the percentage of negative examples parsed by G and G’s fitness. While results displayed above may seem encouraging authors, have noticed that in majority of cases, their approach « gives a grammar that is very capable of detecting whether a sentence is valid in English, but it has not learned much English structure ». In other terms, Aycinena et al. have succeeded to breed grammars which have certain discriminatory power but are practically useless as models of English language. They go even so far as to state, in the ultimate paragraph of their work that « It is still possible that English grammar is too complex to be learned from a corpus of words » and that other external clues are necessary for successful GI of English. The big disadvantage of above-mentioned algorithm was also the fact that its input were sequences of already attributed POS-tags and not sequences of words themselves. Thus, even if the approach would discover some interesting grammars, a reproach could be made and justified that in fact it only re-discovered the rules of the tagging system which was used in the first place. From perspective of our Thesis, another disadvantage of Aycinena et al.’s approach is related to the fact that their approach is anything but model of grammar development in human child. For it is evident (c.f. Section 2) that children learn the grammar of their language in an incremental fashion – they are not confronted with whole corpus from the very beginning. Nor does the corpus stay identic after each iteration of the learning process. On the contrary : as child grows, its linguistic environment - the corpus – also grows. Both in length and complexity. Figure 10: Grammars evolved from nine different POS-tagged corpora. Figure reproduced from (Aycinena et al., 2003). An interesting evolutionary approach of GI which both tries to create own non-terminal categories and also takes such « incrementality » into account is presented in the work of (Smith and Witten 1995). In their scenario, candidate grammars are evolved after presentation of every new sentence. Grammars have form of LISP s-expressions whereby AND represets a concatenation of two symbols (i.e. a syntagmatic node) and OR represents a disjunction (i.e. a paradigmatic node). Whole process is started as follows : « The GA proceeds from the creation of a random population of diverse grammars based on the first sample string. The vocabulary of the expression is added to an initially empty lexicon of terminal symbols, and these are combined with randomly chosen operators in a construction of a candidate grammar...If the candidate grammar can parse the first string, it is parsed into the initial population ». Figure 11 displays two sample grammars for the sentence « the dog saw a cat ». Figure 11: Two simple grammars covering the sentence "the dog saw a cat". Figure reproduced from (Smith & Witten, 1995) S-expression sequences representing individual grammars are subsequently mutated. Couple of parent grammars can also switch their nodes – probability of being chosen for such cross-over is inversely proportional to grammar’s size : shorter grammars are prefered. Cross-over is non-destructive, parents thus also persist. The events of reproductions are grouped in cycles, at the end of each cycle, population of candidate grammars is confronted with new sentence from sample of positive evidence. In their article (Smith and Witten 1995)show, how after presentation of sentences : «the dog saw a cat », « a dog saw a cat », « the dog bit a cat », « the cat saw a cat », « the dog saw a mouse » and « a cat chased the mouse » their system naturally converged to a grammar which had quite correctly subsumed determiners like « a », « the » under one group of OR nodes, verbs like « chased », « saw », « bit » under another, and nouns like « dog », « cat », « mouse » under yet another. The grammar which they finally obtain is not ideal but, as they argue, it could get better if confronted with new sentences. «It is an adaptive process whereby the model is graudally conditioned by the training set. Recurring patterns help to reinforce partial inferences, but intermediate states of the model may include incorrect generalizations that can only be eradicated by continued evolution. This is not unlike the developing grammar of a child which includes mistakes and overgeneralisations that are slowly eliminated as their weaknenesses are made apparent by increasing positive evidence ». (Smith and Witten 1995) While strongly agreeing with above citation, we nonetheless cannot ignore certain drawbacks of Smith & Witten’s approach. Most importantly, by using LISP’s s-expressions as a way of representing their grammars, they ultimately have to end up with highly bifurcated binary trees (since arity of AND|OR operators is 2). Thus, one can easily subordinate two non-terminals to one terminal (e.g. OR(cat,dog)), but in case of three subordinated terminals, one is obliged to use complex expression involving three non-terminals (e.g. OR(OR(cat,dog),OR(mouse,NULL)). Therefore, in such an s-expression based representation, is any class having more than two members neccessarily represented by a longer sequence → is more prone to mutation → is highly « handicapped » in regards to much shorter expressions subordinating just two nodes. Another drawback of Smith & Witten’s work which cannot be ignored is related to the fact that while they used English language sentences to train their system, the sentences were very simple and the relevance of their findings to GI of « natural » English is more than disputable. In fact, they seem to achieve, with quite complex evolutionary machinery, even less than Wolff’s deterministic SNPR model have achieved almost a decade before. Notwithstanding these two drawbacks we nonetheless consider as particularly inspiring their approach aiming to solve the problem of GI of natural languages by uniting, in one framework, the notions adaptability, evolvability and statistical sensitivity to recurring patterns. We summarize : all five above-mentioned approaches indicate that evolutionary computing can potentially yield useful solutions to the problem of Grammar Induction of both artificial (regular, context-free) and natural language grammars. The length of the candidate grammar is frequently used as an input argument of the fitness function. Note also that both solutions of Dupont and Smith & Witten also use a sort of « incremental » procedure whereby individual solutions gradually adapt to every new sentence. Especially Dupont’s findings are reminiscent of what was already told about « importance of starting small » when discussing works of Elman & Harris. On the other hand, none of the above mentioned models was confronted with corpus of child-directed (i.e. « motherese ») or child-originated utterances. The objective of our Thesis shall be to fill this gap. 3.4. Evolutionary Language Game Evolutionary Language Game (ELG) first proposed in (Nowak et al. 1999) is a stunningly simple yet mathematically feasible stochastic model addressing the question « How could a coordinated system of meanings&sounds evolve in a group of mutually interacting agents ?». In most simple terms, the model can be described as follows: Let’s have a population of N agents. Each agent is described by an n x m associative matrix A. A’s entry a ij specifies how often an individual, in a role of a student, observed one or more other individuals (teachers) referring to object i by producing signal j. Thus, from this matrix A, one can derive the active « speaker » matrix P by normalizing rows : while the « hearer » passive matrix Q by normalization of A’s columns: The entries pij of the matrix P denote the probability that for an agent-speaker, object i is associated with sound j. The entries q ji of the matrix Q denote the probability that for an agent-hearer, a sound j is associated with the object i. Subsequently, we can imagine two individuals A and A’, the first one having the language L (P, Q), the other having the language L’ (P’, Q’). The payoff related to communication of such two individuals is, within Nowak’s model, calculated as follows: n n F ( A, A′ ) = ∑∑ pij q′ji = Tr ( PQ′ ) i =1 j =1 And the fitness of the individual A in regards to all other members of the population can be obtained as follows : f ( A) = 1 | P | −1 ∑ F ( A, A′) A′∈P ( A′≠ A) After the fitness values are obtained for all population members, one can easily apply traditional evolutionary computing methods in order to direct the population toward more optimal states, i.e. states where individual matrices are mutually « aligned ». In Nowak’s framework this alignment represents the situation when hearer and speaker mutually understand each other, i.e. speaker has encoded meaning M by sound S and hearer had subsequently decoded sound S as meaning M. ELG beautifully illustrates how such an alignment of sound-meaning matrices – a mutually shared communication protocol - can emerge practically ex nihilo given that there is some « mutual learning » procedure mechanism involved, which allows to transfer information from one individual to individual another. This is attained by creating a blank « student » matrix and then filling its elements, by means of stochastic « matrix sampling » procedure, in a way so that the resulting student matrix will partially correspond to| be aligned with matrices of pre-existing « teacher » (or teachers). Further discussion and experiments with ELG are described (Kvasnička and Pospíchal) and (Hromada 2012). All these studies point in the same direction and suggest that not only emergence of mutually shared communication protocol practically ex nihilo is possible whenever there exists a means of transfer of information among individuals but also that presence of certain low amount of noise during the learning process is the only way how to make certain that the system will converge to « communicatively optimal » state. The role of ELG model within the context of our Thesis is quite opened. For while it is the case that ELG sheds some light upon the question of emergence of language within a community of symbolicaly interacting agents, it does not, principially address the problem of language learning by a concrete individual. Thus, ELG is rather a model of macroscopic phylogeny than microscopic ontogeny - it addresses the problem of how small communities of homo habilis could, hundred years ago, gradually converge to system of signs within which, for example, « baubau » could mean a banana and « wauwau » mean a lion. But it does not address a problem of how today’s human baby learns the complex language of her mother. On the other hand, it is not completely hors propos to imagine a slight variation of Nowak’s model wherein one population of matrices would be fixed (representing the linugistic competence of a teacher or mother organism) while the second population of matrices would represent the linguistic competence of a « child ». Given that the fitness function would somehow succeed to represent the degree of alignment between such « mother » and « child », we postulate that something like child’s language competence could spontaneously emerge. 4. Remark concerning the Theory of Grammar Systems A branch of Formal Language Theory which could be of particular use for pursposes of our Thesis is devoted to study of Grammar Systems (GS). A GS is a « set of grammars working together, according to a specified protocol, to generate a language» (Jiménez-López 2000). Thus, contrary to classical Formal Language Theory within which one grammar generate ones language, in GS several grammars work together in order to generate one language. Grammar Systems can be therefore considered as a sort of multi-agent variants of traditional « monolithic » formal grammar theory. The very nature of multi-agent systems often implies cooperation, communication distribution, modularity parallelism, or even emergence of complexity. For example, Figure 12 illustrates a very simple bimodular « language colony » variant of a GS. Figure 12: Language colony of two finite grammars cooperating to generate an infinite language. Figure reproduced from (Kelemen 2004) By allowing the finite grammar components to communicate through a common symbolic environment16, one ultimately generates a language which is infinite ! (Kelemen 2004) applies the term « miracle » to such behaviour, which is very common in the world of GS. Since the Theory of Grammar Systems is formally very well developped - most notably thanks to life-long work of Erzsébet Csuhaj-Varju and substantial contributions by George Paun and Jozef Kelemen– it is impossible for us to introduce, within the limited scope of this text, the formalism of GS Theory in closer detail. This will be done in the final version of our Thesis, if ever we decide to poursuit our research in direction. If that will be the case, we will often refer to the doctoral Thesis of (Jiménez-López 2000) which contains many persuasive arguments for application of GS upon the study of natural human languages. On the other hand, the Thesis of Jimenez-Lopez is limited by the fact that it mostly proposes to use the Grammar System Theory as a framework explaining the final, i.e. « adult » linguistic component, and not as a framework which could elucidate the very process of language development and language acquistion17. The only tentative to use Grammar System apparatus for grammatical inference is that of (Sosík and Štỳbnar 1997). Contrary to other authors of GS who focus principially on the productive (i.e. generative) aspects of GS, Sosik & Štýbnar have focused on GS's language-accepting properties. In a hybrid connectionist-symbolic architecture, they have used a « neural pushdown automaton » to infer a language colony able to cover some simple artificial context-free grammars able to cover balanced parenthesis or palindrom languages. As far as we know, no tentative is reported in the litterature to solve the problem of grammar induction of natural languages by means of evolutionary optimization of Grammar Systems. 5. Thesis The Thesis hereby introduced is done under double supervision of dpt. Cybernetics at Slovak University of Technology (STU) and « cognitive psychology » laboratory affiliated to University Paris 8 (P8). Ideally, both « engineering » approach – common to STU – as well as more cognition-oriented « experimental » approach of P8, should be equally reflected in the final Thesis. In order to do so, the Thesis shall, in fact, introduce multiple «theses » among which some shall be addressing more « theoretical » psychology and linguistics related phenomena and problems. But due to its affiliation to STU, the text shall also introduce more concrete, 16 A common symbolic environment which is shared by different modules plays the central role in practically all variants of Grammar Systems. It is reminiscent of the role which « short term memory » or «working memory » plays in cognitive psychology. 17 In terms of Grammar System Theory, it seems to be more appropriate to speak about «language emergence » pragmatic and operational theses aiming to offer a computationally and formally sound affirmative answer to the question : « Can a language development be modelled as an evolutionary process ? » 5.1. Theoretical Thesis At first, a child has to learn • how to segment the world into groups of discrete objects and processes • how to segment phonetic flux into sequences of discrete linguistic tokens The subsequent problem of language development can be analyzed as a trinity of sub-problems: 1) vocabulary development (learning of mappings between objects and tokens) 2) induction of grammatical categories 3) induction of grammatical rules These tasks are deeply and strongly intertwined. Without ability to segment world into objects there are no stable referents to which linguistic tokens could refer. Without ability to perceive recurrent tokens, there are no conventional symbols with which a child could denote specific objects. Without vocabulary development (which relates to induction of semantic classes which we have called « concept construction » in the text above), there is no need for grammatical rules nor categories. Without grammatical categories, grammatical rules are just a senseless tautological formal game and there is no way to distinguish useful grammars from useless ones. Without useful grammars, vocabulary development shall halt at some locally optimal level of a « pidgin » language. Left on their own, these problems pose us in front of us a variant of a chicken & egg problem which seems almost impossible to tackle. Baby’s brain, however, resolves these problems with such such an ellegance that one is tempted to say that they even do not exist. Aim of the Thesis which shall follow is to demonstrate that if one interprets the above mentioned set of problems interpreted in terms of • parent-child communication (imitation) • partitioning of vector spaces (categorization) • gradual accomodation and assimilation of knowledge (generalization) one could subsequently state that the key theoretical Thesis we aim to defend is Tt process of language development is an auto-organizing and potentially evolutionary process Note the word « potentially », because in order to be labeled as « evolutionary », following conjectures have to be validated : C1) Not only imitation but also repetition are forms of replication : Information replicates not only between the brains but also in the brain. C2) Fitness of a linguistic structure is related to its ability to represent certain recurrent aspect of agent’s environment : If cognitive structure matches some aspect of environment, it gets activated. By being activated, it augments it probability of being (at least partially) replicated. C3) Problems of both generalization and overgeneralization are to be solved by variation|decay operators endogenously transforming the information represented in the memory of a language-inducing system. Acceptation of above-mentioned conjectures lead us to model of language development based not on tuning of parameters of single monolithic grammar, but rather based on a population of « microgrammars », a « language colony » (Kelemen and Kelemenová 1992) of mutually communicating, co-operating, decaying and replicating sequences of production rules unceasingly trying to match the language of linguistic environment. We postulate that if such an environment has certain properties of « motherese », a linguistic competence : an ability to generate utterances in still more & more complex « toddlerese », shall spontaneously emerge. Thus, three notions will be of utmost importance in the Thesis which we hereby introduce : « motherese », « microgrammar » and « matching ». The corpus of « motherese », more concretely the CHILDes corpus (MacWhinney 2000) , will be considered to be sufficiently adequate image of initial stages of child’s linguistic environment. Development of child’s linguistic competence will be explained in terms of gradual evolution of individual « microgrammars », i.e. chromosomes whose genomes can be understood as individual production rules. At last but not least, the notion of « matching » shall furnish us the first principle which could potentially allow us to to explain the mystery of language acquisition as an evolutionary process: P1 «If (internal) rule R or substitional schema S succeeds to match some aspect of (external) environment, then it shall be replicated into another microgrammar» 5.2. Operational Thesis The operational Thesis (TO) is stated as follows : TO « There exists an evolutionary algorithm A which, when confronted with the corpus of motherese language (LM) as an input, can produce the toddlerese grammar (GT) able to generate the LM -ressembling toddlerese language LT » The term « evolutionary » means that the algorithm A shall involve incremental replication, mutation and selection of information-representing structures. More concretely, these information-representing structures, i.e. genomes, shall be ordered sequences of genes, whereby each gene shall contain an individual substitution rule. Thus, every individual genome shall represent a « microgrammar » aiming to transform linguistic token (i.e. sequence of terminals) currently observable in the environment, into sequence of non-terminals. Whenever such « successful parse » shall occur, the principle P1 shall apply and useful genes shall be reproduced into other individual micro-grammars. This could potentially cause the micro-grammars to gradually adapt their structures to those of environment. On the other hand, in order to prevent excessive adaptation, a variation operator shall be also integrated in the algorithm A, aiming to vaguely modeling a well-known phenomenon of « forgetting ». 5.3. The organization of the Thesis The Thesis shall be composed of five parts each of which is composed of multiple major chapters. Every chapter consists of introduction and conclusion preceding resp. following more specific subchapters which can fractally branch into sub-chapters , sub-sub-chapters etc. All such parts, chapters, sub-chapters etc. can be considered to be « non-terminal » nodes of structure presented by this text. The first part, labeled Theses, is a stem of whole text. It will introduce multiple theses at varying degrees of generality which shall be all - in one way or another - more directly addressed in subsequent sections. In order to weave the basic conceptual fabric, some definitions of terms like « evolution » and « language learning » shall be also offered along the path delimited in Section 1. All variants of the thesis shall be briefly related to other cognitive sciences. The second branch, labeled « Theoretical position » is composed of chapters dedicated to Universal Darwinism, Developmental Psycholinguistics and Natural Language Processing. In these chapters, the theses presented in the first chapter shall be more deeply interpreted and contextualized in terms of respective disciplines. The third branch, labeled «Observations» will describe multiple longtitudinal observations of one concrete human child. In certain cases, the generalizability of such individual observations shall be verified or falsified by means of text-mining the CHILDes corpora. Subsequent interpretations in terms of the evolutionary theoretical framework shall follow. The penultimate branch, called «Simulations» shall present multiple computational models addressing four problems related to language acquisition process : 1) The problem of segmentation 2) The problem of induction of grammatical categories 3) The problem of induction of grammatical rules 4) The problem of concept induction. Specific chapter will be dedicated to every problem in which existing solutions shall be described. Special focus shall be put on evolutionary solutions, if they exist. To every of four above-mentioned problems we shall try to offer our own unique evolutionary solution and subsequently we shall discuss its performance. PERL source codes of diverse versions of the algorithm A shall be also attached in order to allow reproducibility of our results by other scientists. The conclusive branch labeled « Synthesis » shall primarily discuss results obtained in parts « Observations » and « Simulations ». If the results turn out to be consistent with theory, the work shall end with a tentative to integerate theses Tt and Tt in one unified framework. If unsuccessful, potential reasons of the failure shall be analysed. 6. Bibliography Araujo, Lourdes. 2002. Part-of-speech tagging with evolutionary algorithms. In Computational Linguistics and Intelligent Text Processing, 230–239. Heidelberg, Germany: Springer. Aycinena, Margaret, Mykel J. Kochenderfer, and David Carl Mulford. 2003. An evolutionary approach to natural language grammar induction. Final Paper Stanford CS224N June. Barrett, Deirdre. 2007. Waistland: A (R) evolutionary View of Our Weight and Fitness Crisis. New York, NY : WW Norton & Company. Bee, Helen L., and Denise Roberts Boyd. 2003. The developing child. Boston, MA : Allyn & Bacon. Bentley, Peter. 1999. Evolutionary design by computers. San Francisco, CA : Morgan Kaufmann. Berman, Ruth A. 1988. Word class distinctions in developing grammars. Categories and processes in language acquisition: 45–72. Blackmore, Susan. 2000. The meme machine. Oxford, England : Oxford University Press. Braine, Martin DS. 1971. On two types of models of the internalization of grammars. The ontogenesis of grammar: 153–186. Brodsky, Peter, H. R. Waterfall, and Shimon Edelman. 2007. Characterizing motherese: On the computational structure of child-directed language. In Proceedings of the 29th Cognitive Science Society Conference, ed. DS McNamara & JG Trafton, 833–38. Brown, Roger. 1973. A first language: The early stages. Cambridge, MA : Harvard University Press. Campbell, Donald T. 1960. Blind variation and selective retentions in creative thought as in other knowledge processes. Psychological review 67: 380. Choubey, Nitin S., and Madan U. Kharat. 2009. Grammar Induction and Genetic Algorithms-An Overview. Pacific Journal of Science and Technology 10: 884–888. Christodoulopoulos, Christos, Sharon Goldwater, and Mark Steedman. 2010. Two Decades of Unsupervised POS induction: How far have we come? In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (pp. 575-584). Cohen, Trevor, Roger Schvaneveldt, and Dominic Widdows. 2010. Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections. Journal of Biomedical Informatics 43: 240–256. Cosmides, Leda, and John Tooby. 1997. Evolutionary psychology: A primer. Retrieved from http://www.cep.ucsb.edu/primer.html. Csuhaj-Varjú, Erzsébet. 1994. Grammar systems: a grammatical approach to distribution and cooperation. Yverdon, Switzerland : Gordon and Breach Science Publishers. Darwin, Charles. 1859. On the Origin of Species. London, England : John Murray. Darwin, Charles. 1906. The voyage of the Beagle. 104. JM Dent & sons. Dawkins, Richard. 2006. The selfish gene. Oxford, England : Oxford university press. Dennett, Daniel C. 1996. Darwin’s Dangerous Idea: Evolution and the Meanings of Life. 39. New York, NY : Simon & Schuster. Dupont, Pierre. 1994. Regular grammatical inference from positive and negative samples by genetic search: the GIG method. In Grammatical Inference and Applications, 236–245. Heidelberg, Germany: Springer. El Ghali, Adil, Daniel Hromada, and Kaoutar El Ghali. 2012. Enrichir et raisonner sur des espaces sémantiques pour l’attribution de mots-clés. JEP-TALN-RECITAL 2012: 77. Elman, Jeffrey L. 1993. Learning and development in neural networks: The importance of starting small. Cognition 48: 71–99. Flake, G. W. 1999. The computational beauty of nature. Cambridge, MA : MIT press. Fogel, Lawrence J., Alvin J. Owens, and Michael J. Walsh. 1966. Artificial intelligence through simulated evolution. New York, NY : John Wiley & Sons. Foster, Mary LeCron. 2002. Symbolism: the foundation of culture. Companion encyclopedia of anthropology : 366. Canada : Routledge. Furrow, David, Katherine Nelson, and Helen Benedict. 1979. Mothers’ speech to children and syntactic development: Some simple relationships. Journal of child language 6: 423–442. Galton, Francis. 1875. English men of science: Their nature and nurture. . Gärdenfors, Peter. 2004. Conceptual spaces: The geometry of thought. MIT press. Haeckel, Ernst Heinrich Philipp August. 1879. The evolution of man. Vol. 1. [sn]. Haidt, Jonathan. 2012. The righteous mind: Why good people are divided by politics and religion. Random House LLC. Hamilton, William D. 1963. The evolution of altruistic behavior. The American Naturalist 97: 354–356. Harris, Margaret. 2013. Language experience and early language development: From input to uptake. Psychology Press. Harris, Zellig S. 1954. Distributional structure. Word. Hebb, Donald Olding. 1964. The Organization of Behavior: A Neuropsychlogical Theory. John Wiley & Sons. Hoff-Ginsberg, Erika. 1986. Function and structure in maternal speech: Their relation to the child’s development of syntax. Developmental Psychology 22: 155. Holland, J. H. (1975). Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence. Ann Arbor, MI : University of Michigan Press. Hromada, Daniel Devatman. 2012. Variations upon the theme of Evolutionary Language Game. Unpublished manuscript. Slovak University of Technology. Hromada, Daniel Devatman. 2013a. Geometrizácia ontológií - prípadová štúdia SNOMED. Unpublished manuscript. Slovak University of Technology. Hromada, Daniel Devatman. 2013b. Random Projection and Geometrization of String Distance Metrics. In Proceedings of the Student Research Workshop associated with RANLP, 79–85. Hissar : Bulgaria. Hromada, Daniel Devatman. 2014a. Comparative study concerning the role of surface morphological features in the induction of part-of-speech categories. In Proceedings of TSD2014 conference. Heidelberg, Germany : Springer. Hromada, Daniel Devatman. 2014b. Introductory experiments with evolutionary optimization of reflective semantic vector spaces. In TALN-RECITAL-DEFT 2014. Marseille. Hromada, Daniel Devatman. 2014c. Conditions for cognitive plausibility of computational models of category induction. In Proceedings of 15th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems. Heidelberg, Germany: Springer. Jiménez-López, MD. 2000. Grammar systems: a formal-language-theoretic framework for linguistics and cultural evolution. PhD Dissertation. Tarragona, Spain : Rovira i Virgili University, . Johnson, William B., and Joram Lindenstrauss. 1984. Extensions of Lipschitz mappings into a Hilbert space. Contemporary mathematics 26: 1. Karypis, George. 2002. CLUTO-a clustering toolkit. DTIC Document. Kauffman, Stuart. 1996. At home in the universe: The search for the laws of self-organization and complexity. Oxford, England : Oxford University Press. Kelemen, Jozef. 2004. Miracles, colonies, and emergence. In Formal Languages and Applications, 323–333. Heidelberg, Germany: Springer. Kelemen, Jozef, and Alica Kelemenová. 1992. A grammar-theoretic treatment of multiagent systems. Cybernetics and System 23: 621–633. Keller, Bill, and Rudi Lutz. 1997. Evolving stochastic context-free grammars from examples using a minimum description length principle. In 1997 Workshop on Automata Induction Grammatical Inference and Language Acquisition. Kennedy, James F., James Kennedy, and Russel C. Eberhart. 2001. Swarm intelligence. San Francisco, CA : Morgan Kaufmann. Koza, J. R. (1992). Genetic programming: on the programming of computers by means of natural selection (Vol. 1). Cambridge, MA : MIT press. Küntay, Aylin, and Dan I. Slobin. 1996. Listening to a Turkish mother: Some puzzles for acquisition. Social interaction, social context, and language: Essays in honor of Susan Ervin-Tripp: 265–286. Kvasnička, Vladimír, and Jirí Pospíchal. Evolúcia jazyka a univerzální darwinizmus. In Myseľ, inteligencia a život. Bratislava : Slovenská Technická Univerzita. Lakoff, G. 1990. Women, fire, and dangerous things. Chicago, IL : University of Chicago Press. Levy, Yonata. 1988. The nature of early language: Evidence from the development of Hebrew morphology. Categories and processes in language acquisition: 73–98. Lawrence Erlbaum Associates. MacWhinney, Brian. 1987. The competition model. Mechanisms of language acquisition: 249–308. MacWhinney, Brian. 2000. The CHILDES Project: Tools for Analyzing Talk. Transcription, format and programs. Vol. 1. Lawrence Erlbaum Associates. Maratsos, Michael. 1988. The acquisition of formal word classes. Categories and processes in language acquisition: 31–44. Lawrence Erlbaum Associates. Morgan, Thomas Hunt. 1916. A Critique of the Theory of Evolution. Princeton University Press. Newport, Elissa L. 1990. Maturational constraints on language learning. Cognitive science 14: 11–28. Ninio, Anat. 1988. On formal grammatical categories in early child language. Categories and processes in language acquisition. Lawrence Erlbaum Associates. Nowak, M. A., J. B. Plotkin, and D. C. Krakauer. 1999. The evolutionary language game. Journal of Theoretical Biology 200: 147–162. Ofria, Charles, and Claus O Wilke. 2004. Avida: A software platform for research in computational evolutionary biology. Artificial life 10: 191–229. O’Neill, Michael, and Conor Ryan. 2003. Grammatical evolution: evolutionary automatic programming in an arbitrary language. Genetic Programming Series, Vol. 4. Heidelberg, Germany: Springer. Piaget, Jean. 1974. Introduction à l’épistémologie génétique. Paris, PUF. Pohlheim, Hartmut. 1996. GEATbx: Genetic and evolutionary algorithm toolbox for use with MATLAB documentation. Retrieved from http://www. geatbx. com/docu/algindex. html. Poincaré, Henri. 1908. L’invention mathématique. Popper, Karl Raimund, Karl Raimund Popper, and Karl Raimund Popper. 1972. Objective knowledge: An evolutionary approach. Oxford, England : Clarendon Press. Ray, Thomas S. 1992. Evolution, ecology and optimization of digital organisms. Santa Fe. Rechenberg, Ingo. 1973. Evolutionsstrategie–Optimierung technisher Systeme nach Prinzipien der biologischen Evolution. Stuttgart, Germany : Fromman-Holzboog. Rizzolatti, Giacomo, and Laila Craighero. 2004. The Mirror-Neuron System. Annual Review of Neuroscience 27: 169–192. Rosch, Eleanor. 1999. Principles of categorization. Concepts: core readings: 189–206. Cambridge, MA : MIT press. Sahlgren, Magnus. 2005. An introduction to random indexing. In Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE. Vol. 5. Sekaj, I. 2005. Evolučné výpočty a ich využitie v praxi. Iris. Shi, Rushen, Janet F Werker, and James L Morgan. 1999. Newborn infants’ sensitivity to perceptual cues to lexical and grammatical words. Cognition 72: B11–B21. doi:10.1016/S0010-0277(99)00047-5. Simonton, Dean Keith. 1999. Creativity as blind variation and selective retention: Is the creative process Darwinian? Psychological Inquiry 10: 309–328. Smith, Tony C., and Ian H. Witten. 1995. A genetic algorithm for the induction of natural language grammars. In Proc. of IJCAI-95 Workshop on New Approaches to Learning for Natural Language Processing, 17–24. Solan, Z., D. Horn, E. Ruppin, and S. Edelman. 2005. Unsupervised learning of natural languages. Proceedings of the National Academy of Sciences 102: 11629. Sosík, Petr, and Leoš Štỳbnar. 1997. Grammatical inference of colonies. In New Trends in Formal Languages, 236–246. Heidelberg, Germany: Springer. Spencer, Herbert. 1894. Education: Intellectual, moral, and physical. CW Bardeen. Tomita, Masaru. 1982. Dynamic construction of finite-state automata from examples using hill-climbing. In Proceedings of the fourth annual cognitive science conference, 105–108. Trivers, R.L. (1972). Parental investment and sexual selection. In B. Campbell (Ed.), Sexual selection and the descent of man, 1871-1971 (pp. 136–179). Chicago, IL: Aldine. Turing, A. M. 2008. Computing machinery and intelligence. Parsing the Turing Test: 23–65. Vapnik, V., S. E Golowich, and A. Smola. 1997. Support vector method for function approximation, regression estimation, and signal processing. Advances in Neural Information Processing Systems 9. Wilson, Edward O. 1978. What is sociobiology? Society 15: 10–14. Wittgenstein, L. 2009. Philosophical investigations. Wiley-Blackwell. Wolff, J. Gerard. 1988. Learning syntax and meanings through optimization and distributional analysis. Categories and processes in language acquisition 1. Wright, Sewall. 1932. The roles of mutation, inbreeding, crossbreeding and selection in evolution. In Proceedings of the sixth international congress on genetics, 1:356–366. Comparative study concerning the role of surface morphological features in the induction of part-of-speech categories Daniel Devatman Hromada12 1 Université Paris 8, Laboratoire Cognition Humaine et Artificielle, 2, rue de la Liberté 93526, St Denis Cedex 02, France 2 Slovak University of Technology, Faculty of Electrical Engineering and Information Technology, Department of Robotics and Cybernetics, Ilkovičova 3, 812 19 Bratislava, Slovakia Abstract. Being based on English language, existing systems of partof-speech induction prioritize the contextual and distributional features “external” to the word and attribute somewhat secondary importance to features derived from word’s “internal” morphologic and orthotactic regularities. Here we present some preliminary empirical results supporting the statement that simple “internal” features derived from frequencies of occurrences of character n-grams can substantially increase the V-measure of POS categories obtained by repeated bisection k-way clustering of tokens contained in Multext-East corpora. Obtained data indicate that information contained in suffix features can furnish c(l)ues strong enough to outperform some much more complex probabilist or HMM-based POS induction models , and that this can especially be the case for Western Slavic languages. Keywords: part-of-speech induction, development of morphology, clustering, surface features, suffix 1 Introduction Part-of-speech (POS) induction is a constructivist process aiming to converge to the mechanism able to attribute the POS category (e.g. “verb”, “noun”, “adjective” etc. ) membership information to any word of the language under study. Because “syntactic category information is part of the basic knowledge about language that children must learn before they can acquire more complicated structures” [15] POS induction (POS-i) is often considered to be the first step in a more complex process of grammar induction and language acquisition in general. Given such an important place of POS-i in NLP studies, it is of no surprise that while first computational models of POS-i were proposed decades ago [3][6][15] the problem of unsupervised POS-label attribution still attracts attention of many computational linguists. Thus, dozens of POS-i systems exist, among which those based on class-based word n-grams [5], graph clustering [2] 2 Daniel Devatman Hromada or diverse extensions to Hidden Markov Models [9][8][1] are compared in the [4] comparative study which suggests that “some of the oldest (and simplest) systems stand up surprisingly well against more recent approaches”. Aims of this article are 1) to elucidate a superior peformance of Clark [5] and Berg-Kirkpatrick [1] models with the statement: “Their models perform better because they use better features” 2) to precise that for many languages, such features can be morphological ones. We precise that what shall be called “morphological feature” (MF) in the rest of this article is any feature “internal” to the word WITHIN which it occurs and as such can be opposed to contextual or distributional features “external” to the word under study (i.e. opposed to features which describe word’s relation to other words and not its internal composition). By focusing upon the role of such “orthotactic” MFs in diverse languages represented in the Multext-East corpus [7] we shall try to persuade the reader that while the “syntax-in-word-order paradigm” could (and did) yield useful models and tools for description of English language, the uncritical acceptation of such paradigm could turn to be somewhat contra-productive if one tends to develop POS-i models for highly flectional & morphology-rich languages. 2 Corpus All analyses were effectuated with texts contained in the 4th version of MultextEast corpus [7] . Bulgarian (bg), Czech (cs), English (en), Estonian (et), Farsi (fa), Hungarian (hu), Polish (pl), Romanian (ro), Serbian (sr), Slovak (sk) and Slovene (sl) transcription of Orwell’s 1984 were analysed. Quantitative descriptions of different corpora are present in the table 1. Corpus Types Tokens TagsPOS bg 17305 117238 13 cs 22341 100368 13 en 11160 134832 12 et 18911 111305 12 fa 13009 124823 12 hu 20642 132196 13 pl 24019 115185 14 ro 16220 135055 15 sk 23015 103452 13 sl 20597 112278 13 sr 21540 126611 13 3 Method Every word from the corpus was described by a vector of features whose values were obtained by application of feature filters described below. Vectors were subsequently clustered into groups. TSD 2014 3.1 3 Feature extraction All tokens, punctuation marks included, were extracted as such from the corpus. Word characters were transcribed into lower case. In order to mark the word boundaries, ˆ and $ characters were prefixed, respectively suffixed, to extracted tokens. Following features were then extracted from tokens: Length [L] – yields only one feature whose value equals the character length of the token, i.e. 6 for word “ˆgood$”. Baseline. Character n-grams of length X [Nx ] – every feature encodes the number of occurrences of the character n-gram of length L within the token. Thus, if X=1, the word “ˆ good$” can be encoded by vector of features [1, 1, 2, 1, 1] whose second element denotes the number of “g” present in the word, third feature the number of “o” etc. If X=2, the vector could be [1, 1, 1, 1, 1], its first element representing the frequency of occurrence of “ˆ g” character bigram, second of “go” bigram, third of “oo” bigram etc. Character fragments whose length 0.6 limit), we consider the Vmeasure to be very valuable quantitative measure of performance of clustering POS-i algorithms. c=1− L N1 N2 N3 N4 F2 F3 F4 A P2 P3 S2 S3 C2 bg 4.3 5.6 13.1 17.0 11.9 8.5 14.4 14.7 14.6 6.7 5.0 18.9 16.5 3.8 cs 5.4 9.2 25.2 20.7 11.6 23.1 24.8 23.9 24.3 7.4 7.1 25.2 18.7 4.7 en 3.8 6.5 14.1 15.3 9.4 10.4 14.9 16.1 14.7 3.9 3.6 20.5 19.7 2.4 et 4.2 4.0 12.2 14.2 11.9 5.8 6.92 9.38 7.24 4.2 6.0 14.2 16.1 3.6 fa 2.6 6.8 15.4 15.52 12.2 12.0 15.51 15.3 15.55 11.7 14.5 14.4 12.0 6.4 hu 2.3 4.3 6.1 10.7 9.4 5.2 6.26 6.58 5.65 5.4 5.7 17.1 14.2 3.0 pl 4.7 8.0 21.1 20.1 13.7 18.5 20.3 19.7 15.6 5.3 6.5 25.1 22.7 4.0 ro 4.6 7.1 11.1 13.6 9.5 8.23 11.3 11.8 10.9 6.5 5.9 15.8 14.8 3.1 sr 5.2 5.5 13.3 14.8 10.5 5.67 8.06 8.82 5.95 6.1 6.4 19.1 16.5 4.6 sk 5.9 11.2 26.9 21.0 14.0 23.8 24.9 24.2 22.5 8.2 5.8 27.5 21.3 4.8 sl 4.5 4.8 12.2 17.1 12.8 7.39 8.42 14.3 7.5 6.8 6.0 21.6 19.3 5.2 C3 2.3 3.1 1.7 2.8 4.6 1.8 3.0 1.9 3.0 3.5 2.4 R2 3.4 3.7 2.9 3.4 2.8 2.4 3.3 2.5 4.7 3.6 3.3 Table 1: V-measures obtained after clustering different corpus according to different features. The most performant feature of every corpus is marked. R3 3.0 3.4 2.2 3.3 3.2 2.0 2.9 2.4 3.5 3.5 3.4 O1 12.5 7.9 14.4 6.77 14.3 7.1 7.9 15.6 9.4 8.7 9.1 TSD 2014 5 Table above shows V-measure*100 values obtained by clustering of words characterized by length (L), character n-gram fragments of fixed (N2 , N3 , N4 ) length or n-gram fragments shorter than certain length (F2 , F3 , F4 ) as well as of clusters created by considering all fragments (A). The best results (i.e. highest V-measures) were observed in case of Western Slavic languages which have all attained >0.2 of V-measure performance when clustered according to features representing character bigram occurrences. Southern Slavic languages along with Romanian, Hungarian and Estonian performed the best when character trigrams were taken into account. English attained the 0.16 performance when all bigrammata, trigrammata and tetragrammata were taken into account while Farsi was clustered the best when all n-gram character fragments were taken into account. Further results presented in the table below point in the same direction. Highest V-measure score was attained by Slovak, Czech and Polish when simple extractor of suffix features of length 2 was applied. In fact the same extractor yielded highest scores in case of all languages with exception of Estonian where somewhat longer suffixes tend to facilitate the POS-i, and in case of Farsi whereby prefixal features seem to be at least as important as suffixal features. Word circumference features C2 and C3 as well their “negation”, the word root features R2 and R3 do not seem to bring any information relevant to the categorization process – in fact they seem to perform even worse than the baseline feature L. Members of set of “external” distributional features (O1 ), which represent the trivial frequency of occurrence of the feature-word to the left or right from the target word, performed worse in all cases, English included, than S2 . 5 Discussion POS-i system comparative study of [4] indicates that POS-i models involving morphological features perform better than models which do not. However both in Clark’s [5] probabilist model as well as in morphology-enriched HMM-derived [1] model, morphological features seem to play rather a role of a performanceincreasing “cherry added to the top of the cake” than that of model’s cornerstone. Results presented in this paper suggest that focusing upon the phenomena occurring within the token, if the token’s transcription allows it3 , seem to yield quite strong c(l)ues for subsequent clustering of tokens into their respective syntactic categories. It may be the case that especially the character bigrams occuring at word’s offset position – suffixes – seem to play an important role in word→ POS category attribution. It is also worth noting that suffixes augment the performance of POS-i not only for Indo-European languages but also for Uralic languages like Estonian or Hungarian. 3 For example, an “internal” feature-oriented approach would hardly yield any interesting results if applied on Chinese logograms but could be of certain theoretic interest when applied upon pinyin transcription. 6 Daniel Devatman Hromada It is also worth reiterating that POS-i within Western Slavic languages tends to be much more sensitive to character N-gram and suffix-derived features than other languages compared in this study. Because the research presented hereby was based only on one particular litteral corpus (Orwell’s 1984) and the results obtained may thus represent not the properties of languages as such, but rather a certain translation style, it would be somewhat hors propos to postulate that a kind of overall statistic property - labeled hereby as “word offset flectivity” - is more marked in Western Slavic languages than, for example, in Southern Slavic or Uralic languages. But given the fact that it was only Slovak, Czech and Polish whose V>0.25 when clustered according to outputs of S2 feature-extracting prism, we believe that subsequent analyses involving more corpora and more languages may be worth the effort. Verily only more exhaustive comparative studies could assess the impact of morphology of word X upon the attribution of syntactic function to the very word X. And since syntax is often bound with semantics – for example by means of thematic relations – such studies, if ever they would verify and not falsify the results presented hereby, could possibly result in a partial revision of a canonical “signifiant is independent from signifié” paradigm [14]. To emit such a call was, however, not a motivation behind the redaction of this paper. Nor had we aimed to outperform existing distributional&probabilist models – for it may seem quite unprobable that one would outperform the “heavy Markovian artillery” with such a simple computational machinery as k-way clustering. Thus, it has been of certain surprise to us that the comparison of data presented on Figure 4 in [4] with our results indicated that for some Slavic corpora, our simplistic morphology-driven geometrically-clustered model has attained higher or more or less equal V-mesure scores than models presented in [11][9]. Our approach can also dispose of certain advantages when it comes to computational complexity – while some models like that of [2] have sometimes problems to converge to result in reasonable time, none of our 198 analyses whose results are presented above have lasted more than few seconds on an average desktop computer. This being said, we believe that it may be the case that POS-i induction of systems of next generation could not only take into account but shall rather be based on word’s “internal” morpho(phono)logical or even prosodic and metric features. While sufficient evidence exists for stating that in order to have a highly performant and robust POS-i model, one MUST take into account the distributional and contextual information “external” to the word under question, we believe that especially in case of highly flectional languages, the complexity of the whole POS-i clustering proccess could be significantly reduced if ever the process shall be “seeded” (i.e. initiated) with token’s “internal” features. Since the performance-augmenting and complexity-reducing effects of such seeding are the principal topic of our ongoing work, we conclude that what we believe to be the ultimate advantage of such a model could be its “cognitive plausibility” [10]. At last but not least, by underlining the importance of suffixal features for POS-induction process, our results may well point in the same direction as hy- TSD 2014 7 pothesis that ”one of the first operating principles employed in the ontogenesis of grammar [is that] grammatical realizations in the form of suffixes or postpositions will be acquired earlier than realizations in the form of prefixes or prepositions”[16]. Thus, without an intention to do so4 we ultimately find the results of our purely empiric study to be consistent with more general psycholinguistic theories of grammar induction and language development. References 1. Berg-Kirkpatrick, Taylor, Alexandre Bouchard-Côté, John DeNero, et Dan Klein. 2010. Painless unsupervised learning with features. P. 582–590 in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 2. Biemann, Chris. 2006. Unsupervised part-of-speech tagging employing efficient graph clustering. P. 7–12 in Proceedings of the 21st International Conference on computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. 3. Brown, Peter F., Peter V. Desouza, Robert L. Mercer, Vincent J. Della Pietra, et Jenifer C. Lai. 1992. Class-based n-gram models of natural language. P. 467–479 in Computational linguistics 18(4) 4. Christodoulopoulos, Christos, Sharon Goldwater, et Mark Steedman. 2010. Two Decades of Unsupervised POS induction: How far have we come?. P. 575–584 in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. 5. Clark, Alexander. 2003. Combining distributional and morphological information for part of speech induction. P. 59–66 in Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics-Volume 1. 6. Elman, Jeffrey L. 1989. Representation and structure in connectionist models. 7. Erjavec, Tomas. 2012. MULTEXT-East: morphosyntactic resources for Central and Eastern European languages. P. 131–142 in Language resources and evaluation 46(1) 8. Goldwater, Sharon, et Tom Griffiths. 2007. A fully Bayesian approach to unsupervised part-of-speech tagging. P. 744 in Annual Meeting of Association of Computational Linguistics, vol. 45. 9. Graca, Joao, Kuzman Ganchev, Ben Taskar, et Fernando Pereira. 2009. Posterior vs. parameter sparsity in latent variable models. P. 664–672 in Advances in Neural Information Processing Systems 22. 10. Hromada, Daniel Devatman. 2014. Conditions for cognitive plausibility of computational models of category induction. Accepted for 15th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU2014). Montpellier, France. 11. Johnson, Mark. 2007. Why doesn’t EM find good HMM POS-taggers. P. 296–305 in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL). 4 Both during conception and realization of our study, we have been utterly unaware neither of Slobin’s ”operating principle A”, nor of amount of scientific evidence already associated with it. 8 Daniel Devatman Hromada 12. Karypis, George. 2002. CLUTO-a clustering toolkit. 13. Rosenberg, Andrew, et Julia Hirschberg. 2007. V-measure: A conditional entropybased external cluster evaluation measure. P. 420 in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), vol. 410. 14. deSaussure, Ferdinand. 1922. Cours de linguistique générale. Payot, Paris. 15. Schütze, Hinrich. 1993. Part-of-speech induction from scratch. P.251–258 in Proceedings of the 31st annual meeting on Association for Computational Linguistics. 16. Slobin, Dan. 1973. Cognitive prerequisities for acquisition of grammar. P. 175-208 in Studies of child and language development. Evolutionary modelisation of ontogeny of linguistic structures Rigorous Thesis Examination Daniel D. Hromada12 1 Slovak University of Technology Faculty of Electronic Engeneering and Informatics Department of Robotics and Cybernetics 2 Université Paris 8 École Doctorale Cognition, Langage, Interaction Laboratoire Cognition Humaine et Artificielle 4-11-2014 doc. Ing. Ivan Sekaj, PhD. prof. Ing. Vladimı́r Kvasnička, DrSc. Administrative Position Development 2010: enrolled for PhD. at Ecole Doctorale Cognition, Langage, Interaction of University Paris8 2011: attribution of double PhD. scholarship by french government, inscription at STU as external doctorant 2012: inter-university convention signed by FEI STU’s Dean and President of Paris8, start of scholarship 2013: summer semester in Paris, presentation of prof. Tijus in Bratislava 2014: summer semester in Paris, end of scholarship Thesis is to be written and defended in English language. An interdisciplinary enterprise Part-ofspeech induction Categorization Optimization Induction Natural Language Processing Grammar induction Computational Linguistics Production vs. Perception Generalization Formal Language Theory Language Acquisition Device Grammar Systems Developmental Psycholinguistics Evolutionary Modeling of L-Structure Ontogeny Populations Genetic Algorithms Motherese and Toddlerese Evolutionary Computation Universal Darwinism Bootstrapping Fitness Functions and Landscapes Evolutionary Strategies Heredity, Variation, Selection Stochastic Systems Adaptation Trial and Error Universal Darwinism Definition A general theoretical framework aiming to explain the emergence and optimization of diverse complex phenomena in terms of interaction of three basic processes: 1 information variation 2 information selection 3 information replication UD-consistent disciplines biology (Darwin 1859, Mendel 1866) and genetics (Morgan 1916, Watson & Crick, 1953) sociobiology (Hamilton 1974, Wilson 1978) and evolutionary psychology (Cosmides & Tooby 1997) memetics (Dawkins 1976, Blackmore 2000) evolutionary epistemology neural darwinism evolutionary computation, artificial life, ... evolutionary linguistics Evolutionary Epistemology An ambiguous definition Evolutionary Epistemology aims to explain source, existence, nature, scope and diversity of forms of knowledge in evolutionary terms. Two possible intepretations: 1 biological evolution of cognitive and mental faculties in animals and humans 2 knowledge per se evolves by selection and variation The second interpretation can be further analyzed: 1 knowledge can emerge by variation&selection of ideas shared by a group of mutually interacting individuals (Popper 1972) 2 knowledge can emerge by variation&selection of cognitive representations within one individual Genetic Theories of Learning and Creativity ”Genetic” not in contemporary (i.e. DNA-related) sense but as related to ”origins” (genesis) and ”heredity” (genus). Piaget’s Genetic Epistemology 1 aims to explain how human cognitive systems (CS) develop from birth onwards 2 CS pass through series of stages, every stage involves equilibration of cognitive schemas 3 schemas change through process of assimilation and acommodation Campbell-Simonton’s Theory of Creativity 1 scientific discovery and creativity can be explained in terms of blind variation and selective retention (Campbell 1970) 2 ”How do human beings create variations? One perfectly good Darwinian explanation would be that the variations themselves arise from a cognitive variation-selection process that occurs within the individual brain.” (Simonton 1990) Neural Darwinism Edelman (1987) postulated that complex adaptations in the brain arise through some process similar to natural selection. Another variant of ND is theory of Changeux and Dehane (1989): ”the production and storage of mental representations, including their chaining into meaningful propositions and the development of reasoning, can also be intepreted, by analogy, in variation-selection (Darwinian) terms within psychological time-scales.” Fernando et al. (2012) propose two ”toy models .. of a means by which a higher-order unit of neuronal evolution above the synaptic level may be able to replicate.” Figure: reproduced from (Fernando et al., 2012). Evolutionary Computation Definition ”Evolutionary computation uses computational models of evolutionary processes as key elements in the design and implementation of computer-based problem solving systems” (Spears et al., 1993) genetic algorithms (c.f. next slide) evolutionary programming (stronger genotype-phenotype distinction, FSAs, little recombination) evolutionary strategies (involves more recombination, self-adaptation, other nature-inspired approaches) genetic programming (does not search for solutions but for programs) Grammatical evolution Variant of Genetic Programming which uses evolutionary search to discover specific sequences of application of rules of production which generate program code which yields wished solutions. swarm intelligence (Kennedy & Eberhart, 2001) artificial life (no exogenous fitness function: Tierra, AVIDA, etc.) Genetic algorithms Canonic GA (Holland, 1975) Encoding: binary vector Initial population: randomly generated P Selection: fitness proportionate (pi = fi / N j=1 fj ) Crossover: one-point Mutation: bit-flip with probability p (0.001) rand i n i t evaluate select repeat crossover mutation evaluate select u n t i l stop Schema theorem A schema is a subset of strings with similarities at certain positions. Schema theorem states that short, low-order (i.e. with few fixed positions) schemata with above-average fitness increase exponentially in successive generations: E (m(H, t + 1)) ≥ m(H, t)f (H) [1 − p at m(H,t) is the number of strings belonging to schema H at generation t, f(H) is the observed average fitness of schema H and at is the observed average fitness at generation t. P is the probability that crossover or mutation will disrupt H. Convergence to global optimum Rudolph (1994) has proven that CGAs are certain to converge to global optimum only if they ”keep track of the best solution found over time” (i.e. involve a form of elitism). Evolutionary Language Game (Nowak et al., 1999) Let’s have a population of N agents. Each agent is described by an r ∗ c associative matrix A. A’s entry aij specifies how often an individual, in a role of a student, observed one or more other individuals (teachers) referring to object i by producing signal j. From this associative matrix A, one can derive a the active ”speaker” matrix S by normalizing A’s rows: sij = Pr ij a n=1 in the ”hearer” passive matrix H by normalization of A’s columns: hij = a Pc ij n=1 anj Subsequently, we can imagine two individuals A and A’, the first one having the language L (H,S), the other having the language L’ (H’, S’). The payoff related to communication of two individuals is calculated as follows: F (A, A0 ) = r X c X sij hji0 = Tr (SH 0 ) i=1 j=1 And the fitness of the individual A in regards to all other members of the population can be obtained as follows : X 1 f (A) = F (A, A0 ) |P| − 1 0 A ∈P A6=A0 By implementing EC, these fitness values can subsequently direct evolution of the population toward states where individual matrices are more optimally ”aligned”. In ELG, this alignment represents the situation when hearer and speaker mutually understand each other, i.e. speaker has encoded meaning M by sound S and hearer had subsequently decoded sound S as meaning M. Evolutionary Language Game #2 parent-child information transfer modelled by matrix sampling procedure parameter k specify the quantity of repetition during the matrix sampling all experiments with N=100 having memones of size 5x5 (i.e. their associative matrices could encode max 5 ”sounds” and 5 ”meanings”) convergence to globally optimal state is assured only if MS involves small but nonzero amount of noise! beautiful model how ”language” is sure to arise ex nihilo in communities wherein information transfer between individuals exists Nowak et al. (1999) and Kvasnička & Pospı́chal (2007) use it to iluminate emergence of language in phylogeny of homo sapiens sapiens species, but couldn’t be an analogical approach used to model transfer from mother to child in ontogeny? Evolutionary Linguistics Definition Scientific study of both the origins and development of language as well as the cultural evolution of languages. Schleicher’s (1853) language tree (Stammbaumtheorie) theory lack of fossil records, difficult to empirically verify, banned by Societe linguistique de Paris in 1866 revived at the end of 20th century (c.f. Pinker & Bloom, 2011) quantitative comparative linguistics, phylogenetic trees... focuses on phylogeny and not ontogeny Why EL should focus on ontogeny ”We are not very well informed about the psychology of Neanderthal man or about the psychology of Homo siniensis of Teilhard de Chardin. Since this field of biogenesis is not available to us, we shall do as biologists do and turn to ontogenesis. Nothing could be more accessible to study than the ontogenesis of these notions. There are children all around us. (Piaget, 1975)” Formal Language Theory Alphabet A is a finite, nonempty set of symbols. A word or a string over an alphabet A is a finite sequence of symbols of from A. A* is the set of all words over A*. A language L over A is a subset of A*. A grammar G is a quadruple=(N,T,P,S) where N is the nonterminal alphabet, T is the terminal alphabet, S ∈ N is the axiom and P is the set of rewriting (production, substitution) rules, written as x → y . Grammars are called REGULAR when the form of all rules in P is X → α, X → αB, α ∈ T , A, B ∈ N CONTEXT-FREE when all its rules have form X → x where X ∈ N, x ∈ A∗G CONTEXT-SENSITIVE when P contains only rules of the form x1 Xx2 → x1 wx2 x1 , x2 , w being strings over AG , X ∈ N Language L GENERATED from grammar G is a set of all sequences of terminals which can be derived from axiom S by recursive application of rules in P. Language L can be PARSED by grammar G if, for all sequences s ∈ L, there exist at least one sequence of application of production rules which, when applied in an inverse fashion (i.e. substitute left side of production rule for the right side), shall end at axiom S. Grammar Systems Introduction A Grammar System is a set of grammars working together, according to a specified protocol, to generate a language. a syntactic theory of multi-agent, distributed and parallel systems multiple independent grammars share their productions in ”string environment” (analogic to AI ”blackboard” approaches) environment can change on its own (so called ”eco-grammar” systems) or not (language colonies) Figure: Reproduced from Kelemen’s (2004) article ”Miracles, colonies, and emergence”. Natural Language Processing uses computers to process human languages implements AI, data-mining, information retrieval and machine learning methods (both supervised and unsupervised) first and ultimate NLP challenge posed by Turing (1950) other problems: anaphora resolution, automatic summarization, discourse analysis, machine translation, morpohological segmentation, named entity recognition, natural language understanding, POS-induction and tagging, parsing, question answering, sentiment analysis, speech recognition, word sense disambiguation etc... in NLP, statistics often plays more important role than FLT in NLP, methods based on aNN, Naive Bayes, SVM-ba methods are predominant, EC is much less used POS-induction and Grammar induction Part-of-speech induction The goal is to group tokens, present in the pure-text corpus C, into clusters grouping members of diverse parts-of-speech (nouns, verbs, adjectives, etc.). Grammar induction The goal is to infer, from pure-text corpus C, a grammar G which could have generated the corpus C. POS-i and GI problems are strongly intertwined. Clusters discovered by POS-i can be denoted by non-terminal symbols. C: John loves Mary. Mary hates John. Mary sleeps. John weeps. ideal grammar: N→JohnkMary V→lovekhateksleepkweep S→NVskNVsN least general grammar: S→C most general grammar: S→A* Learning of semantic categories How can machines work with semantic categories? Semantic categories (i.e. concepts) can be characterized by extensive (listing the instances) or ostentative (pointing the finger) definition in terms of sufficient and necessary features as (convex) subspaces of N-dimensional semantic feature space (Gardenfors 2004) as prototypes (points) within such spaces Principle(s) behind construction of semantic spaces In neurosciences: ”neurons that fire together, wire together” (Hebb 1964) In linguistics: ”a word is characterized by the company it keeps” (Harris 1954) In philosophy: ”the meaning of a word is its use in the language (Wittgenstein 1953) Conjecture Development of vocabulary in human children is a variant of multi-class classification problem and as such can be simulated by an algorithm creating and partitioning semantic feature vector spaces. Developmental Psycholinguistics Developmental Psycholinguistics (DP) is a scientific discipline studying changes occuring in human faculty of understanding and production of natural languages. As such, it is closely related to developmental psychology (a sub-field of psychology) and developmental linguistics (a sub-field of linguistics). Language Development (DEF) Language development (LD) - or ontogeny of natural language L in human individual H - is a constructivist process gradually transforming L into evermore optimized communication channel facilitating the exchange of information between H and her social surroundings. language is social and pragmatic (allows children to manipulate objective world) comprehension precedes production: C-representations offer preliminary targets for P-productions physiological predispositions of language are innate but useless without triggering epigenetic stimuli children are not ”ideal learners” (in Gold’s theorem sense) brains simultaneously encode multiple language registers and grammars Motherese parents modify their language in order to make themselves understood higher pitch (267 Hz in comparison to 198Hz), slower tempo, greater rhytmicity, longer pauses between utterances ”much of the speech addressed to babies consists of short, routine, repetitive utterances produced with great consistency and frequency in the same contexts, day after day” (Clark 2003)” repetitions three times more frequent in speech to two-years-old than in speech to ten-years-old Figure: Reproduced from Tverarthen (1993). Toddlerese babys expressive faculties start with 1bit communication channel (need soothing/don’t need soothing) gets more subtle and fine-grained with time: more information transmitted with less signal babbling starts cca at 8 months of age, first as repetition of same syllables (mamamama, babababa), later syllables shall start to vary within the sequence (babadadabebe) around 1 year: consistent vocalizations in specific contexts (protowords) children tend to be quite accurate in their first productions but later versions of the same words appear to be further from adult targets ”continuous exploration, experimentation, practice and intense involvement with linguistic structure” (Labov, 1978) LD reveales, upon closer inspection, a constantly changing series of small experiments where child progressively scrutinizes and tries out different options (Clark 2003) a lot of variability in children’s word forms (Ferguson and Farewell 1975) first grammar form around ”pivot” words, e.g.: ”mama auch, tato auch, nana auch, baba auch” (S → Nauch ; N → mama|tato|nana|baba) toddlerese: 10 - 30 months Quantitative laws of language acquisition Piotrowski law In both linguistic phylogeny as well as ontogeny (e.g. sentence length, vocabulary size) does development follow the logistic equation: 1+aec −bt . Note that in ecology, the same equation is considered to yield the law of population growth (Lotka, 1920). Figure: reproduced from (Baixeries et al., 2013). Zipf’s law Zipf (1949) showed that if the most frequent word in a text is assigned rank 1, the second most frequent word is assigned rank 2 etc. than frequency f(r) of a word of rank r obeys f ≈ r −α (i.e. follows the power-law distribution). Recent (Baixeries et al., 2013) analyses of CHILDES corpus indicate that the exponent α depends on age and is much higher and decreases faster in small children. Common aspects of both LD and evolution Axiomatic 1 Convergence: different trajectories, same result 2 Variation: children PLAY, children forget 3 Non-monotonicity: locally ”correct” behaviours are lost Hypothetic 1 Adaptation: gradual convergence of LT towards LM and possibly GT towards GM 2 Replication: both vertical (repetition) and horizontal (non-local storage) 3 Parallel coexistence of schemas 4 Selection: correct behaviours are rewarded Subject and Method Subject My own daughter. 0-30months (0-2;6) Method Phenomenological method based principially on amazed observations. Long-term journal. Little or no experimental (artificial) interactions beyond natural and normal scenarios. Cognitive Crossover Cases Case 1 - Banan Banan was called ”baja” in (1;6) and ”anan” in (1;10). At (1;11) a following interaction took place: F: banan ; C: anan F: banan ; C: anan F: baja ; C: bajan F: bajan ; C: banan Case 2 - Olol Very intensive ”Krtko & Orol” period between 1;10-1;11. Word ”OLOL” used with high frequency on a regular basis. During one pre-sleep monologue, subject said ”KOLOL” when enumerating the names of her friends from creche, one among them being named Nikol. Bilingual crossovers oči+augen=oge opica+afe=api voda+wasser=vava etc... Quantitative observations Corpus CHILDES - Child Language Data Exchange System (MacWhinney and Snow 1984) 1 more than 130 corpora of transcripted child verbal interactions 2 more than 20 languages Variation operators whose impact shall be analyzed 1 Substitutions - papija → babija → mamija 2 Reduplications - hau − hau 3 Omissions - vlak → ak Method Matching with Perl-compatible regular expression (Hromada 2011). Reduplication, for example, can be easily detected with regexp (\d{2,})\1. Note that strings can evolve by substituting substrings for other strings and the substitution rule itself is also a string. Grammar Induction inducing not one monolithic grammar but populations of individual grammars fitness function promotes individuals which 1) match patterns present in environment 2) generate utterances which shall be ”accepted” by environment 3) minimize number of utterances which shall not be accepted by environmennt individual grammar is encoded as an ordered sequence of production rules Corpus #Mutter# #Vater# Grammar1 Grammar2 Vat → A #A → B er # → A #Mu → A tA → A axioms: AA BA er # → B #Va → A #Mu → A tB → B er # → B axioms: AB Concept Construction Attaching meanings to words interpreted as supervised learning of multiclass classifier. In most recent experiments I crossover four ideas in order to create it: RANDOM PROJECTION - exploiting lemma Johnson-Lindenstrauss to project problem into D-dimensional space BINARIZATION - transposition of problem from real-valued spaces to binary (Hamming) spaces (Hromada 2014) THEORY OF PROTOTYPES - every category C can be characterized by a prototype PC which is as close as possible to members of C and as far as possible members of other category (Rosch 1973) EVOLUTIONARY COMPUTATION - thus searches for such a set of K prototypes P1 , ..., PN which maximizes the ”prototype fitness function”: F (I ) = N X K X ( H(PB , i)PC =PB ifLi 6=C − H(PG , i)PC =PG ifLi =C ) i C where H(PG , i) is the Hamming distance between the binary vector denoting the prototype PG and the document i contained in training document set of cardinality N. Every individual is a binary vector obtained by concatenation of vectors of all K prototypes |I | = D ∗ K . Concept Construction - Preliminary Results Trained on training part and evaluated on testing part of 20newsgroups corpus (K=20) LSB parameters D=128, S=3, I=2 CGA (N=100, PM =0.001, one-point crossover) with 1/8 elitism Figure: Evolutionary induction of semantic prototypes - training Figure: Evaluation of induced prototypes against the testing set Algorithm seems to perform better than ”deep learning” Semantic Hashing method of Salakhutdinov & Hinton (2009). An Evolutionary Computation Algorithms are capable of generalization and can be thus considered a case of Machine Learning. Thesis 1 At some level of abstraction, ontogeny of syntactic and semantic categories is a process consistent with tenets of Universal Darwinism. 2 Representations in human mind are subjects of variation, selection and replication. 3 In young children this process is still not completely internalized (Vygotsky 1934) and is thus visible to external observer. 4 Evolutionary Computation is a means how this process can be successfully simulated in silico. Merci Thank You. Introduction to Moral Induction Model and its Deployment in Artificial Agents Daniel Devatman Hromada12 and Ilaria Gaudiello hromi at giver.eu, i.gaudiello at gmail.com Abstract Inidividual specificity and autonomy of a morally reasoning system is principally attained by means of a constructionist inductive process . Input into such process are moral dilemmata or their story­like representations, its output are general patterns allowing to classify as moral or immoral even the dilemmas which were not represented in the initial “training” corpus. Moral inference process can be simulated by machine learning algorithms and can be based upon detection and extraction of morally relevant features. Supervised or semi­supervised approaches should be used by those aiming to simulate parent­>child or teacher­>student morality transfer processes in artificial agents. Pre­existing models of inference ­ e.g. the grammar inference models in the domain of computational linguistics ­ can offer certain inspiration for anyone aiming to deploy a moral induction model. Historical data, mythology or folklore could serve as a basis of the training corpus which could be subsequently significantly extended by a crowdsourcing method exploiting the web­based « Completely Automated Moral Turing test to tell Computers and Humans Apart ». Such a CAMTCHA approach could be also useful for evaluation of agent’s moral faculties. Keywords: moral induction model, autonomous artificial agent, induction of morality, grammar inference, moral Turing test, corpus­based machine learning, morally relevant features, oracle machine, moral grammar, semantic enrichment, CAMTCHA 1. Inductive Process The aim of this article is to furnish some theoretical as well as practical arguments supporting the proposal that « specific and autonomous aspects of moral behaviour are tuned by means of an inductive process ». It shall be argued that at least certain components of this process, e.g. « moral feature extraction » or «equivalence­class clustering of moral dilemmas » can be indeed computable and can be successfully simulated on a Universal Turing Machine especially if an immediate answer­giving oracle (Turing, 1939) is supervising the process. The usage of generic term « process » indicates that we aim to explain the emergence of morality as 1 Department of control and industrial informatics, Faculty of electrical engineering and information technology, Slovak University of Technology, Ilkovicova 3, 812 19 Bratislava , Slovak Republic 2 Cognition Humaine & Artificielle ­ Laboratoire des Usages en Techniques d’Information Numériques, Faculty of Psychology, Université Paris 8 Vincennes­Saint Denis, Paris, France a durative and constructive phenomenon. As other human aptitudes like language or object manipulation, human moral faculty demands time to develop and we believe that this development can be understood in terms of environment­driven tuning of certain biologically pre­wired innate parameters related to the fact that healthy humans are essentially social beings (Adler, 1927). The 1) imitation faculty of mirror neurons 2) generalisation faculty of human brain and 3) the very possibility furnished by the second law of termodynamics, i.e. the freedom of structures to evolve in a new direction (i.e. to mutate) – it may be the case that interaction of these three principal components may well account for construction of morality in human ontogeny, as well as phylogeny. Concrete insights concerning the interaction of these 3 components can be found in (Piaget & Baechler, 1932). However, for the scope of the present work, we will focus on the second one, that is, the continous processing of situations implying moral dilemmata whose solution should be further generalized beyond the contingent situations. This type of processing is generally called « induction » or « inference ». Both of these i­terms denote the direction from the concrete and often physical towards the abstract and general. Their antonym is « deduction » denoting the flow from general to the concrete. While it is an undeniable fact that both induction&deduction form an unseparable holistic head&tails for any advanced cognitive activity performed by a human agent – and that deduction is neccessary in case of any reasonable performance ­ we argue that the construction of individual and autonomous moral competence is ultimately based on induction. The usage of terms competence and performance, which are so widely used within the framework of Chomskian doctrine, may indicate that we shall tend to defend its (nativ|mental|generativ)ist position stating that human being are, from their birth, endowed with some kind of a « universal moral grammar » (UMG) (Mikhail, 2007) which should play a crucial role in setting parameters for a more local moral grammar (MG), in order to adaptat it to the given culutral and social context. While we are far from excluding the possibility that humans are endowed with a certain UMG ­ most probably related to such anthropological constants like “empathy”, “emotional resonance” or “theory of mind”­ the objective of our proposal is to explain not the Unity but rather the diversity of human moral behaviour. That is, instead of wondering which ethical theory should we use to endow agents with moral competence (Lin, Abney, & Bekey, 2012), we propose to focus attention on local contextuality as well as to assess the divergence among various instances of individual MGs. As far as we know, a child does not, in order to take a « good » decision, inject all possible behavioral maximes as an input parameter into some kind of Kantian (Kant, 1785) universally applicable cognitive blackbox. On the contrary: simple imitation is more than often a successful heuristics ­ be it the imitation of a physical person standing in front of the child, or imitation of a model figure represented as a sort of archetype in child’s semantic memory. And if ever there is nothing to imitate, if ever there is no precedens, no match, only then the generalisation procedure enters the solution­seeking game. 2. Training Corpus How to simulate this constructive and durative process in the realm of artificial agents (AAs) ? The question is not to be wiped away from the table since in the world already governed in huge extent by machines, a big lot can depend from the correct and, if possible, deeply empathic answer. In accordance with authors (see Vitz, 1990 for a review) who suggest that narratives are central to human moral development, we suggest to extend the very same narrative­based approach beyond the domain of organic agents, thus proposing a following answer to the question posed above: « By telling stories ». Within the framework of a full­fledged Moral Induction Model (MIM) a « story » is defined as a representation of a situation of moral dilemma. In order to demonstrate our point we shall, in this paper, focus solely upon dilemmata represented in textual modality. Our motivation for such a choice is twofold: 1) text seems to be robust enough a vector for the transfer of “moral of the story” from the author to the reader 2) canonical Turing Test is a text­oriented one, and thus it can be expected that the moral­restricted TuringTest­like evaluation procedure will be also based on textual modality. An example of such a story­represented­in­text can be: STORY 1 : «There was once a king who saw a man digging a ditch near the road. The king asked a man : ‘How much You earn for such a hard work ?’ . ‘Three dimes daily’ answered the man. Surprised was the king and asketh : ‘Three dimes daily? So little ?’. The man answereth : ‘Three dimes daily, oh yes dear and respectable king, but in fact I live only from dime a day, since with the second dime I lend and with the third I pay back what I have borrowed’. Puzzled was the king and asketh : ‘How comes ?’. The man replieth : ‘I simply pay back one dime to my father and invest one in my son, o Lord ! » (Dobšinský, 1883) One can extract such stories from folklore, mythology, religion, history, legal codices or biographies in order to create a Training Corpus (TC). Criteria according to which such corpora are built are of utmost importance since it is the injection of TC into MIM’s inductive apparatus which starts the whole process aiming to attain attain artificial agents endowed with faculty to reason according to human moral precepts or at least to understand them. One would be thus highly reluctant to integrate into corpus violent acts described in both testaments or biographies of Stalin, Hitler etc. and introduction of such texts into the learning process is highly discouraged especially for the phases during which an AA still does not dispose of its own consistent yet autonomous (Hromada, 2012) moral core. The very process of story selection and TC construction is already an act by means of which a human teacher supervises AA’s learning. One should never underestimate the importance of the selection criteria according to which the teacher chooses to confront AA with this story and not that one, and to do so in this moment of learning process and not later nor sooner. These selection criteria are very important because they are strongly coupled with « values » that the teacher seeks to transfer by the learning process. Hence, MI is never a fully unsupervised process. The teacher should be always present, and since it follows ex vi termini that a good teacher can not be physically present for more than a limited period of time, (s)he should at least aim to encode some λόγος into the very form of TC he deploys. While it is of course possible to imagine that once the TC is constructed, one could go further with unsupervised algorithms, choice of such an approach would make it practically impossible for the teacher to transfer his precepts with the envisaged degree of exactness. It is therefore recommended to depart from the state whereby the stories contained in TC are already associated with labels furnished explicitely by the teacher. In more advanced cases, labels can be more complex structures like label (CONCLUSION : Agent(King); Predicate(Reward); Acceptor(Poor­man) ; Reason(Acceptor’s wisdom)) associated to STORY1. But due to scaffolding nature of MIM, it seems more rational to depart towards such complex levels from basic TCs which contain binary (i.e. «good » and « bad ») and ternary (c.f. STORY2 below) labels. 3. 1. Model Description Preprocessing Every input into induction process, every story, is in the beginnning nothing more than a string of characters. This sequence of tokens subsequently enters the natural language processing (NLP) machinery of parses, lemmatisers etc. which enrich the initial data with relevant syntactic metadata. 2. Semantic Enrichment Once the basic syntactic tags are assigned to different phrases and words of the story, the NLP engine shall « link » the data contained in the story with prebuild ontologies or semantic vector spaces (Widdows, 2004) which represent previously attained knowledge. This can be done by the process of semantic enrichment (SE) whose objective is to make explicit the information which is implictly contained in the initial story. SE can be thought of as a sort of « process of source code compilation » whose output is a complex datastracture containing much more information than was explicitely stated in the initial « source code » (i.e. in the « story »). For example, the sequence of 4 letters : D I M E shall, in combination with syntactic labels like “noun” obtained in phase 1, transform into a reference to such assertions like «form of money», «of little value» etc. We believe that even with current RDF&SPARQL­based technologies one could possibly make explicit the fact that the main character of STORY1 was very poor ← because his salary was very low ← because he reacted with the statement « three dimes daily » ← to a question containing the verb « earn » as its predicate. And since the first cycle of SE process attributed facts like « pays back the old » and « invests into the youth » to the principal agent of the story, it is highly probable that SE’s second cycle shall, with relatively high probability, inject into the story’s graph also the representation of the predicate Wise(Poor­man). 3. Moral Feature Extraction Once the flat linear sequence of letters from initial story was transformed into semantically enriched densely intraconnected multigraph and/or into a vector space endowed with certain unique topological properties, one can try to align it with previously obtained morally relevant data. One possible way how to attain this goal is to encode the story as a vector of binary values which denote the presence or absence of this or that feature in the story. For example, an edge between nodes A and B of a semantically enriched multigraph could be possibly interpreted as a presence of feature AB. Once the vector representation of the story is ready, one can align it with vector representations of other stories contained in the TC and pass the resulting matrix as an input into supervised machine learning feature extraction algorithm like AdaBoost (Viola & Jones, 2001). During the training phase, the algorithm will discover such linear combinations of eigenfeatures which reduce the story → label classification error. In other terms, during the learning process, an AA could possibly « discover » that what is morally relevant for the success or failure of story’s principal hero is that he was associated with features like «hard­working», «polite» and «wise» while the presence of a feature like «hero digs a ditch» is as irrelevant for the moral of the story as would be the presence of a feature «hero paves the path». Absence of features can be equally important : the fact that no person is rude or violent in the story can also be chosen as MRF. 4. Equivalence class construction and production of an abstract moral template Once morally relevant features are extracted in the training phase, one can cluster objects which share such feature (or sets of features) into classes. After that, non­terminals denoting these equivalence classes can be organized into patterns whose totality would yield a « moral template ». If there is a mismatch between output produced by confrontation of the moral template with the story S and the label associated in the training corpus with the story S, one should try to modify the classes or some of the patterns so they would match (if moral) or not match (if immoral) MRFs extracted from the story. If no such modification leads to success, one will be obliged to re­run the costly MRF­extraction process with additional data. In the real­life scenario, one simply « compile » the story by SE process, looks for absence or presence of preselected MRFs, looks what concepts («justice», «loyalty») can be constructed from them, tries to match their combinations with already induced patterns to produce the final output. In a robotic AA endowed with a material shell, such an output can be an instruction inducing the agent to execute a physical movement. 4. Moral & Grammar Inference Moral induction is a bootstrapping (Hromada, 2014) and self­scaffolding process. Value­representing concepts (e.g. X=« wisdom ») have to be constructed in parallel to maxima­representing pattern­predicates (e.g. « reward X ») within which the value­representing concepts play a role of free variable. One is dependent from the other and vice versa. In this sense Moral Induction is analogic to the process of grammar inference which is an condition sine qua non of language acquisition and as such automatically occurs in every healthy human baby. In grammar inference one has to deal with a similar problem: equivalence classes for grammatical categories, conjugations and declinations are to be constructed before a rule manipulating these classes. But without prealable knowledge of such rules it is difficult to evaluate whether the candidate equivalence class is a pertinent one, or whether it is just a set of tokens clustered according to some non­important criteria. For example the rule Regular_Verb+ed­>PastParticiple is of no use if the baby does not have any notion of what verbs are and, on the other side, it is a non­trivial problem for a baby’s brain to find out what tokens should be clustered into the group of regular verbs since initially the baby does not know any rule which could help it to distinguish regulars from irregulars or even nouns. But luckily enough, it seems that this chicken&egg problem can be solved. At least results of computational models of grammar inference like « Automatic Distillation of Structure » (ADIOS) (Solan, Horn, Ruppin, & Edelman, 2005) indicate that even a relatively simple graph theory approach can furnish a method by means of which a man can induce grammatical rules which generated the corpus by using as an input only the very corpus itself. We believe that human grammatic competence share certain characteristics with the moral competence – both first transform the surface structure into much more complex « deep structure » and afterwards match this structure with already induced template. If the grammatical structure of the sentence matches the syntactic template, one « feels » that it is grammatic ; if the « moral of the story » matches the moral template, one « feels » that story’s hero does the « right » thing. It is also worth mentioning that a deeper formal analysis presented in (Clark, 2010) suggests, that certain problems of the grammar induction simply disappear if ever the induction­performing algorithm disposes of possibility to consult an oracle machine (Turing, 1939) with the question «Is utterance X grammatical ? ». Mutatis mutandi, in the domain or moral inductive process occuring in child’s mind, the question is « Is a given maxime moral ? Should one act like that ? » and the oracle is principially a parent, later a teacher. 5. Problem & Solution A disadvantage of an approach proposed in preceding paragraphs is that in order to train a fully autonomous AA, one would needs a very huge TC in order to be able to detect & extract subtle MRFs. If we speak about millions features of which potentially any story can be composed, we shall need a TC containing at least hundreds thousands of stories. Otherwise, the dataset would be too sparse and no MRFs could be extracted which could yield a robust moral classifier. What’s worse, at least one label should be manually attributed to every story of the corpus by a human teacher which would demand a significant devotion of one’s time for the project. Involvement of multiple teachers in the labeling process can be a possible solution but, in case the teachers moral values would not be mutually consistent, it could stain TC with more noise than signal. But, luckily enough, the labeling problem can be easily crowdsourced so that any story could be possibly labeled by a statistically significant number of human subjects. Such a corpus could thus possibly represent not only the moral values of one or few teachers but, possibly moral values of a community, nation or even of humankind itself. We present hereby a way how TC could be potentially constructed in a relatively non­violent and potentially rewarding and amusing way : During creation of an account on a website it is nowadays a common procedure to include a so­called CAPTCHA image somewhere in the registration form so that the webserver application can be sure that it communicates with a human being, which is able to visually parse the content of an image, and not with a bot which is unable to do so. In a CAMTCHA 3 (i.e. Completely Automated Moral Turing test to tell Computers and Humans Apart) which we hereby proposed, the « question » is not addressing subject’s faculty of visual recognition. It addresses his|her moral reasoning faculty. Thus instead of proposing to a user an image containing twisted or rotated letters which have to be recognized and rewritten into the inputbox below, an application could propose a story & CAMTCHA question couple : STORY2 : There are 3 children on the playground ­ Alice, Bob and Carla. Bob is sad because his mother is in the hospital. Alice is happy because just a while ago, her father gave her a beautiful present. Carla is sad because she never recieved any present at all – her parents are too poor to buy her any. QUESTION: You must sooth the kids. You have two toys to give. Which child shall NOT get a toy ? Below the story will be the inputbox where a human “teacher” shall, with quite high probability, write the answer «Alice». In the same time, the same story shall be presented to another users and if statistically significative number users will give the same answer (and not some other), the CAMTCHA could consider the answer as a valid « moral » label for the presented story. Contrary to CAPTCHA whose intention from the very beginning was to distinguish bots from humans, the primary reason for deployment of CAMTCHA would be to obtain valid labels for TC under question. But once at least some stories are labeled with sufficient clarity, CAMTCHA could be, of course, used as a miniature 3 As of 2013, the only running instance of CAMTCHA we are aware of is present at the site kyberia.cz where users have to answer the question “What is justice?” in order to be granted access into the community. moral Turing Test (Hromada, 2012) used at an entrance to such web communities or applications where the extent of moral competence of a user­to­be­verified plays an important role . Problems presented by CAMTCHA could be, of course, automatically diversified – names (Alice­>Eve), objects (toys­>rewards), verbs (give­>distribute) could be substituted. The very final question could also vary in relation to labeling schema of the TC corpus (e.g. the question could be « Would it be good or bad if You give toy to Carla ? » for TC labeled only with binary labels « good » and « bad »). Later, a more complex narrative generator could be programmed which would not only « mutate » but also « crossover » the stories present in the TC, hence generating completely new stories. Worth more than gold, such an automatic moral story narrator could and should be based on already obtained data and could be imagined as an « active » counterpart to the « passive » pattern­matching MIM­template finite state automaton. But before one gets there, it seems reasonable to manually construct corpus of very simple and morally unambigous stories. Note, for example, that story 2 has only 81 words and it is quite easily syntactically parsable. The SE process converging to the « knowledge of the fact » that Alice is the only child among the three which is not sad (because she is described as « happy ») is attainable by current semantic vector space or ontology­based techniques. Thus, creating a question­answering system which would 1) parse the question 2) realise that the question has three possible answers 3) apply MIM to find out that it is not a happy but sad child which has to be soothed in the first place, is something which could be done even today. Verily could an approach proposed hereby yield some success if the engineer’s aims would be modest. Thus, instead of aiming to create an AA able to find an answer to artificially constructed « trolley problems » (Mikhail, 2007) to which even an adult human being cannot find any answer, the process of grounding of AA’s morality should be started with corpora of stories representing concrete and minute problems of concrete and small human beings – children. In this paper we have tried to illustrate how such an approach could, possibly, ground the notion of « justice » by illustrating its retributive (STORY1) and distributive (STORY2) forms. It may be the case that some of the premises proposed in this article were wrong, however if ever there shall be once at least one artificial kindergarten’s playground arbiter which shall recognize a suffering child and make it smile, we believe that writing it was worth the effort. Bibliography Adler, A. (1927). Understanding Human Nature. Clark, A. (2010). Towards general algorithms for grammatical inference. Algorithmic Learning Theory (p. 11–30). Dobšinský, P. (1883). Simple National Slovak Tales (Vol. 1­8). Hromada, D. D. (2012). From Age&Gender­based Taxonomy of Turing Test Scenarios towards Attribution of Legal Status to Meta­Modular Artificial Autonomous Agents. Proceedings of IACAP/AISB Turing Centenary Conference. Birmingham, UK. Hromada, D. D. (2014). Conditions for Cognitive Plausibility of Computational Models of Category Induction. In Information Processing and Management of Uncertainty in Knowledge­Based Systems (pp. 93­105). Springer International Publishing. Kant, I. (1785). Groundwork of the Metaphysic of Morals. Lin, P., Abney, K., Bekey, G.A. (2012). Robot Ethics: The Ethical and Social Implications of Robotics. Intelligent Robotics and Autonomous Agents series. The MIT Press, Cambridge: Massachussets. Mikhail, J. (2007). Universal moral grammar: Theory, evidence and the future. Trends in Cognitive Sciences, 11(4), 143–152. Piaget, J., & Baechler, N. (1932). Le jugement moral chez l’enfant. Solan, Z., Horn, D., Ruppin, E., & Edelman, S. (2005). Unsupervised learning of natural languages. Proceedings of the National Academy of Sciences, 102(33), 11629. Turing, A. M. (1939). Systems of logic based on ordinals. Proceedings of the London Mathematical Society, 2(1), 161–228. Viola, P., & Jones, M. (2001). Rapid Object Detection using a Boosted Cascade of Simple Classifiers. Proc. IEEE CVPR 2001. Vitz,P.C. (1990). The use of stories in moral development. American Psychologist, 45(6):709­720. Widdows, D. (2004). Geometry and meaning. CSLI publications. Conditions of cognitive plausibility of computational models of category induction Daniel Devatman Hromada Laboratoire Cognition Humaine et Artificielle (ChART) Universite Paris 8 hromi@wizzion.com Abstract. We present two axiomatic and three conjectural conditions which a model inducing natural language categories should dispose of, if ever it aims to be considered as “cognitively plausible”. 1st axiomatic condition is that the model should involve a bootstrapping component. 2nd axiomatic condition is that it should be data-driven. 1st conjectural condition demands that the model integrates the surface features – related to prosody, phonology and morphology – somewhat more intensively than is the case in existing Markov-inspired models. 2nd conjectural condition demands that asides integrating symbolic and connectionist aspects, the model under question should exploit the global geometric and topologic properties of vector-spaces upon which it operates. At last we shall argue that model should facilitate qualitative evaluation, for example in form of a POS-i oriented Turing Test. In order to support our claims, we shall present a POS-induction model based on trivial k-way clustering of vectors representing suffixal and co-occurrence information present in parts of Multext-East corpus. Even in very initial stages of its development, the model succeeds to outperform some more complex probabilistic POS-induction models for lesser computational cost. Keywords: categorization, part-of-speech induction, surface features, vector spaces, categorization-oriented Turing Test, partitioning of grammatical feature space, K-means clustering, cognitive plausibility 1. Introduction The notion of “cognitive plausibility” and “part-of-speech induction” shall be defined in subsection 1.1. Subsection 1.2 shall clarify the position of syntactic category induction within the field of Natural Language Processing (NLP). The last subsection (1.3) shall offer a brief overview of the history of the problem, arguing that the current paradigm is probabilistic and English-centered one. adfa, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011 1.1 Cognitive plausibility This article enumerates some basic conditions which should be fulfilled, we believe, by engineers aiming to transform their computational models into “cognitively plausible” artificial agents. We label as “cognitively plausible” a model which tends to address some basic function of human cognitive system not only by simulating, in a sort of “black-box apparatus”, the mapping of inputs (stimuli, corpus data etc.) upon outputs (results), but also tends to faithfully represent the way how the respective function/skill is accomplished by a human mind and its material substrate – the brain. In other terms, we believe that a cognitively plausible model should not only aim to attain the most quantitatively accurate results, but also to do so by processing the information similarly to the way mind does it. The aim of this article is to elucidate the notion of “cognitive plausibility” (CP) by relating it to one particular problem, that of construction of grammatical categories present in natural languages. More concretely, we shall try to illustrate our point on the problem of construction of part-of-speech (POS) classes. We precise that the term POS-induction (POS-i) designates the process which endows the human or an artificial agent with the competence to attribute the POS-labels (like “verb”, “noun”, “adjective”) to any token observable in agent’s linguistic environment. For the simplicity of the argument, only parts of textual corpora like Multext-East (Erjavec, 2012) shall be considered as such “linguistic environment” of the computational agent introduced below. 1.2 Part-of-Speech induction in Natural Language Processing and Language Acquisition studies POS-i is often considered to be “one of the most popular tasks in research on unsupervised NLP” (Christodoulopoulos et al., 2010). The problem of construction of grammatical categories is closely related to problem of “grammar induction” and language acquisition. Since “syntactic category information is part of the basic knowledge about language that children must learn before they can acquire more complicated structures” (Schütze, 1993), it is hard to imagine any computational model of grammar induction - aiming to discover the set of rules of the grammar of the language under study- without it being able to construct, in the first place, the equivalence classes upon which the rules-to-discover shall be applied (Elman, 1989; Solan et al., 2005). Acquisition of formal grammatical categories, be it parts-of-speech or others, is thoroughly studied in psycholinguistic literature – for introductory overview c.f. Levy et al.,(1988). Such studies often aim to address the question “whether grammatical categories are innate, or induced through interaction with environment by means of imitation and analogy?”. The result of this never-ceasing Nature&Nurture debate is vast amount of both empiric and theoretic knowledge which could be ideally useful for any tentative to bring together disparate disciplines of artificial intelligence and developmental psychology. 1.3 POS-i paradigm(s) While already latent in worthy POS-i models, like that of (Elman, 1989) existed before, or were published more or less in parallel (Schütze, 1993), the paradigm currently dominating the POSi domain was fully born with article published by Brown et al. in 1992. Without going into detail, we precise that the model was successful because of its ability to apply both Markovian probabilistic concepts and those coming from information theory (Shannon & Weaver, 1949) upon the information contained in the co-occurrences of the words in the sequences, thus becoming the flagship of what we label hereby as “co-occurrence distribution” or “contextual distribution” (CD) paradigm. In decades to follow, the CD paradigm have clearly dominated the POS-i field. Be it hidden Markov Models tweaked with variational Bayes (Johnson, 2007) , Gibbs sampling (Goldwater & Griffiths, 2007), morphological features (Berg-Kirkpatrick, Bouchard-Côté, DeNero, & Klein, 2010; Clark, 2003) or graph-oriented methods (Biemann, 2006) – all such approaches and many others consider contextual co-occurrence to be the primary source of POS-irelevant information. But as comparative study of (Christodoulopoulos et al., 2010) indicates when demonstrating that models integrating morphological features tend to better than those who do not, it seems plausible that the uncontested primary role of CD in POS should be revised. While it is evident that the CD indeed must furnish relevant information if ever distributional hypothesis is valid (Harris, 1954) and it is axiomatic that distributional hypothesis applies in case of any agent creating categories consistently with Hebb’s law (Hebb, 1964) we shall argue in subsection 3.1 that pertinent POS-I clues can be extracted not only from word’s “external” contextual properties but also from word’s very “internal” Mορφε. 2. Axiomatic conditions of Cognitive Plausibility This section deals with what we believe are necessary (i.e. sine qua non) conditions of cognitive plausibility of a computational model . Subsection 2.1 deals with the “bootstrapping” condition stating that categories which are being built are based on categories which have already been built. Emergence of bootstrapping effect shall be illustrated on a trivial multiiterative re-clustering of clusters pre-clustered according to CD features. Subsection 2.2 discusses the assumption that in order to be cognitively plausible, the model should be data and/or oracle-driven. 2.1 Bootstrapping the bootstrapping From biochemistry to social sciences it is a well known fact that structuring structures are the structures structured. Computational Linguistics and NLP in particular is not an exception. The most general definition of the term bootstrapping (B) – i.e. that B is a selfsustaining multi-iterative process whereby outputs of the previous iteration modify the very execution of the next iteration – could be indeed apply upon so many computational “recurrent”, “self-feeding” (Riloff & Jones, 1999), “auto-organizing” (Nowak et al., 1999) approaches that have been already applied in so many NLP studies, that to state about a NLP algorithm X that “X bootstraps” may sometimes seem to be plain tautology. In certain sense almost any POS-i model based on CD paradigm are, ex vi termini, bootstrapping ones because even in the most simplistic models, the information about the membership of the target word W T in the candidate class C is inferred from the probabilities of membership of WL (WT’s left context) and WR (WT’s right context) to their respective candidate POS classes. Given the fact that the W T plays the role of right context for W L and the role of left context for WR, whole problem is circular and as such often calls for a bootstrapping solution. Solan et al. (2005) refer to a crucial 4th component of their automatic distillation of structure (ADIOS) algorithm as “generalized bootstrapping”. Differently from the “geometric approach” which shall be presented in our experiment below, ADIOS implements graph-like structures in order to attain its aim of construction of equivalence classes useful in subsequent grammar induction. But in its very essence, the approach of Solan et al., i.e. that one should substitute the vertices “subsumed” by a “subsuming” non-terminal class-denoting vertex is analogical, mutatis mutandi, to the approach presented in the following paragraphs. 1.1.1 1st experiment: Bootstrapping k-way POS clustering seeded by token co-occurrence features Experiment was performed with data contained in English (en), Czech (cs) and Slovak (sk), corpora contained in 4th version of Multext-East corpus (Erjavec, 2012). Table 1 . Overall statistics of analyzed corpora Corpus Cs En Sk Word Types Tokens TagsPOS Featcooc 19283 10511 20588 100368 134832 103452 13 12 13 70426 36774 74912 Table 1. presents summary statistics concerning the quantities of distinct word tokens, word types (i.e. tokens without context) and the most coarse-grained “gold standard” POS-tags is presented along with total number of distinct co-occurrence features which is equivalent to the number of columns (dimensions) in the resulting co-occurrence matrix. Every word WT type was characterized by a (row) vector of values [W 1L, W2L ...WNL, W1R, W2R ... WNR ], W1L referring to cases when the word W 1 occurred to the left of WT, W2L to cases when W2L was to the left, W 3R to cases when W3 was to the right from the target word. What results is a simple co-occurrence matrix with N rows and maximum of Feat COOC==2*N columns. Given that in the experiment we were actually looking two words to the left and two words to the right from WT, the maximum possible number of columns was Feat COOC =4*N. But since not all word couples do occur asides each other, the final number Feat COOC was always below the theoretical limit. The matrix has been clustered in C={2 … 50} clusters by the fast & frugal repeated bisection kway clustering algorithm as implemented in the clustering tool CLUTO (Karypis, 2002). Columns were scaled according to IDF principle and the clustering was done according to cosine metrics. Once finished, comparison with “gold standard” yielded V-measure (Rosenberg & Hirschberg, 2007) values which are also illustrated as NO curves on Figure 1. We have implemented the bootstrapping component in a following manner: After each clustering, the information about the proposed cluster is added as a new feature to target’s word vector description. Thus, if matrix with 20 columns entered the first iteration which clustered the vectors into 5 clusters, the matrix entering the second iteration shall have 20+5 columns. If second iteration yields 6 clusters, a matrix with 25+6 columns will become the input for the third iteration etc. Figure 1 shows that in case of all 3 studied corpora, the bootstrapping BO method always attains higher scores than the static NO approach.1 1 Note that the V-measure of NO-bootstrap curves seem to be relatively stable in regards to increase of number of clusters. Contrary to many-to-one accuracy (purity) which increases with number of clusters, V-measure thus seems to be better evaluation measure for cases when solutions containing different numbers of clusters have to be compared. Fig. 1. Bootstrapping of contextual co-occurrence statistics 2.2 Data and oracle-driven learning Computational models unable to analyze what they have previously synthesized and synthesize what they have previously analyzed could be hardly labeled as “cognitively plausible”. But even the presence of such “dialectic” component cannot be the guarantee of absolute success, if ever the model’s initial prima materia – the data with which the whole bootstrapping is initiated – are not adapted to model’s prewired “innate” state. It is unfortunately often the case in computational linguistics that whenever the model does not attain the expected performance, huge amount of effort is invested into tuning the model by diverse ad hoc modifications. After hours of exhaustive search, both intellectual as well as automatic, diverse parameters, meta-parameters and hyper-parameters are finally discovered which allow the model to attain somewhat superior performances when confronted, for example, with Wall Street Journal (WSJ) corpus But human categorization faculties – POS-i included – do not develop in such a way. While it seems plausible that same sort of “tuning of parameters” indeed takes place during initial period of language acquisition, it seems to be so efficient because the data itself is well adapted to ever-evolving state of baby’s neuro-linguistic structures. Said more concretely, parents do not recite to its children the WSJ or Eulex corpora in order to adjust the synaptic weights in the brains of their children, they rather modify all their narrative intentions by pragmatic, prosodic, phonological as well as semantic Babytalk (Ferguson, 1964) cognitive filters. In doing so – by pre-processing the stimuli before it even attains perceptual buffers of child agent’s ears – parents affirm themselves in the role of computational oracle (Turing, 1939). Since it was already demonstrated by Clark (Clark, 2010) with sufficient analytical clarity that the “supervision” coming from external oracle machines can significantly reduce the complexity of the grammar induction and POS-i problems, we found it worthwhile to state that “fully unsupervised approaches are very rare because the engineer’s decision to confront the algorithm with corpus X and not Y, and to do so in the moment T1 and not T2, is already an act of supervision”. By saying so we do not want to underestimate the importance of using the same corpora for mutual comparison of scientific results. We simple want to indicate that, because it determines everything which follows, the question of corpus choice should not be neglected. More concretely, cognitively plausible models of POS-i should be firstly tuned and “raised” with corpora like CHILDes (MacWhinney, 2000) and only later should be their scope of validity extended by means of confrontation with corpora of adult and expert utterances. 3. Conjectural conditions of model’s Cognitive Plausibility Subsection 3.1 discuss the role of non-distributional “surface” features for POS-induction. Discussion is followed by results of an experiment suggesting that features like suffix can indeed offer quite strong clues for the creation of syntactic categories. Subsection 3.2 introduces a conjectural condition for model’s CP by proposing to base it principally on geometric grounds. It is followed by subsection 3.3 arguing that CP model should facilitate evaluation by means of qualitative inspection. In general, these sections deal with CP’s conjectural conditions, meaning that while they may seem less self-evident that the axiomatic ones, we nonetheless consider them as valid. 3.1 Integration of surface features Natural languages are very redundant communication channels (de Saussure., 1922; Shannon & Weaver, 1949). Three facets of the word – its morpho-phonological signifiant, its invisible signifiée and its its syntactic function – are not independent from one another and more often than not do they significantly overlap (Jackendoff, 2003; Lakoff, 1990). Thus it is not surprising that especially in morphologically rich languages, token’s very syntactic function is encoded by morphemes present in the surface, i.e. objectively perceivable form, of the token itself. And results obtained by Clark (Clark, 2003) or (Berg-Kirkpatrick et al., 2010) indeed point in this direction – it may be no coincidence that approaches which exploit morphological features turned out, in (Christodoulopoulos et al., 2010) comparative study, to perform better than models which do not use such features. 1.1.2 2nd experiment : Assessing the impact of sufixal features on part-ofspeech categorisation We used the same three Multext-East corpora as in the first experiment. Ultimate character trigram was extracted from every word type and considered to be a feature. Word types are subsequently clustered in C clusters according these FeatSUFFIX orthogonal dimensions. The comparison with Mutext-East gold standard subsequently yields V-measures (V), entropies (H) and purities (P) presented in Table 2. Table 2. Performance of model’s inducing C categories solely according to suffixal features Cs 534 En 286 Sk 523 C=10 V=0.178 H=0.487 P=0.582 V=0.248 H=0.428 P=0.639 V=0.17 H=0.5 P=0.504 C=30 V=0.24 H=0.392 P=0.642 V=0.215 H=0.4 P=0.652 V=0.272 H=0.373 P=0.685 C=50 V=0.26 H=0.34 P=0.69 V=0.2 H=0.39 P=0.66 V=0.274 H=0.339 P=0.714 Amount below the corpus name in the above table denotes the length of the FeatSUFFIX vector, i.e. the number of distinct suffixal trigrams observed in their respective corpora. FeatSUFFIX-driven model attains lesser V-measures as had obtained (Christodoulopoulos et al., 2010) when evaluating models of (Clark, 2003) or (Berg-Kirkpatrick et al., 2010) within their 2013 comparative study. The very same study however also indicates that even the simplistic FEATSUFFIX-driven model can be worth of certain interest since it seems to be quite fast – in comparison to models harnessing the power of more than dozen computational cores to attain comparable or even better V-measures than FEATSUFFIX-driven method , we are glad to state that in order to attain results presented above, our dual-core Pentium needed in average T EN=1.8, TSK=3.2, TCS=3.6 seconds per simulation. 3.2 Knowledge is geometric After the Turing machine symbol-operating paradigm started to put more importance upon ever-still more & more fine-grained modular to probabilistic and connectionist models. But in recent years, a “geometric” paradigm starts to gain momentum in diverse fields of cognitive sciences including computational linguistics and NLP. In experiments described above such paradigm was harnessed in a sense that instead of modulating weights along different dimensions, geometers often modulate the number of dimensions itself. It could be possibly reproached to such a geometric approach that associating every plausible feature with a new dimension can induce some serious matrix-sparsity problems and|or that such an approach would be, sooner or later, confronted with insurmountable computational and memory limits. It is true that methods by means of which some older approaches deal with the problem of huge co-occurrency matrices can be very costly, as is the case, for example, in singular value decomposition within LSA (Landauer & Dumais, 1997). But since very elegant, simple and concise representations of sparse matrices can be very easily generated (Karypis, 2002) and since lemma of Johnson-Lindenstrauss (W. B. Johnson & Lindenstrauss, 1984) indicates that sparse high-dimensional matrices can be easily projected into low-dimensional as is often done in random-indexing (Sahlgren, 2005), it seems to be plausible to state that construction of vector spaces which are 1) dense but 2) transformable for low computational cost 3) encode huge amount of features attributed to huge amount of objects is not so problematic as it used to be in time when HMM-mastered POS-i paradigm was born. Series of articles by Sahlgren (2002; 2005), Cohen (2010), Widdows (2004) and their colleagues offer valuable initiation into advantages of random-projection based semantic models. For more general discussion of “geometrization of thought” in diverse fields of cognitive sciences, see (Gärdenfors, 2004). Within all such geometric models, categories can be considered as local subspaces of a global space derived from the data. 3.3 Mix of quantitative and qualitative evaluation Performance of early grammatical category induction models was evaluated manually by introspection into induced equivalence classes and articles published in the period of “golden age” of POS-i often used to enumerate members of at least one particularly pleasing class or presenting their dendograms. Such an approach was later critiqued by Clark (2003) as “inadequate” and attention of POS-I community turned towards more quantitative measures like perplexity, conditional entropy, cross-validation (Gao & Johnson, 2008), one-to-one (Haghighi & Klein, 2006) or many-to-1 accuracy (purity); variation of information (Meila, 2003) , substituable F-score (Frank et al., 2009) etc. For the purposes of this article we had decided to present our simulations principally in terns of V-measure. Given its elegance, stability in regards to growing number of clusters but also certain “strictness” (note that even the best performing models present in comparative study (Christodoulopoulos et al., 2010) rarely surpass the V>0.6 limit), we consider the V-measure to be very valuable quantitative measure of performance of clustering POS-i algorithms. But we also believe that the “old school” many-to-1 purity measure can be of certain interest, especially for those aiming to create a “semi-supervised bridge” between POS-induction and POS-tagging models; or by those aiming not to evaluate the performance of the model by rather to gain insights of correct annotations of analyzed corpora. In other terms, asides to “global” statistic measures informing the researcher about the overall performance of the model, more “local” measures can still offer interesting and useful information about individual induced classes themselves. Values presented in Table 3 represent the number C of clusters into which the corpus has to be partitioned in order to obtain at least Φ absolutely pure (i.e. Purity=1) classes. Table 3. Distillation of absolutely pure categories Φ=1 Φ=2 Φ=3 Φ=4 Φ=5 Φ=10 SFFX 72 92 105 126 131 160 CD 168 194 196 248 281 377 CD+BO 107 142 180 189 194 256 SFFX+CD+BO 69 71 80 90 96 116 For example, in order to obtain an absolutely pure cluster on the basis of contextual distribution (CD) features, one would have to partition the English part of Multext-East corpus into 168 clusters among which shall emerge following noun-only cluster: authority, character, frontispiece, judgements, levels, listlessness, popularity, sharpness, stead, successors, translucency, virtuosity Interesting insights can also be attained by inspection of some exact points of the clustering procedure. Let’s inspect, as an example, the case when one clusters the English corpus into 7 clusters according to features both internal to the word – i.e. suffixes – and external – i.e. co-occurrence with other words co-occurrence. Such an inspection indicates that the model somehow succeeds to distinguish verbs from nouns. As is shown on Table 4, whose columns represent the “gold standard” tags and rows denote the artificially induced clusters, our naïve computational model tends to put nouns in clusters 4 and 6 while putting verbs into clusters 2, 3 and 5. Table 3 . Origins of Noun-Verb distinction 0 1 2 3 4 5 6 N 10 568 97 13 1173 608 1977 V 3 67 668 1011 67 958 97 M 0 0 0 1 4 72 22 D 0 0 0 0 0 67 0 R 413 1 1 275 6 252 42 A 30 0 137 0 133 321 1091 S 0 1 3 2 0 99 3 C 0 2 2 0 0 72 0 I 0 0 0 0 0 7 3 P 0 1 0 0 4 106 0 X 1 0 0 0 3 3 2 G 0 0 0 0 0 12 0 The objective of our ongoing work is to align as much as possible such “seeding” states like that presented on Table 4. with data consistent with psycholinguistic knowledge about diverse stages of language acquisition process. At last but not least, we believe that the temporal aspects of model’s performance, i.e. the answer to the question “How long does the model need to run in order to furnish reasonable results?” should be always seriously considered. One way how to evaluate such temporal aspects of categorization could be a simplistic Turing-Test (TT) like POS-i oriented scenario where the evaluator asks the model (or an agent) to attribute the POS-label to word posed by evaluator, or at least to return a set of members of the same category. In such a reallife scenario, an absolute perfection of possible future answer could be possibly traded off for less perfect (yet still locally optimal) answer given in reasonable time. But because with this TTPOS proposal we already depart from the domain of unsupervised induction towards semi-supervised “learning with oracle” or fully supervised POS-tagger, we conclude that we consider the condition “cognitively plausible model of part of speech induction should be evaluated by both quantitative and qualitative means” to be the weakest among all proposals concerning the development of an agent inducing the categories of natural language in a “cognitively plausible” way. 4. Conclusion Model should be labeled as “cognitively plausible” model of certain human faculty if and only if it not only accurately emulates the input (problem) → output (solution) mapping executed by the faculty, but also emulates the basic “essential” characteristics associated to such mapping operation in case of human cognitive systems, i.e. emulates not only WHAT but also HOW the problem → solution mapping is done. In relation to the problem of how part-of-speech induction is effectuated by human agents, two characteristic conditions have been defined as axiomatic (necessary). First postulates that POS-i should involve a “bootstrapping” multi-iterative process able to subsume terminals sharing common features under a new non-terminal and to subsequently exploit the information related to occurrence of the new non-terminal to extend the (vectorial) definition terminals represented in the memory. Ideally the process should converge to partitions “optimally” corresponding to the gold standard. First experiment has shown for three distinct corpora that even a very simple model based on clustering of the most trivial co-occurrence information can attain higher accuracies if such a bootstrapping component is involved. The second necessary condition of POS-i’s CP is that it should be data or oracle-driven. It should perform better when first confronted with simple corpora like CHILDes (MacWhinney, 2000) and only latter with more complex ones than if it would be first confronted with complex corpora. Another condition of POS-i’s CP proposed that morphological and surface features should not be neglected and instead of playing a secondary “performance increasing role”, they should possibly “seed” whole bootstrapping process which shall follow. This condition is considered to be conjectural (i.e. “weaker” ) just because it points to somewhat orthogonal direction than does a traditionally acclaimed distributional hypothesis (Harris, 1954). It may be the case, however, that especially native speakers of some morphologically rich languages shall consider the “syntax-is-also-IN-the-word” paradigm not only as conjectural but also axiomatic. Another “weak” condition of cognitive plausibility postulates that many phenomena related to mental representations and thinking, POS-i included, can be not only described but also explained and represented in geometric and topologic terms. Ideally, the geometric paradigm (Gärdenfors, 2004) should not be contradictory but rather complenetary to symbolic and connectionist paradigms. The last and weakest condition of CP proposed that computational models of part-of-speech induction should be not only easily quantitatively analyzed but should be also transparent for researcher’s or supervisor’s qualitative analyses. They should facilitate and not complicate posing of all sorts of “Why?” questions and the results should be easily interpretable. A sort of categorization-faculty Turing Test was proposed which could be potentially embedded into the linguistic component of the hierarchy of Turing Tests which we propose elsewhere (Hromada, 2012). It may be the case that the list of conditions of cognitive plausibility presented in this article is not sufficient one and should be extended with other terms like “modularity”, “selfreferentiality” or notions coming from complex systems and evolutionary computing. Regarding the problem of elucidation of how could a machine induce, from the environmentrepresenting corpus, the categories in a way analogical to that of a child learning by imitating its parents, we consider even the list of 2 strong precepts and 3 weak precepts hereby presented as quite useful and possibly necessary. Bibliography Berg-Kirkpatrick, T., Bouchard-Côté, A., DeNero, J., & Klein, D. (2010). Painless unsupervised learning with features. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (p. 582–590). Biemann, C. (2006). Unsupervised part-of-speech tagging employing efficient graph clustering. Proceedings of the 21st International Conference on computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop (p. 7–12). Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., & Lai, J. C. (1992). Class-based ngram models of natural language. Computational linguistics, 18(4), 467–479. Christodoulopoulos, C., Goldwater, S., & Steedman, M. (2010). Two Decades of Unsupervised POS induction: How far have we come? Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (p. 575–584). Clark, A. (2003). Combining distributional and morphological information for part of speech induction. Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics- Volume 1 (p. 59–66). Clark, A. (2010). Towards general algorithms for grammatical inference. Algorithmic Learning Theory (p. 11–30). Cohen, T., Schvaneveldt, R., & Widdows, D. (2010). Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections. Journal of Biomedical Informatics, 43(2), 240– 256. Elman, J. L. (1989). Representation and structure in connectionist models. DTIC Document. Erjavec, T. (2012). MULTEXT-East: morphosyntactic resources for Central and Eastern European languages. Language resources and evaluation, 46(1), 131–142. Ferguson, C. A. (1964). Baby talk in six languages. American anthropologist, 66(6_PART2), 103–114. Frank, S., Goldwater, S., & Keller, F. (2009). Evaluating models of syntactic category acquisition without using a gold standard Proc. 31st Annual Conf. of the Cognitive Science Society (p. 2576–2581). Gao, J., & Johnson, M. (2008). A comparison of Bayesian estimators for unsupervised Hidden Markov Model POS taggers. Proceedings of the Conference on Empirical Methods in Natural Language Processing (p. 344–352). Gärdenfors, P. (2004). Conceptual spaces: The geometry of thought. MIT press. Goldwater, S., & Griffiths, T. (2007). A fully Bayesian approach to unsupervised part-of-speech tagging. ANNUAL MEETINGASSOCIATION FOR COMPUTATIONAL LINGUISTICS (Vol. 45, p. 744). Haghighi, A., & Klein, D. (2006). Prototype-driven learning for sequence models. Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (p. 320–327). Harris, Z. S. (1954). Distributional structure. Word. Hebb, D. O. (1964). The Organization of Behavior: A Neuropsychlogical Theory. John Wiley & Sons. Hromada, D. D. (2012). Taxonomy of Turing Test Scenarios. Proceedings of AISB/IACAP Symposium. Birmingham, United Kingdom. 2012 Jackendoff, R. (2003). Foundations of language: Brain, meaning, grammar, evolution. Oxford University Press, USA. Johnson, M. (2007). Why doesn’t EM find good HMM POS-taggers. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (p. 296–305). Johnson, W. B., & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space. Contemporary mathematics, 26(189-206), 1. Karypis, G. (2002). CLUTO-a clustering toolkit. DTIC Document. Lakoff, G. (1990). Women, fire, and dangerous things. Univ. of Chicago Press. Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 104(2), 211–240. Levy, Y., Schlesinger, I. M., Braine, M.D.S. (1988). Categories and Processes in Language Acquisition. Lawrence Erlbaum. MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk. Transcription, format and programs (Vol. 1). Lawrence Erlbaum. Meilua, M. (2003). Comparing clusterings by the variation of information. Learning theory and kernel machines (p. 173–187). Springer. Nowak, M. A., Plotkin, J. B., & Krakauer, D. C. (1999). The evolutionary language game. Journal of Theoretical Biology, 200(2), 147– 162. Riloff, E., & Jones, R. (1999). Learning dictionaries for information extraction by multi-level bootstrapping. Proceedings of the National Conference on Artificial Intelligence (p. 474–479). Rosenberg, A., & Hirschberg, J. (2007). V-measure: A conditional entropy-based external cluster evaluation measure. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (Vol. 410, p. 420). Sahlgren, M. (2005). An introduction to random indexing. Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE (Vol. 5). Sahlgren, M., & Karlgren, J. (2002). Vector-based semantic analysis using random indexing for crosslingual query expansion. Evaluation of Cross-Language Information Retrieval Systems (p. 169–176). De Saussure, F., Bally, C., Séchehaye, A., Riedlinger, A., Calvet, L. J., & De Mauro, T. (1922). Cours de linguistique générale. Payot, Paris. Schütze, H. (1993). Part-of-speech induction from scratch. Proceedings of the 31st annual meeting on Association for Computational Linguistics (p. 251–258). Shannon, C. E., & Weaver, W. (1949). The mathematical theory of information. Urbana: University of Illinois Press, 97. Solan, Z., Horn, D., Ruppin, E., & Edelman, S. (2005). Unsupervised learning of natural languages. Proceedings of the National Academy of Sciences, 102(33), 11629. Turing, A. M. (1939). Systems of logic based on ordinals. Proceedings of the London Mathematical Society, 2(1), 161–228. Language and Speech, 40(1), 47–62. Vlachos, A., Korhonen, A., & Ghahramani, Z. (2009). Unsupervised and constrained Dirichlet process mixture models for verb clustering. Proceedings of the workshop on geometrical models of natural language semantics (p.74–82). Widdows, D., & Kanerva, P. (2004). Geometry and meaning. CSLI publications Stanford.